Processing streaming events with Apache Kafka
The Streamer
Kafka Partitions
If a topic had to live exclusively on one machine, Kafka's scalability would be limited quite radically. Therefore, the software (in contrast to classical messaging systems) divides a logical topic into partitions, which it distributes to different machines (Figure 4), allowing individual topics to be scaled as desired in terms of data throughput and read and write access.
Partitioning takes the single topic and splits it (without redundancy) into several logs, each of which exists on a separate node in the Kafka cluster. This approach distributes the work of storing messages, writing new messages, and processing existing messages across many nodes in the cluster.
Once Kafka has partitioned a topic, you need a way to decide which messages Kafka should write to which partitions. If a message appears without a key, Kafka usually distributes the subsequent messages in a round-robin process to all partitions of the topic. In this case, all partitions receive an equal amount of data, but the incoming messages do not adhere to a specific order.
If the message includes a key, Kafka calculates the target partition from a hash of the key. In this way, the software guarantees that messages with the same key always end up in the same partition and therefore always remain in the correct order.
Keys are not unique, though, but reference an identifiable value in the system. For example, if the events all belong to the same customer, the customer ID used as a key guarantees that all events for a particular customer always arrive in the correct order. This guaranteed order is one of the great advantages of the append-only commit log in Kafka.
Kafka Brokers
Thus far, I have explained events, topics, and partitions, but I have not yet addressed the actual computers too explicitly. From a physical infrastructure perspective, Kafka comprises a network of machines known as brokers. Today, these are probably not physical servers, but more typically containers running on pods, which in turn run on virtualized servers running on actual processors in a physical data center somewhere in the world.
Whatever they do, they are independent machines, each running the Kafka broker process. Each broker hosts a certain set of partitions and handles incoming requests to write new events to the partitions or read events from them. Brokers also replicate the partitions among themselves.
Replication
Storing each partition on just one broker is not enough: Regardless of whether the brokers are bare-metal servers or managed containers, they and the storage space on which they are based are vulnerable to failure. For this reason, Kafka copies the partition data to several other brokers to keep it safe.
These copies are known as "follower replicas," whereas the main partition is known as the "leader replica" in Kafka-speak. When a producer generates data for the leader (generally in the form of read and write operations), the leader and the followers work together to replicate these new writes to the followers automatically. If one node in the cluster dies, developers can be confident that the data is safe because another node automatically takes over its role.
Buy this article as PDF
(incl. VAT)