Photo by Mark Decile on Unsplash

Photo by Mark Decile on Unsplash

Processing streaming events with Apache Kafka

The Streamer

Article from ADMIN 61/2021
By
Apache Kafka reads and writes events virtually in real time, and you can extend it to take on a wide range of roles in today's world of big data and event streaming.

Event streaming is a modern concept that aims at continuously processing large volumes of data. The open source Apache Kafka [1] has established itself as a leader in the field. Kafka was originally developed by the career platform LinkedIn to process massive volumes of data in real time. Today, Kafka is used by more than 80 percent of the Fortune 100 companies, according to the project's own information.

Apache Kafka captures, processes, stores, and integrates data on a large scale. The software supports numerous applications, including distributed logging, stream processing, data integration, and pub/sub messaging. Kafka continuously processes the data almost in real time, without first writing to a database.

Kafka can connect virtually to any other data source in traditional enterprise information systems, modern databases, or the cloud. Together with the connectors available for Kafka Connect, Kafka forms an efficient integration point without hiding the logic or routing within the centralized infrastructure.

Many organizations use Kafka to monitor operating data. Kafka collects statistics from distributed applications to create centralized feeds with real-time metrics. Kafka also serves as a central source of truth for bundling data generated by various components of a distributed system. Kafka fields, stores, and processes data streams of any size (both real-time streams and from other interfaces, such as files or databases). As a result, companies entrust Kafka with technical tasks such as transforming, filtering, and aggregating data, and they also use it for critical business applications, such as payment transactions.

Kafka also works well as a modernized version of the traditional message broker, efficiently decoupling a process that generates events from one or more other processes that receive events.

What is Event Streaming

In the parlance of the event-streaming community, an event is any type of action, incident, or change that a piece of software or an application identifies or records. This value could be a payment, a mouse click on a website, or a temperature point recorded by a sensor, along with a description of what happened in each case.

An event connects a notification (a temporal element on the basis of which the system can trigger another action) with a state. In most cases the message is quite small – usually a few bytes or kilobytes. It is usually in a structured format, such as JSON, or is included in an object serialized with Apache Avro or Protocol Buffers (protobuf).

Architecture and Concepts

Figure 1 shows a sensor analysis use case in the Internet of Things (IoT) environment with Kafka. This scenario provides a good overview of the individual Kafka components and their interaction with other technologies.

Figure 1: Kafka evaluates the measuring points of the IoT sensors. A Kafka producer feeds them into the platform, and a Kafka consumer receives them for monitoring. At the same time, they are transferred to a Spark-based analysis platform by Kafka Connect. © Confluent

Kafka's architecture is based on the abstract idea of a distributed commit log. By dividing the log into partitions, Kafka is able to scale systems. Kafka models events as key-value pairs.

Internally these keys and values consist only of byte sequences. However, in your preferred programming language they can often be represented as structured objects in that language's type system. Conversion between language types and internal bytes is known as (de-)serialization in Kafka-speak. As mentioned earlier, serialized formats are mostly JSON, JSON Schema, Avro, or protobuf.

What exactly do the keys and values represent? Values typically represent an application domain object or some form of raw message input in serialized form – such as the output of a sensor.

Although complex domain objects can be used as keys, they usually consist of primitive types like strings or integers. The key part of a Kafka event does not necessarily uniquely identify an event, as would the primary key of a row in a relational database. Instead, it is used to determine an identifiable variable in the system (e.g., a user, a job, or a specific connected device).

Although it might not sound that significant at first, keys determine how Kafka deals with things like parallelization and data localization, as you will see.

Kafka Topics

Events tend to accumulate. That is why the IT world needs a system to organize them. Kafka's most basic organizational unit is the "topic," which roughly corresponds to a table in a relational database. For developers working with Kafka, the topic is the abstraction they think about most. You create topics to store different types of events. Topics can also comprise filtered and transformed versions of existing topics.

A topic is a logical construct. Kafka stores the events for a topic in a log. These logs are easy to understand because they are simple data structures with known semantics.

You should keep three things in mind: (1) Kafka events always append to the end of a logfile. When the software writes a new message to a log, it always ends up at the last position. (2) You can only read events by searching for an arbitrary position (offset) in the log and then sequentially browsing log entries. Kafka does not allow queries like ANSI SQL, which lets you search for a certain value. (3) The events in the log prove to be immutable – past events are very difficult to undo.

The logs themselves are basically perpetual. Traditional messaging systems in companies use queues as well as topics. These queues buffer messages on their way from source to destination. However, they usually also delete the messages after consumption. The goal is not to keep the messages longer, to be fed to the same or another application later (Figure 2).

Figure 2: Kafka's consumers sometimes consume the data in real time. © Confluent

Because Kafka topics are available as logfiles, the data they contain is by nature not temporarily available, as in traditional messaging systems, but permanently available. You can configure each topic so that the data expires either after a certain age (retention time) or as soon as the topic reaches a certain size. The time span can range from seconds to years to indefinitely.

The logs underlying Kafka topics are stored as files on a disk. When Kafka writes an event to a topic, it is as permanent as the data in a classical relational database.

The simplicity of the log and the immutability of the content it contains are the key to Kafka's success as a critical component in modern data infrastructures. A (real) decoupling of the systems therefore works far better than with traditional middleware (Figure 3), which relies on the extract, transform, load (ETL) mechanism or an enterprise service bus (ESB) and which is based either on web services or message queues. Kafka simplifies domain-driven design (DDD) but also allows communication with outdated interfaces [2].

Figure 3: Apache Kafka's domain-driven design helps decouple middleware. © Confluent

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus