Processing streaming events with Apache Kafka
The Streamer
Kafka Ecosystem
If there were only brokers managing partitioned, replicated topics with an ever-growing collection of producers and consumers, who in turn write and read events, this would already be quite a useful system.
However, the Kafka developer community has learned that further application scenarios soon emerge among users in practice. To implement these, users then usually develop similar functions around the Kafka core. To do so, they build layers of application functionality to handle certain recurring tasks.
The code developed by Kafka users may do important work, but it is usually not relevant to the actual business field in which their activities lie. At most, it indirectly generates value for the users. Ideally, the Kafka community or infrastructure providers should provide such code.
And they do: Kafka Connect [3], the Confluent Schema Registry [4], Kafka Streams [5], and ksqlDB [6] are examples of this kind of infrastructure code. Here, I look at each of these examples in turn.
Data Integration with Kafka Connect
Information often resides in systems other than Kafka. Sometimes you want to convert data from these systems into Kafka topics; sometimes you want to store data from Kafka topics on these systems. The Kafka Connect integration API is the right tool for the job.
Kafka Connect comprises an ecosystem of connectors on the one hand and a client application on the other. The client application runs as a server process on hardware separate from the Kafka brokers – not just a single Connect worker, but a cluster of Connect workers who share the data transfer load into and out of Kafka from and to external systems. This arrangement makes the service scalable and fault tolerant.
Kafka Connect also relieves the user of the need to write complicated code: A JSON configuration is all that is required. The excerpt in Listing 3 shows how to stream data from Kafka into an Elasticsearch installation.
Listing 3
JSON for Kafka Connect
01 [...] 02 { 03 "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector", 04 "topics" : "my_topic", 05 "connection.url" : "http://elasticsearch:9200", 06 "type.name" : "_doc", 07 "key.ignore" : "true", 08 "schema.ignore" : "true" 09 } 10 [...]
Kafka Streams and ksqlDB
In a very large Kafka-based application, consumers tend to increase complexity. For example, you might start with a simple stateless transformation (e.g., obfuscating personal information or changing the format of a message to meet internal schema requirements). Kafka users soon end up with complex aggregations, enrichments, and more.
The code of the KafkaConsumer API does not provide much support for such operations: Developers working for Kafka users therefore have to program a fair amount of frame code to deal with time slots, latecomer messages, lookup tables, aggregations by key, and more.
When programming, it is also important to remember that operations such as aggregation and enrichment are typically stateful. The Kafka application must not lose this state, but at the same time it must remain highly available: If the application fails, its state is also lost.
You could try to develop a schema to maintain this state somewhere, but it is devilishly complicated to write and debug on a large scale, and it wouldn't really help to improve the lives of Kafka users directly. Therefore, Apache Kafka offers an API for stream processing: Kafka Streams.
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.