Processing streaming events with Apache Kafka

The Streamer

Kafka Streams

The Kafka Streams Java API allows easy access to all computational primitives of stateless and stateful stream processing (actions such as filtering, grouping, aggregating, merging, etc.), removing the need to write framework code against the consumer API to do all these things. Kafka Streams also supports the potentially large number of states that result from the calculations of the data stream processing. It also keeps the data collections and enrichments either in memory or in a local key-value store (based on RocksDB).

Combining stateful data processing and high scalability turns out to be a big challenge. The Streams API solves both problems: On the one hand, it maintains the status on the local hard disk and of internal topics in the Kafka cluster; on the other hand, the client application (i.e., the Kafka Streams cluster) automatically scales as Kafka adds or removes new client instances.

In a typical microservice, the application performs stream processing in addition to other functions. For example, a mail order company combines shipment events with events in a product information change log, which contains customer records to create shipment notification objects that other services then convert into email and text messages. However, the shipment notification service might also be required to provide a REST API for synchronous key queries by mobile applications once the apps render views that show the status of a particular shipment.

The service reacts to events. In this case, it first merges three data streams with each other and may perform further state calculations (state windows) according to the merges. Nevertheless, the service also serves HTTP requests at its REST endpoint.

Because Kafka Streams is a Java library and not a set of dedicated infrastructure components, it is trivial to integrate directly into other applications and develop sophisticated, scalable, fault-tolerant stream processing. This feature is one of the key differences from other stream-processing frameworks like ksqlDB, Apache Storm, or Apache Flink.

ksqlDB

Kafka Streams, as a Java-based stream-processing API, is very well suited for creating scalable, standalone stream-processing applications. However, it is also suitable for enriching the stream-processing functions available in Java applications.

What if applications are not in Java or the developers are looking for a simpler solution? What if it seems advantageous from an architectural or operational point of view to implement a pure stream-processing job without a web interface or API to provide the results to the frontend? In this case, ksqlDB enters the play.

The highly specialized database is optimized for applications that process data streams. It runs on its own scalable, fault-tolerant cluster and provides a REST interface for applications that then submit new stream-processing jobs to execute and retrieve results.

The stream-processing jobs and queries are written in SQL. Thanks to the interface options by REST and the command line, it does not matter which programming language the applications use. It is a good idea to start in development mode, either with Docker or a single node running natively on a development machine, or directly in a supported service.

In summary, ksqlDB is a standalone, SQL-based, stream-processing engine that continuously processes event streams and makes the results available to database-like applications. It aims to provide a conceptual model for most Kafka-based stream-processing application workloads. For comparison, Listing 4 shows an example of logic that continuously collects and counts the values of a message attribute. The beginning of the listing shows the Kafka Streams version; the second part shows the version written in ksqlDB.

Listing 4

Kafka Streams vs ksqlDB

01 [...]
02 // Kafka Streams (Java):
03 builder
04   .stream("input-stream",
05     Consumed.with(Serdes.String(), Serdes.String()))
06   .groupBy((key, value) -> value)
07   .count()
08   .toStream()
09   .to("counts", Produced.with(Serdes.String(), Serdes.Long()));
10
11 // ksqlDB (SQL):
12
13 SELECT x, count(*) FROM stream GROUP BY x EMIT CHANGES;
14 [...]

Conclusions and Outlook

Kafka has established itself on the market as the de facto standard for event streaming; many companies use it in production in various projects. Meanwhile, Kafka continues to develop.

With all the advantages of Apache Kafka, it is important not to ignore the disadvantages: Event streaming is a fundamentally new concept. Development, testing, and operation is therefore completely different from using known infrastructures. For example, Apache Kafka uses rolling upgrades instead of active-passive deployments.

That Kafka is a distributed system also has an effect on production operations. Kafka is more complex than plain vanilla messaging systems, and the hardware requirements are also completely different. For example, Apache ZooKeeper requires stable and low latencies. In return, the software processes large amounts of data (or small but critical business transactions) in real time and is highly available. This process not only involves sending from A to B, but also loosely coupled and scalable data integration of source and target systems with Kafka Connect and continuous event processing (stream processing) with Kafka Streams or ksqlDB.

In this article, I explained the basic concepts of Apache Kafka, but there are other helpful components as well:

  • The REST proxy [7] takes care of communication over HTTP(S) with Kafka (producer, consumer, administration).
  • The Schema Registry [4] regulates data governance; it manages and versions schemas and enforces certain data structures.
  • Cloud services simplify the operation of fully managed serverless infrastructures, in particular, but also of platform-as-a-service offerings.

Apache Kafka version 3.0 will remove the dependency on ZooKeeper (for easier operation and even better scalability and performance), offer fully managed (serverless) Kafka Cloud services, and allow hybrid deployments (edge, data center, and multicloud).

Two videos from this year's Kafka Summit [8] [9], an annual conference of the Kafka community, also offer an outlook on the future and a review of the history of Apache Kafka. The conference took place online for the first time in 2020 and counted more than 30,000 registered developers.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus