Dec 7, 2021 by Monica Puisoru
What Are Kafka Streams?
What you need to know about one of the most popular data processing systems
Kafka Streams have grown in popularity during the last years and represent a Java library created to facilitate the development of applications and microservices that process input messages in order to convert them into output records.
Kafka Streams API is used to perform processing operations on messages from a specific topic in real-time, before being consumed by their subscribers. The major benefit of Kafka Streams API is acquiring parallelism while performing complex data processing, as the messages are managed as a continuous real-time flow of records. Many well-known companies are using Kafka Streams, including Pinterest, The New York Times, Zalando, Rabobank, and Trivago.
In Kafka Streams, a topology can be associated with a graph in which the nodes refer to stream processors and edges specify the streams. This representation indicates the way the original flow of records suffers a series of transformations in order to be converted into the final output.
Fig. 1 – Processor topology
The source processor provides an input stream based on the messages collected from the input topics and redirects it to its stream processors. The stream processor indicates a processing operation in a topology, which can alter the content of the message or can be applied only for tracking purposes. Next, the sink processor simply takes the records it is receiving and forwards them to one of the Kafka topics.
Streams and tables are the most important components from Kafka Streams, being used for storing and processing data present in topics.
A stateful application has to remember states about previous events when processing the next ones, and tables from Kafka Streams illustrate this approach. While a KStream is an abstraction of a record stream, a table is defined by a set of key-value pairs. Using this kind of table allows performing a series of transformations to the data flow, like queries, inserts or updates.
One of the major differences between streams and tables is that in KStream, new data records are interpreted as an insert, while in KTable they are interpreted as an upsert, which is an update of the last value for the same record key if the key already exists, or an insert if it doesn’t. Also, a record with a null value can be seen as a delete operation.
Stateful transformations are usually used for complex stream processing, as the state is kept locally in a so-called “state store” and the next actions are performed based on that state. A state store is automatically created when applying one of the following operations:
- Aggregations are operations that based on an input stream or table extract different records in order to produce one output table. Aggregations can be applied over records having the same key, on windowed or non-windowed messages.
- Joins are used to merge two input streams or tables based on the keys and create a new stream or table. Having a null key or value will not initiate the join operation.
- Inner joins produce an output when both input sources have records with the same key.
- Left joins produce an output for each record in the left or primary input source. If the other source doesn’t have a value for a given key, it will be set to null.
- Outer joins produce an output for each record in either input source. If only one of the sources contains a key, the other one will be null.
Fig. 3 – Join types
Fig. 4 – KTable/KStream duality
- Windowing is used to group records having the same key for stateful operations, such as aggregations or joins, into windows. A grace period can be specified, which indicates the moment when the window results are final. This way, the records that arrive with a small delay but don’t exceed the grace period will still be processed in the current window. After that, all the other records will be discarded from the current window if the grace period has expired.
Stateless operations are methods that can be applied to Kafka messages without depending on previous processing transformations and some of the most common ones are:
- Branch is used to route records based on given predicates. The records are evaluated and when the first match is found, they are being forwarded to their corresponding output stream. A record that doesn’t match any of the predicates will be discarded.
- Filter is used to evaluate a boolean function in order to extract only the elements matching the condition. The opposite transformation is defined by FilterNot, which discards the records matching the condition and returns only the records for which the boolean function returns false.
- FlatMap generates zero, one or more output records based on a single input record. Map generates one output record based on one input record.
- SelectKey assigns a new key to each record based on the original (key, value) pair.
- Peek is used for logging or troubleshooting, as it allows tracking the records without altering them. In order to modify the records, the forEach operation can be used.
Benefits of Kafka Streams
During the last years, Kafka Streams gained popularity due to their important benefits in data processing, especially because they are known as the first library that provides the “exactly-once messaging” semantics. This means that the read, process and write operations are executed exactly one time per record, so this way the duplicates are avoided and also there isn’t any record that could be skipped.
Moreover, Kafka Streams are elastic, highly scalable, fault-tolerant and are supported by all three platforms (Mac, Windows, Linux). In addition, they support stateless, stateful or window operations on streams, providing complex transformations on high volume data.
Kafka Streams Alternatives
Besides Kafka Streams, there are also other open-source APIs that could be used to process data within Kafka. One of them is Apache Spark, developed to perform batch processing, streaming, machine learning and interactive queries. Its major benefit is that it can process vast amounts of data and allows monitoring and analyzing the performance of the jobs in real-time.
We can also rely on Apache Flink, used for data analytics in clusters, which supports Java and Scala. Its major advantage is that it can process the data almost right away. Moreover, it is the only hybrid framework that supports both batch and stream processing.
Apache Storm is also highly used in real-time analytics for processing streams of data because it can process over a million tuples per second per node. Also, it is easy to set up, use and it ensures that every record will be processed through the topology at least once.
In conclusion, there are a variety of frameworks developed for processing large amounts of data, which offer major benefits and are easy to use. Kafka Streams continues to remain one of the most popular data processing systems, as they can perform complex operations on a continuous real-time flow of records.
For more on Kafka by Monica Puisoru, click here.