What Is Apache Kafka?

What you need to know about one of the most-used tools in the world

Over the last few years, Apache Kafka has become one of the most used tools in the world. More and more companies have been using the distributed event streaming platform. 

Developed by the Apache Software Foundation, Apache Kafka publishes and subscribes to streams of records based on a message queue mechanism. Being a broker-based solution, it aims to maintain streams of data as records within a cluster of servers. Kafka servers provide data persistence by storing messages across multiple server instances in topics. 

A Brief History

Kafka was initially developed at LinkedIn by Jay Kreps, Neha Narkhede and Jun Rao, having its initial release in January 2011. Jay Kreps named the framework after the author Franz Kafka because it is “a system optimized for writing,” and he admired Kafka’s work. Jay is currently co-founder and CEO at Confluent and also the main author of several popular projects, like Apache Samza, Voldemort and Azkaban.

Over 10 years ago, as the site’s complexity increased, LinkedIn’s infrastructure needed to be totally redesigned. This process started with the migration from a monolithic architecture to one based on microservices, which led to higher scalability on multiple platforms, such as search, profile and communications. 

This way, a couple of custom data pipelines for different streaming and queuing data have been developed. Instead of maintaining and scaling each of them separately, the developers came up with the idea of creating a single distributed publisher-subscriber platform, known today as Kafka.

Who uses Kafka?

Apache Kafka has become one of the most used open-source stream-processing platforms for collecting, processing, storing and analyzing data. Today, it is widely used by various companies from different fields, like computer software, financial services, healthcare, government, transportation and many more. Some of the most well-known organizations that are using Kafka include Spotify, Pinterest, Netflix, The New York Times, Airbnb, Coursera, Cisco, Uber, PayPal, Strava, Tumblr, Twitter and Yahoo.

Architecture

Fig. 1 – Interaction between producers and consumers
Source: https://programmertoday.com/apache-kafka-architecture-and-components/

APIs

Kafka was developed based on five core APIs for Java and Scala:

  1. Admin API: Used to manage topics, brokers and other Kafka objects
  2. Producer API: Used to write a stream of data to the Kafka topics
  3. Consumer API: Used to read streams of data from topics in the Kafka cluster
  4. Kafka Streams API: Used for stream processing applications and microservices
  5. Kafka Connect API: Used to build and run reusable data import/export connectors that consume or produce data streams from and to external applications in order to integrate with Kafka.

Key Concepts

To have a better understanding of the way Kafka works, we must get familiar with several key concepts.

A topic is a category of messages that a consumer can subscribe to and it’s identified by its name. This way, consumers aren’t automatically receiving every message that has been published to the cluster, but instead they subscribe only to the relevant ones.

Topics are split into ordered partitions, which are numbered, starting incrementally from 0. When creating a new topic we must explicitly specify its number of partitions.

Fig. 2 – Distribution of a topic into partitions
Source: https://kafka.apache.org/081/documentation.html

Each message within a partition gets an incremental ID, called offset. The data will be stored only for a limited period of time and once written to a partition, it cannot be modified. Unless a key is provided, data is automatically assigned to a random partition. There can be any number of partitions per topic, and having more partitions leads to more parallelism.

A Kafka cluster consists of one or multiple brokers, each of them being identified by a unique ID. After connecting to any broker, one will be connected to the entire cluster.

Fig. 3 – Data distribution within three brokers
Source: https://kontext.tech/column/streaming-analytics/474/kafka-topic-partitions-walkthrough-via-python

Distribution

Every partition is copied across one or more servers in order to assure fault tolerance. There is only one server that has the role of the leader, which handles all read/write requests for the partition, and 0 or more servers that act as the followers, which have to replicate the leader. 

If the leader fails, the new leader will be automatically assigned from the list of followers.

Fig. 4 – Leaders and followers example
Source: https://medium.com/@mukeshkumar_46704/in-depth-kafka-message-queue-principles-of-high-reliability-42e464e66172

Producers write data to topics and can choose to receive three types of acknowledgements for data writes:

  • Acks = 0: won’t wait for acknowledgements, which may cause loss of data
  • Acks = 1: will wait only for leader’s acknowledgement, which may cause limited data loss
  • Acks = all: will wait for acknowledgements from leader and replicas, which results in no data loss

Consumers read from a server and can be on separate processes or separate machines. They have a group-name label and each message published is sent to one consumer instance within each consumer group that is subscribed to it.

Fig. 5 – Consumer groups
Source: http://bhargavisatpathy.github.io/apache-kafka/

Pros and Cons of Using Kafka

One of the major benefits of using Kaka is the scalability it provides, which can be achieved by adding multiple nodes. Also, other advantages are the high-throughput and low latency, as Kafka is capable of processing real-time high volume data with the very low latency of the range of milliseconds.

In addition, an important aspect is the variety of use cases, such as log aggregation, web activity tracking, payments and financial transactions processing, monitoring patients in hospitals, and so on.

Due to the message broker capabilities, Kafka is by default persistent, which means that messages are never lost due to the message replication, which makes it durable and reliable. 

Apart from its numerous benefits, however, we must also mention some of Kafka’s limitations. One of them is that brokers and consumers might reduce the performance by compressing and decompressing the data flow. Also, it doesn’t include a complete set of monitoring tools and it doesn’t support wildcard topic selection, as it can only match the exact topic name.

Conclusion

All things considered, during the last few years, the benefits of Kafka have overcome its downsides, causing it to become one of the most used tools across a wide range of today’s leading companies. Kafka’s impact has also been enhanced due to the benefits that came along with the Kafka Streams API.

If you want to learn more about Kafka Streams, read this article by Cognizant Softvision Enterprise Coffee Engineer, Monica Puisoru.

References: