Apache Kafka is a highly scalable, rapid, and fault-tolerant, distributed messaging system used for real-time data processing and application streaming.
Kafka trusts the filesystem for storage and caching, and hence it’s fast and much more reliable. It averts data loss because of which it is a fault-tolerant messaging system. With real-time data processing, the data is shared rapidly and replicated with guaranteed durability and availability.
Basically, Kafka ascribes to a publish-subscribe messaging pattern. Hence, it is used in cases where traditional messaging systems like JMS, RabbitMQ, and ActiveMQ may not even be considered due to volume, responsiveness, and fault-tolerance.
Moreover, Kafka has a built-in system to resend the data if there is any failure while processing. With this mechanism, it is highly fault-tolerant.
Talking about the performance rate in a typical environment, it will easily reach the 100,000 messages/second benchmark.
Kafka APIs can be generally divided into the following:
- Producer API: Used for sending the stream of data to one or more topics.
- Consumer API: Used for receiving data from one or more topics.
- Connector API: Used for building and running as reusable producers or consumers that can connect Kafka topics to existing applications or data sources.
- Stream API: Used for consuming input data from one or more topics and for producing as an output stream for one or more topics.
Common terms used in Kafka
Topic is a unique representation or category to which a producer publishes a stream of data. A topic will be subscribed by zero to ‘n’ number of consumers for receiving data. Producers publish data to the topics and consumers can subscribe to a topic to read the data.
For distributed processing, each topic can have multiple partitions and several publishers can write to these partitions simultaneously.
Topic is one of the main abstractions in Kafka whereas partitions can be considered as subsets of the topic.
Partitions are managed by Kafka brokers. Producers are responsible for producing messages to a topic or the partitions within. Each message in a partition is represented by a unique ID which is called an ‘offset’.
This message offset can be explained as increasing the logical time stamp within a partition. Consumers are responsible to request a message from a certain offset onward. For each consumer group, messages are guaranteed to be consumed at least once.
Suppose there are 2 brokers (Broker 1 and Broker 2) and a message has been published to Broker 1. What if Broker 1 fails due to some error? In this scenario, the message will be lost and never be recovered.
To overcome this, the replication concept comes into the picture. Replication means duplicating messages among different brokers. Thereby, this gives the guarantee that any published message will not be lost and will be consumed properly even if the broker fails due to a program or machine error. Replication provides better durability and higher availability.
Both producers and consumers are aware about replication in Kafka. Below diagram shows the concept of replication.
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic.
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines. If all the consumer instances have the same consumer group, then the records will be effectively load-balanced over the consumer instances. If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
The producer sends messages to Kafka in the form of records. Record is having a key-value pair which contains the topic name and partition number to be sent. The Kafka broker keeps records inside topic partitions. The records sequence is maintained at the partition level and can be defined as the logic on which the basis partition will be determined.
Advantages of Kafka
- High throughput
- High scalability
- Continuous streaming
- Tracking web activities
- Low latency
- Fault tolerant
Noticeable Use Cases and Clients:
- The New York Times