At Acceldata we spend a lot of time working with enterprises to optimize high throughput, low latency data streaming applications using Apache Kafka.
Some of our customers include ad networks with over 100 billion events a day on their Kafka infrastructure. Amongst various metrics that Kafka monitoring includes consumer lag is nearly the most important of them all.
In this blog we will explore potential reasons for Kafka consumer lag and what you could do when you experience lag.
Apache Kafka is no longer used just by the Internet hyperscalers. Apache Kafka is used in the enterprise to deal with exploding streaming data. It powers compelling consumer experiences such as real-time personalization, recommendation and next best action.
Kafka allows low latency ingestion of large amounts of data into data lakes or data warehouses. Kafka allows businesses to get real-time intelligence into their business operations that allows them to react in real -time to changing business conditions.
Mission critical business processes are plagued by consumer lags, and experienced practitioners agree that preventing consumer lag is the biggest challenge in Kafka.
Kafka is a distributed, partitioned, replicated commit log service. Kafka is run as a cluster of multiple servers or containers.The cluster stores streams of records in categories called topics with each record consisting of a key, value and a timestamp.
Kafka Producers are processes that publish data into Kafka topics, while Consumers are processes that read messages off a Kafka topic.Topics are divided into partitions which contain messages in an append-only sequence. Each message in a partition is assigned and identified by its unique offset.
Partitions can hold multiple partition logs allowing consumers to read from in parallel.These partitions are replicated across multiple Kafka clusters for resilience. A variety of applications could produce data and send them towards the Kafka broker.
Consumers are applications that read messages from such Kafka brokers. Consumers read messages from a specific offset and are allowed to read from any offset point they choose. Consumer groups include a set of consumer processes subscribing to a given topic.Each consumer group is assigned a set of partitions to consume from.
They will receive messages from a different subset of the partitions in the topic. Kafka guarantees that the message is only read by a single consumer in the group. This philosophy is called exactly once delivery.
Consumer lag indicates the lag between Kafka producers and consumers. If the rate of production of data far exceeds the rate at which it is getting consumed, consumer groups will exhibit lag.
It can be understood very succinctly as the gap between the difference between the latest offset and consumer offset. In general, enterprises talk about Kafka but they are referring to the physical Kafka brokers - a server either physical or container that runs Kafka. Brokers are the physical repositories of logs that store and serve Kafka messages.
Data storage inside a Kafka broker is done through topics. Topics are divided into partitions and brokers write data into specific partitions. As the broker writes data - it keeps track of the last offset and records it as the log end offset.
Consumers on the other end may have complex application logic embedded inside the consumer processes. If there are way too many producers writing data to the same topic when there are a limited number of consumers then then the reading processes will always be slow. The real time objectives are lost.
So just like multiple producers which can write to the same topic, multiple consumers can read from the same topic, by getting data from one of the partitions inside the topic. It is common for consumer groups to have equal numbers of consumers as partitions, since they are doing low-latency operations. Good design includes the creation of a large number of partitions and is a fundamental way of scaling.
Just like Brokers keep track of their write position in each Partition, each Consumer keeps track of “read position” in each partition whose data it is consuming. It is the only way to keep track of the data that it has read, this is periodically persisted to Zookeeper or a Kafka Topic itself.
It’s possible that some consumer groups exhibit more lag than others, because they may have more complex logic. It can also occur because of stuck consumers, slow message processing, incrementally more messages produced than consumed.
Rebalance events can also be unpleasant contributors to consumer lag. In real time conditions, new addition of new consumers to the consumer group causes partition ownership to change - this is helpful if it’s done to increase parallelism.
However, such changes are undesirable when triggered due to a consumer process crashing down. During this event, consumers can’t consume messages, and hence consumer lag occurs. Also, when partitions are moved from one consumer to another, the consumer loses its current state including caches.
There are several Kafka monitoring tools both in the open-source community and commercially. We’ve provided end-to-end visibility into Kafka and allow enterprises to scale technology adoption without worrying about operational blindness.