In Part 1 of this blog post, Gain visibility with Aiven Kafka dashboards, we talked about how important it is to get a good understanding of the behavior of your production Kafka system in order to successfully keep it running continuously. In this blog post, we'll cover the latest addition to our Kafka dashboards, the consumer group graph: consumer group replication lag.
The graph and its associated telemetry will give you insight into the behavior of consumers within your production Kafka system. These consumers can be part of a business-critical, revenue generating data pipeline with very high SLA for uptime, accentuating the need to know how the consumers are behaving at all times. Before we jump into the graph, let's cover some basics.
A topic can be thought of as a unique channel over which a discussion can take place. The channel typically has producers and consumers; the former send messages while the latter read them. For example, let's say there's a channel called "soccer".
For example, you might be able to read what others are saying about soccer (in this case you are acting like a consumer), and post soccer-related messages to the soccer channel (in this case you are acting like a producer).
The producers and consumers working within a topic, also work within topic partitions. But, what are topic partitions and why are they important?
Topics can be segmented, or partitioned: this is the primary mechanism of Kafka's scalability because topic partitions increase read and write bandwidth. They do this by spreading the messages across one or more partitions.
Each partition is a standalone append-only log where messages are stored in monotonically increasing positions called offsets.
For example, let's say there are 100 messages for a topic and that it is set up to have 5 partitions. In this case, each partition will be assigned around 20 messages. By partitioning topics, it makes it a lot easier to read large amounts of Kafka messages, leading us to consumer groups.
In a Kafka system, multiple consumers can be used to read messages from a Kafka topic to improve the message consumption rate. Spreading messages across multiple partitions and thinking how consumers are mapped to them is a little unintuitive.
Additionally, it raises the question of whether a message can be read by multiple consumers which could be undesirable for certain classes of business applications. Let's dive a little deeper.
We'll use the example of a 100 message topic with 5 partitions, and add a consumer group of 5 consumers. In this case, each consumer will be assigned to one partition and consume that partition's messages: 20 messages each.
A consumer can only read from partitions that it is assigned to.
What if the consumer group only has 3? Consumers A through C will be assigned to partitions 1 through 3. A and B will then pick up the remaining slack and be assigned to partitions 4 and 5.
However, the reverse might happen where there are more consumers than partitions. Then the unassigned consumers in the group will be idle until perhaps another member of the group has terminated for some unforeseen reason.
As you see, Kafka's topics, partitions, and consumer groups work in a coordinated, straightforward manner. However, problems will occur and tracking key metrics such as consumer group lag can help you avoid major ones.
Consumer group lag is one of the most important graphs in your Kafka dashboard. If there is a significant lag, it could indidate one of two scenarios - a terminated consumer or, a consumer who is alive but unable to keep up with the rate of incoming messages.
If the lag is due to a short message spike, it should be ok. However, if the lag persists for tens of minutes or hours, it is probably not a good symptom and you will want to dig into why it is happening.
It could be anything from as trivial as restarting a dead consumer to figuring out what's happening with the brokers, or the consumers and the underlying hosts they are running on.
If the brokers are behaving normally and aren't overloaded, it's likely that the issues are with consumers that are unable to keep up with the incoming message rate, and troubleshooting resources should be directed towards them.
Consumer group telemetry and our latest graph give you insight into consumer behavior within your production Kafka system, giving you the ability to solve potential problems with your consumers, such as failed or slow moving consumers.
For business-critical data pipelines, you need to know if a consumer (or a consumers group) is lagging or has terminated. If terminated and not restarted in a timely manner, it could pose serious consequences downstream. Therefore, it's very important to monitor and know the progress and status of consumers at all times.
By making additional consumer telemetry and the associated graph available, we hope to make it possible to detect such consumer failures and help you take preventative action in order to keep your data pipeline running smoothly.
If you're not using Aiven Kafka dashboards yet, head to our documentation site and follow the simple setup instructions.
Apr 28, 2021
Developers are adding Apache Kafka® to their tech stacks to get event-driven. Read Lorna Mitchell's tips for designing the payloads.
Sep 29, 2022
Aiven enters the next phase of its open source commitment with Apache Kafka® data governance. Read what Oskari Saarenmaa has to say about Kafkawize.
Mar 17, 2021
We don’t like to think about disasters, but sometimes they just happen. Find out how you should prepare your data for the worst, even while hoping for the best.