Troubleshoot Apache Kafka® consumer connections
Apache Kafka® consumers sometimes experience disconnections from a cluster node.
On one hand, isolated occurrences are expected and well mitigated by the Apache Kafka® protocol, for example if a partition leadership moved to a different node so the cluster remains available and balanced. On the other hand, repeating occurrences often raise concerns as they can impact consumer lag, and cluster performance.
What is observed
The client behaviour may vary depending on the library and version
(remind to keep it up-to-date) in use. A typical consumer disconnection
log is as follows:
[AdminClient] Node 3 disconnected. (org.apache.kafka.clients.NetworkClient:937)
.
On the server, the GroupCoordinator undertakes some consumer-group
re-balancing activity, as for example:
Member ABC has left group my-app through explicit LeaveGroup; client reason: the consumer unsubscribed from all topics (kafka.coordinator.group.GroupCoordinator)
or Timed out HeartbeatRequest
.
What it means
As they are processing messages, consumer instances may break, fail to request more local resources or to complete their current task on time. In this case, they leave their group either explicitly or by missing to send a heartbeat to the cluster.
As a consequence, the partition assignments of the consumer leaving the group are automatically re-affected to other consumers of the group, if they exist.
Why it occurs now and never before
- The consumer of a given topic
A
can get dropped because it relies on the samegroup.id
as another consumer leaving or joining the group, even if the later consumes a totally different topicB
. Any rebalance triggered by any one consumer would still affect the other consumers in the group. - Although consumer heartbeat and message processing are to be
executed by 2 different client threads, the machine (on which the
consumer application runs) can get overloaded and so it cannot send
a heartbeat to the group coordinator after
heartbeat.interval.ms
(default: 3 s) and beforesession.timeout.ms
(default: 45 s). - If the consumer logic consists in expensive transformations or
synchronous API calls, then your average single message processing
time can dramatically increase, as in the following example
scenarios:
- The subsystem it depends on has an issue (ex. the destination server is not available).
- The incoming message size increased.
- If the consumer lag has grown enough so that the consumer now
processes
max.poll.records
(default: 500), then your average batch processing time and resources can dramatically increase, with the following consequences:- The consumer may run out-of-memory, resulting in an application crash.
- The
max.poll.interval.ms
is exceeded between polls, so the consumer sends a LeaveGroup-request to the coordinator rather than sending a heartbeat.
How to solve it
- Make sure that your consumers are relying on a unique
group.id
. - It is generally a good idea to reduce
max.poll.records
ormax.partition.fetch.bytes
, and monitor if the situation improves. - If it is required to stick to large batches (such as in windowing
and data lake ingestion use-cases), then temporarily increasing
session.timeout.ms
might do the trick. - Remind that
session.timeout.ms
must be in the allowable range as configured in the broker configuration bygroup.min.session.timeout.ms
andgroup.max.session.timeout.ms
. - It is recommended to configure
heartbeat.interval.ms
to be no more than a third ofsession.timeout.ms
. This ensures that even if a heartbeat or two are lost over the transient network, then the consumer is still considered alive. - If the problem arose because of an increase in the volume of incoming messages, then we would advise to scale the consumer up/vertically or out/horizontally.
If the above didn't help you resolve your issue and you have further concerns or questions, contact Aiven support.