Skip to main content

Apache Kafka® metrics sent to Datadog

When you configure a Datadog service integration for Aiven for Apache Kafka®, Aiven sends Kafka metrics to Datadog. You can also customize selected metrics using the Aiven CLI.

Prerequisites

note

Datadog integration is not available for new Startup-2 plans in Aiven for Apache Kafka®.

Existing customers who already use Startup-2 with Datadog integration can continue to create Startup-2 services with Datadog integration and use their existing services without upgrading to a higher plan.

Aiven recommends Business-4 or higher for Aiven for Apache Kafka® services with Datadog integration to avoid resource pressure on Startup-2 plans.

If you are an existing customer and cannot create a Startup-2 service with Datadog integration in a new project, contact Aiven Support.

Metrics sent to Datadog

When you configure a Datadog integration for your Aiven for Apache Kafka® service, Aiven collects Kafka metrics covering broker throughput, request handling, replication, the controller, consumers, producers, the JVM, quotas, and tiered storage.

Datadog uses a different naming convention from Prometheus metrics. Metrics in this section are grouped by category to help you find the relevant Datadog metric.

note

The available metrics depend on your Apache Kafka version, service configuration, and metadata mode:

  • KRaft controller metrics apply to services running in KRaft mode (Apache Kafka 3.9 and later).
  • Group coordinator metrics apply to the new group coordinator (Apache Kafka 4.x and later).
  • ZooKeeper metrics apply only to services running in ZooKeeper mode.

Broker throughput metrics

MetricDescription
kafka.messages_in.rateRate of messages received by the broker
kafka.net.bytes_in.rateRate of bytes received by the broker across all topics
kafka.net.bytes_out.rateRate of bytes sent by the broker across all topics
kafka.net.bytes_rejected.rateRate of bytes rejected by the broker
kafka.net.processor.avg.idle.pct.rateAverage idle percentage of network processor threads

Per-topic throughput metrics

MetricDescription
kafka.topic.messages_in.rateRate of incoming messages per topic (tagged by topic and partition)
kafka.topic.net.bytes_in.rateRate of incoming bytes per topic
kafka.topic.net.bytes_out.rateRate of outgoing bytes per topic
kafka.topic.net.bytes_rejected.rateRate of rejected bytes per topic

Request handling metrics

MetricDescription
kafka.request.channel.queue.sizeNumber of requests in the request queue
kafka.request.handler.avg.idle.pct.rateAverage idle percentage of request handler threads (1-minute rate)
kafka.request.produce.rateRate of Produce requests
kafka.request.produce.failed.rateRate of failed Produce requests
kafka.request.produce.time.avgAverage total time for Produce requests (ms)
kafka.request.produce.time.99percentile99th-percentile total time for Produce requests (ms)
kafka.request.fetch_consumer.rateRate of FetchConsumer requests
kafka.request.fetch_consumer.time.avgAverage total time for FetchConsumer requests (ms)
kafka.request.fetch_consumer.time.99percentile99th-percentile total time for FetchConsumer requests (ms)
kafka.request.fetch_follower.rateRate of FetchFollower requests
kafka.request.fetch_follower.time.avgAverage total time for FetchFollower requests (ms)
kafka.request.fetch_follower.time.99percentile99th-percentile total time for FetchFollower requests (ms)
kafka.request.fetch.failed.rateRate of failed Fetch requests
kafka.request.metadata.time.avgAverage total time for Metadata requests (ms)
kafka.request.metadata.time.99percentile99th-percentile total time for Metadata requests (ms)
kafka.request.offsets.time.avgAverage total time for Offsets requests (ms)
kafka.request.offsets.time.99percentile99th-percentile total time for Offsets requests (ms)
kafka.request.update_metadata.time.avgAverage total time for UpdateMetadata requests (ms)
kafka.request.update_metadata.time.99percentile99th-percentile total time for UpdateMetadata requests (ms)
kafka.request.producer_request_purgatory.sizeNumber of requests waiting in the producer purgatory
kafka.request.fetch_request_purgatory.sizeNumber of requests waiting in the fetch purgatory

Replication metrics

MetricDescription
kafka.replication.active_controller_countNumber of active controllers (should be 1)
kafka.replication.leader_countNumber of partitions for which this broker is the leader
kafka.replication.partition_countNumber of partitions on this broker
kafka.replication.under_replicated_partitionsNumber of under-replicated partitions
kafka.replication.offline_partitions_countNumber of offline partitions
kafka.replication.isr_expands.rateRate of in-sync replica (ISR) expand events
kafka.replication.isr_shrinks.rateRate of ISR shrink events
kafka.replication.leader_elections.rateRate of leader elections
kafka.replication.unclean_leader_elections.rateRate of unclean leader elections. Unclean leader elections can lead to data loss
kafka.replication.max_lagMaximum replication lag across all followers

KRaft controller metrics

MetricDescription
kafka.controller.active_broker_countNumber of active brokers
kafka.controller.global_topic_countTotal number of topics
kafka.controller.global_partition_countTotal number of partitions
kafka.controller.topics_to_delete_countNumber of topics pending deletion
kafka.controller.replicas_to_delete_countNumber of replicas pending deletion
kafka.controller.election_from_eligible_leader_replicas_per_secRate of leader elections from eligible replicas

Group coordinator metrics

MetricDescription
kafka.server.group_coordinator_metrics.group_countTotal number of consumer groups
kafka.server.group_coordinator_metrics.consumer_group_countNumber of consumer groups in the new protocol
kafka.server.group_coordinator_metrics.consumer_group_rebalance_countTotal count of consumer group rebalances
kafka.server.group_coordinator_metrics.consumer_group_rebalance_rateRate of consumer group rebalances
kafka.server.group_coordinator_metrics.group_completed_rebalance_countTotal count of completed group rebalances
kafka.server.group_coordinator_metrics.group_completed_rebalance_rateRate of completed group rebalances
kafka.server.group_coordinator_metrics.streams_group_countNumber of Kafka Streams groups
kafka.server.group_coordinator_metrics.streams_group_rebalance_countTotal count of Kafka Streams group rebalances
kafka.server.group_coordinator_metrics.streams_group_rebalance_rateRate of Kafka Streams group rebalances
kafka.server.group_coordinator_metrics.offset_commit_countTotal count of offset commits
kafka.server.group_coordinator_metrics.offset_commit_rateRate of offset commits
kafka.server.group_coordinator_metrics.offset_deletion_countTotal count of offset deletions
kafka.server.group_coordinator_metrics.offset_deletion_rateRate of offset deletions
kafka.server.group_coordinator_metrics.offset_expiration_countTotal count of offset expirations
kafka.server.group_coordinator_metrics.offset_expiration_rateRate of offset expirations
kafka.server.group_coordinator_metrics.num_partitionsNumber of partitions managed by the group coordinator
kafka.server.group_coordinator_metrics.partition_load_time_avgAverage partition load time for the group coordinator (ms)
kafka.server.group_coordinator_metrics.partition_load_time_maxMaximum partition load time for the group coordinator (ms)
kafka.server.group_coordinator_metrics.batch_flush_rateRate of batch flushes in the group coordinator
kafka.server.group_coordinator_metrics.batch_flush_time_ms_maxMaximum batch flush time in the group coordinator (ms)
kafka.server.group_coordinator_metrics.batch_linger_time_ms_maxMaximum batch linger time in the group coordinator (ms)
kafka.server.group_coordinator_metrics.event_processing_time_ms_maxMaximum event processing time in the group coordinator (ms)
kafka.server.group_coordinator_metrics.event_purgatory_time_ms_maxMaximum time events spent in the group coordinator purgatory (ms)
kafka.server.group_coordinator_metrics.event_queue_sizeSize of the group coordinator event queue
kafka.server.group_coordinator_metrics.event_queue_time_ms_maxMaximum time events waited in the group coordinator queue (ms)
kafka.server.group_coordinator_metrics.thread_idle_ratio_avgAverage idle ratio of group coordinator threads

Group metadata manager metrics

MetricDescription
kafka.server.group_metadata_manager.num_groupsNumber of consumer groups managed by this broker
kafka.server.group_metadata_manager.num_groups_preparing_rebalanceNumber of consumer groups preparing for rebalance
kafka.server.group_metadata_manager.num_offsetsNumber of committed offsets stored by this broker

Log metrics

MetricDescription
kafka.log.flush_rate.rateRate of log flush operations
note

Per-partition log size and offset metrics (kafka.log.log_size, kafka.log.log_start_offset, and kafka.log.log_end_offset) are not collected by default. You can enable them by configuring the Datadog integration. See Configurable custom metrics.

JVM metrics

MetricDescription
jvm.heap_memoryHeap memory used (bytes)
jvm.heap_memory_committedHeap memory committed (bytes)
jvm.heap_memory_initHeap memory initially requested (bytes)
jvm.heap_memory_maxMaximum heap memory (bytes)
jvm.non_heap_memoryNon-heap memory used (bytes)
jvm.non_heap_memory_committedNon-heap memory committed (bytes)
jvm.non_heap_memory_initNon-heap memory initially requested (bytes)
jvm.non_heap_memory_maxMaximum non-heap memory (bytes)
jvm.gc.cms.countNumber of concurrent (CMS) garbage collections
jvm.gc.parnew.timeTime spent in ParNew garbage collection (ms)
jvm.gc.eden_sizeSize of the eden space (bytes)
jvm.gc.survivor_sizeSize of the survivor space (bytes)
jvm.gc.old_gen_sizeSize of the old generation (bytes)
jvm.gc.metaspace_sizeSize of the metaspace (bytes)
jvm.buffer_pool.direct.capacityCapacity of direct buffer pools (bytes)
jvm.buffer_pool.direct.countNumber of direct buffers in the pool
jvm.buffer_pool.direct.usedMemory used by direct buffer pools (bytes)
jvm.buffer_pool.mapped.capacityCapacity of mapped buffer pools (bytes)
jvm.buffer_pool.mapped.countNumber of mapped buffers in the pool
jvm.buffer_pool.mapped.usedMemory used by mapped buffer pools (bytes)
jvm.cpu_load.processJVM process CPU load
jvm.cpu_load.systemSystem CPU load
jvm.thread_countNumber of live JVM threads
jvm.loaded_classesNumber of currently loaded classes
jvm.unloaded_classesNumber of classes unloaded since JVM start
jvm.os.open_file_descriptorsNumber of open file descriptors

Quota metrics

MetricDescription
kafka.bandwidth.quota.byte.rateBandwidth quota byte rate per client or user
kafka.bandwidth.quota.throttle.timeBandwidth quota throttle time per client or user (ms)
kafka.request.quota.request.timeRequest quota request time per client or user (ms)
kafka.request.quota.throttle.timeRequest quota throttle time per client or user (ms)

Producer metrics

MetricDescription
kafka.producer.request_rateProducer request rate
kafka.producer.response_rateProducer response rate
kafka.producer.request_latency_avgAverage producer request latency (ms)
kafka.producer.request_latency_maxMaximum producer request latency (ms)
kafka.producer.requests_in_flightNumber of producer requests in flight
kafka.producer.message_rateRate of messages sent by producers
kafka.producer.bytes_outRate of bytes sent by producers
kafka.producer.record_send_rateRate of records sent per topic (tagged by topic and partition)
kafka.producer.records_send_rateRate of records sent by producers
kafka.producer.records_per_requestAverage records per producer request
kafka.producer.record_error_rateRate of errored producer records
kafka.producer.record_retry_rateRate of retried producer records
kafka.producer.record_size_avgAverage producer record size (bytes)
kafka.producer.record_size_maxMaximum producer record size (bytes)
kafka.producer.record_queue_time_avgAverage record queue time for producers (ms)
kafka.producer.record_queue_time_maxMaximum record queue time for producers (ms)
kafka.producer.batch_size_avgAverage batch size for producers (bytes)
kafka.producer.batch_size_maxMaximum batch size for producers (bytes)
kafka.producer.compression_rateCompression rate per topic (tagged by topic and partition)
kafka.producer.compression_rate_avgAverage compression rate for producers
kafka.producer.available_buffer_bytesAvailable producer buffer memory (bytes)
kafka.producer.buffer_bytes_totalTotal producer buffer memory (bytes)
kafka.producer.bufferpool_wait_timeTime producer threads blocked on the buffer pool (ms)
kafka.producer.waiting_threadsNumber of waiting producer threads
kafka.producer.io_waitAverage I/O wait time for producers (ns)
kafka.producer.metadata_ageAge of producer metadata (seconds)
kafka.producer.throttle_time_avgAverage producer throttle time (ms)
kafka.producer.throttle_time_maxMaximum producer throttle time (ms)

Consumer metrics

MetricDescription
kafka.consumer.messages_inRate of messages consumed
kafka.consumer.bytes_inRate of bytes consumed
kafka.consumer.bytes_consumedRate of bytes consumed per topic (tagged by topic and partition)
kafka.consumer.records_consumedRate of records consumed per topic (tagged by topic and partition)
kafka.consumer.records_per_request_avgAverage records per fetch request per topic (tagged by topic and partition)
kafka.consumer.fetch_rateConsumer fetch rate
kafka.consumer.fetch_size_avgAverage fetch size per topic (tagged by topic and partition)
kafka.consumer.fetch_size_maxMaximum fetch size per topic (tagged by topic and partition)
kafka.consumer.max_lagMaximum consumer lag
kafka.consumer.kafka_commitsRate of offset commits through Kafka (legacy)
kafka.consumer.zookeeper_commitsRate of offset commits through ZooKeeper (legacy)
note

Some client-side producer and consumer metrics require additional configuration to appear in Datadog. See Add client-side Apache Kafka® producer and consumer Datadog metrics.

Consumer lag and offset metrics

MetricDescription
kafka.broker_offsetLatest offset on the broker for a topic-partition (tagged by topic and partition)
kafka.consumer_offsetCommitted consumer offset for a topic-partition (tagged by topic and partition)
kafka.consumer_lagConsumer lag in messages (tagged by topic and partition)

ZooKeeper metrics

These metrics apply only to services running in ZooKeeper mode.

MetricDescription
kafka.session.zookeeper.disconnect.rateRate of ZooKeeper disconnections
kafka.session.zookeeper.expire.rateRate of ZooKeeper session expirations
kafka.session.zookeeper.readonly.rateRate of ZooKeeper read-only connections
kafka.session.zookeeper.sync.rateRate of ZooKeeper sync connections

Tiered storage metrics

For services with tiered storage enabled, the following metrics are collected automatically to monitor the health and performance of tiered storage operations.

Throughput metrics

MetricDescription
kafka.tiered_storage.remote_copy_bytes.rateRate of bytes copied to remote storage
kafka.tiered_storage.remote_copy_requests.rateRate of copy requests to remote storage
kafka.tiered_storage.remote_fetch_bytes.rateRate of bytes fetched from remote storage
kafka.tiered_storage.remote_fetch_requests.rateRate of fetch requests from remote storage
kafka.tiered_storage.remote_delete_requests.rateRate of delete requests to remote storage
kafka.tiered_storage.build_remote_log_aux_state_requests.rateRate of remote log aux-state build requests

Error metrics

MetricDescription
kafka.tiered_storage.remote_copy_errors.rateRate of errors when copying segments to remote storage
kafka.tiered_storage.remote_fetch_errors.rateRate of errors when fetching segments from remote storage
kafka.tiered_storage.remote_delete_errors.rateRate of errors when deleting segments from remote storage
kafka.tiered_storage.build_remote_log_aux_state_errors.rateRate of errors rebuilding remote log auxiliary state

Lag metrics

MetricDescription
kafka.tiered_storage.remote_copy_lag_bytesBytes eligible for tiering but not yet copied
kafka.tiered_storage.remote_copy_lag_segmentsSegments eligible for tiering but not yet copied
kafka.tiered_storage.remote_delete_lag_bytesBytes eligible for deletion but not yet deleted
kafka.tiered_storage.remote_delete_lag_segmentsSegments eligible for deletion but not yet deleted

Storage metrics

MetricDescription
kafka.tiered_storage.remote_log_size_bytesTotal size of remote log in bytes
kafka.tiered_storage.remote_log_size_computation_timeTime taken to compute remote log size (ms)
kafka.tiered_storage.remote_log_metadata_countNumber of remote log metadata entries

Thread pool metrics

MetricDescription
kafka.tiered_storage.remote_log_manager_tasks_avg_idle_percentAverage idle percentage of remote log manager task threads
kafka.tiered_storage.remote_log_reader_avg_idle_percentAverage idle percentage of remote log reader threads
kafka.tiered_storage.remote_log_reader_task_queue_sizeSize of the remote log reader task queue
kafka.tiered_storage.remote_log_reader_fetch.rateRate of remote log reader fetch operations
kafka.tiered_storage.remote_log_reader_fetch_time_avgAverage time for remote log reader fetch operations (ms)
kafka.tiered_storage.remote_log_reader_fetch_time_99percentile99th-percentile time for remote log reader fetch (ms)
kafka.tiered_storage.delayed_remote_fetch_expires.rateRate of expired delayed remote fetch operations

Throttling metrics

MetricDescription
kafka.tiered_storage.remote_copy_throttle_time_avgAverage copy throttle time for remote storage (ms)
kafka.tiered_storage.remote_copy_throttle_time_maxMaximum copy throttle time for remote storage (ms)
kafka.tiered_storage.remote_fetch_throttle_time_avgAverage fetch throttle time for remote storage (ms)
kafka.tiered_storage.remote_fetch_throttle_time_maxMaximum fetch throttle time for remote storage (ms)

Cache metrics

MetricDescription
kafka.tiered_storage.cache.chunk_cache_sizeTotal size of the chunk cache
kafka.tiered_storage.cache.chunk_cache_hitsChunk cache hit count
kafka.tiered_storage.cache.chunk_cache_missesChunk cache miss count
kafka.tiered_storage.cache.chunk_cache_evictionsChunk cache eviction count
kafka.tiered_storage.cache.chunk_cache_eviction_weightTotal eviction weight from the chunk cache
kafka.tiered_storage.cache.segment_manifest_cache_sizeTotal size of the segment manifest cache
kafka.tiered_storage.cache.segment_manifest_cache_hitsSegment manifest cache hit count
kafka.tiered_storage.cache.segment_manifest_cache_missesSegment manifest cache miss count
kafka.tiered_storage.cache.segment_manifest_cache_evictionsSegment manifest cache eviction count
kafka.tiered_storage.cache.segment_indexes_cache_sizeTotal size of the segment indexes cache
kafka.tiered_storage.cache.segment_indexes_cache_hitsSegment indexes cache hit count
kafka.tiered_storage.cache.segment_indexes_cache_missesSegment indexes cache miss count
kafka.tiered_storage.cache.segment_indexes_cache_evictionsSegment indexes cache eviction count

Metrics specific to your cloud storage provider and to the Aiven remote storage manager are listed by backend below.

Amazon S3 tiered storage metrics

MetricDescription
kafka.tiered_storage.s3.get_object_requests_rateRate of S3 GetObject requests
kafka.tiered_storage.s3.get_object_time_avgAverage latency of S3 GetObject requests (ms)
kafka.tiered_storage.s3.delete_object_requests_rateRate of S3 DeleteObject requests
kafka.tiered_storage.s3.upload_part_requests_rateRate of S3 UploadPart requests
kafka.tiered_storage.s3.create_multipart_upload_time_avgAverage latency of S3 CreateMultipartUpload requests (ms)
kafka.tiered_storage.s3.complete_multipart_upload_time_avgAverage latency of S3 CompleteMultipartUpload requests (ms)

Google Cloud Storage tiered storage metrics

MetricDescription
kafka.tiered_storage.gcs.object_get_rateRate of GCS object get requests
kafka.tiered_storage.gcs.object_delete_rateRate of GCS object delete requests
kafka.tiered_storage.gcs.resumable_upload_initiate_rateRate of GCS resumable upload initiations
kafka.tiered_storage.gcs.resumable_chunk_upload_rateRate of GCS resumable chunk uploads

Azure Blob Storage tiered storage metrics

MetricDescription
kafka.tiered_storage.azure.blob_get_rateRate of Azure Blob get requests
kafka.tiered_storage.azure.blob_upload_rateRate of Azure Blob upload requests
kafka.tiered_storage.azure.blob_delete_rateRate of Azure Blob delete requests
kafka.tiered_storage.azure.block_upload_rateRate of Azure Block upload requests
kafka.tiered_storage.azure.block_list_upload_rateRate of Azure BlockList upload requests

Aiven remote storage manager (RSM) tiered storage metrics

MetricDescription
kafka.tiered_storage.aiven.segment_copy_bytes_rateRate of bytes copied to remote storage (Aiven RSM)
kafka.tiered_storage.aiven.segment_copy_time_avgAverage time to copy a segment to remote storage (ms)
kafka.tiered_storage.aiven.segment_copy_time_maxMaximum time to copy a segment to remote storage (ms)
kafka.tiered_storage.aiven.segment_delete_bytes_rateRate of bytes deleted from remote storage (Aiven RSM)
kafka.tiered_storage.aiven.segment_delete_time_avgAverage time to delete a segment from remote storage (ms)
kafka.tiered_storage.aiven.segment_delete_time_maxMaximum time to delete a segment from remote storage (ms)

Configurable custom metrics

The following per-partition log metrics are not collected by default. You can enable them by configuring the Datadog integration. These metrics are tagged with topic and partition, enabling independent monitoring of each topic and partition:

  • kafka.log.log_size
  • kafka.log.log_start_offset
  • kafka.log.log_end_offset

Variables

Replace the following placeholders in the code samples:

VariableDescription
SERVICE_NAMEAiven for Apache Kafka® service name
INTEGRATION_IDID of the integration between Aiven for Apache Kafka® service and Datadog

To find the INTEGRATION_ID parameter, run:

avn service integration-list SERVICE_NAME

Customize metrics for Datadog

Before customizing metrics, configure and enable a Datadog endpoint in your Aiven for Apache Kafka® service. For setup instructions, see Send metrics to Datadog.

Format any listed parameters as a comma-separated list: ['value0', 'value1', 'value2', ...].

To customize Datadog metrics, use the service integration-update command with the kafka_custom_metrics parameter. Specify a comma-separated list of custom metrics, such as kafka.log.log_size, kafka.log.log_start_offset, and kafka.log.log_end_offset.

For example, to send the kafka.log.log_size and kafka.log.log_end_offset metrics, run:

avn service integration-update                                                \
-c 'kafka_custom_metrics=["kafka.log.log_size","kafka.log.log_end_offset"]' \
INTEGRATION_ID

After updating settings, view the collected metrics in the Datadog Metrics Explorer.

Customize consumer metrics for Datadog

Apache Kafka Consumer Integration collects metrics for message offsets. To customize the metrics sent from this Datadog integration to Datadog, use the service integration-update command with the following parameters:

  • include_topics: A comma-separated list of topics to include.

    note

    By default, all topics are included.

  • exclude_topics: A comma-separated list of topics to exclude.

    note

    To use exclude_topics, specify at least one include_consumer_groups value. Otherwise, exclude_topics does not take effect.

  • include_consumer_groups: A comma-separated list of consumer groups to include.

  • exclude_consumer_groups: A comma-separated list of consumer groups to exclude.

For example, to include topics topic1 and topic2, run:

avn service integration-update                                                  \
-c 'kafka_custom_metrics=["kafka.log.log_size","kafka.log.log_end_offset"]' \
-c 'include_topics=["topic1","topic2"]' \
INTEGRATION_ID

After updating settings, view the collected metrics in the Datadog Metrics Explorer.

Related pages