Aiven for Apache Kafka® metrics available via Prometheus

Explore common metrics available via Prometheus for your Aiven for Apache Kafka® service. The available metrics depend on whether your service runs in KRaft mode or ZooKeeper mode.

How to retrieve metrics

You can retrieve a complete list of metrics from your service by querying the Prometheus endpoint. To do this:

Gather the necessary details:
- Aiven project certificate: ca.pem. To download the CA certificate, see Download CA certificates.
- Prometheus credentials: <PROMETHEUS_USER>:<PROMETHEUS_PASSWORD>
- Aiven for Apache Kafka hostname: <KAFKA_HOSTNAME>
- Prometheus port: <PROMETHEUS_PORT>

Run the following curl command to query the Prometheus endpoint:

   curl --cacert ca.pem \
     --user '<PROMETHEUS_USER>:<PROMETHEUS_PASSWORD>' \
      'https://<KAFKA_HOSTNAME>:<PROMETHEUS_PORT>/metrics'

For more information about setting up Prometheus integration, see Use Prometheus with Aiven

Host metrics

Host metrics provide insights into system-level performance, including CPU, memory, disk, and network usage.

CPU utilization

CPU utilization metrics offer insights into CPU usage. These metrics include time spent on different processes, system load, and overall uptime.

Metric	Description
`cpu_usage_guest`	CPU time spent running a virtual CPU for guest operating systems
`cpu_usage_guest_nice`	CPU time running low-priority virtual CPUs for guest operating systems; interrupted by higher-priority tasks and measured in hundredths of a second
`cpu_usage_idle`	Time the CPU spends doing nothing
`cpu_usage_iowait`	Time waiting for I/O to complete
`cpu_usage_irq`	Time servicing interrupts
`cpu_usage_nice`	Time running user-niced processes
`cpu_usage_softirq`	Time servicing softirqs
`cpu_usage_steal`	Time spent in other operating systems when running in a virtualized environment
`cpu_usage_system`	Time spent running system processes
`cpu_usage_user`	Time spent running user processes
`system_load1`	System load average for the last minute
`system_load15`	System load average for the last 15 minutes
`system_load5`	System load average for the last 5 minutes
`system_n_cpus`	Number of CPU cores available
`system_n_users`	Number of users logged in
`system_uptime`	Time for which the system has been up and running

Disk space utilization

Disk space utilization metrics provide a snapshot of disk usage. These metrics include information about free and used disk space, as well as inode usage and total disk capacity.

Metric	Description
`disk_free`	Amount of free disk space
`disk_inodes_free`	Number of free inodes
`disk_inodes_total`	Total number of inodes
`disk_inodes_used`	Number of used inodes
`disk_total`	Total disk space
`disk_used`	Amount of used disk space
`disk_used_percent`	Percentage of disk space used

Disk input and output

Metrics such as diskio_io_time and diskio_iops_in_progress provide insights into disk I/O operations. These metrics cover read/write operations, the duration of these operations, and the number of bytes read/written.

Metric	Description
`diskio_io_time`	Total time spent on I/O operations
`diskio_iops_in_progress`	Number of I/O operations currently in progress
`diskio_merged_reads`	Number of read operations that were merged
`diskio_merged_writes`	Number of write operations that were merged
`diskio_read_bytes`	Total bytes read from disk
`diskio_read_time`	Total time spent on read operations
`diskio_reads`	Total number of read operations
`diskio_weighted_io_time`	Weighted time spent on I/O operations, considering their duration and intensity
`diskio_write_bytes`	Total bytes written to disk
`diskio_write_time`	Total time spent on write operations
`diskio_writes`	Total number of write operations

Generic memory

The following metrics, including mem_active and mem_available, provide insights into your system's memory usage.

Metric	Description
`mem_active`	Amount of actively used memory
`mem_available`	Amount of available memory
`mem_available_percent`	Percentage of available memory
`mem_buffered`	Amount of memory used for buffering I/O
`mem_cached`	Amount of memory used for caching
`mem_commit_limit`	Maximum amount of memory that can be committed
`mem_committed_as`	Total amount of committed memory
`mem_dirty`	Amount of memory waiting to be written to disk
`mem_free`	Amount of free memory
`mem_high_free`	Amount of free memory in the high memory zone
`mem_high_total`	Total amount of memory in the high memory zone
`mem_huge_pages_free`	Number of free huge pages
`mem_huge_page_size`	Size of huge pages
`mem_huge_pages_total`	Total number of huge pages
`mem_inactive`	Amount of inactive memory
`mem_low_free`	Amount of free memory in the low memory zone
`mem_low_total`	Total amount of memory in the low memory zone
`mem_mapped`	Amount of memory mapped into the process's address space
`mem_page_tables`	Amount of memory used by page tables
`mem_shared`	Amount of memory shared between processes
`mem_slab`	Amount of memory used by the kernel for data structure caches
`mem_swap_cached`	Amount of swap memory cached
`mem_swap_free`	Amount of free swap memory
`mem_swap_total`	Total amount of swap memory
`mem_total`	Total amount of memory
`mem_used`	Amount of used memory
`mem_used_percent`	Percentage of used memory
`mem_vmalloc_chunk`	Largest contiguous block of vmalloc memory available
`mem_vmalloc_total`	Total amount of vmalloc memory
`mem_vmalloc_used`	Amount of used vmalloc memory
`mem_wired`	Amount of wired memory
`mem_write_back`	Amount of memory being written back to disk
`mem_write_back_tmp`	Amount of temporary memory being written back to disk

Network

The following metrics, including net_bytes_recv and net_packets_sent, provide insights into your system's network operations.

Metric	Description
`net_bytes_recv`	Total bytes received on the network interfaces
`net_bytes_sent`	Total bytes sent on the network interfaces
`net_drop_in`	Incoming packets dropped
`net_drop_out`	Outgoing packets dropped
`net_err_in`	Incoming packets with errors
`net_err_out`	Outgoing packets with errors
`net_icmp_inaddrmaskreps`	Number of ICMP address mask replies received
`net_icmp_inaddrmasks`	Number of ICMP address mask requests received
`net_icmp_incsumerrors`	Number of ICMP checksum errors
`net_icmp_indestunreachs`	Number of ICMP destination unreachable messages received
`net_icmp_inechoreps`	Number of ICMP echo replies received
`net_icmp_inechos`	Number of ICMP echo requests received
`net_icmp_inerrors`	Number of ICMP messages received with errors
`net_icmp_inmsgs`	Total number of ICMP messages received
`net_icmp_inparmprobs`	Number of ICMP parameter problem messages received
`net_icmp_inredirects`	Number of ICMP redirect messages received
`net_icmp_insrcquenchs`	Number of ICMP source quench messages received
`net_icmp_intimeexcds`	Number of ICMP time exceeded messages received
`net_icmp_intimestampreps`	Number of ICMP timestamp reply messages received
`net_icmp_intimestamps`	Number of ICMP timestamp request messages received
`net_icmpmsg_intype3`	Number of ICMP type 3 (destination unreachable) messages received
`net_icmpmsg_intype8`	Number of ICMP type 8 (echo request) messages received
`net_icmpmsg_outtype0`	Number of ICMP type 0 (echo reply) messages sent
`net_icmpmsg_outtype3`	Number of ICMP type 3 (destination unreachable) messages sent
`net_icmp_outaddrmaskreps`	Number of ICMP address mask reply messages sent
`net_icmp_outaddrmasks`	Number of ICMP address mask request messages sent
`net_icmp_outdestunreachs`	Number of ICMP destination unreachable messages sent
`net_icmp_outechoreps`	Number of ICMP echo reply messages sent
`net_icmp_outechos`	Number of ICMP echo request messages sent
`net_icmp_outerrors`	Number of ICMP messages sent with errors
`net_icmp_outmsgs`	Total number of ICMP messages sent
`net_icmp_outparmprobs`	Number of ICMP parameter problem messages sent
`net_icmp_outredirects`	Number of ICMP redirect messages sent
`net_icmp_outsrcquenchs`	Number of ICMP source quench messages sent
`net_icmp_outtimeexcds`	Number of ICMP time exceeded messages sent
`net_icmp_outtimestampreps`	Number of ICMP timestamp reply messages sent
`net_icmp_outtimestamps`	Number of ICMP timestamp request messages sent
`net_icmp_outratelimitglobal`	Number of globally rate-limited ICMP messages sent
`net_icmp_outratelimithost`	Number of ICMP messages rate-limited per host
`net_ip_defaultttl`	Default time-to-live for IP packets
`net_ip_forwarding`	Indicates if IP forwarding is enabled
`net_ip_forwdatagrams`	Number of forwarded IP datagrams
`net_ip_fragcreates`	Number of IP fragments created
`net_ip_fragfails`	Number of failed IP fragmentations
`net_ip_fragoks`	Number of successful IP fragmentations
`net_ip_inaddrerrors`	Number of incoming IP packets with address errors
`net_ip_indelivers`	Number of incoming IP packets delivered to higher layers
`net_ip_indiscards`	Number of incoming IP packets discarded
`net_ip_inhdrerrors`	Number of incoming IP packets with header errors
`net_ip_inreceives`	Total number of incoming IP packets received
`net_ip_inunknownprotos`	Number of incoming IP packets with unknown protocols
`net_ip_outdiscards`	Number of outgoing IP packets discarded
`net_ip_outnoroutes`	Number of outgoing IP packets with no route available
`net_ip_outrequests`	Total number of outgoing IP packets requested to be sent
`net_ip_outtransmits`	Number of IP packets transmitted successfully
`net_ip_reasmfails`	Number of failed IP reassembly attempts
`net_ip_reasmoks`	Number of successful IP reassembly attempts
`net_ip_reasmreqds`	Number of IP fragments received needing reassembly
`net_ip_reasmtimeout`	Number of IP reassembly timeouts
`net_packets_recv`	Total number of packets received on the network interfaces
`net_packets_sent`	Total number of packets sent on the network interfaces
`netstat_tcp_close`	Number of TCP connections in the CLOSE state
`netstat_tcp_close_wait`	Number of TCP connections in the CLOSE_WAIT state
`netstat_tcp_closing`	Number of TCP connections in the CLOSING state
`netstat_tcp_established`	Number of TCP connections in the ESTABLISHED state
`netstat_tcp_fin_wait1`	Number of TCP connections in the FIN_WAIT_1 state
`netstat_tcp_fin_wait2`	Number of TCP connections in the FIN_WAIT_2 state
`netstat_tcp_last_ack`	Number of TCP connections in the LAST_ACK state
`netstat_tcp_listen`	Number of TCP connections in the LISTEN state
`netstat_tcp_none`	Number of TCP connections in the NONE state
`netstat_tcp_syn_recv`	Number of TCP connections in the SYN_RECV state
`netstat_tcp_syn_sent`	Number of TCP connections in the SYN_SENT state
`netstat_tcp_time_wait`	Number of TCP connections in the TIME_WAIT state
`netstat_udp_socket`	Number of UDP sockets
`net_tcp_activeopens`	Number of active TCP open connections
`net_tcp_attemptfails`	Number of failed TCP connection attempts
`net_tcp_currestab`	Number of currently established TCP connections
`net_tcp_estabresets`	Number of established TCP connections reset
`net_tcp_incsumerrors`	Number of TCP checksum errors in incoming packets
`net_tcp_inerrs`	Number of incoming TCP packets with errors
`net_tcp_insegs`	Number of TCP segments received
`net_tcp_maxconn`	Maximum number of TCP connections supported
`net_tcp_outrsts`	Number of TCP reset packets sent
`net_tcp_outsegs`	Number of TCP segments sent
`net_tcp_passiveopens`	Number of passive TCP open connections
`net_tcp_retranssegs`	Number of TCP segments retransmitted
`net_tcp_rtoalgorithm`	TCP retransmission timeout algorithm
`net_tcp_rtomax`	Maximum TCP retransmission timeout
`net_tcp_rtomin`	Minimum TCP retransmission timeout
`net_udp_ignoredmulti`	Number of UDP multicast packets ignored
`net_udp_incsumerrors`	Number of UDP checksum errors in incoming packets
`net_udp_indatagrams`	Number of UDP datagrams received
`net_udp_inerrors`	Number of incoming UDP packets with errors
`net_udp_memerrors`	Number of UDP packets dropped due to memory errors
`net_udplite_ignoredmulti`	Number of UDP-Lite multicast packets ignored
`net_udplite_incsumerrors`	Number of UDP-Lite checksum errors in incoming packets
`net_udplite_indatagrams`	Number of UDP-Lite datagrams received
`net_udplite_inerrors`	Number of incoming UDP-Lite packets with errors
`net_udplite_memerrors`	Number of UDP-L

Kernel

The metrics listed below, such as kernel_boot_time and kernel_context_switches, provide insights into the operations of your system's kernel.

Metric	Description
`kernel_boot_time`	Time at which the system was last booted
`kernel_context_switches`	Number of context switches that have occurred in the kernel
`kernel_entropy_avail`	Amount of available entropy in the kernel's entropy pool
`kernel_interrupts`	Number of interrupts that have occurred
`kernel_processes_forked`	Number of processes that have been forked

Process

Metrics such as processes_running and processes_zombies provide insights into the management of the system's processes.

Metric	Description
`processes_blocked`	Number of processes that are blocked
`processes_dead`	Number of processes that have terminated
`processes_idle`	Number of processes that are idle
`processes_paging`	Number of processes that are paging
`processes_running`	Number of processes currently running
`processes_sleeping`	Number of processes that are sleeping
`processes_stopped`	Number of processes that are stopped
`processes_total`	Total number of processes
`processes_total_threads`	Total number of threads across all processes
`processes_unknown`	Number of processes in an unknown state
`processes_zombies`	Number of zombie processes (terminated but not reaped by parent process)

Swap usage

Metrics such as swap_free and swap_used provide insights into the usage of the system's swap memory.

Metric	Description
`swap_free`	Amount of free swap memory
`swap_in`	Amount of data swapped in from disk
`swap_out`	Amount of data swapped out to disk
`swap_total`	Total amount of swap memory
`swap_used`	Amount of used swap memory
`swap_used_percent`	Percentage of swap memory used

Aiven for Apache Kafka-specific metrics

Metrics specific to Apache Kafka provide detailed insights into the health and performance of your Kafka clusters, including broker, controller, and topic-level metrics.

Garbage collector `MXBean`

Metrics associated with the java_lang_GarbageCollector provide insights into the JVM's garbage collection process. These metrics include the collection count and the duration of collections.

Metric	Description
`java_lang_GarbageCollector_G1_Young_Generation_CollectionCount`	Returns the total number of collections that have occurred
`java_lang_GarbageCollector_G1_Young_Generation_CollectionTime`	Returns the approximate accumulated collection elapsed time in milliseconds
`java_lang_GarbageCollector_G1_Young_Generation_duration`	Duration of G1 Young Generation garbage collections

Memory Usage

Metrics starting with java_lang_Memory provide insights into the JVM's memory usage, including committed memory, initial memory, max memory, and used memory.

Metric	Description
`java_lang_Memory_committed`	Returns the amount of memory in bytes that is committed for the Java virtual machine to use
`java_lang_Memory_init`	Returns the amount of memory in bytes that the Java virtual machine initially requests from the operating system for memory management.
`java_lang_Memory_max`	Returns the maximum amount of memory in bytes that can be used for memory management
`java_lang_Memory_used`	Returns the amount of used memory in bytes.
`java_lang_Memory_ObjectPendingFinalizationCount`	Number of objects pending finalization

Apache Kafka Connect

For a comprehensive list of Apache Kafka Connect metrics exposed through Prometheus, see Apache Kafka® Connect available via Prometheus.

Apache Kafka broker metrics

Apache Kafka brokers expose metrics that provide insights into the health and performance of the Apache Kafka cluster. Find detailed descriptions of these metrics, see the monitoring section of the Apache Kafka documentation.

Metric types

Cumulative counters (`_count`)

Metrics with a _count suffix are cumulative counters. They track the total number of occurrences for a specific event since the broker started.

Example:

kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Count: Total number of leader elections that have occurred in the controller.

Rate counters (`perSec`)

Metrics with a perSec suffix in their name are also cumulative counters. They track the total number of events per second, not the current rate.

Example:

kafka_server_BrokerTopicMetrics_MessagesInPerSec_Count: Total number of incoming messages received by the broker.

note

To calculate the rate of change for these _Count metrics, you can use functions such as rate() in PromQL.

Apache Kafka controller metrics

Apache Kafka offers a range of metrics to help you assess the performance and health of your Apache Kafka controller.

Percentile Metrics: Metrics like kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_XthPercentile (where X can be 50th, 75th, 95th, etc.) show the time taken for leader elections to complete at various percentiles. This helps in understanding the distribution of leader election times.
Interval Metrics: Metrics ending with FifteenMinuteRate, FiveMinuteRate, following kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_, show the rate of leader elections over different time intervals.
Statistical Metrics: Metrics ending with Max, Mean, Min, StdDev, following kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_, provide statistical measures about the leader election times.
Controller State Metrics: Metrics starting with kafka_controller_KafkaController_ give insights into the state of the Kafka controller, such as the number of active brokers, offline partitions, and replicas to delete.

ZooKeeper mode-only metrics

Apache Kafka requires a separate ZooKeeper process that, for example, stores metadata. The following metrics are only available when running Kafka in ZooKeeper mode:

Metric	Description
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_50thPercentile`	Time taken for leader elections to complete at the 50th percentile
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_75thPercentile`	Time taken for leader elections to complete at the 75th percentile
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_95thPercentile`	Time taken for leader elections to complete at the 95th percentile
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_98thPercentile`	Time taken for leader elections to complete at the 98th percentile
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_99thPercentile`	Time taken for leader elections to complete at the 99th percentile
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_999thPercentile`	Time taken for leader elections to complete at the 99.9th percentile
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Count`	The total number of leader elections
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_FifteenMinuteRate`	Rate of leader elections over the last 15 minutes
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_FiveMinuteRate`	Rate of leader elections over the last 5 minutes
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Max`	Maximum time taken for a leader election
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Mean`	Mean time taken for leader elections
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_MeanRate`	Mean rate of leader elections
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Min`	Minimum time taken for a leader election
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_OneMinuteRate`	Rate of leader elections over the last minute
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_StdDev`	Standard deviation of leader election times
`kafka_controller_ControllerStats_UncleanLeaderElectionsPerSec_Count`	Number of unclean leader elections. Unclean leader elections can lead to data loss
`kafka_controller_KafkaController_ActiveBrokerCount_Value`	Number of active brokers
`kafka_controller_KafkaController_ActiveControllerCount_Value`	Number of active controllers
`kafka_controller_KafkaController_FencedBrokerCount_Value`	Number of fenced brokers
`kafka_controller_KafkaController_OfflinePartitionsCount_Value`	Number of offline partitions
`kafka_controller_KafkaController_PreferredReplicaImbalanceCount_Value`	Number of preferred replica imbalances
`kafka_controller_KafkaController_ReplicasIneligibleToDeleteCount_Value`	Number of replicas ineligible to delete
`kafka_controller_KafkaController_ReplicasToDeleteCount_Value`	Number of replicas to delete
`kafka_controller_KafkaController_TopicsIneligibleToDeleteCount_Value`	Number of topics ineligible to delete
`kafka_controller_KafkaController_TopicsToDeleteCount_Value`	Number of topics to delete

KRaft mode and metrics changes

Aiven for Apache Kafka services running Apache Kafka 3.9 use KRaft mode, which replaces ZooKeeper for metadata and controller management. While KRaft tracks similar metrics, Aiven does not expose these metrics because they are internal and not needed for monitoring. This is similar to ZooKeeper mode, where Kafka did not expose ZooKeeper-specific metrics.

The following controller metrics are not available in KRaft mode:

Metric	Description
`kafka_controller_KafkaController_ActiveControllerCount_Value`	Number of active controllers
`kafka_controller_KafkaController_OfflinePartitionsCount_Value`	Number of offline partitions
`kafka_controller_KafkaController_PreferredReplicaImbalanceCount_Value`	Number of preferred replica imbalances
`kafka_controller_KafkaController_TopicsToDeleteCount_Value`	Number of topics to delete
`kafka_controller_KafkaController_ReplicasToDeleteCount_Value`	Number of replicas to delete
`kafka_controller_KafkaController_TopicsIneligibleToDeleteCount_Value`	Number of topics ineligible for deletion
`kafka_controller_KafkaController_ReplicasIneligibleToDeleteCount_Value`	Number of replicas ineligible for deletion
`kafka_controller_KafkaController_ActiveBrokerCount_Value`	Number of active brokers
`kafka_controller_KafkaController_FencedBrokerCount_Value`	Number of fenced brokers
`kafka_controller_KafkaController_NewActiveControllersCount_Value`	Number of new active controllers
`kafka_controller_KafkaController_TimedOutBrokerHeartbeatCount_Value`	Number of timed-out broker heartbeats
`kafka_controller_KafkaController_ControllerState_Value`	Controller state
`kafka_controller_KafkaController_GlobalTopicCount_Value`	Number of global topics
`kafka_controller_KafkaController_GlobalPartitionCount_Value`	Number of global partitions
`kafka_controller_KafkaController_MetadataErrorCount_Value`	Number of metadata errors
`kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Count`	Total number of leader elections
`kafka_controller_ControllerStats_UncleanLeaderElectionsPerSec_Count`	Number of unclean leader elections
`kafka_controller_KafkaController_EventQueueOperationsStartedCount_Value`	Number of event queue operations started
`kafka_controller_KafkaController_EventQueueOperationsTimedOutCount_Value`	Number of event queue operations timed out
`kafka_controller_KafkaController_LastAppliedRecordOffset_Value`	Last applied record offset
`kafka_controller_KafkaController_LastCommittedRecordOffset_Value`	Last committed record offset
`kafka_controller_KafkaController_LastAppliedRecordTimestamp_Value`	Last applied record timestamp
`kafka_controller_KafkaController_LastAppliedRecordLagMs_Value`	Last applied record lag in ms
`kafka_controller_ControllerEventManager_EventQueueProcessingTimeMs_Value`	Event queue processing time in ms
`kafka_controller_ControllerEventManager_EventQueueSize_Value`	Event queue size
`kafka_controller_ControllerEventManager_EventQueueTimeMs_Value`	Event queue time in ms
`kafka_controller_ControllerChannelManager_TotalQueueSize_Value`	Total queue size
`kafka_controller_ControllerChannelManager_QueueSize_Value`	Queue size per broker
`kafka_controller_ControllerChannelManager_RequestRateAndQueueTimeMs_Value`	Request rate and queue time per broker
`kafka_controller_KafkaController_MigratingZkBrokerCount_Value`	Number of brokers migrating from ZooKeeper to KRaft
`kafka_controller_KafkaController_ZkMigrationState_Value`	ZooKeeper migration state

`Jolokia` collector collect time

Jolokia is a JMX-HTTP bridge that provides an alternative to native JMX access. The following metric provides insights into the time taken by the Jolokia collector to collect metrics.

Metric	Description
`kafka_jolokia_collector_collect_time`	Represents the time taken by the Jolokia collector to collect metrics

Apache Kafka log

Apache Kafka provides a variety of metrics that offer insights into its operation. These metrics are useful for understanding the operation of the log cleaner and log flush operations.

Log cleaner metrics

These metrics provide insights into the log cleaner's operation, which helps in compacting the Apache Kafka logs.

Metric	Description
`kafka_log_LogCleaner_cleaner_recopy_percent_Value`	Percentage of log segments that were recopied during cleaning
`kafka_log_LogCleanerManager_time_since_last_run_ms_Value`	Time in milliseconds since the last log cleaner run
`kafka_log_LogCleaner_max_clean_time_secs_Value`	Maximum time in seconds taken for a log cleaning operation

Log flush rate metrics

Metrics like kafka_log_LogFlushStats_LogFlushRateAndTimeMs_XthPercentile provide the time taken to flush logs at various percentiles.

These metrics offer insights into log flush operations, ensuring that the system writes data from memory to disk. They also indicate the time required to flush logs at different percentiles.

Metric	Description
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_50thPercentile`	Time taken to flush logs at the 50th percentile
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_75thPercentile`	Time taken to flush logs at the 75th percentile
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_95thPercentile`	Time taken to flush logs at the 95th percentile
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_98thPercentile`	Time taken to flush logs at the 98th percentile
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_99thPercentile`	Time taken to flush logs at the 99th percentile
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_999thPercentile`	Time taken to flush logs at the 99.9th percentile
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_Count`	Total number of log flush operations
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_FifteenMinuteRate`	Rate of log flush operations over the last 15 minutes
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_FiveMinuteRate`	Rate of log flush operations over the last 5 minutes
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_Max`	Maximum time taken for a log flush operation
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_Mean`	Mean time taken for log flush operations
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_MeanRate`	Mean rate of log flush operations
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_Min`	Minimum time taken for a log flush operation
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_OneMinuteRate`	Rate of log flush operations over the last minute
`kafka_log_LogFlushStats_LogFlushRateAndTimeMs_StdDev`	Standard deviation of log flush times

Log metrics

These metrics provide general information about log sizes and offsets.

Metric	Description
`kafka_log_Log_LogEndOffset_Value`	End offset of the log
`kafka_log_Log_LogStartOffset_Value`	Start offset of the log
`kafka_log_Log_Size_Value`	Size of the log

Apache Kafka network

Apache Kafka provides several metrics, such as kafka_network_RequestMetrics_RequestsPerSec_Count and kafka_network_RequestMetrics_TotalTimeMs_Mean, to monitor the performance and health of network requests made to the Apache Kafka brokers.

Metric	Description
`kafka_network_RequestChannel_RequestQueueSize_Value`	Size of the request queue
`kafka_network_RequestChannel_ResponseQueueSize_Value`	Size of the response queue
`kafka_network_RequestMetrics_RequestsPerSec_Count`	Total number of requests per second.
`kafka_network_RequestMetrics_TotalTimeMs_95thPercentile`	Total time for requests at the 95th percentile
`kafka_network_RequestMetrics_TotalTimeMs_Count`	Total number of requests
`kafka_network_RequestMetrics_TotalTimeMs_Mean`	Mean total time for requests
`kafka_network_SocketServer_NetworkProcessorAvgIdlePercent_Value`	Average idle percentage of the network processor

Apache Kafka server

Apache Kafka provides a range of metrics that help monitor the server's performance and health.

Topic metrics: BrokerTopicMetrics offer insights into various operations related to topics, such as bytes in/out and failed fetch/produce requests.
Replica metrics: kafka_server_ReplicaManager_LeaderCount_Value provides insights into the state of replicas within the Apache Kafka cluster.

The topic tag is crucial in these metrics. If you don't specify it, the system displays a combined rate for all topics, along with the rate for each individual topic. To view rates for specific topics, use the topic tag. To exclude the combined rate for all topics and only list metrics for individual topics, filter with topic!="".

Metric	Description
`kafka_server_BrokerTopicMetrics_BytesInPerSec_Count`	Byte in (from the clients) rate per topic. Omitting `topic=(...)` will yield the all-topic rate
`kafka_server_BrokerTopicMetrics_BytesOutPerSec_Count`	Byte out (to the clients) rate per topic. Omitting `topic=(...)` will yield the all-topic rate
`kafka_server_BrokerTopicMetrics_BytesRejectedPerSec_Count`	Rejected byte rate per topic due to the record batch size being greater than `max.message.bytes` configuration. Omitting `topic=(...)` will yield the all-topic rate
`kafka_server_BrokerTopicMetrics_FailedFetchRequestsPerSec_Count`	Failed fetch request (from clients or followers) rate per topic. Omitting `topic=(...)` will yield the all-topic rate
`kafka_server_BrokerTopicMetrics_FailedProduceRequestsPerSec_Count`	Failed produce request rate per topic. Omitting `topic=(...)` will yield the all-topic rate
`kafka_server_BrokerTopicMetrics_FetchMessageConversionsPerSec_Count`	Message format conversion rate for produce or fetch requests per topic. Omitting `topic=(...)` will yield the all-topic rate.
`kafka_server_BrokerTopicMetrics_MessagesInPerSec_Count`	Incoming message rate per topic. Omitting `topic=(...)` will yield the all-topic rate
`kafka_server_BrokerTopicMetrics_ProduceMessageConversionsPerSec_Count`	Message format conversion rate for produce or fetch requests per topic. Omitting `topic=(...)` will yield the all-topic rate
`kafka_server_BrokerTopicMetrics_ReassignmentBytesInPerSec_Count`	Incoming byte rate of reassignment traffic.
`kafka_server_BrokerTopicMetrics_ReassignmentBytesOutPerSec_Count`	Outgoing byte rate of reassignment traffic.
`kafka_server_BrokerTopicMetrics_ReplicationBytesInPerSec_Count`	Byte in (from other brokers) rate per topic. Omitting `topic=(...)` will yield the all-topic rate
`kafka_server_BrokerTopicMetrics_ReplicationBytesOutPerSec_Count`	Byte out (to other brokers) rate per topic. Omitting `topic=(...)` will yield the all-topic rate
`kafka_server_BrokerTopicMetrics_TotalFetchRequestsPerSec_Count`	Fetch request (from clients or followers) rate per topic. Omitting `topic=(...)` will yield the all-topic rate
`kafka_server_BrokerTopicMetrics_TotalProduceRequestsPerSec_Count`	Total number of produce requests per second. This metric is collected per host and not per topic
`kafka_server_DelayedOperationPurgatory_NumDelayedOperations_Value`	Number of delayed operations in purgatory.
`kafka_server_DelayedOperationPurgatory_PurgatorySize_Value`	Size of the purgatory queue.
`kafka_server_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent_OneMinuteRate`	Average idle percentage of request handlers over the last minute
`kafka_server_KafkaServer_BrokerState_Value`	State of the broker
`kafka_server_ReplicaManager_IsrExpandsPerSec_Count`	Number of ISR expansions per second
`kafka_server_ReplicaManager_IsrShrinksPerSec_Count`	Number of ISR shrinks per second
`kafka_server_ReplicaManager_LeaderCount_Value`	Number of leader replicas
`kafka_server_ReplicaManager_PartitionCount_Value`	Number of partitions
`kafka_server_ReplicaManager_UnderMinIsrPartitionCount_Value`	Number of partitions under the minimum ISR
`kafka_server_ReplicaManager_UnderReplicatedPartitions_Value`	Number of under-replicated partitions
`kafka_server_group_coordinator_metrics_group_completed_rebalance_count`	Number of completed group rebalances
`kafka_server_group_coordinator_metrics_group_completed_rebalance_rate`	Rate of completed group rebalances
`kafka_server_group_coordinator_metrics_offset_commit_count`	Number of offset commits
`kafka_server_group_coordinator_metrics_offset_commit_rate`	Rate of offset commits
`kafka_server_group_coordinator_metrics_offset_deletion_count`	Number of offset deletions
`kafka_server_group_coordinator_metrics_offset_deletion_rate`	Rate of offset deletions
`kafka_server_group_coordinator_metrics_offset_expiration_count`	Number of offset expirations
`kafka_server_group_coordinator_metrics_offset_expiration_rate`	Rate of offset expirations

Tiered storage metrics

Aiven for Apache Kafka includes several metrics to monitor the performance and health of your Apache Kafka broker's tiered storage operations. Access these metrics through Prometheus to gain insights into various aspects of tiered storage, including data copying, fetching, deleting, and their associated lags and errors.

Metric	Description
`kafka_server_BrokerTopicMetrics_RemoteCopyBytesPerSec_Count`	Number of bytes per second being copied to remote storage
`kafka_server_BrokerTopicMetrics_RemoteCopyRequestsPerSec_Count`	Number of copy requests per second to remote storage
`kafka_server_BrokerTopicMetrics_RemoteCopyErrorsPerSec_Count`	Number of errors per second encountered during remote copy
`kafka_server_BrokerTopicMetrics_RemoteCopyLagBytes_Value`	Number of bytes in non-active segments eligible for tiering that are not yet uploaded to remote storage
`kafka_server_BrokerTopicMetrics_RemoteCopyLagSegments_Value`	Number of non-active segments eligible for tiering that are not yet uploaded to remote storage
`kafka_server_BrokerTopicMetrics_RemoteFetchBytesPerSec_Count`	Number of bytes per second being fetched from remote storage
`kafka_server_BrokerTopicMetrics_RemoteFetchRequestsPerSec_Count`	Number of fetch requests per second from remote storage
`kafka_server_BrokerTopicMetrics_RemoteFetchErrorsPerSec_Count`	Number of errors per second encountered during remote fetch
`kafka_server_BrokerTopicMetrics_RemoteDeleteRequestsPerSec_Count`	Number of delete requests per second to remote storage
`kafka_server_BrokerTopicMetrics_RemoteDeleteErrorsPerSec_Count`	Number of errors per second encountered during remote delete
`kafka_server_BrokerTopicMetrics_RemoteDeleteLagBytes_Value`	Number of bytes in non-active segments marked for deletion but not yet deleted from remote storage
`kafka_server_BrokerTopicMetrics_RemoteDeleteLagSegments_Value`	Number of non-active segments marked for deletion but not yet deleted from remote storage

How to retrieve metrics​

Host metrics​

CPU utilization​

Disk space utilization​

Disk input and output​

Generic memory​

Network​

Kernel​

Process​

Swap usage​

Aiven for Apache Kafka-specific metrics​

Garbage collector MXBean​

Memory Usage​

Apache Kafka Connect​

Apache Kafka broker metrics​

Metric types​

Cumulative counters (_count)​

Rate counters (perSec)​

Apache Kafka controller metrics​

ZooKeeper mode-only metrics​

KRaft mode and metrics changes​

Jolokia collector collect time​

Apache Kafka log​

Log cleaner metrics​

Log flush rate metrics​

Log metrics​

Apache Kafka network​

Apache Kafka server​

Tiered storage metrics​