Skip to main content

Prometheus system metrics

Learn how to check what metrics are available for monitoring your service using Prometheus, and find out which of the available metrics are particularly worth monitoring and why.

About the Prometheus integration

The Prometheus integration allows you to monitor your Aiven services and understand the resource usage. Using this integration, you can also track some non-service-specific metrics that may be worth monitoring.

To start using Prometheus for monitoring the metrics, configure the Prometheus integration and set up the Prometheus server.

Get a list of available service metrics

To discover the metrics available for your services, make an HTTP GET request to your Prometheus service endpoint.

  1. Once your Prometheus integration is configured, collect the following Prometheus service details from Aiven Console > the Overview page of your service > the Connection information section > the Prometheus tab:

    • Prometheus URL
    • Username
    • Password
  2. Make a request to get a snapshot of your metrics, replacing the placeholders in the following code with the values for your service:

    curl -k --user USERNAME:PASSWORD PROMETHEUS_URL/metrics

The resulting output is a full list of the metrics available for your service.

Metrics

CPU usage

CPU usage metrics are helpful in determining if the CPU is constantly being maxed out. For a high-level view of the CPU usage for a single CPU service, you can use the following:

100 - cpu_usage_idle{cpu="cpu-total"}
note

A process with a nice value larger than 0 is categorized as cpu_usage_nice, which is not included in cpu_usage_user.

tip

It can be useful to monitor cpu_usage_iowait{cpu="cpu-total"}. Its high value indicates that the service node is working on something I/O intensive. For example, if cpu_usage_iowait{cpu="cpu-total"} equals 40, the CPU is idle waiting for disk or network I/O operations for 40% of time.

Some important CPU-related metrics you can collect and monitor are generated from the Telegraf plugin. They are as follows:

MetricsDescription
cpu_usage_idlePercentage of time the CPU is idle
cpu_usage_systemPercentage of time the Kernel code is consuming the CPU
cpu_usage_userPercentage of time the CPU is in the user-space program with a nice value <= 0
cpu_usage_nicePercentage of time the CPU is in the user-space program with a nice value > 0
cpu_usage_iowaitPercentage of time that the CPU is idle when the system has pending disk I/O operations
cpu_usage_stealPercentage of time waiting for the hypervisor to give CPU cycles to the VM
cpu_usage_irqPercentage of time the system is handling interrupts
cpu_usage_softirqPercentage of time the system is handling software interrupts
cpu_usage_guestPercentage of time the CPU is running for a guest OS
cpu_usage_guest_nicePercentage of time the CPU is running for a guest OS with a low priority

Disk usage

Monitoring the disk usage ensures that applications or processes don't fail due to an insufficient disk storage.

tip

Consider monitoring disk_used_percent and disk_free.

The following table lists some important disk usage metrics you can collect and monitor:

MetricsDescription
disk_freeFree space on the service disk
disk_usedUsed space on the disk, for example, 1.0e+9 (8,000,000,000 bytes)
disk_totalTotal space on the disk (free and used)
disk_used_percentPercentage of the disk space used equal to disk_used / disk_total * 100, for example, 80 (80% service disk usage)
disk_inodes_freeNumber of index nodes available on the service disk
disk_inodes_usedNumber of index nodes used on the service disk
disk_inodes_totalTotal number of index nodes on the service disk

Memory usage

Metrics for monitoring the memory consumption are essential to ensure the performance of your service.

tip

Consider monitoring mem_available (in bytes) or mem_available_percent, as this is the estimated amount of memory available for application without swapping.

Network usage

Monitoring the network provides visibility of your network and an understanding of the network utilization and traffic, allowing you to act immediately in case of network issues.

tip

It may be worth monitoring the number of established TCP sessions available in the netstat_tcp_established metric.