[UPDATE] A consumer group dashboard is now available, learn more: (Part II) - Consumer groups.
Generally, operations folks and developers want to understand how a particular software system behaves under load, how to better diagnose issues when they arise, a warning system to alert before the system runs out of critical resources, and to plan for capacity to handle future growth.
Displaying Apache Kafka telemetry is something that would be tremendously helpful to that end, but it is something that vanilla Kafka lacks: enter Aiven Kafka dashboards and telemetry.
Using InfluxDB and Grafana, Aiven customers can seamlessly collect and visualize Aiven Kafka and system resource telemetry. Our solution goes one step further by providing granular dashboards for Aiven Kafka instances by default.
Simply put, an InfluxDB instance receives Aiven Kafka telemetry data and sends it to Grafana for visualization.
With better visibility, ops folks and developers benefit in a number of ways, including improvements in:
- Understanding of peak usage bottlenecks
- Performance and bottlenecks debugging
- Awareness of system resource utilization and trends
- Oversight of critical systems to prevent resource exhaustion that causes service outage
- Capacity planning to handle future growth
Setting up an Aiven Kafka dashboards integration is simple and only takes a few moments. First, let's take an in-depth look at what dashboards are included by default, how to set up custom dashboards and alerting, and how to query InfluxDB(feel free to jump to the section of your choice):
- Built-in Aiven Kafka dashboards
- How to set up custom dashboards in 10 steps
- How to set up alerting in two steps
- How to query InfluxDB for Kafka telemetry using InfluxQL
Built-in Aiven Kafka dashboards
It's easy to think of dashboards as belonging to three separate groups:
- Kafka specific,
- System level resources (CPU, RAM, Network, Disk, etc) related, and
- Belonging to the Java Virtual Machine (JVM) on which Kafka runs
Below are examples of each group:
1. Kafka dashboards
Kafka-specific metrics about brokers, message activity, etc.
2. System dashboards
The usual host-level resource metrics, such as CPU, Memory, Network and disk usage.
3. JVM dashboards
Internal metrics from the JVM that runs the Kafka processes.
Example System Setup
To highlight the usefulness of Aiven Kafka dashboards, we ran a Kafka load generator for an Aiven Kafka service with the following setup:
- Using a Startup-2 service plan, i.e. three Kafka brokers running on separate VMs.
- Replication factor of two and two partitions for each topic which included producers and consumers.
Examining Kafka specific dashboards
Kafka dashboards are a must-have if you wish to better understand the behavior of the Kafka system in relation to the applications using it.
Ideally, you want to have headroom to handle peak throughput.
The Kafka dashboards should generally be used in conjunction with system-level dashboards that show CPU utilization, available memory, as well as network and disk throughput.
Kafka messages in represents the incoming messages per second sent into the Kafka brokers.
It is tremendously useful to understand the min, max, and mean throughput of messages that your application is sending through Kafka.
These numbers could correlate to other external metrics such as the number of users the system is serving, the number of IoT devices being managed by your application, etc., and improve your planning for future anticipated capacity.
Fetch requests and Kafka failed fetch requests represent the number of requests per second and the number of failed requests per second, respectively.
Under normal circumstances, the failed fetch requests should be zero.
However, if it exceeds zero, it could be a sign that requests are dropping due to internal buffers running out of resources.
Somewhat similar to fetch requests, Requests represents the number of requests per second.
Kafka log cleaner represents the elapsed time since the last time the log cleaner ran. The log cleaner is responsible for Kafka log compaction which includes removing messages with a guarantee of keeping at least the last message for each primary key.
It is good to keep an eye out for log compactions that run for an extended time as it may indicate over-utilization of the CPU or insufficient disk IOPS for the compaction task.
Kafka bytes in and bytes out represent the number of incoming and outgoing bytes per second. These metrics help you understand peak bytes in/out for the Kafka system.
Consequently, they provide a better estimate of current and future network bandwidth requirements. For example, if your Kafka system is maxing out at a particular level and you are looking to reach higher levels, it may help to move to a faster network.
Note: use this metric with Kafka messages in per second and available memory metrics to draw appropriate conclusions.
As the name implies, Kafka under-replicated partitions represents the number of under-replicated partitions in Kafka.
The number should typically be zero, and in cases where it is not, it means Kafka is doing work to reach the specific replication factor for some topic partitions. Note that it is normal to see under-replication partitions; for example, when replication factors have been adjusted by the user, or when the cluster is being migrated, upgraded, or when maintenance updates are being applied.
If the number of under-replicated partitions exceeds zero for an extended period of time, it indicates that replicas are continuing to be out of sync. This can be an indication of a problem, such as one of the brokers may have been on hardware that has failed. Problems such as this are resolved by Aiven's self-healing automation that will launch new VMs to replace any failed ones and will automatically get them fully in sync with the cluster.
Still, ops folks may want to keep a close eye on this metric. Further, this is a good candidate for alerting. We will show how to setup up alerts in Grafana below.
Kafka delayed operations and Kafka purgatory size display the number of requests that are waiting to be fulfilled and the size of the holding pen for requests, respectively.
Kafka ISR expands and shrinks display the number of times the in-sync replicas have expanded after a failed broker is fully caught-up, and the number of times the in-sync replicas have shrunk due to a broker going down, getting stuck or just falling behind in replication. As demonstrated in the dashboard, there were neither expands nor shrinks during our load exercise.
Kafka handler threads idle and network threads idle display the average fraction of time the request handler threads are idle, and the average fraction of time the network threads are idle.
During heavy load, the threads will be idling much less than what we see here during our load exercise. If the network threads are super busy, it could mean that there are far too many messages coming into the system, and may indicate adding more brokers to distribute the load.
Leader election rate & Unclean leader elections display the number of clean leader election attempts per second and unclean leader elections.
Leader elections are triggered when the broker that is leader for a partition becomes unavailable. The rest of the in-sync replica brokers will decide which of the brokers will become the new leader for the partition. Unclean leader election is not enabled by default in Aiven an can only occur when there is no available in-sync replica candidate for the new leader and Aiven Operations overrides the leadership manually.
How to set up custom dashboards in 10 steps
Custom dashboards are easy to create in Grafana and we will demonstrate the mechanics involved in creating them within this section.
Note: anytime you save the dashboard that you are working on, you must use a name that does not begin with "Aiven".
If you don't, your dashboards may get overwritten during upgrades because Aiven ships a set of built-in dashboards beginning with the name "Aiven".
- Select Home from the search box beside the Grafana icon in the top-left corner of the browser window.
- Click on the blue New Dashboard button to the right.
- Select Graph for the new dashboard widget.
- Under the General tab, give the graph a title.
- In the Metrics tab, select the datasource for your project next to Datasource.
Under Query A,
- For FROM clause, select
defaultfor database and
- For WHERE clause, pick
project = __yourproject__;
service = __yourservice__,
- For SELECT, pick
field(Value), followed by
- For "FORMAT AS" and "ALIAS BY" you can leave the values to their defaults
- For FROM clause, select
- In the Axes tab, pick the settings as shown above.
- In the Legend tab, pick the settings as shown above.
- In the Display tab, pick the settings as shown above.
- Save the dashboard as "MyCustomDashboard".
You should now be able to visualize the data from your Kafka brokers.
Note: it is important to set the time window for your observed data in the top-right corner of the Grafana window. As a starting point, a setting of "last 15 minutes" should suffice.
Now that your dashboard is set up, let's look at how to create custom alerts. Alerting is a must-have for peace of mind.
- Select "MyCustomDashboard" that you created earlier, then click the title, "Kafka under-replicated partitions" graph and select edit.
- In the Alert tab for the graph, and then under the Alert config subsection, pick the entries shown below.
And that's it! When the condition that you set is triggered, Grafana will automatically create alerts.
To view a history of triggered events, look in the State history section within the Alert tab. It'll help you ensure that there are no issues with partition replication.
Alerts can be sent via different notification mechanisms to your favorite incident management service (PagerDuty, VictorOps, OpsGenie, etc.) or to your chat service (Slack, HipChat). More details on how to setup notification channels can be found in the Grafana documentation.
You can also create useful alerts for other situations, e.g. running low on available memory, running high on CPU utilization, etc.
Alerting provides tremendous value because it acts as an early-warning system that helps you protect your Kafka cluster from exhausting system resources and becoming a service affecting outage. Therefore, we highly recommend taking advantage of this feature.
How to query InfluxDB for Kafka telemetry using InfluxQL
InfluxDB is a high performance time-series database that lends itself to storing and reporting of telemetry data from sensors and internet-of-things devices in an efficient and performant manner.
As noted above, Aiven Kafka telemetry measurements are stored in an InfluxDB instance for reporting and analysis purposes.
In InfluxDB, metrics are stored in a "measurement", which is a rough equivalent of a relational database table. For example, Aiven Kafka metrics for Kafka under-replicated partitions are stored in a measurement called
All measurement values (rows) have a timestamp, optional key-value tags that are useful in limiting your queries to a subset of the entire dataset (for example, to a specific target host) and one or more metrics values.
Let's walk through the necessary steps to query Kafka measurements that are stored in the InfluxDB using its InfluxQL query language.
Note: It is assumed that you have installed InfluxDB CLI that is bundled with the InfluxDB distribution for your operating system.
If you haven't, you can find the instructions for installation and getting started here**
Connecting to the InfluxDB instance
First and foremost, here is the command line you would use to connect to the InfluxDB instance storing your Kafka telemetry.
influx -host yourinfluxdb-yourproject.aivencloud.com \ -port yourinstanceport \ -database 'defaultdb' \ -username 'avnadmin' \ -password 'yourpassword'\ -ssl -unsafeSsl
Note: substitute the appropriate hostname/project, port and password parameters for the above command.
Our Getting started with Aiven InfluxDB help article has more examples on how to connect to the InfluxDB service using various clients.
Once you have successfully connected to the InfluxDB instance, you can then run some queries to retrieve Kafka telemetry measurements. Let's start with determining what retention policies exist in the database.
In InfluxDB, a retention policy automatically manages the lifetime of data points in measurements and comes with a default retention policy.
Aiven Kafka dashboards integrations add additional retention policy for Aiven Kafka measurements.
Additionally, you can manage retention policies via
SHOW RETENTION POLICIES,
CREATE RETENTION POLICY,
ALTER RETENTION POLICY, and
DROP RETENTION POLICY.
To determine retention policies
> SHOW RETENTION POLICIES;
To determine telemetry measurements
If you want to see all of the Aiven Kafka telemetry measurements stored in the InfluxDB instance, use the InfluxQL command below.
> SHOW MEASUREMENTS;
To retrieve Kafka under-replicated partitions
To retrieve Kafka under-replicated partitions from InfluxDB, use the InfluxQL command below.
SELECT max("Value") FROM "kafka.server:ReplicaManager.UnderReplicatedPartitions" WHERE "project" = 'yourproject' AND "service" = 'yourservice' AND time >= now() - 15m GROUP BY time(1m);
Note that InfluxQL uses different types of quoting for fields and values. String values are quoted
with single-quotes (
'example') where as retention policies, measurements and column names are quoted
with double-quotes (
The query above finds the maximum value of under-replicated partitions for
yourservice in the last 15 minutes and groups them by 1 minute intervals.
The result of the above query should look something like,
name: kafka.server:ReplicaManager.UnderReplicatedPartitions time max ---- --- 1520196600000000000 0 1520196660000000000 0 1520196720000000000 0 1520196780000000000 0 1520196840000000000 0 1520196900000000000 0 1520196960000000000 0
Similarly, to determine the total fetch requests per second measurement, use the following InfluxQL command,
SELECT non_negative_derivative(mean("Count"), 1s) FROM "kafka.server:BrokerTopicMetrics.TotalFetchRequestsPerSec" WHERE "project" = 'yourproject' AND "service" = 'yourservice' AND time >= now() - 15m GROUP BY time(30s), "host";
The result of the query above should be something like,
tags: host=kafka-telemetry-15 time non_negative_derivative ---- ----------------------- 1520197410000000000 157.93333333333334 1520197440000000000 159.73333333333332 1520197470000000000 159.66666666666666 1520197500000000000 160 1520197530000000000 158.93333333333334 1520197560000000000 157.33333333333334 1520197590000000000 160 1520197620000000000 159
As you can see, querying the underlying InfluxDB database that stores the Kafka telemetry measurements from your Aiven Kafka cluster is easy.
Aiven Kafka dashboards are a powerful tool for debugging and troubleshooting issues, understanding system behavior during load/peak hours and trends, alerting to avert service outage, and capacity planning.
Even better, it is simple to setup. We've also introduced Kafka consumer group-level dashboard. Learn more about it by reading the latest post in the Kafka dashboards series, Gain visibility with Aiven Kafka dashboards (Part II) - Consumer groups.
We'll continue to improve the dashboards and plan on introducing Kafka topic related, and producer specific telemetry and a set of dashboards in upcoming ititerations of Aiven Kafka dashboards to provide even more visibility into your Aiven Kafka. So, stay tuned!