Earlier this year we introduced Aiven for Apache Kafka Dashboards, greatly increasing the visibility Aiven customers had to their Kafka services. Now, similar insights are available for all Aiven PostgreSQL services.
Check out this help article to see the graphs that are included in the Aiven for PostgreSQL dashboard by default.
In this post, we'll cover why having access to graphing dashboards is important, the four tasks that are ideal for data visualization, as well as how to set up your Aiven dashboards service integration.
Why graphing dashboards are important
When operating any non-trivial software, you will eventually need to investigate anomalies that occur within it. For example, the software might be running slower than you’d expect or, some functionality subset isn’t working correctly.
When this occurs, good visibility into system internals can turn a tedious debugging nightmare into five minutes of perusing data visualizations to uncover the issue. Even better, you could avoid a problem altogether because the data reveals an upcoming issue, maybe even via an alert.
In short, if you want to make your debugging more efficient and possibly avoid issues altogether, you should be using graphing dashboards. With that time, you can focus more resources on adding value to your application.
Four tasks ideal for data visualization
1. Capacity planning
Having long-term graphs that show changes in various basic resource, e.g. CPU, disk, memory, network, etc., utilization over time makes it rather easy to extrapolate where the usage will likely go in the future.
Otherwise, it can be extremely difficult to make reliable predictions because random variance could be skewing your input data if that data is only based on isolated data points that are infrequently collected.
2. Optimizing resource usage
The earlier image showing disk space consumption might indicate the need for an upgrade to a larger plan in less than a month. But, increased disk usage by over 20 gigabytes in a day might also indicate a software anomaly.
You can also take a more granular look at your database, such as your table and index sizes, to see if anything is out of place as well.
3. Automating service state monitoring
With many services, the primary alert is getting a complaint from an end-user. Relying on your end-users to raise an issue is a dangerous gambit because you never know at what point they will decide to look for alternatives.
While end-users are a valuable resource for feedback, it is less costly in the long run to rely on alerts that are triggered when some metric goes beyond the expected range: this is easy to set up when you have relevant, visualized data available.
There are a number of causes for when database related operations slow or stop working. Luckily, many are easy to monitor so that you're notified proactively:
There are several reasons why iowait might rise, e.g. missing indexes resulting in heavy sequential scans, large joins requiring temp files, excessive reads/updates compared to server specs, too little memory compared to active dataset size, etc. But, high continuous iowait will eventually cause queries to slow down, regardless of the root cause.
Deadlocks are caused by different transactions trying to modify the same records in a different order so that none of them can make progress. PostgreSQL will eventually abort some of the transactions so that others can proceed.
While retrying on the client side can make the application work correctly, issues will ultimately become visible to end-users when deadlocks start occurring too often. Using the
SELECT … FOR UPDATE construct to avoid deadlocks is effective in simple cases.
High number of connections
PostgreSQL forks a new process for each connection and each one of those may use up to
work_mem memory. Consequently, each connection is quite heavy and there is a hard limit on the number of allowed connections.
After you reach the maximum, you can't establish new connections until you close old ones, possibly making an application to slow or stop working. You can use connection pooling to increase the number of connections clients can make but this doesn’t increase the number of backend connections.
Therefore, a large number of clients making queries that take a non-trivial amount of time will eventually cause new queries to slow down because they won't be able to be started immediately.
Long running transactions
The MVCC model of PostgreSQL results in behavior where old records cannot be garbage collected as long a transaction that began when the record was in a specific state started. This prevents vacuum from doing its job, causing your database size to increase and queries to slow down.
For a busy database, a transaction that remains open for an hour or two, even if completely idle and not causing any load itself, might badly deteriorate the database server performance.
The above list comprises a set of things that are often interesting from a monitoring point of view. These and many others are available as graphs in the Aiven for PostgreSQL dashboard by default and alerts can be added from the Grafana UI by editing the desired graph; let's continue to point #4.
4. Investigating application update anomalies
When the amount of application users increases, a new set of problems may become visible. Increases in the amount of queries and the amount of data stored in the database are obvious results of increased use.
If everything is working well, simply scaling the database server specs up may be enough to cope with the increased load. But, very often issues that have already been present in the code will begin to become visible.
With increased queries, and more often than not an increase in clients that need to connect to the backend, issues such as deadlocks or running out of connections may start appearing even if the server specs are increased in proportion to the increased load.
Another set of problems is related to the growth of database size, which may lead into other issues that are not in proportion to the load and specs. For example, a missing index might not be a problem when the database is small.
In fact, doing a full sequential table scan might well be the best possible strategy for getting results from a fairly small table that easily fits into memory, and the PostgreSQL query planner might opt to do that even if an index was available.
However, when the table size grows, the server might not be able to switch to a better plan due to the missing index, and CPU usage and possible disk IO will start growing considerably as a result.
Another example is when joins that produce large temporary data sets for large tables will not be visible when tables are small, but will start causing problems when the tables grow.
Specifically, if the results fit into work memory then there won’t be additional disk usage. But the server will be forced to write temporary results to disk once the tables grow enough, which may result in a dramatic increase in disk IO and through that, slower server responses.
It might be possible to optimize the queries to avoid large joins or it might make sense to increase work memory for a specific user account that is used to make the heavy queries so that the results can be kept in memory.
Tip: Temporary file creations and bytes written to temporary files are visible in their own graph, making it easy to see when this problem occurs. The Aiven web console and API provide log access to help further investigate which queries are causing temporary files to be created.
For even more capability, you can use the Aiven OpenSearch + OpenSearch dashboards service integration to improve search capabilities. For instance, searching for “temporary file” would return relevant points in log files. But, let's check out setting up the dashboard integration.
How to set up your dashboard integration
The Aiven dashboard functionality makes use of three separate Aiven hosted services: the Aiven PostgreSQL (or Kafka) service you want to get metrics from, Aiven InfluxDB for storing the metrics data points, and Aiven Grafana for showing graphs and raising alerts based on the data points in InfluxDB.
If you don’t already have the other services running in your project you’ll need to create those first or have them automatically created while enabling integration between the different services.
We have a simple getting started guide, Getting Started with Service Integrations to help you get up and running in minutes.
We introduced some scenarios in which having good visibility to what is going on with your database server can save money and time by making otherwise difficult tasks easy, which will ultimately improve end-user experience.
While hardly a comprehensive list, we hopefully demonstrated why it is important to make sure you can monitor your system and try to set up automatic alerting when values go beyond the expected ranges.
If you’re already using Datadog for your monitoring purposes Aiven also supports integrating with that, though the data points available via the Datadog integration are a bit more limited. Also, keep in mind that you can freely create new dashboards in Aiven Grafana, so if you're a heavy user you’re not in any way bound to the default dashboard.
As always, do let us know if you think this functionality is currently lacking some features you think it should have! In the meantime, join our blog and changelog RSS feeds; or, follow us on Twitter or LinkedIn to stay up-to-date.