Sep 24, 2024
Cost Effective Data Retention with Aiven for ClickHouse® Tiered Storage
Tiered storage is now generally available in Aiven for ClickHouse®. Combine the speed of ClickHouse with the cost optimization of object storage for a powerful data warehouse solution.
ClickHouse® combines its columnar design, industry leading compression, and blazing fast queries to make it one of the most performant data warehouse solutions available. Aiven for ClickHouse® takes this a step further with tiered storage, now generally available in the Aiven console. Enabling tiered storage allows you to optimize your costs by using object storage in tandem with existing SSD storage for the best of both worlds.
Tiered storage overview
Tiered storage can be enabled on a per-table basis either from within the Aiven console, or on the command line with the ClickHouse client. Once enabled, data will be distributed between SSDs and object storage based on your defined data threshold.
By default, once an Aiven for ClickHouse instance's SSDs hit 80% capacity, data will be automatically moved to object storage. This behavior can be further refined by creating an explicit TTL, where all data older than a user-defined interval is moved to object storage on a recurring basis. Data can also be manually moved to object storage as needed for any one-off use cases.
At any point you can check the distribution of data in your Aiven for ClickHouse tables from within the Aiven console. You can also query this data via the ClickHouse client like so:
SELECT database, table, disk_name, formatReadableSize(sum(data_compressed_bytes)) AS total_size, count(*) AS parts_count, formatReadableSize(min(data_compressed_bytes)) AS min_part_size, formatReadableSize(median(data_compressed_bytes)) AS median_part_size, formatReadableSize(max(data_compressed_bytes)) AS max_part_size FROM system.parts GROUP BY database, table, disk_name ORDER BY database ASC, table ASC, disk_name ASC
Which should provide output similar to:
Example use case: logs
So, when would you want to use tiered storage in your environment? Any time you might want to trade query performance for cost optimization. For example: logs.
Aiven for ClickHouse is already a great option for logs. There's an OpenTelemetry ClickHouse exporter, for example, that can be combined with the ClickHouse plugin in Aiven for Grafana to visualize queries and create logging dashboards very quickly. The trick becomes keeping that logging solution cost effective.
Often an organization will have a retention policy that requires them to keep logs for months or years, but the most of the queries and analytics they run only apply to the most recent data in the table. As such, keeping the entirety of their logs on SSDs could get very costly without diminishing returns, as the vast majority of the data they're storing is rarely accessed.
The solution in this case would be to determine roughly how far back in time the average query is interested in, and define that as a TTL for triggering tiered storage. Any data older than that TTL is stored on object storage and still accessible, if slightly less performant, and the latest data is kept on SSDs so that investigating emergent issues is as fast as possible.
Get started with tiered storage
You can learn more about tiered storage in the Aiven for ClickHouse documentation.
If you have any questions about configuring tiered storage on your services, don't hesitate to reach out to support@aiven.io.
Subscribe to the Aiven newsletter
All things open source, plus our product updates and news in a monthly newsletter.
Related resources
Jan 10, 2023
Learn the disadvantages of batch analytics, the benefits of switching to real-time, and how to adopt a tech stack that supports real-time analytics.
Dec 7, 2023
Tiered Storage for Apache Kafka® delivers an improved cost-to-performance ratio, increased operational flexibility, and greater scalability for customers
Jan 25, 2024
Execute even complex queries quickly without the risk of stale results. See what makes Aiven for ClickHouse® ideal for analytics wherever your data lives.