Aug 28, 2024
Unleash Savings and Scalability with Aiven for Apache Kafka®
Tiered Storage reduces storage costs, increases operational flexibility, and provides unlimited data retention scalability
Aiven announces the general availability of Tiered Storage in Aiven for Apache Kafka®. Tiered storage enables Apache Kafka® to seamlessly move older, less frequently accessed data to a lower cost storage tier within Kafka. This enables infinite data retention while substantially reducing storage costs and improving performance and operational efficiency. Tiered storage in Aiven for Apache Kafka is production-ready for AWS, Google Cloud, Azure, and soon to be available for Bring Your Own Cloud (BYOC) deployments for these cloud providers.
The popularity of Apache Kafka for building real-time data pipelines and event-driven architectures continues to accelerate. Larger datasets are being stored for longer durations, bringing growing challenges of scalability, performance, operational management, and especially cost-efficiency.
Traditionally, Kafka stored all data on fast local disks and defaulted to a short 3- to 7-day retention period to mitigate the high cost of storage. As data accumulated, customers either purchased more storage, or moved data out of Kafka to less expensive datastores like Hadoop Distributed File System (HDFS) or other databases, data lakes, or data warehouses for long-term retention and analysis. But this created a fragmented data access experience for consumers that needed to access both fresh and historical data-–maintaining queries to both Kafka and other datastores is complex and cumbersome.
Tiered storage addresses this challenge where storage is no longer the bottleneck or budget buster in scaling Kafka. Log segments on the Kafka broker node remain on local disk for fast access, and cold data is automatically moved within Kafka to lower cost cloud object storage like Amazon S3, Google Cloud Storage, or Azure Blob Storage. Using tiered storage reduces the size of broker node storage, which eliminates the need for large node-to-node data replication operations and makes upgrades and migrations smoother and more performant.
Benefits of Apache Kafka tiered storage
Tiered storage optimizes costs, simplifies scaling, boosts performance, and improves operational efficiency. These benefits make Apache Kafka a more versatile and cost-effective solution for both real-time and long-term data storage. For use cases involving large volumes of historical data, the cost savings can be very substantial.
- Unlimited storage: Data can be retained indefinitely without space constraints.
- Optimized costs: Storage costs are significantly reduced by moving older data from costly local disks in the broker node to a less expensive storage tier.
- Greater elasticity: Decoupling compute and storage enables storage to expand independently, without adding new broker nodes and copying the entire dataset from old to new nodes. Also, compute resources can be adjusted based on demand, leading to overall more efficient resource allocation and utilization.
- Improved Performance: Consumers querying fresh data on local disk experience no performance impact from those reading data in tiered storage. Data read from tiered storage is retained in a cache for faster subsequent queries on the same data.
- Simplified Management: With less data managed locally by the broker, node-to-node replications, upgrades, migrations, and rebalancing are faster and smoother, reducing the operational load on both the system and Kafka administrators.
When and why to use tiered storage
Understanding when and why to use tiered storage in Aiven for Apache Kafka helps you maximize its benefits, particularly around cost savings and system performance.
Top scenarios for using tiered storage include:
- Long-term data retention: Many organizations require large-scale data storage for extended periods for regulatory compliance, audits, or historical data analysis. For example, financial institutions retain auditable records for seven years. Tiered storage makes it possible to keep data accessible within Kafka for as long as required at a very reasonable cost.
- Immutable backup: When failures occur in downstream consumers like PostgreSQL, OpenSearch, or ClickHouse, Kafka tiered storage contains the full data history to quickly restore the data to its current state.
- High-speed data ingestion: Tiered storage can manage unpredictable or sudden data spikes by supplementing local disks with cloud storage to ensure optimal performance. Retailers and ecommerce websites may see a huge, unexpected spike following a popular promotion, and Kafka with tiered storage can easily handle the sudden increase in incoming data.
- Unlock new use cases: By eliminating traditional storage limitations, organizations gain the flexibility to support a wide range of applications and workflows, including scenarios where Apache Kafka was once deemed impractical or too expensive. For example, training AI/ML models with real-time and historical data from a single source of truth, Kafka.
How it works
Tiered storage operates seamlessly for both Apache Kafka producers and consumers. When tiered storage is enabled, data produced to a topic is initially stored on the local disk of the Kafka broker. Data is transferred to remote storage asynchronously, as configured by the Kafka admin, and does not interfere with the producer activity. When consumers fetch records stored in remote storage, there is a slight delay introduced when the Apache Kafka broker gets the records from remote storage. However, the broker downloads and caches these records locally, enabling faster access in subsequent retrievals.
Pricing & Availability
Tiered storage pricing starts from $0.09 per GB/month, depending on the precise object storage system used. Costs are determined by the amount of remote storage used, measured in GB/hour, and will be confirmed when enabling tiered storage in the Aiven console. For a detailed explanation of tiered storage pricing, please check the documentation. Tiered storage supports cloud storage services from all three hyperscalers – Amazon S3, Google Cloud Storage, and Azure Blob Storage.
Please contact Aiven Support to request that tiered storage be made available to you at the project level.
Read more about Tiered Storage in our blog article Introducing Tiered Storage for Aiven for Apache Kafka®: Unlocking Improved Cost-Efficiency and Scalability and see how to start using it in A guide to Apache Kafka® tiered storage with Aiven and Terraform.
Subscribe to the Aiven newsletter
All things open source, plus our product updates and news in a monthly newsletter.
Related resources
Dec 7, 2023
Tiered Storage for Apache Kafka® delivers an improved cost-to-performance ratio, increased operational flexibility, and greater scalability for customers
Feb 20, 2024
A guide to what tiered storage is, and how you can start using it with Aiven for Apache Kafka® and Terraform. We’ll set up a cluster, load the data and observe the metrics.
Dec 7, 2022
When making Apache Kafka the backbone of your data infra, get an end-to-end solution. Find out more about Aiven’s open source ecosystem around Apache Kafka®.