Jul 31, 2025
Understanding Apache Kafka® Performance: Diskless Topics Deep Dive
Diskless Kafka topics store data directly to object storage for massive cost savings and unlimited retention, but require higher latency tolerance.
Jorge Quilcate
|RSS FeedSoftware engineer passionate about open-source and event-driven distributed systems. Currently working as Software Engineer at Aiven and open-source contributor to the Apache Kafka.
Hugh Evans
|RSS FeedDeveloper advocate and community manager with a particular interest in data and AI.
TL;DR
Diskless topics reward high-throughput workloads with large batches but can struggle with low-throughput patterns.
Note: This analysis is based on testing with Diskless Kafka 4.0.0-rc15. Diskless topics are available for you to start experimenting with via the Inkless fork but the feature is still in development, and performance characteristics may change significantly as the technology matures.
If you're:
- Operating Apache Kafka® clusters and evaluating storage options
- A platform engineer considering object storage-backed streaming
- Generally curious about next-generation Kafka architectures
This post is for you!
We've all been there: stakeholders want sub-second end-to-end latency AND guaranteed durability with infinite retention. Traditional Kafka makes you choose between fast writes with potential data loss or rock-solid durability with higher latencies.
Diskless Kafka topics represent a different approach by storing data directly to object stores like S3 while maintaining strong consistency guarantees. But as with any new technology, understanding the performance characteristics is crucial before making architectural decisions.
Diskless Kafka topics deliver an extreme nine-nines durability by storing data directly to object storage and can reduce the total cost of ownership of Kafka by 80% but comes with the tradeoff of additional latency.
How Diskless Topics Work
Diskless topics flip traditional Kafka architecture on its head. Instead of writing to local disk and replicating across brokers, they write directly to object storage and use a batch coordinator (currently PostgreSQL®) for metadata. The design focuses on strong durability use cases aligned with producer setting acks=all
, where writes are only acknowledged once data is stored durably across multiple machines.
The theoretical benefits are compelling:
- Infinite retention without disk management complexity
- Strong durability with data immediately persisted to object storage
- Simplified operations with reduced local storage requirements
But there's a fundamental tradeoff: using Diskless topics limits how low latency can be, as you must wait for batching to complete and object storage to store messages.
Where Those Milliseconds Go
Understanding Diskless performance starts with understanding where time gets spent. Every write request follows this path:
- WAL buffering wait time: 0-250ms (the linger time, default 250ms or until the 4-6MB batch size is reached, whichever comes first)
- Object storage upload: typically 120-200ms average to S3, with worst case writes taking 400-500ms
- Metadata commit: 10-20ms with low partition counts, up to 200ms with 1000+ batched partitions
The WAL buffering wait time is perhaps the most critical component to understand. WAL stands for Write-Ahead Log. WAL is a common database pattern where changes are first written to a sequential log before being applied to the main data store. In Diskless topics, incoming messages are buffered in WAL files until one of two conditions is met: either the buffer reaches the configured size (typically 4-6MB in our testing), or the configurable linger time passes since the first message in the buffer arrived (250ms by default, but this represents a latency vs. throughput tradeoff that can be tuned based on your requirements).
This means your latency depends heavily on your throughput patterns. High-throughput workloads that quickly fill the buffer experience minimal wait time. Low-throughput workloads that trickle in messages will always hit the linger timeout, adding up to a quarter-second to every request. There's no middle ground here so you're either filling buffers quickly or waiting for the timer.
Combined server-side latencies:
- Worst case (low throughput): 250ms wait + 200ms upload + 200ms commit = 650ms
- Best case (high throughput): 0ms wait + 200ms upload + 10ms commit = 210ms
Add producer-side latencies (batching time, queue time), and you reach the expected 500ms P50, 2s P99 end-to-end latencies that we observed in our performance testing below.
Performance Testing Results
The performance characteristics of Diskless topics become clear through systematic testing. Using Kafka performance tools and the OpenMessaging Benchmark (OMB) framework with AWS S3 as object storage and PostgreSQL® as the batch coordinator, several key patterns emerge.
Single Producer Challenges
Starting with the simplest scenario—a single producer writing to a single partition with default settings—performance limitations become immediately apparent. Standard producer configuration (16KB batch size, 0ms linger, 5 max in-flight requests) creates a fundamentally problematic pattern for diskless architecture.
The root issue stems from the mismatch between Kafka's default batching behavior and S3's performance characteristics. Small 16KB batches result in frequent, tiny uploads to S3—precisely the opposite of what object storage optimizes for. Compounding this problem, while the broker can queue up to 5 requests, it processes S3 uploads sequentially. Each upload must complete before the next begins, causing queue times to accumulate rapidly. Under these conditions, producers achieved only ~3 requests per second while queue times grew until requests began timing out.
Thread Pool Implementation
We've addressed some of these bottlenecks by introducing a thread pool for S3 uploads, giving each broker an upper bound on concurrent uploads. However, S3 PUT performance still requires deeper investigation to fully understand and resolve the remaining performance constraints.
Our observations reveal that batches exceeding 8MB trigger significant high-percentile latency spikes—some reaching 5 seconds. The root cause remains unclear and could involve throttling, retries, or timeout behaviors. Currently, the only reliable workaround is reducing buffer sizes: 4MB batches deliver predictable 100-300ms performance, while 6MB batches perform well but occasionally spike to ~1 second at high percentiles.
Scaling with Partitions
Single partitions hit performance walls quickly. Even with optimal tuning, you're limited by batch size constraints and the fundamental need to wait for either buffer fills or timeout conditions. Adding more partitions unlocks better performance by allowing producers to distribute load and build larger requests.
Test Infrastructure
Our partition scaling tests used a single broker configuration to isolate partition-level behavior from multi-broker complexity. The test environment consisted of:
- Single Kafka broker: 4 vCPU, 16GB RAM
- Single producer: Writing across all partitions in round-robin fashion
- PostgreSQL batch coordinator: 2 vCPU, 8GB RAM initially
With 12 partitions and the single producer configured with max.request.size=8000000
(8MB), we achieved 24MB/sec throughput. This translates to approximately 3 requests of 8MB each being processed per second—a significant improvement over single-partition performance.
Persistent Timing Challenges
However, broker logs revealed that even with this optimization, timing challenges persist:
Rotating active file after PT0.250654595S with buffer size 5999250 bytes Rotating active file after PT0.000692899S with buffer size 6999125 bytes Rotating active file after PT0.000685513S with buffer size 6999125 bytes Rotating active file after PT0.250632278S with buffer size 5999250 bytes
The pattern shows that some requests still hit the 250ms timeout instead of the size threshold, creating P99 latency spikes. Even with 12 partitions allowing for larger request building, the fundamental batching behavior still creates variability in response times. Requests that build to size quickly (0.0006s rotation time) perform well, while those that timeout (0.25s rotation time) drive up high-percentile latencies.
This behavior persists even with optimized configurations, highlighting that partition scaling improves average performance but doesn't eliminate the inherent latency variability in diskless architecture with single producers deployments.
Multiple Producer Approaches
The most predictable performance came from scaling producers rather than over-optimizing single instances. Testing with 90 producers across 90 topics with 64 partitions each delivered consistent results.
With minimal tuning (simply setting linger.ms=100
and max.in.flight.requests.per.connection=1
), this approach provided steady throughput without the latency spikes seen in single producer scenarios. Each producer could build full 1MB requests naturally, distributing load evenly across the system.
The Batch Coordinator Bottleneck
The batch coordinator plays a critical role in diskless topics by managing metadata for each batch written to object storage. On every write operation, the coordinator must record batch locations and partition offsets, and coordinate consistency across the distributed system. As partition counts increase, the coordinator processes exponentially more metadata operations—each partition generates its own batch metadata that must be atomically committed.
This explains why PostgreSQL performance becomes critical as the number of partitions increases. With 500+ partitions, commit times jump from 10-20ms to 100-200ms as the coordinator handles significantly more concurrent metadata operations. With thousands of partitions, the coordinator becomes a clear bottleneck requiring substantial scaling.
Optimization Opportunities
Our testing revealed that moving from 2 to 4 CPUs on the PostgreSQL instance was essential when scaling to 90 producers with thousands of partitions. However, this represents baseline scaling rather than optimized performance. Initial analysis suggests significant optimization opportunities remain in our PostgreSQL configuration and query patterns—areas we're actively investigating to reduce coordinator overhead.
The coordinator workload has specific characteristics that may benefit from PostgreSQL tuning around batch operations, connection pooling, and write-heavy workloads. As we optimize these configurations, we expect to see meaningful improvements in high partition count scenarios.
The take away here is that multiple smaller producers often outperform heavily optimized single producers in Diskless scenarios. This pattern also holds true for traditional Kafka, but the effect is more pronounced with Diskless topics due to the way batching and object storage uploads interact.
Current Recommendations
For production deployments, the batch coordinator isn't just a metadata store—it's a critical component that directly impacts overall system performance and requires appropriate sizing:
- Low partition counts (< 500): Standard PostgreSQL configurations are typically sufficient
- High partition counts (500+): Plan for dedicated PostgreSQL scaling and optimization
- Very high partition counts (1000+): Expect coordinator optimization to be an ongoing operational focus
We're continuing to refine coordinator performance, and future releases should see improvements in this area as the technology matures.
Read Performance Characteristics
Read latency follows a different pattern:
- Find batches: Get path and ranges to WAL files (~10ms)
- Fetch batches: Average ~100ms
These are constrained by fetch.max.bytes
and max.partition.fetch.bytes
settings. Finding batches has been optimized to query all batches from the same request simultaneously, so larger requests improve throughput.
We are still developing support for fetch.min.bytes
and fetch.max.wait.ms
to make it easier to optimize fetch batching.
Quick Tuning Guide
All Diskless Producers
Essential settings that you should apply to every diskless producer configuration:
linger.ms=100 max.in.flight.requests.per.connection=1
Single Producer Setup
For single producer scenarios, add these additional settings:
batch.size=1000000 max.request.size=8000000
Partition Strategy: Use 8-12 partitions to optimize performance while avoiding coordinator bottlenecks.
Multiple Producer Setup
Use multiple smaller producers with minimal configuration instead of over-tuning single instances. In this case, add these additional settings:
linger.ms=100 max.in.flight.requests.per.connection=1
Partition Strategy: Use 64+ partitions per producer and scale PostgreSQL CPU (2→4 cores) when exceeding 500 total partitions.
The Bottom Line
Diskless topics represent a significant evolution in Kafka architecture, solving real problems around cost effective durability and retention that many organizations face. The technology delivers on its core promises but introduces performance characteristics that require different approaches to producer tuning and system design.
The architecture works within its constraints, but those constraints are meaningful. Organizations that can work within the latency profile and have use cases benefiting from the durability model will find compelling advantages. However, Diskless topics aren't drop-in replacements for traditional topics—they're solving different problems with different tradeoffs.
Understanding these tradeoffs upfront enables better architectural decisions and sets appropriate performance expectations for both engineering teams and stakeholders.
The streaming storage landscape continues evolving rapidly. As these technologies mature, sharing real-world performance data and operational experiences helps the entire community make better informed decisions about when and how to adopt new approaches.
Stay updated with Aiven
Subscribe for the latest news and insights on open source, Aiven offerings, and more.