Oct 2, 2025
OpsHelm goes multi-cloud with Aiven Diskless BYOC, cuts costs by 78% over MSK
OpsHelm cuts streaming costs by 5x migrating from MSK and NATS to Aiven Diskless Kafka in under a month
Filip Yonov
|RSS FeedHead of Streaming Services
TL;DR
In under a month, OpsHelm the continuous, enriched changelog for cloud infrastructure - migrated its streaming backbone from MSK and NATS to Aiven Diskless Kafka (BYOC on AWS). The switch eliminated cross-cloud networking fees, collapsed multiple storage layers into one, and cut total streaming costs by 5x (from >$50,000/year to <$10,000/year) while serving the team a single logical event bus that stretches across multiple regions and accounts.
Case Study
Think of OpsHelm as the "git log + grep" for cloud infrastructure. The platform continuously gathers raw change events from every major cloud provider, normalises them, enriches each event with context such as resource tags, policy owners, and live cost, then streams the resulting changelog to:
- Incident response tools – so responders see who changed what seconds before an alert fired
- Security engines – to flag mis‑configurations the moment they land
- FinOps dashboards – to track spend‑impacting changes in near‑real‑time
Like many other companies, OpsHelm powers all of its infrastructure with real‑time data streaming. Processing hundreds of TBs daily of compressed cloud change events, streaming infrastructure is built into the OpsHelm product itself - this isn't a background admin pipeline; in essence, OpsHelm is a giant log.
Periodic inventory scans miss transient resources and can take hours on large estates. OpsHelm's value proposition fully depends on sub‑second ingestion and fan‑out to keep every downstream system in sync.
Phase 1 – AWS MSK (managed Kafka) - The Expensive Start
OpsHelm launched on Apache Kafka via MSK because the team already onboarded to AWS. While the cluster handled early volumes, costs quickly spiraled out of control. More events meant larger brokers, higher cross‑AZ traffic, and expensive EBS volumes. Multi‑cloud operations required bridging data into Azure & GCP with costly connectors and cross‑cloud traffic fees. Total annual spend: >$50,000/year with a x3 increase projected for the subsequent quarters.
Phase 2 – NATS + JetStream (for a lighter & cheaper MSK)
Hoping to escape AWS's cost spiral, OpsHelm re‑platformed onto NATS + JetStream:
Expectation | Reality |
---|---|
Smaller, cheaper servers | Monthly costs remained high once storage & egress were factored in. Internet egress is charged at >9¢ a GB |
Simple, in‑memory core | Multiple incidents: publisher back‑pressure issues, constantly stalling consumers, and frequent data loss on replica fail‑over |
Horizontal elasticity | Painful scaling; subject shard juggling during usage spikes across hundreds of topics |
After the third data loss incident, where JetStream's leader failover caused message sequence resets and incomplete data replay, leaving gaps in their critical audit trail, the team hit a pause. These data loss events were unacceptable for a compliance-focused platform that needed to guarantee complete change visibility. They needed a backbone that was: cost‑predictable, durable at petabyte scale, and natively compatible with both Kafka and REST.
Phase 3 – The Search for "The Final Bus"
The team knew they needed to return to Apache Kafka, but the question was how to get there without recreating the operational complexity and the cost-curve they were trying to escape. Traditional managed Kafka offerings either locked them into proprietary APIs or still required significant cluster tuning and capacity planning.
Aiven's BYOC approach offered the best of both worlds: enterprise-grade Kafka management running in OpsHelm's own AWS infrastructure. This meant no vendor lock-in, full data sovereignty, and transparent cost allocation: while eliminating the operational burden of managing brokers, monitoring, and scaling decisions that had consumed engineering resources with both their MSK and NATS deployments.
The pain points that forced the change
Run‑away cost – egress, NAT Gateway, and dual storage (local SSD plus S3) kept the monthly bill at 100% above projections.
Reliability incidents – message drops during replica lag, sequence resets after node restarts, and elusive JetStream bugs eroded trust. System uptime struggled to reach 99%.
Ceiling on scale – bursts from new Azure subscriptions regularly saturated JetStream mirrors across 100+ topics, requiring manual re‑partitioning.
How Diskless Kafka (BYOC on AWS) fixes all three at scale
OpsHelm's data patterns aligned well with Diskless Kafka's strengths, but the team needed a partner who could deliver production-grade streaming without the operational overhead. Enter Aiven's BYOC model: managed Kafka that runs in OpsHelm's own AWS account, giving them the control and cost transparency they needed while eliminating the day-to-day cluster management that was eating engineering cycles.
The plan was straightforward: leverage Aiven's managed Kafka expertise while using Diskless topics specifically for cost optimization and scale. OpsHelm's workload characteristics: high overall throughput with acceptable latency tolerance, strong durability requirements for audit trails, and variable traffic patterns. All this made OpsHelm an ideal candidate for the Diskless approach. Working closely with Aiven's team, they mapped out how to consolidate their sprawling streaming architecture into a single consolidate diskless cluster. Most importantly, they needed to eliminate the data loss incidents that had plagued their NATS deployment.
Challenge | Diskless approach | Result |
---|---|---|
Cost | No EBS volumes or cross-AZ networking fees; storage writes directly to shared S3 | 5x cost reduction (>$50K to <$10K annually) |
Reliability | Immutable object storage + end‑to‑end checksums; Brokers auto‑heal and rehydrate from S3. | Zero data‑loss incidents since cut‑over vs NATS |
Scalability | Leaderless brokers; partition‑balancing handled by the control plane; virtual clusters isolate tenants. | 10× headroom without re‑architecture |
Migration snapshot
Working with Aiven meant that the migration could be methodical rather than heroic. Our team provided guidance on the consolidation strategy: going from 100+ loosely organized NATS topics to 30 well-architected streams that better matched OpsHelm's actual data flows. The BYOC deployment meant OpsHelm retained full control over their infrastructure while getting expert operational support.
Reference Architecture
Old: MSK clusters + Data Firehose delivery streams to JetStream clusters per cloud managing 100+ topics across 5 c6in.2xlarge instances.
New: Diskless Brokers in every VPC write/consume via a shared S3 bucket; one logical cluster spans all clouds.
Timeline
- Dual‑write bridge (2 weeks)
- Consumer cut‑over (1 week)
- Producer switch‑off & JetStream decom (1 week)
No downtime, no replay gaps. The migration involved 3 engineers over 2 months, leveraging Aiven's BYOC deployment model.
Results & what's next
5x cost reduction from >$50,000/year to <$10,000/year
Sub‑second visibility across all regions — though latency profile shifted to 500ms-2s for durability guarantees
Zero manual ops Brokers autoscale with traffic patterns
Streamlined architecture consolidated from 100+ topics to 30 optimized streams
Multi-cloud flexibility single logical cluster across VPCs without networking fees
Up next on the roadmap is Glacier tiering for infinite retention, customer‑hosted Brokers, and enhanced drift detection.
This architecture is significantly more cost effective than OpsHelm's MSK solution because they don't have to pay for any disks or networking fees. The migration delivered on its core promises: predictable costs, strong durability, and simplified multi-cloud operations. The latency trade-off (500ms-2s end-to-end) proved acceptable for OpsHelm's audit and compliance use case, where data accuracy and cost efficiency outweigh sub-second delivery requirements.
Get started
Ready to shrink your streaming bill and eliminate cross‑cloud NAT fees?
Table of contents
- TL;DR
- Case Study
- Phase 1 – AWS MSK (managed Kafka) - The Expensive Start
- Phase 2 – NATS + JetStream (for a lighter & cheaper MSK)
- Phase 3 – The Search for "The Final Bus"
- The pain points that forced the change
- How Diskless Kafka (BYOC on AWS) fixes all three at scale
- Migration snapshot
- Results & what's next
- Get started
Stay updated with Aiven
Subscribe for the latest news and insights on open source, Aiven offerings, and more.