Oct 2, 2025

OpsHelm goes multi-cloud with Aiven Diskless BYOC, cuts costs by 78% over MSK

OpsHelm cuts streaming costs by 5x migrating from MSK and NATS to Aiven Diskless Kafka in under a month

Filip Yonov
|RSS Feed
Head of Streaming Services

TL;DR

In under a month, OpsHelm the continuous, enriched changelog for cloud infrastructure - migrated its streaming backbone from MSK and NATS to Aiven Diskless Kafka (BYOC on AWS). The switch eliminated cross-cloud networking fees, collapsed multiple storage layers into one, and cut total streaming costs by 5x (from >$50,000/year to <$10,000/year) while serving the team a single logical event bus that stretches across multiple regions and accounts.

Case Study

Think of OpsHelm as the "git log + grep" for cloud infrastructure. The platform continuously gathers raw change events from every major cloud provider, normalises them, enriches each event with context such as resource tags, policy owners, and live cost, then streams the resulting changelog to:

Incident response tools – so responders see who changed what seconds before an alert fired
Security engines – to flag mis‑configurations the moment they land
FinOps dashboards – to track spend‑impacting changes in near‑real‑time

Like many other companies, OpsHelm powers all of its infrastructure with real‑time data streaming. Processing hundreds of TBs daily of compressed cloud change events, streaming infrastructure is built into the OpsHelm product itself - this isn't a background admin pipeline; in essence, OpsHelm is a giant log.

Periodic inventory scans miss transient resources and can take hours on large estates. OpsHelm's value proposition fully depends on sub‑second ingestion and fan‑out to keep every downstream system in sync.

Phase 1 – AWS MSK (managed Kafka) - The Expensive Start

OpsHelm launched on Apache Kafka via MSK because the team already onboarded to AWS. While the cluster handled early volumes, costs quickly spiraled out of control. More events meant larger brokers, higher cross‑AZ traffic, and expensive EBS volumes. Multi‑cloud operations required bridging data into BYOC with costly connectors and cross‑cloud traffic fees. Total annual spend: >$50,000/year with a x3 increase projected for the subsequent quarters.

Phase 2 – NATS + JetStream (for a lighter & cheaper MSK)

Hoping to escape AWS's cost spiral, OpsHelm re‑platformed onto NATS + JetStream:

Expectation	Reality
Smaller, cheaper servers	Monthly costs remained high once storage & egress were factored in. Internet egress is charged at >9¢ a GB
Simple, in‑memory core	Multiple incidents: publisher back‑pressure issues, constantly stalling consumers, and frequent data loss on replica fail‑over
Horizontal elasticity	Painful scaling; subject shard juggling during usage spikes across hundreds of topics

After the third data loss incident, where JetStream's leader failover caused message sequence resets and incomplete data replay, leaving gaps in their critical audit trail, the team hit a pause. These data loss events were unacceptable for a compliance-focused platform that needed to guarantee complete change visibility. They needed a backbone that was: cost‑predictable, durable at petabyte scale, and natively compatible with both Kafka and REST.

Phase 3 – The Search for "The Final Bus"

The team knew they needed to return to Apache Kafka, but the question was how to get there without recreating the operational complexity and the cost-curve they were trying to escape. Traditional managed Kafka offerings either locked them into proprietary APIs or still required significant cluster tuning and capacity planning.

Aiven's BYOC approach offered the best of both worlds: enterprise-grade Kafka management running in OpsHelm's own AWS infrastructure. This meant no vendor lock-in, full data sovereignty, and transparent cost allocation: while eliminating the operational burden of managing brokers, monitoring, and scaling decisions that had consumed engineering resources with both their MSK and NATS deployments.

The pain points that forced the change

Run‑away cost – egress, NAT Gateway, and dual storage (local SSD plus S3) kept the monthly bill at 100% above projections.

Reliability incidents – message drops during replica lag, sequence resets after node restarts, and elusive JetStream bugs eroded trust. System uptime struggled to reach 99%.

Ceiling on scale – bursts from new Azure subscriptions regularly saturated JetStream mirrors across 100+ topics, requiring manual re‑partitioning.

How Diskless Kafka (BYOC on AWS) fixes all three at scale

OpsHelm's data patterns aligned well with Diskless Kafka's strengths, but the team needed a partner who could deliver production-grade streaming without the operational overhead. Enter Aiven's BYOC model: managed Kafka that runs in OpsHelm's own AWS account, giving them the control and cost transparency they needed while eliminating the day-to-day cluster management that was eating engineering cycles.

The plan was straightforward: leverage Aiven's managed Kafka expertise while using Diskless topics specifically for cost optimization and scale. OpsHelm's workload characteristics: high overall throughput with acceptable latency tolerance, strong durability requirements for audit trails, and variable traffic patterns. All this made OpsHelm an ideal candidate for the Diskless approach. Working closely with Aiven's team, they mapped out how to consolidate their sprawling streaming architecture into a single consolidate diskless cluster. Most importantly, they needed to eliminate the data loss incidents that had plagued their NATS deployment.

Challenge	Diskless approach	Result
Cost	No EBS volumes or cross-AZ networking fees; storage writes directly to shared S3	5x cost reduction (>$50K to <$10K annually)
Reliability	Immutable object storage + end‑to‑end checksums; Brokers auto‑heal and rehydrate from S3.	Zero data‑loss incidents since cut‑over vs NATS
Scalability	Leaderless brokers; partition‑balancing handled by the control plane; virtual clusters isolate tenants.	10× headroom without re‑architecture

Migration snapshot

Working with Aiven meant that the migration could be methodical rather than heroic. Our team provided guidance on the consolidation strategy: going from 100+ loosely organized NATS topics to 30 well-architected streams that better matched OpsHelm's actual data flows. The BYOC deployment meant OpsHelm retained full control over their infrastructure while getting expert operational support.

Reference Architecture

Old: MSK clusters + Data Firehose delivery streams to JetStream clusters per cloud managing 100+ topics across 5 c6in.2xlarge instances.

New: Diskless Brokers in every VPC write/consume via a shared S3 bucket; one logical cluster spans all clouds.

Timeline

Dual‑write bridge (2 weeks)
Consumer cut‑over (1 week)
Producer switch‑off & JetStream decom (1 week)

No downtime, no replay gaps. The migration involved 3 engineers over 2 months, leveraging Aiven's BYOC deployment model.

Results & what's next

5x cost reduction from >$50,000/year to <$10,000/year

Sub‑second visibility across all regions — though latency profile shifted to 500ms-2s for durability guarantees

Zero manual ops Brokers autoscale with traffic patterns

Streamlined architecture consolidated from 100+ topics to 30 optimized streams

Multi-cloud flexibility single logical cluster across VPCs without networking fees

Up next on the roadmap is Glacier tiering for infinite retention, customer‑hosted Brokers, and enhanced drift detection.

This architecture is significantly more cost effective than OpsHelm's MSK solution because they don't have to pay for any disks or networking fees. The migration delivered on its core promises: predictable costs, strong durability, and simplified multi-cloud operations. The latency trade-off (500ms-2s end-to-end) proved acceptable for OpsHelm's audit and compliance use case, where data accuracy and cost efficiency outweigh sub-second delivery requirements.

Get started

Ready to shrink your streaming bill and eliminate cross‑cloud NAT fees? Book a demo

Table of contents

TL;DR
Case Study
Phase 1 – AWS MSK (managed Kafka) - The Expensive Start
Phase 2 – NATS + JetStream (for a lighter & cheaper MSK)
Phase 3 – The Search for "The Final Bus"
The pain points that forced the change
How Diskless Kafka (BYOC on AWS) fixes all three at scale
Migration snapshot
Results & what's next
Get started

Stay updated with Aiven

Subscribe for the latest news and insights on open source, Aiven offerings, and more.

Subscribe to RSS

OpsHelm goes multi-cloud with Aiven Diskless BYOC, cuts costs by 78% over MSK

TL;DR

Case Study

Phase 1 – AWS MSK (managed Kafka) - The Expensive Start

Phase 2 – NATS + JetStream (for a lighter & cheaper MSK)

Phase 3 – The Search for "The Final Bus"

The pain points that forced the change

How Diskless Kafka (BYOC on AWS) fixes all three at scale

Migration snapshot

Results & what's next

Get started

Stay updated with Aiven

Related resources

TL;DR

Case Study

Phase 1 – AWS MSK (managed Kafka) - The Expensive Start

Phase 2 – NATS + JetStream (for a lighter & cheaper MSK)

Phase 3 – The Search for "The Final Bus"

The pain points that forced the change

How Diskless Kafka (BYOC on AWS) fixes all three at scale

Migration snapshot

Results & what's next

Get started