MirrorMaker 2.0 is a robust data replication utility for Apache Kafka that increases the resilience of Kafka-centric architectures by allowing users to easily and reliably copy data from one cluster to another. It accomplishes this by acting as a consumer and producer for multiple Kafka clusters.
Although Kafka MirrorMaker has acted as the default data replication tool for years, it possessed several limitations that prevented it from being used for business critical applications. The Kafka community analyzed its limitations, and KIP-382: MirrorMaker 2.0 was proposed in 2019.
We’ll look at the most common use cases for a replication workflow, the specific limitations of its predecessor, and how MirrorMaker 2.0 leverages the Kafka Connect framework to resolve those issues.
Why replicate your Kafka cluster data?
Data replication between your Kafka clusters can add a layer of flexibility, performance and reliability to your core data infrastructure and is suitable for large companies with huge data volumes. There are many situations where you might want to replicate data among Kafka clusters.
1. Disaster recovery
The best understood and most important of which is disaster recovery. Nowadays, many businesses rely on Kafka as a cornerstone of their data infrastructure. Despite Kafka’s maturity, reliability and the fact that managed Kafka services are offered by trusted providers, disasters can happen which lead to temporary data unavailability or loss.
Kafka users naturally want to mitigate the risks connected to such incidents. The best way to achieve this is to have a copy of your data in another Kafka cluster in a different data center. This gives you the ability to switch clients to it relatively seamlessly, allowing you to switch to an alternative deployment on the fly with no service interruptions if something were to go wrong.
2. Cloud migration
More and more companies are migrating their Kafka clusters from on-prem installations to the cloud — or between cloud regions or providers. Tools that support cloud migration of data services are valuable because they give you more control over your data. Data replication between Kafka clusters is an excellent choice for low-downtime Kafka cloud migration.
3. Geographic proximity
For many global businesses, it's not uncommon to produce and consume data in geographically-distributed locations. Replication allows you to optimize geographical proximity, which helps to ensure minimal latency and network costs, and optimal throughput.
For legal, compliance, and performance reasons, isolation of some data sets to a separate Kafka cluster may be required. For instance, you can limit the retention period of a topic you’re writing to in one cluster and mirror it to another with longer retention in a region that’s compliant to read from.
5. Data analytics
Data analytics might require the aggregation of data from geographically-distributed Kafka clusters into a single one which broadcasts that data to other clusters and/or data systems is yet another possible need for data replication.
Kafka MirrorMaker — an incomplete solution
Considering the above use cases, the original MirrorMaker was impractical to use for data replication in production environments, where continuity is imperative. Here are its top limitations:
- No support for consumer migration between clusters, making disaster recovery difficult. Most importantly, record offsets are not preserved or mapped across clusters.
- Topic partitioning is not mirrored and preservation of record partitions is not guaranteed during replication.
- Topic configurations and ACLs are not replicated.
- Throughput and scalability are significantly limited.
- No support for replication topologies exists beyond two-cluster active-passive replication. Only a single pair of source and target clusters per MirrorMaker instance is allowed.
Due to these and other shortcomings of the original MirrorMaker design, many companies developed their own Kafka replication tools: Brooklin by LinkedIn, uReplicator by Uber, Confluent Replicator.
Kafka MirrorMaker 2.0 - new and improved
MirrorMaker 2.0 is the open-source alternative, improving on the following:
- The offset mappings between clusters are now preserved. Tooling for nearly transparent consumer migration between clusters is offered. This is the key feature to enable reliable disaster recovery.
- Topic configuration (including partitioning) and ACLs are synchronized from source to target clusters, which eliminates the necessity for doing this via external tools.
- Record partitions are preserved during replication, which is critical for situations when records are partitioned not randomly but based on their semantics.
- Having the ability to run multiple replication flows by a single MirrorMaker cluster and the mechanism for preventing replication cycles enables easy setup for complex replication topologies such as active-active and chain replication.
- MirrorMaker performance, reliability, and scalability have been significantly improved by leveraging the well established Kafka Connect framework.
- Better support for monitoring and operations is offered.
MirrorMaker 2.0 was added to Apache Kafka in release 2.4.
The future of MirrorMaker
Not everything from the original grand MirrorMaker 2.0 design has yet been implemented. For example, the cross-cluster exactly-once delivery guarantee is a long-awaited feature for a future release.
MirrorMaker 2.0 is a much-need improvement to Kafka replication that should provide database administrators with greater peace of mind and give companies more control over their data. The Kafka community will continue to develop the Kafka toolset to provide more added value to Kafka as a data streaming solution.
We recently announced support for MirrorMaker 2.0 as a service, so you can try it out for free.