Aiven Blog

Jun 22, 2021

Introducing Aiven for PostgreSQL® change data capture

(Updated) How does PostgreSQL capture changes in data? Read this post for a look at the methods available.

John Hammink

|RSS Feed

Developer Advocate at Aiven

Francesco Tisiot

|RSS Feed

Field CTO & Head of Developer Experience at Aiven

Auri Poso

|RSS Feed

Technical Copywriter at Aiven

Change happens all the time. Even in your data. (Yes, it’s shocking, isn’t it.) So what happens when PostgreSQL data changes within a table and requires prompt action?

Most of the time, data’s journey through your company’s systems starts and ends with its initial store. Updates are often done to this canonical store in bulk; in fact, if we’re talking about ETL (Extract, Transform, and Load), these updates may run only periodically (for example, daily or weekly) and, as such, are always behind the present moment.
What if the information is needed at a remote point immediately, so that action can be taken in a timely fashion?

A few years ago, periodic bulk updates were the only option for copying an entire database. Fortunately, change happens and progress occurs, and you can now duplicate only the changes during a synchronization operation.

Transferring only the delta (that is, data that has changed) is much more efficient than mirroring or cloning an entire table. What’s more, this change can readily be rolled back to restore the datastore to an earlier state.
In this article, we’ll take a look at how Aiven for PostgreSQL captures changed data.

Built-in change data capture

Change data capture (CDC) refers to a set of software design patterns used to determine (and track) the data that has changed. Actions can then be performed on only the changed data, resulting in great savings in resources and time.
PostgreSQL has built-in functionality for this in the form of a write-ahead log (WAL) that records all the changes that are made. WALs are typically used for data recovery by replaying them to get back to a particular state.

The WAL, as it turns out, is the actual source of truth about the state of a PostgreSQL instance. You can think of the PostgreSQL tables as a queryable cache of the log.

Solving some of the problems: CDC via logical decoding

In PostgreSQL, logical decoding refers to the process of decoding the contents of that WAL into an application-specific form, such as SQL statements.
A modern approach allows PostgreSQL to keep track of all the database changes by decoding the WAL. PostgreSQL supports multiple different logical decoding output plugins that can output the data in different formats. The CDC WAL-based approach is by far the most efficient among the ones we’ve covered so far. It eliminates the need for double writes by utilising the WAL that PostgreSQL was already writing for crash recovery.

By using logical decoding for CDC, you can track all DML (´INSERT´, ´UPDATE´, ´DELETE´) changes and allow reading of a specific subset of changes. In fact, a unit of change is merely a committed row of data.

CDC via logical data decoding can be easily be turned on by setting the following parameters in ´postgresql.conf´:

wal_level=logical max_replication_slots = 10 # at least one max_wal_sender = 10 # at least one

Then create a specific replication role:

$ CREATE ROLE foo REPLICATION LOGIN;

This sort of approach is supported by many PostgreSQL drivers (including JDBC, and Python psycopg2) and its use extends beyond CDC to replication and auditing.
That being said, it has some notable shortcomings. For one thing, CDC via logical decoding cannot replicate DDL changes such as table column additions. There are workarounds for this, like setting up event triggers to have your replication system run the associated DDL, but PostgreSQL doesn’t natively support this yet.

However, the available output plugins do not support all data types. Also, in the event of failures, losing a node isn’t handled gracefully as replication slots are not transferred over to standbys failover.

A final limitation is that the changes in a given stream concern only a single logical database, but you can have as many parallel streams going on as you wish.

Let’s look at some variants of this approach.

CDC with pg_recvlogical with wal2json

Developed by Euler Taveira de Oliviera and hosted on GitHub, wal2json works by decoding logical changes into JSON format. While limited to the data types that JSON normally handles, wal2json is supported by Aiven. And wal2json works with the Apache Kafka Debezium plug-in, which we’ll discuss later.

pg_recvlogical is a command-line tool enabling the receipt of logical changes. When pg_recvlogical is configured together with wal2json, all logical changes can be received and written to a file. This is great for simple use cases, but there is one flaw: a single receiver of data writes changes to a single file. If that one file is lost, you lose your entire log of changes.

Feeding changes in data directly to an application

The simplest CDC approach is probably just pushing the transaction log to the target application that needs to read them. All we need to do is to write all PostgreSQL logical changes directly to the application itself. The data can be transformed in transit and it’s possible to act on it in real time.

The upsides of this approach are:

  1. Minimal performance impact on the master database.
  2. No changes required to the application nor database tables.

For many, this direct log-based CDC is the approach of choice.

However, having a single receiver of data can represent a vulnerability: if the node hosting your application goes down, so does your capacity to process that change data. Even more critically, when multiple applications need to read the same changes, we could hit problems such as resource locks or performance hits resulting from simultaneous reads of a single WAL. What then?

CDC using a streaming platform

Instead of having several applications directly consume data from PostgreSQL, it’s better to decouple the changes ingestion and delivery phases using a streaming platform to broker change data. We are, of course, talking about Apache Kafka.

As mentioned above, when every consuming application connects to the PostgreSQL instance separately, there’s likely to be a performance hit on the PostgreSQL instance. Furthermore, there may be issues when multiple applications attempt to access PostgreSQL’s WAL.

Contrast that with using Apache Kafka to serve data to the consumer applications. In this scenario, Apache Kafka producers send changes to Kafka, where decoupled consumer applications can consume those changes at their own pace.

This solves the problem of having a single point of failure that the previous approaches have. In general, the use of a streaming platform allows an arbitrary number of readers to access the same data and enjoy easy post-processing. Also, adding additional consumers will NOT add additional load on the source PostgreSQL database, enabling this solution to scale as needed.

But everything comes at a cost. You’ve now added another distributed system to the mix, which will increase the complexity of your overall architecture. Be sure that you require this sort of functionality before going this route.

Apache Kafka is reliable, scalable and meant for streaming data. And it works even better for CDC when specifically enhanced with CDC in mind.

Introducing Debezium

Debezium is such an enhancement. It’s an abstraction layer, built atop Kafka, that allows your applications or messaging services to respond directly to inserts, updates, and deletes (DML) from your PostgreSQL instance. Debezium acts like a monitoring daemon, scanning the target database for row-level changes and then streaming them in the order they were committed.

Debezium comes with many connectors, including one for Apache Kafka using Kafka Connect; it supports a variety of databases including MySQL, MongoDB, PostgreSQL, Oracle, SQL Server and Cassandra.
Debezium uses logical replication to replicate a stream of changes to a Kafka topic; Debezium can run custom transformation code on the data as it is received and supports PostgreSQL’s built-in native pgoutput format, protobuf output plugin, or wal2json, which we discussed earlier.

Debezium frequently forms the basis for Kafka-centric architectures. In such cases you don’t need to know or care ahead of time how, when or by whom the data will be consumed. Once the data is in Kafka we can rely on Kafka Connect connectors to distribute it to multiple downstream systems or stores. And, of course, it works with PostgreSQL changes in real time.

Some Debezium - PostgreSQL gotchas

It’s important to note that when PostgreSQL master failover occurs, the PostgreSQL replication slot disappears; this prompts the need to recreate the state. If you’re using topics that are not pre-created, they will have the cleanup policy set to DELETE and not COMPACT, affecting data persistence.
Furthermore, if you want to propagate both UPDATE and DELETE changes, you’ll need to set REPLICA IDENTITY FULL.

Wrapping up

Today, changed data can be easily captured in PostgreSQL using logical decoding and direct replication, but it’s not perfect. We defined its limits especially in cases of multiple downstream applications requiring the same changes. For a more robust solution, you can use Apache Kafka and Debezium to decouple the capture of PostgreSQL database changes and the replication of that data to a huge variety (and number) of downstream systems.

Next steps

Your next step could be to check out Aiven for PostgreSQL.

If you're not using Aiven services yet, go ahead and sign up now for your free trial at https://console.aiven.io/signup!

In the meantime, make sure you follow our changelog and blog RSS feeds or our LinkedIn and Twitter accounts to stay up-to-date with product and feature-related news.

Sources

  1. Change data capture pipeline from PostgreSQL to Kafka

  2. CDC for a brave New World: Hannu Valtonen presents at PostgresConf US 2018

  3. Debezium.io


Stay updated with Aiven

Subscribe for the latest news and insights on open source, Aiven offerings, and more.

Related resources