Jan 13, 2022
Future PostgreSQL: Improvement to the Replication Protocol
Check out the latest improvement to PostgreSQL replication to which our OSPO team contributed to help solve reliability problems.
Today's post is about a new PostgreSQL feature (coming in PostgreSQL 15)
developed by Aiven to solve a reliability problem we sometimes run into with
replication. The READ_REPLICATION_SLOT
command makes a physical replication
client, particularly pg_receivewal
, more reliable.
At Aiven's OSPO office, we dedicate our time to the open source products Aiven
offers or develops, and we were very happy to collaborate on this feature.
PostgreSQL replication surprises
This story starts with our own PgHoard, a
PITR
backup tool for PostgreSQL. PgHoard offers several methods to archive the WAL
(Write Ahead Log), including
pg_receivewal
,
a small application shipping with PostgreSQL which connects to a PostgreSQL
cluster using the physical replication protocol to stream WAL as they are
produced, optionally keeping track of the position on the server using a
replication slot.
We noticed that we could occasionally lose some WAL when PgHoard is restarted
on another machine. This happens because pg_receivewal
determines its start
position like this:
- Look at what segments are present in the archive folder, and resume from the
latest archived one. - If the archive folder is empty, start from the current server LSN position.
The problem is, in our case, since pg_receivewal
was started on another
server, it didn't have access to the archive folder directly so it resumed from
the current LSN (Log Sequence Number) position.
You can observe this in action by launching pg_receivewal
, stopping it after
some WAL have been archived, then relaunching it with its directory empty:
❯ sudo -u postgres pg_receivewal -D /tmp/wals -v -s pg_receivewal_slot pg_receivewal: starting log streaming at 47/E5000000 (timeline 1) pg_receivewal: finished segment at 47/E6000000 (timeline 1) pg_receivewal: received interrupt signal, exiting pg_receivewal: not renaming "0000000100000047000000E6.partial", segment is not complete pg_receivewal: received interrupt signal, exiting ❯ sudo rm /tmp/wals/* -rf # Wait for some WAL to be generated on the server, and relaunch pg_receivewal ❯ sudo -u postgres pg_receivewal -D /tmp/wals -v -s pg_receivewal_slot pg_receivewal: starting log streaming at 47/E9000000 (timeline 1)
We notice here that the segments 0000000100000047000000E7
and
0000000100000047000000E8
have not been archived at all, jumping from
47/E6000000
to 47/E9000000
What should happen here is resuming from the replication slot's restart_lsn
, which
is kept on the server.
READ_REPLICATION_SLOT
command
New feature: Prior to version 15 of PostgreSQL, the only way to know a replication slot's position
was to query the pg_replication_slots
view, which means regular SQL queries need
to be issued. This is fine for logical replication, as logical replication
connection are bound to a database and are allowed to perform SQL queries. But in
the case of physical replication connections, we do not have this possibility
That meant that a physical replication client had no way to know the state of
it's replication slot, except by opening a separate, non-replication connection,
which is not something we can do with the pg_receivewal
application.
The READ_REPLICATION_SLOT
command aims to fulfill this gap, by allowing a
replication connection to read the current restart_lsn
and timeline associated
to a replication slot:
❯ psql postgres -d "replication=1" psql (15devel) Type "help" for help. postgres=# READ_REPLICATION_SLOT slot1; slot_type | restart_lsn | restart_tli -----------+-------------+------------- physical | 0/1501000 | 1 (1 row)
With that new command, pg_receivewal
can now determine its starting LSN
slightly differently:
- Look at what segments are present in the archive folder, and resume from the
latest archived one. - If the archive folder is empty and a replication slot is used,
issue aREAD_REPLICATION_SLOT
command to read the state from the server. - Otherwise, start streaming from the server's current LSN.
This new command is currently only supported for physical replication slots,
and returns limited information but that could evolve in future if needed.
Already the feature will help with the "surprises" we encountered when using
pg_receivewal
, and will also benefit other replication clients by making them
more reliable. It has been accepted into the project and will be available in
the version 15 release.
Next steps
- Learn more about Aiven for PostgreSQL, or sign up for our free trial.
- Check out the PgHoard project on GitHub.
- Find out what else our OSPO (Open Source Program Office) is up to.
- Follow us on Twitter to keep up with more posts like this.
Stay updated with Aiven
Subscribe for the latest news and insights on open source, Aiven offerings, and more.