Using Kafka Connect JDBC Source: a PostgreSQL® example

Find out how to use Apache Kafka® Connect to update an old app-to-db design to use up-to-date tech tools without disrupting the original solution.

If we go back in history few years, the typical data pipeline was an app creating events and pushing them to a backend database. Data was then propagated to downstream applications via dedicated ETL flows at regular intervals, usually daily.

In these modern times, Apache Kafka® has become the default data platform. Apps write events to Kafka, which then distributes them in near-real-time to downstream sinks like databases or cloud storages. Apache Kafka® Connect, a framework to stream data into and out of Apache Kafka, represents a further optimisation that makes the ingestion and propagation of events just a matter of config files settings.

What if we're facing an old app-to-database design? How can we bring it up-to-date and include Kafka in the game? Instead of batch exporting to the database at night, we can add Kafka to the existing system. Kafka Connect lets us integrate to an existing system and make use of more modern tech tools, without disrupting the original solution.

One way to do this is to use the Kafka Connect JDBC Connector. This post will walk you through an example of sourcing data from an existing table in PostgreSQL® and populating a Kafka topic with only the changed rows. This is a great approach for many use cases. But when no additional query load to the source system is allowed, you could also make use of change data capture solutions based on tools like Debezium. As we'll see later, Aiven provides Kafka Connect as a managed service for both options. You can start your connectors without the hassle of managing a dedicated cluster.

This blog post provides an example of the Kafka Connect JDBC Source based on a PostgreSQL database. A more detailed explanation of the connector is provided in our documentation, Create a JDBC source connector from PostgreSQL® to Apache Kafka®

In our example, we first create a PostgreSQL database to act as backend data storage for our imaginary application. Then we create a Kafka cluster with Kafka Connect and show how any new or modified row in PostgreSQL appears in a Kafka topic.

Creating the PostgreSQL Source system

We'll create the whole setup using the Aiven Command Line Interface. Follow the instructions in that document to install the avn command and log in.

Once you've logged in to the Aiven client, we can create a PostgreSQL database with the following avn command in our terminal:

Loading code...

This command creates a PostgreSQL database (that's what -t pg does) named pg-football on region google-europe-west3. The selected plan driving the amount of resources available and associated billing is business-4.

The create command returns immediately, Aiven received the request and started creating the instance. We can wait for the database to be ready with the following command:

Loading code...

The wait command can be executed against any Aiven instance, and returns only when the service is in RUNNING mode.

Time to Scout Football Players

Now let's create our playground: we are a football scouting agency, checking players all over the world and our app pushes the relevant data to a PostgreSQL table. Let's login to PostgreSQL from the terminal:

Loading code...

Our agency doesn't do a great job at scouting, all we are able to capture is the player's name, nationality and a flag is_retired showing their activity status.
We create a simple football_players table containing the above information together with two control columns:

created_at keeping the record's creation time
modified_at for the row's last modification time

These two columns will later be used from the Kafka Connect connector to select the recently changed rows.
Now it's time to create the table from the PostgreSQL client:

Loading code...

The created_at field will work as expected immediately, with the DEFAULT NOW() definition.
The modified_at on the other side, requires a bit more tuning to be usable. We'll need to create a trigger that inserts the current timestamp in case of updates. The following SQL can be executed from the PostgreSQL client:

Loading code...

The first statement creates the change_modified_at function that will later be used by the modified_at_updates trigger.

Football Scouting App at Work

We can now simulate our football scouting app behaviour by manually inserting three rows in the football_players table from the PostgreSQL client with

Loading code...

We can verify that the created_at column is successfully populated in PostgreSQL with

Loading code...

Which will output

Loading code...

Perfect, the app is working when inserting new rows. If only we could have an update to an existing row...

Well, this was somehow expected, Juventus FC went out of Champions League and needed new energy in the midfield. We can update the relevant row with

Loading code...

We can check that the modified_at is correctly working by issuing the same select * from football_players; statement in the PostgreSQL client and checking the following output

Loading code...

Ok, we recreated the original setup: our football scouting app is correctly storing data in the football_players table. In the old days the extraction of that data was demanded to an ETL flow running overnight and pushing it to the downstream applications. Now, as per our original aim, we want to include Apache Kafka in the game, so... let's do it!

Creating a Kafka environment

As stated initially, our goal is to base our data pipeline on Apache Kafka without having to change the existing setup. We don't have a Kafka environment available right now, but we can easily create one using Aiven's CLI from the terminal with the following avn command

Loading code...

The command creates an Apache Kafka instance (-t kafka) in google-europe-west3 with the business-4 plan.
Additionally it enables the topic auto-creation (-c kafka.auto_create_topics_enable=true) so our applications can create topics on the fly without forcing us to create them beforehand.
Finally, it enables Kafka Connect (-c kafka_connect=true) on the same Kafka instance. We can use the avn wait command mentioned above to pause until the Kafka cluster is in RUNNING state.

Note that on Kafka instances part of the startup plans, you'll be forced to create a standalone Kafka Connect instance. For production systems, we recommend using standalone Kafka Connect for the separation of concerns principle.

Connecting the dots

The basic building blocks are ready: our source system represented by the pg-football PostgreSQL database with the football_players table and the kafka-football Apache Kafka instances are running. It's now time to connect the two: creating a new event in Kafka every time an insert or modified row appears in PostgreSQL. That can be achieved by creating a Kafka Connect JDBC source connector.

Create a JSON configuration file

Start by creating a JSON configuration file like the following:

Loading code...

Where the important parameters are:

name: the name of the Kafka Connect connector, in our case pg-timestamp-source
connection.url: the connection URL pointing to the PostgreSQL database, in the form of jdbc:postgresql://<HOSTNAME>:<PORT>/<DATABASE>?<ADDITIONAL_PARAMETERS>, we can create it with the dbname, host, port output of the following avn command

Loading code...

connection.user and connection.user: PostgreSQL credentials, the default avnadmin credentials are available as user and password output of the avn command above
table.whitelist: list of tables to source from PostgreSQL, in our case is football_players
mode: Kafka Connect JDBC mode. Three modes are available: bulk, incrementing, timestamp. For this post we'll use the timestamp one. For a more detailed description of modes, please refer to the help article
timestamp.column.name: list of timestamp column names: The value for this setting should be modified_at,created_at since modified_at will contain the most recent update timestamp, and in case of null value, we can rely on the created_at column.
poll.interval.ms: time between database polls
topic.prefix: prefix for topic, the full topic name will be a concatenation of topic.prefix and the PostgreSQL table name.

Start the JDBC connector

After storing the above JSON in a file named kafka_jdbc_config.json, we can now start the Kafka Connect JDBC connector in our terminal with the following command:

Loading code...

We can verify the status of the Kafka Connect connector with the following avn command:

Loading code...

Note that the last parameter pg-timestamp-source in the avn command above refers to the Kafka Connect connector name defined in the name setting of the kafka_jdbc_config.json configuration file. If all settings are correct, the above command will show our healthy connector being in RUNNING mode

Loading code...

Check the data in Kafka with kcat

The data should now have landed in Apache Kafka. How can we check it?
We can use kcat a nice command line utility.

Once kcat is installed, we'll need to set up the connection to our Kafka environment.

Aiven by default enables SSL certificate based authentication. The certificates are available from the Aiven console for manual download. In Aiven CLI you can avoid the clicking with the following avn command in our terminal:

Loading code...

These commands create a kafkacerts folder (if not existing already) and download in it the ca.pem, service.cert and service.key SSL certificates required to connect.

The last missing piece of information that kcat needs is where to find our Kafka instance in terms of hostname and port. This information can be displayed in our terminal with the following avn command

Loading code...

Once we collected the required info we can create a kcat.config file with the following entries

Loading code...

Remember to substitute the <KAFKA_SERVICE_URI> with the output of the avn service get command mentioned above.

Now we are ready to read the topic from Kafka by pasting the following command in our terminal:

Loading code...

Note that we are using kcat in consumer mode (flag -C) reading from the topic pg_source_football_players which is the concatenation of the topic.prefix setting in Kafka Connect and the name of our football_players PostgreSQL table.

As expected, since the connector is working, kcat will output the three messages present in the Kafka topic matching the three rows in the football_players PostgreSQL table

Loading code...

Updating the listings

Now, let's see if our football scouts around the world can fetch some news for us

Wow, we found a new talent named Enzo Gorlami and Cristiano Rolando officially retired today from professional football (please be aware this post is not reflecting football reality). Let's push the two news to PostgreSQL:

Loading code...

We can verify that the data is correctly stored in the database:

Loading code...

And in Kafkacat we receive the following two updates:

Loading code...

Wrapping up

This blog post showed how to easily integrate PostgreSQL and Apache Kafka with a fully managed, config-file-driven Kafka Connect JDBC connector. We used a timestamp-based approach to retrieve the changed rows since the previous poll and push them to a Kafka topic increasing the query load to the source database.

An alternative method is represented by Change Data Capture solutions like Debezium, which, in case of PostgreSQL, reads changes directly from WAL files avoiding any additional query load on the source database. A guide on how to setup CDC for Aiven PostgreSQL is provided in Create a Debezium source connector from PostgreSQL® to Apache Kafka®.

Not using Aiven for PostgreSQL® and our other services yet? Sign up now for your free trial at https://console.aiven.io/signup!

In the meantime, make sure you follow our changelog and RSS feeds or our LinkedIn and Twitter accounts to stay up-to-date with product and feature-related news.

Table of contents

Creating the PostgreSQL Source system
Time to Scout Football Players
Football Scouting App at Work
Creating a Kafka environment
Connecting the dots
Create a JSON configuration file
Start the JDBC connector
Check the data in Kafka with kcat
Updating the listings
Further reading
Wrapping up

Using Kafka Connect JDBC Source: a PostgreSQL® example

Creating the PostgreSQL Source system

Time to Scout Football Players

Football Scouting App at Work

Creating a Kafka environment

Connecting the dots

Create a JSON configuration file

Start the JDBC connector

Check the data in Kafka with kcat

Updating the listings

Further reading

Wrapping up

Creating the PostgreSQL Source system

Time to Scout Football Players

Football Scouting App at Work

Creating a Kafka environment

Connecting the dots

Create a JSON configuration file

Start the JDBC connector

Check the data in Kafka with kcat

Updating the listings

Further reading

Wrapping up