Kafka Connector and Source Database Table Mapping

Hello,

I am currently exploring data transfer using Kafka and have encountered some challenges and decisions that I’d like to discuss with the community.

Configuration:

  • We are using Kafka for data transfer from AWS Aurora to Google BigQuery.
  • Our source database is PostgreSQL, and we are utilizing the PostgreSQL Source Connector for Kafka.

Challenges:

  • Due to our database schema, we are unable to merge connectors. This is because primary key names are not consistent, and timestamp column names vary, and so on.
  • If we establish a one-to-one correspondence between the source DB table and the Kafka connector, we end up maintaining a number of database connections at least equal to the number of tables, in accordance with the number of connector tasks. This inevitably seizes a portion of our database resources (memory).

Questions for Discussion:

  1. Are there any practices where tables and connectors are operated on a one-to-one basis?
  2. What are the pros and cons of creating connectors for each table versus grouping them?

Any insights or experiences shared would be greatly appreciated!

1 Like

I’m sharing your question with our Connectors’ expert!

1 Like

Hello!

The PostgreSQL Source Connector (based on JDBC) has the limitation that you describe, where all tables must have a somewhat consistent structure in order to be sourced together. This is a design choice in the connector, and working around it directly would involve reshaping your data, perhaps through database views to expose a consistent table shape to the connector.

The Debezium for PostgreSQL Source Connector does not have this limitation, and is intended to get the most value out of a single CDC connection to the database, reading changes from multiple tables together, regardless of structure. It is a completely different implementation of the source connector, and has different requirements, so check it out to see if it works in your context.

To answer your questions directly:

  1. When tables and connectors are on a 1:1 basis, you may consider setting up a database read-replica just for the connectors. This would lower the resource impact on other database users at the cost of an additional database node.

  2. If you have a connector per-table, you simplify each configuration that is then only handling one shape of data. When you have multiple tables flowing through one connector, you will need to use Predicates (Apache Kafka) to select which transformations apply to each record. The Debezium connector also has a more complicated way of specifying the keys for multiple tables in a single configuration value (Debezium connector for PostgreSQL :: Debezium Documentation). This extends to some tuning parameters (such as producer batch size and linger ms) which can only be specified on a per-task basis.

There may also be a tradeoff in scalability and efficiency, as the Debezium connectors are limited to a single very high throughput CDC connection, while the JDBC implementation distributes the load across multiple lower throughput JDBC connections.

And most obviously, there’s a fixed cost for producers, tasks, and JDBC connections, and merging together connectors amortizes those costs, and makes batching more efficient.

2 Likes

Hi, @gharris1727

Thank you for your reply!

Based on your answer, I now have two more questions:

  1. As I understand it, when implementing CDC using Debezium, the connectors connect to the writer instance of the database cluster. Is this correct? If so, considering our system’s workload, selecting a CDC solution seems challenging. Therefore, I’m considering developing a data pipeline using a JDBC connector instead.

  2. You mentioned that “there’s a fixed cost for producers, tasks, and JDBC connections.” However, I understand that costs are incurred only for the Kafka service cluster and Kafka connector service, according to Aiven’s plan. Could you please clarify what you mean by “cost”? Are you referring to server resources?

Please give me more your help!

Best,

Hey @Kazuki_Maeda!

I’m glad the above made sense.

  1. Yes this is correct. The official Debezium documentation says this:

Logical decoding replication slots are supported on only primary servers. When there is a cluster of PostgreSQL servers, the connector can run on only the active primary server.
Debezium connector for PostgreSQL :: Debezium Documentation

If you don’t have access to the primary, or can’t afford the extra load on the primary, then you may have to get by with the JDBC connector targeting a read replica.

  1. I meant server resource requirements, not monetary costs. A JDBC connector solution will ultimately be less efficient than the CDC solution, and may or may not require a larger Connect service, but that depends on your workload. It may or may not be more effective to resize the database instance, and you’d need to do the comparison yourself to make sure you’re getting the most value out of the connectors.

I’m happy to help!
Greg

1 Like

Hi, @gharris1727

Thank you for your detailed explanation!
I will consider the based on the information that you have provided!

Best,

1 Like

I encountered an issue where certain operations became impossible after creating a certain number of connectors, each paired one-to-one with a table. The problems I experienced included:

Inability to check connectors from the web console (requests timed out)
Failure to execute ‘list connector’ in CLI
Inability to perform operations with Terraform (plan/apply)
I found that I had to reduce the number of connectors to about 20 to alleviate the issue.

It seems like there is still surplus resource availability in the Kafka connect service.

I would appreciate assistance in resolving and improving this issue!

1 Like

@Kazuki_Maeda
I’m sorry to hear that you’re having issues with the many-connector solution.

The REST API (which all of the operations you listed depends on) relies on the availability of the per-worker “Herder Thread” to process REST requests. If that thread is busy doing something else (such as processing rebalances, starting or stopping connectors and tasks, etc) or is experiencing some error condition (like an unresponsive Kafka cluster) it may not have a chance to service REST requests.

This thread is a shared resource that isn’t captured by memory or CPU metrics, and can limit the scalability of your service. You may consider increasing the number of workers in your service to see if you can add more connectors, even if the memory or CPU budget per-worker seems adequate.

It is possible also that there’s an excessive timeout in your connector configurations that is causing them to be slow to start or stop, which may cause the herder thread to block. If you don’t see anything obvious, you may need to reproduce the issue and investigate further.

2 Likes