Apache Kafka® is the backbone of modern data platforms, allowing data to flow to where it is needed. Kafka Connect is the magic that enables the integration of Apache Kafka with a wide selection of different technologies as data sources or sinks by defining a few rows in a JSON configuration file.
Sometimes, Apache Kafka® Connect might appear as dark magic: its plethora of connectors with partially overlapping functionality, the inconsistent configurations parameters and vague error messages might give the feeling of a hidden art behind the tool. Therefore, specifically if you’re new in this field, some Kafka Connect configuration tips can make you go from willing to sacrifice a cockroach to the connect gods to a perfectly working and reliable streaming pipeline.
Tip #0: The basic rule
Before jumping to the tips, we need to share the fundamental law of becoming an Apache Kafka® Connect magician: WE NEED TO READ THE MANUAL!
Apache Kafka® Connect is solving a complex integration problem, and doing a great job in taking part of the complexity away. Still, the space is huge, with a great variety of technologies and several partially overlapping connectors solving similar integration problems.
Our first duty to become the next Kafka Connect Houdini is to browse for information, understand which connectors exist for the integration problem we’re trying to solve and read their instructions carefully to evaluate their usage for our case.
Now, it’s officially time to start with the tips!
Tip #1: Prepare the data landing ground
Like magicians use their hats to store rabbits for their tricks, we need to prepare a soft cushion for our data to land properly. Whether we’re sourcing or sinking data from Apache Kafka®, we should pre-create all the data structures needed to receive the data.
Most of the time, we will be offered some shortcuts in the form of
auto.evolve, delegating to Apache Kafka® Connect the target topic or table creation. However, by doing so we lose control of these artifacts, which could generate problems on the downstream data pipelines. For instance, Kafka Connect will generate a topic with default partition count, or tables without the partitioning you have in mind. Therefore the suggestion is: read the documentation carefully and pre-create the necessary landing spots accordingly.
If, after reading the documentation, we’re still unsure where the data will land, then we can create a test environment, enable
auto.create and take note of which artifacts are created so we can properly define them in production.
Tip #2: Evaluate the benefits, limits and risks of the various connectors
Like magicians needing to learn all the spells to perform tricks, we need to gather similar knowledge of the Apache Kafka® Connect space. As mentioned above, Kafka Connect is an amazing and wide space, with different connectors solving similar problems in slightly different methods.
Part of working successfully with Kafka Connect is understanding what possible connectors are solving the integration problem we are facing, understanding the benefits, their technical and licensing limits and related risks. Once we have a clear map of the options, we can choose the best one for our needs.
A practical example; to source database data into Apache Kafka® we have two choices: a polling mechanism based on JDBC queries or the Debezium push mechanism. Both seem valid alternatives, but when you start pushing the boundaries the JDBC solution shows its limits. Knowing the limits of a solution will help us make a better choice.
Tip #3: Check the pre-requisites
Magicians need to check they have all the listed ingredients for their potions. To build a successful connector, we need to take the same care in validating that all the prerequisites are satisfied!
First of all, Apache Kafka® Connect is Java-based, therefore to run a specific connector we need to put all the required JAR dependencies in exactly the right folder (check out the case for Twitter). This is quite a task by itself and is where using managed platforms like Aiven for Apache Kafka® Connect can help by removing the friction.
Once dependencies are sorted, we still need to properly test that everything we need is there:
Check the network paths: can we ping the database? is the Google Cloud Storage accessible from the Kafka Connect cluster?
Evaluate the credentials and privileges: can the user login? can it read from the table or write to the S3 bucket?
Validate that required objects are in place: is the target S3 bucket already in place? What about the database replication needed by Debezium?
Ensuring we have all the bits and pieces in place before starting the connector will provide a smoother experience. The last thing we want is to lose two hours of time checking the connector configuration when the problem is that there’s no network connectivity between our endpoints.
Special mention: data formats
Data formats is a topic commonly overlooked, but, if not done right, can have huge implications in the downstream pipeline.
When using the default configurations, most of the source connectors push data to Apache Kafka® topics in JSON format. This approach works for the majority of data pipelines but there’s an exception: the lack of a properly defined schema means that we’ll not be able to sink data to technologies, like relational databases, where understanding the data structure is required. We’ll face the error
No fields found using key and value schemas for table and, as of today, there are no workarounds to make such a connector work.
The tip is to use data formats that specify schemas every time it is possible. From a connector configuration standpoint, this means adding some lines of code (check the
value.converter in the Debezium example) to make use of tools like Karapace to store key and value schemas.
Once we have the schema properly defined during data ingestion in Apache Kafka®, we can use the same schema registry functionality to let the sink connector understand the data shape and push it to any of the downstream technologies, whether they require a schema or not.
Tip #4: Reshape the message payload
Apache Kafka® Connect provides the magical ability to change the data shape while sourcing/sinking. The power is given by the Single Message Transformations (SMT), enabling us to reshuffle the data in several different ways including:
Filtering: pass only a subset of the incoming dataset
Routing: send different events to separate destinations based on particular field values
Defining Keys: define the set of fields to be used as event key (this will be discussed more in-depth later)
Masking: obfuscate or remove a field, useful for PII (Personally Identifiable Information) data
SMTs are a very powerful swiss army knife to customize the shape of the data during the integration phase.
Tip #5: Define the keys to drive data partitioning and lookups
Keys are used in Apache Kafka® to define partitioning, and in the target system to perform lookups. When defined properly, keys drive performance benefits both when sourcing (parallel writes to partitions) and sinking (e.g. partition identification in database tables).
It is therefore very important to analyze and accurately define the keys to achieve better and correct performance (ordering in Apache Kafka® is guaranteed only within a partition). In her blog, Olena dives deep into how to balance your data across partitions and the tradeoffs you might encounter when selecting the best partitioning strategy.
Tip #6: Increase the connector’s robustness
To strengthen our Apache Kafka® Connect magic powers, we need to make our connectors more robust and less vulnerable to errors. Testing the performance, understanding bottlenecks, and continuously monitoring and improving our pipelines are a few of the “evergreen” suggestions that can also be applied in this space. There are also a couple of detailed tips that could possibly save us from specific failures.
Reduce the amount of data in flight
Almost every connector allows the definition of the data collection/push frequency. Sinking data to a target environment once per day means that the connector needs to hold all the daily dataset; if the data volume becomes too big, the connector could crash forcing us to restart the process and therefore adding delay to the pipeline. Writing less data more often can mitigate the risk, we need to find the right balance between frequency and “batch” size.
Parallelize the load to increase performance and reduce risk
Kafka Connect has the concept of tasks that can be used to parallelize the load.
As an example, if we need to source 15 tables from a database, having a single connector with a single task to carry all the load could be a dangerous choice. Instead, we could distribute the data ingestion, either by defining one connector with 15 tasks, or 15 different connectors with a single task depending on our needs.
Parallelizing the work helps both in increasing performance and in reducing risks related to a single point of failure.
Tip #7: Know how to debug
Nobody becomes an Apache Kafka® Connect magician without mistakes; we’ll hardly nail every setting on the first attempt. Our work is to understand what is wrong and fix it. A couple of general hints to have a successful experience:
Check the logs: logs are the source of truth. The error description contained in the logs can help us understand the nature of the problem and give us hints on where to look.
Browse for a solution: the internet will provide plenty of fixes to a specific error message. We need to take time and understand carefully if the suggested fix applies to our situation. We can’t assume a solution is valid only because the error message posted is similar.
Since Kafka is a relatively new tech, we could also face a situation where nobody seems to have had the same error before us. However, in that case, a second look at the logs and a read of the connector code might take us to a solution… that’s the beauty of Open Source!
The error tolerance and automatic restart
The error tolerance defines how many parsing mistakes we can accept before having the connector fail. The two options are
none where Kafka connect will crash on the first error, and
all where the connector will continue running even if none of the messages can be understood.
A middle ground is represented by the dead letter queue, a topic where we will receive all the erroring messages. The dead letter queue is a great way to make the connector more robust to single message errors but, if used, needs to be actively monitored. The last thing we want is to discover, one year later, 200.000 unparsed messages in our
orders topic because of a silly formatting error.
Related to the tolerance, another useful parameter is automatic restart, allowing us to try reanimating a crashed connector. Setting it on can be a good way to rescue our connector from transient errors but will not save us in cases where our configuration is just wrong.
Tip #8: Keep the evolution trace
The concept of a “spell book” translates really well to Apache Kafka® Connect and approaching the connector configuration with an iterative method can take us a long way.
First, we need to read the set of configurations available, then analyze what parameters are necessary and keep our connector configs as simple as possible to build a minimal integration example that can be evolved over time.
Linked with the above, it’s also worth spending time on properly setting up a version control system for the configuration and automating the deployment as much as possible. This approach will save us time when needing to revert non-working changes and reduce the risk of human errors during deployment. Apache Kafka® Connect REST APIs or tools like the Aiven Client, Aiven Terraform Provider or kcctl can help in the automation process.
Making the integration magic
Understanding Apache Kafka® Connect’s dark magic can be overwhelming at first sight; but by browsing its ecosystem, reading the various connectors' documentation, and listening to the above tips we can start using our new wizardry to build fast scalable, and resilient streaming data pipelines.
Some more resources to get you started:
Aiven for Apache Kafka® Connect: don’t lose time setting up the Kafka Connect cluster, focus on creating the integration instead
Twitter example: check out how to self-host a Kafka Connect cluster to run any connector you need, while still benefitting from a managed Apache Kafka® cluster
Aiven for Apache Kafka® Connect
Integrate your existing data sources and sinks with Apache Kafka®.How does it work?
To get the latest news about Aiven and our services, plus a bit of extra around all things open source, subscribe to our monthly newsletter! Daily news about Aiven is available on our LinkedIn and Twitter feeds.
If you just want to find out about our service updates, follow our changelog.
May 24, 2022
What if Aiven doesn't provide the Apache Kafka® connector you want? Read on to learn how to use an external connector, to gather Twitter messages into Kafka.
Mar 25, 2021
Find out how to use Apache Kafka® Connect to update an old app-to-db design to use up-to-date tech tools without disrupting the original solution.
Jun 21, 2021
Read the latest instalment of the Terraform adventures to find out how to plug PostgreSQL into Grafana and really see what the metrics are saying.
Subscribe to the Aiven newsletter
All things open source, plus our product updates and news in a monthly newsletter.