Aiven Blog

Nov 3, 2021

Optimizing data streaming pipelines: a panel recap

Apache Flink and Apache Kafka go together like foo and bar. Read or listen to panelists Olena Babenko, Francesco Tisiot, and Gyula Fora explain their take.

Ana Vasiliuk

Ana Vasiliuk

|RSS Feed

Community Manager at Aiven

Aiven hosted a panel discussion at Flink Forward Global 2021, the conference on all things Apache Flink and stream processing. Being part of the conference for the first time was exhilarating and we really enjoyed interacting with the Flink community.

Our panelists, Olena Babenko (Aiven), Francesco Tisiot (Aiven), and Gyula Fora (Cloudera), sat down to discuss their favourite things about Flink SQL and how it’s a good fit in your data streaming pipeline with Kafka. We certainly got a lot of great questions from the audience and useful insights.

The panel recording is available on the Flink Forward YouTube channel and at the end of this post, and we can’t wait for you to watch it! In the meantime, read on for a recap and some tips from our experts.

According to Francesco Tisiot, as we’re moving away from batch processing and going into streaming, we need a platform that is performant and scalable, and proven to be successful. Apache Kafka is the perfect match in this case with its beautiful features like Kafka Connect, allowing you to connect your Kafka instance to other systems. However, it often acts as a messenger, taking data from one place and pushing it to the other.

If you want to level up your game and perform analytics, Flink is a very powerful tool for that. It understands the architecture of Kafka (i.e. topics, partitions) and optimises the workload across those parameters.

Secondly, Flink supports a vast range of data platforms, both for input and output. This makes it a good choice for your data pipelines, no matter where your data sits. The combination of Flink and Kafka is truly powerful and takes your data streaming to the next level.

Working with Kafka and Flink services, we’re often faced with the issue of skewed data. For example, if you have 10 partitions in your Kafka service and only 1 partition has 5GB and the rest have 2MB. Usually this happens because of a mismatch between the node keys. If there is a mismatch in requests, your Kafka and Flink service performance will suffer when they try to process the data.

According to Olena Babenko, you can avoid that by adjusting your metrics, such as maximum number of message bytes and patch sizes, to minimise the impact of the data overload. Additionally, if you are the manager of the Kafka service, you can review your partitioning mechanism to ensure an even distribution of messages.

Our panelist had diverse opinions on this topic. As a Flink expert and long-term committer, Gyula Fóra mentioned that Flink has greatly improved with each release. There is definitely no better time to join Flink than now, with a lot of new feature additions, great connectors, and cloud support. Kudos to the Flink community for working hard on adding new features at an incredible pace!

However, there are some things that could be improved on the SQL side of things. For example, good operational support for SQL queries could come in handy, or a way of guaranteeing savepoint compatibility between SQL jobs. Additionally, as in Francesco’s case, telling a user exactly where errors in the SQL statements are could make it a better experience, especially for new Flink users.

The SQL client, although it has been improved a lot, is still a work in progress. Especially if we look at the SQL CLI, you want to make sure that, when you run real-world applications, logging and all other operations are configured correctly and set up for production. Most companies have a deployment stack around this and a regular Table API program is a better fit than the SQL CLI.

Another small issue with SQL applications is the difficulty to write a unit test on it, because it’s hard to make it isolated and fast. Olena suggests one tip would be to mock as many sources of your data as possible (e.g. data fakers) to try and perform in-memory testing. In the worst case scenario, you can also try using a file system or Dockerised tools as sources.

Our panelists agree that Flink is becoming more and more mainstream. In most cases, it has pretty much all you need for stream processing. In some cases, the developer experience in the company makes it impractical to get a new tool, e.g. when you already have a strong team with extensive Kafka experience. However, the downside of using platform-specific tools like kSQL is that you are forced to use Kafka as both source and sink. Flink, on the other hand, doesn’t have such a limitation and works over a broad tech ecosystem with its wide range of connectors.

Without a doubt, you should check out Flink documentation – there is a lot of great content that will help you understand Flink’s capabilities. Secondly, we recommend trying the SQL client as a great way to get started with Flink. Thirdly, if you are looking for a streaming SQL interface, we're launching our Apache Flink beta program soon, and we can’t wait for you to try it out!

Wrapping up

Not using Aiven services yet? Sign up now for your free trial at https://console.aiven.io/signup!

In the meantime, make sure you follow our changelog and blog RSS feeds or our LinkedIn and Twitter accounts to stay up-to-date with product and feature-related news.


Related resources