At the center of every company are two things: data and transformation pipelines. Aggregations, filtering, masking ... so many things are needed to change raw input into useful data assets.
Aiven for Apache Flink® fulfils the need by providing a SQL interface to define streaming and batch data transformation pipelines on top of data hosted in a variety of technologies like Apache Kafka®, PostgreSQL® or OpenSearch®.
A lot has happened since the first beta release of Aiven for Apache Flink. Our fully-managed service for Apache Flink is now in general availability and we've learned a lot about how Flink is used, as well as about stability and lifecycle management!
We sat down with Filip Yonov, Director of Product Management for data streaming services at Aiven, to discuss the journey and where Aiven for Apache Flink is heading next.
From the origins to today, the story of Aiven for Apache Flink
Aiven doesn't just provide data pipelines; we're a data company, too, and we adopt open source technologies for internal purposes. When we've collected enough experience and trust in the tooling, we have the option of packaging and exposing them as a service for our clients, who can then benefit from the functionality without the management hassle. This started with PostgreSQL, which first solved our database needs and next became one of our most widespread data solutions. The process has been repeated since, with for example Apache Kafka, which we use to serve data across technologies in streaming mode.
When facing the need of data transformation, we looked for tools that could unify the pipeline definition across technologies. We needed our Apache Kafka to be manipulated in near-real time to provide timely analytics and notifications, using data coming from our PostgreSQL database for enrichment. Apache Flink was seen as a clear leader in this space since it can define pipelines across a wide range of backend technologies. Even more, the comprehensive set of SQL features, the resilient and scalable architecture, and the unique and supportive community backing the project make Apache Flink a solid choice for critical production workloads.
Since the first release of the Aiven for Apache Flink beta, we've benefited from valuable feedback from customers and internal stakeholders to improve the service and its developer experience. In the following paragraphs we dive deeper into the recent improvements.
An abstraction layer on top of Apache Flink SQL
The first Aiven for Apache Flink product provided a direct mapping to the Apache Flink SQL client: you were able to define tables (sources and sinks of data) and jobs (transformation pipelines). However, while this reflects the journey in the SQL client, it creates a technical misunderstanding when the structure definitions evolve over time.
A running Apache Flink job contains both source, sink and transformation SQL defined at the time of deployment. If a source or target table definition changes afterwards, the job doesn't pick up the update until it is stopped and restarted. This little difference meant that the metadata definition stored in the first version of Aiven for Apache Flink could be out of sync with what was executed on the cluster, generating confusion or even errors when jobs were restarted.
We fixed the issue by adding the definition of an Application: a layer of abstraction that includes
- Source and sink table definitions
- SQL transformation definitions
- Deployment parameters
By working at the Application level, you'll know what your current definitions are at any point in time.
Also, Aiven added a versioning system. You can now navigate between versions, track enhancements, explore new features and roll back changes if things don't look right.
Smoother pipeline definition
Writing Flink SQL is not an easy job: defining tables, integrations and metadata parameters can sometimes be a headache. Here are ways in which Aiven for Apache Flink enhances the pipeline definition:
- Simplified integration: forget about difficulties in integration, networking, SSL certificates with data sources and sinks. You can read from and write to Aiven for Apache Kafka, Aiven for PostgreSQL® and Aiven for OpenSearch® by creating an integration in the Aiven Console with a few clicks.
- Pure table SQL definition: in a previous iteration of the table definition, we separated the column definition (that is, which columns are contained in a table) from the metadata syntax (that is, which connector, topic, consumer group to use). However, based on user feedback, we realized that this didn't provide any additional benefit to the experience. Therefore we reverted the table definition to pure SQL, and now it's fully compatible with the Apache Flink SQL Client.
- SQL autocomplete: since both the full table and transformation definition are in now SQL, we added an autocomplete feature allowing developers to fast-track their work.
- Flink SQL pipeline output preview: the new pipeline preview feature lets you check that your table definition fetches the correct data and that your SQL transformation performs the right manipulation, without pushing a single record to the target environment. The interactive queries allow you to see, directly in the Aiven Console, what is the output of a table definition or transformation SQL.
- Improved OpenSearch and Slack connectors: Data doesn't only live in PostgreSQL and Apache Kafka. By enhancing the set of connection options, you can now stream the result of the data pipeline directly to an OpenSearch index or notify people in the Slack messaging tool.
Faster definition, reuse and an iterative approach
Clearing ambiguities isn't the only benefit of the Application concept. By storing table and transformation SQL definitions and performing versioning, Aiven for Apache Flink becomes the perfect playground to iterate fast, test new deployments, and recall code changes.
Even more, once a correct prototype is created, the new UI allows you to quickly copy table definitions across applications. This avoids the "empty canvas syndrome" and cuts down the time needed to develop new applications.
Tight control on deployments
Every data pipeline definition is different, and every execution can vary as well. By adding a "Deployment" stage, you can now define application execution parameters, such as the number of parallel tasks. You can also stop deployments and store their state, using Flink savepoints. You can then reuse savepoints to start an application execution from an existing state, generated by the same application, even if it was on a different version. The compatibility of the savepoint is assessed automatically. If there's a positive match, the savepoint is used to start the application from a known point in time.
You can store up to 5 savepoints per application, so you can easily 'time travel' in your Flink application by choosing a new deployment "starting point" based on your needs.
Our beta customers and internal stakeholders helped us to make Aiven for Apache Flink more robust. The native Apache Flink checkpoint and savepoint system has been enhanced to automate the backup of pipeline state and allow graceful shutdown and startup. On top of this, you can now correctly size your service depending on the workload and change it whenever needed. Additionally, if your data needs to be migrated between clouds or regions, your pipeline definition can follow the data, to minimize any latency.
The future of Aiven for Apache Flink
The enhanced developer experience provided by Aiven for Apache Flink is only a first step in a long journey. Aiven is fully committed to optimizing the service further. It only makes sense, as Flink is clearly becoming the de facto standard in the data processing space.
Aiven will continue to enhance the service by adding new integration and connection capabilities, and integrating all upcoming Flink SQL features. Our plans will extend to accommodate new use cases and needs from existing and new customers.
Experience the new developer experience of Aiven for Apache Flink directly from the Aiven Console
Feb 8, 2023
With Aiven’s fully managed Apache Flink® service, you can leverage the power and flexibility of real time data processing.
Jan 5, 2023
It's normally best to use dynamic IP addresses for services, but occasionally a static IP address makes sense. Read on to find out why.
Nov 22, 2022
Streaming, batch, caching, archiving, encryption - data management can seem very complex. Read on for an ice-cream store metaphor that explains the options.
Subscribe to the Aiven newsletter
All things open source, plus our product updates and news in a monthly newsletter.