Dec 2, 2021
Easy real time streaming insights
Your streaming data is a valuable asset. Find out why you should be hoarding and analyzing it, and how to start building that pipeline.
Let’s talk about streaming. Not the “over the counter” streaming services that deliver shows and movies to your laptop and TV, but the big services that move data between repositories.
There is whole a group of technologies for dealing with massive amounts of data for real time, low latency use cases. This mature ecosystem makes it easy to get started and expand your use case with a low risk investment.
Get it working, and keep it working while failing fast.
Let’s take a look at a growing industry trend and get some insight into whether building a streaming pipeline is right for your business.
Why are streaming insights important?
What can your data do?
- Real time monitoring/scaling
- Anomaly detection
- Shopping cart abandonment for ecommerce
- Security and automation
Streaming technologies are growing in popularity because they promise to increase business agility. Your business can gain a competitive advantage from making decisions faster and enabling collaboration between business units. Streaming architectures allow your teams to iterate faster, enable more teams with more data, and create collaboration between departments.
Many of these tools are open source, and are becoming more popular in the industry. Both are factors that get you access to a wide and widening talent pool to drive innovation.
Later in this post, we will talk about the technologies; for now, remember that streaming data comes from a “source”, and eventually lands in a database, data lake, or data warehouse after some processing. Each streaming technology has its own ecosystem of integration libraries to support certain use cases. Some are better fits for big data and ETL/ELT, while others are better for machine learning and artificial intelligence.
Many businesses are sitting on top of untapped mountains of data gold. With streaming technologies,businesses can start making use of the untapped potential in their decision-making. It isn’t magic: given the right investment, it just means that you can put the right data in front of the right people faster.
When should you leverage streaming insights?
Streaming insights can likely help transform your business if any of the following are true:
- You have a wealth of data
- You are using database technologies that support Change Data Capture
- Increasing team collaboration results in increased business agility
- Strict data governance policies adds friction to internal processes
Your dataset is constantly changing and you don’t fully understand the what data you have access to, or it potential for adding business value
A business or technology leader should always consider the ROI for any initiative and work to de-risk the investment. Here are some tips to use Pareto’s Principle (the 80/20 rule) to find the low hanging fruit and set your team up for success with fail-fast iterations and clear milestones.
First confirm that you have the data, and what it is. Consult with your data team to get an idea of your data retention, data volume, and other potential data sources, i.e. what are you deleting or just not tracking? This gives you a good idea of what is available and how much historical data is stored.
Second, check your tech stack to see which databases your application is using. Services like MySQL, Postgres, Cassandra, Oracle, MongoDB, and SQL Server have out of the box integrations with tools like Apache Kafka and Kafka connect. You can leverage Change Data Capture (CDC) to “listen” to database changes and convert them to event streams. This way, legacy applications can become modern real-time streaming applications with hardly any development effort.
Consider your use cases and take a look at common use cases for each framework. For example, you might be able to find case studies related to similar use cases running machine learning on Spark, or advanced stream processing on Flink. We talk about this more in the next sections, but it will help inform which tools can be used once your data is in Kafka.
How can you implement and leverage streaming insights?
Many technologies provide similar functionality, but they are not all the same. When it comes to scale, the contributor community, or the ecosystem of integrations with other enterprise grade technologies, they are all different.
Through trial by fire, the industry tends to pick a leading technology for particular use cases. This is true for language frameworks, databases, and even primary cloud providers. Kafka has been at the forefront of streaming data technology for nearly a decade. We will discuss how Kafka solves challenges in streaming applications and reduces the friction of innovation.
Apache Kafka originated in LinkedIn and has grown up to become the industry standard in high throughput and low latency applications. Open source contributors have built an ecosystem of tooling around it to simplify its operation and usage. It is largely based on a Pub-Sub model, facilitating asynchronous processing of events and decoupled scaling of producers and consumers. This is oversimplified, but the point is that Kafka is the foundation of a streaming architecture.It is the pipe that allows data to flow.
Apache Kafka Connect defines a common API (Abstract Programming Interface) for technologies to connect to Kafka as producers or consumers. This means you can integrate tools and technologies you are already using with Kafka with minimal effort. Many times, with functionality like CDC, the implementation actually requires zero engineering effort. Aiven has worked with customers to convert legacy systems to event driven architecture in a matter of days with the help of Kafka Connect and CDC.
Where Kafka Connect is a simple way to get your data into and out of Kafka, you likely want to do other complex processing like joining, windowing, transformations, or filtering. This can be accomplished with low level Kafka libraries or Kafka Streams clients, but that takes engineering effort. The low cost alternatives are configuration, and integration based frameworks. Confluent’s proprietary kSQL allows powerful stream processing of data in Kafka. However, other open source technologies provide more powerful processing plus a wider set of integration points.
Flink is one technology that provides complex stream processing functionality through a SQL-Like syntax for filtering, joining, and transforming data in Kafka, as well in as other systems like Postgres, Elasticsearch, and OpenSearch. This can be extended to other data integrations like Hadoop and Cassandra. But the real power goes beyond the configuration based use cases. You get custom jobs, and a built-in job manager. Flink is everything from a data science experimentation lab, to a robust ETL tool suite, to a cutting edge machine learning platform for driving real business in the tool ecosystem that is right for your use case and business.
Any time we Aiven’s Solution Architects talk to customers and prospects, we push them towards more mature cloud native deployments. It’s not always easy. Sometimes this includes difficult conversations. You NEED to have an answer for the following points, even if that answer is “Not right now; put it on the backlog.”
- What are your growth projections and how does that scale for performance? How does that scale financially? Big data is expensive.
- What are your performance and latency requirements? “High throughput,” “low latency,” and “low cost” … pick two, or get creative. Make sure you have well defined SLAs and budgets.
- What is your plan for security? The easiest time to get security and best practices in place is before rolling it out. Retro-fitting an enterprise deployment of Kafka with a microservice architecture is not an easy or in-expensive task. Security should be a top priority.
What’s next?
Make sure that you set your team up for success using the tips above to find the low hanging fruit and start iterating quickly. It is easy to set up integrations with your existing system and start exploring your data. The ecosystem of tools allows for quick time to value with minimal upfront investment, allowing you to set clear milestones and fail fast.
This is the first step down a path with many forks; Aiven has taken this journey countless times, watching companies leverage open source streaming technologies to drive growth.
Wrapping up
Your next step could be to check out Aiven for Apache Kafka and Aiven for Apache Flink.
If you're not using Aiven services yet, go ahead and sign up now for your free trial at https://console.aiven.io/signup!
In the meantime, make sure you follow our changelog and blog RSS feeds or our LinkedIn and Twitter accounts to stay up-to-date with product and feature-related news.
Further Reading
Subscribe to the Aiven newsletter
All things open source, plus our product updates and news in a monthly newsletter.
Related resources
Jan 10, 2023
Learn the disadvantages of batch analytics, the benefits of switching to real-time, and how to adopt a tech stack that supports real-time analytics.
Oct 31, 2022
Already using Apache Kafka®? Here’s why you should be analyzing your streaming data, not just moving it around.
Nov 10, 2022
We regularly measure the throughput performance of Aiven for Apache Kafka® - check out our latest test results in this post.