How to sessionize events with Apache Flink®
Learn how to gather individual events into sessions, collections of events from a single user over a given time span, using Apache Flink®
Why sessions?
Gathering customer feedback outside a retail store is no longer necessary, but businesses still seek answers to familiar questions:
- How did this user find us?
- What actions did they take?
- Which actions lead to revenue?
- How do different channels affect purchasing patterns?
The challenge with online business is that you can’t directly ask customers these questions. Instead, you have to observe the answers using data. To achieve this, businesses implement user tracking technologies like Snowplow to monitor user clicks and track their online activity. However, analyzing individual events lacks value, as a singular 'checkout click' lacks context. To really understand user behavior, it is need to enrich the 'checkout click' event with details such as:
- What was the page from which the customer clicked the checkout button?
- How many items were in their cart?
- How much time did they spend on the website prior to checkout?
We can answer these questions by defining a user session, and analyzing the data in that session.
What are sessions?
A session is a collection of events from a single user with a start and end time. Sessions are meaningful entities useful for analytics or real-time applications such as monitoring or fraud prevention. Analyzing sessions provides valuable metrics that will help you understand your user journey, improve your product and ultimately drive growth.
When aggregating raw event data to sessions it's typical to define a session window. The session window is the duration between each individual event, where those events are still part of the same session. For instance, if you define a 30 minute session window, this means that 30 minutes of idleness will end the session, and any actions after this belong to a new session. Because of this, the length of a session will never be fixed. The session window duration varies depending on your product, but typically the session window is highly correlated with the session duration. For example, a mobile payment app session is generally very short, while for a cloud service provider the session can be as long as a full working day. Many of the event-tracking products on the market use 30 minutes as a suitable default session window for web-based products. Understanding your product and users helps you define a suitable session window duration.
How to sessionize?
Deciding how to organize events has implications for your data pipelines. The most common way to organize events is to do it in post-processing, once the data is in the storage system. The reason for this is that to create complete sessions covering all parts of your business, you have to first combine data from different sources like ads, websites, and mobile. Another way to organize events is to use the built-in tools from your event-tracking provider. This is common for web-based products and companies that may not have many resources or are still figuring out their data setup. The last way to organize events is during mid-process, using a framework that processes data as it comes in. This is common in tech-savvy companies that like to have flexibility and room to grow.
Modern technologies now let you organize events at different stages of the data journey, which helps with many different situations. Apache Kafka® is a good example of this kind of technology. It provides a standard way for your organization to work with data, promoting consistency and allowing for real-time event processing. In the Kafka world, there are tools like Apache Flink® that can handle events in real-time. Combining Kafka with Flink is a great way to capture, process (like creating sessions), and put your data into the storage system or other applications quickly.
Limits of the heads or tails approach
Sessionizing events on the client side has a significant drawback: the analyst manipulating the data has minimal control over the sessionizing logic, such as defining and experimenting with various session window durations. Client-side sessionization also fails to capture user journeys comprehensively. While suitable for simple web-based online shops, it falls short for products with multiple touchpoints (e.g., APIs, website, mobile, desktop apps). On the positive side, this approach is easy to set up, requiring a simple switch in your web tracking provider.
In the data warehouse, capturing sessions typically involves SQL scripts with window functions handling vast amounts of data. The downside of this method lies in its resource-intensive nature. Due to the unlimited duration of a session, it can be an expensive and slow operation. Furthermore, it lacks the ability to react to sessions in real-time. However, the upside is that the analyst retains complete control over data transformations and can experiment with different parameters. Lastly, in complex data landscapes, the data warehouse may be the only place where the organization can access data that holistically covers the business.
Solution: real-time transformation
A better approach for sessionizing events is to use a stream processing framework on top of your event stream. For example, employing Apache Flink to process and transform events stored in Apache Kafka. Processing events in real-time and leveraging a stream processing framework like Flink generally offers the following benefits over the aforementioned alternatives:
More performant and reliable
Stream-processing frameworks, like Flink, are purpose-built for unbounded data streams. Flink only needs to process records for the duration of the session window. It achieves this by exposing key parameters that control event time handling and fault tolerance. This results in higher performance compared to a post-processing approach, where these methods don't exist. Moreover, Flink is designed for distributed processing, enabling scalability for workloads of all sizes.
Transformation in real-time
Flink directly connects to your Kafka instance, enabling real-time data processing. Processed data can be written to a new topic, consumable by your data warehouse through a sink connector, or directly by your downstream application (e.g., a machine learning model).
Quick iteration with a rich SQL environment
Flink provides an SQL API, allowing you to write SQL queries that specify how to manipulate the data. This is advantageous for data analysts who already possess these skills, facilitating technology adoption. Additionally, Flink offers a comprehensive set of operators and functions that prove useful when dealing with sessions, including event-time-based joins, window-based computations, and stateful processing.
Sessionizing with Apache Flink
Here is an example of how to sessionize an event stream using Aiven for Apache Flink.
Create a landing table. When creating sessions it's required that your events contain a unique column for a user, and a timestamp for when the event occurred. In this example we use the datagen connector to help us create mock data to test the logic:
CREATE TABLE landing_table ( user_id int, event_timestamp TIMESTAMP(3), WATERMARK FOR event_timestamp AS event_timestamp ) WITH ( 'connector' = 'datagen', 'rows-per-second' = '10', 'fields.user_id.min' = '0', 'fields.user_id.max' = '100' )
Create a sink table using the blackhole connector.
CREATE TABLE sink_table ( user_id INT, session_start TIMESTAMP(3), session_end TIMESTAMP(3), request_count BIGINT ) WITH ( 'connector' = 'blackhole' )
Create a job which sessionizes the events, for the sake of the example I’ve defined a 10 second session window here.
INSERT INTO sink_table SELECT user_id, SESSION_START(event_timestamp, INTERVAL '10' SECOND) AS session_beg, SESSION_ROWTIME(event_timestamp, INTERVAL '10' SECOND) AS session_end, COUNT(*) AS event_count FROM landing_table GROUP BY user_id, SESSION(event_timestamp, INTERVAL '10' SECOND);
The next step
The next step is passing this information to your application. For example, you might want to calculate the average session duration over a sliding window, to identify potential degradation of your service in real-time.
Flink writes the output of your SQL job to a new topic. Applications can either directly read the data from this topic, or use one of the ready-made connectors for these use-cases. A common approach is to store the output in your data lake or data warehouse for long-term storage.
For a machine learning application, you might first transform the incoming data using Flink, then use the HTTP Sink Connector to send requests to an endpoint where your model is hosted. With every incoming record, this allows you to serve predictions in real time. For larger machine learning workloads that don’t require real time data, you might store the data in object storage using one of the many connectors (e.g. Google Cloud Storage sink) and then perform predictions in batches, triggered outside of Kafka.
Conclusion
Understanding how your product is used is essential for your business to grow. Tracked events by themselves aren’t meaningful until you aggregate them on a level which makes sense to the business; that's why we aggregate individual events to sessions. A session is a collection of events from a single user, with a start and end time. Analyzing sessions gives you valuable metrics that help you understand your user journeys, improve your product and ultimately drive growth.
Event sessionization can happen on the client-side device itself, in the data warehouse, or in real-time using stream-processing tools like Apache Flink. Opting for real-time sessionization offers notable advantages compared to the other methods: enhanced performance and reliability, real-time transformation of data, and a feature-rich SQL environment for swift iteration.
To commence real-time event sessionization, begin by exploring Apache Kafka and Apache Flink with a free trial of Aiven.