One of the more recent specialized database models that have been developed since the introduction of the relational database management system (RDBMS), time series databases have been steadily growing in popularity since 2015. However, they have been outpacing all other database models by a wide margin over the past two years according to DB-Engines.
As the name implies, a time series database (TSDB) makes it possible to efficiently and continuously add, process, and track massive quantities of real-time data with lightning speed and precision. While other database models have been used for these kinds of workloads in the past, TSDBs utilize specific algorithms and architecture to deal with their unique needs.
In this piece, we’ll take a deeper look at time series databases, including the unique needs of the workloads they’re built for, their benefits, common use cases, and the TSDBs out there.
Key attributes of time series analysis
Trend refers to the general direction a set of data is moving in. There are upward and downward trends, or up and downtrends. For example, graduation rates and median household income.
Seasonality is the study of predictable patterns over regular intervals, typically measured during a finite time period. One example of seasonal analysis may include the number of vacationers in a specific area, which tends to peak during specific parts of the year (e.g., the summer for a beach town and the winter for a ski town). Another example of seasonality is television viewership, which can spike during certain sports seasons, such as baseball or football.
Cycles occur when a series has a pattern that can’t be measured seasonally. While seasons follow the calendar year, cycles do not. For example, a cycle may be used to describe temperature fluctuations over a decade, or to describe a quarterly business pattern.
4. Irregular fluctuation
Irregular fluctuation happens when data anomalies occur due to random events. In other words, the data do not follow a particular pattern. Irregular fluctuation recently took place at the start of the COVID-19 pandemic, when safety concerns led to food and medical supply shortages.
A note on stationary time series
A time series can be stationary or non-stationary. It is considered stationary when it lacks seasonal or trend effects, and non-stationary when it has variances, covariances, and means that fluctuate over time — it can also include random walks, deterministic trends, cycles, etc.
Now that we know the key attributes of time series, let’s flesh out exactly what we mean when we’re talking about TSDBs. Specifically, what they are and what’s encouraged their emergence.
What is a Time Series Database?
A time series database stores data as pairs of time(s) and value(s). By storing data in this way, it makes it easy to analyze time series, or a sequence of points recorded in order over time. A TSDB can handle concurrent series, measuring many different variables or metrics in parallel.
Early time series databases were primarily used for processing volatile financial data and streamlining securities trading. However, the world’s changed a lot since they were first introduced and many new use cases have emerged as technology has continued to evolve.
For example, the internet of things (IoT) concept and its associated sensors that constantly collect and stream data underlie a number of modern workloads such as powering industrial applications, predicting sales demand, analyzing temperature readings, and providing medical information from wearable devices. As you can imagine, the data produced is staggering...
Well, it can be staggering for more traditional databases whose requirements and design decisions restrict their suitability for such workloads. Luckily, there are time series databases and they are getting better and better at dealing with the ever growing demand of this data type.
What are the benefits of a time series database?
There’s a reason more and more developers and organizations are using time series databases. They deliver a number of benefits, which we’ll now briefly explore.
Benefit: More accurate and meaningful time series measurement
The most obvious benefit is that a time series database makes it easy to measure how datasets change over time. With a time series database in place, you can concurrently view past, present, and future datasets for reporting that is more accurate and meaningful.
Benefit: Resource-efficient data storage
By the very nature of the data type, processing it can require massive amounts of storage, which can be difficult to manage — and very expensive. Time series databases possess tooling that makes it possible to aggregate data into predetermined time periods and eliminate certain data streams as needed and use compression algorithms that optimize storage.
Benefit: Lightning-fast data queries
A TSDB can also make it easy to query and retrieve data based on specific periods. For example, imagine someone who doesn’t remember the title of a book they recently read but know they read it three months ago. Time series databases can help the individual figure out what the book was without having to use a bunch of wildcard searches. Using a time series database, you can quickly find information based on timeframe — enabling rapid retrieval.
Common time series database use cases
1. Accessing IoT data
Most IoT deployments — like connected water, energy, and temperature meters — require constant data collection and reporting at regular intervals. Time series analysis can provide timestamped data points, making it possible to identify seasonal patterns, average usage, and inefficiencies. For example, a connected pH meter connected to a TSDB might tell a technician tasked with maintaining a specific pH level that a certain vat of water is becoming too acidic. IoT endpoints also collect massive amounts of data, requiring highly scalable time series databases.
2. Monitoring web services, applications and infrastructure
Companies can also use time series databases to measure the performance of their applications and web properties. For example, the open source monitoring system Prometheus is a time series database that enables developers to keep tabs on performance trends over time. This enables them to easily detect when problems are occuring, which then allows them to plan maintenance and rapidly respond to incidents to sustain an optimal user experience.
Some web and mobile applications store the events within their app in a TSDB (such as a button click, playing a video or sharing some content). These events allow them to map a user’s journey, identify frustrations or performance bottlenecks and streamline more complex processes.
3. Understanding financial trends
Using time series data to accurately predict financial trends is very difficult. However, a TSDB can provide a wealth of contextual data to help analysts. Let’s take the stock market as an example; a sudden increase in airline stock may coincide with holiday travel. Or an executive leadership purge may spook investors, causing a stock to temporarily tumble. Time series databases make it easy to cross-reference data, providing a richer, clearer picture.
4. Processing self-driving car data
Self-driving cars typically collect about 4,000 GB of data per day, which is beyond the scope of what a typical relational database can process. Time series databases enable faster data ingest and queries and stronger data compression. As a result, they are ideal for processing massive volumes of real-time data that can be used to improve the safety of self-driving cars.
5. Sales forecasting
Retail stores are continuously challenged to predict future sales in order to accurately stock their shelves with products. Thanks to time series databases, retailers can use statistical models in conjunction with historical data and cross-reference it with consumer behavior trends to predict future patterns and make informed decisions about which products to keep in stock and when.
For instance, retailers are now using forecasting to plan ahead and restock bicycles, which are now experiencing a shortage due to the pandemic. Retailers are using data to predict when new products will become available again, what the demand will be like, and what alternative transportation options consumers are buying in lieu of bicycles e.g., trikes, rollerblades, etc.
Top time series databases in use
Now, let’s take a quick look at some of the most popular time series databases in use today.
Created by InfluxData and licensed under an open core model, InfluxDB is synonymous with time series. It features an industry standard protocol for writing data and is optimized for fast retrieval of time series data points. However, it’s license poses problems for those who want to use it in production at scale because its open source version does not support clustering.
In fact, our underlying metrics and monitoring infrastructure was originally built on top of InfluxDB but we started to hit its limitations when our node count and the associated production of data points, began to grow exponentially — which is why we moved to M3 (read more below).
Prometheus is an open source, metrics-based monitoring system from SoundCloud that uses a single node time series database to help developers keep track of their applications’ health. It uses a highly dimensional data model utilizing PromQL for slicing and dicing time series data.
Although completely Open Source and capable of ingesting high volumes of data at tremendous rates, Prometheus is only available in a single node setup so you will eventually run into scalability issues similar to InfluxDB, which is why M3 was developed (more on that below).
TimescaleDB is an extension for PostgreSQL. This database is good because it is an extension of an industry standard RDBMS and uses a query language that most are familiar with.
When it was first introduced, TimescaleDB was highly available but not distributable because PG's setup is primary / secondary — you could scale it vertically, i.e. increase the power of the machines, but not horizontally. However, TimescaleDB 2.0 introduces the distributed hypertable, which creatively overcomes and takes advantage of PostgreSQL's architecture simultaneously.
The brainchild of Uber’s engineering team, M3 was developed to track the locations of all drivers in real time. In fact, as of 2018, M3 was handling 8.5 billion points per second. Interestingly, they originally developed it as a distributable time series datastore for Prometheus. Today, it’s actually made of three components, so it’s better to think of it as a platform than just a database.
It’s an open-source database, aggregation service, and query engine all rolled into one platform and is compatible with industry standard data ingestion protocols including Prometheus, InfluxDB, and Graphite. This made it easy for us to transition from InfluxDB and implement M3 as the backbone of the metrics and monitoring infrastructure that underlies our fully-automated, self-healing functionality that is the bread and butter of the Aiven data cloud.
As you can see, the use case for time series databases is growing. But, we’re just getting started when it comes to realizing their full promise. As more capable solutions become available, such as M3, we will see a virtuous circle form where new use cases are unlocked.
Even better, managed services will help fuel the expansion of their use cases because users will be able to focus on the problem space instead of standing and managing the infrastructure. To learn more about the latest evolution in time series databases, make sure you read this article.