Choosing a time series database has a lot in common with choosing a generic database, but with a few crucial differences. These differences are mainly due to the types of applications that TSDBs are typically used for and the kinds of data they store. For example, very typical use cases for a TSDB are IoT applications. They use data in a very different way from what vanilla databases were designed for: the volumes are typically much larger and the data structure simpler. The data is also processed differently downstream.
This post will introduce you to the technical considerations involved in choosing the right time series database:
- The impact of data type
- Scaling and clustering
- Reading speed requirements
- System footprint size
- Logging and monitoring
(Tip: If you’re new to the world of time series databases, you might like to start by reading our article An introduction to time series databases.)
1. The impact of data type
Even though we’re only discussing time series databases here, it doesn’t mean that all data will be similar. For example, data sent for analysis from IoT devices has to be processed differently from financial information used for predictions.
M3 is great for IoT or metrics datatypes at large scales. Its current implementation doesn’t allow the backfilling of large amounts of historical data, though. Out-of-order writes are limited to a single compressed time series window. For backfilling old data in larger scales, for example to maintain continuity, TimescaleDB or InfluxDB would be a better choice.
For data with high cardinality, you could do worse than go with TimescaleDB. However, remember that M3 was designed specifically for querying high-cardinality data.
Time precision is another factor you may need to consider. Time precision refers to the smallest delay that you can configure in a given time unit. For example, if your delay is set to 100ps, you can indicate 10ps as the precision that is applied to measuring that delay, in other words, how many decimal points to use relative to the time unit specified. With respect to time precision, your best bet is to select a system where you can tune it to match changing needs. Here again we’re going to point to M3.
Time series databases can get absolutely huge, and amounts of data impact both storage size and performance speeds. For one thing, a great number of devices and systems may be sending data to the database, and the number will fluctuate over time. They also send their data very frequently, and that frequency may also vary, by time or by device. The storage system you select must be able to handle great volumes and frequent entries without the danger of dropping data points.
The best way to save on space is to compress or downsample your data. Let’s take a brief look at how various products do it.
Downsampling reduces the resolution. Common practice is to have separate storage bins with different downsampling ratios. For example, you might save your incoming data in full for seven days, then pack it by 50% for 30 days’ storage. Finally, you’d place items further packaged by 50% into long-term storage.
Most databases, including Graphite, InfluxDB, and Prometheus, offer a basic inbuilt aggregator service and a configuration utility. M3 offers the M3 Aggregator that leverages etcd to provide cost-effective and reliable stream-based downsampling. Aiven not only offers M3 as a Service, but Aggregator as well.
Compression algorithms encode data by using fewer bits than the original representation. Unlike downsampling, algorithms can achieve lossless compression. The compression rate then depends on how much redundancy exists in the data being encoded.
InfluxDB uses half a dozen algorithms depending on what type of data is being processed: timestamps, floats, integers, Booleans, or strings. Its handling of floats is based on Facebook’s Gorilla database. Gorilla is also present in its delta-of-delta handling of timestamps, same as Prometheus. In fact these two databases have very similar implementations, but InfluxDB has a larger selection of algorithms. M3 uses two algorithms: the highly-compressing M34TSZ and the flexible Protocol Buffers (the latter is Google’s language and platform neutral mechanism for serializing data).
To hedge your bets most effectively, choose a platform with flexible packing options, both lossless and lossy. This way you can switch as needed. If you have the budget, don’t get locked into a platform with a single compression option only to find out the hard way when it’s too late.
3. Scaling and clustering
Let’s say you need a cluster--either because you have high performance requirements, or you need high availability. In an IoT use case, you might start with a database receiving data from a few hundred devices, which adds up to a lot of data points. What’s more, if your business booms, soon you may need a database able to handle data from thousands of devices. This is where the scale and re-scalability of operations becomes very relevant.
Vertical scaling tends not to be an issue for most modern time series databases and is mostly limited by hardware resources. As for horizontal scaling, it’s worth noting that M3 is eminently scalable in all its components.
InfluxDB’s open version doesn’t support clustering, but their proprietary version does and it uses a great basic cluster management utility. Its strength is to be solid, dependable and versatile, and it offers good options for hands-on management.
The strength of TimescaleDB is that it supports clustering out of the box via PostgreSQL’s primary/secondary setup . Although clustered, it does pose some limitations for those who want to scale their clusters horizontally. Thankfully, their 2.0 release aims to overcome this with the distributed hypertable. (https://blog.timescale.com/blog/timescaledb-2-0-a-multi-node-petabyte-scale-completely-free-relational-database-for-time-series/).
Prometheus can be clustered and although it doesn’t have high availability, it does have federation so you can kind of fake it. However, for large setups its configuration becomes a clunky affair.
But the cherry on top of this fruit salad of options is M3, which supports full clustering with advanced hands-off management features. It provides strong multi-cluster consistency management through etcd. You can update cluster configuration in real-time, manage the placement of its distributed or sharded tiers, govern leader selection and so on. As a result, clustering in M3 is a very low-maintenance option, just set it up once and it’ll continue to do its job. At the same time, it integrates well with Prometheus to provide those sweet sweet monitoring goodies.
4. Reading speed requirements
The other main impact of large amounts of data is the speed at which you can retrieve it when necessary. If you use your data for real-time analytics, machine learning or AI, you’re going to need a database where systems downstream can read data quickly and efficiently.
The larger the database and the more granular the data, the longer it takes to retrieve a certain piece of information. One good way to impact read access speed is to use aggregation and compression as discussed earlier.
Indexing also improves reading speed. TimescaleDB uses composite indexing to achieve remarkable savings on read times. InfluxDB enhances read operations with their Time Series Index (TSI), but isn’t quite up to speed yet as it were. Prometheus uses LevelDB for indexes, offering fair performance.
Databases that run in-memory automatically compress source data, which reduces seek time during processing queries. But then, just about all time series databases run in-memory.
5. System footprint size
Depending on how and where your data collection takes place, it may make sense to invest in a database with fewer features and a smaller footprint.
For example, for a mobile analysis device (and by mobile we mean something that moves, not necessarily a handset, like a movable weather station or the sensor rig on a mining tool) you probably want a small database to ride with the hardware that collects and processes raw data and then sends only the aggregated or analysed data back to the main office. InfluxDB has a nice, modest footprint and integrates well with logging and other systems.
But if footprint size is not an issue, well, the world’s your crustacean of choice, and you have many more options open.
6. Logging and monitoring the database
If your cluster setup isn’t complex and/or your DBadmin enjoys fixing and tinkering, Prometheus offers strong monitoring capabilities and can quickly alert you to issues. Prometheus has its own monitoring stack that comes pre-configured with a set of alerts and Grafana dashboards.
Also InfluxDB and M3 are easy to plug into Prometheus or Grafana for logging and visualisation.
There’s no single easy answer to “which is the best time series database”. In fact, you’re lucky if you can arrive at a single easy answer to “which is the best for my needs”. The considerations we’ve listed in this article can get you most of the way to a decision, but we can’t make the selection for you.
Oh, and one more thing, or three...
+3 big questions
After all of that, there are still three foundational questions to ask yourself and your team:
Chances are your organization already has policies in place for all of that. You could however take the opportunity to revisit those policies and procedures. Things change, both your company’s business landscape and the choices on offer.
Whatever you choose, you’re also likely to want to re-evaluate your position once the business has ramped up. This is a good point to remind you that open source systems help you to do that; and managed open source systems, such as... well, such as Aiven to take an example completely at random, can help you store your data in such a way that you are always free to change your mind without losing anything.
Not using Aiven services yet? Sign up now for your free trial at https://console.aiven.io/signup!