April 13, 2018

Turn Aiven PostgreSQL into a time series powerhouse with TimescaleDB

Time series data from IoT tends to be voluminous and hence the demands on the database to store that data efficiently for both ingest and query is high. TimescaleDB is an open-source, scalable SQL engine that meets these demands quite effectively.

Packaged as a PostgreSQL extension, TimescaleDB provides users of Aiven PostgreSQL 10.3 and newer the benefit of stability and power of PostgreSQL while storing their applications’ time series data using an engine that is purpose-built to handle it.

In this post, we’ll walk through a general overview of time series data, how TimescaleDB turns PostgreSQL into a time series database, demonstrate how to set up and perform various tasks with it, and finish with visualizing that data in Grafana.

We are surrounded by time series data

As human beings, we produce data as we move and operate through time. Therefore, everything we do in our business and personal lives creates time series data. For a more academic definition of time series data, we can look to Wikipedia:

A series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time.

IoT is an obvious example of time series data in practice. Everything from mobile devices, home appliances, thermostats and security cameras to automobiles are able to continuously collect, store, use, and transmit data.

The data, along with the time at which it is collected and recorded, constitutes time series data.

Some other examples of time series data include, rate of flow of vehicles at an intersection, telemetry from an aircraft engine during flight, etc. Ultimately, we are surrounded by opportunities to collect time series data.

So how does TimescaleDB’s architecture update PostgreSQL’s capabilities?

TimescaleDB: create a table, add a column

Essentially, you create a table in TimescaleDB just as you would in vanilla PostgreSQL. The only requirement is that the table have a timestamp (or date) column. (You can get a more in-depth overview of TimescaleDB’s Architecture here.)

Once this table is created, the built-in helper function, create_hypertable(), performs all the necessary tasks to create the hypertable and its child objects, e.g. partitions and chunks. The hypertable holds the information about its child objects, e.g. partitions and key spaces, within partitions called chunks.

Data ingested into a time series table ultimately ends up in chunks, its final resting place.

Chunks can be managed (resized, dropped, etc) from the command line. Dropping chunks to purge old data is much more efficient than the usual alternatives. But, the true elegance of TimescaleDB lies in the fact that it is a PostgreSQL extension.

As PostgreSQL goes, so does TimescaleDB

The neat thing about TimescaleDB being a PostgreSQL extension is that backup and streaming replication automatically apply to it: pg_basebackup will be able to backup data stored in TimescaleDB. Similarly, restore and point-in-time recovery also automatically apply.

The same goes for synchronous and asynchronous streaming replication. Therefore, Aiven PostgreSQL business and premium plans with standby instances provide read scalability and high availability, and would do so for TimescaleDB as well.

The above is key, especially in an environment where popular time series databases have chosen the open-core model, making features that are necessary for production environments only available under proprietary licenses. Now that there is an alternative, let’s check it out.

A TimescaleDB use case in Aiven PostgreSQL

Let’s walk through a use case of TimescaleDB within Aiven PostgreSQL by examining vehicle traffic data for Aarhus, Denmark from February to June, 2014. Follow this link to download the zipped archive that contains the actual data.

The raw data is a series of timestamped traffic measurements recorded in five-minute intervals. The measurements include everything from vehicle count to average speed, and are displayed as below:

status,avgMeasuredTime,avgSpeed,extID,medianMeasuredTime,TIMESTAMP,vehicleCount,_id,REPORT_ID
OK,66,56,668,66,2014-02-13T11:30:00,7,190000,158324
OK,69,53,668,69,2014-02-13T11:35:00,5,190449,158324
OK,69,53,668,69,2014-02-13T11:40:00,6,190898,158324

To demonstrate the capabilities of TimescaleDB, we’ll perform 6 tasks for this use case:

  1. Create the schema in TimescaleDB
  2. Ingest data
  3. Query data
  4. Purge old data
  5. Get database resource information
  6. Visualize the data in Grafana

1. Creating the schema in TimescaleDB

1.1. Let’s set up an Aiven PostgreSQL Startup-4 plan and connect to it using psql.

psql postgres://avnadmin:pw@host:port/defaultdb?sslmode=require

1.2. Create TimescaleDB extension and create the table vehicle_traffic to store the traffic data.

CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;

CREATE TABLE vehicle_traffic (
  status                TEXT not null,
  avg_measured_time     INT not null,
  avg_speed             INT not null,
  ext_id                INT not null,
  median_measured_time  INT not null,
  time                  TIMESTAMP not null,
  vehicle_count         INT not null,
  id                    INT not null,
  report_id             INT not null
);

1.3. Create the corresponding hypertable in TimescaleDB.

Let’s create chunks to store 1 day’s worth of data (this is specified in the chunk_time_interval parameter). The interval maps well to the traffic data we will be ingesting and analyzing.

SELECT create_hypertable('vehicle_traffic', 'time',
    chunk_time_interval => interval '1 day');

Back to top

2. Ingesting data

2.1. Now, let’s Ingest data into the table.

Since the data is split into multiple files, I used a bash script to load each CSV file using psql.

#! /bin/bash
for i in *.csv; do
    echo "item: $i"
    time psql \
        "postgres://avnadmin:password@host:port/defaultdb?sslmode=require" \
        -c "\COPY vehicle_traffic FROM $i CSV HEADER"
done

2.1.a. After data is ingested into the table, the partition and chunk information becomes evident.

defaultdb=> \d+ vehicle_traffic
Indexes:
    "vehicle_traffic_time_idx" btree ("time" DESC)
Child tables: _timescaledb_internal._hyper_67_100_chunk,
              _timescaledb_internal._hyper_67_101_chunk,
              _timescaledb_internal._hyper_67_102_chunk,
...
              _timescaledb_internal._hyper_67_215_chunk,
              _timescaledb_internal._hyper_67_216_chunk

2.1.b. The query plan for a simple count(*) of the records in the table after all the data is ingested looks like this:

defaultdb=> explain select count(*) from vehicle_traffic;

 Finalize Aggregate  (cost=227717.58..227717.59 rows=1 width=8)
   ->  Gather  (cost=227717.47..227717.58 rows=1 width=8)
         Workers Planned: 1
         ->  Partial Aggregate  (cost=226717.47..226717.48 rows=1 width=8)
               ->  Append  (cost=0.00..206762.98 rows=7981797 width=0)
                     ->  Parallel Seq Scan on vehicle_traffic  (cost=0.00..0.00 rows=1 width=0)
                     ->  Parallel Seq Scan on _hyper_67_100_chunk  (cost=0.00..948.12 rows=36612 width=0)
…
                     ->  Parallel Seq Scan on _hyper_67_215_chunk  (cost=0.00..1480.92 rows=57192 width=0)
                     ->  Parallel Seq Scan on _hyper_67_216_chunk  (cost=0.00..253.96 rows=9796 width=0)

*As you can see, the query needs to scan each and every chunk to resolve the query.

Back to top

3. Querying data

3.1. The optimal query in TimescaleDB hits a single chunk within a partition.

With our data and corresponding chunk configuration (remember, we configured a chunk to contain a day’s worth of data), the most efficient query would be one that includes a specific date in the WHERE clause. Here’s an example of such a query and its corresponding query plan.

defaultdb=> SELECT  date_trunc('day', time) as day,
                    avg(avg_speed), sum(vehicle_count)
                FROM vehicle_traffic
                WHERE time = '2014-02-28'
                GROUP BY day ORDER BY day;

          day           |         avg         | sum
------------------------+---------------------+-----
 2014-02-28 00:00:00+00 | 46.7289377289377289 |  32
(1 row)

defaultdb=> EXPLAIN SELECT  date_trunc('day', time) as day,
                            avg(avg_speed), sum(vehicle_count)
                        FROM vehicle_traffic
                        WHERE time = '2014-02-28'
                        GROUP BY day ORDER BY day;

 Sort  (cost=370.26..370.76 rows=200 width=48)
   Sort Key: (date_trunc('day'::text, vehicle_traffic."time"))
   ->  HashAggregate  (cost=359.62..362.62 rows=200 width=48)
         Group Key: date_trunc('day'::text, vehicle_traffic."time")
         ->  Result  (cost=0.00..356.57 rows=407 width=16)
               ->  Append  (cost=0.00..351.48 rows=407 width=16)
                     ->  Seq Scan on vehicle_traffic  (cost=0.00..0.00 rows=1 width=16)
                           Filter: ("time" = '2014-02-28 00:00:00+00'::timestamp with time zone)
                     ->  Index Scan using _hyper_67_115_chunk_vehicle_traffic_time_idx on _hyper_67_115_chunk  (cost=0.42..351.48 rows=406 width=16)
                           Index Cond: ("time" = '2014-02-28 00:00:00+00'::timestamp with time zone)

*As can be seen above, one (and only one) chunk is ever considered for the query.

3.2. The second most optimal query is a range scan on the date/time column.

For range scans on time, multiple chunks will be considered for the query. But, here’s where PostgreSQL parallel query scan comes into play and provides big benefits as shown below.

defaultdb=> SELECT  date_trunc('day', time) as day,
                    avg(avg_speed), sum(vehicle_count)
                FROM vehicle_traffic
                WHERE time >= '2014-02-1' AND time < '2014-03-1'
                GROUP BY day ORDER BY day;

          day           |         avg         |  sum
------------------------+---------------------+--------
 2014-02-13 00:00:00+00 | 41.7338751124245150 | 235854
 2014-02-14 00:00:00+00 | 42.5012742058532310 | 432441
...
 2014-02-27 00:00:00+00 | 42.7180470031571745 | 509792
 2014-02-28 00:00:00+00 | 42.6757349907385461 | 460238
(16 rows)

defaultdb=> explain SELECT  date_trunc('day', time) as day,
                            avg(avg_speed), sum(vehicle_count)
                        FROM vehicle_traffic
                        WHERE time >= '2014-02-1' AND time < '2014-03-1'
                        GROUP BY day ORDER BY day;

 Finalize GroupAggregate  (cost=58285.77..58290.77 rows=200 width=48)
   Group Key: (date_trunc('day'::text, vehicle_traffic."time"))
   ->  Sort  (cost=58285.77..58286.27 rows=200 width=48)
         Sort Key: (date_trunc('day'::text, vehicle_traffic."time"))
         ->  Gather  (cost=58255.62..58278.12 rows=200 width=48)
               Workers Planned: 1
               ->  Partial HashAggregate  (cost=57255.62..57258.12 rows=200 width=48)
                     Group Key: date_trunc('day'::text, vehicle_traffic."time")
                     ->  Result  (cost=0.00..48822.70 rows=1124389 width=16)
                           ->  Append  (cost=0.00..34767.84 rows=1124389 width=16)
                                 ->  Parallel Seq Scan on vehicle_traffic  (cost=0.00..0.00 rows=1 width=16)
                                       Filter: (("time" >= '2014-02-01 00:00:00+00'::timestamp with time zone) AND ("time" < '2014-03-01 00:00:00+00'::timestamp with time zone))
                                 ->  Parallel Seq Scan on _hyper_67_100_chunk  (cost=0.00..1135.07 rows=36738 width=16)
                                       Filter: (("time" >= '2014-02-01 00:00:00+00'::timestamp with time zone) AND ("time" < '2014-03-01 00:00:00+00'::timestamp with time zone))

…
                                 ->  Parallel Seq Scan on _hyper_67_114_chunk  (cost=0.00..2207.42 rows=71361 width=16)
                                       Filter: (("time" >= '2014-02-01 00:00:00+00'::timestamp with time zone) AND ("time" < '2014-03-01 00:00:00+00'::timestamp with time zone))
                                 ->  Parallel Seq Scan on _hyper_67_115_chunk  (cost=0.00..2075.41 rows=67161 width=16)
                                       Filter: (("time" >= '2014-02-01 00:00:00+00'::timestamp with time zone) AND ("time" < '2014-03-01 00:00:00+00'::timestamp with time zone))

3.3. The least optimal query scans all chunks.

But, even in this scenario, PostgreSQL’s parallel query scan capability has significant benefits over sequential scans, which is common in some database technologies.

defaultdb=> explain SELECT  date_trunc('day', time) as day,
                            avg(avg_speed), sum(vehicle_count)
                        FROM vehicle_traffic
                        GROUP BY day ORDER BY day;

 Finalize GroupAggregate  (cost=367720.50..367725.50 rows=200 width=48)
   Group Key: (date_trunc('day'::text, vehicle_traffic."time"))
   ->  Sort  (cost=367720.50..367721.00 rows=200 width=48)
         Sort Key: (date_trunc('day'::text, vehicle_traffic."time"))
         ->  Gather  (cost=367690.35..367712.85 rows=200 width=48)
               Workers Planned: 1
               ->  Partial HashAggregate  (cost=366690.35..366692.85 rows=200 width=48)
                     Group Key: date_trunc('day'::text, vehicle_traffic."time")
                     ->  Result  (cost=0.00..306779.27 rows=7988145 width=16)
                           ->  Append  (cost=0.00..206927.45 rows=7988145 width=16)
                                 ->  Parallel Seq Scan on vehicle_traffic  (cost=0.00..0.00 rows=1 width=16)
                                 ->  Parallel Seq Scan on _hyper_67_100_chunk  (cost=0.00..951.38 rows=36738 width=16)
…
                                 ->  Parallel Seq Scan on _hyper_67_215_chunk  (cost=0.00..1480.92 rows=57192 width=16)
                                 ->  Parallel Seq Scan on _hyper_67_216_chunk  (cost=0.00..253.96 rows=9796 width=16)

3.4. TimescaleDB also has other built-in time functions that are more flexible and convenient to use, such as time_bucket() which provides for arbitrary time intervals as compared to PostgreSQL’s built-in date_trunc() function.

defaultdb=> SELECT  time_bucket('1 hour', time) as one_hr,
                    avg(avg_speed), sum(vehicle_count)
                FROM vehicle_traffic
                WHERE avg_speed > 50
                GROUP BY one_hr ORDER BY one_hr;

         one_hr         |         avg         |  sum
------------------------+---------------------+-------
 2014-02-13 11:00:00+00 | 65.1486310299869622 |  7055
 2014-02-13 12:00:00+00 | 64.8786089238845144 | 15565

Back to top

4. Purging old data

When dealing with time series data, it is convenient to have a mechanism to purge old data without impacting incoming data. This means options like DELETE FROM table_foo WHERE time > ‘data’ are problematic since it may involve intrusive locking and therefore impact ingest performance.

In TimescaleDB, you can drop a chunk that contains old data, and the best part is, it does not impact ongoing ingest. Here’s an example, where we drop chunks containing data older than a specific date.

defaultdb=> SELECT drop_chunks('2014-02-20 12:00:00-5'::timestamp,
                               'vehicle_traffic');

Back to top

5. Getting information about TimescaleDB resources

You can obtain information bout internal resource usage as well by using other built-in TimescaleDB functions and information_schema tables.

An example of one such function is the hypertable_relation_size() (and its pretty-fied equivalent) function which produces the on-disk size of hypertables and indexes.

defaultdb=> SELECT * FROM hypertable_relation_size_pretty('vehicle_traffic');
 table_size | index_size | toast_size | total_size
------------+------------+------------+------------
 995 MB     | 395 MB     | 936 kB     | 1391 MB

*Please see the API reference for more information.

Back to top

6. Visualizing your TimescaleDB data in Grafana

6.1. If you haven’t already, create a Grafana instance.

6.2. Login to Grafana by clicking on the Aiven Grafana service page’s host link from the Aiven console. Use the user, avnadmin, and password shown in the service page.

6.3. Create a TimescaleDB PostgreSQL datasource in Grafana by selecting Data Sources.

step one of creating timescaledb datasource in grafana UI

6.3.1. Enter the values shown and click Add.

NOTE: Substitute the respective values from your PostgreSQL service for host:port and password.

step two of creating timescaledb datasource in grafana UI

6.4. Create a new dashboard by clicking New dashboard.

step one of creating a new graphing dashboard for your timescaledb datasource

6.4.1. Select Graph.

step two of creating a new graphing dashboard for your timescaledb datasource

6.4.2. Click the Panel Title to display the editing option and click Edit.

step three of creating a new graphing dashboard for your timescaledb datasource

6.4.3. Enter a suitable title under the General tab.

step 4 of creating a new dashboard for your timescaledb datasource

6.5. Under Metrics tab select “timescaledb” as the data source, and use the query pictured below.

configuring your timescaledb datasource

6.5.1. Click on the clock symbol in the top right corner to select your time window. For this example, select a time window between Feb-1 and Jun-9, 2014. Click Apply.

selecting the date range for your timescaledb graph

6.5.2. You should have a dashboard similar to what’s shown below.

NOTE: You can add other graphs and dashboards to visualize the data you have in the database.

example of timescaledb datasource graph

Start with your plan, then apply TimescaleDB

When starting out with TimescaleDB, it is best to understand your time series data, the retention period, and most importantly what kind of queries you will run against it. These will help you determine the chunk interval for the data which closely correlates to ingest performance, query performance, and the ability to easily purge old data.

TimescaleDB scales very effectively to store time series data compared to vanilla PostgreSQL. This, along with PostgreSQL features such as backup and high availability, render it a powerful open-source alternative to other time series databases.

The example shown in this post will help you get started quickly with TimescaleDB in Aiven PostgreSQL. To take advantage of its ability to be deployed as a high availability time series database, enable it within a Business-4 or higher Aiven PostgreSQL plan!

*See the references below for additional information on TimescaleDB and don’t forget to follow us on Facebook and Twitter, or join the Aiven blog’s RSS feed to stay up-to-date.

References
[1] TimescaleDB data model
[2] Comparing TimescaleDB to vanilla PostgreSQL
[3] Migrating data from vanilla PostgreSQL to TimescaleDB
[4] Hyper tables
[5] Best practices-1; Best practices-2
[6] Data retention
[7] Full list of built-in functions (API)

Start your free 30 day trial today

Test the whole platform for 30 days with no ifs, ands, or buts.