Speed is the key to success of our products, applications, solutions. Instead of guessing your next step, data can navigate your path faster to complex business decisions. Today we want to introduce you to a solution designed to get to the bottom of storing and analyzing big data: ClickHouse.
ClickHouse is a highly scalable open source database management system (DBMS) that uses a column-oriented structure. It's designed for online analytical processing (OLAP) and is highly performant. ClickHouse can return processed results in real time in a fraction of a second. This makes it ideal for applications working with massive structured data sets: data analytics, complex data reports, data science computations...
ClickHouse is most praised for its exceptionally high performance. That performance comes from a sum of many factors:
- Column-oriented data storage
- Data compression
- The vector computation engine
- Approximated calculations
- The use of physical sparse indices
But performance isn't the only benefit of ClickHouse. ClickHouse is more than a database, it's a sophisticated database management system that supports distributed query processing, partitioning, data replication and sharding. It's a highly scalable and reliable system capable of handling terabytes of data.
In fact, ClickHouse is designed to write huge amounts of data and simultaneously process a large number of reading requests. And you can conveniently use a declarative SQL-like query language.
ClickHouse has a booming development community and continues to be actively developed and improved. You can look at the changelog and their road map to see the latest features and future plans. Even with fast growth of the system, every new feature is evaluated performance-wise to make sure it doesn't affect the speed of the system. And many of existing biggest features of ClickHouse are particularly aimed at enhancing its performance and efficiency.
As a truly columnar database, ClickHouse stores the values of the same column physically next to each other with no extra data attached to each value. This matters when even an insignificant amount of extra data (such as length of a string, for example) attached to hundreds of millions of items in the column, substantially affects the speed of compression, decompression and reads.
To achieve desired performance ClickHouse uses data compression. This includes general-purpose compression, as well as a number of specialised codecs targeting different types of data stored in separate columns.
ClickHouse supports distributed query processing with data stored across different shards. Large queries are parallelized across multiple cores and use resources they need.
ClickHouse supports SQL syntax similar to ANSI SQL. However, it is not identical, so a migration from another SQL-compatible system might require translations.
During data processing, ClickHouse works with chunks of columns (so-called vectors) and operations are performed on the arrays of items, rather than on individual values.
ClickHouse updates tables continually without relying on locks when adding new data.
Clickhouse keeps data physically sorted by primary key. Secondary indices (also called "data skipping indices") indicate in advance which data won't match filtering criteria and should be skipped (therefore, the name).
To gain farther performance boost, for complex queries you can perform calculations on the sample of data finding a trade-off between accuracy and performance. This is relevant, for example, for complex data science calculations.
While ClickHouse can be an excellent choice for many scenarios, it's important to keep in mind its architectural characteristics. Because ClickHouse is pretty unique, it's easy to make mistakes that lead to sub-optimal performance. That's why it is important to understand what stands behind this DBMS and how it functions.
Let's start by looking at its most distinguishable feature - column-oriented structure of the storage.
To understand better where the need for the column-oriented approach is coming from and why ClickHouse uses it, let's take a closer look at two different types of systems: Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). In particular, pay attention to granularity with which they manipulate the data and to the types of operations that are prevalent in these systems.
OLTP applications perform small but very frequent operations to insert, update and select a modest numbers of rows. In this type of applications we traditionally use row-oriented approach as the most effective way to work with entire individual rows.
OLAP systems are a completely different thing - operations do not target single lines - instead, we work with hundreds of thousands and even millions of records at a time, relying on grouping and aggregation mechanisms. Data in OLAP systems is represented by events and rarely needs to be updated. And, what is important, usually only a fraction of fields is necessary to be retrieved and processed at a time. This makes it very inefficient to read complete rows, like row-oriented systems do.
The bottom line is, in OLTP applications records are being stored for an easy update of individual rows, while in OLAP systems, data is stored primarily for fast read and analysis of massive chunks of data.
Therefore, the row-oriented DBMS could not effectively manage analytical processing of data volumes typical to OLAP applications.
Column-oriented systems were designed to solve OLAP challenges. In truly columnar databases, the data is physically grouped and stored by columns. This minimizes disk access and improves performance, because processing a specific query only requires reading a fraction of the data. Since each column contains data of the same type, it can use effective compression mechanisms.
Additionally, the columnar approach allows adding or removing new columns with no performance overhead, since it means simply creating or deleting files. In contrast, adding a new column in a row-oriented database would require updating the data in every row.
Understanding the difference between OLAP and OLTP systems, and the distinction between row and columnar approaches is key when making a decision weather to use ClickHouse or not. In the next section we'll look into how this relates to specific system requirements, and what you should pay attention to when making a decision to adopt ClickHouse.
If used correctly and in suitable scenarios, ClickHouse is a powerful, scalable and fast solution that outperforms its competitors. ClickHouse is made for OLAP applications, and includes a number of optimizations to read data and process complex requests at high speeds.
You'll get the most out of ClickHouse if:
- you work with enormous volumes of data (measured in terabytes) continuously written and read;
- you have tables with the massive number of columns (ClickHouse loves large numbers of columns!), but column values are reasonably short;
- your data is well-structured and not yet aggregated;
- you insert data in large batches over thousands of lines, a million is a good number;
- the vast majority of operations are reads with aggregations;
- for reads, you process large number of rows, but fairly low number of columns;
- you don't need to modify data later;
- you don't need to retrieve specific rows;
- you don't need transactions.
For example, Yandex uses over 500 servers with 25 million records coming each day. Another company that uses ClickHouse, Bloomberg, has over a hundred of servers and accepts approximately a trillion new records each day (as of data from 2018).
ClickHouse is designed to be fast. However, the optimisations that make ClickHouse the perfect solution for OLAP applications make it suboptimal for other types of projects.
Do not use ClickHouse for OLTP. ClickHouse expects data to remain immutable. Even though it is technically possible to remove big chunks of data from the ClickHouse database, it is not fast. ClickHouse simply isn't designed for data modifications. It's also inefficient at finding and retrieving single rows by keys, due to sparse indexing. Lastly, ClickHouse does not fully support ACID transactions.
ClickHouse is not a key-value DBMS. It is also not designed to be a file storage.
It's not a document-oriented database, either. ClickHouse uses a pre-defined schema that needs to be specified during table creation. The better the schema, the more effective and performant are the queries.
I hope that this got you intrigued about ClickHouse and its superpowers. And maybe you wonder how you can start using it on your own. ClickHouse is an open source project and you can follow its documentation to build it yourself.
However, we know that setting up and maintaining ClickHouse cluster can be quite a challenge. Ensuring proper replication of data, fault-tolerance, stability takes plenty of time and energy. That's why Aiven has decided to offer Aiven for ClickHouse, which will provide you with benefits of ClickHouse without the headache overload.
With Aiven for ClickHouse, you can focus on the product you are building, and we'll keep the underlying infrastructure running so smoothly that you can totally forget about it.
So, how can you set up Aiven for ClickHouse?
Select Aiven for ClickHouse in the Aiven Console when creating a new service. (Read detailed instructions in our Getting Started documentation).
To dive deeper and understand how indexing and data processing works in Clickhouse, check out this Indexing article.
To get the latest news about Aiven and our services, plus a bit of extra around all things open source, subscribe to our monthly newsletter! Daily news about Aiven are available on our LinkedIn and Twitter feeds.
If you just want to stay find out about our service updates, follow our changelog.
Struggling with the management of your data infrastucture? Sign up now for your free trial at https://console.aiven.io/signup!
All things open source, plus our product updates and news in a monthly newsletter.
Subscribe to the Aiven newsletter