Alexander does product marketing for databases at Aiven. Throughout his career he's worked across marketing, sales, analytics, data engineering and more.
For ClickHouse to achieve the speed it's known for, it has historically relied on a trigger-based and incremental architecture for materialized views.
This architecture is known to be very efficient but lacks flexibility, such as when working with data that needs to be backfilled rather than append-only.
The Problem with Incremental Materialized Views
ClickHouse Materialized Views have historically functioned like triggers, firing only when new data is inserted into the source table. They then transform that single block of data and append it to the target table. They don’t capture historical data and don’t maintain a link to the source table.
This created three major problems:
No updates or deletes: If you updated a record in your source table, the Incremental MV would not reflect this change, as it only triggers on new inserts.
Complex re-calculations: If you changed your aggregation logic, you couldn't just "refresh" the view. You had to drop the view, then create a new one and manually backfill the data.
Orchestration overhead: To get around this, you had to write external scripts, such as using Airflow, Dagster, or Cron, to periodically drop tables and run INSERT INTO ... SELECT queries just to get a clean snapshot of the data.
The Refreshing Solution
The introduction of Refreshable Materialized Views lets you control the update frequency of your views directly in ClickHouse, without the need for external orchestration tools. While this is not a complete replacement for complex orchestration, it can replace many simple cron jobs and even some dbt pipelines.
Replacing cron jobs
The magic lies in the REFRESH clause.
Loading code...
When using the REFRESH clause, ClickHouse will automatically re-run the query, replacing the old data with the latest results from the source. Anything you can represent with cron, you can do in ClickHouse, just with a different syntax.
You can also manually refresh, modify, and audit the schedules.
Loading code...
Replacing dbt pipelines
Taking this a step further, we can use the DEPENDS ON clause to synchronize refreshes of different tables. This is similar to the ref function in dbt.
Loading code...
If you are familiar with how dbt works, this might seem straightforward, but let’s go through what is actually happening here.
By saying that customers depends on stg_customers ClickHouse will not trigger the refresh of customers until the refresh of stg_customers completes successfully.
If we wanted, we could also chain multiple views to create a dependency graph, or in other words, a Directed Acyclical Graph (DAG) that transforms raw data into something useful.
If the source view fails to refresh, the dependent view will skip its refresh to avoid processing inconsistent or stale data.
ClickHouse does not replace the full power of dbt, but it gives you the ability to choose what kind of complexity you want. By making it simple to create simple cron jobs or pipelines, you can choose if the overhead of a Python framework is worth the extra complexity, or if all you need is a simple REFRESH.
ClickHouse gives you the ability to choose
While Refreshable Materialized Views are a great addition, they do not replace Incremental Materialized Views, which are generally more efficient. ClickHouse now gives you more flexibility to choose what you need.
You can now do both the real-time streaming ClickHouse is known for and the traditional batch workloads you might have run in other data warehouses.
You might even be able to simplify your operations by replacing simple cron jobs and dbt pipelines by allowing ClickHouse to orchestrate them with the REFRESH and DEPENDS ON clauses.
-- Can be used to periodically pull data from S3CREATE MATERIALIZED VIEW refreshing_summary
REFRESH EVERY 1HOURTO refreshed_target
ASSELECT...FROM s3('s3://sample-bucket/*.parquet')-- More complex example:-- 20th day of every month at 11:30 pmREFRESH EVERY 2 WEEK OFFSET19DAY23HOUR30MINUTE
-- Start a manual refreshSYSTEM REFRESH VIEW refreshing_summary;-- Change the refresh frequencyALTERTABLE refreshing_summary
MODIFY REFRESH EVERY 45MINUTE;-- Audit view refreshesSELECTdatabase,view,status, last_success_time, last_refresh_time, next_refresh_time
FROM system.view_refreshes;
-- Using the dbt ref functionSELECT...FROM {{ ref('stg_customers') }}
-- Using the ClickHouse DEPENDS ON clauseCREATE MATERIALIZED VIEW customers
REFRESH EVERY 1DAYDEPENDS ON stg_customers
ASSELECT...FROM stg_customers