ClickHouse® Materialized Views: Patterns and Pitfalls
This blog post is part 3 of the ClickHouse Deep Dive series. The last one in this series, the data modeling post, briefly introduced materialized views among other core ClickHouse topics such as columnar storage, engines and vectorized executions. This one you are reading now is the full deep dive into materialized views, one of the core ClickHouse capabilities and differentiators.
If you've used materialized views in PostgreSQL, you need to unlearn something before reading this. A Postgres MV is a cached query result that refreshes on schedule. ClickHouse MVs are nothing like that, they're triggers. Every INSERT fires them, they transform the incoming rows using the query, and they write the output to a separate target table. There's no cache and no refresh. The core difference is that MV runs on the write path, not the read path.
I keep coming back to that distinction because it explains basically every mistake newbies make with them. At ObsessionDB, we've built and seen INSERT-time MVs on tables that ingest tens of millions of rows per second. Some of these MVs have been rock-solid for over a year. Others caused incidents that took days to untangle. The difference between the two came down to whether we understood we were building pipelines or query optimisations.
This post is about INSERT-time MVs only. ClickHouse also has refreshable MVs that are closer to the Postgres model. Different beast, future post.
What materialized views actually do on INSERT
A block of rows arrives at the source table, it can be ingested from the HTTP API, from another MV or an external connector, which doesn't really matter. The MV's SELECT runs against that block and only that block. Results get written to the target table. INSERT completes.
That "only that block" part causes all the confusion. SELECT count(*) FROM source_table inside an MV doesn't return the total table count. It returns the number of rows that were in this particular INSERT batch. GROUP BY aggregates within the block. Self-JOINs see only the current block on both sides. Window functions stop at the block boundary. ORDER BY in the MV's SELECT does nothing because the target table has its own sort order.
So your aggregations are partial. Every INSERT block produces a partial result. What happens to those partial results depends on which engine the target table uses, and this is where the design decisions start.
SummingMergeTree is the simplest option. Rows with the same ORDER BY key get their numeric columns added together during background merges. Two blocks produce rows for (2026-02-20, /api/users), merges sum them. No special syntax. The catch is that it can only sum. No other functionalities allowed, but super simple and efficient.
ReplacingMergeTree is a dedup engine. It keeps only the latest row per ORDER BY key, optionally by a version column. An MV feeding a ReplacingMergeTree is basically a dedup pipeline. We use this for CDC patterns constantly.
AggregatingMergeTree is the most famous and relevant pattern. It preaggregates an intermediate state mechanism, and it's worth spending a minute on because the -State/-Merge pattern is key to understanding ClickHouse databases.
Instead of writing final numbers, it writes serialised aggregation states to the destination table. sumState(), uniqState(), quantileState(). These return binary blobs, not numbers, and are completely useless if queried with a normal SELECT * FROM source_table. The target table stores them in AggregateFunction columns, and during background merges, ClickHouse combines partial states from different blocks into more complete states. When you query, the -Merge counterparts finalise the result: sumMerge(), uniqMerge(), quantileMerge() and return the resulting value.
Four patterns we run in production
These all run at ObsessionDB, ingesting tens of millions of rows per second. I'm listing them in order of how often we reach for them. There are many more patterns you can use, but the objective of these examples is to illustrate the flexibility of the tool.
Pre-aggregation
We monitor API performance at ObsessionDB. Billions of raw request rows, and we need hourly endpoint stats that load in milliseconds. The source table, the target, and the MV:
CREATE TABLE api_requests (
timestamp DateTime,
endpoint String,
user_id UInt64,
response_time_ms UInt32,
status_code UInt16
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (endpoint, timestamp);
CREATE TABLE hourly_endpoint_stats (
hour DateTime,
endpoint String,
request_count AggregateFunction(sum, UInt64),
unique_users AggregateFunction(uniq, UInt64),
p99_response AggregateFunction(quantile(0.99), UInt32)
)
ENGINE = AggregatingMergeTree()
ORDER BY (endpoint, hour);
CREATE MATERIALIZED VIEW hourly_endpoint_stats_mv TO hourly_endpoint_stats AS
SELECT
toStartOfHour(timestamp) AS hour,
endpoint,
sumState(toUInt64(1)) AS request_count,
uniqState(user_id) AS unique_users,
quantileState(0.99)(response_time_ms) AS p99_response
FROM api_requests
GROUP BY endpoint, hour;
Querying uses -Merge to get the final state:
SELECT
hour,
endpoint,
sumMerge(request_count) AS request_count,
uniqMerge(unique_users) AS unique_users,
quantileMerge(0.99)(p99_response) AS p99_response
FROM hourly_endpoint_stats
WHERE endpoint = '/api/v1/users'
GROUP BY endpoint, hour
ORDER BY hour DESC;
The target table is roughly 30,000 times smaller than the source. Altinity benchmarked this pattern and got 900x query speedups. Our numbers are comparable.
Reshaping at ingest
Sometimes the raw table has the right data but not in the shape you want to query. Bucketing timestamps, classifying values, extracting fields. Paying that cost at query time means every analyst pays it on every query. Paying it at ingest means you pay once. We abuse this, and it works beautifully with the pre-aggregation pattern.
CREATE TABLE raw_requests (
timestamp DateTime,
request_id String,
user_id UInt64,
response_time_ms UInt32,
status_code UInt16,
path String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, timestamp);
CREATE TABLE requests_enriched (
timestamp DateTime,
hour_bucket DateTime,
request_id String,
user_id UInt64,
response_time_ms UInt32,
speed_category LowCardinality(String),
status_category LowCardinality(String),
api_version LowCardinality(String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, hour_bucket);
CREATE MATERIALIZED VIEW enriched_requests_mv TO requests_enriched AS
SELECT
timestamp,
toStartOfHour(timestamp) AS hour_bucket,
request_id,
user_id,
response_time_ms,
multiIf(response_time_ms > 1000, 'slow',
response_time_ms > 200, 'normal', 'fast') AS speed_category,
multiIf(status_code >= 500, 'server_error',
status_code >= 400, 'client_error', 'success') AS status_category,
extract(path, '^/api/(v[0-9]+)/') AS api_version
FROM raw_requests;
Analysts query the enriched table and never touch the raw one:
SELECT
hour_bucket,
speed_category,
count() AS requests,
avg(response_time_ms) AS avg_response
FROM requests_enriched
WHERE api_version = 'v2' AND speed_category = 'slow'
GROUP BY hour_bucket, speed_category
ORDER BY hour_bucket DESC;
If you need geo enrichment or user metadata lookups, dictGet with in-memory dictionaries works well here. Not JOINs. The "What went wrong" section below explains why.
Deduplication
CDC from Postgres, webhook retries, API replays. Duplicates are a fact of life in event pipelines, and we got tired of handling them at query time. Sometimes it's easier to account for them in the pipeline than on ingestion, or if critical, to enforce it on both sides.
CREATE TABLE raw_webhooks (
received_at DateTime,
event_id String,
entity_id UInt64,
payload String,
updated_at DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(received_at)
ORDER BY (entity_id, received_at);
CREATE TABLE entity_latest (
entity_id UInt64,
payload String,
updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY entity_id;
CREATE MATERIALIZED VIEW entity_latest_mv TO entity_latest AS
SELECT entity_id, payload, updated_at
FROM raw_webhooks;
Merges deduplicate rows with the same entity_id, keeping the latest updated_at. Between merges, both rows exist. Use FINAL when you need guaranteed dedup at query time. On large tables, FINAL has a cost. The argMax pattern avoids it by resolving at query time instead. It's especially useful when combined with TTL.
-- FINAL clause
SELECT entity_id, payload, updated_at
FROM entity_latest FINAL
WHERE entity_id = 123;
-- argMax GROUP BY
SELECT
entity_id,
argMax(payload, updated_at) AS payload,
max(updated_at) AS updated_at
FROM entity_latest
WHERE entity_id = 123
GROUP BY entity_id;
Routing
This one is almost too simple to write about, but it's been one of the most useful patterns in practice. One source table, several MVs, each with a WHERE clause, sending data to a different destination with its own retention and sort order.
CREATE TABLE all_events (
timestamp DateTime,
event_type String,
level String,
is_audit UInt8,
payload String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (event_type, timestamp);
CREATE TABLE error_events (timestamp DateTime, level String, payload String)
ENGINE = MergeTree() ORDER BY timestamp TTL timestamp + INTERVAL 90 DAY;
CREATE TABLE metric_events (timestamp DateTime, payload String)
ENGINE = MergeTree() ORDER BY timestamp TTL timestamp + INTERVAL 365 DAY;
CREATE TABLE audit_log (timestamp DateTime, payload String)
ENGINE = MergeTree() ORDER BY timestamp; -- no TTL, keep forever
CREATE MATERIALIZED VIEW errors_mv TO error_events AS
SELECT timestamp, level, payload FROM all_events WHERE level = 'error';
CREATE MATERIALIZED VIEW metrics_mv TO metric_events AS
SELECT timestamp, payload FROM all_events WHERE event_type = 'metric';
CREATE MATERIALIZED VIEW audit_mv TO audit_log AS
SELECT timestamp, payload FROM all_events WHERE is_audit = 1;
No aggregation, no transformation. Just filtering. Writes are amplified, and data is duplicated, but each destination has its own sort order and retention policy. Then, each table can be queried directly:
-- Last 24h of errors
SELECT timestamp, level, payload
FROM error_events
WHERE timestamp > now() - INTERVAL 1 DAY
ORDER BY timestamp DESC;
-- Full audit trail for a specific window
SELECT timestamp, payload
FROM audit_log
WHERE timestamp BETWEEN '2026-01-01' AND '2026-02-01'
ORDER BY timestamp;
What went wrong
I want to be specific about these because the ClickHouse docs explain what MVs do, but not what happens when they go sideways. Everything below is something we hit in production in our previous ventures.
JOINs in MVs are expensive and asymmetric. We had an MV joining against a 500-million-row dimension table. Functionally correct. But it fired on every INSERT, thousands of times per minute, and INSERT latency increased from 50ms to 2 seconds, the read cost exploded. We swapped it for dictGet.
The root cause: the MV query sees only the ingested block from the source table, but all the data from any other table in the JOIN. Joining against a dimension table works. Self-JOINing the source table produces empty results for any row whose match was in a previous batch, because both sides of the JOIN only see the current block. Nobody tells you this upfront.
Write amplification compounds. 5 MVs on one source table means 6 writes per INSERT. Each creates parts, and parts need merging. The INSERT doesn't complete until every MV finishes. We ran 7 MVs on our main events table, and during peak ingest, the merge backlog grew without bound. Consolidated to 3 by combining aggregation logic into wider target tables. Backlog stabilised within an hour. The parallel_view_processing setting runs MVs concurrently instead of sequentially, but it doesn't reduce total work, it just rearranges timing.
Cascading MVs hide bugs. Source feeds MV A, the destination table is read from MV B, and it cascades into MV C. We had this. MV A had a bad WHERE clause that silently dropped 15% of rows for three days. MV C aggregated the corrupted data, and nobody got paged. A customer noticed their dashboard numbers were off. Debugging meant tracing data through steps. Deep chains do work, though. We have run some at depth-8 in production. But every hop multiplies the debugging surface, and I'll come back to this in the rules.
Schema changes have a gap. The TO clause lets you do ALTER TABLE mv_name MODIFY QUERY to change the SELECT. But if you need to change the target table's schema, engine, ORDER BY, or column types, it's drop-recreate-backfill. Between the drop and the recreate, data goes unprocessed. We've handled this in three ways:
- Backfill after the swap
- Canary upgrade (new MV + new target alongside old, verify, then swap)
- Pausing ingestion and ack downtime
No matter which one is chosen, the backfill must use the same -State functions as the MV. Forgetting this and inserting raw values into an AggregatingMergeTree target produces data corruption. It's also really likely that the reingestion needs to run in batches, something like:
INSERT INTO target SELECT ... FROM source WHERE timestamp BETWEEN [gap_start] AND [gap_end]
Error handling is pick your poison. A runtime error in the MV's SELECT (type cast, division by zero) fails the INSERT by default, but the behavior is non-deterministic: blocks already written to the target are preserved, only subsequent blocks are lost. Set materialized_views_ignore_errors = 1, and the source write succeeds, but failing MV rows disappear silently. No dead-letter queue either way. We use one or the other depending on the use case.
Source dedup doesn't propagate to MVs. Say your source table is ReplacingMergeTree and your MV does sumState(). The MV fires on every INSERT, including duplicates. Source deduplicates during merges, but by then, the MV already wrote the duplicate. Duplicates get counted twice. The source table looks right after the merges. The MV target is permanently wrong. You can use ReplacingMergeTree on the target too, or argMaxState to keep only the latest version, or design the MV to be idempotent. None of these solutions are clean. It's a tradeoff you have to think through before writing the CREATE statement.
When to use projections instead
Projections store an alternative physical sort order (and optionally aggregation) of the same data, inside the same table. The optimiser picks them automatically.
| Materialized Views | Projections | |
|---|---|---|
| Storage | Separate target table | Inline with parent table |
| Aggregation | Full (AggregatingMergeTree) | Limited |
| Transformations | Any SELECT (JOINs, ARRAY JOIN, dictGet) | Simple GROUP BY only |
| Schema changes | Drop + recreate + backfill | ALTER TABLE ADD/DROP PROJECTION |
| Query routing | Explicit (query the target table) | Automatic (optimizer chooses) |
| Write amplification | Separate parts and merges | Merged with parent |
| Backfill | Manual INSERT INTO ... SELECT | ALTER TABLE MATERIALIZE PROJECTION |
| Monitoring | Easy (separate table, direct queries) | Hard (hidden inside table, need EXPLAIN) |
Projections for simple aggregations. MVs for everything else. We use projections on raw tables and MVs for all pre-aggregation, and that split has worked. Projections hit their limits faster than you'd expect, particularly around uniqState/quantileState patterns and routing to different engines.
What we monitor
Row count comparison between source and MV targets over the same time window. If they drift, data is being lost or dropped. A SELECT count() comparison running every five minutes would catch incidents before they scale.
system.parts on each MV target. Active part count should plateau, not climb. If it's climbing, the merges can't keep up.
INSERT latency, measured end-to-end, including all MVs.
system.query_log filtered for MV-related query IDs. That's where errors hide when materialized_views_ignore_errors is on.
The rules
These came from incidents, not from docs.
- Cascading MVs work, but they have a cost. We run chains at depth-8 in production. But a bug at depth 2 silently corrupts everything downstream, and you won't know until someone notices the numbers look off. If you go deep, invest in per-hop monitoring. When you can, prefer two independent MVs on the source over a chain.
- Budget the write amplification. Three MVs on a high-throughput table is already a lot. Seven was too many for us.
dictGetfor enrichment, not JOINs.- Monitor like a pipeline. Row counts, latency, error rates. All of it.
- Get the schema right upfront. Changing it means drop-recreate-backfill, and you don't want to go through that pain.
-State/-Mergeon AggregatingMergeTree over SummingMergeTree. Summing is simpler, but the moment you needuniqorquantile, you're rewriting everything.- Write and test the backfill query the same day you create the MV. You will need it, and you don't want to be figuring out the right
-Statefunctions at 2am during an incident.
Materialized views are the feature I rely on most in ClickHouse. Also, the one that's caused the worst incidents. The difference between these outcomes is whether you treat them as pipelines or as query shortcuts.
Continue Reading
Originally written for obsessionDB. Read the original article here.
ClickHouse is a registered trademark of ClickHouse, Inc. https://clickhouse.com