Data Modelling in ClickHouse® DB: Engines, Tables & MVs
In the last ClickHouse post, I explored the internals, including columnar storage, MergeTree's handling of parts and granules, and the role of the sparse index. I recommend starting there if you are not familiar with ClickHouse fundamentals.
If you are still here, that means you understand the engine basics, so let's talk about data modelling in ClickHouse.
ClickHouse has several engine families, each one containing its own set of engines. This article focuses on the MergeTree family, which covers almost every use case. I'll cover the others in a future post.
The docs cover each one on its own page. What nobody tells you is how to choose between them, and how your engine choice, ORDER BY, and PARTITION BY interact with each other. I had to figure this out the hard way, running ClickHouse at petabyte scale for analytics at Numia and ObsessionDB. This post is the guide I wish someone had handed me three years ago.
The mental model that makes everything click
Every engine in the MergeTree family shares the same storage layer. Same parts on disk, same granules, same sparse indexes, same compression. All of it is identical.
The only thing that changes is what happens during background merges. But that's a big deal, and it doesn't happen synchronously. The checks run in the background when the server has sufficient capacity, which occurs when the server's computing resources are not being utilised by ingestion or reads triggered by external queries.
MergeTree just consolidates parts. ReplacingMergeTree deduplicates rows that share the same ORDER BY key during merges, keeping the latest version. SummingMergeTree sums numeric columns. AggregatingMergeTree combines partial aggregate states. VersionedCollapsingMergeTree cancels out old state with +1/-1 sign pairs.
That's it. Same storage, different merge behaviour. This means that picking an engine is essentially answering one question: how does your data change over time?
| Your data pattern | Engine | Merge behaviour |
|---|---|---|
| Append-only (logs, events, clicks) | MergeTree | Consolidate parts. Nothing special. |
| Rows get logically updated (CDC, profiles) | ReplacingMergeTree | Deduplicate by ORDER BY key, keep latest version |
| Needs pre-aggregated sums/counts | SummingMergeTree | Sum numeric columns with matching keys |
| Needs pre-aggregated anything (avg, uniq) | AggregatingMergeTree | Combine partial aggregate states |
| Tracks state changes (any insert order) | VersionedCollapsingMergeTree | Cancel old state via sign column, matched by version |
I skipped CollapsingMergeTree since it requires strict insertion order, meaning that in practice you would almost always use VersionedCollapsing instead. The other missing one is GraphiteMergeTree, which is interesting only if you run Graphite. SummingMergeTree is in the table above but doesn't get its own section because it's AggregatingMergeTree's simpler sibling: auto-sums numeric columns on merge, no State/Merge syntax needed, but limited to sums only. Cloudflare uses it for HTTP analytics at 6M requests/second, so it works at serious scale. If you need avg, uniq, or anything else, go straight to AggregatingMergeTree.
Here's the thing that most people don't expect: 99% of production tables are either a plain MergeTree, ReplacingMergeTree or AggregatingMergeTree. Three engines to cover almost any use case.
Materialised views
ClickHouse MVs aren't the periodic batch refreshes you know from PostgreSQL. They're AFTER INSERT triggers. Every time rows land in a source table, the MV's SELECT runs on just those new rows (be careful, the query doesn't have access to other rows) and writes results to a target table. Zero disk reads on the source. Only the INSERT RAM buffer.
The typical pattern: raw events land in a source table, an MV aggregates on the fly, and the results go into an AggregatingMergeTree or SummingMergeTree target. You define the rollup once, and every insert maintains it. I'll show a worked example in the AggregatingMergeTree section below.
ORDER BY and PARTITION BY
You can pick the perfect engine and still have terrible performance if your ORDER BY is wrong. I've seen it happen. It happened to me.
Data is physically sorted on disk by ORDER BY. The sparse index has one entry every 8,192 rows (default value, can be adjusted), and it depends on the sort order. If your first ORDER BY column is a timestamp (billions of unique values), the sparse index can barely skip anything when you filter by customer_id. Your "fast analytics database" ends up scanning gigabytes it didn't need to touch.
CloudQuery published a great post-mortem about exactly this. They sorted by timestamp first. Customer queries scanned 10x more data than necessary. They moved the customer_id before the timestamp, and the problem disappeared. Same cluster, same data.
The rule: always order by the columns you use to query, and don't forget that the low cardinality columns must go first, with high cardinality last.
ORDER BY (tenant_id, event_type, toStartOfHour(timestamp), timestamp)
Now, in plain MergeTree, that's all ORDER BY does. Performance optimisation. But the specialised engines add a second job.
In ReplacingMergeTree, ORDER BY defines which rows are considered duplicates. Same ORDER BY values = same row, and the merge keeps the latest version. In AggregatingMergeTree, ORDER BY defines the aggregation key. Rows with matching ORDER BY values get their aggregate states combined during merges.
This creates a tension. Say you have a ReplacingMergeTree with rows identified by (tenant_id, entity_type, entity_id). Queries almost always filter by tenant_id and sometimes entity_type, but never by entity_id. You need entity_id in ORDER BY for dedup to work, but it's wasting space in the sparse index.
Solution: make PRIMARY KEY a prefix of ORDER BY.
CREATE TABLE entities (
tenant_id UInt32,
entity_type LowCardinality(String),
entity_id String,
data String,
updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY (tenant_id, entity_type, entity_id)
PRIMARY KEY (tenant_id, entity_type);
Data gets sorted by all three columns on disk (good for dedup). Only the first two get into the sparse index (good for RAM and query performance). Most people don't even know that PRIMARY KEY and ORDER BY can be different in ClickHouse.
ClickHouse never merges parts across partition boundaries. That sentence sounds boring, but it has teeth.
For ReplacingMergeTree: old version of a row in the January partition, updated version in the February partition? They never get deduplicated. Your row exists twice forever. So partition by something stable, like creation month, not updated_at.
Same for AggregatingMergeTree. Partial aggregate states in different partitions stay separate during merges. Your queries need to account for this (GROUP BY with -Merge functions handles it, but you need to know it's there).
Aim for 1-300 GB per partition for MergeTree, 400 MB-40 GB for the specialised engines where merges need to work harder. One common ClickHouse mistake I see in production is PARTITION BY customer_id on a table with millions of customers. Instant death spiral.
ReplacingMergeTree for people coming from PostgreSQL
In PostgreSQL, you write UPDATE users SET name = 'Alice' WHERE id = 1, and it's done. Atomic. Consistent. You query the table and see the new value.
ClickHouse doesn't work that way.
You insert a new row with the updated values. At some point, potentially minutes or even hours later, ClickHouse runs a background merge and deduplicates: for the same ORDER BY key, it keeps the row with the higher version column. Until that merge happens, both the old and new rows are visible. This isn't a bug, but the design of its architecture. ClickHouse chose write speed over read consistency, and when analysing billions of rows, that tradeoff makes sense.
But your queries have to deal with it.
-- FINAL forces dedup at query time
SELECT * FROM entities FINAL WHERE tenant_id = 1;
-- Or do it yourself
SELECT entity_id, argMax(data, updated_at) AS data
FROM entities WHERE tenant_id = 1
GROUP BY entity_id;
FINAL used to be painfully slow (10x penalty was common), but since v22.6, it runs multi-threaded. Separately, setting do_not_merge_across_partitions_select_final = 1 reduced overhead by 7x in Altinity's tests. We use FINAL in production now, but you should be aware of the possibilities.
Two more things to remember. Dedup only happens within a partition (I mentioned this above, but it's worth repeating for ReplacingMergeTree specifically because this is where it bites hardest). And ORDER BY columns can't be changed through replacement. They ARE the row identity. Want to change a user's ID? Cancel the old row, insert a new one.
Since v23.3 there's an is_deleted column too: ENGINE = ReplacingMergeTree(updated_at, is_deleted). Clean soft deletes without hacks.
I'll go much deeper on ReplacingMergeTree in a dedicated post in two weeks. CDC pipelines from PostgreSQL, version column strategies, FINAL optimisation. There's enough material for its own article.
AggregatingMergeTree
This one I want to explain through the problem it solves.
At Numia, we have a dashboard that shows daily transaction counts and unique wallets per blockchain. The raw blockchain_events table has billions of rows. Computing count(*) and uniq(wallet_address) grouped by day and chain on every page load? Not an option.
AggregatingMergeTree lets you store partial aggregate states instead of raw values. When parts merge, ClickHouse combines those states. You pay the aggregation cost once at write time, then reads are cheap.
CREATE TABLE daily_chain_stats (
day Date,
chain_id UInt16,
tx_count AggregateFunction(sum, UInt64),
unique_wallets AggregateFunction(uniq, String)
)
ENGINE = AggregatingMergeTree()
ORDER BY (chain_id, day);
The column type AggregateFunction(sum, UInt64) is strange if you haven't seen it before. It doesn't store a number. It stores an intermediate sum state, a blob that knows how to merge with other blobs of the same type. You write with sumState(), you read with sumMerge(). If you SELECT without the -Merge function, you get binary garbage. Not an error. Just nonsense.
For simple stuff (sum, max, min, any) there's SimpleAggregateFunction, which stores plain values and queries normally. Skip the State/Merge dance when you can.
Now, you're not going to manually insert aggregate states. That's where materialised views come in:
CREATE MATERIALIZED VIEW daily_chain_stats_mv TO daily_chain_stats AS
SELECT
toDate(block_timestamp) AS day,
chain_id,
sumState(toUInt64(1)) AS tx_count,
uniqState(wallet_address) AS unique_wallets
FROM blockchain_events
GROUP BY chain_id, day;
-- Reading the pre-aggregated data in the resulting table
SELECT day, chain_id,
sumMerge(tx_count) AS tx_count,
uniqMerge(unique_wallets) AS unique_wallets
FROM daily_chain_stats
GROUP BY chain_id, day;
Altinity benchmarked this pattern and measured a 900x query speedup with the target table 30,000x smaller than the source. Their specific workload, but we see similar ratios.
ClickHouse data modelling mistakes I've seen (and made)
The worst one: a team had an MV doing GROUP BY aggregation writing into a plain MergeTree target table. The MV correctly grouped each INSERT batch, but MergeTree's merge behaviour just concatenates parts. No aggregation. So the target table accumulated duplicate keys that never got combined, and nobody noticed for weeks because the numbers were "close enough" until they weren't. If your MV aggregates, the target needs to be SummingMergeTree or AggregatingMergeTree.
PARTITION BY toYYYYMM(updated_at) on ReplacingMergeTree. Classic. Row gets updated in February, updated_at crosses the month boundary, old version stays in January forever. Partition by something that doesn't change.
ORDER BY that serves the engine but ignores queries. ORDER BY (entity_id) for ReplacingMergeTree dedup, but every query filters by tenant_id. Dedup works perfectly. Queries scan the whole table. Put your query filters first in ORDER BY, engine identity columns after. You can even add an index if you don't want to change the ORDER BY, but you need to do something.
People also forget that merges in the MergeTree family are eventual. If you SELECT * FROM daily_stats expecting a clean result, you'll see multiple rows per key with partial states that haven't been merged yet. Always query with GROUP BY and -Merge functions.
Cloud-native and Replicated MergeTrees
Everything above about engine selection and table design applies regardless of how you deploy ClickHouse. What changes is the operational layer.
In self-hosted clusters, every engine has a Replicated variant: ReplicatedMergeTree, ReplicatedReplacingMergeTree, ReplicatedAggregatingMergeTree. They use ClickHouse Keeper (or ZooKeeper in older setups) to coordinate replication across nodes. The data modelling stays identical. You prefix the engine name with Replicated and add two parameters for the Keeper path and replica name.
Where it gets expensive is operations. A typical production cluster runs N shards with X replicas each, plus a distributed table on top to route queries. One logical table becomes N*X physical tables plus the routing layer. Keeper health, replication queue lag, shard weights, and failover. It works at scale, and companies run it in production. But the surface area is huge. If you are interested, you can learn more in this blog post I wrote a few weeks ago.
SharedMergeTree replaces all of this. Data goes to shared object storage, compute is stateless, you write ENGINE = MergeTree(), and it works. No Keeper. No replica management. No shard topology. Same data modelling decisions, less operational noise.
For the most cost-effective and simple setup, check out ObsessionDB. It's a managed ClickHouse service that uses shared object storage to provide a serverless ClickHouse experience.
Continue Reading
Originally written for obsessionDB. Read the original article here.
ClickHouse is a registered trademark of ClickHouse, Inc. https://clickhouse.com