CollapsingMergeTree: Current State from Event Streams
The last post covered ReplacingMergeTree, where you can insert a new row, and the background merges clear the old state. Simple writes, expensive reads with FINAL or argMax on every query. This post is part of the ClickHouse Deep Dive series and covers the alternative.
Because there's another way, using CollapsingMergeTree. Instead of asking ClickHouse which row is the latest during merges, you define it during ingestion by cancelling the old value and inserting a new row. Merges still happen, but the process is simplified with extra information. This is not free, and it requires the ingestion pipeline to know the previous state. On the contrary, the main advantage is that the read complexity is reduced.
The problem is the same: calculating the current state in an append engine, but the approach is the opposite. And the choice is reduced to two key points: where do you prefer the pipeline complexity to live (ingestion vs queries), and if the specific patterns this engine allow are worth the overhead.
The mental model: double-entry bookkeeping
CollapsingMergeTree works like an accounting ledger. Every INSERT is a ledger entry. A state row (Sign = +1) is a credit. A canceled row (Sign = -1) is a debit. During merges, debits cancel credits. Only the net state survives.
To handle this logic, the Sign column is the only engine-specific concept:
- Type:
Int8(must support negative values, so notUInt8) - Values: exactly
1or-1, nothing else is valid. - ClickHouse doesn't validate Sign values at
INSERTtime. A0or2silently corrupts the merge logic. However, you can enable validation withSET add_implicit_sign_column_constraint_for_collapsing_engine = 1
Here's the table definition:
CREATE TABLE user_activity (
user_id UInt64,
event_time DateTime,
page_views Int32,
duration Int32,
Sign Int8
) ENGINE = CollapsingMergeTree(Sign)
ORDER BY user_id;
Except for the Sign column, everything else is standard MergeTree. Note the signed integer types on page_views and duration to properly handle the row cancellations.
How it works: INSERT, UPDATE, DELETE
All three operations are INSERTs; there's no UPDATE, no DELETE. Just rows with Sign values.
To create, just insert a single state row:
INSERT INTO user_activity VALUES (123, now(), 5, 146, 1);
To update, two inserts are necessary. A cancel row that duplicates all fields of the old state, and a new state row:
INSERT INTO user_activity VALUES
(123, now(), 5, 146, -1), -- cancel old state
(123, now(), 6, 185, 1); -- new state
This is the critical requirement: your application has to be aware of the previous state to cancel it. The cancel row must replicate all sorting key column values exactly. This is the fundamental operational cost of CollapsingMergeTree, and it's the reason most people reach for ReplacingMergeTree first.
To delete, insert only a cancel row:
INSERT INTO user_activity VALUES (123, now(), 6, 185, -1);
Like every other MergeTree engine, the merges do the heavy lifting. Collapsing is only executed by them and never at INSERT time. Between inserts, both the old state and the canceled row coexist and are visible to every query. And remember that merges only work within the same partition; they don't cross-check rows in other physical locations.
The best optimization nit for this table is always inserting cancel and new state in the same INSERT statement. Same INSERT means the same data part, which means that they collapse during the next merge of that part. Separate INSERTs create separate parts, and those only collapse when ClickHouse merges those parts together. The timing is less predictable.
Query patterns that always work
This is the payoff. This is why you'd choose CollapsingMergeTree over ReplacingMergeTree.
Between merges, a naive SELECT * returns everything: old state, cancel row, new state. Three rows for one logical entity.
But a simple query pattern handles this:
SELECT
user_id,
sum(page_views * Sign) AS page_views,
sum(duration * Sign) AS duration
FROM user_activity
GROUP BY user_id
HAVING sum(Sign) > 0;
For other operations, some query changes are also necessary:
| What you want | What you write |
|---|---|
| Count | sum(Sign) instead of count(*) |
| Sum | sum(column * Sign) instead of sum(column) |
| Average | sum(column * Sign) / sum(Sign) |
| Filter deleted | HAVING sum(Sign) > 0 |
This pattern follows standard GROUP BY with all ClickHouse optimizations: parallel execution, PREWHERE, and index skipping. No special query-time merge logic.
FINAL also works (SELECT * FROM user_activity FINAL), but it carries the same overhead as ReplacingMergeTree FINAL: merge-on-read, disables PREWHERE by default. Multi-threaded, but still adds latency. The whole point of CollapsingMergeTree is that you don't need it.
ReplacingMergeTree vs CollapsingMergeTree
In general, the main tradeoff is about where the complexity lives. On CollapsingMergeTree, the ingestion does the heavy lifting, while on ReplacingMergeTree, the queries are much more complex.
The following table contains the main differences.
| Aspect | ReplacingMergeTree | CollapsingMergeTree |
|---|---|---|
| Update pattern | Just insert new version | Cancel old + insert new |
| State knowledge | No need to know previous state | Must know previous state |
| Write volume | 1 row per update | 2 rows per update |
| Query correctness | Requires FINAL or argMax | sum(col * Sign), always correct |
| MV compatibility | Only uniq()-like aggregations | Full sum/count support |
| Sorting key changes | Cannot change ORDER BY columns | Cancel old key, insert new key |
| Ease of use | Simple | Harder to get right |
In practice, there are only two advantages that can justify the overhead of maintaining a stateful ingestion pipeline, so let's double-check on them.
The sorting key advantage
With ReplacingMergeTree, ORDER BY columns are immutable. They are the deduplication key. If you change them, the engine treats the row as a completely new entity. To "update" a sorting key column, you need mutations (expensive) or manual delete + re-insert.
CollapsingMergeTree handles this naturally. Cancel the old row with the old sorting key values, insert a new row with the new values. Both are regular INSERTs.
Say a user changes their region, which is part of ORDER BY:
-- Cancel old state
INSERT INTO users VALUES ('user_123', 'US-East', 100, -1);
-- New state with a different region
INSERT INTO users VALUES ('user_123', 'EU-West', 100, 1);
During the merge, the canceled row collapses the old US-East entry. The EU-West entry persists. ReplacingMergeTree cannot do this without mutations.
The materialized view advantage
A SummingMergeTree MV on top of CollapsingMergeTree naturally handles counts and sums:
CREATE MATERIALIZED VIEW active_users
ENGINE = SummingMergeTree()
ORDER BY (region)
AS SELECT
region,
sum(Sign) AS user_count
FROM users
GROUP BY region;
When a user is added (Sign=1), user_count increments. When cancelled (Sign=-1), it decrements. The SummingMergeTree target sums these during merge. Always correct.
ReplacingMergeTree MVs can't do this. They see every INSERT as a new row. A materialized view counting rows would count every version of every row. Only uniq()-like functions, which naturally deduplicate, produce correct results.
If you're building real-time dashboards that need running counts or revenue totals, this can become the deciding factor.
Decision-making guide
Usually, you can use CollapsingMergeTree when:
- Your pipeline has access to the previous state (CDC with before/after images)
- You need MV-compatible counts and sums
- Sorting key columns can change
- You want to avoid
FINALoverhead entirely - High delete volume (native support vs
is_deletedcleanup)
On the contrary, ReplacingMergeTree is a better option when
- Your pipeline only knows the current state (no before/after)
- Simplicity matters more than query performance
- Low update frequency, so 2x write amplification isn't justified
- You don't need count/sum MVs
The question isn't which engine is better. It's what your pipeline looks like. If you can access the previous state without adding complexity, CollapsingMergeTree is the natural fit. If your source only emits the current state, constructing cancel rows means querying ClickHouse on every write. That's usually not worth it.
In the current state case study, we chose ReplacingMergeTree because the exchange didn't emit before/after images, it only emitted the current state. Building a CollapsingMergeTree pipeline would have meant looking up the old row on every write, which added latency and complexity we didn't need.
VersionedCollapsingMergeTree: the production engine
Plain CollapsingMergeTree has a problem. The collapsing algorithm depends on state rows appearing before their corresponding cancel rows in sorted order. When data arrives out of order for whatever reason (the causes can vary: different threads ingesting, a distributed system on the producer, network issues), the pairing breaks. Rows that should collapse remain uncollapsed indefinitely.
This isn't a theoretical concern, it's what happens in almost every real production system.
VersionedCollapsingMergeTree fixes it with a Version column:
CREATE TABLE order_events (
order_id UInt64,
status LowCardinality(String),
amount Decimal64(2),
updated_at DateTime64(3),
Sign Int8,
Version UInt64
) ENGINE = VersionedCollapsingMergeTree(Sign, Version)
ORDER BY (status, order_id);
The pairing mechanism is different:
CollapsingMergeTreepairs rows by adjacent position in sorted order. Same sorting key, opposite Sign, next to each other in the stream. Order matters.VersionedCollapsingMergeTreepairs rows by primary key + version + opposite sign. A cancel row with Version=3 only pairs with a state row with Version=3. They can arrive in any order.
The INSERT pattern just adds the new column:
-- Initial state (Version 1)
INSERT INTO order_events VALUES (42, 'pending', 99.99, now(), 1, 1);
-- Update: cancel Version 1, insert Version 2
INSERT INTO order_events VALUES
(42, 'pending', 99.99, now(), -1, 1), -- cancel (same Version)
(42, 'shipped', 99.99, now(), 1, 2); -- new state (higher Version)
The cancel row uses the same Version as the row being cancelled. The new state row uses a higher Version. This is how the engine knows which cancel matches which state.
The version column has some requirements; the value must increase over time. Timestamps, auto-incrementing integers, or source database LSN (log sequence number) all work. Random UUIDs do not. If the Version column isn't in the primary key, ClickHouse adds it implicitly as the last field.
Query patterns are identical to CollapsingMergeTree. sum(col * Sign) with HAVING sum(Sign) > 0. The Version column is used internally by the engine for pairing but doesn't typically appear in analytical queries.
The main rule is that if you're considering CollapsingMergeTree, use VersionedCollapsingMergeTree. Plain Collapsing only works for single-threaded ingestion from a strictly ordered source. In practice, that's rarely the case and would require a migration if more threads are added.
Production patterns
Two patterns that show up most in production.
CDC from Postgres
CDC tools like Debezium emit before/after images from Postgres WAL. Each UPDATE produces both the old row state and the new row state. This maps directly to CollapsingMergeTree:
UPDATEs- Old state = cancel row (Sign = -1)
- New state = state row (Sign = 1)
DELETEs = cancel row onlyINSERTs = state row only
CREATE TABLE replicated_users (
id UInt64,
name String,
email String,
region LowCardinality(String),
balance Decimal64(2),
updated_at DateTime64(3),
Sign Int8,
Version UInt64 -- source DB's transaction LSN
) ENGINE = VersionedCollapsingMergeTree(Sign, Version)
PARTITION BY toYYYYMM(updated_at)
ORDER BY (region, id)
SETTINGS do_not_merge_across_partitions_select_final = 1;
The source database's LSN is the natural Version column. No lookup needed: the pipeline already has both states.
Real-time dashboards with SummingMergeTree MVs
CollapsingMergeTree pairs naturally with SummingMergeTree materialized views. Every canceled row subtracts, every state row adds. The MV stays correct without FINAL, without argMax, without any query-time deduplication.
Using the same user_activity table, a SummingMergeTree MV can track total page views and active user counts per day:
CREATE MATERIALIZED VIEW daily_activity
ENGINE = SummingMergeTree()
ORDER BY (toDate(event_time))
AS SELECT
toDate(event_time) AS date,
sum(page_views * Sign) AS total_page_views,
sum(Sign) AS active_users
FROM user_activity
GROUP BY date;
When a user's activity is updated, the cancel row decrements the old page views and the active user count. The new state row adds the updated values back. The dashboard query is a plain GROUP BY:
SELECT
date,
sum(active_users) AS active_users,
sum(total_page_views) AS total_page_views
FROM daily_activity
GROUP BY date;
Correct at all times, regardless of merge state.
Common pitfalls
Every one of these comes up in production. Some are obvious. Some are quiet enough to go unnoticed for months.
Missing cancel rows. This is obvious, but a silent killer. If your app inserts a new state without cancelling the old one, both persist. sum(page_views * Sign) double-counts. No error, no warning. It accumulates over time. You notice when the numbers look wrong, and by then, you have a bunch of corrupted aggregations.
-- Forgot the cancel row:
INSERT INTO user_activity VALUES (123, now(), 6, 185, 1);
-- Now two state rows exist:
-- (123, ..., 5, 146, 1) old state, never cancelled
-- (123, ..., 6, 185, 1) new state
-- sum(page_views * Sign) = 5 + 6 = 11 (should be 6)
Cancel row doesn't match sorting key. The cancel row must exactly replicate all ORDER BY column values from the original state. Different sorting keys mean different logical rows. They never collapse.
-- State: ORDER BY (user_id, event_date)
INSERT INTO events VALUES (123, '2024-01-01', 100, 1);
-- Wrong cancel (different event_date):
INSERT INTO events VALUES (123, '2024-01-02', 100, -1);
-- These NEVER collapse
Cross-partition cancellation. State row in the January partition, cancel row in the February partition. They will never collapse. They live in different parts of the storage, and ClickHouse will never merge them. Design partitioning so all state changes for the same entity land in the same partition.
UInt columns with Sign multiplication. This is why I used Int32 instead of UInt32 in the table definition earlier. sum(UInt32_column * Sign) on a cancel row produces an unsigned integer underflow: a very large number instead of a negative one. Use signed types for any column multiplied by Sign.
Using count(*) or sum(column) directly. Between merges, these return inflated results. Always sum(Sign) for counts, sum(column * Sign) for sums. There is no shortcut.
Expecting instant cancellation. Collapsing happens during background merges, not at INSERT time. Between inserts, both rows exist. OPTIMIZE TABLE ... FINAL forces a merge but is expensive. Design your queries to work regardless of the merge state, and this stops being a problem.
These pitfalls are all avoidable. The cost is discipline: your write pipeline must be meticulous about cancel rows, sorting key values, and partition alignment. That's the CollapsingMergeTree contract. Correctness at write time, freedom at query time.
Wrapping up
ReplacingMergeTree and CollapsingMergeTree solve the same problem. Replacing is simpler to write, harder to query. Collapsing is harder to write, trivial to query. Your pipeline characteristics determine the right choice: whether you have access to the previous state, whether you need MV-compatible aggregations, and whether sorting keys change.
If your CDC tool provides before-and-after images, the mapping is direct. If your source only knows the current state, stick with ReplacingMergeTree and accept the FINAL overhead.
Next in the series, we'll look at projections and how they change the performance equation for both engines.
Continue Reading
Originally written for obsessionDB. Read the original article here.
ClickHouse is a registered trademark of ClickHouse, Inc. https://clickhouse.com