Current State with ClickHouse® ReplacingMergeTree

An exchange produces a continuous stream of tick update events. Every liquidity change, every reserve adjustment. Batches hit 25MB. They arrive under one second apart. Data must be updated in real-time for the traders.

We need one thing from this firehose: the current state exposed via API. What ticks are active right now, and what are their reserves?

We built this using ReplacingMergeTree and is_deleted, served it through Cloudflare Durable Objects. This post assumes you are familiar with how the engine works. If that's not the case, the ReplacingMergeTree deep dive covers it end-to-end. You can also start from the beginning of the series.

The problem

If you're not from the trading world, a tick is a price point in a trading pair. It holds reserves, the funds available for trading at that price. Traders check ticks to know what prices exist and how much liquidity sits at each one. All the active ticks together make up the order book. Investopedia has a good primer if this is unfamiliar.

Ticks change constantly. A tick gets created when someone provides liquidity. Its reserves update every time a trade passes through it. Reserves drop to zero, and the tick is dead.

The volume is relentless. Thousands of updates per batch, batches arriving sub-second, data piling up. Millions of tick updates are emitted daily, but only a few thousand ticks per pair are active at the same time. Without a way to clean up dead ticks, queries scan too much data. And you can't afford to miss a single batch, since one gap means some ticks never get cancelled, and the state is wrong from that point forward.

Traders don't care about any of this. They need the order book in real-time to make informed decisions.

Everything is append-only. There's no UPDATE statement anywhere in the stack. In Postgres, you'd upsert. INSERT ... ON CONFLICT DO UPDATE. Straightforward enough at low volume. But at this throughput, Postgres becomes a bottleneck. Every UPDATE creates a new tuple version, and the old one is kept until VACUUM cleans it up. With thousands of tick updates per second, autovacuum can't keep pace. Write amplification hits 3-5x. The table bloats. The endpoints break.

We needed something that could swallow a 25MB batch of events in under a second and answer "what's the current state?" without choking. ClickHouse was the answer.

Ingestion pipeline

ClickHouse is not built to handle retries, and missing data is not an option here. A single lost event translates into tick cancellations that never arrive, meaning stale state is served to traders. So we put Google Cloud Pub/Sub in front. Raw events land there first, one message per batch. If the worker fails to process a batch, it ends up in a dead-letter queue and gets retried immediately.

The worker picks up each message, extracts tick update events, and writes them into a MergeTree source table first. We keep the raw data queryable by default. When something is off in the tick state, we go back to the source events and debug from there.

From here, materialized views take over. They fire on every INSERT to the source table, transforming and normalizing the data into the ReplacingMergeTree tick_state table. Duplicate inserts are fine. The engine handles deduplication.

The schema

ReplacingMergeTree allows you to insert new versions of a row and let ClickHouse keep only the latest one. No UPDATE, no locking, just append and let background merges deduplicate. It has two relevant params. The first one, version, allows you to identify the last column. Add a second one, is_deleted, and it can also physically remove rows you mark as dead. Skip it, and your table becomes too big with queries taking several seconds to compute.

CREATE TABLE tick_state (
    asset_0 String,
    asset_1 String,
    asset_in String,
    tick_index Int64,
    fee UInt64,
    updated_at DateTime64(6),
    reserves UInt256,
    reserves_zero UInt8,
    ingestion_timestamp DateTime64(3),
    INDEX idx_ingestion_ts ingestion_timestamp TYPE minmax GRANULARITY 1
) ENGINE = ReplacingMergeTree(updated_at, reserves_zero)
ORDER BY (asset_0, asset_1, asset_in, tick_index, fee)
SETTINGS
    index_granularity = 1024,
    min_age_to_force_merge_seconds = 21600;

index_granularity = 1024 gives finer granularity than the default 8192. The active dataset is small, and smaller granules mean less data scanned per point query.

min_age_to_force_merge_seconds = 21600 forces merges on parts older than 6 hours, which is what triggers the physical cleanup of is_deleted rows. The 6-hour window is not arbitrary. Delete markers need to be stored long enough for late-arriving data to land. If a batch is ingested out of order and the delete marker for that tick is already gone, FINAL has nothing to resolve against, and the stale row becomes visible. Six hours gives us room for retries and reprocessing without risking ghost ticks. Without this setting, dead ticks accumulate until ClickHouse decides to merge on its own schedule. Cleanup also requires enabling allow_experimental_replacing_merge_with_cleanup at the server level. The feature is still marked experimental, but it works.

ORDER BY does double duty here. Those five columns uniquely identify a tick, and they're also the deduplication key. When ReplacingMergeTree notices two rows with the same ORDER BY values, it keeps the one with the higher updated_at and deletes the other.

reserves_zero is the is_deleted column (second argument to ReplacingMergeTree). When a tick's reserves drop to zero, we set reserves_zero = 1. ClickHouse treats that as a delete signal and physically removes the row during cleanup merges.

Millions of daily tick state changes. A few thousand active ticks. Same table, same queries. That's what is_deleted buys you.

State compute

FINAL performs deduplication at query time. For each ORDER BY key, it keeps the row with the highest updated_at and discards the rest. Since v22.6, FINAL is multi-threaded. The old performance penalty that made people avoid it is mostly gone. The previous post covers how this works under the hood.

SELECT *
FROM tick_state FINAL
WHERE asset_0 = {asset0:String}
  AND asset_1 = {asset1:String}
  AND reserves_zero = 0

FINAL deduplicates, WHERE reserves_zero = 0 filters out dead ticks. That's it. No GROUP BY, no aggregation functions. Works even before background merges have cleaned up the is_deleted rows, because FINAL resolves duplicates at query time regardless of merge state.

Delta queries

Full-state queries scan all ticks for a trading pair. Fine for initial load. Wasteful when you're polling every second, and most ticks haven't changed.

We add an ingestion_timestamp filter:

SELECT *
FROM tick_state FINAL
WHERE asset_0 = {asset0:String}
  AND asset_1 = {asset1:String}
  AND ingestion_timestamp > parseDateTimeBestEffort('{lastTimestamp}')

Why ingestion_timestamp and not updated_at? Because ingestion_timestamp reflects when the data arrived at our system, not when the event happened at the source. That's what the polling layer cares about: give me everything you ingested since I last asked.

Without an index on ingestion_timestamp, this query was unusable. The column isn't part of the ORDER BY, so ClickHouse had no way to skip granules and scanned everything. We added a minmax index, and delta queries dropped to under 200ms.

Clients hit the endpoint, fetch full state, then subscribe to a WebSocket stream for deltas. Data is polled every second using the query above. A full refresh is triggered every 3 minutes as a safety net. Belt and suspenders.

Serving it: Cloudflare Durable Objects

ClickHouse handles storage and deduplication. But it's not a real-time delivery system. Traders need live updates over WebSocket, and you don't want hundreds of connected clients each polling ClickHouse directly.

We put Cloudflare Durable Objects in front, which is a stateful serverless instance: single-threaded, globally unique, in-memory state that persists for the instance's lifecycle. One instance per trading pair.

Each DO holds the current tick state in memory. When the first user connects, it fetches the full history. From this point on, it polls ClickHouse with delta queries every second and forwards them through every open connection. As a safety net, it runs a full refresh every 3 minutes, computes the difference between the old state and the new one, and broadcasts only changes to connected clients.

The part I like most about this setup: in-flight fetch deduplication. If 50 clients connect to the same trading pair at the same instant, only one ClickHouse query fires. Everyone else awaits the same promise:

if (this.cached) return this.cached.clone()
if (!this.inflight) {
    this.inflight = this.fetchFromClickHouse().then((response) => {
        this.cached = response
        this.inflight = undefined
        return response
    })
}
return await this.inflight

Per-key singleton routing (built into DOs) ensures all requests for the same pair hit the same instance. No thundering herd.

We looked at Redis Pub/Sub early on. DOs won because they bundle state and WebSocket support in one primitive, with request deduplication built in. Redis would have needed us to wire up state management, pub/sub, and connection handling separately. DOs also cost nothing when idle, which matters because some trading pairs are quiet most of the time.

One thing to watch: code deployments disconnect all WebSockets. Clients need reconnection logic. We knew this going in and built for it.

The full pipeline

PubSub receives raw batches, a worker parses tick update events and writes to a MergeTree source table. Failed inserts go to DLQ for immediate retry. Materialized views transform the data into the ReplacingMergeTree tick_state table on every insert.

Background merges collapse duplicates and keep the row with the latest updated_at. is_deleted rows get physically cleaned up after they become 6 hours old via forced merges. OPTIMIZE TABLE ... FINAL CLEANUP is there if we ever need immediate removal, but we've never used it in production.

FINAL queries serve current state, delta queries filtered by ingestion_timestamp feed the streaming layer. Durable Objects hold state in memory per trading pair, poll for deltas every second, and broadcast changes via WebSocket.

Full pipeline: source to trader's screen — Full pipeline: from source events to the trader's screen.

25MB batches to live trader-facing data in under a second. Millions of tick updates, a few thousand active ticks. Runs without anyone touching it.

This isn't a novel architecture. Each piece is simple on its own. The work was in making them fit together under the production load. This means that the 2am test holds: any engineer on the team can debug any layer without bothering anyone else. That's the point.

We need one thing from this firehose: the current state exposed via API. What ticks are active right now, and what are their reserves?

The problem

Ticks change constantly. A tick gets created when someone provides liquidity. Its reserves update every time a trade passes through it. Reserves drop to zero, and the tick is dead.

Traders don't care about any of this. They need the order book in real-time to make informed decisions.

We needed something that could swallow a 25MB batch of events in under a second and answer "what's the current state?" without choking. ClickHouse was the answer.

Ingestion pipeline

The schema

CREATE TABLE tick_state (
    asset_0 String,
    asset_1 String,
    asset_in String,
    tick_index Int64,
    fee UInt64,
    updated_at DateTime64(6),
    reserves UInt256,
    reserves_zero UInt8,
    ingestion_timestamp DateTime64(3),
    INDEX idx_ingestion_ts ingestion_timestamp TYPE minmax GRANULARITY 1
) ENGINE = ReplacingMergeTree(updated_at, reserves_zero)
ORDER BY (asset_0, asset_1, asset_in, tick_index, fee)
SETTINGS
    index_granularity = 1024,
    min_age_to_force_merge_seconds = 21600;

index_granularity = 1024 gives finer granularity than the default 8192. The active dataset is small, and smaller granules mean less data scanned per point query.

Millions of daily tick state changes. A few thousand active ticks. Same table, same queries. That's what is_deleted buys you.

State compute

SELECT *
FROM tick_state FINAL
WHERE asset_0 = {asset0:String}
  AND asset_1 = {asset1:String}
  AND reserves_zero = 0

Delta queries

Full-state queries scan all ticks for a trading pair. Fine for initial load. Wasteful when you're polling every second, and most ticks haven't changed.

We add an ingestion_timestamp filter:

SELECT *
FROM tick_state FINAL
WHERE asset_0 = {asset0:String}
  AND asset_1 = {asset1:String}
  AND ingestion_timestamp > parseDateTimeBestEffort('{lastTimestamp}')

Serving it: Cloudflare Durable Objects

if (this.cached) return this.cached.clone()
if (!this.inflight) {
    this.inflight = this.fetchFromClickHouse().then((response) => {
        this.cached = response
        this.inflight = undefined
        return response
    })
}
return await this.inflight

Per-key singleton routing (built into DOs) ensures all requests for the same pair hit the same instance. No thundering herd.

One thing to watch: code deployments disconnect all WebSockets. Clients need reconnection logic. We knew this going in and built for it.

The full pipeline

25MB batches to live trader-facing data in under a second. Millions of tick updates, a few thousand active ticks. Runs without anyone touching it.

Current State with ClickHouse® ReplacingMergeTree

The problem

Ingestion pipeline

The schema

State compute

Delta queries

Serving it: Cloudflare Durable Objects

The full pipeline

Continue Reading

Current State with ClickHouse® ReplacingMergeTree

The problem

Ingestion pipeline

The schema

State compute

Delta queries

Serving it: Cloudflare Durable Objects

The full pipeline

Continue Reading