What Decoupled Storage Changes When You Build on ClickHouse® DB
In the last post in this series, I left you with a loose end. The dictionaries piece ended on a line about how ObsessionDB stores data on object storage, except for dictionaries, which still live in RAM on every node.
This is the next post. And it's bigger than dictionaries.
Here is a table from one of our test clusters:
CREATE TABLE hits_10b
(
WatchID Int64,
EventTime DateTime,
EventDate Date,
CounterID Int32,
UserID Int64,
URL String,
-- ...105 columns total
)
ENGINE = SharedMergeTree
ORDER BY (CounterID, EventDate, UserID, EventTime, WatchID)
SETTINGS storage_policy = 's3'
Ten billion rows, one table, and fifty-four gigabytes stored once. If you have run ClickHouse® at scale on your own hardware, that definition should bother you a little. Where is the _local table? Where is the Distributed table sitting on top of it? Where is the ON CLUSTER? What did we shard on?
None of it is necessary on our infra. That missing machinery is what this post is about.
We have spent this series going from ClickHouse fundamentals to its table engines, its materialized views, and its query patterns. This is the finale, and it does one thing: it re-reads everything you already learned through a single change. Storage and compute are no longer on the same machine. I am going to walk through what it changes for you, the person writing the schemas.
Decoupled storage and compute: the one change everything hangs on
Classic ClickHouse deployments couple storage and compute. Each node owns its data on local disk, and replicas keep their own full copies. Coordination runs through ClickHouse Keeper (or the older ZooKeeper). If you want the table to survive a node dying, you replicate it, and now you have two copies of the bytes and a protocol keeping them in sync.
Decoupled storage and compute breaks that apart. One copy of the data lives in object storage, and stateless compute nodes read from it. Separating storage from compute enables each to scale independently. In ClickHouse Cloud the engine that does this is SharedMergeTree. We built our own version of the same idea, for reasons I will get to at the end.
I am not going to re-explain how it works under the hood. We already wrote those posts. If you want the why, read The hard truth about self-hosting ClickHouse at scale, which walks through the replication math that pushed us off our own clusters. If you want to know how queries stay fast when the data lives on S3, read how we built a stateless distributed cache. This post assumes the architecture and asks a narrower question: what is different when you build on it?
You stop writing replication into your schema
Here is what a sharded, replicated table looks like on self-hosted ClickHouse. You write the storage table:
CREATE TABLE hits_local ON CLUSTER my_cluster
(/* columns */)
ENGINE = ReplicatedMergeTree('/clickhouse/{shard}/hits', '{replica}')
ORDER BY (CounterID, EventDate);
Then you write the table you actually query, which fans requests out across shards:
CREATE TABLE hits ON CLUSTER my_cluster AS hits_local
ENGINE = Distributed(my_cluster, default, hits_local, rand());
Two objects for one logical table. A sharding expression you had to pick (rand() here, often a hash of some column, and the wrong choice gives you skew you will be fixing for months). Every schema change goes out ON CLUSTER and propagates through Keeper, and when one shard applies the change and another doesn't, you get to debug that at an unpleasant hour.
The decoupled version is the CREATE TABLE I opened with. ENGINE = SharedMergeTree. No second table. No ON CLUSTER. No sharding key, because there are no shards to spread across. One copy of the data, one definition, one ALTER when you need to change it.
The thing I want you to notice is that this is not an ops improvement you read about in a release note. It changes what you type, migrations stop being distributed-systems problems, and you write the table the way you would on a laptop, and it runs across a pool of stateless nodes.
Compute-compute separation: a pool per workload
On a coupled cluster, adding capacity means moving data. New shard? Copy a chunk of your dataset onto it and rebalance. At twenty-five terabytes that is measured in days, and you plan around it.
Decoupled, adding a node is a metadata operation. The new node reads the same object storage everyone else reads, so it shows up, registers itself, and starts serving. There is no data to copy because the data was never tied to the node. New nodes come up in seconds. I still find that slightly unreasonable, in a good way, after years of treating capacity changes as projects.
That speed is not the interesting part, though. The interesting part is what it lets you build.
This is the part I would actually design around, and it has a name: compute-compute separation. You split compute into separate pools that all read the same single copy of the data, and you size each pool for the job it does. The real-time API serving layer gets its own nodes, small and always warm, tuned for low-latency lookups. Heavy analytical and ad-hoc queries get a different pool that can burst wide and then shrink back. Ingest gets its own. A new expensive endpoint, or a nightly backfill that used to keep you up, can get dedicated compute that nobody else feels.
They all see the same data. No copies, no second system, no pipeline shuffling rows between an "analytics" cluster and a "serving" cluster. And the failure every ClickHouse builder has hit, where one heavy GROUP BY over ten billion rows starves the path answering your users, stops being a tuning exercise. You hand that query its own compute and leave the serving pool alone.
And the storage stays single. Our ten-billion-row table is fifty-four gigabytes in object storage, full stop. On a three-replica ReplicatedMergeTree setup that is roughly a hundred and sixty gigabytes of the same data, because each replica holds its own copy. Storage scales by adding bytes. Compute scales by adding nodes. They stopped being the same dial.
Everything you learned still works
When people hear "the storage engine is different," they assume their hard-won ClickHouse knowledge is now suspect. It isn't. Decoupling is a change to where bytes live and how compute attaches to them, but it is not a change to the query engine.
ORDER BY and the sparse primary index work exactly as before. Compression codecs, the same. The merge semantics of MergeTree, the same. And every specialized engine you spent this series learning behaves identically: ReplacingMergeTree still deduplicates on merge, AggregatingMergeTree still folds rows into aggregate states, CollapsingMergeTree still cancels pairs, materialized views still fire on insert, projections still give you an alternate sort order. If you internalized the query-optimization post, all of it transfers. None of those mechanics know or care that the parts they operate on happen to live on S3.
What changes is operational, and it changes in your favor. Parts exist once, in shared storage, instead of N times across replicas. Background merges run against that shared copy. Our ten-billion-row table sits in a single active part right now: the merges happened, collapsed thousands of inserts down to one part, and we never scheduled or babysat any of it.
Decoupling moved your data, the column parts, into object storage. It did not move the in-memory layer that keeps ClickHouse fast, and it was never meant to. The sparse primary index built from your sorting key still loads into RAM on the node running the query, exactly as it does self-hosted, and so do the mark and uncompressed caches that sit on the read path. Dictionaries are part of that same layer, just the most visible part: every compute node holds a full in-memory copy of each dictionary, refreshed the way a single self-hosted server would. The reason is latency. dictGet exists for sub-millisecond enrichment, and a round trip to S3 on every lookup would defeat the point, so the dictionary stays in RAM and we eat the cost on every node. That is the teaser I left you with, answered: decoupling changed where your data lives, not where your hot, in-memory structures live.
One objection before someone raises it. Open-source ClickHouse can already put MergeTree data on an s3 disk, and yes, that is real. But that is data sitting on object storage with each replica still owning its own state and coordination. It is not stateless compute over shared storage. Storing parts on S3 and decoupling compute from storage are different things, and the difference is the entire post.
The new cost you have to think about
I would be lying if I told you decoupling is free. It moves work off your plate and puts one new thing on it.
The good news first. Storing once and compressing well adds up. That ten-billion-row table is 5.71 terabytes uncompressed and 54 gigabytes on disk, which is about a hundred-to-one on this dataset (the hits data compresses unusually well, so do not take the exact ratio to the bank). You pay for object storage, which is cheap and durable, and you pay for it once.
Here is the new thing. When data lives on object storage, the first read of a cold part pays for a trip across the network, and the network is now your bottleneck instead of the disk. Our cache absorbs most of that, and it is distributed across every node.
That gap is the thing you now reason about that you did not before. Which data needs to stay hot. What a cold query costs the first time someone runs it. How the cache behaves under your access pattern. The mechanism that keeps hot data hot, our distributed cache, is its own post and I will not repeat it here. As a builder, the habit you pick up is thinking about warmth as a property of your workload.
And the honest limit, this trade is not worth it for everyone. If your dataset is small, static, and fits comfortably on a single node, decoupling adds a cold-read penalty (think five to fifteen milliseconds on a cold query) to buy you elasticity you are not going to use. The model earns its keep at scale, on mixed real-time and analytical workloads, on data that grows and load that spikes. Below that, a plain single-node MergeTree is the right answer and I will tell you so.
What you stop paying is the part people forget to count. No ZooKeeper ensemble to keep alive. No shard rebalancing at 3am. No replica storage multiplying your bill by your replication factor. That is real money and real weekends.
Where this lands
You can probably see where the series was pointing. The cloud-native version of MergeTree, the one this whole funnel has been walking toward, is what we build at ObsessionDB: decoupled storage and compute, the benefits of the SharedMergeTree model, built by people who ran ClickHouse at near-petabyte scale for Numia and got tired of the replication tax.
I want to be straight about the landscape, because overclaiming here is easy and wrong. Decoupling is not our invention and I am not going to pretend it is. What we argue is narrower: we give you the same architecture without ClickHouse Cloud's pricing model or lock-in, hosted in the EU, with the controls operators actually want, at a better price for the performance. On a ten-billion-row benchmark we measured 35% faster queries and up to 9x better performance per dollar than ClickHouse Cloud. Those numbers have their own post; I am citing them, not re-running them here.
If you are running ClickHouse at scale and the replication math in this series felt familiar, we do workload assessments at ObsessionDB. Bring your actual queries and we will show you the side-by-side.
We started this series at ClickHouse fundamentals. We are ending it at how to build on the engine once storage and compute come apart. The short version: decoupling deletes a category of work you used to do (sharding, replication, coordination), adds exactly one habit (think about what stays hot), and leaves everything you learned about modeling and querying completely intact. That is a good trade. Most of the time.
Continue Reading
Originally written for obsessionDB. Read the original article here.
ClickHouse is a registered trademark of ClickHouse, Inc. https://clickhouse.com