ClickHouse® Database: What It Is, How It Works, and When It Falls Apart
Tesla has ingested over a quadrillion rows into ClickHouse. Cloudflare processes 11 million rows per second. Netflix pushes roughly 5 petabytes of logs through it daily.
These numbers are real. The companies have no reason to exaggerate.
At Numia, we run ClickHouse at 100 billion+ rows serving 10 million requests per day. I've spent enough time debugging queries, optimizing schemas, and learning the hard way what works and what doesn't. This post is what I wish someone had explained before I started.
Most "what is ClickHouse" articles stop at the marketing pitch. This one goes deeper. By the end, you'll understand WHY ClickHouse is architected the way it is, WHEN to reach for it, and WHERE it falls apart.
Why ClickHouse Exists
In 2009, Yandex had a problem. Their web analytics platform, Yandex.Metrica, was processing 200 million events per day. By 2016, that number hit 25 billion events daily. The challenge wasn't storage, it was real-time analytics on non-aggregated data.
Traditional solutions failed at this scale. OLAP cubes required pre-aggregation, which killed query flexibility. Row-based databases like PostgreSQL couldn't scan billions of rows fast enough for interactive dashboards.
The numbers that broke everything: 374 servers, 20.3 trillion rows, 2 petabytes compressed. Report load times averaged 26 seconds. After migrating to ClickHouse: 0.8 seconds. Same reports, 32x faster. Not from better hardware, but from better architecture.
Two decisions made this possible:
- Columnar storage: Read only the data your query needs
- Vectorized execution: Process data in batches, not one row at a time
The rest of ClickHouse's design follows from these two ideas.
Columnar Storage
The difference between row-based and columnar storage is the single biggest factor in query performance.
Consider a simple table:
CREATE TABLE events (
timestamp DateTime,
user_id UInt64,
event_type String
)
Row-based storage (PostgreSQL, MySQL):
[timestamp1, user_id1, event_type1], [timestamp2, user_id2, event_type2]...
Every row lives together. Reading one column means reading them all.
Columnar storage (ClickHouse):
[timestamp1, timestamp2...], [user_id1, user_id2...], [event_type1, event_type2...]
Each column is stored separately. Reading one column means reading only that column.
This matters because of how analytics queries work. "How many events per day?" only needs the timestamp column. Row-based databases read 100% of the data. Columnar reads ~33% (one of three columns). A table with 50 columns where your query needs 3? Row-based reads everything. Columnar reads 6%.
The flip side: fetching a single complete row is worse. A row-store reads one contiguous block. A column-store reads from every column file separately, one access per column. With 50 columns, that's 50 reads instead of 1. This is why columnar databases are terrible for SELECT * WHERE id = 123 workloads.
Engines And MergeTree
MergeTree is where ClickHouse does most of its work. The name is literal: data arrives in parts, and a background process merges them.
The hierarchy matters:
- Table: Collection of parts
- Part: Immutable data chunk created on each INSERT
- Granule: Smallest unit ClickHouse reads (default: 8,192 rows)
- Mark: Pointer to a granule's location
Unlike traditional databases that index every row (B-tree), ClickHouse uses a sparse index: one entry per granule, not one entry per row. Each entry is called a mark, and it points to where a granule starts on disk.
Take a table with 8.87 million rows:
- 8.87M / 8,192 = 1,083 granules
- Primary index stores one mark per granule
- Total index size: 96.93 KB in memory
The entire primary index fits in RAM. Always. This is by design. When you query, ClickHouse binary searches over these marks to find which granules might contain your data. For 1,083 marks, that's ~10 steps. Then it reads at most 8,192 rows per matching granule.
Creating a MergeTree table:
CREATE TABLE events (
timestamp DateTime,
user_id UInt64,
event_type LowCardinality(String),
payload String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (event_type, user_id, timestamp);
Two decisions here matter:
PARTITION BY controls physical separation. Monthly partitioning means January data lives in different files than February. Queries filtering by month skip entire partitions.
ORDER BY controls sorting within each part.
Data flows like this:
- INSERT creates a new part (immutable)
- Background thread merges small parts into larger ones
- Merges can trigger computation for specialized engines
The most relevant variants:
| Engine | Purpose |
|---|---|
| MergeTree | Base engine |
| ReplacingMergeTree | Deduplicates during merge |
| AggregatingMergeTree | Pre-aggregates during merge |
Why ORDER BY Matters
The sparse index can only help if the data is sorted in a way that groups relevant rows into the same granules. That's what ORDER BY does. It controls how rows are sorted within each part, which directly controls how many granules ClickHouse can skip.
Consider:
SELECT * FROM events
WHERE event_type = 'click' AND user_id = 12345;
With ORDER BY (event_type, user_id, timestamp):
- Binary search finds
event_type = 'click'granules - Within those, finds
user_id = 12345granules - Scans maybe 8,192 rows
With ORDER BY (timestamp, event_type, user_id):
event_typeis second in order- Can't use index efficiently
- Scans potentially millions of rows
Same query, same data, different ORDER BY. The difference can be 100x in query time.
The real rule: query patterns first, then cardinality. Your ORDER BY should start with the columns that appear most often in your WHERE clauses. A column that never gets filtered on is useless in the index, no matter its cardinality.
Once you've identified which columns you actually query by, order them from low cardinality to high cardinality. Why? A low-cardinality column at the front eliminates large chunks of granules in one step. If event_type has 10 unique values and user_id has 1 million, putting event_type first eliminates ~90% of granules immediately. Reverse the order and user_id can't help because each granule contains many different user_ids scattered across it.
For example, if you always filter by tenant_id and timestamp but rarely by event_type, your ORDER BY should be (tenant_id, timestamp) even if event_type has lower cardinality. An index that doesn't match your queries is just wasted sorting.
Compression benefits from good ordering too. The documentation shows 3:1 compression with poor ordering and 39:1 with optimal ordering. Same data, 13x better compression from column order alone.
Vectorized Execution
Traditional databases process row by row:
for each row:
apply filter
compute aggregation
ClickHouse processes in batches:
for each batch of 65,536 rows:
apply filter to entire batch (SIMD)
compute aggregation on batch (SIMD)
SIMD means Single Instruction, Multiple Data. One CPU instruction processes multiple values simultaneously. Modern CPUs support AVX2 (4 doubles at once) and AVX-512 (8 doubles at once).
ClickHouse has multiple SIMD implementations per operator and auto-selects based on your CPU at runtime. It has 30+ precompiled hash table variants for different data patterns.
In practice, this means 10-100x faster aggregations compared to row-at-a-time processing. It's why ClickHouse aggregates billions of rows in seconds while PostgreSQL takes minutes on the same data.
Compression
Columnar storage has another advantage: similar values sit next to each other, and similar values compress well. Timestamps next to timestamps achieve 10:1 to 30:1 compression. With monotonic sequences, ClickHouse hits 800:1 using DoubleDelta encoding (more on that below).
Every column stored on disk gets compressed. The smaller it is on disk, the less I/O your query needs. ClickHouse applies compression in two layers:
- A specialized codec transforms the data to make it more compressible
- A general compression algorithm actually shrinks it
First you fold everything neatly, then you vacuum-seal the bag.
General Compression
These algorithms work on any data. You choose one based on a simple tradeoff:
| Codec | Speed | Compression |
|---|---|---|
| LZ4 | Faster decompression | Good (2-3x reduction) |
| ZSTD | Slightly slower | ~30% better than LZ4 |
Both decompress faster than most SSDs can read, so in practice you rarely notice the speed difference. LZ4 is the default and a safe choice. Use ZSTD when storage cost matters more than marginal CPU savings.
Specialized Codecs
These don't compress data themselves. They transform it into a form that general compression can handle much better. Each one targets a specific pattern:
Delta stores the difference between consecutive values instead of the values themselves.
Original: [1000, 1001, 1002, 1003]
After Delta: [1000, 1, 1, 1]
A column full of small numbers like 1, 1, 1 compresses much better than 1000, 1001, 1002. Good for timestamps and any monotonically increasing column.
DoubleDelta goes one step further and stores the delta of the deltas.
Original: [1000, 1005, 1010, 1015]
After Delta: [1000, 5, 5, 5]
After DoubleDelta: [1000, 5, 0, 0, 0]
When the interval between values is constant (like a sensor reporting every 5 seconds), the second delta is all zeros. This can hit 800:1 compression on regular time-series data. Good for counters and regular-interval sequences.
Gorilla uses XOR encoding, designed for floating-point numbers. It XORs each value with the previous one. When consecutive floats are similar, the result is mostly zeros.
Original: [36.6, 36.7, 36.6, 36.8]
After XOR: [36.6, (mostly zeros), (mostly zeros), (mostly zeros)]
Good for metrics and measurements where values don't jump around much between rows.
Putting It Together
You pair one specialized codec with one general codec per column, choosing based on the data type:
CREATE TABLE metrics (
timestamp DateTime CODEC(Delta, ZSTD),
value Float64 CODEC(Gorilla, LZ4),
counter UInt64 CODEC(DoubleDelta, ZSTD)
)
If you're unsure, the defaults (LZ4, no specialized codec) are solid. Add specialized codecs when you understand your data patterns and want to squeeze out better compression.
When ClickHouse Shines
ClickHouse is at its best when you're scanning a lot of data and aggregating it down. Log analytics, time-series, event tracking, real-time dashboards. Cloudflare processes 11M rows/sec through it. PostHog built their entire analytics product on top of it. Timestamps sort naturally and compress well with Delta encoding, which makes time-series workloads particularly fast.
The query patterns that work well are aggregations (COUNT, SUM, AVG) over millions of rows, time-range filters like WHERE timestamp > now() - INTERVAL 7 DAY, and GROUP BY on low-cardinality columns like country or device type.
Some production numbers:
| Company | Scale |
|---|---|
| Tesla | 1 quadrillion+ rows |
| Netflix | ~5 PB logs/day |
| Cloudflare | 11M rows/sec, 47 Gbps |
| Yandex | 20.3 trillion rows (2014) |
When ClickHouse Falls Apart
ClickHouse is not a general-purpose database. I've hit each of these walls in production.
There are no ACID transactions and no row-level locking. If you need transactional guarantees, use PostgreSQL.
Point lookups on columns that aren't in the ORDER BY are painful:
-- Slow if user_id isn't first in ORDER BY
SELECT * FROM events WHERE user_id = 12345;
The sparse index can't skip granules efficiently, so you end up scanning everything.
Updates don't exist in the way you'd expect. ClickHouse doesn't update in place. You can use ReplacingMergeTree, but merge timing isn't guaranteed. CollapsingMergeTree works but requires explicit delete/insert with sign columns. None of it is as clean as UPDATE ... WHERE.
JOINs on high-cardinality columns are another pain point:
-- Millions of unique join keys = pain
SELECT * FROM events e JOIN users u ON e.user_id = u.id;
The right side of the JOIN gets loaded into memory. Large tables cause OOM. The workaround is to denormalize your data or use dictionaries.
And finally, small frequent inserts. Each INSERT creates a part on disk. Thousands of small inserts means thousands of parts, which creates merge overhead. Batch your inserts (1,000-100,000 rows at a time).
The Mental Model
ClickHouse is fast because it reads only the columns you need, keeps the entire primary index in RAM (96 KB for 8.87 million rows), processes rows in SIMD batches, and compresses everything 10-30x so there's less to read from disk in the first place.
It breaks when you need transactions, fast lookups on non-indexed columns, frequent updates, JOINs on millions of unique keys, or high-throughput single-row inserts.
That's the trade-off. ClickHouse gives up write flexibility and transactional guarantees in exchange for read speed on analytical queries. If that matches your workload, nothing else comes close. If it doesn't, you'll fight the system at every turn.
Originally written for obsessionDB. Read the original article here.