ClickHouse® Query Optimization: What Makes It Fast
390x. That's the difference between the same query on the same data with two different column orderings in the sorting key. Not a different engine, not a different schema. Just column order.
This post is Part 8 of the ClickHouse Deep Dive series. The prior seven taught structure: table design, engines, materialization. This one is about operation. Your tables exist, your queries are running, and some of them are slow.
Three earlier posts from the series stay close throughout. How ClickHouse works covers the granule and sparse-index mechanics that every diagnostic in this post builds on. Data modeling covers the ORDER BY and partition decisions that set the ceiling on what tuning can fix. Projections are one of the optimization levers, folded into the toolkit below. The engine-specific posts show up inline where they're relevant.
We'll use one table throughout every optimization, so the examples are easier to follow:
CREATE TABLE app_events (
tenant_id UInt32,
sequence_id UInt64,
event_id String,
event_type LowCardinality(String),
user_id String,
amount_cents UInt64,
timestamp DateTime
) ENGINE = ReplacingMergeTree()
PRIMARY KEY (tenant_id, event_type)
ORDER BY (tenant_id, event_type, user_id, event_id)
PARTITION BY toYYYYMM(timestamp);
Sorting key, primary key, and partition pruning
The first thing to check. Always.
ORDER BY determines physical layout on disk. PRIMARY KEY determines what goes into the sparse index that ClickHouse loads into RAM for granule lookup. By default, they're identical, but they don't have to be. In our table, ORDER BY has four columns, but PRIMARY KEY has two: tenant_id, event_type. The extra columns give us sort order and deduplication for ReplacingMergeTree, but we only need the first two for granule elimination. Fewer PRIMARY KEY columns means a smaller index in RAM. At billions of rows across hundreds of tables, that matters.
Column ordering within your key is where the real gains hide. ClickHouse's documentation shows swapping column order changed rows scanned from 7.92 million to 20,320. Same data, 390x fewer rows, just from putting the low-cardinality column first. One team reported 18x faster queries and 90% fewer granules scanned from reordering alone.
The rule: columns you filter on most, then ascending cardinality.
You can check how your indexes work (the primary key generates an index) by adding the EXPLAIN indexes = 1 prefix to the query.
EXPLAIN indexes = 1
SELECT count()
FROM app_events
WHERE tenant_id = 1 AND event_type = 'purchase';
Granules: 42/15000 means you're reading 0.3%. Granules: 15000/15000 means full scan. The first post covered sparse index mechanics. This is the applied version.
Partition pruning is the second check. PARTITION BY toYYYYMM(timestamp) stores each month separately. A query with WHERE timestamp >= '2026-01-01' AND timestamp < '2026-02-01' reads one partition instead of all twelve. Over-partitioning (millions of customer_ids) creates merge pressure. Under-partitioning gives no pruning. The guideline from the data modeling post: 1-300 GB per partition.
PREWHERE
ClickHouse has a read-path optimization that can cut your I/O by 95%. It's enabled by default. Most people have never heard of it. This is especially relevant when storage and compute are decoupled, since I/O is more expensive on S3 than on SSD. If you are using ObsessionDB, keep this in mind.
PREWHERE reads filter columns first, evaluates them granule by granule, and only loads the remaining columns for rows that pass. If you're selecting 50 columns but filtering on 1, PREWHERE avoids reading the other 49 for rows that don't match.
The official docs show 23.36 MB read without PREWHERE, 6.74 MB with. 3.5x less data. And we have seen even more extreme cases, up to 20x faster.
Since v23.2, optimize_move_to_prewhere is on by default. ClickHouse moves selective WHERE conditions to PREWHERE, prioritizing smaller columns. You can outperform the automatic optimization by writing PREWHERE explicitly when you know selectivity better than the size heuristic:
SELECT *
FROM app_events
PREWHERE tenant_id = 1
WHERE user_id = 'usr_7a250d5630b4';
tenant_id is 4 bytes and filters out most rows. The wider user_id string only gets read for survivors.
One correctness gotcha: PREWHERE runs before FINAL by default. With ReplacingMergeTree, this can filter out the "winning" row and give you wrong results. Since v25.12, apply_prewhere_after_final=1 fixes the correctness issue, but it defeats the point of PREWHERE: once the filter runs after the merge, you're no longer saving the compute that PREWHERE existed to save. Use it when correctness matters more than the scan cost.
Data skipping indexes
Your primary key is your primary index. Data skipping indexes are secondary indexes: per-granule metadata that lets ClickHouse skip granules that can't match your filter.
Two things apply to every variant below. First, the shape is always ADD INDEX name column TYPE <variant> followed by MATERIALIZE INDEX name. The first statement registers the index so new inserts populate it; the second backfills existing parts. Skip the materialize step, and the index only covers rows written after you added it, so historical queries see no speedup.
Second, all skip indexes need the indexed column's values to cluster on disk so rows in each granule share a narrow range. The primary key gives you that automatically for columns in the sort prefix; monotonically-increasing columns (like a timestamp on an append-only workload) get it from insertion order. Without that clustering, values scatter across granules and the index skips nothing. You've paid the overhead for zero benefit.
bloom_filter
For high-cardinality columns not in ORDER BY, where equality and IN filters dominate. Our schema doesn't have a natural candidate (every high-cardinality column is already in the sort prefix), so imagine we added a session_id String column to track sessions independently of user activity:
ALTER TABLE app_events
ADD INDEX idx_session_id session_id TYPE bloom_filter(0.01) GRANULARITY 4;
Sizing matters more than people expect. A bloom filter hashes values into a bit array: bigger array, fewer false positives, so "make it bigger" looks like a free lever, but it isn't. ClickHouse reads the filter every time it checks a granule, so an oversized filter turns a cheap lookup into real I/O. The sweet spot lands around 8 KB per granule at the default 8,192 rows per granule. That's enough bits to keep false positives rare and small enough that loading the filter costs less than the granules it lets you skip. In one test, a 262 KB filter ran 17x slower than the tuned 8 KB version (4.5 vs 78 queries per second). The filter was doing its job; reading it just cost more than the work it was saving.
Insert overhead is real. A single bloom filter slows inserts by ~45%. Each additional hash function adds ~8%. Operators that work: =, IN, has(). Operators that don't: !=, NOT LIKE.
minmax
minmax stores two values per granule: the min and max of the indexed column in that block. On a range query, ClickHouse compares the filter bounds against each granule's pair and skips any granule whose range doesn't overlap. No hashing, no probabilistic math, a handful of bytes of metadata per granule.
Write overhead is essentially zero, so the decision is easy: if you range-filter a column and the column clusters with physical layout, add it. sequence_id in our table increases monotonically as events arrive, so each part tends to hold a contiguous range of values, and WHERE sequence_id BETWEEN 18000000 AND 18100000 prunes parts outside that window. Same story for timestamp on most append-only workloads. On values that don't cluster (UUIDs, hashes, anything shuffled), minmax stores wide per-granule ranges and skips nothing.
ALTER TABLE app_events
ADD INDEX idx_seq sequence_id TYPE minmax GRANULARITY 4;
set
set stores the distinct values that appear in each granule block, capped at the N you pass as the argument. On a query, ClickHouse checks whether your filter value is in that small per-granule set and skips the granule when it isn't. Unlike bloom_filter, the check is exact: a granule is pruned only when the value provably isn't there, no false positives to chase.
That makes it the right tool for columns with modest cardinality that you query often but that aren't in the sorting key: status codes, regions, tenant tiers, plan names. Picking N is where people trip. The cap is per-granule, not global: if any granule sees more than N distinct values, the index gives up on that granule and stops skipping. Five status codes globally but eight distinct values in a busy granule means set(16) is safer than set(8). Write overhead is modest (storing a small deduped set per granule), so the penalty for sizing N generously is small.
Assuming we extended our schema with a status LowCardinality(String) column:
ALTER TABLE app_events
ADD INDEX idx_status status TYPE set(64) GRANULARITY 4;
text
text is the modern replacement for ngrambf_v1, GA since ClickHouse 26.2, and is already available on ObsessionDB. Instead of hashing substrings into a probabilistic filter, it builds a proper inverted index: tokenize the column once at write time, then store a mapping from each token to the rows that contain it. Lookups are deterministic (no false positives, no wasted granule reads) and work at row granularity instead of granule granularity, so selective searches hit much less data.
You pick the tokenizer to match how you query: splitByNonAlpha for natural-language text, splitByString for structured payloads with known delimiters, ngrams(N) if you want n-gram behavior without the bloom-filter math, asciiCJK for CJK text. Assuming we stored a raw message String payload alongside the structured columns:
ALTER TABLE app_events
ADD INDEX idx_message message TYPE text(tokenizer = splitByNonAlpha);
The index powers hasToken, hasAnyTokens, hasAllTokens (the recommended entry points), plus LIKE, match, startsWith, and endsWith. For any new text search workload on a recent ClickHouse version, start here and only fall back to ngrambf_v1 if you're pinned to an older build.
FINAL: It's not slow anymore
If someone told you to avoid FINAL, they were right in 2022. The advice hasn't kept up with the code.
| Version | What improved |
|---|---|
| 20.5 | Parallel FINAL introduced |
| 23.5 | Lower memory consumption |
| 24.1 | enable_vertical_final default ON. Parallel column reads |
| 25.6 | Skip indexes work with FINAL |
| 25.12 | apply_prewhere_after_final setting (see PREWHERE section) |
| 26.2 | Automatic cross-partition merge decision |
The setting that matters most is do_not_merge_across_partitions_select_final. By default, FINAL treats every part in the table as a potential source of duplicates and merges them into one deduplicated stream. When your data model guarantees that a row and any of its duplicates always live in the same partition (an updated event stays in its month, a user state update stays in its tenant's partition), that cross-partition work is pure overhead. Turning the setting on runs FINAL per partition and unions the results, which parallelizes better and skips the bookkeeping for impossible cross-partition duplicates. Since v26.2, ClickHouse makes this call automatically based on the partitioning scheme.
Column types in your ORDER BY also affect FINAL speed directly. To deduplicate, FINAL has to identify rows with matching ORDER BY tuples, which means reading every ORDER BY column from every part and comparing values. Bigger types compound: more bytes per row, more I/O, slower comparisons. When you have a choice at schema design time (UInt64 vs UUID, UInt32 vs Int64, LowCardinality vs plain String), the FINAL path is one place the savings show up.
ArgMax can still be the best alternative in some cases, and we introduced it in the data modeling post. The real answer depends on filter selectivity, and the following table has a quick summary. To understand the full picture I'd recommend reading the original article.
| Rows surviving filter | argMax | FINAL | Winner |
|---|---|---|---|
| ~92% (broad query) | 1006s | 511s | FINAL |
| ~71% | 30s | 8s | FINAL |
| ~10% (selective) | 89s | 109s | argMax |
| Point lookup | 4s | 14s | argMax |
JOIN algorithms
Six algorithms. The default (hash) is fine until it OOMs, and many people don't know you can use a different one. All the options are included in the following table.
| Algorithm | Speed | Memory | When to use |
|---|---|---|---|
hash | Fast | High | Default. Right table fits in memory |
parallel_hash | Fastest | Highest | Large right table, multi-core |
full_sorting_merge | Competitive | 60% less | Tables sorted by join key |
grace_hash | Moderate | Tunable | Right table too large for memory |
partial_merge | Slowest | Lowest | Memory-constrained |
direct | 32x faster | Low | Dictionary or Join engine |
The one to know is full_sorting_merge. The default hash join builds an in-memory hash table from the right table, then streams the left side through it looking for matches. That works well when the right table fits comfortably in memory, but the build phase is single-threaded and memory scales with the right table's size, so billion-row joins tend to stall there. full_sorting_merge takes a different approach: it sorts both inputs by the join key, then streams them together and matches rows in lockstep. Sort-merge parallelizes across cores and keeps memory bounded by the merge buffer instead of the full right table. On a 1B-row join, it outpaced hash join while using 60% less memory.
The real win comes when the inputs are already sorted by the join key. If a table's ORDER BY starts with that key, ClickHouse can skip the sort on that side entirely (the planner doesn't always pick this up, so verify with EXPLAIN PIPELINE). You can often engineer the condition. Assuming we also have a users dim table keyed on id, our app_events is ordered by (tenant_id, event_type, user_id, event_id), so a query that fixes the leading columns with WHERE tenant_id = 1 AND event_type = 'purchase' lands in a scan range that's already user_id-ordered, and full_sorting_merge can run without re-sorting either side:
SET join_algorithm = 'full_sorting_merge';
SELECT e.tenant_id, e.event_type, u.name
FROM app_events e
JOIN users u ON e.user_id = u.id
WHERE e.tenant_id = 1 AND e.event_type = 'purchase';
Two rules beyond algorithm choice. For hash-based JOINs (the default and parallel_hash), the right table gets loaded into memory, so always put the smaller side on the right. And for dimension lookups where the join key is unique in the right table, ANY JOIN returns the first match and skips checking for additional ones; it's semantically equivalent to the default ALL JOIN when the key is unique, but cheaper.
Before you reach for a JOIN at all, consider the hierarchy. Higher on this list, cheaper the alternative:
- dictGet: O(1) in-memory lookups against a dictionary. 6.6x faster than the equivalent JOIN for dimension data, and the MV post covered the replacement pattern.
- IN subquery: when you only need to filter the left table, not fetch columns from the right.
- Filtered subqueries as JOIN input: if you must JOIN, minimize what enters the hash table by filtering the right side first.
JOIN (SELECT ... FROM users WHERE ...)beatsJOIN users ... WHERE users.x = ...in many cases because the planner can't always push the filter down itself. - Raw JOIN: last resort, when the query genuinely needs columns from both sides and no cheaper path fits.
Projections and MVs
Both have dedicated posts, but given their importance in optimizing queries we couldn't leave them out of this post.
Projections give a table a second physical sort order without making you maintain a second table or rewrite queries. You declare the projection, ClickHouse stores a sorted copy of the projected columns inside the same parts, and the optimizer picks whichever ordering makes the current query cheapest. Our app_events is ordered by (tenant_id, event_type, user_id, event_id), which is great for per-tenant event-type scans but slow for a user-timeline query like WHERE user_id = ? ORDER BY timestamp DESC. A projection sorted by (user_id, timestamp) closes that gap with zero application changes. The cost is storage and write amplification: each projection adds a sorted copy of its columns and gets populated on every insert, so reach for one only when a query pattern is frequent enough to earn the extra bytes.
Materialized views move the work from query time to insert time. An MV is a query that runs against every block inserted into a source table, with the result written to a separate target table. If your dashboards aggregate the same raw data over and over (daily revenue by tenant, purchase count per user, error rate per endpoint), an MV computes those rollups once on the way in, and read queries hit the pre-aggregated target instead of scanning raw events. For our app_events table, an MV that sums amount_cents per tenant per day turns "this month's revenue for tenant X" from a billion-row scan into a lookup over a handful of rows. The same pattern works for transformations like parsing, enrichment, or denormalization. The tradeoff is write throughput: every MV attached to a table runs synchronously on inserts, and the cost stacks.
The diagnostic toolkit
Before you can fix a slow query, you need to know which one is slow and where the time is going. Three surfaces cover most cases.
Find expensive queries. system.query_log records every executed query with duration, memory, rows read, and more. Ranking by frequency times duration surfaces the queries that actually cost you, not just the worst-case outliers.
SELECT
normalized_query_hash,
count() AS executions,
avg(query_duration_ms) AS avg_ms,
count() * avg(query_duration_ms) AS total_ms,
any(query) AS sample
FROM system.query_log
WHERE type = 'QueryFinish' AND event_date >= today() - 7
GROUP BY normalized_query_hash
ORDER BY total_ms DESC
LIMIT 20;
A 50ms query running 100K times a day costs more than a 10s query running twice, and the first is the one worth optimizing.
Understand the plan. EXPLAIN indexes = 1 shows how many granules the primary key eliminates. Granules: 42/15000 means you're reading 0.3% of the table, which is healthy. Granules: 15000/15000 means the query bypassed the index entirely: a sorting-key problem, a filter that doesn't match the ORDER BY prefix, or a type mismatch forcing an implicit cast. EXPLAIN PIPELINE shows the actual processors and thread counts. MergeTreeThread x 8 means eight parallel readers; MergeTreeThread x 1 means something is serializing the query. EXPLAIN SYNTAX shows the query after the optimizer's rewrites, which is useful when behavior surprises you.
Monitor part health. system.parts tells you whether the table itself is healthy. Thousands of active parts per partition means the merge pipeline is behind and you're heading straight for the optimize_read_in_order memory trap in the next section. A healthy table sits at tens to low hundreds of active parts per partition.
SELECT partition, count() AS parts, formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE table = 'app_events' AND active
GROUP BY partition
ORDER BY parts DESC;
The workflow: rank system.query_log by total_ms, pick the top offender, run EXPLAIN indexes = 1. If the granule ratio is bad, the problem is the sorting key or partition design, not the query text. If granules look fine but the query is still slow, EXPLAIN PIPELINE to check parallelism. Change one variable at a time, re-run with SET enable_filesystem_cache = 0 to measure cold performance, and verify with a fresh query_log comparison.
Anti-patterns
This is a quick list of the most common anti-patterns. Each one is small on its own and expensive in aggregate.
SELECT *. Columnar storage means every column in your SELECT list gets fetched from disk. SELECT * on a 100-column table to use 3 is 33x more I/O than naming the three. Always list the columns you actually need.
LowCardinality on high-cardinality columns. LowCardinality stores a per-part dictionary of distinct values. Under ~10K distinct values, it's a major win (smaller storage, faster comparisons), but above ~100K, the per-part dictionary overflows (default cap 8,192) and the wrapper overhead starts outweighing the compression gain. Run uniqExact(col) on a sample before applying.
optimize_read_in_order memory trap. This setting parallelizes across parts, not threads, so part count drives memory, not max_threads. Each stream allocates ~8 MB per column: 1,803 parts × 3 columns = 42 GB in one reported case. Watch your active part count and either reduce parts or disable the setting when it grows past a few hundred per partition.
ORDER BY in subqueries. The outer query re-sorts anyway, so sorting the inner result is pure wasted CPU. Remove it unless paired with LIMIT, where the ORDER BY actually bounds what flows outward.
GROUP BY column order. When your GROUP BY prefix matches the ORDER BY prefix, ClickHouse stream-aggregates group by group without building a hash table. Misalign them and you force hash aggregation, which holds every group in memory at once. Match the prefix when you can, especially on high-cardinality GROUP BYs.
Type bloat. UInt64 when UInt16 fits wastes 4x per row, and it compounds: FINAL reads every ORDER BY column to deduplicate, so oversized sort-key types slow that path too. Nullable adds a UInt8 marker per row. Pick the smallest type that fits the actual value range you expect over the table's lifetime.
JOINs between mismatched types. Joining LowCardinality against plain String, or UInt64 against Int64, pays an implicit cast per row. In hash joins, the cost is modest; in full_sorting_merge it can break the sort-skip path because the cast materializes a new column before the merge. Keep join-key types identical across both tables.
Wrapping up
Eight posts in. The first seven built the foundation: how ClickHouse works, how to model data for it, and when to reach for each specialized engine or materialization. This one is the flip side: what to do when the schema is already locked in, and the query is still too slow. The toolkit, by now: sorting key design, PREWHERE, skip indexes, the modern FINAL, JOIN algorithms, projections, materialized views, a diagnostic workflow for finding what's actually slow, and the anti-patterns that quietly undo everything else.
These levers work on any ClickHouse deployment, including ObsessionDB.
Continue Reading
Originally written for obsessionDB. Read the original article here.
ClickHouse is a registered trademark of ClickHouse, Inc. https://clickhouse.com