The Lookup Kit: ClickHouse® Dictionaries and JOINs

You have data in two tables, and you need to combine them at query time. In OLTP-land you JOIN without a second thought. In ClickHouse, that reflex is wrong about 80% of the time.

ClickHouse gives you different tools for combinatorial work, and the wrong choice costs about 20×. We measured it on our own cluster this week: 11 seconds with dictGet, 225 seconds with the equivalent JOIN, on the same 10-billion-row enrichment. The full numbers are in the benchmark section below.

This post is a deep dive into dictGet (the right default), an operational nuance on JOINs (the escape hatch), and the move that beats both when the enrichment is stable: resolving it once at ingest so the query-time lookup disappears. At the end, a one-page decision tree that you can save.

Why OLAP makes joining expensive

ClickHouse stores data on disk, sorted by ORDER BY, with a sparse index in RAM. We covered the mechanics in the first post in this series. Joining two tables breaks that sorted access pattern. At least one side has to be reorganized in memory, usually as a hash table, and sometimes as a sorted stream. That reorganization is what costs you.

The tools each solve a different shape of the problem:

Dimension lookup with low-cardinality reference data that updates slowly: a dictionary, via dictGet.
Two large analytical tables combined ad-hoc: a JOIN, with the right algorithm.
A stable lookup you read constantly while the dim barely moves: bake it in at ingest, either upstream in your pipeline or with dictGet inside a materialized view, and the query-time lookup disappears entirely.

The cheaper alternatives hierarchy from the query-optimization post puts dictGet at the top, JOINs at the bottom, but in this case, we are going deeper.

Dictionaries

A dictionary is a lookup primitive, and you don't query it like a table most of the time; you call dictGet from inside another query. The dict lives in RAM on the server, refreshed from an external source on a schedule, and serves O(1) lookups instead of the JOIN's hash-table-build-then-scan.

The materialized-views post had the canonical war story: an MV joined a 500-million-row dimension table on every insert, and INSERT latency went from 50ms to 2 seconds. The fix was replacing the JOIN with dictGet. That story is still the best one-paragraph summary of why this tool exists. Read it here if you missed it.

Sources

You can create a dictionary with PostgreSQL, MySQL, another ClickHouse table, an HTTP endpoint, an executable, or a file on disk. The source is decoupled from how ClickHouse stores the dict in memory. The most common production case is "PostgreSQL holds the truth, ClickHouse holds the lookup":

CREATE DICTIONARY product_metadata (
    sku UInt64,
    name String,
    category UInt8,
    base_price Decimal(10, 2)
) PRIMARY KEY sku
SOURCE(POSTGRESQL(
    host 'pg.internal' port 5432
    user 'reader' password 'secret'
    db 'inventory' table 'products'
    invalidate_query 'SELECT max(updated_at) FROM products'
))
LIFETIME(MIN 60 MAX 300)
LAYOUT(HASHED());

For same-cluster dictionaries that read from another ClickHouse table, use SOURCE(CLICKHOUSE(...)) and pass explicit USER and PASSWORD. Managed clusters (including ours) do not let dicts authenticate implicitly. We hit that one during the benchmark setup for this post; the dict failed to load with a misleading "AUTHENTICATION_FAILED" until we added the credentials inline.

Layouts: pick by key shape and memory budget

The layout is the decision that determines memory usage and lookup speed. ClickHouse has more layouts than most teams need to know about. Here are the ones you actually pick from:

Three dictionary layouts compared: flat for dense integer keys, hashed for general single keys, complex_key_hashed for composite tuple keys — Three dictionary layouts: flat for dense integer keys, hashed for general use, complex_key_hashed for composite tuple keys.

flat is the fastest layout that exists. It's a plain array indexed directly by the key, so a lookup is one array offset with no hashing involved. The constraint is baked into that design: keys have to be dense UInt64 values under the max_array_size cap, 500K by default and configurable higher. Sequential integer keys, an auto-increment id or a CounterID, are the sweet spot. Feed it sparse or non-integer keys, and you either burn memory on the gaps or it refuses to build.
Reach for hashed when flat doesn't fit, which is most of the time. It's a standard hash table: any single key type goes in, gets hashed to a bucket, and walks a short chain on collision. Strings, wide integer ranges, and sparse keys, all handled. You pay a little more RAM than flat and a hash computation per lookup, but it stays O(1) effectively, and you stop worrying about key shape. When you don't know which layout you need, this is the right place to start.
complex_key_hashed is the same hash table with a composite key. Identical memory profile, except the key is a tuple of columns instead of a single one. Whenever your real key is "these two or three columns together," this is the layout.

Those three cover the large majority of dictionaries you'll build. ClickHouse ships about a dozen more for narrower jobs: range_hashed for time-versioned lookups like currency rates by day, ip_trie for CIDR and GeoIP matching, cache and ssd_cache for dicts too large to keep fully in RAM, and complex_key_hashed_array for keys carrying many attributes. If your key shape doesn't match the three above, skim the layouts documentation before committing.

Refresh strategies

Three mechanisms, each with a tradeoff.

LIFETIME(MIN n MAX m) refreshes the dict at a random interval between n and m seconds. The randomization prevents thundering herd across replicas refreshing the same source at the same instant. Use LIFETIME(0) to disable auto-refresh entirely.
invalidate_query is a SQL query whose result ClickHouse treats as a watermark with massive load reduction for slow-moving dictionaries. It compares against the previous result and the dict only refreshes when the watermark moves. Pattern:

SOURCE(POSTGRESQL(... invalidate_query 'SELECT max(updated_at) FROM products' ...))

update_field is the incremental option. ClickHouse appends WHERE update_field >= last_update_time to the source query and pulls only changed rows. Worth its weight in gold for large dictionaries where a full reload is operationally painful.

The atomicity guarantee is worth knowing. Queries see either the old version or the new version of a dict, never a mix. Failed refresh keeps the previous snapshot live and only logs the error. Monitor system.dictionaries.last_exception because nothing else will tell you a refresh broke.

The `dictGet` family

The function you call most of the time is dictGet. The variants exist for a reason:

dictGet(name, attr, key) is the base call: it returns the attribute value for the key. On a miss, it falls back to the default declared in the dictionary's attribute definition (or the type's zero value if none is set).
dictGetOrDefault(name, attr, key, default) returns a typed fallback for missing keys. Prefer this when missing keys are normal in your data; it keeps NULL handling out of downstream queries.
dictGetOrNull returns NULL on miss. Right choice when missing should be distinguishable from a legitimate empty/zero value (v21.4+).
dictHas is a yes/no check, cheaper than dictGet when you only need to know whether a key exists.
dictGetAll returns all matching attribute values for regexp_tree dictionaries that may match multiple patterns (v23.5+).
dictGetKeys is the new reverse lookup: value to keys. Available in v25.12+. Useful when a JOIN-style operation is really "find all keys whose attribute matches this value."
The hierarchical family (dictGetHierarchy, dictGetChildren, dictGetDescendants, dictIsIn) handles parent-child dictionaries.

Type-specific variants exist too: dictGetUInt64, dictGetString, dictGetIPv4, etc. They return typed scalars without dynamic-type wrapping, which matters on hot paths. The polymorphic dictGet is more readable; the typed ones are faster. Pick based on whether you're optimizing or expressing.

SELECT
    transaction_id,
    dictGetOrDefault(
        'address_labels',
        'label',
        (chain_id, from_address),
        'unknown'
    ) AS sender_label
FROM transfers
WHERE block_date = today();

Dictionaries in ObsessionDB

In self-hosted ClickHouse, in-memory dicts live in RAM on every replica that may serve the query. ObsessionDB's architecture is decoupled, with data on S3 and compute on stateless nodes, but dictionaries are the deliberate exception.

Decoupled architecture with data on S3 and stateless compute, where dictionaries are the deliberate exception held in RAM on every compute node — Facts cross the storage/compute boundary on every query. Dictionaries are the deliberate exception, held in RAM on every node.

Each compute node holds a full in-memory copy of every dict, refreshed exactly the way a self-hosted single-node instance would do it. S3 round-trips would defeat the entire point of dictGet (sub-millisecond enrichment), so we accept the RAM cost on every node and keep its behavior identical to vanilla ClickHouse.

For the article you're reading, this is what it means in practice: a 6,506-row dim takes 1.62 MiB of RAM and loads in 27 milliseconds on every node. The numbers below come from running dictGet on a stateless compute node reading 40 GB from S3-backed storage.

dictGet vs JOIN, on our own cluster

We ran the comparison this week. ClickHouse 26.3.9.1 LTS. ObsessionDB SharedMergeTree. The standard clickbench hits_10b table (10 billion rows, 54 GiB) joined against a 6,506-row synthetic counter dim. Filesystem cache disabled per query for cold-vs-cold fairness. Output discarded via outer aggregation, so client-side overhead doesn't pollute the numbers. Three runs each, log_comment markers, results pulled from system.query_log.

Approach	Wall time	Memory
Baseline (no enrichment)	11.0 s	22 MiB
`dictGet` (HASHED)	11.5 s	22 MiB
`dictGet` (FLAT)	11.3 s	22 MiB
JOIN `parallel_hash` (default 26.3)	~225 s	52 MiB
JOIN `hash` (single-threaded)	141–212 s	40 MiB

The headline reads itself: dictGet is statistically indistinguishable from the no-enrichment baseline. The lookup happens for every row of a 10-billion-row scan and adds zero measurable overhead. The equivalent JOIN takes about 20× longer with 2.4× the memory.

Now the part I didn't expect. Single-threaded hash was faster than parallel_hash on this workload, and parallel_hash is the default in 26.3. The reason: with a 6,506-row right side, the parallel hash-table-build coordination overhead doesn't pay off. The single-threaded build finishes faster than the parallel orchestration. The 24.12 default flip from hash to parallel_hash is the right call for most dim sizes, but small-dim workloads benefit from explicitly setting join_algorithm='hash'. Always benchmark against your actual data shape before trusting defaults.

How does the gap evolve as the dim grows? We ran the comparison at 1K, 6.5K, 10K, 100K, and 1M dim sizes:

Query time vs dim size on a 10-billion-row enrichment: the dictGet curve stays flat while the JOIN curve sits 10 to 20 times higher — Query time vs dim size on a 10B-row enrichment: dictGet stays in the 10–23s band while the JOIN sits 10–20× higher.

The dictGet curve stays in the 10–23 second band as dim size scales 1K→1M; the JOIN curve sits in the 100–225 second band. The variance on the JOIN side at the 6,506-row dim, between the original 225s and a later 105s run, is real and cluster-load-dependent. Even at the favorable end, JOIN is ~10× slower.

When dictGet doesn't win

The 20× headline is true at scale on the right shape of problem. There's a shape where it isn't, and we have internal benchmark data on it from a quantile pipeline we ran a while back.

Threshold table: ~500 rows. Fact table: full table scan. Tested four combinations of percentile function and lookup strategy:

Approach	Pipeline latency
`quantile` + window-function JOIN	21 ms (winner)
`quantileTDigest` + dictionary lookup	49 ms
`quantileTDigest` + window-function JOIN	93 ms
`quantile` + dictionary lookup	436 ms

Why the JOIN won: with a 500-row right side and a window function on the left, ClickHouse compiles the join plan once. Hash the small threshold table once, probe in lockstep, done. dictGet is a per-row function call with no bulk optimization. On dim tables small enough that JOIN's hash-table build is essentially free, JOIN beats dictGet because dictGet's per-row overhead doesn't get amortized.

Rule of thumb: dicts win when the dim is big enough that the per-row lookup is cheap relative to the JOIN's hash-build cost. On dim tables of a few hundred rows accessed via window functions, the math flips.

JOINs: today's behavior

The query-optimization post had the full algorithm matrix for JOINs in ClickHouse. Read that for the comparison table; we'll keep this section to the operational nuance.

The right-table-in-memory rule is the key one, a hash JOIN puts the right side into a hash table. For this reason, always put the smaller table on the right. The IMDB benchmark from ClickHouse's own blog post puts a number on it: 1M actors on the right runs in ~0.8s with 716 MB; 100M roles on the right takes 5+ seconds and 8.95 GB. For the same JOIN and same result, that's 6× slowdown purely from table order.

Regarding algorithm selection, parallel_hash is the default, and that's the right one for most dim sizes (more than ~5K rows). For small dims, see the surprise from the benchmark section above: explicit join_algorithm='hash' can beat the default by ~30%. For tables physically sorted on the join key, full_sorting_merge skips the sort entirely and runs at hash-join speed with much less memory. For right sides that don't fit in RAM, grace_hash spills to disk in tunable buckets. partial_merge is the memory-of-last-resort.

There are some extra specific learnings that can impact your joins depending on the specific cluster conditions. For example, in environments where duplicates can happen ANY versus ALL is the silent Cartesian killer. If your right side has K duplicates per join key, ALL multiplies the left's row count by K as any default database, but ANY can mitigate this with a low effort fix:

-- Table B has 3 rows for id=1
SELECT a.id, b.value FROM A INNER JOIN B ON a.id = b.id;
-- Returns 3 rows. Silent multiplication.

SELECT a.id, b.value FROM A INNER ANY JOIN B ON a.id = b.id;
-- Returns 1 row. What you wanted.

The production failure mode is depressingly common: a dim table accumulates duplicates over time (slowly changing dimensions, late inserts, missed deduplication), and a JOIN that ran fine for months suddenly OOMs or returns a 10× row count. Switch to ANY to mitigate quickly, but deduplicate the dim or model it as a dictionary to fix.

GLOBAL JOIN on distributed setups is also something to keep in mind: the cost model is right_size × num_shards bytes broadcast on every query. Worth it when the right side is genuinely small (low-MB) and changes too often to replicate.

Enrich at ingest, not at query

The fastest lookup at read time is the one you already did at write time.

Everything above assumes you resolve the lookup when the query runs. dictGet makes that cheap, but cheap isn't free, and you pay it on every query that touches the column. When the enrichment is stable, a label, a category, or a denormalized attribute, there's a better move: resolve it once, on the way in, and store the result. Read-time queries see a plain column with nothing left to resolve.

There are two places to do it. The first is upstream, before the data ever reaches ClickHouse. If your producer, stream processor, or ETL job already has the reference data on hand, attach the label there and let the rows arrive enriched. ClickHouse stores what it's given and never looks anything up. This is often the simplest path when you're already running something like Kafka with a consumer, or Flink, in front of the database.

The second is in-database, when the reference data is best kept as a ClickHouse dictionary. Put dictGet in a materialized view: rows land in the raw table, the MV fires, the lookup runs once per row at insert, and the resolved value gets written into the target table.

CREATE MATERIALIZED VIEW transfers_enriched_mv TO transfers_enriched AS
SELECT
    transaction_id,
    chain_id,
    from_address,
    dictGetOrDefault(
        'address_labels', 'label',
        (chain_id, from_address), 'unknown'
    ) AS sender_label
FROM transfers_raw;

Either way, the tradeoff is that data is frozen. Query-time dictGet always reflects the current dictionary. A value enriched at ingest reflects the reference data as it was when the row landed. Relabel a user tomorrow, and every row already written keeps yesterday's label until you backfill. So the rule splits cleanly:

Enrich at ingest when the historical value is the correct one, or close enough that staleness doesn't bite. Transaction labels, geo lookup at event time, and the category a product had when it was sold.
Keep query-time dictGet when you need the always-current answer. Live ownership, current tier, anything where "what is it now" beats "what was it then."

The decision tree

The takeaway artifact is what you should learn from this article.

Need to combine data from two sources?
│
├── Reference data (lookup), updates infrequently, fits in RAM
│   └── DICTIONARY
│       ├── Dense UInt64 keys < 500K        →  flat
│       ├── Composite keys                  →  complex_key_hashed
│       ├── Time-ranged (rates, ownership)  →  range_hashed
│       ├── CIDR / IP ranges                →  ip_trie
│       └── Many attributes per key         →  complex_key_hashed_array
│
├── Reference data doesn't fit in RAM
│   └── DICTIONARY with cache or ssd_cache
│
├── Two analytical tables, ad-hoc, both large
│   ├── Right is Dict / RocksDB / Join engine, ANY ok →  direct
│   ├── Right fits in RAM, dim ≥ ~5K rows             →  parallel_hash
│   ├── Right fits in RAM, dim < ~5K rows             →  hash (less coordination)
│   ├── Both share physical order on join key         →  full_sorting_merge
│   ├── Right doesn't fit                             →  grace_hash
│   └── Distributed, right on every shard             →  GLOBAL JOIN
│
└── Stable lookup, read constantly, dim barely changes
    └── Enrich at INGEST  →  pre-enrich upstream, or dictGet in an MV
                             (value frozen into the row, zero query-time lookup)

Next time you reach for LEFT JOIN, pause and run this tree first. The default (LEFT JOIN against a hash-built right side) is the right call maybe 20% of the time. The rest of the time, dicts win.

What's next

dictGet is the right default for enrichment at query time. JOIN is the escape hatch when the data shape doesn't fit a dictionary. And when the value is stable, the cheapest lookup of all is the one you resolve once at ingest and never run again.

These levers work on any ClickHouse deployment, including ObsessionDB.

You have data in two tables, and you need to combine them at query time. In OLTP-land you JOIN without a second thought. In ClickHouse, that reflex is wrong about 80% of the time.

Why OLAP makes joining expensive

The tools each solve a different shape of the problem:

Dimension lookup with low-cardinality reference data that updates slowly: a dictionary, via dictGet.
Two large analytical tables combined ad-hoc: a JOIN, with the right algorithm.
A stable lookup you read constantly while the dim barely moves: bake it in at ingest, either upstream in your pipeline or with dictGet inside a materialized view, and the query-time lookup disappears entirely.

The cheaper alternatives hierarchy from the query-optimization post puts dictGet at the top, JOINs at the bottom, but in this case, we are going deeper.

Dictionaries

Sources

CREATE DICTIONARY product_metadata (
    sku UInt64,
    name String,
    category UInt8,
    base_price Decimal(10, 2)
) PRIMARY KEY sku
SOURCE(POSTGRESQL(
    host 'pg.internal' port 5432
    user 'reader' password 'secret'
    db 'inventory' table 'products'
    invalidate_query 'SELECT max(updated_at) FROM products'
))
LIFETIME(MIN 60 MAX 300)
LAYOUT(HASHED());

Layouts: pick by key shape and memory budget

The layout is the decision that determines memory usage and lookup speed. ClickHouse has more layouts than most teams need to know about. Here are the ones you actually pick from:

flat is the fastest layout that exists. It's a plain array indexed directly by the key, so a lookup is one array offset with no hashing involved. The constraint is baked into that design: keys have to be dense UInt64 values under the max_array_size cap, 500K by default and configurable higher. Sequential integer keys, an auto-increment id or a CounterID, are the sweet spot. Feed it sparse or non-integer keys, and you either burn memory on the gaps or it refuses to build.
Reach for hashed when flat doesn't fit, which is most of the time. It's a standard hash table: any single key type goes in, gets hashed to a bucket, and walks a short chain on collision. Strings, wide integer ranges, and sparse keys, all handled. You pay a little more RAM than flat and a hash computation per lookup, but it stays O(1) effectively, and you stop worrying about key shape. When you don't know which layout you need, this is the right place to start.
complex_key_hashed is the same hash table with a composite key. Identical memory profile, except the key is a tuple of columns instead of a single one. Whenever your real key is "these two or three columns together," this is the layout.

Refresh strategies

Three mechanisms, each with a tradeoff.

LIFETIME(MIN n MAX m) refreshes the dict at a random interval between n and m seconds. The randomization prevents thundering herd across replicas refreshing the same source at the same instant. Use LIFETIME(0) to disable auto-refresh entirely.
invalidate_query is a SQL query whose result ClickHouse treats as a watermark with massive load reduction for slow-moving dictionaries. It compares against the previous result and the dict only refreshes when the watermark moves. Pattern:

SOURCE(POSTGRESQL(... invalidate_query 'SELECT max(updated_at) FROM products' ...))

update_field is the incremental option. ClickHouse appends WHERE update_field >= last_update_time to the source query and pulls only changed rows. Worth its weight in gold for large dictionaries where a full reload is operationally painful.

The `dictGet` family

The function you call most of the time is dictGet. The variants exist for a reason:

dictGet(name, attr, key) is the base call: it returns the attribute value for the key. On a miss, it falls back to the default declared in the dictionary's attribute definition (or the type's zero value if none is set).
dictGetOrDefault(name, attr, key, default) returns a typed fallback for missing keys. Prefer this when missing keys are normal in your data; it keeps NULL handling out of downstream queries.
dictGetOrNull returns NULL on miss. Right choice when missing should be distinguishable from a legitimate empty/zero value (v21.4+).
dictHas is a yes/no check, cheaper than dictGet when you only need to know whether a key exists.
dictGetAll returns all matching attribute values for regexp_tree dictionaries that may match multiple patterns (v23.5+).
dictGetKeys is the new reverse lookup: value to keys. Available in v25.12+. Useful when a JOIN-style operation is really "find all keys whose attribute matches this value."
The hierarchical family (dictGetHierarchy, dictGetChildren, dictGetDescendants, dictIsIn) handles parent-child dictionaries.

SELECT
    transaction_id,
    dictGetOrDefault(
        'address_labels',
        'label',
        (chain_id, from_address),
        'unknown'
    ) AS sender_label
FROM transfers
WHERE block_date = today();

Dictionaries in ObsessionDB

dictGet vs JOIN, on our own cluster

Approach	Wall time	Memory
Baseline (no enrichment)	11.0 s	22 MiB
`dictGet` (HASHED)	11.5 s	22 MiB
`dictGet` (FLAT)	11.3 s	22 MiB
JOIN `parallel_hash` (default 26.3)	~225 s	52 MiB
JOIN `hash` (single-threaded)	141–212 s	40 MiB

How does the gap evolve as the dim grows? We ran the comparison at 1K, 6.5K, 10K, 100K, and 1M dim sizes:

When dictGet doesn't win

The 20× headline is true at scale on the right shape of problem. There's a shape where it isn't, and we have internal benchmark data on it from a quantile pipeline we ran a while back.

Threshold table: ~500 rows. Fact table: full table scan. Tested four combinations of percentile function and lookup strategy:

Approach	Pipeline latency
`quantile` + window-function JOIN	21 ms (winner)
`quantileTDigest` + dictionary lookup	49 ms
`quantileTDigest` + window-function JOIN	93 ms
`quantile` + dictionary lookup	436 ms

JOINs: today's behavior

The query-optimization post had the full algorithm matrix for JOINs in ClickHouse. Read that for the comparison table; we'll keep this section to the operational nuance.

-- Table B has 3 rows for id=1
SELECT a.id, b.value FROM A INNER JOIN B ON a.id = b.id;
-- Returns 3 rows. Silent multiplication.

SELECT a.id, b.value FROM A INNER ANY JOIN B ON a.id = b.id;
-- Returns 1 row. What you wanted.

Enrich at ingest, not at query

The fastest lookup at read time is the one you already did at write time.

CREATE MATERIALIZED VIEW transfers_enriched_mv TO transfers_enriched AS
SELECT
    transaction_id,
    chain_id,
    from_address,
    dictGetOrDefault(
        'address_labels', 'label',
        (chain_id, from_address), 'unknown'
    ) AS sender_label
FROM transfers_raw;

Enrich at ingest when the historical value is the correct one, or close enough that staleness doesn't bite. Transaction labels, geo lookup at event time, and the category a product had when it was sold.
Keep query-time dictGet when you need the always-current answer. Live ownership, current tier, anything where "what is it now" beats "what was it then."

The decision tree

The takeaway artifact is what you should learn from this article.

Need to combine data from two sources?
│
├── Reference data (lookup), updates infrequently, fits in RAM
│   └── DICTIONARY
│       ├── Dense UInt64 keys < 500K        →  flat
│       ├── Composite keys                  →  complex_key_hashed
│       ├── Time-ranged (rates, ownership)  →  range_hashed
│       ├── CIDR / IP ranges                →  ip_trie
│       └── Many attributes per key         →  complex_key_hashed_array
│
├── Reference data doesn't fit in RAM
│   └── DICTIONARY with cache or ssd_cache
│
├── Two analytical tables, ad-hoc, both large
│   ├── Right is Dict / RocksDB / Join engine, ANY ok →  direct
│   ├── Right fits in RAM, dim ≥ ~5K rows             →  parallel_hash
│   ├── Right fits in RAM, dim < ~5K rows             →  hash (less coordination)
│   ├── Both share physical order on join key         →  full_sorting_merge
│   ├── Right doesn't fit                             →  grace_hash
│   └── Distributed, right on every shard             →  GLOBAL JOIN
│
└── Stable lookup, read constantly, dim barely changes
    └── Enrich at INGEST  →  pre-enrich upstream, or dictGet in an MV
                             (value frozen into the row, zero query-time lookup)

Next time you reach for LEFT JOIN, pause and run this tree first. The default (LEFT JOIN against a hash-built right side) is the right call maybe 20% of the time. The rest of the time, dicts win.

What's next

These levers work on any ClickHouse deployment, including ObsessionDB.

The Lookup Kit: ClickHouse® Dictionaries and JOINs

Why OLAP makes joining expensive

Dictionaries

Sources

Layouts: pick by key shape and memory budget

Refresh strategies

The `dictGet` family

Dictionaries in ObsessionDB

dictGet vs JOIN, on our own cluster

When dictGet doesn't win

JOINs: today's behavior

Enrich at ingest, not at query

The decision tree

What's next

Continue Reading

The Lookup Kit: ClickHouse® Dictionaries and JOINs

Why OLAP makes joining expensive

Dictionaries

Sources

Layouts: pick by key shape and memory budget

Refresh strategies

The `dictGet` family

Dictionaries in ObsessionDB

dictGet vs JOIN, on our own cluster

When dictGet doesn't win

JOINs: today's behavior

Enrich at ingest, not at query

The decision tree

What's next

Continue Reading

Why OLAP makes joining expensive

Dictionaries

Sources

Layouts: pick by key shape and memory budget

Refresh strategies

The dictGet family

Dictionaries in ObsessionDB

dictGet vs JOIN, on our own cluster

When dictGet doesn't win

JOINs: today's behavior

Enrich at ingest, not at query

The decision tree

What's next

Continue Reading

Why OLAP makes joining expensive

Dictionaries

Sources

Layouts: pick by key shape and memory budget

Refresh strategies

The dictGet family

Dictionaries in ObsessionDB

dictGet vs JOIN, on our own cluster

When dictGet doesn't win

JOINs: today's behavior

Enrich at ingest, not at query

The decision tree

What's next

Continue Reading

The `dictGet` family

The `dictGet` family