AI Skills for ClickHouse® Database: Optimization Patterns Proven at 100B+ Rows
AI coding assistants write excellent SQL. They also write queries that will destroy your production database.
Ask Claude or Cursor to create a ClickHouse schema and you'll get something syntactically perfect: correct types, reasonable column names, an ENGINE clause that compiles. What you won't get is a schema optimized for your actual query patterns. The agent doesn't know that putting timestamp first in your ORDER BY will make user-specific queries 10x slower. It doesn't know that SELECT * in ClickHouse is a different kind of mistake than in PostgreSQL.
This is the AI slop problem for databases: outputs that look right, compile fine, and blow up your bill in production.
At ObsessionDB, we've been running ClickHouse with 100 billion+ rows and 10 million requests per day. Queries that work in development and timeout with real data. Schemas that seem reasonable until you hit a billion rows. Use cases that simply don't fit ClickHouse at all. The patterns that separate working ClickHouse from production ClickHouse aren't in the documentation's first pages.
So we extracted those patterns into open source skills that teach your AI assistant the optimization rules that matter.
Why Skills Beat Documentation
ClickHouse has good documentation. The problem is volume.
An AI agent pointed at raw docs behaves like a new hire with access to Confluence: it can find information, but it doesn't know what matters. Ask it about PREWHERE and it'll cite the entire page. Ask about schema design and it'll summarize the general principles. What it won't do is tell you that ORDER BY column selection is the single decision most likely to make or break your queries.
This is the fundamental limitation. Agents are reactive. They answer the question you ask, not the question you should have asked. When you say "create a ClickHouse schema for events," the agent doesn't think "this person probably filters by user_id, so that should go first in ORDER BY." It doesn't know your query patterns. It doesn't know what production traffic looks like. So it makes a reasonable choice that happens to be wrong.
Documentation doesn't solve this because documentation is organized by feature, not by priority. The ORDER BY page explains what ORDER BY does. It doesn't say "this is the #1 thing you'll get wrong and here's the exact rule to follow." That knowledge lives in the heads of engineers who've debugged these problems, and it stays there until someone writes it down in a format agents can use.
Skills work differently. Instead of a manual to search, they provide tiered patterns delivered when the agent needs them. CRITICAL patterns that prevent 10-100x performance regressions come first. HIGH impact optimizations for production workloads follow. Everything has numbered priorities so agents know what to fix first.
When your agent loads the Schema Design skill, it gets the exact rules for ORDER BY column selection: the decision framework, not a link to documentation. The agent stops guessing. It asks the right questions or double-checks the pipeline before making the decision.
The Three Skill Categories
We organized ClickHouse optimization into three skills, ordered by impact:
| Skill | What It Prevents |
|---|---|
| Schema Design | Table and pipelines design, engines, partitions |
| Query Optimization | Full table scans, indexes, prewhere, memory blowouts |
| Materialized Views | Pre-aggregates data, engines deeper dive, best practices |
Schema Design is where most performance is won or lost. ORDER BY column selection is one of the most common mistakes, followed by partition strategy and engine selection. Get this wrong and no amount of query tuning will save you.
Query Optimization handles runtime patterns: PREWHERE vs WHERE, column selection, why SELECT * is particularly expensive in columnar databases, and join strategies that don't kill your cluster.
Materialized Views enable real-time dashboards without query-time aggregation. The skill covers the most common patterns, refresh strategies, and the gotchas that cause data inconsistencies.
Example: ORDER BY Column Order Matters
As a basic example, let's image a table that will be queried by user_id. Here's the code that's often generated by the agents:
-- INCORRECT: timestamp first
CREATE TABLE events (
timestamp DateTime,
user_id UInt64,
event_type String
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id);
This looks reasonable. Events have timestamps, ordering by them seems natural.
However, anyone who has used ClickHouse knows this is wrong. Here's the correct version:
-- CORRECT: filter column first
CREATE TABLE events (
timestamp DateTime,
user_id UInt64,
event_type String
) ENGINE = MergeTree()
ORDER BY (user_id, timestamp);
Why this matters: ClickHouse stores data sorted by the ORDER BY columns. When you query WHERE user_id = 123, the database uses that ordering to jump directly to the relevant data, like an index lookup.
With ORDER BY (timestamp, user_id), ClickHouse has to scan all data to find rows for a specific user. The timestamp ordering is useless for user-specific queries, which are typically a big chunk of production traffic.
The impact? In production, we've seen queries go from 20 seconds to 100 milliseconds by fixing column order. A dev with ClickHouse experience spots this immediately, but models make this mistake consistently. We got tired of reviewing the same issue.
The Schema Design skill teaches agents to ask: "What columns will queries filter by most often?" and put those first. A simple rule that prevents the single most common ClickHouse performance disaster.
Why Agents Keep Making the Same Mistakes
Every agent conversation starts fresh. Your AI assistant doesn't know that last week you spent four hours debugging an ORDER BY issue. It doesn't know your team requires LowCardinality to cut costs for string columns under 10,000 distinct values. It doesn't know that you've already established patterns for partition keys or that your dashboards depend on specific materialized view refresh intervals.
Without guidance, the agent falls back on what it knows from training data: generic patterns that compile, general advice from documentation, and reasonable guesses based on column names. This is how you end up with technically correct schemas that miss your actual use case.
The agent isn't stupid. It's uninformed. Given the right context, it makes good decisions. Given no context, it makes safe, generic decisions that often turn out to be wrong for your specific situation.
Skills fix this. When an agent loads the ClickHouse skills, it knows what your senior engineers know. The schema design skill doesn't just list rules; it tells the agent to ask "what columns will queries filter by most often?" before making ORDER BY decisions. The query optimization skill explains why SELECT * is worse in ClickHouse than in row-oriented databases. The materialized views skill covers the merge timing gotchas that cause stale data.
Install the lesson once, every future session benefits. The agent stops making the same mistakes because someone taught it what mistakes to avoid.
The skills work across Claude Code, Cursor, and any agent that loads them.
Getting Started
Installation takes one command:
npx skills add obsessiondb/clickhouse-skills
The skills load automatically when you're working on ClickHouse schemas or queries. No configuration needed.
What's Next
The skills are open source. If you've learned ClickHouse lessons the hard way, contribute them or file issues for patterns we missed.
The goal is making ClickHouse expertise accessible to every developer, whether they're writing queries directly or through an AI assistant.
Star the GitHub repo, try the skills, let us know what hurts.
Originally written for obsessionDB. Read the original article here.