Trading Automation
Data Engineering · Quantitative Analysis · ClickHouse · PostgreSQL · Airflow
I trade my own money. Not much, but enough that I care about the outcomes. What started as a spreadsheet tracking a few positions turned into a proper data engineering project when I got tired of manually copying prices and running backtests in Jupyter notebooks that broke every time I reopened them. Now I have PostgreSQL for live operations, ClickHouse for historical analysis, and Airflow making sure everything runs while I sleep.
Why Build This
Trading research generates a lot of data. Price feeds, order books, sentiment signals, macro indicators. Most retail traders lose this data or stuff it into spreadsheets that fall apart at scale. I was one of them for years.
The breaking point was backtesting. I had a strategy idea, pulled historical data into a notebook, ran a backtest, got promising results. Two weeks later I wanted to test a variation and couldn't reproduce my own work. Different data ranges, slightly different parameters, outputs that didn't match. I'd been fooling myself with sloppy methodology.
I needed infrastructure. Real pipelines that run reliably. Historical data I could trust. Backtests that were reproducible. If I was going to risk money on systematic strategies, I wanted to treat it like engineering, not gambling.
The Architecture
Two databases, one scheduler. Sounds simple because it is. The complexity is in the data flows, not the stack.
PostgreSQL handles operational state. Positions, orders, account balances. Anything that needs transactional guarantees and low-latency access. When a signal fires and I need to check current exposure before sizing a trade, that query hits Postgres.
ClickHouse handles analytical queries. Five years of price history. Tick data. Backtest results. Billions of rows that would crush Postgres but that ClickHouse scans in milliseconds. When I want to know how a strategy performed during the 2022 drawdown or compare Sharpe ratios across 20 variations, that query hits ClickHouse.
Airflow orchestrates everything. Ingestion jobs pull from exchange APIs. Transformation jobs compute technical indicators. Report generation runs overnight. I have 30+ DAGs that coordinate the whole system. When something breaks, I get an alert instead of discovering stale data three days later.
How Data Flows
Real-time websocket connections stream prices continuously. That data lands in Postgres first for operational use, then gets batch-loaded into ClickHouse hourly for historical analysis. The separation matters because analytical queries on ClickHouse would wreck Postgres performance, and I need Postgres responsive for trading.
Raw data transforms into indicators, features, signals. Moving averages, volatility measures, regime classifications. These live as materialized views in ClickHouse that refresh on schedule. I can query the derived data without recomputing it every time.
Backtest results get stored too. Every strategy variation I test goes into a results table with the parameters used, the performance metrics, and timestamps. When I revisit an idea months later, I can see exactly what I already tried.
The Signal Engine
I trade discretionary strategies, but I wanted to systematize the research. The signal engine codifies my ideas into testable logic. Define entry conditions, exit conditions, position sizing rules. Run against historical data. Store results.
I've tested 20+ strategy variations through this framework. Most didn't work. That's the point. Finding out a strategy fails in backtest is cheap. Finding out it fails with real money is expensive. The system exists to fail fast on bad ideas.
Risk management is built in. Position sizing based on volatility. Maximum exposure limits. Correlation checks to avoid concentration. The guardrails run automatically so I can't talk myself into oversized bets when I'm feeling confident.
AI for Research, Not Trading
I integrated the OpenAI API for research assistance, not for generating trade signals. Sentiment analysis on headlines. Summarizing FOMC minutes. Pattern recognition on charts that might be worth investigating manually. The AI suggests, I decide.
Honestly, this part is experimental. I'm not convinced LLMs understand markets better than noise. But they're useful for processing text I don't have time to read, and occasionally they surface something I would have missed.
Streamlit Dashboards
Three dashboards for different purposes. Portfolio view shows current positions, PnL, risk metrics. Research view queries ClickHouse for historical analysis and strategy comparison. Pipeline monitor tracks Airflow health and data freshness.
I built these in Streamlit because I wanted something functional in hours, not days. They're not pretty. They work.
What Actually Changed
The platform saves me an hour a day minimum. No more manual data collection. No more broken notebooks. No more wondering if the data is fresh.
More importantly, my research quality improved. Reproducible backtests. Versioned strategy parameters. Historical results I can query. When I think a strategy worked, I can verify it worked against actual data, not a vague memory of running something similar once.
ClickHouse returns sub-second results on queries that would take minutes in Postgres. That changes how I explore ideas. I can ask questions interactively instead of waiting for batch jobs.
What I Learned
Building this taught me data engineering at a level client work never did. When it's your own system, your own data, your own money at stake, you care about every detail. Schema design decisions that seemed academic become concrete when you're the one paying for query performance.
OLAP vs OLTP isn't just a concept. I felt the difference when analytical queries started impacting operational latency. Separating the databases wasn't premature optimization; it was survival.
Self-hosted infrastructure teaches you operational overhead. Airflow needs monitoring. ClickHouse needs disk space management. Postgres needs backup verification. The "free" option costs time instead of money.
Most trading ideas don't work. Having infrastructure to test them quickly matters more than having clever ideas. The edge isn't in the signal; it's in testing more signals faster with better methodology. That's an engineering problem, not a finance problem.
I'm still figuring out if any of my strategies actually have edge. But at least now I'll know for sure when they don't.