DEX Anomaly Detection

Protocols kept asking us the same thing: how do we know if trading on our DEX is real? At Numia we were already indexing every transaction on Osmosis, up to two million a day. We had the data. What we didn't have was a way to separate genuine trading from wash trading, whale manipulation, and coordinated bot activity. I built an unsupervised ML pipeline to answer that question. No labels, no ground truth, just raw transaction data fed through an autoencoder ensemble that learned what "normal" looks like and flagged everything that wasn't. The system caught whale accumulation patterns before price rallies, and a 0.909 Silhouette score confirmed the model was finding real structure in the data.

Real-time ML pipeline architecture — GCP pipeline architecture: from blockchain node to actionable alert in under one second

The Problem

DEX trading is opaque by design. Anyone can spin up wallets and trade against themselves to inflate volume numbers. Protocols making governance decisions based on trading metrics, token projects evaluating where to list, they all need to trust the numbers are real. And for us at Numia, if we're the data layer these clients depend on, we had to be able to answer that question ourselves.

Osmosis was the natural starting point. We were already indexing the full Cosmos ecosystem, and Osmosis had the highest volume. Over two million daily transactions at peak, unusually clean data thanks to the Cosmos SDK architecture, and sub-second latency requirements because alerts that arrive after the move are useless.

There's no ground truth in anomaly detection for trading. You can't label historical transactions as "anomalous" or "normal" because that's exactly what you're trying to figure out. Supervised learning was off the table from day one. So the question became: can unsupervised models separate meaningful patterns from noise, and if they do, do those patterns actually correlate with real market events?

The Product

I tested five approaches: dense autoencoder, One-Class SVM, Isolation Forest, K-Means, and statistical baselines. The autoencoder won decisively, but I kept the ensemble because each model catches different kinds of patterns.

The autoencoder idea is simple: train a neural network to compress and reconstruct normal transactions. When something unusual comes through, the reconstruction error spikes. I spent weeks tuning the architecture. Too many layers and it memorizes everything, so nothing gets flagged. Too few and it can't learn normal patterns well enough to notice deviations. The sweet spot ended up being a 5-layer encoder with dropout, and the 0.909 Silhouette score proved it worked on real data.

One-Class SVM and Isolation Forest pick up what the autoencoder misses. OCSVM draws a boundary around "normal" in feature space. Isolation Forest finds data points that are unusually easy to separate from the crowd. Different math, complementary results. I fixed the outlier ratio at 5% across all experiments. Arbitrary, sure, but consistent comparison mattered more than squeezing out an extra percent from each model individually.

Raw transaction data on its own doesn't tell you much. The features that actually mattered: gas patterns (bots and urgent trades show distinctive behavior, a wallet suddenly paying 10x its normal gas is worth watching), wallet clustering (one wallet is noise, twenty wallets doing the same thing at the same time is signal), cross-chain activity (IBC transfers into Osmosis often arrive right before large trades), and temporal patterns (retail trades during US market hours, institutional activity shows up at odd times).

Silhouette Score of 0.909 meant the model found a real boundary in the data, not random noise. Mann-Whitney U tests compared detected anomalies against normal transactions across volatility, volume, and price impact, with statistically significant differences across every metric. Detected whale accumulation patterns preceded price rallies. Not every time, but often enough that protocol teams could act on it.

The Architecture

I built the full pipeline on GCP, on top of our existing Numia infrastructure, so this ran in production from the start.

Sub-second latency from blockchain confirmation to alert delivery. Pub/Sub ingests transactions from blockchain nodes with at-least-once delivery, which meant the feature engineering had to be idempotent. Cloud Functions handle stateless prediction (cold starts were a real headache until I added provisioned concurrency for the critical paths). Firestore powers real-time alerts to the monitoring dashboard. BigQuery handles historical storage for training, plugging into the data warehouse we already had running. The model retrains weekly on the latest data.

Market behavior drifts. A model trained on January data starts getting stale by March. Vertex AI manages training runs and model versions, with every model getting a unique ID and evaluation metrics tracked over time. New models roll out via blue-green deployments without downtime, and if the new version performs worse, automatic rollback kicks in. Drift monitoring tracks feature distributions and prediction confidence. When the model starts losing certainty, it triggers retraining automatically.

Results

The system worked in production and the numbers held up under scrutiny.

Model accuracy: 0.909 Silhouette score on the autoencoder, with Mann-Whitney tests confirming anomalies were statistically distinct from normal transactions across volatility, volume, and price impact.
Predictive value: The system flagged whale accumulation the day before a 15% price move on multiple occasions. Not a crystal ball, but consistent enough for protocol teams to act on.
Scale: 2M+ daily transactions analyzed with sub-second alert latency, running on our existing GCP infrastructure.
Portability: The methodology transfers to any DEX with transaction-level data. We designed it on Osmosis, but the core approach works on Uniswap, Curve, or PancakeSwap. Feature engineering changes per chain, the detection logic doesn't.

What I Learned

Unsupervised models can find real structure in unlabeled blockchain data. The 0.909 Silhouette score was good validation, but the part that actually convinced me was watching the system flag whale accumulation before price moves in real time. Or seeing coordinated trading emerge across dozens of wallets that looked completely unrelated on paper.

Past a certain accuracy threshold, latency matters more than precision. We could have squeezed out slightly better detection by running heavier models, but a signal that arrives two seconds after a block confirms beats a perfect signal two minutes late. The sub-second pipeline turned out to be the most important engineering decision of the whole project.

Building this on our own production infrastructure meant it was useful from day one. Running it through proper statistical validation made it trustworthy. I've applied that same combination to every project since.