Virtual SOC
Data Engineering · Cybersecurity · Real-Time · 100M+ Events/Day · ELK · Airflow
This was the data flagship product. A virtual SOC that ingested security logs from enterprise clients, detected threats in real time, and automated the response. I owned the data infrastructure end-to-end: from log ingestion to analyst dashboards to automated playbooks. When I left, we were processing 100M+ events daily across 6 enterprise clients with sub-minute detection times.
The Problem
Security operations centers have a volume problem. A single enterprise might generate 100 million events per day. Firewalls, DNS,endpoints, cloud services, applications... everything produces logs. Somewhere in that flood of data, maybe 10 events actually indicate a threat. The rest is noise.
Traditional SOC analysts face an impossible task. They're drowning in alerts, most of which are false positives. They burn out triaging noise instead of investigating real threats. Worse, actual incidents get missed because analysts are fatigued.
Our clients needed a system that could ingest everything, find the signal in the noise, and let analysts focus on what matters. They also needed it to work reliably, because an hour of missed logs might contain the early signs of a breach.
The Architecture
I built a pipeline with three layers: ingestion, detection, and response. Each layer was designed for reliability first, speed second.
Ingestion Layer
Beats agents collected logs from 100+ data sources per client. Everything from Windows event logs to firewall traffic to cloud audit trails. Logstash normalized these into a common schema and pushed them to Elasticsearch.
The hard part wasn't handling volume since Elasticsearch can ingest a lot. The hard part was normalization. Every vendor has their own log format. A "failed login" looks different in Active Directory, AWS, and Okta. I wrote and maintained custom parsers for 100+ formats so that downstream detection rules could work consistently.
We ran an Elasticsearch cluster managing 2 PB+ with 90-day retention. Indexing optimizations got query performance up 60% over the initial setup. This mattered because analysts needed instant search across months of data when investigating incidents.
Detection Layer
Apache Airflow orchestrated the detection logic. I designed 50+ DAGs that ran continuously: ingestion checks, detection rules, correlation jobs, and alerting pipelines.
The detection rules fell into two categories. First, known-bad patterns such as specific indicators of compromise, malicious IPs, file hashes that match malware. These are easy. Second, behavioral anomalies like a user logging in from two countries in an hour, a service account suddenly accessing files it never touched before or even complex sequences matching attack patterns. These are harder and required careful tuning to avoid false positives.
Every alert got enriched with context before reaching an analyst. Not just "suspicious login detected" but "User X logged in from Brazil, their normal location is Madrid, they accessed 15 sensitive files in the last hour, here's their login history for the past month." Context makes the difference between a 30-minute investigation and a 30-second dismiss.
Response Layer
Siemplify (now Google SecOps) handled automated response. I built 50+ playbooks that could take action without human intervention for well-understood scenarios. Block an IP at the firewall. Disable a compromised account. Isolate an infected endpoint.
The key constraint was audit traceability. Every automated action got logged with full context: what triggered it, what data was considered, what action was taken. Clients in regulated industries needed this for compliance. More importantly, it let us review and improve playbooks over time.
Automated response reduced manual interventions by 60%. Not because automation replaced analysts, but because it handled the routine cases: the clearly including malicious IPs and the obviously-compromised accounts. This enabled analysts to focus on ambiguous situations that needed human judgment.
What Made It Work
Technical architecture was table stakes. Every security vendor has a SOC stack and some automation. What made our platform work was tuning and the ability to adapt to the needs of each client.
Alert tuning is an ongoing conversation with analysts. We'd deploy a detection rule, watch the false positive rate, adjust thresholds, repeat. Some rules took months to get right. A rule that fires 100 times a day with 95% false positives is worse than useless: it trains analysts to ignore it, which means they'll miss the 5% that matter.
We aimed for actionable alerts. If an alert fires, the analyst should be able to decide within a minute whether it needs investigation. That meant aggressive context enrichment and careful threshold tuning. Better to miss some edge cases than to overwhelm analysts with noise.
The other thing that made it work was reliability. We designed for 99.99% uptime because the cost of downtime in security isn't just inconvenience, it's potential blind spots during an active attack. Redundant pipelines, graceful degradation, automated failover. When something broke, the system kept ingesting data while we fixed it.
Results
By the time I left:
- 100M+ events processed daily with consistent performance
- 6 enterprise clients running on the platform
- Real time detection for critical single-step threats
- Sub-minute detection times for critical multi-step threats
- 60% reduction in manual analyst workload through automation
- 99.99% pipeline uptime over 12 months
- Zero security breaches detected during the operating period
Analysts could do security work instead of drowning in noise. Clients got better coverage with fewer people. We onboarded new clients without linearly scaling the team. The platform ran entirely on open source tools, which kept costs low and made it easier to customize for each client.
What I Learned
The technical work was the easy part. Building a pipeline that handles 100M events is a solved problem. The hard part was making it useful.
Useful means analysts trust the alerts. Trust comes from low false positive rates, which come from months of tuning, not clever algorithms. I learned to treat detection engineering as an iterative process, not a one-time implementation.
Automation in security isn't about replacing humans. It's about respecting their time. Every false positive you eliminate, every routine case you automate, gives an analyst hours back to do actual security work. The goal isn't zero human involvement, but human involvement on the cases that need human judgment.
I also learned to design for failure. Security systems break at the worst possible times. The infrastructure that seems reliable during normal operations will be tested during an active attack, probably at 3am. Redundancy, graceful degradation, and automated failover are the difference between detecting a breach and missing it.
Beyond the technical, this project taught me to push back. Clients would request features that sounded reasonable but would have degraded detection quality. I learned to say no, explain why, and look for feasible alternatives. Sometimes I was wrong and they convinced me. But the willingness to have that conversation, to defend technical decisions to non-technical people, turned out to be just as important as building the system itself.