Analytics Dashboards

I built a monitoring platform that we deployed across three client types: banking apps, critical infrastructure, and government applications. Same core architecture, different dashboards tuned to what each client actually cared about. By the end we were processing 50M+ events daily with real-time analytics, crash detection, and 80% accuracy on usage forecasting.

The Problem

All three client types had the same issue: they were reactive. Support teams learned about problems from user complaints. Engineers discovered crashes hours after they happened. Capacity planning was guesswork.

The banking clients had millions of mobile app users but no visibility into crash patterns. The infrastructure teams monitored critical systems but couldn't correlate metrics with real-world impact. The government apps served citizens during peak periods (tax season, administrative deadlines) but couldn't predict load spikes.

The Architecture

Simpler than it sounds. Splunk's SDK plugged directly into client apps (mobile, web, infrastructure agents) and streamed events straight into Splunk. No custom ingestion pipeline. The SDK handled batching, retry logic, and offline queuing. We configured what to capture and Splunk did the rest.

Splunk was the entire backend. Database, query engine, ML platform. Clients purchased the licenses; we did everything else. We maintained the data warehouse, optimized queries, built pipelines, configured alerts, trained models. Full managed service. They got enterprise monitoring without hiring a data team, we got deep expertise across multiple production environments.

Flask was the application layer. The backend handled authentication and role-based access, acting as a security layer between users and Splunk. Clients never touched Splunk directly. Every query went through our API, which let us enforce permissions and audit access. The frontend was Flask templates.

React handled the dynamic parts: animated charts, real-time updates and interactive filters. We embedded React components into Flask templates rather than building a separate SPA. This kept the stack simple enough for a team without dedicated frontend developers.

Forecasting and anomaly detection ran in Splunk's ML Toolkit. Models lived alongside the data. No ETL to a separate platform, no batch jobs shuffling data around. Query Splunk, get predictions back.

Everything ran in Docker with Compose. Health checks, automatic restarts, zero-downtime deployments. The platform had to be more reliable than the systems it monitored.

Client-Specific Dashboards

The core infrastructure was the same, but dashboards were completely different. Each client had different questions.

Banking Apps

Banks cared about user experience and fraud signals. Dashboards showed real-time active users, crash rates by app version, transaction patterns. Support got a crash clustering interface that grouped similar issues by impact. Product managers got behavioral funnels showing where users dropped off.

The ML component predicted daily active users at ~80% accuracy. Sounds like a vanity metric, but it drove real decisions. When the model predicted a spike from an app store feature or marketing push, ops pre-scaled infrastructure. When it predicted a dip, they scheduled maintenance.

Critical Infrastructure

Here it was about correlation. When database latency spiked, what happened downstream? When a deployment went out, did error rates change? The dashboard connected metrics across systems to answer those questions.

Alerting was tighter. Banking apps could tolerate a few minutes of degradation. Infrastructure couldn't. We tuned thresholds lower and added escalation paths. Unacknowledged alert after 5 minutes? Goes to backup. Still nothing? Pages management.

Government Apps

These had a distinct pattern: long quiet periods, then extreme spikes. Tax deadlines. Administrative filing periods. Public announcements. The dashboard focused on capacity forecasting and geographic distribution.

We built views by region because different autonomous communities had different calendars. The forecasting model learned seasonal patterns so ops could request capacity before peaks instead of scrambling during them.

Detection and Alerting

Raw metrics aren't useful. Teams don't need to know error rate is 0.3%. They need to know if that's normal. I built detection rules in Splunk's ML Toolkit that learned baselines and alerted on deviations.

Role-specific alerting took work. Support needed crash notifications with user context. Engineering needed stack traces and deployment correlation. Management needed summaries, not noise. We built different views for each.

Every alert included context. Not "crash spike detected" but "crash rate up 3x in last 15 minutes, iOS 17.2 users, started after deployment abc123, 847 users impacted." That context turns a 30-minute investigation into a 30-second decision.

Results

Across all three client types:

50M+ daily events with consistent performance
Crash detection went from hours to under 1 minute
70% fewer user-reported incidents (we caught them first)
3x faster resolution with enriched alerts
~80% accuracy on usage forecasting
Sub-second queries across 90 days of data
99.9% uptime

Banking clients expanded to their full product suite. Infrastructure teams plugged it into their incident workflow. Government clients used the forecasts to justify budget for additional capacity.

What I Learned

The technical work was straightforward. Flask, Splunk, Docker. What made it useful was understanding what each client actually needed to see.

Engineers want stack traces. Support wants user impact. Management wants trends. Same data, different questions. I learned to start dashboards by asking "what decision does this help you make?" not "what data do you want?"

The other lesson was constraints. Every client wanted custom features. Every client had limited budget. The discipline was finding the 20% of customization that delivered 80% of value. Common infrastructure, client-specific views. That's how a small team serves multiple clients.

Forecasting taught me that accuracy matters less than honesty about confidence. An 80% prediction with clear intervals beats a 90% prediction presented as certain. People need to know when to trust the model and when to use judgment.