App Monitoring

Application monitoring tool that consolidates metrics, logs, and traces; real‑time alerting, product/engineering dashboards, and materially reduced MTTR.

Built with Flask and React, using Splunk as database and ML Toolkit, and Docker for deployment.

Full-Stack DevelopmentBusiness IntelligenceTime SeriesBig Data

Crash detection < 1 min70% fewer incidents99.9% uptime100M+ events/day80% prediction accuracy

FlaskReactSplunkDocker

Duration

Jan 2021 Jun 2021

Introduction

Delivered a production monitoring platform for a Spanish bank that detects crashes in <1 minute, analyzes real-time usage, and forecasts peak hours across 100M+ daily events. Using Splunk as both time-series store and ML platform kept the stack simple while enabling sub-second investigations.

The Challenge

The bank’s mobile app served millions but lacked visibility into crashes and behavior patterns. Support teams were reactive, learning about issues from customer complaints. We needed to track interactions, detect crashes immediately, predict peak usage hours, and provide actionable insights — all within the bank’s existing Splunk stack.

Solution & Approach

I built an end-to-end monitoring solution integrated with the bank’s stack:

Backend Architecture (Flask)

RESTful API ingesting mobile app events and crash reports
Processing pipeline normalizing iOS and Android logs
Integration layer connecting to Splunk for storage and retrieval
Dockerized microservices ensuring consistent deployments across environments

Frontend Dashboard (React)

Real-time views of active users, crash rates, and performance
Interactive charts showing behavioral patterns by time and location
Crash analysis interface clustering similar issues for efficient debugging
Mobile-responsive design for on-the-go support teams

Analytics with Splunk

Splunk as the primary database for events and metrics
Time-series forecasting with Splunk ML Toolkit for usage estimates
Automated searches detecting crash spikes and unusual patterns
Role-specific dashboards for support, engineering, and management

Monitoring & Automation

Real-time alerts for crash rate thresholds via email and Slack
Automated daily reports on user behavior and app health
Docker Compose orchestrating services with health checks
CI/CD pipeline for zero-downtime deployments

Results & Impact

Teams moved from reactive to proactive: crash detection dropped from hours to under a minute, user-reported incidents fell by ~70%, and MTTR improved ~3× with richer, role-specific alerting context. Sub-second queries over 100M+ events and ~80% forecasting accuracy improved incident response and capacity planning.

Operational Impact

Crash detection time reduced from hours to seconds
70% fewer user-reported incidents (proactively detected)
3× faster resolution with detailed support data
Accurate peak-hour prediction for capacity planning

Technical Achievements

100M+ daily events processed via Flask backend
Sub-second query performance in Splunk despite data volume
99.9% uptime through Docker containerization
~80% accuracy forecasting DAU with ML

The project demonstrated how modern web technologies, combined with enterprise tools like Splunk, can deliver powerful monitoring. The bank continued expanding the tool to cover its full digital product suite.