Skip to main content

App Monitoring

Application monitoring tool that consolidates metrics, logs, and traces; real‑time alerting, product/engineering dashboards, and materially reduced MTTR.

Built with Flask and React, using Splunk as database and ML Toolkit, and Docker for deployment.

Full-Stack DevelopmentBusiness IntelligenceTime SeriesBig Data
Crash detection < 1 min70% fewer incidents99.9% uptime100M+ events/day80% prediction accuracy
FlaskReactSplunkDocker

Duration

Introduction

Delivered a production monitoring platform for a Spanish bank that detects crashes in <1 minute, analyzes real-time usage, and forecasts peak hours across 100M+ daily events. Using Splunk as both time-series store and ML platform kept the stack simple while enabling sub-second investigations.

The Challenge

The bank’s mobile app served millions but lacked visibility into crashes and behavior patterns. Support teams were reactive, learning about issues from customer complaints. We needed to track interactions, detect crashes immediately, predict peak usage hours, and provide actionable insights — all within the bank’s existing Splunk stack.

Solution & Approach

I built an end-to-end monitoring solution integrated with the bank’s stack:

Backend Architecture (Flask)

  • RESTful API ingesting mobile app events and crash reports
  • Processing pipeline normalizing iOS and Android logs
  • Integration layer connecting to Splunk for storage and retrieval
  • Dockerized microservices ensuring consistent deployments across environments

Frontend Dashboard (React)

  • Real-time views of active users, crash rates, and performance
  • Interactive charts showing behavioral patterns by time and location
  • Crash analysis interface clustering similar issues for efficient debugging
  • Mobile-responsive design for on-the-go support teams

Analytics with Splunk

  • Splunk as the primary database for events and metrics
  • Time-series forecasting with Splunk ML Toolkit for usage estimates
  • Automated searches detecting crash spikes and unusual patterns
  • Role-specific dashboards for support, engineering, and management

Monitoring & Automation

  • Real-time alerts for crash rate thresholds via email and Slack
  • Automated daily reports on user behavior and app health
  • Docker Compose orchestrating services with health checks
  • CI/CD pipeline for zero-downtime deployments

Results & Impact

Teams moved from reactive to proactive: crash detection dropped from hours to under a minute, user-reported incidents fell by ~70%, and MTTR improved ~3× with richer, role-specific alerting context. Sub-second queries over 100M+ events and ~80% forecasting accuracy improved incident response and capacity planning.

Operational Impact

  • Crash detection time reduced from hours to seconds
  • 70% fewer user-reported incidents (proactively detected)
  • 3× faster resolution with detailed support data
  • Accurate peak-hour prediction for capacity planning

Technical Achievements

  • 100M+ daily events processed via Flask backend
  • Sub-second query performance in Splunk despite data volume
  • 99.9% uptime through Docker containerization
  • ~80% accuracy forecasting DAU with ML

The project demonstrated how modern web technologies, combined with enterprise tools like Splunk, can deliver powerful monitoring. The bank continued expanding the tool to cover its full digital product suite.