# Key Performance Indicators (KPIs) for Calminer ## Overview This document defines the key performance indicators (KPIs) that Calminer tracks to ensure system scalability, reliability, and optimal user experience as specified in FR-006. ## KPI Categories ### Application Performance Metrics #### Response Time - **Metric**: HTTP request duration (95th percentile) - **Target**: < 500ms for API endpoints, < 2s for UI pages - **Collection**: Automatic via MetricsMiddleware - **Alert Threshold**: > 1s (API), > 5s (UI) #### Error Rate - **Metric**: HTTP error responses (4xx/5xx) as percentage of total requests - **Target**: < 1% overall, < 0.1% for 5xx errors - **Collection**: Automatic via MetricsMiddleware - **Alert Threshold**: > 5% (4xx), > 0.5% (5xx) #### Throughput - **Metric**: Requests per second (RPS) - **Target**: > 100 RPS sustained - **Collection**: Automatic via MetricsMiddleware - **Alert Threshold**: < 10 RPS sustained ### Data Processing Metrics #### Import/Export Duration - **Metric**: Time to complete import/export operations - **Target**: < 30s for small datasets (< 10k rows), < 5min for large datasets - **Collection**: Via monitoring.metrics.observe_import/observe_export - **Alert Threshold**: > 10min for any operation #### Data Volume - **Metric**: Rows processed per operation - **Target**: Support up to 100k rows per import/export - **Collection**: Via import/export service instrumentation - **Alert Threshold**: Operations failing on > 10k rows ### System Resource Metrics #### Database Connections - **Metric**: Active database connections - **Target**: < 80% of max connections - **Collection**: Prometheus gauge (DB_CONNECTIONS) - **Alert Threshold**: > 90% of max connections #### Memory Usage - **Metric**: Application memory consumption - **Target**: < 512MB per worker - **Collection**: Container metrics (Kubernetes/Docker) - **Alert Threshold**: > 1GB per worker #### CPU Usage - **Metric**: Application CPU utilization - **Target**: < 70% sustained - **Collection**: Container metrics (Kubernetes/Docker) - **Alert Threshold**: > 85% sustained ### User Experience Metrics #### Concurrent Users - **Metric**: Active user sessions - **Target**: Support 100+ concurrent users - **Collection**: Session tracking via AuthSessionMiddleware - **Alert Threshold**: > 200 concurrent users (capacity planning) #### Session Duration - **Metric**: Average user session length - **Target**: 10-30 minutes typical - **Collection**: Session tracking - **Alert Threshold**: < 1 minute average (usability issue) ### Business Metrics #### Project/Scenario Operations - **Metric**: Projects/scenarios created per hour - **Target**: 50+ operations per hour - **Collection**: Repository operation logging - **Alert Threshold**: < 5 operations per hour (adoption issue) #### Simulation Performance - **Metric**: Monte Carlo simulation completion time - **Target**: < 10s for typical scenarios - **Collection**: Simulation service instrumentation - **Alert Threshold**: > 60s for any simulation ## Monitoring Implementation ### Data Collection - **HTTP Metrics**: Automatic collection via MetricsMiddleware - **Business Metrics**: Service-level instrumentation - **System Metrics**: Container orchestration (Kubernetes) - **Storage**: performance_metrics table + Prometheus ### Alerting - **Response Time**: P95 > 1s for 5 minutes - **Error Rate**: > 5% for 10 minutes - **Resource Usage**: > 90% for 15 minutes - **Data Processing**: Failures > 3 in 1 hour ### Dashboards - **Real-time**: Current performance via /metrics endpoint - **Historical**: Aggregated metrics via /performance endpoint - **Health**: Detailed health checks via /health endpoint ## Scaling Guidelines ### Horizontal Scaling Triggers - CPU > 70% sustained - Memory > 80% sustained - RPS > 80% of target ### Vertical Scaling Triggers - Memory > 90% sustained - Database connections > 80% ### Auto-scaling Configuration - Min replicas: 2 - Max replicas: 10 - Scale up: CPU > 70% for 5 minutes - Scale down: CPU < 30% for 10 minutes