Files

144 lines
4.0 KiB
Markdown

# Key Performance Indicators (KPIs) for Calminer
## Overview
This document defines the key performance indicators (KPIs) that Calminer tracks to ensure system scalability, reliability, and optimal user experience as specified in FR-006.
## KPI Categories
### Application Performance Metrics
#### Response Time
- **Metric**: HTTP request duration (95th percentile)
- **Target**: < 500ms for API endpoints, < 2s for UI pages
- **Collection**: Automatic via MetricsMiddleware
- **Alert Threshold**: > 1s (API), > 5s (UI)
#### Error Rate
- **Metric**: HTTP error responses (4xx/5xx) as percentage of total requests
- **Target**: < 1% overall, < 0.1% for 5xx errors
- **Collection**: Automatic via MetricsMiddleware
- **Alert Threshold**: > 5% (4xx), > 0.5% (5xx)
#### Throughput
- **Metric**: Requests per second (RPS)
- **Target**: > 100 RPS sustained
- **Collection**: Automatic via MetricsMiddleware
- **Alert Threshold**: < 10 RPS sustained
### Data Processing Metrics
#### Import/Export Duration
- **Metric**: Time to complete import/export operations
- **Target**: < 30s for small datasets (< 10k rows), < 5min for large datasets
- **Collection**: Via monitoring.metrics.observe_import/observe_export
- **Alert Threshold**: > 10min for any operation
#### Data Volume
- **Metric**: Rows processed per operation
- **Target**: Support up to 100k rows per import/export
- **Collection**: Via import/export service instrumentation
- **Alert Threshold**: Operations failing on > 10k rows
### System Resource Metrics
#### Database Connections
- **Metric**: Active database connections
- **Target**: < 80% of max connections
- **Collection**: Prometheus gauge (DB_CONNECTIONS)
- **Alert Threshold**: > 90% of max connections
#### Memory Usage
- **Metric**: Application memory consumption
- **Target**: < 512MB per worker
- **Collection**: Container metrics (Kubernetes/Docker)
- **Alert Threshold**: > 1GB per worker
#### CPU Usage
- **Metric**: Application CPU utilization
- **Target**: < 70% sustained
- **Collection**: Container metrics (Kubernetes/Docker)
- **Alert Threshold**: > 85% sustained
### User Experience Metrics
#### Concurrent Users
- **Metric**: Active user sessions
- **Target**: Support 100+ concurrent users
- **Collection**: Session tracking via AuthSessionMiddleware
- **Alert Threshold**: > 200 concurrent users (capacity planning)
#### Session Duration
- **Metric**: Average user session length
- **Target**: 10-30 minutes typical
- **Collection**: Session tracking
- **Alert Threshold**: < 1 minute average (usability issue)
### Business Metrics
#### Project/Scenario Operations
- **Metric**: Projects/scenarios created per hour
- **Target**: 50+ operations per hour
- **Collection**: Repository operation logging
- **Alert Threshold**: < 5 operations per hour (adoption issue)
#### Simulation Performance
- **Metric**: Monte Carlo simulation completion time
- **Target**: < 10s for typical scenarios
- **Collection**: Simulation service instrumentation
- **Alert Threshold**: > 60s for any simulation
## Monitoring Implementation
### Data Collection
- **HTTP Metrics**: Automatic collection via MetricsMiddleware
- **Business Metrics**: Service-level instrumentation
- **System Metrics**: Container orchestration (Kubernetes)
- **Storage**: performance_metrics table + Prometheus
### Alerting
- **Response Time**: P95 > 1s for 5 minutes
- **Error Rate**: > 5% for 10 minutes
- **Resource Usage**: > 90% for 15 minutes
- **Data Processing**: Failures > 3 in 1 hour
### Dashboards
- **Real-time**: Current performance via /metrics endpoint
- **Historical**: Aggregated metrics via /performance endpoint
- **Health**: Detailed health checks via /health endpoint
## Scaling Guidelines
### Horizontal Scaling Triggers
- CPU > 70% sustained
- Memory > 80% sustained
- RPS > 80% of target
### Vertical Scaling Triggers
- Memory > 90% sustained
- Database connections > 80%
### Auto-scaling Configuration
- Min replicas: 2
- Max replicas: 10
- Scale up: CPU > 70% for 5 minutes
- Scale down: CPU < 30% for 10 minutes