144 lines
4.0 KiB
Markdown
144 lines
4.0 KiB
Markdown
# Key Performance Indicators (KPIs) for Calminer
|
|
|
|
## Overview
|
|
|
|
This document defines the key performance indicators (KPIs) that Calminer tracks to ensure system scalability, reliability, and optimal user experience as specified in FR-006.
|
|
|
|
## KPI Categories
|
|
|
|
### Application Performance Metrics
|
|
|
|
#### Response Time
|
|
|
|
- **Metric**: HTTP request duration (95th percentile)
|
|
- **Target**: < 500ms for API endpoints, < 2s for UI pages
|
|
- **Collection**: Automatic via MetricsMiddleware
|
|
- **Alert Threshold**: > 1s (API), > 5s (UI)
|
|
|
|
#### Error Rate
|
|
|
|
- **Metric**: HTTP error responses (4xx/5xx) as percentage of total requests
|
|
- **Target**: < 1% overall, < 0.1% for 5xx errors
|
|
- **Collection**: Automatic via MetricsMiddleware
|
|
- **Alert Threshold**: > 5% (4xx), > 0.5% (5xx)
|
|
|
|
#### Throughput
|
|
|
|
- **Metric**: Requests per second (RPS)
|
|
- **Target**: > 100 RPS sustained
|
|
- **Collection**: Automatic via MetricsMiddleware
|
|
- **Alert Threshold**: < 10 RPS sustained
|
|
|
|
### Data Processing Metrics
|
|
|
|
#### Import/Export Duration
|
|
|
|
- **Metric**: Time to complete import/export operations
|
|
- **Target**: < 30s for small datasets (< 10k rows), < 5min for large datasets
|
|
- **Collection**: Via monitoring.metrics.observe_import/observe_export
|
|
- **Alert Threshold**: > 10min for any operation
|
|
|
|
#### Data Volume
|
|
|
|
- **Metric**: Rows processed per operation
|
|
- **Target**: Support up to 100k rows per import/export
|
|
- **Collection**: Via import/export service instrumentation
|
|
- **Alert Threshold**: Operations failing on > 10k rows
|
|
|
|
### System Resource Metrics
|
|
|
|
#### Database Connections
|
|
|
|
- **Metric**: Active database connections
|
|
- **Target**: < 80% of max connections
|
|
- **Collection**: Prometheus gauge (DB_CONNECTIONS)
|
|
- **Alert Threshold**: > 90% of max connections
|
|
|
|
#### Memory Usage
|
|
|
|
- **Metric**: Application memory consumption
|
|
- **Target**: < 512MB per worker
|
|
- **Collection**: Container metrics (Kubernetes/Docker)
|
|
- **Alert Threshold**: > 1GB per worker
|
|
|
|
#### CPU Usage
|
|
|
|
- **Metric**: Application CPU utilization
|
|
- **Target**: < 70% sustained
|
|
- **Collection**: Container metrics (Kubernetes/Docker)
|
|
- **Alert Threshold**: > 85% sustained
|
|
|
|
### User Experience Metrics
|
|
|
|
#### Concurrent Users
|
|
|
|
- **Metric**: Active user sessions
|
|
- **Target**: Support 100+ concurrent users
|
|
- **Collection**: Session tracking via AuthSessionMiddleware
|
|
- **Alert Threshold**: > 200 concurrent users (capacity planning)
|
|
|
|
#### Session Duration
|
|
|
|
- **Metric**: Average user session length
|
|
- **Target**: 10-30 minutes typical
|
|
- **Collection**: Session tracking
|
|
- **Alert Threshold**: < 1 minute average (usability issue)
|
|
|
|
### Business Metrics
|
|
|
|
#### Project/Scenario Operations
|
|
|
|
- **Metric**: Projects/scenarios created per hour
|
|
- **Target**: 50+ operations per hour
|
|
- **Collection**: Repository operation logging
|
|
- **Alert Threshold**: < 5 operations per hour (adoption issue)
|
|
|
|
#### Simulation Performance
|
|
|
|
- **Metric**: Monte Carlo simulation completion time
|
|
- **Target**: < 10s for typical scenarios
|
|
- **Collection**: Simulation service instrumentation
|
|
- **Alert Threshold**: > 60s for any simulation
|
|
|
|
## Monitoring Implementation
|
|
|
|
### Data Collection
|
|
|
|
- **HTTP Metrics**: Automatic collection via MetricsMiddleware
|
|
- **Business Metrics**: Service-level instrumentation
|
|
- **System Metrics**: Container orchestration (Kubernetes)
|
|
- **Storage**: performance_metrics table + Prometheus
|
|
|
|
### Alerting
|
|
|
|
- **Response Time**: P95 > 1s for 5 minutes
|
|
- **Error Rate**: > 5% for 10 minutes
|
|
- **Resource Usage**: > 90% for 15 minutes
|
|
- **Data Processing**: Failures > 3 in 1 hour
|
|
|
|
### Dashboards
|
|
|
|
- **Real-time**: Current performance via /metrics endpoint
|
|
- **Historical**: Aggregated metrics via /performance endpoint
|
|
- **Health**: Detailed health checks via /health endpoint
|
|
|
|
## Scaling Guidelines
|
|
|
|
### Horizontal Scaling Triggers
|
|
|
|
- CPU > 70% sustained
|
|
- Memory > 80% sustained
|
|
- RPS > 80% of target
|
|
|
|
### Vertical Scaling Triggers
|
|
|
|
- Memory > 90% sustained
|
|
- Database connections > 80%
|
|
|
|
### Auto-scaling Configuration
|
|
|
|
- Min replicas: 2
|
|
- Max replicas: 10
|
|
- Scale up: CPU > 70% for 5 minutes
|
|
- Scale down: CPU < 30% for 10 minutes
|