4.0 KiB
4.0 KiB
Key Performance Indicators (KPIs) for Calminer
Overview
This document defines the key performance indicators (KPIs) that Calminer tracks to ensure system scalability, reliability, and optimal user experience as specified in FR-006.
KPI Categories
Application Performance Metrics
Response Time
- Metric: HTTP request duration (95th percentile)
- Target: < 500ms for API endpoints, < 2s for UI pages
- Collection: Automatic via MetricsMiddleware
- Alert Threshold: > 1s (API), > 5s (UI)
Error Rate
- Metric: HTTP error responses (4xx/5xx) as percentage of total requests
- Target: < 1% overall, < 0.1% for 5xx errors
- Collection: Automatic via MetricsMiddleware
- Alert Threshold: > 5% (4xx), > 0.5% (5xx)
Throughput
- Metric: Requests per second (RPS)
- Target: > 100 RPS sustained
- Collection: Automatic via MetricsMiddleware
- Alert Threshold: < 10 RPS sustained
Data Processing Metrics
Import/Export Duration
- Metric: Time to complete import/export operations
- Target: < 30s for small datasets (< 10k rows), < 5min for large datasets
- Collection: Via monitoring.metrics.observe_import/observe_export
- Alert Threshold: > 10min for any operation
Data Volume
- Metric: Rows processed per operation
- Target: Support up to 100k rows per import/export
- Collection: Via import/export service instrumentation
- Alert Threshold: Operations failing on > 10k rows
System Resource Metrics
Database Connections
- Metric: Active database connections
- Target: < 80% of max connections
- Collection: Prometheus gauge (DB_CONNECTIONS)
- Alert Threshold: > 90% of max connections
Memory Usage
- Metric: Application memory consumption
- Target: < 512MB per worker
- Collection: Container metrics (Kubernetes/Docker)
- Alert Threshold: > 1GB per worker
CPU Usage
- Metric: Application CPU utilization
- Target: < 70% sustained
- Collection: Container metrics (Kubernetes/Docker)
- Alert Threshold: > 85% sustained
User Experience Metrics
Concurrent Users
- Metric: Active user sessions
- Target: Support 100+ concurrent users
- Collection: Session tracking via AuthSessionMiddleware
- Alert Threshold: > 200 concurrent users (capacity planning)
Session Duration
- Metric: Average user session length
- Target: 10-30 minutes typical
- Collection: Session tracking
- Alert Threshold: < 1 minute average (usability issue)
Business Metrics
Project/Scenario Operations
- Metric: Projects/scenarios created per hour
- Target: 50+ operations per hour
- Collection: Repository operation logging
- Alert Threshold: < 5 operations per hour (adoption issue)
Simulation Performance
- Metric: Monte Carlo simulation completion time
- Target: < 10s for typical scenarios
- Collection: Simulation service instrumentation
- Alert Threshold: > 60s for any simulation
Monitoring Implementation
Data Collection
- HTTP Metrics: Automatic collection via MetricsMiddleware
- Business Metrics: Service-level instrumentation
- System Metrics: Container orchestration (Kubernetes)
- Storage: performance_metrics table + Prometheus
Alerting
- Response Time: P95 > 1s for 5 minutes
- Error Rate: > 5% for 10 minutes
- Resource Usage: > 90% for 15 minutes
- Data Processing: Failures > 3 in 1 hour
Dashboards
- Real-time: Current performance via /metrics endpoint
- Historical: Aggregated metrics via /performance endpoint
- Health: Detailed health checks via /health endpoint
Scaling Guidelines
Horizontal Scaling Triggers
- CPU > 70% sustained
- Memory > 80% sustained
- RPS > 80% of target
Vertical Scaling Triggers
- Memory > 90% sustained
- Database connections > 80%
Auto-scaling Configuration
- Min replicas: 2
- Max replicas: 10
- Scale up: CPU > 70% for 5 minutes
- Scale down: CPU < 30% for 10 minutes