Files

4.0 KiB

Key Performance Indicators (KPIs) for Calminer

Overview

This document defines the key performance indicators (KPIs) that Calminer tracks to ensure system scalability, reliability, and optimal user experience as specified in FR-006.

KPI Categories

Application Performance Metrics

Response Time

  • Metric: HTTP request duration (95th percentile)
  • Target: < 500ms for API endpoints, < 2s for UI pages
  • Collection: Automatic via MetricsMiddleware
  • Alert Threshold: > 1s (API), > 5s (UI)

Error Rate

  • Metric: HTTP error responses (4xx/5xx) as percentage of total requests
  • Target: < 1% overall, < 0.1% for 5xx errors
  • Collection: Automatic via MetricsMiddleware
  • Alert Threshold: > 5% (4xx), > 0.5% (5xx)

Throughput

  • Metric: Requests per second (RPS)
  • Target: > 100 RPS sustained
  • Collection: Automatic via MetricsMiddleware
  • Alert Threshold: < 10 RPS sustained

Data Processing Metrics

Import/Export Duration

  • Metric: Time to complete import/export operations
  • Target: < 30s for small datasets (< 10k rows), < 5min for large datasets
  • Collection: Via monitoring.metrics.observe_import/observe_export
  • Alert Threshold: > 10min for any operation

Data Volume

  • Metric: Rows processed per operation
  • Target: Support up to 100k rows per import/export
  • Collection: Via import/export service instrumentation
  • Alert Threshold: Operations failing on > 10k rows

System Resource Metrics

Database Connections

  • Metric: Active database connections
  • Target: < 80% of max connections
  • Collection: Prometheus gauge (DB_CONNECTIONS)
  • Alert Threshold: > 90% of max connections

Memory Usage

  • Metric: Application memory consumption
  • Target: < 512MB per worker
  • Collection: Container metrics (Kubernetes/Docker)
  • Alert Threshold: > 1GB per worker

CPU Usage

  • Metric: Application CPU utilization
  • Target: < 70% sustained
  • Collection: Container metrics (Kubernetes/Docker)
  • Alert Threshold: > 85% sustained

User Experience Metrics

Concurrent Users

  • Metric: Active user sessions
  • Target: Support 100+ concurrent users
  • Collection: Session tracking via AuthSessionMiddleware
  • Alert Threshold: > 200 concurrent users (capacity planning)

Session Duration

  • Metric: Average user session length
  • Target: 10-30 minutes typical
  • Collection: Session tracking
  • Alert Threshold: < 1 minute average (usability issue)

Business Metrics

Project/Scenario Operations

  • Metric: Projects/scenarios created per hour
  • Target: 50+ operations per hour
  • Collection: Repository operation logging
  • Alert Threshold: < 5 operations per hour (adoption issue)

Simulation Performance

  • Metric: Monte Carlo simulation completion time
  • Target: < 10s for typical scenarios
  • Collection: Simulation service instrumentation
  • Alert Threshold: > 60s for any simulation

Monitoring Implementation

Data Collection

  • HTTP Metrics: Automatic collection via MetricsMiddleware
  • Business Metrics: Service-level instrumentation
  • System Metrics: Container orchestration (Kubernetes)
  • Storage: performance_metrics table + Prometheus

Alerting

  • Response Time: P95 > 1s for 5 minutes
  • Error Rate: > 5% for 10 minutes
  • Resource Usage: > 90% for 15 minutes
  • Data Processing: Failures > 3 in 1 hour

Dashboards

  • Real-time: Current performance via /metrics endpoint
  • Historical: Aggregated metrics via /performance endpoint
  • Health: Detailed health checks via /health endpoint

Scaling Guidelines

Horizontal Scaling Triggers

  • CPU > 70% sustained
  • Memory > 80% sustained
  • RPS > 80% of target

Vertical Scaling Triggers

  • Memory > 90% sustained
  • Database connections > 80%

Auto-scaling Configuration

  • Min replicas: 2
  • Max replicas: 10
  • Scale up: CPU > 70% for 5 minutes
  • Scale down: CPU < 30% for 10 minutes