Skip to content

Monitoring and Metrics (Prometheus)

Architecture Overview

  • Metrics emitted via prometheus_client in-process registry (maricusco.utils.metrics).
  • Metrics endpoint at /metrics (FastAPI router in maricusco/api/metrics.py); returns 503 when disabled, 500 on rendering errors.
  • Prometheus server (docker-compose prometheus service) scrapes app:8000/metrics every 15s by default.
  • Grafana can be pointed at Prometheus for dashboards and alerts.

Health Check Endpoint

  • Endpoint: FastAPI /health (configurable via MARICUSCO_HEALTHCHECK_PATH, default /health).
  • Status codes: 200 when all required dependencies are healthy; 503 when any required dependency (PostgreSQL, Redis) is unhealthy. Optional dependencies and vendor checks are opt-in.
  • Response schema:
    {
      "status": "healthy" | "unhealthy",
      "timestamp": "2025-12-09T10:48:54.066885Z",
      "dependencies": {
        "database": "healthy" | "unhealthy",
        "redis": "healthy" | "unhealthy",
        "data_vendors": "healthy" | "unhealthy" | "not_checked"
      },
      "errors": { "<dependency>": "<message>" }
    }
    
  • Configuration:
  • Enable/disable: MARICUSCO_HEALTHCHECK_ENABLED (default true).
  • Dependency lists: required defaults database, redis; optional via MARICUSCO_HEALTHCHECK_OPTIONAL_DEPENDENCIES; vendor checks toggle MARICUSCO_HEALTHCHECK_VENDORS_ENABLED.
  • Timeouts: MARICUSCO_HEALTHCHECK_TIMEOUT_DEFAULT plus per-dependency overrides (..._DATABASE, ..._REDIS, ..._VENDORS), typically 1–2s.
  • Cache TTL: MARICUSCO_HEALTHCHECK_CACHE_TTL_SECONDS (default 0 = no cache).

Metric Types and Names

  • Counters:
  • maricusco_api_calls_total — API calls by vendor/agent/operation.
  • maricusco_api_errors_total — API failures.
  • maricusco_api_rate_limit_hits_total — rate-limit events.
  • maricusco_events_total — domain events.
  • maricusco_errors_by_type_total — errors by exception type.
  • maricusco_retry_attempts_total — retry attempts.
  • maricusco_cache_hits_total, maricusco_cache_misses_total — cache outcomes.
  • Gauges:
  • maricusco_active_agents — active agent executions.
  • maricusco_queue_size — queue depth (wire to your queue if used).
  • maricusco_current_value — arbitrary numeric values (balances, positions, etc.).
  • Histograms:
  • maricusco_latency_seconds — external/internal step latency.
  • maricusco_execution_time_seconds — agent/orchestration execution time.
  • maricusco_request_duration_seconds — API/vendor request duration.

Label set (shared): vendor, agent_type, operation. Some metrics also add error_type.

Queue and Cache Instrumentation

  • Cache client: maricusco.data.cache.CacheClient uses Redis when available, falls back to in-memory. get_cache_client() returns a singleton.
  • Cache hits/misses: record_cache_hit(cache_name) / record_cache_miss(cache_name) used inside the cache client; labels default to vendor=redis|local, agent_type=cache, operation=<cache_name>.
  • Cache latency: cache client records latency via latency_seconds histogram for get/set operations.
  • Queue depth: maricusco.orchestration.queue_metrics.report_queue_depth(queue_name, depth) emits maricusco_queue_size; defaults vendor=internal, agent_type=queue.
  • Queue latency: record_queue_latency(queue_name, duration_seconds) records enqueue/dequeue latency via latency_seconds.
  • Current state: no message queue is configured; helpers are ready for future queue integration and can be called from any queue implementation.

Error Rate Over Time

  • Use PromQL rate over counters rather than a separate gauge:
    sum by (vendor) (rate(maricusco_api_errors_total[5m]))
  • Error ratio vs calls (per vendor):
    sum by (vendor) (rate(maricusco_api_errors_total[5m])) / sum by (vendor) (rate(maricusco_api_calls_total[5m]))

Configuration

  • Enable/disable: MARICUSCO_METRICS_ENABLED (default false).
  • Endpoint path: MARICUSCO_METRICS_PATH (default /metrics).
  • Endpoint port: MARICUSCO_METRICS_PORT (default 8000).
  • Sampling: MARICUSCO_METRICS_SAMPLING_RATE (default 1.0).
  • Constant labels: MARICUSCO_METRICS_LABELS (comma key=value).

Prometheus Server Setup

  • Compose service prometheus (prom/prometheus v3.8.0, pinned in docker-compose.yml).
  • Config: docker/prometheus/prometheus.yml
  • scrape_interval: 15s, metrics_path: /metrics, target app:8000.
  • Retention: --storage.tsdb.retention.time=15d (compose command).
  • Rule files placeholder: /etc/prometheus/rules/*.yml (add via bind mount).
  • Run: docker-compose up -d prometheus
  • UI: http://localhost:9090

Grafana Setup and Configuration

Docker Compose Service

  • Service: grafana in docker-compose.yml (image grafana/grafana:12.3.0, pinned version).
  • Port: 3000 (configurable via GRAFANA_PORT environment variable, default 3000).
  • Access URL: http://localhost:3000
  • Network: Connected to maricusco-network for internal communication with Prometheus.
  • Dependencies: Depends on Prometheus service (depends_on: prometheus).
  • Persistent storage: grafana_data volume for dashboard and user data persistence.

Environment Variables

  • GF_SECURITY_ADMIN_USER: Grafana admin username (defaults to admin if not set).
  • GF_SECURITY_ADMIN_PASSWORD: Grafana admin password (defaults to admin for local development if not set).
  • Important: For non-local/production environments, GF_SECURITY_ADMIN_PASSWORD must be set to a secure password via .env file.
  • For local development, default credentials are admin / admin when not set (convenient for development, not secure for production).
  • Note: Grafana uses GF_ prefix for all configuration environment variables.

Data Source Provisioning

  • Prometheus data source is automatically provisioned via docker/grafana/provisioning/datasources/prometheus.yml.
  • Data source name: "Prometheus"
  • URL: http://prometheus:9090 (internal Docker network)
  • Access mode: Server (proxy)
  • Is default: Yes
  • No manual configuration required; data source is available immediately after Grafana starts.

Dashboard Provisioning

  • Dashboards are automatically provisioned from docker/grafana/provisioning/dashboards/.
  • Dashboard provider configuration: docker/grafana/provisioning/dashboards/dashboard.yml
  • Folder: "Maricusco" (dashboards organized in this folder in Grafana UI)
  • Auto-discovery: Grafana scans for .json files every 10 seconds
  • All dashboard JSON files in the provisioning directory are automatically loaded

Pre-Provisioned Dashboards

1. Maricusco Overview (maricusco-overview.json)

Comprehensive overview of system metrics: - P95 Request Duration: Time series showing 95th percentile request duration - PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) - API Error Rate: Time series showing error rate by vendor/operation - PromQL: rate(maricusco_api_errors_total[5m]) - Rate Limit Hits: Time series showing rate limit hits by vendor/operation - PromQL: rate(maricusco_api_rate_limit_hits_total[5m]) - Active Agents: Gauge panel showing current active agents count - PromQL: maricusco_active_agents - Cache Hit/Miss Ratio: Time series showing overall cache hit/miss ratio (0-1 scale) - PromQL: sum(rate(maricusco_cache_hits_total[5m])) / (sum(rate(maricusco_cache_hits_total[5m])) + sum(rate(maricusco_cache_misses_total[5m])))

2. Maricusco Vendor/Operation Latency (maricusco-vendor-latency.json)

Detailed latency breakdowns: - Request Duration by Vendor (P95): Time series grouped by vendor - PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor) - Request Duration by Operation (P95): Time series grouped by operation - PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (operation) - Latency by Vendor/Operation (P95): Table view showing P95 latency for each vendor/operation combination - PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor, operation)

3. Maricusco Cache Performance (maricusco-cache.json)

Cache-specific metrics: - Cache Hit Rate Over Time: Time series showing cache hits by cache name (operation label) - PromQL: rate(maricusco_cache_hits_total[5m]) - Cache Miss Rate Over Time: Time series showing cache misses by cache name - PromQL: rate(maricusco_cache_misses_total[5m]) - Cache Hit/Miss Ratio: Time series showing overall ratio (0-1 scale) - PromQL: sum(rate(maricusco_cache_hits_total[5m])) / (sum(rate(maricusco_cache_hits_total[5m])) + sum(rate(maricusco_cache_misses_total[5m]))) - Cache Operations by Cache Name: Table showing total operations per second by cache name (operation label) - PromQL: sum(rate(maricusco_cache_hits_total[5m]) + rate(maricusco_cache_misses_total[5m])) by (operation)

Accessing Grafana

  1. Start services: docker-compose up -d grafana (or docker-compose up -d for all services)
  2. Open browser: Navigate to http://localhost:3000
  3. Login:
  4. Local development: Default credentials are admin / admin (when GF_SECURITY_ADMIN_PASSWORD is not set in .env)
  5. Production/non-local: Use credentials from .env file (GF_SECURITY_ADMIN_USER / GF_SECURITY_ADMIN_PASSWORD)
  6. Navigate to dashboards: Click "Dashboards" → "Maricusco" folder to see pre-provisioned dashboards
  7. Verify data: Ensure MARICUSCO_METRICS_ENABLED=true for metrics to appear in dashboards

PromQL Queries Reference (Used in Dashboards)

  • P95 request duration: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m]))
  • P95 by vendor: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor)
  • P95 by operation: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (operation)
  • P95 by vendor/operation: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor, operation)
  • API error rate: rate(maricusco_api_errors_total[5m])
  • Rate limit hits: rate(maricusco_api_rate_limit_hits_total[5m])
  • Active agents: maricusco_active_agents
  • Cache hit rate: rate(maricusco_cache_hits_total[5m])
  • Cache miss rate: rate(maricusco_cache_misses_total[5m])
  • Cache hit/miss ratio: sum(rate(maricusco_cache_hits_total[5m])) / (sum(rate(maricusco_cache_hits_total[5m])) + sum(rate(maricusco_cache_misses_total[5m])))
  • Cache operations by name: sum(rate(maricusco_cache_hits_total[5m]) + rate(maricusco_cache_misses_total[5m])) by (operation)

Alerting Rules (optional)

  • Example rules file: docker/prometheus/rules/alerts.yml (mounted read-only).
  • Sample alerts:
  • HighErrorRatio: error/call ratio > 5% over 5m per vendor.
  • HighRequestLatencyP95: p95 request duration > 2s over 5m.
  • MetricsMissing: absence of maricusco_api_calls_total series for 10m.
  • Enable by ensuring the compose volume mount for /etc/prometheus/rules is present; Prometheus auto-reloads rules on change.

Monitoring Integration

  • Prometheus can scrape /health separately if desired (lightweight JSON); primary scrape remains /metrics.
  • Use alerting rules on dependency status fields in health responses (e.g., blackbox exporter with JSONPath) or on metrics signaling downstream failures.
  • Container health status propagates to orchestrators (Docker healthcheck) and can be visualized via docker-compose ps or platform dashboards.

Code Examples

Instrumenting a function

from maricusco.utils.metrics import execution_time_seconds, label_values, record_latency, events_total

def fetch_prices(symbol: str):
    labels = label_values(vendor="alpha_vantage", agent_type="data", operation="fetch_prices")
    with record_latency(execution_time_seconds, labels):
        data = _call_api(symbol)
    events_total.labels(*labels).inc()
    return data

Custom metric definition

from prometheus_client import Gauge
from maricusco.utils.metrics import get_registry

custom_gauge = Gauge(
    "maricusco_custom_health",
    "Service health score",
    ["component"],
    registry=get_registry(),
)

def update_health(component: str, value: float):
    custom_gauge.labels(component).set(value)

Using decorators for automatic timing/counts

from maricusco.utils.metrics import count_calls, track_latency, request_duration_seconds, label_values

labels = label_values("alpha_vantage", "data", "get_news")

@count_calls(counter=request_duration_seconds, labels=labels)  # increments histogram count
@track_latency(histogram=request_duration_seconds, labels=labels)
def get_news():
    ...

PromQL Examples

  • 95th percentile request latency (5m):
    histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m]))
  • Error rate per vendor (5m):
    sum by (vendor) (rate(maricusco_api_errors_total[5m]))
  • Error ratio (per vendor, 5m):
    sum by (vendor) (rate(maricusco_api_errors_total[5m])) / sum by (vendor) (rate(maricusco_api_calls_total[5m]))
  • Rate limit hits (5m):
    rate(maricusco_api_rate_limit_hits_total[5m])
  • Active agents:
    maricusco_active_agents