Skip to content

Monitoring and Metrics (Prometheus)

Structured Logging (JSON + Context)

  • Framework: structlog with stdlib handlers; JSON in production, colored console in development.
  • Context propagation via contextvars: correlation_id, session_id, ticker, user, agent_name, tool/vendor metadata.
  • Rotation: size-based by default (MARICUSCO_LOG_MAX_BYTES=10MB, MARICUSCO_LOG_BACKUP_COUNT=5). File logging is off by default; enable with MARICUSCO_LOG_FILE_ENABLED=true.
  • Docker parity: docker-compose.yml uses json-file with max-size: 10m, max-file: 5 to match app rotation.
  • Retention: optional cleanup of rotated files older than MARICUSCO_LOG_RETENTION_DAYS (days).
  • Formats:
  • JSON (prod): machine-readable for aggregation.
  • Human-readable (dev): MARICUSCO_LOG_FORMAT=human-readable.
  • Example (JSON):
    {
      "event": "agent_execution_start",
      "logger": "maricusco.orchestration.trading_graph",
      "level": "info",
      "timestamp": "2025-12-10T12:30:00.000Z",
      "correlation_id": "corr_abc123",
      "session_id": "sess_789",
      "ticker": "AAPL",
      "user": "user-abc"
    }
    
  • Config (env):
  • MARICUSCO_LOG_LEVEL (default: INFO)
  • MARICUSCO_LOG_FORMAT (json | human-readable, default: json)
  • MARICUSCO_LOG_CONSOLE (default: true)
  • MARICUSCO_LOG_FILE_ENABLED (default: false)
  • MARICUSCO_LOG_FILE (path; default: maricusco/logs/maricusco.log)
  • MARICUSCO_LOG_MAX_BYTES (default: 10485760 = 10MB)
  • MARICUSCO_LOG_BACKUP_COUNT (default: 5)
  • MARICUSCO_LOG_RETENTION_DAYS (default: 30; remove rotated files older than N days)
  • MARICUSCO_LOG_ROTATION (size | time, default: size)
  • Usage examples:
    from maricusco.utils.logging import get_logger, log_context, bind_context
    
    log = get_logger(__name__)
    
    # Ad-hoc scope
    with log_context(correlation_id="abc", session_id="sess-1", ticker="AAPL"):
        log.info("analysis_start", agent="technical")
    
    # Persistent binding
    bind_context(agent_name="trader")
    log.info("decision_ready", ticker="AAPL", action="HOLD")
    
  • Querying JSON logs (example):
    jq 'select(.correlation_id=="corr_abc123") | {event, level, ticker, session_id}' maricusco/logs/maricusco.log
    
  • Best practices:
  • Do not log secrets or payloads containing API keys.
  • Prefer structured fields over free-form strings.
  • Include correlation/session/ticker for traceability.
  • Keep file logging disabled in ephemeral/containerized environments unless needed.

Architecture Overview

  • Metrics emitted via prometheus_client in-process registry (maricusco.utils.metrics).
  • Metrics endpoint at /metrics (FastAPI router in maricusco/api/metrics.py); returns 503 when disabled, 500 on rendering errors.
  • Prometheus server (docker-compose prometheus service) scrapes app:8000/metrics every 15s by default.
  • Grafana can be pointed at Prometheus for dashboards and alerts.

Health Check Endpoint

  • Endpoint: FastAPI /health (configurable via MARICUSCO_HEALTHCHECK_PATH, default: /health).
  • Status codes: 200 when all required dependencies are healthy; 503 when any required dependency (PostgreSQL, Redis) is unhealthy. Optional dependencies and vendor checks are opt-in.
  • Response schema:
    {
      "status": "healthy" | "unhealthy",
      "timestamp": "2025-12-09T10:48:54.066885Z",
      "dependencies": {
        "database": "healthy" | "unhealthy",
        "redis": "healthy" | "unhealthy",
        "data_vendors": "healthy" | "unhealthy" | "not_checked"
      },
      "errors": { "<dependency>": "<message>" }
    }
    
  • Configuration:
  • Enable/disable: MARICUSCO_HEALTHCHECK_ENABLED (default: true).
  • Endpoint path: MARICUSCO_HEALTHCHECK_PATH (default: /health).
  • Dependency lists:
    • Required dependencies: defaults to database, redis; override with MARICUSCO_HEALTHCHECK_REQUIRED_DEPENDENCIES (comma-separated; set to empty string to disable all required deps in local/dev).
    • Optional dependencies: MARICUSCO_HEALTHCHECK_OPTIONAL_DEPENDENCIES (comma-separated).
    • Vendor checks: toggle with MARICUSCO_HEALTHCHECK_VENDORS_ENABLED (default: false).
  • Timeouts:
    • MARICUSCO_HEALTHCHECK_TIMEOUT_DEFAULT (default: 1.0)
    • MARICUSCO_HEALTHCHECK_TIMEOUT_DATABASE (default: 1.0)
    • MARICUSCO_HEALTHCHECK_TIMEOUT_REDIS (default: 1.0)
    • MARICUSCO_HEALTHCHECK_TIMEOUT_VENDORS (default: 1.0)
  • Cache TTL: MARICUSCO_HEALTHCHECK_CACHE_TTL_SECONDS (default: 0.0 = no cache).

Metric Types and Names

  • Counters:
  • maricusco_api_calls_total — API calls by vendor/agent/operation.
  • maricusco_api_errors_total — API failures.
  • maricusco_api_rate_limit_hits_total — rate-limit events.
  • maricusco_events_total — domain events.
  • maricusco_errors_by_type_total — errors by exception type.
  • maricusco_retry_attempts_total — retry attempts.
  • maricusco_cache_hits_total, maricusco_cache_misses_total — cache outcomes.
  • Gauges:
  • maricusco_active_agents — active agent executions.
  • maricusco_queue_size — queue depth (wire to your queue if used).
  • maricusco_current_value — arbitrary numeric values (balances, positions, etc.).
  • Histograms:
  • maricusco_latency_seconds — external/internal step latency.
  • maricusco_execution_time_seconds — agent/orchestration execution time.
  • maricusco_request_duration_seconds — API/vendor request duration.

Label set (shared): vendor, agent_type, operation. Some metrics also add error_type.

Queue and Cache Instrumentation

  • Cache client: maricusco.data.cache.CacheClient uses Redis when available, falls back to in-memory. get_cache_client() returns a singleton.
  • Cache hits/misses: record_cache_hit(cache_name) / record_cache_miss(cache_name) used inside the cache client; labels default to vendor=redis|local, agent_type=cache, operation=<cache_name>.
  • Cache latency: cache client records latency via latency_seconds histogram for get/set operations.
  • Queue depth: maricusco.orchestration.queue_metrics.report_queue_depth(queue_name, depth) emits maricusco_queue_size; defaults vendor=internal, agent_type=queue.
  • Queue latency: record_queue_latency(queue_name, duration_seconds) records enqueue/dequeue latency via latency_seconds.
  • Current state: no message queue is configured; helpers are ready for future queue integration and can be called from any queue implementation.

Error Rate Over Time

  • Use PromQL rate over counters rather than a separate gauge:
    sum by (vendor) (rate(maricusco_api_errors_total[5m]))
  • Error ratio vs calls (per vendor):
    sum by (vendor) (rate(maricusco_api_errors_total[5m])) / sum by (vendor) (rate(maricusco_api_calls_total[5m]))

Configuration

  • Enable/disable: MARICUSCO_METRICS_ENABLED (default: false).
  • Endpoint path: MARICUSCO_METRICS_PATH (default: /metrics).
  • Endpoint port: MARICUSCO_METRICS_PORT (default: 8000).
  • Sampling rate: MARICUSCO_METRICS_SAMPLING_RATE (default: 1.0).
  • Constant labels: MARICUSCO_METRICS_LABELS (comma-separated key=value pairs).

Prometheus Server Setup

  • Compose service prometheus in docker-compose.yml.
  • Config: docker/prometheus/prometheus.yml
  • scrape_interval: 15s, metrics_path: /metrics, target app:8000.
  • Retention: --storage.tsdb.retention.time=15d (compose command).
  • Rule files placeholder: /etc/prometheus/rules/*.yml (add via bind mount).
  • Run: docker-compose up -d prometheus
  • UI: http://localhost:9090

Grafana Setup and Configuration

Docker Compose Service

  • Service: grafana in docker-compose.yml.
  • Port: 3000 (configurable via GRAFANA_PORT environment variable, default 3000).
  • Access URL: http://localhost:3000
  • Network: Connected to maricusco-network for internal communication with Prometheus.
  • Dependencies: Depends on Prometheus service (depends_on: prometheus).
  • Persistent storage: grafana_data volume for dashboard and user data persistence.

Environment Variables

  • GF_SECURITY_ADMIN_USER: Grafana admin username (defaults to admin if not set).
  • GF_SECURITY_ADMIN_PASSWORD: Grafana admin password (defaults to admin for local development if not set).
  • Important: For non-local/production environments, GF_SECURITY_ADMIN_PASSWORD must be set to a secure password via .env file.
  • For local development, default credentials are admin / admin when not set (convenient for development, not secure for production).
  • Note: Grafana uses GF_ prefix for all configuration environment variables.

Data Source Provisioning

  • Prometheus data source is automatically provisioned via docker/grafana/provisioning/datasources/prometheus.yml.
  • Data source name: "Prometheus"
  • URL: http://prometheus:9090 (internal Docker network)
  • Access mode: Server (proxy)
  • Is default: Yes
  • No manual configuration required; data source is available immediately after Grafana starts.

Dashboard Provisioning

  • Dashboards are automatically provisioned from docker/grafana/provisioning/dashboards/.
  • Dashboard provider configuration: docker/grafana/provisioning/dashboards/dashboard.yml
  • Folder: "Maricusco" (dashboards organized in this folder in Grafana UI)
  • Auto-discovery: Grafana scans for .json files every 10 seconds
  • All dashboard JSON files in the provisioning directory are automatically loaded

Pre-Provisioned Dashboards

1. Maricusco Overview (maricusco-overview.json)

Comprehensive overview of system metrics: - P95 Request Duration: Time series showing 95th percentile request duration - PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) - API Error Rate: Time series showing error rate by vendor/operation - PromQL: rate(maricusco_api_errors_total[5m]) - Rate Limit Hits: Time series showing rate limit hits by vendor/operation - PromQL: rate(maricusco_api_rate_limit_hits_total[5m]) - Active Agents: Gauge panel showing current active agents count - PromQL: maricusco_active_agents - Cache Hit/Miss Ratio: Time series showing overall cache hit/miss ratio (0-1 scale) - PromQL: sum(rate(maricusco_cache_hits_total[5m])) / (sum(rate(maricusco_cache_hits_total[5m])) + sum(rate(maricusco_cache_misses_total[5m])))

2. Maricusco Vendor/Operation Latency (maricusco-vendor-latency.json)

Detailed latency breakdowns: - Request Duration by Vendor (P95): Time series grouped by vendor - PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor) - Request Duration by Operation (P95): Time series grouped by operation - PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (operation) - Latency by Vendor/Operation (P95): Table view showing P95 latency for each vendor/operation combination - PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor, operation)

3. Maricusco Cache Performance (maricusco-cache.json)

Cache-specific metrics: - Cache Hit Rate Over Time: Time series showing cache hits by cache name (operation label) - PromQL: rate(maricusco_cache_hits_total[5m]) - Cache Miss Rate Over Time: Time series showing cache misses by cache name - PromQL: rate(maricusco_cache_misses_total[5m]) - Cache Hit/Miss Ratio: Time series showing overall ratio (0-1 scale) - PromQL: sum(rate(maricusco_cache_hits_total[5m])) / (sum(rate(maricusco_cache_hits_total[5m])) + sum(rate(maricusco_cache_misses_total[5m]))) - Cache Operations by Cache Name: Table showing total operations per second by cache name (operation label) - PromQL: sum(rate(maricusco_cache_hits_total[5m]) + rate(maricusco_cache_misses_total[5m])) by (operation)

Accessing Grafana

  1. Start services: docker-compose up -d grafana (or docker-compose up -d for all services)
  2. Open browser: Navigate to http://localhost:3000
  3. Login:
  4. Local development: Default credentials are admin / admin (when GF_SECURITY_ADMIN_PASSWORD is not set in .env)
  5. Production/non-local: Use credentials from .env file (GF_SECURITY_ADMIN_USER / GF_SECURITY_ADMIN_PASSWORD)
  6. Navigate to dashboards: Click "Dashboards" → "Maricusco" folder to see pre-provisioned dashboards
  7. Verify data: Ensure MARICUSCO_METRICS_ENABLED=true for metrics to appear in dashboards

PromQL Queries Reference (Used in Dashboards)

  • P95 request duration: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m]))
  • P95 by vendor: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor)
  • P95 by operation: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (operation)
  • P95 by vendor/operation: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor, operation)
  • API error rate: rate(maricusco_api_errors_total[5m])
  • Rate limit hits: rate(maricusco_api_rate_limit_hits_total[5m])
  • Active agents: maricusco_active_agents
  • Cache hit rate: rate(maricusco_cache_hits_total[5m])
  • Cache miss rate: rate(maricusco_cache_misses_total[5m])
  • Cache hit/miss ratio: sum(rate(maricusco_cache_hits_total[5m])) / (sum(rate(maricusco_cache_hits_total[5m])) + sum(rate(maricusco_cache_misses_total[5m])))
  • Cache operations by name: sum(rate(maricusco_cache_hits_total[5m]) + rate(maricusco_cache_misses_total[5m])) by (operation)

Alerting Rules (optional)

  • Example rules file: docker/prometheus/rules/alerts.yml (mounted read-only).
  • Sample alerts:
  • HighErrorRatio: error/call ratio > 5% over 5m per vendor.
  • HighRequestLatencyP95: p95 request duration > 2s over 5m.
  • MetricsMissing: absence of maricusco_api_calls_total series for 10m.
  • Enable by ensuring the compose volume mount for /etc/prometheus/rules is present; Prometheus auto-reloads rules on change.

Monitoring Integration

  • Prometheus can scrape /health separately if desired (lightweight JSON); primary scrape remains /metrics.
  • Use alerting rules on dependency status fields in health responses (e.g., blackbox exporter with JSONPath) or on metrics signaling downstream failures.
  • Container health status propagates to orchestrators (Docker healthcheck) and can be visualized via docker-compose ps or platform dashboards.

Code Examples

Instrumenting a function

from maricusco.utils.metrics import execution_time_seconds, label_values, record_latency, events_total

def fetch_prices(symbol: str):
    labels = label_values(vendor="alpha_vantage", agent_type="data", operation="fetch_prices")
    with record_latency(execution_time_seconds, labels):
        data = _call_api(symbol)
    events_total.labels(*labels).inc()
    return data

Custom metric definition

from prometheus_client import Gauge
from maricusco.utils.metrics import get_registry

custom_gauge = Gauge(
    "maricusco_custom_health",
    "Service health score",
    ["component"],
    registry=get_registry(),
)

def update_health(component: str, value: float):
    custom_gauge.labels(component).set(value)

Using decorators for automatic timing/counts

from maricusco.utils.metrics import count_calls, track_latency, request_duration_seconds, label_values

labels = label_values("alpha_vantage", "data", "get_news")

@count_calls(counter=request_duration_seconds, labels=labels)  # increments histogram count
@track_latency(histogram=request_duration_seconds, labels=labels)
def get_news():
    ...

PromQL Examples

  • 95th percentile request latency (5m):
    histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m]))
  • Error rate per vendor (5m):
    sum by (vendor) (rate(maricusco_api_errors_total[5m]))
  • Error ratio (per vendor, 5m):
    sum by (vendor) (rate(maricusco_api_errors_total[5m])) / sum by (vendor) (rate(maricusco_api_calls_total[5m]))
  • Rate limit hits (5m):
    rate(maricusco_api_rate_limit_hits_total[5m])
  • Active agents:
    maricusco_active_agents