Monitoring and Metrics (Prometheus)¶

Structured Logging (JSON + Context)¶

Framework: structlog with stdlib handlers; JSON in production, colored console in development.
Context propagation via contextvars: correlation_id, session_id, ticker, user, agent_name, tool/vendor metadata.
Rotation: size-based by default (MARICUSCO_LOG_MAX_BYTES=10MB, MARICUSCO_LOG_BACKUP_COUNT=5). File logging is off by default; enable with MARICUSCO_LOG_FILE_ENABLED=true.
Docker parity: docker-compose.yml uses json-file with max-size: 10m, max-file: 5 to match app rotation.
Retention: optional cleanup of rotated files older than MARICUSCO_LOG_RETENTION_DAYS (days).
Formats:
JSON (prod): machine-readable for aggregation.
Human-readable (dev): MARICUSCO_LOG_FORMAT=human-readable.

Example (JSON):

{
  "event": "agent_execution_start",
  "logger": "maricusco.orchestration.trading_graph",
  "level": "info",
  "timestamp": "2025-12-10T12:30:00.000Z",
  "correlation_id": "corr_abc123",
  "session_id": "sess_789",
  "ticker": "AAPL",
  "user": "user-abc"
}

Config (env):
MARICUSCO_LOG_LEVEL (default: INFO)
MARICUSCO_LOG_FORMAT (json | human-readable, default: json)
MARICUSCO_LOG_CONSOLE (default: true)
MARICUSCO_LOG_FILE_ENABLED (default: false)
MARICUSCO_LOG_FILE (path; default: maricusco/logs/maricusco.log)
MARICUSCO_LOG_MAX_BYTES (default: 10485760 = 10MB)
MARICUSCO_LOG_BACKUP_COUNT (default: 5)
MARICUSCO_LOG_RETENTION_DAYS (default: 30; remove rotated files older than N days)
MARICUSCO_LOG_ROTATION (size | time, default: size)

Usage examples:

from maricusco.utils.logging import get_logger, log_context, bind_context

log = get_logger(__name__)

# Ad-hoc scope
with log_context(correlation_id="abc", session_id="sess-1", ticker="AAPL"):
    log.info("analysis_start", agent="technical")

# Persistent binding
bind_context(agent_name="trader")
log.info("decision_ready", ticker="AAPL", action="HOLD")

Querying JSON logs (example):

jq 'select(.correlation_id=="corr_abc123") | {event, level, ticker, session_id}' maricusco/logs/maricusco.log

Best practices:
Do not log secrets or payloads containing API keys.
Prefer structured fields over free-form strings.
Include correlation/session/ticker for traceability.
Keep file logging disabled in ephemeral/containerized environments unless needed.

Architecture Overview¶

Metrics emitted via prometheus_client in-process registry (maricusco.utils.metrics).
Metrics endpoint at /metrics (FastAPI router in maricusco/api/metrics.py); returns 503 when disabled, 500 on rendering errors.
Prometheus server (docker-compose prometheus service) scrapes app:8000/metrics every 15s by default.
Grafana can be pointed at Prometheus for dashboards and alerts.

Health Check Endpoint¶

Endpoint: FastAPI /health (configurable via MARICUSCO_HEALTHCHECK_PATH, default: /health).
Status codes: 200 when all required dependencies are healthy; 503 when any required dependency (PostgreSQL, Redis) is unhealthy. Optional dependencies and vendor checks are opt-in.

Response schema:

{
  "status": "healthy" | "unhealthy",
  "timestamp": "2025-12-09T10:48:54.066885Z",
  "dependencies": {
    "database": "healthy" | "unhealthy",
    "redis": "healthy" | "unhealthy",
    "data_vendors": "healthy" | "unhealthy" | "not_checked"
  },
  "errors": { "<dependency>": "<message>" }
}

Configuration:
Enable/disable: MARICUSCO_HEALTHCHECK_ENABLED (default: true).
Endpoint path: MARICUSCO_HEALTHCHECK_PATH (default: /health).
Dependency lists:
- Required dependencies: defaults to database, redis; override with MARICUSCO_HEALTHCHECK_REQUIRED_DEPENDENCIES (comma-separated; set to empty string to disable all required deps in local/dev).
- Optional dependencies: MARICUSCO_HEALTHCHECK_OPTIONAL_DEPENDENCIES (comma-separated).
- Vendor checks: toggle with MARICUSCO_HEALTHCHECK_VENDORS_ENABLED (default: false).
Timeouts:
- MARICUSCO_HEALTHCHECK_TIMEOUT_DEFAULT (default: 1.0)
- MARICUSCO_HEALTHCHECK_TIMEOUT_DATABASE (default: 1.0)
- MARICUSCO_HEALTHCHECK_TIMEOUT_REDIS (default: 1.0)
- MARICUSCO_HEALTHCHECK_TIMEOUT_VENDORS (default: 1.0)
Cache TTL: MARICUSCO_HEALTHCHECK_CACHE_TTL_SECONDS (default: 0.0 = no cache).

Metric Types and Names¶

Counters:
maricusco_api_calls_total — API calls by vendor/agent/operation.
maricusco_api_errors_total — API failures.
maricusco_api_rate_limit_hits_total — rate-limit events.
maricusco_events_total — domain events.
maricusco_errors_by_type_total — errors by exception type.
maricusco_retry_attempts_total — retry attempts.
maricusco_cache_hits_total, maricusco_cache_misses_total — cache outcomes.
Gauges:
maricusco_active_agents — active agent executions.
maricusco_queue_size — queue depth (wire to your queue if used).
maricusco_current_value — arbitrary numeric values (balances, positions, etc.).
Histograms:
maricusco_latency_seconds — external/internal step latency.
maricusco_execution_time_seconds — agent/orchestration execution time.
maricusco_request_duration_seconds — API/vendor request duration.

Label set (shared): vendor, agent_type, operation. Some metrics also add error_type.

Queue and Cache Instrumentation¶

Cache client: maricusco.data.cache.CacheClient uses Redis when available, falls back to in-memory. get_cache_client() returns a singleton.
Cache hits/misses: record_cache_hit(cache_name) / record_cache_miss(cache_name) used inside the cache client; labels default to vendor=redis|local, agent_type=cache, operation=<cache_name>.
Cache latency: cache client records latency via latency_seconds histogram for get/set operations.
Queue depth: maricusco.orchestration.queue_metrics.report_queue_depth(queue_name, depth) emits maricusco_queue_size; defaults vendor=internal, agent_type=queue.
Queue latency: record_queue_latency(queue_name, duration_seconds) records enqueue/dequeue latency via latency_seconds.
Current state: no message queue is configured; helpers are ready for future queue integration and can be called from any queue implementation.

Error Rate Over Time¶

Use PromQL rate over counters rather than a separate gauge:
sum by (vendor) (rate(maricusco_api_errors_total[5m]))
Error ratio vs calls (per vendor):
sum by (vendor) (rate(maricusco_api_errors_total[5m])) / sum by (vendor) (rate(maricusco_api_calls_total[5m]))

Configuration¶

Enable/disable: MARICUSCO_METRICS_ENABLED (default: false).
Endpoint path: MARICUSCO_METRICS_PATH (default: /metrics).
Endpoint port: MARICUSCO_METRICS_PORT (default: 8000).
Sampling rate: MARICUSCO_METRICS_SAMPLING_RATE (default: 1.0).
Constant labels: MARICUSCO_METRICS_LABELS (comma-separated key=value pairs).

Prometheus Server Setup¶

Compose service prometheus in docker-compose.yml.
Config: docker/prometheus/prometheus.yml
scrape_interval: 15s, metrics_path: /metrics, target app:8000.
Retention: --storage.tsdb.retention.time=15d (compose command).
Rule files placeholder: /etc/prometheus/rules/*.yml (add via bind mount).
Run: docker-compose up -d prometheus
UI: http://localhost:9090

Grafana Setup and Configuration¶

Docker Compose Service¶

Service: grafana in docker-compose.yml.
Port: 3000 (configurable via GRAFANA_PORT environment variable, default 3000).
Access URL: http://localhost:3000
Network: Connected to maricusco-network for internal communication with Prometheus.
Dependencies: Depends on Prometheus service (depends_on: prometheus).
Persistent storage: grafana_data volume for dashboard and user data persistence.

Environment Variables¶

GF_SECURITY_ADMIN_USER: Grafana admin username (defaults to admin if not set).
GF_SECURITY_ADMIN_PASSWORD: Grafana admin password (defaults to admin for local development if not set).
Important: For non-local/production environments, GF_SECURITY_ADMIN_PASSWORD must be set to a secure password via .env file.
For local development, default credentials are admin / admin when not set (convenient for development, not secure for production).
Note: Grafana uses GF_ prefix for all configuration environment variables.

Data Source Provisioning¶

Prometheus data source is automatically provisioned via docker/grafana/provisioning/datasources/prometheus.yml.
Data source name: "Prometheus"
URL: http://prometheus:9090 (internal Docker network)
Access mode: Server (proxy)
Is default: Yes
No manual configuration required; data source is available immediately after Grafana starts.

Dashboard Provisioning¶

Dashboards are automatically provisioned from docker/grafana/provisioning/dashboards/.
Dashboard provider configuration: docker/grafana/provisioning/dashboards/dashboard.yml
Folder: "Maricusco" (dashboards organized in this folder in Grafana UI)
Auto-discovery: Grafana scans for .json files every 10 seconds
All dashboard JSON files in the provisioning directory are automatically loaded

Pre-Provisioned Dashboards¶

1. Maricusco Overview (`maricusco-overview.json`)¶

Comprehensive overview of system metrics: - P95 Request Duration: Time series showing 95^th percentile request duration - PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) - API Error Rate: Time series showing error rate by vendor/operation - PromQL: rate(maricusco_api_errors_total[5m]) - Rate Limit Hits: Time series showing rate limit hits by vendor/operation - PromQL: rate(maricusco_api_rate_limit_hits_total[5m]) - Active Agents: Gauge panel showing current active agents count - PromQL: maricusco_active_agents - Cache Hit/Miss Ratio: Time series showing overall cache hit/miss ratio (0-1 scale) - PromQL: sum(rate(maricusco_cache_hits_total[5m])) / (sum(rate(maricusco_cache_hits_total[5m])) + sum(rate(maricusco_cache_misses_total[5m])))

2. Maricusco Vendor/Operation Latency (`maricusco-vendor-latency.json`)¶

Detailed latency breakdowns: - Request Duration by Vendor (P95): Time series grouped by vendor - PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor) - Request Duration by Operation (P95): Time series grouped by operation - PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (operation) - Latency by Vendor/Operation (P95): Table view showing P95 latency for each vendor/operation combination - PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor, operation)

3. Maricusco Cache Performance (`maricusco-cache.json`)¶

Cache-specific metrics: - Cache Hit Rate Over Time: Time series showing cache hits by cache name (operation label) - PromQL: rate(maricusco_cache_hits_total[5m]) - Cache Miss Rate Over Time: Time series showing cache misses by cache name - PromQL: rate(maricusco_cache_misses_total[5m]) - Cache Hit/Miss Ratio: Time series showing overall ratio (0-1 scale) - PromQL: sum(rate(maricusco_cache_hits_total[5m])) / (sum(rate(maricusco_cache_hits_total[5m])) + sum(rate(maricusco_cache_misses_total[5m]))) - Cache Operations by Cache Name: Table showing total operations per second by cache name (operation label) - PromQL: sum(rate(maricusco_cache_hits_total[5m]) + rate(maricusco_cache_misses_total[5m])) by (operation)

Accessing Grafana¶

Start services: docker-compose up -d grafana (or docker-compose up -d for all services)
Open browser: Navigate to http://localhost:3000
Login:
Local development: Default credentials are admin / admin (when GF_SECURITY_ADMIN_PASSWORD is not set in .env)
Production/non-local: Use credentials from .env file (GF_SECURITY_ADMIN_USER / GF_SECURITY_ADMIN_PASSWORD)
Navigate to dashboards: Click "Dashboards" → "Maricusco" folder to see pre-provisioned dashboards
Verify data: Ensure MARICUSCO_METRICS_ENABLED=true for metrics to appear in dashboards

PromQL Queries Reference (Used in Dashboards)¶

P95 request duration: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m]))
P95 by vendor: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor)
P95 by operation: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (operation)
P95 by vendor/operation: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor, operation)
API error rate: rate(maricusco_api_errors_total[5m])
Rate limit hits: rate(maricusco_api_rate_limit_hits_total[5m])
Active agents: maricusco_active_agents
Cache hit rate: rate(maricusco_cache_hits_total[5m])
Cache miss rate: rate(maricusco_cache_misses_total[5m])
Cache hit/miss ratio: sum(rate(maricusco_cache_hits_total[5m])) / (sum(rate(maricusco_cache_hits_total[5m])) + sum(rate(maricusco_cache_misses_total[5m])))
Cache operations by name: sum(rate(maricusco_cache_hits_total[5m]) + rate(maricusco_cache_misses_total[5m])) by (operation)

Alerting Rules (optional)¶

Example rules file: docker/prometheus/rules/alerts.yml (mounted read-only).
Sample alerts:
HighErrorRatio: error/call ratio > 5% over 5m per vendor.
HighRequestLatencyP95: p95 request duration > 2s over 5m.
MetricsMissing: absence of maricusco_api_calls_total series for 10m.
Enable by ensuring the compose volume mount for /etc/prometheus/rules is present; Prometheus auto-reloads rules on change.

Monitoring Integration¶

Prometheus can scrape /health separately if desired (lightweight JSON); primary scrape remains /metrics.
Use alerting rules on dependency status fields in health responses (e.g., blackbox exporter with JSONPath) or on metrics signaling downstream failures.
Container health status propagates to orchestrators (Docker healthcheck) and can be visualized via docker-compose ps or platform dashboards.

Code Examples¶

Instrumenting a function¶

from maricusco.utils.metrics import execution_time_seconds, label_values, record_latency, events_total

def fetch_prices(symbol: str):
    labels = label_values(vendor="alpha_vantage", agent_type="data", operation="fetch_prices")
    with record_latency(execution_time_seconds, labels):
        data = _call_api(symbol)
    events_total.labels(*labels).inc()
    return data

Custom metric definition¶

from prometheus_client import Gauge
from maricusco.utils.metrics import get_registry

custom_gauge = Gauge(
    "maricusco_custom_health",
    "Service health score",
    ["component"],
    registry=get_registry(),
)

def update_health(component: str, value: float):
    custom_gauge.labels(component).set(value)

Using decorators for automatic timing/counts¶

from maricusco.utils.metrics import count_calls, track_latency, request_duration_seconds, label_values

labels = label_values("alpha_vantage", "data", "get_news")

@count_calls(counter=request_duration_seconds, labels=labels)  # increments histogram count
@track_latency(histogram=request_duration_seconds, labels=labels)
def get_news():
    ...

PromQL Examples¶

95^th percentile request latency (5m):
histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m]))
Error rate per vendor (5m):
sum by (vendor) (rate(maricusco_api_errors_total[5m]))
Error ratio (per vendor, 5m):
sum by (vendor) (rate(maricusco_api_errors_total[5m])) / sum by (vendor) (rate(maricusco_api_calls_total[5m]))
Rate limit hits (5m):
rate(maricusco_api_rate_limit_hits_total[5m])
Active agents:
maricusco_active_agents