Skip to content

Monitoring and Metrics (Prometheus)

Structured Logging (JSON + Context)

  • Framework: structlog with stdlib handlers; JSON in production, colored console in development.
  • Context propagation via contextvars: correlation_id, session_id, ticker, user, agent_name, tool/vendor metadata.
  • Rotation: size-based by default (REDHOUND_LOG_MAX_BYTES=10MB, REDHOUND_LOG_BACKUP_COUNT=5). File logging is off by default; enable with REDHOUND_LOG_FILE_ENABLED=true.
  • Docker parity: docker-compose.yml uses json-file with max-size: 10m, max-file: 5 to match app rotation.
  • Retention: optional cleanup of rotated files older than REDHOUND_LOG_RETENTION_DAYS (days).
  • Formats:
  • JSON (prod): machine-readable for aggregation.
  • Human-readable (dev): REDHOUND_LOG_FORMAT=human-readable.
  • Example (JSON):
    {
      "event": "agent_execution_start",
      "logger": "redhound.orchestration.trading_graph",
      "level": "info",
      "timestamp": "2025-12-10T12:30:00.000Z",
      "correlation_id": "corr_abc123",
      "session_id": "sess_789",
      "ticker": "AAPL",
      "user": "user-abc"
    }
    
  • Config (env):
  • REDHOUND_LOG_LEVEL (default: INFO)
  • REDHOUND_LOG_FORMAT (json | human-readable, default: json)
  • REDHOUND_LOG_CONSOLE (default: true)
  • REDHOUND_LOG_FILE_ENABLED (default: false)
  • REDHOUND_LOG_FILE (path; default: redhound/logs/redhound.log)
  • REDHOUND_LOG_MAX_BYTES (default: 10485760 = 10MB)
  • REDHOUND_LOG_BACKUP_COUNT (default: 5)
  • REDHOUND_LOG_RETENTION_DAYS (default: 30; remove rotated files older than N days)
  • REDHOUND_LOG_ROTATION (size | time, default: size)
  • Usage examples:
    from backend.utils.logging import get_logger, log_context, bind_context
    
    log = get_logger(__name__)
    
    # Ad-hoc scope
    with log_context(correlation_id="abc", session_id="sess-1", ticker="AAPL"):
        log.info("analysis_start", agent="technical")
    
    # Persistent binding
    bind_context(agent_name="trader")
    log.info("decision_ready", ticker="AAPL", action="HOLD")
    
  • Querying JSON logs (example):
    jq 'select(.correlation_id=="corr_abc123") | {event, level, ticker, session_id}' redhound/logs/redhound.log
    
  • Best practices:
  • Do not log secrets or payloads containing API keys.
  • Prefer structured fields over free-form strings.
  • Include correlation/session/ticker for traceability.
  • Keep file logging disabled in ephemeral/containerized environments unless needed.

Architecture Overview

  • Metrics emitted via prometheus_client in-process registry (redhound.utils.metrics).
  • Metrics endpoint at /metrics (FastAPI router in redhound/api/metrics.py); returns 503 when disabled, 500 on rendering errors.
  • Prometheus server (docker-compose prometheus service) scrapes app:8000/metrics every 15s by default.
  • Grafana can be pointed at Prometheus for dashboards and alerts.

Health Check Endpoint

  • Endpoint: FastAPI /health (configurable via REDHOUND_HEALTHCHECK_PATH, default: /health).
  • Status codes: 200 when all required dependencies are healthy; 503 when any required dependency (PostgreSQL, Redis) is unhealthy. Optional dependencies and vendor checks are opt-in.
  • Response schema:
    {
      "status": "healthy" | "unhealthy",
      "timestamp": "2025-12-09T10:48:54.066885Z",
      "dependencies": {
        "database": "healthy" | "unhealthy",
        "redis": "healthy" | "unhealthy",
        "data_vendors": "healthy" | "unhealthy" | "not_checked"
      },
      "errors": { "<dependency>": "<message>" }
    }
    
  • Configuration:
  • Enable/disable: REDHOUND_HEALTHCHECK_ENABLED (default: true).
  • Endpoint path: REDHOUND_HEALTHCHECK_PATH (default: /health).
  • Dependency lists:
    • Required dependencies: defaults to database, redis; override with REDHOUND_HEALTHCHECK_REQUIRED_DEPENDENCIES (comma-separated; set to empty string to disable all required deps in local/dev).
    • Optional dependencies: REDHOUND_HEALTHCHECK_OPTIONAL_DEPENDENCIES (comma-separated).
    • Vendor checks: toggle with REDHOUND_HEALTHCHECK_VENDORS_ENABLED (default: false).
  • Timeouts:
    • REDHOUND_HEALTHCHECK_TIMEOUT_DEFAULT (default: 1.0)
    • REDHOUND_HEALTHCHECK_TIMEOUT_DATABASE (default: 1.0)
    • REDHOUND_HEALTHCHECK_TIMEOUT_REDIS (default: 1.0)
    • REDHOUND_HEALTHCHECK_TIMEOUT_VENDORS (default: 1.0)
  • Cache TTL: REDHOUND_HEALTHCHECK_CACHE_TTL_SECONDS (default: 0.0 = no cache).

Metric Types and Names

  • Counters:
  • redhound_api_calls_total — API calls by vendor/agent/operation.
  • redhound_api_errors_total — API failures.
  • redhound_api_rate_limit_hits_total — rate-limit events.
  • redhound_events_total — domain events.
  • redhound_errors_by_type_total — errors by exception type.
  • redhound_retry_attempts_total — retry attempts.
  • redhound_cache_hits_total, redhound_cache_misses_total — cache outcomes.
  • Gauges:
  • redhound_active_agents — active agent executions.
  • redhound_queue_size — queue depth (wire to your queue if used).
  • redhound_current_value — arbitrary numeric values (balances, positions, etc.).
  • Histograms:
  • redhound_latency_seconds — external/internal step latency.
  • redhound_execution_time_seconds — agent/orchestration execution time.
  • redhound_request_duration_seconds — API/vendor request duration.

Label set (shared): vendor, agent_type, operation. Some metrics also add error_type.

Queue and Cache Instrumentation

  • Cache client: redhound.data.cache.CacheClient uses Redis when available, falls back to in-memory. get_cache_client() returns a singleton.
  • Cache hits/misses: record_cache_hit(cache_name) / record_cache_miss(cache_name) used inside the cache client; labels default to vendor=redis|local, agent_type=cache, operation=<cache_name>.
  • Cache latency: cache client records latency via latency_seconds histogram for get/set operations.
  • Queue depth: redhound.orchestration.queue_metrics.report_queue_depth(queue_name, depth) emits redhound_queue_size; defaults vendor=internal, agent_type=queue.
  • Queue latency: record_queue_latency(queue_name, duration_seconds) records enqueue/dequeue latency via latency_seconds.
  • Current state: no message queue is configured; helpers are ready for future queue integration and can be called from any queue implementation.

Error Rate Over Time

  • Use PromQL rate over counters rather than a separate gauge:
    sum by (vendor) (rate(redhound_api_errors_total[5m]))
  • Error ratio vs calls (per vendor):
    sum by (vendor) (rate(redhound_api_errors_total[5m])) / sum by (vendor) (rate(redhound_api_calls_total[5m]))

Configuration

  • Enable/disable: REDHOUND_METRICS_ENABLED (default: false).
  • Endpoint path: REDHOUND_METRICS_PATH (default: /metrics).
  • Endpoint port: REDHOUND_METRICS_PORT (default: 8000).
  • Sampling rate: REDHOUND_METRICS_SAMPLING_RATE (default: 1.0).
  • Constant labels: REDHOUND_METRICS_LABELS (comma-separated key=value pairs).

Prometheus Server Setup

  • Compose service prometheus in docker-compose.yml.
  • Config: docker/prometheus/prometheus.yml
  • scrape_interval: 15s, metrics_path: /metrics, target app:8000.
  • Retention: --storage.tsdb.retention.time=15d (compose command).
  • Rule files placeholder: /etc/prometheus/rules/*.yml (add via bind mount).
  • Run: docker-compose up -d prometheus
  • UI: http://localhost:9090

Grafana Setup and Configuration

Docker Compose Service

  • Service: grafana in docker-compose.yml.
  • Port: 3000 (configurable via GRAFANA_PORT environment variable, default 3000).
  • Access URL: http://localhost:3000
  • Network: Connected to redhound-network for internal communication with Prometheus.
  • Dependencies: Depends on Prometheus service (depends_on: prometheus).
  • Persistent storage: grafana_data volume for dashboard and user data persistence.

Environment Variables

  • GF_SECURITY_ADMIN_USER: Grafana admin username (defaults to admin if not set).
  • GF_SECURITY_ADMIN_PASSWORD: Grafana admin password (defaults to admin for local development if not set).
  • Important: For non-local/production environments, GF_SECURITY_ADMIN_PASSWORD must be set to a secure password via .env file.
  • For local development, default credentials are admin / admin when not set (convenient for development, not secure for production).
  • Note: Grafana uses GF_ prefix for all configuration environment variables.

Data Source Provisioning

  • Prometheus data source is automatically provisioned via docker/grafana/provisioning/datasources/prometheus.yml.
  • Data source name: "Prometheus"
  • URL: http://prometheus:9090 (internal Docker network)
  • Access mode: Server (proxy)
  • Is default: Yes
  • No manual configuration required; data source is available immediately after Grafana starts.

Dashboard Provisioning

  • Dashboards are automatically provisioned from docker/grafana/provisioning/dashboards/.
  • Dashboard provider configuration: docker/grafana/provisioning/dashboards/dashboard.yml
  • Folder: "Redhound" (dashboards organized in this folder in Grafana UI)
  • Auto-discovery: Grafana scans for .json files every 10 seconds
  • All dashboard JSON files in the provisioning directory are automatically loaded

Pre-Provisioned Dashboards

1. Redhound Overview (redhound-overview.json)

Comprehensive overview of system metrics: - P95 Request Duration: Time series showing 95th percentile request duration - PromQL: histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) - API Error Rate: Time series showing error rate by vendor/operation - PromQL: rate(redhound_api_errors_total[5m]) - Rate Limit Hits: Time series showing rate limit hits by vendor/operation - PromQL: rate(redhound_api_rate_limit_hits_total[5m]) - Active Agents: Gauge panel showing current active agents count - PromQL: redhound_active_agents - Cache Hit/Miss Ratio: Time series showing overall cache hit/miss ratio (0-1 scale) - PromQL: sum(rate(redhound_cache_hits_total[5m])) / (sum(rate(redhound_cache_hits_total[5m])) + sum(rate(redhound_cache_misses_total[5m])))

2. Redhound Vendor/Operation Latency (redhound-vendor-latency.json)

Detailed latency breakdowns: - Request Duration by Vendor (P95): Time series grouped by vendor - PromQL: histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) by (vendor) - Request Duration by Operation (P95): Time series grouped by operation - PromQL: histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) by (operation) - Latency by Vendor/Operation (P95): Table view showing P95 latency for each vendor/operation combination - PromQL: histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) by (vendor, operation)

3. Redhound Cache Performance (redhound-cache.json)

Cache-specific metrics: - Cache Hit Rate Over Time: Time series showing cache hits by cache name (operation label) - PromQL: rate(redhound_cache_hits_total[5m]) - Cache Miss Rate Over Time: Time series showing cache misses by cache name - PromQL: rate(redhound_cache_misses_total[5m]) - Cache Hit/Miss Ratio: Time series showing overall ratio (0-1 scale) - PromQL: sum(rate(redhound_cache_hits_total[5m])) / (sum(rate(redhound_cache_hits_total[5m])) + sum(rate(redhound_cache_misses_total[5m]))) - Cache Operations by Cache Name: Table showing total operations per second by cache name (operation label) - PromQL: sum(rate(redhound_cache_hits_total[5m]) + rate(redhound_cache_misses_total[5m])) by (operation)

Accessing Grafana

  1. Start services: docker-compose up -d grafana (or docker-compose up -d for all services)
  2. Open browser: Navigate to http://localhost:3000
  3. Login:
  4. Local development: Default credentials are admin / admin (when GF_SECURITY_ADMIN_PASSWORD is not set in .env)
  5. Production/non-local: Use credentials from .env file (GF_SECURITY_ADMIN_USER / GF_SECURITY_ADMIN_PASSWORD)
  6. Navigate to dashboards: Click "Dashboards" → "Redhound" folder to see pre-provisioned dashboards
  7. Verify data: Ensure REDHOUND_METRICS_ENABLED=true for metrics to appear in dashboards

PromQL Queries Reference (Used in Dashboards)

  • P95 request duration: histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m]))
  • P95 by vendor: histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) by (vendor)
  • P95 by operation: histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) by (operation)
  • P95 by vendor/operation: histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) by (vendor, operation)
  • API error rate: rate(redhound_api_errors_total[5m])
  • Rate limit hits: rate(redhound_api_rate_limit_hits_total[5m])
  • Active agents: redhound_active_agents
  • Cache hit rate: rate(redhound_cache_hits_total[5m])
  • Cache miss rate: rate(redhound_cache_misses_total[5m])
  • Cache hit/miss ratio: sum(rate(redhound_cache_hits_total[5m])) / (sum(rate(redhound_cache_hits_total[5m])) + sum(rate(redhound_cache_misses_total[5m])))
  • Cache operations by name: sum(rate(redhound_cache_hits_total[5m]) + rate(redhound_cache_misses_total[5m])) by (operation)

Alerting Rules

  • Alert rules file: docker/prometheus/rules/alerts.yml (mounted read-only).
  • Configured alerts:
  • HighErrorRatio: error/call ratio > 5% over 5m per vendor (severity: warning).
  • HighRequestLatencyP95: p95 request duration > 2s over 5m (severity: warning).
  • MetricsMissing: absence of redhound_api_calls_total series for 10m (severity: critical).
  • Prometheus evaluates rules every 15s (evaluation_interval).
  • Alerts are sent to Alertmanager for routing and notification.

Alertmanager Configuration and Alert Routing

Overview

Alertmanager receives alerts from Prometheus and routes them to notification channels (Slack). It provides alert grouping, deduplication, silencing, and routing based on severity.

Architecture

Prometheus → Evaluates alert rules every 15s
Alertmanager → Groups, routes, and deduplicates alerts
Slack → Team notifications in real-time

Docker Compose Service

  • Service: alertmanager in docker-compose.yml
  • Image: prom/alertmanager:v0.30.1 (latest stable, released Jan 2026, compatible with Prometheus v3.9.1)
  • Port: 9093 (configurable via ALERTMANAGER_PORT environment variable)
  • Access URL: http://localhost:9093
  • Network: Connected to redhound-network for internal communication with Prometheus
  • Dependencies: Depends on Prometheus service
  • Persistent storage: alertmanager_data volume for alert state persistence

Configuration Files

  • Main config: docker/alertmanager/alertmanager.yml (webhook URL substituted at container start from env)
  • Setup guide: docker/alertmanager/README.md (comprehensive setup and troubleshooting)

Alert Routing Strategy

Severity-Based Routing: - Critical alerts (severity: critical): - Sent immediately with no grouping delay (group_wait: 0s) - Repeated every 2 hours if not resolved (repeat_interval: 2h) - Examples: MetricsMissing

  • Warning alerts (severity: warning):
  • Grouped for 30 seconds before sending (group_wait: 30s)
  • Subsequent grouped alerts sent after 5 minutes (group_interval: 5m)
  • Repeated every 4 hours if not resolved (repeat_interval: 4h)
  • Examples: HighErrorRatio, HighRequestLatencyP95

Alert Grouping: - Alerts are grouped by alertname and severity - Reduces notification spam when multiple similar alerts fire - Example: 3 HighErrorRatio alerts for different vendors → 1 grouped Slack message

Inhibition Rules: - If MetricsMissing (critical) is firing, all other alerts are suppressed - Prevents alert storms when metrics collection fails

Slack Integration

Setup Process: 1. Create Slack Incoming Webhook at https://api.slack.com/messaging/webhooks for the alerts channel and copy the webhook URL. 2. Set SLACK_ALERTMANAGER_WEBHOOK_URL in your .env file (same directory as docker-compose.yml). Do not commit .env or the URL to git. CI/CD uses GitHub Secrets (SLACK_BOT_TOKEN, SLACK_CHANNEL_ID), not .env; Alertmanager uses this separate webhook for the alerts channel. 3. Build and start Alertmanager: docker-compose up -d --build alertmanager. The container substitutes the webhook URL into the config at startup; the config file in the repo contains only the placeholder.

Slack Message Format: - Firing alerts: 🔴 indicator with severity, alert name, summary, description, start time - Resolved alerts: ✅ indicator with resolution time and duration - Links: Direct links to Alertmanager UI and Prometheus - Color coding: Red (critical), Yellow (warning), Green (resolved)

Example Slack Message:

🔴 [CRITICAL] MetricsMissing

Status: Firing (1 alert)

---
Alert: MetricsMissing
Severity: critical
Summary: Metrics stream missing
Description: No redhound_api_calls_total series scraped for 10 minutes.
Started: 2026-02-01 10:05:00 UTC

View in Alertmanager | View in Prometheus

Alertmanager UI

Access: http://localhost:9093

Features: - View active alerts (currently firing) - View silenced alerts (manually suppressed) - View alert history - Silence alerts (temporarily suppress notifications) - View configuration status

Silencing Alerts: 1. Navigate to http://localhost:9093 2. Click on the alert to silence 3. Click "Silence" button 4. Set duration and reason 5. Click "Create"

Testing Alerts

Test 1: Verify Alertmanager is Running

# Check service status
docker-compose ps alertmanager

# Check logs
docker-compose logs -f alertmanager

# Access UI
open http://localhost:9093

Test 2: Send Test Alert via Prometheus API

curl -X POST http://localhost:9093/api/v2/alerts \
  -H 'Content-Type: application/json' \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning",
      "instance": "test"
    },
    "annotations": {
      "summary": "This is a test alert",
      "description": "Testing Alertmanager and Slack integration"
    }
  }]'

Test 3: Trigger Real Alert

# Stop app to trigger MetricsMissing alert (after 10 minutes)
docker-compose stop app

# Wait 10+ minutes for alert to fire
# Check Alertmanager UI: http://localhost:9093
# Check Slack for notification

# Restart app to resolve alert
docker-compose start app

Troubleshooting

Alertmanager not starting:

# Check logs for configuration errors
docker-compose logs alertmanager

# Verify configuration syntax
docker-compose exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml

Alerts not appearing in Alertmanager:

# Verify Prometheus is sending alerts
# Check Prometheus config
docker-compose exec prometheus cat /etc/prometheus/prometheus.yml | grep -A 5 alerting

# Check Prometheus alerts page
open http://localhost:9090/alerts

# Check Prometheus logs
docker-compose logs prometheus | grep alertmanager

Alerts not being sent to Slack:

# Verify webhook URL is set (SLACK_ALERTMANAGER_WEBHOOK_URL or SLACK_WEBHOOK_URL)
docker-compose exec alertmanager env | grep SLACK_

# Test webhook URL manually
curl -X POST YOUR_WEBHOOK_URL_HERE \
  -H 'Content-Type: application/json' \
  -d '{"text": "Test message"}'

# Check Alertmanager logs for errors
docker-compose logs alertmanager | grep -i error

Command-Line Management (amtool)

Installation:

go install github.com/prometheus/alertmanager/cmd/amtool@latest

Common Commands:

# View active alerts
amtool alert query --alertmanager.url=http://localhost:9093

# Silence an alert
amtool silence add \
  --alertmanager.url=http://localhost:9093 \
  --author="Your Name" \
  --comment="Maintenance window" \
  --duration=2h \
  alertname=HighErrorRatio

# List active silences
amtool silence query --alertmanager.url=http://localhost:9093

# Expire a silence
amtool silence expire --alertmanager.url=http://localhost:9093 SILENCE_ID

Security Considerations

Webhook URL Security: - Never commit webhook URLs to git - Store webhook URL only in alertmanager.yml (local file) - For production, consider using Docker secrets or environment variable substitution

Access Control: - Alertmanager UI has no authentication by default - In production, use a reverse proxy with authentication (e.g., nginx with basic auth)

Advanced Configuration

Multiple Slack Channels:

receivers:
  - name: 'slack-critical'
    slack_configs:
      - api_url: 'YOUR_CRITICAL_WEBHOOK_URL'

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'YOUR_WARNINGS_WEBHOOK_URL'

routes:
  - match:
      severity: critical
    receiver: 'slack-critical'

  - match:
      severity: warning
    receiver: 'slack-warnings'

Email Notifications:

receivers:
  - name: 'slack-and-email'
    slack_configs:
      - api_url: 'YOUR_WEBHOOK_URL'
    email_configs:
      - to: 'alerts@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager@example.com'
        auth_password: 'password'

Resources

Monitoring Integration

  • Prometheus can scrape /health separately if desired (lightweight JSON); primary scrape remains /metrics.
  • Use alerting rules on dependency status fields in health responses (e.g., blackbox exporter with JSONPath) or on metrics signaling downstream failures.
  • Container health status propagates to orchestrators (Docker healthcheck) and can be visualized via docker-compose ps or platform dashboards.

Code Examples

Instrumenting a function

from backend.utils.metrics import execution_time_seconds, label_values, record_latency, events_total

def fetch_prices(symbol: str):
    labels = label_values(vendor="alpha_vantage", agent_type="data", operation="fetch_prices")
    with record_latency(execution_time_seconds, labels):
        data = _call_api(symbol)
    events_total.labels(*labels).inc()
    return data

Custom metric definition

from prometheus_client import Gauge
from backend.utils.metrics import get_registry

custom_gauge = Gauge(
    "redhound_custom_health",
    "Service health score",
    ["component"],
    registry=get_registry(),
)

def update_health(component: str, value: float):
    custom_gauge.labels(component).set(value)

Using decorators for automatic timing/counts

from backend.utils.metrics import count_calls, track_latency, request_duration_seconds, label_values

labels = label_values("alpha_vantage", "data", "get_news")

@count_calls(counter=request_duration_seconds, labels=labels)  # increments histogram count
@track_latency(histogram=request_duration_seconds, labels=labels)
def get_news():
    ...

PromQL Examples

  • 95th percentile request latency (5m):
    histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m]))
  • Error rate per vendor (5m):
    sum by (vendor) (rate(redhound_api_errors_total[5m]))
  • Error ratio (per vendor, 5m):
    sum by (vendor) (rate(redhound_api_errors_total[5m])) / sum by (vendor) (rate(redhound_api_calls_total[5m]))
  • Rate limit hits (5m):
    rate(redhound_api_rate_limit_hits_total[5m])
  • Active agents:
    redhound_active_agents