Monitoring and Metrics (Prometheus)¶
Structured Logging (JSON + Context)¶
- Framework:
structlogwith stdlib handlers; JSON in production, colored console in development. - Context propagation via
contextvars:correlation_id,session_id,ticker,user,agent_name, tool/vendor metadata. - Rotation: size-based by default (
REDHOUND_LOG_MAX_BYTES=10MB,REDHOUND_LOG_BACKUP_COUNT=5). File logging is off by default; enable withREDHOUND_LOG_FILE_ENABLED=true. - Docker parity:
docker-compose.ymlusesjson-filewithmax-size: 10m,max-file: 5to match app rotation. - Retention: optional cleanup of rotated files older than
REDHOUND_LOG_RETENTION_DAYS(days). - Formats:
- JSON (prod): machine-readable for aggregation.
- Human-readable (dev):
REDHOUND_LOG_FORMAT=human-readable. - Example (JSON):
- Config (env):
REDHOUND_LOG_LEVEL(default:INFO)REDHOUND_LOG_FORMAT(json|human-readable, default:json)REDHOUND_LOG_CONSOLE(default:true)REDHOUND_LOG_FILE_ENABLED(default:false)REDHOUND_LOG_FILE(path; default:redhound/logs/redhound.log)REDHOUND_LOG_MAX_BYTES(default:10485760= 10MB)REDHOUND_LOG_BACKUP_COUNT(default:5)REDHOUND_LOG_RETENTION_DAYS(default:30; remove rotated files older than N days)REDHOUND_LOG_ROTATION(size|time, default:size)- Usage examples:
from backend.utils.logging import get_logger, log_context, bind_context log = get_logger(__name__) # Ad-hoc scope with log_context(correlation_id="abc", session_id="sess-1", ticker="AAPL"): log.info("analysis_start", agent="technical") # Persistent binding bind_context(agent_name="trader") log.info("decision_ready", ticker="AAPL", action="HOLD") - Querying JSON logs (example):
- Best practices:
- Do not log secrets or payloads containing API keys.
- Prefer structured fields over free-form strings.
- Include correlation/session/ticker for traceability.
- Keep file logging disabled in ephemeral/containerized environments unless needed.
Architecture Overview¶
- Metrics emitted via
prometheus_clientin-process registry (redhound.utils.metrics). - Metrics endpoint at
/metrics(FastAPI router inredhound/api/metrics.py); returns 503 when disabled, 500 on rendering errors. - Prometheus server (docker-compose
prometheusservice) scrapesapp:8000/metricsevery 15s by default. - Grafana can be pointed at Prometheus for dashboards and alerts.
Health Check Endpoint¶
- Endpoint: FastAPI
/health(configurable viaREDHOUND_HEALTHCHECK_PATH, default:/health). - Status codes: 200 when all required dependencies are healthy; 503 when any required dependency (PostgreSQL, Redis) is unhealthy. Optional dependencies and vendor checks are opt-in.
- Response schema:
- Configuration:
- Enable/disable:
REDHOUND_HEALTHCHECK_ENABLED(default:true). - Endpoint path:
REDHOUND_HEALTHCHECK_PATH(default:/health). - Dependency lists:
- Required dependencies: defaults to
database,redis; override withREDHOUND_HEALTHCHECK_REQUIRED_DEPENDENCIES(comma-separated; set to empty string to disable all required deps in local/dev). - Optional dependencies:
REDHOUND_HEALTHCHECK_OPTIONAL_DEPENDENCIES(comma-separated). - Vendor checks: toggle with
REDHOUND_HEALTHCHECK_VENDORS_ENABLED(default:false).
- Required dependencies: defaults to
- Timeouts:
REDHOUND_HEALTHCHECK_TIMEOUT_DEFAULT(default:1.0)REDHOUND_HEALTHCHECK_TIMEOUT_DATABASE(default:1.0)REDHOUND_HEALTHCHECK_TIMEOUT_REDIS(default:1.0)REDHOUND_HEALTHCHECK_TIMEOUT_VENDORS(default:1.0)
- Cache TTL:
REDHOUND_HEALTHCHECK_CACHE_TTL_SECONDS(default:0.0= no cache).
Metric Types and Names¶
- Counters:
redhound_api_calls_total— API calls by vendor/agent/operation.redhound_api_errors_total— API failures.redhound_api_rate_limit_hits_total— rate-limit events.redhound_events_total— domain events.redhound_errors_by_type_total— errors by exception type.redhound_retry_attempts_total— retry attempts.redhound_cache_hits_total,redhound_cache_misses_total— cache outcomes.- Gauges:
redhound_active_agents— active agent executions.redhound_queue_size— queue depth (wire to your queue if used).redhound_current_value— arbitrary numeric values (balances, positions, etc.).- Histograms:
redhound_latency_seconds— external/internal step latency.redhound_execution_time_seconds— agent/orchestration execution time.redhound_request_duration_seconds— API/vendor request duration.
Label set (shared): vendor, agent_type, operation. Some metrics also add error_type.
Queue and Cache Instrumentation¶
- Cache client:
redhound.data.cache.CacheClientuses Redis when available, falls back to in-memory.get_cache_client()returns a singleton. - Cache hits/misses:
record_cache_hit(cache_name)/record_cache_miss(cache_name)used inside the cache client; labels default tovendor=redis|local,agent_type=cache,operation=<cache_name>. - Cache latency: cache client records latency via
latency_secondshistogram for get/set operations. - Queue depth:
redhound.orchestration.queue_metrics.report_queue_depth(queue_name, depth)emitsredhound_queue_size; defaultsvendor=internal,agent_type=queue. - Queue latency:
record_queue_latency(queue_name, duration_seconds)records enqueue/dequeue latency vialatency_seconds. - Current state: no message queue is configured; helpers are ready for future queue integration and can be called from any queue implementation.
Error Rate Over Time¶
- Use PromQL rate over counters rather than a separate gauge:
sum by (vendor) (rate(redhound_api_errors_total[5m])) - Error ratio vs calls (per vendor):
sum by (vendor) (rate(redhound_api_errors_total[5m])) / sum by (vendor) (rate(redhound_api_calls_total[5m]))
Configuration¶
- Enable/disable:
REDHOUND_METRICS_ENABLED(default:false). - Endpoint path:
REDHOUND_METRICS_PATH(default:/metrics). - Endpoint port:
REDHOUND_METRICS_PORT(default:8000). - Sampling rate:
REDHOUND_METRICS_SAMPLING_RATE(default:1.0). - Constant labels:
REDHOUND_METRICS_LABELS(comma-separatedkey=valuepairs).
Prometheus Server Setup¶
- Compose service
prometheusindocker-compose.yml. - Config:
docker/prometheus/prometheus.yml scrape_interval: 15s,metrics_path: /metrics, targetapp:8000.- Retention:
--storage.tsdb.retention.time=15d(compose command). - Rule files placeholder:
/etc/prometheus/rules/*.yml(add via bind mount). - Run:
docker-compose up -d prometheus - UI:
http://localhost:9090
Grafana Setup and Configuration¶
Docker Compose Service¶
- Service:
grafanaindocker-compose.yml. - Port:
3000(configurable viaGRAFANA_PORTenvironment variable, default3000). - Access URL:
http://localhost:3000 - Network: Connected to
redhound-networkfor internal communication with Prometheus. - Dependencies: Depends on Prometheus service (
depends_on: prometheus). - Persistent storage:
grafana_datavolume for dashboard and user data persistence.
Environment Variables¶
GF_SECURITY_ADMIN_USER: Grafana admin username (defaults toadminif not set).GF_SECURITY_ADMIN_PASSWORD: Grafana admin password (defaults toadminfor local development if not set).- Important: For non-local/production environments,
GF_SECURITY_ADMIN_PASSWORDmust be set to a secure password via.envfile. - For local development, default credentials are
admin/adminwhen not set (convenient for development, not secure for production). - Note: Grafana uses
GF_prefix for all configuration environment variables.
Data Source Provisioning¶
- Prometheus data source is automatically provisioned via
docker/grafana/provisioning/datasources/prometheus.yml. - Data source name: "Prometheus"
- URL:
http://prometheus:9090(internal Docker network) - Access mode: Server (proxy)
- Is default: Yes
- No manual configuration required; data source is available immediately after Grafana starts.
Dashboard Provisioning¶
- Dashboards are automatically provisioned from
docker/grafana/provisioning/dashboards/. - Dashboard provider configuration:
docker/grafana/provisioning/dashboards/dashboard.yml - Folder: "Redhound" (dashboards organized in this folder in Grafana UI)
- Auto-discovery: Grafana scans for
.jsonfiles every 10 seconds - All dashboard JSON files in the provisioning directory are automatically loaded
Pre-Provisioned Dashboards¶
1. Redhound Overview (redhound-overview.json)¶
Comprehensive overview of system metrics:
- P95 Request Duration: Time series showing 95th percentile request duration
- PromQL: histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m]))
- API Error Rate: Time series showing error rate by vendor/operation
- PromQL: rate(redhound_api_errors_total[5m])
- Rate Limit Hits: Time series showing rate limit hits by vendor/operation
- PromQL: rate(redhound_api_rate_limit_hits_total[5m])
- Active Agents: Gauge panel showing current active agents count
- PromQL: redhound_active_agents
- Cache Hit/Miss Ratio: Time series showing overall cache hit/miss ratio (0-1 scale)
- PromQL: sum(rate(redhound_cache_hits_total[5m])) / (sum(rate(redhound_cache_hits_total[5m])) + sum(rate(redhound_cache_misses_total[5m])))
2. Redhound Vendor/Operation Latency (redhound-vendor-latency.json)¶
Detailed latency breakdowns:
- Request Duration by Vendor (P95): Time series grouped by vendor
- PromQL: histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) by (vendor)
- Request Duration by Operation (P95): Time series grouped by operation
- PromQL: histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) by (operation)
- Latency by Vendor/Operation (P95): Table view showing P95 latency for each vendor/operation combination
- PromQL: histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) by (vendor, operation)
3. Redhound Cache Performance (redhound-cache.json)¶
Cache-specific metrics:
- Cache Hit Rate Over Time: Time series showing cache hits by cache name (operation label)
- PromQL: rate(redhound_cache_hits_total[5m])
- Cache Miss Rate Over Time: Time series showing cache misses by cache name
- PromQL: rate(redhound_cache_misses_total[5m])
- Cache Hit/Miss Ratio: Time series showing overall ratio (0-1 scale)
- PromQL: sum(rate(redhound_cache_hits_total[5m])) / (sum(rate(redhound_cache_hits_total[5m])) + sum(rate(redhound_cache_misses_total[5m])))
- Cache Operations by Cache Name: Table showing total operations per second by cache name (operation label)
- PromQL: sum(rate(redhound_cache_hits_total[5m]) + rate(redhound_cache_misses_total[5m])) by (operation)
Accessing Grafana¶
- Start services:
docker-compose up -d grafana(ordocker-compose up -dfor all services) - Open browser: Navigate to
http://localhost:3000 - Login:
- Local development: Default credentials are
admin/admin(whenGF_SECURITY_ADMIN_PASSWORDis not set in.env) - Production/non-local: Use credentials from
.envfile (GF_SECURITY_ADMIN_USER/GF_SECURITY_ADMIN_PASSWORD) - Navigate to dashboards: Click "Dashboards" → "Redhound" folder to see pre-provisioned dashboards
- Verify data: Ensure
REDHOUND_METRICS_ENABLED=truefor metrics to appear in dashboards
PromQL Queries Reference (Used in Dashboards)¶
- P95 request duration:
histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) - P95 by vendor:
histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) by (vendor) - P95 by operation:
histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) by (operation) - P95 by vendor/operation:
histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) by (vendor, operation) - API error rate:
rate(redhound_api_errors_total[5m]) - Rate limit hits:
rate(redhound_api_rate_limit_hits_total[5m]) - Active agents:
redhound_active_agents - Cache hit rate:
rate(redhound_cache_hits_total[5m]) - Cache miss rate:
rate(redhound_cache_misses_total[5m]) - Cache hit/miss ratio:
sum(rate(redhound_cache_hits_total[5m])) / (sum(rate(redhound_cache_hits_total[5m])) + sum(rate(redhound_cache_misses_total[5m]))) - Cache operations by name:
sum(rate(redhound_cache_hits_total[5m]) + rate(redhound_cache_misses_total[5m])) by (operation)
Alerting Rules¶
- Alert rules file:
docker/prometheus/rules/alerts.yml(mounted read-only). - Configured alerts:
HighErrorRatio: error/call ratio > 5% over 5m per vendor (severity: warning).HighRequestLatencyP95: p95 request duration > 2s over 5m (severity: warning).MetricsMissing: absence ofredhound_api_calls_totalseries for 10m (severity: critical).- Prometheus evaluates rules every 15s (evaluation_interval).
- Alerts are sent to Alertmanager for routing and notification.
Alertmanager Configuration and Alert Routing¶
Overview¶
Alertmanager receives alerts from Prometheus and routes them to notification channels (Slack). It provides alert grouping, deduplication, silencing, and routing based on severity.
Architecture¶
Prometheus → Evaluates alert rules every 15s
↓
Alertmanager → Groups, routes, and deduplicates alerts
↓
Slack → Team notifications in real-time
Docker Compose Service¶
- Service:
alertmanagerindocker-compose.yml - Image:
prom/alertmanager:v0.30.1(latest stable, released Jan 2026, compatible with Prometheus v3.9.1) - Port:
9093(configurable viaALERTMANAGER_PORTenvironment variable) - Access URL:
http://localhost:9093 - Network: Connected to
redhound-networkfor internal communication with Prometheus - Dependencies: Depends on Prometheus service
- Persistent storage:
alertmanager_datavolume for alert state persistence
Configuration Files¶
- Main config:
docker/alertmanager/alertmanager.yml(webhook URL substituted at container start from env) - Setup guide:
docker/alertmanager/README.md(comprehensive setup and troubleshooting)
Alert Routing Strategy¶
Severity-Based Routing:
- Critical alerts (severity: critical):
- Sent immediately with no grouping delay (group_wait: 0s)
- Repeated every 2 hours if not resolved (repeat_interval: 2h)
- Examples: MetricsMissing
- Warning alerts (
severity: warning): - Grouped for 30 seconds before sending (
group_wait: 30s) - Subsequent grouped alerts sent after 5 minutes (
group_interval: 5m) - Repeated every 4 hours if not resolved (
repeat_interval: 4h) - Examples:
HighErrorRatio,HighRequestLatencyP95
Alert Grouping:
- Alerts are grouped by alertname and severity
- Reduces notification spam when multiple similar alerts fire
- Example: 3 HighErrorRatio alerts for different vendors → 1 grouped Slack message
Inhibition Rules:
- If MetricsMissing (critical) is firing, all other alerts are suppressed
- Prevents alert storms when metrics collection fails
Slack Integration¶
Setup Process:
1. Create Slack Incoming Webhook at https://api.slack.com/messaging/webhooks for the alerts channel and copy the webhook URL.
2. Set SLACK_ALERTMANAGER_WEBHOOK_URL in your .env file (same directory as docker-compose.yml). Do not commit .env or the URL to git. CI/CD uses GitHub Secrets (SLACK_BOT_TOKEN, SLACK_CHANNEL_ID), not .env; Alertmanager uses this separate webhook for the alerts channel.
3. Build and start Alertmanager: docker-compose up -d --build alertmanager. The container substitutes the webhook URL into the config at startup; the config file in the repo contains only the placeholder.
Slack Message Format: - Firing alerts: 🔴 indicator with severity, alert name, summary, description, start time - Resolved alerts: ✅ indicator with resolution time and duration - Links: Direct links to Alertmanager UI and Prometheus - Color coding: Red (critical), Yellow (warning), Green (resolved)
Example Slack Message:
🔴 [CRITICAL] MetricsMissing
Status: Firing (1 alert)
---
Alert: MetricsMissing
Severity: critical
Summary: Metrics stream missing
Description: No redhound_api_calls_total series scraped for 10 minutes.
Started: 2026-02-01 10:05:00 UTC
View in Alertmanager | View in Prometheus
Alertmanager UI¶
Access: http://localhost:9093
Features: - View active alerts (currently firing) - View silenced alerts (manually suppressed) - View alert history - Silence alerts (temporarily suppress notifications) - View configuration status
Silencing Alerts: 1. Navigate to http://localhost:9093 2. Click on the alert to silence 3. Click "Silence" button 4. Set duration and reason 5. Click "Create"
Testing Alerts¶
Test 1: Verify Alertmanager is Running
# Check service status
docker-compose ps alertmanager
# Check logs
docker-compose logs -f alertmanager
# Access UI
open http://localhost:9093
Test 2: Send Test Alert via Prometheus API
curl -X POST http://localhost:9093/api/v2/alerts \
-H 'Content-Type: application/json' \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"instance": "test"
},
"annotations": {
"summary": "This is a test alert",
"description": "Testing Alertmanager and Slack integration"
}
}]'
Test 3: Trigger Real Alert
# Stop app to trigger MetricsMissing alert (after 10 minutes)
docker-compose stop app
# Wait 10+ minutes for alert to fire
# Check Alertmanager UI: http://localhost:9093
# Check Slack for notification
# Restart app to resolve alert
docker-compose start app
Troubleshooting¶
Alertmanager not starting:
# Check logs for configuration errors
docker-compose logs alertmanager
# Verify configuration syntax
docker-compose exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml
Alerts not appearing in Alertmanager:
# Verify Prometheus is sending alerts
# Check Prometheus config
docker-compose exec prometheus cat /etc/prometheus/prometheus.yml | grep -A 5 alerting
# Check Prometheus alerts page
open http://localhost:9090/alerts
# Check Prometheus logs
docker-compose logs prometheus | grep alertmanager
Alerts not being sent to Slack:
# Verify webhook URL is set (SLACK_ALERTMANAGER_WEBHOOK_URL or SLACK_WEBHOOK_URL)
docker-compose exec alertmanager env | grep SLACK_
# Test webhook URL manually
curl -X POST YOUR_WEBHOOK_URL_HERE \
-H 'Content-Type: application/json' \
-d '{"text": "Test message"}'
# Check Alertmanager logs for errors
docker-compose logs alertmanager | grep -i error
Command-Line Management (amtool)¶
Installation:
Common Commands:
# View active alerts
amtool alert query --alertmanager.url=http://localhost:9093
# Silence an alert
amtool silence add \
--alertmanager.url=http://localhost:9093 \
--author="Your Name" \
--comment="Maintenance window" \
--duration=2h \
alertname=HighErrorRatio
# List active silences
amtool silence query --alertmanager.url=http://localhost:9093
# Expire a silence
amtool silence expire --alertmanager.url=http://localhost:9093 SILENCE_ID
Security Considerations¶
Webhook URL Security:
- Never commit webhook URLs to git
- Store webhook URL only in alertmanager.yml (local file)
- For production, consider using Docker secrets or environment variable substitution
Access Control: - Alertmanager UI has no authentication by default - In production, use a reverse proxy with authentication (e.g., nginx with basic auth)
Advanced Configuration¶
Multiple Slack Channels:
receivers:
- name: 'slack-critical'
slack_configs:
- api_url: 'YOUR_CRITICAL_WEBHOOK_URL'
- name: 'slack-warnings'
slack_configs:
- api_url: 'YOUR_WARNINGS_WEBHOOK_URL'
routes:
- match:
severity: critical
receiver: 'slack-critical'
- match:
severity: warning
receiver: 'slack-warnings'
Email Notifications:
receivers:
- name: 'slack-and-email'
slack_configs:
- api_url: 'YOUR_WEBHOOK_URL'
email_configs:
- to: 'alerts@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'password'
Resources¶
- Alertmanager Documentation
- Alertmanager Configuration
- Slack Incoming Webhooks
- Alert Routing Tree Editor
- Local setup guide:
docker/alertmanager/README.md
Monitoring Integration¶
- Prometheus can scrape
/healthseparately if desired (lightweight JSON); primary scrape remains/metrics. - Use alerting rules on dependency status fields in health responses (e.g., blackbox exporter with JSONPath) or on metrics signaling downstream failures.
- Container health status propagates to orchestrators (Docker healthcheck) and can be visualized via
docker-compose psor platform dashboards.
Code Examples¶
Instrumenting a function¶
from backend.utils.metrics import execution_time_seconds, label_values, record_latency, events_total
def fetch_prices(symbol: str):
labels = label_values(vendor="alpha_vantage", agent_type="data", operation="fetch_prices")
with record_latency(execution_time_seconds, labels):
data = _call_api(symbol)
events_total.labels(*labels).inc()
return data
Custom metric definition¶
from prometheus_client import Gauge
from backend.utils.metrics import get_registry
custom_gauge = Gauge(
"redhound_custom_health",
"Service health score",
["component"],
registry=get_registry(),
)
def update_health(component: str, value: float):
custom_gauge.labels(component).set(value)
Using decorators for automatic timing/counts¶
from backend.utils.metrics import count_calls, track_latency, request_duration_seconds, label_values
labels = label_values("alpha_vantage", "data", "get_news")
@count_calls(counter=request_duration_seconds, labels=labels) # increments histogram count
@track_latency(histogram=request_duration_seconds, labels=labels)
def get_news():
...
PromQL Examples¶
- 95th percentile request latency (5m):
histogram_quantile(0.95, rate(redhound_request_duration_seconds_bucket[5m])) - Error rate per vendor (5m):
sum by (vendor) (rate(redhound_api_errors_total[5m])) - Error ratio (per vendor, 5m):
sum by (vendor) (rate(redhound_api_errors_total[5m])) / sum by (vendor) (rate(redhound_api_calls_total[5m])) - Rate limit hits (5m):
rate(redhound_api_rate_limit_hits_total[5m]) - Active agents:
redhound_active_agents