Monitoring and Metrics (Prometheus)¶
Architecture Overview¶
- Metrics emitted via
prometheus_clientin-process registry (maricusco.utils.metrics). - Metrics endpoint at
/metrics(FastAPI router inmaricusco/api/metrics.py); returns 503 when disabled, 500 on rendering errors. - Prometheus server (docker-compose
prometheusservice) scrapesapp:8000/metricsevery 15s by default. - Grafana can be pointed at Prometheus for dashboards and alerts.
Health Check Endpoint¶
- Endpoint: FastAPI
/health(configurable viaMARICUSCO_HEALTHCHECK_PATH, default/health). - Status codes: 200 when all required dependencies are healthy; 503 when any required dependency (PostgreSQL, Redis) is unhealthy. Optional dependencies and vendor checks are opt-in.
- Response schema:
- Configuration:
- Enable/disable:
MARICUSCO_HEALTHCHECK_ENABLED(default true). - Dependency lists: required defaults
database,redis; optional viaMARICUSCO_HEALTHCHECK_OPTIONAL_DEPENDENCIES; vendor checks toggleMARICUSCO_HEALTHCHECK_VENDORS_ENABLED. - Timeouts:
MARICUSCO_HEALTHCHECK_TIMEOUT_DEFAULTplus per-dependency overrides (..._DATABASE,..._REDIS,..._VENDORS), typically 1–2s. - Cache TTL:
MARICUSCO_HEALTHCHECK_CACHE_TTL_SECONDS(default 0 = no cache).
Metric Types and Names¶
- Counters:
maricusco_api_calls_total— API calls by vendor/agent/operation.maricusco_api_errors_total— API failures.maricusco_api_rate_limit_hits_total— rate-limit events.maricusco_events_total— domain events.maricusco_errors_by_type_total— errors by exception type.maricusco_retry_attempts_total— retry attempts.maricusco_cache_hits_total,maricusco_cache_misses_total— cache outcomes.- Gauges:
maricusco_active_agents— active agent executions.maricusco_queue_size— queue depth (wire to your queue if used).maricusco_current_value— arbitrary numeric values (balances, positions, etc.).- Histograms:
maricusco_latency_seconds— external/internal step latency.maricusco_execution_time_seconds— agent/orchestration execution time.maricusco_request_duration_seconds— API/vendor request duration.
Label set (shared): vendor, agent_type, operation. Some metrics also add error_type.
Queue and Cache Instrumentation¶
- Cache client:
maricusco.data.cache.CacheClientuses Redis when available, falls back to in-memory.get_cache_client()returns a singleton. - Cache hits/misses:
record_cache_hit(cache_name)/record_cache_miss(cache_name)used inside the cache client; labels default tovendor=redis|local,agent_type=cache,operation=<cache_name>. - Cache latency: cache client records latency via
latency_secondshistogram for get/set operations. - Queue depth:
maricusco.orchestration.queue_metrics.report_queue_depth(queue_name, depth)emitsmaricusco_queue_size; defaultsvendor=internal,agent_type=queue. - Queue latency:
record_queue_latency(queue_name, duration_seconds)records enqueue/dequeue latency vialatency_seconds. - Current state: no message queue is configured; helpers are ready for future queue integration and can be called from any queue implementation.
Error Rate Over Time¶
- Use PromQL rate over counters rather than a separate gauge:
sum by (vendor) (rate(maricusco_api_errors_total[5m])) - Error ratio vs calls (per vendor):
sum by (vendor) (rate(maricusco_api_errors_total[5m])) / sum by (vendor) (rate(maricusco_api_calls_total[5m]))
Configuration¶
- Enable/disable:
MARICUSCO_METRICS_ENABLED(default false). - Endpoint path:
MARICUSCO_METRICS_PATH(default/metrics). - Endpoint port:
MARICUSCO_METRICS_PORT(default 8000). - Sampling:
MARICUSCO_METRICS_SAMPLING_RATE(default 1.0). - Constant labels:
MARICUSCO_METRICS_LABELS(commakey=value).
Prometheus Server Setup¶
- Compose service
prometheus(prom/prometheusv3.8.0, pinned indocker-compose.yml). - Config:
docker/prometheus/prometheus.yml scrape_interval: 15s,metrics_path: /metrics, targetapp:8000.- Retention:
--storage.tsdb.retention.time=15d(compose command). - Rule files placeholder:
/etc/prometheus/rules/*.yml(add via bind mount). - Run:
docker-compose up -d prometheus - UI:
http://localhost:9090
Grafana Setup and Configuration¶
Docker Compose Service¶
- Service:
grafanaindocker-compose.yml(imagegrafana/grafana:12.3.0, pinned version). - Port:
3000(configurable viaGRAFANA_PORTenvironment variable, default3000). - Access URL:
http://localhost:3000 - Network: Connected to
maricusco-networkfor internal communication with Prometheus. - Dependencies: Depends on Prometheus service (
depends_on: prometheus). - Persistent storage:
grafana_datavolume for dashboard and user data persistence.
Environment Variables¶
GF_SECURITY_ADMIN_USER: Grafana admin username (defaults toadminif not set).GF_SECURITY_ADMIN_PASSWORD: Grafana admin password (defaults toadminfor local development if not set).- Important: For non-local/production environments,
GF_SECURITY_ADMIN_PASSWORDmust be set to a secure password via.envfile. - For local development, default credentials are
admin/adminwhen not set (convenient for development, not secure for production). - Note: Grafana uses
GF_prefix for all configuration environment variables.
Data Source Provisioning¶
- Prometheus data source is automatically provisioned via
docker/grafana/provisioning/datasources/prometheus.yml. - Data source name: "Prometheus"
- URL:
http://prometheus:9090(internal Docker network) - Access mode: Server (proxy)
- Is default: Yes
- No manual configuration required; data source is available immediately after Grafana starts.
Dashboard Provisioning¶
- Dashboards are automatically provisioned from
docker/grafana/provisioning/dashboards/. - Dashboard provider configuration:
docker/grafana/provisioning/dashboards/dashboard.yml - Folder: "Maricusco" (dashboards organized in this folder in Grafana UI)
- Auto-discovery: Grafana scans for
.jsonfiles every 10 seconds - All dashboard JSON files in the provisioning directory are automatically loaded
Pre-Provisioned Dashboards¶
1. Maricusco Overview (maricusco-overview.json)¶
Comprehensive overview of system metrics:
- P95 Request Duration: Time series showing 95th percentile request duration
- PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m]))
- API Error Rate: Time series showing error rate by vendor/operation
- PromQL: rate(maricusco_api_errors_total[5m])
- Rate Limit Hits: Time series showing rate limit hits by vendor/operation
- PromQL: rate(maricusco_api_rate_limit_hits_total[5m])
- Active Agents: Gauge panel showing current active agents count
- PromQL: maricusco_active_agents
- Cache Hit/Miss Ratio: Time series showing overall cache hit/miss ratio (0-1 scale)
- PromQL: sum(rate(maricusco_cache_hits_total[5m])) / (sum(rate(maricusco_cache_hits_total[5m])) + sum(rate(maricusco_cache_misses_total[5m])))
2. Maricusco Vendor/Operation Latency (maricusco-vendor-latency.json)¶
Detailed latency breakdowns:
- Request Duration by Vendor (P95): Time series grouped by vendor
- PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor)
- Request Duration by Operation (P95): Time series grouped by operation
- PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (operation)
- Latency by Vendor/Operation (P95): Table view showing P95 latency for each vendor/operation combination
- PromQL: histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor, operation)
3. Maricusco Cache Performance (maricusco-cache.json)¶
Cache-specific metrics:
- Cache Hit Rate Over Time: Time series showing cache hits by cache name (operation label)
- PromQL: rate(maricusco_cache_hits_total[5m])
- Cache Miss Rate Over Time: Time series showing cache misses by cache name
- PromQL: rate(maricusco_cache_misses_total[5m])
- Cache Hit/Miss Ratio: Time series showing overall ratio (0-1 scale)
- PromQL: sum(rate(maricusco_cache_hits_total[5m])) / (sum(rate(maricusco_cache_hits_total[5m])) + sum(rate(maricusco_cache_misses_total[5m])))
- Cache Operations by Cache Name: Table showing total operations per second by cache name (operation label)
- PromQL: sum(rate(maricusco_cache_hits_total[5m]) + rate(maricusco_cache_misses_total[5m])) by (operation)
Accessing Grafana¶
- Start services:
docker-compose up -d grafana(ordocker-compose up -dfor all services) - Open browser: Navigate to
http://localhost:3000 - Login:
- Local development: Default credentials are
admin/admin(whenGF_SECURITY_ADMIN_PASSWORDis not set in.env) - Production/non-local: Use credentials from
.envfile (GF_SECURITY_ADMIN_USER/GF_SECURITY_ADMIN_PASSWORD) - Navigate to dashboards: Click "Dashboards" → "Maricusco" folder to see pre-provisioned dashboards
- Verify data: Ensure
MARICUSCO_METRICS_ENABLED=truefor metrics to appear in dashboards
PromQL Queries Reference (Used in Dashboards)¶
- P95 request duration:
histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) - P95 by vendor:
histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor) - P95 by operation:
histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (operation) - P95 by vendor/operation:
histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) by (vendor, operation) - API error rate:
rate(maricusco_api_errors_total[5m]) - Rate limit hits:
rate(maricusco_api_rate_limit_hits_total[5m]) - Active agents:
maricusco_active_agents - Cache hit rate:
rate(maricusco_cache_hits_total[5m]) - Cache miss rate:
rate(maricusco_cache_misses_total[5m]) - Cache hit/miss ratio:
sum(rate(maricusco_cache_hits_total[5m])) / (sum(rate(maricusco_cache_hits_total[5m])) + sum(rate(maricusco_cache_misses_total[5m]))) - Cache operations by name:
sum(rate(maricusco_cache_hits_total[5m]) + rate(maricusco_cache_misses_total[5m])) by (operation)
Alerting Rules (optional)¶
- Example rules file:
docker/prometheus/rules/alerts.yml(mounted read-only). - Sample alerts:
HighErrorRatio: error/call ratio > 5% over 5m per vendor.HighRequestLatencyP95: p95 request duration > 2s over 5m.MetricsMissing: absence ofmaricusco_api_calls_totalseries for 10m.- Enable by ensuring the compose volume mount for
/etc/prometheus/rulesis present; Prometheus auto-reloads rules on change.
Monitoring Integration¶
- Prometheus can scrape
/healthseparately if desired (lightweight JSON); primary scrape remains/metrics. - Use alerting rules on dependency status fields in health responses (e.g., blackbox exporter with JSONPath) or on metrics signaling downstream failures.
- Container health status propagates to orchestrators (Docker healthcheck) and can be visualized via
docker-compose psor platform dashboards.
Code Examples¶
Instrumenting a function¶
from maricusco.utils.metrics import execution_time_seconds, label_values, record_latency, events_total
def fetch_prices(symbol: str):
labels = label_values(vendor="alpha_vantage", agent_type="data", operation="fetch_prices")
with record_latency(execution_time_seconds, labels):
data = _call_api(symbol)
events_total.labels(*labels).inc()
return data
Custom metric definition¶
from prometheus_client import Gauge
from maricusco.utils.metrics import get_registry
custom_gauge = Gauge(
"maricusco_custom_health",
"Service health score",
["component"],
registry=get_registry(),
)
def update_health(component: str, value: float):
custom_gauge.labels(component).set(value)
Using decorators for automatic timing/counts¶
from maricusco.utils.metrics import count_calls, track_latency, request_duration_seconds, label_values
labels = label_values("alpha_vantage", "data", "get_news")
@count_calls(counter=request_duration_seconds, labels=labels) # increments histogram count
@track_latency(histogram=request_duration_seconds, labels=labels)
def get_news():
...
PromQL Examples¶
- 95th percentile request latency (5m):
histogram_quantile(0.95, rate(maricusco_request_duration_seconds_bucket[5m])) - Error rate per vendor (5m):
sum by (vendor) (rate(maricusco_api_errors_total[5m])) - Error ratio (per vendor, 5m):
sum by (vendor) (rate(maricusco_api_errors_total[5m])) / sum by (vendor) (rate(maricusco_api_calls_total[5m])) - Rate limit hits (5m):
rate(maricusco_api_rate_limit_hits_total[5m]) - Active agents:
maricusco_active_agents