Skip to content

DevOps Operations

This document covers infrastructure, deployment, and operational procedures for the Redhound trading system.

Infrastructure Overview

Infrastructure Architecture

graph TB
    subgraph "External Access"
        USER[Users/Developers]
        LB[Load Balancer<br/>Optional]
    end

    subgraph "Application Layer"
        APP[Redhound Application<br/>FastAPI + Uvicorn<br/>Port: 8000]
    end

    subgraph "Data Services"
        PG[(PostgreSQL<br/>TimescaleDB<br/>pgvector<br/>Port: 5432)]
        REDIS[(Redis Cache<br/>Port: 6379)]
    end

    subgraph "Monitoring Stack"
        PROM[Prometheus<br/>Metrics Collection<br/>Port: 9090]
        GRAF[Grafana<br/>Dashboards<br/>Port: 3000]
    end

    subgraph "Storage"
        VOL1[(postgres_data)]
        VOL2[(redis_data)]
        VOL3[(prometheus_data)]
        VOL4[(grafana_data)]
    end

    USER --> LB
    LB --> APP
    USER --> APP

    APP --> PG
    APP --> REDIS
    APP -->|/metrics| PROM
    APP -->|/health| PROM

    PROM --> GRAF

    PG --> VOL1
    REDIS --> VOL2
    PROM --> VOL3
    GRAF --> VOL4

    style APP fill:#7A9FB3,stroke:#6B8FA3,color:#fff
    style PG fill:#9B8AAB,stroke:#8B7A9B,color:#fff
    style REDIS fill:#9B8AAB,stroke:#8B7A9B,color:#fff
    style PROM fill:#7A9A7A,stroke:#6B8E6B,color:#fff
    style GRAF fill:#7A9A7A,stroke:#6B8E6B,color:#fff
    style LB fill:#C4A484,stroke:#B49474,color:#fff
    style VOL1 fill:#C4A484,stroke:#B49474,color:#fff
    style VOL2 fill:#C4A484,stroke:#B49474,color:#fff
    style VOL3 fill:#C4A484,stroke:#B49474,color:#fff
    style VOL4 fill:#C4A484,stroke:#B49474,color:#fff

Service Architecture

The system runs as a containerized application with the following services:

Note: All services are automatically deployed and monitored through the CI/CD pipeline.

Service Container Port Purpose
Application redhound-app 8000 Main FastAPI application
PostgreSQL redhound-postgres 5432 Time-series database (TimescaleDB) with pgvector
Redis redhound-redis 6379 Caching layer
Prometheus redhound-prometheus 9090 Metrics collection
Grafana redhound-grafana 3000 Metrics visualization

Network Configuration

All services communicate via the redhound-network bridge network. Services use internal DNS names (e.g., postgres, redis) for inter-service communication.

Resource Limits

Default resource constraints per service:

app:
  limits: { cpus: '2.0', memory: 4G }
  reservations: { cpus: '0.5', memory: 512M }

postgres:
  limits: { cpus: '2.0', memory: 2G }
  reservations: { cpus: '0.5', memory: 512M }

redis:
  limits: { cpus: '1.0', memory: 1G }
  reservations: { cpus: '0.25', memory: 256M }

grafana:
  limits: { cpus: '1.0', memory: 512M }
  reservations: { cpus: '0.25', memory: 128M }

Adjust these values in docker-compose.yml based on workload requirements.

Deployment

Prerequisites

  • Docker 20.10+ and Docker Compose 2.0+
  • Minimum 8GB RAM, 4 CPU cores
  • 20GB free disk space for volumes

Initial Deployment

  1. Clone repository:

    git clone https://github.com/redhound-labs/redhound.git
    cd redhound
    

  2. Configure environment:

    cp .env.example .env
    # Edit .env with production values
    

  3. Set required environment variables:

    # API Keys
    export OPENAI_API_KEY=sk-...
    export ALPHA_VANTAGE_API_KEY=...
    
    # Database credentials
    export POSTGRES_PASSWORD=<secure-password>
    export POSTGRES_USER=redhound
    export POSTGRES_DB=redhound
    
    # Grafana credentials
    export GF_SECURITY_ADMIN_PASSWORD=<secure-password>
    

  4. Start services:

    docker-compose up -d
    

  5. Verify deployment:

    docker-compose ps
    # All services should show "healthy" status
    
    # Check application health
    curl http://localhost:8000/health
    

Application Configuration

The application entrypoint supports environment variable overrides:

# Uvicorn configuration
export APP_MODULE=redhound.api.app:app
export APP_HOST=0.0.0.0
export APP_PORT=8000
export UVICORN_WORKERS=4  # Scale workers
export UVICORN_RELOAD=false
export EXTRA_UVICORN_ARGS="--log-level info"

Scaling

Horizontal scaling (multiple app instances):

  1. Update docker-compose.yml:

    app:
      deploy:
        replicas: 3
    

  2. Use a load balancer (nginx, traefik) in front of multiple instances.

Vertical scaling (resource limits):

Update resource limits in docker-compose.yml based on monitoring metrics.

Health Checks

sequenceDiagram
    participant DC as Docker Compose
    participant APP as Application
    participant PG as PostgreSQL
    participant RD as Redis

    Note over DC: Health check interval: 30s

    DC->>APP: Health check request<br/>(/usr/local/bin/healthcheck)
    APP->>APP: Check /health endpoint
    APP->>PG: Test connection<br/>(pg_isready)
    PG-->>APP: Connection status
    APP->>RD: Test connection<br/>(redis-cli ping)
    RD-->>APP: PONG
    APP-->>DC: Health status<br/>(healthy/unhealthy)

    Note over DC: All dependencies must be healthy<br/>for container to be marked healthy

Health checks run every 30 seconds with a 2-second timeout:

  • Application: /health endpoint (required)
  • PostgreSQL: pg_isready command
  • Redis: redis-cli ping

Verify health status:

docker-compose ps
# Check individual service
docker exec redhound-app curl -f http://localhost:8000/health

Database Migrations

Schema changes are managed with Alembic. The database must have PostgreSQL with TimescaleDB and pgvector extensions (e.g. Docker Compose Postgres with docker/postgres/init-timescaledb.sql).

Migration commands

  • Apply all migrations: ./scripts/db_migrate.sh or alembic upgrade head
  • Rollback one revision: ./scripts/db_rollback.sh or alembic downgrade -1
  • Reset database (dev only): ./scripts/db_reset.sh (downgrade to base, then upgrade head)
  • Seed test data (dev): ./scripts/db_seed.sh

Connection settings come from POSTGRES_* / REDHOUND_DATABASE_* environment variables. See Configuration Reference.

Creating a new migration

  1. Change SQLAlchemy models in redhound/database/models/ as needed.
  2. Generate a revision: alembic revision --autogenerate -m "description of change".
  3. Edit the generated file in alembic/versions/: add TimescaleDB or pgvector operations if required (see below).
  4. Test: alembic upgrade head then alembic downgrade -1 then alembic upgrade head.
  5. Commit the new version file.

TimescaleDB-specific migrations

For new time-series tables that should be hypertables:

  1. Create the table in the migration (or via autogenerate).
  2. Convert to hypertable using the helpers in redhound/database/migration_utils/timescale.py:
  3. create_hypertable(connection, table_name, time_column_name) (e.g. created_at or timestamp).
  4. Optionally add compression and retention in the same or a later migration:
    • add_compression_policy(connection, table_name, "7 days")
    • add_retention_policy(connection, table_name, "2 years")

In downgrade, remove policies (e.g. remove_retention_policy, remove_compression_policy) then drop the table if applicable.

pgvector migrations

  • Enable the extension once (migration 001_enable_pgvector.py).
  • For vector columns, create the table then add an HNSW index, e.g.: CREATE INDEX ... ON table_name USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
  • Downgrade: drop the index then drop the table; drop the extension only if no tables use vector.

Production migration runbook

  1. Pre-migration: Backup the database. Put the application in maintenance mode or ensure no schema-dependent deploys run during the migration.
  2. Apply: Run alembic upgrade head against the production database (using production credentials). Use the same virtualenv and code version that includes the new migration.
  3. Verify: Run alembic current and spot-check application health and key queries.
  4. Rollback (if needed): Run alembic downgrade -1 (or to a specific revision). Fix any application or data issues before re-running upgrade.

Rollback procedures

  • One revision: alembic downgrade -1
  • To a specific revision: alembic downgrade <revision> (e.g. 002, 001)
  • To empty schema: alembic downgrade base (drops all managed objects)

Downgrade order matters: remove retention/compression policies before dropping hypertables; drop tables that use pgvector before dropping the vector extension.

Environment Management

Environment Variables

Configuration priority (highest to lowest): 1. Environment variables 2. .env file 3. Default values in code

Security: - Never commit .env files to version control - Use secrets management (AWS Secrets Manager, HashiCorp Vault) in production - Rotate credentials regularly

See Configuration Reference for complete environment variable documentation.

CI/CD Pipeline

The project uses GitHub Actions for continuous integration and deployment. See CI/CD Pipeline for complete documentation.

Key stages: 1. Lock file validation 2. Code quality (lint, type check, security) 3. Test suite execution 4. Docker image build and scan 5. Deployment (manual or automated)

Local CI simulation:

make ci-check  # Run all CI checks locally

Monitoring and Observability

Monitoring Flow

flowchart LR
    subgraph "Application"
        APP[Redhound App]
        METRICS[/metrics endpoint]
        HEALTH[/health endpoint]
    end

    subgraph "Collection"
        PROM[Prometheus<br/>Scrapes every 15s]
    end

    subgraph "Visualization"
        GRAF[Grafana<br/>Dashboards]
    end

    subgraph "Storage"
        TSDB[(Time Series DB<br/>15 day retention)]
    end

    APP --> METRICS
    APP --> HEALTH
    METRICS -->|HTTP GET| PROM
    HEALTH -->|HTTP GET| PROM
    PROM --> TSDB
    PROM -->|Query| GRAF
    TSDB -->|Query| GRAF
    GRAF -->|Display| USER[Users/Operators]

    style APP fill:#7A9FB3,stroke:#6B8FA3,color:#fff
    style PROM fill:#7A9A7A,stroke:#6B8E6B,color:#fff
    style GRAF fill:#7A9A7A,stroke:#6B8E6B,color:#fff
    style TSDB fill:#9B8AAB,stroke:#8B7A9B,color:#fff
    style METRICS fill:#C4A484,stroke:#B49474,color:#fff
    style HEALTH fill:#C4A484,stroke:#B49474,color:#fff

See Monitoring and Metrics for complete documentation on metrics, logging, and Grafana dashboards.

Security

Container Security

  • Non-root user (appuser, UID 1000) in application container
  • Minimal base images (Python slim)
  • Multi-stage builds to reduce image size
  • Regular security scans via Trivy in CI/CD

Network Security

  • Services communicate via internal Docker network
  • Expose only necessary ports (8000 for app, 3000 for Grafana)
  • Use reverse proxy (nginx, traefik) for TLS termination

Secrets Management

Production recommendations: - Use secrets management service (AWS Secrets Manager, HashiCorp Vault) - Never hardcode credentials - Rotate API keys and passwords regularly - Use least-privilege access principles

Security Scanning

Docker image scanning:

docker run --rm \
  -v /var/run/docker.sock:/var/run/docker.sock \
  aquasec/trivy image redhound:latest

Dependency scanning:

uv run pip-audit --desc

Fixing vulnerabilities

Trivy and pip-audit report OS (Debian) and Python issues. Address them as follows.

  1. List findings
  2. Image (OS + Python in image): docker build -t redhound:test . then docker run --rm -v /var/run/docker.sock:/var/run/docker.sock aquasec/trivy:0.68.1 image --scanners vuln redhound:test
  3. Python deps only: uv run pip-audit --format json --desc or uv run pip-audit --desc

  4. OS (Debian) packages

  5. The Dockerfile runs apt-get update && apt-get upgrade -y before apt-get install to apply security fixes for the base image (e.g. curl, libc). Rebuild the image to pick up fixes.
  6. If a fix is not yet in the base, use a newer python:3.12.12-slim (or python:3.12-slim) once the official image is refreshed, or pin to a digest after verifying it includes the fix.

  7. Python packages

  8. pip-audit (and Trivy) name the vulnerable package and fixed version. In pyproject.toml, set the package to the fixed version (or a compatible newer one), then run uv lock and uv sync. Re-run pip-audit and Trivy to confirm.

  9. Re-scan

  10. Rebuild the image and run Trivy again. Run uv run pip-audit --desc after uv lock/uv sync.

Forcing base-image security updates

The Dockerfile uses a build-arg SECURITY_UPDATES. When it changes, the apt-get upgrade layer is invalidated and re-runs, so the image gets the latest Debian security updates. In CI, SECURITY_UPDATES is set to the run date (UTC) so the apt layer refreshes at least once per calendar day; same-day runs share the cache. For local builds, pass --build-arg SECURITY_UPDATES=$(date +%Y%m%d) to force a refresh.

Temporarily accepted vulnerabilities

Vulnerabilities with no fix yet (e.g. upstream not released) may be listed in .trivyignore with an expiry date. Re-evaluate before the expiry and remove the entry once a patched version is available. Current entries include CVE-2026-0994 (protobuf; no patch yet) and CVE-2026-0861 (glibc; Debian Trixie has no fix; re-evaluate when Debian security provides an update). The Dockerfile upgrades pip to ≥25.3 to address CVE-2025-8869.

Maintenance

Log Management

View logs:

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f app

# Last 100 lines
docker-compose logs --tail=100 app

Log cleanup:

# Clean old logs (if file logging enabled)
find redhound/logs -name "*.log.*" -mtime +30 -delete

Database Maintenance

PostgreSQL maintenance:

# Vacuum database
docker exec redhound-postgres psql -U redhound redhound -c "VACUUM ANALYZE;"

# Check database size
docker exec redhound-postgres psql -U redhound redhound -c "SELECT pg_size_pretty(pg_database_size('redhound'));"

Redis maintenance:

# Check memory usage
docker exec redhound-redis redis-cli INFO memory

# Clear cache (use with caution)
docker exec redhound-redis redis-cli FLUSHDB

Updates and Upgrades

Application update:

# Pull latest code
git pull origin main

# Rebuild and restart
docker-compose build app
docker-compose up -d app

Dependency updates: - Dependabot automatically creates PRs for dependency updates - Review and merge PRs after CI passes - See CI/CD Pipeline for details

Troubleshooting

Service Won't Start

Check logs:

docker-compose logs app

Common issues: - Missing environment variables → Check .env file - Port conflicts → Change port mappings in docker-compose.yml - Insufficient resources → Increase resource limits - Database connection failure → Verify POSTGRES_* environment variables

Health Check Failures

Application unhealthy:

# Check health endpoint directly
docker exec redhound-app curl -v http://localhost:8000/health

# Check dependency connectivity
docker exec redhound-app ping postgres
docker exec redhound-app ping redis

Database connection issues:

# Test PostgreSQL connection
docker exec redhound-postgres pg_isready -U redhound

# Check PostgreSQL logs
docker-compose logs postgres

Performance Issues

High memory usage:

# Check container resource usage
docker stats

# Adjust resource limits in docker-compose.yml

Slow queries:

# Enable query logging in PostgreSQL
# Check slow query logs
docker-compose logs postgres | grep "slow query"

Data Issues

Corrupted or missing data: 1. Check volume mounts: docker volume inspect redhound_postgres_data 2. Check service logs: docker-compose logs <service> 3. Verify database connectivity and permissions

Operational Procedures

Service Restart

Restart all services:

docker-compose restart

Restart specific service:

docker-compose restart app

Graceful restart (zero downtime):

# Rolling restart with new image
docker-compose up -d --no-deps --build app

Service Scaling

Scale application:

# Scale to 3 instances
docker-compose up -d --scale app=3

Note: Use load balancer for multiple instances. Docker Compose scaling is limited for production use.

Maintenance Windows

Scheduled maintenance: 1. Notify users of maintenance window 2. Stop services: docker-compose down 3. Perform maintenance (updates, backups, etc.) 4. Start services: docker-compose up -d 5. Verify health: curl http://localhost:8000/health

References