DevOps Operations¶
This document covers infrastructure, deployment, and operational procedures for the Redhound trading system.
Infrastructure Overview¶
Infrastructure Architecture¶
graph TB
subgraph "External Access"
USER[Users/Developers]
LB[Load Balancer<br/>Optional]
end
subgraph "Application Layer"
APP[Redhound Application<br/>FastAPI + Uvicorn<br/>Port: 8000]
end
subgraph "Data Services"
PG[(PostgreSQL<br/>TimescaleDB<br/>pgvector<br/>Port: 5432)]
REDIS[(Redis Cache<br/>Port: 6379)]
end
subgraph "Monitoring Stack"
PROM[Prometheus<br/>Metrics Collection<br/>Port: 9090]
GRAF[Grafana<br/>Dashboards<br/>Port: 3000]
end
subgraph "Storage"
VOL1[(postgres_data)]
VOL2[(redis_data)]
VOL3[(prometheus_data)]
VOL4[(grafana_data)]
end
USER --> LB
LB --> APP
USER --> APP
APP --> PG
APP --> REDIS
APP -->|/metrics| PROM
APP -->|/health| PROM
PROM --> GRAF
PG --> VOL1
REDIS --> VOL2
PROM --> VOL3
GRAF --> VOL4
style APP fill:#7A9FB3,stroke:#6B8FA3,color:#fff
style PG fill:#9B8AAB,stroke:#8B7A9B,color:#fff
style REDIS fill:#9B8AAB,stroke:#8B7A9B,color:#fff
style PROM fill:#7A9A7A,stroke:#6B8E6B,color:#fff
style GRAF fill:#7A9A7A,stroke:#6B8E6B,color:#fff
style LB fill:#C4A484,stroke:#B49474,color:#fff
style VOL1 fill:#C4A484,stroke:#B49474,color:#fff
style VOL2 fill:#C4A484,stroke:#B49474,color:#fff
style VOL3 fill:#C4A484,stroke:#B49474,color:#fff
style VOL4 fill:#C4A484,stroke:#B49474,color:#fff
Service Architecture¶
The system runs as a containerized application with the following services:
Note: All services are automatically deployed and monitored through the CI/CD pipeline.
| Service | Container | Port | Purpose |
|---|---|---|---|
| Application | redhound-app |
8000 | Main FastAPI application |
| PostgreSQL | redhound-postgres |
5432 | Time-series database (TimescaleDB) with pgvector |
| Redis | redhound-redis |
6379 | Caching layer |
| Prometheus | redhound-prometheus |
9090 | Metrics collection |
| Grafana | redhound-grafana |
3000 | Metrics visualization |
Network Configuration¶
All services communicate via the redhound-network bridge network. Services use internal DNS names (e.g., postgres, redis) for inter-service communication.
Resource Limits¶
Default resource constraints per service:
app:
limits: { cpus: '2.0', memory: 4G }
reservations: { cpus: '0.5', memory: 512M }
postgres:
limits: { cpus: '2.0', memory: 2G }
reservations: { cpus: '0.5', memory: 512M }
redis:
limits: { cpus: '1.0', memory: 1G }
reservations: { cpus: '0.25', memory: 256M }
grafana:
limits: { cpus: '1.0', memory: 512M }
reservations: { cpus: '0.25', memory: 128M }
Adjust these values in docker-compose.yml based on workload requirements.
Deployment¶
Prerequisites¶
- Docker 20.10+ and Docker Compose 2.0+
- Minimum 8GB RAM, 4 CPU cores
- 20GB free disk space for volumes
Initial Deployment¶
-
Clone repository:
-
Configure environment:
-
Set required environment variables:
-
Start services:
-
Verify deployment:
Application Configuration¶
The application entrypoint supports environment variable overrides:
# Uvicorn configuration
export APP_MODULE=redhound.api.app:app
export APP_HOST=0.0.0.0
export APP_PORT=8000
export UVICORN_WORKERS=4 # Scale workers
export UVICORN_RELOAD=false
export EXTRA_UVICORN_ARGS="--log-level info"
Scaling¶
Horizontal scaling (multiple app instances):
-
Update
docker-compose.yml: -
Use a load balancer (nginx, traefik) in front of multiple instances.
Vertical scaling (resource limits):
Update resource limits in docker-compose.yml based on monitoring metrics.
Health Checks¶
sequenceDiagram
participant DC as Docker Compose
participant APP as Application
participant PG as PostgreSQL
participant RD as Redis
Note over DC: Health check interval: 30s
DC->>APP: Health check request<br/>(/usr/local/bin/healthcheck)
APP->>APP: Check /health endpoint
APP->>PG: Test connection<br/>(pg_isready)
PG-->>APP: Connection status
APP->>RD: Test connection<br/>(redis-cli ping)
RD-->>APP: PONG
APP-->>DC: Health status<br/>(healthy/unhealthy)
Note over DC: All dependencies must be healthy<br/>for container to be marked healthy
Health checks run every 30 seconds with a 2-second timeout:
- Application:
/healthendpoint (required) - PostgreSQL:
pg_isreadycommand - Redis:
redis-cli ping
Verify health status:
docker-compose ps
# Check individual service
docker exec redhound-app curl -f http://localhost:8000/health
Database Migrations¶
Schema changes are managed with Alembic. The database must have PostgreSQL with TimescaleDB and pgvector extensions (e.g. Docker Compose Postgres with docker/postgres/init-timescaledb.sql).
Migration commands¶
- Apply all migrations:
./scripts/db_migrate.shoralembic upgrade head - Rollback one revision:
./scripts/db_rollback.shoralembic downgrade -1 - Reset database (dev only):
./scripts/db_reset.sh(downgrade to base, then upgrade head) - Seed test data (dev):
./scripts/db_seed.sh
Connection settings come from POSTGRES_* / REDHOUND_DATABASE_* environment variables. See Configuration Reference.
Creating a new migration¶
- Change SQLAlchemy models in
redhound/database/models/as needed. - Generate a revision:
alembic revision --autogenerate -m "description of change". - Edit the generated file in
alembic/versions/: add TimescaleDB or pgvector operations if required (see below). - Test:
alembic upgrade headthenalembic downgrade -1thenalembic upgrade head. - Commit the new version file.
TimescaleDB-specific migrations¶
For new time-series tables that should be hypertables:
- Create the table in the migration (or via autogenerate).
- Convert to hypertable using the helpers in
redhound/database/migration_utils/timescale.py: create_hypertable(connection, table_name, time_column_name)(e.g.created_atortimestamp).- Optionally add compression and retention in the same or a later migration:
add_compression_policy(connection, table_name, "7 days")add_retention_policy(connection, table_name, "2 years")
In downgrade, remove policies (e.g. remove_retention_policy, remove_compression_policy) then drop the table if applicable.
pgvector migrations¶
- Enable the extension once (migration
001_enable_pgvector.py). - For vector columns, create the table then add an HNSW index, e.g.:
CREATE INDEX ... ON table_name USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); - Downgrade: drop the index then drop the table; drop the extension only if no tables use
vector.
Production migration runbook¶
- Pre-migration: Backup the database. Put the application in maintenance mode or ensure no schema-dependent deploys run during the migration.
- Apply: Run
alembic upgrade headagainst the production database (using production credentials). Use the same virtualenv and code version that includes the new migration. - Verify: Run
alembic currentand spot-check application health and key queries. - Rollback (if needed): Run
alembic downgrade -1(or to a specific revision). Fix any application or data issues before re-running upgrade.
Rollback procedures¶
- One revision:
alembic downgrade -1 - To a specific revision:
alembic downgrade <revision>(e.g.002,001) - To empty schema:
alembic downgrade base(drops all managed objects)
Downgrade order matters: remove retention/compression policies before dropping hypertables; drop tables that use pgvector before dropping the vector extension.
Environment Management¶
Environment Variables¶
Configuration priority (highest to lowest):
1. Environment variables
2. .env file
3. Default values in code
Security:
- Never commit .env files to version control
- Use secrets management (AWS Secrets Manager, HashiCorp Vault) in production
- Rotate credentials regularly
See Configuration Reference for complete environment variable documentation.
CI/CD Pipeline¶
The project uses GitHub Actions for continuous integration and deployment. See CI/CD Pipeline for complete documentation.
Key stages: 1. Lock file validation 2. Code quality (lint, type check, security) 3. Test suite execution 4. Docker image build and scan 5. Deployment (manual or automated)
Local CI simulation:
Monitoring and Observability¶
Monitoring Flow¶
flowchart LR
subgraph "Application"
APP[Redhound App]
METRICS[/metrics endpoint]
HEALTH[/health endpoint]
end
subgraph "Collection"
PROM[Prometheus<br/>Scrapes every 15s]
end
subgraph "Visualization"
GRAF[Grafana<br/>Dashboards]
end
subgraph "Storage"
TSDB[(Time Series DB<br/>15 day retention)]
end
APP --> METRICS
APP --> HEALTH
METRICS -->|HTTP GET| PROM
HEALTH -->|HTTP GET| PROM
PROM --> TSDB
PROM -->|Query| GRAF
TSDB -->|Query| GRAF
GRAF -->|Display| USER[Users/Operators]
style APP fill:#7A9FB3,stroke:#6B8FA3,color:#fff
style PROM fill:#7A9A7A,stroke:#6B8E6B,color:#fff
style GRAF fill:#7A9A7A,stroke:#6B8E6B,color:#fff
style TSDB fill:#9B8AAB,stroke:#8B7A9B,color:#fff
style METRICS fill:#C4A484,stroke:#B49474,color:#fff
style HEALTH fill:#C4A484,stroke:#B49474,color:#fff
See Monitoring and Metrics for complete documentation on metrics, logging, and Grafana dashboards.
Security¶
Container Security¶
- Non-root user (
appuser, UID 1000) in application container - Minimal base images (Python slim)
- Multi-stage builds to reduce image size
- Regular security scans via Trivy in CI/CD
Network Security¶
- Services communicate via internal Docker network
- Expose only necessary ports (8000 for app, 3000 for Grafana)
- Use reverse proxy (nginx, traefik) for TLS termination
Secrets Management¶
Production recommendations: - Use secrets management service (AWS Secrets Manager, HashiCorp Vault) - Never hardcode credentials - Rotate API keys and passwords regularly - Use least-privilege access principles
Security Scanning¶
Docker image scanning:
docker run --rm \
-v /var/run/docker.sock:/var/run/docker.sock \
aquasec/trivy image redhound:latest
Dependency scanning:
Fixing vulnerabilities
Trivy and pip-audit report OS (Debian) and Python issues. Address them as follows.
- List findings
- Image (OS + Python in image):
docker build -t redhound:test .thendocker run --rm -v /var/run/docker.sock:/var/run/docker.sock aquasec/trivy:0.68.1 image --scanners vuln redhound:test -
Python deps only:
uv run pip-audit --format json --descoruv run pip-audit --desc -
OS (Debian) packages
- The Dockerfile runs
apt-get update && apt-get upgrade -ybeforeapt-get installto apply security fixes for the base image (e.g. curl, libc). Rebuild the image to pick up fixes. -
If a fix is not yet in the base, use a newer
python:3.12.12-slim(orpython:3.12-slim) once the official image is refreshed, or pin to a digest after verifying it includes the fix. -
Python packages
-
pip-audit (and Trivy) name the vulnerable package and fixed version. In
pyproject.toml, set the package to the fixed version (or a compatible newer one), then runuv lockanduv sync. Re-run pip-audit and Trivy to confirm. -
Re-scan
- Rebuild the image and run Trivy again. Run
uv run pip-audit --descafteruv lock/uv sync.
Forcing base-image security updates
The Dockerfile uses a build-arg SECURITY_UPDATES. When it changes, the apt-get upgrade layer is invalidated and re-runs, so the image gets the latest Debian security updates. In CI, SECURITY_UPDATES is set to the run date (UTC) so the apt layer refreshes at least once per calendar day; same-day runs share the cache. For local builds, pass --build-arg SECURITY_UPDATES=$(date +%Y%m%d) to force a refresh.
Temporarily accepted vulnerabilities
Vulnerabilities with no fix yet (e.g. upstream not released) may be listed in .trivyignore with an expiry date. Re-evaluate before the expiry and remove the entry once a patched version is available. Current entries include CVE-2026-0994 (protobuf; no patch yet) and CVE-2026-0861 (glibc; Debian Trixie has no fix; re-evaluate when Debian security provides an update). The Dockerfile upgrades pip to ≥25.3 to address CVE-2025-8869.
Maintenance¶
Log Management¶
View logs:
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f app
# Last 100 lines
docker-compose logs --tail=100 app
Log cleanup:
Database Maintenance¶
PostgreSQL maintenance:
# Vacuum database
docker exec redhound-postgres psql -U redhound redhound -c "VACUUM ANALYZE;"
# Check database size
docker exec redhound-postgres psql -U redhound redhound -c "SELECT pg_size_pretty(pg_database_size('redhound'));"
Redis maintenance:
# Check memory usage
docker exec redhound-redis redis-cli INFO memory
# Clear cache (use with caution)
docker exec redhound-redis redis-cli FLUSHDB
Updates and Upgrades¶
Application update:
# Pull latest code
git pull origin main
# Rebuild and restart
docker-compose build app
docker-compose up -d app
Dependency updates: - Dependabot automatically creates PRs for dependency updates - Review and merge PRs after CI passes - See CI/CD Pipeline for details
Troubleshooting¶
Service Won't Start¶
Check logs:
Common issues:
- Missing environment variables → Check .env file
- Port conflicts → Change port mappings in docker-compose.yml
- Insufficient resources → Increase resource limits
- Database connection failure → Verify POSTGRES_* environment variables
Health Check Failures¶
Application unhealthy:
# Check health endpoint directly
docker exec redhound-app curl -v http://localhost:8000/health
# Check dependency connectivity
docker exec redhound-app ping postgres
docker exec redhound-app ping redis
Database connection issues:
# Test PostgreSQL connection
docker exec redhound-postgres pg_isready -U redhound
# Check PostgreSQL logs
docker-compose logs postgres
Performance Issues¶
High memory usage:
Slow queries:
# Enable query logging in PostgreSQL
# Check slow query logs
docker-compose logs postgres | grep "slow query"
Data Issues¶
Corrupted or missing data:
1. Check volume mounts: docker volume inspect redhound_postgres_data
2. Check service logs: docker-compose logs <service>
3. Verify database connectivity and permissions
Operational Procedures¶
Service Restart¶
Restart all services:
Restart specific service:
Graceful restart (zero downtime):
Service Scaling¶
Scale application:
Note: Use load balancer for multiple instances. Docker Compose scaling is limited for production use.
Maintenance Windows¶
Scheduled maintenance:
1. Notify users of maintenance window
2. Stop services: docker-compose down
3. Perform maintenance (updates, backups, etc.)
4. Start services: docker-compose up -d
5. Verify health: curl http://localhost:8000/health
References¶
- Docker Setup - Detailed Docker configuration
- CI/CD Pipeline - Continuous integration and deployment
- Monitoring and Metrics - Observability setup
- Configuration Reference - Complete configuration options