Error Handling & Retry Logic¶

This document provides a comprehensive guide to the error handling and retry infrastructure in Redhound. The system is designed to handle transient failures, API rate limits, and network errors gracefully while preventing cascading failures and ensuring system reliability.

Overview¶

The error handling system provides:

Automatic Retries: Exponential backoff with jitter for transient failures
Circuit Breakers: Prevent cascading failures by failing fast when services are unhealthy
Error Classification: Intelligent categorization of errors to determine retry behavior
Rate Limit Handling: Extended backoff periods for API rate limits (HTTP 429)
Async Support: Full support for both synchronous and asynchronous operations
Observability: Integrated with structured logging and Prometheus metrics

Architecture¶

Components¶

The error handling system consists of three main components:

Error Classification (ErrorType enum)
Categorizes exceptions into: TRANSIENT, PERMANENT, RATE_LIMIT, TIMEOUT
Determines retry behavior based on error type
Retry Logic (@retry_with_backoff decorator)
Exponential backoff with configurable parameters
Jitter to prevent thundering herd problem
Extended backoff for rate limits
Support for both sync and async functions
Circuit Breaker (CircuitBreaker class)
Three states: CLOSED (healthy), OPEN (failing), HALF_OPEN (recovery)
Per-vendor isolation to prevent one failing service from affecting others
Configurable failure threshold and recovery timeout

Error Classification¶

Errors are automatically classified into four types:

class ErrorType(Enum):
    TRANSIENT = "transient"      # Network hiccups, server errors (500-504)
    PERMANENT = "permanent"      # Client errors (400, 401, 403, 404)
    RATE_LIMIT = "rate_limit"    # HTTP 429, quota exceeded
    TIMEOUT = "timeout"          # Request timeouts

Retryable Errors: TRANSIENT, TIMEOUT, RATE_LIMIT Non-Retryable Errors: PERMANENT

Retry Strategy¶

The system uses exponential backoff with jitter:

First Retry: Wait base_delay seconds (default: 1.0s)
Second Retry: Wait base_delay * multiplier seconds (default: 2.0s)
Third Retry: Wait base_delay * multiplier^2 seconds (default: 4.0s)
Rate Limits: Use extended rate_limit_base_delay (default: 60.0s)

Jitter adds randomness (±20% by default) to prevent synchronized retries from multiple instances.

Circuit Breaker Pattern¶

Circuit breakers protect downstream services by failing fast when errors exceed a threshold:

CLOSED: Service is healthy, requests pass through normally
OPEN: Failure threshold exceeded, requests fail immediately without attempting
HALF_OPEN: Recovery period, allows test requests to check if service recovered

CLOSED ──(failures >= threshold)──> OPEN
  ↑                                    ↓
  └──(success in half-open)──  HALF_OPEN

Configuration¶

Retry Configuration¶

Configure retry behavior using RetryConfig:

from backend.utils.error_handling import retry_with_backoff, RetryConfig

@retry_with_backoff(
    config=RetryConfig(
        max_attempts=5,           # Maximum retry attempts
        base_delay=2.0,           # Initial delay in seconds
        max_delay=120.0,          # Maximum delay cap
        multiplier=2.0,           # Exponential backoff multiplier
        jitter=0.2,               # ±20% random jitter
        rate_limit_base_delay=60.0  # Extended delay for rate limits
    ),
    context_name="my_operation"  # Context for logging
)
def my_api_call():
    # API call logic
    pass

Environment Variables¶

Configure system-wide error handling via environment variables:

# Retry settings (per operation type)
export REDHOUND_ERROR_HANDLING__DEFAULT__MAX_ATTEMPTS=3
export REDHOUND_ERROR_HANDLING__DEFAULT__BASE_DELAY=1.0
export REDHOUND_ERROR_HANDLING__DEFAULT__MAX_DELAY=60.0

# Data vendor specific settings
export REDHOUND_ERROR_HANDLING__DATA_VENDORS__MAX_ATTEMPTS=5
export REDHOUND_ERROR_HANDLING__DATA_VENDORS__BASE_DELAY=2.0

# LLM provider specific settings
export REDHOUND_ERROR_HANDLING__LLM_PROVIDERS__MAX_ATTEMPTS=5
export REDHOUND_ERROR_HANDLING__LLM_PROVIDERS__BASE_DELAY=2.0
export REDHOUND_ERROR_HANDLING__LLM_PROVIDERS__RATE_LIMIT_BASE_DELAY=60.0

# Circuit breaker settings
export REDHOUND_ERROR_HANDLING__CIRCUIT_BREAKER__FAILURE_THRESHOLD=5
export REDHOUND_ERROR_HANDLING__CIRCUIT_BREAKER__RECOVERY_TIMEOUT=60

# Timeout settings (seconds)
export REDHOUND_ERROR_HANDLING__TIMEOUT_DATA_VENDOR=30
export REDHOUND_ERROR_HANDLING__TIMEOUT_LLM_PROVIDER=120
export REDHOUND_ERROR_HANDLING__TIMEOUT_DATABASE=10

See Configuration Reference for complete options.

Circuit Breaker Configuration¶

Configure circuit breakers via typed configuration:

from backend.config import get_config

config = get_config()

# Circuit breaker settings
config.error_handling.circuit_breaker.failure_threshold = 5
config.error_handling.circuit_breaker.recovery_timeout = 60
config.error_handling.circuit_breaker.expected_exception = Exception

Usage Examples¶

Basic Retry Decorator¶

Apply retry logic to any function:

from backend.utils.error_handling import retry_with_backoff, RetryConfig

@retry_with_backoff(
    config=RetryConfig(max_attempts=3, base_delay=1.0),
    context_name="fetch_stock_data"
)
def fetch_stock_data(symbol: str):
    """Fetch stock data with automatic retry on transient failures."""
    response = requests.get(f"https://api.example.com/stock/{symbol}")
    response.raise_for_status()
    return response.json()

Async Functions¶

Full support for async operations:

from backend.utils.error_handling import async_retry_with_backoff, RetryConfig

@async_retry_with_backoff(
    config=RetryConfig(max_attempts=5, base_delay=2.0),
    context_name="fetch_llm_response"
)
async def fetch_llm_response(prompt: str):
    """Fetch LLM response with automatic retry."""
    async with aiohttp.ClientSession() as session:
        async with session.post("https://api.openai.com/v1/chat/completions", json={
            "model": "gpt-4",
            "messages": [{"role": "user", "content": prompt}]
        }) as response:
            response.raise_for_status()
            return await response.json()

Custom Retry Conditions¶

Define custom retry logic:

from backend.utils.error_handling import retry_with_backoff, RetryConfig

def should_retry_on_custom_error(exc: Exception) -> bool:
    """Custom retry condition."""
    return isinstance(exc, (CustomTransientError, CustomTimeoutError))

@retry_with_backoff(
    config=RetryConfig(max_attempts=3, base_delay=1.0),
    retry_on=should_retry_on_custom_error,
    context_name="custom_operation"
)
def custom_operation():
    # Operation logic
    pass

Circuit Breaker Usage¶

Protect services with circuit breakers:

from backend.utils.error_handling import CircuitBreaker

# Create circuit breaker for external service
breaker = CircuitBreaker(
    failure_threshold=5,      # Open after 5 failures
    recovery_timeout=60,      # Wait 60s before attempting recovery
    name="alpha_vantage_api"
)

def fetch_market_data(symbol: str):
    """Fetch data with circuit breaker protection."""
    try:
        # Circuit breaker wraps the call
        return breaker.call(make_api_request, symbol)
    except CircuitBreakerError:
        # Service is currently unhealthy, fail fast
        logger.warning("Circuit breaker open for alpha_vantage_api")
        return cached_data(symbol)  # Fallback to cache

Async Circuit Breaker¶

Circuit breakers work with async functions:

async def fetch_async_data(symbol: str):
    """Async fetch with circuit breaker."""
    try:
        result = await breaker.call_async(make_async_request, symbol)
        return result
    except CircuitBreakerError:
        logger.warning("Circuit breaker open, using fallback")
        return await fallback_data(symbol)

Combining Retry and Circuit Breaker¶

Use both patterns together for robust error handling:

from backend.utils.error_handling import retry_with_backoff, RetryConfig, CircuitBreaker

# Create per-vendor circuit breakers
alpha_vantage_breaker = CircuitBreaker(
    failure_threshold=5,
    recovery_timeout=60,
    name="alpha_vantage"
)

@retry_with_backoff(
    config=RetryConfig(
        max_attempts=3,
        base_delay=1.0,
        rate_limit_base_delay=60.0
    ),
    context_name="alpha_vantage_api"
)
def fetch_alpha_vantage_data(symbol: str):
    """Fetch with both retry and circuit breaker."""
    # Circuit breaker prevents retries when service is unhealthy
    return alpha_vantage_breaker.call(make_api_call, symbol)

Integration Points¶

Data Vendors¶

Error handling is integrated across all data vendor modules:

Alpha Vantage (`redhound/data/vendors/alpha_vantage_common.py`)¶

@retry_with_backoff(
    config=RetryConfig(
        max_attempts=3,
        base_delay=1.0,
        rate_limit_base_delay=60.0  # Extended backoff for rate limits
    ),
    context_name="alpha_vantage_api"
)
def _make_api_request(function_name: str, params: dict) -> str:
    """Make Alpha Vantage API request with retry and rate limit handling."""
    # API call logic with automatic retry

FMP (`redhound/data/vendors/fmp.py`)¶

@retry_with_backoff(
    config=RetryConfig(max_attempts=3, base_delay=1.0),
    context_name="fmp_get_stock"
)
def get_stock_data_online(symbol: str, start_date: str, end_date: str):
    """Fetch FMP data with automatic retry on network failures."""
    # Data fetch logic

All FMP functions have retry decorators: - get_stock_data_online() - get_balance_sheet() - get_cashflow() - get_income_statement() - get_insider_transactions()

OpenAI Vendor (`redhound/data/vendors/openai.py`)¶

@retry_with_backoff(
    config=RetryConfig(max_attempts=5, base_delay=2.0),
    context_name="openai_stock_news"
)
def get_stock_news_openai(query: str, start_date: str, end_date: str):
    """Fetch OpenAI news data with retry for LLM API calls."""
    # LLM API call logic

Data Interface (`redhound/data/interface.py`)¶

The data routing layer implements per-vendor circuit breakers:

from backend.utils.error_handling import retry_with_backoff, RetryConfig, CircuitBreaker

# Per-vendor circuit breakers (isolate failures)
_circuit_breakers: dict[str, CircuitBreaker] = {}

def _get_circuit_breaker(vendor: str) -> CircuitBreaker:
    """Get or create circuit breaker for vendor."""
    if vendor not in _circuit_breakers:
        _circuit_breakers[vendor] = CircuitBreaker(
            failure_threshold=5,
            recovery_timeout=60,
            name=f"vendor_{vendor}"
        )
    return _circuit_breakers[vendor]

def route_to_vendor(method: str, *args, **kwargs):
    """Route data call to vendor with retry and circuit breaker."""
    for impl_func, vendor_name in vendor_methods:
        retry_config = RetryConfig(max_attempts=3, base_delay=1.0)
        breaker = _get_circuit_breaker(vendor_name)

        @retry_with_backoff(
            config=retry_config,
            context_name=f"{vendor_name}_{method}"
        )
        def _call_vendor():
            return breaker.call(impl_func, *args, **kwargs)

        try:
            return _call_vendor()
        except CircuitBreakerError:
            logger.warning(f"Circuit breaker open for {vendor_name}, trying next vendor")
            continue  # Try next vendor
        except Exception as e:
            logger.error(f"Error calling {vendor_name}: {e}")
            continue  # Try next vendor

LLM Providers¶

LLM Factory (`redhound/data/llm_factory.py`)¶

@retry_with_backoff(
    config=RetryConfig(max_attempts=5, base_delay=2.0),
    context_name="llm_creation"
)
def create_llm(provider: str, model: str, **kwargs):
    """Create LLM instance with retry on initialization failures."""
    # LLM initialization logic

Trading Graph (`redhound/orchestration/trading_graph.py`)¶

LLM instances are created via llm_factory with retry logic. LangChain provides built-in retry for LLM invocations, so no additional wrapping is needed at the orchestration layer.

Error Logging and Context¶

Structured Logging¶

All retry attempts and errors are logged with rich context:

# Retry attempt logs
logger.info(
    "retry_attempt",
    attempt=attempt_number,
    max_attempts=max_attempts,
    delay=delay_seconds,
    error_type=error_type,
    context=context_name
)

# Retry exhausted logs (with full traceback)
logger.exception(
    "retry_exhausted",
    max_attempts=max_attempts,
    context=context_name,
    error=str(exception)
)

# Circuit breaker state changes
logger.warning(
    "circuit_breaker_opened",
    vendor=vendor_name,
    failure_count=failure_count,
    threshold=failure_threshold
)

Context Preservation¶

Error context is preserved across retries:

Original error message and stack trace
Request parameters (sanitized to avoid logging sensitive data)
Retry attempt history
Vendor/service identifier
Correlation IDs for distributed tracing

Metrics and Monitoring¶

Prometheus Metrics¶

The error handling system integrates with Prometheus metrics:

# Retry attempt counter
record_retry(
    vendor=vendor_name,
    operation=operation_name,
    attempt=attempt_number
)

# Error classification counter
record_error_by_type(
    vendor=vendor_name,
    operation=operation_name,
    error_type=error_type
)

# API call metrics (from metrics.py)
increment_counter(api_calls_total, labels)
increment_counter(api_errors_total, labels)
record_latency(execution_time_seconds, labels)

Available Metrics¶

api_calls_total: Total API calls by vendor/operation
api_errors_total: Total errors by vendor/operation
execution_time_seconds: Latency histogram for operations
retry_attempts_total: Retry attempts by vendor/operation/attempt
error_type_total: Errors classified by type (transient/permanent/rate_limit/timeout)

Grafana Dashboards¶

Monitor error handling in Grafana:

Retry Rate: Percentage of operations requiring retries
Error Rate by Type: Breakdown of transient vs permanent errors
Circuit Breaker States: Per-vendor circuit breaker health
Rate Limit Frequency: HTTP 429 occurrences by vendor
Recovery Time: Time spent in retry backoff

Best Practices¶

1. Choose Appropriate Max Attempts¶

Data vendors: 3-5 attempts (higher for critical data)
LLM providers: 5 attempts (higher latency, more expensive failures)
Database operations: 2-3 attempts (fast fail to avoid bottlenecks)

2. Configure Backoff Delays¶

Network operations: base_delay=1.0, max_delay=60.0
LLM API calls: base_delay=2.0, max_delay=120.0 (higher latency tolerance)
Rate limits: rate_limit_base_delay=60.0 (respect API quotas)

3. Use Circuit Breakers for External Services¶

Isolate failures per vendor to prevent cascading failures
Set failure_threshold based on expected error rate (typically 3-5)
Configure recovery_timeout based on service SLA (typically 60-120s)

4. Log with Context¶

Always include relevant context in error logs:

logger.exception(
    "operation_failed",
    vendor=vendor_name,
    symbol=symbol,
    operation=operation_name,
    correlation_id=correlation_id,
    error=str(exc)
)

5. Implement Fallbacks¶

Provide graceful degradation when errors occur:

try:
    data = fetch_from_primary_vendor(symbol)
except Exception:
    logger.warning("Primary vendor failed, using fallback")
    data = fetch_from_cache(symbol) or fetch_from_secondary_vendor(symbol)

6. Monitor Retry Rates¶

High retry rates indicate:

Unreliable external services
Network issues
Insufficient rate limits

Set up alerts for retry rates exceeding thresholds (e.g., >10% of requests).

7. Sanitize Logged Data¶

Never log sensitive information in error context:

# BAD: Logs API key
logger.error("API call failed", api_key=api_key)

# GOOD: Sanitize sensitive data
logger.error("API call failed", api_key="***REDACTED***")

8. Test Retry Behavior¶

Write tests to verify retry logic:

def test_retry_on_transient_failure():
    """Verify retry on network errors."""
    mock = Mock(side_effect=[ConnectionError(), {"data": "success"}])

    @retry_with_backoff(config=RetryConfig(max_attempts=3))
    def operation():
        return mock()

    result = operation()
    assert result == {"data": "success"}
    assert mock.call_count == 2  # First failed, second succeeded

Troubleshooting¶

High Retry Rates¶

Symptom: Many operations requiring multiple retries

Causes: - Unreliable network connection - External service degradation - Rate limits exceeded

Solutions: 1. Check network connectivity 2. Verify external service status 3. Increase rate limit quotas or reduce request frequency 4. Add caching to reduce API calls

Circuit Breakers Constantly Opening¶

Symptom: Circuit breakers frequently entering OPEN state

Causes: - Service consistently failing - Failure threshold too low - Insufficient recovery timeout

Solutions: 1. Investigate root cause of service failures 2. Increase failure_threshold if errors are expected 3. Increase recovery_timeout to allow service to recover 4. Implement health checks before reopening circuit

Rate Limit Errors Not Handled¶

Symptom: HTTP 429 errors not triggering extended backoff

Causes: - Error classification not detecting rate limit errors - rate_limit_base_delay too short - Rate limit headers not parsed

Solutions: 1. Verify error messages contain "rate limit", "429", or "too many requests" 2. Increase rate_limit_base_delay (default: 60s) 3. Check if vendor returns Retry-After header and respect it

Retries Exhausted Quickly¶

Symptom: Operations failing after all retries exhausted

Causes: - Max attempts too low - Base delay too short - Permanent errors being retried

Solutions: 1. Increase max_attempts for critical operations 2. Increase base_delay to give service time to recover 3. Ensure permanent errors (400, 401, 403, 404) are not retried

Memory Leaks from Circuit Breakers¶

Symptom: Increasing memory usage over time

Causes: - Circuit breaker instances not reused - Too many circuit breakers created

Solutions: 1. Use _get_circuit_breaker() to reuse instances per vendor 2. Limit circuit breaker creation to fixed set of vendors 3. Monitor circuit breaker count in metrics

Testing¶

Unit Tests¶

Verify error handling components:

# Run error handling unit tests
uv run pytest tests/utils/test_error_handling.py -v

Tests cover: - Error classification correctness - Retry logic with exponential backoff - Jitter randomization - Max retry limit enforcement - Circuit breaker state transitions - Async function retry behavior

Integration Tests¶

Verify end-to-end error recovery:

# Run error handling integration tests
uv run pytest tests/integration/test_error_handling_integration.py -v

Tests cover: - Retry scenarios with mock failures - Circuit breaker behavior with real vendor calls - Rate limit handling (HTTP 429 with extended backoff) - Concurrent retry scenarios - Metrics verification

Manual Testing¶

Test retry behavior manually:

from backend.utils.error_handling import retry_with_backoff, RetryConfig

@retry_with_backoff(
    config=RetryConfig(max_attempts=3, base_delay=1.0, jitter=0.0),
    context_name="test_operation"
)
def test_operation():
    """Simulate transient failure."""
    import random
    if random.random() < 0.7:  # 70% failure rate
        raise ConnectionError("Simulated network error")
    return "success"

# Run multiple times to observe retry behavior
for i in range(10):
    try:
        result = test_operation()
        print(f"Attempt {i+1}: {result}")
    except Exception as e:
        print(f"Attempt {i+1}: Failed after retries - {e}")

Configuration Reference - Error handling configuration options
Monitoring & Metrics - Observability and alerting

References¶

tenacity - Python retry library
circuitbreaker - Circuit breaker implementation
Prometheus Best Practices - Metrics and monitoring