Skip to content

Error Handling & Retry Logic

This document provides a comprehensive guide to the error handling and retry infrastructure in Redhound. The system is designed to handle transient failures, API rate limits, and network errors gracefully while preventing cascading failures and ensuring system reliability.

Overview

The error handling system provides:

  • Automatic Retries: Exponential backoff with jitter for transient failures
  • Circuit Breakers: Prevent cascading failures by failing fast when services are unhealthy
  • Error Classification: Intelligent categorization of errors to determine retry behavior
  • Rate Limit Handling: Extended backoff periods for API rate limits (HTTP 429)
  • Async Support: Full support for both synchronous and asynchronous operations
  • Observability: Integrated with structured logging and Prometheus metrics

Architecture

Components

The error handling system consists of three main components:

  1. Error Classification (ErrorType enum)
  2. Categorizes exceptions into: TRANSIENT, PERMANENT, RATE_LIMIT, TIMEOUT
  3. Determines retry behavior based on error type

  4. Retry Logic (@retry_with_backoff decorator)

  5. Exponential backoff with configurable parameters
  6. Jitter to prevent thundering herd problem
  7. Extended backoff for rate limits
  8. Support for both sync and async functions

  9. Circuit Breaker (CircuitBreaker class)

  10. Three states: CLOSED (healthy), OPEN (failing), HALF_OPEN (recovery)
  11. Per-vendor isolation to prevent one failing service from affecting others
  12. Configurable failure threshold and recovery timeout

Error Classification

Errors are automatically classified into four types:

class ErrorType(Enum):
    TRANSIENT = "transient"      # Network hiccups, server errors (500-504)
    PERMANENT = "permanent"      # Client errors (400, 401, 403, 404)
    RATE_LIMIT = "rate_limit"    # HTTP 429, quota exceeded
    TIMEOUT = "timeout"          # Request timeouts

Retryable Errors: TRANSIENT, TIMEOUT, RATE_LIMIT Non-Retryable Errors: PERMANENT

Retry Strategy

The system uses exponential backoff with jitter:

  1. First Retry: Wait base_delay seconds (default: 1.0s)
  2. Second Retry: Wait base_delay * multiplier seconds (default: 2.0s)
  3. Third Retry: Wait base_delay * multiplier^2 seconds (default: 4.0s)
  4. Rate Limits: Use extended rate_limit_base_delay (default: 60.0s)

Jitter adds randomness (±20% by default) to prevent synchronized retries from multiple instances.

Circuit Breaker Pattern

Circuit breakers protect downstream services by failing fast when errors exceed a threshold:

  • CLOSED: Service is healthy, requests pass through normally
  • OPEN: Failure threshold exceeded, requests fail immediately without attempting
  • HALF_OPEN: Recovery period, allows test requests to check if service recovered
CLOSED ──(failures >= threshold)──> OPEN
  ↑                                    ↓
  └──(success in half-open)──  HALF_OPEN

Configuration

Retry Configuration

Configure retry behavior using RetryConfig:

from backend.utils.error_handling import retry_with_backoff, RetryConfig

@retry_with_backoff(
    config=RetryConfig(
        max_attempts=5,           # Maximum retry attempts
        base_delay=2.0,           # Initial delay in seconds
        max_delay=120.0,          # Maximum delay cap
        multiplier=2.0,           # Exponential backoff multiplier
        jitter=0.2,               # ±20% random jitter
        rate_limit_base_delay=60.0  # Extended delay for rate limits
    ),
    context_name="my_operation"  # Context for logging
)
def my_api_call():
    # API call logic
    pass

Environment Variables

Configure system-wide error handling via environment variables:

# Retry settings (per operation type)
export REDHOUND_ERROR_HANDLING__DEFAULT__MAX_ATTEMPTS=3
export REDHOUND_ERROR_HANDLING__DEFAULT__BASE_DELAY=1.0
export REDHOUND_ERROR_HANDLING__DEFAULT__MAX_DELAY=60.0

# Data vendor specific settings
export REDHOUND_ERROR_HANDLING__DATA_VENDORS__MAX_ATTEMPTS=5
export REDHOUND_ERROR_HANDLING__DATA_VENDORS__BASE_DELAY=2.0

# LLM provider specific settings
export REDHOUND_ERROR_HANDLING__LLM_PROVIDERS__MAX_ATTEMPTS=5
export REDHOUND_ERROR_HANDLING__LLM_PROVIDERS__BASE_DELAY=2.0
export REDHOUND_ERROR_HANDLING__LLM_PROVIDERS__RATE_LIMIT_BASE_DELAY=60.0

# Circuit breaker settings
export REDHOUND_ERROR_HANDLING__CIRCUIT_BREAKER__FAILURE_THRESHOLD=5
export REDHOUND_ERROR_HANDLING__CIRCUIT_BREAKER__RECOVERY_TIMEOUT=60

# Timeout settings (seconds)
export REDHOUND_ERROR_HANDLING__TIMEOUT_DATA_VENDOR=30
export REDHOUND_ERROR_HANDLING__TIMEOUT_LLM_PROVIDER=120
export REDHOUND_ERROR_HANDLING__TIMEOUT_DATABASE=10

See Configuration Reference for complete options.

Circuit Breaker Configuration

Configure circuit breakers via typed configuration:

from backend.config import get_config

config = get_config()

# Circuit breaker settings
config.error_handling.circuit_breaker.failure_threshold = 5
config.error_handling.circuit_breaker.recovery_timeout = 60
config.error_handling.circuit_breaker.expected_exception = Exception

Usage Examples

Basic Retry Decorator

Apply retry logic to any function:

from backend.utils.error_handling import retry_with_backoff, RetryConfig

@retry_with_backoff(
    config=RetryConfig(max_attempts=3, base_delay=1.0),
    context_name="fetch_stock_data"
)
def fetch_stock_data(symbol: str):
    """Fetch stock data with automatic retry on transient failures."""
    response = requests.get(f"https://api.example.com/stock/{symbol}")
    response.raise_for_status()
    return response.json()

Async Functions

Full support for async operations:

from backend.utils.error_handling import async_retry_with_backoff, RetryConfig

@async_retry_with_backoff(
    config=RetryConfig(max_attempts=5, base_delay=2.0),
    context_name="fetch_llm_response"
)
async def fetch_llm_response(prompt: str):
    """Fetch LLM response with automatic retry."""
    async with aiohttp.ClientSession() as session:
        async with session.post("https://api.openai.com/v1/chat/completions", json={
            "model": "gpt-4",
            "messages": [{"role": "user", "content": prompt}]
        }) as response:
            response.raise_for_status()
            return await response.json()

Custom Retry Conditions

Define custom retry logic:

from backend.utils.error_handling import retry_with_backoff, RetryConfig

def should_retry_on_custom_error(exc: Exception) -> bool:
    """Custom retry condition."""
    return isinstance(exc, (CustomTransientError, CustomTimeoutError))

@retry_with_backoff(
    config=RetryConfig(max_attempts=3, base_delay=1.0),
    retry_on=should_retry_on_custom_error,
    context_name="custom_operation"
)
def custom_operation():
    # Operation logic
    pass

Circuit Breaker Usage

Protect services with circuit breakers:

from backend.utils.error_handling import CircuitBreaker

# Create circuit breaker for external service
breaker = CircuitBreaker(
    failure_threshold=5,      # Open after 5 failures
    recovery_timeout=60,      # Wait 60s before attempting recovery
    name="alpha_vantage_api"
)

def fetch_market_data(symbol: str):
    """Fetch data with circuit breaker protection."""
    try:
        # Circuit breaker wraps the call
        return breaker.call(make_api_request, symbol)
    except CircuitBreakerError:
        # Service is currently unhealthy, fail fast
        logger.warning("Circuit breaker open for alpha_vantage_api")
        return cached_data(symbol)  # Fallback to cache

Async Circuit Breaker

Circuit breakers work with async functions:

async def fetch_async_data(symbol: str):
    """Async fetch with circuit breaker."""
    try:
        result = await breaker.call_async(make_async_request, symbol)
        return result
    except CircuitBreakerError:
        logger.warning("Circuit breaker open, using fallback")
        return await fallback_data(symbol)

Combining Retry and Circuit Breaker

Use both patterns together for robust error handling:

from backend.utils.error_handling import retry_with_backoff, RetryConfig, CircuitBreaker

# Create per-vendor circuit breakers
alpha_vantage_breaker = CircuitBreaker(
    failure_threshold=5,
    recovery_timeout=60,
    name="alpha_vantage"
)

@retry_with_backoff(
    config=RetryConfig(
        max_attempts=3,
        base_delay=1.0,
        rate_limit_base_delay=60.0
    ),
    context_name="alpha_vantage_api"
)
def fetch_alpha_vantage_data(symbol: str):
    """Fetch with both retry and circuit breaker."""
    # Circuit breaker prevents retries when service is unhealthy
    return alpha_vantage_breaker.call(make_api_call, symbol)

Integration Points

Data Vendors

Error handling is integrated across all data vendor modules:

Alpha Vantage (redhound/data/vendors/alpha_vantage_common.py)

@retry_with_backoff(
    config=RetryConfig(
        max_attempts=3,
        base_delay=1.0,
        rate_limit_base_delay=60.0  # Extended backoff for rate limits
    ),
    context_name="alpha_vantage_api"
)
def _make_api_request(function_name: str, params: dict) -> str:
    """Make Alpha Vantage API request with retry and rate limit handling."""
    # API call logic with automatic retry

FMP (redhound/data/vendors/fmp.py)

@retry_with_backoff(
    config=RetryConfig(max_attempts=3, base_delay=1.0),
    context_name="fmp_get_stock"
)
def get_stock_data_online(symbol: str, start_date: str, end_date: str):
    """Fetch FMP data with automatic retry on network failures."""
    # Data fetch logic

All FMP functions have retry decorators: - get_stock_data_online() - get_balance_sheet() - get_cashflow() - get_income_statement() - get_insider_transactions()

OpenAI Vendor (redhound/data/vendors/openai.py)

@retry_with_backoff(
    config=RetryConfig(max_attempts=5, base_delay=2.0),
    context_name="openai_stock_news"
)
def get_stock_news_openai(query: str, start_date: str, end_date: str):
    """Fetch OpenAI news data with retry for LLM API calls."""
    # LLM API call logic

Data Interface (redhound/data/interface.py)

The data routing layer implements per-vendor circuit breakers:

from backend.utils.error_handling import retry_with_backoff, RetryConfig, CircuitBreaker

# Per-vendor circuit breakers (isolate failures)
_circuit_breakers: dict[str, CircuitBreaker] = {}

def _get_circuit_breaker(vendor: str) -> CircuitBreaker:
    """Get or create circuit breaker for vendor."""
    if vendor not in _circuit_breakers:
        _circuit_breakers[vendor] = CircuitBreaker(
            failure_threshold=5,
            recovery_timeout=60,
            name=f"vendor_{vendor}"
        )
    return _circuit_breakers[vendor]

def route_to_vendor(method: str, *args, **kwargs):
    """Route data call to vendor with retry and circuit breaker."""
    for impl_func, vendor_name in vendor_methods:
        retry_config = RetryConfig(max_attempts=3, base_delay=1.0)
        breaker = _get_circuit_breaker(vendor_name)

        @retry_with_backoff(
            config=retry_config,
            context_name=f"{vendor_name}_{method}"
        )
        def _call_vendor():
            return breaker.call(impl_func, *args, **kwargs)

        try:
            return _call_vendor()
        except CircuitBreakerError:
            logger.warning(f"Circuit breaker open for {vendor_name}, trying next vendor")
            continue  # Try next vendor
        except Exception as e:
            logger.error(f"Error calling {vendor_name}: {e}")
            continue  # Try next vendor

LLM Providers

LLM Factory (redhound/data/llm_factory.py)

@retry_with_backoff(
    config=RetryConfig(max_attempts=5, base_delay=2.0),
    context_name="llm_creation"
)
def create_llm(provider: str, model: str, **kwargs):
    """Create LLM instance with retry on initialization failures."""
    # LLM initialization logic

Trading Graph (redhound/orchestration/trading_graph.py)

LLM instances are created via llm_factory with retry logic. LangChain provides built-in retry for LLM invocations, so no additional wrapping is needed at the orchestration layer.

Error Logging and Context

Structured Logging

All retry attempts and errors are logged with rich context:

# Retry attempt logs
logger.info(
    "retry_attempt",
    attempt=attempt_number,
    max_attempts=max_attempts,
    delay=delay_seconds,
    error_type=error_type,
    context=context_name
)

# Retry exhausted logs (with full traceback)
logger.exception(
    "retry_exhausted",
    max_attempts=max_attempts,
    context=context_name,
    error=str(exception)
)

# Circuit breaker state changes
logger.warning(
    "circuit_breaker_opened",
    vendor=vendor_name,
    failure_count=failure_count,
    threshold=failure_threshold
)

Context Preservation

Error context is preserved across retries:

  • Original error message and stack trace
  • Request parameters (sanitized to avoid logging sensitive data)
  • Retry attempt history
  • Vendor/service identifier
  • Correlation IDs for distributed tracing

Metrics and Monitoring

Prometheus Metrics

The error handling system integrates with Prometheus metrics:

# Retry attempt counter
record_retry(
    vendor=vendor_name,
    operation=operation_name,
    attempt=attempt_number
)

# Error classification counter
record_error_by_type(
    vendor=vendor_name,
    operation=operation_name,
    error_type=error_type
)

# API call metrics (from metrics.py)
increment_counter(api_calls_total, labels)
increment_counter(api_errors_total, labels)
record_latency(execution_time_seconds, labels)

Available Metrics

  • api_calls_total: Total API calls by vendor/operation
  • api_errors_total: Total errors by vendor/operation
  • execution_time_seconds: Latency histogram for operations
  • retry_attempts_total: Retry attempts by vendor/operation/attempt
  • error_type_total: Errors classified by type (transient/permanent/rate_limit/timeout)

Grafana Dashboards

Monitor error handling in Grafana:

  • Retry Rate: Percentage of operations requiring retries
  • Error Rate by Type: Breakdown of transient vs permanent errors
  • Circuit Breaker States: Per-vendor circuit breaker health
  • Rate Limit Frequency: HTTP 429 occurrences by vendor
  • Recovery Time: Time spent in retry backoff

Best Practices

1. Choose Appropriate Max Attempts

  • Data vendors: 3-5 attempts (higher for critical data)
  • LLM providers: 5 attempts (higher latency, more expensive failures)
  • Database operations: 2-3 attempts (fast fail to avoid bottlenecks)

2. Configure Backoff Delays

  • Network operations: base_delay=1.0, max_delay=60.0
  • LLM API calls: base_delay=2.0, max_delay=120.0 (higher latency tolerance)
  • Rate limits: rate_limit_base_delay=60.0 (respect API quotas)

3. Use Circuit Breakers for External Services

  • Isolate failures per vendor to prevent cascading failures
  • Set failure_threshold based on expected error rate (typically 3-5)
  • Configure recovery_timeout based on service SLA (typically 60-120s)

4. Log with Context

Always include relevant context in error logs:

logger.exception(
    "operation_failed",
    vendor=vendor_name,
    symbol=symbol,
    operation=operation_name,
    correlation_id=correlation_id,
    error=str(exc)
)

5. Implement Fallbacks

Provide graceful degradation when errors occur:

try:
    data = fetch_from_primary_vendor(symbol)
except Exception:
    logger.warning("Primary vendor failed, using fallback")
    data = fetch_from_cache(symbol) or fetch_from_secondary_vendor(symbol)

6. Monitor Retry Rates

High retry rates indicate:

  • Unreliable external services
  • Network issues
  • Insufficient rate limits

Set up alerts for retry rates exceeding thresholds (e.g., >10% of requests).

7. Sanitize Logged Data

Never log sensitive information in error context:

# BAD: Logs API key
logger.error("API call failed", api_key=api_key)

# GOOD: Sanitize sensitive data
logger.error("API call failed", api_key="***REDACTED***")

8. Test Retry Behavior

Write tests to verify retry logic:

def test_retry_on_transient_failure():
    """Verify retry on network errors."""
    mock = Mock(side_effect=[ConnectionError(), {"data": "success"}])

    @retry_with_backoff(config=RetryConfig(max_attempts=3))
    def operation():
        return mock()

    result = operation()
    assert result == {"data": "success"}
    assert mock.call_count == 2  # First failed, second succeeded

Troubleshooting

High Retry Rates

Symptom: Many operations requiring multiple retries

Causes: - Unreliable network connection - External service degradation - Rate limits exceeded

Solutions: 1. Check network connectivity 2. Verify external service status 3. Increase rate limit quotas or reduce request frequency 4. Add caching to reduce API calls

Circuit Breakers Constantly Opening

Symptom: Circuit breakers frequently entering OPEN state

Causes: - Service consistently failing - Failure threshold too low - Insufficient recovery timeout

Solutions: 1. Investigate root cause of service failures 2. Increase failure_threshold if errors are expected 3. Increase recovery_timeout to allow service to recover 4. Implement health checks before reopening circuit

Rate Limit Errors Not Handled

Symptom: HTTP 429 errors not triggering extended backoff

Causes: - Error classification not detecting rate limit errors - rate_limit_base_delay too short - Rate limit headers not parsed

Solutions: 1. Verify error messages contain "rate limit", "429", or "too many requests" 2. Increase rate_limit_base_delay (default: 60s) 3. Check if vendor returns Retry-After header and respect it

Retries Exhausted Quickly

Symptom: Operations failing after all retries exhausted

Causes: - Max attempts too low - Base delay too short - Permanent errors being retried

Solutions: 1. Increase max_attempts for critical operations 2. Increase base_delay to give service time to recover 3. Ensure permanent errors (400, 401, 403, 404) are not retried

Memory Leaks from Circuit Breakers

Symptom: Increasing memory usage over time

Causes: - Circuit breaker instances not reused - Too many circuit breakers created

Solutions: 1. Use _get_circuit_breaker() to reuse instances per vendor 2. Limit circuit breaker creation to fixed set of vendors 3. Monitor circuit breaker count in metrics

Testing

Unit Tests

Verify error handling components:

# Run error handling unit tests
uv run pytest tests/utils/test_error_handling.py -v

Tests cover: - Error classification correctness - Retry logic with exponential backoff - Jitter randomization - Max retry limit enforcement - Circuit breaker state transitions - Async function retry behavior

Integration Tests

Verify end-to-end error recovery:

# Run error handling integration tests
uv run pytest tests/integration/test_error_handling_integration.py -v

Tests cover: - Retry scenarios with mock failures - Circuit breaker behavior with real vendor calls - Rate limit handling (HTTP 429 with extended backoff) - Concurrent retry scenarios - Metrics verification

Manual Testing

Test retry behavior manually:

from backend.utils.error_handling import retry_with_backoff, RetryConfig

@retry_with_backoff(
    config=RetryConfig(max_attempts=3, base_delay=1.0, jitter=0.0),
    context_name="test_operation"
)
def test_operation():
    """Simulate transient failure."""
    import random
    if random.random() < 0.7:  # 70% failure rate
        raise ConnectionError("Simulated network error")
    return "success"

# Run multiple times to observe retry behavior
for i in range(10):
    try:
        result = test_operation()
        print(f"Attempt {i+1}: {result}")
    except Exception as e:
        print(f"Attempt {i+1}: Failed after retries - {e}")

References