Skip to content

Conversation

@lsvishaal
Copy link
Contributor

PR Description for the Circuit Breaker


Summary

Implements a production-grade Circuit Breaker pattern to prevent cascade failures when external services (LLMs, APIs) are degraded or unavailable.

Why This Change?

Bindu agents call external LLMs (OpenAI, Anthropic, etc.) which can experience:

  • Rate limits
  • Latency spikes
  • Complete outages

The existing retry mechanism (bindu/utils/retry.py) handles transient failures but keeps retrying indefinitely on degraded services. This can:

  • Block workers waiting on slow/dead services
  • Exhaust connection pools
  • Cause cascade failures across the agent network

Circuit breaker complements retry — retry = "try again" wheras circuit breaker = "stop trying, it's broken."

What's Included

Component Description
bindu/utils/circuit_breaker.py Core implementation (580 lines)
bindu/settings.py CircuitBreakerSettings config class
tests/unit/test_circuit_breaker.py 34 async tests (TDD approach)
examples/circuit_breaker_demo.py Manual verification script

Key Features

  • Three-state FSM: CLOSED → OPEN → HALF_OPEN → CLOSED
  • Async-native: Thread-safe state transitions with asyncio.Lock
  • Three usage patterns: Decorator, context manager, direct call
  • Global registry: Monitor all circuit breakers in one place
  • Configurable: Via CircuitBreakerSettings (Pydantic)
  • Observable: Sentry integration for state change alerts

Usage Example

from bindu.utils.circuit_breaker import circuit_breaker

@circuit_breaker(name="openai", failure_threshold=5)
async def call_openai(prompt: str) -> str:
    # After 5 consecutive failures, circuit opens
    # Calls fail-fast until recovery timeout
    ...

Testing

# Run circuit breaker tests
pytest tests/unit/test_circuit_breaker.py -v

# Run demo
python examples/circuit_breaker_demo.py

# Verify no regressions (516 tests)
pytest tests/ -q

Reference

Based on Michael T. Nygard's "Release It!" pattern for production resilience.

✅ Checklist

  • Tested thoroughly (34 tests, all passing)
  • Followed project style and linting (ruff clean)
  • Added demo script for verification
  • No regressions (516 existing tests still pass)

Implements production-grade circuit breaker to prevent cascade failures
when external services (LLMs, APIs) are degraded or unavailable.

Why:
- Agents call external LLMs (OpenAI, Anthropic, etc.)
- These services have rate limits, latency spikes, outages
- Current retry mechanism retries indefinitely on degraded services
- Circuit breaker provides fail-fast behavior + auto-recovery

What:
- Three-state FSM: CLOSED → OPEN → HALF_OPEN → CLOSED
- Async-native design with thread-safe state transitions
- Three usage patterns: decorator, context manager, direct call
- Global registry for monitoring all circuit breakers
- Configurable via CircuitBreakerSettings (Pydantic)
- Integration with Sentry for state change notifications
- 34 comprehensive async tests (TDD approach)
- Demo script for manual verification

Reference: Michael T. Nygard's "Release It!" pattern
Addresses: Resilience gap identified in production hardening roadmap
@raahulrahl raahulrahl added the enhancement New feature or request label Dec 12, 2025
@raahulrahl raahulrahl self-assigned this Dec 12, 2025
@raahulrahl raahulrahl marked this pull request as draft December 12, 2025 19:02
@raahulrahl raahulrahl added wontfix This will not be worked on and removed enhancement New feature or request labels Dec 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wontfix This will not be worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants