Skip to content

An autonomous SRE agent that tracks tasks and helps debug

Notifications You must be signed in to change notification settings

madhurprash/Autonomous_SRE_Agent

Repository files navigation

Intelligent SRE Agent with EKS MCP Server

An intelligent Site Reliability Engineering (SRE) agent system built with Strands Agents SDK and Amazon Bedrock AgentCore for autonomous Kubernetes/EKS operations.

What This Is

This system implements a multi-agent architecture where a K8s agent autonomously investigates cluster issues by making multiple tool calls in an agentic loop (up to 20 iterations) without user intervention. The K8s agent is wrapped as a tool using the agents-as-tools pattern from Strands, allowing an orchestrator agent to delegate Kubernetes tasks to it. The agents connect to an EKS MCP server (20+ Kubernetes tools) hosted on AgentCore Runtime and exposed through AgentCore Gateway with OAuth authentication.

Architecture: User/Slack → Orchestrator Agent → K8s Agent (agentic loop) → AgentCore Gateway (OAuth) → AgentCore Runtime (MCP Server) → EKS Cluster. The K8s agent autonomously gathers information (pods, logs, events, health) across multiple iterations and synthesizes comprehensive reports with root cause analysis and recommendations.

architecture_diagram

frontend_screenshot

Quick Start (TL;DR)

For those who want to get started quickly:

# 1. Install dependencies
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

# 2. Set up a test EKS cluster (skip if you have one)
cd infrastructure/eks_cluster && ./create_test_cluster.sh
cd error-scenarios && ./deploy_all.sh
export EKS_CLUSTER_NAME=eks-mcp-test-cluster
export AWS_REGION=us-west-2

# 3. Deploy AgentCore infrastructure (Runtime + Gateway)
cd ../../scripts/agentcore/runtime
./deploy_all.sh

# 4. Start the application
cd ../../../
./start.sh

# 5. Open browser to http://localhost:5173 and start chatting!

First command to try: Check the health of all pods in test-errors namespace

For detailed step-by-step instructions, continue reading below.


Setup Overview

The complete setup follows these steps:

Step What Time Description
1 Install Dependencies 2 min Install uv and Python packages
2 Set Up EKS Cluster 0-15 min Use existing cluster (0 min) or create test cluster (15 min)
3 Deploy AgentCore 10-15 min Deploy Runtime + Gateway with OAuth
4 Verify Deployment 1 min Test connectivity and credentials
5 Start Application 1 min Launch web interface

Total Time: 15-35 minutes depending on whether you create a new cluster.


🧠 Memory & Context Retention

This system supports AgentCore Memory for long-term context retention across sessions:

  • Remember cluster issues: K8s agent recalls previous errors, patterns, and recurring problems (e.g., "pod crashes every Monday")
  • Learn from debugging: Holmes agent builds a knowledge base of successful solutions over time
  • Maintain user context: Orchestrator remembers user preferences and conversation history
  • Cross-session intelligence: Agents learn from past interactions to provide better recommendations

Quick Setup:

# 1. Create memory resources (one-time setup)
uv run python scripts/agentcore/memory/create_agent_memories.py --region us-west-2

# 2. Add memory IDs to .env file (from script output)
AGENTCORE_MEMORY_K8S_AGENT_MEMORY_ID=mem-abc123
AGENTCORE_MEMORY_ORCHESTRATOR_MEMORY_ID=mem-def456

# 3. Use agents with memory enabled
from sre_agent.agents.orchestrator import create_orchestrator_agent
import uuid

orchestrator = create_orchestrator_agent(
    cluster_name="my-cluster",
    session_id=str(uuid.uuid4()),  # Unique per conversation
    actor_id="user-alice",          # Unique per user
    enable_memory=True              # Default
)

Memory is optional but recommended for production use. See the AgentCore Memory Integration section for details.

Prerequisites

  • Python 3.11+
  • Node.js 18+ (for the web frontend)
  • AWS CLI configured with appropriate credentials
  • AWS Account with permissions for: Amazon EKS, Amazon Cognito, Amazon Bedrock AgentCore, Amazon ECR, IAM
  • kubectl for Kubernetes cluster management
  • jq for JSON parsing (Install: brew install jq on macOS or apt install jq on Linux)

Getting Started

This guide walks you through setting up the Intelligent SRE Agent from scratch. The process takes about 15-20 minutes.

Step 1: Install uv and Dependencies

First, install uv package manager:

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or using pip
pip install uv

Then install project dependencies:

cd /path/to/Intelligent_SRE_Agent

# Install dependencies (creates .venv automatically)
uv sync

Step 2: Set Up Kubernetes Cluster

You have two options depending on whether you have an existing EKS cluster.

Option A: Use an Existing EKS Cluster

If you already have an EKS cluster, simply note down the cluster name and region. You'll need these in Step 3.

# Example: List your existing clusters
aws eks list-clusters --region us-west-2

# Set environment variables for deployment
export EKS_CLUSTER_NAME=your-existing-cluster-name
export AWS_REGION=us-west-2

Option B: Create a Test EKS Cluster with Sample Errors

If you don't have an EKS cluster, create a test cluster with pre-configured error scenarios for testing the SRE agent:

# Create a test EKS cluster (takes ~15 minutes)
cd infrastructure/eks_cluster
./create_test_cluster.sh

# Deploy error scenarios (pods with various issues for testing)
cd error-scenarios
./deploy_all.sh

# The cluster name will be: eks-mcp-test-cluster
# Note: Make sure to export this for the next step
export EKS_CLUSTER_NAME=eks-mcp-test-cluster
export AWS_REGION=us-west-2

The test cluster includes various error scenarios:

  • ImagePullBackOff errors
  • CrashLoopBackOff pods
  • Resource limit issues
  • Configuration errors

These provide realistic scenarios for testing the SRE agent's diagnostic capabilities.

Step 3: Deploy AgentCore Infrastructure (Runtime + Gateway)

Now deploy the complete infrastructure with a single command. This deploys:

  • Cognito OAuth for authentication
  • AgentCore Runtime with EKS MCP Server
  • AgentCore Gateway for unified access
  • Project .env file with all credentials
# Navigate to the runtime deployment directory
cd scripts/agentcore/runtime

# Deploy everything (takes ~10-15 minutes)
./deploy_all.sh

The script will automatically:

  1. Create Cognito OAuth configuration for M2M authentication
  2. Deploy the EKS MCP server to AgentCore Runtime with full EKS permissions
  3. Test the Runtime deployment
  4. Create an AgentCore Gateway
  5. Attach the Runtime as a Gateway target
  6. Update your project's .env file with all credentials

Important: The deployment script uses the EKS_CLUSTER_NAME and AWS_REGION environment variables you set in Step 2.

Step 4: Verify Deployment

# Navigate back to project root
cd ../../..

# Check that credentials were saved
cat scripts/agentcore/runtime/credentials.json

# Verify .env file was created with all credentials
cat .env

# Test MCP tools are accessible
uv run python scripts/agentcore/runtime/test_mcp_client.py

Expected output should show:

  • Gateway URL and OAuth credentials
  • List of available MCP tools (20+ Kubernetes operations)
  • Successful connection to your EKS cluster

Step 5: Start the Application

Now you're ready to run the Intelligent SRE Agent with its web interface!

Start Both Backend and Frontend

# From project root directory
./start.sh

This script will:

  • Start the FastAPI backend on http://127.0.0.1:8000
  • Start the React frontend on http://localhost:5173
  • Open the web UI automatically in your browser

Access the Web Interface

The web UI will open automatically, or navigate to: http://localhost:5173

What you'll see:

  • Left Panel: Real-time agent activity monitor

    • 🌐 System - Gateway connection status
    • 🎯 Orchestrator - Task coordinator (lights up when processing)
    • ☸️ K8s Agent - Kubernetes operations agent (lights up when executing)
  • Right Panel: Terminal interface for commands

    • Natural language input
    • Command history (use ↑/↓ arrows)
    • Real-time log streaming from agents
    • Color-coded responses and errors

Try These Example Commands

If you created the test cluster with error scenarios:

Check the health of all pods in test-errors namespace
Show me any failing pods and their logs
Investigate pod crashes in test-errors and provide root cause analysis

For any cluster:

Are all nodes ready?
What is the current status of the Kubernetes cluster?
Show resource usage for all pods

Stopping the Application

When you're done:

./stop.sh

This stops both the backend API and frontend development server.


Alternative Ways to Use the Agent

Option 1: Web UI (Recommended - Covered Above)

The web interface provides the best experience with real-time visualization of agent activity.

Option 2: Manual Start (Advanced)

If you prefer to start services individually:

# Terminal 1: Start backend API (from project root)
uv run python -m sre_agent.api.app

# Terminal 2: Start frontend (from project root)
cd frontend
npm install  # First time only
npm run dev

Important: Always use uv run python -m sre_agent.api.app (not python app.py) to ensure imports work correctly.

Option 3: Using the Orchestrator Programmatically

You can also use the orchestrator agent directly in your Python code:

from sre_agent.agents.orchestrator import create_orchestrator_agent

# Create orchestrator with K8s sub-agent
orchestrator = create_orchestrator_agent(
    cluster_name="your-cluster-name"
)

# Ask questions - agents operate autonomously
response = orchestrator("Check the health of all pods in the default namespace")
print(response)

# More examples
orchestrator("Why is my deployment failing?")
orchestrator("Get logs from the api-server pod in kube-system")
orchestrator("Investigate pod crashes in production namespace")

Using the Orchestrator with Memory

Enable memory to allow agents to remember context across conversations:

from sre_agent.agents.orchestrator import create_orchestrator_agent
import uuid

# Create orchestrator with memory enabled
orchestrator = create_orchestrator_agent(
    cluster_name="your-cluster-name",
    session_id=str(uuid.uuid4()),  # Unique per conversation
    actor_id="user-alice",          # Unique per user
    enable_memory=True              # Default is True
)

# First conversation - agent stores context
response1 = orchestrator("Check pods in production namespace")
print(response1)

# Follow-up - agent remembers previous context!
response2 = orchestrator("What issues did we find earlier?")
print(response2)

# Agent now remembers:
# - Previous cluster issues and error patterns
# - User preferences for output format
# - Historical troubleshooting context

Option 4: CLI Chat Interface

For terminal-only interaction without the web UI:

uv run python scripts/cli_chat_simple.py

# With debug logging to see full agent reasoning
uv run python scripts/cli_chat_simple.py --debug

Option 5: Using K8s Agent Directly

from sre_agent.agents.k8s_agent import create_k8s_agent

# Create K8s agent directly
k8s_agent = create_k8s_agent(cluster_name="your-cluster-name")

# Agent runs in agentic loop - makes multiple tool calls autonomously
response = k8s_agent("Diagnose why pods are failing in the app namespace")
print(response)

How the Agentic Loop Works

When you ask "Check pod health in default namespace", the K8s agent autonomously:

  1. Iteration 1: Calls get_pods(namespace="default") → sees 2 failing pods
  2. Iteration 2: Calls describe_pod() for failing pods → finds ImagePullBackOff
  3. Iteration 3: Calls get_events() → identifies image registry auth failure
  4. Iteration 4: Calls get_pod_logs() → confirms authentication error
  5. Synthesizes: Returns comprehensive report with root cause and recommendations

No user intervention needed - the agent decides what tools to call and when to stop.

Common Use Cases

# Health checks
orchestrator("Check overall cluster health")
orchestrator("Show me all unhealthy pods")
orchestrator("Are all nodes ready?")

# Troubleshooting
orchestrator("Why are pods crashing in production?")
orchestrator("Diagnose why the api-deployment is unavailable")
orchestrator("What events have occurred in the last hour?")

# Resource monitoring
orchestrator("Show resource usage for all pods")
orchestrator("Check node capacity and utilization")

Example Technical Queries

These advanced queries showcase the multi-agent system's ability to perform complex investigations through orchestration:

1. Multi-Step Pod Failure Investigation

Investigate why pods in the test-errors namespace are failing. Check pod status, get logs from any crashed pods, examine recent events, and provide root cause analysis with recommendations.

What happens: Orchestrator delegates to K8s agent, which autonomously:

  • Lists all pods in test-errors namespace
  • Identifies failing/crashed pods
  • Retrieves logs from failed pods
  • Examines cluster events related to failures
  • Analyzes patterns and synthesizes root cause report

2. Deployment Health Deep Dive

Check the health of the api-server deployment in the production namespace. Include pod status, replica counts, recent rollout history, resource usage, and any configuration issues.

What happens: K8s agent performs comprehensive deployment analysis:

  • Gets deployment status and replica counts
  • Describes pod specifications and health
  • Reviews rollout status and history
  • Checks resource requests vs actual usage
  • Identifies misconfigurations or resource constraints

3. Node Resource Capacity Analysis

Analyze all nodes in the cluster for resource utilization. Show CPU and memory usage, identify any nodes under pressure, and highlight pods consuming the most resources.

What happens: Agent gathers cluster-wide resource metrics:

  • Lists all nodes with capacity information
  • Gets resource usage statistics per node
  • Identifies resource-intensive pods
  • Detects nodes approaching capacity limits
  • Provides recommendations for scaling or optimization

4. Cross-Namespace Event Correlation

Get all warning and error events across all namespaces from the last 30 minutes. Group by type and identify any patterns or recurring issues that need attention.

What happens: Agent performs event analysis:

  • Retrieves cluster events across namespaces
  • Filters for warnings and errors
  • Groups by event type and reason
  • Identifies patterns in timing and affected resources
  • Highlights critical recurring issues

5. Service Connectivity Troubleshooting

Check if the backend-api service in the app namespace is accessible. Verify the service endpoints, check backing pods health, review recent logs, and diagnose any connectivity issues.

What happens: Agent performs multi-layer investigation:

  • Gets service definition and endpoints
  • Verifies backing pods are ready and healthy
  • Reviews pod logs for connection errors
  • Checks service selector matches pod labels
  • Analyzes network policies or ingress configuration
  • Provides connectivity troubleshooting report

Try these queries through the Web UI or programmatically to see the agents work through complex investigations autonomously!

Configuration

The agents automatically load configuration from .env:

# Gateway endpoint (from deployment)
AGENTCORE_GATEWAY_ENDPOINT=https://xxxxx.gateway.bedrock-agentcore...

# OAuth credentials (from deployment)
GATEWAY_COGNITO_APP_CLIENT_ID=xxxxx
GATEWAY_COGNITO_APP_CLIENT_SECRET=xxxxx
GATEWAY_COGNITO_TOKEN_ENDPOINT=https://...amazoncognito.com/oauth2/token

# Your cluster
EKS_CLUSTER_NAME=your-cluster-name
AWS_REGION=us-west-2

The K8s agent uses these credentials to:

  1. Obtain OAuth token from Cognito (automatic, cached, auto-refreshed)
  2. Connect to AgentCore Gateway with Bearer token
  3. Load 20+ Kubernetes MCP tools
  4. Execute operations against your EKS cluster

Available MCP Tools

The EKS MCP server provides 20+ tools including:

  • Cluster: get_cluster_info, get_cluster_health, list_clusters
  • Pods: get_pods, describe_pod, get_pod_logs
  • Deployments: get_deployments, scale_deployment, rollout_status
  • Resources: get_services, get_namespaces, get_nodes, get_events
  • Monitoring: get_resource_usage, get_node_resource_usage

All operations are read-only by default for production safety.

AgentCore Memory Integration

The system now supports Long-Term Memory (LTM) using Amazon Bedrock AgentCore Memory, enabling agents to:

  • Remember previous cluster issues and error patterns
  • Learn from past debugging sessions
  • Maintain user conversation context across sessions
  • Build a knowledge base of solutions over time

Memory Setup

1. Install Memory Dependencies

# Install AgentCore Memory with Strands integration
uv add 'bedrock-agentcore[strands-agents]'

2. Create Memory Resources

Use the memory manager to create long-term memory for your agents:

from sre_agent.memory import AgentCoreMemoryManager

# Initialize memory manager
manager = AgentCoreMemoryManager(region_name="us-east-1")

# Create long-term memory with multiple strategies
# - Summary: Condenses conversation sessions
# - Preferences: Captures user preferences
# - Semantic: Extracts and stores facts
memory = manager.create_long_term_memory(
    name="k8s-agent-memory",
    description="Long-term memory for K8s agent"
)

memory_id = memory.get("id")
print(f"Memory ID: {memory_id}")

# Save memory ID as environment variable
# export AGENTCORE_MEMORY_K8S_AGENT_MEMORY_ID=mem-abc123

3. Use Memory with Agents

Each agent can be configured with its own memory session:

from sre_agent.memory import (
    create_k8s_agent_session_manager,
    create_holmes_agent_session_manager,
    create_orchestrator_session_manager,
)
from strands import Agent

# Create session manager for K8s agent
session_manager = create_k8s_agent_session_manager(
    session_id="user-session-001",  # Unique per conversation
    actor_id="user-123",             # Unique per user
    region_name="us-east-1"
)

# Use with K8s agent
k8s_agent = Agent(
    name="k8s_agent",
    system_prompt="You are a Kubernetes specialist...",
    tools=[...],
    session_manager=session_manager,  # Memory enabled!
    max_iterations=20
)

# Agent now remembers context across invocations
response1 = k8s_agent("Check pod health in default namespace")
response2 = k8s_agent("What was the issue we found earlier?")  # Remembers!

Memory Retrieval Configuration

Each agent has optimized retrieval configurations:

K8s Agent - Prioritizes cluster history:

  • Facts: Top 15 results (relevance 0.6) - Previous cluster issues and patterns
  • Summaries: Top 5 results (relevance 0.5) - Recent session context
  • Preferences: Top 3 results (relevance 0.7) - User output preferences

Holmes Agent - Prioritizes debugging knowledge:

  • Facts: Top 20 results (relevance 0.5) - Broad debugging history
  • Summaries: Top 5 results (relevance 0.5) - Past debugging sessions
  • Preferences: Top 3 results (relevance 0.7) - User preferences

Orchestrator - Balanced retrieval:

  • Preferences: Top 5 results (relevance 0.7) - User context
  • Summaries: Top 5 results (relevance 0.6) - Conversation history
  • Facts: Top 10 results (relevance 0.6) - General knowledge

Memory Namespaces

AgentCore Memory uses three namespace patterns:

  1. /preferences/{actorId} - Cross-session user preferences
  2. /facts/{actorId} - Cross-session facts and knowledge
  3. /summaries/{actorId}/{sessionId} - Session-specific summaries

These namespaces allow agents to share knowledge across sessions while maintaining user-specific context.

Environment Variables

Store memory IDs as environment variables for production use:

# Add to .env file
AGENTCORE_MEMORY_K8S_AGENT_MEMORY_ID=mem-abc123
AGENTCORE_MEMORY_HOLMES_AGENT_MEMORY_ID=mem-def456
AGENTCORE_MEMORY_ORCHESTRATOR_MEMORY_ID=mem-ghi789

The memory manager automatically uses these environment variables if available.

Memory Benefits

For K8s Agent:

  • Remembers recurring cluster issues (e.g., "This pod crashes every Monday")
  • Learns common error patterns and their solutions
  • Maintains context when troubleshooting across multiple sessions

For Holmes Agent:

  • Builds a knowledge base of successful debugging strategies
  • Recalls similar past errors and their root causes
  • Improves debugging accuracy over time

For Orchestrator:

  • Maintains user conversation context across sessions
  • Remembers user preferences for output format and detail level
  • Provides personalized responses based on user history

Important Notes

  • Only one agent per session is supported when using memory
  • Memory IDs should be created once and stored as environment variables
  • Use unique session_id for each conversation
  • Use consistent actor_id for each user to enable cross-session learning
  • Memory retrieval uses relevance scoring (0.0-1.0) to filter results

Example: Multi-Session Context

from sre_agent.memory import create_k8s_agent_session_manager
from strands import Agent

# Session 1
session_mgr_1 = create_k8s_agent_session_manager(
    session_id="session-001",
    actor_id="user-alice"
)
agent = Agent(name="k8s_agent", session_manager=session_mgr_1, ...)
agent("There's a pod crashing in the api namespace")
# Agent stores: "Pod crash in api namespace"

# Session 2 (different session, same user)
session_mgr_2 = create_k8s_agent_session_manager(
    session_id="session-002",  # New session
    actor_id="user-alice"      # Same user
)
agent = Agent(name="k8s_agent", session_manager=session_mgr_2, ...)
agent("Is the api namespace still having issues?")
# Agent recalls: Previous facts about api namespace crashes!

Reference: AgentCore Memory Documentation

How Memory is Integrated into Agents

Memory is automatically enabled in both the Orchestrator and K8s agents when you provide session_id and actor_id parameters.

Automatic Memory Integration

When you create agents with memory parameters, the system automatically:

  1. Creates memory session managers for each agent type
  2. Configures optimized retrieval based on agent role
  3. Enables context retention across conversations
  4. Builds knowledge base over time

Example: Creating Agents with Memory

from sre_agent.agents.orchestrator import create_orchestrator_agent
import uuid

# Generate session ID (unique per conversation)
session_id = str(uuid.uuid4())
actor_id = "sre-alice"  # Unique per user

# Create orchestrator - memory is enabled automatically
orchestrator = create_orchestrator_agent(
    cluster_name="my-eks-cluster",
    session_id=session_id,
    actor_id=actor_id,
    enable_memory=True  # This is the default
)

# Both orchestrator AND K8s agent now have memory!
# The K8s agent inherits the session context

What Gets Remembered

Orchestrator Agent remembers:

  • User conversation context and preferences
  • Task routing patterns
  • User's output format preferences (verbose vs concise)
  • Cross-session user context

K8s Agent remembers:

  • Previous cluster issues and error patterns
  • Recurring pod failures (e.g., "pod crashes every Monday")
  • Common troubleshooting steps for specific errors
  • Historical resource usage patterns
  • Solutions that worked for similar issues

Memory Configuration in Agent Code

The memory integration happens automatically in the agent creation functions:

In orchestrator.py:

# Memory is initialized when session_id and actor_id are provided
def create_orchestrator_agent(
    cluster_name: str | None = None,
    session_id: str | None = None,
    actor_id: str | None = None,
    enable_memory: bool = True,
) -> Agent:
    # ...
    if enable_memory and session_id and actor_id:
        session_manager = create_orchestrator_session_manager(
            session_id=session_id,
            actor_id=actor_id,
        )

    orchestrator = Agent(
        name="orchestrator",
        tools=[...],
        session_manager=session_manager,  # Memory enabled!
    )

In k8s_agent.py:

# K8s agent tool receives memory configuration
def k8s_agent_tool(
    cluster_name: str | None = None,
    session_id: str | None = None,
    actor_id: str | None = None,
    enable_memory: bool = True,
):
    # ...
    if enable_memory and session_id and actor_id:
        session_manager = create_k8s_agent_session_manager(
            session_id=session_id,
            actor_id=actor_id,
        )

    # Agent created inside tool with memory
    agent = Agent(
        name="k8s_agent",
        tools=mcp_tools,
        session_manager=session_manager,  # Memory enabled!
    )

Disabling Memory (Optional)

If you want to run agents without memory:

# Disable memory explicitly
orchestrator = create_orchestrator_agent(
    cluster_name="my-cluster",
    enable_memory=False
)

# Or simply don't provide session_id/actor_id
orchestrator = create_orchestrator_agent(
    cluster_name="my-cluster"
)

Memory Setup Checklist

Before using memory with agents:

  1. Install dependencies: uv add 'bedrock-agentcore[strands-agents]'
  2. Create memory resources: uv run python scripts/create_agent_memories.py
  3. Add memory IDs to .env: Copy environment variables from script output
  4. Generate session/actor IDs: Use uuid.uuid4() for sessions, consistent IDs for users
  5. Create agents with memory: Pass session_id and actor_id to agent creation

Verifying Memory is Enabled

Look for these log messages:

INFO: Enabling memory for orchestrator - session: abc-123, actor: user-alice
INFO: ✅ Orchestrator memory enabled
INFO: Enabling memory for K8s agent - session: abc-123, actor: user-alice
INFO: ✅ K8s agent memory enabled - will remember cluster issues and patterns

If you see these warnings, memory is not enabled:

WARNING: Memory module not available - install 'bedrock-agentcore[strands-agents]'
WARNING: Failed to enable memory: ...

Memory Best Practices

Session IDs:

  • Generate new session_id for each conversation: str(uuid.uuid4())
  • Use same session_id for related questions in one conversation
  • Create new session_id when starting a new topic

Actor IDs:

  • Use consistent actor_id for the same user: "user-alice"
  • This enables cross-session learning for that user
  • Different users should have different actor_ids

Example Session Management:

import uuid

# Start of conversation
session_id = str(uuid.uuid4())
actor_id = "sre-alice"

orchestrator = create_orchestrator_agent(
    cluster_name="my-cluster",
    session_id=session_id,
    actor_id=actor_id
)

# Multiple related questions in same session
orchestrator("Check pod health")
orchestrator("What issues did we find?")  # Same session
orchestrator("Get logs for failed pods")  # Same session

# New conversation = new session_id
new_session_id = str(uuid.uuid4())
orchestrator = create_orchestrator_agent(
    cluster_name="my-cluster",
    session_id=new_session_id,  # New session
    actor_id=actor_id             # Same user!
)

# Agent remembers user preferences and facts from previous sessions!
orchestrator("Any issues today?")

Development

# Install dependencies
uv sync

# Run linting and formatting
uv run ruff check --fix .
uv run ruff format .

# Run tests
uv run pytest

# Pre-commit workflow
uv run ruff check --fix . && uv run ruff format . && uv run pytest

Web UI Features

Real-Time Agent Visualization

  • Agent Activity Panel: See which agents are actively working on your request
  • Animated Status Indicators: Visual feedback showing WORKING/IDLE/OFFLINE states
  • Agent Badges: Each log entry shows which agent generated it (🎯 Orchestrator, ☸️ K8s)
  • Pulse Animations: Active agents have glowing borders that pulse

Terminal Interface

  • Command History: Navigate previous commands with ↑/↓ arrow keys
  • Auto-Scroll: Automatically scrolls to latest output
  • Color-Coded Output:
    • 🟢 Green: Success responses and agent output
    • 🔵 Blue: Info logs and agent activity
    • 🔴 Red: Errors and failures
    • ⚪ White: User commands
  • Session Stats: Track number of commands in current session

WebSocket Streaming

  • Real-Time Updates: See logs as agents generate them
  • No Polling: Efficient WebSocket connection for instant feedback
  • Reconnection: Automatically reconnects if connection is lost

Project Structure

Intelligent_SRE_Agent/
├── README.md                          # This file
├── QUICKSTART.md                      # Quick start guide for web UI
├── start.sh                           # Start backend + frontend
├── stop.sh                            # Stop all services
├── frontend/                          # React web UI
│   ├── src/
│   │   ├── App.jsx                    # Main app component
│   │   └── components/
│   │       ├── TerminalInterface.jsx  # Terminal UI
│   │       └── TerminalInterface.css  # Styling
│   └── package.json
├── src/sre_agent/
│   ├── agents/
│   │   ├── orchestrator.py           # Orchestrator agent
│   │   └── k8s_agent.py              # K8s agent with agentic loop
│   ├── api/
│   │   └── app.py                    # FastAPI backend with WebSocket
│   └── config/
│       ├── oauth.py                   # OAuth token manager
│       └── settings.py                # Configuration (loads .env)
├── scripts/
│   ├── cli_chat_simple.py            # CLI chat interface
│   └── agentcore/                    # Deployment scripts
│       ├── README.md                  # Deployment guide
│       ├── runtime/                   # Deploy MCP server to runtime
│       ├── gateway/                   # Create AgentCore Gateway
│       └── mcp_servers/               # EKS MCP server implementation
└── tests/                             # Test suite

Troubleshooting

Web UI Issues

"Frontend can't connect to backend"

# Check if backend is running
curl http://127.0.0.1:8000/health

# Check backend logs
tail -f /tmp/sre-agent-backend.log

# Restart services
./stop.sh
./start.sh

"WebSocket disconnected"

  • Check that backend is running on port 8000
  • Verify CORS settings in src/sre_agent/api/app.py
  • Check browser console for WebSocket errors

"Agents not showing activity"

  • Ensure logs contain agent names ("orchestrator", "k8s_agent")
  • Check WebSocket connection in browser dev tools
  • Verify log messages are being emitted by agents

Backend/Agent Issues

"Credentials file not found"

# Run deployment scripts in order
cd scripts/agentcore
uv run python runtime/setup_oauth.py
uv run python runtime/deploy_mcp_to_runtime.py
uv run python gateway/setup_gateway.py

"MCP tools not loading"

# Verify gateway URL in .env
cat .env

# Test gateway connection
uv run python scripts/agentcore/runtime/test_mcp_listtools.py

# Check gateway status
aws bedrock-agentcore-control get-gateway --gateway-identifier <gateway_id> --region $AWS_REGION

"Invalid OAuth token"

# Verify OAuth credentials match
diff <(cat .env | grep COGNITO) <(cat scripts/.infrastructure_credentials.json | grep cognito)

# Test token acquisition
python -c "from sre_agent.config.oauth import OAuthTokenManager; from sre_agent.config.settings import settings; tm = OAuthTokenManager(settings.gateway_cognito_token_endpoint, settings.gateway_cognito_app_client_id, settings.gateway_cognito_app_client_secret); print('Token:', tm.get_token()[:20] + '...')"

Cleanup

When you're done testing and want to remove all resources:

Option 1: Complete Cleanup (Recommended)

Use the automated cleanup script:

# Navigate to runtime directory
cd scripts/agentcore/runtime

# Clean up AgentCore infrastructure (Gateway, Runtime, Cognito)
./deploy_all.sh --cleanup

This removes:

  • AgentCore Gateway
  • AgentCore Runtime
  • Cognito OAuth resources
  • IAM roles
  • Parameter Store entries

Option 2: Delete EKS Test Cluster

If you created the test cluster, delete it to avoid ongoing charges:

# Delete the test EKS cluster (takes ~10 minutes)
eksctl delete cluster --name eks-mcp-test-cluster --region us-west-2

Note: Only delete the cluster if you created it specifically for testing. Do not delete production clusters!

Option 3: Manual Cleanup (Advanced)

If you need to manually clean up specific resources:

# Get resource IDs from credentials file
cd scripts/agentcore/runtime
cat credentials.json

# Delete gateway
aws bedrock-agentcore-control delete-gateway \
  --gateway-identifier <gateway_id> \
  --region $AWS_REGION

# Delete runtime
aws bedrock-agentcore-control delete-runtime \
  --runtime-identifier <runtime_agent_id> \
  --region $AWS_REGION

# Delete Cognito User Pool
aws cognito-idp delete-user-pool \
  --user-pool-id <cognito_pool_id> \
  --region $AWS_REGION

Documentation

References

License

See LICENSE file for details.

About

An autonomous SRE agent that tracks tasks and helps debug

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published