An intelligent Site Reliability Engineering (SRE) agent system built with Strands Agents SDK and Amazon Bedrock AgentCore for autonomous Kubernetes/EKS operations.
This system implements a multi-agent architecture where a K8s agent autonomously investigates cluster issues by making multiple tool calls in an agentic loop (up to 20 iterations) without user intervention. The K8s agent is wrapped as a tool using the agents-as-tools pattern from Strands, allowing an orchestrator agent to delegate Kubernetes tasks to it. The agents connect to an EKS MCP server (20+ Kubernetes tools) hosted on AgentCore Runtime and exposed through AgentCore Gateway with OAuth authentication.
Architecture: User/Slack → Orchestrator Agent → K8s Agent (agentic loop) → AgentCore Gateway (OAuth) → AgentCore Runtime (MCP Server) → EKS Cluster. The K8s agent autonomously gathers information (pods, logs, events, health) across multiple iterations and synthesizes comprehensive reports with root cause analysis and recommendations.
For those who want to get started quickly:
# 1. Install dependencies
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
# 2. Set up a test EKS cluster (skip if you have one)
cd infrastructure/eks_cluster && ./create_test_cluster.sh
cd error-scenarios && ./deploy_all.sh
export EKS_CLUSTER_NAME=eks-mcp-test-cluster
export AWS_REGION=us-west-2
# 3. Deploy AgentCore infrastructure (Runtime + Gateway)
cd ../../scripts/agentcore/runtime
./deploy_all.sh
# 4. Start the application
cd ../../../
./start.sh
# 5. Open browser to http://localhost:5173 and start chatting!First command to try: Check the health of all pods in test-errors namespace
For detailed step-by-step instructions, continue reading below.
The complete setup follows these steps:
| Step | What | Time | Description |
|---|---|---|---|
| 1 | Install Dependencies | 2 min | Install uv and Python packages |
| 2 | Set Up EKS Cluster | 0-15 min | Use existing cluster (0 min) or create test cluster (15 min) |
| 3 | Deploy AgentCore | 10-15 min | Deploy Runtime + Gateway with OAuth |
| 4 | Verify Deployment | 1 min | Test connectivity and credentials |
| 5 | Start Application | 1 min | Launch web interface |
Total Time: 15-35 minutes depending on whether you create a new cluster.
This system supports AgentCore Memory for long-term context retention across sessions:
- Remember cluster issues: K8s agent recalls previous errors, patterns, and recurring problems (e.g., "pod crashes every Monday")
- Learn from debugging: Holmes agent builds a knowledge base of successful solutions over time
- Maintain user context: Orchestrator remembers user preferences and conversation history
- Cross-session intelligence: Agents learn from past interactions to provide better recommendations
Quick Setup:
# 1. Create memory resources (one-time setup)
uv run python scripts/agentcore/memory/create_agent_memories.py --region us-west-2
# 2. Add memory IDs to .env file (from script output)
AGENTCORE_MEMORY_K8S_AGENT_MEMORY_ID=mem-abc123
AGENTCORE_MEMORY_ORCHESTRATOR_MEMORY_ID=mem-def456
# 3. Use agents with memory enabled
from sre_agent.agents.orchestrator import create_orchestrator_agent
import uuid
orchestrator = create_orchestrator_agent(
cluster_name="my-cluster",
session_id=str(uuid.uuid4()), # Unique per conversation
actor_id="user-alice", # Unique per user
enable_memory=True # Default
)Memory is optional but recommended for production use. See the AgentCore Memory Integration section for details.
- Python 3.11+
- Node.js 18+ (for the web frontend)
- AWS CLI configured with appropriate credentials
- AWS Account with permissions for: Amazon EKS, Amazon Cognito, Amazon Bedrock AgentCore, Amazon ECR, IAM
kubectlfor Kubernetes cluster managementjqfor JSON parsing (Install:brew install jqon macOS orapt install jqon Linux)
This guide walks you through setting up the Intelligent SRE Agent from scratch. The process takes about 15-20 minutes.
First, install uv package manager:
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or using pip
pip install uvThen install project dependencies:
cd /path/to/Intelligent_SRE_Agent
# Install dependencies (creates .venv automatically)
uv syncYou have two options depending on whether you have an existing EKS cluster.
If you already have an EKS cluster, simply note down the cluster name and region. You'll need these in Step 3.
# Example: List your existing clusters
aws eks list-clusters --region us-west-2
# Set environment variables for deployment
export EKS_CLUSTER_NAME=your-existing-cluster-name
export AWS_REGION=us-west-2If you don't have an EKS cluster, create a test cluster with pre-configured error scenarios for testing the SRE agent:
# Create a test EKS cluster (takes ~15 minutes)
cd infrastructure/eks_cluster
./create_test_cluster.sh
# Deploy error scenarios (pods with various issues for testing)
cd error-scenarios
./deploy_all.sh
# The cluster name will be: eks-mcp-test-cluster
# Note: Make sure to export this for the next step
export EKS_CLUSTER_NAME=eks-mcp-test-cluster
export AWS_REGION=us-west-2The test cluster includes various error scenarios:
- ImagePullBackOff errors
- CrashLoopBackOff pods
- Resource limit issues
- Configuration errors
These provide realistic scenarios for testing the SRE agent's diagnostic capabilities.
Now deploy the complete infrastructure with a single command. This deploys:
- Cognito OAuth for authentication
- AgentCore Runtime with EKS MCP Server
- AgentCore Gateway for unified access
- Project .env file with all credentials
# Navigate to the runtime deployment directory
cd scripts/agentcore/runtime
# Deploy everything (takes ~10-15 minutes)
./deploy_all.shThe script will automatically:
- Create Cognito OAuth configuration for M2M authentication
- Deploy the EKS MCP server to AgentCore Runtime with full EKS permissions
- Test the Runtime deployment
- Create an AgentCore Gateway
- Attach the Runtime as a Gateway target
- Update your project's
.envfile with all credentials
Important: The deployment script uses the EKS_CLUSTER_NAME and AWS_REGION environment variables you set in Step 2.
# Navigate back to project root
cd ../../..
# Check that credentials were saved
cat scripts/agentcore/runtime/credentials.json
# Verify .env file was created with all credentials
cat .env
# Test MCP tools are accessible
uv run python scripts/agentcore/runtime/test_mcp_client.pyExpected output should show:
- Gateway URL and OAuth credentials
- List of available MCP tools (20+ Kubernetes operations)
- Successful connection to your EKS cluster
Now you're ready to run the Intelligent SRE Agent with its web interface!
# From project root directory
./start.shThis script will:
- Start the FastAPI backend on
http://127.0.0.1:8000 - Start the React frontend on
http://localhost:5173 - Open the web UI automatically in your browser
The web UI will open automatically, or navigate to: http://localhost:5173
What you'll see:
-
Left Panel: Real-time agent activity monitor
- 🌐 System - Gateway connection status
- 🎯 Orchestrator - Task coordinator (lights up when processing)
- ☸️ K8s Agent - Kubernetes operations agent (lights up when executing)
-
Right Panel: Terminal interface for commands
- Natural language input
- Command history (use ↑/↓ arrows)
- Real-time log streaming from agents
- Color-coded responses and errors
If you created the test cluster with error scenarios:
Check the health of all pods in test-errors namespace
Show me any failing pods and their logs
Investigate pod crashes in test-errors and provide root cause analysis
For any cluster:
Are all nodes ready?
What is the current status of the Kubernetes cluster?
Show resource usage for all pods
When you're done:
./stop.shThis stops both the backend API and frontend development server.
The web interface provides the best experience with real-time visualization of agent activity.
If you prefer to start services individually:
# Terminal 1: Start backend API (from project root)
uv run python -m sre_agent.api.app
# Terminal 2: Start frontend (from project root)
cd frontend
npm install # First time only
npm run devImportant: Always use uv run python -m sre_agent.api.app (not python app.py) to ensure imports work correctly.
You can also use the orchestrator agent directly in your Python code:
from sre_agent.agents.orchestrator import create_orchestrator_agent
# Create orchestrator with K8s sub-agent
orchestrator = create_orchestrator_agent(
cluster_name="your-cluster-name"
)
# Ask questions - agents operate autonomously
response = orchestrator("Check the health of all pods in the default namespace")
print(response)
# More examples
orchestrator("Why is my deployment failing?")
orchestrator("Get logs from the api-server pod in kube-system")
orchestrator("Investigate pod crashes in production namespace")Enable memory to allow agents to remember context across conversations:
from sre_agent.agents.orchestrator import create_orchestrator_agent
import uuid
# Create orchestrator with memory enabled
orchestrator = create_orchestrator_agent(
cluster_name="your-cluster-name",
session_id=str(uuid.uuid4()), # Unique per conversation
actor_id="user-alice", # Unique per user
enable_memory=True # Default is True
)
# First conversation - agent stores context
response1 = orchestrator("Check pods in production namespace")
print(response1)
# Follow-up - agent remembers previous context!
response2 = orchestrator("What issues did we find earlier?")
print(response2)
# Agent now remembers:
# - Previous cluster issues and error patterns
# - User preferences for output format
# - Historical troubleshooting contextFor terminal-only interaction without the web UI:
uv run python scripts/cli_chat_simple.py
# With debug logging to see full agent reasoning
uv run python scripts/cli_chat_simple.py --debugfrom sre_agent.agents.k8s_agent import create_k8s_agent
# Create K8s agent directly
k8s_agent = create_k8s_agent(cluster_name="your-cluster-name")
# Agent runs in agentic loop - makes multiple tool calls autonomously
response = k8s_agent("Diagnose why pods are failing in the app namespace")
print(response)When you ask "Check pod health in default namespace", the K8s agent autonomously:
- Iteration 1: Calls
get_pods(namespace="default")→ sees 2 failing pods - Iteration 2: Calls
describe_pod()for failing pods → finds ImagePullBackOff - Iteration 3: Calls
get_events()→ identifies image registry auth failure - Iteration 4: Calls
get_pod_logs()→ confirms authentication error - Synthesizes: Returns comprehensive report with root cause and recommendations
No user intervention needed - the agent decides what tools to call and when to stop.
# Health checks
orchestrator("Check overall cluster health")
orchestrator("Show me all unhealthy pods")
orchestrator("Are all nodes ready?")
# Troubleshooting
orchestrator("Why are pods crashing in production?")
orchestrator("Diagnose why the api-deployment is unavailable")
orchestrator("What events have occurred in the last hour?")
# Resource monitoring
orchestrator("Show resource usage for all pods")
orchestrator("Check node capacity and utilization")These advanced queries showcase the multi-agent system's ability to perform complex investigations through orchestration:
Investigate why pods in the test-errors namespace are failing. Check pod status, get logs from any crashed pods, examine recent events, and provide root cause analysis with recommendations.
What happens: Orchestrator delegates to K8s agent, which autonomously:
- Lists all pods in test-errors namespace
- Identifies failing/crashed pods
- Retrieves logs from failed pods
- Examines cluster events related to failures
- Analyzes patterns and synthesizes root cause report
Check the health of the api-server deployment in the production namespace. Include pod status, replica counts, recent rollout history, resource usage, and any configuration issues.
What happens: K8s agent performs comprehensive deployment analysis:
- Gets deployment status and replica counts
- Describes pod specifications and health
- Reviews rollout status and history
- Checks resource requests vs actual usage
- Identifies misconfigurations or resource constraints
Analyze all nodes in the cluster for resource utilization. Show CPU and memory usage, identify any nodes under pressure, and highlight pods consuming the most resources.
What happens: Agent gathers cluster-wide resource metrics:
- Lists all nodes with capacity information
- Gets resource usage statistics per node
- Identifies resource-intensive pods
- Detects nodes approaching capacity limits
- Provides recommendations for scaling or optimization
Get all warning and error events across all namespaces from the last 30 minutes. Group by type and identify any patterns or recurring issues that need attention.
What happens: Agent performs event analysis:
- Retrieves cluster events across namespaces
- Filters for warnings and errors
- Groups by event type and reason
- Identifies patterns in timing and affected resources
- Highlights critical recurring issues
Check if the backend-api service in the app namespace is accessible. Verify the service endpoints, check backing pods health, review recent logs, and diagnose any connectivity issues.
What happens: Agent performs multi-layer investigation:
- Gets service definition and endpoints
- Verifies backing pods are ready and healthy
- Reviews pod logs for connection errors
- Checks service selector matches pod labels
- Analyzes network policies or ingress configuration
- Provides connectivity troubleshooting report
Try these queries through the Web UI or programmatically to see the agents work through complex investigations autonomously!
The agents automatically load configuration from .env:
# Gateway endpoint (from deployment)
AGENTCORE_GATEWAY_ENDPOINT=https://xxxxx.gateway.bedrock-agentcore...
# OAuth credentials (from deployment)
GATEWAY_COGNITO_APP_CLIENT_ID=xxxxx
GATEWAY_COGNITO_APP_CLIENT_SECRET=xxxxx
GATEWAY_COGNITO_TOKEN_ENDPOINT=https://...amazoncognito.com/oauth2/token
# Your cluster
EKS_CLUSTER_NAME=your-cluster-name
AWS_REGION=us-west-2The K8s agent uses these credentials to:
- Obtain OAuth token from Cognito (automatic, cached, auto-refreshed)
- Connect to AgentCore Gateway with Bearer token
- Load 20+ Kubernetes MCP tools
- Execute operations against your EKS cluster
The EKS MCP server provides 20+ tools including:
- Cluster:
get_cluster_info,get_cluster_health,list_clusters - Pods:
get_pods,describe_pod,get_pod_logs - Deployments:
get_deployments,scale_deployment,rollout_status - Resources:
get_services,get_namespaces,get_nodes,get_events - Monitoring:
get_resource_usage,get_node_resource_usage
All operations are read-only by default for production safety.
The system now supports Long-Term Memory (LTM) using Amazon Bedrock AgentCore Memory, enabling agents to:
- Remember previous cluster issues and error patterns
- Learn from past debugging sessions
- Maintain user conversation context across sessions
- Build a knowledge base of solutions over time
# Install AgentCore Memory with Strands integration
uv add 'bedrock-agentcore[strands-agents]'Use the memory manager to create long-term memory for your agents:
from sre_agent.memory import AgentCoreMemoryManager
# Initialize memory manager
manager = AgentCoreMemoryManager(region_name="us-east-1")
# Create long-term memory with multiple strategies
# - Summary: Condenses conversation sessions
# - Preferences: Captures user preferences
# - Semantic: Extracts and stores facts
memory = manager.create_long_term_memory(
name="k8s-agent-memory",
description="Long-term memory for K8s agent"
)
memory_id = memory.get("id")
print(f"Memory ID: {memory_id}")
# Save memory ID as environment variable
# export AGENTCORE_MEMORY_K8S_AGENT_MEMORY_ID=mem-abc123Each agent can be configured with its own memory session:
from sre_agent.memory import (
create_k8s_agent_session_manager,
create_holmes_agent_session_manager,
create_orchestrator_session_manager,
)
from strands import Agent
# Create session manager for K8s agent
session_manager = create_k8s_agent_session_manager(
session_id="user-session-001", # Unique per conversation
actor_id="user-123", # Unique per user
region_name="us-east-1"
)
# Use with K8s agent
k8s_agent = Agent(
name="k8s_agent",
system_prompt="You are a Kubernetes specialist...",
tools=[...],
session_manager=session_manager, # Memory enabled!
max_iterations=20
)
# Agent now remembers context across invocations
response1 = k8s_agent("Check pod health in default namespace")
response2 = k8s_agent("What was the issue we found earlier?") # Remembers!Each agent has optimized retrieval configurations:
K8s Agent - Prioritizes cluster history:
- Facts: Top 15 results (relevance 0.6) - Previous cluster issues and patterns
- Summaries: Top 5 results (relevance 0.5) - Recent session context
- Preferences: Top 3 results (relevance 0.7) - User output preferences
Holmes Agent - Prioritizes debugging knowledge:
- Facts: Top 20 results (relevance 0.5) - Broad debugging history
- Summaries: Top 5 results (relevance 0.5) - Past debugging sessions
- Preferences: Top 3 results (relevance 0.7) - User preferences
Orchestrator - Balanced retrieval:
- Preferences: Top 5 results (relevance 0.7) - User context
- Summaries: Top 5 results (relevance 0.6) - Conversation history
- Facts: Top 10 results (relevance 0.6) - General knowledge
AgentCore Memory uses three namespace patterns:
/preferences/{actorId}- Cross-session user preferences/facts/{actorId}- Cross-session facts and knowledge/summaries/{actorId}/{sessionId}- Session-specific summaries
These namespaces allow agents to share knowledge across sessions while maintaining user-specific context.
Store memory IDs as environment variables for production use:
# Add to .env file
AGENTCORE_MEMORY_K8S_AGENT_MEMORY_ID=mem-abc123
AGENTCORE_MEMORY_HOLMES_AGENT_MEMORY_ID=mem-def456
AGENTCORE_MEMORY_ORCHESTRATOR_MEMORY_ID=mem-ghi789The memory manager automatically uses these environment variables if available.
For K8s Agent:
- Remembers recurring cluster issues (e.g., "This pod crashes every Monday")
- Learns common error patterns and their solutions
- Maintains context when troubleshooting across multiple sessions
For Holmes Agent:
- Builds a knowledge base of successful debugging strategies
- Recalls similar past errors and their root causes
- Improves debugging accuracy over time
For Orchestrator:
- Maintains user conversation context across sessions
- Remembers user preferences for output format and detail level
- Provides personalized responses based on user history
- Only one agent per session is supported when using memory
- Memory IDs should be created once and stored as environment variables
- Use unique
session_idfor each conversation - Use consistent
actor_idfor each user to enable cross-session learning - Memory retrieval uses relevance scoring (0.0-1.0) to filter results
from sre_agent.memory import create_k8s_agent_session_manager
from strands import Agent
# Session 1
session_mgr_1 = create_k8s_agent_session_manager(
session_id="session-001",
actor_id="user-alice"
)
agent = Agent(name="k8s_agent", session_manager=session_mgr_1, ...)
agent("There's a pod crashing in the api namespace")
# Agent stores: "Pod crash in api namespace"
# Session 2 (different session, same user)
session_mgr_2 = create_k8s_agent_session_manager(
session_id="session-002", # New session
actor_id="user-alice" # Same user
)
agent = Agent(name="k8s_agent", session_manager=session_mgr_2, ...)
agent("Is the api namespace still having issues?")
# Agent recalls: Previous facts about api namespace crashes!Reference: AgentCore Memory Documentation
Memory is automatically enabled in both the Orchestrator and K8s agents when you provide session_id and actor_id parameters.
When you create agents with memory parameters, the system automatically:
- Creates memory session managers for each agent type
- Configures optimized retrieval based on agent role
- Enables context retention across conversations
- Builds knowledge base over time
from sre_agent.agents.orchestrator import create_orchestrator_agent
import uuid
# Generate session ID (unique per conversation)
session_id = str(uuid.uuid4())
actor_id = "sre-alice" # Unique per user
# Create orchestrator - memory is enabled automatically
orchestrator = create_orchestrator_agent(
cluster_name="my-eks-cluster",
session_id=session_id,
actor_id=actor_id,
enable_memory=True # This is the default
)
# Both orchestrator AND K8s agent now have memory!
# The K8s agent inherits the session contextOrchestrator Agent remembers:
- User conversation context and preferences
- Task routing patterns
- User's output format preferences (verbose vs concise)
- Cross-session user context
K8s Agent remembers:
- Previous cluster issues and error patterns
- Recurring pod failures (e.g., "pod crashes every Monday")
- Common troubleshooting steps for specific errors
- Historical resource usage patterns
- Solutions that worked for similar issues
The memory integration happens automatically in the agent creation functions:
In orchestrator.py:
# Memory is initialized when session_id and actor_id are provided
def create_orchestrator_agent(
cluster_name: str | None = None,
session_id: str | None = None,
actor_id: str | None = None,
enable_memory: bool = True,
) -> Agent:
# ...
if enable_memory and session_id and actor_id:
session_manager = create_orchestrator_session_manager(
session_id=session_id,
actor_id=actor_id,
)
orchestrator = Agent(
name="orchestrator",
tools=[...],
session_manager=session_manager, # Memory enabled!
)In k8s_agent.py:
# K8s agent tool receives memory configuration
def k8s_agent_tool(
cluster_name: str | None = None,
session_id: str | None = None,
actor_id: str | None = None,
enable_memory: bool = True,
):
# ...
if enable_memory and session_id and actor_id:
session_manager = create_k8s_agent_session_manager(
session_id=session_id,
actor_id=actor_id,
)
# Agent created inside tool with memory
agent = Agent(
name="k8s_agent",
tools=mcp_tools,
session_manager=session_manager, # Memory enabled!
)If you want to run agents without memory:
# Disable memory explicitly
orchestrator = create_orchestrator_agent(
cluster_name="my-cluster",
enable_memory=False
)
# Or simply don't provide session_id/actor_id
orchestrator = create_orchestrator_agent(
cluster_name="my-cluster"
)Before using memory with agents:
- ✅ Install dependencies:
uv add 'bedrock-agentcore[strands-agents]' - ✅ Create memory resources:
uv run python scripts/create_agent_memories.py - ✅ Add memory IDs to .env: Copy environment variables from script output
- ✅ Generate session/actor IDs: Use
uuid.uuid4()for sessions, consistent IDs for users - ✅ Create agents with memory: Pass
session_idandactor_idto agent creation
Look for these log messages:
INFO: Enabling memory for orchestrator - session: abc-123, actor: user-alice
INFO: ✅ Orchestrator memory enabled
INFO: Enabling memory for K8s agent - session: abc-123, actor: user-alice
INFO: ✅ K8s agent memory enabled - will remember cluster issues and patterns
If you see these warnings, memory is not enabled:
WARNING: Memory module not available - install 'bedrock-agentcore[strands-agents]'
WARNING: Failed to enable memory: ...
Session IDs:
- Generate new
session_idfor each conversation:str(uuid.uuid4()) - Use same
session_idfor related questions in one conversation - Create new
session_idwhen starting a new topic
Actor IDs:
- Use consistent
actor_idfor the same user:"user-alice" - This enables cross-session learning for that user
- Different users should have different
actor_ids
Example Session Management:
import uuid
# Start of conversation
session_id = str(uuid.uuid4())
actor_id = "sre-alice"
orchestrator = create_orchestrator_agent(
cluster_name="my-cluster",
session_id=session_id,
actor_id=actor_id
)
# Multiple related questions in same session
orchestrator("Check pod health")
orchestrator("What issues did we find?") # Same session
orchestrator("Get logs for failed pods") # Same session
# New conversation = new session_id
new_session_id = str(uuid.uuid4())
orchestrator = create_orchestrator_agent(
cluster_name="my-cluster",
session_id=new_session_id, # New session
actor_id=actor_id # Same user!
)
# Agent remembers user preferences and facts from previous sessions!
orchestrator("Any issues today?")# Install dependencies
uv sync
# Run linting and formatting
uv run ruff check --fix .
uv run ruff format .
# Run tests
uv run pytest
# Pre-commit workflow
uv run ruff check --fix . && uv run ruff format . && uv run pytest- Agent Activity Panel: See which agents are actively working on your request
- Animated Status Indicators: Visual feedback showing WORKING/IDLE/OFFLINE states
- Agent Badges: Each log entry shows which agent generated it (🎯 Orchestrator, ☸️ K8s)
- Pulse Animations: Active agents have glowing borders that pulse
- Command History: Navigate previous commands with ↑/↓ arrow keys
- Auto-Scroll: Automatically scrolls to latest output
- Color-Coded Output:
- 🟢 Green: Success responses and agent output
- 🔵 Blue: Info logs and agent activity
- 🔴 Red: Errors and failures
- ⚪ White: User commands
- Session Stats: Track number of commands in current session
- Real-Time Updates: See logs as agents generate them
- No Polling: Efficient WebSocket connection for instant feedback
- Reconnection: Automatically reconnects if connection is lost
Intelligent_SRE_Agent/
├── README.md # This file
├── QUICKSTART.md # Quick start guide for web UI
├── start.sh # Start backend + frontend
├── stop.sh # Stop all services
├── frontend/ # React web UI
│ ├── src/
│ │ ├── App.jsx # Main app component
│ │ └── components/
│ │ ├── TerminalInterface.jsx # Terminal UI
│ │ └── TerminalInterface.css # Styling
│ └── package.json
├── src/sre_agent/
│ ├── agents/
│ │ ├── orchestrator.py # Orchestrator agent
│ │ └── k8s_agent.py # K8s agent with agentic loop
│ ├── api/
│ │ └── app.py # FastAPI backend with WebSocket
│ └── config/
│ ├── oauth.py # OAuth token manager
│ └── settings.py # Configuration (loads .env)
├── scripts/
│ ├── cli_chat_simple.py # CLI chat interface
│ └── agentcore/ # Deployment scripts
│ ├── README.md # Deployment guide
│ ├── runtime/ # Deploy MCP server to runtime
│ ├── gateway/ # Create AgentCore Gateway
│ └── mcp_servers/ # EKS MCP server implementation
└── tests/ # Test suite
"Frontend can't connect to backend"
# Check if backend is running
curl http://127.0.0.1:8000/health
# Check backend logs
tail -f /tmp/sre-agent-backend.log
# Restart services
./stop.sh
./start.sh"WebSocket disconnected"
- Check that backend is running on port 8000
- Verify CORS settings in
src/sre_agent/api/app.py - Check browser console for WebSocket errors
"Agents not showing activity"
- Ensure logs contain agent names ("orchestrator", "k8s_agent")
- Check WebSocket connection in browser dev tools
- Verify log messages are being emitted by agents
"Credentials file not found"
# Run deployment scripts in order
cd scripts/agentcore
uv run python runtime/setup_oauth.py
uv run python runtime/deploy_mcp_to_runtime.py
uv run python gateway/setup_gateway.py"MCP tools not loading"
# Verify gateway URL in .env
cat .env
# Test gateway connection
uv run python scripts/agentcore/runtime/test_mcp_listtools.py
# Check gateway status
aws bedrock-agentcore-control get-gateway --gateway-identifier <gateway_id> --region $AWS_REGION"Invalid OAuth token"
# Verify OAuth credentials match
diff <(cat .env | grep COGNITO) <(cat scripts/.infrastructure_credentials.json | grep cognito)
# Test token acquisition
python -c "from sre_agent.config.oauth import OAuthTokenManager; from sre_agent.config.settings import settings; tm = OAuthTokenManager(settings.gateway_cognito_token_endpoint, settings.gateway_cognito_app_client_id, settings.gateway_cognito_app_client_secret); print('Token:', tm.get_token()[:20] + '...')"When you're done testing and want to remove all resources:
Use the automated cleanup script:
# Navigate to runtime directory
cd scripts/agentcore/runtime
# Clean up AgentCore infrastructure (Gateway, Runtime, Cognito)
./deploy_all.sh --cleanupThis removes:
- AgentCore Gateway
- AgentCore Runtime
- Cognito OAuth resources
- IAM roles
- Parameter Store entries
If you created the test cluster, delete it to avoid ongoing charges:
# Delete the test EKS cluster (takes ~10 minutes)
eksctl delete cluster --name eks-mcp-test-cluster --region us-west-2Note: Only delete the cluster if you created it specifically for testing. Do not delete production clusters!
If you need to manually clean up specific resources:
# Get resource IDs from credentials file
cd scripts/agentcore/runtime
cat credentials.json
# Delete gateway
aws bedrock-agentcore-control delete-gateway \
--gateway-identifier <gateway_id> \
--region $AWS_REGION
# Delete runtime
aws bedrock-agentcore-control delete-runtime \
--runtime-identifier <runtime_agent_id> \
--region $AWS_REGION
# Delete Cognito User Pool
aws cognito-idp delete-user-pool \
--user-pool-id <cognito_pool_id> \
--region $AWS_REGION- QUICKSTART.md - Quick start guide for the web UI
- scripts/agentcore/README.md - Deployment guide for AgentCore infrastructure
See LICENSE file for details.

