Skip to content

πŸ† 1st Place @ uOttawaHack 8: a real-time LLM battle arena where AI agents face off in classic games with live replays, strict rule enforcement, and metric-driven comparisons.

Notifications You must be signed in to change notification settings

peter-bf/AgentArena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

alt text

πŸ† 1st Place @ uOttawaHack 8

Built for uOttaHack 8 - IT: Agentic Compare Challenge

Features

  • Three Games: Tic-Tac-Toe, Connect-4, and Battleship
  • Three LLM Providers: OpenAI GPT, DeepSeek, and Google Gemini with multiple model variants
  • Streaming Gameplay: Real-time server-sent events for live match updates
  • Real-time Animations: Piece drops, mark fade-ins, winning line highlights, confetti celebrations
  • Match Replay: Step through moves or auto-play at 500ms intervals
  • Global Leaderboard: Track wins by model across all matches
  • Robust Validation: Server-side move validation with retry logic
  • Advanced Metrics: Token usage tracking, move duration, error rates per agent
  • Configurable Models: Select specific model variants (gpt-4o, deepseek-reasoner, gemini-2.0-flash, etc.)

Architecture Overview

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ app/                         # Next.js App Router
β”‚   β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”‚   β”œβ”€β”€ play/route.ts        # POST - Run a match (synchronous)
β”‚   β”‚   β”‚   β”œβ”€β”€ play-stream/route.ts # POST - Stream match updates (SSE)
β”‚   β”‚   β”‚   └── stats/route.ts       # GET - Fetch global stats
β”‚   β”‚   β”œβ”€β”€ layout.tsx
β”‚   β”‚   β”œβ”€β”€ page.tsx                 # Main UI
β”‚   β”‚   └── globals.css
β”‚   β”œβ”€β”€ components/                  # React components
β”‚   β”‚   β”œβ”€β”€ TicTacToeBoard.tsx
β”‚   β”‚   β”œβ”€β”€ Connect4Board.tsx
β”‚   β”‚   β”œβ”€β”€ BattleshipBoard.tsx      # Battleship game board
β”‚   β”‚   β”œβ”€β”€ AgentPanel.tsx
β”‚   β”‚   β”œβ”€β”€ MatchResultCard.tsx
β”‚   β”‚   β”œβ”€β”€ GlobalStats.tsx
β”‚   β”‚   β”œβ”€β”€ GameControls.tsx
β”‚   β”‚   β”œβ”€β”€ ReplayControls.tsx
β”‚   β”‚   β”œβ”€β”€ LiveOutput.tsx           # Real-time move stream
β”‚   β”‚   └── LLMSettings.tsx          # Model variant selection
β”‚   β”œβ”€β”€ lib/
β”‚   β”‚   β”œβ”€β”€ games/                   # Game engines
β”‚   β”‚   β”‚   β”œβ”€β”€ tictactoe.ts
β”‚   β”‚   β”‚   β”œβ”€β”€ connect4.ts
β”‚   β”‚   β”‚   └── battleship.ts        # Battleship logic
β”‚   β”‚   β”œβ”€β”€ agents/                  # LLM adapters
β”‚   β”‚   β”‚   β”œβ”€β”€ gpt.ts               # OpenAI integration
β”‚   β”‚   β”‚   β”œβ”€β”€ deepseek.ts          # DeepSeek integration
β”‚   β”‚   β”‚   β”œβ”€β”€ gemini.ts            # Google Gemini integration
β”‚   β”‚   β”‚   β”œβ”€β”€ prompts.ts
β”‚   β”‚   β”‚   └── index.ts
β”‚   β”‚   β”œβ”€β”€ simulation.ts            # Match runner
β”‚   β”‚   β”œβ”€β”€ db.ts                    # File-based JSON storage
β”‚   β”‚   β”œβ”€β”€ ui/
β”‚   β”‚   β”‚   └── providerStyles.ts    # Theme colors per model
β”‚   β”‚   └── utils/
β”‚   β”‚       └── pricing.ts           # Token cost calculations
β”‚   └── types/
β”‚       └── index.ts                 # TypeScript types
β”œβ”€β”€ data/
β”‚   └── matches.json                 # Match history (auto-created)
└── package.json

Quick Start

1. Install Dependencies

npm install

2. Configure Environment

Create a .env.local file:

# OpenAI API Key (required for GPT models)
OPENAI_API_KEY=sk-your-openai-api-key-here

# DeepSeek API Key and Base URL (required for DeepSeek models)
DEEPSEEK_API_KEY=your-deepseek-api-key-here
DEEPSEEK_BASE_URL=https://api.deepseek.com/v1

# Google Gemini API Key (required for Gemini models)
GEMINI_API_KEY=your-gemini-api-key-here

3. Run Development Server

npm run dev

Open http://localhost:3000 in your browser.

How It Works

Match Flow

  1. User selects game type (Tic-Tac-Toe, Connect-4, or Battleship)
  2. Choose AI models for Agent A and Agent B (GPT, DeepSeek, or Gemini variants)
  3. Click "Run Single Match" to start streaming gameplay
  4. Server streams real-time updates via Server-Sent Events (SSE)
  5. Each agent receives the current board state and must return a valid move
  6. Invalid JSON or illegal moves trigger retries (up to 5 attempts)
  7. Results are stored and leaderboard updates automatically
  8. Match replay available with step-through or auto-play controls

Supported Models

OpenAI (GPT):

  • gpt-4o-mini (default)
  • gpt-4o
  • gpt-4-turbo
  • gpt-3.5-turbo

DeepSeek:

  • deepseek-chat (default)
  • deepseek-reasoner

Google Gemini:

  • gemini-2.0-flash (default)
  • gemini-2.0-flash-lite
  • gemini-1.5-flash
  • gemini-1.5-pro

Agent Prompting

Agents receive a structured prompt with:

  • Current board state (visual representation)
  • List of legal moves
  • Winning/blocking move hints (for Tic-Tac-Toe and Connect-4)
  • Game-specific strategic advice
  • Instructions to return JSON with move and reason

Example prompt for Tic-Tac-Toe:

You are playing Tic-Tac-Toe as X.

Board is a 3x3 grid in row-major order:

Index mapping:
0 1 2
3 4 5
6 7 8

Current board:
[X, O, _,
 _, X, _,
 O, _, _]

Legal moves: [2, 3, 5, 7, 8]

Validation & Retry Logic

  • Invalid JSON: If the agent returns malformed JSON, retry with error message
  • Illegal Move: If the move is not in legal moves list, retry with error message
  • Max Retries: After 5 failed attempts, the agent forfeits and opponent wins

API Endpoints

POST /api/play-stream

Stream real-time match updates via Server-Sent Events (SSE). This is the primary endpoint used by the UI.

Request:

{
  "gameType": "ttt",  // or "c4" or "bs"
  "agentA": { "model": "gpt", "variant": "gpt-4o-mini" },
  "agentB": { "model": "gemini", "variant": "gemini-2.0-flash" }
}

Response Stream Events:

  • thinking: Agent is processing next move
  • move: Move completed with board state and reasoning
  • complete: Match finished with final results
  • error: Error occurred during match
  • forfeit: Agent forfeited after max retries

POST /api/play

Run a single match synchronously (no streaming).

Request:

{
  "gameType": "ttt",  // "ttt", "c4", or "bs"
  "agentA": { "model": "gpt", "variant": "gpt-4o-mini" },
  "agentB": { "model": "deepseek", "variant": "deepseek-chat" }
}

Response:

{
  "id": "uuid",
  "winner": "A",  // "A", "B", or "draw"
  "winnerModel": "gpt",
  "moves": [...],
  "finalBoard": [...],
  "metrics": {
    "totalMoves": 9,
    "durationMs": 5432,
    "agentA": {
      "invalidJsonCount": 0,
      "illegalMoveCount": 1,
      "retryCount": 1,
      "inputTokens": 245,
      "outputTokens": 52
    },
    "agentB": { /* same structure */ }
  }
}

GET /api/stats

Fetch global leaderboard statistics.

Response:

{
  "stats": {
    "ttt": { "matchesPlayed": 10, "draws": 2, "winsByModel": { "gpt": 5, "deepseek": 2, "gemini": 1 } },
    "c4": { "matchesPlayed": 5, "draws": 0, "winsByModel": { "gpt": 3, "deepseek": 2 } },
    "bs": { "matchesPlayed": 3, "draws": 0, "winsByModel": { "gemini": 2, "gpt": 1 } }
  },
  "recentMatches": [...]
}

Tech Stack

  • Framework: Next.js 14 (App Router)
  • Language: TypeScript
  • Styling: Tailwind CSS + Lucide Icons
  • LLM APIs: OpenAI, DeepSeek, Google Gemini
  • Database: File-based JSON
  • Animations: canvas-confetti

Judge Pitch

Agent Arena transforms the abstract concept of "agentic AI comparison" into a fun, visual, and educational experience. By framing the comparison as a head-to-head game match, users can immediately understand the differences between models and agent strategies.

Key differentiators:

  • Engaging UX: Animated game boards with piece drops, winning highlights, and confetti celebrations
  • Meaningful Metrics: Track not just wins, but also token usage, error rates, and retries per model
  • Fair Comparison: Same prompts, same game state, different brains - see how GPT, DeepSeek, and Gemini reason differently
  • Robust Design: Server-side validation ensures no cheating; retry logic handles LLM quirks
  • Multiple Complexities: From simple Tic-Tac-Toe to strategic Battleship - test agents across varying difficulty levels

The leaderboard provides a persistent record of AI capabilities, making Agent Arena not just a demo, but a benchmarking tool for agentic AI performance.

License

MIT

About

πŸ† 1st Place @ uOttawaHack 8: a real-time LLM battle arena where AI agents face off in classic games with live replays, strict rule enforcement, and metric-driven comparisons.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published