Built for uOttaHack 8 - IT: Agentic Compare Challenge
- Three Games: Tic-Tac-Toe, Connect-4, and Battleship
- Three LLM Providers: OpenAI GPT, DeepSeek, and Google Gemini with multiple model variants
- Streaming Gameplay: Real-time server-sent events for live match updates
- Real-time Animations: Piece drops, mark fade-ins, winning line highlights, confetti celebrations
- Match Replay: Step through moves or auto-play at 500ms intervals
- Global Leaderboard: Track wins by model across all matches
- Robust Validation: Server-side move validation with retry logic
- Advanced Metrics: Token usage tracking, move duration, error rates per agent
- Configurable Models: Select specific model variants (gpt-4o, deepseek-reasoner, gemini-2.0-flash, etc.)
βββ src/
β βββ app/ # Next.js App Router
β β βββ api/
β β β βββ play/route.ts # POST - Run a match (synchronous)
β β β βββ play-stream/route.ts # POST - Stream match updates (SSE)
β β β βββ stats/route.ts # GET - Fetch global stats
β β βββ layout.tsx
β β βββ page.tsx # Main UI
β β βββ globals.css
β βββ components/ # React components
β β βββ TicTacToeBoard.tsx
β β βββ Connect4Board.tsx
β β βββ BattleshipBoard.tsx # Battleship game board
β β βββ AgentPanel.tsx
β β βββ MatchResultCard.tsx
β β βββ GlobalStats.tsx
β β βββ GameControls.tsx
β β βββ ReplayControls.tsx
β β βββ LiveOutput.tsx # Real-time move stream
β β βββ LLMSettings.tsx # Model variant selection
β βββ lib/
β β βββ games/ # Game engines
β β β βββ tictactoe.ts
β β β βββ connect4.ts
β β β βββ battleship.ts # Battleship logic
β β βββ agents/ # LLM adapters
β β β βββ gpt.ts # OpenAI integration
β β β βββ deepseek.ts # DeepSeek integration
β β β βββ gemini.ts # Google Gemini integration
β β β βββ prompts.ts
β β β βββ index.ts
β β βββ simulation.ts # Match runner
β β βββ db.ts # File-based JSON storage
β β βββ ui/
β β β βββ providerStyles.ts # Theme colors per model
β β βββ utils/
β β βββ pricing.ts # Token cost calculations
β βββ types/
β βββ index.ts # TypeScript types
βββ data/
β βββ matches.json # Match history (auto-created)
βββ package.json
npm installCreate a .env.local file:
# OpenAI API Key (required for GPT models)
OPENAI_API_KEY=sk-your-openai-api-key-here
# DeepSeek API Key and Base URL (required for DeepSeek models)
DEEPSEEK_API_KEY=your-deepseek-api-key-here
DEEPSEEK_BASE_URL=https://api.deepseek.com/v1
# Google Gemini API Key (required for Gemini models)
GEMINI_API_KEY=your-gemini-api-key-herenpm run devOpen http://localhost:3000 in your browser.
- User selects game type (Tic-Tac-Toe, Connect-4, or Battleship)
- Choose AI models for Agent A and Agent B (GPT, DeepSeek, or Gemini variants)
- Click "Run Single Match" to start streaming gameplay
- Server streams real-time updates via Server-Sent Events (SSE)
- Each agent receives the current board state and must return a valid move
- Invalid JSON or illegal moves trigger retries (up to 5 attempts)
- Results are stored and leaderboard updates automatically
- Match replay available with step-through or auto-play controls
OpenAI (GPT):
- gpt-4o-mini (default)
- gpt-4o
- gpt-4-turbo
- gpt-3.5-turbo
DeepSeek:
- deepseek-chat (default)
- deepseek-reasoner
Google Gemini:
- gemini-2.0-flash (default)
- gemini-2.0-flash-lite
- gemini-1.5-flash
- gemini-1.5-pro
Agents receive a structured prompt with:
- Current board state (visual representation)
- List of legal moves
- Winning/blocking move hints (for Tic-Tac-Toe and Connect-4)
- Game-specific strategic advice
- Instructions to return JSON with
moveandreason
Example prompt for Tic-Tac-Toe:
You are playing Tic-Tac-Toe as X.
Board is a 3x3 grid in row-major order:
Index mapping:
0 1 2
3 4 5
6 7 8
Current board:
[X, O, _,
_, X, _,
O, _, _]
Legal moves: [2, 3, 5, 7, 8]
- Invalid JSON: If the agent returns malformed JSON, retry with error message
- Illegal Move: If the move is not in legal moves list, retry with error message
- Max Retries: After 5 failed attempts, the agent forfeits and opponent wins
Stream real-time match updates via Server-Sent Events (SSE). This is the primary endpoint used by the UI.
Request:
{
"gameType": "ttt", // or "c4" or "bs"
"agentA": { "model": "gpt", "variant": "gpt-4o-mini" },
"agentB": { "model": "gemini", "variant": "gemini-2.0-flash" }
}Response Stream Events:
thinking: Agent is processing next movemove: Move completed with board state and reasoningcomplete: Match finished with final resultserror: Error occurred during matchforfeit: Agent forfeited after max retries
Run a single match synchronously (no streaming).
Request:
{
"gameType": "ttt", // "ttt", "c4", or "bs"
"agentA": { "model": "gpt", "variant": "gpt-4o-mini" },
"agentB": { "model": "deepseek", "variant": "deepseek-chat" }
}Response:
{
"id": "uuid",
"winner": "A", // "A", "B", or "draw"
"winnerModel": "gpt",
"moves": [...],
"finalBoard": [...],
"metrics": {
"totalMoves": 9,
"durationMs": 5432,
"agentA": {
"invalidJsonCount": 0,
"illegalMoveCount": 1,
"retryCount": 1,
"inputTokens": 245,
"outputTokens": 52
},
"agentB": { /* same structure */ }
}
}Fetch global leaderboard statistics.
Response:
{
"stats": {
"ttt": { "matchesPlayed": 10, "draws": 2, "winsByModel": { "gpt": 5, "deepseek": 2, "gemini": 1 } },
"c4": { "matchesPlayed": 5, "draws": 0, "winsByModel": { "gpt": 3, "deepseek": 2 } },
"bs": { "matchesPlayed": 3, "draws": 0, "winsByModel": { "gemini": 2, "gpt": 1 } }
},
"recentMatches": [...]
}- Framework: Next.js 14 (App Router)
- Language: TypeScript
- Styling: Tailwind CSS + Lucide Icons
- LLM APIs: OpenAI, DeepSeek, Google Gemini
- Database: File-based JSON
- Animations: canvas-confetti
Agent Arena transforms the abstract concept of "agentic AI comparison" into a fun, visual, and educational experience. By framing the comparison as a head-to-head game match, users can immediately understand the differences between models and agent strategies.
Key differentiators:
- Engaging UX: Animated game boards with piece drops, winning highlights, and confetti celebrations
- Meaningful Metrics: Track not just wins, but also token usage, error rates, and retries per model
- Fair Comparison: Same prompts, same game state, different brains - see how GPT, DeepSeek, and Gemini reason differently
- Robust Design: Server-side validation ensures no cheating; retry logic handles LLM quirks
- Multiple Complexities: From simple Tic-Tac-Toe to strategic Battleship - test agents across varying difficulty levels
The leaderboard provides a persistent record of AI capabilities, making Agent Arena not just a demo, but a benchmarking tool for agentic AI performance.
MIT

