The platform for evaluations and RL environments for agents.
Wrap real software as environments, run benchmarks, and train with RL – locally or at scale.
📅 Hop on a call or email us at 📧 founders@hud.ai
HUD is an open-source platform for evaluating and training agents:
- Turn any piece of software into an RL environment.
- Run reproducible benchmarks and submit to public leaderboards.
- Train agents with RL (GRPO) on top of those environments with any RL provider.
- Inspect every tool call, observation, and reward in real time on hud.ai.
The core Python SDK lives in hud-evals/hud-python.
We love contributions!
Check our open hud-python issues and open pull requests for more information.
- 🧱 MCP environment skeleton – wrap any system as an MCP server so any agent can call it.
- 📡 Live telemetry – traces for every trajectory and tool call, streamed to hud.ai.
- 📊 Public benchmarks & leaderboards – OSWorld-Verified, SheetBench-50, and more at hud.ai/leaderboards.
- 🌐 Cloud browsers – integrations for browser-based environments (AnchorBrowser, Steel, BrowserBase, etc.).
- 🔁 Hot-reload dev loop –
hud devfor iterating on environments without rebuilds. - 🧪 One-click RL –
hud rlto train GRPO policies on any HUD dataset.