GitHub - RLsys-Foundation/miles: Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.

Enterprise-Grade Reinforcement Learning for Large-Scale Model Training

High-Performance Rollout • Low Precision Training • Production Stability

Latest Updates | Quick Start | Key Features | Documentation

Latest Updates

[2026/01] 💎 INT4 Quantization-Aware Training (QAT): Inspired by the Kimi K2-Thinking report, Miles now features a full-stack INT4 W4A16 QAT pipeline. This allows 1TB-scale models to fit into single-machine VRAM (e.g., NVIDIA H200), doubling rollout efficiency by eliminating cross-node bottlenecks while maintaining BF16-equivalent accuracy. Blog
[2026/01] 💎 Unified VLM/LLM Multi-Turn Training: We provided an implementation for the VLM multi-turn sampling paradigm. Developers only need to write a customized rollout function to easily start multi-turn RL for VLM, just like training LLM. blog
[2026/01] 🤖 Multi-Agent Co-Evolution: Miles now supports MrlX, a novel asynchronous co-evolutionary framework for Multi-Agent RL. Achieve superior performance in complex tasks like Doctor-Patient simulations and DeepResearch pipelines by enabling specialized agents to evolve together symbiotically. [Link]
[2025/12] 🔄 Rollout Routing Replay (R3): In collaboration with SGLang, we've launched R3 to solve MoE RL instability. R3 records inference routing decisions and replays them during training, effectively eliminating the "training-inference mismatch" and preventing training collapse in large MoE models like Qwen3 and DeepSeek-V3. [Paper]
[2025/11] 🔥 Unified FP8 Release: Solves the stability issues in MoE RL by ensuring training and inference use the exact same FP8 quantization logic. [Blog]
[2025/11] ⚡ Speculative Decoding in RL: Integrated speculative rollout with online SFT for draft models, achieving massive throughput gains. [Blog]
[2025/11] 🎉 Miles Project Launch: A joint effort by InfiXAI, Ant Group, SGLang RL Team, and the Miles community. [Announcement]

What is Miles?

Miles is a high-performance, enterprise-ready reinforcement learning (RL) framework specifically optimized for Large-Scale model Post-Training. Built as a powerful fork of slime, Miles bridges the gap between research-grade RL and production-grade reliability by integrating SGLang for high-throughput rollout and Megatron-LM for scalable training.

"A journey of a thousand miles begins with a single step." — Miles focuses on the low-level system optimizations that make large-scale RL stable, efficient, and reproducible.

Key Features

🌪️ Advanced MoE & Low-Precision Training

Unified FP8 Pipeline: The first framework to implement end-to-end FP8 sampling and training. By unifying precision across rollout and training, Miles eliminates the quantization-induced discrepancy that causes RL collapse in large MoE models.
Rollout Routing Replay (R3): Records expert routing decisions during SGLang inference and replays them during training to ensure bit-wise expert alignment.
INT4 QAT Support: Recommendation for 1TB+ models to enable single-machine (e.g., H200) deployment by significantly reducing memory footprint.

🛡️ Eliminating Train-Inference Mismatch

Bit-wise Identical Training and Inference Log Probs: System-level solution achieving deterministic forward/backward passes through kernel-level optimization (FlashAttention-3, DeepGEMM).
Algorithmic Correction (TIS/MIS): When mismatch is unavoidable, Miles provides Truncated Importance Sampling (TIS) and Masked Importance Sampling (MIS) to mitigate off-policy bias and prevent training divergence.

⚡ Extreme Performance & Efficiency

Speculative RL Training: Achieve 25%+ rollout speedup by using an Online SFT Draft Model. Unlike frozen draft models, Miles updates the draft policy during RL to prevent policy drift.
Zero-Copy Weight Sync: Optimized weight refit via CUDA IPC zero-copy mapping, async tensor gathering, and bucketed flattening. Sync time reduced by 50% compared to standard HTTP/RPC transfers.
Partial Rollout & Over-Sampling: Handles the "Long-Tail Effect" in multi-turn RL by over-sampling requests and recycling half-finished trajectories to maximize GPU utilization.

Model Support & Training Diversity

🏗️ Supported Models

Miles supports a wide range of state-of-the-art architectures, with a special emphasis on DeepSeek, Qwen, Llama and mainstream models.

Family	Supported Models
DeepSeek	R1, V3, V3.2
Qwen	Qwen 2, 2.5, 3
Llama	Llama 3, 3.1, 3.3, 4
Gemma	Gemma 2, 3, 3N
GLM	GLM-4.5, GLM-4.6, GLM-4.7
MiniMax	M2, M2.1
Others	Mistral, Mixtral, Phi, gpt-oss and any model supported by SGLang and Megatron

🧩 Diverse Training Scenarios

Miles is designed to handle the complexity of modern RL workloads across various dimensions:

Multi-Turn Interaction: Optimized for complex, multi-round conversations and tool-use scenarios.
VLM & LLM Support: Unified framework for both Vision-Language and pure Text models.
Reasoning & Coding: Specific recipes and optimizations for Reasoning (Math/Logic) and Coding Agent tasks.
Multi-Agent Training: Support for advanced co-training and collaborative multi-agent reinforcement learning.

Quick Start

Installation

We recommend using our official Docker image for the best performance and compatibility:

# Pull the latest image
docker pull radixark/miles:latest

# Or install from source
pip install -r requirements.txt
pip install -e .

Launch Training

Miles provides a unified entry point for complex RL tasks. Here is an example of FP8 GRPO training for Qwen3:

python train.py \
    --advantage-estimator grpo \
    --model-name qwen3-30b-a3b \
    --hf-checkpoint /path/to/qwen3-30b-a3b-hf \
    --rollout-batch-size 512 \
    --n-samples-per-prompt 8

For comprehensive guides on environment setup and custom reward functions, see the Quick Start Guide.

Roadmap

✅ Completed

Unified FP8 E2E Training & Rollout
INT4 Quantization-Aware Training (QAT): Single-machine 1TB models
Speculative RL with Online SFT
Multi-Agent RL (Co-evolutionary frameworks like MrlX)
Support DeepSeek V3.2 Models
VLM Multi-Turn Training
Aligning SGLang with Megatron in Dense Models
Rollout Routing Replay (R3)

🏗️ In Progress & Planned

Zero mismatch for MoE RL
Aligning SGLang with Megatron in MoE Models
Diffusion RL Support
Omni RL Support
Diffusion LLM RL Support
Elastic Resource Scheduling: Dynamic scaling of rollout vs. training workers

Acknowledgements

Miles is built upon the shoulders of giants in the LLM infrastructure ecosystem:

slime: The core modular architecture and inspiration.
SGLang: The high-performance inference engine.
Megatron-LM: Robust large-scale training components.

Special thanks to InfiXAI Team, Ant Group AQ Team, SGLang RL Team, and the Miles Team. We also thank DataCrunch for compute sponsorship and NVIDIA for technical support on Transformer Engine (TE).

Links

GitHub: https://github.com/radixark/miles
Slime Project: https://github.com/THUDM/slime
Developer Guide: Check the docs/ and examples/ directories for in-depth technical notes.

Give Miles a ⭐️ Star if it helps your RL journey!

Name		Name	Last commit message	Last commit date
Latest commit History 897 Commits
.github		.github
docker		docker
docs		docs
examples		examples
imgs		imgs
miles		miles
miles_plugins		miles_plugins
scripts		scripts
tests		tests
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
build_conda.sh		build_conda.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py
train_async.py		train_async.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enterprise-Grade Reinforcement Learning for Large-Scale Model Training

High-Performance Rollout • Low Precision Training • Production Stability

Latest Updates

What is Miles?

Key Features

🌪️ Advanced MoE & Low-Precision Training

🛡️ Eliminating Train-Inference Mismatch

⚡ Extreme Performance & Efficiency

Model Support & Training Diversity

🏗️ Supported Models

🧩 Diverse Training Scenarios

Quick Start

Installation

Launch Training

Roadmap

✅ Completed

🏗️ In Progress & Planned

Acknowledgements

Links

About

Uh oh!

Releases

Packages

Languages

License

RLsys-Foundation/miles

Folders and files

Latest commit

History

Repository files navigation

Enterprise-Grade Reinforcement Learning for Large-Scale Model Training

High-Performance Rollout • Low Precision Training • Production Stability

Latest Updates

What is Miles?

Key Features

🌪️ Advanced MoE & Low-Precision Training

🛡️ Eliminating Train-Inference Mismatch

⚡ Extreme Performance & Efficiency

Model Support & Training Diversity

🏗️ Supported Models

🧩 Diverse Training Scenarios

Quick Start

Installation

Launch Training

Roadmap

✅ Completed

🏗️ In Progress & Planned

Acknowledgements

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages