Skip to content
/ RLYX Public

A hackable, simple, and reseach-friendly GRPO Training Framework with high speed weight synchronization in a multinode environment.

Notifications You must be signed in to change notification settings

goddoe/RLYX

Repository files navigation

RLYX

A hackable, simple, and reseach-friendly GRPO Training Framework with high speed weight synchronization in a multinode environment.

Note: This project, previously simple-r1, has been refactored and renamed to RLYX with improved modular architecture using a decorator-based registry system.

Features

  • High-Speed Weight Synchronization between Training Process and Inference Workers Unlike traditional RLHF frameworks (e.g., Open-R1), which combine training and inference within a single process—leading to high memory overhead—RLYX decouples inference from training. We achieve extremely fast weight updates for vLLM-based inference workers via direct NCCL communication among distributed nodes.
  • High-Performance Inference with Ray Serve Ray Serve is a high-performance, scalable serving framework that provides load balancing for inference workers. We use Ray Serve to efficiently sample generated text from vLLM.
  • Modular Architecture with Registry Pattern RLYX uses a decorator-based registry system for easy extension and customization of components (rewards, tokenizers, evaluators, etc.).
  • Hackable: No Hugging Face Trainer. You can fully customize your training loop.
  • Simple: Minimal abstraction, minimal files, minimal dependencies.

Architecture

sample_generation_using_ray_server_and_vllm weight_synchronization

Project Structure

rlyx/
├── registries.py         # Centralized registry system
├── train.py              # Main training script
├── evaluation.py         # Model evaluation
├── infer_workers.py      # Inference workers with Ray Serve
├── arguments.py          # Training arguments
├── utils/                # Utilities and helpers
├── chat_templates/       # Chat templates (registry-based)
├── dataset_loaders/      # Dataset loaders (registry-based)
├── evaluators/           # Model evaluators (registry-based)
├── rewards/              # Reward functions (registry-based)
└── tokenizers/           # Tokenizers (registry-based)

TODO

  • Implement a basic training loop to reproduce for DeepSeek R1-Zero.
  • Implement high-speed weight synchronization using NCCL between training and inference nodes.
  • Improve code readability, enhance documentation, and refactor the code with modular architecture.
  • Test distributed training with single training node and single inference nodes.
  • Test distributed training with multiple training and inference nodes.
  • Test and support large models.

R1 Zero Simple Reproduction

GSM8K (Qwen2.5-0.5B)

Training Figures

gsm8k_sample

Model description GSM8K
Qwen.2.5-0.5B Qwen2.5-0.5B baseline 41.6
Qwen.2.5-0.5B-r1-zero-reproduction r1-zero training with Qwen2.5-0.5B, limitted output length setting(max output length 500) 62.0

Working Environment

  • Python 3.12
  • PyTorch 2.5.1
  • CUDA Toolkit 12.1 ~ 12.4
    • (Due to vLLM and Ray compatibility issues, CUDA versions must be between 12.1 and 12.4.)

Installation

Install PyTorch

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Extending RLYX

Adding Custom Components

RLYX uses a decorator-based registry system. To add custom components:

Example: Adding a Custom Reward Function

# rlyx/rewards/my_custom_reward.py
from rlyx.registries import REWARD_REGISTRY

@REWARD_REGISTRY.register("my_custom_reward")
def my_custom_reward_func(pred_text: str, gold_text: str, **kwargs) -> float:
    # Your implementation here
    return 1.0

Use it in your experiment config:

reward_function_names: ["my_custom_reward"]

Training the Model

Single Node Setup

(TBU)

1 Training Node, N Inference Nodes Setup

One node for training, multiple nodes for inference workers.

Setting Up Inference Workers

Step 1. Launch Ray Master
# On the Master Node
./exps/exp-gsm8k-qwen-2.5-0.5b-base-example/prep_01_start_ray_on_master.sh
Step 2. Attach Inference Workers to Ray Master

The inference worker nodes must connect to the Ray Master.

# On the Inference Worker Node
ray start --address="$RAY_MASTER_ADDRESS:$RAY_MASTER_PORT" --block
Step 3. Launch Inference Workers
# On the Master Node
./exps/exp-gsm8k-qwen-2.5-0.5b-base-example/prep_02_start_serve_on_master.sh
Step 4. Launch Training

Master node of trainig Node and Master node of Ray must be the same node.

# At Master Node

# Configure Accelerate for Training
accelerate config

# Run Training
./exps/exp-gsm8k-qwen-2.5-0.5b-base-example/run_train.sh
Termination
# on the master node
# kill the training process
ps -ef | grep "[p]ython -m rlyx.train" | awk '{print $2}' | xargs kill -9

# stop the Ray serve
ray stop

N Training Nodes, K Inference Nodes Setup

(TBU)

@misc{kim2025rlyx,
      title={RLYX: A Hackable, Simple, and Research Friendly RL Framework with High-Speed Weight Synchronization}, 
      author={Sungju Kim},
      year={2025},
      url={}, 
}

About

A hackable, simple, and reseach-friendly GRPO Training Framework with high speed weight synchronization in a multinode environment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published