This project fine-tunes a Qwen-2.5-0.5B-Instruct model on the PubMedQA dataset using a manually supervised GRPO (Group Relative Policy Optimization) pipeline, with subagent using qwen 7b.
Step 1: Go into the project folder:
cd supervisorStep 2: Build the Docker image:
docker build -t grpo-pubmedqa:latest .Step 3: Run the container with GPU support:
docker run -it --gpus all \
-v ${PWD}:/workspace \
-e WANDB_API_KEY=your_key_here \
-e WANDB_PROJECT=GRPO-Qwen-PubMedQA-Manual \
grpo-pubmedqa:latest
(YOU CAN PROVIDE A MODEL NAME to replace sub agent model)
(IF YOU WANT TO CHANGE THE MODEL YOU train you need to change it in main)Step 1: Install dependencies:
pip install -r requirements.txtStep 2: Set up Weights & Biases (W&B) for experiment tracking:
export WANDB_API_KEY=your_key_here
export WANDB_PROJECT=GRPO-Qwen-PubMedQA-ManualStep 3: Run the training script:
python main.py- Ensure you have a working GPU + CUDA setup.
- Weights & Biases is optional but recommended for tracking metrics and losses.
- You can modify the default model name or dataset path directly inside
main.pyif needed. - The model and tokenizer will be automatically downloaded from Hugging Face on first run.
setx WANDB_API_KEY "your_key_here"
setx WANDB_PROJECT "GRPO-Qwen-PubMedQA-Manual"That’s it! 🎯
You’re ready to train and evaluate your GRPO-based PubMedQA supervisor model.