-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Can you help me with this problem?
Description
When running AgentRL training, AsyncSglangWorker encounters a CUDA error when calling NCCL dist.barrier(), causing the training process to crash.
Error Message
ERROR worker.py:430 -- Unhandled error: ray::AsyncSglangWorker.async_call_plugin()
File "/angel/hzt/MultiTaskAgent/AgentRL/trainer/src/agentrl/trainer/workers/abstract.py", line 72, in async_call_plugin
return await getattr(plugin, method)(*args, **kwargs)
File "/angel/hzt/MultiTaskAgent/AgentRL/trainer/src/agentrl/trainer/components/nccl_tensor_comm.py", line 90, in async_receive
await asyncio.to_thread(dist.barrier, group=self.pg)
...
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.27.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'invalid argument'
Environment Information
System:
$ python --version
Python 3.10.19
$ nvidia-smi --query-gpu=name,driver_version,count --format=csv,noheader
NVIDIA H20, 570.133.20, 8Python Packages:
$ conda activate agent-rl
$ pip list | grep -E "(torch|ray|sglang|flash-attn|transformers|accelerate|numpy)"
accelerate 1.11.0
flash_attn 2.8.3
numpy 2.2.6
ray 2.51.1
sglang 0.5.5
torch 2.8.0
transformers 4.57.1CUDA/NCCL:
$ python -c "import torch; print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'CUDA Version: {torch.version.cuda}'); print(f'NCCL Version: {torch.cuda.nccl.version()}')"
CUDA Available: True
CUDA Version: 12.8
NCCL Version: (2, 27, 3)GPU Configuration:
- Total GPUs: 8 x NVIDIA H20
- Rollout GPUs: 3
- Rollout Stale GPUs: 1
- Actor GPUs: 4
- CUDA_HOME:
/usr/local/cuda-12.8
Steps to Reproduce
- Use config file
examples/training/configs/qw3b_ws.yaml - Run training script:
export WANDB_MODE=disabled export NCCL_DEBUG=INFO export CUDA_HOME='/usr/local/cuda-12.8' python ./examples/training/agentrl_trainer.py --config examples/training/configs/qw3b_ws.yaml
- Error occurs during initialization, specifically when establishing NCCL communication between FSDPWorker and AsyncSglangWorker
Relevant Logs
Initialization logs show:
- Multiple AsyncSglangWorker processes started successfully
- FSDPWorker loaded model successfully
- Error occurred during NCCL process group initialization
Key warning messages:
[W1116 21:48:34.292995660 ProcessGroupNCCL.cpp:5023] [PG ID 3 PG GUID nccl_comm_172.25.116.138_58583 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
/angel/hzt/miniconda3/envs/agent-rl/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier`. Using the current device set by the user.
Related Code
The issue occurs in the NCCLTensorReceiver.async_receive() method in nccl_tensor_comm.py:
async def async_receive(self):
await asyncio.to_thread(dist.barrier, group=self.pg) # <- Error occurs here
done = False
await self.worker.async_acquire_writer_lock()
task = asyncio.sleep(0)
while not done:
buffer = await asyncio.to_thread(receive_buffer, self.pg)
await task
task = asyncio.create_task(self.worker.update_params(buffer))
lst: list[bool | None] = [None]
await asyncio.to_thread(broadcast_object_list, lst, group_src=0, group=self.pg)
done: bool = lst[0]
await task
await self.worker.async_release_writer_lock()paxionfull and Garfieldgengliang
Metadata
Metadata
Assignees
Labels
No labels