NCCL Error: "Cuda failure 'invalid argument'" in NCCLTensorReceiver.async_receive()


Can you help me with this problem?

## Description

When running AgentRL training, `AsyncSglangWorker` encounters a CUDA error when calling NCCL `dist.barrier()`, causing the training process to crash.

## Error Message

```
ERROR worker.py:430 -- Unhandled error: ray::AsyncSglangWorker.async_call_plugin()
  File "/angel/hzt/MultiTaskAgent/AgentRL/trainer/src/agentrl/trainer/workers/abstract.py", line 72, in async_call_plugin
    return await getattr(plugin, method)(*args, **kwargs)
  File "/angel/hzt/MultiTaskAgent/AgentRL/trainer/src/agentrl/trainer/components/nccl_tensor_comm.py", line 90, in async_receive
    await asyncio.to_thread(dist.barrier, group=self.pg)
  ...
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.27.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'invalid argument'
```

## Environment Information

**System:**
```bash
$ python --version
Python 3.10.19

$ nvidia-smi --query-gpu=name,driver_version,count --format=csv,noheader
NVIDIA H20, 570.133.20, 8
```

**Python Packages:**
```bash
$ conda activate agent-rl
$ pip list | grep -E "(torch|ray|sglang|flash-attn|transformers|accelerate|numpy)"
accelerate                1.11.0
flash_attn                2.8.3
numpy                     2.2.6
ray                       2.51.1
sglang                    0.5.5
torch                     2.8.0
transformers              4.57.1
```

**CUDA/NCCL:**
```python
$ python -c "import torch; print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'CUDA Version: {torch.version.cuda}'); print(f'NCCL Version: {torch.cuda.nccl.version()}')"
CUDA Available: True
CUDA Version: 12.8
NCCL Version: (2, 27, 3)
```

**GPU Configuration:**
- Total GPUs: 8 x NVIDIA H20
- Rollout GPUs: 3
- Rollout Stale GPUs: 1
- Actor GPUs: 4
- CUDA_HOME: `/usr/local/cuda-12.8`

## Steps to Reproduce

1. Use config file `examples/training/configs/qw3b_ws.yaml`
2. Run training script:
   ```bash
   export WANDB_MODE=disabled 
   export NCCL_DEBUG=INFO
   export CUDA_HOME='/usr/local/cuda-12.8'
   python ./examples/training/agentrl_trainer.py --config examples/training/configs/qw3b_ws.yaml
   ```
3. Error occurs during initialization, specifically when establishing NCCL communication between FSDPWorker and AsyncSglangWorker

## Relevant Logs

Initialization logs show:
- Multiple AsyncSglangWorker processes started successfully
- FSDPWorker loaded model successfully
- Error occurred during NCCL process group initialization

Key warning messages:
```
[W1116 21:48:34.292995660 ProcessGroupNCCL.cpp:5023] [PG ID 3 PG GUID nccl_comm_172.25.116.138_58583 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
```

```
/angel/hzt/miniconda3/envs/agent-rl/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier`. Using the current device set by the user.
```

## Related Code

The issue occurs in the `NCCLTensorReceiver.async_receive()` method in `nccl_tensor_comm.py`:

```python
async def async_receive(self):
    await asyncio.to_thread(dist.barrier, group=self.pg)  # <- Error occurs here
    done = False
    await self.worker.async_acquire_writer_lock()
    task = asyncio.sleep(0)
    while not done:
        buffer = await asyncio.to_thread(receive_buffer, self.pg)
        await task
        task = asyncio.create_task(self.worker.update_params(buffer))
        
        lst: list[bool | None] = [None]
        await asyncio.to_thread(broadcast_object_list, lst, group_src=0, group=self.pg)
        done: bool = lst[0]
    await task
    await self.worker.async_release_writer_lock()
```




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL Error: "Cuda failure 'invalid argument'" in NCCLTensorReceiver.async_receive() #7

Description

Error Message

Environment Information

Steps to Reproduce

Relevant Logs

Related Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL Error: "Cuda failure 'invalid argument'" in NCCLTensorReceiver.async_receive() #7

Description

Description

Error Message

Environment Information

Steps to Reproduce

Relevant Logs

Related Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions