Skip to content

NCCL Error: "Cuda failure 'invalid argument'" in NCCLTensorReceiver.async_receive() #7

@Gikiman

Description

@Gikiman

Can you help me with this problem?

Description

When running AgentRL training, AsyncSglangWorker encounters a CUDA error when calling NCCL dist.barrier(), causing the training process to crash.

Error Message

ERROR worker.py:430 -- Unhandled error: ray::AsyncSglangWorker.async_call_plugin()
  File "/angel/hzt/MultiTaskAgent/AgentRL/trainer/src/agentrl/trainer/workers/abstract.py", line 72, in async_call_plugin
    return await getattr(plugin, method)(*args, **kwargs)
  File "/angel/hzt/MultiTaskAgent/AgentRL/trainer/src/agentrl/trainer/components/nccl_tensor_comm.py", line 90, in async_receive
    await asyncio.to_thread(dist.barrier, group=self.pg)
  ...
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.27.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'invalid argument'

Environment Information

System:

$ python --version
Python 3.10.19

$ nvidia-smi --query-gpu=name,driver_version,count --format=csv,noheader
NVIDIA H20, 570.133.20, 8

Python Packages:

$ conda activate agent-rl
$ pip list | grep -E "(torch|ray|sglang|flash-attn|transformers|accelerate|numpy)"
accelerate                1.11.0
flash_attn                2.8.3
numpy                     2.2.6
ray                       2.51.1
sglang                    0.5.5
torch                     2.8.0
transformers              4.57.1

CUDA/NCCL:

$ python -c "import torch; print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'CUDA Version: {torch.version.cuda}'); print(f'NCCL Version: {torch.cuda.nccl.version()}')"
CUDA Available: True
CUDA Version: 12.8
NCCL Version: (2, 27, 3)

GPU Configuration:

  • Total GPUs: 8 x NVIDIA H20
  • Rollout GPUs: 3
  • Rollout Stale GPUs: 1
  • Actor GPUs: 4
  • CUDA_HOME: /usr/local/cuda-12.8

Steps to Reproduce

  1. Use config file examples/training/configs/qw3b_ws.yaml
  2. Run training script:
    export WANDB_MODE=disabled 
    export NCCL_DEBUG=INFO
    export CUDA_HOME='/usr/local/cuda-12.8'
    python ./examples/training/agentrl_trainer.py --config examples/training/configs/qw3b_ws.yaml
  3. Error occurs during initialization, specifically when establishing NCCL communication between FSDPWorker and AsyncSglangWorker

Relevant Logs

Initialization logs show:

  • Multiple AsyncSglangWorker processes started successfully
  • FSDPWorker loaded model successfully
  • Error occurred during NCCL process group initialization

Key warning messages:

[W1116 21:48:34.292995660 ProcessGroupNCCL.cpp:5023] [PG ID 3 PG GUID nccl_comm_172.25.116.138_58583 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
/angel/hzt/miniconda3/envs/agent-rl/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier`. Using the current device set by the user.

Related Code

The issue occurs in the NCCLTensorReceiver.async_receive() method in nccl_tensor_comm.py:

async def async_receive(self):
    await asyncio.to_thread(dist.barrier, group=self.pg)  # <- Error occurs here
    done = False
    await self.worker.async_acquire_writer_lock()
    task = asyncio.sleep(0)
    while not done:
        buffer = await asyncio.to_thread(receive_buffer, self.pg)
        await task
        task = asyncio.create_task(self.worker.update_params(buffer))
        
        lst: list[bool | None] = [None]
        await asyncio.to_thread(broadcast_object_list, lst, group_src=0, group=self.pg)
        done: bool = lst[0]
    await task
    await self.worker.async_release_writer_lock()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions