Skip to content

Pointops error #7

@Tengau

Description

@Tengau

Hi, While running your code, I ran into this error. It seems like the call to pointops' safe_interpolation has some size issues and I am not sure how to fix it. Did anyone else run into an issue like this?

 

2025-03-31 04:10:43,006 - INFO - Infer 1/3827
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [116,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [117,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

...

Interpolation failed: CUDA error: device-side assert triggered
Shapes: p_from=torch.Size([2048, 3]), p_to=torch.Size([8, 3]), x=torch.Size([2048, 250])
Traceback (most recent call last):
File "/home/user/app/test/SymPoint/svgnet/model/basic_operators.py", line 32, in get_subscene_features
new_feat = pointops.safe_interpolation(p_from, p_to, x, o_from, o_to, k=kr)
File "/home/user/app/test/SymPoint/modules/pointops/functions/pointops.py", line 335, in safe_interpolation
new_feat += feat[idx[:, i].long(), :] * weight[:, i].unsqueeze(-1)
RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tools/test.py", line 102, in
main()
File "tools/test.py", line 72, in main
res = model(batch,return_loss=False)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/app/test/SymPoint/svgnet/model/svgnet.py", line 31, in forward
return self._forward(coords,feats,offsets,semantic_labels,lengths,return_loss=return_loss)
File "/home/user/app/test/SymPoint/svgnet/util/utils.py", line 204, in wrapper
return func(*new_args, **new_kwargs)
File "/home/user/app/test/SymPoint/svgnet/model/svgnet.py", line 84, in _forward
outputs = self.decoder(stage_list)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/home/user/app/test/SymPoint/svgnet/model/decoder.py", line 139, in forward
output_class, outputs_mask,attn_mask = self.mask_module(queries,
File "/home/user/app/test/SymPoint/svgnet/model/decoder.py", line 210, in mask_module
attn_mask = get_subscene_features("up", step, stage_list, attn_mask, torch.tensor([4, 4, 4, 4]))
File "/home/user/app/test/SymPoint/svgnet/model/basic_operators.py", line 36, in get_subscene_features
print(f"Offsets: o_from={o_from}, o_to={o_to}")
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor.py", line 572, in format
return object.format(self, format_spec)
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor.py", line 249, in repr
return torch._tensor_str._str(self)
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor_str.py", line 415, in _str
return _str_intern(self)
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor_str.py", line 390, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor_str.py", line 251, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor_str.py", line 86, in init
value_str = '{}'.format(value)
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor.py", line 571, in format
return self.item().format(format_spec)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa81b185d62 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c5f3 (0x7fa86b9755f3 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x1a2 (0x7fa86b976002 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7fa81b16f314 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x29a879 (0x7fa9162dc879 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae0751 (0x7fa916b22751 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object
) + 0x292 (0x7fa916b22a52 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: /usr/bin/python3.8() [0x5ec66b]
frame #8: /usr/bin/python3.8() [0x5ae4ca]
frame #9: /usr/bin/python3.8() [0x5ae4ca]
frame #10: /usr/bin/python3.8() [0x5ecb30]
frame #11: /usr/bin/python3.8() [0x543e48]
frame #12: /usr/bin/python3.8() [0x543e9a]
frame #13: /usr/bin/python3.8() [0x543e9a]
frame #14: /usr/bin/python3.8() [0x543e9a]
frame #15: PyDict_SetItemString + 0x536 (0x5cdf86 in /usr/bin/python3.8)
frame #16: PyImport_Cleanup + 0x79 (0x6844d9 in /usr/bin/python3.8)
frame #17: Py_FinalizeEx + 0x7f (0x67f76f in /usr/bin/python3.8)
frame #18: Py_RunMain + 0x32d (0x6b6e5d in /usr/bin/python3.8)
frame #19: Py_BytesMain + 0x2d (0x6b70cd in /usr/bin/python3.8)
frame #20: __libc_start_main + 0xf3 (0x7fa91c29c083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #21: _start + 0x2e (0x5fac3e in /usr/bin/python3.8)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 5089) of binary: /usr/bin/python3.8
Traceback (most recent call last):
File "/home/user/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/user/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/user/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/user/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/test.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-03-31_04:10:46
host : 4fcab01c4b38
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 5089)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 5089

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions