-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Hi, While running your code, I ran into this error. It seems like the call to pointops' safe_interpolation has some size issues and I am not sure how to fix it. Did anyone else run into an issue like this?
2025-03-31 04:10:43,006 - INFO - Infer 1/3827
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [116,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [117,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed....
Interpolation failed: CUDA error: device-side assert triggered
Shapes: p_from=torch.Size([2048, 3]), p_to=torch.Size([8, 3]), x=torch.Size([2048, 250])
Traceback (most recent call last):
File "/home/user/app/test/SymPoint/svgnet/model/basic_operators.py", line 32, in get_subscene_features
new_feat = pointops.safe_interpolation(p_from, p_to, x, o_from, o_to, k=kr)
File "/home/user/app/test/SymPoint/modules/pointops/functions/pointops.py", line 335, in safe_interpolation
new_feat += feat[idx[:, i].long(), :] * weight[:, i].unsqueeze(-1)
RuntimeError: CUDA error: device-side assert triggeredDuring handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tools/test.py", line 102, in
main()
File "tools/test.py", line 72, in main
res = model(batch,return_loss=False)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/app/test/SymPoint/svgnet/model/svgnet.py", line 31, in forward
return self._forward(coords,feats,offsets,semantic_labels,lengths,return_loss=return_loss)
File "/home/user/app/test/SymPoint/svgnet/util/utils.py", line 204, in wrapper
return func(*new_args, **new_kwargs)
File "/home/user/app/test/SymPoint/svgnet/model/svgnet.py", line 84, in _forward
outputs = self.decoder(stage_list)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/home/user/app/test/SymPoint/svgnet/model/decoder.py", line 139, in forward
output_class, outputs_mask,attn_mask = self.mask_module(queries,
File "/home/user/app/test/SymPoint/svgnet/model/decoder.py", line 210, in mask_module
attn_mask = get_subscene_features("up", step, stage_list, attn_mask, torch.tensor([4, 4, 4, 4]))
File "/home/user/app/test/SymPoint/svgnet/model/basic_operators.py", line 36, in get_subscene_features
print(f"Offsets: o_from={o_from}, o_to={o_to}")
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor.py", line 572, in format
return object.format(self, format_spec)
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor.py", line 249, in repr
return torch._tensor_str._str(self)
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor_str.py", line 415, in _str
return _str_intern(self)
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor_str.py", line 390, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor_str.py", line 251, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor_str.py", line 86, in init
value_str = '{}'.format(value)
File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor.py", line 571, in format
return self.item().format(format_spec)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa81b185d62 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c5f3 (0x7fa86b9755f3 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7fa86b976002 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7fa81b16f314 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x29a879 (0x7fa9162dc879 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae0751 (0x7fa916b22751 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object) + 0x292 (0x7fa916b22a52 in /home/user/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: /usr/bin/python3.8() [0x5ec66b]
frame #8: /usr/bin/python3.8() [0x5ae4ca]
frame #9: /usr/bin/python3.8() [0x5ae4ca]
frame #10: /usr/bin/python3.8() [0x5ecb30]
frame #11: /usr/bin/python3.8() [0x543e48]
frame #12: /usr/bin/python3.8() [0x543e9a]
frame #13: /usr/bin/python3.8() [0x543e9a]
frame #14: /usr/bin/python3.8() [0x543e9a]
frame #15: PyDict_SetItemString + 0x536 (0x5cdf86 in /usr/bin/python3.8)
frame #16: PyImport_Cleanup + 0x79 (0x6844d9 in /usr/bin/python3.8)
frame #17: Py_FinalizeEx + 0x7f (0x67f76f in /usr/bin/python3.8)
frame #18: Py_RunMain + 0x32d (0x6b6e5d in /usr/bin/python3.8)
frame #19: Py_BytesMain + 0x2d (0x6b70cd in /usr/bin/python3.8)
frame #20: __libc_start_main + 0xf3 (0x7fa91c29c083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #21: _start + 0x2e (0x5fac3e in /usr/bin/python3.8)ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 5089) of binary: /usr/bin/python3.8
Traceback (most recent call last):
File "/home/user/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/user/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/user/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/user/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:tools/test.py FAILED
Failures:
<NO_OTHER_FAILURES>Root Cause (first observed failure):
[0]:
time : 2025-03-31_04:10:46
host : 4fcab01c4b38
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 5089)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 5089