Skip to content

BitBLAS's compiling seems conflicting with vLLM's torch.compile integration? #315

@xxxxyu

Description

@xxxxyu

Failed to run a GPTQ model https://huggingface.co/JunHowie/Qwen3-8B-GPTQ-Int4 with latest vLLM with quantization="gptq_bitblas".

Got:

(EngineCore_DP0 pid=506800) torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
(EngineCore_DP0 pid=506800)   Explanation: Dynamo does not know how to trace the builtin `<unknown module>._SimpleCData.__new__.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
... (omitted)
(APIServer pid=506583) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Complete log (after tuning):
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.54it/s]
(EngineCore_DP0 pid=506800) 
(EngineCore_DP0 pid=506800) INFO 10-14 21:31:21 [default_loader.py:314] Loading weights took 1.41 seconds
(EngineCore_DP0 pid=506800) INFO 10-14 21:31:50 [gpu_model_runner.py:2910] Model loading took 5.6824 GiB and 489.310448 seconds
(EngineCore_DP0 pid=506800) /home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1481: UserWarning: Dynamo does not know how to trace the builtin `<unknown module>._SimpleCData.__new__.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
(EngineCore_DP0 pid=506800) If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
(EngineCore_DP0 pid=506800) If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
(EngineCore_DP0 pid=506800)   torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792] EngineCore failed to start.
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792] Traceback (most recent call last):
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/core.py", line 783, in run_engine_core
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/core.py", line 555, in __init__
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     super().__init__(
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/core.py", line 223, in _initialize_kv_caches
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/executor/abstract.py", line 88, in determine_available_memory
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/executor/uniproc_executor.py", line 74, in collective_rpc
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/utils/__init__.py", line 2977, in run_method
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     return func(*args, **kwargs)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     return func(*args, **kwargs)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/worker/gpu_worker.py", line 280, in determine_available_memory
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     self.model_runner.profile_run()
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/worker/gpu_model_runner.py", line 3722, in profile_run
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]                                         ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     return func(*args, **kwargs)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/worker/gpu_model_runner.py", line 3455, in _dummy_run
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     outputs = self.model(
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]               ^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/compilation/cuda_graph.py", line 126, in __call__
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/models/qwen3.py", line 321, in forward
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     hidden_states = self.model(
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]                     ^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/compilation/decorators.py", line 407, in __call__
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     output = self.compiled_callable(*args, **kwargs)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 745, in compile_wrapper
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     raise e.with_traceback(None) from e.__cause__  # User compiler error
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792] torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   Explanation: Dynamo does not know how to trace the builtin `<unknown module>._SimpleCData.__new__.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   Hint: If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   Hint: If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792] 
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   Developer debug context: module: <unknown module>, qualname: _SimpleCData.__new__, skip reason: <missing reason>
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792] 
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792] 
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792] from user code:
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]    File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/models/qwen2.py", line 385, in forward
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     hidden_states, residual = layer(positions, hidden_states, residual)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/models/qwen3.py", line 225, in forward
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     hidden_states = self.self_attn(
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/models/qwen3.py", line 144, in forward
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     qkv, _ = self.qkv_proj(hidden_states)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/layers/linear.py", line 582, in forward
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/layers/quantization/gptq_bitblas.py", line 479, in apply
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     out = self.kernel.apply_gptq_bitblas_linear(layer, x)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/layers/quantization/kernels/mixed_precision/bitblas.py", line 315, in apply_gptq_bitblas_linear
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     output = self.bitblas_matmul(*args)  # type: ignore[operator]
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/bitblas/ops/general_matmul/__init__.py", line 756, in __call__
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     return self.forward(*args, **kwds)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/bitblas/ops/general_matmul/__init__.py", line 751, in forward
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     self._forward_from_prebuild_lib(*args, stream=stream.cuda_stream)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/bitblas/ops/operator.py", line 459, in _forward_from_prebuild_lib
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     ctypes.c_void_p(arr.data_ptr()) if not isinstance(arr, int) else arr for arr in args
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/polyfills/__init__.py", line 204, in instantiate_user_defined_class_object
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792]     obj = cls.__new__(cls, *args, **kwargs)
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792] 
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore_DP0 pid=506800) ERROR 10-14 21:31:51 [core.py:792] 
(EngineCore_DP0 pid=506800) Process EngineCore_DP0:
(EngineCore_DP0 pid=506800) Traceback (most recent call last):
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=506800)     self.run()
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=506800)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/core.py", line 796, in run_engine_core
(EngineCore_DP0 pid=506800)     raise e
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/core.py", line 783, in run_engine_core
(EngineCore_DP0 pid=506800)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=506800)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/core.py", line 555, in __init__
(EngineCore_DP0 pid=506800)     super().__init__(
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore_DP0 pid=506800)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=506800)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/core.py", line 223, in _initialize_kv_caches
(EngineCore_DP0 pid=506800)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=506800)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/executor/abstract.py", line 88, in determine_available_memory
(EngineCore_DP0 pid=506800)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=506800)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/executor/uniproc_executor.py", line 74, in collective_rpc
(EngineCore_DP0 pid=506800)     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=506800)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/utils/__init__.py", line 2977, in run_method
(EngineCore_DP0 pid=506800)     return func(*args, **kwargs)
(EngineCore_DP0 pid=506800)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=506800)     return func(*args, **kwargs)
(EngineCore_DP0 pid=506800)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/worker/gpu_worker.py", line 280, in determine_available_memory
(EngineCore_DP0 pid=506800)     self.model_runner.profile_run()
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/worker/gpu_model_runner.py", line 3722, in profile_run
(EngineCore_DP0 pid=506800)     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore_DP0 pid=506800)                                         ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=506800)     return func(*args, **kwargs)
(EngineCore_DP0 pid=506800)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/worker/gpu_model_runner.py", line 3455, in _dummy_run
(EngineCore_DP0 pid=506800)     outputs = self.model(
(EngineCore_DP0 pid=506800)               ^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/compilation/cuda_graph.py", line 126, in __call__
(EngineCore_DP0 pid=506800)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=506800)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=506800)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=506800)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=506800)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=506800)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/models/qwen3.py", line 321, in forward
(EngineCore_DP0 pid=506800)     hidden_states = self.model(
(EngineCore_DP0 pid=506800)                     ^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/compilation/decorators.py", line 407, in __call__
(EngineCore_DP0 pid=506800)     output = self.compiled_callable(*args, **kwargs)
(EngineCore_DP0 pid=506800)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 745, in compile_wrapper
(EngineCore_DP0 pid=506800)     raise e.with_traceback(None) from e.__cause__  # User compiler error
(EngineCore_DP0 pid=506800)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=506800) torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
(EngineCore_DP0 pid=506800)   Explanation: Dynamo does not know how to trace the builtin `<unknown module>._SimpleCData.__new__.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
(EngineCore_DP0 pid=506800)   Hint: If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
(EngineCore_DP0 pid=506800)   Hint: If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
(EngineCore_DP0 pid=506800) 
(EngineCore_DP0 pid=506800)   Developer debug context: module: <unknown module>, qualname: _SimpleCData.__new__, skip reason: <missing reason>
(EngineCore_DP0 pid=506800) 
(EngineCore_DP0 pid=506800) 
(EngineCore_DP0 pid=506800) from user code:
(EngineCore_DP0 pid=506800)    File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/models/qwen2.py", line 385, in forward
(EngineCore_DP0 pid=506800)     hidden_states, residual = layer(positions, hidden_states, residual)
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/models/qwen3.py", line 225, in forward
(EngineCore_DP0 pid=506800)     hidden_states = self.self_attn(
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/models/qwen3.py", line 144, in forward
(EngineCore_DP0 pid=506800)     qkv, _ = self.qkv_proj(hidden_states)
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/layers/linear.py", line 582, in forward
(EngineCore_DP0 pid=506800)     output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/layers/quantization/gptq_bitblas.py", line 479, in apply
(EngineCore_DP0 pid=506800)     out = self.kernel.apply_gptq_bitblas_linear(layer, x)
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/model_executor/layers/quantization/kernels/mixed_precision/bitblas.py", line 315, in apply_gptq_bitblas_linear
(EngineCore_DP0 pid=506800)     output = self.bitblas_matmul(*args)  # type: ignore[operator]
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/bitblas/ops/general_matmul/__init__.py", line 756, in __call__
(EngineCore_DP0 pid=506800)     return self.forward(*args, **kwds)
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/bitblas/ops/general_matmul/__init__.py", line 751, in forward
(EngineCore_DP0 pid=506800)     self._forward_from_prebuild_lib(*args, stream=stream.cuda_stream)
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/bitblas/ops/operator.py", line 459, in _forward_from_prebuild_lib
(EngineCore_DP0 pid=506800)     ctypes.c_void_p(arr.data_ptr()) if not isinstance(arr, int) else arr for arr in args
(EngineCore_DP0 pid=506800)   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/polyfills/__init__.py", line 204, in instantiate_user_defined_class_object
(EngineCore_DP0 pid=506800)     obj = cls.__new__(cls, *args, **kwargs)
(EngineCore_DP0 pid=506800) 
(EngineCore_DP0 pid=506800) Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore_DP0 pid=506800) 
[rank0]:[W1014 21:31:51.724849712 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=506583) Traceback (most recent call last):
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=506583)     sys.exit(main())
(APIServer pid=506583)              ^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=506583)     args.dispatch_function(args)
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/entrypoints/cli/serve.py", line 62, in cmd
(APIServer pid=506583)     uvloop.run(run_server(args))
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=506583)     return __asyncio.run(
(APIServer pid=506583)            ^^^^^^^^^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=506583)     return runner.run(main)
(APIServer pid=506583)            ^^^^^^^^^^^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=506583)     return self._loop.run_until_complete(task)
(APIServer pid=506583)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=506583)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=506583)     return await main
(APIServer pid=506583)            ^^^^^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/entrypoints/openai/api_server.py", line 1917, in run_server
(APIServer pid=506583)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/entrypoints/openai/api_server.py", line 1933, in run_server_worker
(APIServer pid=506583)     async with build_async_engine_client(
(APIServer pid=506583)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=506583)     return await anext(self.gen)
(APIServer pid=506583)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/entrypoints/openai/api_server.py", line 191, in build_async_engine_client
(APIServer pid=506583)     async with build_async_engine_client_from_engine_args(
(APIServer pid=506583)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=506583)     return await anext(self.gen)
(APIServer pid=506583)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/entrypoints/openai/api_server.py", line 238, in build_async_engine_client_from_engine_args
(APIServer pid=506583)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=506583)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/utils/__init__.py", line 1336, in inner
(APIServer pid=506583)     return fn(*args, **kwargs)
(APIServer pid=506583)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/async_llm.py", line 208, in from_vllm_config
(APIServer pid=506583)     return cls(
(APIServer pid=506583)            ^^^^
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/async_llm.py", line 130, in __init__
(APIServer pid=506583)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=506583)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/core_client.py", line 121, in make_async_mp_client
(APIServer pid=506583)     return AsyncMPClient(*client_args)
(APIServer pid=506583)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/core_client.py", line 807, in __init__
(APIServer pid=506583)     super().__init__(
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/core_client.py", line 468, in __init__
(APIServer pid=506583)     with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=506583)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=506583)   File "/home/lixiangyu/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=506583)     next(self.gen)
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/utils.py", line 816, in launch_core_engines
(APIServer pid=506583)     wait_for_engine_startup(
(APIServer pid=506583)   File "/home/lixiangyu/repos/xxxxyu/vllm/vllm/v1/engine/utils.py", line 873, in wait_for_engine_startup
(APIServer pid=506583)     raise RuntimeError(
(APIServer pid=506583) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

When setting enforce-eager=True, there is no such error — it seems BitBLAS's compiling conflicts with vLLM's torch.compile integration?

If so, what is the best practice for running vLLM + BitBLAS?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions