This repository was archived by the owner on Sep 23, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 35
This repository was archived by the owner on Sep 23, 2025. It is now read-only.
Issue about using ipex on cpu #197
Copy link
Copy link
Open
Description
When ipex is set to true on cpu, the value here
| **model_config.dict(), |
trust_remote_code=False use_auth_token='' load_in_4bit=False torch_dtype=torch.float16 revision=None.
But when llm_on_ray-serve is executed, a warning lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/optimize.py:948: UserWarning: fail to apply ipex.llm.optimize due to: Unsupported input type, fallback to the origin model appears. And after sending a request, the server will report an error:
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) 2024-04-19 02:06:37,419 - llm_on_ray.inference.predictor_deployment - INFO - Handling dynamic batch (size=1) ...
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) ERROR 2024-04-19 02:06:37,427 llama-2-7b-chat-hf_PredictorDeployment c5fp5i4e 3e35b1cb-a52d-4ebc-aa5c-dce9703fc4b4 /llama-2-7b-chat-hf/llama-2-7b-chat-hf replica.py:352 - Request failed:
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) ray::ServeReplica:llama-2-7b-chat-hf:PredictorDeployment.handle_request_with_rejection() (pid=2799681, ip=10.0.11.2)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/utils.py", line 164, in wrap_to_ray_error
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) raise exception
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 1102, in call_user_method
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) await self._call_func_or_gen(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 828, in _call_func_or_gen
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) result = await result
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 403, in __call__
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return await self.handle_non_streaming(prompts, config)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 220, in handle_non_streaming
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return await self.handle_dynamic_batch((prompts, config))
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/batching.py", line 591, in batch_wrapper
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return await enqueue_request(args, kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/batching.py", line 243, in _assign_func_results
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) results = await func_future
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 249, in handle_dynamic_batch
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) batch_results = self.predictor.generate(prompts, **config)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/transformer_predictor.py", line 113, in generate
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) gen_tokens = self.model.generate(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return func(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/generation/utils.py", line 1719, in generate
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return self.sample(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/generation/utils.py", line 2801, in sample
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) outputs = self(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/models.py", line 108, in LlamaForCausalLM_forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) outputs = self.model(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) layer_outputs = decoder_layer(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/modules/decoder.py", line 874, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return LlamaDecoderLayer_forward(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/modules/decoder.py", line 26, in LlamaDecoderLayer_forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) hidden_states = self.input_layernorm(hidden_states)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/cpu/fusions/mha_fusion.py", line 137, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return torch.ops.torch_ipex.rmsnorm(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/_ops.py", line 755, in __call__
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) return self._op(*args, **(kwargs or {}))
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) RuntimeError: Unsupported input type
If I remove parameter torch_dtype=torch.float16 in model_config, it will work fine.
conda env
intel-extension-for-pytorch 2.2.0+cpu
torch 2.2.2+cpu
transformers 4.35.0
model
Llama-2-7b-hf
Metadata
Metadata
Assignees
Labels
No labels