Issue about using ipex on cpu

When ipex is set to true on cpu, the value here 
https://github.com/intel/llm-on-ray/blob/f536304d4abb72acc08d0269a452f05f2bb646d5/llm_on_ray/inference/transformer_predictor.py#L59 will be `trust_remote_code=False use_auth_token='' load_in_4bit=False torch_dtype=torch.float16 revision=None`.

But when llm_on_ray-serve is executed, a warning `lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/optimize.py:948: UserWarning: fail to apply ipex.llm.optimize due to: Unsupported input type, fallback to the origin model` appears. And after sending a request, the server will report an error: 
```
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) 2024-04-19 02:06:37,419 - llm_on_ray.inference.predictor_deployment - INFO - Handling dynamic batch (size=1) ...
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) ERROR 2024-04-19 02:06:37,427 llama-2-7b-chat-hf_PredictorDeployment c5fp5i4e 3e35b1cb-a52d-4ebc-aa5c-dce9703fc4b4 /llama-2-7b-chat-hf/llama-2-7b-chat-hf replica.py:352 - Request failed:
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) ray::ServeReplica:llama-2-7b-chat-hf:PredictorDeployment.handle_request_with_rejection() (pid=2799681, ip=10.0.11.2)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/utils.py", line 164, in wrap_to_ray_error
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     raise exception
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 1102, in call_user_method
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     await self._call_func_or_gen(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 828, in _call_func_or_gen
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     result = await result
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 403, in __call__
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return await self.handle_non_streaming(prompts, config)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 220, in handle_non_streaming
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return await self.handle_dynamic_batch((prompts, config))
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/batching.py", line 591, in batch_wrapper
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return await enqueue_request(args, kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/ray/serve/batching.py", line 243, in _assign_func_results
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     results = await func_future
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/predictor_deployment.py", line 249, in handle_dynamic_batch
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     batch_results = self.predictor.generate(prompts, **config)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/project/performance/llm-on-ray/llm_on_ray/inference/transformer_predictor.py", line 113, in generate
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     gen_tokens = self.model.generate(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return func(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/generation/utils.py", line 1719, in generate
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self.sample(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/generation/utils.py", line 2801, in sample
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     outputs = self(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/models.py", line 108, in LlamaForCausalLM_forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     outputs = self.model(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     layer_outputs = decoder_layer(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/modules/decoder.py", line 874, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return LlamaDecoderLayer_forward(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/modules/decoder.py", line 26, in LlamaDecoderLayer_forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     hidden_states = self.input_layernorm(hidden_states)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._call_impl(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return forward_call(*args, **kwargs)
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/cpu/fusions/mha_fusion.py", line 137, in forward
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return torch.ops.torch_ipex.rmsnorm(
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)   File "/home/ykp/miniconda3/envs/llmonray_master_ipex/lib/python3.9/site-packages/torch/_ops.py", line 755, in __call__
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681)     return self._op(*args, **(kwargs or {}))
(ServeReplica:llama-2-7b-chat-hf:PredictorDeployment pid=2799681) RuntimeError: Unsupported input type

```

If I remove parameter `torch_dtype=torch.float16` in model_config, it will work fine.


**conda env**
```
intel-extension-for-pytorch 2.2.0+cpu
torch                       2.2.2+cpu
transformers                4.35.0
```
**model**
Llama-2-7b-hf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue about using ipex on cpu #197

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue about using ipex on cpu #197

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions