Add ChatVLLM wrapper for proper chat template support #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The default LangChain VLLM wrapper uses
vllm.LLM.generate()which does not apply the model's chat template (see reference [1]). This causes issues for models that rely on chat templates (e.g., with<|im_start|>,<|im_end|>,<think>tags), leading to malformed inputs, truncated outputs, and biased judge evaluations.This PR implements a
ChatVLLMwrapper that usesvllm.LLM.chat()instead, which automatically applies the model's native chat templatetokenizer_config.jsonstored in each model directory.The
ChatVLLMwrapper converts LangChain prompts to OpenAI-style messages, then vLLM applies the chat template automatically.Example Impact
Without Chat Template (LangChain default):
With Chat Template (ChatVLLM):
Changes
ChatVLLMclass inutils.py:vllm.LLM.chat()instead ofgenerate()make_model()to useChatVLLMfor VLLM providermax_tokensfrom 200 to 8192 to prevent runaway generationReferences
[1] https://docs.vllm.ai/en/latest/getting_started/quickstart/#offline-batched-inference:~:text=r%7D%22)-,Note,same%20format%20as%20those%20passed%20to%20OpenAI%27s%20client.chat.completions%3A,-Code
"The llm.generate method does not automatically apply the model's chat template to the input prompt. Therefore, if you are using an Instruct model or Chat model, you should manually apply the corresponding chat template to ensure the expected behavior. Alternatively, you can use the llm.chat method..."