Skip to content

Conversation

@ferreirafabio
Copy link
Contributor

@ferreirafabio ferreirafabio commented Jan 8, 2026

Description

The default LangChain VLLM wrapper uses vllm.LLM.generate() which does not apply the model's chat template (see reference [1]). This causes issues for models that rely on chat templates (e.g., with <|im_start|>, <|im_end|>, <think> tags), leading to malformed inputs, truncated outputs, and biased judge evaluations.

This PR implements a ChatVLLM wrapper that uses vllm.LLM.chat() instead, which automatically applies the model's native chat template tokenizer_config.json stored in each model directory.

The ChatVLLM wrapper converts LangChain prompts to OpenAI-style messages, then vLLM applies the chat template automatically.

# BEFORE: LangChain VLLM Default  (in class VLLM(BaseLLM))
def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        ...        
        outputs = self.llm.generate([prompt],  # Raw string prompt - no message formatting
        self.sampling_params)
        ...

# AFTER: ChatVLLM (applies chat template)
def batch(self, inputs: list, **invoke_kwargs) -> list[str]:
        ...
        outputs = llm.chat(messages_batch, sampling_params, add_generation_prompt=True)
        ...

Example Impact

Without Chat Template (LangChain default):

User: How to write a function?
Assistant: def my_function(): pass  # Incomplete, no structure

With Chat Template (ChatVLLM):

<|im_start|>user
How to write a function?<|im_end|>
<|im_start|>assistant
<think>Let me think about this systematically...</think>

def my_function(x):
    """A complete, well-documented function."""
    return x * 2<|im_end|>

Changes

  • Added ChatVLLM class in utils.py:
    • Uses vllm.LLM.chat() instead of generate()
    • Converts LangChain prompts to OpenAI-style messages
    • Maintains LangChain compatibility
  • Updated make_model() to use ChatVLLM for VLLM provider
  • Updated default max_tokens from 200 to 8192 to prevent runaway generation
  • all changes have been tested

References

[1] https://docs.vllm.ai/en/latest/getting_started/quickstart/#offline-batched-inference:~:text=r%7D%22)-,Note,same%20format%20as%20those%20passed%20to%20OpenAI%27s%20client.chat.completions%3A,-Code

"The llm.generate method does not automatically apply the model's chat template to the input prompt. Therefore, if you are using an Instruct model or Chat model, you should manually apply the corresponding chat template to ensure the expected behavior. Alternatively, you can use the llm.chat method..."

The default LangChain VLLM wrapper uses vllm.LLM.generate() which does *not*
apply the model's chat template. This is problematic for models that rely
on chat templates (e.g., with <|im_start|>, <|im_end|>, <think> tags).

This PR introduces ChatVLLM which uses vllm.LLM.chat() instead, correctly
formatting prompts with the model's native chat template.

Also updates default max_tokens from 200 to 8192 to avoid truncation.
Copy link
Collaborator

@geoalgo geoalgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the catch. Just have two minor comments.

Comment on lines +106 to +111
def __init__(self, model: str, max_tokens: int = 8192, **vllm_kwargs):
from vllm import LLM, SamplingParams

self.model_path = model
self.max_tokens = max_tokens
self.llm = LLM(model=model, trust_remote_code=True, **vllm_kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to leave the possibility to change trust_remote_code.

Suggested change
def __init__(self, model: str, max_tokens: int = 8192, **vllm_kwargs):
from vllm import LLM, SamplingParams
self.model_path = model
self.max_tokens = max_tokens
self.llm = LLM(model=model, trust_remote_code=True, **vllm_kwargs)
def __init__(self, model: str, max_tokens: int = 8192, trust_remote_code: bool = True, **vllm_kwargs):
from vllm import LLM, SamplingParams
self.model_path = model
self.max_tokens = max_tokens
self.llm = LLM(model=model, trust_remote_code=trust_remote_code, **vllm_kwargs)

if model_provider == "VLLM":
return ChatVLLM(
model=model_name,
max_tokens=max_tokens if max_tokens else 8192,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need else 8192? The default is already 8192 in l168 right?

Suggested change
max_tokens=max_tokens if max_tokens else 8192,
max_tokens=max_tokens,

@ferreirafabio ferreirafabio merged commit cf5c2a2 into OpenEuroLLM:main Jan 12, 2026
1 check failed
@ferreirafabio ferreirafabio deleted the feature/vllm-chat-wrapper branch January 12, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants