Add ChatVLLM wrapper for proper chat template support #8

ferreirafabio · 2026-01-08T20:01:08Z

Description

The default LangChain VLLM wrapper uses vllm.LLM.generate() which does not apply the model's chat template (see reference [1]). This causes issues for models that rely on chat templates (e.g., with <|im_start|>, <|im_end|>, <think> tags), leading to malformed inputs, truncated outputs, and biased judge evaluations.

This PR implements a ChatVLLM wrapper that uses vllm.LLM.chat() instead, which automatically applies the model's native chat template tokenizer_config.json stored in each model directory.

The ChatVLLM wrapper converts LangChain prompts to OpenAI-style messages, then vLLM applies the chat template automatically.

# BEFORE: LangChain VLLM Default  (in class VLLM(BaseLLM))
def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        ...        
        outputs = self.llm.generate([prompt],  # Raw string prompt - no message formatting
        self.sampling_params)
        ...

# AFTER: ChatVLLM (applies chat template)
def batch(self, inputs: list, **invoke_kwargs) -> list[str]:
        ...
        outputs = llm.chat(messages_batch, sampling_params, add_generation_prompt=True)
        ...

Example Impact

Without Chat Template (LangChain default):

User: How to write a function?
Assistant: def my_function(): pass  # Incomplete, no structure

With Chat Template (ChatVLLM):

<|im_start|>user
How to write a function?<|im_end|>
<|im_start|>assistant
<think>Let me think about this systematically...</think>

def my_function(x):
    """A complete, well-documented function."""
    return x * 2<|im_end|>

Changes

Added ChatVLLM class in utils.py:
- Uses vllm.LLM.chat() instead of generate()
- Converts LangChain prompts to OpenAI-style messages
- Maintains LangChain compatibility
Updated make_model() to use ChatVLLM for VLLM provider
Updated default max_tokens from 200 to 8192 to prevent runaway generation
all changes have been tested

References

[1] https://docs.vllm.ai/en/latest/getting_started/quickstart/#offline-batched-inference:~:text=r%7D%22)-,Note,same%20format%20as%20those%20passed%20to%20OpenAI%27s%20client.chat.completions%3A,-Code

"The llm.generate method does not automatically apply the model's chat template to the input prompt. Therefore, if you are using an Instruct model or Chat model, you should manually apply the corresponding chat template to ensure the expected behavior. Alternatively, you can use the llm.chat method..."

The default LangChain VLLM wrapper uses vllm.LLM.generate() which does *not* apply the model's chat template. This is problematic for models that rely on chat templates (e.g., with <|im_start|>, <|im_end|>, <think> tags). This PR introduces ChatVLLM which uses vllm.LLM.chat() instead, correctly formatting prompts with the model's native chat template. Also updates default max_tokens from 200 to 8192 to avoid truncation.

geoalgo

Thanks again for the catch. Just have two minor comments.

geoalgo · 2026-01-12T14:34:38Z

openjury/utils.py

+    def __init__(self, model: str, max_tokens: int = 8192, **vllm_kwargs):
+        from vllm import LLM, SamplingParams
+
+        self.model_path = model
+        self.max_tokens = max_tokens
+        self.llm = LLM(model=model, trust_remote_code=True, **vllm_kwargs)


We probably want to leave the possibility to change trust_remote_code.

Suggested change

def __init__(self, model: str, max_tokens: int = 8192, **vllm_kwargs):

from vllm import LLM, SamplingParams

self.model_path = model

self.max_tokens = max_tokens

self.llm = LLM(model=model, trust_remote_code=True, **vllm_kwargs)

def __init__(self, model: str, max_tokens: int = 8192, trust_remote_code: bool = True, **vllm_kwargs):

from vllm import LLM, SamplingParams

self.model_path = model

self.max_tokens = max_tokens

self.llm = LLM(model=model, trust_remote_code=trust_remote_code, **vllm_kwargs)

geoalgo · 2026-01-12T14:36:23Z

openjury/utils.py

+    if model_provider == "VLLM":
+        return ChatVLLM(
+            model=model_name,
+            max_tokens=max_tokens if max_tokens else 8192,


why do we need else 8192? The default is already 8192 in l168 right?

Suggested change

max_tokens=max_tokens if max_tokens else 8192,

max_tokens=max_tokens,

geoalgo approved these changes Jan 12, 2026

View reviewed changes

ferreirafabio merged commit cf5c2a2 into OpenEuroLLM:main Jan 12, 2026
1 check failed

ferreirafabio deleted the feature/vllm-chat-wrapper branch January 12, 2026 14:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ChatVLLM wrapper for proper chat template support #8

Add ChatVLLM wrapper for proper chat template support #8

Uh oh!

ferreirafabio commented Jan 8, 2026 •

edited

Loading

Uh oh!

geoalgo left a comment

Uh oh!

geoalgo Jan 12, 2026

Uh oh!

geoalgo Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	max_tokens=max_tokens if max_tokens else 8192,
	max_tokens=max_tokens,

Add ChatVLLM wrapper for proper chat template support #8

Add ChatVLLM wrapper for proper chat template support #8

Uh oh!

Conversation

ferreirafabio commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Example Impact

Changes

References

Uh oh!

geoalgo left a comment

Choose a reason for hiding this comment

Uh oh!

geoalgo Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

geoalgo Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ferreirafabio commented Jan 8, 2026 •

edited

Loading