Skip to content

Conversation

@kacperlukawski
Copy link
Member

All Submissions:

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  • Does your submission pass the existing tests?
  • Have you added tests for your feature?
  • Have you installed pre-commit with pip3 install pre-commit and set up hooks with pre-commit install?

New models submission:

  • Have you added an explanation of why it's important to include this model?
  • Have you added tests for the new model? Were canonical values for tests computed via the original model?
  • Have you added the code snippet for how canonical values were computed?
  • Have you successfully ran tests with your changes locally?

@kacperlukawski kacperlukawski force-pushed the feature/colmodernvbert-support branch 2 times, most recently from bf215e1 to 2f0e85a Compare December 8, 2025 12:04
@kacperlukawski kacperlukawski requested a review from joein December 8, 2025 12:46
@kacperlukawski kacperlukawski marked this pull request as ready for review December 8, 2025 12:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for the ColModernVBERT model (Qdrant/colmodernvbert), a late-interaction multimodal embedding model that uses bidirectional attention for retrieval tasks. The implementation extends the existing late-interaction multimodal embedding infrastructure with support for image splitting and dynamic patch handling.

Key Changes:

  • Introduced ColModernVBERT class with support for image splitting and patch-based processing
  • Added new image transformation operators (ImageSplitter, ResizeForVisionEncoder, SquareResize, ResizeLongestEdge) to handle Idefics3-style image preprocessing
  • Enhanced ONNX multimodal model to handle nested image structures with attention masks and metadata

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
fastembed/late_interaction_multimodal/colmodernvbert.py New model implementation with text/image embedding support, patch computation logic, and worker classes
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py Added ColModernVBERT to the embeddings registry
fastembed/late_interaction_multimodal/onnx_multimodal_model.py Updated image embedding to support nested structures with patch-level attention masks
fastembed/image/transform/operators.py Added transforms for longest edge resizing, vision encoder resizing, image splitting, and square resizing
fastembed/image/transform/functional.py Added utility functions for resizing longest edge, cropping ndarrays, and resizing ndarrays
fastembed/common/preprocessor_utils.py Added conditional check to avoid re-enabling padding if already configured
fastembed/common/onnx_model.py Added optional metadata field to OnnxOutputContext
tests/test_late_interaction_multimodal.py Added canonical values and test cases for the new model

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +159 to +166
Post-process the ONNX model output to convert it into a usable format.
Args:
output (OnnxOutputContext): The raw output from the ONNX model.
Returns:
Iterable[NumpyArray]: Post-processed output as NumPy arrays.
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring describes post-processing output, but this method actually preprocesses text input. The docstring should describe what the method does: adding empty image placeholders to text input for the ONNX model. The Args should describe onnx_input (not output), and the Returns should describe the updated onnx_input dict.

Suggested change
Post-process the ONNX model output to convert it into a usable format.
Args:
output (OnnxOutputContext): The raw output from the ONNX model.
Returns:
Iterable[NumpyArray]: Post-processed output as NumPy arrays.
Preprocesses the text input for the ONNX model by adding empty image placeholders.
Args:
onnx_input (dict[str, NumpyArray]): The text input dictionary for the ONNX model.
Returns:
dict[str, NumpyArray]: The updated input dictionary with empty image placeholders added.

Copilot uses AI. Check for mistakes.
@coderabbitai
Copy link

coderabbitai bot commented Dec 8, 2025

📝 Walkthrough

Walkthrough

Adds ColModernVBERT, a new late‑interaction multimodal ONNX embedding model and registers it. Introduces extensive image utilities and transforms (resize_longest_edge, crop_ndarray, resize_ndarray, ResizeLongestEdge, ResizeForVisionEncoder, ImageSplitter, SquareResize) and updates Normalize/Rescale to accept nested patch lists. Makes tokenizer padding conditional. Updates ONNX image handling to use ExitStack for context-managed image opening. Extends OnnxOutputContext with an optional metadata field. Adds CI-aware model caching and a module-scoped model_cache fixture plus related test updates.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • tbung
  • generall
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.34% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Model: ModernVBERT/colmodernvbert' directly relates to the main change: adding support for a new ColModernVBERT model to the FastEmbed library.
Description check ✅ Passed The description is related to the changeset, confirming that a new model (ModernVBERT/colmodernvbert) was added with tests and canonical values, matching the implementation changes shown in the raw summary.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (5)
fastembed/image/transform/functional.py (1)

194-223: Verify float input range assumption.

The function assumes float arrays are normalized to [0, 1] range (line 209: img_hwc * 255). Values outside this range will clip or overflow. Also, float64 inputs are converted to float32 on output (line 212).

If this is intentional for the pipeline (where rescale happens later), consider adding a docstring note about the expected input range.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

16-19: Consider annotating with ClassVar for type safety.

The EMBEDDINGS_REGISTRY is a mutable class attribute. Annotating it with ClassVar makes the intent explicit and prevents accidental instance-level assignment.

+from typing import Any, ClassVar, Iterable, Optional, Sequence, Type, Union
-from typing import Any, Iterable, Optional, Sequence, Type, Union
-    EMBEDDINGS_REGISTRY: list[Type[LateInteractionMultimodalEmbeddingBase]] = [
+    EMBEDDINGS_REGISTRY: ClassVar[list[Type[LateInteractionMultimodalEmbeddingBase]]] = [
         ColPali,
         ColModernVBERT,
     ]
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

173-177: Unused context manager.

The ExitStack is created but never used to manage any resources. If this was intended for managing image file handles, consider either implementing that pattern or removing the unused context manager.

fastembed/late_interaction_multimodal/colmodernvbert.py (2)

136-153: Consider adding error handling for missing config files.

If any of the config files (processor_config.json, preprocessor_config.json, config.json) are missing, the error won't clearly indicate which model configuration is incomplete. Consider wrapping in try-except with a clearer message.

         # Load image processing configuration
         processor_config_path = self._model_dir / "processor_config.json"
-        with open(processor_config_path) as f:
-            processor_config = json.load(f)
-            self.image_seq_len = processor_config.get("image_seq_len", 64)
+        try:
+            with open(processor_config_path) as f:
+                processor_config = json.load(f)
+                self.image_seq_len = processor_config.get("image_seq_len", 64)
+        except FileNotFoundError:
+            raise FileNotFoundError(
+                f"processor_config.json not found at {processor_config_path}. "
+                "Ensure the model was downloaded correctly."
+            )

167-172: Add guard for self.image_size being None.

self.image_size is Optional[int] and initialized to None. While load_onnx_model should be called before this method, adding an assertion would provide a clearer error message and satisfy type checking.

     def _preprocess_onnx_text_input(
         self, onnx_input: dict[str, NumpyArray], **kwargs: Any
     ) -> dict[str, NumpyArray]:
-        """
-        Post-process the ONNX model output to convert it into a usable format.
-
-        Args:
-            output (OnnxOutputContext): The raw output from the ONNX model.
-
-        Returns:
-            Iterable[NumpyArray]: Post-processed output as NumPy arrays.
-        """
+        """Preprocess text input by adding empty image placeholders."""
+        assert self.image_size is not None, "image_size not initialized. Call load_onnx_model first."
         batch_size, seq_length = onnx_input["input_ids"].shape
         empty_image_placeholder: NumpyArray = np.zeros(
-            (batch_size, seq_length, 3, self.image_size, self.image_size), dtype=np.float32  # type: ignore[type-var,arg-type,assignment]
+            (batch_size, seq_length, 3, self.image_size, self.image_size), dtype=np.float32
         )
         onnx_input["pixel_values"] = empty_image_placeholder
         return onnx_input

Note: The docstring also appears to be copy-pasted from a post-process method — updated it to match the actual purpose.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 428381c and e462bf0.

📒 Files selected for processing (8)
  • fastembed/common/onnx_model.py (1 hunks)
  • fastembed/common/preprocessor_utils.py (1 hunks)
  • fastembed/image/transform/functional.py (1 hunks)
  • fastembed/image/transform/operators.py (9 hunks)
  • fastembed/late_interaction_multimodal/colmodernvbert.py (1 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (2 hunks)
  • fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1 hunks)
  • tests/test_late_interaction_multimodal.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.14.7)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py

16-19: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

fastembed/late_interaction_multimodal/colmodernvbert.py

156-156: Unused method argument: kwargs

(ARG002)


189-189: Unused method argument: kwargs

(ARG002)


194-194: Unused method argument: kwargs

(ARG002)

fastembed/image/transform/operators.py

390-392: Avoid specifying long messages outside the exception class

(TRY003)


429-429: Avoid specifying long messages outside the exception class

(TRY003)


432-432: Unused static method argument: config

(ARG004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Agent
🔇 Additional comments (18)
fastembed/common/preprocessor_utils.py (1)

53-56: LGTM! Conditional padding avoids overwriting existing configuration.

The guard prevents re-enabling padding when it's already configured, which is useful when tokenizer configs already include padding settings.

fastembed/image/transform/operators.py (9)

44-55: LGTM! Safe handling of both flat and nested image structures.

The images and isinstance(images[0], list) pattern correctly short-circuits on empty input, preventing IndexError.


75-86: Consistent with Normalize pattern.

Same safe short-circuit evaluation for nested structure detection.


109-121: Clean implementation delegating to functional utility.


124-167: Logic correctly aligns dimensions to multiples of max_size.

The aspect ratio calculation and rounding ensure proper vision encoder compatibility. PIL's (width, height) convention is correctly used in the resize call.


170-242: Patch splitting logic handles boundary patches correctly.

Edge patches are properly clamped via min(start_x + optimal_width, width), ensuring no out-of-bounds access. Note that boundary patches may have smaller dimensions than interior patches, which is expected behavior for grid-based splitting.


245-267: Output format correctly matches ImageSplitter for pipeline compatibility.

Wrapping single resized image in inner list ensures downstream transforms (Normalize, Rescale) work uniformly.


386-404: Proper handling of Idefics3ImageProcessor configuration.

The resample parameter conversion from int to Image.Resampling handles configs that serialize enums as integers.


426-429: Intentional no-op for Idefics3ImageProcessor center crop.

The unused config parameter (ARG004 hint) is required by the method signature pattern. Safely ignore this static analysis warning.


435-455: Clean conditional pipeline configuration for image splitting.

The @classmethod decorator is appropriate here since it follows the pattern of other _get_* methods. The chained .get() calls provide safe defaults.

fastembed/image/transform/functional.py (2)

152-175: Even dimension enforcement is useful for vision encoder compatibility.

The aspect ratio calculation and dimension adjustment logic is correct. Ensuring even dimensions (lines 170-173) is a common requirement for certain encoder architectures.


178-191: Simple and correct crop implementation.

Relies on caller for bounds validation, which is appropriate since ImageSplitter already clamps coordinates before calling this function.

tests/test_late_interaction_multimodal.py (4)

24-34: Test data added for colmodernvbert image embeddings.

Per the PR objectives, the code snippet showing how canonical test values were computed was not added. Consider adding this to the PR description or as a comment for reproducibility.


49-59: Query embedding canonical values added.

Consistent with image embedding test data structure.


115-117: Embedding size verification for colmodernvbert.

Confirms the model reports 128-dimensional embeddings, matching the existing colpali model.


130-133: Instance-based embedding size check added.

Tests lazy-loaded model instance reports correct embedding size.

fastembed/common/onnx_model.py (1)

19-24: Clean backward-compatible extension of OnnxOutputContext.

The optional metadata field allows carrying additional context (e.g., patch counts for image splitting) without breaking existing code that doesn't use it.

fastembed/late_interaction_multimodal/colmodernvbert.py (1)

35-45: Clean implementation following existing patterns.

The ColModernVBERT class properly inherits from both LateInteractionMultimodalEmbeddingBase and OnnxMultimodalModel, follows the established patterns from ColPali, and implements the required abstract methods. The visual prompt prefix and overall structure look correct.

Comment on lines 186 to 194
patch_counts = [len(patches) for patches in processed]
max_patches = max(patch_counts)

# Get dimensions from first patch
C, H, W = processed[0][0].shape

# Create padded array
batch_size = len(processed)
encoded = np.zeros((batch_size, max_patches, C, H, W), dtype=processed[0][0].dtype)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential IndexError if an image produces no patches.

If processed[0] is an empty list, accessing processed[0][0] will raise an IndexError. Consider adding a guard or handling the edge case where an image might yield zero patches.

             patch_counts = [len(patches) for patches in processed]
             max_patches = max(patch_counts)
+            
+            if max_patches == 0:
+                raise ValueError("All images produced zero patches")

             # Get dimensions from first patch
-            C, H, W = processed[0][0].shape
+            # Find first non-empty patch list to get dimensions
+            first_patch = next(
+                (patches[0] for patches in processed if len(patches) > 0), None
+            )
+            if first_patch is None:
+                raise ValueError("No valid patches found in processed images")
+            C, H, W = first_patch.shape

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In fastembed/late_interaction_multimodal/onnx_multimodal_model.py around lines
186-194, the code assumes processed[0] contains at least one patch and uses
processed[0][0].shape which will IndexError if any image produced zero patches;
change it to first find a non-empty entry (e.g., iterate processed to locate the
first patches list with length>0) and use that patch's shape for C,H,W, and
handle the case where no images have patches (max_patches == 0) by either
returning an appropriately shaped empty encoded array or raising a clear
exception; ensure subsequent logic that pads/fills encoded handles the
zero-patch case.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
fastembed/late_interaction_multimodal/colmodernvbert.py (2)

37-41: Clarify model identifier in docstring.

The docstring references ModernVBERT/colmodernvbert but the registered model identifier is Qdrant/colmodernvbert. Consider clarifying that this is the ONNX-converted version of the original model.


158-166: Incorrect docstring: describes post-processing but method preprocesses.

This docstring describes post-processing output, but the method actually preprocesses text input by adding empty image placeholders.

🧹 Nitpick comments (1)
fastembed/late_interaction_multimodal/colmodernvbert.py (1)

136-153: Add error handling for config file loading.

The config file loading lacks error handling for missing or malformed JSON files. Consider wrapping in try-except to provide clearer error messages when config files are missing or corrupted.

         # Load image processing configuration
         processor_config_path = self._model_dir / "processor_config.json"
-        with open(processor_config_path) as f:
-            processor_config = json.load(f)
-            self.image_seq_len = processor_config.get("image_seq_len", 64)
+        try:
+            with open(processor_config_path) as f:
+                processor_config = json.load(f)
+                self.image_seq_len = processor_config.get("image_seq_len", 64)
+        except FileNotFoundError:
+            raise FileNotFoundError(
+                f"Required config file not found: {processor_config_path}"
+            )
+        except json.JSONDecodeError as e:
+            raise ValueError(f"Invalid JSON in {processor_config_path}: {e}")

Apply similar handling to the other config file reads.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e462bf0 and 35b9026.

📒 Files selected for processing (1)
  • fastembed/late_interaction_multimodal/colmodernvbert.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
fastembed/late_interaction_multimodal/colmodernvbert.py

156-156: Unused method argument: kwargs

(ARG002)


189-189: Unused method argument: kwargs

(ARG002)


194-194: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.9.x on windows-latest test
  • GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (7)
fastembed/late_interaction_multimodal/colmodernvbert.py (7)

1-32: LGTM!

Imports and model description are well-structured and follow the existing patterns in the codebase.


174-191: LGTM!

These methods are straightforward. The unused kwargs parameters are likely intentional for interface compatibility with base class method signatures.


258-280: LGTM!

The grid dimension computation logic is sound with appropriate fallback handling for edge cases.


330-346: LGTM!

The post-processing logic correctly reshapes the output tensor using the model's embedding dimension.


348-418: LGTM!

Both embedding methods follow the established patterns and correctly delegate to internal implementation methods.


420-426: LGTM!

Worker class references are correct.


429-446: LGTM!

Worker classes correctly implement init_embedding with appropriate defaults for parallel processing contexts.

Comment on lines +106 to +111
self.mask_token_id = None
self.pad_token_id = None
self.image_seq_len: Optional[int] = None
self.max_image_size: Optional[int] = None
self.image_size: Optional[int] = None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, check the class definition and its base classes
head -50 fastembed/late_interaction_multimodal/colmodernvbert.py | grep -A 5 "^class "

Repository: qdrant/fastembed

Length of output: 367


🏁 Script executed:

#!/bin/bash
# Search for mask_token_id and pad_token_id usage across the codebase
rg -n "mask_token_id|pad_token_id" --type py

Repository: qdrant/fastembed

Length of output: 1452


🏁 Script executed:

#!/bin/bash
# Check if the file has any references to these attributes
cat -n fastembed/late_interaction_multimodal/colmodernvbert.py | grep -E "(mask_token_id|pad_token_id)"

Repository: qdrant/fastembed

Length of output: 332


🏁 Script executed:

#!/bin/bash
# Find the LateInteractionMultimodalEmbeddingBase class definition
fd -e py | xargs grep -l "class LateInteractionMultimodalEmbeddingBase"

Repository: qdrant/fastembed

Length of output: 143


🏁 Script executed:

#!/bin/bash
# Check the base class for any requirements on mask_token_id and pad_token_id
rg -A 30 "class LateInteractionMultimodalEmbeddingBase" --type py

Repository: qdrant/fastembed

Length of output: 3890


Remove unused instance attributes mask_token_id and pad_token_id.

These attributes are initialized to None at lines 106-107 but are never assigned values or used anywhere in the class. The base class LateInteractionMultimodalEmbeddingBase does not require them, and the class logic uses local variables instead (e.g., pad_token_id is obtained directly from self.tokenizer.padding["pad_id"] at line 235). These are dead code and should be removed.

🤖 Prompt for AI Agents
In fastembed/late_interaction_multimodal/colmodernvbert.py around lines 106 to
110, remove the dead instance attributes mask_token_id and pad_token_id which
are initialized to None but never used; delete those two attribute
initializations and ensure no other code depends on self.mask_token_id or
self.pad_token_id (use local tokenizer/padding lookups already present), then
run tests/linters to confirm no references remain.

Comment on lines 167 to 173
batch_size, seq_length = onnx_input["input_ids"].shape
empty_image_placeholder: NumpyArray = np.zeros(
(batch_size, seq_length, 3, self.image_size, self.image_size), dtype=np.float32 # type: ignore[type-var,arg-type,assignment]
)
onnx_input["pixel_values"] = empty_image_placeholder
return onnx_input
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against None value for self.image_size.

self.image_size is Optional[int] initialized to None, and is only set during load_onnx_model(). If this method is called before the model is loaded (e.g., with lazy_load=True), using self.image_size will produce an invalid array shape.

+        if self.image_size is None:
+            raise RuntimeError(
+                "Model not loaded. Call load_onnx_model() first or initialize with lazy_load=False"
+            )
         batch_size, seq_length = onnx_input["input_ids"].shape
-        empty_image_placeholder: NumpyArray = np.zeros(
-            (batch_size, seq_length, 3, self.image_size, self.image_size), dtype=np.float32  # type: ignore[type-var,arg-type,assignment]
-        )
+        empty_image_placeholder: NumpyArray = np.zeros(
+            (batch_size, seq_length, 3, self.image_size, self.image_size), dtype=np.float32
+        )
🤖 Prompt for AI Agents
In fastembed/late_interaction_multimodal/colmodernvbert.py around lines 167-172,
guard against self.image_size being None before using it to build the
empty_image_placeholder; if self.image_size is None either call
self.load_onnx_model() (if safe here) or raise a clear ValueError indicating the
model must be loaded (or lazy_load disabled) before creating pixel_values, and
ensure you cast/check the value to an int before using it to construct the numpy
shape so an invalid shape is never created.

Comment on lines +233 to +241
# Get padding config from tokenizer
padding_direction = self.tokenizer.padding["direction"] # type: ignore[index,union-attr]
pad_token_id = self.tokenizer.padding["pad_id"] # type: ignore[index,union-attr]

# Initialize with pad token
padded_input_ids = np.full((batch_size, max_len), pad_token_id, dtype=np.int64)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add guard for uninitialized tokenizer.

self.tokenizer could be None if the model isn't fully loaded. The type: ignore comments suppress the error but don't prevent a runtime TypeError.

+        if self.tokenizer is None or self.tokenizer.padding is None:
+            raise RuntimeError("Tokenizer not initialized. Ensure model is loaded.")
+
         # Get padding config from tokenizer
-        padding_direction = self.tokenizer.padding["direction"]  # type: ignore[index,union-attr]
-        pad_token_id = self.tokenizer.padding["pad_id"]  # type: ignore[index,union-attr]
+        padding_direction = self.tokenizer.padding["direction"]
+        pad_token_id = self.tokenizer.padding["pad_id"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Get padding config from tokenizer
padding_direction = self.tokenizer.padding["direction"] # type: ignore[index,union-attr]
pad_token_id = self.tokenizer.padding["pad_id"] # type: ignore[index,union-attr]
# Initialize with pad token
padded_input_ids = np.full((batch_size, max_len), pad_token_id, dtype=np.int64)
if self.tokenizer is None or self.tokenizer.padding is None:
raise RuntimeError("Tokenizer not initialized. Ensure model is loaded.")
# Get padding config from tokenizer
padding_direction = self.tokenizer.padding["direction"]
pad_token_id = self.tokenizer.padding["pad_id"]
# Initialize with pad token
padded_input_ids = np.full((batch_size, max_len), pad_token_id, dtype=np.int64)
🤖 Prompt for AI Agents
fastembed/late_interaction_multimodal/colmodernvbert.py lines 233-238: the code
assumes self.tokenizer is set and immediately accesses its padding attributes,
which can raise a TypeError if tokenizer is None; add a guard that checks if
self.tokenizer is None and either (a) raise a clear RuntimeError/ValueError
stating the model/tokenizer is not initialized and instructing how to load it,
or (b) attempt to lazy-initialize or retrieve a fallback tokenizer before
accessing padding; after the guard, safely read padding_direction and
pad_token_id and remove reliance on the type: ignore suppression so the runtime
can fail loudly or recover with a proper fallback.

Comment on lines +282 to +314
def _create_single_image_prompt_string(self) -> str:
return (
"<fake_token_around_image>"
+ "<global-img>"
+ "<image>" * self.image_seq_len # type: ignore[operator]
+ "<fake_token_around_image>"
)

def _create_split_image_prompt_string(self, rows: int, cols: int) -> str:
text_split_images = ""

# Add tokens for each patch in the grid
for n_h in range(rows):
for n_w in range(cols):
text_split_images += (
"<fake_token_around_image>"
+ f"<row_{n_h + 1}_col_{n_w + 1}>"
+ "<image>" * self.image_seq_len # type: ignore[operator]
)
text_split_images += "\n"

# Add global image at the end
text_split_images += (
"\n<fake_token_around_image>"
+ "<global-img>"
+ "<image>" * self.image_seq_len # type: ignore[operator]
+ "<fake_token_around_image>"
)

return text_split_images
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against None value for self.image_seq_len.

Similar to self.image_size, self.image_seq_len could be None if the model isn't loaded, causing string multiplication to fail with TypeError.

Consider adding a runtime check at the start of methods that depend on model configuration:

if self.image_seq_len is None:
    raise RuntimeError("Model not loaded. Call load_onnx_model() first.")
🤖 Prompt for AI Agents
In fastembed/late_interaction_multimodal/colmodernvbert.py around lines 282 to
311, both _create_single_image_prompt_string and
_create_split_image_prompt_string assume self.image_seq_len is not None and will
raise a TypeError when it's None; add a runtime guard at the start of each
method that checks if self.image_seq_len is None and raises a RuntimeError with
a clear message like "Model not loaded. Call load_onnx_model() first." so
callers get a meaningful error instead of a TypeError from string
multiplication.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
fastembed/late_interaction_multimodal/colmodernvbert.py (4)

159-165: Fix incorrect docstring.

The docstring describes post-processing output, but this method actually preprocesses text input by adding empty image placeholders. The Args describe output when the parameter is onnx_input, and Returns describe the updated input dictionary.

-        Post-process the ONNX model output to convert it into a usable format.
+        Preprocesses the text input for the ONNX model by adding empty image placeholders.
 
         Args:
-            output (OnnxOutputContext): The raw output from the ONNX model.
+            onnx_input (dict[str, NumpyArray]): The text input dictionary for the ONNX model.
 
         Returns:
-            Iterable[NumpyArray]: Post-processed output as NumPy arrays.
+            dict[str, NumpyArray]: The updated input dictionary with empty image placeholders added.

233-238: Add guard for uninitialized tokenizer.

self.tokenizer is Optional[Tokenizer] and could be None if the model isn't fully loaded. Accessing self.tokenizer.padding at lines 234-235 will raise an AttributeError if called with lazy_load=True before the model is loaded. The type: ignore comments suppress type errors but don't prevent the runtime exception.

+        if self.tokenizer is None or self.tokenizer.padding is None:
+            raise RuntimeError("Tokenizer not initialized. Ensure model is loaded.")
+
         # Get padding config from tokenizer
         padding_direction = self.tokenizer.padding["direction"]  # type: ignore[index,union-attr]
         pad_token_id = self.tokenizer.padding["pad_id"]  # type: ignore[index,union-attr]

167-170: Guard against None value for self.image_size.

self.image_size is Optional[int] initialized to None (line 110) and only set during load_onnx_model(). If this method is called with lazy_load=True before the model is loaded, using self.image_size at line 169 will create an invalid array shape or raise a TypeError.

+        if self.image_size is None:
+            raise RuntimeError(
+                "Model not loaded. Call load_onnx_model() first or initialize with lazy_load=False"
+            )
         batch_size, seq_length = onnx_input["input_ids"].shape
         empty_image_placeholder: NumpyArray = np.zeros(
             (batch_size, seq_length, 3, self.image_size, self.image_size), dtype=np.float32  # type: ignore[type-var,arg-type,assignment]

282-311: Guard against None value for self.image_seq_len.

Both _create_single_image_prompt_string and _create_split_image_prompt_string use self.image_seq_len for string multiplication (lines 286, 299, 307). self.image_seq_len is Optional[int] initialized to None (line 108) and only set during load_onnx_model(). If called with lazy_load=True before the model is loaded, these will raise TypeError: can't multiply sequence by NoneType.

Add a runtime check at the start of both methods:

 def _create_single_image_prompt_string(self) -> str:
+    if self.image_seq_len is None:
+        raise RuntimeError("Model not loaded. Call load_onnx_model() first.")
     return (
 def _create_split_image_prompt_string(self, rows: int, cols: int) -> str:
+    if self.image_seq_len is None:
+        raise RuntimeError("Model not loaded. Call load_onnx_model() first.")
     text_split_images = ""
🧹 Nitpick comments (2)
fastembed/late_interaction_multimodal/colmodernvbert.py (2)

37-40: Clarify the relationship between original and ONNX models in docstring.

The docstring references "ModernVBERT/colmodernvbert" but the actual model identifier used is "Qdrant/colmodernvbert" (line 23). Consider clarifying that Qdrant hosts the ONNX-converted version of the original ModernVBERT model to help users understand which model files are being loaded.

-    The ModernVBERT/colmodernvbert model implementation. This model uses
-    bidirectional attention, which proves to work better for retrieval.
-
-    See: https://huggingface.co/ModernVBERT/colmodernvbert
+    The ColModernVBERT model implementation using ONNX runtime. This model uses
+    bidirectional attention, which proves to work better for retrieval.
+
+    This is the ONNX-converted version hosted at https://huggingface.co/Qdrant/colmodernvbert
+    Original model: https://huggingface.co/ModernVBERT/colmodernvbert

106-107: Remove unused instance attributes.

Based on learnings, mask_token_id and pad_token_id are initialized here but never assigned values or used anywhere in the class. The padding logic at lines 234-235 retrieves pad_token_id directly from self.tokenizer.padding["pad_id"] instead. These lines are dead code and can be safely removed.

-        self.mask_token_id = None
-        self.pad_token_id = None
         self.image_seq_len: Optional[int] = None
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35b9026 and e0acfe7.

📒 Files selected for processing (1)
  • fastembed/late_interaction_multimodal/colmodernvbert.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
fastembed/late_interaction_multimodal/colmodernvbert.py

156-156: Unused method argument: kwargs

(ARG002)


189-189: Unused method argument: kwargs

(ARG002)


194-194: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Python 3.9.x on windows-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
🔇 Additional comments (3)
fastembed/late_interaction_multimodal/colmodernvbert.py (3)

330-346: LGTM!

The post-processing logic correctly reshapes the output and includes a proper assertion for the model dimension.


348-418: LGTM!

Both embed_text and embed_image methods properly delegate to inherited base methods with all configuration parameters correctly passed through.


420-446: LGTM!

Worker class methods and class definitions follow the standard pattern. Both text and image workers correctly initialize ColModernVBERT instances with single-threaded configuration.

@kacperlukawski kacperlukawski marked this pull request as draft December 9, 2025 12:42
@kacperlukawski kacperlukawski marked this pull request as ready for review December 9, 2025 15:15
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (2)

217-225: Guard against empty patch lists in _process_nested_patches.

If processed[0] contains zero patches (or the first few images yield no patches), processed[0][0] will raise an IndexError, even though later images may have valid patches. Similarly, if all images produce zero patches, max(patch_counts) is 0 and you have no patch to infer C, H, W from.

Consider locating the first non-empty patch list and handling the all-empty case explicitly:

-        patch_counts = [len(patches) for patches in processed]
-        max_patches = max(patch_counts)
-
-        # Get dimensions from first patch
-        C, H, W = processed[0][0].shape
+        patch_counts = [len(patches) for patches in processed]
+        max_patches = max(patch_counts)
+
+        if max_patches == 0:
+            raise ValueError("All images produced zero patches")
+
+        # Get dimensions from the first non-empty patch list
+        first_patch = next((patches[0] for patches in processed if patches), None)
+        if first_patch is None:
+            raise ValueError("No valid patches found in processed images")
+        C, H, W = first_patch.shape

This avoids hard failures on edge cases where some or all images contribute no patches.


187-190: Use encoded.shape[0] as the batch size source in _process_flat_images.

_process_flat_images currently takes num_images from the caller and uses it to size attention_mask and patch_counts. If a processor ever returns a different number of encoded images than the original images list (e.g., filtering, de-duplication), this will cause shape or length mismatches.

You can make the function self-consistent by deriving the batch size from encoded and dropping the num_images argument:

-            else:
-                encoded, attention_mask, metadata = self._process_flat_images(
-                    processed,  # type: ignore[arg-type]
-                    len(images),
-                )
+            else:
+                encoded, attention_mask, metadata = self._process_flat_images(
+                    processed,  # type: ignore[arg-type]
+                )

And in _process_flat_images:

-    def _process_flat_images(
-        self, processed: list[NumpyArray], num_images: int
-    ) -> tuple[NumpyArray, NumpyArray, dict[str, Any]]:
+    def _process_flat_images(
+        self, processed: list[NumpyArray]
+    ) -> tuple[NumpyArray, NumpyArray, dict[str, Any]]:
@@
-        encoded = np.array(processed)
+        encoded = np.array(processed)
+        batch_size = encoded.shape[0]
@@
-        if len(encoded.shape) == 5:
-            # 5D tensor: attention_mask shape is (batch, num_patches)
-            attention_mask = np.ones((num_images, encoded.shape[1]), dtype=np.int64)
-            metadata = {"patch_counts": [encoded.shape[1]] * num_images}
+        if len(encoded.shape) == 5:
+            # 5D tensor: attention_mask shape is (batch, num_patches)
+            attention_mask = np.ones((batch_size, encoded.shape[1]), dtype=np.int64)
+            metadata = {"patch_counts": [encoded.shape[1]] * batch_size}
         else:
-            # 4D tensor: attention_mask shape is (batch, 1)
-            attention_mask = np.ones((num_images, 1), dtype=np.int64)
-            metadata = {"patch_counts": [1] * num_images}
+            # 4D tensor: attention_mask shape is (batch, 1)
+            attention_mask = np.ones((batch_size, 1), dtype=np.int64)
+            metadata = {"patch_counts": [1] * batch_size}

This keeps attention masks and metadata consistent with the actual encoded tensor regardless of how the processor behaves.

Also applies to: 258-275

🧹 Nitpick comments (2)
tests/test_late_interaction_multimodal.py (1)

24-34: Canonical values for ColModernVBERT look consistent; please keep derivation reproducible.

The added CANONICAL_IMAGE_VALUES and CANONICAL_QUERY_VALUES entries for "Qdrant/colmodernvbert" integrate cleanly with the existing tests (shapes match the token×dim slicing logic used below). To keep these stable over time and aid future debugging, it would be helpful to either:

  • add a short note or link to the script/notebook used to generate these canonical values, or
  • include the missing “how canonical values were computed” snippet in the test docs/checklist.

This isn’t blocking, but it will make future updates to the model/tests much easier to validate.

Also applies to: 49-56

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

179-191: Propagate preprocessed attention_mask into OnnxOutputContext for image path.

After _preprocess_onnx_image_input, you always pass the original attention_mask into OnnxOutputContext. If a subclass overrides _preprocess_onnx_image_input and adjusts or replaces the mask, the context will become inconsistent with what was actually fed to the model.

You can mirror the text path by reading from onnx_input first:

-        onnx_input = {"pixel_values": encoded, "attention_mask": attention_mask}
-        onnx_input = self._preprocess_onnx_image_input(onnx_input, **kwargs)
-        model_output = self.model.run(None, onnx_input)  # type: ignore[union-attr]
-
-        return OnnxOutputContext(
-            model_output=model_output[0],
-            attention_mask=attention_mask,  # type: ignore[arg-type]
-            metadata=metadata,
-        )
+        onnx_input = {"pixel_values": encoded, "attention_mask": attention_mask}
+        onnx_input = self._preprocess_onnx_image_input(onnx_input, **kwargs)
+        model_output = self.model.run(None, onnx_input)  # type: ignore[union-attr]
+
+        return OnnxOutputContext(
+            model_output=model_output[0],
+            attention_mask=onnx_input.get("attention_mask", attention_mask),  # type: ignore[arg-type]
+            metadata=metadata,
+        )

This keeps the context aligned with any preprocessing logic while remaining backward compatible.

Also applies to: 196-200

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0acfe7 and f4b17fa.

📒 Files selected for processing (2)
  • fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1 hunks)
  • tests/test_late_interaction_multimodal.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_late_interaction_multimodal.py (1)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (3)
  • LateInteractionMultimodalEmbedding (15-189)
  • get_embedding_size (98-122)
  • embedding_size (91-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.9.x on windows-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
🔇 Additional comments (2)
tests/test_late_interaction_multimodal.py (1)

112-113: Embedding size tests for ColModernVBERT match registry expectations.

The new assertions for "Qdrant/colmodernvbert" in both test_get_embedding_size and test_embedding_size correctly expect a dim of 128 and are aligned with the registry-based lookup logic. No issues here.

Also applies to: 127-129

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

277-299: Patch-dimension detection via ONNX input shape looks sound.

_needs_patch_dimension cleanly inspects the pixel_values input shape to distinguish 4D vs 5D models and defaults to False when the model is not yet loaded or the input is missing. This matches the intended behavior for ColModernVBERT (5D) vs earlier 4D models and should be robust against dynamic dimension names.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (6)
fastembed/late_interaction_multimodal/colmodernvbert.py (6)

37-41: Docstring references original model, but code uses Qdrant ONNX version.

The docstring mentions ModernVBERT/colmodernvbert and links to that HuggingFace page, but the actual model identifier in supported_colmodernvbert_models is Qdrant/colmodernvbert. Consider clarifying this distinction.


107-108: Remove unused instance attributes mask_token_id and pad_token_id.

These attributes are initialized but never assigned meaningful values or used anywhere in the class. This was previously flagged.


159-166: Incorrect docstring: describes post-processing but method preprocesses input.

The docstring says "Post-process the ONNX model output" but this method actually preprocesses text input by adding empty image placeholders.


168-172: Guard against None value for self.image_size.

With lazy_load=True, self.image_size remains None until load_onnx_model() is called, causing a runtime error when creating the placeholder array. This was previously flagged.


236-238: Add guard for uninitialized tokenizer.

self.tokenizer could be None if the model isn't loaded. This was previously flagged.


285-314: Guard against None value for self.image_seq_len.

self.image_seq_len could be None if the model isn't loaded, causing TypeError on string multiplication. This was previously flagged.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4b17fa and 94366ea.

📒 Files selected for processing (2)
  • fastembed/late_interaction_multimodal/colmodernvbert.py
  • tests/test_late_interaction_multimodal.py
🧰 Additional context used
🪛 Ruff (0.14.10)
fastembed/late_interaction_multimodal/colmodernvbert.py

46-46: Possible hardcoded password assigned to: "QUERY_AUGMENTATION_TOKEN"

(S105)


157-157: Unused method argument: kwargs

(ARG002)


190-190: Unused method argument: kwargs

(ARG002)


197-197: Unused method argument: kwargs

(ARG002)

🔇 Additional comments (11)
fastembed/late_interaction_multimodal/colmodernvbert.py (7)

21-32: Model description looks well-structured.

The model descriptor includes all necessary fields and the additional file processor_config.json is correctly specified for the image processing configuration.


244-258: Padding logic correctly handles both left and right directions.

The implementation properly places tokens based on padding_direction and constructs appropriate attention masks.


261-283: Grid computation logic handles edge cases appropriately.

The method correctly handles single patches (patch_count <= 1), attempts square grid detection first, and falls back to factor enumeration for non-square grids. The final fallback to (0, 0) ensures graceful degradation.


316-331: Prompt construction and tokenization logic is correct.

The method correctly selects the appropriate prompt format based on grid dimensions and properly tokenizes the expanded prompt.


333-349: Post-processing correctly reshapes output to expected dimensions.

The assertion guards against misconfigured models, and the reshape operation produces the expected (batch, n, dim) output format.


351-421: Embedding methods correctly delegate to base class utilities.

Both embed_text and embed_image properly pass through all configuration options and use yield from for efficient lazy iteration.


432-449: Worker classes are correctly implemented.

Both ColModernVBERTTextEmbeddingWorker and ColModernVBERTImageEmbeddingWorker properly initialize the ColModernVBERT model with threads=1 for parallel processing scenarios.

tests/test_late_interaction_multimodal.py (4)

24-34: Canonical image embedding values added for new model.

The test data follows the established pattern with abridged and rounded values. Note that the PR checklist indicates the code snippet showing how these canonical values were computed was not added - consider adding this for reproducibility.


49-59: Canonical query embedding values added for new model.

The test data is consistent with the existing format.


115-117: Embedding size test added for new model.

The assertion correctly validates the 128-dimensional embedding size matching the model description.


130-133: Lazy-load embedding size test added for new model.

The test correctly validates that the embedding size is accessible even with lazy_load=True.

Comment on lines +142 to +147
preprocessor_config_path = self._model_dir / "preprocessor_config.json"
with open(preprocessor_config_path) as f:
preprocessor_config = json.load(f)
self.max_image_size = preprocessor_config.get("max_image_size", {}).get(
"longest_edge", 512
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential AttributeError if max_image_size is not a dict.

If preprocessor_config.get("max_image_size", {}) returns a non-dict value (e.g., an integer), the subsequent .get("longest_edge", 512) call will fail with an AttributeError.

🔎 Suggested defensive handling
         preprocessor_config_path = self._model_dir / "preprocessor_config.json"
         with open(preprocessor_config_path) as f:
             preprocessor_config = json.load(f)
-            self.max_image_size = preprocessor_config.get("max_image_size", {}).get(
-                "longest_edge", 512
-            )
+            max_image_size_config = preprocessor_config.get("max_image_size", {})
+            if isinstance(max_image_size_config, dict):
+                self.max_image_size = max_image_size_config.get("longest_edge", 512)
+            else:
+                self.max_image_size = max_image_size_config if isinstance(max_image_size_config, int) else 512
🤖 Prompt for AI Agents
In fastembed/late_interaction_multimodal/colmodernvbert.py around lines 142 to
147, the code assumes preprocessor_config.get("max_image_size", {}) returns a
dict and calls .get("longest_edge", 512), which will raise AttributeError if
it's an int or other non-dict; change this to read the value into a variable,
check isinstance(value, dict) and if so use value.get("longest_edge", 512), elif
isinstance(value, int) use the int directly, otherwise fall back to 512, and
assign that result to self.max_image_size so the code is robust to different
config shapes.

@kacperlukawski kacperlukawski force-pushed the feature/colmodernvbert-support branch from 94366ea to 5a6d884 Compare January 2, 2026 20:11
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (7)
fastembed/late_interaction_multimodal/colmodernvbert.py (6)

35-46: Class docstring references original model but deployed model differs.

The docstring references ModernVBERT/colmodernvbert but the actual model identifier is Qdrant/colmodernvbert (line 23). Consider clarifying that this is the ONNX-converted version.


107-108: Remove unused instance attributes mask_token_id and pad_token_id.

These attributes are initialized to None but never assigned values or used anywhere in the class. The tokenizer padding is accessed directly via self.tokenizer.padding["pad_id"] at line 238. These are dead code.


142-147: Potential AttributeError if max_image_size is not a dict.

If preprocessor_config.get("max_image_size", {}) returns a non-dict value (e.g., an integer), the subsequent .get("longest_edge", 512) call will fail.

🔎 Proposed defensive handling
         preprocessor_config_path = self._model_dir / "preprocessor_config.json"
         with open(preprocessor_config_path) as f:
             preprocessor_config = json.load(f)
-            self.max_image_size = preprocessor_config.get("max_image_size", {}).get(
-                "longest_edge", 512
-            )
+            max_image_size_config = preprocessor_config.get("max_image_size", {})
+            if isinstance(max_image_size_config, dict):
+                self.max_image_size = max_image_size_config.get("longest_edge", 512)
+            else:
+                self.max_image_size = max_image_size_config if isinstance(max_image_size_config, int) else 512

156-173: Docstring describes post-processing but method preprocesses input; also missing None guard for self.image_size.

Two issues:

  1. The docstring incorrectly describes post-processing output when this method preprocesses text input by adding image placeholders.
  2. self.image_size is Optional[int] and only set during load_onnx_model(). With lazy_load=True, using it before model load will produce an invalid array shape.
🔎 Proposed fix
     def _preprocess_onnx_text_input(
         self, onnx_input: dict[str, NumpyArray], **kwargs: Any
     ) -> dict[str, NumpyArray]:
         """
-        Post-process the ONNX model output to convert it into a usable format.
+        Preprocess text input by adding empty image placeholders for the ONNX model.

         Args:
-            output (OnnxOutputContext): The raw output from the ONNX model.
+            onnx_input (dict[str, NumpyArray]): The text input dictionary for the ONNX model.

         Returns:
-            Iterable[NumpyArray]: Post-processed output as NumPy arrays.
+            dict[str, NumpyArray]: Updated input dictionary with empty image placeholders.
         """
+        if self.image_size is None:
+            raise RuntimeError(
+                "Model not loaded. Call load_onnx_model() first or initialize with lazy_load=False"
+            )
         batch_size, seq_length = onnx_input["input_ids"].shape

236-238: Add guard for uninitialized tokenizer.

self.tokenizer could be None if the model isn't fully loaded. The type: ignore comments suppress the error but don't prevent a runtime TypeError.

🔎 Proposed fix
+        if self.tokenizer is None or self.tokenizer.padding is None:
+            raise RuntimeError("Tokenizer not initialized. Ensure model is loaded.")
+
         # Get padding config from tokenizer
-        padding_direction = self.tokenizer.padding["direction"]  # type: ignore[index,union-attr]
-        pad_token_id = self.tokenizer.padding["pad_id"]  # type: ignore[index,union-attr]
+        padding_direction = self.tokenizer.padding["direction"]
+        pad_token_id = self.tokenizer.padding["pad_id"]

285-314: Guard against None value for self.image_seq_len.

self.image_seq_len could be None if the model isn't loaded, causing string multiplication ("<image>" * self.image_seq_len) to fail with TypeError.

🔎 Proposed fix

Add a runtime check at the start of methods that use image_seq_len:

if self.image_seq_len is None:
    raise RuntimeError("Model not loaded. Call load_onnx_model() first.")
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

217-237: Potential IndexError if an image produces zero patches.

The past review comment about this issue was not fully addressed. If processed[0] is an empty list, accessing processed[0][0].shape at line 221 will raise an IndexError. Additionally, if processed itself is empty or all images produce zero patches, max(patch_counts) will be 0, and the dimension extraction will fail.

🔎 Proposed defensive handling
     def _process_nested_patches(
         self, processed: list[list[NumpyArray]]
     ) -> tuple[NumpyArray, NumpyArray, dict[str, Any]]:
         ...
         patch_counts = [len(patches) for patches in processed]
         max_patches = max(patch_counts)
+        
+        if max_patches == 0:
+            raise ValueError("All images produced zero patches")

         # Get dimensions from first patch
-        C, H, W = processed[0][0].shape
+        # Find first non-empty patch list to get dimensions
+        first_patch = next(
+            (patches[0] for patches in processed if len(patches) > 0), None
+        )
+        if first_patch is None:
+            raise ValueError("No valid patches found in processed images")
+        C, H, W = first_patch.shape
         batch_size = len(processed)
🧹 Nitpick comments (3)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

16-19: Registry extended to include ColModernVBERT.

The EMBEDDINGS_REGISTRY now includes both ColPali and ColModernVBERT, enabling model discovery and instantiation via the unified interface.

Per the static analysis hint (RUF012), consider annotating mutable class attributes with typing.ClassVar for clarity:

EMBEDDINGS_REGISTRY: ClassVar[list[Type[LateInteractionMultimodalEmbeddingBase]]] = [...]

This is a minor improvement and not blocking.

fastembed/image/transform/functional.py (1)

192-221: Consider validating the normalization assumption for float inputs.

For float32/float64 dtypes, the code assumes values are normalized to [0, 1] range (line 206). If values fall outside this range, they will be clipped during uint8 conversion, potentially causing silent data quality issues.

Consider adding validation:

🔎 Add range validation for float inputs
     # Handle different dtypes
     if img_hwc.dtype == np.float32 or img_hwc.dtype == np.float64:
-        # Assume normalized, scale to 0-255 for PIL
+        # Validate normalized range and scale to 0-255 for PIL
+        if img_hwc.min() < 0 or img_hwc.max() > 1:
+            raise ValueError(
+                f"Float images must be normalized to [0, 1] range, got [{img_hwc.min()}, {img_hwc.max()}]"
+            )
         img_hwc_scaled = (img_hwc * 255).astype(np.uint8)
fastembed/image/transform/operators.py (1)

386-404: Optional: Extract error message to exception class.

Static analysis suggests moving the long error message to a custom exception class for better maintainability.

🔎 Refactor suggestion

Define a custom exception:

class InvalidImageProcessorConfigError(ValueError):
    def __init__(self, processor_type: str, missing_key: str):
        super().__init__(
            f"Size dictionary must contain '{missing_key}' key for {processor_type}"
        )

Then use it:

                 if "longest_edge" not in size:
-                    raise ValueError(
-                        "Size dictionary must contain 'longest_edge' key for Idefics3ImageProcessor"
-                    )
+                    raise InvalidImageProcessorConfigError("Idefics3ImageProcessor", "longest_edge")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94366ea and 5a6d884.

📒 Files selected for processing (8)
  • fastembed/common/onnx_model.py
  • fastembed/common/preprocessor_utils.py
  • fastembed/image/transform/functional.py
  • fastembed/image/transform/operators.py
  • fastembed/late_interaction_multimodal/colmodernvbert.py
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py
  • fastembed/late_interaction_multimodal/onnx_multimodal_model.py
  • tests/test_late_interaction_multimodal.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • fastembed/common/preprocessor_utils.py
  • fastembed/common/onnx_model.py
🧰 Additional context used
🧬 Code graph analysis (5)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
fastembed/late_interaction_multimodal/colmodernvbert.py (1)
  • ColModernVBERT (35-429)
tests/test_late_interaction_multimodal.py (1)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (2)
  • get_embedding_size (98-122)
  • embedding_size (91-95)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (3)
fastembed/late_interaction_multimodal/colmodernvbert.py (1)
  • _preprocess_onnx_image_input (196-259)
fastembed/late_interaction_multimodal/colpali.py (1)
  • _preprocess_onnx_image_input (210-227)
fastembed/common/onnx_model.py (1)
  • OnnxOutputContext (20-24)
fastembed/image/transform/operators.py (2)
fastembed/image/transform/functional.py (5)
  • crop_ndarray (176-189)
  • normalize (63-92)
  • resize (95-111)
  • resize_longest_edge (150-173)
  • resize_ndarray (192-221)
fastembed/common/utils.py (1)
  • normalize (18-23)
fastembed/late_interaction_multimodal/colmodernvbert.py (6)
fastembed/common/model_description.py (2)
  • DenseModelDescription (35-40)
  • ModelSource (7-20)
fastembed/common/onnx_model.py (2)
  • OnnxOutputContext (20-24)
  • _select_exposed_session_options (128-137)
fastembed/common/utils.py (1)
  • define_cache_dir (48-59)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • LateInteractionMultimodalEmbeddingBase (10-86)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (4)
  • OnnxMultimodalModel (20-356)
  • load_onnx_model (82-83)
  • _load_onnx_model (59-80)
  • tokenize (85-86)
fastembed/common/model_management.py (2)
  • _get_model_description (67-84)
  • download_model (372-471)
🪛 GitHub Actions: Tests
fastembed/image/transform/functional.py

[error] 153-153: NameError: name 'Union' is not defined. Possibly missing 'from typing import Union' in the module.

🪛 GitHub Actions: type-checkers
fastembed/image/transform/functional.py

[error] 153-153: Name 'Union' is not defined. Did you forget to import it from 'typing'?


[error] 195-195: Name 'Union' is not defined. Did you forget to import it from 'typing'?

fastembed/image/transform/operators.py

[error] 45-45: Name 'Union' is not defined. Did you forget to import it from 'typing'?


[error] 46-46: Name 'Union' is not defined. Did you forget to import it from 'typing'?


[error] 76-76: Name 'Union' is not defined. Did you forget to import it from 'typing'?


[error] 77-77: Name 'Union' is not defined. Did you forget to import it from 'typing'?

🪛 Ruff (0.14.10)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py

16-19: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

fastembed/image/transform/functional.py

153-153: Undefined name Union

(F821)


195-195: Undefined name Union

(F821)

fastembed/image/transform/operators.py

45-45: Undefined name Union

(F821)


46-46: Undefined name Union

(F821)


76-76: Undefined name Union

(F821)


77-77: Undefined name Union

(F821)


390-392: Avoid specifying long messages outside the exception class

(TRY003)


429-429: Avoid specifying long messages outside the exception class

(TRY003)


432-432: Unused static method argument: config

(ARG004)

fastembed/late_interaction_multimodal/colmodernvbert.py

46-46: Possible hardcoded password assigned to: "QUERY_AUGMENTATION_TOKEN"

(S105)


157-157: Unused method argument: kwargs

(ARG002)


190-190: Unused method argument: kwargs

(ARG002)


197-197: Unused method argument: kwargs

(ARG002)

🔇 Additional comments (19)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (3)

179-200: Refactored image embedding flow with proper dispatch logic looks good.

The dispatch between nested patches and flat images based on isinstance(processed[0], list) is clear and appropriate for distinguishing ColModernVBERT's patch-based processing from standard processors.


239-275: Flat image processing handles both 4D and 5D tensor shapes correctly.

The logic to conditionally add a patch dimension based on ONNX model signature is sound. The attention mask and metadata shapes are derived consistently from the encoded tensor dimensions.


277-299: ONNX signature inspection for patch dimension detection is well-implemented.

Checking len(input_meta.shape) == 5 for the pixel_values input is a reliable way to determine model expectations. The fallback to False ensures backward compatibility.

tests/test_late_interaction_multimodal.py (4)

24-34: Canonical image values for ColModernVBERT added correctly.

The test data follows the same structure and precision as the existing ColPali values.


49-59: Canonical query values for ColModernVBERT added correctly.

Consistent with the existing test pattern for query embeddings.


115-116: Embedding size test for ColModernVBERT.

The expected embedding size of 128 matches the model description defined in colmodernvbert.py.


130-132: Embedding size property test for ColModernVBERT with lazy_load.

Good addition to ensure the embedding_size property works correctly with lazy loading for the new model.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

7-7: Import for ColModernVBERT added correctly.

The import follows the existing pattern for ColPali.

fastembed/late_interaction_multimodal/colmodernvbert.py (6)

21-32: Model description for ColModernVBERT is well-defined.

The model registry entry includes all required fields and correctly specifies the HuggingFace source.


261-283: Patch grid computation logic is correct.

The algorithm to compute rows/cols from patch count handles both square and non-square grids, with appropriate fallback to unsplit (0, 0) when computation fails.


316-331: Input ID building logic for image prompts is well-structured.

The method correctly dispatches between single and split image prompt strings and tokenizes the expanded prompt.


333-349: Image output post-processing correctly reshapes to expected dimensions.

The reshape using model_description.dim ensures consistent embedding output format.


351-421: embed_text and embed_image methods delegate correctly to base implementation.

Both methods properly pass all configuration parameters including providers, device settings, and session options.


423-449: Worker classes correctly instantiate ColModernVBERT with single thread.

The worker implementations follow the established pattern for parallel processing with threads=1.

fastembed/image/transform/operators.py (5)

44-55: LGTM: Clean backward-compatible nested list support.

The modifications to Normalize and Rescale elegantly handle both flat and nested list structures for patch-based processing while maintaining backward compatibility.

Also applies to: 75-86


124-167: Verify that significant aspect ratio changes are acceptable.

The implementation rounds both width and height up to the nearest multiple of max_size independently (lines 149-156), which can significantly alter the aspect ratio. For example, a 1000×700 image with max_size=512 becomes 1024×1024 (aspect ratio changes from ~1.43 to 1.0).

While the docstring mentions "approximately" preserving aspect ratio, please confirm this behavior aligns with the ColModernVBERT model's requirements.


170-242: LGTM: Robust image splitting logic.

The ImageSplitter implementation correctly handles grid-based patch generation with optimal patch sizing and appropriately includes a global view only when splitting is necessary.


245-267: LGTM: Consistent return structure with ImageSplitter.

The SquareResize class correctly returns list[list[NumpyArray]] to maintain API consistency with ImageSplitter, enabling seamless switching between splitting modes.


435-455: LGTM: Well-structured conditional image splitting logic.

The _get_image_splitting method cleanly orchestrates the choice between patch-based splitting and square resizing based on configuration. The docstring appropriately documents the ordering dependency with PILtoNDarray.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
fastembed/image/transform/operators.py (1)

170-242: Implicit dependency on ResizeForVisionEncoder for uniform patches.

The patch extraction logic produces uniform max_size × max_size patches only when input dimensions are exact multiples of max_size—which is guaranteed by ResizeForVisionEncoder in the pipeline. If ImageSplitter is used standalone on arbitrary inputs, patches may have varying dimensions.

Consider adding a note in the docstring about this prerequisite, or adding a runtime assertion/warning.

🔎 Optional: Add docstring clarification
 class ImageSplitter(Transform):
     """
     Split images into grid of patches plus a global view.
 
     If image dimensions exceed max_size:
     - Divide into ceil(H/max_size) x ceil(W/max_size) patches
     - Each patch is cropped from the image
     - Add a global view (original resized to max_size x max_size)
 
     If image is smaller than max_size:
     - Return single image unchanged
 
     Works on numpy arrays in (C, H, W) format.
+
+    Note: For uniform max_size x max_size patches, input dimensions should
+    be multiples of max_size. Use ResizeForVisionEncoder before this transform.
     """
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a6d884 and 2dc7de8.

📒 Files selected for processing (2)
  • fastembed/image/transform/functional.py
  • fastembed/image/transform/operators.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • fastembed/image/transform/functional.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-12T10:48:30.978Z
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.

Applied to files:

  • fastembed/image/transform/operators.py
🧬 Code graph analysis (1)
fastembed/image/transform/operators.py (2)
fastembed/image/transform/functional.py (9)
  • center_crop (15-60)
  • convert_to_rgb (7-12)
  • crop_ndarray (176-189)
  • normalize (63-92)
  • pil2ndarray (118-121)
  • rescale (114-115)
  • resize (95-111)
  • resize_longest_edge (150-173)
  • resize_ndarray (192-221)
fastembed/common/utils.py (1)
  • normalize (18-23)
🪛 Ruff (0.14.10)
fastembed/image/transform/operators.py

390-392: Avoid specifying long messages outside the exception class

(TRY003)


429-429: Avoid specifying long messages outside the exception class

(TRY003)


432-432: Unused static method argument: config

(ARG004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Python 3.13.x on windows-latest test
🔇 Additional comments (11)
fastembed/image/transform/operators.py (11)

1-17: LGTM!

The new imports (math, crop_ndarray, resize_longest_edge, resize_ndarray) are correctly added to support the new transform classes.


44-55: LGTM!

The polymorphic handling of flat vs nested image structures is implemented correctly. The images and isinstance(images[0], list) check safely handles empty lists via short-circuit evaluation.


75-86: LGTM!

Consistent with the Normalize implementation, correctly handling both flat and nested structures.


109-121: LGTM!

Clean wrapper around the resize_longest_edge functional helper, following the established Transform pattern.


139-167: LGTM!

The resize logic correctly ensures both dimensions become multiples of max_size. The aspect ratio is "approximately" preserved as documented, since independent ceiling operations on each dimension can introduce slight distortion—this is acceptable for vision encoder compatibility.


245-267: LGTM!

The singleton-list output format (list[list[NumpyArray]]) correctly matches ImageSplitter's output shape, enabling uniform downstream handling regardless of which path is taken.


306-306: LGTM!

Documentation correctly updated to reflect the new longest_edge size key support.


317-317: LGTM!

Correct pipeline ordering: _get_image_splitting is called after _get_pil2ndarray (requires numpy arrays) and before _get_rescale/_get_normalize.


386-404: LGTM!

The Idefics3ImageProcessor handling follows the established pattern. The resample int-to-enum conversion correctly handles cases where configuration provides the numeric value instead of the enum.


426-427: LGTM!

No-op for center crop is correct; Idefics3ImageProcessor uses image splitting instead.


435-455: LGTM!

The conditional transform wiring is correct:

  • Splitting enabled: ResizeForVisionEncoderImageSplitter (ensures proper grid extraction)
  • Splitting disabled: SquareResize (simple square output)

The nested dict access with chained .get() calls provides safe fallbacks.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/test_late_interaction_multimodal.py (1)

92-95: Unused loop variable name.

As flagged by the linter, the loop variable name is not used within the loop body.

♻️ Suggested fix
     if is_ci:
-        for name, model in cache.items():
+        for _, model in cache.items():
             delete_model_cache(model.model._model_dir)
     cache.clear()

Alternatively, use .values():

     if is_ci:
-        for name, model in cache.items():
-            delete_model_cache(model.model._model_dir)
+        for model in cache.values():
+            delete_model_cache(model.model._model_dir)
     cache.clear()
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2dc7de8 and 46453a2.

📒 Files selected for processing (4)
  • fastembed/image/onnx_image_model.py
  • fastembed/late_interaction_multimodal/colmodernvbert.py
  • fastembed/late_interaction_multimodal/onnx_multimodal_model.py
  • tests/test_late_interaction_multimodal.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_late_interaction_multimodal.py (2)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (6)
  • LateInteractionMultimodalEmbedding (15-189)
  • embed_image (147-168)
  • embed_text (124-145)
  • get_embedding_size (98-122)
  • embedding_size (91-95)
  • token_count (170-189)
tests/utils.py (1)
  • delete_model_cache (11-39)
🪛 Ruff (0.14.10)
tests/test_late_interaction_multimodal.py

93-93: Loop control variable name not used within loop body

Rename unused name to _name

(B007)

fastembed/late_interaction_multimodal/colmodernvbert.py

46-46: Possible hardcoded password assigned to: "QUERY_AUGMENTATION_TOKEN"

(S105)


157-157: Unused method argument: kwargs

(ARG002)


191-191: Unused method argument: kwargs

(ARG002)


202-202: Unused method argument: kwargs

(ARG002)


215-215: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
🔇 Additional comments (17)
fastembed/image/onnx_image_model.py (1)

78-92: LGTM! Good resource management with ExitStack.

The ExitStack properly manages image file handles, ensuring they're closed after processing. The image processing occurs within the context block before the handles are released, which is correct.

tests/test_late_interaction_multimodal.py (1)

151-163: Good test coverage for token counting.

The test properly validates that:

  1. Individual token counts sum correctly
  2. Batching doesn't affect total count
  3. Extension tokens are counted when include_extension=True
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (4)

172-202: LGTM! Clean dispatch logic for nested vs flat image processing.

The method correctly:

  1. Uses ExitStack for resource management
  2. Dispatches based on processor output structure
  3. Returns a complete OnnxOutputContext with attention mask and metadata

204-239: Well-structured patch processing with proper attention masking.

The method correctly handles variable patch counts per image by:

  1. Finding max patches for batch padding
  2. Zero-padding shorter sequences
  3. Using attention mask to distinguish real patches from padding

241-277: LGTM! Proper handling of 4D/5D tensor requirements.

The method adapts to model input requirements by checking the ONNX signature and appropriately reshaping tensors for backward compatibility.


279-301: LGTM! Good defensive programming.

The method safely handles uninitialized models and defaults to False for backward compatibility.

fastembed/late_interaction_multimodal/colmodernvbert.py (11)

35-46: LGTM! Clean class structure.

The class constants are appropriately defined. The static analysis warning about QUERY_AUGMENTATION_TOKEN being a "hardcoded password" is a false positive—it's clearly a prompt formatting token, not a credential.


48-114: LGTM! Comprehensive initialization with good device handling.

The constructor properly handles:

  • Lazy loading for parallel processing scenarios
  • Device ID fallback (explicit → first from list → None)
  • Model download and configuration setup

191-195: LGTM! Query augmentation follows colpali-engine pattern.

The 10 augmentation tokens match the upstream process_queries logic as noted in the comment. The unused **kwargs parameter maintains API consistency with the base class.


197-212: LGTM! Clean token counting implementation.

The method properly:

  1. Ensures the model/tokenizer is loaded before counting
  2. Supports both single strings and iterables
  3. Uses appropriate tokenization based on include_extension flag

214-277: LGTM! Well-implemented dynamic input construction.

The method correctly:

  1. Counts real patches from attention mask
  2. Computes grid dimensions from patch count
  3. Builds input_ids with appropriate padding direction from tokenizer config

279-302: LGTM! Correct grid dimension computation.

The method properly handles:

  • Single/no patches (returns 0, 0)
  • Square and near-square grids
  • Non-square grid fallback search

304-351: LGTM! Clean prompt construction following Idefics3 format.

The methods properly construct prompt strings with appropriate special tokens for both single and split image scenarios.


352-368: LGTM! Correct output reshaping for late-interaction format.

The method properly reshapes model output to the expected (batch, n, dim) format for late-interaction retrieval.


370-440: LGTM! Clean delegation to base class methods.

Both embedding methods properly forward all configuration options to the underlying implementation.


451-467: LGTM! Standard worker implementations.

The workers correctly initialize the model with threads=1, which is appropriate for parallel processing where each worker handles a subset of the workload.


136-155: Add error handling for JSON config file loading.

Lines 137-154 open three JSON config files without error handling. While these files are guaranteed to exist (config.json and preprocessor_config.json are part of HuggingFace's standard model files, and processor_config.json is explicitly listed in additional_files), adding try/except blocks would improve debuggability if a file is corrupted or contains invalid JSON.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @tests/test_late_interaction_multimodal.py:
- Around line 70-96: The cache key is lowercased but comparisons use the
original MODELS_TO_CACHE and the constructor is given the lowercased name; fix
by keeping MODELS_TO_CACHE entries lowercased or by comparing
lowercase_model_name against a lowercase set (e.g., compare lowercase_model_name
in {m.lower() for m in MODELS_TO_CACHE}), pass the original model_name (not
lowercase_model_name) into LateInteractionMultimodalEmbedding so the canonical
name is preserved, and replace the final cleanup loop that unpacks (name, model)
with iterating cache.values() to avoid the unused-variable warning (Ruff B007)
when calling delete_model_cache(model.model._model_dir).
🧹 Nitpick comments (3)
tests/test_late_interaction_multimodal.py (3)

2-10: Keep delete_model_cache usage resilient to internal API changes.

This test now depends on a private field chain (model_inst.model._model_dir). If the internal structure changes, CI cleanup will crash the fixture rather than “best-effort” cleanup. Consider guarding access (or using a public cache-dir accessor if one exists).


25-35: Canonical vectors: consider pinning dtype for stability.

These arrays default to float64; embedding outputs are typically float32. Pinning dtype=np.float32 can reduce subtle cross-platform numeric drift and memory.

Also applies to: 50-60


151-164: token_count test is solid; consider making CI behavior explicit.

If CI cleanup is important here, the fixture fix above is the key. Optionally, you could also make is_ci a real boolean (e.g., is_ci = os.getenv("CI") == "true"), since any non-empty value is truthy today.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 46453a2 and 8f6f057.

📒 Files selected for processing (1)
  • tests/test_late_interaction_multimodal.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_late_interaction_multimodal.py (6)
fastembed/image/image_embedding_base.py (2)
  • get_embedding_size (48-50)
  • embedding_size (53-55)
fastembed/late_interaction/late_interaction_embedding_base.py (3)
  • get_embedding_size (64-66)
  • embedding_size (69-71)
  • token_count (73-80)
fastembed/text/text_embedding_base.py (3)
  • get_embedding_size (64-66)
  • embedding_size (69-71)
  • token_count (73-75)
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
  • token_count (212-224)
fastembed/sparse/sparse_text_embedding.py (1)
  • token_count (131-143)
fastembed/sparse/minicoil.py (1)
  • token_count (192-195)
🪛 Ruff (0.14.10)
tests/test_late_interaction_multimodal.py

93-93: Loop control variable name not used within loop body

Rename unused name to _name

(B007)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
🔇 Additional comments (2)
tests/test_late_interaction_multimodal.py (2)

98-132: Updated tests use model_cache cleanly; ensure they benefit from caching after the fixture fix.

Once the cache-key/casing bug is fixed, these three tests should stop re-initializing the same model repeatedly (especially important in CI).


141-149: Embedding size assertions for Qdrant/colmodernvbert look good.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @tests/test_late_interaction_multimodal.py:
- Around line 70-71: _MODELS_TO_CACHE is fine but MODELS_TO_CACHE is currently a
generator expression which gets exhausted on first membership check; replace it
with a reusable sequence (e.g., a tuple or list) by materializing the generator
so repeated checks like the membership test against lowercase_model_name can
succeed (update the assignment of MODELS_TO_CACHE accordingly, keeping the
lowercase conversion logic).
🧹 Nitpick comments (2)
tests/test_late_interaction_multimodal.py (2)

74-96: Model caching fixture looks good.

The implementation correctly provides module-scoped model caching with CI-aware cleanup. The pattern aligns well with the existing test_text_cross_encoder.py implementation.

Optional improvement: Line 94 uses _ for the unused loop variable. Consider using name or keeping _ for clarity:

for name, model in cache.items():
    delete_model_cache(model.model._model_dir)

However, _ is acceptable Python convention for intentionally unused variables.


146-149: Consider using model_cache for consistency.

While lazy_load=True minimizes resource usage, this test creates a model instance without going through the model_cache fixture. For consistency with other tests and proper CI cleanup, consider refactoring to use the caching pattern.

♻️ Suggested refactor
-def test_embedding_size():
+def test_embedding_size(model_cache):
     model_name = "Qdrant/colmodernvbert"
-    model = LateInteractionMultimodalEmbedding(model_name=model_name, lazy_load=True)
-    assert model.embedding_size == 128
+    with model_cache(model_name) as model:
+        assert model.embedding_size == 128
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f6f057 and 01965c9.

📒 Files selected for processing (1)
  • tests/test_late_interaction_multimodal.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_late_interaction_multimodal.py (2)
tests/utils.py (1)
  • delete_model_cache (11-39)
tests/test_text_cross_encoder.py (2)
  • model_cache (25-46)
  • get_model (30-39)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
🔇 Additional comments (5)
tests/test_late_interaction_multimodal.py (5)

2-2: LGTM - Imports support the caching infrastructure.

The new imports are necessary for the model caching fixture and CI cleanup functionality.

Also applies to: 10-10


25-35: Document the canonical value computation.

The PR objectives note that the code snippet demonstrating how these canonical test values were computed was not added. This is an incomplete checklist item that affects reproducibility and verification.

Please add documentation (inline comments or separate file) showing how these expected values were generated.

Also applies to: 50-60


99-111: Test refactoring looks good.

The tests correctly use the model_cache fixture with appropriate CI-aware behavior. The conditional skipping of ColPali in CI (due to model size) is reasonable, and the test assertions properly validate the embedding outputs.

Also applies to: 113-122, 124-133


142-143: LGTM - Assertion added for new model.

The new assertion correctly validates the embedding size for the ColModernVBERT model.


152-165: Token count test implementation is correct.

The test properly uses the model_cache fixture and validates token counting behavior across different scenarios (single docs, multiple docs, different batch sizes, and with extensions).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/test_late_interaction_multimodal.py (2)

93-96: Minor: Consider using cache.values() for iteration.

The loop variable _ indicates the key is unused. Using cache.values() is more idiomatic.

Suggested change
     if is_ci:
-        for _, model in cache.items():
+        for model in cache.values():
             delete_model_cache(model.model._model_dir)
     cache.clear()

101-102: Minor: Simplify case-insensitive comparison.

The comparison can be simplified since you're lowercasing both sides.

Suggested change
-        if model_name.lower() == "Qdrant/colpali-v1.3-fp16".lower() and os.getenv("CI"):
+        if model_name.lower() == "qdrant/colpali-v1.3-fp16" and os.getenv("CI"):

This pattern is repeated at lines 115-116 and 126-127 as well.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01965c9 and ef9c496.

📒 Files selected for processing (1)
  • tests/test_late_interaction_multimodal.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
🔇 Additional comments (7)
tests/test_late_interaction_multimodal.py (7)

70-71: LGTM!

The constant definition with a private tuple and its lowercase transformation for case-insensitive model matching is clear and appropriate.


25-35: LGTM!

Canonical image embedding values for the new model follow the established pattern.


50-60: LGTM!

Canonical query embedding values are consistent with the existing structure.


105-110: LGTM!

The test correctly uses the model_cache fixture with context manager pattern and validates embeddings against canonical values with appropriate tolerance.


142-149: LGTM!

The embedding size tests appropriately verify the new model's embedding dimension. Using lazy_load=True in test_embedding_size correctly tests the property without fully loading the model.


152-165: LGTM!

The token count test is well-structured, verifying:

  • Token count additivity for individual vs. combined documents
  • Consistency across different batch sizes
  • The effect of include_extension parameter

79-89: Case normalization is intentional and correct; private attribute access is pervasive across tests.

The lowercase model name conversion on line 83 is by design. Lines 70–71 explicitly show that MODELS_TO_CACHE is built from lowercased model names (tuple(model_name.lower() for model_name in _MODELS_TO_CACHE)), and this pattern is consistently used across all embedding test fixtures in the codebase. Model instantiation with lowercase names is the intended behavior.

The use of model.model._model_dir (line 88) does rely on a private attribute, but this is a widespread pattern throughout the test suite and model implementations—not unique to this fixture. All model classes define _model_dir, and there is no public API alternative for accessing the model directory path.

Likely an incorrect or invalid review comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @fastembed/late_interaction_multimodal/colmodernvbert.py:
- Around line 37-43: Docstring model link and description for class
ColModernVBERT point to "ModernVBERT/colmodernvbert" while the configured
Hugging Face source in the registry uses "Qdrant/colmodernvbert"; update the
docstring to match the configured source or explicitly note that this class
implements the Qdrant fork of ModernVBERT (e.g., change the URL and text in
ColModernVBERT's docstring to "Qdrant/colmodernvbert" or add "Qdrant fork of
ModernVBERT" next to the existing link), ensuring the reference aligns with the
registry/model source used by the code.
- Around line 238-277: The _process_nested_patches static method currently
assumes processed is non-empty and that every element contains at least one
patch; add explicit validation at the top of _process_nested_patches to raise a
clear ValueError if processed is empty or if any patches list is empty (e.g.,
"processed must contain at least one image and every image must contain at least
one patch"); only then compute patch_counts, max_patches and access
processed[0][0]; this prevents obscure exceptions from max(...) or
processed[0][0] and makes failures predictable and debuggable.
- Around line 158-177: In _preprocess_onnx_text_input replace the huge per-token
image allocation with a single dummy image per example: allocate
empty_image_placeholder with shape (batch_size, 1, 3, self.image_size,
self.image_size) (not (batch_size, seq_length, ...)) and set
onnx_input["pixel_values"] = empty_image_placeholder; also fix the method
docstring to say "Pre-process the ONNX model input" and update the return/docs
to reflect that pixel_values uses (batch, num_patches, C, H, W).
🧹 Nitpick comments (2)
fastembed/late_interaction_multimodal/colmodernvbert.py (2)

127-157: Config reads need defensive defaults or clearer failure mode.

These open() calls (Line 139-156) will raise if files are absent/corrupt. Even if you decide to hard-fail, it’d be helpful to raise a targeted exception explaining which file is missing and how to fix it (esp. when running with local_files_only=True).

Suggested hard-fail with clearer error (minimal change)
         # Load image processing configuration
         processor_config_path = self._model_dir / "processor_config.json"
-        with open(processor_config_path) as f:
+        with open(processor_config_path, encoding="utf-8") as f:
             processor_config = json.load(f)
             self.image_seq_len = processor_config.get("image_seq_len", 64)

         preprocessor_config_path = self._model_dir / "preprocessor_config.json"
-        with open(preprocessor_config_path) as f:
+        with open(preprocessor_config_path, encoding="utf-8") as f:
             preprocessor_config = json.load(f)
             self.max_image_size = preprocessor_config.get("max_image_size", {}).get(
                 "longest_edge", 512
             )

         # Load model configuration
         config_path = self._model_dir / "config.json"
-        with open(config_path) as f:
+        with open(config_path, encoding="utf-8") as f:
             model_config = json.load(f)
             vision_config = model_config.get("vision_config", {})
             self.image_size = vision_config.get("image_size", 512)

399-415: Consider caching prompt tokenization per (rows, cols) to avoid repeated work.

_build_input_ids_for_image() calls self.tokenizer.encode(...) (Line 411) per image. Since the prompt depends only on (rows, cols), you can memoize the resulting np.ndarray to reduce overhead in large batches.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef9c496 and 8b9f50c.

📒 Files selected for processing (2)
  • fastembed/late_interaction_multimodal/colmodernvbert.py
  • fastembed/late_interaction_multimodal/onnx_multimodal_model.py
🧰 Additional context used
🪛 Ruff (0.14.10)
fastembed/late_interaction_multimodal/colmodernvbert.py

48-48: Possible hardcoded password assigned to: "QUERY_AUGMENTATION_TOKEN"

(S105)


159-159: Unused method argument: kwargs

(ARG002)


193-193: Unused method argument: kwargs

(ARG002)


204-204: Unused method argument: kwargs

(ARG002)


279-279: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
🔇 Additional comments (6)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

172-187: ExitStack resource management is correct; ImageInput contract matches Image.open() inputs.

The type definition confirms ImageInput is strictly str | Path | Image.Image, so Image.open() handles all required inputs without conversion. Path-like inputs are properly cleaned up, and user-provided PIL.Image.Image instances correctly bypass the context manager to preserve caller ownership. No issues found.

fastembed/late_interaction_multimodal/colmodernvbert.py (5)

216-237: onnx_embed_image looks solid; metadata passthrough is a good addition.

Using ExitStack() to manage image file handles (Line 217-223) and threading metadata through OnnxOutputContext.metadata (Line 232-236) is a clean integration point.


343-367: Rows/cols derivation is reasonable and has a safe fallback.

The “+1 global image” handling and factor fallback (Line 348-366) looks consistent with the patch-count encoding.


515-532: Worker setup is straightforward.

Pinning threads=1 per worker (Line 520-531) is consistent with multiprocessing patterns (avoid oversubscription).


23-34: Review comment is incorrect — the missing files are already downloaded.

The files config.json and preprocessor_config.json are included in the default allow_patterns list (hardcoded at line 199-203 in fastembed/common/model_management.py). The download mechanism automatically includes these files regardless of whether they're in additional_files. The final allow_patterns includes:

  1. A hardcoded default list containing both config.json and preprocessor_config.json
  2. The model file (model.model_file)
  3. Any files in additional_files

Since config.json and preprocessor_config.json are already in the hardcoded list, they will be downloaded automatically when snapshot_download() is called. There is no runtime failure risk.

Only processor_config.json (which is correctly listed in additional_files) is not in the default list and relies on explicit inclusion.

Likely an incorrect or invalid review comment.


278-342: No issue — tokenizer.padding access is guaranteed safe.

The tokenizer is initialized via load_tokenizer() in _load_onnx_model() (line 78 of onnx_multimodal_model.py), which explicitly enables padding before returning: if not tokenizer.padding: tokenizer.enable_padding(...). The padding dict with "pad_id" and "direction" keys is always present by the time _preprocess_onnx_image_input() is called. This pattern is already proven safe elsewhere in the codebase (e.g., colbert.py line 179) with no defensive guards, and the initialization contract is enforced by the framework.

Likely an incorrect or invalid review comment.

Comment on lines +37 to +43
class ColModernVBERT(LateInteractionMultimodalEmbeddingBase, OnnxMultimodalModel[NumpyArray]):
"""
The ModernVBERT/colmodernvbert model implementation. This model uses
bidirectional attention, which proves to work better for retrieval.
See: https://huggingface.co/ModernVBERT/colmodernvbert
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Model reference mismatch in docstring/link vs configured HF source.

The docstring points to ModernVBERT/colmodernvbert (Line 42), while the registry entry uses Qdrant/colmodernvbert (Line 25, 30). If this is intentional (fork vs upstream), consider aligning the link and/or explicitly stating “Qdrant fork of …”.

🤖 Prompt for AI Agents
In @fastembed/late_interaction_multimodal/colmodernvbert.py around lines 37 -
43, Docstring model link and description for class ColModernVBERT point to
"ModernVBERT/colmodernvbert" while the configured Hugging Face source in the
registry uses "Qdrant/colmodernvbert"; update the docstring to match the
configured source or explicitly note that this class implements the Qdrant fork
of ModernVBERT (e.g., change the URL and text in ColModernVBERT's docstring to
"Qdrant/colmodernvbert" or add "Qdrant fork of ModernVBERT" next to the existing
link), ensuring the reference aligns with the registry/model source used by the
code.

Comment on lines +158 to +177
def _preprocess_onnx_text_input(
self, onnx_input: dict[str, NumpyArray], **kwargs: Any
) -> dict[str, NumpyArray]:
"""
Post-process the ONNX model output to convert it into a usable format.
Args:
output (OnnxOutputContext): The raw output from the ONNX model.
Returns:
Iterable[NumpyArray]: Post-processed output as NumPy arrays.
"""
batch_size, seq_length = onnx_input["input_ids"].shape
empty_image_placeholder: NumpyArray = np.zeros(
(batch_size, seq_length, 3, self.image_size, self.image_size),
dtype=np.float32, # type: ignore[type-var,arg-type,assignment]
)
onnx_input["pixel_values"] = empty_image_placeholder
return onnx_input

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find the file and examine it
find . -name "colmodernvbert.py" -type f

Repository: qdrant/fastembed

Length of output: 117


🏁 Script executed:

# Check the content of the file around the mentioned lines
cat -n fastembed/late_interaction_multimodal/colmodernvbert.py | sed -n '150,185p'

Repository: qdrant/fastembed

Length of output: 1623


🏁 Script executed:

# Look at the class definition and initialization to understand image_size
cat -n fastembed/late_interaction_multimodal/colmodernvbert.py | sed -n '1,50p'

Repository: qdrant/fastembed

Length of output: 2128


🏁 Script executed:

# Search for image_size initialization and usage
rg "image_size" fastembed/late_interaction_multimodal/colmodernvbert.py -n -B 2 -A 2

Repository: qdrant/fastembed

Length of output: 1085


🏁 Script executed:

# Check how pixel_values is used after this method
rg "pixel_values" fastembed/late_interaction_multimodal/colmodernvbert.py -n -B 3 -A 3

Repository: qdrant/fastembed

Length of output: 1611


🏁 Script executed:

# Look for usages of _preprocess_onnx_text_input
rg "_preprocess_onnx_text_input" . -n -B 2 -A 5

Repository: qdrant/fastembed

Length of output: 3353


🏁 Script executed:

# Check what the ONNX model expects for text-only inputs
find . -name "*.json" -path "*/colmodernvbert*" -o -name "config.json" | head -10

Repository: qdrant/fastembed

Length of output: 42


🏁 Script executed:

# Look at how other models (colpali) handle text preprocessing
cat -n fastembed/late_interaction_multimodal/colpali.py | sed -n '193,220p'

Repository: qdrant/fastembed

Length of output: 1351


🏁 Script executed:

# Check if there's any test or usage pattern for text-only inputs
rg "_preprocess_onnx_text_input|text_to_embeddings" fastembed/late_interaction_multimodal/ -n -B 2 -A 5

Repository: qdrant/fastembed

Length of output: 3289


🏁 Script executed:

# Check IMAGE_PLACEHOLDER_SIZE in colpali to understand the pattern better
rg "IMAGE_PLACEHOLDER_SIZE" fastembed/late_interaction_multimodal/colpali.py -B 3 -A 3

Repository: qdrant/fastembed

Length of output: 673


🏁 Script executed:

# Check if colmodernvbert has similar constant or expected shape
rg "IMAGE_PLACEHOLDER_SIZE|pixel_values.*shape" fastembed/late_interaction_multimodal/colmodernvbert.py -n

Repository: qdrant/fastembed

Length of output: 106


🏁 Script executed:

# Check the docstring/comments around _preprocess_onnx_image_input to see expected shapes
cat -n fastembed/late_interaction_multimodal/colmodernvbert.py | sed -n '280,330p'

Repository: qdrant/fastembed

Length of output: 2726


🏁 Script executed:

# Verify the image processing path to understand the full context
cat -n fastembed/late_interaction_multimodal/colmodernvbert.py | sed -n '215,240p'

Repository: qdrant/fastembed

Length of output: 1365


Fix memory explosion in text-only preprocessing: allocate single image placeholder instead of one per token.

The empty_image_placeholder is shaped (batch_size, seq_length, 3, image_size, image_size), which allocates massive memory—for typical seq_length=512 and image_size=512, a single batch item uses ~400 MB. This is conceptually wrong: text-only inputs should need only a single dummy image, not one per token.

The expected format for pixel_values is (batch, num_patches, C, H, W) (see docstring at line 288). For text-only runs, use (batch_size, 1, 3, image_size, image_size) to represent a single dummy patch.

Also note the docstring incorrectly states "Post-process"—this is a preprocessing method.

Suggested fix
     def _preprocess_onnx_text_input(
         self, onnx_input: dict[str, NumpyArray], **kwargs: Any
     ) -> dict[str, NumpyArray]:
-        batch_size, seq_length = onnx_input["input_ids"].shape
+        batch_size, _seq_length = onnx_input["input_ids"].shape
         empty_image_placeholder: NumpyArray = np.zeros(
-            (batch_size, seq_length, 3, self.image_size, self.image_size),
+            (batch_size, 1, 3, self.image_size, self.image_size),
             dtype=np.float32,  # type: ignore[type-var,arg-type,assignment]
         )
         onnx_input["pixel_values"] = empty_image_placeholder
         return onnx_input
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _preprocess_onnx_text_input(
self, onnx_input: dict[str, NumpyArray], **kwargs: Any
) -> dict[str, NumpyArray]:
"""
Post-process the ONNX model output to convert it into a usable format.
Args:
output (OnnxOutputContext): The raw output from the ONNX model.
Returns:
Iterable[NumpyArray]: Post-processed output as NumPy arrays.
"""
batch_size, seq_length = onnx_input["input_ids"].shape
empty_image_placeholder: NumpyArray = np.zeros(
(batch_size, seq_length, 3, self.image_size, self.image_size),
dtype=np.float32, # type: ignore[type-var,arg-type,assignment]
)
onnx_input["pixel_values"] = empty_image_placeholder
return onnx_input
def _preprocess_onnx_text_input(
self, onnx_input: dict[str, NumpyArray], **kwargs: Any
) -> dict[str, NumpyArray]:
"""
Post-process the ONNX model output to convert it into a usable format.
Args:
output (OnnxOutputContext): The raw output from the ONNX model.
Returns:
Iterable[NumpyArray]: Post-processed output as NumPy arrays.
"""
batch_size, _seq_length = onnx_input["input_ids"].shape
empty_image_placeholder: NumpyArray = np.zeros(
(batch_size, 1, 3, self.image_size, self.image_size),
dtype=np.float32, # type: ignore[type-var,arg-type,assignment]
)
onnx_input["pixel_values"] = empty_image_placeholder
return onnx_input
🧰 Tools
🪛 Ruff (0.14.10)

159-159: Unused method argument: kwargs

(ARG002)

🤖 Prompt for AI Agents
In @fastembed/late_interaction_multimodal/colmodernvbert.py around lines 158 -
177, In _preprocess_onnx_text_input replace the huge per-token image allocation
with a single dummy image per example: allocate empty_image_placeholder with
shape (batch_size, 1, 3, self.image_size, self.image_size) (not (batch_size,
seq_length, ...)) and set onnx_input["pixel_values"] = empty_image_placeholder;
also fix the method docstring to say "Pre-process the ONNX model input" and
update the return/docs to reflect that pixel_values uses (batch, num_patches, C,
H, W).

Comment on lines +238 to +277
@staticmethod
def _process_nested_patches(
processed: list[list[NumpyArray]],
) -> tuple[NumpyArray, NumpyArray, dict[str, Any]]:
"""
Process nested image patches (from ImageSplitter).
Args:
processed: List of patch lists, one per image [[img1_patches], [img2_patches], ...]
Returns:
tuple: (encoded array, attention_mask, metadata)
- encoded: (batch_size, max_patches, C, H, W)
- attention_mask: (batch_size, max_patches) with 1 for real patches, 0 for padding
- metadata: Dict with 'patch_counts' key
"""
patch_counts = [len(patches) for patches in processed]
max_patches = max(patch_counts)

# Get dimensions from first patch
channels, height, width = processed[0][0].shape
batch_size = len(processed)

# Create padded array
encoded = np.zeros(
(batch_size, max_patches, channels, height, width), dtype=processed[0][0].dtype
)

# Create attention mask (1 for real patches, 0 for padding)
attention_mask = np.zeros((batch_size, max_patches), dtype=np.int64)

# Fill in patches and attention mask
for i, patches in enumerate(processed):
for j, patch in enumerate(patches):
encoded[i, j] = patch
attention_mask[i, j] = 1

metadata = {"patch_counts": patch_counts}
return encoded, attention_mask, metadata # type: ignore[return-value]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Patch processing assumes non-empty inputs; add explicit validation.

max(patch_counts) (Line 255) and processed[0][0] (Line 258) will throw on empty images, or if any image yields zero patches. A small guard makes failures predictable and easier to debug.

Proposed fix
     def _process_nested_patches(
         processed: list[list[NumpyArray]],
     ) -> tuple[NumpyArray, NumpyArray, dict[str, Any]]:
+        if not processed:
+            raise ValueError("No images/patches to process (processed is empty)")
+        if any(len(patches) == 0 for patches in processed):
+            raise ValueError("At least one image produced zero patches")
         patch_counts = [len(patches) for patches in processed]
         max_patches = max(patch_counts)
🤖 Prompt for AI Agents
In @fastembed/late_interaction_multimodal/colmodernvbert.py around lines 238 -
277, The _process_nested_patches static method currently assumes processed is
non-empty and that every element contains at least one patch; add explicit
validation at the top of _process_nested_patches to raise a clear ValueError if
processed is empty or if any patches list is empty (e.g., "processed must
contain at least one image and every image must contain at least one patch");
only then compute patch_counts, max_patches and access processed[0][0]; this
prevents obscure exceptions from max(...) or processed[0][0] and makes failures
predictable and debuggable.

@joein joein merged commit 800f388 into main Jan 9, 2026
13 checks passed
@joein joein deleted the feature/colmodernvbert-support branch January 9, 2026 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants