-
Notifications
You must be signed in to change notification settings - Fork 168
new: drop python3.9, replace optional and union with | #574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThe PR uniformly migrates type annotations to Python 3.10+ PEP 604 syntax (X | Y) and replaces older typing generics (Optional, Union, List, Dict, Set) with built-in generics or pipe unions across many modules. CI and packaging were updated to require/run Python >=3.10 (pyproject.toml and GitHub Actions workflows). A new module-level constant Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Areas needing extra attention:
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
fastembed/common/types.py (1)
13-20: Reconsider the NumpyArray expansion scope—it masks a semantic type violation in BM25.The expansion to include integer types was added to support token IDs (appropriate), but it overly broadens the contract for
SparseEmbedding.values, which should semantically represent score values (floats).Issue: BM25 creates integer values (
np.ones_like(token_ids)withdtype=np.int32) that now pass strict type checking only becauseNumpyArraywas expanded. Before this change, BM25 would have failed type validation—the bug was latent. The# type: ignorecomment inas_dict()reveals the developers recognized the type mismatch.Recommended approach:
- Narrow
NumpyArrayto float types only forSparseEmbedding.values- Use a separate/specific type for token ID arrays (e.g., in
tokenize(),convert_ids_to_tokens())- Fix BM25 to create actual float values (BM25 scores) instead of integer ones
The current change allows type-incorrect code to pass strict mode validation without addressing the underlying semantic contract violation.
🧹 Nitpick comments (6)
fastembed/late_interaction/token_embeddings.py (1)
2-69: Use PEP 604 for documents, dropUnionimportSwitching this parameter to
str | Iterable[str]lets us remove the remainingUnionimport and keeps the file consistent with the rest of the modernization.-from typing import Union, Iterable, Any, Type +from typing import Iterable, Any, Type ... - documents: Union[str, Iterable[str]], + documents: str | Iterable[str],fastembed/late_interaction_multimodal/colpali.py (1)
60-75: Docstring still referencesOptional[...]The annotations now use
| None, so consider updating the docstring wording (e.g., “Sequence[OnnxProvider] | None”) to avoid confusing readers skimming the docs.fastembed/sparse/splade_pp.py (1)
80-95: Bring the docstring in line with the new union syntax.Signature now advertises
str | None,Sequence[...] | None, etc., but the docstring still talks aboutOptional[...]. Updating the prose keeps API docs consistent with the public surface.- cache_dir (str, optional): The path to the cache directory. + cache_dir (str | None, optional): The path to the cache directory. @@ - providers (Optional[Sequence[OnnxProvider]], optional): The list of onnxruntime providers to use. + providers (Sequence[OnnxProvider] | None, optional): The list of onnxruntime providers to use. @@ - device_ids (Optional[list[int]], optional): The list of device ids to use for data parallel processing in + device_ids (list[int] | None, optional): The list of device ids to use for data parallel processing in @@ - device_id (Optional[int], optional): The device id to use for loading the model in the worker process. - specific_model_path (Optional[str], optional): The specific path to the onnx model dir if it should be imported from somewhere else + device_id (int | None, optional): The device id to use for loading the model in the worker process. + specific_model_path (str | None, optional): The specific path to the onnx model dir if it should be imported from somewhere elsefastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
92-106: Update the docstring to mirror the union annotations.After switching the signature to
str | None/Sequence[...] | None, the docstring should follow suit to avoid staleOptional[...]references.- cache_dir (str, optional): The path to the cache directory. + cache_dir (str | None, optional): The path to the cache directory. @@ - threads (int, optional): The number of threads single onnxruntime session can use. Defaults to None. + threads (int | None, optional): The number of threads single onnxruntime session can use. Defaults to None. @@ - providers (Optional[Sequence[OnnxProvider]], optional): The list of onnxruntime providers to use. + providers (Sequence[OnnxProvider] | None, optional): The list of onnxruntime providers to use. @@ - device_ids (Optional[list[int]], optional): The list of device ids to use for data parallel processing in + device_ids (list[int] | None, optional): The list of device ids to use for data parallel processing in @@ - device_id (Optional[int], optional): The device id to use for loading the model in the worker process. - specific_model_path (Optional[str], optional): The specific path to the onnx model dir if it should be imported from somewhere else + device_id (int | None, optional): The device id to use for loading the model in the worker process. + specific_model_path (str | None, optional): The specific path to the onnx model dir if it should be imported from somewhere elsefastembed/late_interaction/late_interaction_embedding_base.py (1)
1-61: LGTM! Type hint modernization is correct.The conversion from
Optional[T]toT | NoneandUnion[A, B]toA | Bis syntactically correct and maintains functional equivalence. All parameter types in__init__,embed, andquery_embedhave been updated consistently.Consider updating the docstring at line 51 to align with the new type syntax—it still references
Union[str, Iterable[str]]while the actual parameter type now usesstr | Iterable[str]. This applies to similar docstrings in other methods as well.fastembed/late_interaction/late_interaction_text_embedding.py (1)
1-153: LGTM! Type annotations updated correctly.All type hint conversions to Python 3.10+ union syntax are correct. The changes maintain functional equivalence while modernizing the codebase.
The docstring at line 146 still references the old
Union[str, Iterable[str]]syntax. For consistency, consider updating docstrings to match the new type annotation style throughout the codebase.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (42)
.github/workflows/python-publish.yml(1 hunks).github/workflows/python-tests.yml(0 hunks).github/workflows/type-checkers.yml(1 hunks)fastembed/common/model_description.py(1 hunks)fastembed/common/model_management.py(4 hunks)fastembed/common/onnx_model.py(4 hunks)fastembed/common/types.py(1 hunks)fastembed/embedding.py(2 hunks)fastembed/image/image_embedding.py(4 hunks)fastembed/image/image_embedding_base.py(2 hunks)fastembed/image/onnx_embedding.py(4 hunks)fastembed/image/onnx_image_model.py(4 hunks)fastembed/image/transform/functional.py(5 hunks)fastembed/image/transform/operators.py(7 hunks)fastembed/late_interaction/colbert.py(6 hunks)fastembed/late_interaction/late_interaction_embedding_base.py(3 hunks)fastembed/late_interaction/late_interaction_text_embedding.py(5 hunks)fastembed/late_interaction/token_embeddings.py(2 hunks)fastembed/late_interaction_multimodal/colpali.py(5 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py(5 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py(3 hunks)fastembed/parallel_processor.py(4 hunks)fastembed/postprocess/muvera.py(1 hunks)fastembed/rerank/cross_encoder/custom_text_cross_encoder.py(2 hunks)fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py(4 hunks)fastembed/rerank/cross_encoder/onnx_text_model.py(4 hunks)fastembed/rerank/cross_encoder/text_cross_encoder.py(4 hunks)fastembed/rerank/cross_encoder/text_cross_encoder_base.py(3 hunks)fastembed/sparse/bm25.py(5 hunks)fastembed/sparse/bm42.py(5 hunks)fastembed/sparse/minicoil.py(5 hunks)fastembed/sparse/sparse_embedding_base.py(4 hunks)fastembed/sparse/sparse_text_embedding.py(4 hunks)fastembed/sparse/splade_pp.py(4 hunks)fastembed/sparse/utils/sparse_vectors_converter.py(7 hunks)fastembed/text/custom_text_embedding.py(2 hunks)fastembed/text/multitask_embedding.py(4 hunks)fastembed/text/onnx_embedding.py(4 hunks)fastembed/text/onnx_text_model.py(5 hunks)fastembed/text/text_embedding.py(6 hunks)fastembed/text/text_embedding_base.py(3 hunks)tests/utils.py(2 hunks)
💤 Files with no reviewable changes (1)
- .github/workflows/python-tests.yml
🧰 Additional context used
🧬 Code graph analysis (21)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2)
fastembed/late_interaction_multimodal/colpali.py (1)
embed_text(209-242)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
embed_text(120-141)
fastembed/image/image_embedding_base.py (2)
fastembed/image/image_embedding.py (1)
embed(114-135)fastembed/image/onnx_embedding.py (1)
embed(149-184)
fastembed/sparse/sparse_embedding_base.py (4)
fastembed/sparse/bm25.py (1)
query_embed(305-321)fastembed/sparse/bm42.py (1)
query_embed(317-336)fastembed/sparse/minicoil.py (1)
query_embed(220-238)fastembed/sparse/sparse_text_embedding.py (1)
query_embed(118-128)
fastembed/text/text_embedding_base.py (4)
fastembed/image/image_embedding_base.py (1)
embed(23-45)fastembed/late_interaction/late_interaction_embedding_base.py (2)
embed(22-29)query_embed(46-61)fastembed/sparse/sparse_embedding_base.py (2)
embed(47-54)query_embed(71-86)fastembed/text/text_embedding.py (2)
embed(165-187)query_embed(189-200)
fastembed/late_interaction/late_interaction_embedding_base.py (3)
fastembed/late_interaction/colbert.py (2)
embed(205-239)query_embed(241-251)fastembed/late_interaction/token_embeddings.py (1)
embed(64-71)fastembed/text/text_embedding_base.py (2)
embed(22-29)query_embed(46-61)
fastembed/late_interaction/late_interaction_text_embedding.py (3)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
query_embed(46-61)fastembed/text/text_embedding.py (1)
query_embed(189-200)fastembed/late_interaction/colbert.py (1)
query_embed(241-251)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (3)
fastembed/image/image_embedding.py (1)
embedding_size(81-85)fastembed/late_interaction/late_interaction_text_embedding.py (1)
embedding_size(84-88)fastembed/text/text_embedding.py (1)
embedding_size(132-136)
fastembed/late_interaction/colbert.py (2)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
query_embed(46-61)fastembed/late_interaction/late_interaction_text_embedding.py (1)
query_embed(141-153)
fastembed/sparse/utils/sparse_vectors_converter.py (3)
fastembed/common/utils.py (2)
get_all_punctuation(62-65)remove_non_alphanumeric(68-69)fastembed/sparse/utils/vocab_resolver.py (1)
vocab_size(54-56)fastembed/sparse/sparse_embedding_base.py (1)
SparseEmbedding(13-31)
fastembed/sparse/bm25.py (4)
fastembed/late_interaction/colbert.py (1)
query_embed(241-251)fastembed/sparse/bm42.py (1)
query_embed(317-336)fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)fastembed/sparse/sparse_text_embedding.py (1)
query_embed(118-128)
fastembed/sparse/minicoil.py (5)
fastembed/sparse/utils/vocab_resolver.py (1)
VocabResolver(31-201)fastembed/sparse/utils/minicoil_encoder.py (1)
Encoder(11-146)fastembed/sparse/utils/sparse_vectors_converter.py (1)
SparseVectorConverter(24-244)fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)fastembed/sparse/sparse_text_embedding.py (1)
query_embed(118-128)
fastembed/postprocess/muvera.py (2)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
LateInteractionTextEmbeddingBase(8-71)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
LateInteractionMultimodalEmbeddingBase(10-78)
fastembed/text/onnx_text_model.py (2)
fastembed/common/onnx_model.py (1)
_preprocess_onnx_input(49-55)fastembed/text/multitask_embedding.py (1)
_preprocess_onnx_input(60-69)
tests/utils.py (1)
fastembed/common/model_description.py (1)
BaseModelDescription(24-31)
fastembed/text/multitask_embedding.py (2)
fastembed/image/image_embedding.py (1)
embed(114-135)fastembed/text/text_embedding.py (1)
embed(165-187)
fastembed/image/onnx_image_model.py (1)
fastembed/image/transform/operators.py (1)
Compose(85-269)
fastembed/sparse/bm42.py (2)
fastembed/sparse/bm25.py (1)
query_embed(305-321)fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)
fastembed/image/transform/operators.py (1)
fastembed/image/transform/functional.py (1)
pil2ndarray(118-121)
fastembed/image/image_embedding.py (3)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
embedding_size(84-88)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
embedding_size(87-91)fastembed/text/text_embedding.py (1)
embedding_size(132-136)
fastembed/sparse/sparse_text_embedding.py (4)
fastembed/sparse/bm25.py (1)
query_embed(305-321)fastembed/sparse/bm42.py (1)
query_embed(317-336)fastembed/sparse/minicoil.py (1)
query_embed(220-238)fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)
fastembed/text/text_embedding.py (4)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
query_embed(141-153)fastembed/text/text_embedding_base.py (1)
query_embed(46-61)fastembed/sparse/sparse_embedding_base.py (1)
query_embed(71-86)fastembed/text/multitask_embedding.py (1)
query_embed(84-85)
🪛 Ruff (0.14.4)
fastembed/late_interaction/colbert.py
241-241: Unused method argument: kwargs
(ARG002)
fastembed/sparse/bm25.py
305-305: Unused method argument: kwargs
(ARG002)
fastembed/text/onnx_text_model.py
42-42: Unused method argument: kwargs
(ARG002)
fastembed/sparse/bm42.py
317-317: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.10.x on windows-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
🔇 Additional comments (42)
fastembed/sparse/minicoil.py (4)
3-3: LGTM! Import statement correctly updated.The removal of
OptionalandUnionimports is appropriate since the code now uses PEP 604 union syntax (|).
68-78: LGTM! Constructor parameters correctly modernized.All optional parameter type hints have been properly converted to PEP 604 syntax. The conversions maintain functional equivalence while aligning with Python 3.10+ conventions.
118-126: LGTM! Instance attributes properly typed with appropriate None guards.The optional type annotations correctly reflect the lazy loading pattern. The assertions at lines 151 and 264-266 properly guard against None usage before accessing these attributes.
180-220: LGTM! Method signatures correctly updated and consistent with base classes.The type hint modernization for
embed()andquery_embed()methods maintains consistency with the base class signatures shown in the related code..github/workflows/type-checkers.yml (1)
11-11: Matrix drop matches supported runtimesRemoving 3.9 keeps our type-checking matrix aligned with the new 3.10+ typing surface—looks good.
.github/workflows/python-publish.yml (1)
28-28: Publish workflow aligned with 3.10 floorBumping the release job to Python 3.10 keeps packaging consistent with the rest of the toolchain.
fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
11-45: Constructor unions look goodThe move to
str | None/int | Nonemirrors the rest of the PEP 604 migration and keeps the interface unchanged—LGTM.fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
14-52: Base multimodal signatures updated cleanlyNice work updating the constructor and embed entry points to the new union syntax while keeping behavior intact.
fastembed/image/transform/functional.py (1)
16-128: Image utilities now speak the same typing dialectThe widened unions for PIL images vs. ndarrays and the refined mean/std annotations match the rest of the refactor and preserve runtime logic—looks solid.
fastembed/image/transform/operators.py (1)
18-92: Transform stack consistent with functional helpersThe union-typed signatures align perfectly with the helper functions and keep the transform pipeline typings coherent—nicely done.
fastembed/sparse/utils/sparse_vectors_converter.py (3)
18-21: Type annotations modernized successfully.The dataclass fields now use built-in generic types (
list[str]instead ofList[str]), aligning with Python 3.10+ conventions.
25-41: New parameteravg_lenadds flexibility to BM25 calculation.The
avg_lenparameter is now exposed in the constructor with a sensible default of 150.0. This maintains backward compatibility while allowing customization of the average document length used in BM25 scoring.
60-64: Method signatures consistently updated.All method signatures throughout the file now use built-in generics (
list[float]instead ofList[float]), maintaining consistency with the modernization effort.tests/utils.py (2)
11-11: Type annotations modernized with union syntax.The function signature now uses
str | Pathinstead ofUnion[str, Path], aligning with Python 3.10+ typing conventions.
42-47: Parameter type updated consistently.The
is_ciparameter now usesstr | Noneinstead ofOptional[str], maintaining consistency with the modernization effort across the codebase.fastembed/common/model_description.py (1)
34-40: Type hints updated to modern union syntax.The
DenseModelDescriptionfields now useint | Noneanddict[str, Any] | Noneinstead ofOptional[...]. The__post_init__validation fordimremains unchanged, correctly enforcing thatdimmust be provided at runtime.fastembed/postprocess/muvera.py (1)
12-12: Type alias modernized with union operator.The
MultiVectorModeltype alias now uses the|operator instead ofUnion, consistent with Python 3.10+ typing conventions.fastembed/rerank/cross_encoder/custom_text_cross_encoder.py (1)
11-23: Constructor signature modernized comprehensively.All optional parameters now use the modern union syntax (
str | None,int | None,Sequence[OnnxProvider] | None,list[int] | None). The changes are consistent and maintain full backward compatibility.fastembed/common/model_management.py (3)
180-194: Nested function types updated consistently.The
_collect_file_metadatafunction's return type now usesdict[str, dict[str, int | str]]with the modern union syntax for the inner dict values. This correctly expresses that metadata values can be either int or str.
195-201: Parameter typing aligned with return type.The
_save_file_metadatafunction parameter type matches the return type of_collect_file_metadata, maintaining type consistency across the helper functions.
376-404: Public method signature modernized.The
download_modelmethod'sspecific_model_pathparameter now usesstr | Noneinstead ofOptional[str], consistent with the modernization effort across the codebase.fastembed/rerank/cross_encoder/onnx_text_model.py (3)
21-22: Class variable type correctly updated.The
ONNX_OUTPUT_NAMESclass attribute now useslist[str] | Noneinstead ofOptional[list[str]], maintaining consistency with Python 3.10+ conventions.
28-46: Method signature parameters modernized.The
_load_onnx_modelmethod parameters now use the union operator for optional types (int | None,Sequence[OnnxProvider] | None), aligning with modern Python typing.
87-100: Complex method signature updated consistently.All optional parameters in
_rerank_pairsnow use the modern union syntax. The method signature is cleaner and more readable with the|operator.fastembed/embedding.py (1)
16-24: Deprecated class updated with modern type hints.The
JinaEmbeddingconstructor parameters now usestr | Noneandint | Noneinstead ofOptionaltypes. While this class is deprecated, maintaining consistent typing is good practice.fastembed/sparse/sparse_text_embedding.py (1)
1-128: LGTM! Type hint modernization is consistent.The updates to union syntax are correct and align with the broader project changes.
fastembed/sparse/bm25.py (1)
1-321: LGTM! Type hint updates are correct.The conversion to Python 3.10+ union syntax is consistent throughout the file.
Note: The static analysis hint about unused
kwargsat line 305 can be safely ignored—the parameter is maintained for interface consistency with other embedding classes'query_embedmethods, as seen in the codebase pattern.fastembed/image/image_embedding.py (1)
1-135: LGTM! Type annotations modernized correctly.All parameter and variable type hints have been updated to Python 3.10+ union syntax consistently.
fastembed/text/text_embedding.py (1)
1-214: LGTM! Type hint modernization is complete and correct.All type annotations have been successfully updated to Python 3.10+ union syntax across the public API.
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
1-164: LGTM! Type annotations updated consistently.All type hints have been correctly modernized to Python 3.10+ union syntax.
fastembed/common/types.py (1)
9-12: LGTM! Type alias updates are correct.The conversion of
PathInput,ImageInput, andOnnxProviderto Python 3.10+ union syntax is correct and maintains semantic equivalence.fastembed/parallel_processor.py (1)
11-11: LGTM! Type annotations modernized correctly.The type hint updates from
Optional[T]toT | Noneare syntactically correct and align with Python 3.10+ PEP 604 union syntax. The removal ofOptionalfrom the typing imports is also appropriate.Also applies to: 41-41, 96-97, 102-103, 110-110, 223-223
fastembed/sparse/sparse_embedding_base.py (1)
38-39: LGTM! Type annotations modernized correctly.The updates to
cache_dir,threads,documents,parallel, andqueryparameters correctly adopt the Python 3.10+ union syntax.Also applies to: 49-51, 71-71
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
1-1: LGTM! Type annotations modernized correctly.All type hint updates from
Optional[T]toT | Noneare correct and consistent. The removal ofOptionalfrom imports is appropriate.Also applies to: 56-58, 60-60, 105-105, 143-143
fastembed/text/onnx_embedding.py (1)
1-1: LGTM! Type annotations modernized correctly.All type hint updates throughout the
OnnxTextEmbeddingclass are syntactically correct and align with Python 3.10+ union syntax. The removal ofOptionalfrom imports is appropriate.Also applies to: 202-204, 206-206, 208-209, 242-242, 263-263, 265-265
fastembed/text/text_embedding_base.py (1)
1-1: LGTM! Type annotations modernized correctly.All type hint updates in the
TextEmbeddingBaseclass are correct and consistent with Python 3.10+ union syntax. The removal ofOptionalandUnionfrom imports is appropriate.Also applies to: 12-13, 20-20, 24-24, 26-26, 46-46
fastembed/image/onnx_image_model.py (1)
5-5: LGTM! Type annotations modernized correctly.All type hint updates in the
OnnxImageModelclass are syntactically correct and align with Python 3.10+ union syntax. The removal ofOptionalfrom imports is appropriate.Also applies to: 40-40, 54-55, 57-57, 94-94, 96-97, 99-99, 101-101
fastembed/image/onnx_embedding.py (1)
1-1: LGTM! Type annotations modernized correctly.All type hint updates in the
OnnxImageEmbeddingclass are syntactically correct and align with Python 3.10+ union syntax. The removal ofOptionalfrom imports is appropriate.Also applies to: 66-68, 70-70, 72-73, 107-107, 151-151, 153-153
fastembed/text/onnx_text_model.py (1)
4-4: LGTM! Type annotations modernized correctly.All type hint updates in the
OnnxTextModelclass are syntactically correct and align with Python 3.10+ union syntax. The removal ofOptionalfrom imports is appropriate.The return type on line 43 (
dict[str, NumpyArray | NDArray[np.int64]]) correctly uses the new union syntax for flexible preprocessing return types.Also applies to: 18-18, 38-38, 43-43, 53-54, 56-56, 105-105, 107-108, 110-110, 112-112
fastembed/late_interaction/colbert.py (1)
2-2: LGTM! Clean type annotation modernization.The conversion from
Optional[T]andUnion[A, B]to Python 3.10+ union syntax (T | None,A | B) is correctly applied throughout the file. All type changes maintain functional equivalence while aligning with modern Python typing conventions.Note: The static analysis warning about unused
kwargsinquery_embed(line 241) can be safely ignored—it's part of the API contract inherited from the base class for extensibility.Also applies to: 111-118, 152-152, 168-169, 172-172, 207-207, 209-209, 241-241
fastembed/sparse/bm42.py (1)
4-4: LGTM! Consistent type annotation updates.The PEP 604 union syntax conversion is correctly and consistently applied. The changes maintain functional equivalence while modernizing the type hints.
Note: The static analysis warning about unused
kwargsinquery_embed(line 317) can be safely ignored—it follows the same extensibility pattern as other embedding classes in the codebase.Also applies to: 60-68, 104-104, 275-275, 277-277, 317-317
fastembed/text/multitask_embedding.py (1)
2-2: LGTM! Type annotations correctly modernized.The conversion to PEP 604 union syntax is properly applied across all method signatures and attribute annotations. The task_id handling pattern is correct: the public
embedmethod accepts an optional task_id and defaults it toself.default_task_id, ensuring that by the time_preprocess_onnx_inputis called, a non-None value is present. The ValueError in_preprocess_onnx_input(line 67) serves as an appropriate safeguard.Also applies to: 48-48, 50-50, 63-63, 73-73, 75-76, 84-84
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (1)
pyproject.toml(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.959Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.10.x on windows-latest test
🔇 Additional comments (1)
pyproject.toml (1)
14-14: Python 3.10 minimum version requirement correctly updated.The change from
>=3.9.0to>=3.10.0aligns with the PR objectives and enables use of modern type-hinting features (PEP 604 union syntax with|).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
fastembed/late_interaction/colbert.py (1)
244-244: Consider removing unusedkwargsparameter.The
kwargsparameter is declared but never used in the method body. While this may be intentional for API consistency with otherquery_embedimplementations, consider whether it's necessary.If the parameter isn't needed, apply this diff:
- def query_embed(self, query: str | Iterable[str], **kwargs: Any) -> Iterable[NumpyArray]: + def query_embed(self, query: str | Iterable[str]) -> Iterable[NumpyArray]:fastembed/sparse/bm42.py (1)
328-328: Consider removing unusedkwargsparameter.The
kwargsparameter is declared but never used in the method body. While this may be intentional for API consistency across different embedding implementations, consider whether it's necessary.If the parameter isn't needed, apply this diff:
- def query_embed(self, query: str | Iterable[str], **kwargs: Any) -> Iterable[SparseEmbedding]: + def query_embed(self, query: str | Iterable[str]) -> Iterable[SparseEmbedding]:fastembed/text/onnx_text_model.py (1)
41-47: Unusedkwargsin_preprocess_onnx_input(Ruff ARG002)
_preprocess_onnx_inputaccepts**kwargsbut ignores it in the base implementation; Ruff flags this. If you want to keep the hook while silencing the warning, you can explicitly mark it unused:def _preprocess_onnx_input( self, onnx_input: dict[str, NumpyArray], **kwargs: Any ) -> dict[str, NumpyArray | NDArray[np.int64]]: - """ - Preprocess the onnx input. - """ - return onnx_input + """ + Preprocess the onnx input. + """ + _ = kwargs + return onnx_inputfastembed/text/onnx_embedding.py (1)
213-229: Optional: sync docstrings with new type hintsThe constructor docstring still refers to
Optional[...]while the signature now usesstr | None,int | None,Sequence[OnnxProvider] | None, etc. Not a blocker, but updating the prose to match the new annotation style would avoid minor confusion for users reading the docs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (43)
.github/workflows/python-publish.yml(1 hunks).github/workflows/python-tests.yml(0 hunks).github/workflows/type-checkers.yml(1 hunks)fastembed/common/model_description.py(1 hunks)fastembed/common/model_management.py(4 hunks)fastembed/common/onnx_model.py(4 hunks)fastembed/common/types.py(1 hunks)fastembed/embedding.py(2 hunks)fastembed/image/image_embedding.py(4 hunks)fastembed/image/image_embedding_base.py(2 hunks)fastembed/image/onnx_embedding.py(4 hunks)fastembed/image/onnx_image_model.py(4 hunks)fastembed/image/transform/functional.py(5 hunks)fastembed/image/transform/operators.py(7 hunks)fastembed/late_interaction/colbert.py(6 hunks)fastembed/late_interaction/late_interaction_embedding_base.py(3 hunks)fastembed/late_interaction/late_interaction_text_embedding.py(5 hunks)fastembed/late_interaction/token_embeddings.py(2 hunks)fastembed/late_interaction_multimodal/colpali.py(5 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py(5 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py(3 hunks)fastembed/parallel_processor.py(4 hunks)fastembed/postprocess/muvera.py(1 hunks)fastembed/rerank/cross_encoder/custom_text_cross_encoder.py(2 hunks)fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py(4 hunks)fastembed/rerank/cross_encoder/onnx_text_model.py(4 hunks)fastembed/rerank/cross_encoder/text_cross_encoder.py(4 hunks)fastembed/rerank/cross_encoder/text_cross_encoder_base.py(3 hunks)fastembed/sparse/bm25.py(5 hunks)fastembed/sparse/bm42.py(5 hunks)fastembed/sparse/minicoil.py(5 hunks)fastembed/sparse/sparse_embedding_base.py(4 hunks)fastembed/sparse/sparse_text_embedding.py(4 hunks)fastembed/sparse/splade_pp.py(4 hunks)fastembed/sparse/utils/sparse_vectors_converter.py(7 hunks)fastembed/text/custom_text_embedding.py(2 hunks)fastembed/text/multitask_embedding.py(4 hunks)fastembed/text/onnx_embedding.py(4 hunks)fastembed/text/onnx_text_model.py(5 hunks)fastembed/text/text_embedding.py(6 hunks)fastembed/text/text_embedding_base.py(3 hunks)pyproject.toml(1 hunks)tests/utils.py(2 hunks)
💤 Files with no reviewable changes (1)
- .github/workflows/python-tests.yml
🚧 Files skipped from review as they are similar to previous changes (17)
- .github/workflows/python-publish.yml
- fastembed/parallel_processor.py
- fastembed/common/model_description.py
- .github/workflows/type-checkers.yml
- fastembed/postprocess/muvera.py
- fastembed/sparse/minicoil.py
- fastembed/image/transform/functional.py
- fastembed/common/types.py
- fastembed/sparse/sparse_embedding_base.py
- fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py
- fastembed/late_interaction/late_interaction_embedding_base.py
- fastembed/late_interaction/token_embeddings.py
- fastembed/image/image_embedding.py
- fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py
- fastembed/text/custom_text_embedding.py
- tests/utils.py
- fastembed/text/text_embedding_base.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
📚 Learning: 2025-11-12T10:48:30.978Z
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
Applied to files:
fastembed/sparse/splade_pp.pyfastembed/sparse/sparse_text_embedding.pyfastembed/image/onnx_embedding.pyfastembed/text/onnx_embedding.pyfastembed/sparse/bm42.pyfastembed/image/transform/operators.pyfastembed/text/text_embedding.pyfastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.pyfastembed/sparse/bm25.py
🧬 Code graph analysis (8)
fastembed/sparse/sparse_text_embedding.py (9)
fastembed/late_interaction/colbert.py (1)
query_embed(244-254)fastembed/late_interaction/late_interaction_embedding_base.py (1)
query_embed(46-61)fastembed/sparse/bm25.py (1)
query_embed(305-321)fastembed/sparse/bm42.py (1)
query_embed(328-347)fastembed/sparse/minicoil.py (1)
query_embed(231-249)fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)fastembed/text/multitask_embedding.py (1)
query_embed(84-85)fastembed/text/text_embedding.py (1)
query_embed(189-200)fastembed/text/text_embedding_base.py (1)
query_embed(46-61)
fastembed/late_interaction/late_interaction_text_embedding.py (6)
fastembed/late_interaction/late_interaction_embedding_base.py (2)
embedding_size(69-71)query_embed(46-61)fastembed/text/text_embedding.py (2)
embedding_size(132-136)query_embed(189-200)fastembed/late_interaction/colbert.py (1)
query_embed(244-254)fastembed/sparse/bm25.py (1)
query_embed(305-321)fastembed/sparse/bm42.py (1)
query_embed(328-347)fastembed/sparse/sparse_embedding_base.py (1)
query_embed(71-86)
fastembed/late_interaction/colbert.py (5)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
query_embed(46-61)fastembed/late_interaction/late_interaction_text_embedding.py (1)
query_embed(141-153)fastembed/sparse/sparse_embedding_base.py (1)
query_embed(71-86)fastembed/text/multitask_embedding.py (1)
query_embed(84-85)fastembed/text/text_embedding_base.py (1)
query_embed(46-61)
fastembed/sparse/utils/sparse_vectors_converter.py (2)
fastembed/common/utils.py (2)
get_all_punctuation(62-65)remove_non_alphanumeric(68-69)fastembed/sparse/sparse_embedding_base.py (1)
SparseEmbedding(13-31)
fastembed/image/transform/operators.py (1)
fastembed/image/transform/functional.py (1)
pil2ndarray(118-121)
fastembed/image/onnx_image_model.py (1)
fastembed/image/transform/operators.py (1)
Compose(85-269)
fastembed/image/image_embedding_base.py (1)
fastembed/image/onnx_embedding.py (1)
embed(151-187)
fastembed/sparse/bm25.py (5)
fastembed/sparse/bm42.py (1)
query_embed(328-347)fastembed/sparse/minicoil.py (1)
query_embed(231-249)fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)fastembed/sparse/sparse_text_embedding.py (1)
query_embed(118-128)fastembed/text/text_embedding.py (1)
query_embed(189-200)
🪛 Ruff (0.14.7)
fastembed/sparse/bm42.py
328-328: Unused method argument: kwargs
(ARG002)
fastembed/late_interaction/colbert.py
244-244: Unused method argument: kwargs
(ARG002)
fastembed/sparse/bm25.py
305-305: Unused method argument: kwargs
(ARG002)
fastembed/text/onnx_text_model.py
42-42: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.13.x on windows-latest test
🔇 Additional comments (33)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (4)
1-1: LGTM! Import cleanup aligns with PEP 604 syntax.The removal of
OptionalandUnionfrom imports is correct, as all type hints in the file now use the Python 3.10+ union operator (|).
11-22: LGTM! Type hints correctly updated to PEP 604 syntax.The conversions of
Optional[str]→str | None,Optional[int]→int | Noneare correct and maintain equivalent type semantics with no runtime behavior changes.
24-30: LGTM! Method signature correctly updated to PEP 604 syntax.The type hint updates (
Union[str, Iterable[str]]→str | Iterable[str]andOptional[int]→int | None) are correct and maintain equivalent semantics.
48-54: LGTM! Method signature correctly updated to PEP 604 syntax.The type hint updates (
Union[ImageInput, Iterable[ImageInput]]→ImageInput | Iterable[ImageInput]andOptional[int]→int | None) are correct and maintain equivalent semantics.fastembed/common/model_management.py (1)
8-8: LGTM! Type annotation modernization is correct.The migration to Python 3.10+ union syntax is consistent across import removal, internal helper signatures, and the public
download_modelAPI. Runtime behavior remains unchanged.Also applies to: 183-184, 196-196, 398-398
pyproject.toml (1)
14-14: Python version constraint updated correctly.The minimum Python version is now 3.10.0, aligning with the PEP 604 union syntax adoption throughout the codebase.
fastembed/embedding.py (1)
1-1: LGTM! Consistent type annotation updates.The migration to union syntax is correct, and the deprecated
JinaEmbeddingwrapper maintains API compatibility.Also applies to: 20-21
fastembed/sparse/splade_pp.py (1)
1-1: LGTM! Type annotations modernized correctly.The migration to PEP 604 union syntax is consistent across constructor parameters, instance attributes, and method signatures. Runtime behavior is preserved.
Also applies to: 68-75, 109-109, 142-144
fastembed/sparse/bm25.py (1)
5-5: LGTM! Type annotations updated consistently.The migration to union syntax is correct across all method signatures. The unused
kwargsparameter inquery_embed(flagged by static analysis) is intentional for interface consistency with other sparse embedding implementations.Also applies to: 94-94, 101-101, 161-165, 208-210, 305-305
fastembed/sparse/sparse_text_embedding.py (1)
1-1: LGTM! Wrapper class type annotations updated correctly.The migration to PEP 604 union syntax is consistent across the facade class. All delegation to underlying models remains intact.
Also applies to: 56-60, 96-98, 118-118
fastembed/common/onnx_model.py (2)
22-25: Correct handling of dataclass NDArray fields.The
attention_maskfield correctly usesOptionalsyntax instead of|due to runtime evaluation constraints with NDArray types, with an explanatory comment. Theinput_idsfield uses the same pattern, and the comment on line 23 applies to both fields.Based on learnings from this codebase.
48-49: LGTM! Regular class attributes updated correctly.The
modelandtokenizerattributes, along with_load_onnx_modelparameters, correctly use PEP 604 union syntax since these are not dataclass fields with NDArray types.Also applies to: 63-67
fastembed/image/onnx_embedding.py (1)
1-1: LGTM! Image embedding type annotations modernized.The migration to PEP 604 union syntax is consistent with the rest of the codebase. Constructor parameters, instance attributes, and method signatures all updated correctly.
Also applies to: 66-73, 108-108, 153-155
fastembed/late_interaction/late_interaction_text_embedding.py (1)
1-1: LGTM! Typing modernization applied consistently.The file has been successfully updated to use Python 3.10+ union syntax (PEP 604) across all public method signatures and type annotations. The changes are consistent with the PR objective and maintain backward compatibility at runtime.
Also applies to: 54-60, 104-104, 119-121, 141-141
fastembed/late_interaction/colbert.py (1)
2-2: LGTM! Typing modernization applied consistently.The type annotations have been successfully updated to Python 3.10+ union syntax across the class constructor, instance attributes, and method signatures. The changes maintain API consistency with the broader codebase.
Also applies to: 111-118, 153-153, 169-173, 209-211
fastembed/sparse/bm42.py (1)
4-4: LGTM! Typing modernization applied consistently.The type annotations have been successfully migrated to Python 3.10+ union syntax throughout the class, including constructor parameters, instance attributes, and method signatures.
Also applies to: 68-76, 113-113, 285-287
fastembed/late_interaction_multimodal/colpali.py (1)
1-1: LGTM! Typing modernization applied consistently.The type annotations have been successfully updated to Python 3.10+ union syntax across the ColPali class, including constructor parameters, instance attributes, and both text and image embedding methods. The changes align with the broader modernization effort across the codebase.
Also applies to: 49-56, 90-90, 213-215, 249-251
fastembed/image/transform/operators.py (1)
1-1: LGTM! Typing modernization applied consistently.All type annotations have been successfully updated to Python 3.10+ union syntax across the transform operators. The changes include:
- Method signatures using
X | Yinstead ofUnion[X, Y]- Optional parameters using
X | Noneinstead ofOptional[X]The modifications maintain the existing functionality while modernizing the type hints.
Also applies to: 18-18, 36-36, 47-47, 66-66, 74-74, 90-91, 256-256
fastembed/sparse/utils/sparse_vectors_converter.py (2)
4-6: LGTM! Typing modernization applied consistently.The file has been successfully updated to use Python 3.10+ built-in generic types (
list,dict,set) instead of theirtypingmodule equivalents. The addition of theINT32_MAXconstant improves code readability and maintainability.Note: The
WordEmbeddingdataclass fields use plainlisttypes (not numpyNDArray), so the PEP 604 union syntax is appropriate here.Also applies to: 12-12, 18-18, 21-21, 27-27, 60-60, 67-68, 87-87, 129-129, 158-159, 173-173, 208-208, 216-217
31-31: No action required — theavg_lenparameter is correctly implemented.The parameter addition is backward compatible (keyword argument with default value), and the only existing caller in
minicoil.pyalready passesavg_lenthrough correctly. The BM25 formula at line 56 properly usesself.avg_lenfor document length normalization, and the default value of 150.0 is consistent across bothMiniCOILandSparseVectorConverterwith appropriate documentation in the class docstring.fastembed/rerank/cross_encoder/custom_text_cross_encoder.py (1)
1-1: LGTM! Typing modernization applied consistently.The constructor parameters have been successfully updated to use Python 3.10+ union syntax, maintaining consistency with the broader codebase modernization effort.
Also applies to: 14-21
fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
1-1: LGTM! Typing modernization applied consistently.The base class has been successfully updated to use Python 3.10+ union syntax for optional parameters in both the constructor and the
rerank_pairsmethod. These changes align with the broader typing modernization across the codebase.Also applies to: 11-12, 44-44
fastembed/text/text_embedding.py (1)
1-6: LGTM! Clean migration to PEP 604 union syntax.The removal of
OptionalandUnionfrom imports and the consistent use of|syntax throughout the file aligns well with the Python 3.10+ target.fastembed/image/image_embedding_base.py (1)
1-29: LGTM! Type annotations properly updated.The migration to
str | None,int | None, andImageInput | Iterable[ImageInput]is correct for Python 3.10+. The instance attribute_embedding_sizeon line 21 is fine with|syntax since this is a regular class, not a dataclass.fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
1-4: LGTM! Consistent typing updates.The import cleanup and migration to PEP 604 union syntax (
str | None,Sequence[OnnxProvider] | None, etc.) is consistent with the rest of the codebase migration.fastembed/rerank/cross_encoder/onnx_text_model.py (2)
1-36: LGTM! Type annotations properly migrated.The class attribute
ONNX_OUTPUT_NAMES: list[str] | Noneand method parameters are correctly updated to PEP 604 syntax. This is a regular class (not a dataclass), so the|operator works correctly at runtime.
89-103: LGTM! Method signature updates are consistent.The
_rerank_pairsmethod parameters are correctly migrated to useint | None,Sequence[OnnxProvider] | None,list[int] | None,str | None, anddict[str, Any] | Nonesyntax.fastembed/text/multitask_embedding.py (3)
48-50: LGTM! Clear default task handling.The initialization correctly stores the provided
task_idor falls back toPASSAGE_TASK. The type annotationTask | intfordefault_task_idaccurately reflects the possible values.
71-85: LGTM! Proper task_id propagation.The
embedmethod correctly resolvestask_idfrom the parameter ordefault_task_idbefore delegation, ensuring_preprocess_onnx_inputalways receives a valid value. Thequery_embedmethod explicitly passestask_id=self.QUERY_TASK.
60-69: Good defensive validation added.The
ValueErroron line 67 provides a clear fail-fast mechanism iftask_idis unexpectedlyNone. This validation is justified: all public entry points (embed,query_embed,passage_embed) ensuretask_idis set before reaching this method. In the non-parallel path,task_idflows through**kwargstoonnx_embed()and then to_preprocess_onnx_input(). In the parallel path,JinaEmbeddingV3Worker.process()explicitly passestask_id=self.model.default_task_idtoonnx_embed(). The check safeguards against edge cases where the method might be called directly or through an unexpected code path without propertask_idinitialization.fastembed/text/onnx_text_model.py (1)
4-4: PEP 604 typing migration here looks consistent and non‑breakingThe updated imports and annotations (
list[str] | None,Tokenizer | None,int | None,Sequence[OnnxProvider] | None,str | Iterable[str], etc.) align with the project‑wide 3.10+ move and keep the runtime behavior intact. Thedocuments: str | Iterable[str]change is already handled correctly via the explicitisinstance(documents, str)check before batching.Also applies to: 18-18, 36-39, 49-58, 103-115
fastembed/text/onnx_embedding.py (1)
1-1: OnnxTextEmbedding PEP 604 migration is consistent with existing behaviorThe updated annotations for
cache_dir,threads,providers,device_ids,device_id,specific_model_path, anddocuments/parallelkeep the previous semantics, and thedevice_idselection logic (explicitdevice_idwins, else first fromdevice_ids) is unchanged.self.device_id: int | None = Nonejust makes the intent explicit.Also applies to: 199-211, 241-247, 261-266
fastembed/image/onnx_image_model.py (1)
5-5: Image ONNX model typing updates look correct and behavior‑preservingThe new annotations (
Compose | None,int | None,Sequence[OnnxProvider] | None,ImageInput | Iterable[ImageInput],list[int] | None, etc.) match existing usage:self.processoris asserted non‑None before use, image inputs are normalized to lists before batching, and parallel/extra session options are threaded through unchanged. I don’t see any functional changes introduced here.Also applies to: 38-41, 50-59, 92-105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
fastembed/common/utils.py(2 hunks)fastembed/late_interaction_multimodal/onnx_multimodal_model.py(5 hunks)fastembed/text/custom_text_embedding.py(3 hunks)tests/test_multi_gpu.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- fastembed/text/custom_text_embedding.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
🧬 Code graph analysis (1)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
fastembed/image/transform/operators.py (1)
Compose(85-269)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on windows-latest test
🔇 Additional comments (10)
fastembed/common/utils.py (1)
8-8: LGTM! Type annotation modernization is correct.The removal of
Optionalfrom imports and the update to PEP 604 union syntax (str | None) in the function signature are correct and consistent with the PR's objective to modernize the codebase for Python 3.10+.Also applies to: 48-48
tests/test_multi_gpu.py (3)
2-2: LGTM! Import cleanup is appropriate.Removing the
OptionalandUnionimports is correct since the type annotations now use PEP 604 union syntax (|) which doesn't require these imports.
17-17: LGTM! Type annotation correctly updated.The type hint
int | Nonecorrectly replacesOptional[int]and matches the parametrized test values.
89-89: LGTM! Type annotation correctly updated.The type hint
list[int] | Nonecorrectly replacesOptional[list[int]]and matches the parametrized test values.fastembed/late_interaction_multimodal/onnx_multimodal_model.py (6)
5-5: LGTM: Import statement correctly updated.The removal of
OptionalandUnionfrom imports is appropriate since PEP 604 pipe operator syntax (|) is now used throughout the file.
21-21: LGTM: Class attribute annotation modernized.The type annotation correctly uses PEP 604 syntax.
25-26: LGTM: Instance attribute annotations modernized.Both attribute annotations correctly use PEP 604 syntax and maintain consistency with the existing None defaults.
59-68: LGTM: Method signature modernized.All parameter type annotations correctly use PEP 604 syntax. The parameter defaults and requirements remain unchanged from the previous implementation.
115-129: LGTM: Method signature modernized.All parameter type annotations in
_embed_documentscorrectly use PEP 604 syntax, with union types and optional parameters properly converted.
186-200: LGTM: Method signature modernized.All parameter type annotations in
_embed_imagescorrectly use PEP 604 syntax, with union types and optional parameters properly converted.
| "device_ids,parallel", [(None, None), (None, 2), ([1], None), ([1], 1), ([1], 2), ([0, 1], 2)] | ||
| ) | ||
| def test_multi_gpu_parallel_inference(device_ids: Optional[list[int]], parallel: int) -> None: | ||
| def test_multi_gpu_parallel_inference(device_ids: list[int] | None, parallel: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix type annotation mismatch for parallel parameter.
The parallel parameter is typed as int, but the parametrize decorator passes None in multiple test cases (e.g., (None, None), ([1], None)). This type annotation mismatch will fail static type checking and is inconsistent with the test data.
Apply this diff to fix the type annotation:
-def test_multi_gpu_parallel_inference(device_ids: list[int] | None, parallel: int) -> None:
+def test_multi_gpu_parallel_inference(device_ids: list[int] | None, parallel: int | None) -> None:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "device_ids,parallel", [(None, None), (None, 2), ([1], None), ([1], 1), ([1], 2), ([0, 1], 2)] | |
| ) | |
| def test_multi_gpu_parallel_inference(device_ids: Optional[list[int]], parallel: int) -> None: | |
| def test_multi_gpu_parallel_inference(device_ids: list[int] | None, parallel: int) -> None: | |
| "device_ids,parallel", [(None, None), (None, 2), ([1], None), ([1], 1), ([1], 2), ([0, 1], 2)] | |
| ) | |
| def test_multi_gpu_parallel_inference(device_ids: list[int] | None, parallel: int | None) -> None: |
🤖 Prompt for AI Agents
In tests/test_multi_gpu.py around lines 172 to 174, the test function signature
types `parallel` as int while the parametrize decorator supplies None for some
cases; update the function annotation to accept None (use int | None or
Optional[int]) so it matches the test data and static type checkers, i.e.,
change the signature to accept parallel: int | None and leave device_ids as
list[int] | None.
tbung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataclasses also support | None
| @dataclass | ||
| class SparseEmbedding: | ||
| values: NumpyArray | ||
| indices: Union[NDArray[np.int64], NDArray[np.int32]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union
| hf: Optional[str] = None | ||
| url: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional
| requires_idf: Optional[bool] = None | ||
| vocab_size: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional
|
|
||
| def embed( | ||
| self, | ||
| documents: Union[str, Iterable[str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
fastembed/late_interaction/token_embeddings.py (1)
2-2: LGTM! Verify parent class signature consistency.The type annotation updates correctly implement PEP 604 union syntax, replacing
Union[str, Iterable[str]]withstr | Iterable[str]andOptional[int]withint | None. The import changes appropriately remove unusedUnionandOptionalfrom the typing module.Since this method overrides
OnnxTextEmbedding.embed(line 71), verify the parent class has been updated for consistent union syntax across the codebase.fastembed/common/model_description.py (1)
3-3: PEP 604 migration here looks good; consider tighteningtaskstype.
- Using
str | None,int | None, andbool | Nonehere is consistent with the rest of the PR and doesn’t change behavior.- For
tasks: dict[str, Any] | None = field(default_factory=dict), the default is always a dict;Noneis only possible if explicitly passed. If you don’t expectNonein practice, you could simplify this todict[str, Any]to better reflect actual usage (or keep as-is if callers legitimately passNone).Also applies to: 8-9, 36-37, 45-46
fastembed/common/types.py (1)
3-3: Type alias modernisation looks good; optional consistency tweak forNumpyArray.
PathInput,ImageInput, andOnnxProviderusing PEP 604 unions withTypeAliasare clear and align with the rest of the PR.NumpyArraynow uses a|-based union of the supported dtypes, which keeps usage sites readable.- For consistency with the other exported aliases, you might optionally declare
NumpyArrayas aTypeAliasas well:- NumpyArray = ( + NumpyArray: TypeAlias = ( NDArray[np.float64] | NDArray[np.float32] | NDArray[np.float16] | NDArray[np.int8] | NDArray[np.int64] | NDArray[np.int32] )This is stylistic only; behavior is already correct.
Also applies to: 6-6, 9-10, 12-20
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
fastembed/common/model_description.py(2 hunks)fastembed/common/onnx_model.py(4 hunks)fastembed/common/types.py(1 hunks)fastembed/late_interaction/token_embeddings.py(2 hunks)fastembed/sparse/sparse_embedding_base.py(5 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
📚 Learning: 2025-11-12T10:48:30.978Z
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
Applied to files:
fastembed/sparse/sparse_embedding_base.pyfastembed/common/types.py
🔇 Additional comments (4)
fastembed/sparse/sparse_embedding_base.py (3)
38-39: LGTM!The PEP 604 pipe syntax for optional parameters is correct and consistent with the PR objectives.
49-51: LGTM!The union syntax updates for method parameters are appropriate and align with Python 3.10+ conventions.
71-86: LGTM!The union syntax update and the concrete implementation that dispatches to
embedfor both string and iterable inputs are correct.fastembed/common/onnx_model.py (1)
61-66: Thethreadsparameter in_load_onnx_modeland all its subclass overrides consistently hasint | Nonewithout a default. All call sites throughout the codebase (fastembed/text/onnx_embedding.py, fastembed/image/onnx_embedding.py, fastembed/late_interaction/colbert.py, fastembed/late_interaction_multimodal/colpali.py, fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py, and sparse models) explicitly passthreads=self.threads. No callers rely on a missing default, so no breaking change is occurring. The concern about backward compatibility is not supported by the codebase.Likely an incorrect or invalid review comment.
| attention_mask: NDArray[np.int64] | None = None | ||
| input_ids: NDArray[np.int64] | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid NDArray[...] | None in dataclass fields; keep Union here.
For OnnxOutputContext, annotating attention_mask and input_ids as NDArray[np.int64] | None can trigger runtime issues because dataclasses evaluate annotations at class creation time and NDArray[...] does not reliably support the | operator in this codebase. Based on prior experience, these fields should keep typing.Union[...] instead of PEP 604 unions, with a short comment explaining why.
A possible adjustment:
-from numpy.typing import NDArray
+from numpy.typing import NDArray
+from typing import Union
@@
- attention_mask: NDArray[np.int64] | None = None
- input_ids: NDArray[np.int64] | None = None
+ # NOTE: Use Union instead of `|` with NDArray in dataclasses to avoid
+ # runtime annotation evaluation issues.
+ attention_mask: Union[NDArray[np.int64], None] = None
+ input_ids: Union[NDArray[np.int64], None] = NoneBased on learnings, this keeps the modernisation elsewhere while avoiding the known NDArray/dataclass pitfall.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| attention_mask: NDArray[np.int64] | None = None | |
| input_ids: NDArray[np.int64] | None = None | |
| # NOTE: Use Union instead of `|` with NDArray in dataclasses to avoid | |
| # runtime annotation evaluation issues. | |
| attention_mask: Union[NDArray[np.int64], None] = None | |
| input_ids: Union[NDArray[np.int64], None] = None |
🤖 Prompt for AI Agents
In fastembed/common/onnx_model.py around lines 22 to 23, the dataclass fields
use PEP 604 style unions (NDArray[np.int64] | None) which can break at dataclass
creation time for NDArray; change these annotations back to
typing.Union[NDArray[np.int64], None] (or Optional[NDArray[np.int64]]) and add a
short comment noting this workaround (dataclass/NDArray incompatibility) so
other maintainers understand why Union is used instead of the | operator.
| @@ -1,5 +1,5 @@ | |||
| from dataclasses import dataclass | |||
| from typing import Iterable, Optional, Union, Any | |||
| from typing import Iterable, Any | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert to Union syntax for NDArray types in dataclass fields.
Based on learnings, dataclasses evaluate annotations at runtime, and NDArray types don't support the __or__ operator required for PEP 604 pipe syntax. This will cause runtime errors.
You encountered this issue earlier in the PR review when you noted "dataclass complaints on this change."
Apply this diff to revert to Union syntax with an explanatory comment:
-from typing import Iterable, Any
+from typing import Iterable, Any, Union @dataclass
class SparseEmbedding:
values: NumpyArray
- indices: NDArray[np.int64] | NDArray[np.int32]
+ indices: Union[NDArray[np.int64], NDArray[np.int32]] # Union required: dataclasses can't handle | with NDArrayBased on learnings, this approach is already used correctly elsewhere in your PR (e.g., fastembed/common/onnx_model.py).
Also applies to: 15-15
🤖 Prompt for AI Agents
In fastembed/sparse/sparse_embedding_base.py around lines 2 and 15, the
dataclass annotations currently use PEP 604 pipe (|) syntax for NDArray types
which fails at runtime because dataclasses evaluate annotations and NDArray
types don't implement __or__; revert those annotations to typing.Union[...] form
(e.g., Union[NDArray, NoneType] or Union[NDArray, Iterable[NDArray]] as
appropriate) and add a brief comment noting that Union is required because
dataclasses evaluate annotations at runtime and NDArray does not support the
pipe operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fastembed/common/types.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
📚 Learning: 2025-11-12T10:48:30.978Z
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
Applied to files:
fastembed/common/types.py
🔇 Additional comments (1)
fastembed/common/types.py (1)
9-12: PathInput / ImageInput / OnnxProvider aliases look goodUsing PEP 604 unions here is clean and consistent with the 3.10+ baseline; aliases are clear and match the existing semantics.
| NumpyArray: TypeAlias = ( | ||
| NDArray[np.float64] | ||
| | NDArray[np.float32] | ||
| | NDArray[np.float16] | ||
| | NDArray[np.int8] | ||
| | NDArray[np.int64] | ||
| | NDArray[np.int32] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine the project's dependency specifications
find . -maxdepth 2 -type f \( -name "setup.py" -o -name "requirements.txt" -o -name "pyproject.toml" -o -name "setup.cfg" \) | head -20Repository: qdrant/fastembed
Length of output: 76
🏁 Script executed:
# Examine the current state of fastembed/common/types.py
cat -n fastembed/common/types.pyRepository: qdrant/fastembed
Length of output: 652
🏁 Script executed:
# Search for usages of NumpyArray alias in the codebase
rg "NumpyArray" --type py -B 2 -A 2Repository: qdrant/fastembed
Length of output: 50372
🏁 Script executed:
# Check if there are any recent changes or related issues with NDArray
git log --oneline --all --grep="NDArray\|NumpyArray" | head -20Repository: qdrant/fastembed
Length of output: 42
🏁 Script executed:
# Check numpy version requirements
cat pyproject.toml | grep -A 10 -B 2 "numpy\|python"Repository: qdrant/fastembed
Length of output: 146
🏁 Script executed:
# Try to import the types module to see if there are any immediate issues
python3 << 'EOF'
try:
import sys
sys.path.insert(0, '/repo')
from fastembed.common.types import NumpyArray
print("✓ NumpyArray imported successfully")
print(f"NumpyArray = {NumpyArray}")
except Exception as e:
print(f"✗ Import failed: {type(e).__name__}: {e}")
EOFRepository: qdrant/fastembed
Length of output: 146
🏁 Script executed:
# Examine the sparse_embedding_base.py file where the prior issue occurred
head -30 fastembed/sparse/sparse_embedding_base.pyRepository: qdrant/fastembed
Length of output: 980
Use Union[...] instead of | operator for NumpyArray type alias
The current pipe operator syntax for NDArray[...] | NDArray[...] can cause issues at runtime when this type alias is evaluated in dataclass field annotations. NDArray types don't reliably support the __or__ operator across numpy versions, especially in dataclass contexts where annotations are evaluated at class definition time.
Switch to Union[...] syntax:
from typing import Union
NumpyArray: TypeAlias = Union[
NDArray[np.float64],
NDArray[np.float32],
NDArray[np.float16],
NDArray[np.int8],
NDArray[np.int64],
NDArray[np.int32],
]Add a comment explaining why Union is preferred over the pipe operator for NDArray unions.
🤖 Prompt for AI Agents
In fastembed/common/types.py around lines 13 to 20, replace the pipe-operation
union of NDArray types with typing.Union: import Union from typing, change the
NumpyArray TypeAlias to use Union[...] with the listed NDArray[...] types, and
add a brief comment stating that Union is preferred because NDArray does not
reliably support the | operator across NumPy versions and dataclass annotation
evaluation; ensure imports are ordered and no runtime evaluation of the |
operator remains.
994cfee to
985d185
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
fastembed/sparse/sparse_embedding_base.py (1)
71-88: Fixtoken_count'sUnionname error and modernize its annotation
token_countusesUnion[str, Iterable[str]]on line 88, butUnionis not imported fromtyping. Since this method signature isn't constrained by dataclass limitations, update it to PEP 604 syntax to match the rest of the base class:- def token_count(self, texts: Union[str, Iterable[str]], **kwargs: Any) -> int: + def token_count(self, texts: str | Iterable[str], **kwargs: Any) -> int:fastembed/sparse/splade_pp.py (1)
1-59: Fix token_count annotation after dropping Union import
Unionwas removed from the imports, buttoken_countstill uses it, causing a runtime NameError.Align with the rest of the file and remove the
Unionusage:class SpladePP(SparseTextEmbeddingBase, OnnxTextModel[SparseEmbedding]): def token_count( - self, texts: Union[str, Iterable[str]], batch_size: int = 1024, **kwargs: Any + self, texts: str | Iterable[str], batch_size: int = 1024, **kwargs: Any ) -> int: return self._token_count(texts, batch_size=batch_size, **kwargs)This keeps the public API shape while resolving the
Unionname error and modernizing to PEP 604 syntax.fastembed/late_interaction/late_interaction_embedding_base.py (1)
73-79: Fix undefinedUnionintoken_countsignature.The pipeline failure indicates
Unionis used on line 75 but not imported. Either importUnionfromtypingor convert to pipe syntax for consistency with the rest of the file.- def token_count( - self, - texts: Union[str, Iterable[str]], - batch_size: int = 1024, - **kwargs: Any, - ) -> int: + def token_count( + self, + texts: str | Iterable[str], + batch_size: int = 1024, + **kwargs: Any, + ) -> int:fastembed/sparse/sparse_text_embedding.py (1)
130-142: Fix undefinedUnionintoken_countsignature.The pipeline failure indicates
Unionis used on line 131 but not imported. Convert to pipe syntax for consistency with the rest of the file.def token_count( - self, texts: Union[str, Iterable[str]], batch_size: int = 1024, **kwargs: Any + self, texts: str | Iterable[str], batch_size: int = 1024, **kwargs: Any ) -> int:fastembed/text/text_embedding.py (1)
216-228: Fix undefinedUnionintoken_countsignature.The pipeline failure indicates
Unionis used on line 217 but not imported. Convert to pipe syntax for consistency with the rest of the file.def token_count( - self, texts: Union[str, Iterable[str]], batch_size: int = 1024, **kwargs: Any + self, texts: str | Iterable[str], batch_size: int = 1024, **kwargs: Any ) -> int:fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
82-82: Critical: Missing Union import causes NameError.Line 82 uses
Union[str, Iterable[str]]butUnionwas removed from the imports. This is confirmed by the pipeline failure.Apply this diff to fix:
- def token_count( - self, - texts: Union[str, Iterable[str]], - **kwargs: Any, - ) -> int: + def token_count( + self, + texts: str | Iterable[str], + **kwargs: Any, + ) -> int:fastembed/late_interaction/colbert.py (1)
99-106: Critical: Missing Union import causes NameError.Line 101 uses
Union[str, Iterable[str]]butUnionwas removed from the imports at line 2. This is confirmed by the pipeline failure.Apply this diff to fix:
def token_count( self, - texts: Union[str, Iterable[str]], + texts: str | Iterable[str], batch_size: int = 1024, is_doc: bool = True, include_extension: bool = False, **kwargs: Any, ) -> int:
♻️ Duplicate comments (3)
fastembed/common/onnx_model.py (1)
19-23: Revert NDArray dataclass fields to Union to avoid runtime issuesUsing
NDArray[np.int64] | Nonein a@dataclasscan break at class creation time because the|operator is evaluated andNDArray[...]doesn’t reliably support__or__in this codebase. Based on previous failures and recorded learnings, these fields should stick withtyping.Union.Suggested change:
-from typing import Any, Generic, Iterable, Sequence, Type, TypeVar +from typing import Any, Generic, Iterable, Sequence, Type, TypeVar, Union @@ @dataclass class OnnxOutputContext: model_output: NumpyArray - attention_mask: NDArray[np.int64] | None = None - input_ids: NDArray[np.int64] | None = None + # NOTE: Use Union instead of `|` with NDArray in dataclasses to avoid + # runtime annotation evaluation issues. + attention_mask: Union[NDArray[np.int64], None] = None + input_ids: Union[NDArray[np.int64], None] = NoneThis matches the earlier guidance you added elsewhere for NDArray in dataclasses.
fastembed/sparse/sparse_embedding_base.py (1)
2-3: Use Union for NDArray in SparseEmbedding dataclass
SparseEmbedding.indicesis a dataclass field withNDArray[np.int64] | NDArray[np.int32]. As captured in previous iterations, dataclasses evaluate annotations at runtime and NumPy’s NDArray typing objects don’t reliably support__or__, which can lead to runtime errors. Based on learnings, this field should keeptyping.Union.Suggested change:
-from typing import Iterable, Any +from typing import Iterable, Any, Union @@ @dataclass class SparseEmbedding: values: NumpyArray - indices: NDArray[np.int64] | NDArray[np.int32] + indices: Union[NDArray[np.int64], NDArray[np.int32]] # Union required: dataclasses can't handle `|` with NDArrayThis keeps the PEP 604 style elsewhere while avoiding the known NDArray/dataclass pitfall.
Also applies to: 12-16
fastembed/common/types.py (1)
13-20: UseUnion[...]instead of|operator forNumpyArraytype alias.Based on retrieved learnings: when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (
|) because dataclasses evaluate annotations at runtime and NDArray types don't support the__or__operator. Consider adding a comment explaining this constraint.+from typing import Any, TypeAlias, Union -from typing import Any, TypeAlias -NumpyArray: TypeAlias = ( - NDArray[np.float64] - | NDArray[np.float32] - | NDArray[np.float16] - | NDArray[np.int8] - | NDArray[np.int64] - | NDArray[np.int32] -) +# Use Union instead of | for NDArray types because dataclasses evaluate +# annotations at runtime and NDArray doesn't support __or__ +NumpyArray: TypeAlias = Union[ + NDArray[np.float64], + NDArray[np.float32], + NDArray[np.float16], + NDArray[np.int8], + NDArray[np.int64], + NDArray[np.int32], +]
🧹 Nitpick comments (11)
fastembed/sparse/utils/sparse_vectors_converter.py (4)
4-12: Hash-space mapping for OOV tokens looks correct; consider guarding against extremeshiftvaluesThe combination of
GAP,unknown_words_shift = ((vocab_size * embedding_size) // GAP + 2) * GAP, andINT32_MAXkeeps OOV indices in[unknown_words_shift, INT32_MAX - 1]and avoids overlap with in-vocab indices under realistic vocab/embedding sizes. That said, ifvocab_size * embedding_sizeever became large enough thatshift >= INT32_MAX,range_sizewould be non‑positive andtoken_hash % range_sizewould fail.A small defensive check in
unkn_word_token_idwould turn that into an explicit, easier‑to‑debug error instead of a cryptic crash:@classmethod def unkn_word_token_id( cls, word: str, shift: int ) -> int: # 2-3 words can collide in 1 index with this mapping, not considering mm3 collisions - token_hash = abs(mmh3.hash(word)) - - range_size = INT32_MAX - shift + if shift >= INT32_MAX: + raise ValueError( + f"shift={shift} must be < INT32_MAX={INT32_MAX} to keep OOV indices in int32 range" + ) + + token_hash = abs(mmh3.hash(word)) + range_size = INT32_MAX - shift remapped_hash = shift + (token_hash % range_size)Not urgent, but it future‑proofs this mapping against pathological configurations.
Also applies to: 44-52, 173-200
59-65: Minor micro-optimization innormalize_vectorThe logic and new type hints are correct. If you want a tiny efficiency/readability win, you can avoid allocating an intermediate list:
- norm = sum([x**2 for x in vector]) ** 0.5 + norm = sum(x * x for x in vector) ** 0.5Behavior is identical.
66-88:clean_wordstyping matches actual usage; consider aligning docstring with dataclass-based inputThe updated type hints (
dict[str, WordEmbedding]andnew_sentence_embedding: dict[str, WordEmbedding]) correctly reflect that the values areWordEmbeddinginstances, and the logic around stemming / merging embeddings is unchanged.The docstring examples still show raw dicts as values; if those are no longer accepted in practice, updating the examples to use
WordEmbedding(or noting that a conversion happens earlier in the pipeline) would avoid confusion for users of this API.Also applies to: 97-125
206-244: Check intended semantics ofword_id >= 0in query path vs> 0in training pathIn
embedding_to_vector, known tokens use the dense miniCOIL embedding only whenword_id > 0, while inembedding_to_vector_querythe condition isword_id >= 0. Ifword_id == 0is ever used as an UNK/sentinel, these two branches would diverge:
embedding_to_vector:word_id == 0→ treated as OOV, goes throughunkn_word_token_id.embedding_to_vector_query:word_id == 0→ treated as known, mapped to indices starting at0 * embedding_size.If the upstream pipeline guarantees
word_idis either> 0or< 0(never0), this is a non-issue, but the inconsistency can be surprising. For clarity and future safety you might want to standardize on the same predicate asembedding_to_vector:- if word_id >= 0: # miniCOIL starts with ID 1 + if word_id > 0: # miniCOIL starts with ID 1Please double-check which behavior is actually desired in your data.
pyproject.toml (1)
14-35: Python >=3.10 constraint matches the PEP 604 refactorThe
python = ">=3.10.0"constraint lines up with the type-annotation modernization. Note that thepillowentry forpython = "<3.10"is now unreachable; you can keep it for historical clarity or drop it in a follow‑up cleanup.fastembed/sparse/bm25.py (1)
314-330: Optionally silence the unusedkwargswarning in query_embedRuff flags
kwargsas unused here. If you want to keep**kwargsfor API compatibility, you can explicitly ignore it:- def query_embed(self, query: str | Iterable[str], **kwargs: Any) -> Iterable[SparseEmbedding]: + def query_embed(self, query: str | Iterable[str], **kwargs: Any) -> Iterable[SparseEmbedding]: @@ - if isinstance(query, str): + # Accept **kwargs for API compatibility; currently unused. + del kwargs + if isinstance(query, str):Alternatively, adding
# noqa: ARG002on the function definition line would also address the lint without changing behavior.fastembed/text/onnx_text_model.py (1)
41-47: Silence Ruff ARG002 by markingkwargsas intentionally unused.
_preprocess_onnx_inputmust accept**kwargsto match the base interface, but it doesn't use them. Renaming the parameter makes this explicit and should satisfy ARG002 without changing behavior.- def _preprocess_onnx_input( - self, onnx_input: dict[str, NumpyArray], **kwargs: Any - ) -> dict[str, NumpyArray | NDArray[np.int64]]: + def _preprocess_onnx_input( + self, onnx_input: dict[str, NumpyArray], **_kwargs: Any + ) -> dict[str, NumpyArray | NDArray[np.int64]]:fastembed/sparse/bm42.py (1)
328-348: Markkwargsas intentionally unused inquery_embed.
query_embedaccepts**kwargsto conform to the base API but doesn’t use them, which triggers Ruff ARG002. Renaming the parameter is a low-noise way to document this and quiet the warning.- def query_embed(self, query: str | Iterable[str], **kwargs: Any) -> Iterable[SparseEmbedding]: + def query_embed(self, query: str | Iterable[str], **_kwargs: Any) -> Iterable[SparseEmbedding]:fastembed/image/transform/operators.py (1)
1-92: Typing updates across image transforms are consistent with PEP 604 and current usage.The new union annotations (
float | list[float],int | tuple[int, int],Image.Image | NumpyArray,str | int | tuple[int, ...], andlist[Image.Image] | list[NumpyArray]) match how these transforms are actually called viaCompose._interpolation_resolver(resample: str | None)also aligns with theinterpolationconfig handling.If you ever want stricter typing,
Transformcould be made generic over input/output types instead of usinglist[Any], but that’s not necessary for this PR.Also applies to: 255-269
tests/utils.py (1)
11-20: Aligndelete_model_cachedocstring with updated annotation.The function now takes
model_dir: str | Path, but the docstring still mentionsUnion[str, Path]. For clarity and consistency, update the docstring to match the annotation.- model_dir (Union[str, Path]): The path to the model cache directory. + model_dir (str | Path): The path to the model cache directory.fastembed/common/model_description.py (1)
36-37: Inconsistent default fortasksfield.The type annotation
dict[str, Any] | NonesuggestsNoneis a valid value, butdefault_factory=dictmeans the default is an empty dict{}, notNone. This inconsistency could cause confusion. Consider either:
- Using
tasks: dict[str, Any] = field(default_factory=dict)ifNoneis never valid- Using
tasks: dict[str, Any] | None = NoneifNoneis a valid state@dataclass(frozen=True) class DenseModelDescription(BaseModelDescription): dim: int | None = None - tasks: dict[str, Any] | None = field(default_factory=dict) + tasks: dict[str, Any] = field(default_factory=dict)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (46)
.github/workflows/python-publish.yml(1 hunks).github/workflows/python-tests.yml(0 hunks).github/workflows/type-checkers.yml(1 hunks)fastembed/common/model_description.py(2 hunks)fastembed/common/model_management.py(4 hunks)fastembed/common/onnx_model.py(4 hunks)fastembed/common/types.py(1 hunks)fastembed/common/utils.py(2 hunks)fastembed/embedding.py(2 hunks)fastembed/image/image_embedding.py(4 hunks)fastembed/image/image_embedding_base.py(2 hunks)fastembed/image/onnx_embedding.py(4 hunks)fastembed/image/onnx_image_model.py(4 hunks)fastembed/image/transform/functional.py(5 hunks)fastembed/image/transform/operators.py(7 hunks)fastembed/late_interaction/colbert.py(6 hunks)fastembed/late_interaction/late_interaction_embedding_base.py(3 hunks)fastembed/late_interaction/late_interaction_text_embedding.py(5 hunks)fastembed/late_interaction/token_embeddings.py(2 hunks)fastembed/late_interaction_multimodal/colpali.py(5 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py(5 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py(3 hunks)fastembed/late_interaction_multimodal/onnx_multimodal_model.py(5 hunks)fastembed/parallel_processor.py(4 hunks)fastembed/postprocess/muvera.py(1 hunks)fastembed/rerank/cross_encoder/custom_text_cross_encoder.py(2 hunks)fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py(4 hunks)fastembed/rerank/cross_encoder/onnx_text_model.py(4 hunks)fastembed/rerank/cross_encoder/text_cross_encoder.py(4 hunks)fastembed/rerank/cross_encoder/text_cross_encoder_base.py(3 hunks)fastembed/sparse/bm25.py(5 hunks)fastembed/sparse/bm42.py(5 hunks)fastembed/sparse/minicoil.py(5 hunks)fastembed/sparse/sparse_embedding_base.py(5 hunks)fastembed/sparse/sparse_text_embedding.py(4 hunks)fastembed/sparse/splade_pp.py(4 hunks)fastembed/sparse/utils/sparse_vectors_converter.py(7 hunks)fastembed/text/custom_text_embedding.py(3 hunks)fastembed/text/multitask_embedding.py(4 hunks)fastembed/text/onnx_embedding.py(4 hunks)fastembed/text/onnx_text_model.py(5 hunks)fastembed/text/text_embedding.py(6 hunks)fastembed/text/text_embedding_base.py(3 hunks)pyproject.toml(1 hunks)tests/test_multi_gpu.py(4 hunks)tests/utils.py(2 hunks)
💤 Files with no reviewable changes (1)
- .github/workflows/python-tests.yml
🚧 Files skipped from review as they are similar to previous changes (14)
- .github/workflows/python-publish.yml
- tests/test_multi_gpu.py
- .github/workflows/type-checkers.yml
- fastembed/image/image_embedding.py
- fastembed/text/multitask_embedding.py
- fastembed/text/text_embedding_base.py
- fastembed/common/model_management.py
- fastembed/image/onnx_embedding.py
- fastembed/embedding.py
- fastembed/late_interaction/late_interaction_text_embedding.py
- fastembed/text/custom_text_embedding.py
- fastembed/late_interaction_multimodal/colpali.py
- fastembed/sparse/minicoil.py
- fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
📚 Learning: 2025-11-12T10:48:30.978Z
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
Applied to files:
fastembed/postprocess/muvera.pyfastembed/common/types.pyfastembed/sparse/splade_pp.pyfastembed/sparse/bm42.pyfastembed/image/transform/functional.pyfastembed/text/onnx_embedding.pyfastembed/image/transform/operators.pyfastembed/common/onnx_model.pyfastembed/text/text_embedding.pyfastembed/sparse/sparse_embedding_base.py
🧬 Code graph analysis (11)
fastembed/postprocess/muvera.py (2)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
LateInteractionTextEmbeddingBase(8-80)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
LateInteractionMultimodalEmbeddingBase(10-86)
fastembed/image/image_embedding_base.py (7)
fastembed/sparse/sparse_embedding_base.py (1)
embed(47-54)fastembed/image/image_embedding.py (1)
embed(114-135)fastembed/image/onnx_embedding.py (1)
embed(151-187)fastembed/late_interaction/late_interaction_embedding_base.py (1)
embed(22-29)fastembed/sparse/bm25.py (1)
embed(206-236)fastembed/text/text_embedding.py (1)
embed(165-187)fastembed/text/text_embedding_base.py (1)
embed(22-29)
fastembed/text/onnx_text_model.py (8)
fastembed/common/onnx_model.py (1)
_preprocess_onnx_input(49-55)fastembed/image/onnx_embedding.py (1)
_preprocess_onnx_input(193-200)fastembed/image/onnx_image_model.py (1)
_preprocess_onnx_input(42-48)fastembed/late_interaction/colbert.py (1)
_preprocess_onnx_input(71-81)fastembed/rerank/cross_encoder/onnx_text_model.py (1)
_preprocess_onnx_input(160-166)fastembed/text/multitask_embedding.py (1)
_preprocess_onnx_input(60-69)fastembed/text/onnx_embedding.py (1)
_preprocess_onnx_input(302-308)fastembed/late_interaction/jina_colbert.py (1)
_preprocess_onnx_input(40-48)
fastembed/common/model_description.py (1)
fastembed/sparse/utils/vocab_resolver.py (1)
vocab_size(54-56)
fastembed/image/transform/operators.py (1)
fastembed/image/transform/functional.py (1)
pil2ndarray(118-121)
fastembed/sparse/sparse_text_embedding.py (4)
fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)fastembed/late_interaction/late_interaction_embedding_base.py (1)
query_embed(46-61)fastembed/text/multitask_embedding.py (1)
query_embed(84-85)fastembed/text/text_embedding_base.py (1)
query_embed(46-61)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
fastembed/image/transform/operators.py (1)
Compose(85-269)
fastembed/sparse/sparse_embedding_base.py (6)
fastembed/late_interaction/colbert.py (1)
query_embed(276-286)fastembed/late_interaction/late_interaction_text_embedding.py (1)
query_embed(141-153)fastembed/sparse/bm25.py (1)
query_embed(314-330)fastembed/sparse/bm42.py (1)
query_embed(328-347)fastembed/sparse/minicoil.py (1)
query_embed(236-254)fastembed/text/text_embedding_base.py (1)
query_embed(46-61)
fastembed/image/onnx_image_model.py (1)
fastembed/image/transform/operators.py (1)
Compose(85-269)
fastembed/late_interaction/colbert.py (9)
fastembed/sparse/sparse_embedding_base.py (1)
query_embed(71-86)fastembed/late_interaction/late_interaction_embedding_base.py (1)
query_embed(46-61)fastembed/late_interaction/late_interaction_text_embedding.py (1)
query_embed(141-153)fastembed/sparse/bm25.py (1)
query_embed(314-330)fastembed/sparse/bm42.py (1)
query_embed(328-347)fastembed/sparse/minicoil.py (1)
query_embed(236-254)fastembed/sparse/sparse_text_embedding.py (1)
query_embed(118-128)fastembed/text/multitask_embedding.py (1)
query_embed(84-85)fastembed/text/text_embedding.py (1)
query_embed(189-200)
fastembed/sparse/bm25.py (5)
fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)fastembed/late_interaction/colbert.py (1)
query_embed(276-286)fastembed/sparse/bm42.py (1)
query_embed(328-347)fastembed/sparse/minicoil.py (1)
query_embed(236-254)fastembed/text/text_embedding_base.py (1)
query_embed(46-61)
🪛 GitHub Actions: Tests
fastembed/late_interaction/late_interaction_embedding_base.py
[error] 75-75: NameError: name 'Union' is not defined. Likely missing 'from typing import Union, Iterable' import for type hints in LateInteractionTextEmbeddingBase.
🪛 GitHub Actions: type-checkers
fastembed/text/onnx_text_model.py
[error] 163-163: Name 'Union' is not defined [name-defined]. Did you forget to import it from typing? (Suggestion: 'from typing import Union')
fastembed/sparse/splade_pp.py
[error] 57-57: Name 'Union' is not defined [name-defined]. Did you forget to import it from typing? (Suggestion: 'from typing import Union')
fastembed/late_interaction/late_interaction_embedding_base.py
[error] 75-75: Name 'Union' is not defined [name-defined]. Did you forget to import it from typing? (Suggestion: 'from typing import Union')
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py
[error] 82-82: Name 'Union' is not defined [name-defined]. Did you forget to import it from typing? (Suggestion: 'from typing import Union')
fastembed/sparse/bm42.py
[error] 354-354: Name 'Union' is not defined [name-defined]. Did you forget to import it from typing? (Suggestion: 'from typing import Union')
fastembed/text/onnx_embedding.py
[error] 335-335: Name 'Union' is not defined [name-defined]. Did you forget to import it from typing? (Suggestion: 'from typing import Union')
fastembed/sparse/sparse_text_embedding.py
[error] 131-131: Name 'Union' is not defined [name-defined]. Did you forget to import it from typing? (Suggestion: 'from typing import Union')
fastembed/text/text_embedding.py
[error] 217-217: Name 'Union' is not defined [name-defined]. Did you forget to import it from typing? (Suggestion: 'from typing import Union')
fastembed/sparse/sparse_embedding_base.py
[error] 88-88: Name 'Union' is not defined [name-defined]. Did you forget to import it from typing? (Suggestion: 'from typing import Union')
fastembed/late_interaction/colbert.py
[error] 101-101: Name 'Union' is not defined [name-defined]. Did you forget to import it from typing? (Suggestion: 'from typing import Union')
fastembed/sparse/bm25.py
[error] 271-271: Name 'Union' is not defined [name-defined]. Did you forget to import it from typing? (Suggestion: 'from typing import Union')
🪛 Ruff (0.14.8)
fastembed/text/onnx_text_model.py
42-42: Unused method argument: kwargs
(ARG002)
fastembed/sparse/bm42.py
328-328: Unused method argument: kwargs
(ARG002)
fastembed/late_interaction/colbert.py
276-276: Unused method argument: kwargs
(ARG002)
fastembed/sparse/bm25.py
314-314: Unused method argument: kwargs
(ARG002)
🔇 Additional comments (28)
fastembed/image/transform/functional.py (1)
16-16: LGTM! Clean migration to Python 3.10+ union syntax.The type annotations have been correctly updated from
Union[X, Y]toX | Ysyntax across all functions. The changes are annotation-only with no logic modifications, and the semantic meaning is preserved.Note: The retrieved learning about keeping
Unionsyntax forNDArraytypes applies specifically to dataclass fields, not function parameters like these.Also applies to: 65-66, 97-98, 118-118, 127-127
fastembed/sparse/utils/sparse_vectors_converter.py (3)
15-22: Dataclass annotations with built-in generics look goodSwitching
formsandembeddingtolist[str]/list[float]is idiomatic for Python ≥3.10 and matches actual usage in this module. There are noNDArray | ...unions here, so this dataclass avoids the PEP 604 +NDArraypitfall noted elsewhere in the codebase.
24-42: Constructor and BM25 TF logic are consistent and safe for typical inputs
- Typing
stopwordsasset[str]matches the use of the set union operator (punctuation | special_tokens | stopwords).- Storing
avg_lenon the instance and using it inbm25_tfkeeps the BM25 parameters clearly configurable.- The BM25 formula itself is standard and only uses
sentence_lenafter it’s computed as a positive sum of counts, so there’s no division‑by‑zero risk fromsentence_len.No changes needed here.
Also applies to: 54-57
127-133:embedding_to_vectortyping and TF application look consistent
- The new annotations for
sentence_embedding,indices, andvaluesusing built‑in generics are correct and match the actual structures built in the method.- The computation of
unknown_words_shiftwithGAPis consistent with the explanatory comment and ensures OOV buckets start beyond the vocab region for typical vocab/embedding sizes.- BM25 TF using
bm25_tfwith the cleaned sentence length is correctly applied before normalizing the embedding and writing into the sparse vector.Implementation and types look solid here.
Also applies to: 158-160, 173-205
fastembed/common/utils.py (1)
48-59: define_cache_dir annotation update looks goodSwitching to
cache_dir: str | Nonekeeps the behavior and aligns with the project-wide PEP 604 move. No issues spotted.fastembed/postprocess/muvera.py (1)
12-12: MultiVectorModel alias is consistent and clearUsing
LateInteractionTextEmbeddingBase | LateInteractionMultimodalEmbeddingBasehere matches the actual usage infrom_multivector_modeland the broader typing style.fastembed/sparse/splade_pp.py (1)
70-81: PEP 604 updates in init and embed look consistentThe new
str | None,int | None,Sequence[OnnxProvider] | None, andstr | Iterable[str]annotations in__init__andembedmatch the patterns used elsewhere (e.g., sparse bases, ONNX helpers). Thedevice_id: int | Noneattribute also fits existing CUDA/provider branching.Also applies to: 114-115, 145-150
fastembed/common/onnx_model.py (1)
45-66: Other PEP 604 updates in OnnxModel look safeThe changes to annotate
model/tokenizerasort.InferenceSession | None/Tokenizer | None, and_load_onnx_modelparameters (threads: int | None,providers: Sequence[OnnxProvider] | None,device_id: int | None,extra_session_options: dict[str, Any] | None) preserve the existing control flow and checks (if threads is not None,if extra_session_options is not None, etc.).fastembed/parallel_processor.py (1)
35-42: ParallelWorkerPool typing refinements look correctThe switch to
dict[str, Any] | None,str | None,list[int] | None, andQueue | None/BaseValue | Nonematches actual usage (with initialization instartand runtime asserts) and doesn’t alter the multiprocessing behavior.Also applies to: 91-111, 223-234
fastembed/text/onnx_text_model.py (1)
49-58: PEP 604-style optional parameters in_load_onnx_model/_embed_documentslook consistent.The updated signatures (
threads: int | None,providers: Sequence[OnnxProvider] | None,device_id: int | None,extra_session_options: dict[str, Any] | None,documents: str | Iterable[str],parallel: int | None, etc.) match the rest of the codebase and are forwarded correctly tosuper()._load_onnx_modelandParallelWorkerPoolwithout changing behavior. Re-running type checkers after this migration should confirm callers are still compatible.Also applies to: 103-116
fastembed/sparse/bm42.py (1)
65-78: Constructor,embed, andquery_embedtypings now align with project-wide PEP 604 style.Using
cache_dir: str | None,threads: int | None,providers: Sequence[OnnxProvider] | None,device_ids: list[int] | None,device_id: int | None,specific_model_path: str | None, anddocuments/query: str | Iterable[str]withparallel: int | Nonematches the other embedding classes and doesn’t change runtime behavior.Based on learnings, the NDArray/
Uniondataclass constraint is not relevant here sinceBm42is not a dataclass.Also applies to: 283-288, 328-348
fastembed/text/onnx_embedding.py (1)
199-210: OnnxTextEmbedding’s constructor,embed, and worker wiring match the new typing style.Using
cache_dir: str | None,threads: int | None,providers: Sequence[OnnxProvider] | None,device_ids: list[int] | None,device_id: int | None,specific_model_path: str | None, anddocuments: str | Iterable[str]/parallel: int | Nonekeeps this class aligned with other embedding implementations and withOnnxTextEmbeddingWorker.init_embedding.Also applies to: 241-247, 261-296, 340-352
fastembed/late_interaction/token_embeddings.py (1)
2-2: TokenEmbeddingsModel.embed signature now matches the modernized base API.Switching to
documents: str | Iterable[str]andparallel: int | Noneis consistent withOnnxTextEmbedding.embedand removes the need forUnionin this module.Also applies to: 64-71
fastembed/rerank/cross_encoder/custom_text_cross_encoder.py (1)
1-1: CustomTextCrossEncoder constructor typing is correctly migrated.The updated
__init__matches the newOnnxTextCrossEncodersignature and forwards all parameters unchanged, so behavior is preserved while adopting| Noneacross optional args.Also applies to: 11-35
fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
1-2: Base TextCrossEncoder typings are updated consistently.Using
cache_dir: str | None,threads: int | None, andparallel: int | Noneinrerank_pairsaligns this abstract base with its concrete subclasses and the rest of the PEP 604 migration without changing behavior.Also applies to: 8-18, 40-46
fastembed/image/image_embedding_base.py (1)
1-1: LGTM! Type hints modernized to Python 3.10+ union syntax.The type annotation updates are consistent with the patterns used in the related embedding base classes (
TextEmbeddingBase,SparseTextEmbeddingBase,LateInteractionTextEmbeddingBase).Also applies to: 13-14, 21-21, 25-25, 27-27
fastembed/common/model_description.py (1)
3-3: LGTM! Type hints correctly updated to Python 3.10+ union syntax.The dataclass fields use simple types (str, int, bool) which support the pipe operator at runtime without issues.
Also applies to: 8-9, 45-46
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
1-1: LGTM! Type annotations modernized consistently.The type hint updates align with the broader PR goal and are consistent across the
__init__, attribute declarations, and method signatures.Also applies to: 80-87, 127-127, 183-183
fastembed/common/types.py (1)
9-12: LGTM! Simple type aliases correctly use pipe syntax.
PathInput,ImageInput, andOnnxProvideruse basic types (str,Path,Image.Image,tuple) which properly support the pipe operator at runtime.fastembed/late_interaction/late_interaction_embedding_base.py (1)
1-1: Type hint updates look good.The modernization to Python 3.10+ union syntax is consistent across the
__init__,embed, andquery_embedmethods.Also applies to: 12-13, 20-20, 24-24, 26-26, 46-46
fastembed/sparse/sparse_text_embedding.py (1)
1-1: Type hint updates are consistent.The modernization to Python 3.10+ union syntax is properly applied across
__init__,embed, andquery_embedmethods.Also applies to: 56-58, 60-60, 96-96, 98-98, 118-118
fastembed/text/text_embedding.py (1)
2-2: Type hint updates are well-applied.The modernization to Python 3.10+ union syntax is consistent across
add_custom_model,__init__,get_embedding_size,embed, andquery_embedmethods.Also applies to: 54-54, 82-84, 86-86, 152-152, 167-167, 169-169, 189-189
fastembed/rerank/cross_encoder/onnx_text_model.py (1)
4-4: LGTM! Clean migration to Python 3.10+ union syntax.The type annotations have been consistently updated across the file using the pipe operator (
|) instead ofOptionalandUnion. All changes are annotation-only with no behavioral impact.Also applies to: 22-22, 32-36, 95-101
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
1-1: LGTM! Consistent union syntax updates.Type annotations correctly migrated to Python 3.10+ syntax across the public API surface. No runtime behavior changes.
Also applies to: 56-60, 105-105, 143-143
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
1-1: LGTM! Other type annotations correctly updated.The remaining type annotations properly use Python 3.10+ union syntax.
Also applies to: 14-15, 22-22, 26-30, 50-54
fastembed/late_interaction/colbert.py (1)
2-2: LGTM! Other type annotations correctly migrated.The remaining type annotations properly use Python 3.10+ union syntax. The
kwargsparameter at line 276 maintains interface consistency with the base class, so the unused parameter warning can be safely ignored.Also applies to: 143-150, 185-185, 201-202, 205-205, 239-243, 276-286
fastembed/image/onnx_image_model.py (1)
5-5: LGTM! Clean typing migration.Type annotations consistently updated to Python 3.10+ union syntax. All changes are annotation-only with no runtime impact.
Also applies to: 40-40, 54-58, 96-104
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
5-5: LGTM! Comprehensive typing updates.Type annotations correctly migrated to Python 3.10+ union syntax across class attributes and method signatures. No behavioral changes.
Also applies to: 21-21, 25-26, 63-67, 119-127, 190-198
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
fastembed/sparse/sparse_embedding_base.py (1)
2-2: Revert NDArray union in dataclass toUnion[...]to avoid runtime errorsUsing
NDArray[np.int64] | NDArray[np.int32]in a@dataclassfield causes the|expression to be evaluated at class definition time; NumPy’sNDArray[...]types don’t implement__or__, which has previously led to dataclass complaints in this repo. Per the established exception for NDArray fields, this should stay ontyping.Unionwith an explanatory comment.Suggested fix:
-from typing import Iterable, Any +from typing import Iterable, Any, Union @@ @dataclass class SparseEmbedding: values: NumpyArray - indices: NDArray[np.int64] | NDArray[np.int32] + indices: Union[NDArray[np.int64], NDArray[np.int32]] # Union required: dataclasses can't handle | with NDArrayBased on learnings, this mirrors how similar NDArray annotations are handled elsewhere in the codebase.
Also applies to: 15-15
🧹 Nitpick comments (4)
fastembed/sparse/sparse_embedding_base.py (1)
34-45: SparseTextEmbeddingBase API updates and defaultquery_embedimplementation look goodThe migration to
str | None/int | Nonein__init__andstr | Iterable[str]inembed,query_embed, andtoken_countmatches the denseTextEmbeddingBaseand sparse callers (e.g., SpladePP, Bm25). The newquery_embeddefault correctly normalizesstrinto a singleton list and otherwise forwards iterables toembed, centralizing behavior without breaking existing overrides.If you want perfect consistency, you could also update the docstring argument types from
Union[...]tostr | Iterable[str]in this file later.Also applies to: 47-55, 56-70, 71-87, 88-90
fastembed/sparse/bm25.py (1)
91-103: Bm25 union-style annotations are consistent with sparse base; only kwargs lint remainsThe move to
str | None,str | Iterable[str], andint | Nonein__init__,_embed_documents,embed,token_count, andquery_embedmatchesSparseTextEmbeddingBaseand preserves existing behavior and batching/parallelism logic.Ruff’s ARG002 on unused
kwargsintoken_countandquery_embedis purely stylistic. If you want to appease the linter while keeping the flexible API, you can rename the variadic parameter:- def token_count(self, texts: str | Iterable[str], **kwargs: Any) -> int: + def token_count(self, texts: str | Iterable[str], **_kwargs: Any) -> int: @@ - def query_embed(self, query: str | Iterable[str], **kwargs: Any) -> Iterable[SparseEmbedding]: + def query_embed(self, query: str | Iterable[str], **_kwargs: Any) -> Iterable[SparseEmbedding]:This keeps the call surface identical but avoids the unused-argument warning.
Also applies to: 157-166, 206-236, 271-278, 314-330
fastembed/text/text_embedding_base.py (1)
8-21: TextEmbeddingBase typing and defaultquery_embedimplementation are coherentThe base text embedding now cleanly uses
str | None,int | None, andstr | Iterable[str]across__init__,embed,query_embed, andtoken_count, with_embedding_size: int | Nonesupporting the usual lazy caching in subclasses. The newquery_embeddefault correctly normalizes a single string to a list and forwards iterables directly toembed, matching how callers already use this API.If desired, you can later update the docstring types from
Union[...]tostr | Iterable[str]for consistency.Also applies to: 22-28, 31-45, 46-62, 73-75
fastembed/text/text_embedding.py (1)
189-201: Alignquery_embeddocstring with updated type hintThe signature now uses
query: str | Iterable[str], but the docstring still mentionsUnion[str, Iterable[str]]. Consider updating the docstring for consistency with the new annotation style; behavior is otherwise unchanged.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
fastembed/late_interaction/colbert.py(7 hunks)fastembed/late_interaction/late_interaction_embedding_base.py(4 hunks)fastembed/late_interaction/late_interaction_text_embedding.py(6 hunks)fastembed/late_interaction_multimodal/colpali.py(6 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py(6 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py(4 hunks)fastembed/sparse/bm25.py(6 hunks)fastembed/sparse/bm42.py(6 hunks)fastembed/sparse/minicoil.py(5 hunks)fastembed/sparse/sparse_embedding_base.py(6 hunks)fastembed/sparse/sparse_text_embedding.py(5 hunks)fastembed/sparse/splade_pp.py(5 hunks)fastembed/text/onnx_embedding.py(5 hunks)fastembed/text/onnx_text_model.py(6 hunks)fastembed/text/text_embedding.py(7 hunks)fastembed/text/text_embedding_base.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- fastembed/late_interaction/late_interaction_embedding_base.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
📚 Learning: 2025-11-12T10:48:30.978Z
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
Applied to files:
fastembed/text/onnx_embedding.pyfastembed/text/onnx_text_model.pyfastembed/text/text_embedding.pyfastembed/sparse/bm42.pyfastembed/sparse/sparse_embedding_base.pyfastembed/sparse/bm25.py
🧬 Code graph analysis (4)
fastembed/text/text_embedding.py (9)
fastembed/late_interaction/late_interaction_embedding_base.py (2)
embedding_size(69-71)query_embed(46-61)fastembed/late_interaction/late_interaction_text_embedding.py (2)
embedding_size(84-88)query_embed(141-153)fastembed/text/text_embedding_base.py (2)
embedding_size(69-71)query_embed(46-61)fastembed/image/image_embedding_base.py (1)
embedding_size(53-55)fastembed/late_interaction/colbert.py (1)
query_embed(276-286)fastembed/sparse/bm25.py (1)
query_embed(314-330)fastembed/sparse/bm42.py (1)
query_embed(328-347)fastembed/sparse/sparse_embedding_base.py (1)
query_embed(71-86)fastembed/text/multitask_embedding.py (1)
query_embed(84-85)
fastembed/sparse/bm42.py (7)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
query_embed(46-61)fastembed/sparse/bm25.py (1)
query_embed(314-330)fastembed/sparse/minicoil.py (1)
query_embed(236-254)fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)fastembed/text/text_embedding.py (1)
query_embed(189-200)fastembed/text/text_embedding_base.py (1)
query_embed(46-61)fastembed/text/multitask_embedding.py (1)
query_embed(84-85)
fastembed/sparse/sparse_text_embedding.py (3)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
query_embed(46-61)fastembed/sparse/sparse_embedding_base.py (2)
query_embed(71-86)SparseEmbedding(13-31)fastembed/text/text_embedding_base.py (1)
query_embed(46-61)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (5)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
embedding_size(84-88)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
embedding_size(76-78)fastembed/text/text_embedding.py (1)
embedding_size(132-136)fastembed/image/image_embedding.py (1)
embedding_size(81-85)fastembed/postprocess/muvera.py (1)
embedding_size(216-217)
🪛 Ruff (0.14.8)
fastembed/text/onnx_text_model.py
42-42: Unused method argument: kwargs
(ARG002)
fastembed/sparse/bm42.py
328-328: Unused method argument: kwargs
(ARG002)
fastembed/late_interaction/colbert.py
276-276: Unused method argument: kwargs
(ARG002)
fastembed/sparse/bm25.py
271-271: Unused method argument: kwargs
(ARG002)
314-314: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
🔇 Additional comments (43)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
54-76: PEP 604 type-hint migration here looks consistent and non-breakingConstructor,
get_embedding_size,embed_text/embed_image, andtoken_countnow usestr | None,int | None, andstr | Iterable[...]/ImageInput | Iterable[...]in line with the multimodal base and ColPali implementations, with no behavior changes beyond annotations.Also applies to: 107-108, 120-126, 143-149, 166-172
fastembed/late_interaction_multimodal/colpali.py (1)
46-57: ColPali typing updates are consistent with the multimodal base and helpersConstructor,
device_id,token_count,embed_text, andembed_imagenow use PEP 604 unions in line withLateInteractionMultimodalEmbeddingBaseand the Onnx worker APIs, while preserving existing control flow and data handling.Also applies to: 80-90, 175-181, 228-234, 264-270
fastembed/sparse/splade_pp.py (1)
56-59: SpladePP PEP 604 unions align with sparse base and Onnx text modelThe updated
__init__,token_count, andembedsignatures correctly adoptstr | None,int | None, andstr | Iterable[str]while remaining compatible withSparseTextEmbeddingBaseandOnnxTextModel. No behavioral changes introduced.Also applies to: 70-81, 114-115, 145-151, 167-180
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
10-22: Base multimodal API typing matches concrete implementationsThe switch to
str | None/int | Noneandstr | Iterable[...]/ImageInput | Iterable[...]in the base class matches the concrete multimodal embeddings, and_embedding_size: int | Nonealigns with the standard lazy caching pattern. Abstract methods remain clearly defined.Also applies to: 24-30, 48-54, 71-77, 80-84
fastembed/late_interaction/late_interaction_text_embedding.py (1)
51-61: LateInteractionTextEmbedding PEP 604 unions are consistent and preserve behaviorThe constructor,
get_embedding_size,embed,query_embed, andtoken_countnow usestr | None,int | None, andstr | Iterable[str]in line with the late-interaction base and other embedding classes. Theembedding_sizeproperty still lazily caches_embedding_sizebased onget_embedding_size, so runtime semantics remain unchanged.Also applies to: 83-89, 90-105, 117-123, 141-153, 155-180
fastembed/text/onnx_embedding.py (4)
1-1: LGTM! Import cleanup aligns with PEP 604 migration.The import statement correctly removes
OptionalandUnionsince all type annotations now use the native|syntax.
202-209: LGTM! Constructor parameters updated to PEP 604 syntax.All optional parameters now correctly use
str | None,int | None,Sequence[OnnxProvider] | None, andlist[int] | Nonesyntax consistently.
261-267: LGTM! Method signatures updated consistently.The
embedmethod parametersdocuments: str | Iterable[str]andparallel: int | Nonefollow the same PEP 604 pattern.
334-336: LGTM! Previous review concern addressed.The
token_countmethod now correctly usesstr | Iterable[str]instead of the previously flaggedUnionannotation.fastembed/sparse/sparse_text_embedding.py (5)
1-1: LGTM! Import statement simplified.Correctly imports only
Any, Iterable, Sequence, Typeafter removingOptionalandUnion.
53-62: LGTM! Constructor type hints modernized.All parameters use consistent PEP 604 syntax:
cache_dir: str | None,threads: int | None,providers: Sequence[OnnxProvider] | None,device_ids: list[int] | None.
94-100: LGTM! Method signatures updated.
embedmethod correctly usesdocuments: str | Iterable[str]andparallel: int | None.
118-128: LGTM!query_embedsignature updated.The signature now uses
query: str | Iterable[str]consistently with the base class implementation.
130-132: LGTM!token_countsignature updated.Uses
texts: str | Iterable[str]consistently with other methods in this file.fastembed/text/onnx_text_model.py (6)
4-4: LGTM! Import simplified for PEP 604 migration.Correctly removes
OptionalandUnionfrom imports.
17-18: LGTM! Class attribute type annotation updated.
ONNX_OUTPUT_NAMES: list[str] | Nonecorrectly uses the new union syntax.
36-43: LGTM! Instance attributes and return types updated.
self.tokenizer: Tokenizer | Noneand the return typedict[str, NumpyArray | NDArray[np.int64]]correctly use PEP 604 syntax. These are not dataclass fields, so no runtime evaluation concerns.
49-58: LGTM!_load_onnx_modelparameters updated.All optional parameters use consistent
| Nonesyntax.
103-117: LGTM!_embed_documentsparameters modernized.All parameters correctly use PEP 604 syntax:
documents: str | Iterable[str],parallel: int | None,providers: Sequence[OnnxProvider] | None, etc.
162-162: LGTM! Previous review concern addressed.The
_token_countmethod now correctly usestexts: str | Iterable[str]instead of the previously flaggedUnionannotation.fastembed/sparse/minicoil.py (5)
3-3: LGTM! Import updated for PEP 604.Correctly imports only required types without
OptionalorUnion.
72-86: LGTM! Constructor parameters modernized.All optional parameters use consistent
| Nonesyntax:cache_dir,threads,providers,device_ids,device_id,specific_model_path.
126-135: LGTM! Instance attribute annotations updated.All optional attributes correctly use
| Nonesyntax:tokenizer: Tokenizer | None,vocab_resolver: VocabResolver | None,encoder: Encoder | None,output_dim: int | None,sparse_vector_converter: SparseVectorConverter | None. These are regular class attributes initialized in__init__, so PEP 604 works correctly.
190-199: LGTM! Method signatures updated.
token_countandembedmethods correctly usestr | Iterable[str]andint | Nonesyntax.
236-236: LGTM!query_embedsignature updated.Uses
query: str | Iterable[str]consistently with other sparse embedding implementations.fastembed/late_interaction/colbert.py (6)
2-2: LGTM! Import simplified.Correctly removes
OptionalandUnionfrom typing imports.
99-106: LGTM!token_countparameter updated.Uses
texts: str | Iterable[str]consistently with other embedding classes.
140-151: LGTM! Constructor parameters modernized.All optional parameters use consistent
| Nonesyntax.
185-205: LGTM! Instance attribute annotations updated.
device_id: int | None,mask_token_id: int | None,pad_token_id: int | None, andquery_tokenizer: Tokenizer | Noneall correctly use PEP 604 syntax.
239-244: LGTM!embedmethod parameters updated.Uses
documents: str | Iterable[str]andparallel: int | Noneconsistently.
276-276: LGTM!query_embedsignature updated.Uses
query: str | Iterable[str]consistently with other embedding implementations.fastembed/sparse/bm42.py (6)
4-4: LGTM! Import simplified.Correctly removes
OptionalandUnionfrom typing imports.
65-77: LGTM! Constructor parameters modernized.All optional parameters use consistent
| Nonesyntax:cache_dir,threads,providers,device_ids,device_id,specific_model_path.
112-117: LGTM! Instance attribute annotation updated.
self.device_id: int | Nonecorrectly uses PEP 604 syntax.
283-289: LGTM!embedmethod parameters updated.Uses
documents: str | Iterable[str]andparallel: int | Noneconsistently.
328-328: LGTM! Previous review concern addressed.The
query_embedmethod now correctly usesquery: str | Iterable[str]instead of the previously flaggedUnionannotation.
353-355: LGTM! Previous review concern addressed.The
token_countmethod now correctly usestexts: str | Iterable[str]instead of the previously flaggedUnionannotation.fastembed/text/text_embedding.py (6)
2-2: Typing imports now match the updated annotation styleUsing
Any,Iterable,Sequence, andTypehere is appropriate and aligns with the new builtin generic +|union style used below. No issues.
54-74:additional_filesannotation and default handling look correctChanging
additional_filestolist[str] | Noneand then normalizing viaadditional_files or []avoids shared mutable defaults while keeping the API behavior unchanged. This is a clean modernization of the type hint.
79-90: Constructor type hints are modernized without changing behaviorUpdating
cache_dir,threads,providers, anddevice_idsto useX | Noneis backward compatible and matches how these values are forwarded toTextEmbeddingBaseand the concrete model instances. No behavioral change introduced.
151-163: Localembedding_sizetyping is clear and safeUsing
embedding_size: int | None = Nonewith the subsequentNonecheck keeps the control flow explicit and doesn’t alter behavior. This is fine as-is.
165-187:embedsignature update is consistent with downstream usage
documents: str | Iterable[str]andparallel: int | Nonematch the expected inputs of the underlying model’sembedimplementation, and only the annotation style changed. The forwarding callself.model.embed(documents, batch_size, parallel, **kwargs)remains correct.
216-228:token_countannotation matches intended usageTyping
textsasstr | Iterable[str]with the unchanged delegation toself.model.token_countpreserves behavior and clarifies accepted inputs. The docstring already reflects the new union style; nothing else to change.
* new: drop python3.9, replace optional and union with | * new: remove python 3.9 from pyproject * refactor: replace remaining union and optional with | * new: remove optional and union in dataclasses * fix: add typealias to numpy type * new: replace union with | in token count
No description provided.