Enhance encode_targets.py with dataset and model support #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

smokyngt wants to merge 1 commit into qdrant:master from smokyngt:patch-1

smokyngt commented Oct 5, 2025

Refactor encoding script to support new dataset loading and model options.


          Enhance encode_targets.py with dataset and model support

28691fd

Refactor encoding script to support new dataset loading and model options.

generall requested a review from Copilot

October 12, 2025 19:10

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull Request Overview

This PR enhances the encode_targets.py script to support multiple data sources and embedding models. The refactor transforms a simple file-based encoding script into a more flexible tool that can handle both HuggingFace datasets and local files with different embedding frameworks.

Key changes:

Added support for HuggingFace datasets (MS MARCO) alongside existing file-based input
Implemented dual model support (FastEmbed and SentenceTransformer) with automatic fallback
Enhanced argument parsing with vocabulary filtering and improved batch processing

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

mini_coil/data_pipeline/encode_targets.py

    
                          if max_count and i >= max_count:

                              break

                  else:

                      raise ValueError(f"Dataset non supporté: {name}")

Copilot AI Oct 12, 2025

Error message should be in English to maintain consistency with the rest of the codebase.

Suggested change

      
                    raise ValueError(f"Dataset non supporté: {name}")
          
                    raise ValueError(f"Unsupported dataset: {name}")

Copilot uses AI. Check for mistakes.

mini_coil/data_pipeline/encode_targets.py

    
                      )

                      yield embeddings

                  else:

                      raise ValueError(f"Model type non supporté: {model_type}")

Copilot AI Oct 12, 2025

Error message should be in English to maintain consistency with the rest of the codebase.

Suggested change

      
                    raise ValueError(f"Model type non supporté: {model_type}")
          
                    raise ValueError(f"Unsupported model type: {model_type}")

Copilot uses AI. Check for mistakes.

mini_coil/data_pipeline/encode_targets.py

Comment on lines +111 to +112

    
                      texts_list = list(texts_generator)

                      for embeddings_batch in encode_texts(texts_list, model, model_type, args.batch_size, args.use_cuda):

Copilot AI Oct 12, 2025

Converting the entire generator to a list loads all texts into memory at once, which could cause memory issues with large datasets. Consider processing texts in chunks or streaming them directly to the encoding function.

Copilot uses AI. Check for mistakes.

mini_coil/data_pipeline/encode_targets.py

Comment on lines +64 to +70

    
                      embeddings = model.encode(

                          list(texts),

                          batch_size=batch_size,

                          convert_to_numpy=True,

                          show_progress_bar=True

                      )

                      yield embeddings

Copilot AI Oct 12, 2025

The SentenceTransformer branch converts texts to a list again and yields all embeddings at once, negating the benefit of the generator pattern. Consider yielding embeddings in batches to maintain memory efficiency.

Copilot uses AI. Check for mistakes.

mini_coil/data_pipeline/encode_targets.py

    
                      raise ValueError("Must specify either --input-file or --dataset")

                  model, model_type = get_model(args.model_name)

                  os.makedirs(os.path.dirname(args.output_file), exist_ok=True)

Copilot AI Oct 12, 2025

This will fail if args.output_file is just a filename without a directory path, as os.path.dirname() would return an empty string. Consider checking if dirname is non-empty before calling makedirs.

Suggested change

      
                os.makedirs(os.path.dirname(args.output_file), exist_ok=True)
          
                output_dir = os.path.dirname(args.output_file)
          
                if output_dir:
          
                    os.makedirs(output_dir, exist_ok=True)

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet