Skip to content

Conversation

@smokyngt
Copy link

@smokyngt smokyngt commented Oct 5, 2025

Refactor encoding script to support new dataset loading and model options.

Refactor encoding script to support new dataset loading and model options.
@generall generall requested a review from Copilot October 12, 2025 19:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the encode_targets.py script to support multiple data sources and embedding models. The refactor transforms a simple file-based encoding script into a more flexible tool that can handle both HuggingFace datasets and local files with different embedding frameworks.

Key changes:

  • Added support for HuggingFace datasets (MS MARCO) alongside existing file-based input
  • Implemented dual model support (FastEmbed and SentenceTransformer) with automatic fallback
  • Enhanced argument parsing with vocabulary filtering and improved batch processing

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

if max_count and i >= max_count:
break
else:
raise ValueError(f"Dataset non supporté: {name}")
Copy link

Copilot AI Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message should be in English to maintain consistency with the rest of the codebase.

Suggested change
raise ValueError(f"Dataset non supporté: {name}")
raise ValueError(f"Unsupported dataset: {name}")

Copilot uses AI. Check for mistakes.
)
yield embeddings
else:
raise ValueError(f"Model type non supporté: {model_type}")
Copy link

Copilot AI Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message should be in English to maintain consistency with the rest of the codebase.

Suggested change
raise ValueError(f"Model type non supporté: {model_type}")
raise ValueError(f"Unsupported model type: {model_type}")

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +112
texts_list = list(texts_generator)
for embeddings_batch in encode_texts(texts_list, model, model_type, args.batch_size, args.use_cuda):
Copy link

Copilot AI Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting the entire generator to a list loads all texts into memory at once, which could cause memory issues with large datasets. Consider processing texts in chunks or streaming them directly to the encoding function.

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +70
embeddings = model.encode(
list(texts),
batch_size=batch_size,
convert_to_numpy=True,
show_progress_bar=True
)
yield embeddings
Copy link

Copilot AI Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SentenceTransformer branch converts texts to a list again and yields all embeddings at once, negating the benefit of the generator pattern. Consider yielding embeddings in batches to maintain memory efficiency.

Copilot uses AI. Check for mistakes.
raise ValueError("Must specify either --input-file or --dataset")

model, model_type = get_model(args.model_name)
os.makedirs(os.path.dirname(args.output_file), exist_ok=True)
Copy link

Copilot AI Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if args.output_file is just a filename without a directory path, as os.path.dirname() would return an empty string. Consider checking if dirname is non-empty before calling makedirs.

Suggested change
os.makedirs(os.path.dirname(args.output_file), exist_ok=True)
output_dir = os.path.dirname(args.output_file)
if output_dir:
os.makedirs(output_dir, exist_ok=True)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant