-
Notifications
You must be signed in to change notification settings - Fork 5
Enhance encode_targets.py with dataset and model support #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Refactor encoding script to support new dataset loading and model options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the encode_targets.py script to support multiple data sources and embedding models. The refactor transforms a simple file-based encoding script into a more flexible tool that can handle both HuggingFace datasets and local files with different embedding frameworks.
Key changes:
- Added support for HuggingFace datasets (MS MARCO) alongside existing file-based input
- Implemented dual model support (FastEmbed and SentenceTransformer) with automatic fallback
- Enhanced argument parsing with vocabulary filtering and improved batch processing
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| if max_count and i >= max_count: | ||
| break | ||
| else: | ||
| raise ValueError(f"Dataset non supporté: {name}") |
Copilot
AI
Oct 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message should be in English to maintain consistency with the rest of the codebase.
| raise ValueError(f"Dataset non supporté: {name}") | |
| raise ValueError(f"Unsupported dataset: {name}") |
| ) | ||
| yield embeddings | ||
| else: | ||
| raise ValueError(f"Model type non supporté: {model_type}") |
Copilot
AI
Oct 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message should be in English to maintain consistency with the rest of the codebase.
| raise ValueError(f"Model type non supporté: {model_type}") | |
| raise ValueError(f"Unsupported model type: {model_type}") |
| texts_list = list(texts_generator) | ||
| for embeddings_batch in encode_texts(texts_list, model, model_type, args.batch_size, args.use_cuda): |
Copilot
AI
Oct 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting the entire generator to a list loads all texts into memory at once, which could cause memory issues with large datasets. Consider processing texts in chunks or streaming them directly to the encoding function.
| embeddings = model.encode( | ||
| list(texts), | ||
| batch_size=batch_size, | ||
| convert_to_numpy=True, | ||
| show_progress_bar=True | ||
| ) | ||
| yield embeddings |
Copilot
AI
Oct 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SentenceTransformer branch converts texts to a list again and yields all embeddings at once, negating the benefit of the generator pattern. Consider yielding embeddings in batches to maintain memory efficiency.
| raise ValueError("Must specify either --input-file or --dataset") | ||
|
|
||
| model, model_type = get_model(args.model_name) | ||
| os.makedirs(os.path.dirname(args.output_file), exist_ok=True) |
Copilot
AI
Oct 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail if args.output_file is just a filename without a directory path, as os.path.dirname() would return an empty string. Consider checking if dirname is non-empty before calling makedirs.
| os.makedirs(os.path.dirname(args.output_file), exist_ok=True) | |
| output_dir = os.path.dirname(args.output_file) | |
| if output_dir: | |
| os.makedirs(output_dir, exist_ok=True) |
Refactor encoding script to support new dataset loading and model options.