Conversation
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 39 out of 64 changed files in this pull request and generated 27 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # CRITICAL FIX: Clip predictions to valid token ID range | ||
| vocab_size = len(tokenizer) | ||
| preds = np.clip(preds, 0, vocab_size - 1) |
There was a problem hiding this comment.
The training script clips token IDs to the vocabulary size to prevent out-of-range errors during evaluation. However, clipping predictions could produce invalid tokens. A better approach would be to investigate why predictions are out of range in the first place, as this indicates a potential issue with model generation or tokenization configuration.
| """NLLB-200 translator with fastText language detection.""" | ||
|
|
||
| def __init__(self, | ||
| model_name="src/multimeditron/translation/models/nllb-consensus-finetuned-1epoch", #Fine tuned model - to use the base NLLB-200 3.3B model, add HF path here (nllb-200-3.3B) |
There was a problem hiding this comment.
The default model path uses a relative path that may not work correctly depending on where the code is executed from. Consider using an absolute path constructed from file or making this a required parameter without a default value. The comment also suggests this should point to a HuggingFace model ID for the base model, but the current default is a local path.
| print(f"[INFO] Loading NLLB model: {model_name}") | ||
| self.tokenizer = AutoTokenizer.from_pretrained(model_name) | ||
| self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name) | ||
|
|
||
|
|
||
| self.device = "cuda" if torch.cuda.is_available() else "cpu" | ||
| self.model.to(self.device) | ||
|
|
||
| print(f"[INFO] Loading fastText language detection model") | ||
| try: | ||
| import fasttext | ||
| model_path = hf_hub_download( | ||
| repo_id="facebook/fasttext-language-identification", | ||
| filename="model.bin" | ||
| ) | ||
| fasttext.FastText.eprint = lambda x: None | ||
| self.lang_detector = fasttext.load_model(model_path) | ||
| print(f"[INFO] fastText model loaded successfully") | ||
| except Exception as e: | ||
| print(f"[ERROR] Failed to load fastText: {e}") | ||
| print("[INFO] Ensure: pip install 'numpy<2.0' fasttext") | ||
| raise | ||
|
|
||
| self.detected_user_lang = None | ||
| print(f"[INFO] NLLB translator ready on {self.device}") |
There was a problem hiding this comment.
The print statements are suitable for debugging but should be replaced with proper logging (using the logging module) for production code. This allows users to control log levels and outputs more flexibly.
| try: | ||
| token_id = self.tokenizer.convert_tokens_to_ids(detected_code) | ||
| if token_id == self.tokenizer.unk_token_id: | ||
| print(f"[WARNING] '{detected_code}' not supported. Defaulting to eng_Latn.") | ||
| return 'eng_Latn' |
There was a problem hiding this comment.
When detected language code is not supported by the tokenizer, the method returns 'eng_Latn' as a fallback. However, this could lead to incorrect behavior as the actual text might not be in English. Consider raising an exception or logging a warning to make this fallback behavior more explicit to callers.
| def translate_from_english(self, text: str, tgt_lang: str = None) -> str: | ||
| """ | ||
| Translate from English back to original language. | ||
| If original was low confidence (eng_Latn), passes through unchanged. | ||
| """ | ||
| if tgt_lang is None: | ||
| if self.detected_user_lang is None: | ||
| print("[WARNING] No detected language stored. Returning as-is.") | ||
| return text | ||
| tgt_lang = self.detected_user_lang | ||
|
|
||
| if tgt_lang == 'eng_Latn': | ||
| return text | ||
|
|
||
| return self.translate(text, 'eng_Latn', tgt_lang) No newline at end of file |
There was a problem hiding this comment.
The translate_from_english method relies on the detected_user_lang instance variable set by translate_to_english. This creates a stateful dependency between method calls that could lead to bugs if the methods are called out of order or in a multi-threaded context. Consider making this stateless by requiring the target language as a parameter or documenting this requirement clearly.
| b = bleu.sentence_score(candidate, refs).score | ||
| c = chrf.sentence_score(candidate, refs).score | ||
| scores[model] = 0.5 * b + 0.5 * c | ||
| except: |
There was a problem hiding this comment.
Except block directly handles BaseException.
| except: | ||
| pass |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| pass | |
| except Exception as exc: | |
| # Gradient checkpointing is an optional optimization; continue without it if enabling fails. | |
| print(f" ⚠️ Could not enable gradient checkpointing: {exc}") |
| except: | ||
| pass |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| pass | |
| except Exception as save_err: | |
| log(f"⚠️ Failed to save emergency checkpoint: {save_err}") |
| lang_data = load_jsonl(filepath) | ||
|
|
||
| if SAMPLES_PER_LANGUAGE: | ||
| lang_data = lang_data[:SAMPLES_PER_LANGUAGE] |
There was a problem hiding this comment.
This statement is unreachable.
|
|
||
| if lang not in writers: | ||
| out_path = out_dir / f"wikipedia_{lang}_pretraining.jsonl" | ||
| writers[lang] = open(out_path, "w", encoding=ENCODING) |
There was a problem hiding this comment.
File is opened but is not closed.
This PR introduces a translation interface for NLLB-200 with fasttext detection.
✨ Key Contributions
Translator (
translator.py) for multimeditron inferenceConsensus-based data generation
Fine-tuning & evaluation framework