Skip to content

🚀 Added a Translation Pipeline#43

Open
leagrieder wants to merge 10 commits intoEPFLiGHT:masterfrom
leagrieder:addTranslationModel
Open

🚀 Added a Translation Pipeline#43
leagrieder wants to merge 10 commits intoEPFLiGHT:masterfrom
leagrieder:addTranslationModel

Conversation

@leagrieder
Copy link

@leagrieder leagrieder commented Jan 28, 2026

This PR introduces a translation interface for NLLB-200 with fasttext detection.

✨ Key Contributions

  • Translator (translator.py) for multimeditron inference

    • Automatic language detection with fastText (80% confidence threshold)
    • Smart routing to prevent mistranslation of ambiguous inputs
    • Bidirectional medical translation (user language ↔ English)
    • Compatible with base and fine-tuned NLLB-200 models
  • Consensus-based data generation

    • Synthetic parallel medical corpora built from multi-model translation agreement
    • Scalable approach for low-resource languages
  • Fine-tuning & evaluation framework

    • Scripts for NLLB-200 medical fine-tuning
    • Comprehensive experiments on translation quality and downstream medical QA

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 64 changed files in this pull request and generated 27 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +218 to +220
# CRITICAL FIX: Clip predictions to valid token ID range
vocab_size = len(tokenizer)
preds = np.clip(preds, 0, vocab_size - 1)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The training script clips token IDs to the vocabulary size to prevent out-of-range errors during evaluation. However, clipping predictions could produce invalid tokens. A better approach would be to investigate why predictions are out of range in the first place, as this indicates a potential issue with model generation or tokenization configuration.

Copilot uses AI. Check for mistakes.
"""NLLB-200 translator with fastText language detection."""

def __init__(self,
model_name="src/multimeditron/translation/models/nllb-consensus-finetuned-1epoch", #Fine tuned model - to use the base NLLB-200 3.3B model, add HF path here (nllb-200-3.3B)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default model path uses a relative path that may not work correctly depending on where the code is executed from. Consider using an absolute path constructed from file or making this a required parameter without a default value. The comment also suggests this should point to a HuggingFace model ID for the base model, but the current default is a local path.

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +56
print(f"[INFO] Loading NLLB model: {model_name}")
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model.to(self.device)

print(f"[INFO] Loading fastText language detection model")
try:
import fasttext
model_path = hf_hub_download(
repo_id="facebook/fasttext-language-identification",
filename="model.bin"
)
fasttext.FastText.eprint = lambda x: None
self.lang_detector = fasttext.load_model(model_path)
print(f"[INFO] fastText model loaded successfully")
except Exception as e:
print(f"[ERROR] Failed to load fastText: {e}")
print("[INFO] Ensure: pip install 'numpy<2.0' fasttext")
raise

self.detected_user_lang = None
print(f"[INFO] NLLB translator ready on {self.device}")
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The print statements are suitable for debugging but should be replaced with proper logging (using the logging module) for production code. This allows users to control log levels and outputs more flexibly.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +86
try:
token_id = self.tokenizer.convert_tokens_to_ids(detected_code)
if token_id == self.tokenizer.unk_token_id:
print(f"[WARNING] '{detected_code}' not supported. Defaulting to eng_Latn.")
return 'eng_Latn'
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When detected language code is not supported by the tokenizer, the method returns 'eng_Latn' as a fallback. However, this could lead to incorrect behavior as the actual text might not be in English. Consider raising an exception or logging a warning to make this fallback behavior more explicit to callers.

Copilot uses AI. Check for mistakes.
Comment on lines +155 to +169
def translate_from_english(self, text: str, tgt_lang: str = None) -> str:
"""
Translate from English back to original language.
If original was low confidence (eng_Latn), passes through unchanged.
"""
if tgt_lang is None:
if self.detected_user_lang is None:
print("[WARNING] No detected language stored. Returning as-is.")
return text
tgt_lang = self.detected_user_lang

if tgt_lang == 'eng_Latn':
return text

return self.translate(text, 'eng_Latn', tgt_lang) No newline at end of file
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The translate_from_english method relies on the detected_user_lang instance variable set by translate_to_english. This creates a stateful dependency between method calls that could lead to bugs if the methods are called out of order or in a multi-threaded context. Consider making this stateless by requiring the target language as a parameter or documenting this requirement clearly.

Copilot uses AI. Check for mistakes.
b = bleu.sentence_score(candidate, refs).score
c = chrf.sentence_score(candidate, refs).score
scores[model] = 0.5 * b + 0.5 * c
except:
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except block directly handles BaseException.

Copilot uses AI. Check for mistakes.
Comment on lines +523 to +524
except:
pass
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
pass
except Exception as exc:
# Gradient checkpointing is an optional optimization; continue without it if enabling fails.
print(f" ⚠️ Could not enable gradient checkpointing: {exc}")

Copilot uses AI. Check for mistakes.
Comment on lines +654 to +655
except:
pass
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
pass
except Exception as save_err:
log(f"⚠️ Failed to save emergency checkpoint: {save_err}")

Copilot uses AI. Check for mistakes.
lang_data = load_jsonl(filepath)

if SAMPLES_PER_LANGUAGE:
lang_data = lang_data[:SAMPLES_PER_LANGUAGE]
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is unreachable.

Copilot uses AI. Check for mistakes.

if lang not in writers:
out_path = out_dir / f"wikipedia_{lang}_pretraining.jsonl"
writers[lang] = open(out_path, "w", encoding=ENCODING)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File is opened but is not closed.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant