🚀 Added a Translation Pipeline by leagrieder · Pull Request #43 · EPFLiGHT/MultiMeditron

leagrieder · 2026-01-28T13:45:23Z

This PR introduces a translation interface for NLLB-200 with fasttext detection.

✨ Key Contributions

Translator (translator.py) for multimeditron inference
- Automatic language detection with fastText (80% confidence threshold)
- Smart routing to prevent mistranslation of ambiguous inputs
- Bidirectional medical translation (user language ↔ English)
- Compatible with base and fine-tuned NLLB-200 models
Consensus-based data generation
- Synthetic parallel medical corpora built from multi-model translation agreement
- Scalable approach for low-resource languages
Fine-tuning & evaluation framework
- Scripts for NLLB-200 medical fine-tuning
- Comprehensive experiments on translation quality and downstream medical QA

…e large results)

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 39 out of 64 changed files in this pull request and generated 27 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T21:55:45Z

src/multimeditron/translation/datasets/training/finetune_nllb_openwho.py

+    # CRITICAL FIX: Clip predictions to valid token ID range
+    vocab_size = len(tokenizer)
+    preds = np.clip(preds, 0, vocab_size - 1)


The training script clips token IDs to the vocabulary size to prevent out-of-range errors during evaluation. However, clipping predictions could produce invalid tokens. A better approach would be to investigate why predictions are out of range in the first place, as this indicates a potential issue with model generation or tokenization configuration.

Copilot · 2026-02-04T21:55:45Z

src/multimeditron/translation/translator.py

+    """NLLB-200 translator with fastText language detection."""
+
+    def __init__(self, 
+                model_name="src/multimeditron/translation/models/nllb-consensus-finetuned-1epoch",  #Fine tuned model - to use the base NLLB-200 3.3B model, add HF path here (nllb-200-3.3B)


The default model path uses a relative path that may not work correctly depending on where the code is executed from. Consider using an absolute path constructed from file or making this a required parameter without a default value. The comment also suggests this should point to a HuggingFace model ID for the base model, but the current default is a local path.

Copilot · 2026-02-04T21:55:46Z

src/multimeditron/translation/translator.py

+        print(f"[INFO] Loading NLLB model: {model_name}")
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
+
+
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        self.model.to(self.device)
+
+        print(f"[INFO] Loading fastText language detection model")
+        try:
+            import fasttext
+            model_path = hf_hub_download(
+                repo_id="facebook/fasttext-language-identification",
+                filename="model.bin"
+            )
+            fasttext.FastText.eprint = lambda x: None
+            self.lang_detector = fasttext.load_model(model_path)
+            print(f"[INFO] fastText model loaded successfully")
+        except Exception as e:
+            print(f"[ERROR] Failed to load fastText: {e}")
+            print("[INFO] Ensure: pip install 'numpy<2.0' fasttext")
+            raise
+
+        self.detected_user_lang = None
+        print(f"[INFO] NLLB translator ready on {self.device}")


The print statements are suitable for debugging but should be replaced with proper logging (using the logging module) for production code. This allows users to control log levels and outputs more flexibly.

Copilot · 2026-02-04T21:55:46Z

src/multimeditron/translation/translator.py

+            try:
+                token_id = self.tokenizer.convert_tokens_to_ids(detected_code)
+                if token_id == self.tokenizer.unk_token_id:
+                    print(f"[WARNING] '{detected_code}' not supported. Defaulting to eng_Latn.")
+                    return 'eng_Latn'


When detected language code is not supported by the tokenizer, the method returns 'eng_Latn' as a fallback. However, this could lead to incorrect behavior as the actual text might not be in English. Consider raising an exception or logging a warning to make this fallback behavior more explicit to callers.

Copilot · 2026-02-04T21:55:46Z

src/multimeditron/translation/translator.py

+    def translate_from_english(self, text: str, tgt_lang: str = None) -> str:
+        """
+        Translate from English back to original language.
+        If original was low confidence (eng_Latn), passes through unchanged.
+        """
+        if tgt_lang is None:
+            if self.detected_user_lang is None:
+                print("[WARNING] No detected language stored. Returning as-is.")
+                return text
+            tgt_lang = self.detected_user_lang
+
+        if tgt_lang == 'eng_Latn':
+            return text
+
+        return self.translate(text, 'eng_Latn', tgt_lang)


The translate_from_english method relies on the detected_user_lang instance variable set by translate_to_english. This creates a stateful dependency between method calls that could lead to bugs if the methods are called out of order or in a multi-threaded context. Consider making this stateless by requiring the target language as a parameter or documenting this requirement clearly.

Copilot · 2026-02-04T21:55:52Z

src/multimeditron/translation/experiments/scripts/translation_consensus.py

+                    b = bleu.sentence_score(candidate, refs).score
+                    c = chrf.sentence_score(candidate, refs).score
+                    scores[model] = 0.5 * b + 0.5 * c
+                except:


Except block directly handles BaseException.

Copilot · 2026-02-04T21:55:53Z

...itron/translation/experiments/experiments_on_base_nllb/experiment_1_african_languages_gpt.py

+        except:
+            pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except:

pass

except Exception as exc:

# Gradient checkpointing is an optional optimization; continue without it if enabling fails.

print(f" ⚠️ Could not enable gradient checkpointing: {exc}")

Copilot · 2026-02-04T21:55:53Z

src/multimeditron/translation/datasets/training/finetune_nllb_consensus.py

+    except:
+        pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except:

pass

except Exception as save_err:

log(f"⚠️ Failed to save emergency checkpoint: {save_err}")

Copilot · 2026-02-04T21:55:53Z

src/multimeditron/translation/datasets/training/finetune_nllb_consensus.py

+        lang_data = load_jsonl(filepath)
+
+        if SAMPLES_PER_LANGUAGE:
+            lang_data = lang_data[:SAMPLES_PER_LANGUAGE]


This statement is unreachable.

Copilot · 2026-02-04T21:55:53Z

...ditron/translation/datasets/scripts/reformatting_scripts/convert_cleanWiki_to_pretraining.py

+
+                if lang not in writers:
+                    out_path = out_dir / f"wikipedia_{lang}_pretraining.jsonl"
+                    writers[lang] = open(out_path, "w", encoding=ENCODING)


File is opened but is not closed.

Grieder Lea Noemie and others added 10 commits January 28, 2026 14:29

Ignore large translation result JSON file

0020243

Consensus algorithm, cleaned scripts, and experiment refactor (exclud…

d73eabb

…e large results)

translation consensus script updated

d1701fd

finetuned model

67bf030

Translation pipeline experiments and datasets

ae85caf

git bugs

a400cbe

new preprocessor changes from master needed to be updated in experiments

fc1ef17

results finetuned nllb on baseline experiments

3d2239b

final results experiment 3

52047f6

analysis of meditron output

2d4b5d7

fabnemEPFL requested a review from Copilot February 4, 2026 14:52

Copilot started reviewing on behalf of fabnemEPFL February 4, 2026 14:52 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

fabnemEPFL requested a review from Copilot February 4, 2026 21:50

Copilot started reviewing on behalf of fabnemEPFL February 4, 2026 21:51 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Added a Translation Pipeline#43

🚀 Added a Translation Pipeline#43
leagrieder wants to merge 10 commits intoEPFLiGHT:masterfrom
leagrieder:addTranslationModel

leagrieder commented Jan 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        except:
-            pass
+        except Exception as exc:
+            # Gradient checkpointing is an optional optimization; continue without it if enabling fails.
+            print(f"   ⚠️ Could not enable gradient checkpointing: {exc}")

-    except:
-        pass
+    except Exception as save_err:
+        log(f"⚠️  Failed to save emergency checkpoint: {save_err}")

Conversation

leagrieder commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Key Contributions

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leagrieder commented Jan 28, 2026 •

edited

Loading