NVIDIA · danielkorzekwa · Feb 18, 2026 · Feb 18, 2026 · Feb 19, 2026 · Feb 19, 2026
@@ -275,21 +275,9 @@ vllm bench throughput --model path/to/model --input-len 2000 --output-len 100 --
 
 ## Knowledge Distillation
 
-To recover degradation in the quality of the compressed model, we can use knowledge distillation. This allows transferring the capabilities of the original model to the pruned one. For this, we will use [NeMo framework](https://github.com/NVIDIA-NeMo/NeMo) with the [nemo:25.07](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo?version=25.07) container.
+To recover degradation in the quality of the compressed model, we can use knowledge distillation. This allows transferring the capabilities of the original model to the pruned one.
 
-First, convert the HF model to NeMo format:
-
-```bash
-python -m nemo_export/convert_hf_to_nemo --input-ckpt-path path/to/HF-model --output-ckpt-path path/to/save/model-nemo
-```
-
-Now you can utilize all the training features available in NeMo, including distillation. Please refer to the [NeMo distillation documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/distillation/distillation.html).
-
-[Optional] Once distillation is complete, you can convert the distilled model back to the HuggingFace format.
-
-```bash
-python -m nemo_export/convert_nemo_to_hf --input-ckpt-path path/to/nemo-model --output-ckpt-path path/to/save/model-HF
-```
+See [mbridge_distillation/README.md](./mbridge_distillation/README.md) for instructions on using Megatron-Bridge for knowledge distillation.
 
 ## Advanced Usage
 

@@ -0,0 +1,146 @@
+# Knowledge Distillation with Megatron-Bridge
+
+This guide shows how to perform knowledge distillation on Puzzletron-compressed AnyModel checkpoints using Megatron-Bridge.
+
+## Overview
+
+1. Set up the environment with Megatron-Bridge
+2. Convert AnyModel checkpoints (student and teacher) to Megatron-Bridge format
+3. Run knowledge distillation training
+
+## Setup
+
+> **Temporary Setup:** The NeMo docker container includes Megatron-Bridge (main branch), but Puzzletron requires a specific version/branch of Megatron-Bridge that is not included by default. This manual setup is required to use the Puzzletron-compatible version. Once the container includes the required version, this setup step will no longer be necessary.
+
+**Note:** Set `$WORKSPACE` to your project root directory before running these commands:
+
+```bash
+export WORKSPACE=/path/to/your/project
+```
+
+1. **Clone Megatron-Bridge:**
+
+   Clone [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) and checkout the specific commit required for Puzzletron:
+
+   ```bash
+   cd $WORKSPACE
+   git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
+   cd Megatron-Bridge
+   git checkout 960a718cb8989676b258e107d538642717e22e39
+   ```
+
+2. **Initialize Megatron-Bridge submodules:**
+
+   ```bash
+   cd $WORKSPACE/Megatron-Bridge
+   git submodule init
+   git submodule update
+   ```
+
+3. **Start Docker container with mounts:**
+
+   Use the [NeMo 25.11 container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo?version=25.11):
+
+   ```bash
+   docker run --gpus all -it --rm \
+     -v $WORKSPACE:/workspace \
+     -v $WORKSPACE/Megatron-Bridge/3rdparty/Megatron-LM:/opt/megatron-lm \
+     nvcr.io/nvidia/nemo:25.11 \
+     /bin/bash
+   ```
+
+   **Note:** The mount `/opt/megatron-lm` is required because Megatron-Bridge depends on the Megatron-LM submodule.
+
+4. **Set up the environment inside the container:**
+
+   ```bash
+   export PYTHONPATH="/workspace/Megatron-Bridge/src:/workspace/Model-Optimizer:${PYTHONPATH}"
+   ```
+
+## Dataset Preparation
+
+This section describes how to prepare datasets for knowledge distillation. We provide examples using a toy dataset (WikiText-103) for illustration purposes, and note how to adapt the process for production datasets like Nemotron-Post-Training-Dataset-v2.
+
+> **Note:** For actual knowledge distillation, use a larger, more representative dataset like [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2).
+
+### Step 1: Download Dataset
+
+First, download the dataset and save it in JSONL format. For WikiText-103, you can use the following script:
+
+```python
+# download_hf_wikitext_dataset.py
+import json
+import os
+from datasets import load_dataset
+
+DATA_PATH = "path/to/hf_datasets/wikitext-103-v1"
+# Load the WikiText-103 dataset
+dataset = load_dataset("wikitext", "wikitext-103-v1", split="train")
+
+# Define the destination folder
+os.makedirs(DATA_PATH, exist_ok=True)
+
+# Save splits to JSONL files
+with open(f"{DATA_PATH}/wikitext-train.jsonl", "w") as file:
+    file.writelines(json.dumps(item) + "\n" for item in dataset)
+
+print(f"Raw dataset saved to {DATA_PATH}/wikitext-train.jsonl")
+```
+
+### Step 2: Tokenize Dataset
+
+Next, tokenize the JSONL dataset using the tokenizer from your model. This converts the text data into token IDs that can be used for training:
+
+```python
+# tokenize_wikitext_dataset.py
+from modelopt.torch.utils.plugins import megatron_preprocess_data
+
+DATA_PATH = "path/to/hf_datasets/wikitext-103-v1"
+HF_MODEL_NAME_OR_PATH = "path/to/your/model/checkpoint"
+
+megatron_preprocess_data(
+    input_path=f"{DATA_PATH}/wikitext-train.jsonl",
+    output_dir=DATA_PATH,
+    tokenizer_name_or_path=HF_MODEL_NAME_OR_PATH,
+    json_keys=["text"],
+    workers=32,
+    log_interval=100000,
+)
+```
+
+## Step 1: Convert Checkpoints to Megatron-Bridge Format
+
+Convert both student and teacher checkpoints:
+
+```bash
+# Convert student checkpoint
+torchrun --nproc_per_node=1 examples/puzzletron/mbridge_distillation/import_anymodel_to_mbridge.py \
+    --input-ckpt-path /path/to/student/anymodel/checkpoint \
+    --output-ckpt-path /path/to/student/mbridge/checkpoint
+
+# Convert teacher checkpoint
+torchrun --nproc_per_node=1 examples/puzzletron/mbridge_distillation/import_anymodel_to_mbridge.py \
+    --input-ckpt-path /path/to/teacher/anymodel/checkpoint \
+    --output-ckpt-path /path/to/teacher/mbridge/checkpoint
+```
+
+## Step 2: Run Knowledge Distillation
+
+Run distillation with tokenized dataset:
+
+```bash
+torchrun --nproc_per_node=8 examples/puzzletron/mbridge_distillation/distill_anymodel.py \
+    --student-mbridge-ckpt /path/to/student/mbridge/checkpoint/iter_0000000 \
+    --teacher-mbridge-ckpt /path/to/teacher/mbridge/checkpoint/iter_0000000 \
+    --data-path /path/to/tokenized/dataset \
+    --output-dir ./distilled_output \
+    dataset.sequence_length=8192 \
+    model.tensor_model_parallel_size=8 \
+    model.teacher.tensor_model_parallel_size=8 \
+    train.global_batch_size=4 \
+    train.micro_batch_size=1 \
+    train.train_iters=5000 \
+    logger.log_interval=1
+```
+
+The distilled checkpoint will be saved to `--output-dir`.