Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 2 additions & 14 deletions examples/puzzletron/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,21 +275,9 @@ vllm bench throughput --model path/to/model --input-len 2000 --output-len 100 --

## Knowledge Distillation

To recover degradation in the quality of the compressed model, we can use knowledge distillation. This allows transferring the capabilities of the original model to the pruned one. For this, we will use [NeMo framework](https://github.com/NVIDIA-NeMo/NeMo) with the [nemo:25.07](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo?version=25.07) container.
To recover degradation in the quality of the compressed model, we can use knowledge distillation. This allows transferring the capabilities of the original model to the pruned one.

First, convert the HF model to NeMo format:

```bash
python -m nemo_export/convert_hf_to_nemo --input-ckpt-path path/to/HF-model --output-ckpt-path path/to/save/model-nemo
```

Now you can utilize all the training features available in NeMo, including distillation. Please refer to the [NeMo distillation documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/distillation/distillation.html).

[Optional] Once distillation is complete, you can convert the distilled model back to the HuggingFace format.

```bash
python -m nemo_export/convert_nemo_to_hf --input-ckpt-path path/to/nemo-model --output-ckpt-path path/to/save/model-HF
```
See [mbridge_distillation/README.md](./mbridge_distillation/README.md) for instructions on using Megatron-Bridge for knowledge distillation.

## Advanced Usage

Expand Down
146 changes: 146 additions & 0 deletions examples/puzzletron/mbridge_distillation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Knowledge Distillation with Megatron-Bridge

This guide shows how to perform knowledge distillation on Puzzletron-compressed AnyModel checkpoints using Megatron-Bridge.

## Overview

1. Set up the environment with Megatron-Bridge
2. Convert AnyModel checkpoints (student and teacher) to Megatron-Bridge format
3. Run knowledge distillation training

## Setup

> **Temporary Setup:** The NeMo docker container includes Megatron-Bridge (main branch), but Puzzletron requires a specific version/branch of Megatron-Bridge that is not included by default. This manual setup is required to use the Puzzletron-compatible version. Once the container includes the required version, this setup step will no longer be necessary.

**Note:** Set `$WORKSPACE` to your project root directory before running these commands:

```bash
export WORKSPACE=/path/to/your/project
```

1. **Clone Megatron-Bridge:**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Megatron-Bridge is already cloned in the container at /opt/Megatron-Bridge. Why dont we just do following inside the container: cd /opt/Megatron-Bridge && git checkout 960a718cb8989676b258e107d538642717e22e39?


Clone [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) and checkout the specific commit required for Puzzletron:

```bash
cd $WORKSPACE
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
cd Megatron-Bridge
git checkout 960a718cb8989676b258e107d538642717e22e39
```

2. **Initialize Megatron-Bridge submodules:**

```bash
cd $WORKSPACE/Megatron-Bridge
git submodule init
git submodule update
```

3. **Start Docker container with mounts:**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these same steps work for 26.02, can we use that instead?


Use the [NeMo 25.11 container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo?version=25.11):

```bash
docker run --gpus all -it --rm \
-v $WORKSPACE:/workspace \
-v $WORKSPACE/Megatron-Bridge/3rdparty/Megatron-LM:/opt/megatron-lm \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fyi, for 26.02, Megatron-LM is at /opt/Megatron-Bridge/3rdparty/Megatron-LM

nvcr.io/nvidia/nemo:25.11 \
/bin/bash
```

**Note:** The mount `/opt/megatron-lm` is required because Megatron-Bridge depends on the Megatron-LM submodule.

4. **Set up the environment inside the container:**

```bash
export PYTHONPATH="/workspace/Megatron-Bridge/src:/workspace/Model-Optimizer:${PYTHONPATH}"
```

## Dataset Preparation

This section describes how to prepare datasets for knowledge distillation. We provide examples using a toy dataset (WikiText-103) for illustration purposes, and note how to adapt the process for production datasets like Nemotron-Post-Training-Dataset-v2.

> **Note:** For actual knowledge distillation, use a larger, more representative dataset like [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2).

### Step 1: Download Dataset

First, download the dataset and save it in JSONL format. For WikiText-103, you can use the following script:

```python
# download_hf_wikitext_dataset.py
import json
import os
from datasets import load_dataset

DATA_PATH = "path/to/hf_datasets/wikitext-103-v1"
# Load the WikiText-103 dataset
dataset = load_dataset("wikitext", "wikitext-103-v1", split="train")

# Define the destination folder
os.makedirs(DATA_PATH, exist_ok=True)

# Save splits to JSONL files
with open(f"{DATA_PATH}/wikitext-train.jsonl", "w") as file:
file.writelines(json.dumps(item) + "\n" for item in dataset)

print(f"Raw dataset saved to {DATA_PATH}/wikitext-train.jsonl")
```

### Step 2: Tokenize Dataset

Next, tokenize the JSONL dataset using the tokenizer from your model. This converts the text data into token IDs that can be used for training:

```python
# tokenize_wikitext_dataset.py
from modelopt.torch.utils.plugins import megatron_preprocess_data

DATA_PATH = "path/to/hf_datasets/wikitext-103-v1"
HF_MODEL_NAME_OR_PATH = "path/to/your/model/checkpoint"

megatron_preprocess_data(
input_path=f"{DATA_PATH}/wikitext-train.jsonl",
output_dir=DATA_PATH,
tokenizer_name_or_path=HF_MODEL_NAME_OR_PATH,
json_keys=["text"],
workers=32,
log_interval=100000,
)
```

## Step 1: Convert Checkpoints to Megatron-Bridge Format

Convert both student and teacher checkpoints:

```bash
# Convert student checkpoint
torchrun --nproc_per_node=1 examples/puzzletron/mbridge_distillation/import_anymodel_to_mbridge.py \
--input-ckpt-path /path/to/student/anymodel/checkpoint \
--output-ckpt-path /path/to/student/mbridge/checkpoint

# Convert teacher checkpoint
torchrun --nproc_per_node=1 examples/puzzletron/mbridge_distillation/import_anymodel_to_mbridge.py \
--input-ckpt-path /path/to/teacher/anymodel/checkpoint \
--output-ckpt-path /path/to/teacher/mbridge/checkpoint
```

## Step 2: Run Knowledge Distillation

Run distillation with tokenized dataset:

```bash
torchrun --nproc_per_node=8 examples/puzzletron/mbridge_distillation/distill_anymodel.py \
--student-mbridge-ckpt /path/to/student/mbridge/checkpoint/iter_0000000 \
--teacher-mbridge-ckpt /path/to/teacher/mbridge/checkpoint/iter_0000000 \
--data-path /path/to/tokenized/dataset \
--output-dir ./distilled_output \
dataset.sequence_length=8192 \
model.tensor_model_parallel_size=8 \
model.teacher.tensor_model_parallel_size=8 \
train.global_batch_size=4 \
train.micro_batch_size=1 \
train.train_iters=5000 \
logger.log_interval=1
```

The distilled checkpoint will be saved to `--output-dir`.
Loading
Loading