Skip to content

Support mbridge distillation for any_model#904

Open
danielkorzekwa wants to merge 6 commits intodkorzekwa/any_modelfrom
dkorzekwa/any_model_mbridge_distillation
Open

Support mbridge distillation for any_model#904
danielkorzekwa wants to merge 6 commits intodkorzekwa/any_modelfrom
dkorzekwa/any_model_mbridge_distillation

Conversation

@danielkorzekwa
Copy link

What does this PR do?

  • hf_to_mcore mbridge converter (examples for llama and qwen models)
  • distillation script

- Add distill_anymodel.py: Knowledge distillation script for AnyModel checkpoints
- Add import_anymodel_to_mbridge.py: Import script to convert HF AnyModel to MBridge format
- Update base.py: Simplify HeterogeneousBridgeMixin for AnyModel support
- Add __init__.py: Module initialization
- Add llama.py: Llama bridge implementation
- Add qwen3.py: Qwen3 bridge implementation
@danielkorzekwa danielkorzekwa requested review from a team as code owners February 18, 2026 18:26
@danielkorzekwa danielkorzekwa requested review from realAsma and removed request for a team February 18, 2026 18:26
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (3)
  • main
  • release/.*
  • feature/.*

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dkorzekwa/any_model_mbridge_distillation

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kevalmorabia97 kevalmorabia97 requested review from AAnoosheh and kevalmorabia97 and removed request for realAsma February 18, 2026 18:29
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
OmegaConf conversion tries to access per_block_parameters which may not
be initialized when loading from YAML. Return empty list as fallback.
"""
if name == "per_block_parameters":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also check if that attr is not set then only return [] else return whatever is set?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think at some point we should just upstream this in Megatron-Bridge repo since its a standard Megatron feature without anything to do with model optimization?

export WORKSPACE=/path/to/your/project
```

1. **Clone Megatron-Bridge:**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Megatron-Bridge is already cloned in the container at /opt/Megatron-Bridge. Why dont we just do following inside the container: cd /opt/Megatron-Bridge && git checkout 960a718cb8989676b258e107d538642717e22e39?

git submodule update
```

3. **Start Docker container with mounts:**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these same steps work for 26.02, can we use that instead?

```bash
docker run --gpus all -it --rm \
-v $WORKSPACE:/workspace \
-v $WORKSPACE/Megatron-Bridge/3rdparty/Megatron-LM:/opt/megatron-lm \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fyi, for 26.02, Megatron-LM is at /opt/Megatron-Bridge/3rdparty/Megatron-LM

),
dataset=GPTDatasetConfig(
random_seed=1234,
blend=[[data_path], [1.0]],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For larger real datasets, we will likely have more than one data path (because dataset is split into multiple shards). We can let user pass all as cmdline argument instead of hard-coding to 1 element list. See my distill.py for reference

checkpoint=CheckpointConfig(load=student_ckpt, save=output_dir),
logger=LoggerConfig(),
tokenizer=TokenizerConfig(tokenizer_type="HuggingFaceTokenizer", tokenizer_model=None),
validation=ValidationConfig(eval_interval=500, eval_iters=100),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now why your script does not work for 26.02. Because in 26.02, they had not yet introduced separate ValidationConfig and instead all its args are also passed to TrainingConfig

use_fp16: bool,
) -> ConfigContainer:
"""Create base ConfigContainer with defaults."""
return ConfigContainer(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing mixed_precision="bf16_mixed" here which will make training faster

logger=LoggerConfig(),
tokenizer=TokenizerConfig(tokenizer_type="HuggingFaceTokenizer", tokenizer_model=None),
validation=ValidationConfig(eval_interval=500, eval_iters=100),
ddp=DistributedDataParallelConfig(grad_reduce_in_fp32=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what mbridge uses for its training scripts. I think we should use the same as well

        ddp=DistributedDataParallelConfig(
            check_for_nan_in_grad=True,
            grad_reduce_in_fp32=True,
            overlap_grad_reduce=True,
            overlap_param_gather=True,
            average_in_collective=True,
            use_distributed_optimizer=True,
        ),

)


def merge_checkpoint_configs(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? Shouldnt optimizer automatically load from ckpt if resuming training?

…ze() on DistillationProvider.provide()

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments