Support mbridge distillation for any_model#904
Support mbridge distillation for any_model#904danielkorzekwa wants to merge 6 commits intodkorzekwa/any_modelfrom
Conversation
- Add distill_anymodel.py: Knowledge distillation script for AnyModel checkpoints - Add import_anymodel_to_mbridge.py: Import script to convert HF AnyModel to MBridge format - Update base.py: Simplify HeterogeneousBridgeMixin for AnyModel support
- Add __init__.py: Module initialization - Add llama.py: Llama bridge implementation - Add qwen3.py: Qwen3 bridge implementation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. 🗂️ Base branches to auto review (3)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
| OmegaConf conversion tries to access per_block_parameters which may not | ||
| be initialized when loading from YAML. Return empty list as fallback. | ||
| """ | ||
| if name == "per_block_parameters": |
There was a problem hiding this comment.
Should this also check if that attr is not set then only return [] else return whatever is set?
There was a problem hiding this comment.
Do you think at some point we should just upstream this in Megatron-Bridge repo since its a standard Megatron feature without anything to do with model optimization?
| export WORKSPACE=/path/to/your/project | ||
| ``` | ||
|
|
||
| 1. **Clone Megatron-Bridge:** |
There was a problem hiding this comment.
Megatron-Bridge is already cloned in the container at /opt/Megatron-Bridge. Why dont we just do following inside the container: cd /opt/Megatron-Bridge && git checkout 960a718cb8989676b258e107d538642717e22e39?
| git submodule update | ||
| ``` | ||
|
|
||
| 3. **Start Docker container with mounts:** |
There was a problem hiding this comment.
If these same steps work for 26.02, can we use that instead?
| ```bash | ||
| docker run --gpus all -it --rm \ | ||
| -v $WORKSPACE:/workspace \ | ||
| -v $WORKSPACE/Megatron-Bridge/3rdparty/Megatron-LM:/opt/megatron-lm \ |
There was a problem hiding this comment.
Fyi, for 26.02, Megatron-LM is at /opt/Megatron-Bridge/3rdparty/Megatron-LM
| ), | ||
| dataset=GPTDatasetConfig( | ||
| random_seed=1234, | ||
| blend=[[data_path], [1.0]], |
There was a problem hiding this comment.
For larger real datasets, we will likely have more than one data path (because dataset is split into multiple shards). We can let user pass all as cmdline argument instead of hard-coding to 1 element list. See my distill.py for reference
| checkpoint=CheckpointConfig(load=student_ckpt, save=output_dir), | ||
| logger=LoggerConfig(), | ||
| tokenizer=TokenizerConfig(tokenizer_type="HuggingFaceTokenizer", tokenizer_model=None), | ||
| validation=ValidationConfig(eval_interval=500, eval_iters=100), |
There was a problem hiding this comment.
I see now why your script does not work for 26.02. Because in 26.02, they had not yet introduced separate ValidationConfig and instead all its args are also passed to TrainingConfig
| use_fp16: bool, | ||
| ) -> ConfigContainer: | ||
| """Create base ConfigContainer with defaults.""" | ||
| return ConfigContainer( |
There was a problem hiding this comment.
You are missing mixed_precision="bf16_mixed" here which will make training faster
| logger=LoggerConfig(), | ||
| tokenizer=TokenizerConfig(tokenizer_type="HuggingFaceTokenizer", tokenizer_model=None), | ||
| validation=ValidationConfig(eval_interval=500, eval_iters=100), | ||
| ddp=DistributedDataParallelConfig(grad_reduce_in_fp32=True), |
There was a problem hiding this comment.
This is what mbridge uses for its training scripts. I think we should use the same as well
ddp=DistributedDataParallelConfig(
check_for_nan_in_grad=True,
grad_reduce_in_fp32=True,
overlap_grad_reduce=True,
overlap_param_gather=True,
average_in_collective=True,
use_distributed_optimizer=True,
),
| ) | ||
|
|
||
|
|
||
| def merge_checkpoint_configs( |
There was a problem hiding this comment.
Why do we need this? Shouldnt optimizer automatically load from ckpt if resuming training?
…ze() on DistillationProvider.provide() Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
What does this PR do?