Skip to content

Conversation

@chai-xiaonan
Copy link

Reconstruct the Nemo-Bridge based on the restructured flagscale version. Currently, flagscale has supported some functions of nemo-bridge, enabling the flagscale framework to load and save ckpt in the hf format during the training process. Additionally, in the current version, new features have been added, allowing for the setting of the number of iterations for saving hf weights based on the save_hf_interval. The model has verified that Deepseek V3 16_a3B, Qwen3-32B, and Qwen3-0.6B all have correct accuracy.

#Load the HF model from config
config_load = args.hf_config_path
config = safe_load_config_with_retry(config_load, trust_remote_code=False)
bridge = AutoBridge.from_hf_config(config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this save-ckpt step allocate extra GPU memory when initializing an HF model?

bridge.load_hf_weights(ddp_model)
# no optimizer weight
iteration=0
num_floating_point_operations_so_far=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add print_rank_0 here

# use megatron bridge
from megatron.nemo_bridge.models import AutoBridge
bridge=AutoBridge.from_hf_pretrained(load_dir)
bridge.load_hf_weights(ddp_model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can nemo-bridge’s load_hf_model handle a ddp_model directly, where ddp_model is wrapped by DistributedDataParallel?

@@ -0,0 +1,8 @@
# Copyright (c) 2025, BAAI. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nemo megatron-bridge supports pip install for usage, ref https://pypi.org/project/megatron-bridge/
please remove source codes

@@ -0,0 +1,8 @@
# Copyright (c) 2025, BAAI. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename flagscale/train/megatron/nemo_bridge to flagscale/train/megatron/bridge so that it matches the import pattern from megatron.bridge

Copy link
Contributor

@tengqm tengqm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When copy pasting source code from other repos, we are supposed/obliged to copy paste their copyright notice as well. We cannot claim copyrights for these code.
The original code has following copyright header to be preserved:

# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@@ -0,0 +1,110 @@
# Copyright (c) 2025, BAAI. All rights reserved.
#
# Copied from: https://github.com/NVIDIA-NeMo/Megatron-Bridge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Megatron-Bridge has a copyright claim, we are supposed to paste their copyright statements here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants