Skip to content

fix + tests dense & MoE TP all reduce (decoder only)#43722

Open
3outeille wants to merge 137 commits intomainfrom
fix-moe-ep
Open

fix + tests dense & MoE TP all reduce (decoder only)#43722
3outeille wants to merge 137 commits intomainfrom
fix-moe-ep

Conversation

@3outeille
Copy link
Member

@3outeille 3outeille commented Feb 3, 2026

Let's make sure it works for decoder only first (We skip VLM + Encoder-decoder for now)

Introduction, forward, backward, generation (with convert mapping triggering) test agains TP vs non-TP baseline

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os
from torch.distributed.elastic.multiprocessing.errors import record

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
# model_id = "Qwen/Qwen1.5-MoE-A2.7B-Chat"
# model_id = "Qwen/Qwen3-30B-A3B-Instruct-2507"

rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
device = torch.device(f"cuda:{rank}")
# Need to be initialized explicitly to use the `barrier` before loading
torch.distributed.init_process_group(backend="nccl", rank=rank, world_size=world_size, device_id=rank)

@record
def main():

    model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, tp_plan="auto")
    # model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    messages = [
        {"role": "user", "content": "What do you think about life?"},
    ]
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
    input_size = inputs.input_ids.shape[-1]
    output = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    text = tokenizer.batch_decode(output[:, input_size:])[0]
    print(text)

main()

torch.distributed.destroy_process_group()
image
  • ./run_dense_tests.sh results_dense
image - `./run_moe_tests.sh results_moe` image

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ed GPU management

- Updated `run_dense_tests.sh` and `run_moe_tests.sh` to support parallel execution of tests using available GPU pairs.
- Changed variable names for clarity, replacing `NUM_GPUS` with `GPUS_PER_TEST`.
- Enhanced output messages to reflect the number of parallel test slots and GPU usage.
- Implemented logic to handle skipped tests and updated result reporting to include skipped counts.
- Removed `TensorParallelTesterMixin` from `CausalLMModelTest` and integrated it into `ModelTesterMixin` for better structure in test classes.
Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few very early thoughts!

@3outeille 3outeille changed the base branch from main to fix-ep February 4, 2026 13:38
@3outeille 3outeille changed the title EP all reduce tests EP all reduce Feb 4, 2026
ArthurZucker and others added 11 commits February 4, 2026 13:44
- Modified `run_dense_tests.sh` and `run_moe_tests.sh` to change the pytest keyword from "test_tensor_parallel" to "test_tp_" for improved test targeting.
- Cleaned up comments and removed unused code in `test_tensor_parallel_mixin.py` to streamline the testing process and enhance readability.
@3outeille 3outeille changed the title tests EP all reduce tests EP all reduce (decoder only) Feb 4, 2026
@github-actions
Copy link
Contributor

💔 This comment contains run-slow, but unknown error occurred and the workflow run aborted!


op_name = _format_op_name(op)

tb_str = "".join(traceback.format_exception(type(e), e, e.__traceback__))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keeps this one please!

Comment on lines 970 to 972
load_config: Any,
tp_plan: dict[str, str] | None,
dtype_plan: dict | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we want to revert this

Comment on lines 1158 to 1162
shard_index = (
len(mapping.collected_tensors.get(source_pattern, []))
if isinstance(mapping, WeightConverter) and isinstance(mapping.operations[0], MergeModulelist)
else None
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is important for "EP" sharding no?

Comment on lines 292 to 295
if is_torch_greater_or_equal("2.3.0"):
str_to_torch_dtype["U16"] = torch.uint16
str_to_torch_dtype["U32"] = torch.uint32
str_to_torch_dtype["U64"] = torch.uint64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't support 2.3 only >= 2.4

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a lot to revert here still (cleanup)

@3outeille
Copy link
Member Author

run-slow: apertus, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, exaone4, exaone_moe, flex_olmo, gemma2, gemma3, gemma3n, glm4_moe, glm4_moe_lite, glm_moe_dsa, gpt_oss

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/apertus", "models/deepseek_v2", "models/deepseek_v3", "models/dots1", "models/ernie4_5_moe", "models/exaone4", "models/exaone_moe", "models/flex_olmo", "models/gemma2", "models/gemma3", "models/gemma3n", "models/glm4_moe", "models/glm4_moe_lite", "models/glm_moe_dsa", "models/gpt_oss"]
quantizations: []

@3outeille
Copy link
Member Author

run-slow: apertus, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, exaone4, exaone_moe, flex_olmo, gemma2, gemma3, gemma3n, glm4_moe, glm4_moe_lite, glm_moe_dsa, gpt_oss

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 542e74c6 merge commit
PR 3cde5991 branch commit
main c8f112d4 base commit

⚠️ No test being reported (jobs are skipped or cancelled)!

@github-actions
Copy link
Contributor

💔 This comment contains run-slow, but unknown error occurred and the workflow run aborted!

@3outeille
Copy link
Member Author

run-slow: apertus, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, exaone4, exaone_moe, flex_olmo, gemma3

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/apertus", "models/deepseek_v2", "models/deepseek_v3", "models/dots1", "models/ernie4_5_moe", "models/exaone4", "models/exaone_moe", "models/flex_olmo", "models/gemma3"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN efb11cbd merge commit
PR 550b1428 branch commit
main 609e3d58 base commit

✅ No failing test specific to this PR 🎉 👏 !

@huggingface huggingface deleted a comment from github-actions bot Feb 13, 2026
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: apertus, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, exaone4, exaone_moe, flex_olmo, gemma3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants