Skip to content

仿真结果没有DP通信 #199

@Georgecheng123

Description

@Georgecheng123

我使用aicb生成workload命令:python -m workload_generator.SimAI_training_workload_generator --frame Megatron --gpu_type A100 --world_size=128 --tensor_model_parallel_size=4 --pipeline_model_parallel=4 --global_batch=512 --micro_batch=2 --epoch_num=10 --model_name=GPT-7.5B --hidden_size=4096 --num_layers=36 --seq_length=4096 --num_attention_heads=32 --vocab_size=50304 --max_position_embeddings=4096 --ffn_hidden_size=16384 --activation_func=gelu --dtype=bfloat16 --use_flash_attn --aiob_enable,生成了一个A100-GPT-7.5B-world_size128-tp4-pp4-ep1-gbs512-mbs2-seq4096-MOE-False-GEMM-False-flash_attn-True.txt文件,直接跑SimAI模型训练仿真(TP=4,PP=4,DP=8,GPU卡数=128),会不报错退出,打印日志如下:

Image 删除workload前面7行(参考issue:https://github.com//issues/107): Image

使用命令:AS_NVLS_ENABLE=1 AS_SEND_LAT=3 AS_LOG_LEVEL=DEBUG ./bin/SimAI_simulator -t 16 -w A100-GPT-7.5B-world_size128-tp4-pp4-ep1-gbs512-mbs2-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n Spectrum-X_128g_8gps_200Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf,可以生成ncclFlowModel_EndToEnd.csv文件,但是文件内只有TP通信,没有DP通信。
Image
A100-GPT-7.5B-world_size128-tp4-pp2-ep1-gbs256-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt

ncclFlowModel_EndToEnd.csv

能帮忙看一下是哪里出现问题了么?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions