仿真结果没有DP通信

我使用aicb生成workload命令：python -m workload_generator.SimAI_training_workload_generator --frame Megatron --gpu_type A100 --world_size=128 --tensor_model_parallel_size=4 --pipeline_model_parallel=4 --global_batch=512 --micro_batch=2 --epoch_num=10 --model_name=GPT-7.5B --hidden_size=4096 --num_layers=36 --seq_length=4096 --num_attention_heads=32 --vocab_size=50304 --max_position_embeddings=4096 --ffn_hidden_size=16384 --activation_func=gelu --dtype=bfloat16 --use_flash_attn --aiob_enable，生成了一个A100-GPT-7.5B-world_size128-tp4-pp4-ep1-gbs512-mbs2-seq4096-MOE-False-GEMM-False-flash_attn-True.txt文件，直接跑SimAI模型训练仿真（TP=4，PP=4，DP=8，GPU卡数=128），会不报错退出，打印日志如下：

<img width="992" height="768" alt="Image" src="https://github.com/user-attachments/assets/2c00fe4a-e60e-48d0-a4cd-7efc91c1969f" />
删除workload前面7行（参考issue：https://github.com/aliyun/SimAI/issues/107）：

<img width="1308" height="930" alt="Image" src="https://github.com/user-attachments/assets/3ac77658-efe5-4c22-8cd9-ac1349c72690" />



使用命令：AS_NVLS_ENABLE=1 AS_SEND_LAT=3 AS_LOG_LEVEL=DEBUG ./bin/SimAI_simulator -t 16 -w A100-GPT-7.5B-world_size128-tp4-pp4-ep1-gbs512-mbs2-seq4096-MOE-False-GEMM-False-flash_attn-True.txt -n Spectrum-X_128g_8gps_200Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf，可以生成ncclFlowModel_EndToEnd.csv文件，但是文件内只有TP通信，没有DP通信。
<img width="1794" height="789" alt="Image" src="https://github.com/user-attachments/assets/e9eb1e30-089b-43c7-9174-b5c0b29ed0ec" />
[A100-GPT-7.5B-world_size128-tp4-pp2-ep1-gbs256-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt](https://github.com/user-attachments/files/23537615/A100-GPT-7.5B-world_size128-tp4-pp2-ep1-gbs256-mbs1-seq4096-MOE-False-GEMM-False-flash_attn-True.txt)



[ncclFlowModel_EndToEnd.csv](https://github.com/user-attachments/files/23537621/ncclFlowModel_EndToEnd.csv)




能帮忙看一下是哪里出现问题了么？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

仿真结果没有DP通信 #199

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

仿真结果没有DP通信 #199

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions