-
Notifications
You must be signed in to change notification settings - Fork 127
Open
Description
1. Environment Info
- OS: [ Ubuntu 22.04.4 LTS]
- SimAI Commit: [e5d1251]
- Python Version: [Python 3.11.12]
2.Steps to Reproduce
Step 1: Generate Workload
- Working Directory:
/root/simai/SimAI/aicb/ - Command Executed:
uv run -m workload_generator.SimAI_training_workload_generator \ --frame=Megatron \ --world_size=1472 \ --tensor_model_parallel_size=8 \ --pipeline_model_parallel=8 \ --global_batch=32 \ --micro_batch=1 \ --epoch_num=1 \ --model_name=gpt_13B \ --hidden_size=5120 \ --num_layers=40 \ --seq_length=4096 \ --num_attention_heads=40 \ --vocab_size=50257 \ --max_position_embeddings=4096 \ --ffn_hidden_size=11008 \ --dtype=bfloat16 \ --enable_sequence_parallel \ --swiglu \ --make_vocab_size_divisible_by=128 \ --workload_only \ --output_filename=two_phase_opt_phase1/ws1472-tp8-pp8 \ --aiob_enable \ --comp_filepath=/root/simai/SimAI/aicb/workload/aiob_inputs/A100_A800.txt
- Generated Workload: ws1472-tp8-pp8.txt
Step 2: Run Analytical Simulation
- Working Directory:
/root/simai/SimAI/ - Command Executed:
/root/simai/SimAI/bin/SimAI_analytical \ -w /root/simai/SimAI/aicb/results/workload/two_phase_opt_phase1/ws1472-tp8-pp8.txt \ -g 1472 \ -g_p_s 16 \ -n_p_s 5 \ -r two_phase_opt_phase1/ws1472-tp8-pp8 \ -g_type A800 \ -nic 37.47811737589823 \ -dp_o 0.5 \ -tp_o 0.7 \ -ep_o 0.8 \ -pp_o 0.5
- Result File: ws1472-tp8-pp8EndToEnd.csv
3. The Issue
In the generated EndToEnd.csv file, the Expose DP comm time is exceptionally large,This occurs consistently in many experiments
Metadata
Metadata
Assignees
Labels
No labels