Skip to content

Incorrect EndToEnd total_time in SimAI_analytical results #202

@xmoon2022

Description

@xmoon2022

1. Environment Info

- OS: [ Ubuntu 22.04.4 LTS]
- SimAI Commit: [e5d1251]
- Python Version: [Python 3.11.12]

2.Steps to Reproduce

Step 1: Generate Workload

  • Working Directory: /root/simai/SimAI/aicb/
  • Command Executed:
    uv run -m workload_generator.SimAI_training_workload_generator \
      --frame=Megatron \
      --world_size=1472 \
      --tensor_model_parallel_size=8 \
      --pipeline_model_parallel=8 \
      --global_batch=32 \
      --micro_batch=1 \
      --epoch_num=1 \
      --model_name=gpt_13B \
      --hidden_size=5120 \
      --num_layers=40 \
      --seq_length=4096 \
      --num_attention_heads=40 \
      --vocab_size=50257 \
      --max_position_embeddings=4096 \
      --ffn_hidden_size=11008 \
      --dtype=bfloat16 \
      --enable_sequence_parallel \
      --swiglu \
      --make_vocab_size_divisible_by=128 \
      --workload_only \
      --output_filename=two_phase_opt_phase1/ws1472-tp8-pp8 \
      --aiob_enable \
      --comp_filepath=/root/simai/SimAI/aicb/workload/aiob_inputs/A100_A800.txt
  • Generated Workload: ws1472-tp8-pp8.txt

Step 2: Run Analytical Simulation

  • Working Directory: /root/simai/SimAI/
  • Command Executed:
    /root/simai/SimAI/bin/SimAI_analytical \
      -w /root/simai/SimAI/aicb/results/workload/two_phase_opt_phase1/ws1472-tp8-pp8.txt \
      -g 1472 \
      -g_p_s 16 \
      -n_p_s 5 \
      -r two_phase_opt_phase1/ws1472-tp8-pp8 \
      -g_type A800 \
      -nic 37.47811737589823 \
      -dp_o 0.5 \
      -tp_o 0.7 \
      -ep_o 0.8 \
      -pp_o 0.5
  • Result File: ws1472-tp8-pp8EndToEnd.csv

3. The Issue

In the generated EndToEnd.csv file, the Expose DP comm time is exceptionally large,This occurs consistently in many experiments

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions