-
Notifications
You must be signed in to change notification settings - Fork 127
Description
Issue Description
Hello SimAI Team,
I am currently using the SimAI toolkit to simulate the training performance of a GPT model. My process involves two steps:
Generating a workload file using workload_generator.SimAI_training_workload_generator.
Running the analytical simulator SimAI_analytical with the generated workload file as input.
The workload generation step appears to work correctly. However, when I run the SimAI_analytical command, it prints some message to the console and then exits, but the final endtoend result file is created empty.
I would like to understand if this is expected behavior under certain conditions, or if I might be misusing the tool. Could you please help clarify why the output file might be empty?
Steps to Reproduce
Generate the workload file with the following command:
Bash
python3 -m workload_generator.SimAI_training_workload_generator \
--frame=Megatron \
--world_size=16 \
--tensor_model_parallel_size=2 \
--pipeline_model_parallel=1 \
--global_batch=1024 \
--micro_batch=2 \
--epoch_num=1 \
--model_name=gpt_175B \
--hidden_size=12288 \
--num_layers=96 \
--seq_length=4096 \
--num_attention_heads=96 \
--vocab_size=50257 \
--max_position_embeddings=4096 \
--ffn_hidden_size=11008 \
--dtype=bfloat16 \
--swiglu \
--make_vocab_size_divisible_by=128 \
--workload_only \
--output_filename=gpt_175B-t1
Run the analytical simulator with the generated workload:
Bash
./SimAI_analytical \
-w /root/simai/SimAI/aicb/results/workload/gpt_175B-t1.txt \
-g 16 \
-g_p_s 8 \
-n_p_s 8 \
-r gpt_175B-t1 \
-g_type H100 \
-nic 35.0 \
-dp_o 0.5 \
-tp_o 0.7 \
-ep_o 0.8 \
-pp_o 0.5
The SimAI_analytical command runs and prints some message to the console (the content of this message was as follows in my test).
id.........(many output)
id: optimizer3 , depen: -1 , wg_comp_time: 0
id: optimizer4 , depen: -1 , wg_comp_time: 0
type: HYBRID_TRANSFORMER_FWD_IN_BCKWD ,num passes: 1 ,lines: 12427 compute scale: 1 ,comm scale: 1
stat path: ./results/mo_bo_gpt_175B_costrate0.1/gpt_175B-t1 ,total rows: 1 ,stat row: 0
CSV path and filename: ./results/mo_bo_gpt_175B_costrate0.1/gpt_175B-t1EndToEnd.csv
SimAI begin run Analytical
pass: 0 finished at time: 74540
workload stats for the job scheduled at NPU offset: 0
{"retcode":0, "info":"Success!", "node_count":1, "nic_type":"cx7", "gpus_pernode":2, "nics_pernode":8.0, "coll_type":"allgather", "cross_nic":0, "nccl_algo":"Ring", "theoretical_bus_bw_GBps":370.800}
warning! a callable is removed before call
SimAI-Analytical finished.
The corresponding endtoend result file is created in the results directory, but it is empty (0 bytes).
Additional Information
Content of the generated workload file (gpt_175B-t1.txt):
Environment :
SimAI Version: newest master branch in Sep 25 13:46:45 2025 +0800 (e5d125144ea864419d92fc1f15f36e378ee0e2a7)
Operating System: Ubuntu 22.04.4 LTS
Python Version: Python 3.11.12
Thank you for your time and assistance!