Skip to content

Conversation

@WWeiOne
Copy link
Contributor

@WWeiOne WWeiOne commented May 19, 2025

Problem Description

recvHash and all_sent_chunksize got integer overflow when the data size in collective communication is large, and there are implicit type conversions, leading to an early stop without an end-to-end result.

map<std::pair<int, std::pair<int, int>>, int> recvHash;
uint64_t count = recvHash[make_pair(tag, make_pair(t.src, t.dest))]

int all_sent_chunksize;
notify_sender_sending_finished(sid, did, all_sent_chunksize, flowTag);
void notify_sender_sending_finished(int sender_node, int receiver_node, **uint64_t** message_size, AstraSim::ncclFlowTag flowTag)

Observation

file: ncclFlowModel_EndToEnd.csv empty
file: SimAI.log, strange count, size and message size. and early stop.
Pasted image 20250519023943

Minimal Reproduction

Workload

HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 2 ep: 1 pp: 1 vpp: 36 ga: 1 all_gpus: 4 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0
2
grad_gather	-1	1	NONE	0	1	NONE	0	1	ALLGATHER	6467616768	100
grad_param_comm	-1	1	NONE	0	1	NONE	0	1	REDUCESCATTER	12935233536	100

Topology

7 2 2 1 8 H100
4 5 6 
0 4 2880Gbps 0.000025ms 0
0 6 100Gbps 0.0005ms 0
1 4 2880Gbps 0.000025ms 0
1 6 100Gbps 0.0005ms 0
2 5 2880Gbps 0.000025ms 0
2 6 100Gbps 0.0005ms 0
3 5 2880Gbps 0.000025ms 0
3 6 100Gbps 0.0005ms 0

Change Made:

  1. Update type of recvHash and all_sent_chunksize.
  2. Fix the relevant output log format specifiers and update to consistent language.

@CLAassistant
Copy link

CLAassistant commented May 19, 2025

CLA assistant check
All committers have signed the CLA.

@zyksir
Copy link
Collaborator

zyksir commented May 30, 2025

This commit looks good to me @Huoyuan100861

@gabrielecastellano
Copy link
Contributor

This indeed solves a problem that was hard to detect. Maybe also change total_bytes (in qp_finish) and all_sent_chunksize (in send_finish) to uint_64?

@WWeiOne
Copy link
Contributor Author

WWeiOne commented Jun 4, 2025

This indeed solves a problem that was hard to detect. Maybe also change total_bytes (in qp_finish) and all_sent_chunksize (in send_finish) to uint_64?

Good point! all_sent_chunksize has already been updated. I’ve now changed total_bytes to uint64, making it consistent with the type of q->m_size.

@WWeiOne
Copy link
Contributor Author

WWeiOne commented Jun 5, 2025

@zyksir @Huoyuan100861 All updates are in, ready for review

@gabrielecastellano
Copy link
Contributor

This very simple bugfix is from seven months ago, why was not merged yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants