[FR attribution] FR logic update to remove any use of PG description but window-based ordering #216
[FR attribution] FR logic update to remove any use of PG description but window-based ordering #216sbak5 wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
446841a to
aa2f927
Compare
aa2f927 to
2b74226
Compare
2b74226 to
f3447bf
Compare
507e37a to
61791e1
Compare
|
@greptile can you review this PR and leave a summary here for reviewers to review this PR more quickly? |
Greptile OverviewGreptile SummaryThis PR replaces the heuristic-based process group (PG) ordering with a window-based approach for FR attribution analysis. The changes:
The new windowing algorithm:
This approach better handles partially different PG orderings across ranks (common in P2P operations) while respecting local scheduling order. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Main as preprocess_FR_dumps
participant GCW as group_collectives_by_windows
participant AM as analyze_matches
participant GP as group_pgs
Main->>Main: Process FR dump files
Main->>Main: Extract collectives_by_file
Main->>GCW: Call group_collectives_by_windows()
loop Until all ranks processed
GCW->>GCW: Find most common PG (wavefront)
GCW->>GCW: Check if new window needed
alt New window needed
GCW->>GCW: Increment pg_window_counter
end
GCW->>GCW: Collect consecutive collectives for PG
GCW->>GCW: Track rank participation
GCW->>GCW: Update active PG set
end
GCW-->>Main: Return collective_groups with windows
Main->>Main: Build collectives_to_order mapping
Main->>AM: Call analyze_matches()
AM->>AM: Extract group types from windows
loop For each collective group
AM->>AM: Match collectives per PG
AM->>AM: Identify missing/completed ranks
end
AM->>AM: Cross-window matching
AM-->>Main: Return completed_pg, missing_pg
Main->>GP: Call group_pgs(missing_pg)
GP->>GP: Build rank overlap graph
GP->>GP: Find longest paths using DFS
Note over GP: Uses collectives_to_order<br/>for path ordering
GP-->>Main: Return grouped PGs
Main->>Main: Return first PG as root cause
|
| return None | ||
|
|
||
| def find_type_val(type_name: str) -> int: | ||
| def find_type_val(key: tuple[str, str, int]) -> int: |
There was a problem hiding this comment.
inconsistent type hint syntax - use capital T Tuple from typing module (already imported on line 17) instead of lowercase tuple builtin for consistency with rest of file
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| else: | ||
| all_paths.append(current_path.copy()) | ||
|
|
||
| # visited_in_path.remove(node) |
There was a problem hiding this comment.
commented-out code should be removed
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
This PR removes any heuristic ordering unmatched PGs but create windows of PGs and order them to figure out the wavefront of chained hanging PGs.
Use these windows as the global order of process groups -> return the first PG in the graph analysis as we do previously
this approach respects the local scheduling order of process groups as well as the partially different ordering of PGs across ranks for p2p.