Skip to content

Comments

[FR attribution] FR logic update to remove any use of PG description but window-based ordering #216

Open
sbak5 wants to merge 4 commits intoNVIDIA:mainfrom
sbak5:sbak/fr_logic_update_pr
Open

[FR attribution] FR logic update to remove any use of PG description but window-based ordering #216
sbak5 wants to merge 4 commits intoNVIDIA:mainfrom
sbak5:sbak/fr_logic_update_pr

Conversation

@sbak5
Copy link
Contributor

@sbak5 sbak5 commented Nov 4, 2025

This PR removes any heuristic ordering unmatched PGs but create windows of PGs and order them to figure out the wavefront of chained hanging PGs.

  1. Pick the most common PG in the current active entry per rank
  2. Group them to a pg. and register this in the active pg list
  3. If the ranks in the current window overlaps with the correspoding PG registered in the active pg -> Then it creates a new pg
  4. if the new ranks belong to a previously registered active pg and they're matched later with the new ranks. the pg is deleted from the active pg because it's completed
  5. Repeat until all per-rank entry index == len(per_rank_entries)

Use these windows as the global order of process groups -> return the first PG in the graph analysis as we do previously

this approach respects the local scheduling order of process groups as well as the partially different ordering of PGs across ranks for p2p.

@sbak5 sbak5 force-pushed the sbak/fr_logic_update_pr branch 3 times, most recently from 446841a to aa2f927 Compare November 18, 2025 21:51
@sbak5 sbak5 added the ci-approved Approved to run CI label Dec 11, 2025
@sbak5 sbak5 force-pushed the sbak/fr_logic_update_pr branch from aa2f927 to 2b74226 Compare January 20, 2026 22:46
@sbak5 sbak5 force-pushed the sbak/fr_logic_update_pr branch from 2b74226 to f3447bf Compare January 20, 2026 22:48
@sbak5 sbak5 force-pushed the sbak/fr_logic_update_pr branch from 507e37a to 61791e1 Compare January 20, 2026 23:19
@sbak5
Copy link
Contributor Author

sbak5 commented Feb 9, 2026

@greptile can you review this PR and leave a summary here for reviewers to review this PR more quickly?

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

This PR replaces the heuristic-based process group (PG) ordering with a window-based approach for FR attribution analysis. The changes:

  • Removed: DEFAULT_PG_ORDER mapping and set_type_to_order() method that relied on PG descriptions from Megatron Core
  • Removed: --scheduling-order-file CLI argument and related logic
  • Added: New group_collectives_by_windows() method that creates temporal windows of PGs based on rank activity patterns
  • Added: record_id field to Collective dataclass
  • Enhanced: Dynamic FR debug logging via FR_DEBUG environment variable
  • Refactored: analyze_matches() extensively to support window-based grouping with cross-window rank matching

The new windowing algorithm:

  1. Tracks the most common PG among active ranks (the "wavefront")
  2. Groups collectives into windows based on rank participation patterns
  3. Creates new windows when ranks return to a PG or when significantly different rank sets arrive
  4. Performs cross-window matching to avoid false positives for missing ranks

This approach better handles partially different PG orderings across ranks (common in P2P operations) while respecting local scheduling order.

Confidence Score: 4/5

  • This PR is safe to merge with minor style issues
  • The refactoring is substantial but well-structured. The logic is sound and the new windowing approach is well-documented. Only minor style issues (inconsistent type hint, commented code) were found, no functional bugs detected.
  • No files require special attention

Important Files Changed

Filename Overview
src/nvidia_resiliency_ext/attribution/trace_analyzer/fr_attribution.py Major refactoring to replace heuristic PG ordering with window-based approach; adds new windowing algorithm, removes deprecated ordering logic, has one syntax issue with type hint

Sequence Diagram

sequenceDiagram
    participant Main as preprocess_FR_dumps
    participant GCW as group_collectives_by_windows
    participant AM as analyze_matches
    participant GP as group_pgs
    
    Main->>Main: Process FR dump files
    Main->>Main: Extract collectives_by_file
    Main->>GCW: Call group_collectives_by_windows()
    
    loop Until all ranks processed
        GCW->>GCW: Find most common PG (wavefront)
        GCW->>GCW: Check if new window needed
        alt New window needed
            GCW->>GCW: Increment pg_window_counter
        end
        GCW->>GCW: Collect consecutive collectives for PG
        GCW->>GCW: Track rank participation
        GCW->>GCW: Update active PG set
    end
    
    GCW-->>Main: Return collective_groups with windows
    Main->>Main: Build collectives_to_order mapping
    Main->>AM: Call analyze_matches()
    
    AM->>AM: Extract group types from windows
    loop For each collective group
        AM->>AM: Match collectives per PG
        AM->>AM: Identify missing/completed ranks
    end
    
    AM->>AM: Cross-window matching
    AM-->>Main: Return completed_pg, missing_pg
    
    Main->>GP: Call group_pgs(missing_pg)
    GP->>GP: Build rank overlap graph
    GP->>GP: Find longest paths using DFS
    Note over GP: Uses collectives_to_order<br/>for path ordering
    GP-->>Main: Return grouped PGs
    
    Main->>Main: Return first PG as root cause
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

return None

def find_type_val(type_name: str) -> int:
def find_type_val(key: tuple[str, str, int]) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistent type hint syntax - use capital T Tuple from typing module (already imported on line 17) instead of lowercase tuple builtin for consistency with rest of file

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

else:
all_paths.append(current_path.copy())

# visited_in_path.remove(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented-out code should be removed

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved Approved to run CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant