Skip to content

Conversation

@songhappy
Copy link

fix https://jira.devtools.intel.com/browse/PYTORCHDGQ-6106.
oneccl does not allow avg and throw error while checking, so replace it using sum/world_size. nothing else is changed

Copilot AI review requested due to automatic review settings January 6, 2026 06:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the compile-time conditional AVG reduction operation with a runtime emulation approach. Previously, the code used #if !defined(XCCL_HAS_AVG) preprocessor directives to conditionally emulate AVG by performing SUM followed by division. The new implementation unconditionally emulates AVG for all oneCCL operations since oneCCL doesn't support AVG on all scheduler paths.

Key changes:

  • Added helper functions shouldEmulateAvg() and toXcclReduceOp() to centralize AVG emulation logic
  • Removed all #if !defined(XCCL_HAS_AVG) preprocessor conditionals
  • Applied consistent AVG emulation across all reduction operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return reduceOp == ReduceOp::AVG;
}

ReduceOp toXcclReduceOp(const ReduceOp& reduceOp, const at::Tensor& input) {
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name 'toXcclReduceOp' is misleading since it doesn't actually convert to an XCCL-specific type but rather conditionally transforms AVG to SUM. Consider renaming to 'getEmulatedReduceOp' or 'transformReduceOpForXccl' to better reflect its purpose.

Suggested change
ReduceOp toXcclReduceOp(const ReduceOp& reduceOp, const at::Tensor& input) {
ReduceOp getEmulatedReduceOp(const ReduceOp& reduceOp, const at::Tensor& input) {

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +48
if (reduceOp == ReduceOp::AVG) {
TORCH_CHECK(
input.scalar_type() != at::kBool,
"Cannot use ReduceOp.AVG with boolean inputs");
return ReduceOp::SUM;
}
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boolean input validation is performed in 'toXcclReduceOp()' but this check occurs at each reduction call site. Consider moving this validation earlier (e.g., in the parent functions that call the reduction operations) to avoid redundant checks in hot paths.

Copilot uses AI. Check for mistakes.
@songhappy songhappy requested a review from pkourdis January 8, 2026 00:13
@Chao1Han
Copy link
Contributor

Chao1Han commented Jan 8, 2026

oneCCL supports AVG directly over the XeLink path. Simulating everything with SUM followed by division would incur a performance penalty. Do we really have a strong justification to mandate that all AVG operations be emulated in this way?

#if !defined(XCCL_HAS_AVG)
if (opts.reduceOp == ReduceOp::AVG) {
input, output, comm, reduceOp, xcclStream, stream);
if (shouldEmulateAvg(opts.reduceOp)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you evaluated the perf impact? If you find some cases unsupported for avg, you should ask oneCCL for support instead of replacing avg by sum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants