using sum/world_size to replace avg #2691

songhappy · 2026-01-06T06:56:30Z

fix https://jira.devtools.intel.com/browse/PYTORCHDGQ-6106.
oneccl does not allow avg and throw error while checking, so replace it using sum/world_size. nothing else is changed

Copilot

Pull request overview

This PR replaces the compile-time conditional AVG reduction operation with a runtime emulation approach. Previously, the code used #if !defined(XCCL_HAS_AVG) preprocessor directives to conditionally emulate AVG by performing SUM followed by division. The new implementation unconditionally emulates AVG for all oneCCL operations since oneCCL doesn't support AVG on all scheduler paths.

Key changes:

Added helper functions shouldEmulateAvg() and toXcclReduceOp() to centralize AVG emulation logic
Removed all #if !defined(XCCL_HAS_AVG) preprocessor conditionals
Applied consistent AVG emulation across all reduction operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-06T06:57:25Z

src/xccl/ProcessGroupXCCL.cpp

+  return reduceOp == ReduceOp::AVG;
+}
+
+ReduceOp toXcclReduceOp(const ReduceOp& reduceOp, const at::Tensor& input) {


The function name 'toXcclReduceOp' is misleading since it doesn't actually convert to an XCCL-specific type but rather conditionally transforms AVG to SUM. Consider renaming to 'getEmulatedReduceOp' or 'transformReduceOpForXccl' to better reflect its purpose.

Suggested change

ReduceOp toXcclReduceOp(const ReduceOp& reduceOp, const at::Tensor& input) {

ReduceOp getEmulatedReduceOp(const ReduceOp& reduceOp, const at::Tensor& input) {

Copilot · 2026-01-06T06:57:26Z

src/xccl/ProcessGroupXCCL.cpp

+  if (reduceOp == ReduceOp::AVG) {
+    TORCH_CHECK(
+        input.scalar_type() != at::kBool,
+        "Cannot use ReduceOp.AVG with boolean inputs");
+    return ReduceOp::SUM;
+  }


The boolean input validation is performed in 'toXcclReduceOp()' but this check occurs at each reduction call site. Consider moving this validation earlier (e.g., in the parent functions that call the reduction operations) to avoid redundant checks in hot paths.

Chao1Han · 2026-01-08T05:31:31Z

oneCCL supports AVG directly over the XeLink path. Simulating everything with SUM followed by division would incur a performance penalty. Do we really have a strong justification to mandate that all AVG operations be emulated in this way?

zhangxiaoli73 · 2026-01-08T05:53:15Z

src/xccl/ProcessGroupXCCL.cpp

-#if !defined(XCCL_HAS_AVG)
-        if (opts.reduceOp == ReduceOp::AVG) {
+            input, output, comm, reduceOp, xcclStream, stream);
+        if (shouldEmulateAvg(opts.reduceOp)) {


Have you evaluated the perf impact? If you find some cases unsupported for avg, you should ask oneCCL for support instead of replacing avg by sum.

using sum/world_size to replace avg

b62c819

Copilot AI review requested due to automatic review settings January 6, 2026 06:56

Copilot AI reviewed Jan 6, 2026

View reviewed changes

songhappy requested a review from pkourdis January 8, 2026 00:13

zhangxiaoli73 reviewed Jan 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

using sum/world_size to replace avg #2691

using sum/world_size to replace avg #2691

Uh oh!

songhappy commented Jan 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Chao1Han commented Jan 8, 2026

Uh oh!

zhangxiaoli73 Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	ReduceOp toXcclReduceOp(const ReduceOp& reduceOp, const at::Tensor& input) {
	ReduceOp getEmulatedReduceOp(const ReduceOp& reduceOp, const at::Tensor& input) {

using sum/world_size to replace avg #2691

Are you sure you want to change the base?

using sum/world_size to replace avg #2691

Uh oh!

Conversation

songhappy commented Jan 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Chao1Han commented Jan 8, 2026

Uh oh!

zhangxiaoli73 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants