Skip to content

Conversation

@CuiYifeng
Copy link
Contributor

This PR is to remove work-around for a functionality regression in oneMKL 2025.2 and fix the error handling code that has never been executed with 2025.2.

@CuiYifeng CuiYifeng added this to the PT2.10 milestone Nov 10, 2025
@CuiYifeng CuiYifeng added the mkl label Nov 10, 2025
@CuiYifeng CuiYifeng marked this pull request as ready for review December 1, 2025 08:49
Copilot AI review requested due to automatic review settings December 1, 2025 08:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses error handling in BatchLinearAlgebra operations by removing a workaround for a oneMKL 2025.2 regression and fixing indexing bugs in the error handling code. The changes enable proper exception handling that was previously non-functional.

Key Changes:

  • Removed temporary skip list entries for inverse operation tests that were disabled due to oneMKL regression
  • Fixed incorrect indexing in error handling code to use exception IDs instead of loop indices

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
test/xpu/skip_list_common.py Removed 8 test cases from skip list that were temporarily disabled due to oneMKL regression
src/ATen/native/xpu/mkl/BatchLinearAlgebra.cpp Removed workaround code block and fixed indexing bug in error handlers to use ids[i] instead of i

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"\nDetail: ",
e.detail());
info_cpu[i] = e.info();
info_cpu[ids[i]] = e.info();
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential out-of-bounds access: ids[i] is used to index info_cpu, but there's no validation that ids[i] is within the bounds of info_cpu. This could cause a segfault or memory corruption if the batch exception contains an invalid matrix ID.

Copilot uses AI. Check for mistakes.
} catch (const sycl::exception& e) {
TORCH_WARN("Caught SYCL exception:\nWhat: ", e.what(), "\nInfo: -1");
info_cpu[i] = -1;
info_cpu[ids[i]] = -1;
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential out-of-bounds access: Same issue as above - ids[i] is used to index info_cpu without bounds checking. Consider adding validation that ids[i] < info_cpu.size() before indexing.

Copilot uses AI. Check for mistakes.
@CuiYifeng CuiYifeng force-pushed the yifeng/remove_mkl_wa branch 4 times, most recently from da52643 to cc9fcc0 Compare December 11, 2025 07:40
@CuiYifeng CuiYifeng force-pushed the yifeng/remove_mkl_wa branch 2 times, most recently from c266c55 to b3909ed Compare December 22, 2025 02:49
@github-actions
Copy link

Performance outliers, please check!

  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
huggingface_float16_training MT5ForConditionalGeneration 0.904954 0.882927

@CuiYifeng CuiYifeng force-pushed the yifeng/remove_mkl_wa branch from b3909ed to b25f933 Compare December 24, 2025 00:58
@CuiYifeng
Copy link
Contributor Author

Fix the following cases:

        # Exception is temporarily unavailable due to regression in oneMKL
        "test_inv_errors_and_warnings_xpu_float32",
        "test_inv_errors_and_warnings_xpu_float64",
        "test_inverse_errors_large_xpu_float32",
        "test_inverse_errors_large_xpu_float64",
        "test_inverse_errors_xpu_float32",
        "test_inverse_errors_xpu_float64",
        "test_inv_ex_singular_xpu_float32",
        "test_inv_ex_singular_xpu_float64",

@CuiYifeng CuiYifeng force-pushed the yifeng/remove_mkl_wa branch from b25f933 to 28a22dc Compare December 31, 2025 01:59
@github-actions
Copy link

Performance outliers, please check!

  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training dcgan 0.844164 0.827297
torchbench_bfloat16_training resnet18 0.934244 0.835603
torchbench_bfloat16_training resnext50_32x4d 0.952235 0.853000
torchbench_bfloat16_training densenet121 0.804804 0.858987
torchbench_bfloat16_training mnasnet1_0 1.021068 0.869143
torchbench_bfloat16_training squeezenet1_1 1.009145 0.880740

@CuiYifeng CuiYifeng force-pushed the yifeng/remove_mkl_wa branch from 28a22dc to 9d1a2f8 Compare January 6, 2026 03:23
@github-actions
Copy link

github-actions bot commented Jan 6, 2026

Performance outliers, please check!

  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training resnet18 0.902028 0.805277

@CuiYifeng CuiYifeng force-pushed the yifeng/remove_mkl_wa branch from 9d1a2f8 to 9415070 Compare January 6, 2026 05:02
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training densenet121 0.783036 0.80363
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training dcgan 0.850267 0.801978
torchbench_bfloat16_training mobilenet_v3_large 0.976237 0.825791
torchbench_bfloat16_training resnet18 0.921155 0.839705
torchbench_bfloat16_training squeezenet1_1 1.023057 0.886896
huggingface_bfloat16_training MT5ForConditionalGeneration 0.915300 0.890694

@CuiYifeng CuiYifeng force-pushed the yifeng/remove_mkl_wa branch from 9415070 to 9b9d51d Compare January 7, 2026 07:49
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Performance outliers, please check!

  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training resnet18 0.964222 0.831755

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants