Skip to content

Commit f6727da

Browse files
generatedunixname499836121tbohutyn
authored andcommitted
Raise XPU tolerances for bf16 ResNet & BotNet TorchBench (#170552)
Summary: Multiple TorchBench models on XPU fail accuracy tests due to numeric tolerance being too strict rather. Two contributing factors identified: 1. Measurement methodology change (PyTorch 2.6.0 enforcing cosine_similarity https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L2227) surfaced limitations and increased sensitivity in error checks for phlippe_resnet. 2. BatchNorm decomposition noise (~1e-5 RMSE per BN in fp16) accumulates through the iteration in botnet26t_256, pushing aggregate diffs beyond current thresholds. **Analysis** - phlippe_resnet failures reproduce across CPU and XPU; fp16 already uses higher tolerance, implying bf16 thresholds are misaligned. - Disabling BN decomposition brings botnet26t_256 outputs within tolerance; with decomposition enabled, cumulative numeric error is expected. - CI health indicates changes are non-disruptive; failures, where present, are unrelated to these PRs. Fixes intel/torch-xpu-ops#1799 Fixes intel/torch-xpu-ops#1305 X-link: pytorch/pytorch#170552 Approved by: https://github.com/EikanWang, https://github.com/desertfire Reviewed By: seemethere Differential Revision: D89434646 fbshipit-source-id: e5ce062b497201158578abb1bdebaac4b593dbfd Co-authored-by: Tomasz Bohutyn <tbohutyn@habana.ai>
1 parent c65e4e7 commit f6727da

File tree

2 files changed

+11
-0
lines changed

2 files changed

+11
-0
lines changed

userbenchmark/dynamo/dynamobench/timm_models.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,10 @@ def pip_install(package):
7171
"mobilenetv3_large_100",
7272
}
7373

74+
REQUIRE_HIGHER_TOLERANCE_FP16_XPU = {
75+
"botnet26t_256",
76+
}
77+
7478
REQUIRE_HIGHER_TOLERANCE_AMP = {}
7579

7680
REQUIRE_EVEN_HIGHER_TOLERANCE = {
@@ -366,6 +370,12 @@ def get_tolerance_and_cosine_flag(self, is_training, current_device, name):
366370
self.args.amp and name in REQUIRE_HIGHER_TOLERANCE_AMP
367371
):
368372
tolerance = 4 * 1e-2
373+
elif (
374+
name in REQUIRE_HIGHER_TOLERANCE_FP16_XPU
375+
and self.args.float16
376+
and current_device == "xpu"
377+
):
378+
tolerance = 4 * 1e-2
369379
else:
370380
tolerance = 1e-2
371381
return tolerance, cosine

userbenchmark/dynamo/dynamobench/torchbench.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ tolerance:
5252
# These models need higher tolerance for xpu devices with bf16
5353
higher_bf16_xpu:
5454
- squeezenet1_1
55+
- phlippe_resnet
5556

5657
freezing:
5758
# Similar logic to timm_models.py:get_tolerance_and_cosine_flag

0 commit comments

Comments
 (0)