[ET-VK] QConv: Avoid dynamic indexing of temporary arrays #16142

SS-JIA · 2025-12-09T02:47:49Z

Stack from ghstack (oldest at bottom):

Context

conv2d_q8ta_q8csw_q8to_conv2d and conv2d_dw_q8ta_q8csw_q8to_conv2d shaders currently have EXTREMELY slow latency on Arm GPUs (Mali/Immortalis architecture).

Conversely, the conv2d_q8ta_q8csw_q8to_linear_tiled shaders, which are used for pointwise convolutions and 3x3 non-grouped convolutions have reasonable latency.

Using malioc (Mali Offline Compiler) it was found that the slow shaders reported usage of the "stack", whereas the fast shader did not.

According to the docs of the Mali Offline Compiler:

Stack use

Stack is a form of thread local storage that is used by compiler-generated memory allocations and register spills. The stack size metric in the report shows the size of the stack memory for a single shader thread. Later compilers generate additional sub-metrics that show the split between regions used for spilling and compiler-generated allocations.

You can reduce the size of your stack in the following ways:

Avoid coding patterns that require the compiler to allocate on the stack, such as dynamically indexing into temporary arrays.
Reduce register pressure to avoid stack spilling.

What the two slow shaders were doing that the fast shader was not, was loading the input convolution window into a local array, then performing the calculation once the entire inpur window was loaded. This caused two issues:

A large array would be needed to store the input window, which requires a lot of registers
Performing the convolution calculation with the input window array requires dynamic indexing

Presumably, this was causing a lot of memory to be allocated via the stack, causing the performance regressions.

Changes

Rewrite the conv2d_q8ta_q8csw_q8to_conv2d shader to not pre-load input values into a local array, and update the convolution calculation to not use dynamic indexing.

Differential Revision: D88702990

## Context `conv2d_q8ta_q8csw_q8to_conv2d` and `conv2d_dw_q8ta_q8csw_q8to_conv2d` shaders currently have EXTREMELY slow latency on Arm GPUs (Mali/Immortalis architecture). Conversely, the `conv2d_q8ta_q8csw_q8to_linear_tiled` shaders, which are used for pointwise convolutions and 3x3 non-grouped convolutions have reasonable latency. Using `malioc` (Mali Offline Compiler) it was found that the slow shaders reported usage of the "stack", whereas the fast shader did not. According to the docs of the Mali Offline Compiler: ### Stack use Stack is a form of thread local storage that is used by compiler-generated memory allocations and register spills. The stack size metric in the report shows the size of the stack memory for a single shader thread. Later compilers generate additional sub-metrics that show the split between regions used for spilling and compiler-generated allocations. You can reduce the size of your stack in the following ways: * Avoid coding patterns that require the compiler to allocate on the stack, such as dynamically indexing into temporary arrays. * Reduce register pressure to avoid stack spilling. What the two slow shaders were doing that the fast shader was not, was loading the input convolution window into a local array, then performing the calculation once the entire inpur window was loaded. This caused two issues: 1. A large array would be needed to store the input window, which requires a lot of registers 2. Performing the convolution calculation with the input window array requires dynamic indexing Presumably, this was causing a lot of memory to be allocated via the stack, causing the performance regressions. ## Changes Rewrite the `conv2d_q8ta_q8csw_q8to_conv2d` shader to not pre-load input values into a local array, and update the convolution calculation to not use dynamic indexing. Differential Revision: [D88702990](https://our.internmc.facebook.com/intern/diff/D88702990/) [ghstack-poisoned]

pytorch-bot · 2025-12-09T02:47:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16142

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit faa3a6b with merge base 78a73d4 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

## Context `conv2d_q8ta_q8csw_q8to_conv2d` and `conv2d_dw_q8ta_q8csw_q8to_conv2d` shaders currently have EXTREMELY slow latency on Arm GPUs (Mali/Immortalis architecture). Conversely, the `conv2d_q8ta_q8csw_q8to_linear_tiled` shaders, which are used for pointwise convolutions and 3x3 non-grouped convolutions have reasonable latency. Using `malioc` (Mali Offline Compiler) it was found that the slow shaders reported usage of the "stack", whereas the fast shader did not. According to the docs of the Mali Offline Compiler: ### Stack use Stack is a form of thread local storage that is used by compiler-generated memory allocations and register spills. The stack size metric in the report shows the size of the stack memory for a single shader thread. Later compilers generate additional sub-metrics that show the split between regions used for spilling and compiler-generated allocations. You can reduce the size of your stack in the following ways: * Avoid coding patterns that require the compiler to allocate on the stack, such as dynamically indexing into temporary arrays. * Reduce register pressure to avoid stack spilling. What the two slow shaders were doing that the fast shader was not, was loading the input convolution window into a local array, then performing the calculation once the entire inpur window was loaded. This caused two issues: 1. A large array would be needed to store the input window, which requires a lot of registers 2. Performing the convolution calculation with the input window array requires dynamic indexing Presumably, this was causing a lot of memory to be allocated via the stack, causing the performance regressions. ## Changes Rewrite the `conv2d_q8ta_q8csw_q8to_conv2d` shader to not pre-load input values into a local array, and update the convolution calculation to not use dynamic indexing. Differential Revision: [D88702990](https://our.internmc.facebook.com/intern/diff/D88702990/) ghstack-source-id: 327992212 Pull Request resolved: #16142

github-actions · 2025-12-09T02:48:55Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #16142 * __->__ #16141 As title - update the way the test case names are displayed in the quantized convolution test binaries so that it is more clear what is being tested. Before: ``` performance_64/32_128/128_3/3_Texture3D_Float ``` After: ``` PERF 32->64 I=128,128 k=3 Tex->Buf ``` Differential Revision: [D88702991](https://our.internmc.facebook.com/intern/diff/D88702991/) Co-authored-by: ssjia <ssjia@devvm1479.ncg0.facebook.com>

SS-JIA mentioned this pull request Dec 9, 2025

[ET-VK][ez] Improve quantized convolution test case naming #16141

Merged

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 9, 2025

meta-codesync bot added fb-exported meta-exported labels Dec 9, 2025

trivedivivek approved these changes Dec 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK] QConv: Avoid dynamic indexing of temporary arrays #16142

[ET-VK] QConv: Avoid dynamic indexing of temporary arrays #16142

SS-JIA commented Dec 9, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 9, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ET-VK] QConv: Avoid dynamic indexing of temporary arrays #16142

Are you sure you want to change the base?

[ET-VK] QConv: Avoid dynamic indexing of temporary arrays #16142

Conversation

SS-JIA commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Stack use

Changes

Uh oh!

pytorch-bot bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16142

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

github-actions bot commented Dec 9, 2025

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SS-JIA commented Dec 9, 2025 •

edited

Loading

pytorch-bot bot commented Dec 9, 2025 •

edited

Loading

This PR needs a `release notes:` label