Skip to content

Conversation

@SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Dec 9, 2025

Stack from ghstack (oldest at bottom):

Context

conv2d_q8ta_q8csw_q8to_conv2d and conv2d_dw_q8ta_q8csw_q8to_conv2d shaders currently have EXTREMELY slow latency on Arm GPUs (Mali/Immortalis architecture).

Conversely, the conv2d_q8ta_q8csw_q8to_linear_tiled shaders, which are used for pointwise convolutions and 3x3 non-grouped convolutions have reasonable latency.

Using malioc (Mali Offline Compiler) it was found that the slow shaders reported usage of the "stack", whereas the fast shader did not.

According to the docs of the Mali Offline Compiler:

Stack use

Stack is a form of thread local storage that is used by compiler-generated memory allocations and register spills. The stack size metric in the report shows the size of the stack memory for a single shader thread. Later compilers generate additional sub-metrics that show the split between regions used for spilling and compiler-generated allocations.

You can reduce the size of your stack in the following ways:

  • Avoid coding patterns that require the compiler to allocate on the stack, such as dynamically indexing into temporary arrays.
  • Reduce register pressure to avoid stack spilling.

What the two slow shaders were doing that the fast shader was not, was loading the input convolution window into a local array, then performing the calculation once the entire inpur window was loaded. This caused two issues:

  1. A large array would be needed to store the input window, which requires a lot of registers
  2. Performing the convolution calculation with the input window array requires dynamic indexing

Presumably, this was causing a lot of memory to be allocated via the stack, causing the performance regressions.

Changes

Rewrite the conv2d_q8ta_q8csw_q8to_conv2d shader to not pre-load input values into a local array, and update the convolution calculation to not use dynamic indexing.

Differential Revision: D88702990

## Context

`conv2d_q8ta_q8csw_q8to_conv2d` and `conv2d_dw_q8ta_q8csw_q8to_conv2d` shaders currently have EXTREMELY slow latency on Arm GPUs (Mali/Immortalis architecture).

Conversely, the `conv2d_q8ta_q8csw_q8to_linear_tiled` shaders, which are used for pointwise convolutions and 3x3 non-grouped convolutions have reasonable latency.

Using `malioc` (Mali Offline Compiler) it was found that the slow shaders reported usage of the "stack", whereas the fast shader did not.

According to the docs of the Mali Offline Compiler:

### Stack use

Stack is a form of thread local storage that is used by compiler-generated memory allocations and register spills. The stack size metric in the report shows the size of the stack memory for a single shader thread. Later compilers generate additional sub-metrics that show the split between regions used for spilling and compiler-generated allocations.

You can reduce the size of your stack in the following ways:

* Avoid coding patterns that require the compiler to allocate on the stack, such as dynamically indexing into temporary arrays.
* Reduce register pressure to avoid stack spilling.

What the two slow shaders were doing that the fast shader was not, was loading the input convolution window into a local array, then performing the calculation once the entire inpur window was loaded. This caused two issues:

1. A large array would be needed to store the input window, which requires a lot of registers
2. Performing the convolution calculation with the input window array requires dynamic indexing

Presumably, this was causing a lot of memory to be allocated via the stack, causing the performance regressions.

## Changes

Rewrite the `conv2d_q8ta_q8csw_q8to_conv2d` shader to not pre-load input values into a local array, and update the convolution calculation to not use dynamic indexing.

Differential Revision: [D88702990](https://our.internmc.facebook.com/intern/diff/D88702990/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16142

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit faa3a6b with merge base 78a73d4 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

SS-JIA pushed a commit that referenced this pull request Dec 9, 2025
## Context

`conv2d_q8ta_q8csw_q8to_conv2d` and `conv2d_dw_q8ta_q8csw_q8to_conv2d` shaders currently have EXTREMELY slow latency on Arm GPUs (Mali/Immortalis architecture).

Conversely, the `conv2d_q8ta_q8csw_q8to_linear_tiled` shaders, which are used for pointwise convolutions and 3x3 non-grouped convolutions have reasonable latency.

Using `malioc` (Mali Offline Compiler) it was found that the slow shaders reported usage of the "stack", whereas the fast shader did not.

According to the docs of the Mali Offline Compiler:

### Stack use

Stack is a form of thread local storage that is used by compiler-generated memory allocations and register spills. The stack size metric in the report shows the size of the stack memory for a single shader thread. Later compilers generate additional sub-metrics that show the split between regions used for spilling and compiler-generated allocations.

You can reduce the size of your stack in the following ways:

* Avoid coding patterns that require the compiler to allocate on the stack, such as dynamically indexing into temporary arrays.
* Reduce register pressure to avoid stack spilling.

What the two slow shaders were doing that the fast shader was not, was loading the input convolution window into a local array, then performing the calculation once the entire inpur window was loaded. This caused two issues:

1. A large array would be needed to store the input window, which requires a lot of registers
2. Performing the convolution calculation with the input window array requires dynamic indexing

Presumably, this was causing a lot of memory to be allocated via the stack, causing the performance regressions.

## Changes

Rewrite the `conv2d_q8ta_q8csw_q8to_conv2d` shader to not pre-load input values into a local array, and update the convolution calculation to not use dynamic indexing.

Differential Revision: [D88702990](https://our.internmc.facebook.com/intern/diff/D88702990/)

ghstack-source-id: 327992212
Pull Request resolved: #16142
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 9, 2025
@github-actions
Copy link

github-actions bot commented Dec 9, 2025

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

SS-JIA added a commit that referenced this pull request Dec 9, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #16142
* __->__ #16141

As title - update the way the test case names are displayed in the
quantized convolution test binaries so that it is more clear what is
being tested.

Before:

```
performance_64/32_128/128_3/3_Texture3D_Float
```

After:

```
PERF  32->64  I=128,128  k=3  Tex->Buf
```

Differential Revision:
[D88702991](https://our.internmc.facebook.com/intern/diff/D88702991/)

Co-authored-by: ssjia <ssjia@devvm1479.ncg0.facebook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants