-
Notifications
You must be signed in to change notification settings - Fork 752
[ET-VK] QConv: Avoid dynamic indexing of temporary arrays #16142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/SS-JIA/379/base
Are you sure you want to change the base?
Conversation
## Context `conv2d_q8ta_q8csw_q8to_conv2d` and `conv2d_dw_q8ta_q8csw_q8to_conv2d` shaders currently have EXTREMELY slow latency on Arm GPUs (Mali/Immortalis architecture). Conversely, the `conv2d_q8ta_q8csw_q8to_linear_tiled` shaders, which are used for pointwise convolutions and 3x3 non-grouped convolutions have reasonable latency. Using `malioc` (Mali Offline Compiler) it was found that the slow shaders reported usage of the "stack", whereas the fast shader did not. According to the docs of the Mali Offline Compiler: ### Stack use Stack is a form of thread local storage that is used by compiler-generated memory allocations and register spills. The stack size metric in the report shows the size of the stack memory for a single shader thread. Later compilers generate additional sub-metrics that show the split between regions used for spilling and compiler-generated allocations. You can reduce the size of your stack in the following ways: * Avoid coding patterns that require the compiler to allocate on the stack, such as dynamically indexing into temporary arrays. * Reduce register pressure to avoid stack spilling. What the two slow shaders were doing that the fast shader was not, was loading the input convolution window into a local array, then performing the calculation once the entire inpur window was loaded. This caused two issues: 1. A large array would be needed to store the input window, which requires a lot of registers 2. Performing the convolution calculation with the input window array requires dynamic indexing Presumably, this was causing a lot of memory to be allocated via the stack, causing the performance regressions. ## Changes Rewrite the `conv2d_q8ta_q8csw_q8to_conv2d` shader to not pre-load input values into a local array, and update the convolution calculation to not use dynamic indexing. Differential Revision: [D88702990](https://our.internmc.facebook.com/intern/diff/D88702990/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16142
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit faa3a6b with merge base 78a73d4 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
## Context `conv2d_q8ta_q8csw_q8to_conv2d` and `conv2d_dw_q8ta_q8csw_q8to_conv2d` shaders currently have EXTREMELY slow latency on Arm GPUs (Mali/Immortalis architecture). Conversely, the `conv2d_q8ta_q8csw_q8to_linear_tiled` shaders, which are used for pointwise convolutions and 3x3 non-grouped convolutions have reasonable latency. Using `malioc` (Mali Offline Compiler) it was found that the slow shaders reported usage of the "stack", whereas the fast shader did not. According to the docs of the Mali Offline Compiler: ### Stack use Stack is a form of thread local storage that is used by compiler-generated memory allocations and register spills. The stack size metric in the report shows the size of the stack memory for a single shader thread. Later compilers generate additional sub-metrics that show the split between regions used for spilling and compiler-generated allocations. You can reduce the size of your stack in the following ways: * Avoid coding patterns that require the compiler to allocate on the stack, such as dynamically indexing into temporary arrays. * Reduce register pressure to avoid stack spilling. What the two slow shaders were doing that the fast shader was not, was loading the input convolution window into a local array, then performing the calculation once the entire inpur window was loaded. This caused two issues: 1. A large array would be needed to store the input window, which requires a lot of registers 2. Performing the convolution calculation with the input window array requires dynamic indexing Presumably, this was causing a lot of memory to be allocated via the stack, causing the performance regressions. ## Changes Rewrite the `conv2d_q8ta_q8csw_q8to_conv2d` shader to not pre-load input values into a local array, and update the convolution calculation to not use dynamic indexing. Differential Revision: [D88702990](https://our.internmc.facebook.com/intern/diff/D88702990/) ghstack-source-id: 327992212 Pull Request resolved: #16142
This PR needs a
|
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #16142 * __->__ #16141 As title - update the way the test case names are displayed in the quantized convolution test binaries so that it is more clear what is being tested. Before: ``` performance_64/32_128/128_3/3_Texture3D_Float ``` After: ``` PERF 32->64 I=128,128 k=3 Tex->Buf ``` Differential Revision: [D88702991](https://our.internmc.facebook.com/intern/diff/D88702991/) Co-authored-by: ssjia <ssjia@devvm1479.ncg0.facebook.com>
Stack from ghstack (oldest at bottom):
Context
conv2d_q8ta_q8csw_q8to_conv2dandconv2d_dw_q8ta_q8csw_q8to_conv2dshaders currently have EXTREMELY slow latency on Arm GPUs (Mali/Immortalis architecture).Conversely, the
conv2d_q8ta_q8csw_q8to_linear_tiledshaders, which are used for pointwise convolutions and 3x3 non-grouped convolutions have reasonable latency.Using
malioc(Mali Offline Compiler) it was found that the slow shaders reported usage of the "stack", whereas the fast shader did not.According to the docs of the Mali Offline Compiler:
Stack use
Stack is a form of thread local storage that is used by compiler-generated memory allocations and register spills. The stack size metric in the report shows the size of the stack memory for a single shader thread. Later compilers generate additional sub-metrics that show the split between regions used for spilling and compiler-generated allocations.
You can reduce the size of your stack in the following ways:
What the two slow shaders were doing that the fast shader was not, was loading the input convolution window into a local array, then performing the calculation once the entire inpur window was loaded. This caused two issues:
Presumably, this was causing a lot of memory to be allocated via the stack, causing the performance regressions.
Changes
Rewrite the
conv2d_q8ta_q8csw_q8to_conv2dshader to not pre-load input values into a local array, and update the convolution calculation to not use dynamic indexing.Differential Revision: D88702990