[Experiment] Evaluate perf impact of striped vs. blocked SLM read/write 1D copy atoms #631
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The sycl group/load API used for 1D SLM <-> registers copies support both blocked & striped copies. For example,
store.slm.d32x4.a32andstore.slm.d64x2.a32store 128 bits per work-item (blocked) but cause bank conflicts. Blocked layout is the default data placement for both APIs.Switching to striped loads & stores (each work-item transfers one item) as an experiment to check performance impact (and any potential breakages). Stores seem to use messages such as
store.slm.d64.a32orstore.slm.d32x2.a32, but if the bank width of BMG/PVC is 64 bits (the documentation states 32 bits, but that part may not have been updated for Xe12), then they write to SLM by avoiding bank conflicts.Performance characteristics of either type of instructions don't seem to be available in the public domain. It's even possible that the first type may perform better due to fewer block messages (as they transfer twice the data as the instructions of the second type), although they entail bank conflicts.
Even if we have BF16 data to move to/from SLM, we could reinterpret cast it to a dtype whose size is equal to the width of each lane's bank.
While 1D loads to/from Global Memory also support striped reads/stores, bank conflicts aren't an issue, so I didn't modify the corresponding copy atoms.
Type
Performance
Testing
Performance
cc @pengzhao-intel