Skip to content

Conversation

@sanchitintel
Copy link

@sanchitintel sanchitintel commented Nov 15, 2025

Description

The sycl group/load API used for 1D SLM <-> registers copies support both blocked & striped copies. For example, store.slm.d32x4.a32 and store.slm.d64x2.a32 store 128 bits per work-item (blocked) but cause bank conflicts. Blocked layout is the default data placement for both APIs.

Switching to striped loads & stores (each work-item transfers one item) as an experiment to check performance impact (and any potential breakages). Stores seem to use messages such as store.slm.d64.a32 or store.slm.d32x2.a32, but if the bank width of BMG/PVC is 64 bits (the documentation states 32 bits, but that part may not have been updated for Xe12), then they write to SLM by avoiding bank conflicts.

Performance characteristics of either type of instructions don't seem to be available in the public domain. It's even possible that the first type may perform better due to fewer block messages (as they transfer twice the data as the instructions of the second type), although they entail bank conflicts.

Even if we have BF16 data to move to/from SLM, we could reinterpret cast it to a dtype whose size is equal to the width of each lane's bank.

While 1D loads to/from Global Memory also support striped reads/stores, bank conflicts aren't an issue, so I didn't modify the corresponding copy atoms.

Type

Performance

Testing

  • Tests pass - [ ] Xe12 - [ ] Xe20

Performance

Metric Before After

cc @pengzhao-intel

BMG & PVC support batched & striped SLM <-> registers transfers.
`store.slm.d32x4.a32` and `store.slm.d64x2.a32` load 128 bits per work-item but cause bank conflicts. Switching to striped loads & stores (each work-item transfers one item) to check performance impact (and any potential breakages).

Even if we have BF16 data to move to/from SLM, we can reinterpret cast it to a dtype whose size is equal to the width of each lane's bank.

Performance characteristics of either don't seem to be available in the public domain.
@sanchitintel sanchitintel changed the title [Experiment] Evaluate perf impact of striped SLM reads/writes vs. batched reads/writes (current 1D SLM copy atoms) [Experiment] Evaluate perf impact of striped vs. batched SLM read/write 1D copy atoms Nov 15, 2025
@sanchitintel sanchitintel changed the title [Experiment] Evaluate perf impact of striped vs. batched SLM read/write 1D copy atoms [Experiment] Evaluate perf impact of striped vs. blocked SLM read/write 1D copy atoms Nov 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants