[Experiment] Evaluate perf impact of striped vs. blocked SLM read/write 1D copy atoms #631

sanchitintel · 2025-11-15T05:59:38Z

Description

The sycl group/load API used for 1D SLM <-> registers copies support both blocked & striped copies. For example, store.slm.d32x4.a32 and store.slm.d64x2.a32 store 128 bits per work-item (blocked) but cause bank conflicts. Blocked layout is the default data placement for both APIs.

Switching to striped loads & stores (each work-item transfers one item) as an experiment to check performance impact (and any potential breakages). Stores seem to use messages such as store.slm.d64.a32 or store.slm.d32x2.a32, but if the bank width of BMG/PVC is 64 bits (the documentation states 32 bits, but that part may not have been updated for Xe12), then they write to SLM by avoiding bank conflicts.

Performance characteristics of either type of instructions don't seem to be available in the public domain. It's even possible that the first type may perform better due to fewer block messages (as they transfer twice the data as the instructions of the second type), although they entail bank conflicts.

Even if we have BF16 data to move to/from SLM, we could reinterpret cast it to a dtype whose size is equal to the width of each lane's bank.

While 1D loads to/from Global Memory also support striped reads/stores, bank conflicts aren't an issue, so I didn't modify the corresponding copy atoms.

Type

Performance

Testing

Tests pass - [ ] Xe12 - [ ] Xe20

Performance

Metric	Before	After

cc @pengzhao-intel

BMG & PVC support batched & striped SLM <-> registers transfers. `store.slm.d32x4.a32` and `store.slm.d64x2.a32` load 128 bits per work-item but cause bank conflicts. Switching to striped loads & stores (each work-item transfers one item) to check performance impact (and any potential breakages). Even if we have BF16 data to move to/from SLM, we can reinterpret cast it to a dtype whose size is equal to the width of each lane's bank. Performance characteristics of either don't seem to be available in the public domain.

sanchitintel changed the title ~~[Experiment] Evaluate perf impact of striped SLM reads/writes vs. batched reads/writes (current 1D SLM copy atoms)~~ [Experiment] Evaluate perf impact of striped vs. batched SLM read/write 1D copy atoms Nov 15, 2025

sanchitintel requested review from jiyang1011 and taozha2 November 15, 2025 06:10

sanchitintel changed the title ~~[Experiment] Evaluate perf impact of striped vs. batched SLM read/write 1D copy atoms~~ [Experiment] Evaluate perf impact of striped vs. blocked SLM read/write 1D copy atoms Nov 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Experiment] Evaluate perf impact of striped vs. blocked SLM read/write 1D copy atoms #631

[Experiment] Evaluate perf impact of striped vs. blocked SLM read/write 1D copy atoms #631

Uh oh!

sanchitintel commented Nov 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Experiment] Evaluate perf impact of striped vs. blocked SLM read/write 1D copy atoms #631

Are you sure you want to change the base?

[Experiment] Evaluate perf impact of striped vs. blocked SLM read/write 1D copy atoms #631

Uh oh!

Conversation

sanchitintel commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type

Testing

Performance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanchitintel commented Nov 15, 2025 •

edited

Loading