[CK_TILE] Stream-K XCD remapping #3652

arai713 · 2026-01-26T18:32:17Z

Proposed changes

This PR adds support for XCD remapping as detailed in this document. On MI300 series GPUs, workgroups are typically scheduled round-robin across XCDs, which can lead to poor locality. We will use a remapping to assign workgroups to contiguous tiles in the XCDs improving the locality and the cache hit rate. This is done through a function that computes this contiguous mapping from this PR, which we have added to the StreamKTilePartitioner. This will require minimal changes to the Stream-K algorithm, only requiring a remap at the time the workgroups are partitioned. Through this approach we can improve the data locality by improving cache hits therefore closing performance gaps that are seen with the default scheduling. There have been unit tests added to verify the function in isolation. This is an optimization that is not specialized to just Stream-K GEMM and can be applied across GEMM.

Note: This only applies to the MI300 series GPUs as they introduce the XCDs.

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

This change adds in a function to remap block ids from their original round robin assignment to a contiguous layout across XCDs. This function is added to the StreamKTilePartitioner and called in the operator() functions. There are also unit tests to verify the correctness of the function on minimal arrays. These changes should improve locality and the cache hit rate, therefore improving performance overall.

arai713 self-assigned this Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK_TILE] Stream-K XCD remapping #3652

[CK_TILE] Stream-K XCD remapping #3652

arai713 commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CK_TILE] Stream-K XCD remapping #3652

Are you sure you want to change the base?

[CK_TILE] Stream-K XCD remapping #3652

Conversation

arai713 commented Jan 26, 2026

Proposed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants