CUDA backend: fix output stride mismatch in delegate copy-back by mergennachin · Pull Request #17945 · pytorch/executorch

mergennachin · 2026-03-06T00:46:26Z

The .pte serializes the output dim_order from PyTorch's SDPA composite
(which may return a non-contiguous transposed view, e.g., efficient
attention outputs [B,Lq,H,D] transposed to [B,H,Lq,D]). However, the
AOTI delegate always produces contiguous output in its own layout,
ignoring the .pte's expected dim_order. The runtime byte-copies the
GPU data to the CPU ETensor, but the ETensor interprets it with the
.pte's strides — causing silent data corruption when the layouts differ.

Fix: in copy_slimtensor_to_etensor, detect when the SlimTensor (GPU)
and ETensor (CPU) have different strides. When they match (common case),
use the fast byte-copy path. When they differ, copy GPU data to a temp
CPU buffer then rearrange element-by-element to match the ETensor's
expected layout.

Also enables the accuracy check in test_non_pow2_head_dim_with_bool_mask
and adds test_output_stride_rearrange.py exercising both fast and slow
copy paths with Triton ON and OFF.

pytorch-bot · 2026-03-06T00:46:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17945

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 12 Pending, 2 Unrelated Failures

As of commit bbc53d9 with merge base 4c96679 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / macos / macos-job (gh) (trunk failure)
AttributeError: '_OpNamespace' 'mkldnn' object has no attribute '_is_mkldnn_acl_supported'
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
AttributeError: '_OpNamespace' 'mkldnn' object has no attribute '_is_mkldnn_acl_supported'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-06T00:47:09Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

This PR fixes a silent data corruption bug in the CUDA backend delegate copy-back path. When AOTI produces contiguous output but the .pte serializes a different dim_order (e.g., from SDPA's efficient attention returning a transposed view), the runtime would byte-copy GPU data into an ETensor that interprets it with wrong strides. The fix detects stride mismatches and, when strides differ, copies GPU data to a temporary CPU buffer and then rearranges it element-by-element to match the ETensor's expected layout.

Changes:

Unified copy_slimtensor_to_etensor_async and copy_slimtensor_to_etensor into a new _copy_slimtensor_to_etensor_impl helper with a fast path (strides match → raw byte copy) and a slow path (strides differ → element-wise rearrange via _strided_copy).
Re-enables the accuracy assertion in test_non_pow2_head_dim_with_bool_mask that was previously gated behind a TODO comment.
Adds a new test file test_output_stride_rearrange.py exercising both fast and slow copy paths.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`backends/cuda/runtime/utils.h`	Core fix: adds `_strides_match`, `_strided_copy`, and `_copy_slimtensor_to_etensor_impl`; refactors public copy functions to call it
`backends/cuda/tests/test_triton_sdpa_nan.py`	Enables the previously disabled accuracy check for the non-pow2 head dim + bool mask test
`backends/cuda/tests/test_output_stride_rearrange.py`	New test file exercising the fast and slow copy-back paths

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

backends/cuda/runtime/utils.h

backends/cuda/tests/test_output_stride_rearrange.py

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

backends/cuda/runtime/utils.h

backends/cuda/tests/test_output_stride_rearrange.py

The .pte serializes the output dim_order from PyTorch's SDPA composite (which may return a non-contiguous transposed view, e.g., efficient attention outputs [B,Lq,H,D] transposed to [B,H,Lq,D]). However, the AOTI delegate always produces contiguous output in its own layout, ignoring the .pte's expected dim_order. The runtime byte-copies the GPU data to the CPU ETensor, but the ETensor interprets it with the .pte's strides — causing silent data corruption when the layouts differ. Fix: in copy_slimtensor_to_etensor, detect when the SlimTensor (GPU) and ETensor (CPU) have different strides. When they match (common case), use the fast byte-copy path. When they differ, copy GPU data to a temp CPU buffer then rearrange element-by-element to match the ETensor's expected layout. Also enables the accuracy check in test_non_pow2_head_dim_with_bool_mask and adds test_output_stride_rearrange.py exercising both fast and slow copy paths with Triton ON and OFF.

Gasoonjia

Thx for fixing the issue!

Copilot AI review requested due to automatic review settings March 6, 2026 00:46

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 6, 2026

Copilot started reviewing on behalf of mergennachin March 6, 2026 00:47 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

backends/cuda/runtime/utils.h Outdated Show resolved Hide resolved

backends/cuda/runtime/utils.h Outdated Show resolved Hide resolved

backends/cuda/runtime/utils.h Show resolved Hide resolved

backends/cuda/tests/test_output_stride_rearrange.py Outdated Show resolved Hide resolved

mergennachin force-pushed the update_pin_enable_test branch from bec6f6b to 71d75f3 Compare March 6, 2026 00:52

mergennachin requested review from Gasoonjia, Copilot and larryliu0820 March 6, 2026 00:52

mergennachin force-pushed the update_pin_enable_test branch from 71d75f3 to dc6c070 Compare March 6, 2026 13:55

Copilot started reviewing on behalf of mergennachin March 6, 2026 13:56 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

backends/cuda/runtime/utils.h Show resolved Hide resolved

backends/cuda/runtime/utils.h Show resolved Hide resolved

backends/cuda/tests/test_output_stride_rearrange.py Show resolved Hide resolved

mergennachin force-pushed the update_pin_enable_test branch from dc6c070 to bbc53d9 Compare March 6, 2026 14:34

Gasoonjia approved these changes Mar 6, 2026

View reviewed changes

mergennachin merged commit eb77ed4 into main Mar 6, 2026
207 of 215 checks passed

mergennachin deleted the update_pin_enable_test branch March 6, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA backend: fix output stride mismatch in delegate copy-back#17945

CUDA backend: fix output stride mismatch in delegate copy-back#17945
mergennachin merged 1 commit intomainfrom
update_pin_enable_test

mergennachin commented Mar 6, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gasoonjia left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mergennachin commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17945

⏳ 12 Pending, 2 Unrelated Failures

Uh oh!

github-actions bot commented Mar 6, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gasoonjia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergennachin commented Mar 6, 2026 •

edited

Loading

pytorch-bot bot commented Mar 6, 2026 •

edited

Loading

This PR needs a `release notes:` label