Skip to content

CUDA backend: fix output stride mismatch in delegate copy-back#17945

Merged
mergennachin merged 1 commit intomainfrom
update_pin_enable_test
Mar 6, 2026
Merged

CUDA backend: fix output stride mismatch in delegate copy-back#17945
mergennachin merged 1 commit intomainfrom
update_pin_enable_test

Conversation

@mergennachin
Copy link
Contributor

@mergennachin mergennachin commented Mar 6, 2026

The .pte serializes the output dim_order from PyTorch's SDPA composite
(which may return a non-contiguous transposed view, e.g., efficient
attention outputs [B,Lq,H,D] transposed to [B,H,Lq,D]). However, the
AOTI delegate always produces contiguous output in its own layout,
ignoring the .pte's expected dim_order. The runtime byte-copies the
GPU data to the CPU ETensor, but the ETensor interprets it with the
.pte's strides — causing silent data corruption when the layouts differ.

Fix: in copy_slimtensor_to_etensor, detect when the SlimTensor (GPU)
and ETensor (CPU) have different strides. When they match (common case),
use the fast byte-copy path. When they differ, copy GPU data to a temp
CPU buffer then rearrange element-by-element to match the ETensor's
expected layout.

Also enables the accuracy check in test_non_pow2_head_dim_with_bool_mask
and adds test_output_stride_rearrange.py exercising both fast and slow
copy paths with Triton ON and OFF.

Copilot AI review requested due to automatic review settings March 6, 2026 00:46
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 6, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17945

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 12 Pending, 2 Unrelated Failures

As of commit bbc53d9 with merge base 4c96679 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 6, 2026
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a silent data corruption bug in the CUDA backend delegate copy-back path. When AOTI produces contiguous output but the .pte serializes a different dim_order (e.g., from SDPA's efficient attention returning a transposed view), the runtime would byte-copy GPU data into an ETensor that interprets it with wrong strides. The fix detects stride mismatches and, when strides differ, copies GPU data to a temporary CPU buffer and then rearranges it element-by-element to match the ETensor's expected layout.

Changes:

  • Unified copy_slimtensor_to_etensor_async and copy_slimtensor_to_etensor into a new _copy_slimtensor_to_etensor_impl helper with a fast path (strides match → raw byte copy) and a slow path (strides differ → element-wise rearrange via _strided_copy).
  • Re-enables the accuracy assertion in test_non_pow2_head_dim_with_bool_mask that was previously gated behind a TODO comment.
  • Adds a new test file test_output_stride_rearrange.py exercising both fast and slow copy paths.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
backends/cuda/runtime/utils.h Core fix: adds _strides_match, _strided_copy, and _copy_slimtensor_to_etensor_impl; refactors public copy functions to call it
backends/cuda/tests/test_triton_sdpa_nan.py Enables the previously disabled accuracy check for the non-pow2 head dim + bool mask test
backends/cuda/tests/test_output_stride_rearrange.py New test file exercising the fast and slow copy-back paths

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The .pte serializes the output dim_order from PyTorch's SDPA composite
(which may return a non-contiguous transposed view, e.g., efficient
attention outputs [B,Lq,H,D] transposed to [B,H,Lq,D]). However, the
AOTI delegate always produces contiguous output in its own layout,
ignoring the .pte's expected dim_order. The runtime byte-copies the
GPU data to the CPU ETensor, but the ETensor interprets it with the
.pte's strides — causing silent data corruption when the layouts differ.

Fix: in copy_slimtensor_to_etensor, detect when the SlimTensor (GPU)
and ETensor (CPU) have different strides. When they match (common case),
use the fast byte-copy path. When they differ, copy GPU data to a temp
CPU buffer then rearrange element-by-element to match the ETensor's
expected layout.

Also enables the accuracy check in test_non_pow2_head_dim_with_bool_mask
and adds test_output_stride_rearrange.py exercising both fast and slow
copy paths with Triton ON and OFF.
@mergennachin mergennachin force-pushed the update_pin_enable_test branch from dc6c070 to bbc53d9 Compare March 6, 2026 14:34
Copy link
Contributor

@Gasoonjia Gasoonjia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for fixing the issue!

@mergennachin mergennachin merged commit eb77ed4 into main Mar 6, 2026
207 of 215 checks passed
@mergennachin mergennachin deleted the update_pin_enable_test branch March 6, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants