CUDA backend: fix output stride mismatch in delegate copy-back#17945
CUDA backend: fix output stride mismatch in delegate copy-back#17945mergennachin merged 1 commit intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17945
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 12 Pending, 2 Unrelated FailuresAs of commit bbc53d9 with merge base 4c96679 ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR fixes a silent data corruption bug in the CUDA backend delegate copy-back path. When AOTI produces contiguous output but the .pte serializes a different dim_order (e.g., from SDPA's efficient attention returning a transposed view), the runtime would byte-copy GPU data into an ETensor that interprets it with wrong strides. The fix detects stride mismatches and, when strides differ, copies GPU data to a temporary CPU buffer and then rearranges it element-by-element to match the ETensor's expected layout.
Changes:
- Unified
copy_slimtensor_to_etensor_asyncandcopy_slimtensor_to_etensorinto a new_copy_slimtensor_to_etensor_implhelper with a fast path (strides match → raw byte copy) and a slow path (strides differ → element-wise rearrange via_strided_copy). - Re-enables the accuracy assertion in
test_non_pow2_head_dim_with_bool_maskthat was previously gated behind aTODOcomment. - Adds a new test file
test_output_stride_rearrange.pyexercising both fast and slow copy paths.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
backends/cuda/runtime/utils.h |
Core fix: adds _strides_match, _strided_copy, and _copy_slimtensor_to_etensor_impl; refactors public copy functions to call it |
backends/cuda/tests/test_triton_sdpa_nan.py |
Enables the previously disabled accuracy check for the non-pow2 head dim + bool mask test |
backends/cuda/tests/test_output_stride_rearrange.py |
New test file exercising the fast and slow copy-back paths |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
bec6f6b to
71d75f3
Compare
71d75f3 to
dc6c070
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The .pte serializes the output dim_order from PyTorch's SDPA composite (which may return a non-contiguous transposed view, e.g., efficient attention outputs [B,Lq,H,D] transposed to [B,H,Lq,D]). However, the AOTI delegate always produces contiguous output in its own layout, ignoring the .pte's expected dim_order. The runtime byte-copies the GPU data to the CPU ETensor, but the ETensor interprets it with the .pte's strides — causing silent data corruption when the layouts differ. Fix: in copy_slimtensor_to_etensor, detect when the SlimTensor (GPU) and ETensor (CPU) have different strides. When they match (common case), use the fast byte-copy path. When they differ, copy GPU data to a temp CPU buffer then rearrange element-by-element to match the ETensor's expected layout. Also enables the accuracy check in test_non_pow2_head_dim_with_bool_mask and adds test_output_stride_rearrange.py exercising both fast and slow copy paths with Triton ON and OFF.
dc6c070 to
bbc53d9
Compare
Gasoonjia
left a comment
There was a problem hiding this comment.
Thx for fixing the issue!
The .pte serializes the output dim_order from PyTorch's SDPA composite
(which may return a non-contiguous transposed view, e.g., efficient
attention outputs [B,Lq,H,D] transposed to [B,H,Lq,D]). However, the
AOTI delegate always produces contiguous output in its own layout,
ignoring the .pte's expected dim_order. The runtime byte-copies the
GPU data to the CPU ETensor, but the ETensor interprets it with the
.pte's strides — causing silent data corruption when the layouts differ.
Fix: in copy_slimtensor_to_etensor, detect when the SlimTensor (GPU)
and ETensor (CPU) have different strides. When they match (common case),
use the fast byte-copy path. When they differ, copy GPU data to a temp
CPU buffer then rearrange element-by-element to match the ETensor's
expected layout.
Also enables the accuracy check in test_non_pow2_head_dim_with_bool_mask
and adds test_output_stride_rearrange.py exercising both fast and slow
copy paths with Triton ON and OFF.