Skip to content

Fix/Improve warp_match_all#9192

Merged
fbusato merged 4 commits into
NVIDIA:mainfrom
fbusato:fix-warp-match-all
Jun 2, 2026
Merged

Fix/Improve warp_match_all#9192
fbusato merged 4 commits into
NVIDIA:mainfrom
fbusato:fix-warp-match-all

Conversation

@fbusato
Copy link
Copy Markdown
Contributor

@fbusato fbusato commented May 29, 2026

closes #6085

Description

cuda::device::warp_match_all requires types that bitwise copyable, UB otherwise.
The PR adds a compile-time check with the recently introduced trait and uses __builtin_clear_padding (clang-cuda 23) to support more types.

@fbusato fbusato self-assigned this May 29, 2026
@fbusato fbusato requested a review from a team as a code owner May 29, 2026 22:04
@fbusato fbusato requested a review from ericniebler May 29, 2026 22:05
@fbusato fbusato added the libcu++ For all items related to libcu++ label May 29, 2026
@fbusato fbusato added this to CCCL May 29, 2026
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 29, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 29, 2026
@fbusato fbusato requested a review from a team as a code owner May 29, 2026 22:10
@fbusato fbusato requested a review from gonidelis May 29, 2026 22:10
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f2cf9ab5-65cc-456f-b238-8025abe7afa6

📥 Commits

Reviewing files that changed from the base of the PR and between 50453ee and 7444c97.

📒 Files selected for processing (2)
  • libcudacxx/include/cuda/__warp/warp_match_all.h
  • libcudacxx/include/cuda/std/__cccl/builtin.h
🚧 Files skipped from review as they are similar to previous changes (2)
  • libcudacxx/include/cuda/std/__cccl/builtin.h
  • libcudacxx/include/cuda/__warp/warp_match_all.h

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced warp_match_all to properly handle struct padding bits during cross-lane data comparison operations when compiler support is available, ensuring more accurate results.
  • Documentation

    • Reorganized and clarified requirements documentation for warp_match_all, with dedicated sections separating constraints from preconditions for better clarity.
  • Tests

    • Expanded test coverage for warp operations to include improved testing of padded struct types.

important:

Walkthrough

Conditional support for clearing struct padding was added: a builtin wrapper macro, warp_match_all now requires trivially-copyable types and either clears padding when available or requires bitwise comparability, tests added for a padded struct, and docs updated to record constraints.

Changes

Padding-aware warp_match_all

Layer / File(s) Summary
Builtin clear_padding macro
libcudacxx/include/cuda/std/__cccl/builtin.h
New _CCCL_BUILTIN_CLEAR_PADDING(...) macro conditionally forwards to __builtin_clear_padding when the builtin exists and compilation is not device code.
warp_match_all padding-aware implementation
libcudacxx/include/cuda/__warp/warp_match_all.h
Copyright year update; includes adjusted to add is_trivially_copyable and conditionally include is_bitwise_comparable; implementation enforces _Tp is trivially copyable, conditionally clears padding via _CCCL_BUILTIN_CLEAR_PADDING (copy-then-clear-then-use address) or requires bitwise comparability, computes word ratio and memcpy's bytes into uint32_t array for lane matching.
Test coverage for padded struct
libcudacxx/test/libcudacxx/cuda/warp/warp_match.pass.cpp
Adds WithPadding struct with padding and conditionally invokes test_types<WithPadding>() when clear-padding builtin is available.
Docs: constraints reorg
docs/libcudacxx/extended_api/warp/warp_match_all.rst
Moves padding and trivially-copyable requirements into a new Constraints section; restricts Preconditions to SM support and non-zero lane_mask.

Assessment against linked issues

Objective Addressed Explanation
Eliminate undefined behavior in warp_match_all with padded types [#6085]
Handle padding via __builtin_clear_padding when available [#6085]
Enforce bitwise comparability when padding-clearing unavailable [#6085]
Add test coverage for padded input types [#6085]

important:

Walkthrough

Conditional support for clearing struct padding was added: a builtin wrapper macro, warp_match_all now requires trivially-copyable types and either clears padding when available or requires bitwise comparability, tests added for a padded struct, and docs updated to record constraints.

Changes

Padding-aware warp_match_all

Layer / File(s) Summary
Builtin clear_padding macro
libcudacxx/include/cuda/std/__cccl/builtin.h
New _CCCL_BUILTIN_CLEAR_PADDING(...) macro conditionally forwards to __builtin_clear_padding when the builtin exists and compilation is not device code.
warp_match_all padding-aware implementation
libcudacxx/include/cuda/__warp/warp_match_all.h
Copyright year update; includes adjusted to add is_trivially_copyable and conditionally include is_bitwise_comparable; implementation enforces _Tp is trivially copyable, conditionally clears padding via _CCCL_BUILTIN_CLEAR_PADDING (copy-then-clear-then-use address) or requires bitwise comparability, computes word ratio and memcpy's bytes into uint32_t array for lane matching.
Test coverage for padded struct
libcudacxx/test/libcudacxx/cuda/warp/warp_match.pass.cpp
Adds WithPadding struct with padding and conditionally invokes test_types<WithPadding>() when clear-padding builtin is available.
Docs: constraints reorg
docs/libcudacxx/extended_api/warp/warp_match_all.rst
Moves padding and trivially-copyable requirements into a new Constraints section; restricts Preconditions to SM support and non-zero lane_mask.

Assessment against linked issues

Objective Addressed Explanation
Eliminate undefined behavior in warp_match_all with padded types [#6085]
Handle padding via __builtin_clear_padding when available [#6085]
Enforce bitwise comparability when padding-clearing unavailable [#6085]
Add test coverage for padded input types [#6085]

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: afa37a68-ff4e-4abf-8cf2-d9104f987019

📥 Commits

Reviewing files that changed from the base of the PR and between 294326d and 395b10e.

📒 Files selected for processing (3)
  • libcudacxx/include/cuda/__warp/warp_match_all.h
  • libcudacxx/include/cuda/std/__cccl/builtin.h
  • libcudacxx/test/libcudacxx/cuda/warp/warp_match.pass.cpp

Comment thread libcudacxx/test/libcudacxx/cuda/warp/warp_match.pass.cpp
@github-actions

This comment has been minimized.

Comment thread libcudacxx/include/cuda/std/__cccl/builtin.h Outdated
Comment thread libcudacxx/include/cuda/__warp/warp_match_all.h Outdated
Comment thread libcudacxx/include/cuda/__warp/warp_match_all.h
@fbusato
Copy link
Copy Markdown
Contributor Author

fbusato commented Jun 1, 2026

discussed offline. Code updated

@fbusato fbusato requested a review from davebayer June 1, 2026 18:45
#endif // _CCCL_HAS_BUILTIN(__is_complete_type)

#if _CCCL_HAS_BUILTIN(__builtin_clear_padding) \
&& (_CCCL_HOST_COMPILATION() || !(_CCCL_COMPILER(GCC) || _CCCL_COMPILER(NVHPC)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the condition is correct, shouldn't this be:

Suggested change
&& (_CCCL_HOST_COMPILATION() || !(_CCCL_COMPILER(GCC) || _CCCL_COMPILER(NVHPC)))
&& (_CCCL_HOST_COMPILATION() && !(_CCCL_COMPILER(GCC) || _CCCL_COMPILER(NVHPC)))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is intentional.

  • If we are on the host and the builtin is supported, then enable this path.
  • If we are on the device and the builtin is supported, then enable this path except when the compiler is gcc or nvc++.

Comment on lines +50 to +58
# if defined(_CCCL_BUILTIN_CLEAR_PADDING)
auto __data_copy = __data;
_CCCL_BUILTIN_CLEAR_PADDING(&__data_copy);
const auto __data_ptr = ::cuda::std::addressof(__data_copy);
# else // ^^^ _CCCL_BUILTIN_CLEAR_PADDING ^^^ / vvv !_CCCL_BUILTIN_CLEAR_PADDING vvv
static_assert(is_bitwise_comparable_v<_Tp>, "data must be bitwise comparable");
const auto __data_ptr = ::cuda::std::addressof(__data);
# endif // _CCCL_BUILTIN_CLEAR_PADDING
::cuda::std::memcpy(__array, __data_ptr, sizeof(_Tp));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you just clear the padding after memcpy?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the padding could be within the argument. If we lose the original data types, it is not possible to understand where the padding is

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

🥳 CI Workflow Results

🟩 Finished in 4h 21m: Pass: 100%/116 | Total: 4d 12h | Max: 4h 17m | Hits: 46%/1287479

See results here.

@fbusato fbusato merged commit 121dd20 into NVIDIA:main Jun 2, 2026
140 checks passed
@github-project-automation github-project-automation Bot moved this from In Review to Done in CCCL Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libcu++ For all items related to libcu++

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[BUG]: cuda::warp_match_all is not reliable when the input has padding

2 participants