Skip to content

[STF] Add re-launchable popped graphs to stackable_ctx#9178

Open
caugonnet wants to merge 7 commits into
NVIDIA:mainfrom
caugonnet:stf_launchable_graphs
Open

[STF] Add re-launchable popped graphs to stackable_ctx#9178
caugonnet wants to merge 7 commits into
NVIDIA:mainfrom
caugonnet:stf_launchable_graphs

Conversation

@caugonnet
Copy link
Copy Markdown
Contributor

@caugonnet caugonnet commented May 29, 2026

Splits graph_ctx_node finalization into phases so a popped nested graph can be instantiated once and launched many times before the matching epilogue runs. Adds three public surfaces on stackable_ctx:

  • pop_prologue() / pop_epilogue() returning a launchable_graph_handle that exposes exec(), stream(), graph(), and launch();
  • launchable_graph_scope, an RAII guard that pairs push() with a lazy pop_prologue() and runs pop_epilogue() in its destructor;
  • pop_prologue_shared() returning a copyable/storable launchable_graph whose destructor runs pop_epilogue() when the last copy dies.

The non-nested finalize path now flows through prepare_graph -> ensure_instantiated -> launch_once -> finalize_after_launch; the existing nested-graph behavior is preserved verbatim in finalize_nested(). push() / pop() guard against being called while a pop_prologue is still pending its matching pop_epilogue.

Coverage lives in the stackable_ctx.cuh inline UNITTESTs: repeated launch, manual cudaGraphLaunch via exec()/stream(), zero-launch, handle invalidation, RAII scope, shared basic/copies/container/manual epilogue, and a CTK-12.4 pop_prologue + repeat_graph_scope test.

Description

closes

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Splits graph_ctx_node finalization into phases so a popped nested graph
can be instantiated once and launched many times before the matching
epilogue runs. Adds three public surfaces on stackable_ctx:

  * pop_prologue() / pop_epilogue() returning a launchable_graph_handle
    that exposes exec(), stream(), graph(), and launch();
  * launchable_graph_scope, an RAII guard that pairs push() with a
    lazy pop_prologue() and runs pop_epilogue() in its destructor;
  * pop_prologue_shared() returning a copyable/storable launchable_graph
    whose destructor runs pop_epilogue() when the last copy dies.

The non-nested finalize path now flows through prepare_graph ->
ensure_instantiated -> launch_once -> finalize_after_launch; the
existing nested-graph behavior is preserved verbatim in
finalize_nested(). push() / pop() guard against being called while a
pop_prologue is still pending its matching pop_epilogue.

Coverage lives in the stackable_ctx.cuh inline UNITTESTs: repeated
launch, manual cudaGraphLaunch via exec()/stream(), zero-launch,
handle invalidation, RAII scope, shared basic/copies/container/manual
epilogue, and a CTK-12.4 pop_prologue + repeat_graph_scope test.
@caugonnet caugonnet self-assigned this May 29, 2026
@caugonnet caugonnet added the stf Sequential Task Flow programming model label May 29, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 29, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL May 29, 2026
@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test 43486c3

@github-actions

This comment has been minimized.

@caugonnet caugonnet marked this pull request as ready for review May 31, 2026 08:16
@caugonnet caugonnet requested a review from a team as a code owner May 31, 2026 08:16
@caugonnet caugonnet requested a review from andralex May 31, 2026 08:16
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL May 31, 2026
@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test edbe4ad

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 31, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 90ddcc14-9a75-4004-a4f9-3e670d0e9d04

📥 Commits

Reviewing files that changed from the base of the PR and between 7178034 and d9db2ae.

📒 Files selected for processing (2)
  • cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh
  • cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Add re-launchable "pop" workflow with returnable launch handles plus RAII and shared-handle variants for repeated graph launches and lazy instantiation.
  • Bug Fixes / Reliability

    • Stronger safeguards around push/pop to prevent invalid usage and ensure handles are invalidated after epilogue.
  • Tests

    • Expanded unit and integration tests for repeated relaunches, manual exec/launch/stream driving, zero-launch paths, RAII correctness, and shared-handle semantics.

suggestion:

Walkthrough

This PR adds a two‑phase re‑launchable pop workflow: pop_prologue() returns launchable handles for deferred/lazy execution, and pop_epilogue() finalizes and invalidates outstanding handles. Graph finalization is split into prepare/instantiate/sync/launch/finalize phases, with push/pop sequencing guards and shared/copyable RAII handle wrappers.

Changes

Re-launchable graph pop API and implementation

Layer / File(s) Summary
Graph finalization refactor
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
Splits graph_ctx_node::finalize() into phased helpers (prepare_graph, ensure_instantiated, ensure_prereqs_synced, launch_once, finalize_after_launch) and extracts finalize_nested(); adds per-node flags and cached exec_graph_ state used by deferred launches.
Two-phase pop machinery
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
Implements pop_prologue_impl()/pop_epilogue_impl(), pop_prologue_result, and helpers (launch_prepared_graph, prepare_handle_for_exec, prepare_handle_for_graph) enabling lazy instantiation and strict sequencing for launchable handles.
Pending-epilogue state
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
Adds pending_epilogue_token_ and pending_epilogue_node_offset_ to atomically track and invalidate outstanding launch handles when pop_epilogue() runs.
Push/pop sequencing guards
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
Adds guards that abort when push() is called while an epilogue is pending and when pop() is invoked during the prologue/epilogue window.
Public API and handle declarations
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh, .../stackable_ctx_impl.cuh
Adds forward decl for launchable_graph_handle, public stackable_ctx::pop_prologue() (returns launchable_graph_handle) and stackable_ctx::pop_epilogue() declarations, and friend/forwarders so handles drive prepared-graph execution and lazy prereq syncing.
Tests (added/extended)
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh
Extends tests to cover repeated relaunch after pop_prologue(), manual exec/stream driving, zero-launch prologue+epilogue, handle invalidation after epilogue (including copy/shared invalidation), RAII launchable_graph_scope, and shared-handle semantics for pop_prologue_shared().

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh (1)

801-818: 💤 Low value

suggestion: launched_ is set on line 689 but never read. Remove or use it.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5cb2f811-0bb5-4423-b7f6-236b7bd3fc9d

📥 Commits

Reviewing files that changed from the base of the PR and between fb8629d and edbe4ad.

📒 Files selected for processing (2)
  • cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh
  • cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh

@github-actions

This comment has been minimized.

caugonnet and others added 2 commits May 31, 2026 21:49
Dep-A ordering is already tracked by synced_; launched_ was set in
launch_once() but never read.
@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test 4b21ca3

@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 43m 28s: Pass: 100%/55 | Total: 17h 04m | Max: 43m 25s | Hits: 24%/108848

See results here.

caugonnet and others added 2 commits June 1, 2026 16:03
Address review follow-ups on the re-launchable popped graphs:

* Fix docs that claimed pop_prologue() eagerly instantiates the
  cudaGraphExec_t. Instantiation is lazy (first exec()/launch()); graph()
  consumers never instantiate. Drop the stale prepare_launch() references.
* Route launchable_graph_handle through thin private stackable_ctx
  wrappers (launch_prepared_graph / prepare_handle_for_exec /
  prepare_handle_for_graph) instead of reaching into pimpl directly,
  mirroring the pop_epilogue() surface.
* Replace the ad-hoc validate_/check_ helpers and the impl-side
  fprintf+abort misuse guards with _CCCL_VERIFY, which stays enabled in
  release builds (unlike _CCCL_ASSERT). Genuine internal invariants remain
  _CCCL_ASSERT.
* Add a unit test that embeds handle.graph() as a child graph node via
  cudaGraphAddChildGraphNode, orders dep-A through an event on
  handle.stream(), and documents the pop_epilogue() ordering caveat.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

1 participant