[STF] Add re-launchable popped graphs to stackable_ctx#9178
Conversation
Splits graph_ctx_node finalization into phases so a popped nested graph
can be instantiated once and launched many times before the matching
epilogue runs. Adds three public surfaces on stackable_ctx:
* pop_prologue() / pop_epilogue() returning a launchable_graph_handle
that exposes exec(), stream(), graph(), and launch();
* launchable_graph_scope, an RAII guard that pairs push() with a
lazy pop_prologue() and runs pop_epilogue() in its destructor;
* pop_prologue_shared() returning a copyable/storable launchable_graph
whose destructor runs pop_epilogue() when the last copy dies.
The non-nested finalize path now flows through prepare_graph ->
ensure_instantiated -> launch_once -> finalize_after_launch; the
existing nested-graph behavior is preserved verbatim in
finalize_nested(). push() / pop() guard against being called while a
pop_prologue is still pending its matching pop_epilogue.
Coverage lives in the stackable_ctx.cuh inline UNITTESTs: repeated
launch, manual cudaGraphLaunch via exec()/stream(), zero-launch,
handle invalidation, RAII scope, shared basic/copies/container/manual
epilogue, and a CTK-12.4 pop_prologue + repeat_graph_scope test.
|
/ok to test 43486c3 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test edbe4ad |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
suggestion: WalkthroughThis PR adds a two‑phase re‑launchable pop workflow: ChangesRe-launchable graph pop API and implementation
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh (1)
801-818: 💤 Low valuesuggestion:
launched_is set on line 689 but never read. Remove or use it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5cb2f811-0bb5-4423-b7f6-236b7bd3fc9d
📒 Files selected for processing (2)
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuhcudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
This comment has been minimized.
This comment has been minimized.
Dep-A ordering is already tracked by synced_; launched_ was set in launch_once() but never read.
|
/ok to test 4b21ca3 |
🥳 CI Workflow Results🟩 Finished in 43m 28s: Pass: 100%/55 | Total: 17h 04m | Max: 43m 25s | Hits: 24%/108848See results here. |
Address review follow-ups on the re-launchable popped graphs: * Fix docs that claimed pop_prologue() eagerly instantiates the cudaGraphExec_t. Instantiation is lazy (first exec()/launch()); graph() consumers never instantiate. Drop the stale prepare_launch() references. * Route launchable_graph_handle through thin private stackable_ctx wrappers (launch_prepared_graph / prepare_handle_for_exec / prepare_handle_for_graph) instead of reaching into pimpl directly, mirroring the pop_epilogue() surface. * Replace the ad-hoc validate_/check_ helpers and the impl-side fprintf+abort misuse guards with _CCCL_VERIFY, which stays enabled in release builds (unlike _CCCL_ASSERT). Genuine internal invariants remain _CCCL_ASSERT. * Add a unit test that embeds handle.graph() as a child graph node via cudaGraphAddChildGraphNode, orders dep-A through an event on handle.stream(), and documents the pop_epilogue() ordering caveat.
Splits graph_ctx_node finalization into phases so a popped nested graph can be instantiated once and launched many times before the matching epilogue runs. Adds three public surfaces on stackable_ctx:
The non-nested finalize path now flows through prepare_graph -> ensure_instantiated -> launch_once -> finalize_after_launch; the existing nested-graph behavior is preserved verbatim in finalize_nested(). push() / pop() guard against being called while a pop_prologue is still pending its matching pop_epilogue.
Coverage lives in the stackable_ctx.cuh inline UNITTESTs: repeated launch, manual cudaGraphLaunch via exec()/stream(), zero-launch, handle invalidation, RAII scope, shared basic/copies/container/manual epilogue, and a CTK-12.4 pop_prologue + repeat_graph_scope test.
Description
closes
Checklist