Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support) by shoumikhin · Pull Request #4314 · pytorch/TensorRT

shoumikhin · 2026-05-30T14:18:05Z

What problem does this solve?

The ExecuTorch TensorRT delegate used to create its own private CUDA stream and run every inference on it. That left an application with no way to make the TensorRT engine run on a specific CUDA stream or context of its choosing.

This matters most for CUDA green contexts — a CUDA feature that hands a piece of work a slice of the GPU's compute units (SMs) instead of the whole GPU, so you can run several models side by side with predictable performance. To keep an engine inside a green context, its work has to run on a stream that belongs to that green context. With a delegate-owned stream, that was impossible.

What this changes

You can now tell the delegate which CUDA stream to run on, with a small RAII helper, CudaStreamGuard. Scope it around your inference call and the engine runs on your stream. If you don't use it, nothing changes — the delegate runs on the per-thread default stream, exactly as before.

This gives the libtorch-free ExecuTorch runtime the same "run on the caller's stream" capability the libtorch TensorRT runtime got in #4232.

Usage example

#include <torch_tensorrt/executorch/TensorRTBackend.h>
namespace et = torch_tensorrt::executorch_backend;

// 1. Create a CUDA stream on the context you want the engine to run on.
//    For SM partitioning, create it on a CUDA green context:
cudaStream_t stream;
cuGreenCtxStreamCreate(&stream, green_ctx, CU_STREAM_NON_BLOCKING, /*priority=*/0);

// 2. Run inference with that stream selected:
{
  et::CudaStreamGuard guard(stream);  // selects `stream` for this thread
  method->execute();                  // the TensorRT engine runs on `stream`
}                                     // selection is restored automatically here

// 3. For outputs that stay on the GPU, sync your stream before reading them:
cudaStreamSynchronize(stream);

The engine's kernels (and any host<->device copies it needs) run on stream, so a green-context stream keeps them inside that context's SM partition.

How it works (in plain terms)

The delegate no longer owns a stream. While a CudaStreamGuard is active on the calling thread, the engine's GPU work runs on the stream you provided.
A CUDA stream remembers which context it was created on, so just running on a green-context stream keeps the work inside that partition — you don't have to make the green context "current" yourself.
No guard, no change. Without a guard, the delegate runs on cudaStreamPerThread and waits for the work to finish before returning, exactly like before, so existing code is unaffected.
When you do provide a stream and all outputs stay on the GPU, the delegate leaves the work queued on your stream (no forced wait) so it composes efficiently with your other GPU work; you synchronize when you read the results. If any input or output had to be copied to/from CPU, it waits so your data is ready on return.
Reusing a handle stays safe with no sync between calls. The no-wait path returns while the engine work is still in flight, and a handle reuses a single execution context. To make that safe, the delegate records a CUDA event after the enqueue and waits on it (host-side) before the next execute() reconfigures that context — and before the handle is destroyed. So you can run the same method repeatedly on your stream, and tear it down, without ever reconfiguring or freeing a context whose work is still running. The default (no-guard) path is unchanged.

Notes

A given method runs single-threaded (one inference at a time), and the provided stream must be on the engine's GPU.
Verified on an H100: with a green-context stream selected, the engine's kernels run only on that context's SM partition.

lanluo-nvidia

LGTM except one comments from codex:

High: cpp/src/torch_tensorrt/executorch/TensorRTBackend.cpp:559 lets guarded, device-only executions return without synchronization, but the handle still
owns a single IExecutionContext. The mutex at :338 is released when execute() returns, so a later call can reuse the same context on another guarded
stream while the previous enqueueV3() from :531 is still running. The header contract at cpp/include/torch_tensorrt/executorch/TensorRTBackend.h:96 only
says “one thread at a time”, which does not prevent sequential async overlap. TensorRT’s docs say registered tensor memory must remain valid until stream
sync, and concurrent use of one execution context across streams is undefined. I’d require either per-stream/per-inflight execution contexts, or track a
completion event and wait before reusing the context; at minimum the public contract needs to say callers must synchronize the guarded stream before any
later execute() or destroy() on that handle.

Could you please rebase the code from the latest main, so that we can see CI are passing.

…tream The delegate created and owned a private CUDA stream in init() and ran every enqueueV3() on it, so an application could not place inference on a specific CUDA stream or context (for example a CUDA green context for SM partitioning). Let the caller select the stream instead, bringing the libtorch-free ExecuTorch runtime the same caller-stream capability the libtorch TensorRT runtime has (pytorch#4232): - Add a scoped CudaStreamGuard (mirroring c10::cuda::CUDAStreamGuard) to select, per calling thread, the CUDA stream the delegate runs TensorRT on. With no guard active the delegate runs on cudaStreamPerThread. - execute() runs enqueueV3() and the staging copies on the selected stream; init() no longer creates a stream and the delegate owns none. - To confine inference to a CUDA green context's SM partition the caller scopes a guard with a stream created on that green context (cuGreenCtxStreamCreate); the partition confinement travels with the stream, so the green context need not be made current. cudaStreamPerThread is invalid while a green context is current (cudaErrorInvalidResourceHandle), so a green-context caller must scope a guard. - cudaSetDevice() is applied only when the engine's device differs from the current device and is restored on exit, so it no longer clobbers a context the caller established. - execute() leaves device-resident outputs enqueued (no end sync) only while a guard is active; the default path and host-staged outputs still synchronize before returning, preserving existing behavior. The caller synchronizes the selected stream when it reads device-resident results. - Make the no-sync path safe to reuse: the handle records a CUDA completion event after the enqueue, and the next execute() (and the destructor) waits on it before reconfiguring or freeing the shared IExecutionContext. A handle can thus be run repeatedly on a caller stream without the caller synchronizing between calls, and teardown never frees a context with an enqueue still in flight. No dependency on the libtorch Torch-TensorRT runtime or libtorch is added.

shoumikhin · 2026-06-05T06:38:51Z

Thanks for the review — addressed both points.

Concurrency (single IExecutionContext reuse). The handle now records a CUDA event after the enqueue on the no-sync path, and both the next execute() and the destructor host-wait on it (cudaEventSynchronize) before reconfiguring or freeing the execution context. So a handle can be re-run on a caller stream — and torn down — without the caller synchronizing between calls, and we never mutate or destroy a context with an enqueue still in flight. The default path (no guard, or host-staged I/O) is unchanged. I also expanded the CudaStreamGuard contract comment to document this ordering.

I used a completion event rather than per-inflight execution contexts, to keep the single-context model and the compose-with-later-work behavior while making reuse and teardown safe — happy to switch to per-inflight contexts if you'd prefer.

Rebased onto latest main. The remaining red checks are the py3.10 dynamo runtime-cache tests, which ran out of memory on the runner (unrelated to this C++ delegate change, and green on main); re-running them.

meta-cla Bot added the cla signed label May 30, 2026

github-actions Bot added the component: api [C++] Issues re: C++ API label May 30, 2026

github-actions Bot requested a review from narendasan May 30, 2026 14:18

shoumikhin force-pushed the fix/et-trt-caller-cuda-stream branch 4 times, most recently from ec63f3f to 2fe2c7a Compare May 31, 2026 06:21

narendasan requested a review from lanluo-nvidia June 1, 2026 21:20

lanluo-nvidia requested changes Jun 3, 2026

View reviewed changes

lanluo-nvidia added the Force All Tests[L0+L1+L2] For run all the L0, L1, L2 tests label Jun 3, 2026

shoumikhin force-pushed the fix/et-trt-caller-cuda-stream branch from 2fe2c7a to c3a5f44 Compare June 5, 2026 05:15

shoumikhin marked this pull request as ready for review June 5, 2026 06:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314

Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:fix/et-trt-caller-cuda-stream

shoumikhin commented May 30, 2026 •

edited

Loading

Uh oh!

lanluo-nvidia left a comment

Uh oh!

shoumikhin commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shoumikhin commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this solve?

What this changes

Usage example

How it works (in plain terms)

Notes

Uh oh!

lanluo-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

shoumikhin commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shoumikhin commented May 30, 2026 •

edited

Loading