Skip to content

Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314

Open
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:fix/et-trt-caller-cuda-stream
Open

Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:fix/et-trt-caller-cuda-stream

Conversation

@shoumikhin
Copy link
Copy Markdown
Contributor

@shoumikhin shoumikhin commented May 30, 2026

What problem does this solve?

The ExecuTorch TensorRT delegate used to create its own private CUDA stream and run every inference on it. That left an application with no way to make the TensorRT engine run on a specific CUDA stream or context of its choosing.

This matters most for CUDA green contexts — a CUDA feature that hands a piece of work a slice of the GPU's compute units (SMs) instead of the whole GPU, so you can run several models side by side with predictable performance. To keep an engine inside a green context, its work has to run on a stream that belongs to that green context. With a delegate-owned stream, that was impossible.

What this changes

You can now tell the delegate which CUDA stream to run on, with a small RAII helper, CudaStreamGuard. Scope it around your inference call and the engine runs on your stream. If you don't use it, nothing changes — the delegate runs on the per-thread default stream, exactly as before.

This gives the libtorch-free ExecuTorch runtime the same "run on the caller's stream" capability the libtorch TensorRT runtime got in #4232.

Usage example

#include <torch_tensorrt/executorch/TensorRTBackend.h>
namespace et = torch_tensorrt::executorch_backend;

// 1. Create a CUDA stream on the context you want the engine to run on.
//    For SM partitioning, create it on a CUDA green context:
cudaStream_t stream;
cuGreenCtxStreamCreate(&stream, green_ctx, CU_STREAM_NON_BLOCKING, /*priority=*/0);

// 2. Run inference with that stream selected:
{
  et::CudaStreamGuard guard(stream);  // selects `stream` for this thread
  method->execute();                  // the TensorRT engine runs on `stream`
}                                     // selection is restored automatically here

// 3. For outputs that stay on the GPU, sync your stream before reading them:
cudaStreamSynchronize(stream);

The engine's kernels (and any host<->device copies it needs) run on stream, so a green-context stream keeps them inside that context's SM partition.

How it works (in plain terms)

  • The delegate no longer owns a stream. While a CudaStreamGuard is active on the calling thread, the engine's GPU work runs on the stream you provided.
  • A CUDA stream remembers which context it was created on, so just running on a green-context stream keeps the work inside that partition — you don't have to make the green context "current" yourself.
  • No guard, no change. Without a guard, the delegate runs on cudaStreamPerThread and waits for the work to finish before returning, exactly like before, so existing code is unaffected.
  • When you do provide a stream and all outputs stay on the GPU, the delegate leaves the work queued on your stream (no forced wait) so it composes efficiently with your other GPU work; you synchronize when you read the results. If any input or output had to be copied to/from CPU, it waits so your data is ready on return.
  • Reusing a handle stays safe with no sync between calls. The no-wait path returns while the engine work is still in flight, and a handle reuses a single execution context. To make that safe, the delegate records a CUDA event after the enqueue and waits on it (host-side) before the next execute() reconfigures that context — and before the handle is destroyed. So you can run the same method repeatedly on your stream, and tear it down, without ever reconfiguring or freeing a context whose work is still running. The default (no-guard) path is unchanged.

Notes

  • A given method runs single-threaded (one inference at a time), and the provided stream must be on the engine's GPU.
  • Verified on an H100: with a green-context stream selected, the engine's kernels run only on that context's SM partition.

@meta-cla meta-cla Bot added the cla signed label May 30, 2026
@github-actions github-actions Bot added the component: api [C++] Issues re: C++ API label May 30, 2026
@github-actions github-actions Bot requested a review from narendasan May 30, 2026 14:18
@shoumikhin shoumikhin force-pushed the fix/et-trt-caller-cuda-stream branch 4 times, most recently from ec63f3f to 2fe2c7a Compare May 31, 2026 06:21
@narendasan narendasan requested a review from lanluo-nvidia June 1, 2026 21:20
Copy link
Copy Markdown
Collaborator

@lanluo-nvidia lanluo-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one comments from codex:

  • High: cpp/src/torch_tensorrt/executorch/TensorRTBackend.cpp:559 lets guarded, device-only executions return without synchronization, but the handle still
    owns a single IExecutionContext. The mutex at :338 is released when execute() returns, so a later call can reuse the same context on another guarded
    stream while the previous enqueueV3() from :531 is still running. The header contract at cpp/include/torch_tensorrt/executorch/TensorRTBackend.h:96 only
    says “one thread at a time”, which does not prevent sequential async overlap. TensorRT’s docs say registered tensor memory must remain valid until stream
    sync, and concurrent use of one execution context across streams is undefined. I’d require either per-stream/per-inflight execution contexts, or track a
    completion event and wait before reusing the context; at minimum the public contract needs to say callers must synchronize the guarded stream before any
    later execute() or destroy() on that handle.

Could you please rebase the code from the latest main, so that we can see CI are passing.

@lanluo-nvidia lanluo-nvidia added the Force All Tests[L0+L1+L2] For run all the L0, L1, L2 tests label Jun 3, 2026
…tream

The delegate created and owned a private CUDA stream in init() and ran every
enqueueV3() on it, so an application could not place inference on a specific
CUDA stream or context (for example a CUDA green context for SM partitioning).

Let the caller select the stream instead, bringing the libtorch-free ExecuTorch
runtime the same caller-stream capability the libtorch TensorRT runtime has
(pytorch#4232):

- Add a scoped CudaStreamGuard (mirroring c10::cuda::CUDAStreamGuard) to select,
  per calling thread, the CUDA stream the delegate runs TensorRT on. With no
  guard active the delegate runs on cudaStreamPerThread.
- execute() runs enqueueV3() and the staging copies on the selected stream;
  init() no longer creates a stream and the delegate owns none.
- To confine inference to a CUDA green context's SM partition the caller scopes a
  guard with a stream created on that green context (cuGreenCtxStreamCreate); the
  partition confinement travels with the stream, so the green context need not be
  made current. cudaStreamPerThread is invalid while a green context is current
  (cudaErrorInvalidResourceHandle), so a green-context caller must scope a guard.
- cudaSetDevice() is applied only when the engine's device differs from the
  current device and is restored on exit, so it no longer clobbers a context the
  caller established.
- execute() leaves device-resident outputs enqueued (no end sync) only while a
  guard is active; the default path and host-staged outputs still synchronize
  before returning, preserving existing behavior. The caller synchronizes the
  selected stream when it reads device-resident results.
- Make the no-sync path safe to reuse: the handle records a CUDA completion event
  after the enqueue, and the next execute() (and the destructor) waits on it
  before reconfiguring or freeing the shared IExecutionContext. A handle can thus
  be run repeatedly on a caller stream without the caller synchronizing between
  calls, and teardown never frees a context with an enqueue still in flight.

No dependency on the libtorch Torch-TensorRT runtime or libtorch is added.
@shoumikhin shoumikhin force-pushed the fix/et-trt-caller-cuda-stream branch from 2fe2c7a to c3a5f44 Compare June 5, 2026 05:15
@shoumikhin
Copy link
Copy Markdown
Contributor Author

Thanks for the review — addressed both points.

Concurrency (single IExecutionContext reuse). The handle now records a CUDA event after the enqueue on the no-sync path, and both the next execute() and the destructor host-wait on it (cudaEventSynchronize) before reconfiguring or freeing the execution context. So a handle can be re-run on a caller stream — and torn down — without the caller synchronizing between calls, and we never mutate or destroy a context with an enqueue still in flight. The default path (no guard, or host-staged I/O) is unchanged. I also expanded the CudaStreamGuard contract comment to document this ordering.

I used a completion event rather than per-inflight execution contexts, to keep the single-context model and the compose-with-later-work behavior while making reuse and teardown safe — happy to switch to per-inflight contexts if you'd prefer.

Rebased onto latest main. The remaining red checks are the py3.10 dynamo runtime-cache tests, which ran out of memory on the runner (unrelated to this C++ delegate change, and green on main); re-running them.

@shoumikhin shoumikhin marked this pull request as ready for review June 5, 2026 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [C++] Issues re: C++ API Force All Tests[L0+L1+L2] For run all the L0, L1, L2 tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants