Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314
Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314shoumikhin wants to merge 1 commit into
Conversation
ec63f3f to
2fe2c7a
Compare
lanluo-nvidia
left a comment
There was a problem hiding this comment.
LGTM except one comments from codex:
- High: cpp/src/torch_tensorrt/executorch/TensorRTBackend.cpp:559 lets guarded, device-only executions return without synchronization, but the handle still
owns a single IExecutionContext. The mutex at :338 is released when execute() returns, so a later call can reuse the same context on another guarded
stream while the previous enqueueV3() from :531 is still running. The header contract at cpp/include/torch_tensorrt/executorch/TensorRTBackend.h:96 only
says “one thread at a time”, which does not prevent sequential async overlap. TensorRT’s docs say registered tensor memory must remain valid until stream
sync, and concurrent use of one execution context across streams is undefined. I’d require either per-stream/per-inflight execution contexts, or track a
completion event and wait before reusing the context; at minimum the public contract needs to say callers must synchronize the guarded stream before any
later execute() or destroy() on that handle.
Could you please rebase the code from the latest main, so that we can see CI are passing.
…tream The delegate created and owned a private CUDA stream in init() and ran every enqueueV3() on it, so an application could not place inference on a specific CUDA stream or context (for example a CUDA green context for SM partitioning). Let the caller select the stream instead, bringing the libtorch-free ExecuTorch runtime the same caller-stream capability the libtorch TensorRT runtime has (pytorch#4232): - Add a scoped CudaStreamGuard (mirroring c10::cuda::CUDAStreamGuard) to select, per calling thread, the CUDA stream the delegate runs TensorRT on. With no guard active the delegate runs on cudaStreamPerThread. - execute() runs enqueueV3() and the staging copies on the selected stream; init() no longer creates a stream and the delegate owns none. - To confine inference to a CUDA green context's SM partition the caller scopes a guard with a stream created on that green context (cuGreenCtxStreamCreate); the partition confinement travels with the stream, so the green context need not be made current. cudaStreamPerThread is invalid while a green context is current (cudaErrorInvalidResourceHandle), so a green-context caller must scope a guard. - cudaSetDevice() is applied only when the engine's device differs from the current device and is restored on exit, so it no longer clobbers a context the caller established. - execute() leaves device-resident outputs enqueued (no end sync) only while a guard is active; the default path and host-staged outputs still synchronize before returning, preserving existing behavior. The caller synchronizes the selected stream when it reads device-resident results. - Make the no-sync path safe to reuse: the handle records a CUDA completion event after the enqueue, and the next execute() (and the destructor) waits on it before reconfiguring or freeing the shared IExecutionContext. A handle can thus be run repeatedly on a caller stream without the caller synchronizing between calls, and teardown never frees a context with an enqueue still in flight. No dependency on the libtorch Torch-TensorRT runtime or libtorch is added.
2fe2c7a to
c3a5f44
Compare
|
Thanks for the review — addressed both points. Concurrency (single I used a completion event rather than per-inflight execution contexts, to keep the single-context model and the compose-with-later-work behavior while making reuse and teardown safe — happy to switch to per-inflight contexts if you'd prefer. Rebased onto latest main. The remaining red checks are the py3.10 dynamo runtime-cache tests, which ran out of memory on the runner (unrelated to this C++ delegate change, and green on main); re-running them. |
What problem does this solve?
The ExecuTorch TensorRT delegate used to create its own private CUDA stream and run every inference on it. That left an application with no way to make the TensorRT engine run on a specific CUDA stream or context of its choosing.
This matters most for CUDA green contexts — a CUDA feature that hands a piece of work a slice of the GPU's compute units (SMs) instead of the whole GPU, so you can run several models side by side with predictable performance. To keep an engine inside a green context, its work has to run on a stream that belongs to that green context. With a delegate-owned stream, that was impossible.
What this changes
You can now tell the delegate which CUDA stream to run on, with a small RAII helper,
CudaStreamGuard. Scope it around your inference call and the engine runs on your stream. If you don't use it, nothing changes — the delegate runs on the per-thread default stream, exactly as before.This gives the libtorch-free ExecuTorch runtime the same "run on the caller's stream" capability the libtorch TensorRT runtime got in #4232.
Usage example
The engine's kernels (and any host<->device copies it needs) run on
stream, so a green-context stream keeps them inside that context's SM partition.How it works (in plain terms)
CudaStreamGuardis active on the calling thread, the engine's GPU work runs on the stream you provided.cudaStreamPerThreadand waits for the work to finish before returning, exactly like before, so existing code is unaffected.execute()reconfigures that context — and before the handle is destroyed. So you can run the same method repeatedly on your stream, and tear it down, without ever reconfiguring or freeing a context whose work is still running. The default (no-guard) path is unchanged.Notes