fix: respect caller's CUDA stream in TRT runtime (Green Context support) by shoumikhin · Pull Request #4232 · pytorch/TensorRT

shoumikhin · 2026-05-01T15:45:55Z

What this fixes

Callers commonly pin work to a specific CUDA stream via torch.cuda.stream(...) (Python) or c10::cuda::CUDAStreamGuard(...) (C++) for NCCL, AMP overlap, custom scheduling, and most recently CUDA Green Contexts (CUDA 12.4+) for SM partitioning. A green-context-derived stream is the only way to confine a TensorRT engine to a partition.

Today the Torch-TensorRT runtime ignores the caller's stream. It reads getCurrentCUDAStream(), then substitutes one from PyTorch's pool. Pool streams are not bound to any green context, so the partition is silently bypassed. The same pattern lives in both Python runtime modules. Net effect: Torch-TensorRT cannot be used for GPU partitioning at all today.

What this PR does

The runtime now picks the engine's stream like this:

If the caller is on the regular default stream, behavior is unchanged. Grab a stream from the pool so the device does not get accidentally serialized on the default stream.
If the caller is on any other stream (custom user stream, Green Context stream, NCCL stream, etc.), run the engine directly on that stream.

Two small pieces of bookkeeping fall out of this change:

When the chosen stream changes between calls, force any captured CUDA Graph to re-record. Otherwise replay would launch on a stale stream and silently produce wrong results.
When the engine stream and the caller's stream are the same, skip the pre/post sync events between them. They become no-ops in that case.

The same logic is mirrored in _PythonTorchTensorRTModule.py (hoisted to the outer forward so both the standard and output-allocator paths see the resolved stream) and _CudaGraphsTorchTensorRTModule.py.

What stays the same for existing users

Anyone who calls model(x) without setting a stream sees the exact same behavior as before. No new public API. No new arguments. No .pt2 file format change. No new fields on engine state.

Behavior change for existing non-default-stream callers

Users who already wrap inference in torch.cuda.stream(my_stream) (or c10::cuda::CUDAStreamGuard(my_stream)) for non-green reasons (custom scheduling, NCCL, AMP) will see the engine run on their stream instead of a separate pool stream. Results are identical; the timing characteristics may differ, since the previous separate-stream + event-fence pattern is replaced by direct execution on the caller's stream. For CUDA Graphs users who alternate between several non-default streams, the captured graph re-records on each stream change.

What a Green Context user does

auto green_stream = c10::cuda::getStreamFromExternal(my_green_ctx_stream, /*device=*/0);
{
  c10::cuda::CUDAStreamGuard guard(green_stream);
  auto outputs = loader.run(inputs);   // runs on green_stream, inside the SM partition
}

Test plan

Existing test suite passes (no API surface changed).
Manual: build a small .pt2, run inside a CUDAStreamGuard over a Green-Context-derived stream, capture an nsys trace, confirm kernels run on the green stream and stay inside the partition's SMs.
CUDA Graphs path: switch streams between calls, confirm the captured graph re-records on stream change.

shoumikhin · 2026-05-01T18:02:46Z

Long-term plan: upstream PyTorch PR

Opened pytorch/pytorch#182149 to add AOTIModelPackageLoader::get_custom_objs(). Once it lands and reaches a tagged PyTorch release, AOTI / .pt2 C++ users can reach the live TRTEngine torchbind instances inside the loaded .so and use the per-engine set_external_stream API directly (no need for the process-wide set_engine_stream_passthrough flag).

torch::inductor::AOTIModelPackageLoader loader("model.pt2");
for (auto& [name, ivalue] : loader.get_custom_objs()) {
  if (auto e = ivalue.toCustomClass<torch_tensorrt::TRTEngine>()) {
    e->set_external_stream(reinterpret_cast<int64_t>(my_green_stream));
  }
}
loader.run(inputs);

This PR (#4232) ships set_engine_stream_passthrough as the interim mechanism so edge / on-device users are unblocked today. A follow-up torch_tensorrt::aoti::TRTAOTILoader wrapper (gated on a CMake TORCH_TRT_HAVE_AOTI_CUSTOM_OBJS probe) will provide the clean per-engine API once upstream lands.

shoumikhin · 2026-05-08T19:05:24Z

Friendly ping @narendasan @cehongwang.

This has been open for a week. The upstream PyTorch dependency (pytorch/pytorch#182149) is landing today, which unblocks the full end-to-end path.

Could one of you take a first pass when you get a chance? Happy to address feedback or split it up if that helps review.

narendasan · 2026-05-08T23:10:58Z

@shoumikhin I will leave the more detailed feedback to @cehongwang but one thing I wanted to raise was would it make sense for your usecase to just pull stream management out of the execute engine op all together. So the lazy stream from the stream pool would be a bail out in the case the user is not managing this externally.

My thought here is basically we operationalize stream management in the fx graph with a set of operators that serves as a guard / interface to prevent computation on the default stream. The concern I have is if you are mixing pytorch and tensorrt operations and using say green contexts would you be able to express everything correctly

Something to this effect in the green context case:

I can upload a prototype if this makes sense. I think we would still use the AOTI hooks you created, just in these stream ops rather than engine by engine

@desertfire

…82149) ## Summary Adds a public C++ accessor `get_custom_objs()` on `torch::aot_inductor::ProxyExecutor`, `torch::inductor::AOTIModelContainerRunner`, and `torch::inductor::AOTIModelPackageLoader` (plus a Python binding on the loader) that returns a snapshot of the torchbind custom-class constants embedded in a loaded `.pt2` model. The `IValue` payloads alias the live entries inside the proxy executor's `custom_objs_` map (the existing private storage already populated during `OSSProxyExecutor` construction by `pickle_load_obj`). Downcasting to a `CustomClassHolder` subclass and mutating its state — e.g. binding a stream, attaching a communicator, toggling profiling — affects subsequent `run()` invocations. ## Motivation Backends that store live state on torchbind constants embedded in `.pt2` packages cannot reach those objects after load via any public PyTorch API. The constants live in `OSSProxyExecutor::custom_objs_` (private, `std::unordered_map<std::string, c10::IValue>`) and the loader/runner only expose tensor-typed constant accessors (`get_constant_fqns`, `extract_constants_map`, etc.). Concrete user: **torch-tensorrt** ([pytorch/TensorRT#4232](pytorch/TensorRT#4232)) needs to bind a CUDA Green Context stream onto each `TRTEngine` torchbind embedded in a torch-tensorrt `.pt2` model loaded via `AOTIModelPackageLoader`. The `ExportedProgram` path already works (Python walks `named_modules()` and binds each engine), but the AOTI path has no equivalent because the engines never become attributes of an `nn.Module` — they live only inside the proxy executor's private map. This is a **purely additive** change: the new methods are optional to call, have a no-op default on the base `ProxyExecutor`, and do not change any existing semantics. ## API ```cpp // torch/csrc/inductor/aoti_torch/proxy_executor.h class ProxyExecutor { // ... virtual std::unordered_map<std::string, c10::IValue> get_custom_objs() const { return {}; } }; // torch/csrc/inductor/aoti_torch/oss_proxy_executor.h class OSSProxyExecutor : public ProxyExecutor { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const override { return custom_objs_; } }; // torch/csrc/inductor/aoti_runner/model_container_runner.h class TORCH_API AOTIModelContainerRunner { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; // torch/csrc/inductor/aoti_package/model_package_loader.h class TORCH_API AOTIModelPackageLoader { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; ``` Python: ```python loader = torch._C._aoti.AOTIModelPackageLoader("model.pt2", "model") custom_objs = loader.get_custom_objs() # Dict[str, ScriptObject] ``` ## Usage example (from torch-tensorrt) ```cpp torch::inductor::AOTIModelPackageLoader loader("model.pt2"); for (auto& [name, ivalue] : loader.get_custom_objs()) { if (ivalue.isCustomClass()) { if (auto engine = ivalue.toCustomClass<torch_tensorrt::TRTEngine>()) { engine->set_external_stream(reinterpret_cast<int64_t>(my_green_stream)); } } } auto outputs = loader.run(inputs); // each engine now uses its bound stream ``` ## API design rationale Three options were considered: | Option | Pros | Cons | Picked | |---|---|---|---| | Return `const std::unordered_map<std::string, c10::IValue>&` | Zero copy | Leaks STL container layout; ABI-fragile if `custom_objs_` is renamed/reshaped | ❌ | | Visitor: `for_each_custom_obj(std::function<...>)` | Hides container | Cumbersome, no random access by name | ❌ | | **Return `std::unordered_map<std::string, c10::IValue>` by value** | ABI-safe; copy is cheap (single-digit entries × pointer-sized IValue); IValue copies share intrusive_ptr so callers still affect live state | One copy of the map | ✅ | We deliberately do **not** filter by torchbind type or otherwise interpret the constants — the `c10::IValue` interface is the right abstraction; backend filtering is the caller's job. ## Compatibility - **Source compatibility**: additive only. Existing callers of `ProxyExecutor`, `AOTIModelContainerRunner`, and `AOTIModelPackageLoader` are unaffected. - **Binary compatibility**: adds a new virtual to `ProxyExecutor`, which changes its vtable layout. Downstream subclasses of `ProxyExecutor` that were compiled against the old header will need a rebuild — but the default implementation handles them transparently if recompiled. No method removed or signature changed. - **API stability**: returns by value (no internal-layout exposure). The map type is part of the public API; if we ever change the executor's storage type we'd convert at the boundary. ## Tests `test/inductor/test_aoti_torchbind_constants.py` (new): - Compile a model with a `_TorchScriptTesting._Foo` torchbind attribute via `aoti_compile_and_package`. - Load via `torch._C._aoti.AOTIModelPackageLoader`. - Assert `get_custom_objs()` returns the expected non-empty map. - Negative test: model without torchbind constants returns an empty map. ## Suggested reviewers - @desertfire — AOTI runner / package loader owner - @chenyang78 — proxy executor / OSS proxy executor - @angelayi — `.pt2` package format Pull Request resolved: #182149 Approved by: https://github.com/desertfire

@desertfire

…torch#182149) ## Summary Adds a public C++ accessor `get_custom_objs()` on `torch::aot_inductor::ProxyExecutor`, `torch::inductor::AOTIModelContainerRunner`, and `torch::inductor::AOTIModelPackageLoader` (plus a Python binding on the loader) that returns a snapshot of the torchbind custom-class constants embedded in a loaded `.pt2` model. The `IValue` payloads alias the live entries inside the proxy executor's `custom_objs_` map (the existing private storage already populated during `OSSProxyExecutor` construction by `pickle_load_obj`). Downcasting to a `CustomClassHolder` subclass and mutating its state — e.g. binding a stream, attaching a communicator, toggling profiling — affects subsequent `run()` invocations. ## Motivation Backends that store live state on torchbind constants embedded in `.pt2` packages cannot reach those objects after load via any public PyTorch API. The constants live in `OSSProxyExecutor::custom_objs_` (private, `std::unordered_map<std::string, c10::IValue>`) and the loader/runner only expose tensor-typed constant accessors (`get_constant_fqns`, `extract_constants_map`, etc.). Concrete user: **torch-tensorrt** ([pytorch/TensorRT#4232](pytorch/TensorRT#4232)) needs to bind a CUDA Green Context stream onto each `TRTEngine` torchbind embedded in a torch-tensorrt `.pt2` model loaded via `AOTIModelPackageLoader`. The `ExportedProgram` path already works (Python walks `named_modules()` and binds each engine), but the AOTI path has no equivalent because the engines never become attributes of an `nn.Module` — they live only inside the proxy executor's private map. This is a **purely additive** change: the new methods are optional to call, have a no-op default on the base `ProxyExecutor`, and do not change any existing semantics. ## API ```cpp // torch/csrc/inductor/aoti_torch/proxy_executor.h class ProxyExecutor { // ... virtual std::unordered_map<std::string, c10::IValue> get_custom_objs() const { return {}; } }; // torch/csrc/inductor/aoti_torch/oss_proxy_executor.h class OSSProxyExecutor : public ProxyExecutor { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const override { return custom_objs_; } }; // torch/csrc/inductor/aoti_runner/model_container_runner.h class TORCH_API AOTIModelContainerRunner { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; // torch/csrc/inductor/aoti_package/model_package_loader.h class TORCH_API AOTIModelPackageLoader { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; ``` Python: ```python loader = torch._C._aoti.AOTIModelPackageLoader("model.pt2", "model") custom_objs = loader.get_custom_objs() # Dict[str, ScriptObject] ``` ## Usage example (from torch-tensorrt) ```cpp torch::inductor::AOTIModelPackageLoader loader("model.pt2"); for (auto& [name, ivalue] : loader.get_custom_objs()) { if (ivalue.isCustomClass()) { if (auto engine = ivalue.toCustomClass<torch_tensorrt::TRTEngine>()) { engine->set_external_stream(reinterpret_cast<int64_t>(my_green_stream)); } } } auto outputs = loader.run(inputs); // each engine now uses its bound stream ``` ## API design rationale Three options were considered: | Option | Pros | Cons | Picked | |---|---|---|---| | Return `const std::unordered_map<std::string, c10::IValue>&` | Zero copy | Leaks STL container layout; ABI-fragile if `custom_objs_` is renamed/reshaped | ❌ | | Visitor: `for_each_custom_obj(std::function<...>)` | Hides container | Cumbersome, no random access by name | ❌ | | **Return `std::unordered_map<std::string, c10::IValue>` by value** | ABI-safe; copy is cheap (single-digit entries × pointer-sized IValue); IValue copies share intrusive_ptr so callers still affect live state | One copy of the map | ✅ | We deliberately do **not** filter by torchbind type or otherwise interpret the constants — the `c10::IValue` interface is the right abstraction; backend filtering is the caller's job. ## Compatibility - **Source compatibility**: additive only. Existing callers of `ProxyExecutor`, `AOTIModelContainerRunner`, and `AOTIModelPackageLoader` are unaffected. - **Binary compatibility**: adds a new virtual to `ProxyExecutor`, which changes its vtable layout. Downstream subclasses of `ProxyExecutor` that were compiled against the old header will need a rebuild — but the default implementation handles them transparently if recompiled. No method removed or signature changed. - **API stability**: returns by value (no internal-layout exposure). The map type is part of the public API; if we ever change the executor's storage type we'd convert at the boundary. ## Tests `test/inductor/test_aoti_torchbind_constants.py` (new): - Compile a model with a `_TorchScriptTesting._Foo` torchbind attribute via `aoti_compile_and_package`. - Load via `torch._C._aoti.AOTIModelPackageLoader`. - Assert `get_custom_objs()` returns the expected non-empty map. - Negative test: model without torchbind constants returns an empty map. ## Suggested reviewers - @desertfire — AOTI runner / package loader owner - @chenyang78 — proxy executor / OSS proxy executor - @angelayi — `.pt2` package format Pull Request resolved: pytorch#182149 Approved by: https://github.com/desertfire

When a caller installs a non-default CUDA stream (custom user stream, CUDA Green Context-derived stream, NCCL stream, etc.), the runtime now runs the engine on that stream instead of substituting one from PyTorch's pool. Default-stream callers see no change (still get a pool stream so we don't serialize the device on the default stream). Mirrored in the Python runtime modules for parity. The cudagraph cache re-records when the chosen stream identity changes between calls. Skips the pre/post caller<->engine event fences when both are the same stream (no-op). No new public API. No .pt2 format change. No global flags. No FX graph changes. No engine state additions. Test plan: - Existing test suite passes (no API surface changed). - Manual: load a .pt2 inside c10::cuda::CUDAStreamGuard(green_stream), capture nsys trace, verify enqueueV3 launches on green_stream and kernels stay inside the partition's SMs.

shoumikhin · 2026-05-14T22:29:39Z

@narendasan many thanks for the feedback and prototype! I rewrote this e2e on top of latest main. Could you take another look please?

What I dropped: the process-wide flag, the per-engine setter, the Python multi-stream facade. No new public API, no .pt2 format change, no torchbind additions, no FX graph changes, no header changes.

What is in: stream selection moved to the top of the lambda. If the caller is on the default stream, behavior is unchanged (grab a pool stream). If the caller is on any other stream, the engine runs on that stream directly. When the chosen stream changes between calls, set runtime_states.context_changed = true so any captured CUDA Graph re-records. The pre/post sync events are skipped when the engine stream and the caller's stream are the same. Mirrored in _PythonTorchTensorRTModule.py and _CudaGraphsTorchTensorRTModule.py.

This matches the shape of execute_engine.cpp on your stream_gaurds branch. I borrowed the caller_on_default local and the fence-skip pattern from there.

On your "pull stream management out entirely" suggestion: with this change that is what happens for any caller who installs a non-default stream. The pool-stream path stays only as the fallback for the default-stream case, which matches your wording: the lazy stream from the pool would be a fallback when the user is not managing this externally.

On mixed PyTorch + TensorRT in a green context: this PR does not push stream awareness into the FX graph. For the common case (one model per .pt2, one green context, caller wraps the whole forward in a CUDAStreamGuard), every CUDA op inside that guard inherits the green stream from the calling thread, both PyTorch ops and the TensorRT engine. So the slim change covers the "uniform green context for the whole graph" case without graph-level plumbing.

Where I think the two approaches fit together: this PR is the runtime contract (the engine runs on whatever stream is current at enqueueV3 time). stream_gaurds is the layer above for cases where the caller wants intra-graph parallelism across multiple streams (different branches on different green contexts, fan-out, etc.). Single-stream callers do not need the FX rewrite to get green contexts working, and graph-level ops still need the runtime to honor the chosen stream, so the slim runtime fix is a prerequisite for stream_gaurds rather than a competing design.

One thought I want to share, since it shaped the design: stream identity is imo a runtime scheduling property, not a model property. A cudaStream_t does not survive serialization, and the same .pt2 may run in different deployment topologies (single-tenant, multi-tenant green contexts, NCCL, etc.). Even with apply_stream_plan taking actual streams at runtime, baking the plan structure into the graph means re-exporting whenever the topology changes shape (different number of streams, different fan-out). For the multi-tenant on-device case where the caller picks the partition per process, host-level stream selection feels like the right granularity. I think stream_gaurds is the right design for the intra-graph multi-stream case but I would be careful about making it the default path for users who just want "engine respects my stream guard."

Happy to discuss any of this. Mind taking another pass?

@desertfire

…torch#182149) ## Summary Adds a public C++ accessor `get_custom_objs()` on `torch::aot_inductor::ProxyExecutor`, `torch::inductor::AOTIModelContainerRunner`, and `torch::inductor::AOTIModelPackageLoader` (plus a Python binding on the loader) that returns a snapshot of the torchbind custom-class constants embedded in a loaded `.pt2` model. The `IValue` payloads alias the live entries inside the proxy executor's `custom_objs_` map (the existing private storage already populated during `OSSProxyExecutor` construction by `pickle_load_obj`). Downcasting to a `CustomClassHolder` subclass and mutating its state — e.g. binding a stream, attaching a communicator, toggling profiling — affects subsequent `run()` invocations. ## Motivation Backends that store live state on torchbind constants embedded in `.pt2` packages cannot reach those objects after load via any public PyTorch API. The constants live in `OSSProxyExecutor::custom_objs_` (private, `std::unordered_map<std::string, c10::IValue>`) and the loader/runner only expose tensor-typed constant accessors (`get_constant_fqns`, `extract_constants_map`, etc.). Concrete user: **torch-tensorrt** ([pytorch/TensorRT#4232](pytorch/TensorRT#4232)) needs to bind a CUDA Green Context stream onto each `TRTEngine` torchbind embedded in a torch-tensorrt `.pt2` model loaded via `AOTIModelPackageLoader`. The `ExportedProgram` path already works (Python walks `named_modules()` and binds each engine), but the AOTI path has no equivalent because the engines never become attributes of an `nn.Module` — they live only inside the proxy executor's private map. This is a **purely additive** change: the new methods are optional to call, have a no-op default on the base `ProxyExecutor`, and do not change any existing semantics. ## API ```cpp // torch/csrc/inductor/aoti_torch/proxy_executor.h class ProxyExecutor { // ... virtual std::unordered_map<std::string, c10::IValue> get_custom_objs() const { return {}; } }; // torch/csrc/inductor/aoti_torch/oss_proxy_executor.h class OSSProxyExecutor : public ProxyExecutor { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const override { return custom_objs_; } }; // torch/csrc/inductor/aoti_runner/model_container_runner.h class TORCH_API AOTIModelContainerRunner { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; // torch/csrc/inductor/aoti_package/model_package_loader.h class TORCH_API AOTIModelPackageLoader { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; ``` Python: ```python loader = torch._C._aoti.AOTIModelPackageLoader("model.pt2", "model") custom_objs = loader.get_custom_objs() # Dict[str, ScriptObject] ``` ## Usage example (from torch-tensorrt) ```cpp torch::inductor::AOTIModelPackageLoader loader("model.pt2"); for (auto& [name, ivalue] : loader.get_custom_objs()) { if (ivalue.isCustomClass()) { if (auto engine = ivalue.toCustomClass<torch_tensorrt::TRTEngine>()) { engine->set_external_stream(reinterpret_cast<int64_t>(my_green_stream)); } } } auto outputs = loader.run(inputs); // each engine now uses its bound stream ``` ## API design rationale Three options were considered: | Option | Pros | Cons | Picked | |---|---|---|---| | Return `const std::unordered_map<std::string, c10::IValue>&` | Zero copy | Leaks STL container layout; ABI-fragile if `custom_objs_` is renamed/reshaped | ❌ | | Visitor: `for_each_custom_obj(std::function<...>)` | Hides container | Cumbersome, no random access by name | ❌ | | **Return `std::unordered_map<std::string, c10::IValue>` by value** | ABI-safe; copy is cheap (single-digit entries × pointer-sized IValue); IValue copies share intrusive_ptr so callers still affect live state | One copy of the map | ✅ | We deliberately do **not** filter by torchbind type or otherwise interpret the constants — the `c10::IValue` interface is the right abstraction; backend filtering is the caller's job. ## Compatibility - **Source compatibility**: additive only. Existing callers of `ProxyExecutor`, `AOTIModelContainerRunner`, and `AOTIModelPackageLoader` are unaffected. - **Binary compatibility**: adds a new virtual to `ProxyExecutor`, which changes its vtable layout. Downstream subclasses of `ProxyExecutor` that were compiled against the old header will need a rebuild — but the default implementation handles them transparently if recompiled. No method removed or signature changed. - **API stability**: returns by value (no internal-layout exposure). The map type is part of the public API; if we ever change the executor's storage type we'd convert at the boundary. ## Tests `test/inductor/test_aoti_torchbind_constants.py` (new): - Compile a model with a `_TorchScriptTesting._Foo` torchbind attribute via `aoti_compile_and_package`. - Load via `torch._C._aoti.AOTIModelPackageLoader`. - Assert `get_custom_objs()` returns the expected non-empty map. - Negative test: model without torchbind constants returns an empty map. ## Suggested reviewers - @desertfire — AOTI runner / package loader owner - @chenyang78 — proxy executor / OSS proxy executor - @angelayi — `.pt2` package format Pull Request resolved: pytorch#182149 Approved by: https://github.com/desertfire

meta-cla Bot added the cla signed label May 1, 2026

github-actions Bot added component: tests Issues re: Tests component: core Issues re: The core compiler component: api [Python] Issues re: Python API component: runtime component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels May 1, 2026

github-actions Bot requested a review from narendasan May 1, 2026 15:46

shoumikhin force-pushed the green-context-external-stream-upstream branch from bfa0fea to a0434e4 Compare May 1, 2026 16:51

narendasan requested a review from cehongwang May 1, 2026 17:02

shoumikhin mentioned this pull request May 1, 2026

[inductor] Expose torchbind constants from AOTIModelPackageLoader pytorch/pytorch#182149

Closed

shoumikhin force-pushed the green-context-external-stream-upstream branch from 08ab339 to 405c706 Compare May 4, 2026 21:48

shoumikhin force-pushed the green-context-external-stream-upstream branch from 405c706 to 7bee902 Compare May 14, 2026 22:01

shoumikhin changed the title ~~feat(runtime): support binding TRTEngine execution to an external CUDA stream~~ fix: respect caller's CUDA stream in TRT runtime (Green Context support) May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: respect caller's CUDA stream in TRT runtime (Green Context support)#4232

fix: respect caller's CUDA stream in TRT runtime (Green Context support)#4232
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:green-context-external-stream-upstream

shoumikhin commented May 1, 2026 •

edited

Loading

Uh oh!

shoumikhin commented May 1, 2026 •

edited

Loading

Uh oh!

shoumikhin commented May 8, 2026

Uh oh!

narendasan commented May 8, 2026

Uh oh!

shoumikhin commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shoumikhin commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this fixes

What this PR does

What stays the same for existing users

Behavior change for existing non-default-stream callers

What a Green Context user does

Test plan

Uh oh!

shoumikhin commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Long-term plan: upstream PyTorch PR

Uh oh!

shoumikhin commented May 8, 2026

Uh oh!

narendasan commented May 8, 2026

Uh oh!

shoumikhin commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shoumikhin commented May 1, 2026 •

edited

Loading

shoumikhin commented May 1, 2026 •

edited

Loading