fix: respect caller's CUDA stream in TRT runtime (Green Context support)#4232
fix: respect caller's CUDA stream in TRT runtime (Green Context support)#4232shoumikhin wants to merge 1 commit into
Conversation
bfa0fea to
a0434e4
Compare
Long-term plan: upstream PyTorch PROpened pytorch/pytorch#182149 to add torch::inductor::AOTIModelPackageLoader loader("model.pt2");
for (auto& [name, ivalue] : loader.get_custom_objs()) {
if (auto e = ivalue.toCustomClass<torch_tensorrt::TRTEngine>()) {
e->set_external_stream(reinterpret_cast<int64_t>(my_green_stream));
}
}
loader.run(inputs);This PR (#4232) ships |
08ab339 to
405c706
Compare
|
Friendly ping @narendasan @cehongwang. This has been open for a week. The upstream PyTorch dependency (pytorch/pytorch#182149) is landing today, which unblocks the full end-to-end path. Could one of you take a first pass when you get a chance? Happy to address feedback or split it up if that helps review. |
|
@shoumikhin I will leave the more detailed feedback to @cehongwang but one thing I wanted to raise was would it make sense for your usecase to just pull stream management out of the execute engine op all together. So the lazy stream from the stream pool would be a bail out in the case the user is not managing this externally. My thought here is basically we operationalize stream management in the fx graph with a set of operators that serves as a guard / interface to prevent computation on the default stream. The concern I have is if you are mixing pytorch and tensorrt operations and using say green contexts would you be able to express everything correctly Something to this effect in the green context case: I can upload a prototype if this makes sense. I think we would still use the AOTI hooks you created, just in these stream ops rather than engine by engine |
…82149) ## Summary Adds a public C++ accessor `get_custom_objs()` on `torch::aot_inductor::ProxyExecutor`, `torch::inductor::AOTIModelContainerRunner`, and `torch::inductor::AOTIModelPackageLoader` (plus a Python binding on the loader) that returns a snapshot of the torchbind custom-class constants embedded in a loaded `.pt2` model. The `IValue` payloads alias the live entries inside the proxy executor's `custom_objs_` map (the existing private storage already populated during `OSSProxyExecutor` construction by `pickle_load_obj`). Downcasting to a `CustomClassHolder` subclass and mutating its state — e.g. binding a stream, attaching a communicator, toggling profiling — affects subsequent `run()` invocations. ## Motivation Backends that store live state on torchbind constants embedded in `.pt2` packages cannot reach those objects after load via any public PyTorch API. The constants live in `OSSProxyExecutor::custom_objs_` (private, `std::unordered_map<std::string, c10::IValue>`) and the loader/runner only expose tensor-typed constant accessors (`get_constant_fqns`, `extract_constants_map`, etc.). Concrete user: **torch-tensorrt** ([pytorch/TensorRT#4232](pytorch/TensorRT#4232)) needs to bind a CUDA Green Context stream onto each `TRTEngine` torchbind embedded in a torch-tensorrt `.pt2` model loaded via `AOTIModelPackageLoader`. The `ExportedProgram` path already works (Python walks `named_modules()` and binds each engine), but the AOTI path has no equivalent because the engines never become attributes of an `nn.Module` — they live only inside the proxy executor's private map. This is a **purely additive** change: the new methods are optional to call, have a no-op default on the base `ProxyExecutor`, and do not change any existing semantics. ## API ```cpp // torch/csrc/inductor/aoti_torch/proxy_executor.h class ProxyExecutor { // ... virtual std::unordered_map<std::string, c10::IValue> get_custom_objs() const { return {}; } }; // torch/csrc/inductor/aoti_torch/oss_proxy_executor.h class OSSProxyExecutor : public ProxyExecutor { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const override { return custom_objs_; } }; // torch/csrc/inductor/aoti_runner/model_container_runner.h class TORCH_API AOTIModelContainerRunner { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; // torch/csrc/inductor/aoti_package/model_package_loader.h class TORCH_API AOTIModelPackageLoader { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; ``` Python: ```python loader = torch._C._aoti.AOTIModelPackageLoader("model.pt2", "model") custom_objs = loader.get_custom_objs() # Dict[str, ScriptObject] ``` ## Usage example (from torch-tensorrt) ```cpp torch::inductor::AOTIModelPackageLoader loader("model.pt2"); for (auto& [name, ivalue] : loader.get_custom_objs()) { if (ivalue.isCustomClass()) { if (auto engine = ivalue.toCustomClass<torch_tensorrt::TRTEngine>()) { engine->set_external_stream(reinterpret_cast<int64_t>(my_green_stream)); } } } auto outputs = loader.run(inputs); // each engine now uses its bound stream ``` ## API design rationale Three options were considered: | Option | Pros | Cons | Picked | |---|---|---|---| | Return `const std::unordered_map<std::string, c10::IValue>&` | Zero copy | Leaks STL container layout; ABI-fragile if `custom_objs_` is renamed/reshaped | ❌ | | Visitor: `for_each_custom_obj(std::function<...>)` | Hides container | Cumbersome, no random access by name | ❌ | | **Return `std::unordered_map<std::string, c10::IValue>` by value** | ABI-safe; copy is cheap (single-digit entries × pointer-sized IValue); IValue copies share intrusive_ptr so callers still affect live state | One copy of the map | ✅ | We deliberately do **not** filter by torchbind type or otherwise interpret the constants — the `c10::IValue` interface is the right abstraction; backend filtering is the caller's job. ## Compatibility - **Source compatibility**: additive only. Existing callers of `ProxyExecutor`, `AOTIModelContainerRunner`, and `AOTIModelPackageLoader` are unaffected. - **Binary compatibility**: adds a new virtual to `ProxyExecutor`, which changes its vtable layout. Downstream subclasses of `ProxyExecutor` that were compiled against the old header will need a rebuild — but the default implementation handles them transparently if recompiled. No method removed or signature changed. - **API stability**: returns by value (no internal-layout exposure). The map type is part of the public API; if we ever change the executor's storage type we'd convert at the boundary. ## Tests `test/inductor/test_aoti_torchbind_constants.py` (new): - Compile a model with a `_TorchScriptTesting._Foo` torchbind attribute via `aoti_compile_and_package`. - Load via `torch._C._aoti.AOTIModelPackageLoader`. - Assert `get_custom_objs()` returns the expected non-empty map. - Negative test: model without torchbind constants returns an empty map. ## Suggested reviewers - @desertfire — AOTI runner / package loader owner - @chenyang78 — proxy executor / OSS proxy executor - @angelayi — `.pt2` package format Pull Request resolved: #182149 Approved by: https://github.com/desertfire
…torch#182149) ## Summary Adds a public C++ accessor `get_custom_objs()` on `torch::aot_inductor::ProxyExecutor`, `torch::inductor::AOTIModelContainerRunner`, and `torch::inductor::AOTIModelPackageLoader` (plus a Python binding on the loader) that returns a snapshot of the torchbind custom-class constants embedded in a loaded `.pt2` model. The `IValue` payloads alias the live entries inside the proxy executor's `custom_objs_` map (the existing private storage already populated during `OSSProxyExecutor` construction by `pickle_load_obj`). Downcasting to a `CustomClassHolder` subclass and mutating its state — e.g. binding a stream, attaching a communicator, toggling profiling — affects subsequent `run()` invocations. ## Motivation Backends that store live state on torchbind constants embedded in `.pt2` packages cannot reach those objects after load via any public PyTorch API. The constants live in `OSSProxyExecutor::custom_objs_` (private, `std::unordered_map<std::string, c10::IValue>`) and the loader/runner only expose tensor-typed constant accessors (`get_constant_fqns`, `extract_constants_map`, etc.). Concrete user: **torch-tensorrt** ([pytorch/TensorRT#4232](pytorch/TensorRT#4232)) needs to bind a CUDA Green Context stream onto each `TRTEngine` torchbind embedded in a torch-tensorrt `.pt2` model loaded via `AOTIModelPackageLoader`. The `ExportedProgram` path already works (Python walks `named_modules()` and binds each engine), but the AOTI path has no equivalent because the engines never become attributes of an `nn.Module` — they live only inside the proxy executor's private map. This is a **purely additive** change: the new methods are optional to call, have a no-op default on the base `ProxyExecutor`, and do not change any existing semantics. ## API ```cpp // torch/csrc/inductor/aoti_torch/proxy_executor.h class ProxyExecutor { // ... virtual std::unordered_map<std::string, c10::IValue> get_custom_objs() const { return {}; } }; // torch/csrc/inductor/aoti_torch/oss_proxy_executor.h class OSSProxyExecutor : public ProxyExecutor { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const override { return custom_objs_; } }; // torch/csrc/inductor/aoti_runner/model_container_runner.h class TORCH_API AOTIModelContainerRunner { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; // torch/csrc/inductor/aoti_package/model_package_loader.h class TORCH_API AOTIModelPackageLoader { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; ``` Python: ```python loader = torch._C._aoti.AOTIModelPackageLoader("model.pt2", "model") custom_objs = loader.get_custom_objs() # Dict[str, ScriptObject] ``` ## Usage example (from torch-tensorrt) ```cpp torch::inductor::AOTIModelPackageLoader loader("model.pt2"); for (auto& [name, ivalue] : loader.get_custom_objs()) { if (ivalue.isCustomClass()) { if (auto engine = ivalue.toCustomClass<torch_tensorrt::TRTEngine>()) { engine->set_external_stream(reinterpret_cast<int64_t>(my_green_stream)); } } } auto outputs = loader.run(inputs); // each engine now uses its bound stream ``` ## API design rationale Three options were considered: | Option | Pros | Cons | Picked | |---|---|---|---| | Return `const std::unordered_map<std::string, c10::IValue>&` | Zero copy | Leaks STL container layout; ABI-fragile if `custom_objs_` is renamed/reshaped | ❌ | | Visitor: `for_each_custom_obj(std::function<...>)` | Hides container | Cumbersome, no random access by name | ❌ | | **Return `std::unordered_map<std::string, c10::IValue>` by value** | ABI-safe; copy is cheap (single-digit entries × pointer-sized IValue); IValue copies share intrusive_ptr so callers still affect live state | One copy of the map | ✅ | We deliberately do **not** filter by torchbind type or otherwise interpret the constants — the `c10::IValue` interface is the right abstraction; backend filtering is the caller's job. ## Compatibility - **Source compatibility**: additive only. Existing callers of `ProxyExecutor`, `AOTIModelContainerRunner`, and `AOTIModelPackageLoader` are unaffected. - **Binary compatibility**: adds a new virtual to `ProxyExecutor`, which changes its vtable layout. Downstream subclasses of `ProxyExecutor` that were compiled against the old header will need a rebuild — but the default implementation handles them transparently if recompiled. No method removed or signature changed. - **API stability**: returns by value (no internal-layout exposure). The map type is part of the public API; if we ever change the executor's storage type we'd convert at the boundary. ## Tests `test/inductor/test_aoti_torchbind_constants.py` (new): - Compile a model with a `_TorchScriptTesting._Foo` torchbind attribute via `aoti_compile_and_package`. - Load via `torch._C._aoti.AOTIModelPackageLoader`. - Assert `get_custom_objs()` returns the expected non-empty map. - Negative test: model without torchbind constants returns an empty map. ## Suggested reviewers - @desertfire — AOTI runner / package loader owner - @chenyang78 — proxy executor / OSS proxy executor - @angelayi — `.pt2` package format Pull Request resolved: pytorch#182149 Approved by: https://github.com/desertfire
When a caller installs a non-default CUDA stream (custom user stream, CUDA Green Context-derived stream, NCCL stream, etc.), the runtime now runs the engine on that stream instead of substituting one from PyTorch's pool. Default-stream callers see no change (still get a pool stream so we don't serialize the device on the default stream). Mirrored in the Python runtime modules for parity. The cudagraph cache re-records when the chosen stream identity changes between calls. Skips the pre/post caller<->engine event fences when both are the same stream (no-op). No new public API. No .pt2 format change. No global flags. No FX graph changes. No engine state additions. Test plan: - Existing test suite passes (no API surface changed). - Manual: load a .pt2 inside c10::cuda::CUDAStreamGuard(green_stream), capture nsys trace, verify enqueueV3 launches on green_stream and kernels stay inside the partition's SMs.
405c706 to
7bee902
Compare
|
@narendasan many thanks for the feedback and prototype! I rewrote this e2e on top of latest main. Could you take another look please? What I dropped: the process-wide flag, the per-engine setter, the Python multi-stream facade. No new public API, no .pt2 format change, no torchbind additions, no FX graph changes, no header changes. What is in: stream selection moved to the top of the lambda. If the caller is on the default stream, behavior is unchanged (grab a pool stream). If the caller is on any other stream, the engine runs on that stream directly. When the chosen stream changes between calls, set runtime_states.context_changed = true so any captured CUDA Graph re-records. The pre/post sync events are skipped when the engine stream and the caller's stream are the same. Mirrored in _PythonTorchTensorRTModule.py and _CudaGraphsTorchTensorRTModule.py. This matches the shape of execute_engine.cpp on your stream_gaurds branch. I borrowed the caller_on_default local and the fence-skip pattern from there. On your "pull stream management out entirely" suggestion: with this change that is what happens for any caller who installs a non-default stream. The pool-stream path stays only as the fallback for the default-stream case, which matches your wording: the lazy stream from the pool would be a fallback when the user is not managing this externally. On mixed PyTorch + TensorRT in a green context: this PR does not push stream awareness into the FX graph. For the common case (one model per .pt2, one green context, caller wraps the whole forward in a CUDAStreamGuard), every CUDA op inside that guard inherits the green stream from the calling thread, both PyTorch ops and the TensorRT engine. So the slim change covers the "uniform green context for the whole graph" case without graph-level plumbing. Where I think the two approaches fit together: this PR is the runtime contract (the engine runs on whatever stream is current at enqueueV3 time). stream_gaurds is the layer above for cases where the caller wants intra-graph parallelism across multiple streams (different branches on different green contexts, fan-out, etc.). Single-stream callers do not need the FX rewrite to get green contexts working, and graph-level ops still need the runtime to honor the chosen stream, so the slim runtime fix is a prerequisite for stream_gaurds rather than a competing design. One thought I want to share, since it shaped the design: stream identity is imo a runtime scheduling property, not a model property. A cudaStream_t does not survive serialization, and the same .pt2 may run in different deployment topologies (single-tenant, multi-tenant green contexts, NCCL, etc.). Even with apply_stream_plan taking actual streams at runtime, baking the plan structure into the graph means re-exporting whenever the topology changes shape (different number of streams, different fan-out). For the multi-tenant on-device case where the caller picks the partition per process, host-level stream selection feels like the right granularity. I think stream_gaurds is the right design for the intra-graph multi-stream case but I would be careful about making it the default path for users who just want "engine respects my stream guard." Happy to discuss any of this. Mind taking another pass? |
…torch#182149) ## Summary Adds a public C++ accessor `get_custom_objs()` on `torch::aot_inductor::ProxyExecutor`, `torch::inductor::AOTIModelContainerRunner`, and `torch::inductor::AOTIModelPackageLoader` (plus a Python binding on the loader) that returns a snapshot of the torchbind custom-class constants embedded in a loaded `.pt2` model. The `IValue` payloads alias the live entries inside the proxy executor's `custom_objs_` map (the existing private storage already populated during `OSSProxyExecutor` construction by `pickle_load_obj`). Downcasting to a `CustomClassHolder` subclass and mutating its state — e.g. binding a stream, attaching a communicator, toggling profiling — affects subsequent `run()` invocations. ## Motivation Backends that store live state on torchbind constants embedded in `.pt2` packages cannot reach those objects after load via any public PyTorch API. The constants live in `OSSProxyExecutor::custom_objs_` (private, `std::unordered_map<std::string, c10::IValue>`) and the loader/runner only expose tensor-typed constant accessors (`get_constant_fqns`, `extract_constants_map`, etc.). Concrete user: **torch-tensorrt** ([pytorch/TensorRT#4232](pytorch/TensorRT#4232)) needs to bind a CUDA Green Context stream onto each `TRTEngine` torchbind embedded in a torch-tensorrt `.pt2` model loaded via `AOTIModelPackageLoader`. The `ExportedProgram` path already works (Python walks `named_modules()` and binds each engine), but the AOTI path has no equivalent because the engines never become attributes of an `nn.Module` — they live only inside the proxy executor's private map. This is a **purely additive** change: the new methods are optional to call, have a no-op default on the base `ProxyExecutor`, and do not change any existing semantics. ## API ```cpp // torch/csrc/inductor/aoti_torch/proxy_executor.h class ProxyExecutor { // ... virtual std::unordered_map<std::string, c10::IValue> get_custom_objs() const { return {}; } }; // torch/csrc/inductor/aoti_torch/oss_proxy_executor.h class OSSProxyExecutor : public ProxyExecutor { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const override { return custom_objs_; } }; // torch/csrc/inductor/aoti_runner/model_container_runner.h class TORCH_API AOTIModelContainerRunner { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; // torch/csrc/inductor/aoti_package/model_package_loader.h class TORCH_API AOTIModelPackageLoader { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; ``` Python: ```python loader = torch._C._aoti.AOTIModelPackageLoader("model.pt2", "model") custom_objs = loader.get_custom_objs() # Dict[str, ScriptObject] ``` ## Usage example (from torch-tensorrt) ```cpp torch::inductor::AOTIModelPackageLoader loader("model.pt2"); for (auto& [name, ivalue] : loader.get_custom_objs()) { if (ivalue.isCustomClass()) { if (auto engine = ivalue.toCustomClass<torch_tensorrt::TRTEngine>()) { engine->set_external_stream(reinterpret_cast<int64_t>(my_green_stream)); } } } auto outputs = loader.run(inputs); // each engine now uses its bound stream ``` ## API design rationale Three options were considered: | Option | Pros | Cons | Picked | |---|---|---|---| | Return `const std::unordered_map<std::string, c10::IValue>&` | Zero copy | Leaks STL container layout; ABI-fragile if `custom_objs_` is renamed/reshaped | ❌ | | Visitor: `for_each_custom_obj(std::function<...>)` | Hides container | Cumbersome, no random access by name | ❌ | | **Return `std::unordered_map<std::string, c10::IValue>` by value** | ABI-safe; copy is cheap (single-digit entries × pointer-sized IValue); IValue copies share intrusive_ptr so callers still affect live state | One copy of the map | ✅ | We deliberately do **not** filter by torchbind type or otherwise interpret the constants — the `c10::IValue` interface is the right abstraction; backend filtering is the caller's job. ## Compatibility - **Source compatibility**: additive only. Existing callers of `ProxyExecutor`, `AOTIModelContainerRunner`, and `AOTIModelPackageLoader` are unaffected. - **Binary compatibility**: adds a new virtual to `ProxyExecutor`, which changes its vtable layout. Downstream subclasses of `ProxyExecutor` that were compiled against the old header will need a rebuild — but the default implementation handles them transparently if recompiled. No method removed or signature changed. - **API stability**: returns by value (no internal-layout exposure). The map type is part of the public API; if we ever change the executor's storage type we'd convert at the boundary. ## Tests `test/inductor/test_aoti_torchbind_constants.py` (new): - Compile a model with a `_TorchScriptTesting._Foo` torchbind attribute via `aoti_compile_and_package`. - Load via `torch._C._aoti.AOTIModelPackageLoader`. - Assert `get_custom_objs()` returns the expected non-empty map. - Negative test: model without torchbind constants returns an empty map. ## Suggested reviewers - @desertfire — AOTI runner / package loader owner - @chenyang78 — proxy executor / OSS proxy executor - @angelayi — `.pt2` package format Pull Request resolved: pytorch#182149 Approved by: https://github.com/desertfire

What this fixes
Callers commonly pin work to a specific CUDA stream via
torch.cuda.stream(...)(Python) orc10::cuda::CUDAStreamGuard(...)(C++) for NCCL, AMP overlap, custom scheduling, and most recently CUDA Green Contexts (CUDA 12.4+) for SM partitioning. A green-context-derived stream is the only way to confine a TensorRT engine to a partition.Today the Torch-TensorRT runtime ignores the caller's stream. It reads
getCurrentCUDAStream(), then substitutes one from PyTorch's pool. Pool streams are not bound to any green context, so the partition is silently bypassed. The same pattern lives in both Python runtime modules. Net effect: Torch-TensorRT cannot be used for GPU partitioning at all today.What this PR does
The runtime now picks the engine's stream like this:
Two small pieces of bookkeeping fall out of this change:
The same logic is mirrored in
_PythonTorchTensorRTModule.py(hoisted to the outerforwardso both the standard and output-allocator paths see the resolved stream) and_CudaGraphsTorchTensorRTModule.py.What stays the same for existing users
Anyone who calls
model(x)without setting a stream sees the exact same behavior as before. No new public API. No new arguments. No.pt2file format change. No new fields on engine state.Behavior change for existing non-default-stream callers
Users who already wrap inference in
torch.cuda.stream(my_stream)(orc10::cuda::CUDAStreamGuard(my_stream)) for non-green reasons (custom scheduling, NCCL, AMP) will see the engine run on their stream instead of a separate pool stream. Results are identical; the timing characteristics may differ, since the previous separate-stream + event-fence pattern is replaced by direct execution on the caller's stream. For CUDA Graphs users who alternate between several non-default streams, the captured graph re-records on each stream change.What a Green Context user does
Test plan
.pt2, run inside aCUDAStreamGuardover a Green-Context-derived stream, capture annsystrace, confirm kernels run on the green stream and stay inside the partition's SMs.