WIP: Enable zero-copy for QNN GPU#2105
Conversation
qti-mattsinc
commented
Apr 27, 2026
- Use the newly added "enable_dx12_shared_memory_allocator" provided option to allocate the KV cache on CPU-accessible GPU memory.
- This provides a large speedup by eliminating unnecessary copy overhead.
* Use the newly added "enable_dx12_shared_memory_allocator" provided option to allocate the KV cache on CPU-accessible GPU memory. * This provides a large speedup by eliminating unnecessary copy overhead. > Co-authored-by: qti-mattsinc <mattsinc@qti.qualcomm.com>
There was a problem hiding this comment.
I had to modify this file and onnxruntime_inline.h to be able to access the CreateMemoryInfo_V2 API. It looks like these files are manually checked-in?
| } | ||
| if (use_dx12_shared_memory) { | ||
| provider_options_list.back().options.emplace_back("enable_dx12_shared_memory_allocator", "1"); | ||
| provider_options_list.back().device_filtering_options = Config::DeviceFilteringOptions { OrtHardwareDeviceType_GPU }; |
There was a problem hiding this comment.
note to self: would rather not touch device_filtering_options here if possible. Needs to be cleaned up in a way that still selects the correct allocator
There was a problem hiding this comment.
This change does not look ideal to me, we need to discuss why it deviates from the norm.
Ideally what I like to see is no change to genai at all.
When OrtMemoryInfo is of type QnnShared can we just return the QnnGpuAllocator based on the device selection made by the user ie GPU ?
Everything else should work transparently, just like how it works for the other EPs cuda, openvino etc.
I don't see a need for enable_dx12_shared_memory_allocator option, why not enable it always in the EP ?
Also we cannot have variable names like use_dx12_shared_memory etc which then turns out to be only a Qnn specific option. Ideally, I would want to avoid any code in genai that are EP specific.