Skip to content

WIP: Enable zero-copy for QNN GPU#2105

Draft
qti-mattsinc wants to merge 1 commit intomicrosoft:mainfrom
CodeLinaro:dev/mattsinc/gpu-zero-copy
Draft

WIP: Enable zero-copy for QNN GPU#2105
qti-mattsinc wants to merge 1 commit intomicrosoft:mainfrom
CodeLinaro:dev/mattsinc/gpu-zero-copy

Conversation

@qti-mattsinc
Copy link
Copy Markdown

  • Use the newly added "enable_dx12_shared_memory_allocator" provided option to allocate the KV cache on CPU-accessible GPU memory.
  • This provides a large speedup by eliminating unnecessary copy overhead.

* Use the newly added "enable_dx12_shared_memory_allocator"
  provided option to allocate the KV cache on CPU-accessible
  GPU memory.
* This provides a large speedup by eliminating unnecessary
  copy overhead.

> Co-authored-by: qti-mattsinc <mattsinc@qti.qualcomm.com>
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to modify this file and onnxruntime_inline.h to be able to access the CreateMemoryInfo_V2 API. It looks like these files are manually checked-in?

Comment thread src/models/model.cpp
}
if (use_dx12_shared_memory) {
provider_options_list.back().options.emplace_back("enable_dx12_shared_memory_allocator", "1");
provider_options_list.back().device_filtering_options = Config::DeviceFilteringOptions { OrtHardwareDeviceType_GPU };
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: would rather not touch device_filtering_options here if possible. Needs to be cleaned up in a way that still selects the correct allocator

Copy link
Copy Markdown

@johnpaultaken johnpaultaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not look ideal to me, we need to discuss why it deviates from the norm.
Ideally what I like to see is no change to genai at all.
When OrtMemoryInfo is of type QnnShared can we just return the QnnGpuAllocator based on the device selection made by the user ie GPU ?
Everything else should work transparently, just like how it works for the other EPs cuda, openvino etc.
I don't see a need for enable_dx12_shared_memory_allocator option, why not enable it always in the EP ?
Also we cannot have variable names like use_dx12_shared_memory etc which then turns out to be only a Qnn specific option. Ideally, I would want to avoid any code in genai that are EP specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants