Rework save area caching, improve stepping performance by ammallya · Pull Request #36 · ROCm/ROCdbgapi

ammallya · 2026-02-16T19:13:11Z

No description provided.

This patch improves debug experience on Windows massively, and on Linux substantially. For example, on my Linux box with a gfx1030, using the gdb.rocm/step-many-waves.exp ROCgdb testcase, before the patch, I see: - Before patch, Linux, debug dbgapi build (+ debug rocgdb): $ grep "Command execution time" testsuite/gdb.log Command execution time: 0.614215 (cpu), 0.627181 (wall) Command execution time: 0.020337 (cpu), 0.020627 (wall) Command execution time: 1.212774 (cpu), 1.235801 (wall) Command execution time: 0.020677 (cpu), 0.020959 (wall) Command execution time: 1.195660 (cpu), 1.218044 (wall) - Before patch, Linux, release dbgapi build (+ debug rocgdb): $ grep "Command execution time" testsuite/gdb.log Command execution time: 0.196843 (cpu), 0.206161 (wall) Command execution time: 0.009141 (cpu), 0.009376 (wall) Command execution time: 0.355500 (cpu), 0.373839 (wall) Command execution time: 0.011062 (cpu), 0.011291 (wall) Command execution time: 0.345165 (cpu), 0.363483 (wall) After the patch, on the same Linux machine I get around the following times: - After patch, Linux, debug dbgapi build (+ debug rocgdb): $ grep "Command execution time" testsuite/gdb.log Command execution time: 0.194576 (cpu), 0.204046 (wall) Command execution time: 0.009021 (cpu), 0.009235 (wall) Command execution time: 0.382576 (cpu), 0.419615 (wall) Command execution time: 0.009179 (cpu), 0.009406 (wall) Command execution time: 0.379717 (cpu), 0.397690 (wall) - After patch, Linux, release dbgapi build (+ debug rocgdb): $ grep "Command execution time" testsuite/gdb.log Command execution time: 0.064394 (cpu), 0.073941 (wall) Command execution time: 0.005430 (cpu), 0.005668 (wall) Command execution time: 0.115358 (cpu), 0.134396 (wall) Command execution time: 0.005047 (cpu), 0.005300 (wall) Command execution time: 0.115153 (cpu), 0.134178 (wall) So about 3 times-ish for the slower steps, in either debug/release. On my Windows machine with an gfx1201, just the first "next" command takes between 200 and 440 seconds (sic!), resulting in gdb.rocm/step-many-waves.exp timeouts. With this patch, on the same Windows machine, I get around the following times: Windows, debug dbgapi build (+ debug rocgdb): $ grep "Command execution time" testsuite/gdb.log Command execution time: 1.961000 (cpu), 1.961469 (wall) Command execution time: 0.129000 (cpu), 0.129094 (wall) Command execution time: 1.522000 (cpu), 1.522097 (wall) Command execution time: 2.065000 (cpu), 2.064715 (wall) Command execution time: 0.127000 (cpu), 0.127054 (wall) Windows, release dbgapi build (+ debug rocgdb): $ grep "Command execution time" gdb.log Command execution time: 1.391000 (cpu), 1.391136 (wall) Command execution time: 0.111000 (cpu), 0.111063 (wall) Command execution time: 1.267000 (cpu), 1.267022 (wall) Command execution time: 1.783000 (cpu), 1.783264 (wall) Command execution time: 0.127000 (cpu), 0.126915 (wall) =================== The problem in the current code is that we do thousands of accesses out of the save area region, and memory accesses on Windows are slower than on Linux. While on Linux reading a few bytes out of the inferior takes in the order of 10 microseconds, on Windows, it takes from 2 to 20 milliseconds, with most accesses falling in the 3ms region. It's mostly a latency problem -- larger transfers take not that much more time than smaller transfers proportionally. Since on Windows we have a higher latency for each memory access, it's important to batch as much as we can into fewer, but larger accesses. We already have a caching mechanism in dbgapi, that caches a small portion of the save area whenever we update waves. Based on a suggestion from Lancelot, this patch starts by making us cache the whole save area instead, reducing thousands of xfers down to one. That's just the beginning, though. If we do that, then we still see many memory writes, most 128 bytes long each. Those come from the memory cache write back code. 128 bytes is a multiple of the size of the cache lines, which is 64 bytes. We see many writes because even though the code already coalesces contiguous cache lines, it only does so for contiguous DIRTY cache lines. I.e., if you have the following all contiguous cache lines 1 through 9, with D for dirty, and C for clean: 1 2 3 4 5 6 7 8 9 D D C D C D C D C then we write back cache lines 1 and 2 in a single memory write call, and then three more separate calls for 4, 6 and 8. I tried improving that by making the write back code write lines 1 through 8 in one go. That did improve timings on Windows substantially already, from the 440s for one single "next" in the gdb.rocm/step-many-waves.exp testcase down to between 6s and 20s. Still far from ideal, but way better. However, the resulting patch degraded the performance on Linux by some 10x, which is unacceptable. The reason for that slowdown on Linux was simply the cache lines allocation. I.e., with a prefetch save area size of 12134416 on my machine, and 64 bytes long cache lines, we end up allocating some 190k cache lines, each a separate heap allocation... On Windows, for a single prefetch call, we would now be spending between 3 and 8 ms reading memory from the inferior, and then still spend 200ms allocating the cache lines, in the loop at the bottom of memory_cache_t<AddressType>::prefetch. The times of Linux were lower, but of similar shape -- the memory xfer is fast, and then the allocation step can also take up to 100ms. This completely explains the 10x slowdown on Linux. Now, I considered several ways to address this, including increasing the cache line size, reusing cache lines across discard/prefill calls, and others. However, there is one important detail -- there is ONE single prefetch call in the whole codebase. The memory cache mechanism is solely used for the save areas, nothing else. There is no code that wants to cache random areas without a clear access pattern. And as I looked deeper into the memory cache, I saw unnecessary complication: - The memory cache mechanism supports several caching policies, but only one is ever used. - There are some entry points in the memory_cache_t class that are not used at all anywhere: void fetch_cache_line (cache_line_t &cache_line, AddressType address) const; void commit_cache_line (cache_line_t &cache_line, AddressType address) const; void allocate_0_cache_line (cache_line_t &cache_line) const; - memory_cache_t is a template that is templated on address type, host or agent. However, the host instantiation isn't used for anything at all. This suggests to me that we can redesign the class, make it more tuned to save area caching, and simplify it at the same time. That's what this patch does. - Make memory_cache_t no longer be a template. Make it only work with agent addresses. - Remove the unneeded entry points listed above. - Remove the concept of cache lines, and replace it with cache entries, the size of the whole cache request area. - Since we now have a full contiguous block of storage per cache entry, we can return the pointer to the cache entry directly to the caller of prefetch, avoiding one redundant memory allocation in update_waves. This improves performance a bit. - Reuse cache entries. This also improves performance a bit. - Writing back an entry is a single memory xfer. Track a range of dirty bytes per entry instead of a boolean, so that that write avoids writing non-dirty bytes unnecessarily. E.g., the control stack region doesn't need to be written back. This also improves performance. - Assume (and document, and assert) that discarding never tries to discard a partial cached entry. If we ever want to go back to having a more generalized memory cache, to keep the efficiency gains of this commit, I'd suggest doing it by reintroducing aligned cache lines, but making the lines have variable width multiple of the cache line size, and introduce a way to force aggregating smaller blocks into one larger contiguous block, so that the semantics of the memory_cache_t::new_entry entry point in the new code, of returning a pointer to a contiguous block, can be preserved in the save area cases. For now, since we don't need it, that would just seem like extra unnecessary complication to me. Finally, the memory_cache_t::contains_all call in wave_t::read_register is no longer necessary as the cache is guaranteed to always have the whole save area. Tested on both Linux and Windows. Change-Id: Iffa425669dbd989535a2f1c2945bea4720a9589f commit-id:13cd2c42

"xfer_global_memory" is kind of a reminiscent of the Linux-only time where the cache was dealing with global addresses. Even if yes, from the GPU perspective it is called "global", it is not necessarily global in the "unified global" sense. Since the memory cache is now specialized to agent_address_t, rename: xfer_global_memory => xfer_agent_memory read_global_memory => read_agent_memory write_global_memory => write_agent_memory Pass the agent to the memory_cache_t ctor, so we can use it in the memory_access_error_t throw in memory_cache_t::write_back, fixing a FIXME: instead of using address_space_t::global(), use the agent's address space. Based on idea from Lancelot. Change-Id: Ib93ea39dcb9648a509a80c1dbd9f8bf79905587c commit-id:c509fa86

palves added 2 commits February 16, 2026 17:10

ammallya requested review from lancesix, lmoriche and t-tye as code owners February 16, 2026 19:13

systems-assistant bot mentioned this pull request Feb 16, 2026

Rework save area caching, improve stepping performance ROCm/rocm-systems#3288

Closed

ROCm deleted a comment from ammallya Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework save area caching, improve stepping performance#36

Rework save area caching, improve stepping performance#36
ammallya wants to merge 2 commits intoamd-stagingfrom
users/pedalves/save-area-cache

ammallya commented Feb 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ammallya commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ammallya commented Feb 16, 2026 •

edited

Loading