Skip to content

Rework save area caching, improve stepping performance#36

Open
ammallya wants to merge 2 commits intoamd-stagingfrom
users/pedalves/save-area-cache
Open

Rework save area caching, improve stepping performance#36
ammallya wants to merge 2 commits intoamd-stagingfrom
users/pedalves/save-area-cache

Conversation

@ammallya
Copy link
Copy Markdown
Contributor

@ammallya ammallya commented Feb 16, 2026

No description provided.

This patch improves debug experience on Windows massively, and on
Linux substantially.

For example, on my Linux box with a gfx1030, using the
gdb.rocm/step-many-waves.exp ROCgdb testcase, before the patch, I see:

- Before patch, Linux, debug dbgapi build (+ debug rocgdb):

 $ grep "Command execution time" testsuite/gdb.log
 Command execution time: 0.614215 (cpu), 0.627181 (wall)
 Command execution time: 0.020337 (cpu), 0.020627 (wall)
 Command execution time: 1.212774 (cpu), 1.235801 (wall)
 Command execution time: 0.020677 (cpu), 0.020959 (wall)
 Command execution time: 1.195660 (cpu), 1.218044 (wall)

- Before patch, Linux, release dbgapi build (+ debug rocgdb):

 $ grep "Command execution time" testsuite/gdb.log
 Command execution time: 0.196843 (cpu), 0.206161 (wall)
 Command execution time: 0.009141 (cpu), 0.009376 (wall)
 Command execution time: 0.355500 (cpu), 0.373839 (wall)
 Command execution time: 0.011062 (cpu), 0.011291 (wall)
 Command execution time: 0.345165 (cpu), 0.363483 (wall)

After the patch, on the same Linux machine I get around the following
times:

- After patch, Linux, debug dbgapi build (+ debug rocgdb):

 $ grep "Command execution time" testsuite/gdb.log
 Command execution time: 0.194576 (cpu), 0.204046 (wall)
 Command execution time: 0.009021 (cpu), 0.009235 (wall)
 Command execution time: 0.382576 (cpu), 0.419615 (wall)
 Command execution time: 0.009179 (cpu), 0.009406 (wall)
 Command execution time: 0.379717 (cpu), 0.397690 (wall)

- After patch, Linux, release dbgapi build (+ debug rocgdb):

 $ grep "Command execution time" testsuite/gdb.log
 Command execution time: 0.064394 (cpu), 0.073941 (wall)
 Command execution time: 0.005430 (cpu), 0.005668 (wall)
 Command execution time: 0.115358 (cpu), 0.134396 (wall)
 Command execution time: 0.005047 (cpu), 0.005300 (wall)
 Command execution time: 0.115153 (cpu), 0.134178 (wall)

So about 3 times-ish for the slower steps, in either debug/release.

On my Windows machine with an gfx1201, just the first "next" command
takes between 200 and 440 seconds (sic!), resulting in
gdb.rocm/step-many-waves.exp timeouts.

With this patch, on the same Windows machine, I get around the
following times:

Windows, debug dbgapi build (+ debug rocgdb):

 $ grep "Command execution time" testsuite/gdb.log
 Command execution time: 1.961000 (cpu), 1.961469 (wall)
 Command execution time: 0.129000 (cpu), 0.129094 (wall)
 Command execution time: 1.522000 (cpu), 1.522097 (wall)
 Command execution time: 2.065000 (cpu), 2.064715 (wall)
 Command execution time: 0.127000 (cpu), 0.127054 (wall)

Windows, release dbgapi build (+ debug rocgdb):

 $ grep "Command execution time" gdb.log
 Command execution time: 1.391000 (cpu), 1.391136 (wall)
 Command execution time: 0.111000 (cpu), 0.111063 (wall)
 Command execution time: 1.267000 (cpu), 1.267022 (wall)
 Command execution time: 1.783000 (cpu), 1.783264 (wall)
 Command execution time: 0.127000 (cpu), 0.126915 (wall)

===================

The problem in the current code is that we do thousands of accesses
out of the save area region, and memory accesses on Windows are slower
than on Linux.  While on Linux reading a few bytes out of the inferior
takes in the order of 10 microseconds, on Windows, it takes from 2 to
20 milliseconds, with most accesses falling in the 3ms region.  It's
mostly a latency problem -- larger transfers take not that much more
time than smaller transfers proportionally.  Since on Windows we have
a higher latency for each memory access, it's important to batch as
much as we can into fewer, but larger accesses.

We already have a caching mechanism in dbgapi, that caches a small
portion of the save area whenever we update waves.

Based on a suggestion from Lancelot, this patch starts by making us
cache the whole save area instead, reducing thousands of xfers down to
one.

That's just the beginning, though.

If we do that, then we still see many memory writes, most 128 bytes
long each.  Those come from the memory cache write back code.  128
bytes is a multiple of the size of the cache lines, which is 64 bytes.
We see many writes because even though the code already coalesces
contiguous cache lines, it only does so for contiguous DIRTY cache
lines.  I.e., if you have the following all contiguous cache lines 1
through 9, with D for dirty, and C for clean:

  1 2 3 4 5 6 7 8 9
  D D C D C D C D C

then we write back cache lines 1 and 2 in a single memory write call,
and then three more separate calls for 4, 6 and 8.

I tried improving that by making the write back code write lines 1
through 8 in one go.  That did improve timings on Windows
substantially already, from the 440s for one single "next" in the
gdb.rocm/step-many-waves.exp testcase down to between 6s and 20s.
Still far from ideal, but way better.  However, the resulting patch
degraded the performance on Linux by some 10x, which is unacceptable.

The reason for that slowdown on Linux was simply the cache lines
allocation.  I.e., with a prefetch save area size of 12134416 on my
machine, and 64 bytes long cache lines, we end up allocating some 190k
cache lines, each a separate heap allocation...  On Windows, for a
single prefetch call, we would now be spending between 3 and 8 ms
reading memory from the inferior, and then still spend 200ms
allocating the cache lines, in the loop at the bottom of
memory_cache_t<AddressType>::prefetch.  The times of Linux were lower,
but of similar shape -- the memory xfer is fast, and then the
allocation step can also take up to 100ms.  This completely explains
the 10x slowdown on Linux.

Now, I considered several ways to address this, including increasing
the cache line size, reusing cache lines across discard/prefill calls,
and others.  However, there is one important detail -- there is ONE
single prefetch call in the whole codebase.  The memory cache
mechanism is solely used for the save areas, nothing else.  There is
no code that wants to cache random areas without a clear access
pattern.  And as I looked deeper into the memory cache, I saw
unnecessary complication:

 - The memory cache mechanism supports several caching policies, but
   only one is ever used.

 - There are some entry points in the memory_cache_t class that are
   not used at all anywhere:

     void fetch_cache_line (cache_line_t &cache_line, AddressType address) const;
     void commit_cache_line (cache_line_t &cache_line, AddressType address) const;
     void allocate_0_cache_line (cache_line_t &cache_line) const;

 - memory_cache_t is a template that is templated on address type,
   host or agent.  However, the host instantiation isn't used for
   anything at all.

This suggests to me that we can redesign the class, make it more tuned
to save area caching, and simplify it at the same time.  That's what
this patch does.

 - Make memory_cache_t no longer be a template.  Make it only work
   with agent addresses.

 - Remove the unneeded entry points listed above.

 - Remove the concept of cache lines, and replace it with cache
   entries, the size of the whole cache request area.

 - Since we now have a full contiguous block of storage per cache
   entry, we can return the pointer to the cache entry directly to the
   caller of prefetch, avoiding one redundant memory allocation in
   update_waves.  This improves performance a bit.

 - Reuse cache entries.  This also improves performance a bit.

 - Writing back an entry is a single memory xfer. Track a range of
   dirty bytes per entry instead of a boolean, so that that write
   avoids writing non-dirty bytes unnecessarily.  E.g., the control
   stack region doesn't need to be written back.  This also improves
   performance.

 - Assume (and document, and assert) that discarding never tries to
   discard a partial cached entry.

If we ever want to go back to having a more generalized memory cache,
to keep the efficiency gains of this commit, I'd suggest doing it by
reintroducing aligned cache lines, but making the lines have variable
width multiple of the cache line size, and introduce a way to force
aggregating smaller blocks into one larger contiguous block, so that
the semantics of the memory_cache_t::new_entry entry point in the new
code, of returning a pointer to a contiguous block, can be preserved
in the save area cases.  For now, since we don't need it, that would
just seem like extra unnecessary complication to me.

Finally, the memory_cache_t::contains_all call in
wave_t::read_register is no longer necessary as the cache is
guaranteed to always have the whole save area.

Tested on both Linux and Windows.

Change-Id: Iffa425669dbd989535a2f1c2945bea4720a9589f
commit-id:13cd2c42
"xfer_global_memory" is kind of a reminiscent of the Linux-only time
where the cache was dealing with global addresses.  Even if yes, from
the GPU perspective it is called "global", it is not necessarily
global in the "unified global" sense.

Since the memory cache is now specialized to agent_address_t, rename:

 xfer_global_memory => xfer_agent_memory
 read_global_memory => read_agent_memory
 write_global_memory => write_agent_memory

Pass the agent to the memory_cache_t ctor, so we can use it in the
memory_access_error_t throw in memory_cache_t::write_back, fixing a
FIXME: instead of using address_space_t::global(), use the agent's
address space.

Based on idea from Lancelot.

Change-Id: Ib93ea39dcb9648a509a80c1dbd9f8bf79905587c
commit-id:c509fa86
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants