Rework save area caching, improve stepping performance#36
Open
ammallya wants to merge 2 commits intoamd-stagingfrom
Open
Rework save area caching, improve stepping performance#36ammallya wants to merge 2 commits intoamd-stagingfrom
ammallya wants to merge 2 commits intoamd-stagingfrom
Conversation
This patch improves debug experience on Windows massively, and on
Linux substantially.
For example, on my Linux box with a gfx1030, using the
gdb.rocm/step-many-waves.exp ROCgdb testcase, before the patch, I see:
- Before patch, Linux, debug dbgapi build (+ debug rocgdb):
$ grep "Command execution time" testsuite/gdb.log
Command execution time: 0.614215 (cpu), 0.627181 (wall)
Command execution time: 0.020337 (cpu), 0.020627 (wall)
Command execution time: 1.212774 (cpu), 1.235801 (wall)
Command execution time: 0.020677 (cpu), 0.020959 (wall)
Command execution time: 1.195660 (cpu), 1.218044 (wall)
- Before patch, Linux, release dbgapi build (+ debug rocgdb):
$ grep "Command execution time" testsuite/gdb.log
Command execution time: 0.196843 (cpu), 0.206161 (wall)
Command execution time: 0.009141 (cpu), 0.009376 (wall)
Command execution time: 0.355500 (cpu), 0.373839 (wall)
Command execution time: 0.011062 (cpu), 0.011291 (wall)
Command execution time: 0.345165 (cpu), 0.363483 (wall)
After the patch, on the same Linux machine I get around the following
times:
- After patch, Linux, debug dbgapi build (+ debug rocgdb):
$ grep "Command execution time" testsuite/gdb.log
Command execution time: 0.194576 (cpu), 0.204046 (wall)
Command execution time: 0.009021 (cpu), 0.009235 (wall)
Command execution time: 0.382576 (cpu), 0.419615 (wall)
Command execution time: 0.009179 (cpu), 0.009406 (wall)
Command execution time: 0.379717 (cpu), 0.397690 (wall)
- After patch, Linux, release dbgapi build (+ debug rocgdb):
$ grep "Command execution time" testsuite/gdb.log
Command execution time: 0.064394 (cpu), 0.073941 (wall)
Command execution time: 0.005430 (cpu), 0.005668 (wall)
Command execution time: 0.115358 (cpu), 0.134396 (wall)
Command execution time: 0.005047 (cpu), 0.005300 (wall)
Command execution time: 0.115153 (cpu), 0.134178 (wall)
So about 3 times-ish for the slower steps, in either debug/release.
On my Windows machine with an gfx1201, just the first "next" command
takes between 200 and 440 seconds (sic!), resulting in
gdb.rocm/step-many-waves.exp timeouts.
With this patch, on the same Windows machine, I get around the
following times:
Windows, debug dbgapi build (+ debug rocgdb):
$ grep "Command execution time" testsuite/gdb.log
Command execution time: 1.961000 (cpu), 1.961469 (wall)
Command execution time: 0.129000 (cpu), 0.129094 (wall)
Command execution time: 1.522000 (cpu), 1.522097 (wall)
Command execution time: 2.065000 (cpu), 2.064715 (wall)
Command execution time: 0.127000 (cpu), 0.127054 (wall)
Windows, release dbgapi build (+ debug rocgdb):
$ grep "Command execution time" gdb.log
Command execution time: 1.391000 (cpu), 1.391136 (wall)
Command execution time: 0.111000 (cpu), 0.111063 (wall)
Command execution time: 1.267000 (cpu), 1.267022 (wall)
Command execution time: 1.783000 (cpu), 1.783264 (wall)
Command execution time: 0.127000 (cpu), 0.126915 (wall)
===================
The problem in the current code is that we do thousands of accesses
out of the save area region, and memory accesses on Windows are slower
than on Linux. While on Linux reading a few bytes out of the inferior
takes in the order of 10 microseconds, on Windows, it takes from 2 to
20 milliseconds, with most accesses falling in the 3ms region. It's
mostly a latency problem -- larger transfers take not that much more
time than smaller transfers proportionally. Since on Windows we have
a higher latency for each memory access, it's important to batch as
much as we can into fewer, but larger accesses.
We already have a caching mechanism in dbgapi, that caches a small
portion of the save area whenever we update waves.
Based on a suggestion from Lancelot, this patch starts by making us
cache the whole save area instead, reducing thousands of xfers down to
one.
That's just the beginning, though.
If we do that, then we still see many memory writes, most 128 bytes
long each. Those come from the memory cache write back code. 128
bytes is a multiple of the size of the cache lines, which is 64 bytes.
We see many writes because even though the code already coalesces
contiguous cache lines, it only does so for contiguous DIRTY cache
lines. I.e., if you have the following all contiguous cache lines 1
through 9, with D for dirty, and C for clean:
1 2 3 4 5 6 7 8 9
D D C D C D C D C
then we write back cache lines 1 and 2 in a single memory write call,
and then three more separate calls for 4, 6 and 8.
I tried improving that by making the write back code write lines 1
through 8 in one go. That did improve timings on Windows
substantially already, from the 440s for one single "next" in the
gdb.rocm/step-many-waves.exp testcase down to between 6s and 20s.
Still far from ideal, but way better. However, the resulting patch
degraded the performance on Linux by some 10x, which is unacceptable.
The reason for that slowdown on Linux was simply the cache lines
allocation. I.e., with a prefetch save area size of 12134416 on my
machine, and 64 bytes long cache lines, we end up allocating some 190k
cache lines, each a separate heap allocation... On Windows, for a
single prefetch call, we would now be spending between 3 and 8 ms
reading memory from the inferior, and then still spend 200ms
allocating the cache lines, in the loop at the bottom of
memory_cache_t<AddressType>::prefetch. The times of Linux were lower,
but of similar shape -- the memory xfer is fast, and then the
allocation step can also take up to 100ms. This completely explains
the 10x slowdown on Linux.
Now, I considered several ways to address this, including increasing
the cache line size, reusing cache lines across discard/prefill calls,
and others. However, there is one important detail -- there is ONE
single prefetch call in the whole codebase. The memory cache
mechanism is solely used for the save areas, nothing else. There is
no code that wants to cache random areas without a clear access
pattern. And as I looked deeper into the memory cache, I saw
unnecessary complication:
- The memory cache mechanism supports several caching policies, but
only one is ever used.
- There are some entry points in the memory_cache_t class that are
not used at all anywhere:
void fetch_cache_line (cache_line_t &cache_line, AddressType address) const;
void commit_cache_line (cache_line_t &cache_line, AddressType address) const;
void allocate_0_cache_line (cache_line_t &cache_line) const;
- memory_cache_t is a template that is templated on address type,
host or agent. However, the host instantiation isn't used for
anything at all.
This suggests to me that we can redesign the class, make it more tuned
to save area caching, and simplify it at the same time. That's what
this patch does.
- Make memory_cache_t no longer be a template. Make it only work
with agent addresses.
- Remove the unneeded entry points listed above.
- Remove the concept of cache lines, and replace it with cache
entries, the size of the whole cache request area.
- Since we now have a full contiguous block of storage per cache
entry, we can return the pointer to the cache entry directly to the
caller of prefetch, avoiding one redundant memory allocation in
update_waves. This improves performance a bit.
- Reuse cache entries. This also improves performance a bit.
- Writing back an entry is a single memory xfer. Track a range of
dirty bytes per entry instead of a boolean, so that that write
avoids writing non-dirty bytes unnecessarily. E.g., the control
stack region doesn't need to be written back. This also improves
performance.
- Assume (and document, and assert) that discarding never tries to
discard a partial cached entry.
If we ever want to go back to having a more generalized memory cache,
to keep the efficiency gains of this commit, I'd suggest doing it by
reintroducing aligned cache lines, but making the lines have variable
width multiple of the cache line size, and introduce a way to force
aggregating smaller blocks into one larger contiguous block, so that
the semantics of the memory_cache_t::new_entry entry point in the new
code, of returning a pointer to a contiguous block, can be preserved
in the save area cases. For now, since we don't need it, that would
just seem like extra unnecessary complication to me.
Finally, the memory_cache_t::contains_all call in
wave_t::read_register is no longer necessary as the cache is
guaranteed to always have the whole save area.
Tested on both Linux and Windows.
Change-Id: Iffa425669dbd989535a2f1c2945bea4720a9589f
commit-id:13cd2c42
"xfer_global_memory" is kind of a reminiscent of the Linux-only time where the cache was dealing with global addresses. Even if yes, from the GPU perspective it is called "global", it is not necessarily global in the "unified global" sense. Since the memory cache is now specialized to agent_address_t, rename: xfer_global_memory => xfer_agent_memory read_global_memory => read_agent_memory write_global_memory => write_agent_memory Pass the agent to the memory_cache_t ctor, so we can use it in the memory_access_error_t throw in memory_cache_t::write_back, fixing a FIXME: instead of using address_space_t::global(), use the agent's address space. Based on idea from Lancelot. Change-Id: Ib93ea39dcb9648a509a80c1dbd9f8bf79905587c commit-id:c509fa86
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.