Skip to content

Conversation

@gpshead
Copy link
Member

@gpshead gpshead commented Jan 22, 2026

Move the equality check out of the hot loop to allow better compiler
optimization. Instead of checking each byte during translation, perform
a single memcmp at the end to determine if the input can be returned
unchanged.

This allows compilers to unroll and pipeline the loops, resulting in ~2x
throughput improvement for medium-to-large inputs (tested on an AMD zen2).
No change observed on small inputs.

It will also be faster for bytes subclasses as those do not need change
detection.

Move the equality check out of the hot loop to allow better compiler
optimization. Instead of checking each byte during translation, perform
a single memcmp at the end to determine if the input can be returned
unchanged.

This allows compilers to unroll and pipeline the loops, resulting in ~2x
throughput improvement for medium-to-large inputs (tested on an AMD zen2).
No change observed on small inputs.

It will also be faster for bytes subclasses as those do not need change
detection.
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop already ran *output++ = table_chars[c]; before the change, the change only moves the changed = 1 logic outside the loop.

@gpshead gpshead merged commit a966d94 into python:main Jan 22, 2026
53 checks passed
@gpshead gpshead self-assigned this Jan 22, 2026
@sergey-miryanov
Copy link
Contributor

May I ask you on what length do you test it (medium-to-large inputs)?

@gpshead
Copy link
Member Author

gpshead commented Jan 22, 2026

May I ask you on what length do you test it (medium-to-large inputs)?

I tested using 64 bytes - 256k as a microbenchmark using https://github.com/gpshead/cpython/blob/6d1b11ac1d84228f5ee7b5d4f3ab0c7fb77b7719/Tools/scripts/translate_bench.py#L454-L457 with --bytes_only. claude wrote that and I didn't spend much time looking it over, i'd have written it a bit differently myself to reduce overhead further given it's a microbenchmark, but it works and demonstrates the change and lack of tiny data regression regardless.

skimming my data, the result was already a clear 10-15% improvement at 64 bytes and approached 2x as the size got larger on my zen2.

i didn't spend time looking at the asm generated, but it makes sense in this case: that "changed" test was being done in the loop for every byte despite being something that only needs to short circuit evaluate. this way it is removed and the hot paths of translation and maybe change detection are both parallelizable memory streaming operations and change detection short circuit evaluates and exits the memcmp upon first changed byte. (thus an identity translation with no changes seeing a slightly lower performance gain than others)

Roughly a 2x speedup for large inputs. For smaller inputs (64-127 bytes), the gains are more modest at 8-25% faster where the fixed overhead of the call dominates. I neglected to measure smaller than that, but I do not expect any meaningfully measurable regression.

expand for a detailed table (x86_64 zen2 gcc 15.2)
bytes: nibble swap (no del)                                                       |bytes: nibble swap (no del)
------------------------------------------------------------                      |------------------------------------------------------------
      Size      ns/call       GB/s    Out len                                     |      Size      ns/call       GB/s    Out len
------------------------------------------------------------                      |------------------------------------------------------------
        64         88.2       0.73         64                                     |        64         95.9       0.67         64
       100        106.2       0.94        100                                     |       100        127.8       0.78        100
       127        109.5       1.16        127                                     |       127        142.2       0.89        127
       256        147.2       1.74        256                                     |       256        221.1       1.16        256
       500        230.3       2.17        500                                     |       500        373.0       1.34        500
      1000        427.1       2.34       1000                                     |      1000        723.8       1.38       1000
      1024        441.6       2.32       1024                                     |      1024        729.2       1.40       1024
      4096       1324.8       3.09       4096                                     |      4096       2590.0       1.58       4096
     16384       4978.0       3.29      16384                                     |     16384       9922.2       1.65      16384
     65536      19391.2       3.38      65536                                     |     65536      39363.0       1.66      65536
    262144      77818.4       3.37     262144                                     |    262144     155573.2       1.69     262144
                                                                                  |
bytes: identity (no del)                                                          |bytes: identity (no del)
------------------------------------------------------------                      |------------------------------------------------------------
      Size      ns/call       GB/s    Out len                                     |      Size      ns/call       GB/s    Out len
------------------------------------------------------------                      |------------------------------------------------------------
        64         89.2       0.72         64                                     |        64         98.8       0.65         64
       100        109.5       0.91        100                                     |       100        131.9       0.76        100
       127        110.4       1.15        127                                     |       127        146.3       0.87        127
       256        145.8       1.76        256                                     |       256        232.9       1.10        256
       500        234.8       2.13        500                                     |       500        380.8       1.31        500
      1000        421.6       2.37       1000                                     |      1000        706.9       1.41       1000
      1024        433.8       2.36       1024                                     |      1024        724.9       1.41       1024
      4096       1335.4       3.07       4096                                     |      4096       2526.7       1.62       4096
     16384       5109.1       3.21      16384                                     |     16384       9808.6       1.67      16384
     65536      20334.0       3.22      65536                                     |     65536      39629.5       1.65      65536
    262144      82627.2       3.17     262144                                     |    262144     156007.6       1.68     262144

Other platforms?

rerunning the benchmark on 32-bit raspbian (arm32) on a rpi5, there are still gains. I included smaller 8,20,32 sizes in this run. but the overall result is less impressive. 2%-30% at most for 64 bytes on up. slightly slower on the tiny sizes but close enough it could be in the noise. this lower spec arm probably doesn't pipeline as well or coalesce writes.

and rerunning it on a 64-bit raspbian (arm64) rpi4 (wow those feel slow these days...), much better gains than the arm32 above. closer to what x86_64 zen2 saw. 10%-170% 64 bytes through 256k. insignificant for 32 bytes and below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants