Skip to content

verify Bitpacker4x SSE compression is actually useful #60

@trinity-1686a

Description

@trinity-1686a

benchmarking bitpacking on a Ryzen PRO 5850U-powered laptop, it seems the handwritten sse code is vital to decompression, but detrimental to compression.
Below is a run of cargo bench, reference is current main, results of this run is with anything "sse" or "x86_64"-specific removed.
This suggest we may be better-of ditching or rewriting the compression part, and the fallback implementation of decompression may be improved to be more kind to auto-vectorisation as to make it faster for non hand-optimized platforms
It would be interesting if someone can reproduce on other x86 devices

bench results
BitPacker4x/decompress-1
                        time:   [81.526 ns 81.683 ns 81.884 ns]
                        thrpt:  [15.632 Gelem/s 15.670 Gelem/s 15.701 Gelem/s]
                 change:
                        time:   [+5.9196% +6.2317% +6.5342%] (p = 0.00 < 0.05)
                        thrpt:  [-6.1334% -5.8661% -5.5888%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Benchmarking BitPacker4x/decompress-delta-1: Collecting 100 samples in
estimated 5.0005 s (2.9M iterations

BitPacker4x/decompress-delta-1
                        time:   [1.6949 µs 1.6966 µs 1.6986 µs]
                        thrpt:  [753.58 Melem/s 754.47 Melem/s 755.21 Melem/s]
                 change:
                        time:   [+661.96% +665.09% +668.56%] (p = 0.00 < 0.05)
                        thrpt:  [-86.989% -86.930% -86.876%]
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  4 (4.00%) high mild
  14 (14.00%) high severe

Benchmarking BitPacker4x/decompress-strict-delta-1: Collecting 100
samples in estimated 5.0092 s (2.8M ite

   BitPacker4x/decompress-strict-delta-1
                        time:   [1.8215 µs 1.8243 µs 1.8275 µs]
                        thrpt:  [700.42 Melem/s 701.65 Melem/s 702.71 Melem/s]
                 change:
                        time:   [+647.48% +650.01% +652.44%] (p = 0.00 < 0.05)
                        thrpt:  [-86.710% -86.667% -86.622%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

BitPacker4x/compress-1  time:   [141.59 ns 141.72 ns 141.88 ns]
                        thrpt:  [9.0220 Gelem/s 9.0318 Gelem/s 9.0403 Gelem/s]
                 change:
                        time:   [+3.1783% +3.4735% +3.7609%] (p = 0.00 < 0.05)
                        thrpt:  [-3.6246% -3.3569% -3.0804%]
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

BitPacker4x/compress-delta-1
                        time:   [219.40 ns 219.70 ns 220.06 ns]
                        thrpt:  [5.8165 Gelem/s 5.8261 Gelem/s 5.8342 Gelem/s]
                 change:
                        time:   [-7.2601% -6.6117% -5.9881%] (p = 0.00 < 0.05)
                        thrpt:  [+6.3695% +7.0798% +7.8284%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

Benchmarking BitPacker4x/compress-strict-delta-1: Collecting 100
samples in estimated 5.0012 s (20M iterat

     BitPacker4x/compress-strict-delta-1
                        time:   [246.39 ns 246.82 ns 247.32 ns]
                        thrpt:  [5.1755 Gelem/s 5.1860 Gelem/s 5.1950 Gelem/s]
                 change:
                        time:   [-9.8551% -9.1131% -8.3893%] (p = 0.00 < 0.05)
                        thrpt:  [+9.1576% +10.027% +10.932%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

BitPacker4x/decompress-2
                        time:   [80.723 ns 80.803 ns 80.891 ns]
                        thrpt:  [15.824 Gelem/s 15.841 Gelem/s 15.857 Gelem/s]
                 change:
                        time:   [+1.7119% +2.4564% +3.0279%] (p = 0.00 < 0.05)
                        thrpt:  [-2.9389% -2.3975% -1.6831%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Benchmarking BitPacker4x/decompress-delta-2: Collecting 100 samples in
estimated 5.0083 s (2.9M iterations

BitPacker4x/decompress-delta-2
                        time:   [1.7146 µs 1.7182 µs 1.7227 µs]
                        thrpt:  [743.03 Melem/s 744.96 Melem/s 746.53 Melem/s]
                 change:
                        time:   [+717.28% +720.67% +724.08%] (p = 0.00 < 0.05)
                        thrpt:  [-87.865% -87.815% -87.764%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Benchmarking BitPacker4x/decompress-strict-delta-2: Collecting 100
samples in estimated 5.0073 s (2.7M ite

   BitPacker4x/decompress-strict-delta-2
                        time:   [1.8311 µs 1.8353 µs 1.8402 µs]
                        thrpt:  [695.56 Melem/s 697.42 Melem/s 699.04 Melem/s]
                 change:
                        time:   [+631.17% +633.52% +635.97%] (p = 0.00 < 0.05)
                        thrpt:  [-86.413% -86.367% -86.323%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

BitPacker4x/compress-2  time:   [140.01 ns 140.24 ns 140.53 ns]
                        thrpt:  [9.1084 Gelem/s 9.1270 Gelem/s 9.1423 Gelem/s]
                 change:
                        time:   [+0.3380% +0.5743% +0.8256%] (p = 0.00 < 0.05)
                        thrpt:  [-0.8189% -0.5710% -0.3369%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

BitPacker4x/compress-delta-2
                        time:   [223.31 ns 223.66 ns 224.07 ns]
                        thrpt:  [5.7125 Gelem/s 5.7231 Gelem/s 5.7319 Gelem/s]
                 change:
                        time:   [-5.6500% -5.0903% -4.5382%] (p = 0.00 < 0.05)
                        thrpt:  [+4.7539% +5.3633% +5.9883%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) high mild
  3 (3.00%) high severe

Benchmarking BitPacker4x/compress-strict-delta-2: Collecting 100
samples in estimated 5.0004 s (20M iterat

     BitPacker4x/compress-strict-delta-2
                        time:   [250.31 ns 250.79 ns 251.37 ns]
                        thrpt:  [5.0921 Gelem/s 5.1039 Gelem/s 5.1138 Gelem/s]
                 change:
                        time:   [-6.7716% -6.0884% -5.4438%] (p = 0.00 < 0.05)
                        thrpt:  [+5.7572% +6.4831% +7.2634%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

BitPacker4x/decompress-24
                        time:   [98.654 ns 98.799 ns 98.986 ns]
                        thrpt:  [12.931 Gelem/s 12.956 Gelem/s 12.975 Gelem/s]
                 change:
                        time:   [+4.1156% +4.3488% +4.5934%] (p = 0.00 < 0.05)
                        thrpt:  [-4.3917% -4.1675% -3.9529%]
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

Benchmarking BitPacker4x/decompress-delta-24: Collecting 100 samples
in estimated 5.0056 s (4.3M iteration

 BitPacker4x/decompress-delta-24
                        time:   [1.1541 µs 1.1571 µs 1.1608 µs]
                        thrpt:  [1.1027 Gelem/s 1.1062 Gelem/s 1.1090 Gelem/s]
                 change:
                        time:   [+435.63% +436.93% +438.26%] (p = 0.00 < 0.05)
                        thrpt:  [-81.422% -81.376% -81.330%]
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  17 (17.00%) high severe

Benchmarking BitPacker4x/decompress-strict-delta-24: Collecting 100
samples in estimated 5.0008 s (3.8M it

  BitPacker4x/decompress-strict-delta-24
                        time:   [1.3251 µs 1.3279 µs 1.3315 µs]
                        thrpt:  [961.33 Melem/s 963.90 Melem/s 965.98 Melem/s]
                 change:
                        time:   [+451.99% +453.78% +455.70%] (p = 0.00 < 0.05)
                        thrpt:  [-82.005% -81.942% -81.884%]
                        Performance has regressed.
Found 21 outliers among 100 measurements (21.00%)
  9 (9.00%) high mild
  12 (12.00%) high severe

BitPacker4x/compress-24 time:   [153.40 ns 153.55 ns 153.74 ns]
                        thrpt:  [8.3258 Gelem/s 8.3360 Gelem/s 8.3445 Gelem/s]
                 change:
                        time:   [-0.5924% -0.4011% -0.2184%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2189% +0.4028% +0.5959%]
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

BitPacker4x/compress-delta-24
                        time:   [190.28 ns 190.61 ns 190.99 ns]
                        thrpt:  [6.7020 Gelem/s 6.7152 Gelem/s 6.7271 Gelem/s]
                 change:
                        time:   [-5.6890% -5.4981% -5.2935%] (p = 0.00 < 0.05)
                        thrpt:  [+5.5894% +5.8179% +6.0322%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

Benchmarking BitPacker4x/compress-strict-delta-24: Collecting 100
samples in estimated 5.0004 s (21M itera

    BitPacker4x/compress-strict-delta-24
                        time:   [234.94 ns 235.30 ns 235.75 ns]
                        thrpt:  [5.4296 Gelem/s 5.4399 Gelem/s 5.4482 Gelem/s]
                 change:
                        time:   [-7.5641% -7.3679% -7.1672%] (p = 0.00 < 0.05)
                        thrpt:  [+7.7205% +7.9539% +8.1831%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

BitPacker4x/decompress-31
                        time:   [119.15 ns 119.39 ns 119.63 ns]
                        thrpt:  [10.699 Gelem/s 10.722 Gelem/s 10.743 Gelem/s]
                 change:
                        time:   [-0.6080% -0.3993% -0.1623%] (p = 0.00 < 0.05)
                        thrpt:  [+0.1625% +0.4009% +0.6117%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

Benchmarking BitPacker4x/decompress-delta-31: Collecting 100 samples
in estimated 5.0038 s (2.8M iteration

 BitPacker4x/decompress-delta-31
                        time:   [1.7672 µs 1.7729 µs 1.7799 µs]
                        thrpt:  [719.15 Melem/s 721.97 Melem/s 724.33 Melem/s]
                 change:
                        time:   [+609.60% +611.46% +613.44%] (p = 0.00 < 0.05)
                        thrpt:  [-85.983% -85.944% -85.908%]
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  4 (4.00%) high mild
  12 (12.00%) high severe

Benchmarking BitPacker4x/decompress-strict-delta-31: Collecting 100
samples in estimated 5.0033 s (2.6M it

  BitPacker4x/decompress-strict-delta-31

                        time:   [1.9287 µs 1.9305 µs 1.9327 µs]
                        thrpt:  [662.30 Melem/s 663.02 Melem/s 663.67 Melem/s]
                 change:
                        time:   [+593.50% +598.06% +601.80%] (p = 0.00 < 0.05)
                        thrpt:  [-85.751% -85.675% -85.580%]
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  2 (2.00%) high mild
  16 (16.00%) high severe

BitPacker4x/compress-31 time:   [167.44 ns 167.73 ns 168.06 ns]
                        thrpt:  [7.6165 Gelem/s 7.6315 Gelem/s 7.6447 Gelem/s]
                 change:
                        time:   [-1.2386% -1.0825% -0.9272%] (p = 0.00 < 0.05)
                        thrpt:  [+0.9358% +1.0943% +1.2541%]
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  3 (3.00%) high severe

BitPacker4x/compress-delta-31
                        time:   [175.49 ns 175.73 ns 176.01 ns]
                        thrpt:  [7.2724 Gelem/s 7.2841 Gelem/s 7.2940 Gelem/s]
                 change:
                        time:   [-5.4460% -5.2640% -5.0936%] (p = 0.00 < 0.05)
                        thrpt:  [+5.3670% +5.5565% +5.7597%]
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  7 (7.00%) high mild
  11 (11.00%) high severe

Benchmarking BitPacker4x/compress-strict-delta-31: Collecting 100
samples in estimated 5.0007 s (22M itera

    BitPacker4x/compress-strict-delta-31
                        time:   [225.08 ns 225.44 ns 225.88 ns]
                        thrpt:  [5.6668 Gelem/s 5.6779 Gelem/s 5.6869 Gelem/s]
                 change:
                        time:   [-8.5164% -8.2743% -8.0374%] (p = 0.00 < 0.05)
                        thrpt:  [+8.7398% +9.0207% +9.3092%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions