Skip to content

verify Bitpacker4x Neon implementation is actually useful #59

@trinity-1686a

Description

@trinity-1686a

benchmarking bitpacking on an Apple M3 Max-powered laptop, it seems the handwritten neon code is actually detrimental to performance.
Below is a run of cargo bench, reference is current main, results of this run is with anything "neon" or "aarch"-specific removed. There is little impact on plain bitpacking, but the delta and strict-delta variant show huge improvements accros the board.

It would be interesting if someone can reproduce on a different arm-powered device

bench results
BitPacker4x/decompress-1                                                                            
                        time:   [52.879 ns 52.905 ns 52.941 ns]
                        thrpt:  [24.178 Gelem/s 24.194 Gelem/s 24.206 Gelem/s]
                 change:
                        time:   [-1.0221% -0.8966% -0.7573%] (p = 0.00 < 0.05)
                        thrpt:  [+0.7631% +0.9047% +1.0326%]
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low severe
  5 (5.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

BitPacker4x/decompress-delta-1                                                                             
                        time:   [665.92 ns 666.13 ns 666.32 ns]
                        thrpt:  [1.9210 Gelem/s 1.9215 Gelem/s 1.9222 Gelem/s]
                 change:
                        time:   [-53.512% -53.470% -53.426%] (p = 0.00 < 0.05)
                        thrpt:  [+114.71% +114.92% +115.11%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  4 (4.00%) high severe

BitPacker4x/decompress-strict-delta-1                                                                             
                        time:   [921.91 ns 923.87 ns 926.28 ns]
                        thrpt:  [1.3819 Gelem/s 1.3855 Gelem/s 1.3884 Gelem/s]
                 change:
                        time:   [-29.550% -29.334% -29.123%] (p = 0.00 < 0.05)
                        thrpt:  [+41.090% +41.511% +41.945%]
                        Performance has improved.

BitPacker4x/compress-1  time:   [104.32 ns 104.44 ns 104.59 ns]                                   
                        thrpt:  [12.239 Gelem/s 12.255 Gelem/s 12.270 Gelem/s]
                 change:
                        time:   [-0.9624% -0.7354% -0.5054%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5080% +0.7408% +0.9718%]
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

BitPacker4x/compress-delta-1                                                                            
                        time:   [154.11 ns 154.40 ns 154.66 ns]
                        thrpt:  [8.2765 Gelem/s 8.2900 Gelem/s 8.3057 Gelem/s]
                 change:
                        time:   [-20.762% -20.633% -20.503%] (p = 0.00 < 0.05)
                        thrpt:  [+25.791% +25.997% +26.202%]
                        Performance has improved.

BitPacker4x/compress-strict-delta-1                                                                            
                        time:   [176.85 ns 177.13 ns 177.49 ns]
                        thrpt:  [7.2117 Gelem/s 7.2264 Gelem/s 7.2377 Gelem/s]
                 change:
                        time:   [-7.9960% -7.7879% -7.5765%] (p = 0.00 < 0.05)
                        thrpt:  [+8.1976% +8.4456% +8.6909%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

BitPacker4x/decompress-2                                                                            
                        time:   [52.683 ns 52.861 ns 53.024 ns]
                        thrpt:  [24.140 Gelem/s 24.214 Gelem/s 24.296 Gelem/s]
                 change:
                        time:   [-1.9552% -1.7086% -1.4512%] (p = 0.00 < 0.05)
                        thrpt:  [+1.4726% +1.7383% +1.9942%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

BitPacker4x/decompress-delta-2                                                                             
                        time:   [661.34 ns 662.67 ns 664.36 ns]
                        thrpt:  [1.9267 Gelem/s 1.9316 Gelem/s 1.9355 Gelem/s]
                 change:
                        time:   [-47.969% -47.723% -47.464%] (p = 0.00 < 0.05)
                        thrpt:  [+90.346% +91.289% +92.192%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  10 (10.00%) high severe

BitPacker4x/decompress-strict-delta-2                                                                             
                        time:   [917.22 ns 921.65 ns 926.67 ns]
                        thrpt:  [1.3813 Gelem/s 1.3888 Gelem/s 1.3955 Gelem/s]
                 change:
                        time:   [-25.226% -24.839% -24.468%] (p = 0.00 < 0.05)
                        thrpt:  [+32.394% +33.048% +33.737%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

BitPacker4x/compress-2  time:   [102.32 ns 102.80 ns 103.41 ns]                                   
                        thrpt:  [12.378 Gelem/s 12.452 Gelem/s 12.510 Gelem/s]
                 change:
                        time:   [-2.3590% -1.9400% -1.4818%] (p = 0.00 < 0.05)
                        thrpt:  [+1.5041% +1.9784% +2.4160%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

BitPacker4x/compress-delta-2                                                                            
                        time:   [155.74 ns 156.14 ns 156.57 ns]
                        thrpt:  [8.1755 Gelem/s 8.1980 Gelem/s 8.2190 Gelem/s]
                 change:
                        time:   [-8.8859% -8.6692% -8.4502%] (p = 0.00 < 0.05)
                        thrpt:  [+9.2302% +9.4921% +9.7524%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

BitPacker4x/compress-strict-delta-2                                                                            
                        time:   [177.30 ns 177.97 ns 178.71 ns]
                        thrpt:  [7.1623 Gelem/s 7.1923 Gelem/s 7.2196 Gelem/s]
                 change:
                        time:   [-5.9149% -5.6137% -5.3176%] (p = 0.00 < 0.05)
                        thrpt:  [+5.6163% +5.9476% +6.2867%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

BitPacker4x/decompress-24                                                                            
                        time:   [60.974 ns 61.069 ns 61.173 ns]
                        thrpt:  [20.924 Gelem/s 20.960 Gelem/s 20.993 Gelem/s]
                 change:
                        time:   [-0.6847% -0.4837% -0.2973%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2982% +0.4860% +0.6895%]
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

BitPacker4x/decompress-delta-24                                                                             
                        time:   [533.76 ns 534.65 ns 535.61 ns]
                        thrpt:  [2.3898 Gelem/s 2.3941 Gelem/s 2.3981 Gelem/s]
                 change:
                        time:   [-52.624% -52.456% -52.290%] (p = 0.00 < 0.05)
                        thrpt:  [+109.60% +110.33% +111.08%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild

BitPacker4x/decompress-strict-delta-24                                                                             
                        time:   [795.94 ns 799.04 ns 802.41 ns]
                        thrpt:  [1.5952 Gelem/s 1.6019 Gelem/s 1.6082 Gelem/s]
                 change:
                        time:   [-27.538% -27.277% -27.006%] (p = 0.00 < 0.05)
                        thrpt:  [+36.997% +37.509% +38.004%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe

BitPacker4x/compress-24 time:   [105.41 ns 105.60 ns 105.81 ns]                                    
                        thrpt:  [12.097 Gelem/s 12.121 Gelem/s 12.143 Gelem/s]
                 change:
                        time:   [+0.3718% +0.6612% +0.9934%] (p = 0.00 < 0.05)
                        thrpt:  [-0.9836% -0.6568% -0.3705%]
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

BitPacker4x/compress-delta-24                                                                            
                        time:   [130.25 ns 130.53 ns 130.85 ns]
                        thrpt:  [9.7819 Gelem/s 9.8061 Gelem/s 9.8272 Gelem/s]
                 change:
                        time:   [-28.061% -27.890% -27.711%] (p = 0.00 < 0.05)
                        thrpt:  [+38.333% +38.677% +39.006%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

BitPacker4x/compress-strict-delta-24                                                                            
                        time:   [147.13 ns 147.37 ns 147.60 ns]
                        thrpt:  [8.6721 Gelem/s 8.6853 Gelem/s 8.6997 Gelem/s]
                 change:
                        time:   [-21.304% -21.093% -20.890%] (p = 0.00 < 0.05)
                        thrpt:  [+26.407% +26.731% +27.072%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

BitPacker4x/decompress-31                                                                            
                        time:   [71.861 ns 71.909 ns 71.960 ns]
                        thrpt:  [17.788 Gelem/s 17.800 Gelem/s 17.812 Gelem/s]
                 change:
                        time:   [-1.0571% -0.7500% -0.4758%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4781% +0.7556% +1.0684%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

BitPacker4x/decompress-delta-31                                                                             
                        time:   [635.70 ns 636.28 ns 636.88 ns]
                        thrpt:  [2.0098 Gelem/s 2.0117 Gelem/s 2.0135 Gelem/s]
                 change:
                        time:   [-49.275% -49.152% -49.047%] (p = 0.00 < 0.05)
                        thrpt:  [+96.258% +96.664% +97.142%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

BitPacker4x/decompress-strict-delta-31                                                                             
                        time:   [941.63 ns 944.23 ns 947.10 ns]
                        thrpt:  [1.3515 Gelem/s 1.3556 Gelem/s 1.3593 Gelem/s]
                 change:
                        time:   [-24.421% -24.203% -23.991%] (p = 0.00 < 0.05)
                        thrpt:  [+31.563% +31.932% +32.312%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

BitPacker4x/compress-31 time:   [123.71 ns 124.11 ns 124.57 ns]                                    
                        thrpt:  [10.275 Gelem/s 10.313 Gelem/s 10.347 Gelem/s]
                 change:
                        time:   [-1.2531% -0.9205% -0.5680%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5712% +0.9290% +1.2690%]
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

BitPacker4x/compress-delta-31                                                                            
                        time:   [112.19 ns 112.36 ns 112.53 ns]
                        thrpt:  [11.374 Gelem/s 11.392 Gelem/s 11.409 Gelem/s]
                 change:
                        time:   [-38.633% -38.431% -38.246%] (p = 0.00 < 0.05)
                        thrpt:  [+61.932% +62.420% +62.955%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

BitPacker4x/compress-strict-delta-31                                                                            
                        time:   [128.39 ns 128.65 ns 128.94 ns]
                        thrpt:  [9.9272 Gelem/s 9.9495 Gelem/s 9.9699 Gelem/s]
                 change:
                        time:   [-30.464% -30.310% -30.148%] (p = 0.00 < 0.05)
                        thrpt:  [+43.160% +43.494% +43.811%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions