bench(fsst): add alloc-free decompress+match baselines for fair comparison#6903
Open
joseph-isaacs wants to merge 20 commits intodevelopfrom
Open
bench(fsst): add alloc-free decompress+match baselines for fair comparison#6903joseph-isaacs wants to merge 20 commits intodevelopfrom
joseph-isaacs wants to merge 20 commits intodevelopfrom
Conversation
Add 13+ benchmark variants for FSST substring matching to compare optimization strategies for the contains DFA kernel: - Split table (production baseline) vs fused 256-wide table - Early exit vs no-early-exit variants - Safe vs unsafe (bounds-check elimination) - Branchless escape handling - Interleaved batch processing (4/8/16 strings) - SIMD gather (8 strings, u32 table, AVX2) - Enumerated DFA (speculative all-start-states) - Multi-string early exit with bitmask - collect_bool chunk-of-64 alignment - ClickBench-style long URL workload Key findings (100K strings, needle "google"): - Fused table + collect_bool + unsafe: 1.55ms (1.40x faster than prod) - Fused table + collect_bool: 1.63ms (1.33x faster) - Fused table one-at-a-time: 1.82ms (1.19x faster) - Split table (production): 2.16ms (baseline) - Interleaved batching: slower at all batch sizes - Decompress then search: 11.85ms (5.5x slower) Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
…a to 4-bit states - Add 4 new data generators: log lines, JSON strings, file paths, emails - Add benchmarks for each data type with split_table, shift_dfa, compact, fused - Add memchr::memmem benchmarks for SIMD-accelerated substring search comparison - Bump ShiftDfa from 3-bit to 4-bit states (supports needles up to 14 chars) - Add memchr as workspace dev-dependency Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
Add rare_* benchmarks with random alphanumeric strings where only ~0.001% contain the needle "xyzzy". Tests DFA performance when almost nothing matches, which is the common case for selective predicates on large datasets. Includes prefilter benchmark to measure code-level bitmap skip effectiveness. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
New approach: precompute which codes keep the DFA in state 0, then skip leading trivial codes before starting the full DFA scan. Effective when the needle is rare (most codes map state 0 → 0). Results on rare match data (0.001%): - rare_prefilter: 3.33ms (best for rare matches) - rare_state_zero_skip: 3.86ms - rare_shift_dfa: 6.94ms - rare_compact_no_exit: 7.51ms Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
…rison Add decompress_no_alloc and decompress_no_alloc_memmem benchmarks that reuse a pre-allocated buffer instead of allocating per-string. This gives a fair comparison against DFA approaches that also avoid allocation. Key results (100K short URLs, needle "google"): - shift_dfa_no_exit: 1.52ms (best DFA) - decompress_no_alloc_memmem: 6.88ms (best decompress, 4.5x slower) - decompress_no_alloc: 13.58ms (sliding window, 8.9x slower) - decompress_then_search: 11.26ms (old baseline with allocs) Key results (100K ClickBench URLs, needle "yandex"): - cb_shift_dfa: 6.00ms (best DFA) - cb_decompress_no_alloc_memmem: 22.33ms (3.7x slower) Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
Hybrid approaches: - PrefilterShiftDfa: code-level bitmap skip + ShiftDfa for survivors - StateZeroShiftDfa: skip leading trivial codes + ShiftDfa for remainder External crate benchmarks (on decompressed data): - regex-automata: dense DFA and sparse DFA - jetscii: PCMPESTRI-based substring search - daachorse: double-array Aho-Corasick Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
Contributor
Polar Signals Profiling ResultsLatest Run
Previous Runs (4)
Powered by Polar Signals Cloud |
Contributor
Benchmarks: PolarSignals ProfilingSummary
datafusion / vortex-file-compressed (0.975x ➖, 1↑ 0↓)
|
Contributor
Benchmarks: TPC-H SF=1 on NVMESummary
datafusion / vortex-file-compressed (0.963x ➖, 6↑ 3↓)
datafusion / vortex-compact (0.972x ➖, 5↑ 1↓)
datafusion / parquet (0.973x ➖, 3↑ 0↓)
datafusion / arrow (0.931x ➖, 4↑ 2↓)
duckdb / vortex-file-compressed (1.017x ➖, 4↑ 5↓)
duckdb / vortex-compact (0.999x ➖, 7↑ 6↓)
duckdb / parquet (1.000x ➖, 4↑ 6↓)
duckdb / duckdb (0.986x ➖, 1↑ 0↓)
|
Contributor
Benchmarks: FineWeb NVMeSummary
datafusion / vortex-file-compressed (1.020x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.049x ➖, 0↑ 2↓)
datafusion / parquet (1.034x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.938x ➖, 1↑ 0↓)
duckdb / vortex-compact (1.032x ➖, 0↑ 0↓)
duckdb / parquet (1.029x ➖, 0↑ 0↓)
|
Contributor
Benchmarks: TPC-DS SF=1 on NVMESummary
datafusion / vortex-file-compressed (1.060x ➖, 0↑ 13↓)
datafusion / vortex-compact (1.052x ➖, 0↑ 9↓)
datafusion / parquet (1.047x ➖, 0↑ 12↓)
duckdb / vortex-file-compressed (1.203x ❌, 15↑ 51↓)
duckdb / vortex-compact (1.253x ❌, 11↑ 65↓)
duckdb / parquet (1.025x ➖, 1↑ 6↓)
duckdb / duckdb (1.049x ➖, 0↑ 10↓)
|
Contributor
Benchmarks: TPC-H SF=10 on NVMESummary
datafusion / vortex-file-compressed (0.936x ➖, 2↑ 0↓)
datafusion / vortex-compact (0.946x ➖, 0↑ 0↓)
datafusion / parquet (0.955x ➖, 0↑ 0↓)
datafusion / arrow (0.932x ➖, 2↑ 0↓)
duckdb / vortex-file-compressed (1.097x ➖, 4↑ 7↓)
duckdb / vortex-compact (1.072x ➖, 4↑ 7↓)
duckdb / parquet (0.986x ➖, 0↑ 0↓)
duckdb / duckdb (0.978x ➖, 0↑ 0↓)
|
Contributor
Benchmarks: TPC-H SF=1 on S3Summary
datafusion / vortex-file-compressed (1.005x ➖, 0↑ 3↓)
datafusion / vortex-compact (0.876x ➖, 1↑ 0↓)
datafusion / parquet (0.944x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.971x ➖, 0↑ 1↓)
duckdb / vortex-compact (0.986x ➖, 0↑ 2↓)
duckdb / parquet (0.937x ➖, 0↑ 0↓)
|
Contributor
Benchmarks: FineWeb S3Summary
datafusion / vortex-file-compressed (0.937x ➖, 1↑ 0↓)
datafusion / vortex-compact (1.050x ➖, 0↑ 0↓)
datafusion / parquet (0.993x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.046x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.921x ➖, 0↑ 0↓)
duckdb / parquet (0.994x ➖, 0↑ 0↓)
|
Contributor
Benchmarks: Random AccessSummary
unknown / unknown (0.959x ➖, 8↑ 1↓)
|
Contributor
Benchmarks: Statistical and Population GeneticsSummary
duckdb / vortex-file-compressed (0.991x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.002x ➖, 0↑ 0↓)
duckdb / parquet (1.003x ➖, 0↑ 0↓)
|
Contributor
Benchmarks: TPC-H SF=10 on S3Summary
datafusion / vortex-file-compressed (0.890x ➖, 2↑ 0↓)
datafusion / vortex-compact (1.012x ➖, 0↑ 1↓)
datafusion / parquet (1.005x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.003x ➖, 1↑ 4↓)
duckdb / vortex-compact (0.994x ➖, 1↑ 1↓)
duckdb / parquet (0.937x ➖, 0↑ 0↓)
|
Contributor
Benchmarks: Clickbench on NVMESummary
datafusion / vortex-file-compressed (0.958x ➖, 5↑ 0↓)
datafusion / vortex-compact (1.029x ➖, 1↑ 7↓)
datafusion / parquet (0.975x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.916x ➖, 13↑ 2↓)
duckdb / vortex-compact (0.966x ➖, 8↑ 5↓)
duckdb / parquet (1.002x ➖, 0↑ 0↓)
duckdb / duckdb (0.982x ➖, 0↑ 1↓)
|
…-only benchmarks Upgrade FsstContainsDfa in the production LIKE kernel from a split n_symbols-wide table with u16 states to a fused 256-entry table with u8 states. The fused table eliminates the ESCAPE_CODE branch from the hot path (handled via sentinel), and u8 states halve the table size for better cache utilization. Add decompress-only benchmarks (no search) for all 7 datasets to measure the raw cost of FSST decompression. DFA search on compressed codes is 2.3-4.9x faster than decompression alone. Signed-off-by: "Claude" <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
Replace the fused u8 table DFA with a shift-based DFA that packs all state transitions into a u64 per code byte. The table load depends only on the code byte (not on the current state), breaking the load-use dependency chain that makes traditional table-lookup DFAs slow. For needles > 14 chars, falls back to the fused u8 table. Benchmarks show shift DFA is fastest on most datasets: - URLs: 1.6ms (shift) vs 1.8ms (fused) - ClickBench: 5.9ms (shift) vs 6.5ms (fused) - Log lines: 8.3ms (shift) vs 9.9ms (fused) - JSON: 4.1ms (shift) vs 4.1ms (fused) - Emails: 1.1ms (shift) vs 1.1ms (fused) Signed-off-by: "Claude" <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
Add end-to-end benchmarks that exercise the full vortex execution framework (Like -> ScalarFn -> FSSTVTable::like -> ShiftDfa) for all 7 datasets. These measure the production code path including kernel dispatch and result materialization. Results show 2.0-3.5x speedup over decompression alone across all datasets, confirming the DFA-on-compressed-codes approach is effective through the full stack. Signed-off-by: "Claude" <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
…op early-exit Three optimizations to the FSST LIKE kernel: 1. Upgrade FsstPrefixDfa from split n_symbols-wide table to shift-based DFA (same approach as the contains DFA). Packs all state transitions into [u64; 256], breaking the load-use dependency chain. 2. Fix unnecessary array clone: validity was obtained via `Validity::copy_from_array(&array.clone().into_array())` which cloned the entire FSSTArray. Now reads validity directly from the codes array. 3. Remove early-exit branch from ShiftDfa::matches hot loop. The accept state is sticky (transitions to itself), so we just check at the end. Removes one branch per iteration from the critical path. Signed-off-by: "Claude" <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
… kernel Replace BitBufferMut::collect_bool closure with a dedicated dfa_scan_to_bitbuf helper that packs match results into u64 words directly. This eliminates the cross-crate closure indirection and ensures the compiler can see the full loop body (DFA transition + bit packing) for better optimization. Benchmark results show the LIKE kernel is now at parity with the raw shift DFA, and 3-4x faster than FSST decompression alone. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
Two optimizations to the LIKE kernel: 1. Copy 65 offsets to a stack array per 64-string chunk for spatial locality, eliminating aliasing concerns in the inner loop. 2. Use iterator-based traversal in ShiftDfa::matches with early-exit on accept state, skipping remaining code bytes once a match is found. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
Three optimizations to the FSST LIKE kernel: 1. BranchlessShiftDfa: fold escape handling into the DFA state space (2N+1 states: N normal + 1 accept + N escape), eliminating the escape-code branch entirely from the inner loop. Used for needles <= 7 characters. The matches() function is a single branchless loop: one table load + shift + mask per code byte. 2. Running offset: track prev_end instead of loading offsets[i] twice per string, saving one offset load per iteration. 3. Iterator-based ShiftDfa::matches: use iter.next() instead of manual pos indexing to help the compiler eliminate bounds checks. Benchmark results (fastest, no native): ClickBench: 5.5ms -> 3.1ms (44% faster) Rare: 6.6ms -> 3.3ms (50% faster) JSON: 4.0ms -> 3.6ms (10% faster) Log: 8.2ms -> 7.8ms (5% faster) Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
Add two new benchmark suites for comparing our FSST DFA-based LIKE kernel against Arrow's memchr::memmem-based LIKE implementation: - arrow_like_*: Arrow LIKE on pre-decompressed data (measures memmem speed) - e2e_arrow_*: Full decompress + Arrow LIKE (measures end-to-end cost) Results show our DFA wins end-to-end on 4/5 datasets (1.1-2.2x faster) due to avoiding decompression overhead, even though Arrow's memmem is faster per-string on already-decompressed data. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01FtpYUQXvGND6mUHASHC614
… claude/check-listing-L7l0k
…ng-L7l0k # Conflicts: # encodings/fsst/Cargo.toml
Merging this PR will improve performance by ×3.4
Performance Changes
Comparing Footnotes
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes: #000
Testing