[optimize](be) optimize floating fmod fast path#64161
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
|
run buildall |
There was a problem hiding this comment.
Reviewed PR 64161 at head f23d171.
No blocking findings were identified in the authoritative GitHub PR diff.
Critical checkpoint conclusions:
- Goal and proof: The PR optimizes floating
mod/fmodby routing float/double vector paths throughfmod_fast; the added BE unit test covers scalar, vector/vector, const-shape, zero-divisor null semantics, corner cases, and random/representative distributions againststd::fmod. - Scope and clarity: The actual PR scope is focused on the fmod implementation, benchmark, and targeted unit tests.
- Concurrency and lifecycle: No new shared mutable state, locks, threads, or non-trivial lifecycle management were introduced. The new helper has no namespace-scope state beyond compile-time constants/functions.
- Configuration and compatibility: No new config, storage format, protocol, or FE/BE serialized variable was introduced.
- Parallel paths:
modfloat/double vector-constant, constant-vector, vector-vector, and scalar paths are covered; integer and decimal modulo paths remain on existing logic. - Error handling and memory: No ignored
Statuspaths were introduced. Allocations are normal column/test/benchmark allocations, not new background or cache ownership paths. - Performance: The optimization is targeted at hot float/double modulo paths, with benchmark coverage added. I did not see an obvious redundant hot-path operation introduced by the PR.
- Tests: The PR adds a BE unit test for the new helper and Doris function semantics. I did not run the test suite in this review environment.
- Observability and persistence: Not applicable; no new runtime service, transaction, persistence, or user-facing diagnostic path was introduced.
User focus: No additional user-provided review focus was present.
Root cause: clang lowers Doris floating mod/fmod to libm std::fmod for every non-null row. On the arithmetic operator benchmark distribution, especially large lhs with sub-unit or small rhs and float inputs promoted to double, that libm path dominates expression execution while SR/GCC-like code uses x87 fprem plus cheap finite and magnitude fast paths. Fix: add fmod_fast helpers for double/float on x86_64 that inline the x87 fprem sequence, keep std::fmod fallback for zero divisor and non-finite inputs, and preserve Doris null semantics by still writing the null map for vector-vector and constant/vector shapes. Wire only floating mod/fmod paths to the helpers; integer, decimal, and pmod behavior stay unchanged. Validation: ran build-support/run_clang_format.py with the pinned clang-format; ran git diff --check; ran ./run-be-ut.sh --run --filter=FunctionFmodFastTest.* -j 90, 6 tests passed; built and ran benchmark_test for fmod cases; deployed a temporary 1FE/1BE cluster with 5M rows, parallel_pipeline_task_num=1 and SQL/query cache disabled, then compared baseline vs patched on the same storage. Representative e2e speedups: Double_VectorConst 4.03x, Double_DB_IN_ONE 3.92x, Double_Nullable 3.19x, Float_DB_IN_ONE 4.58x, Float_VectorConst 3.89x.
f23d171 to
6ac4a03
Compare
|
run buildall |
TPC-H: Total hot run time: 28525 ms |
TPC-DS: Total hot run time: 169020 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Review result: no blocking issues found in the actual PR diff for the floating fmod fast path.
Critical checkpoint conclusions:
- Goal and proof: The PR targets BE floating / performance by using an x87 fast path with fallback. The implementation wires only FLOAT/DOUBLE modulo paths and includes focused BE unit tests comparing scalar and vector helpers bit-for-bit with , plus SQL-level / null semantics cases.
- Scope: The code changes are focused on the fmod helper, modulo dispatch, benchmark coverage, and a dedicated unit test. Integer, decimal, and paths remain on the existing implementation after inspection of the overload dispatch.
- Concurrency/lifecycle: No new shared mutable state, threads, locks, static initialization dependencies, or lifecycle-sensitive ownership were introduced.
- Configuration/compatibility/protocol: No new configuration, storage format, persistence, or FE/BE protocol changes.
- Parallel paths: FLOAT and DOUBLE / vector-constant, constant-vector, vector-vector, and scalar paths are covered. , integer, and decimal paths are intentionally unchanged.
- Error/null handling: Division-by-zero null-map behavior is preserved, including constant-zero short-circuiting. Non-finite and zero divisor cases fall back to where the helper is used directly.
- Tests: Added covers scalar corner cases, representative/random finite distributions, vector shapes, constant shapes, and SQL-level Doris null semantics for FLOAT64. I did not run tests locally in this review; I reviewed the submitted test coverage and PR validation notes.
- Observability: No new operational path requiring additional logging or metrics.
- Performance: The change targets a hot vectorized expression path and avoids applying the x87 path to non-finite/zero-divisor cases. Benchmarks are added for the stated distributions.
User focus: no additional user-provided review focus was present.
There was a problem hiding this comment.
Review result: no blocking issues found in the actual PR diff for the floating fmod fast path.
Critical checkpoint conclusions:
- Goal and proof: The PR targets BE floating mod/fmod performance by using an x87 fprem fast path with std::fmod fallback. The implementation wires only FLOAT/DOUBLE modulo paths and includes focused BE unit tests comparing scalar and vector helpers bit-for-bit with std::fmod, plus SQL-level mod/fmod null semantics cases.
- Scope: The code changes are focused on the fmod helper, modulo dispatch, benchmark coverage, and a dedicated unit test. Integer, decimal, and pmod paths remain on the existing implementation after inspection of the overload dispatch.
- Concurrency/lifecycle: No new shared mutable state, threads, locks, static initialization dependencies, or lifecycle-sensitive ownership were introduced.
- Configuration/compatibility/protocol: No new configuration, storage format, persistence, or FE/BE protocol changes.
- Parallel paths: FLOAT and DOUBLE mod/fmod vector-constant, constant-vector, vector-vector, and scalar paths are covered. pmod, integer, and decimal paths are intentionally unchanged.
- Error/null handling: Division-by-zero null-map behavior is preserved, including constant-zero short-circuiting. Non-finite and zero-divisor cases fall back to std::fmod where the helper is used directly.
- Tests: Added FunctionFmodFastTest covers scalar corner cases, representative/random finite distributions, vector shapes, constant shapes, and SQL-level Doris null semantics for FLOAT64. I did not run tests locally in this review; I reviewed the submitted test coverage and PR validation notes.
- Observability: No new operational path requiring additional logging or metrics.
- Performance: The change targets a hot vectorized expression path and avoids applying the x87 path to non-finite/zero-divisor cases. Benchmarks are added for the stated distributions.
User focus: no additional user-provided review focus was present.
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
Root cause: clang lowers Doris floating mod/fmod to libm std::fmod for every non-null row. On the arithmetic operator benchmark distribution, especially large lhs with sub-unit or small rhs and float inputs promoted to double, that libm path dominates expression execution while SR/GCC-like code uses x87 fprem plus cheap finite and magnitude fast paths.
Fix: add fmod_fast helpers for double/float on x86_64 that inline the x87 fprem sequence, keep std::fmod fallback for zero divisor and non-finite inputs, and preserve Doris null semantics by still writing the null map for vector-vector and constant/vector shapes. Wire only floating mod/fmod paths to the helpers; integer, decimal, and pmod behavior stay unchanged.