Skip to content

[optimize](be) optimize floating fmod fast path#64161

Open
zclllyybb wants to merge 1 commit into
apache:masterfrom
zclllyybb:codex-fmod-gcc-fast
Open

[optimize](be) optimize floating fmod fast path#64161
zclllyybb wants to merge 1 commit into
apache:masterfrom
zclllyybb:codex-fmod-gcc-fast

Conversation

@zclllyybb
Copy link
Copy Markdown
Contributor

Root cause: clang lowers Doris floating mod/fmod to libm std::fmod for every non-null row. On the arithmetic operator benchmark distribution, especially large lhs with sub-unit or small rhs and float inputs promoted to double, that libm path dominates expression execution while SR/GCC-like code uses x87 fprem plus cheap finite and magnitude fast paths.

Fix: add fmod_fast helpers for double/float on x86_64 that inline the x87 fprem sequence, keep std::fmod fallback for zero divisor and non-finite inputs, and preserve Doris null semantics by still writing the null map for vector-vector and constant/vector shapes. Wire only floating mod/fmod paths to the helpers; integer, decimal, and pmod behavior stay unchanged.

case                  baseline mean    patched mean    speedup
━━━━━━━━━━━━━━━━━━━━  ━━━━━━━━━━━━━━━  ━━━━━━━━━━━━━━  ━━━━━━━━━
  Double_DB_DB                91.2 ms         76.1 ms      1.20x
────────────────────  ───────────────  ──────────────  ─────────
  Double_IN_ONE_DB           102.2 ms         87.1 ms      1.17x
────────────────────  ───────────────  ──────────────  ─────────
  Double_DB_IN_ONE           473.7 ms        120.7 ms      3.92x
────────────────────  ───────────────  ──────────────  ─────────
  Double_DB_IN_TEN           205.4 ms        108.5 ms      1.89x
────────────────────  ───────────────  ──────────────  ─────────
  Double_MIXED_ZERO          302.1 ms        141.7 ms      2.13x
────────────────────  ───────────────  ──────────────  ─────────
  Double_VectorConst         413.6 ms        102.5 ms      4.03x
────────────────────  ───────────────  ──────────────  ─────────
  Double_ConstVector          80.0 ms         79.1 ms      1.01x
────────────────────  ───────────────  ──────────────  ─────────
  Double_Nullable            488.4 ms        153.2 ms      3.19x
────────────────────  ───────────────  ──────────────  ─────────
  Float_DB_DB                 75.2 ms         70.7 ms      1.06x
────────────────────  ───────────────  ──────────────  ─────────
  Float_DB_IN_ONE            525.7 ms        114.7 ms      4.58x
────────────────────  ───────────────  ──────────────  ─────────
  Float_VectorConst          414.2 ms        106.4 ms      3.89x

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@zclllyybb
Copy link
Copy Markdown
Contributor Author

/review

@zclllyybb
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed PR 64161 at head f23d171.

No blocking findings were identified in the authoritative GitHub PR diff.

Critical checkpoint conclusions:

  • Goal and proof: The PR optimizes floating mod/fmod by routing float/double vector paths through fmod_fast; the added BE unit test covers scalar, vector/vector, const-shape, zero-divisor null semantics, corner cases, and random/representative distributions against std::fmod.
  • Scope and clarity: The actual PR scope is focused on the fmod implementation, benchmark, and targeted unit tests.
  • Concurrency and lifecycle: No new shared mutable state, locks, threads, or non-trivial lifecycle management were introduced. The new helper has no namespace-scope state beyond compile-time constants/functions.
  • Configuration and compatibility: No new config, storage format, protocol, or FE/BE serialized variable was introduced.
  • Parallel paths: mod float/double vector-constant, constant-vector, vector-vector, and scalar paths are covered; integer and decimal modulo paths remain on existing logic.
  • Error handling and memory: No ignored Status paths were introduced. Allocations are normal column/test/benchmark allocations, not new background or cache ownership paths.
  • Performance: The optimization is targeted at hot float/double modulo paths, with benchmark coverage added. I did not see an obvious redundant hot-path operation introduced by the PR.
  • Tests: The PR adds a BE unit test for the new helper and Doris function semantics. I did not run the test suite in this review environment.
  • Observability and persistence: Not applicable; no new runtime service, transaction, persistence, or user-facing diagnostic path was introduced.

User focus: No additional user-provided review focus was present.

Root cause: clang lowers Doris floating mod/fmod to libm std::fmod for every non-null row. On the arithmetic operator benchmark distribution, especially large lhs with sub-unit or small rhs and float inputs promoted to double, that libm path dominates expression execution while SR/GCC-like code uses x87 fprem plus cheap finite and magnitude fast paths.

Fix: add fmod_fast helpers for double/float on x86_64 that inline the x87 fprem sequence, keep std::fmod fallback for zero divisor and non-finite inputs, and preserve Doris null semantics by still writing the null map for vector-vector and constant/vector shapes. Wire only floating mod/fmod paths to the helpers; integer, decimal, and pmod behavior stay unchanged.

Validation: ran build-support/run_clang_format.py with the pinned clang-format; ran git diff --check; ran ./run-be-ut.sh --run --filter=FunctionFmodFastTest.* -j 90, 6 tests passed; built and ran benchmark_test for fmod cases; deployed a temporary 1FE/1BE cluster with 5M rows, parallel_pipeline_task_num=1 and SQL/query cache disabled, then compared baseline vs patched on the same storage. Representative e2e speedups: Double_VectorConst 4.03x, Double_DB_IN_ONE 3.92x, Double_Nullable 3.19x, Float_DB_IN_ONE 4.58x, Float_VectorConst 3.89x.
@zclllyybb zclllyybb force-pushed the codex-fmod-gcc-fast branch from f23d171 to 6ac4a03 Compare June 5, 2026 16:43
@zclllyybb
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 28525 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6ac4a0356fb2800f96ec1ead75838513f06e360c, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17753	3948	3928	3928
q2	q3	10759	1367	790	790
q4	4700	477	341	341
q5	7657	864	580	580
q6	189	171	137	137
q7	801	875	620	620
q8	9945	1662	1534	1534
q9	6940	4529	4483	4483
q10	6780	1837	1527	1527
q11	436	272	240	240
q12	630	435	289	289
q13	18149	3376	2779	2779
q14	290	255	250	250
q15	q16	818	774	711	711
q17	1387	1043	803	803
q18	6795	5742	5517	5517
q19	1421	1380	1005	1005
q20	518	398	263	263
q21	5534	2568	2424	2424
q22	431	357	304	304
Total cold run time: 101933 ms
Total hot run time: 28525 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4359	4260	4257	4257
q2	q3	4498	4928	4324	4324
q4	2078	2178	1378	1378
q5	4435	4277	4296	4277
q6	227	173	127	127
q7	1754	1860	1773	1773
q8	2560	2220	2113	2113
q9	8068	8104	7865	7865
q10	4772	4692	4316	4316
q11	596	436	381	381
q12	735	746	536	536
q13	3451	3533	3074	3074
q14	325	310	286	286
q15	q16	751	717	638	638
q17	1327	1331	1397	1331
q18	7972	7405	6831	6831
q19	1070	1069	1135	1069
q20	2204	2216	1940	1940
q21	5239	4560	4414	4414
q22	529	460	412	412
Total cold run time: 56950 ms
Total hot run time: 51342 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169020 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6ac4a0356fb2800f96ec1ead75838513f06e360c, data reload: false

query5	4336	629	486	486
query6	441	205	177	177
query7	4874	555	298	298
query8	367	231	213	213
query9	8734	3981	4003	3981
query10	440	303	259	259
query11	5885	2323	2155	2155
query12	158	104	101	101
query13	1282	598	448	448
query14	6392	5432	5093	5093
query14_1	4543	4430	4389	4389
query15	211	197	180	180
query16	1005	464	432	432
query17	1139	718	592	592
query18	2515	489	360	360
query19	205	192	180	180
query20	114	109	104	104
query21	216	141	115	115
query22	13657	13577	13518	13518
query23	17410	16440	16148	16148
query23_1	16241	16285	16479	16285
query24	7558	1795	1338	1338
query24_1	1313	1340	1318	1318
query25	547	450	383	383
query26	1321	315	166	166
query27	2721	577	335	335
query28	4432	2028	2009	2009
query29	1074	611	473	473
query30	308	229	198	198
query31	1127	1072	954	954
query32	114	61	60	60
query33	518	316	245	245
query34	1177	1150	630	630
query35	745	777	677	677
query36	1406	1372	1200	1200
query37	150	105	89	89
query38	3202	3154	3057	3057
query39	923	925	915	915
query39_1	878	885	882	882
query40	225	123	103	103
query41	67	66	63	63
query42	96	95	95	95
query43	319	324	285	285
query44	
query45	197	190	182	182
query46	1091	1192	727	727
query47	2344	2351	2243	2243
query48	405	422	295	295
query49	616	471	365	365
query50	987	358	262	262
query51	4346	4317	4270	4270
query52	89	92	78	78
query53	248	299	192	192
query54	288	244	195	195
query55	78	76	78	76
query56	238	258	240	240
query57	1439	1390	1316	1316
query58	240	214	215	214
query59	1566	1652	1414	1414
query60	288	246	222	222
query61	160	157	148	148
query62	695	677	584	584
query63	233	186	187	186
query64	2578	782	609	609
query65	
query66	1808	477	364	364
query67	29736	29644	29482	29482
query68	
query69	411	322	266	266
query70	977	951	960	951
query71	299	229	200	200
query72	2985	2680	2333	2333
query73	866	777	434	434
query74	5121	4942	4791	4791
query75	2644	2564	2229	2229
query76	2336	1174	790	790
query77	351	368	280	280
query78	12257	12404	11909	11909
query79	1420	1108	699	699
query80	585	498	383	383
query81	451	278	244	244
query82	567	157	122	122
query83	360	271	244	244
query84	261	142	112	112
query85	859	551	425	425
query86	373	295	276	276
query87	3377	3383	3193	3193
query88	3604	2787	2753	2753
query89	435	384	331	331
query90	1893	195	186	186
query91	187	188	155	155
query92	71	67	61	61
query93	1529	1477	882	882
query94	567	384	339	339
query95	711	496	361	361
query96	1038	807	350	350
query97	2765	2678	2538	2538
query98	250	206	204	204
query99	1147	1183	1047	1047
Total cold run time: 251166 ms
Total hot run time: 169020 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 72.34% (102/141) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.80% (21080/39183)
Line Coverage 37.48% (200336/534455)
Region Coverage 33.48% (156926/468784)
Branch Coverage 34.54% (68714/198944)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 85.11% (120/141) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.84% (28266/38279)
Line Coverage 57.85% (307438/531410)
Region Coverage 54.79% (258073/470989)
Branch Coverage 56.14% (111849/199227)

@zclllyybb
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found in the actual PR diff for the floating fmod fast path.

Critical checkpoint conclusions:

  • Goal and proof: The PR targets BE floating / performance by using an x87 fast path with fallback. The implementation wires only FLOAT/DOUBLE modulo paths and includes focused BE unit tests comparing scalar and vector helpers bit-for-bit with , plus SQL-level / null semantics cases.
  • Scope: The code changes are focused on the fmod helper, modulo dispatch, benchmark coverage, and a dedicated unit test. Integer, decimal, and paths remain on the existing implementation after inspection of the overload dispatch.
  • Concurrency/lifecycle: No new shared mutable state, threads, locks, static initialization dependencies, or lifecycle-sensitive ownership were introduced.
  • Configuration/compatibility/protocol: No new configuration, storage format, persistence, or FE/BE protocol changes.
  • Parallel paths: FLOAT and DOUBLE / vector-constant, constant-vector, vector-vector, and scalar paths are covered. , integer, and decimal paths are intentionally unchanged.
  • Error/null handling: Division-by-zero null-map behavior is preserved, including constant-zero short-circuiting. Non-finite and zero divisor cases fall back to where the helper is used directly.
  • Tests: Added covers scalar corner cases, representative/random finite distributions, vector shapes, constant shapes, and SQL-level Doris null semantics for FLOAT64. I did not run tests locally in this review; I reviewed the submitted test coverage and PR validation notes.
  • Observability: No new operational path requiring additional logging or metrics.
  • Performance: The change targets a hot vectorized expression path and avoids applying the x87 path to non-finite/zero-divisor cases. Benchmarks are added for the stated distributions.

User focus: no additional user-provided review focus was present.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found in the actual PR diff for the floating fmod fast path.

Critical checkpoint conclusions:

  • Goal and proof: The PR targets BE floating mod/fmod performance by using an x87 fprem fast path with std::fmod fallback. The implementation wires only FLOAT/DOUBLE modulo paths and includes focused BE unit tests comparing scalar and vector helpers bit-for-bit with std::fmod, plus SQL-level mod/fmod null semantics cases.
  • Scope: The code changes are focused on the fmod helper, modulo dispatch, benchmark coverage, and a dedicated unit test. Integer, decimal, and pmod paths remain on the existing implementation after inspection of the overload dispatch.
  • Concurrency/lifecycle: No new shared mutable state, threads, locks, static initialization dependencies, or lifecycle-sensitive ownership were introduced.
  • Configuration/compatibility/protocol: No new configuration, storage format, persistence, or FE/BE protocol changes.
  • Parallel paths: FLOAT and DOUBLE mod/fmod vector-constant, constant-vector, vector-vector, and scalar paths are covered. pmod, integer, and decimal paths are intentionally unchanged.
  • Error/null handling: Division-by-zero null-map behavior is preserved, including constant-zero short-circuiting. Non-finite and zero-divisor cases fall back to std::fmod where the helper is used directly.
  • Tests: Added FunctionFmodFastTest covers scalar corner cases, representative/random finite distributions, vector shapes, constant shapes, and SQL-level Doris null semantics for FLOAT64. I did not run tests locally in this review; I reviewed the submitted test coverage and PR validation notes.
  • Observability: No new operational path requiring additional logging or metrics.
  • Performance: The change targets a hot vectorized expression path and avoids applying the x87 path to non-finite/zero-divisor cases. Benchmarks are added for the stated distributions.

User focus: no additional user-provided review focus was present.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 6, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 6, 2026

PR approved by anyone and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants