Skip to content

[fix](be) Fix QuantileState protobuf serialization in write/read_column_to_pb#64152

Open
heguanhui wants to merge 1 commit into
apache:masterfrom
heguanhui:fix/quantile-state-serde-pb-serialization
Open

[fix](be) Fix QuantileState protobuf serialization in write/read_column_to_pb#64152
heguanhui wants to merge 1 commit into
apache:masterfrom
heguanhui:fix/quantile-state-serde-pb-serialization

Conversation

@heguanhui

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #64149

Problem Summary: DataTypeQuantileStateSerDe::write_column_to_pb used get_data_at() to copy sizeof(QuantileState) bytes of raw memory, but QuantileState contains heap-allocated members (shared_ptr<TDigest>, vector<double>) whose data is not stored inline. read_column_from_pb then used insert_data() to reinterpret this incomplete memory as a QuantileState object, resulting in dangling pointers and data corruption.

Additionally, write_column_to_pb was missing the PGenericType::QUANTILE_STATE type id assignment.

The test also had a bug: it compared QuantileState objects via get_data_at() (raw memory comparison), which compares pointer values rather than logical equality.

Release note

Fixed QuantileState type corruption in RPC UDF / RPC aggregate function / fold constant paths. Previously, QuantileState values could be silently corrupted or lost when transmitted via protobuf.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes. Fixes broken behavior — QuantileState values were corrupted during protobuf serialization/deserialization.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@heguanhui

Copy link
Copy Markdown
Contributor Author

/review

@heguanhui

Copy link
Copy Markdown
Contributor Author

run buildall

@heguanhui heguanhui force-pushed the fix/quantile-state-serde-pb-serialization branch from 78d120b to c3f87fe Compare June 6, 2026 18:26
…mn_to_pb

### What problem does this PR solve?

Issue Number: close apache#64149

Problem Summary: DataTypeQuantileStateSerDe::write_column_to_pb used get_data_at() to copy sizeof(QuantileState) bytes of raw memory, but QuantileState contains heap-allocated members (shared_ptr<TDigest>, vector<double>) whose data is not stored inline. read_column_from_pb then used insert_data() to reinterpret this incomplete memory as a QuantileState object, resulting in dangling pointers and data corruption. In DEBUG/ASAN builds this caused test failures (crash or assertion errors); in RELEASE builds it appeared to work by accident (undefined behavior accessing stale heap pointers).

Root cause: The serialization was fundamentally wrong — copying sizeof(QuantileState) bytes never captures the heap data. The correct approach (matching HLL/Bitmap implementations) is to call QuantileState::serialize()/deserialize() which properly handle the full object state.

Additionally, write_column_to_pb was missing the PGenericType::QUANTILE_STATE type id assignment that HLL and Bitmap both set.

The test also had a bug: it compared QuantileState objects via get_data_at() (raw memory comparison), which compares pointer values rather than logical equality. Fixed to compare via serialize() output.

### Release note

Fixed QuantileState type corruption in RPC UDF / RPC aggregate function / fold constant paths. Previously, QuantileState values could be silently corrupted or lost when transmitted via protobuf.

### Check List (For Author)

- Test: Unit Test (QuantileStateSerdeTest.writeColumnToPb)
- Behavior changed: No (fixes broken behavior)
- Does this need documentation: No
@heguanhui heguanhui force-pushed the fix/quantile-state-serde-pb-serialization branch from c3f87fe to 9f6743d Compare June 7, 2026 10:59
@heguanhui

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (13/13) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.87% (28291/38298)
Line Coverage 57.91% (307985/531862)
Region Coverage 54.82% (258395/471332)
Branch Coverage 56.14% (111967/199431)

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29375 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9f6743df49796f8d7f612ad091f0d5305e50d560, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17689	3985	3958	3958
q2	q3	10831	1418	796	796
q4	4686	510	370	370
q5	7561	918	585	585
q6	185	176	135	135
q7	801	853	648	648
q8	9339	1598	1574	1574
q9	5883	4580	4567	4567
q10	6743	1819	1572	1572
q11	433	272	250	250
q12	626	425	290	290
q13	18132	3447	2794	2794
q14	270	264	244	244
q15	q16	826	781	720	720
q17	926	965	909	909
q18	6883	5900	5507	5507
q19	1315	1276	1123	1123
q20	505	410	264	264
q21	6422	2835	2757	2757
q22	470	373	312	312
Total cold run time: 100526 ms
Total hot run time: 29375 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5151	4813	4687	4687
q2	q3	4982	5415	4682	4682
q4	2165	2212	1381	1381
q5	4769	4991	4746	4746
q6	228	171	129	129
q7	1865	1839	1536	1536
q8	2431	2102	2118	2102
q9	7896	7544	7398	7398
q10	4735	4675	4204	4204
q11	534	386	358	358
q12	732	756	527	527
q13	3008	3377	2827	2827
q14	280	272	258	258
q15	q16	701	697	627	627
q17	1318	1264	1261	1261
q18	7433	6801	6892	6801
q19	1139	1073	1094	1073
q20	2225	2220	1945	1945
q21	5282	4551	4397	4397
q22	525	454	428	428
Total cold run time: 57399 ms
Total hot run time: 51367 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 169788 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9f6743df49796f8d7f612ad091f0d5305e50d560, data reload: false

query5	4348	632	473	473
query6	453	216	192	192
query7	4946	548	309	309
query8	383	219	226	219
query9	8762	4140	4098	4098
query10	461	317	266	266
query11	5931	2306	2237	2237
query12	156	103	99	99
query13	1280	616	473	473
query14	6408	5433	5107	5107
query14_1	4398	4416	4581	4416
query15	209	195	174	174
query16	986	467	356	356
query17	917	697	557	557
query18	2469	467	339	339
query19	213	186	140	140
query20	108	109	104	104
query21	218	141	124	124
query22	13664	13624	13481	13481
query23	17361	16535	16203	16203
query23_1	16297	16303	16335	16303
query24	7728	1793	1343	1343
query24_1	1338	1351	1353	1351
query25	580	478	401	401
query26	1309	329	172	172
query27	2672	553	340	340
query28	4521	2028	2035	2028
query29	1092	636	502	502
query30	318	245	203	203
query31	1129	1080	978	978
query32	118	64	62	62
query33	536	331	273	273
query34	1192	1120	655	655
query35	763	789	700	700
query36	1374	1395	1268	1268
query37	161	104	101	101
query38	3222	3142	3085	3085
query39	941	921	916	916
query39_1	878	880	872	872
query40	230	131	107	107
query41	74	71	70	70
query42	99	106	97	97
query43	329	331	286	286
query44	
query45	205	192	196	192
query46	1102	1210	762	762
query47	2380	2400	2264	2264
query48	394	430	319	319
query49	657	485	381	381
query50	1021	358	269	269
query51	4394	4312	4239	4239
query52	92	93	79	79
query53	244	273	195	195
query54	300	237	220	220
query55	82	81	73	73
query56	254	231	276	231
query57	1415	1425	1346	1346
query58	243	212	215	212
query59	1603	1726	1447	1447
query60	279	269	230	230
query61	153	157	155	155
query62	705	651	594	594
query63	231	187	185	185
query64	2566	774	618	618
query65	
query66	1782	467	338	338
query67	29747	29700	29476	29476
query68	
query69	431	307	266	266
query70	957	966	931	931
query71	298	229	213	213
query72	2971	2764	2394	2394
query73	876	781	411	411
query74	5150	5010	4785	4785
query75	2690	2605	2227	2227
query76	2288	1152	793	793
query77	359	374	287	287
query78	12505	12514	11879	11879
query79	1417	1051	735	735
query80	596	485	402	402
query81	456	283	247	247
query82	832	160	122	122
query83	362	274	253	253
query84	
query85	925	525	463	463
query86	397	310	283	283
query87	3421	3364	3187	3187
query88	3655	2750	2730	2730
query89	413	388	325	325
query90	1940	179	188	179
query91	180	164	139	139
query92	65	63	60	60
query93	1412	1460	866	866
query94	549	341	311	311
query95	702	383	346	346
query96	1090	860	349	349
query97	2705	2693	2564	2564
query98	214	204	206	204
query99	1187	1181	1065	1065
Total cold run time: 251749 ms
Total hot run time: 169788 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

QuantileState protobuf serialization corrupts data in write/read_column_to_pb

2 participants