Skip to content

[opt](memory) lazy-allocate PrefetchBuffer backing buffer to reduce peak memory#61482

Open
sollhui wants to merge 1 commit intoapache:masterfrom
sollhui:opt_prefetch_buffer
Open

[opt](memory) lazy-allocate PrefetchBuffer backing buffer to reduce peak memory#61482
sollhui wants to merge 1 commit intoapache:masterfrom
sollhui:opt_prefetch_buffer

Conversation

@sollhui
Copy link
Contributor

@sollhui sollhui commented Mar 18, 2026

Problem

When doing a TVF scan over many small S3/HDFS files, each CsvReader::init_reader()
creates a PrefetchBufferedReader, which in its constructor immediately allocates
buffer_num (typically 4) PrefetchBuffer objects, each pre-allocating a
s_max_pre_buffer_size (4 MB) backing buffer. This costs 16 MB per file reader
at construction time, regardless of whether the reader ever performs any I/O.

When the scanner finishes a file and calls close(), the corresponding
PrefetchBuffer objects are not immediately freed: reset_offset() submits prefetch
tasks to a thread pool via a shared_from_this() lambda, which keeps the
PrefetchBuffer alive until the task actually runs. Under high concurrency (many
small files, busy prefetch thread pool), thousands of such tasks can queue up,
each holding a live PrefetchBuffer with its pre-allocated 4 MB _buf. In a
heap profile from a real production workload, 18 GB (33.8%) of total BE memory
was attributed to PrefetchBuffer construction in this path.

Fix

Defer the allocation of _buf from the PrefetchBuffer constructor to the first
time prefetch_buffer() actually runs. This is safe because:

  1. _buf is only written by prefetch_buffer() (one writer).
  2. read_buffer() only accesses _buf after waiting on the condition variable for
    PREFETCHED status, which provides the required happens-before guarantee.
  3. For already-CLOSED buffers, prefetch_buffer() returns early before the
    allocation site — so queued tasks on closed readers never allocate _buf at all.

Impact

Scenario Before After
Reader created, never reads (closed before prefetch runs) 16 MB allocated, held until task drains 0 MB allocated
N tasks queued on closed buffers in thread pool N × 16 MB stuck in memory ~0 MB (empty shell objects only)
Normal read path 16 MB allocated when prefetch runs 16 MB allocated when prefetch runs (unchanged)

Comparison of total load memory (the earlier is before optimization):
image

@sollhui
Copy link
Contributor Author

sollhui commented Mar 18, 2026

run buildall

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@doris-robot
Copy link

TPC-H: Total hot run time: 27313 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9730d803344350998a58eadec2413a2d8a774a8d, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17654	4430	4356	4356
q2	q3	10647	777	524	524
q4	4685	359	257	257
q5	7640	1244	1043	1043
q6	187	176	152	152
q7	823	849	678	678
q8	10202	1500	1325	1325
q9	5220	4803	4759	4759
q10	6325	1957	1682	1682
q11	466	259	255	255
q12	735	582	489	489
q13	18080	3010	2203	2203
q14	224	243	218	218
q15	q16	753	761	669	669
q17	744	846	429	429
q18	5924	5311	5321	5311
q19	1328	990	627	627
q20	555	501	391	391
q21	4517	2138	1664	1664
q22	406	324	281	281
Total cold run time: 97115 ms
Total hot run time: 27313 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4838	4662	4819	4662
q2	q3	3899	4398	3852	3852
q4	946	1212	800	800
q5	4045	4366	4346	4346
q6	196	182	153	153
q7	1814	1741	1533	1533
q8	2522	2733	2630	2630
q9	7726	7397	7294	7294
q10	3761	4020	3728	3728
q11	538	471	429	429
q12	492	608	438	438
q13	2753	3218	2331	2331
q14	291	312	282	282
q15	q16	729	766	762	762
q17	1212	1451	1383	1383
q18	7335	6979	6768	6768
q19	913	903	1035	903
q20	2085	2233	2090	2090
q21	4203	3574	3458	3458
q22	483	440	388	388
Total cold run time: 50781 ms
Total hot run time: 48230 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 168599 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9730d803344350998a58eadec2413a2d8a774a8d, data reload: false

query5	4326	672	515	515
query6	347	235	217	217
query7	4205	472	266	266
query8	349	276	239	239
query9	8753	2750	2684	2684
query10	525	397	348	348
query11	6960	5179	4908	4908
query12	192	139	131	131
query13	1281	477	356	356
query14	5746	3779	3478	3478
query14_1	2853	2842	2883	2842
query15	206	199	186	186
query16	988	471	457	457
query17	1037	739	626	626
query18	2460	470	358	358
query19	234	218	195	195
query20	136	132	133	132
query21	220	134	111	111
query22	13293	14114	15054	14114
query23	16318	15895	15598	15598
query23_1	15828	15694	15594	15594
query24	7170	1632	1222	1222
query24_1	1253	1236	1261	1236
query25	567	523	437	437
query26	1221	260	150	150
query27	2771	492	296	296
query28	4435	1870	1858	1858
query29	858	569	472	472
query30	295	228	188	188
query31	986	956	877	877
query32	86	72	73	72
query33	524	358	290	290
query34	891	866	530	530
query35	676	691	595	595
query36	1062	1109	1005	1005
query37	139	95	81	81
query38	2977	2898	2931	2898
query39	863	836	834	834
query39_1	791	802	810	802
query40	240	157	138	138
query41	61	62	60	60
query42	263	260	262	260
query43	245	257	233	233
query44	
query45	198	191	226	191
query46	878	1021	612	612
query47	2648	2123	2052	2052
query48	319	314	236	236
query49	638	457	383	383
query50	699	274	216	216
query51	4112	4102	4034	4034
query52	266	271	257	257
query53	298	352	289	289
query54	309	276	282	276
query55	96	86	92	86
query56	318	332	323	323
query57	1883	1710	1533	1533
query58	287	278	274	274
query59	2794	2956	2803	2803
query60	361	342	334	334
query61	156	148	170	148
query62	639	597	531	531
query63	309	274	284	274
query64	5064	1272	1013	1013
query65	
query66	1468	453	355	355
query67	24349	24475	24474	24474
query68	
query69	408	326	292	292
query70	980	951	920	920
query71	341	309	298	298
query72	2811	2636	1980	1980
query73	555	554	319	319
query74	9628	9600	9425	9425
query75	2872	2759	2506	2506
query76	2273	1037	677	677
query77	359	362	320	320
query78	10919	11119	10488	10488
query79	1772	784	568	568
query80	1312	624	567	567
query81	542	261	222	222
query82	999	154	118	118
query83	330	260	241	241
query84	273	121	100	100
query85	912	511	465	465
query86	429	318	299	299
query87	3176	3150	3053	3053
query88	3623	2690	2675	2675
query89	438	373	361	361
query90	2018	185	173	173
query91	176	170	138	138
query92	74	79	72	72
query93	989	823	493	493
query94	629	326	292	292
query95	609	397	322	322
query96	645	528	225	225
query97	2489	2459	2406	2406
query98	247	221	227	221
query99	999	1014	918	918
Total cold run time: 250625 ms
Total hot run time: 168599 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants