Fix memory reservation starvation in sort-merge#20642
Fix memory reservation starvation in sort-merge#20642xudong963 wants to merge 4 commits intoapache:mainfrom
Conversation
|
Update: I added an end-to-end test which fails on main |
0f2140c to
651e9c9
Compare
| // Transfer the pre-reserved merge memory to the streaming merge | ||
| // using `take()` instead of `new_empty()`. This ensures the merge | ||
| // stream starts with `sort_spill_reservation_bytes` already | ||
| // allocated, preventing starvation when concurrent sort partitions | ||
| // compete for pool memory. `take()` moves the bytes atomically | ||
| // without releasing them back to the pool, so other partitions | ||
| // cannot race to consume the freed memory. |
There was a problem hiding this comment.
The pre reserved merge memory should be used as part of the sort merge stream.
I mean that if x pre reserved merge memory was reserved the sort merge stream should know about that so it wont think it starting from 0, otherwise this just reserve for unaccounted memory
There was a problem hiding this comment.
Thanks! Good point — just take() alone wouldn't be enough if the merge stream doesn't know about the pre-reserved bytes.
The PR does address this in the other changed files:
- In builder.rs: BatchBuilder now tracks batches_mem_used separately and only calls try_grow() when actual usage exceeds the current reservation size. It also records initial_reservation so
it never shrinks below that during build_output. This way the pre-reserved bytes are used as the initial budget rather than requesting from the pool on top of them. - In multi_level_merge.rs: get_sorted_spill_files_to_merge now tracks total_needed and only requests additional pool memory when total_needed > reservation.size(), so spill file buffers
covered by the pre-reserved bytes don't trigger extra pool allocations.
So the merge stream is aware of the pre-reserved bytes and uses them as its starting budget — it doesn't think it's starting from 0.
DataFusion 52.1.0 has a TOCTOU race in ExternalSorter where merge reservations are freed and re-created empty, letting other partitions steal the memory (apache/datafusion#20642). Until the upstream fix lands, compute a data-aware sort_spill_reservation_bytes by sampling actual Arrow row sizes from the input, estimating spill file count, and reserving enough for the merge phase.
* Sample-based sort spill reservation to mitigate merge OOM DataFusion 52.1.0 has a TOCTOU race in ExternalSorter where merge reservations are freed and re-created empty, letting other partitions steal the memory (apache/datafusion#20642). Until the upstream fix lands, compute a data-aware sort_spill_reservation_bytes by sampling actual Arrow row sizes from the input, estimating spill file count, and reserving enough for the merge phase. * Add more tests * formatting * Allow some truncation here * Handle when budgets are tight * Better estimate in-memory size using row count and avg row size * Remove dead code * Lint fix * linting * Fix low memory situations
01ae92d to
acd7caa
Compare
Which issue does this PR close?
Rationale for this change
This PR fixes memory reservation starvation in sort-merge when multiple sort partitions share a GreedyMemoryPool.
When multiple
ExternalSorterinstances run concurrently and share a single memory pool, the merge phase starves:What changes are included in this PR?
Are these changes tested?
I can't find a deterministic way to reproduce the bug, but it occurs in our production.Add an end-to-end test to verify the fixAre there any user-facing changes?