perf: Optimize some decimal expressions#3619
Open
andygrove wants to merge 4 commits intoapache:mainfrom
Open
Conversation
Replace the 4-node expression tree (Cast→BinaryExpr→Cast→Cast) used for Decimal128 arithmetic that may overflow with a single fused expression that performs i256 register arithmetic directly. This reduces per-batch allocation from 4 intermediate arrays (112 bytes/elem) to 1 output array (16 bytes/elem). The new WideDecimalBinaryExpr evaluates children, performs add/sub/mul using i256 intermediates via try_binary, applies scale adjustment with HALF_UP rounding, checks precision bounds, and outputs a single Decimal128 array. Follows the same pattern as decimal_div.
Add benchmark comparing old Cast->BinaryExpr->Cast chain vs fused WideDecimalBinaryExpr for Decimal128 add/sub/mul. Covers four cases: add with same scale, add with different scales, multiply, and subtract.
cb52636 to
d7495bd
Compare
Member
Author
|
@sqlbenchmark run tpch --iterations 3 |
1 similar comment
Member
Author
|
@sqlbenchmark run tpch --iterations 3 |
Eliminate redundant CheckOverflow when wrapping WideDecimalBinaryExpr (which already handles overflow). Fuse Cast(Decimal128→Decimal128) + CheckOverflow into a single DecimalRescaleCheckOverflow expression that rescales and validates precision in one pass.
5a21500 to
91092a6
Compare
Member
Author
|
@sqlbenchmark run tpch --iterations 3 |
|
Benchmark job |
Contributor
|
This is awesome ! |
parthchandra
reviewed
Mar 5, 2026
| let rescale_divisor = if need_rescale { | ||
| i256_pow10((max_scale - s_out) as u32) | ||
| } else { | ||
| i256::ONE |
Contributor
There was a problem hiding this comment.
Don't we need to scale up to s_out if s_out > max_scale?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
For some decimal operations, we cast the inputs to Decimal256, perform the math operation, and then cast the result back to Decimal128:
This causes intermediate allocations and can be optimized with a custom
WideDecimalBinaryExprthat performs i256 register arithmetic directly, reducing per-batch allocation from 4 intermediate arrays (3 Decimal256 @ 32 bytes/elem + 1 Decimal128 @ 16 bytes/elem = 112 bytes/elem) to 1 output array (16 bytes/elem)TPC-H q1
Before:
After:
Criterion benchmark results (8192 element batches)
How it works
WideDecimalBinaryExprevaluates left/right children, performs add/sub/mul usingi256intermediates viaarrow::compute::kernels::arity::try_binary, applies scale adjustment with HALF_UP rounding, checks precision bounds, and outputs a singleDecimal128array. Follows the same pattern asdecimal_divindiv.rs.Overflow handling matches existing behavior:
ArrowError::ComputeErrori128::MAXsentinel +null_if_overflow_precision