Skip to content

Numba CAReduce: reorder loops based on strides#1961

Open
ricardoV94 wants to merge 2 commits intopymc-devs:v3from
ricardoV94:numba_careduce_loop_ordering
Open

Numba CAReduce: reorder loops based on strides#1961
ricardoV94 wants to merge 2 commits intopymc-devs:v3from
ricardoV94:numba_careduce_loop_ordering

Conversation

@ricardoV94
Copy link
Copy Markdown
Member

@ricardoV94 ricardoV94 commented Mar 9, 2026

Contains commits from #1993

Disclaimer: Heavily assisted by Claude with me in the loop. The worst part is the codegen. We're just reordering reduction loop based on input strides.

We should do something similar for Elemwise/Blockwise (and even copy/ascontiguous/asfcontiguous), for all of which it is clear numba/llvm do no loop reordering optimization.

I went with string codegen for this, but we can consider intrinsic instead...

Benchmarks

We get up to 10x improvement, with minimal overhead on very small inputs

Large reduction

Before

Name (time in ms)                                                             Min
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=(0, 1)]       7.1365
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=(0, 2)]       5.9966
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=(1, 2)]       6.0055
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=0]            8.3376
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=1]            7.0675
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=2]            6.2248
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=None]         5.4307
test_careduce_benchmark_numba_large[layout=strided-axis=(0, 1)]          106.3384
test_careduce_benchmark_numba_large[layout=strided-axis=(0, 2)]          112.3177
test_careduce_benchmark_numba_large[layout=strided-axis=(1, 2)]           97.0718
test_careduce_benchmark_numba_large[layout=strided-axis=0]               106.9464
test_careduce_benchmark_numba_large[layout=strided-axis=1]               127.8707
test_careduce_benchmark_numba_large[layout=strided-axis=2]               115.8513
test_careduce_benchmark_numba_large[layout=strided-axis=None]            111.8030
test_careduce_benchmark_numba_large[layout=transposed-axis=(0, 1)]       100.7692
test_careduce_benchmark_numba_large[layout=transposed-axis=(0, 2)]        87.0125
test_careduce_benchmark_numba_large[layout=transposed-axis=(1, 2)]        94.7885
test_careduce_benchmark_numba_large[layout=transposed-axis=0]             85.3745
test_careduce_benchmark_numba_large[layout=transposed-axis=1]            102.7683
test_careduce_benchmark_numba_large[layout=transposed-axis=2]             89.3570
test_careduce_benchmark_numba_large[layout=transposed-axis=None]         139.1564

After

Name (time in ms)                                                            Min
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=(0, 1)]      6.3777
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=(0, 2)]      6.0822
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=(1, 2)]      5.9218
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=0]           7.8687
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=1]           6.4912
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=2]           5.5795
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=None]        5.3910
test_careduce_benchmark_numba_large[layout=strided-axis=(0, 1)]          13.8629
test_careduce_benchmark_numba_large[layout=strided-axis=(0, 2)]          14.6061
test_careduce_benchmark_numba_large[layout=strided-axis=(1, 2)]           8.9945
test_careduce_benchmark_numba_large[layout=strided-axis=0]               13.9739
test_careduce_benchmark_numba_large[layout=strided-axis=1]                9.2601
test_careduce_benchmark_numba_large[layout=strided-axis=2]                9.1156
test_careduce_benchmark_numba_large[layout=strided-axis=None]             7.0391
test_careduce_benchmark_numba_large[layout=transposed-axis=(0, 1)]       13.8146
test_careduce_benchmark_numba_large[layout=transposed-axis=(0, 2)]       14.5665
test_careduce_benchmark_numba_large[layout=transposed-axis=(1, 2)]        9.0589
test_careduce_benchmark_numba_large[layout=transposed-axis=0]            13.9295
test_careduce_benchmark_numba_large[layout=transposed-axis=1]             9.2425
test_careduce_benchmark_numba_large[layout=transposed-axis=2]             9.0242
test_careduce_benchmark_numba_large[layout=transposed-axis=None]          5.4913

Small reduction

Before

Name (time in us)                                                           Min
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=(0, 1)]     3.6470
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=(0, 2)]     3.6170
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=(1, 2)]     3.6370
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=0]          3.6570
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=1]          3.6370
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=2]          3.7070
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=None]       3.4960
test_careduce_benchmark_numba_small[layout=strided-axis=(0, 1)]          3.6670
test_careduce_benchmark_numba_small[layout=strided-axis=(0, 2)]          3.6670
test_careduce_benchmark_numba_small[layout=strided-axis=(1, 2)]          3.6870
test_careduce_benchmark_numba_small[layout=strided-axis=0]               3.6560
test_careduce_benchmark_numba_small[layout=strided-axis=1]               3.6670
test_careduce_benchmark_numba_small[layout=strided-axis=2]               3.7170
test_careduce_benchmark_numba_small[layout=strided-axis=None]            3.5170
test_careduce_benchmark_numba_small[layout=transposed-axis=(0, 1)]       3.6470
test_careduce_benchmark_numba_small[layout=transposed-axis=(0, 2)]       3.7070
test_careduce_benchmark_numba_small[layout=transposed-axis=(1, 2)]       3.7070
test_careduce_benchmark_numba_small[layout=transposed-axis=0]            3.6870
test_careduce_benchmark_numba_small[layout=transposed-axis=1]            3.7770
test_careduce_benchmark_numba_small[layout=transposed-axis=2]            3.7070
test_careduce_benchmark_numba_small[layout=transposed-axis=None]         3.6270

After

Name (time in us)                                                           Min
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=(0, 1)]     3.5370
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=(0, 2)]     3.4370
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=(1, 2)]     3.4060
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=0]          3.6770
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=1]          3.5760
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=2]          3.4870
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=None]       3.3460
test_careduce_benchmark_numba_small[layout=strided-axis=(0, 1)]          4.1670
test_careduce_benchmark_numba_small[layout=strided-axis=(0, 2)]          4.2880
test_careduce_benchmark_numba_small[layout=strided-axis=(1, 2)]          4.2880
test_careduce_benchmark_numba_small[layout=strided-axis=0]               4.6890
test_careduce_benchmark_numba_small[layout=strided-axis=1]               4.8090
test_careduce_benchmark_numba_small[layout=strided-axis=2]               4.7490
test_careduce_benchmark_numba_small[layout=strided-axis=None]            4.2380
test_careduce_benchmark_numba_small[layout=transposed-axis=(0, 1)]       4.2080
test_careduce_benchmark_numba_small[layout=transposed-axis=(0, 2)]       4.0480
test_careduce_benchmark_numba_small[layout=transposed-axis=(1, 2)]       4.1680
test_careduce_benchmark_numba_small[layout=transposed-axis=0]            4.7290
test_careduce_benchmark_numba_small[layout=transposed-axis=1]            4.8890
test_careduce_benchmark_numba_small[layout=transposed-axis=2]            4.7090
test_careduce_benchmark_numba_small[layout=transposed-axis=None]         3.9470

@ricardoV94 ricardoV94 force-pushed the numba_careduce_loop_ordering branch 2 times, most recently from 438b421 to dc8f2e4 Compare March 9, 2026 15:15
@ricardoV94 ricardoV94 requested a review from aseyboldt March 9, 2026 16:06
@ricardoV94 ricardoV94 force-pushed the numba_careduce_loop_ordering branch from dc8f2e4 to 9f35e08 Compare March 20, 2026 09:16
@ricardoV94 ricardoV94 mentioned this pull request Mar 20, 2026
@ricardoV94 ricardoV94 force-pushed the numba_careduce_loop_ordering branch from 9f35e08 to 9e50bec Compare March 20, 2026 17:58
@ricardoV94 ricardoV94 marked this pull request as draft March 20, 2026 17:59
@ricardoV94 ricardoV94 force-pushed the numba_careduce_loop_ordering branch 3 times, most recently from 2cb3eca to 88d844e Compare March 24, 2026 14:38
@ricardoV94 ricardoV94 marked this pull request as ready for review March 24, 2026 14:42
@ricardoV94 ricardoV94 force-pushed the numba_careduce_loop_ordering branch from 88d844e to 740d711 Compare March 25, 2026 10:03
ricardoV94 and others added 2 commits March 25, 2026 11:03
Co-authored-by: Adrian Seyboldt <aseyboldt@users.noreply.github.com>
@ricardoV94 ricardoV94 force-pushed the numba_careduce_loop_ordering branch from 740d711 to bdc2156 Compare March 25, 2026 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant