Numba CAReduce: reorder loops based on strides by ricardoV94 · Pull Request #1961 · pymc-devs/pytensor

ricardoV94 · 2026-03-09T14:37:04Z

Contains commits from #1993

Disclaimer: Heavily assisted by Claude with me in the loop. The worst part is the codegen. We're just reordering reduction loop based on input strides.

We should do something similar for Elemwise/Blockwise (and even copy/ascontiguous/asfcontiguous), for all of which it is clear numba/llvm do no loop reordering optimization.

I went with string codegen for this, but we can consider intrinsic instead...

Benchmarks

We get up to 10x improvement, with minimal overhead on very small inputs

Large reduction

Before

Name (time in ms)                                                             Min
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=(0, 1)]       7.1365
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=(0, 2)]       5.9966
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=(1, 2)]       6.0055
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=0]            8.3376
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=1]            7.0675
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=2]            6.2248
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=None]         5.4307
test_careduce_benchmark_numba_large[layout=strided-axis=(0, 1)]          106.3384
test_careduce_benchmark_numba_large[layout=strided-axis=(0, 2)]          112.3177
test_careduce_benchmark_numba_large[layout=strided-axis=(1, 2)]           97.0718
test_careduce_benchmark_numba_large[layout=strided-axis=0]               106.9464
test_careduce_benchmark_numba_large[layout=strided-axis=1]               127.8707
test_careduce_benchmark_numba_large[layout=strided-axis=2]               115.8513
test_careduce_benchmark_numba_large[layout=strided-axis=None]            111.8030
test_careduce_benchmark_numba_large[layout=transposed-axis=(0, 1)]       100.7692
test_careduce_benchmark_numba_large[layout=transposed-axis=(0, 2)]        87.0125
test_careduce_benchmark_numba_large[layout=transposed-axis=(1, 2)]        94.7885
test_careduce_benchmark_numba_large[layout=transposed-axis=0]             85.3745
test_careduce_benchmark_numba_large[layout=transposed-axis=1]            102.7683
test_careduce_benchmark_numba_large[layout=transposed-axis=2]             89.3570
test_careduce_benchmark_numba_large[layout=transposed-axis=None]         139.1564

After

Name (time in ms)                                                            Min
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=(0, 1)]      6.3777
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=(0, 2)]      6.0822
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=(1, 2)]      5.9218
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=0]           7.8687
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=1]           6.4912
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=2]           5.5795
test_careduce_benchmark_numba_large[layout=c_contiguous-axis=None]        5.3910
test_careduce_benchmark_numba_large[layout=strided-axis=(0, 1)]          13.8629
test_careduce_benchmark_numba_large[layout=strided-axis=(0, 2)]          14.6061
test_careduce_benchmark_numba_large[layout=strided-axis=(1, 2)]           8.9945
test_careduce_benchmark_numba_large[layout=strided-axis=0]               13.9739
test_careduce_benchmark_numba_large[layout=strided-axis=1]                9.2601
test_careduce_benchmark_numba_large[layout=strided-axis=2]                9.1156
test_careduce_benchmark_numba_large[layout=strided-axis=None]             7.0391
test_careduce_benchmark_numba_large[layout=transposed-axis=(0, 1)]       13.8146
test_careduce_benchmark_numba_large[layout=transposed-axis=(0, 2)]       14.5665
test_careduce_benchmark_numba_large[layout=transposed-axis=(1, 2)]        9.0589
test_careduce_benchmark_numba_large[layout=transposed-axis=0]            13.9295
test_careduce_benchmark_numba_large[layout=transposed-axis=1]             9.2425
test_careduce_benchmark_numba_large[layout=transposed-axis=2]             9.0242
test_careduce_benchmark_numba_large[layout=transposed-axis=None]          5.4913

Small reduction

Before

Name (time in us)                                                           Min
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=(0, 1)]     3.6470
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=(0, 2)]     3.6170
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=(1, 2)]     3.6370
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=0]          3.6570
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=1]          3.6370
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=2]          3.7070
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=None]       3.4960
test_careduce_benchmark_numba_small[layout=strided-axis=(0, 1)]          3.6670
test_careduce_benchmark_numba_small[layout=strided-axis=(0, 2)]          3.6670
test_careduce_benchmark_numba_small[layout=strided-axis=(1, 2)]          3.6870
test_careduce_benchmark_numba_small[layout=strided-axis=0]               3.6560
test_careduce_benchmark_numba_small[layout=strided-axis=1]               3.6670
test_careduce_benchmark_numba_small[layout=strided-axis=2]               3.7170
test_careduce_benchmark_numba_small[layout=strided-axis=None]            3.5170
test_careduce_benchmark_numba_small[layout=transposed-axis=(0, 1)]       3.6470
test_careduce_benchmark_numba_small[layout=transposed-axis=(0, 2)]       3.7070
test_careduce_benchmark_numba_small[layout=transposed-axis=(1, 2)]       3.7070
test_careduce_benchmark_numba_small[layout=transposed-axis=0]            3.6870
test_careduce_benchmark_numba_small[layout=transposed-axis=1]            3.7770
test_careduce_benchmark_numba_small[layout=transposed-axis=2]            3.7070
test_careduce_benchmark_numba_small[layout=transposed-axis=None]         3.6270

After

Name (time in us)                                                           Min
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=(0, 1)]     3.5370
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=(0, 2)]     3.4370
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=(1, 2)]     3.4060
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=0]          3.6770
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=1]          3.5760
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=2]          3.4870
test_careduce_benchmark_numba_small[layout=c_contiguous-axis=None]       3.3460
test_careduce_benchmark_numba_small[layout=strided-axis=(0, 1)]          4.1670
test_careduce_benchmark_numba_small[layout=strided-axis=(0, 2)]          4.2880
test_careduce_benchmark_numba_small[layout=strided-axis=(1, 2)]          4.2880
test_careduce_benchmark_numba_small[layout=strided-axis=0]               4.6890
test_careduce_benchmark_numba_small[layout=strided-axis=1]               4.8090
test_careduce_benchmark_numba_small[layout=strided-axis=2]               4.7490
test_careduce_benchmark_numba_small[layout=strided-axis=None]            4.2380
test_careduce_benchmark_numba_small[layout=transposed-axis=(0, 1)]       4.2080
test_careduce_benchmark_numba_small[layout=transposed-axis=(0, 2)]       4.0480
test_careduce_benchmark_numba_small[layout=transposed-axis=(1, 2)]       4.1680
test_careduce_benchmark_numba_small[layout=transposed-axis=0]            4.7290
test_careduce_benchmark_numba_small[layout=transposed-axis=1]            4.8890
test_careduce_benchmark_numba_small[layout=transposed-axis=2]            4.7090
test_careduce_benchmark_numba_small[layout=transposed-axis=None]         3.9470

Co-authored-by: Adrian Seyboldt <aseyboldt@users.noreply.github.com>

ricardoV94 added numba compilation labels Mar 9, 2026

ricardoV94 force-pushed the numba_careduce_loop_ordering branch 2 times, most recently from 438b421 to dc8f2e4 Compare March 9, 2026 15:15

ricardoV94 requested a review from aseyboldt March 9, 2026 16:06

ricardoV94 mentioned this pull request Mar 10, 2026

Remove need for explicit left expand_dims in inputs of Elemwise #1967

Draft

ricardoV94 force-pushed the numba_careduce_loop_ordering branch from dc8f2e4 to 9f35e08 Compare March 20, 2026 09:16

ricardoV94 mentioned this pull request Mar 20, 2026

Refactor benchmarks #1993

Merged

ricardoV94 force-pushed the numba_careduce_loop_ordering branch from 9f35e08 to 9e50bec Compare March 20, 2026 17:58

ricardoV94 marked this pull request as draft March 20, 2026 17:59

ricardoV94 force-pushed the numba_careduce_loop_ordering branch 3 times, most recently from 2cb3eca to 88d844e Compare March 24, 2026 14:38

ricardoV94 marked this pull request as ready for review March 24, 2026 14:42

ricardoV94 added the performance label Mar 24, 2026

ricardoV94 force-pushed the numba_careduce_loop_ordering branch from 88d844e to 740d711 Compare March 25, 2026 10:03

ricardoV94 and others added 2 commits March 25, 2026 11:03

Benchmark strided and small CAReduce

0539236

Numba CAReduce: reorder loops based on strides

bdc2156

Co-authored-by: Adrian Seyboldt <aseyboldt@users.noreply.github.com>

ricardoV94 force-pushed the numba_careduce_loop_ordering branch from 740d711 to bdc2156 Compare March 25, 2026 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numba CAReduce: reorder loops based on strides#1961

Numba CAReduce: reorder loops based on strides#1961
ricardoV94 wants to merge 2 commits intopymc-devs:v3from
ricardoV94:numba_careduce_loop_ordering

ricardoV94 commented Mar 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ricardoV94 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Large reduction

Small reduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ricardoV94 commented Mar 9, 2026 •

edited

Loading