Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
321 commits
Select commit Hold shift + click to select a range
dd6a442
feat(llama-cpp): per-model max_prefill_tokens option (chunked-prefill…
mudler Jun 23, 2026
a3abd60
docs(paged): GB10 head-to-head server sweep (llama-server vs vLLM)
mudler Jun 23, 2026
8925c00
docs(paged): scope durable grouped FP4-MMA MoE GEMM port for GB10
mudler Jun 23, 2026
010067d
feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap
mudler Jun 23, 2026
acb22a6
feat(paged): mirror MoE token-tile density-aware auto-select (patch 0…
mudler Jun 23, 2026
ee78ae4
docs(paged): Qwen3.6 NVFP4 h2h bench doc - MoE llama.cpp table
mudler Jun 23, 2026
2975a74
docs(paged): Qwen3.6 NVFP4 apples-to-apples scorecard (llama vs vLLM,…
mudler Jun 23, 2026
c8b1f16
docs(paged): dense NVFP4 fair re-run with max_prefill_tokens budget s…
mudler Jun 23, 2026
c7075fb
docs(paged): MoE 35B-A3B NVFP4 fair re-run with max_prefill_tokens bu…
mudler Jun 23, 2026
362eea9
docs(paged): fair re-run verdict - synthesize NVFP4 llama vs vLLM sco…
mudler Jun 23, 2026
ed17fc8
docs(paged): scope token-granular continuous-batch scheduler for llam…
mudler Jun 23, 2026
5a38dd3
docs(paged): adversarial review of the continuous-batch scheduler scope
mudler Jun 23, 2026
fccbb40
docs(paged): ground vLLM 0.23.0 eager-decode architecture vs llama.cpp
mudler Jun 24, 2026
24ce7d0
feat(llama-cpp/paged): dynamic decode-first prefill budget (patch 001…
mudler Jun 24, 2026
f7500df
docs(paged): staggered-arrival evaluation of patch 0016 dynamic budget
mudler Jun 24, 2026
e4c6317
docs(paged): verify llama.cpp GDN decode is O(1)-in-context, not a 2.…
mudler Jun 24, 2026
ea634ee
docs(paged): scope track B - FP4-MMA decode-GEMM roofline + parity go…
mudler Jun 24, 2026
c1d7f33
docs(paged): enrich track-B scope with code-level FP4-GEMM inefficien…
mudler Jun 24, 2026
7434d64
docs(paged): build-ready track-B FP4-GEMM scope - kernel decision + p…
mudler Jun 24, 2026
39e16cc
docs(paged): adversarial review of track-B FP4-GEMM parity go/no-go
mudler Jun 24, 2026
40f019e
docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate re…
mudler Jun 24, 2026
da67fd8
docs(paged): A.2 CUDA-graph decode lever measurement and gap diagnosis
mudler Jun 24, 2026
2dd5d68
docs(paged): A.2 Phase 2 - locate the real decode lever (gated-DeltaN…
mudler Jun 24, 2026
34cadb6
docs(paged): A.2 final synthesis - CUDA-graph decode verdict
mudler Jun 24, 2026
5ce2f1d
feat(paged): qwen35 gated-DeltaNet in-place SSM state write-back (pat…
mudler Jun 24, 2026
6f0792c
feat(paged): qwen35 SSM decode fused recurrent-state gather (patch 0019)
mudler Jun 24, 2026
ee13fd1
docs(paged): profile-both-engines post-SSM ground-truth decode decomp…
mudler Jun 25, 2026
c0e0ed3
docs(paged): synthesize decode-parity exploration - the o_proj MMVQ l…
mudler Jun 25, 2026
b895f4d
feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020)
mudler Jun 25, 2026
e597a8a
docs(paged): vLLM GDN decode = 2 fused kernels under CUDA graph vs ll…
mudler Jun 25, 2026
2b57997
docs(paged): cudagraph-coverage - GDN serial chain IS graph-covered a…
mudler Jun 25, 2026
a723852
docs(paged): decisive node-level decode timeline gap - bubbles refuted
mudler Jun 25, 2026
5825b07
docs(paged): SYNTHESIS - validated decode-parity picture, ranked plan…
mudler Jun 25, 2026
fd4332e
docs(paged): GDN recurrence byte-gate SETTLED - re-stream ~1.0x, buil…
mudler Jun 25, 2026
2a8103c
docs(paged): FINAL DECISION - NO-BUILD fused recurrence, BUILD conv f…
mudler Jun 25, 2026
1785573
docs(paged): bf16 SSM-state build plan (PART C synthesis: edits, KL g…
mudler Jun 25, 2026
5cec1a6
docs(paged): bitexact-vs-vLLM verdict + verified f32 GDN-state correc…
mudler Jun 25, 2026
8f8777e
feat(paged): qwen35 decode conv-state in-place fusion (patch 0021)
mudler Jun 25, 2026
3c1ed67
feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune…
mudler Jun 25, 2026
02cbae5
feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)
mudler Jun 25, 2026
64766ec
Merge branch 'master' into worktree-feat+paged-attention
mudler Jun 25, 2026
634c0e5
docs(paged): rms_norm->fp4 fold analysis - bit-exact decode ceiling a…
mudler Jun 25, 2026
24833f0
docs(paged): bf16 SSM-state NO-SHIP - fails f32 KL gate (= vLLM's own…
mudler Jun 26, 2026
7c45447
docs(paged): FUTURE_LEVERS - parked decode-parity exploration trail
mudler Jun 26, 2026
aaaa90a
bench(paged): final apples-to-apples NVFP4 decode benchmark (0023 vs …
mudler Jun 26, 2026
ae0042f
docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode…
mudler Jun 26, 2026
7dd3431
docs(paged): promote TTFT/prefill + paged-pool burst-degradation bug …
mudler Jun 26, 2026
00f9265
docs(paged): correct vLLM recurrent-state precision (f32, not bf16)
mudler Jun 26, 2026
001d833
docs(paged): f16/bf16 glue probe - dense decode residual ceiling
mudler Jun 26, 2026
89e62fc
docs(paged): finalize f16 glue probe - cost analysis + build verdict
mudler Jun 26, 2026
b061e4a
docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick…
mudler Jun 26, 2026
125d10a
feat(paged): paged-pool burst-reclaim (truncate + defrag + slot relea…
mudler Jun 26, 2026
167768c
feat(backend): llama-cpp-localai-paged variant + NVFP4 Qwen3.6 gallery
mudler Jun 26, 2026
30a2b59
Merge branch 'master' into worktree-feat+paged-attention (llama.cpp p…
mudler Jun 26, 2026
ec7c1b1
feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 pat…
mudler Jun 26, 2026
4d3fecd
docs(paged): MoE decode re-graph lever (patch 0025) + speedup-hunt B …
mudler Jun 26, 2026
6bfca14
docs(paged): speedup-hunt C section + final RANK + PLAN synthesis
mudler Jun 26, 2026
fe5bd3f
feat(paged): qwen35 hybrid per-head f32/bf16 SSM state (patch 0026)
mudler Jun 26, 2026
33dfe7f
feat(paged): qwen35 hybrid per-head f32/bf16 SSM state - carry fix + …
mudler Jun 26, 2026
1f857f1
docs(paged): B-2 down_proj act-quant retune RESULT - negative (no hea…
mudler Jun 26, 2026
9c1c2a6
docs(paged): B-3 mmq_y-down warp-remap NEGATIVE - bit-exact MoE ceili…
mudler Jun 26, 2026
b3d3323
feat(paged): wire ssm_bf16_tau model option for hybrid SSM-state fast…
mudler Jun 26, 2026
3b59571
docs(paged): both-engine MoE decode decomposition - the 15% is NOT th…
mudler Jun 26, 2026
6c6a925
docs(paged): MoE-vs-vLLM DECIDE synthesis - reject W4A16 Marlin, the …
mudler Jun 26, 2026
b1667b4
feat(paged): qwen35 recurrent-state gather fusion (patch 0028)
mudler Jun 26, 2026
bf9b4fa
feat(gallery): NVFP4-MTP Qwen3.6 entries for the LocalAI paged backend
mudler Jun 26, 2026
79edfd2
feat(gallery): -paged suffix rename + qwopus NVFP4-MTP paged variants
mudler Jun 26, 2026
6dd8a3d
docs(gallery): NVFP4 GGUFs published to mudler/ - update header note
mudler Jun 26, 2026
c1f1d1e
Merge remote-tracking branch 'origin/master' into worktree-feat+paged…
mudler Jun 26, 2026
62c407e
docs(paged): lever1 gather-fusion bench landed - checkpoint + attribu…
mudler Jun 26, 2026
9a1be79
docs(paged): lever-4 scope - NVFP4 the still-bf16 MoE GDN/attn projec…
mudler Jun 26, 2026
e3f8149
docs(paged): lever-4 KL-gate FAIL - NVFP4 MoE projections cost ~6% PP…
mudler Jun 26, 2026
9b0e4e5
docs(paged): residual-assess FINAL - MoE at bit-exact ceiling, hunt DONE
mudler Jun 27, 2026
db6ebc5
feat(paged): block-table within-step host cache (patch 0029)
mudler Jun 27, 2026
683e225
docs(paged): arch-generality audit - build-targeting (CUDA arch fan +…
mudler Jun 27, 2026
34abf39
docs(paged): ARCH audit - NVFP4 GGUF off-Blackwell portability + gall…
mudler Jun 27, 2026
5667dfe
docs(paged): arch-generality audit - optimization classification (001…
mudler Jun 27, 2026
2a2de1d
docs(paged): patch-arch-safety classification for patches 0018-0029
mudler Jun 27, 2026
87cfd1f
docs(paged): quant-generality audit - SSM/serving opts are quant-agno…
mudler Jun 27, 2026
af6e133
docs(paged): cross-arch synthesis - ship verdict + minimum non-Blackw…
mudler Jun 27, 2026
2332587
fix(gallery): scope NVFP4-paged entries to Blackwell + consistent tags
mudler Jun 27, 2026
621a20d
feat(paged): backend-gate fused GDN/discriminated SSM_CONV emission (…
mudler Jun 27, 2026
202a29f
feat(paged): Metal/darwin build availability for llama-cpp-localai-paged
mudler Jun 27, 2026
400930d
Merge remote-tracking branch 'origin/master' into worktree-feat+paged…
mudler Jun 27, 2026
e160041
chore(paged): decouple paged llama.cpp pin from the nightly auto-bumper
mudler Jun 27, 2026
2bee7a5
ci(paged): add early-warning canary for vendored llama.cpp paged patches
mudler Jun 27, 2026
7e1832b
fix(paged): strip stray dev-doc hunks so patch series applies on a cl…
mudler Jun 27, 2026
a5a5b2a
feat(paged): bump llama.cpp pin 9d5d882d -> c299a92c (bit-exact verif…
mudler Jun 27, 2026
fb2dc33
docs(paged): consolidate the dev-trail docs into one canonical README
mudler Jun 27, 2026
78fac9a
refactor(paged): stock llama-cpp is patch-free; paged backend owns it…
mudler Jun 27, 2026
4a9a1dd
docs(paged): Mac stock-vs-patched bench + Vulkan note + cross-backend…
mudler Jun 27, 2026
984c8fc
docs(paged): Layer-2 upstream scope for native fused-GDN kernels (Met…
mudler Jun 27, 2026
9115c2c
docs(paged): correct Vulkan/SYCL note (GDN op IS upstream) + CUDA-onl…
mudler Jun 27, 2026
a4e7309
feat(paged): restrict llama-cpp-localai-paged to CUDA-only build targets
mudler Jun 27, 2026
db14006
docs(agents): add paged-backend maintenance + vLLM-parity methodology…
mudler Jun 27, 2026
08b754f
chore(paged): keep patches/ patch-only; README to backend root, docs …
mudler Jun 27, 2026
53f66a6
fix(paged): revert pin to 9d5d882d (== stock); c299a92c broke grpc-se…
mudler Jun 27, 2026
ed5eb70
docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7
mudler Jun 27, 2026
3466094
docs(paged): re-measure DGX benchmarks on one harness (stock/patched/…
mudler Jun 27, 2026
266fcc7
docs(agents): fix A/B-bench gotcha - env-toggle != stock for compiled…
mudler Jun 27, 2026
1431f72
docs(paged): regenerate decode plots (3-way) from re-measured data + …
mudler Jun 27, 2026
0b84fda
docs(paged): add the bf16-tau opt-in line to the decode plots
mudler Jun 27, 2026
9bb8994
chore(paged): drop CUDA-12 variants of llama-cpp-localai-paged, keep …
mudler Jun 28, 2026
23b11a5
paged-kv-manager.h: add missing <cstddef> for size_t
mudler Jun 28, 2026
4da769c
paged headers: self-include <cstddef>/<cstdint> for size_t/uintN_t (f…
mudler Jun 28, 2026
1f3e5ba
fix(paged): serialize both SSM partitions in hybrid bf16-tau state sa…
mudler Jun 28, 2026
ea72a56
Merge origin/master + pin-sync paged backend to 0ed235ea
mudler Jun 28, 2026
c51ff4c
docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan…
mudler Jun 28, 2026
2c59805
fix(paged): rpc cmake target renamed rpc-server -> ggml-rpc-server at…
mudler Jun 28, 2026
4cd90bf
paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=10…
mudler Jun 28, 2026
11128cb
docs(paged): scope the large-M NVFP4 prefill GEMM lever (design only)
mudler Jun 28, 2026
e610347
feat(paged): chunked parallel-scan GDN prefill kernel (patch 0031)
mudler Jun 28, 2026
9a28f23
docs(paged): scope the continuous-serving decode gap (host-bound, des…
mudler Jun 28, 2026
4bdd26a
docs(paged): scope tensor-core (mma) chunked GDN prefill kernel
mudler Jun 28, 2026
0007053
feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch…
mudler Jun 28, 2026
d706980
feat(paged): close the continuous-serving decode gap (S1+S3, patches …
mudler Jun 28, 2026
2fa8ef8
fix(paged): make patch 0031 apply on the 0001-0030 base; default S3 o…
mudler Jun 28, 2026
b028c81
docs(paged): record padded/fixed-slot decode shape as tested-and-reje…
mudler Jun 28, 2026
f1c98ff
fix(paged): revert S3 decode-stable scheduler to default-OFF (A/B reg…
mudler Jun 29, 2026
c4058eb
feat(paged): tail-fusion (0042) + full-step decode CUDA graph default…
mudler Jun 29, 2026
042deab
docs(paged): vLLM-parity lever map + tensor-core GDN build plan (both…
mudler Jun 29, 2026
7b38c6b
feat(paged): GDN M5 tensor-core chunked-scan prefill, default-on unde…
mudler Jun 29, 2026
be65438
docs(paged): record MoE-prefill engine-gap decomposition + GEMM-port …
mudler Jun 29, 2026
bd100dd
fix(paged): repair the patch series, sync to the fork branch (drop de…
mudler Jun 30, 2026
6edbb56
docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED)
mudler Jun 30, 2026
baf1025
docs(paged): correct decode-serving record to ~86% GPU-steady parity …
mudler Jun 30, 2026
2431090
docs(paged): future-agent vLLM-parity HANDOFF guide (GB10, how-to com…
mudler Jun 30, 2026
8bb47e5
docs(paged): correct PARITY_HANDOFF ahead/behind + note dense CDEF ga…
mudler Jun 30, 2026
2033086
patches(paged): track 0044 GatedRMSNorm patch, sync LocalAI series to…
mudler Jun 30, 2026
1b9176c
docs(paged): codify fork-first patch workflow as mandatory policy
mudler Jun 30, 2026
de34cd5
docs(paged): refresh parity handoff state
mudler Jun 30, 2026
f8d7b02
docs(paged): scope GB10 parity reopen plan
mudler Jun 30, 2026
d288a03
docs(paged): add GB10 parity implementation plan
mudler Jun 30, 2026
6ac0673
docs(paged): start GB10 parity phase0 record
mudler Jun 30, 2026
b3cfdfa
docs(paged): record GB10 parity source provenance
mudler Jun 30, 2026
b1a1b72
docs(paged): record GB10 parity artifact gaps
mudler Jun 30, 2026
a9a2efb
docs(paged): record phase0 clean build gates
mudler Jun 30, 2026
ef5d4af
docs(paged): record phase0 prefill baseline
mudler Jun 30, 2026
337ebb8
docs(paged): record phase0 decode repro
mudler Jun 30, 2026
1c0709b
docs(paged): record W4A16 phase1 kill gate
mudler Jun 30, 2026
d8edc61
patches(paged): mirror W4A16 packed metadata
mudler Jun 30, 2026
c5f2545
patches(paged): tune W4A16 grouped tile shape
mudler Jun 30, 2026
8b413d1
docs(paged): record W4A16 scale broadcast rejection
mudler Jun 30, 2026
85c8832
patches(paged): pad W4A16 A shared tile stride
mudler Jun 30, 2026
f9e015d
docs(paged): record W4A16 Wq padding rejection
mudler Jun 30, 2026
b647460
docs(paged): record phase6 serving classifier
mudler Jun 30, 2026
34c4b5c
docs(paged): scope phase7 serving candidates
mudler Jun 30, 2026
d0fa463
test(paged): mirror MoE swiglu down gate
mudler Jun 30, 2026
3cf7fa1
docs(paged): reject swiglu down fusion candidate
mudler Jun 30, 2026
22a93ce
docs(paged): select weighted combine candidate
mudler Jun 30, 2026
4b6fc0f
test(paged): mirror MoE weighted combine gate
mudler Jun 30, 2026
b6885aa
docs(paged): reject weighted combine fusion candidate
mudler Jul 1, 2026
ef14748
docs(paged): scope ragged MoE dispatch phase
mudler Jul 1, 2026
89ef3a4
docs(paged): record ragged MoE profile gate
mudler Jul 1, 2026
b009de0
test(paged): mirror ragged MoE dispatch gate
mudler Jul 1, 2026
b862e2c
docs(paged): stop ragged dispatch source shortcut
mudler Jul 1, 2026
9bbe02c
fix(paged): gate MTP backend sampling
mudler Jul 1, 2026
ff3ad84
docs(paged): record GDN C32 slab baseline
mudler Jul 1, 2026
3da3b16
docs(paged): reject GDN C32 slab phase
mudler Jul 1, 2026
24e778d
docs(paged): scope GDN M5 state-boundary phase
mudler Jul 1, 2026
1b5ae22
docs(paged): reject GDN M5 QS-early phase
mudler Jul 1, 2026
adabd11
docs(paged): scope GDN global Ai32 prototype
mudler Jul 1, 2026
2074b4f
docs(paged): reject GDN global Ai32 prototype
mudler Jul 1, 2026
abc70c2
docs(paged): close ragged MoE dispatch shortcut
mudler Jul 1, 2026
ede23df
docs(paged): close W4A16 pad checklist
mudler Jul 1, 2026
e169058
chore(paged): add DGX inference gate runner
mudler Jul 1, 2026
7039436
docs(paged): gate MTP rollback safety
mudler Jul 1, 2026
4d171e6
docs(paged): reject MTP serving lever
mudler Jul 1, 2026
ae76d42
docs(paged): profile MTP graph reuse loss
mudler Jul 1, 2026
6e35476
docs(paged): scope MTP graph-shape follow-up
mudler Jul 1, 2026
cced07c
docs(paged): add MTP shape trace patch
mudler Jul 1, 2026
310eb3c
docs(paged): reject MTP draft-shape scheduler
mudler Jul 1, 2026
c99678d
docs(paged): refresh current serving snapshot
mudler Jul 1, 2026
ff3f062
chore(paged): add current serving snapshot harness
mudler Jul 1, 2026
6c16574
docs(paged): verify patch-series mirror invariant
mudler Jul 1, 2026
7aa15ce
docs(paged): refresh parity handoff coordinates
mudler Jul 1, 2026
7108b68
chore(paged): record snapshot hardware class
mudler Jul 1, 2026
a019412
chore(paged): summarize snapshot inference gates
mudler Jul 1, 2026
ace1ffa
docs(paged): record audited current snapshot
mudler Jul 1, 2026
3c2cb9f
docs(paged): record graph-node serving profile
mudler Jul 1, 2026
3b9ec3e
docs(paged): record mmq occupancy rejection
mudler Jul 1, 2026
2148fa4
feat(paged): add moe mmq shape trace patch
mudler Jul 1, 2026
b28b448
docs(paged): record mmq shape serving profile
mudler Jul 1, 2026
e189e5a
feat(paged): add moe mmq launch trace patch
mudler Jul 1, 2026
70a4c31
feat(paged): add moe small-m mmq candidate trace
mudler Jul 1, 2026
7665422
feat(paged): add moe small-m mmq tile policy gate
mudler Jul 1, 2026
ba1979a
feat(paged): add moe mmid route trace patch
mudler Jul 1, 2026
49cce0b
feat(paged): add mul mat route trace patch
mudler Jul 1, 2026
fbdc200
feat(paged): add cublas route trace patch
mudler Jul 1, 2026
9f75da0
feat(paged): add cublas tensor-name trace patch
mudler Jul 1, 2026
5354adc
docs(paged): scope gate projection policy
mudler Jul 1, 2026
52c11b1
docs(paged): reject graph-time gate fusion shortcut
mudler Jul 1, 2026
d44e164
docs(paged): record max-concurrency parity check
mudler Jul 1, 2026
aa848d5
docs(paged): record low-concurrency serving check
mudler Jul 1, 2026
b9eff5b
docs(paged): reconcile next parity target
mudler Jul 1, 2026
ecaf406
docs(paged): reject persistent gate fusion shortcut
mudler Jul 1, 2026
ae8284f
feat(paged): parameterize vllm serving snapshot
mudler Jul 1, 2026
2a0fc0f
docs(paged): record inference gate guard
mudler Jul 1, 2026
e69ee0e
feat(paged): parameterize served model name
mudler Jul 1, 2026
440129c
fix(paged): harden serving snapshot readiness
mudler Jul 1, 2026
96825a2
docs(paged): record dense serving snapshot
mudler Jul 1, 2026
cd59e5d
fix(paged): scrub harness vars for vllm serve
mudler Jul 1, 2026
c299dcd
docs(paged): record dense true decode profile
mudler Jul 1, 2026
b5f6515
docs(paged): record serving admission trace
mudler Jul 1, 2026
2aa7670
docs(paged): record dense admission trace
mudler Jul 1, 2026
347a5c0
docs(paged): reject admission budget sweep
mudler Jul 1, 2026
3dbf34e
docs(paged): record admission histogram trace
mudler Jul 1, 2026
999cf09
docs(paged): record TTFT prefill-first A/B
mudler Jul 1, 2026
902bcc7
docs(paged): validate TTFT prefill-first A/B
mudler Jul 1, 2026
9be291e
docs(paged): reject capped TTFT defer sweep
mudler Jul 1, 2026
c41d1a5
docs(paged): record waiting-threshold TTFT defer
mudler Jul 1, 2026
ef7dbfa
docs(paged): compare MoE min32 against vLLM
mudler Jul 1, 2026
fc5d5e4
docs(paged): profile current W4A16 prefill
mudler Jul 1, 2026
ef57886
docs(paged): scope W4A16 direct activation experiment
mudler Jul 1, 2026
b425d8c
docs(paged): mark W4A16 direct policy tests done
mudler Jul 1, 2026
4645935
docs(paged): mark W4A16 direct routing stub done
mudler Jul 1, 2026
f7d7638
docs(paged): record W4A16 direct activation rejection
mudler Jul 1, 2026
6a2618b
docs(paged): record MTP verify-cost rejection
mudler Jul 1, 2026
2e19e5c
docs(paged): record prefill bucket attribution phase
mudler Jul 1, 2026
55df910
docs(paged): record layout trace phase
mudler Jul 1, 2026
3fbdfc2
docs(paged): record quant trace phase
mudler Jul 1, 2026
60954d4
docs(paged): record quant kernel timing phase
mudler Jul 1, 2026
e67b329
docs(paged): record BF16 cuBLAS F32 output phase
mudler Jul 1, 2026
2b2b1f0
docs(paged): record BF16 F32 output dense serving phase
mudler Jul 1, 2026
e573194
docs(paged): record patch mirror readiness phase
mudler Jul 1, 2026
6cf8b78
docs(paged): record BF16 F32 output broader serving phase
mudler Jul 1, 2026
e5c5746
docs(paged): record GDN tensor-core revalidation phase
mudler Jul 1, 2026
2efb0ec
docs(paged): record TTFT min32 serving phase
mudler Jul 1, 2026
eb82ff1
docs(paged): record datacenter Blackwell readiness phase
mudler Jul 1, 2026
5369219
docs(paged): record GDN blocked-solve PoC phase
mudler Jul 1, 2026
26a41fa
docs(paged): record post-PoC GDN audit phase
mudler Jul 1, 2026
f21b393
docs(paged): record current MoE graph profile phase
mudler Jul 1, 2026
a9454b4
docs(paged): record MoE decode-only profile phase
mudler Jul 1, 2026
04ed7fe
docs(paged): record GDN launch sweep phase
mudler Jul 1, 2026
bbfaa66
docs(paged): record GDN BV32 decode A/B phase
mudler Jul 1, 2026
d091eb3
docs(paged): record GDN identity shortcut phase
mudler Jul 1, 2026
67d2c4c
docs(paged): record BF16 GDN state cache phase
mudler Jul 1, 2026
1aba410
docs(paged): record phases 112-140 + series trim decision
mudler Jul 2, 2026
b529cc5
patches(paged): trim series to Phase135 routed-FFN line, sync to fork…
mudler Jul 2, 2026
bf61db6
docs(paged): scope vLLM-class execution re-architecture (additive pro…
mudler Jul 2, 2026
b2784cc
docs(paged): fix EXECUTION_REARCH_SCOPE seam citations to fork 1edddc8fe
mudler Jul 2, 2026
500d653
feat(paged): regenerate patch series 0053-0055 (P1 bf16-stream)
mudler Jul 2, 2026
ccf75d1
docs(paged): record P1 bf16-stream landing (GO)
mudler Jul 2, 2026
586639d
docs(paged): record P2 MoE-region NO-GO (kill-gate flat + seam-shape …
mudler Jul 2, 2026
865e77c
docs(paged): record P4 CBv2 NO-GO at the perf kill-gate
mudler Jul 2, 2026
7b129a5
docs(paged): finalize P4 CBv2 record with the measured A/B verdict
mudler Jul 2, 2026
5b8b33a
docs(paged): record P5 FLA GDN NO-GO - GDN prefill bucket is a confir…
mudler Jul 2, 2026
ac2b021
docs(paged): record P6 fp8-KV BLOCKED-ON-INFRA + the analytical decod…
mudler Jul 2, 2026
a1a3b99
docs(paged): record P3 W4A16 direct-A NO-GO + write program-level pre…
mudler Jul 2, 2026
3159ed0
docs(paged): record P6 fp8-KV measured NO-GO - throughput dead end, c…
mudler Jul 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions .agents/llama-cpp-localai-paged-backend.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# llama-cpp-localai-paged Backend (paged attention + Blackwell NVFP4 decode)

`llama-cpp-localai-paged` is LocalAI's **CUDA-only** paged-attention variant of the
llama.cpp backend. It targets high-concurrency decode for the Qwen3.6 hybrid
gated-DeltaNet (SSM) models on Blackwell (GB10 / DGX Spark). It reuses the stock
`llama-cpp` backend's sources and applies a vendored patch series on top at build
time. It is **not** a fork: a source-only `*.patch` stack plus one canonical doc.

**Canonical reference:** `backend/cpp/llama-cpp-localai-paged/README.md`
(architecture, the patch series 0001-0030, benchmarks, dev notes, generality,
pin/canary policy). Read it for any technical detail; this guide is the maintenance
how-to.

## Where things live

- `backend/cpp/llama-cpp-localai-paged/Makefile` - the thin wrapper. It copies the
stock `backend/cpp/llama-cpp/` build infra into a build dir, clones llama.cpp at
this backend's **own** pin (`LLAMA_VERSION`), applies the paged series via the
`apply-paged-patches` define (strict `git apply`), then builds `grpc-server`.
- `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch`
series (0001-0030), nothing else.
- `backend/cpp/llama-cpp-localai-paged/README.md` - the canonical doc. The
operational docs (`PAGED_BITEXACT_NOTE.md`, `UPSTREAM_LAYER2_SCOPE.md`) and
dev artifacts live in
`backend/cpp/llama-cpp-localai-paged/docs/`.
- `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh`
- the CUDA build entry points.
- `backend/cpp/llama-cpp/` - the **stock** backend, pure upstream. It carries no
paged patches.

## Invariants (do not break these)

- **Stock stays pure.** The paged patches live ONLY in this backend. Never add a
`patches/paged/` dir or `LLAMA_PAGED` logic to `backend/cpp/llama-cpp/`.
- **CUDA-only.** Ship cublas/cuda targets only. Off-CUDA the fusions are gated off
(patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to-
slightly-negative there - non-CUDA users use the stock `llama-cpp`. Do not add
cpu/vulkan/sycl/metal rows for this backend in `.github/backend-matrix.yml`.
(Those builds also fail to link `grpc-server` on darwin/arm64 against upstream
`stream_*` server symbols - another reason it is CUDA-only.)
- **Source-only patches.** A `.patch` may touch only llama.cpp source - never a
dev doc or `*.md`. Strict `git apply` on a clean checkout must reach exit 0. (A
stray `SSM_DECODE_FIX_RESULTS.md` hunk in patch 0019 once broke the CI build.)
- **Bit-exact by default.** Every shipped patch is byte-identical to the f32
baseline. (The one opt-in precision trade, `ssm_bf16_tau` / patch 0026, was
DROPPED: it went flat once the decode fusions landed - forcing all gated-DeltaNet
heads to bf16 gave 780.6 vs 780.0 t/s, zero benefit - so the series is now
bit-exact end to end. Do not reintroduce a per-head SSM-precision lever; see the
rejected-levers note in the backend README section 5.)

## Fork-first workflow (MANDATORY)

The fork **`mudler/llama.cpp` branch `localai-paged`** is the CANONICAL source
of truth for ALL paged-backend kernel and patch work. The vendored
`patches/paged/*.patch` series is a **derivative**: the fork is the source, the
series is a generated mirror of it.

**Always update the fork FIRST, in this exact order:**

1. **Commit the change on the `localai-paged` branch and push it.** Every
kernel or patch change lands as a fork commit first.
2. **Then regenerate the LocalAI series from the fork** via `git format-patch`
(one patch per fork commit, source-only) into
`backend/cpp/llama-cpp-localai-paged/patches/paged/`, so the series stays a
**1:1, drift-free mirror** of the branch.

Hard rules, no exceptions:

- **NEVER edit the `patches/paged/*.patch` files directly.** They are generated
output, not source.
- **NEVER add a patch to the series that has no corresponding fork-branch
commit.** Every `.patch` must be the `git format-patch` of a real commit on
`localai-paged`.
- The fork branch is **where the build and the per-path bit-exact md5 gate
actually run**, so it is the **only** place a change is truly validated. A
patch living only in the LocalAI series has never been built or gated.

Verify the mirror by tree hash: applying the full on-disk series on the pin
must reproduce the fork branch tree byte-for-byte. (The patch maintenance
detail is in `backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md`;
the hard-gate is section 2.5 of `docs/PARITY_HANDOFF.md`.)

## Maintaining the pin against new llama.cpp

The pin (`LLAMA_VERSION` in the wrapper Makefile) is advanced ONLY by the manual
pin-sync. It is deliberately **excluded from the nightly auto-bumper**
(`bump_deps.yaml`): a naive bump would shift the tree out from under the patches
and break `git apply` at build time.

1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml`
runs weekly: it applies + builds the series against the latest upstream tip and
goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync.
2. **The pin-sync** (recorded in the README section 7 and git history): rebase the series onto the new
tip (resolve conflicts; re-export **source-only** with a pathspec like
`-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box,
pass the bit-exact gate on **every** path + `test-backend-ops`, **and confirm
the full grpc-server build/link is green on CI**, then bump `LLAMA_VERSION`.

**Hard constraint: keep the pin == the stock `llama-cpp` pin.** `grpc-server.cpp`
is shared with the stock backend and tracks the stock pin. A paged pin that
diverges PAST an upstream server-API refactor breaks the grpc-server LINK even
when the patches are byte-for-byte bit-exact - the bit-exact gate alone does NOT
catch it. The `c299a92c` bump did exactly this (patches applied + greedy-md5
bit-exact, but `grpc-server.cpp` failed to link with undefined `stream_*` server
helpers the refactor pulled into its headers), so it was reverted to `9d5d882d`.
A pin bump is shippable only once the full CI grpc-server build is green, which in
practice means moving in lockstep with the stock pin (or vendoring a
pin-matched grpc-server.cpp, which we deliberately do not, to keep stock pure).

## The bit-exact gate (run for every change)

- greedy md5: `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1 </dev/null | md5sum`,
paged paths prefixed `LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged
MoE). Must match the recorded baseline. Redirect stdin from `/dev/null` or
`llama-completion` hangs in conversation mode.
- `test-backend-ops` (CUDA0 vs CPU oracle) for every touched op (`SSM_CONV*`,
`GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).
- **The gate is per-path.** The paged-MoE md5 differs from the non-paged md5 - a
benign, KL-validated FP-accumulation-order difference (see `docs/PAGED_BITEXACT_NOTE.md`).
Compare a paged-MoE change to the **paged** reference, not the non-paged one.

## Encapsulating your work

- When you change a kernel, follow the **Fork-first workflow** above: commit and
push on the `localai-paged` branch first, then regenerate the `.patch`
(source-only) from the fork so this worktree mirrors the branch byte-for-byte.
Commit with sign-off.
- New optimization -> next patch number (gaps 0005/0027 are intentional). Update
the README's patch table and dev notes - keep the README the single doc; do not
scatter `*_RESULTS.md` files.
- Record rejected/flat levers in the README too (they stop the next person from
re-running dead ends).

## Follow-ups (Metal / SYCL / Vulkan)

The decode fusions are implemented for **CUDA + CPU only**. The base
gated-DeltaNet + SSM_CONV ops already exist upstream on Metal, SYCL, and Vulkan,
so the models **run** there via the non-fused path - what is missing is the
fusion speedup. Porting it (strictly mirroring the CUDA kernels, since we have no
Metal/SYCL/Vulkan hardware to test on here) is scoped in `docs/UPSTREAM_LAYER2_SCOPE.md`
(recommended order: Metal, then SYCL, then Vulkan; ops-first upstream PR, then one
PR per backend, each gated by `test-backend-ops` on the target hardware). The
methodology for that work is in [.agents/vllm-parity-methodology.md](vllm-parity-methodology.md).
101 changes: 101 additions & 0 deletions .agents/vllm-parity-methodology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Methodology: Closing the vLLM Decode-Throughput Gap in llama.cpp

This is the playbook that took the paged backend
([.agents/llama-cpp-localai-paged-backend.md](llama-cpp-localai-paged-backend.md))
from ~38% of vLLM decode to **parity-to-ahead on dense** (and a proven, honest
ceiling on MoE) on GB10. Use it for any "make llama.cpp match or beat engine X on
accelerator Y" effort. The *levers* are model- and hardware-specific; the
*discipline* below is not. The worked example, with all numbers, is the paged
backend README.

## The core loop

1. **Establish a bit-exact baseline and gate FIRST.** Record the greedy md5 (per
path) and an f32 reference. Every optimization must stay byte-identical to it -
or ship as an explicit, default-off precision opt-in. This is what lets you
optimize aggressively without silently regressing quality. Gate two ways:
greedy md5, and `test-backend-ops` against the CPU oracle.

2. **Profile - do not assume.** nsys the steady-state decode step, broken down per
*kernel* AND per *memcpy*. Find the dominant cost. "It's the GEMM" was wrong
here: on hybrid gated-DeltaNet models the bottleneck was the recurrent-state
**plumbing** (state memcpy + gathers, ~67% of the step), not the weight GEMM.
Also sanity-check GPU-busy %: an early "low utilization" reading was a profiling
window artifact (decode was 96-99% GPU-busy), not real idle.

3. **Ground-truth BOTH engines.** Decompose *your* decode step AND the
competitor's, side by side, per bucket, and compute the per-bucket delta. This
tells you WHERE the gap actually is - not where you would guess. It overturned
premises here: e.g. vLLM does NOT run the GDN/attn projections as NVFP4 (it
keeps them bf16, same as us); the MoE expert GEMM was a llama *win*, not the gap.

4. **Per-lever discipline.** For each candidate: implement -> bit-exact gate ->
same-harness A/B bench. Use a runtime env-toggle (flag off vs on) ONLY for
levers that are actually runtime-gated; a lever **compiled into** the binary
(e.g. the SSM decode fusions here) is NOT isolated by a runtime flag, so measure
it build-vs-build. The full-patchset "stock" baseline likewise needs a
**separately-built unpatched binary at the same pin** - toggling the runtime
flag on the patched binary does not reproduce stock (it measures only the gated
part; here that was ~neutral, which is exactly how this gotcha hides). Bank only
what lifts AND gates. **Record every rejected or flat lever with the reason** -
over time this is the most valuable part: it stops the next person re-running
dead ends.

5. **Name the structural floor.** Prove the bit-exact ceiling exhaustively (every
lever measured, not assumed). What remains is physical - the memory-bandwidth
floor, the irreducible serial-SSM host loop (sampling can't start until logits
land). Name it; do not claim more than you measured.

## Hard rules learned

- **Apples-to-apples, or label it.** Stock-vs-patched on the SAME harness
(`llama-batched-bench`) is exact - lead with it. But "stock" must be a
separately-built unpatched binary at the SAME pin, NOT the patched binary with
the runtime flag off (compiled-in wins survive the toggle). Cross-engine "% of vLLM"
(batched-bench vs vLLM server+client) is *indicative*; always caveat the harness
and config (context length alone shifted the MoE figure 76% <-> 86%).
- **Re-measure a "win" after later levers land - it may evaporate.** bf16 SSM
state (the `ssm_bf16_tau` lever) benched +12% early and failed the f32 KL gate
(vLLM keeps f32 too), so it was kept default-off opt-in. Once the decode fusions
(recurrent-state gather-fusion + block-table cache) landed, a clean re-measure
forcing ALL gated-DeltaNet heads to bf16 (`tau=100000`) went **flat** - 780.6 vs
780.0 t/s. The "+12%" was subsumed by the fusions: the lever bought nothing, so
it was **dropped** (precision trade + bug surface + extra CUDA template-instantiation
compile cost, zero benefit). A win measured before the rest of the series is not a
win after it.
- **Reject the obvious-but-wrong, with evidence.** A faster kernel that is off the
critical path benches FLAT (the freed time becomes idle). Quantizing the bf16
projections to NVFP4 cost ~6% PPL - and vLLM keeps them bf16 for the same reason.
Always measure before believing; a plausible mechanism is not a result.
- **The gate can be per-path.** Paged vs non-paged attention legitimately produces
different (equivalent) FP-reduction orders; validate the difference is benign
(KLD to f32) and then gate each path against its own reference.

## Orchestration (multi-agent)

- **One GPU profiler/bencher at a time** (the GPU-contention rule). Parallel
design/analysis/read agents are fine; concurrent GPU benches pollute each other's
numbers.
- **Adversarial verify.** Before banking a finding, spawn skeptics prompted to
*refute* it; majority-refute kills it. Prevents plausible-but-wrong results.
- **Anti-punt.** Use foreground, blocking ssh loops with short benches and a
progress-file checkpoint. Agents that background work and "wait for the monitor
event" stall - forbid that pattern.
- **GPU coexistence.** On a shared host, stop the user's deployments for a clean
benchmark window (with their OK) and ALWAYS restore them (wrap the bench so a
failure cannot strand them).

## What generalizes (and what doesn't)

The *speedups* may be hardware-specific (here: CUDA/Blackwell - the SSM fusions,
NVFP4 FP4-MMA, the occupancy tune), which is why other accelerators did not
benefit. But the *findings* often generalize and are worth upstreaming: the
"decode is plumbing-bound, not GEMM-bound" insight and the bit-exact, CPU-mirrored
fusion ops help any backend running these models. Separate "ship our tuned backend"
from "upstream the portable op" - they are different deliverables.

## The closing record

Write up the result HONESTLY: the shipped wins, the rejected levers (with reasons),
the structural ceiling, and the cross-backend / cross-quant generality. Negative
results are as valuable as wins. The paged backend README is the template.
39 changes: 39 additions & 0 deletions .docker/llama-cpp-localai-paged-compile.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/env bash
# Shared compile logic for backend/Dockerfile.llama-cpp-localai-paged.
# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.

set -euxo pipefail

export CCACHE_DIR=/root/.ccache
ccache --max-size=5G || true
ccache -z || true

export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"

if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/llama-cpp-localai-paged-*-build
fi

cd /LocalAI/backend/cpp/llama-cpp-localai-paged

if [ -z "${BUILD_TYPE:-}" ]; then
# Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
# arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
if [ "${TARGETARCH}" = "arm64" ]; then
apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
export CC=gcc-14 CXX=g++-14
fi
make llama-cpp-localai-paged-cpu-all
else
# GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator
# does the compute. Keeps the GPU compile from also building the CPU variant matrix and
# avoids the gcc-14 apt step on GPU base images such as nvidia l4t.
make llama-cpp-localai-paged-fallback
fi
make llama-cpp-localai-paged-grpc
make llama-cpp-localai-paged-rpc-server

ccache -s || true
33 changes: 33 additions & 0 deletions .github/backend-matrix.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5177,6 +5177,39 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# llama-cpp-localai-paged: the LocalAI paged-attention llama.cpp variant. Each
# row mirrors the corresponding llama-cpp row with backend/dockerfile/tag-suffix
# swapped; builder-base-image is left UNCHANGED so these reuse the same
# base-grpc-* prebuilt bases (same gRPC + same toolchain), needing no new
# base-images.yml variant.
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-arm64'
base-image: "ubuntu:24.04"
runs-on: 'ubuntu-24.04-arm'
ubuntu-version: '2404'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"

# Darwin matrix (consumed by backend-jobs-darwin).
includeDarwin:
Expand Down
Loading