Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
219 commits
Select commit Hold shift + click to select a range
9ec13f9
Add classifier training support
runwangdl Mar 17, 2025
f1a0491
Fix L3 DMA and Maxpool Bugs
runwangdl Mar 3, 2025
29baf2c
WIP Static Memory Allocation of IOs
Victor-Jung Mar 3, 2025
25be229
Temporary fix broken float softmax
Victor-Jung Mar 4, 2025
da56cbe
Fix lifetime of aliased input buffers
Victor-Jung Mar 4, 2025
721f747
Fix output buffer lifetime
Victor-Jung Mar 4, 2025
78685e5
Linting
Victor-Jung Mar 4, 2025
02b5435
WIP fix output buffer lifetime
Victor-Jung Mar 4, 2025
a2d67a0
Change RQHardswish dim due to compiler bug
Victor-Jung Mar 4, 2025
bdd92de
Fix typo
Victor-Jung Mar 4, 2025
20b1f8b
Fix duplicated IO in memory allocation visualization
Victor-Jung Mar 4, 2025
c708069
Fix the Constant Tensor offset to not take into account IO since they…
Victor-Jung Mar 4, 2025
b6e2448
Add new attribute to Variable and Transient buffer to annotate if the…
Victor-Jung Mar 7, 2025
7e96f18
Adapt calculateLifetime to use buffer I/O annotation
Victor-Jung Mar 7, 2025
b923520
Fix typo
Victor-Jung Mar 7, 2025
f4cb9e0
Remove IO buffer name and refactor var name
Victor-Jung Mar 13, 2025
435cc9d
Linting
Victor-Jung Mar 13, 2025
731f39f
Test the correctness of the memory map after memory allocation
Victor-Jung Mar 17, 2025
dd1370c
Allocate memory arena first
Victor-Jung Mar 17, 2025
8bfdb13
correct DMA lengh of copy assertion
runwangdl Mar 18, 2025
f01eb7f
Align memory allocation test
Victor-Jung Mar 18, 2025
031dc79
delete redundant shell scripts
runwangdl Mar 19, 2025
58e18da
Merge branch 'devel' into PULPCCTL3_16_16_64
runwangdl Mar 19, 2025
ac2d879
Update node with multioutput to single output
runwangdl Mar 19, 2025
6a7198b
add softmaxcrossentropygrad tiling
runwangdl Mar 19, 2025
360aef7
Add softmaxcrossentropylossgrad tiling
runwangdl Mar 20, 2025
bc48582
Merge branch 'PULPCCTL3_16_16_64' into GEMM_training_tiled
runwangdl Mar 20, 2025
b6542ba
Fix CI issue
runwangdl Mar 20, 2025
fe208d0
Fix CI bugs
runwangdl Mar 20, 2025
4a21359
update CI
runwangdl Mar 20, 2025
a0dcb6d
Improve memory alloc visualization
Victor-Jung Mar 20, 2025
91f12f0
Add and pass test for CCT gemmtraining 1_16_16_8 to 128
runwangdl Mar 20, 2025
d1e1ebf
update CI with 8-128 dim CCT last gemm training test
runwangdl Mar 20, 2025
86a2e99
Add SGD support for PULP Open
runwangdl Mar 20, 2025
bdacd2f
Update CCT training test with sgd
runwangdl Mar 20, 2025
b5421cc
Multi-level profiling + Linting
Victor-Jung Mar 21, 2025
99035f0
Update Changelog
runwangdl Mar 23, 2025
62e87d3
Merge branch 'devel' into GEMM_training_tiled
runwangdl Mar 23, 2025
15ea3ec
Solved issues caused by merging conflicts
runwangdl Mar 23, 2025
a644fdf
Solved Review Comments
runwangdl Mar 28, 2025
643e160
Resolving conflicts
runwangdl Mar 28, 2025
80a9518
Reresolve the conflict
runwangdl Mar 28, 2025
501775d
Solving CI issues
runwangdl Mar 28, 2025
65a56b7
fix linting errors
runwangdl Mar 28, 2025
03c3f4a
gelu sigmoid approximation
runwangdl Mar 24, 2025
7e141fd
gelu parallel + unroll
runwangdl Mar 24, 2025
c3ee783
Float Matmul Parallel on M
runwangdl Mar 24, 2025
47d8c19
Softmax Parallel and Softmax Op Support
runwangdl Mar 24, 2025
ccba380
conv parallel without im2col
runwangdl Mar 25, 2025
fafcedf
PULP Layernorm Parallel
runwangdl Mar 25, 2025
147e68f
Fixed CI issues
runwangdl Mar 28, 2025
6e07dc9
fixing linting
runwangdl Mar 28, 2025
8b2f685
Merge branch 'devel' into devel_CCT_Optim
runwangdl Apr 8, 2025
9c0b8f6
Enlarge CI floatconv tiling L1 size for 8 core and delete CCT 128 tes…
runwangdl Apr 8, 2025
4c36de2
matmul 1*4 unrolling
runwangdl Apr 24, 2025
28ec2ca
Add computeOp support for CCT necessary kernels
runwangdl Apr 24, 2025
bf1f8ae
Add openlibm expf
runwangdl Apr 13, 2025
deac9ce
add relu, mul, maxpool ops num
runwangdl May 4, 2025
3b12187
Optimize parallel for multiple kernels
runwangdl May 4, 2025
49da947
Merge branch 'devel' into devel_CCT_Optim
runwangdl May 4, 2025
47961b9
Merge branch 'devel' into devel_CCT_Optim
runwangdl May 6, 2025
8907532
Change ConvTileConstraint to only tile on outchannel
runwangdl May 6, 2025
133f9ae
Fix error in gelu
runwangdl May 6, 2025
f25127d
Fix Linting Issues
runwangdl May 6, 2025
6f3f585
Merge branch 'devel' into devel_CCT_Optim
runwangdl May 8, 2025
4ffea9b
Change CI tests
runwangdl May 8, 2025
81c3460
profilling string change to const static
runwangdl May 8, 2025
4af69de
Fix profiling dual loop issue
runwangdl May 8, 2025
e819626
Add RV32IMF Picolibc support for Siracusa platform
runwangdl May 8, 2025
fa0cc37
Build Docker for new gvsoc for testing
runwangdl May 8, 2025
ac56ca2
Gvsoc Small test
runwangdl May 8, 2025
fd6c99d
Add Redmule Platform, Engline, Tiler, and Deployer
runwangdl May 8, 2025
2862f29
Add rv32imf.txt to build docker
runwangdl May 8, 2025
9ef9cc2
Update GVSOC hash
runwangdl May 9, 2025
10de9f6
matmul delicate constraints for Redmule
runwangdl May 9, 2025
efab54c
Merge branch 'devel_CCT_Optim' into redmule_platform
runwangdl May 9, 2025
37670e6
conv with redmule
runwangdl May 9, 2025
08b7e23
Add CCT 32 test
runwangdl May 9, 2025
e42b3d6
xtensor gvsoc docker build
runwangdl May 9, 2025
823d847
add softmaxgrad tileconstraint
runwangdl May 10, 2025
212ff3c
LayernormGrad and CCT MLP Training Graph
runwangdl May 11, 2025
d7346a5
Merge branch 'devel' into exp/heterogeneous-memory-placement
runwangdl May 12, 2025
c51694b
Fix Layernormgrad
runwangdl May 12, 2025
3efa661
Add Gelugrad
runwangdl May 16, 2025
aee7651
Merge branch 'exp/heterogeneous-memory-placement' into AttentionTraining
runwangdl May 16, 2025
b40cbd7
GEMM with Redmule
runwangdl May 18, 2025
203f095
Efficient GEMM
runwangdl May 18, 2025
7835c5a
reducesum tileconstraint
runwangdl Jun 9, 2025
21294bb
temporary deactiate transposesplit otherwise kq training failed
runwangdl Jun 9, 2025
90689e2
merge devel
runwangdl Jun 13, 2025
5c3f287
gemm no bias + input in name issue for codegenerate
runwangdl Jun 19, 2025
3271c3a
Parallelization and Optimization of CCT Inference and Training Kernel…
runwangdl Jun 12, 2025
7f99f2c
Adapation for Merging Devel
runwangdl Jun 20, 2025
2b46d2d
AttentionTraining Support
runwangdl Nov 16, 2025
bc3f951
clean unwanted deeplotest
runwangdl Nov 16, 2025
fe13842
Merge branch 'devel' into AttentionTraining
runwangdl Nov 16, 2025
d867f73
Remove Redmule Content from this branch
runwangdl Nov 16, 2025
a7d6903
Fix Bugs after merge
runwangdl Nov 16, 2025
0806442
Update CCT training testcases
runwangdl Nov 25, 2025
df6e698
Add CCT2 Training to CI
runwangdl Nov 25, 2025
c28300a
Fix CI errors
runwangdl Nov 25, 2025
06fa447
Remove redundant files
runwangdl Nov 25, 2025
3b2af3f
Decrease unneccessary changes compared with devel
runwangdl Nov 26, 2025
caa751a
Fix transposesplit samenaming issue & update GEMM no bias for tiling
runwangdl Nov 26, 2025
179262e
Fixing Linting
runwangdl Nov 26, 2025
0e16453
Remove redundant changes
runwangdl Nov 26, 2025
3f5a042
Fix linting again
runwangdl Nov 26, 2025
5502d04
Merge latest devel including TinyViT tiling support
runwangdl Nov 26, 2025
d7517e2
Add GlobalAveragePool for MI_BMInet
runwangdl Nov 26, 2025
21777bd
[CNNTrain] Add averagepool
runwangdl Dec 7, 2025
04b81ac
[CNNTrain] Add AveragepoolGrad
runwangdl Dec 7, 2025
8e3bbe7
[CNNTrain] Add ConGradX
runwangdl Dec 7, 2025
26b1e1b
[CNNTraining] Stash
runwangdl Dec 8, 2025
ce5491e
[CNNTraining] ConvGradX, W,B and DW
runwangdl Dec 11, 2025
5f6813a
[CNNTraining] Convdw gradw
runwangdl Dec 11, 2025
b387b14
[CNNTraining] ReluGrad
runwangdl Dec 12, 2025
9f3e712
Link PULPTrainlib
runwangdl Dec 19, 2025
625a388
CNN-Training: CONVGradX,W,DWConvGradX,w
runwangdl Dec 20, 2025
7e2c33d
decouple convgrad with conv
runwangdl Jan 4, 2026
b0017f9
convgradw untiled nonim2col
runwangdl Jan 4, 2026
ed46217
convgradw im2col untiled
runwangdl Jan 4, 2026
78f320c
convgradw im2col tiling but need manual memset to change output data …
runwangdl Jan 4, 2026
13ba9d8
dwconvgradwhw tile and remove convbias
runwangdl Jan 5, 2026
c328687
pwconvgradw untiled and tiled
runwangdl Jan 5, 2026
313952a
Convgradx im2col tiling
runwangdl Jan 6, 2026
5b91672
dwconvgradx tiling pass
runwangdl Jan 6, 2026
0b7e165
refactor parser and template, pwconvgradx untiled
runwangdl Jan 6, 2026
8a300b4
refactor convgrad tileconstraints
runwangdl Jan 6, 2026
ee3fb9c
add convgrad layer
runwangdl Jan 8, 2026
02e8bb5
change relugrad naming to formal grad
runwangdl Jan 8, 2026
2d3f56d
sgd relu some changes
runwangdl Jan 8, 2026
025b39f
zero initialization of pwconv
runwangdl Jan 8, 2026
342a162
zero initialization of convgrad
runwangdl Jan 8, 2026
55f14ef
[CNN Training] Some missing pieces
runwangdl Jan 12, 2026
0ad72a6
change pulptrainlib to my personal repo
runwangdl Jan 12, 2026
4a44bd1
Add GroupNormalization and its gradient
runwangdl Jan 23, 2026
027d206
groupnormgradB
runwangdl Jan 25, 2026
0faead0
Fix forktransformer bug for convgradx
runwangdl Jan 25, 2026
7460074
WCCI 4 e2e trainging graph
runwangdl Jan 25, 2026
f0f7f70
fix groupnormgradxstat tileconstraint surpass l1 limit bug
runwangdl Jan 26, 2026
49cddd2
Deeploy Microbenchmark with GVSoC CSR and Demo on GEMM
runwangdl Feb 12, 2026
b260e4e
Add float concat and Change padding pattern of ConV
runwangdl Feb 13, 2026
9803232
Merge remote-tracking branch 'upstream/devel' into sleepvit
runwangdl Feb 15, 2026
c38a72a
Initial Training platform
runwangdl Feb 25, 2026
9e5957b
Updated training update with gradient accumulation and optimizer update
runwangdl Feb 26, 2026
0c4cfd7
Add MLP_Train Test
runwangdl Feb 26, 2026
78bd0df
Merge branch 'sleepvit' into TrainingPlatform
runwangdl Feb 26, 2026
36d145d
Temporal Changes for Multi-Ouput Kernels to fit the new testtraining …
runwangdl Mar 2, 2026
a89c533
Add Small Conv+Transformer Test for training untiled platform
runwangdl Mar 2, 2026
9428468
Avoid generation redundant memory copy for the same input during mult…
runwangdl Mar 2, 2026
b0e4fc2
Merge branch 'CNNTraining' into TrainingPlatform
runwangdl Mar 2, 2026
78e567c
Fix missing GroupNorm imports in PULPOpen/Bindings.py after CNNTraini…
runwangdl Mar 2, 2026
f5c8d00
Fix merge: restore CNNTraining additions dropped during conflict reso…
runwangdl Mar 3, 2026
41512e9
Fix additional merge conflicts: AveragePool bindings and duplicate de…
runwangdl Mar 3, 2026
b4a90a0
Fix duplicate LayerNormGradParser: keep 5-input TrainingPlatform version
runwangdl Mar 3, 2026
a495d3e
Wrong Free of aliased_input
runwangdl Mar 3, 2026
bf837e3
RISCV-SUMMIT Demo
runwangdl Mar 3, 2026
c2f14b2
LATEST DEMO for RISCV SUBMIT
runwangdl Mar 3, 2026
40179c2
Add training pytest
runwangdl Mar 4, 2026
b0b9c10
Pass tiled "python deeployTrainingRunner_tiled_siracusa.py \
runwangdl Mar 4, 2026
71c36ac
Pass "deeployTrainingRunner_tiled_siracusa.py -t Tests/Models/SmallTr…
runwangdl Mar 4, 2026
50ab6ff
Temporary change for sleepvit BP and Add tile traning pytest
runwangdl Mar 4, 2026
fee90a2
Reafactoring training operators
runwangdl Mar 10, 2026
022b087
Add MaxPoolGrad operator support for PULPOpen platform
runwangdl Mar 11, 2026
528a8b1
Update grad kernels
runwangdl Mar 11, 2026
4d297bb
Update Conv Bias for Train Platform
runwangdl Mar 11, 2026
7b04a97
Update pulp-trainlib submodule: ConvGrad padding/stride support
runwangdl Mar 12, 2026
8f194d2
generateTrainingNetwork: auto-infer n_accum from inputs.npz
runwangdl Mar 12, 2026
a0188d0
Add batchnormgrad, globalaveragepoolgrad
runwangdl Mar 12, 2026
5ab0fe4
Add MSELoss/Grad op, ConvGradB tiling, MaxPoolGrad ORT fix, autoencod…
runwangdl Mar 12, 2026
0229dfa
Transferring TrainingPlatform to Gap9
runwangdl Mar 16, 2026
febd2af
Clean redundant files
runwangdl Mar 18, 2026
709a4fd
Add training off cmake config
runwangdl Mar 18, 2026
ce267c7
Clean redundant pytest files for training
runwangdl Mar 18, 2026
b0d468a
Orgnaize all kernels onnx and npz
runwangdl Mar 18, 2026
ec668f3
Remove groupnorm
runwangdl Mar 18, 2026
7ba871c
Move gap9 binary to /bin
runwangdl Mar 18, 2026
f6b1a3b
Add transpose node to cnngrad weight hwc layout change
runwangdl Mar 18, 2026
8a2e8fd
Add back deeploytrainingrunner
runwangdl Mar 18, 2026
983ecca
Revert "Move gap9 binary to /bin"
runwangdl Mar 18, 2026
8fb22e2
Siracusa cmake path fix
runwangdl Mar 18, 2026
ef32057
statci tileID not initiated across trainingstep
runwangdl Mar 19, 2026
9856417
Extend training platform to L3 (defaultMemLevel=L3) support
runwangdl Mar 19, 2026
42f59ce
Fix bugs for slice and L3 test
runwangdl Mar 19, 2026
04c1760
Decrease Maxpool second output
runwangdl Mar 23, 2026
82af3d9
Remove unnecessay change of generic platform
runwangdl Mar 23, 2026
6cabbd1
remove training from pytest
runwangdl Mar 23, 2026
8db9601
Remove unnecessay kernel tests
runwangdl Mar 23, 2026
9e9df9b
revert gemmtileconstraint
runwangdl Mar 23, 2026
9767ca6
revert unneccessary changes
runwangdl Mar 23, 2026
1373d1d
Merge globaleveragepool header to avgpool header
runwangdl Mar 23, 2026
8f3df5a
remove generic avgpool
runwangdl Mar 23, 2026
8e78519
Add trainingscheduler to keep input forever alive
runwangdl Mar 23, 2026
1706e79
revert memorylevelannotationpass
runwangdl Mar 23, 2026
26731e7
revert MchanDMA async for gap9
runwangdl Mar 23, 2026
28f7a5e
Move typechecker to generic
runwangdl Mar 23, 2026
fd8657a
Reset singlebuffering tilingcode generation and tiler extension
runwangdl Mar 23, 2026
57ecf11
Revert tileconstraint
runwangdl Mar 23, 2026
f8f9e50
revert deeploytypes
runwangdl Mar 23, 2026
ff758d6
delete redundant onnx
runwangdl Mar 23, 2026
290b5d4
revert tiler
runwangdl Mar 23, 2026
75551d7
Merge branch 'devel' into TrainingPlatform
runwangdl Mar 23, 2026
c2d833d
revert tilingvariablereplacement
runwangdl Mar 24, 2026
8aada09
Update new inplace accumulator
runwangdl Mar 24, 2026
5cb6dde
Add tiling for optimizer
runwangdl Mar 24, 2026
138f119
clean execution.py
runwangdl Mar 24, 2026
24d8316
ALiased input liveness wrong
runwangdl Mar 24, 2026
a3ac629
Aligned Size for L3 ram copy
runwangdl Mar 24, 2026
cfc14b7
Share memoryarena between trainingnetwork and optimziernetwork
runwangdl Mar 24, 2026
4dc036a
Add plotmemalloc for training
runwangdl Mar 24, 2026
9f8c89c
optimizer tiling with defaultmemory level option
runwangdl Mar 24, 2026
f6076fc
UPdate aliased workaround for gradient accumulation buffer
runwangdl Mar 25, 2026
e693efe
Fix minimalloc for alias tensor
runwangdl Mar 25, 2026
358d158
Reset alias workaround but change tileextension minimalloc for alias …
runwangdl Mar 25, 2026
d260975
Correct tiling Bug
runwangdl Mar 25, 2026
ccdfdd2
Fix minimalloc alias fix for input with transpose
runwangdl Mar 25, 2026
218fcc1
Fix L3 DMA ram_coppy template of GAP9
runwangdl Mar 25, 2026
2a45d69
Add batchnorm splited kernels
runwangdl Mar 25, 2026
0d8226e
Share traingraph and optimizer graph L1 arena
runwangdl Mar 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,6 @@
[submodule "CMSIS-NN"]
path = TargetLibraries/CMSIS/third_party/CMSIS-NN
url = https://github.com/ARM-software/CMSIS-NN.git
[submodule "pulp-trainlib"]
path = TargetLibraries/PULPOpen/third_party/pulp-trainlib
url = https://github.com/runwangdl/pulp-trainlib.git
Original file line number Diff line number Diff line change
Expand Up @@ -229,13 +229,14 @@ def _NCHWtoNHWC_fun(graph: gs.Graph, match: Match, name: str, default_channels_f

if node.op in ["RequantizedConv", "Conv"]:
spatialDims = len(node.inputs[1].shape) - 2
elif node.op == "MaxPool":
elif node.op in ["MaxPool", "AveragePool", "AveragePoolGrad"]:
spatialDims = len(node.attrs["kernel_shape"])
elif node.op == "Pad":
spatialDims = 2 # Hack based on current status
else:
raise ValueError(f"Cannot determine spatialDims for node {node.name} with operator {node.op}")

# Insert Transpose nodes around the op's activation input/output to convert the data layout.
permuteIn = _transformLayoutPermutation(len(tensorIn.shape), spatialDims, default_channels_first)
graph.nodes.append(_appendTranspose(tensorIn, node, permuteIn))

Expand All @@ -245,7 +246,14 @@ def _NCHWtoNHWC_fun(graph: gs.Graph, match: Match, name: str, default_channels_f
if node.op in ["Conv", "RequantizedConv"]:
# In the case of Conv: [weights, opt. bias], RequantizedConv: [weights, mul, add, opt. shift]
for tensor in node.inputs[1:]:
_transformLayoutConst(tensor, spatialDims, default_channels_first)
if isinstance(tensor, gs.Constant):
# Inference graph: weight is a fixed constant — permute its data in-place.
_transformLayoutConst(tensor, spatialDims, default_channels_first)
elif isinstance(tensor, gs.Variable) and tensor.shape is not None and len(tensor.shape) >= 2:
# Training graph: weight is a Variable (updated by the optimizer) — cannot permute
# in-place, so insert an explicit Transpose node that will run at inference/forward time.
perm = _transformLayoutPermutation(len(tensor.shape), spatialDims, default_channels_first)
graph.nodes.append(_appendTranspose(tensor, node, perm))

node.attrs["channels_first"] = default_channels_first

Expand All @@ -261,6 +269,24 @@ def __init__(self, default_channels_first: bool = True):
super().__init__(graph, partial(_NCHWtoNHWC_fun, default_channels_first = default_channels_first), name)


@contextagnostic
class NCHWtoNHWCAveragePoolPass(ReplaceSequentialPatternPass):

def __init__(self, default_channels_first: bool = True):
graph = _singleNodePattern(op = "AveragePool")
name = "_NCHW_TO_NHWC_AVERAGEPOOL_PASS"
super().__init__(graph, partial(_NCHWtoNHWC_fun, default_channels_first = default_channels_first), name)


@contextagnostic
class NCHWtoNHWCAveragePoolGradPass(ReplaceSequentialPatternPass):

def __init__(self, default_channels_first: bool = True):
graph = _singleNodePattern(op = "AveragePoolGrad")
name = "_NCHW_TO_NHWC_AVERAGEPOOLGRAD_PASS"
super().__init__(graph, partial(_NCHWtoNHWC_fun, default_channels_first = default_channels_first), name)


@contextagnostic
class NCHWtoNHWCConvPass(ReplaceSequentialPatternPass):

Expand Down Expand Up @@ -363,6 +389,8 @@ def __init__(self, default_channels_first: bool = True):
passes = [
NCHWtoNHWCPadPass(default_channels_first),
NCHWtoNHWCMaxPoolPass(default_channels_first),
NCHWtoNHWCAveragePoolPass(default_channels_first),
NCHWtoNHWCAveragePoolGradPass(default_channels_first),
NCHWtoNHWCDwConvPass(default_channels_first),
NCHWtoNHWCConvPass(default_channels_first),
]
Expand All @@ -376,6 +404,8 @@ def __init__(self, default_channels_first: bool = True):
passes = [
NCHWtoNHWCPadPass(default_channels_first),
NCHWtoNHWCMaxPoolPass(default_channels_first),
NCHWtoNHWCAveragePoolPass(default_channels_first),
NCHWtoNHWCAveragePoolGradPass(default_channels_first),
PULPNCHWtoNHWCDwConvPass(default_channels_first),
NCHWtoNHWCConvPass(default_channels_first),
]
Expand Down Expand Up @@ -533,8 +563,10 @@ def _remove_only_singleton_reduce_mean(graph: gs.Graph, match: Match, name: str)
# Delete node if only reduction over singleton dimensions
if 'axis' in node.attrs:
axis = node.attrs['axis']
else:
elif len(node.inputs) > 1 and node.inputs[1] is not None and hasattr(node.inputs[1], 'values') and node.inputs[1].values is not None:
axis = node.inputs[1].values
else:
return graph # axis unknown, skip

# Check if shape information is available
if node.inputs[0].shape is not None and all(node.inputs[0].shape[ax] == 1 for ax in axis):
Expand Down
4 changes: 2 additions & 2 deletions Deeploy/DeeployTypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -336,14 +336,14 @@ def has_live_aliases(self, ctxt: NetworkContext) -> bool:
True if this VariableBuffer has any live aliases, False otherwise
"""
# Do a breadth-first search across the aliasing double-linked list
live = self._live
live = self._live or self.is_input or self.is_output
queue = set(self.aliases)
visited = set(self.name)
while len(queue) > 0:
next = queue.pop()
buffNext = ctxt.lookup(next)
assert isinstance(buffNext, VariableBuffer)
live |= buffNext._live
live |= buffNext._live or buffNext.is_input or buffNext.is_output
visited.add(next)
queue |= buffNext.aliases - visited
return live
Expand Down
223 changes: 213 additions & 10 deletions Deeploy/Targets/GAP9/Bindings.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,25 +23,33 @@
# Import templates from PULPOpen and Generic
from Deeploy.Targets.Generic.Templates import AddTemplate, ConcatTemplate, DequantTemplate, FloatReduceMeanTemplate, \
FloatReduceSumTemplate, GatherTemplate, QuantTemplate, RQSiGELUTemplate, SliceTemplate, iHardswishTemplate
from Deeploy.Targets.Generic.TypeCheckers import AddChecker, ConcatChecker, ConvChecker, DequantChecker, \
GatherChecker, GELUChecker, GEMMChecker, HardswishChecker, LayerNormChecker, MatMulChecker, MulChecker, \
QuantChecker, ReduceMeanChecker, ReluChecker, ReshapeChecker, RQAddChecker, RQHardswishChecker, SGDChecker, \
SliceChecker, SoftmaxChecker, SoftmaxCrossEntropyLossChecker, TransposeChecker
from Deeploy.Targets.Generic.TypeCheckers import AddChecker, BatchNormInternalChecker, \
BatchNormalizationGradChecker, BNGradNormalizeChecker, BNGradReduceChecker, \
ChannelNormalizeChecker, ConcatChecker, ConvChecker, DequantChecker, \
WelfordReduceChecker, \
GatherChecker, GELUChecker, GEMMChecker, GlobalAveragePoolChecker, GlobalAveragePoolGradChecker, \
HardswishChecker, InPlaceAccumulatorV2Checker, LayerNormChecker, MatMulChecker, MaxPoolGradChecker, MulChecker, \
MSELossChecker, QuantChecker, ReduceMeanChecker, ReluChecker, ReshapeChecker, RQAddChecker, RQHardswishChecker, \
SGDChecker, SliceChecker, SoftmaxChecker, SoftmaxCrossEntropyLossChecker, TransposeChecker, \
PULPConvGradBChecker
from Deeploy.Targets.PULPOpen.Bindings import ForkClosure, L3MemoryAwareFunctionCallClosure, \
MemoryAwareForkTransformer, MemoryAwareFunctionCallClosure, TilingCallClosure
from Deeploy.Targets.PULPOpen.CodeTransformationPasses.PULPClusterSynch import PULPSynchCoresPass
from Deeploy.Targets.PULPOpen.CodeTransformationPasses.PULPClusterTiling import PULPClusterTiling
from Deeploy.Targets.PULPOpen.CodeTransformationPasses.PULPL3Tiling import PULPL3Tiling
from Deeploy.Targets.PULPOpen.CodeTransformationPasses.PULPProfileUntiled import PULPProfileUntiled
from Deeploy.Targets.PULPOpen.DataTypes import PULPDMAFuture
from Deeploy.Targets.PULPOpen.Templates import ConvTemplate, DMASliceTemplate, FloatAddTemplate, FloatConvTemplate, \
FloatGELUTemplate, FloatGemmTemplate, FloatLayernormTemplate, FloatMatMulTemplate, FloatMaxPoolTemplate, \
FloatMulTemplate, FloatReluTemplate, FloatSoftmaxTemplate, GEMMTemplate, MatrixVectorTemplate, MaxPoolTemplate, \
MulTemplate, ReduceMeanTemplate, RequantShiftTemplate, ReshapeTemplate, RQAddTemplate, RQSiHardswishTemplate, \
from Deeploy.Targets.PULPOpen.Templates import ConvTemplate, DMASliceTemplate, FloatAddTemplate, \
FloatAveragePoolTemplate, FloatBatchNormTemplate, FloatConvGradTemplate, FloatConvTemplate, \
FloatGELUTemplate, FloatGemmTemplate, FloatGlobalAveragePoolTemplate, \
FloatInPlaceAccumulatorV2Template, FloatLayernormTemplate, FloatMatMulTemplate, \
FloatMaxPoolTemplate, FloatMulTemplate, FloatReluTemplate, FloatSoftmaxTemplate, GEMMTemplate, \
MatrixVectorTemplate, MaxPoolTemplate, MSELossTemplate, MulTemplate, ReduceMeanTemplate, \
RequantShiftTemplate, ReshapeTemplate, RQAddTemplate, RQSiHardswishTemplate, \
SGDTemplate, SoftmaxCrossEntropyLossTemplate, TallGEMMTemplate, TransposeTemplate, UniformRequantShiftTemplate, \
iRMSNormTemplate, iSoftmaxTemplate
from Deeploy.Targets.PULPOpen.TypeCheckers import PULPConvChecker, PULPLinearChecker, PULPMaxPoolChecker, \
PULPRequantShiftChecker
from Deeploy.Targets.PULPOpen.TypeCheckers import PULPConvChecker, PULPLinearChecker, \
PULPMaxPoolChecker, PULPRequantShiftChecker
from Deeploy.TilingExtension.CodeTransformationPasses.TilingVariableReplacement import TilingVariableReplacement, \
TilingVariableReplacementUpdate

Expand Down Expand Up @@ -306,6 +314,14 @@
SoftmaxCrossEntropyLossTemplate.referenceTemplate, GAP9Transformer) for type in IntegerDataTypes
]

# Dual-output binding: outputs[0]=loss (scalar), outputs[1]=log_prob
GAP9SoftmaxCrossEntropyLossDualOutputBindings = [
NodeBinding(
SoftmaxCrossEntropyLossChecker([PointerClass(float32_t), PointerClass(type)],
[PointerClass(float32_t), PointerClass(float32_t)]),
SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, GAP9Transformer) for type in IntegerDataTypes
]

GAP9SoftmaxCrossEntropyLossGradBindings = [
NodeBinding(
SoftmaxCrossEntropyLossChecker([PointerClass(float32_t), PointerClass(type)], [PointerClass(float32_t)]),
Expand All @@ -317,6 +333,179 @@
SGDTemplate.referenceTemplate, GAP9Transformer)
]

# ── Training / Gradient bindings ─────────────────────────────────────────

GAP9ReluGradBinding = NodeBinding(
ReluChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]),
FloatReluTemplate.referenceGradTemplate, GAP9Transformer)

GAP9FloatGELUGradBinding = NodeBinding(
GELUChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]),
FloatGELUTemplate.referenceGradTemplate, GAP9Transformer)

GAP9LayernormGradBinding = NodeBinding(
LayerNormChecker(
[PointerClass(float32_t),
PointerClass(float32_t),
PointerClass(float32_t),
PointerClass(float32_t),
PointerClass(float32_t)],
[PointerClass(float32_t),
PointerClass(float32_t),
PointerClass(float32_t)]), FloatLayernormTemplate.referenceGradTemplate,
GAP9Transformer)

GAP9FloatConvGradW2DBindings = [
NodeBinding(
ConvChecker([PointerClass(float32_t), PointerClass(float32_t)],
[PointerClass(float32_t)]), FloatConvGradTemplate.referenceConvGradW2DIm2ColTemplate,
GAP9ClusterTransformer)
]

GAP9FloatConvGradX2DBindings = [
NodeBinding(
ConvChecker([PointerClass(float32_t), PointerClass(float32_t)],
[PointerClass(float32_t)]), FloatConvGradTemplate.referenceConvGradX2DIm2ColTiledTemplate,
GAP9Transformer)
]

GAP9FloatDWConvGradX2DBindings = [
NodeBinding(
ConvChecker([PointerClass(float32_t), PointerClass(float32_t)],
[PointerClass(float32_t)]), FloatConvGradTemplate.referenceDWConvGradX2DTiledTemplate,
GAP9Transformer)
]

GAP9FloatDWConvGradW2DBindings = [
NodeBinding(
ConvChecker([PointerClass(float32_t), PointerClass(float32_t)],
[PointerClass(float32_t)]), FloatConvGradTemplate.referenceDWConvGradW2DTemplate,
GAP9ClusterTransformer)
]

GAP9FloatPWConvGradW2DBindings = [
NodeBinding(
ConvChecker([PointerClass(float32_t), PointerClass(float32_t)],
[PointerClass(float32_t)]), FloatConvGradTemplate.referencePWConvGradW2DTemplate,
GAP9ClusterTransformer)
]

GAP9FloatPWConvGradX2DBindings = [
NodeBinding(
ConvChecker([PointerClass(float32_t), PointerClass(float32_t)],
[PointerClass(float32_t)]), FloatConvGradTemplate.referencePWConvGradX2DTemplate,
GAP9ClusterTransformer)
]

GAP9FloatConvGradBBindings = [
NodeBinding(
PULPConvGradBChecker([PointerClass(float32_t)],
[PointerClass(float32_t)]), FloatConvGradTemplate.referenceConvGradB2DTemplate,
GAP9ClusterTransformer)
]

GAP9MaxPoolGrad2DBindings = [
NodeBinding(
MaxPoolGradChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]),
FloatMaxPoolTemplate.referenceGradTemplate, GAP9Transformer)
]

GAP9AveragePool2DBindings = [
NodeBinding(PULPMaxPoolChecker([PointerClass(float32_t)], [PointerClass(float32_t)]),
FloatAveragePoolTemplate.referenceTemplate, GAP9Transformer)
]

GAP9AveragePoolGrad2DBindings = [
NodeBinding(PULPMaxPoolChecker([PointerClass(float32_t)], [PointerClass(float32_t)]),
FloatAveragePoolTemplate.referenceGradTemplate, GAP9Transformer)
]

GAP9GlobalAveragePool2DBindings = [
NodeBinding(
GlobalAveragePoolChecker([PointerClass(float32_t)], [PointerClass(float32_t)]),
FloatGlobalAveragePoolTemplate.globalAveragePoolTemplate,
GAP9Transformer)
]

GAP9GlobalAveragePoolGrad2DBindings = [
NodeBinding(
GlobalAveragePoolGradChecker([PointerClass(float32_t)], [PointerClass(float32_t)]),
FloatGlobalAveragePoolTemplate.globalAveragePoolGradTemplate,
GAP9Transformer)
]

GAP9MSELossBindings = [
NodeBinding(MSELossChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]),
MSELossTemplate.referenceTemplate, GAP9Transformer)
]

GAP9MSELossGradBindings = [
NodeBinding(MSELossChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]),
MSELossTemplate.referenceGradientTemplate, GAP9Transformer)
]

GAP9InPlaceAccumulatorV2Bindings = [
NodeBinding(
InPlaceAccumulatorV2Checker(
[PointerClass(float32_t), PointerClass(float32_t), PointerClass(uint8_t)], [PointerClass(float32_t)]),
FloatInPlaceAccumulatorV2Template.referenceTemplate, GAP9Transformer)
]

GAP9InPlaceAccumulatorV2TiledBindings = [
NodeBinding(
InPlaceAccumulatorV2Checker(
[PointerClass(float32_t), PointerClass(float32_t), PointerClass(uint8_t)], [PointerClass(float32_t)]),
FloatInPlaceAccumulatorV2Template.tiledReferenceTemplate, GAP9Transformer)
]

GAP9BatchNormInternalBindings = [
NodeBinding(
BatchNormInternalChecker(
[PointerClass(float32_t)] * 5,
[PointerClass(float32_t)] * 5), FloatBatchNormTemplate.batchNormInternalTemplate,
GAP9Transformer)
]

GAP9BatchNormalizationGradBindings = [
NodeBinding(
BatchNormalizationGradChecker(
[PointerClass(float32_t)] * 5,
[PointerClass(float32_t)] * 3), FloatBatchNormTemplate.batchNormGradTemplate,
GAP9Transformer)
]

GAP9WelfordReduceBindings = [
NodeBinding(
WelfordReduceChecker(
[PointerClass(float32_t)] * 1,
[PointerClass(float32_t)] * 2), FloatBatchNormTemplate.welfordReduceTemplate,
GAP9Transformer)
]

GAP9ChannelNormalizeBindings = [
NodeBinding(
ChannelNormalizeChecker(
[PointerClass(float32_t)] * 5,
[PointerClass(float32_t)] * 1), FloatBatchNormTemplate.channelNormalizeTemplate,
GAP9Transformer)
]

GAP9BNGradReduceBindings = [
NodeBinding(
BNGradReduceChecker(
[PointerClass(float32_t)] * 4,
[PointerClass(float32_t)] * 2), FloatBatchNormTemplate.bnGradReduceTemplate,
GAP9Transformer)
]

GAP9BNGradNormalizeBindings = [
NodeBinding(
BNGradNormalizeChecker(
[PointerClass(float32_t)] * 7,
[PointerClass(float32_t)] * 1), FloatBatchNormTemplate.bnGradNormalizeTemplate,
GAP9Transformer)
]

GAP9TransposeBindings = [
NodeBinding(TransposeChecker([PointerClass(type)], [PointerClass(type)]), TransposeTemplate.referenceTemplate,
GAP9Transformer) for type in IntegerDataTypes
Expand All @@ -328,6 +517,9 @@
GAP9ConcatBindings = [
NodeBinding(ConcatChecker([PointerClass(type), PointerClass(type)], [PointerClass(type)]),
ConcatTemplate.referenceTemplate, GAP9ClusterTransformer) for type in IntegerDataTypes
] + [
NodeBinding(ConcatChecker([PointerClass(float_type), PointerClass(float_type)], [PointerClass(float_type)]),
ConcatTemplate.referenceTemplate, GAP9ClusterTransformer) for float_type in FloatDataTypes
]

GAP9iRMSNormBindings = [
Expand Down Expand Up @@ -370,7 +562,18 @@
GAP9ReluBinding = NodeBinding(ReluChecker([PointerClass(float32_t)], [PointerClass(float32_t)]),
FloatReluTemplate.referenceTemplate, GAP9Transformer)

# Forward LayerNorm: 3 inputs (data, weight, bias), 3 outputs (Y, mean_stash, inv_std_stash)
# The 3-output version is needed for training (backward pass needs mean and inv_std stashes).
GAP9LayernormBinding = NodeBinding(
LayerNormChecker(
[PointerClass(float32_t), PointerClass(float32_t),
PointerClass(float32_t)],
[PointerClass(float32_t), PointerClass(float32_t),
PointerClass(float32_t)]), FloatLayernormTemplate.referenceTemplate,
GAP9Transformer)

# Inference-only LayerNorm: 3 inputs, 1 output (Y only, no stashes)
GAP9LayernormInferenceBinding = NodeBinding(
LayerNormChecker(
[PointerClass(float32_t), PointerClass(float32_t),
PointerClass(float32_t)], [PointerClass(float32_t)]), FloatLayernormTemplate.referenceTemplate,
Expand Down
2 changes: 1 addition & 1 deletion Deeploy/Targets/GAP9/DMA/L3Dma.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ class GAP9L3Dma(AsyncDma):
_transferTemplates = {
2:
NodeTemplate(
"pi_cl_ram_copy_2d(get_ram_ptr(), ${ext}, ${loc}, ${transfer_size}, ${stride}, ${length}, ${ext2loc}, &${future});"
"pi_cl_ram_copy_2d(get_ram_ptr(), (uint32_t)${ext}, ${loc}, ${transfer_size}, ${stride}, ${length}, ${ext2loc}, &${future});"
)
}
_waitingStrategy = PerTensorWaitingStrategy(GAP9L3DmaFuture)
Expand Down
Loading
Loading