An open-source, MLIR-based machine-learning compiler for TPUs.
English · 简体中文 · Quick Start · Docs · Issues
TPU-MLIR converts pre-trained neural networks from mainstream frameworks into bmodel files that run efficiently on TPUs. Built on top of MLIR, it provides a unified IR, a clean lowering pipeline, and a rich set of tools for quantization, calibration, and deployment.
┌──────────────────┐ model_transform.py ┌──────────┐ model_deploy.py ┌──────────┐
│ ONNX / PyTorch / │ ───────────────────────► │ MLIR │ ────────────────────► │ bmodel │
│ TFLite / Caffe │ (front-end import) │ (TOP → │ (lowering, quant, │ on TPU │
│ HuggingFace │ │ TPU) │ layer-group, …) │ │
└──────────────────┘ └──────────┘ └──────────┘
- Multi-framework front-ends — PyTorch, ONNX, TFLite, Caffe (other frameworks via ONNX).
- LLM-ready — one-shot conversion of HuggingFace LLMs (Qwen, MiniCPM-V, …) via
llm_convert.py. - Full quantization toolchain — F32 / BF16 / F16 / INT8 (symmetric & asymmetric), AWQ / GPTQ / AutoRound passthrough, calibration, QAT.
- MLIR-based pipeline — clean dialects (Top / Tpu), pattern rewrites, layer-group memory planning.
- Production tooling —
model_runner,model_tool, accuracy validation, simulator, visualizer. - Bilingual docs & active community — English / 中文 manuals, papers, and video tutorials.
- Installation
- Quick Start (LLM — Qwen)
- Quick Start (Vision — YOLOv5)
- Auxiliary Tools
- Resources
- Citation
- Contributing
- License
TPU-MLIR runs inside a prebuilt Docker image. After the container is running you can either install the Python wheel or build from source.
docker pull sophgo/tpuc_dev:latestIf the pull fails, download the tarball and load it manually:
wget https://sophon-assets.sophon.cn/sophon-prod-s3/drive/25/04/15/16/tpuc_dev_v3.4.tar.gz
docker load -i tpuc_dev_v3.4.tar.gzCreate and enter the container:
docker run --privileged --name tpu-mlir -v $PWD:/workspace -it sophgo/tpuc_dev:latestRequires Python ≥ 3.10 on Ubuntu 22.04 (already satisfied inside the Docker image).
pip install tpu_mlircd /workspace/tpu-mlir
pip install -r requirements.txt
source ./envsetup.sh
./build.shConvert and run a HuggingFace LLM (here: Qwen) on a TPU.
Click to expand the full LLM walkthrough
A pre-quantized AWQ / GPTQ / AutoRound build is recommended.
git lfs install
git clone https://huggingface.co/Intel/Qwen3.5-2B-int4-AutoRound# If you encounter transformers/torch version issues:
# pip3 install transformers torchvision -U
# --max_input_length sets the max prefill length; if omitted it defaults to -s.
llm_convert.py \
-m /workspace/Qwen3.5-2B-int4-AutoRound \
--max_input_length 1024 \
-s 2048 \
-c bm1684x \
--max_pixels 768,768 \
-o qwen3.5_2bMain arguments of llm_convert.py:
| Parameter | Short | Required | Description |
|---|---|---|---|
model_path |
m |
✅ | Path to the model weights |
seq_length |
s |
✅ | Maximum sequence length |
max_input_length |
— | — | Maximum input length; defaults to seq_length (-s) when omitted |
quantize |
q |
✅ | Quantization type: auto / w4bf16 / w4f16 / bf16 / f16, etc. (omit if the source model is already quantized) |
q_group_size |
g |
— | Group size for quantization (default 64) |
chip |
c |
✅ | Target platform: bm1684x / bm1688 / cv186ah |
max_pixels |
— | — | Multi-modal max resolution width,height. Defaults vary by model_type (qwen2_5_vl: 672,896; minicpmv: 980,980; otherwise 768,768) |
out_dir |
o |
✅ | Output directory |
Copy the python_demo folder onto your device and build it:
mkdir build && cd build
cmake ..
make
cp *cpython*.so ..
cd ..Then run the bmodel:
python3 pipeline.py -m xxxx.bmodel -c configSample output:
Compile and run yolov5s.onnx on the BM1684X TPU. The model is bundled in regression/model/yolov5s.onnx.
Click to expand the full YOLOv5 walkthrough
1. Prepare the working directory
mkdir model_yolov5s && cd model_yolov5s
cp ${REGRESSION_PATH}/model/yolov5s.onnx .
cp -rf ${REGRESSION_PATH}/dataset/COCO2017 .
cp -rf ${REGRESSION_PATH}/image .
mkdir workspace && cd workspace2. Convert the model to MLIR
If the model takes images as input, the preprocessing must be specified. The preprocessing formula is:
YOLOv5's official input is RGB scaled by 1/255, so mean = 0,0,0 and scale = 0.0039216,0.0039216,0.0039216.
model_transform.py \
--model_name yolov5s \
--model_def ../yolov5s.onnx \
--input_shapes [[1,3,640,640]] \
--mean 0.0,0.0,0.0 \
--scale 0.0039216,0.0039216,0.0039216 \
--keep_aspect_ratio \
--pixel_format rgb \
--output_names 350,498,646 \
--test_input ../image/dog.jpg \
--test_result yolov5s_top_outputs.npz \
--mlir yolov5s.mlirMain arguments of model_transform.py:
| Argument | Required | Description |
|---|---|---|
model_name |
✅ | Model name |
model_def |
✅ | Model definition file (.onnx, .pt, .tflite, .prototxt) |
model_data |
— | Caffe weight file (.caffemodel) |
input_shapes |
— | Input shape, e.g. [[1,3,640,640]] — supports multiple inputs |
resize_dims |
— | Image resize size before feeding into the model |
keep_aspect_ratio |
— | Keep aspect ratio (pads with 0). Off by default |
mean |
— | Per-channel mean (default 0,0,0) |
scale |
— | Per-channel scale (default 1,1,1) |
pixel_format |
— |
rgb / bgr / gray / rgbd
|
output_names |
— | Output tensor names. Defaults to model outputs |
test_input |
— | Validation input (image / npy / npz). Skipped if not specified |
test_result |
— | Output file for validation |
excepts |
— | Comma-separated list of layers excluded from validation |
debug |
— | Keep intermediate files |
mlir |
✅ | Output MLIR file path |
A ${model_name}_in_f32.npz file containing the preprocessed input is generated after this step.
3. MLIR → F16 bmodel
model_deploy.py \
--mlir yolov5s.mlir \
--quantize F16 \
--processor bm1684x \
--test_input yolov5s_in_f32.npz \
--test_reference yolov5s_top_outputs.npz \
--model yolov5s_1684x_f16.bmodelMain arguments of model_deploy.py:
| Argument | Required | Description |
|---|---|---|
mlir |
✅ | Input MLIR file |
quantize |
✅ |
F32 / BF16 / F16 / INT8
|
processor |
✅ | Target chip |
calibration_table |
— | Calibration table (required for INT8) |
tolerance |
— | Min similarity between MLIR-quantized and MLIR-fp32 inference |
correctness |
— | Min similarity between simulator and MLIR-quantized inference (default 0.99,0.90) |
excepts |
— | Comma-separated layers excluded from validation |
debug |
— | Keep intermediate files |
model |
✅ | Output bmodel path |
dynamic |
— | Dynamic codegen for dynamic shapes |
4. MLIR → INT8 bmodel
Run calibration first (typically 100–1000 images). Prefer symmetric quantization unless accuracy demands asymmetric.
run_calibration.py yolov5s.mlir \
--dataset ../COCO2017 \
--input_num 100 \
-o yolov5s_cali_table
model_deploy.py \
--mlir yolov5s.mlir \
--quantize INT8 \
--calibration_table yolov5s_cali_table \
--processor bm1684x \
--test_input yolov5s_in_f32.npz \
--test_reference yolov5s_top_outputs.npz \
--tolerance 0.85,0.45 \
--model yolov5s_1684x_int8.bmodel5. Verify the results
The sample script lives at python/samples/detect_yolov5.py.
# ONNX
detect_yolov5.py --input ../image/dog.jpg --model ../yolov5s.onnx --output dog_origin.jpg
# F16 bmodel
detect_yolov5.py --input ../image/dog.jpg --model yolov5s_1684x_f16.bmodel --output dog_f16.jpg
# INT8 bmodel
detect_yolov5.py --input ../image/dog.jpg --model yolov5s_1684x_int8.bmodel --output dog_int8.jpgComparison of outputs:
Supports bmodel / mlir / PyTorch / ONNX / TFLite / Caffe.
model_runner.py \
--input resnet18_in_f32.npz \
--model resnet18_1684x_f32.bmodel \
--output resnet18_output.npzmodel_tool
--info model_file : show brief model info
--print model_file : show detailed model info
--extract model_file : split a multi-net bmodel into single-net bmodels
--combine file1 .. fileN -o new_file : merge bmodels by file path
--combine_dir dir1 .. dirN -o new_dir : merge bmodels by directory
--dump model_file start_offset byte_size out_file: dump raw bytes from a bmodel
model_tool --info resnet18_1684x_f32.bmodel| Type | Link |
|---|---|
| Paper | TPU-MLIR (arXiv 2210.15016) |
| Manual | Technical Reference Manual |
| Guide | Quick Start |
Click to expand video index
| # | Topic | Links |
|---|---|---|
| 01 | What is a Deep Learning Compiler? | Intro |
| 02 | MLIR Intro | Syntax 1 · Syntax 2 · Syntax 3 · Dialect Conversion · Pattern Rewriting |
| 03 | TPU-MLIR Intro | Overview · Front-end · Lowering |
| 04 | Quantization | Overview · Formula · Calibration · QAT |
| 05 | TPU Memory | Ep1 · Ep2 |
| 06 | TPU-MLIR Practice | To ONNX · Graph Optimization · Operator Support · Model Support · Fuse Preprocess · Accuracy Validation |
If TPU-MLIR helps your research, please cite:
@misc{tpumlir2022,
title = {TPU-MLIR: A Compiler For TPU Using MLIR},
author = {HuPengchao and LuMan and WangLei and JiangGuoyue},
year = {2022},
eprint = {2210.15016},
archivePrefix = {arXiv},
primaryClass = {cs.PL}
}Bug reports, feature requests and pull requests are welcome! Before you start:
- Search existing issues to avoid duplicates.
- For non-trivial changes, open an issue first to discuss the design.
- Run the regression tests under
regression/before sending a PR.
This project is licensed under the terms of the LICENSE file in the root of this repository.


