Skip to content

Add Olmo-3-7B-Instruct, SmolLM3-3B, and Mistral-NeMo-12B-Instruct recipes#391

Open
hanbitmyths wants to merge 2 commits intomicrosoft:mainfrom
hanbitmyths:sunghcho/add-olmo3-smollm3-mistral-nemo
Open

Add Olmo-3-7B-Instruct, SmolLM3-3B, and Mistral-NeMo-12B-Instruct recipes#391
hanbitmyths wants to merge 2 commits intomicrosoft:mainfrom
hanbitmyths:sunghcho/add-olmo3-smollm3-mistral-nemo

Conversation

@hanbitmyths
Copy link
Copy Markdown
Contributor

Adds Olive recipes for three new instruction-tuned LLMs, each with cpu / cuda / webgpu backends and a PyTorch baseline directory for MMLU comparison.

Models

  • allenai/Olmo-3-7B-Instruct (arch: olmo3)
  • HuggingFaceTB/SmolLM3-3B (arch: smollm3)
  • nvidia/Mistral-NeMo-12B-Instruct ‑> mistralai/Mistral-Nemo-Instruct-2407 (arch: mistral)

Per-backend layout (matches existing Olive recipe convention)

Each cpu/, cuda/, webgpu/ subfolder contains:

  • <model>_<backend>_fp32.json / _fp16.json (fp baseline export)
  • <model>_<backend>_int4.json (canonical INT4 export)
  • *_with_eval.json variants that add an MMLU evaluator
  • info.yaml, README.md, requirements.txt

A baseline/ folder per model carries the PyTorch+MMLU evaluation recipe used to compare INT4 vs FP baselines.

INT4 quantization pipeline

SelectiveMixedPrecision (kld_gradient for 3B/7B, k_quant_last for 12B-on-WebGPU to fit in 80GB) → GPTQRTN (8-bit lm_head/embeds via overrides) → ModelBuilder (precision=int4). Group size 128 for cpu/cuda, 32 for webgpu.

Validation

All 9 INT4 artifacts (3 models × 3 backends) were rebuilt from clean caches and validated:

  • 8/9 pass the onnxruntime_genai repetition heuristic on Linux (cpu/cuda); WebGPU EP is browser-only so its repetition test on Linux is skipped.
  • 1 known regression: SmolLM3 cuda INT4 produces a repeated 4-gram on one synthesis prompt (model-specific behavior under cuda INT4; tracked separately).

Top-level Apache 2.0 LICENSE added per model, matching the Qwen3 recipe pattern.

…ipes

Adds Olive recipes for three new models with cpu/cuda/webgpu backends and pytorch baseline:

- allenai/Olmo-3-7B-Instruct

- HuggingFaceTB/SmolLM3-3B

- nvidia/Mistral-NeMo-12B-Instruct (mistralai/Mistral-Nemo-Instruct-2407)

Each backend dir includes fp32/fp16 and int4 configs (with and without MMLU eval).
Copilot AI review requested due to automatic review settings May 3, 2026 02:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Olive optimization + evaluation recipes for three new instruction-tuned LLMs (OLMo-3-7B-Instruct, SmolLM3-3B, and Mistral-NeMo-12B-Instruct) across CPU/CUDA/WebGPU backends, plus PyTorch baseline evaluation configs for MMLU comparison.

Changes:

  • Introduces per-backend (cpu/cuda/webgpu) FP baseline and INT4 export workflows, including _with_eval MMLU evaluator variants.
  • Adds per-model baseline PyTorch+MMLU evaluation recipes for accuracy comparison.
  • Adds per-model Apache-2.0 LICENSE files and per-backend documentation/requirements.

Reviewed changes

Copilot reviewed 78 out of 78 changed files in this pull request and generated 21 comments.

Show a summary per file
File Description
nvidia-Mistral-NeMo-12B-Instruct/webgpu/requirements.txt WebGPU dependency set for Mistral-NeMo recipes
nvidia-Mistral-NeMo-12B-Instruct/webgpu/nvidia-Mistral-NeMo-12B-Instruct_webgpu_int4_with_eval.json WebGPU INT4 workflow with MMLU evaluator
nvidia-Mistral-NeMo-12B-Instruct/webgpu/nvidia-Mistral-NeMo-12B-Instruct_webgpu_int4.json WebGPU INT4 workflow (no artifacts)
nvidia-Mistral-NeMo-12B-Instruct/webgpu/nvidia-Mistral-NeMo-12B-Instruct_webgpu_fp16_with_eval.json WebGPU FP16 workflow with MMLU evaluator
nvidia-Mistral-NeMo-12B-Instruct/webgpu/nvidia-Mistral-NeMo-12B-Instruct_webgpu_fp16.json WebGPU FP16 workflow (no artifacts)
nvidia-Mistral-NeMo-12B-Instruct/webgpu/info.yaml Registers WebGPU recipes for discovery/metadata
nvidia-Mistral-NeMo-12B-Instruct/webgpu/README.md WebGPU build/eval usage + notes/results
nvidia-Mistral-NeMo-12B-Instruct/cuda/requirements.txt CUDA dependency set for Mistral-NeMo recipes
nvidia-Mistral-NeMo-12B-Instruct/cuda/nvidia-Mistral-NeMo-12B-Instruct_cuda_int4_with_eval.json CUDA INT4 workflow with MMLU evaluator
nvidia-Mistral-NeMo-12B-Instruct/cuda/nvidia-Mistral-NeMo-12B-Instruct_cuda_int4.json CUDA INT4 workflow (no artifacts)
nvidia-Mistral-NeMo-12B-Instruct/cuda/nvidia-Mistral-NeMo-12B-Instruct_cuda_fp16_with_eval.json CUDA FP16 workflow with MMLU evaluator
nvidia-Mistral-NeMo-12B-Instruct/cuda/nvidia-Mistral-NeMo-12B-Instruct_cuda_fp16.json CUDA FP16 workflow (no artifacts)
nvidia-Mistral-NeMo-12B-Instruct/cuda/info.yaml Registers CUDA recipes for discovery/metadata
nvidia-Mistral-NeMo-12B-Instruct/cuda/README.md CUDA build/eval usage + notes/results
nvidia-Mistral-NeMo-12B-Instruct/cpu/requirements.txt CPU dependency set for Mistral-NeMo recipes
nvidia-Mistral-NeMo-12B-Instruct/cpu/nvidia-Mistral-NeMo-12B-Instruct_cpu_int4_with_eval.json CPU INT4 workflow with MMLU evaluator
nvidia-Mistral-NeMo-12B-Instruct/cpu/nvidia-Mistral-NeMo-12B-Instruct_cpu_int4.json CPU INT4 workflow (no artifacts)
nvidia-Mistral-NeMo-12B-Instruct/cpu/nvidia-Mistral-NeMo-12B-Instruct_cpu_fp32_with_eval.json CPU FP32 workflow with MMLU evaluator
nvidia-Mistral-NeMo-12B-Instruct/cpu/nvidia-Mistral-NeMo-12B-Instruct_cpu_fp32.json CPU FP32 workflow (no artifacts)
nvidia-Mistral-NeMo-12B-Instruct/cpu/info.yaml Registers CPU recipes for discovery/metadata
nvidia-Mistral-NeMo-12B-Instruct/cpu/README.md CPU build/eval usage + notes/results
nvidia-Mistral-NeMo-12B-Instruct/baseline/requirements.txt Baseline (PyTorch eval) dependency set
nvidia-Mistral-NeMo-12B-Instruct/baseline/nvidia-Mistral-NeMo-12B-Instruct_pytorch_with_eval.json PyTorch MMLU baseline evaluation config
nvidia-Mistral-NeMo-12B-Instruct/baseline/info.yaml Registers baseline recipe for discovery/metadata
nvidia-Mistral-NeMo-12B-Instruct/baseline/README.md Baseline eval usage + reported results
nvidia-Mistral-NeMo-12B-Instruct/LICENSE Per-model Apache-2.0 license file
allenai-Olmo-3-7B-Instruct/webgpu/requirements.txt WebGPU dependency set for OLMo recipes
allenai-Olmo-3-7B-Instruct/webgpu/info.yaml Registers WebGPU recipes for discovery/metadata
allenai-Olmo-3-7B-Instruct/webgpu/allenai-Olmo-3-7B-Instruct_webgpu_int4_with_eval.json WebGPU INT4 workflow with MMLU evaluator
allenai-Olmo-3-7B-Instruct/webgpu/allenai-Olmo-3-7B-Instruct_webgpu_int4.json WebGPU INT4 workflow (no artifacts)
allenai-Olmo-3-7B-Instruct/webgpu/allenai-Olmo-3-7B-Instruct_webgpu_fp16_with_eval.json WebGPU FP16 workflow with MMLU evaluator
allenai-Olmo-3-7B-Instruct/webgpu/allenai-Olmo-3-7B-Instruct_webgpu_fp16.json WebGPU FP16 workflow (no artifacts)
allenai-Olmo-3-7B-Instruct/webgpu/README.md WebGPU build/eval usage + notes/results
allenai-Olmo-3-7B-Instruct/cuda/requirements.txt CUDA dependency set for OLMo recipes
allenai-Olmo-3-7B-Instruct/cuda/info.yaml Registers CUDA recipes for discovery/metadata
allenai-Olmo-3-7B-Instruct/cuda/allenai-Olmo-3-7B-Instruct_cuda_int4_with_eval.json CUDA INT4 workflow with MMLU evaluator
allenai-Olmo-3-7B-Instruct/cuda/allenai-Olmo-3-7B-Instruct_cuda_int4.json CUDA INT4 workflow (no artifacts)
allenai-Olmo-3-7B-Instruct/cuda/allenai-Olmo-3-7B-Instruct_cuda_fp16_with_eval.json CUDA FP16 workflow with MMLU evaluator
allenai-Olmo-3-7B-Instruct/cuda/allenai-Olmo-3-7B-Instruct_cuda_fp16.json CUDA FP16 workflow (no artifacts)
allenai-Olmo-3-7B-Instruct/cuda/README.md CUDA build/eval usage + notes/results
allenai-Olmo-3-7B-Instruct/cpu/requirements.txt CPU dependency set for OLMo recipes
allenai-Olmo-3-7B-Instruct/cpu/info.yaml Registers CPU recipes for discovery/metadata
allenai-Olmo-3-7B-Instruct/cpu/allenai-Olmo-3-7B-Instruct_cpu_int4_with_eval.json CPU INT4 workflow with MMLU evaluator
allenai-Olmo-3-7B-Instruct/cpu/allenai-Olmo-3-7B-Instruct_cpu_int4.json CPU INT4 workflow (no artifacts)
allenai-Olmo-3-7B-Instruct/cpu/allenai-Olmo-3-7B-Instruct_cpu_fp32_with_eval.json CPU FP32 workflow with MMLU evaluator
allenai-Olmo-3-7B-Instruct/cpu/allenai-Olmo-3-7B-Instruct_cpu_fp32.json CPU FP32 workflow (no artifacts)
allenai-Olmo-3-7B-Instruct/cpu/README.md CPU build/eval usage + notes/results
allenai-Olmo-3-7B-Instruct/baseline/requirements.txt Baseline (PyTorch eval) dependency set
allenai-Olmo-3-7B-Instruct/baseline/info.yaml Registers baseline recipe for discovery/metadata
allenai-Olmo-3-7B-Instruct/baseline/allenai-Olmo-3-7B-Instruct_pytorch_with_eval.json PyTorch MMLU baseline evaluation config
allenai-Olmo-3-7B-Instruct/baseline/README.md Baseline eval usage + reported results
allenai-Olmo-3-7B-Instruct/LICENSE Per-model Apache-2.0 license file
HuggingFaceTB-SmolLM3-3B/webgpu/requirements.txt WebGPU dependency set for SmolLM3 recipes
HuggingFaceTB-SmolLM3-3B/webgpu/info.yaml Registers WebGPU recipes for discovery/metadata
HuggingFaceTB-SmolLM3-3B/webgpu/README.md WebGPU build/eval usage + notes/results
HuggingFaceTB-SmolLM3-3B/webgpu/HuggingFaceTB-SmolLM3-3B_webgpu_int4_with_eval.json WebGPU INT4 workflow with MMLU evaluator
HuggingFaceTB-SmolLM3-3B/webgpu/HuggingFaceTB-SmolLM3-3B_webgpu_int4.json WebGPU INT4 workflow (no artifacts)
HuggingFaceTB-SmolLM3-3B/webgpu/HuggingFaceTB-SmolLM3-3B_webgpu_fp16_with_eval.json WebGPU FP16 workflow with MMLU evaluator
HuggingFaceTB-SmolLM3-3B/webgpu/HuggingFaceTB-SmolLM3-3B_webgpu_fp16.json WebGPU FP16 workflow (no artifacts)
HuggingFaceTB-SmolLM3-3B/cuda/requirements.txt CUDA dependency set for SmolLM3 recipes
HuggingFaceTB-SmolLM3-3B/cuda/info.yaml Registers CUDA recipes for discovery/metadata
HuggingFaceTB-SmolLM3-3B/cuda/README.md CUDA build/eval usage + notes/results
HuggingFaceTB-SmolLM3-3B/cuda/HuggingFaceTB-SmolLM3-3B_cuda_int4_with_eval.json CUDA INT4 workflow with MMLU evaluator
HuggingFaceTB-SmolLM3-3B/cuda/HuggingFaceTB-SmolLM3-3B_cuda_int4.json CUDA INT4 workflow (no artifacts)
HuggingFaceTB-SmolLM3-3B/cuda/HuggingFaceTB-SmolLM3-3B_cuda_fp16_with_eval.json CUDA FP16 workflow with MMLU evaluator
HuggingFaceTB-SmolLM3-3B/cuda/HuggingFaceTB-SmolLM3-3B_cuda_fp16.json CUDA FP16 workflow (no artifacts)
HuggingFaceTB-SmolLM3-3B/cpu/requirements.txt CPU dependency set for SmolLM3 recipes
HuggingFaceTB-SmolLM3-3B/cpu/info.yaml Registers CPU recipes for discovery/metadata
HuggingFaceTB-SmolLM3-3B/cpu/README.md CPU build/eval usage + notes/results
HuggingFaceTB-SmolLM3-3B/cpu/HuggingFaceTB-SmolLM3-3B_cpu_int4_with_eval.json CPU INT4 workflow with MMLU evaluator
HuggingFaceTB-SmolLM3-3B/cpu/HuggingFaceTB-SmolLM3-3B_cpu_int4.json CPU INT4 workflow (no artifacts)
HuggingFaceTB-SmolLM3-3B/cpu/HuggingFaceTB-SmolLM3-3B_cpu_fp32_with_eval.json CPU FP32 workflow with MMLU evaluator
HuggingFaceTB-SmolLM3-3B/cpu/HuggingFaceTB-SmolLM3-3B_cpu_fp32.json CPU FP32 workflow (no artifacts)
HuggingFaceTB-SmolLM3-3B/baseline/requirements.txt Baseline (PyTorch eval) dependency set
HuggingFaceTB-SmolLM3-3B/baseline/info.yaml Registers baseline recipe for discovery/metadata
HuggingFaceTB-SmolLM3-3B/baseline/README.md Baseline eval usage + reported results
HuggingFaceTB-SmolLM3-3B/baseline/HuggingFaceTB-SmolLM3-3B_pytorch_with_eval.json PyTorch MMLU baseline evaluation config
HuggingFaceTB-SmolLM3-3B/LICENSE Per-model Apache-2.0 license file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sentencepiece
tiktoken
torch
transformers>=4.57.1
sentencepiece
tiktoken
torch
transformers>=4.57.1
sentencepiece
tiktoken
torch
transformers>=4.57.1
Comment on lines +2 to +54
"input_model": {
"type": "HfModel",
"model_path": "HuggingFaceTB/SmolLM3-3B",
"load_kwargs": {
"torch_dtype": "float16"
}
},
"systems": {
"local_system": {
"type": "LocalSystem",
"accelerators": [
{
"device": "gpu",
"execution_providers": [
"CUDAExecutionProvider"
]
}
]
}
},
"passes": {
"m": {
"type": "ModelBuilder",
"precision": "fp16"
},
"t": {
"type": "GraphSurgeries",
"surgeries": [
{
"surgeon": "TieWordEmbeddings"
}
]
}
},
"target": "local_system",
"log_severity_level": 0,
"output_dir": "model_cuda_fp16",
"cache_dir": "cache_cuda_fp16",
"evaluators": {
"mmlu": {
"type": "LMEvaluator",
"tasks": [
"mmlu"
],
"batch_size": 1,
"max_length": 2048,
"provider_options": {
"enable_skip_layer_norm_strict_mode": "1"
}
}
},
"evaluator": "mmlu",
"evaluate_input_model": false
Comment on lines +2 to +54
"input_model": {
"type": "HfModel",
"model_path": "HuggingFaceTB/SmolLM3-3B",
"load_kwargs": {
"torch_dtype": "float16"
}
},
"systems": {
"local_system": {
"type": "LocalSystem",
"accelerators": [
{
"device": "cpu",
"execution_providers": [
"CPUExecutionProvider"
]
}
]
}
},
"passes": {
"m": {
"type": "ModelBuilder",
"precision": "fp32"
},
"t": {
"type": "GraphSurgeries",
"surgeries": [
{
"surgeon": "TieWordEmbeddings"
}
]
}
},
"target": "local_system",
"log_severity_level": 0,
"output_dir": "model_cpu_fp32",
"cache_dir": "cache_cpu_fp32",
"evaluators": {
"mmlu": {
"type": "LMEvaluator",
"tasks": [
"mmlu"
],
"batch_size": 1,
"max_length": 2048,
"provider_options": {
"enable_skip_layer_norm_strict_mode": "1"
}
}
},
"evaluator": "mmlu",
"evaluate_input_model": false
Comment on lines +2 to +54
"input_model": {
"type": "HfModel",
"model_path": "HuggingFaceTB/SmolLM3-3B",
"load_kwargs": {
"torch_dtype": "float16"
}
},
"systems": {
"local_system": {
"type": "LocalSystem",
"accelerators": [
{
"device": "gpu",
"execution_providers": [
"WebGpuExecutionProvider"
]
}
]
}
},
"passes": {
"m": {
"type": "ModelBuilder",
"precision": "fp16"
},
"t": {
"type": "GraphSurgeries",
"surgeries": [
{
"surgeon": "TieWordEmbeddings"
}
]
}
},
"target": "local_system",
"log_severity_level": 0,
"output_dir": "model_webgpu_fp16",
"cache_dir": "cache_webgpu_fp16",
"evaluators": {
"mmlu": {
"type": "LMEvaluator",
"tasks": [
"mmlu"
],
"batch_size": 1,
"max_length": 2048,
"provider_options": {
"enable_skip_layer_norm_strict_mode": "1"
}
}
},
"evaluator": "mmlu",
"evaluate_input_model": false
sentencepiece
tiktoken
torch
transformers>=4.57.1
torch
transformers>=4.57.1
onnxruntime-genai
onnxruntime-webgpu
sentencepiece
tiktoken
torch
transformers>=4.57.1
sentencepiece
tiktoken
torch
transformers>=4.57.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants