Add Olmo-3-7B-Instruct, SmolLM3-3B, and Mistral-NeMo-12B-Instruct recipes#391
Open
hanbitmyths wants to merge 2 commits intomicrosoft:mainfrom
Open
Add Olmo-3-7B-Instruct, SmolLM3-3B, and Mistral-NeMo-12B-Instruct recipes#391hanbitmyths wants to merge 2 commits intomicrosoft:mainfrom
hanbitmyths wants to merge 2 commits intomicrosoft:mainfrom
Conversation
…ipes Adds Olive recipes for three new models with cpu/cuda/webgpu backends and pytorch baseline: - allenai/Olmo-3-7B-Instruct - HuggingFaceTB/SmolLM3-3B - nvidia/Mistral-NeMo-12B-Instruct (mistralai/Mistral-Nemo-Instruct-2407) Each backend dir includes fp32/fp16 and int4 configs (with and without MMLU eval).
Contributor
There was a problem hiding this comment.
Pull request overview
Adds Olive optimization + evaluation recipes for three new instruction-tuned LLMs (OLMo-3-7B-Instruct, SmolLM3-3B, and Mistral-NeMo-12B-Instruct) across CPU/CUDA/WebGPU backends, plus PyTorch baseline evaluation configs for MMLU comparison.
Changes:
- Introduces per-backend (cpu/cuda/webgpu) FP baseline and INT4 export workflows, including
_with_evalMMLU evaluator variants. - Adds per-model baseline PyTorch+MMLU evaluation recipes for accuracy comparison.
- Adds per-model Apache-2.0 LICENSE files and per-backend documentation/requirements.
Reviewed changes
Copilot reviewed 78 out of 78 changed files in this pull request and generated 21 comments.
Show a summary per file
| File | Description |
|---|---|
| nvidia-Mistral-NeMo-12B-Instruct/webgpu/requirements.txt | WebGPU dependency set for Mistral-NeMo recipes |
| nvidia-Mistral-NeMo-12B-Instruct/webgpu/nvidia-Mistral-NeMo-12B-Instruct_webgpu_int4_with_eval.json | WebGPU INT4 workflow with MMLU evaluator |
| nvidia-Mistral-NeMo-12B-Instruct/webgpu/nvidia-Mistral-NeMo-12B-Instruct_webgpu_int4.json | WebGPU INT4 workflow (no artifacts) |
| nvidia-Mistral-NeMo-12B-Instruct/webgpu/nvidia-Mistral-NeMo-12B-Instruct_webgpu_fp16_with_eval.json | WebGPU FP16 workflow with MMLU evaluator |
| nvidia-Mistral-NeMo-12B-Instruct/webgpu/nvidia-Mistral-NeMo-12B-Instruct_webgpu_fp16.json | WebGPU FP16 workflow (no artifacts) |
| nvidia-Mistral-NeMo-12B-Instruct/webgpu/info.yaml | Registers WebGPU recipes for discovery/metadata |
| nvidia-Mistral-NeMo-12B-Instruct/webgpu/README.md | WebGPU build/eval usage + notes/results |
| nvidia-Mistral-NeMo-12B-Instruct/cuda/requirements.txt | CUDA dependency set for Mistral-NeMo recipes |
| nvidia-Mistral-NeMo-12B-Instruct/cuda/nvidia-Mistral-NeMo-12B-Instruct_cuda_int4_with_eval.json | CUDA INT4 workflow with MMLU evaluator |
| nvidia-Mistral-NeMo-12B-Instruct/cuda/nvidia-Mistral-NeMo-12B-Instruct_cuda_int4.json | CUDA INT4 workflow (no artifacts) |
| nvidia-Mistral-NeMo-12B-Instruct/cuda/nvidia-Mistral-NeMo-12B-Instruct_cuda_fp16_with_eval.json | CUDA FP16 workflow with MMLU evaluator |
| nvidia-Mistral-NeMo-12B-Instruct/cuda/nvidia-Mistral-NeMo-12B-Instruct_cuda_fp16.json | CUDA FP16 workflow (no artifacts) |
| nvidia-Mistral-NeMo-12B-Instruct/cuda/info.yaml | Registers CUDA recipes for discovery/metadata |
| nvidia-Mistral-NeMo-12B-Instruct/cuda/README.md | CUDA build/eval usage + notes/results |
| nvidia-Mistral-NeMo-12B-Instruct/cpu/requirements.txt | CPU dependency set for Mistral-NeMo recipes |
| nvidia-Mistral-NeMo-12B-Instruct/cpu/nvidia-Mistral-NeMo-12B-Instruct_cpu_int4_with_eval.json | CPU INT4 workflow with MMLU evaluator |
| nvidia-Mistral-NeMo-12B-Instruct/cpu/nvidia-Mistral-NeMo-12B-Instruct_cpu_int4.json | CPU INT4 workflow (no artifacts) |
| nvidia-Mistral-NeMo-12B-Instruct/cpu/nvidia-Mistral-NeMo-12B-Instruct_cpu_fp32_with_eval.json | CPU FP32 workflow with MMLU evaluator |
| nvidia-Mistral-NeMo-12B-Instruct/cpu/nvidia-Mistral-NeMo-12B-Instruct_cpu_fp32.json | CPU FP32 workflow (no artifacts) |
| nvidia-Mistral-NeMo-12B-Instruct/cpu/info.yaml | Registers CPU recipes for discovery/metadata |
| nvidia-Mistral-NeMo-12B-Instruct/cpu/README.md | CPU build/eval usage + notes/results |
| nvidia-Mistral-NeMo-12B-Instruct/baseline/requirements.txt | Baseline (PyTorch eval) dependency set |
| nvidia-Mistral-NeMo-12B-Instruct/baseline/nvidia-Mistral-NeMo-12B-Instruct_pytorch_with_eval.json | PyTorch MMLU baseline evaluation config |
| nvidia-Mistral-NeMo-12B-Instruct/baseline/info.yaml | Registers baseline recipe for discovery/metadata |
| nvidia-Mistral-NeMo-12B-Instruct/baseline/README.md | Baseline eval usage + reported results |
| nvidia-Mistral-NeMo-12B-Instruct/LICENSE | Per-model Apache-2.0 license file |
| allenai-Olmo-3-7B-Instruct/webgpu/requirements.txt | WebGPU dependency set for OLMo recipes |
| allenai-Olmo-3-7B-Instruct/webgpu/info.yaml | Registers WebGPU recipes for discovery/metadata |
| allenai-Olmo-3-7B-Instruct/webgpu/allenai-Olmo-3-7B-Instruct_webgpu_int4_with_eval.json | WebGPU INT4 workflow with MMLU evaluator |
| allenai-Olmo-3-7B-Instruct/webgpu/allenai-Olmo-3-7B-Instruct_webgpu_int4.json | WebGPU INT4 workflow (no artifacts) |
| allenai-Olmo-3-7B-Instruct/webgpu/allenai-Olmo-3-7B-Instruct_webgpu_fp16_with_eval.json | WebGPU FP16 workflow with MMLU evaluator |
| allenai-Olmo-3-7B-Instruct/webgpu/allenai-Olmo-3-7B-Instruct_webgpu_fp16.json | WebGPU FP16 workflow (no artifacts) |
| allenai-Olmo-3-7B-Instruct/webgpu/README.md | WebGPU build/eval usage + notes/results |
| allenai-Olmo-3-7B-Instruct/cuda/requirements.txt | CUDA dependency set for OLMo recipes |
| allenai-Olmo-3-7B-Instruct/cuda/info.yaml | Registers CUDA recipes for discovery/metadata |
| allenai-Olmo-3-7B-Instruct/cuda/allenai-Olmo-3-7B-Instruct_cuda_int4_with_eval.json | CUDA INT4 workflow with MMLU evaluator |
| allenai-Olmo-3-7B-Instruct/cuda/allenai-Olmo-3-7B-Instruct_cuda_int4.json | CUDA INT4 workflow (no artifacts) |
| allenai-Olmo-3-7B-Instruct/cuda/allenai-Olmo-3-7B-Instruct_cuda_fp16_with_eval.json | CUDA FP16 workflow with MMLU evaluator |
| allenai-Olmo-3-7B-Instruct/cuda/allenai-Olmo-3-7B-Instruct_cuda_fp16.json | CUDA FP16 workflow (no artifacts) |
| allenai-Olmo-3-7B-Instruct/cuda/README.md | CUDA build/eval usage + notes/results |
| allenai-Olmo-3-7B-Instruct/cpu/requirements.txt | CPU dependency set for OLMo recipes |
| allenai-Olmo-3-7B-Instruct/cpu/info.yaml | Registers CPU recipes for discovery/metadata |
| allenai-Olmo-3-7B-Instruct/cpu/allenai-Olmo-3-7B-Instruct_cpu_int4_with_eval.json | CPU INT4 workflow with MMLU evaluator |
| allenai-Olmo-3-7B-Instruct/cpu/allenai-Olmo-3-7B-Instruct_cpu_int4.json | CPU INT4 workflow (no artifacts) |
| allenai-Olmo-3-7B-Instruct/cpu/allenai-Olmo-3-7B-Instruct_cpu_fp32_with_eval.json | CPU FP32 workflow with MMLU evaluator |
| allenai-Olmo-3-7B-Instruct/cpu/allenai-Olmo-3-7B-Instruct_cpu_fp32.json | CPU FP32 workflow (no artifacts) |
| allenai-Olmo-3-7B-Instruct/cpu/README.md | CPU build/eval usage + notes/results |
| allenai-Olmo-3-7B-Instruct/baseline/requirements.txt | Baseline (PyTorch eval) dependency set |
| allenai-Olmo-3-7B-Instruct/baseline/info.yaml | Registers baseline recipe for discovery/metadata |
| allenai-Olmo-3-7B-Instruct/baseline/allenai-Olmo-3-7B-Instruct_pytorch_with_eval.json | PyTorch MMLU baseline evaluation config |
| allenai-Olmo-3-7B-Instruct/baseline/README.md | Baseline eval usage + reported results |
| allenai-Olmo-3-7B-Instruct/LICENSE | Per-model Apache-2.0 license file |
| HuggingFaceTB-SmolLM3-3B/webgpu/requirements.txt | WebGPU dependency set for SmolLM3 recipes |
| HuggingFaceTB-SmolLM3-3B/webgpu/info.yaml | Registers WebGPU recipes for discovery/metadata |
| HuggingFaceTB-SmolLM3-3B/webgpu/README.md | WebGPU build/eval usage + notes/results |
| HuggingFaceTB-SmolLM3-3B/webgpu/HuggingFaceTB-SmolLM3-3B_webgpu_int4_with_eval.json | WebGPU INT4 workflow with MMLU evaluator |
| HuggingFaceTB-SmolLM3-3B/webgpu/HuggingFaceTB-SmolLM3-3B_webgpu_int4.json | WebGPU INT4 workflow (no artifacts) |
| HuggingFaceTB-SmolLM3-3B/webgpu/HuggingFaceTB-SmolLM3-3B_webgpu_fp16_with_eval.json | WebGPU FP16 workflow with MMLU evaluator |
| HuggingFaceTB-SmolLM3-3B/webgpu/HuggingFaceTB-SmolLM3-3B_webgpu_fp16.json | WebGPU FP16 workflow (no artifacts) |
| HuggingFaceTB-SmolLM3-3B/cuda/requirements.txt | CUDA dependency set for SmolLM3 recipes |
| HuggingFaceTB-SmolLM3-3B/cuda/info.yaml | Registers CUDA recipes for discovery/metadata |
| HuggingFaceTB-SmolLM3-3B/cuda/README.md | CUDA build/eval usage + notes/results |
| HuggingFaceTB-SmolLM3-3B/cuda/HuggingFaceTB-SmolLM3-3B_cuda_int4_with_eval.json | CUDA INT4 workflow with MMLU evaluator |
| HuggingFaceTB-SmolLM3-3B/cuda/HuggingFaceTB-SmolLM3-3B_cuda_int4.json | CUDA INT4 workflow (no artifacts) |
| HuggingFaceTB-SmolLM3-3B/cuda/HuggingFaceTB-SmolLM3-3B_cuda_fp16_with_eval.json | CUDA FP16 workflow with MMLU evaluator |
| HuggingFaceTB-SmolLM3-3B/cuda/HuggingFaceTB-SmolLM3-3B_cuda_fp16.json | CUDA FP16 workflow (no artifacts) |
| HuggingFaceTB-SmolLM3-3B/cpu/requirements.txt | CPU dependency set for SmolLM3 recipes |
| HuggingFaceTB-SmolLM3-3B/cpu/info.yaml | Registers CPU recipes for discovery/metadata |
| HuggingFaceTB-SmolLM3-3B/cpu/README.md | CPU build/eval usage + notes/results |
| HuggingFaceTB-SmolLM3-3B/cpu/HuggingFaceTB-SmolLM3-3B_cpu_int4_with_eval.json | CPU INT4 workflow with MMLU evaluator |
| HuggingFaceTB-SmolLM3-3B/cpu/HuggingFaceTB-SmolLM3-3B_cpu_int4.json | CPU INT4 workflow (no artifacts) |
| HuggingFaceTB-SmolLM3-3B/cpu/HuggingFaceTB-SmolLM3-3B_cpu_fp32_with_eval.json | CPU FP32 workflow with MMLU evaluator |
| HuggingFaceTB-SmolLM3-3B/cpu/HuggingFaceTB-SmolLM3-3B_cpu_fp32.json | CPU FP32 workflow (no artifacts) |
| HuggingFaceTB-SmolLM3-3B/baseline/requirements.txt | Baseline (PyTorch eval) dependency set |
| HuggingFaceTB-SmolLM3-3B/baseline/info.yaml | Registers baseline recipe for discovery/metadata |
| HuggingFaceTB-SmolLM3-3B/baseline/README.md | Baseline eval usage + reported results |
| HuggingFaceTB-SmolLM3-3B/baseline/HuggingFaceTB-SmolLM3-3B_pytorch_with_eval.json | PyTorch MMLU baseline evaluation config |
| HuggingFaceTB-SmolLM3-3B/LICENSE | Per-model Apache-2.0 license file |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| sentencepiece | ||
| tiktoken | ||
| torch | ||
| transformers>=4.57.1 |
| sentencepiece | ||
| tiktoken | ||
| torch | ||
| transformers>=4.57.1 |
| sentencepiece | ||
| tiktoken | ||
| torch | ||
| transformers>=4.57.1 |
Comment on lines
+2
to
+54
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "HuggingFaceTB/SmolLM3-3B", | ||
| "load_kwargs": { | ||
| "torch_dtype": "float16" | ||
| } | ||
| }, | ||
| "systems": { | ||
| "local_system": { | ||
| "type": "LocalSystem", | ||
| "accelerators": [ | ||
| { | ||
| "device": "gpu", | ||
| "execution_providers": [ | ||
| "CUDAExecutionProvider" | ||
| ] | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "passes": { | ||
| "m": { | ||
| "type": "ModelBuilder", | ||
| "precision": "fp16" | ||
| }, | ||
| "t": { | ||
| "type": "GraphSurgeries", | ||
| "surgeries": [ | ||
| { | ||
| "surgeon": "TieWordEmbeddings" | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "target": "local_system", | ||
| "log_severity_level": 0, | ||
| "output_dir": "model_cuda_fp16", | ||
| "cache_dir": "cache_cuda_fp16", | ||
| "evaluators": { | ||
| "mmlu": { | ||
| "type": "LMEvaluator", | ||
| "tasks": [ | ||
| "mmlu" | ||
| ], | ||
| "batch_size": 1, | ||
| "max_length": 2048, | ||
| "provider_options": { | ||
| "enable_skip_layer_norm_strict_mode": "1" | ||
| } | ||
| } | ||
| }, | ||
| "evaluator": "mmlu", | ||
| "evaluate_input_model": false |
Comment on lines
+2
to
+54
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "HuggingFaceTB/SmolLM3-3B", | ||
| "load_kwargs": { | ||
| "torch_dtype": "float16" | ||
| } | ||
| }, | ||
| "systems": { | ||
| "local_system": { | ||
| "type": "LocalSystem", | ||
| "accelerators": [ | ||
| { | ||
| "device": "cpu", | ||
| "execution_providers": [ | ||
| "CPUExecutionProvider" | ||
| ] | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "passes": { | ||
| "m": { | ||
| "type": "ModelBuilder", | ||
| "precision": "fp32" | ||
| }, | ||
| "t": { | ||
| "type": "GraphSurgeries", | ||
| "surgeries": [ | ||
| { | ||
| "surgeon": "TieWordEmbeddings" | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "target": "local_system", | ||
| "log_severity_level": 0, | ||
| "output_dir": "model_cpu_fp32", | ||
| "cache_dir": "cache_cpu_fp32", | ||
| "evaluators": { | ||
| "mmlu": { | ||
| "type": "LMEvaluator", | ||
| "tasks": [ | ||
| "mmlu" | ||
| ], | ||
| "batch_size": 1, | ||
| "max_length": 2048, | ||
| "provider_options": { | ||
| "enable_skip_layer_norm_strict_mode": "1" | ||
| } | ||
| } | ||
| }, | ||
| "evaluator": "mmlu", | ||
| "evaluate_input_model": false |
Comment on lines
+2
to
+54
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "HuggingFaceTB/SmolLM3-3B", | ||
| "load_kwargs": { | ||
| "torch_dtype": "float16" | ||
| } | ||
| }, | ||
| "systems": { | ||
| "local_system": { | ||
| "type": "LocalSystem", | ||
| "accelerators": [ | ||
| { | ||
| "device": "gpu", | ||
| "execution_providers": [ | ||
| "WebGpuExecutionProvider" | ||
| ] | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "passes": { | ||
| "m": { | ||
| "type": "ModelBuilder", | ||
| "precision": "fp16" | ||
| }, | ||
| "t": { | ||
| "type": "GraphSurgeries", | ||
| "surgeries": [ | ||
| { | ||
| "surgeon": "TieWordEmbeddings" | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "target": "local_system", | ||
| "log_severity_level": 0, | ||
| "output_dir": "model_webgpu_fp16", | ||
| "cache_dir": "cache_webgpu_fp16", | ||
| "evaluators": { | ||
| "mmlu": { | ||
| "type": "LMEvaluator", | ||
| "tasks": [ | ||
| "mmlu" | ||
| ], | ||
| "batch_size": 1, | ||
| "max_length": 2048, | ||
| "provider_options": { | ||
| "enable_skip_layer_norm_strict_mode": "1" | ||
| } | ||
| } | ||
| }, | ||
| "evaluator": "mmlu", | ||
| "evaluate_input_model": false |
| sentencepiece | ||
| tiktoken | ||
| torch | ||
| transformers>=4.57.1 |
| torch | ||
| transformers>=4.57.1 | ||
| onnxruntime-genai | ||
| onnxruntime-webgpu |
| sentencepiece | ||
| tiktoken | ||
| torch | ||
| transformers>=4.57.1 |
| sentencepiece | ||
| tiktoken | ||
| torch | ||
| transformers>=4.57.1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds Olive recipes for three new instruction-tuned LLMs, each with cpu / cuda / webgpu backends and a PyTorch baseline directory for MMLU comparison.
Models
olmo3)smollm3)mistralai/Mistral-Nemo-Instruct-2407(arch:mistral)Per-backend layout (matches existing Olive recipe convention)
Each
cpu/,cuda/,webgpu/subfolder contains:<model>_<backend>_fp32.json/_fp16.json(fp baseline export)<model>_<backend>_int4.json(canonical INT4 export)*_with_eval.jsonvariants that add an MMLU evaluatorinfo.yaml,README.md,requirements.txtA
baseline/folder per model carries the PyTorch+MMLU evaluation recipe used to compare INT4 vs FP baselines.INT4 quantization pipeline
SelectiveMixedPrecision(kld_gradientfor 3B/7B,k_quant_lastfor 12B-on-WebGPU to fit in 80GB) →GPTQ→RTN(8-bit lm_head/embeds via overrides) →ModelBuilder(precision=int4). Group size 128 for cpu/cuda, 32 for webgpu.Validation
All 9 INT4 artifacts (3 models × 3 backends) were rebuilt from clean caches and validated:
onnxruntime_genairepetition heuristic on Linux (cpu/cuda); WebGPU EP is browser-only so its repetition test on Linux is skipped.Top-level Apache 2.0
LICENSEadded per model, matching the Qwen3 recipe pattern.