Add MMLU benchmark evaluation to evals by CarlG0123 · Pull Request #1183 · TransformerLensOrg/TransformerLens

CarlG0123 · 2026-02-24T00:43:50Z

Summary

Add mmlu_eval() function for evaluating models on the MMLU (Massive Multitask Language Understanding) benchmark
Add make_mmlu_data_loader() for loading MMLU data from HuggingFace (cais/mmlu dataset)
Add MMLU_SUBJECTS (all 57 subjects) and MMLU_ANSWER_LETTERS module-level constants
Uses standard MMLU zero-shot letter-prediction format: show all choices (A-D) in prompt, compare log probabilities for each answer letter token

Changes

transformer_lens/evals.py: Added MMLU_SUBJECTS, MMLU_ANSWER_LETTERS, make_mmlu_data_loader(), and mmlu_eval()
tests/acceptance/test_evals.py: Added 5 tests covering data loading (single/multiple/invalid subjects) and evaluation

Testing

Verified with multiple models: Qwen1.5-7B (~40%), Mistral-7B (~45%), GPT-2, Pythia-1.4b
Results align with published zero-shot MMLU benchmarks for these model sizes
All existing tests continue to pass

Test plan

pytest tests/acceptance/test_evals.py -v
Verify MMLU data loading for single and multiple subjects
Verify evaluation produces reasonable accuracy scores

🤖 Generated with Claude Code

Skip MMLU docstring examples in doctest runs since they require network access (HuggingFace dataset download) and may require GPU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Carl Gross and others added 2 commits February 12, 2026 16:00

adding MMLU to evals, updating corresponding tests

98113e2

Fix docstring tests for MMLU functions

14a4fb1

Skip MMLU docstring examples in doctest runs since they require network access (HuggingFace dataset download) and may require GPU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MMLU benchmark evaluation to evals#1183

Add MMLU benchmark evaluation to evals#1183
CarlG0123 wants to merge 2 commits intoTransformerLensOrg:devfrom
CarlG0123:add-mmlu-eval

CarlG0123 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CarlG0123 commented Feb 24, 2026

Summary

Changes

Testing

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant