Skip to content

Add MMLU benchmark evaluation to evals#1183

Open
CarlG0123 wants to merge 2 commits intoTransformerLensOrg:devfrom
CarlG0123:add-mmlu-eval
Open

Add MMLU benchmark evaluation to evals#1183
CarlG0123 wants to merge 2 commits intoTransformerLensOrg:devfrom
CarlG0123:add-mmlu-eval

Conversation

@CarlG0123
Copy link

Summary

  • Add mmlu_eval() function for evaluating models on the MMLU (Massive Multitask Language Understanding) benchmark
  • Add make_mmlu_data_loader() for loading MMLU data from HuggingFace (cais/mmlu dataset)
  • Add MMLU_SUBJECTS (all 57 subjects) and MMLU_ANSWER_LETTERS module-level constants
  • Uses standard MMLU zero-shot letter-prediction format: show all choices (A-D) in prompt, compare log probabilities for each answer letter token

Changes

  • transformer_lens/evals.py: Added MMLU_SUBJECTS, MMLU_ANSWER_LETTERS, make_mmlu_data_loader(), and mmlu_eval()
  • tests/acceptance/test_evals.py: Added 5 tests covering data loading (single/multiple/invalid subjects) and evaluation

Testing

  • Verified with multiple models: Qwen1.5-7B (~40%), Mistral-7B (~45%), GPT-2, Pythia-1.4b
  • Results align with published zero-shot MMLU benchmarks for these model sizes
  • All existing tests continue to pass

Test plan

  • pytest tests/acceptance/test_evals.py -v
  • Verify MMLU data loading for single and multiple subjects
  • Verify evaluation produces reasonable accuracy scores

🤖 Generated with Claude Code

Carl Gross and others added 2 commits February 12, 2026 16:00
Skip MMLU docstring examples in doctest runs since they require
network access (HuggingFace dataset download) and may require GPU.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant