MMLU
MMLU
Description
MMLU (Measuring Massive Multitask Language Understanding) is an environment for evaluating broad academic knowledge across 57 subjects. Based on the MMLU benchmark by Hendrycks et al., it covers subjects ranging from elementary mathematics and US history to professional-level law, medicine, and computer science. Each task is a four-option multiple-choice question evaluated by exact letter match.
Capabilities
- Broad academic knowledge across 57 subjects
- Multiple-choice question answering (A/B/C/D)
- STEM, humanities, social sciences, and professional domains
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
MIT.
Tasks
There are four splits in this environment:
- test: 14,042 tasks
- validation: 1,531 tasks
- dev: 285 tasks (5 per subject, for few-shot)
- auxiliary_train: 99,842 tasks
Questions span 57 subjects including abstract algebra, anatomy, astronomy, clinical knowledge, college-level courses, high school subjects, professional law, medicine, and more.
Reward Structure
This is a single-turn environment. The agent submits a single letter (A, B, C, or D) via the answer tool. The answer is compared to the correct option index by exact match after stripping whitespace and punctuation. Reward is binary: 1.0 if the letter matches the correct answer, 0.0 otherwise. No LLM grading is used.
Data
Data is loaded from HuggingFace cais/mmlu at module import time using the datasets library. Each row contains a question, subject, four choices, and the index of the correct answer.
Tools
| Tool | Description |
|---|---|
answer | Submit your answer as a single letter (A, B, C, or D). Ends the episode. |
Time Horizon
Single-turn. The agent reads the question with options and submits one answer.
Environment Difficulty
MMLU is one of the most widely used LLM benchmarks. Frontier models have largely saturated it, with scores clustering above 88%.
| Model | MMLU Score |
|---|---|
| Gemini 3 Pro | 91.8% |
| GPT-5 | 91.4% |
| Human Expert | 89.8% |
| DeepSeek R1 | 88.9% |
Other Environment Requirements
There are no further environment requirements.
Safety
Agents in MMLU answer multiple-choice academic questions in a standard environment. The environment does not present direct safety risks.
Citation
@inproceedings{hendrycks2021measuring,
title={Measuring Massive Multitask Language Understanding},
author={Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob},
booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}