MMLU

API Endpoint
Leaderboard
Loading leaderboard...
README

MMLU

OpenReward Environment Hugging Face Dataset

Description

MMLU (Measuring Massive Multitask Language Understanding) is an environment for evaluating broad academic knowledge across 57 subjects. Based on the MMLU benchmark by Hendrycks et al., it covers subjects ranging from elementary mathematics and US history to professional-level law, medicine, and computer science. Each task is a four-option multiple-choice question evaluated by exact letter match.

Capabilities

  • Broad academic knowledge across 57 subjects
  • Multiple-choice question answering (A/B/C/D)
  • STEM, humanities, social sciences, and professional domains

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

There are four splits in this environment:

  • test: 14,042 tasks
  • validation: 1,531 tasks
  • dev: 285 tasks (5 per subject, for few-shot)
  • auxiliary_train: 99,842 tasks

Questions span 57 subjects including abstract algebra, anatomy, astronomy, clinical knowledge, college-level courses, high school subjects, professional law, medicine, and more.

Reward Structure

This is a single-turn environment. The agent submits a single letter (A, B, C, or D) via the answer tool. The answer is compared to the correct option index by exact match after stripping whitespace and punctuation. Reward is binary: 1.0 if the letter matches the correct answer, 0.0 otherwise. No LLM grading is used.

Data

Data is loaded from HuggingFace cais/mmlu at module import time using the datasets library. Each row contains a question, subject, four choices, and the index of the correct answer.

Tools

ToolDescription
answerSubmit your answer as a single letter (A, B, C, or D). Ends the episode.

Time Horizon

Single-turn. The agent reads the question with options and submits one answer.

Environment Difficulty

MMLU is one of the most widely used LLM benchmarks. Frontier models have largely saturated it, with scores clustering above 88%.

ModelMMLU Score
Gemini 3 Pro91.8%
GPT-591.4%
Human Expert89.8%
DeepSeek R188.9%

Other Environment Requirements

There are no further environment requirements.

Safety

Agents in MMLU answer multiple-choice academic questions in a standard environment. The environment does not present direct safety risks.

Citation

@inproceedings{hendrycks2021measuring,
  title={Measuring Massive Multitask Language Understanding},
  author={Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob},
  booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}
GeneralReasoning/MMLU | OpenReward