API Endpoint

Leaderboard

Loading leaderboard...

README

MMLU

Description

MMLU (Measuring Massive Multitask Language Understanding) is an environment for evaluating broad academic knowledge across 57 subjects. Based on the MMLU benchmark by Hendrycks et al., it covers subjects ranging from elementary mathematics and US history to professional-level law, medicine, and computer science. Each task is a four-option multiple-choice question evaluated by exact letter match.

Capabilities

Broad academic knowledge across 57 subjects
Multiple-choice question answering (A/B/C/D)
STEM, humanities, social sciences, and professional domains

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

There are four splits in this environment:

test: 14,042 tasks
validation: 1,531 tasks
dev: 285 tasks (5 per subject, for few-shot)
auxiliary_train: 99,842 tasks

Questions span 57 subjects including abstract algebra, anatomy, astronomy, clinical knowledge, college-level courses, high school subjects, professional law, medicine, and more.

Reward Structure

This is a single-turn environment. The agent submits a single letter (A, B, C, or D) via the answer tool. The answer is compared to the correct option index by exact match after stripping whitespace and punctuation. Reward is binary: 1.0 if the letter matches the correct answer, 0.0 otherwise. No LLM grading is used.

Data

Data is loaded from HuggingFace cais/mmlu at module import time using the datasets library. Each row contains a question, subject, four choices, and the index of the correct answer.

Tools

Tool	Description
`answer`	Submit your answer as a single letter (A, B, C, or D). Ends the episode.

Time Horizon

Single-turn. The agent reads the question with options and submits one answer.

Environment Difficulty

MMLU is one of the most widely used LLM benchmarks. Frontier models have largely saturated it, with scores clustering above 88%.

Model	MMLU Score
Gemini 3 Pro	91.8%
GPT-5	91.4%
Human Expert	89.8%
DeepSeek R1	88.9%

Other Environment Requirements

There are no further environment requirements.

Safety

Agents in MMLU answer multiple-choice academic questions in a standard environment. The environment does not present direct safety risks.

Citation

@inproceedings{hendrycks2021measuring,
  title={Measuring Massive Multitask Language Understanding},
  author={Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob},
  booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

MMLU

GeneralReasoning/MMLU

MMLU

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples