MATH

API Endpoint
Leaderboard
Loading leaderboard...
README

MATH

OpenReward Environment Hugging Face Dataset

Description

MATH is an environment for evaluating mathematical reasoning across algebra, geometry, number theory, counting & probability, and calculus. Based on the MATH dataset from Hendrycks et al., agents solve competition-level mathematics problems with optional Python code execution support. Answer verification uses the math-verify library for semantic equivalence checking.

Capabilities

  • Competition-level mathematical problem solving
  • Multi-step mathematical reasoning
  • Python code execution for computational assistance
  • Support for LaTeX answer formatting
  • Symbolic answer verification

Compute Requirements

The Math environment provides a sandbox with Python code execution (0.5 CPU, 1GB RAM). The MathNoCode variant requires no sandbox.

License

MIT

Tasks

There are two splits in this environment:

  • train: 7,500 training problems
  • test: 5,000 test problems

Problems span 8 categories: Algebra, Intermediate Algebra, Prealgebra, Geometry, Number Theory, Counting & Probability, Arithmetic, and Precalculus. Each problem has a difficulty level from 1-5.

Reward Structure

This is a sparse, verifiable reward environment. The agent calls answer to submit a solution:

  • 1.0: Answer is mathematically equivalent to the reference solution
  • 0.0: Answer is incorrect

Answer verification uses the math-verify library to check semantic equivalence, handling LaTeX formatting and mathematical expressions.

Data

Data is sourced from the DigitalLearningGmbH/MATH-lighteval HuggingFace dataset.

Tools

ToolDescription
answerSubmit final answer for verification
execute_codeExecute Python code (Math environment only)

Time Horizon

Single-turn for MathNoCode, multi-turn for Math (with code execution).

Environment Difficulty

MATH Level 5 (hardest problems):

ModelAccuracy
GPT-5 (high)98.1%
GPT-5 (medium)97.9%
o4-mini (high)97.8%
o3 (high)97.8%
Claude Sonnet 4.597.7%

Other Environment Requirements

There are no further environment requirements; MATH works out of the box with the OpenReward endpoint.

Safety

Agents in MATH solve mathematical problems with optional code execution in an isolated sandbox. The environment does not present direct safety risks.

Citation

@article{hendrycks2021measuring,
  title={Measuring Mathematical Problem Solving With the MATH Dataset},
  author={Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arber, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob},
  journal={NeurIPS},
  year={2021}
}
GeneralReasoning/MATH | OpenReward