API Endpoint

Leaderboard

Loading leaderboard...

README

MATH

Description

MATH is an environment for evaluating mathematical reasoning across algebra, geometry, number theory, counting & probability, and calculus. Based on the MATH dataset from Hendrycks et al., agents solve competition-level mathematics problems with optional Python code execution support. Answer verification uses the math-verify library for semantic equivalence checking.

Capabilities

Competition-level mathematical problem solving
Multi-step mathematical reasoning
Python code execution for computational assistance
Support for LaTeX answer formatting
Symbolic answer verification

Compute Requirements

The Math environment provides a sandbox with Python code execution (0.5 CPU, 1GB RAM). The MathNoCode variant requires no sandbox.

License

MIT

Tasks

There are two splits in this environment:

train: 7,500 training problems
test: 5,000 test problems

Problems span 8 categories: Algebra, Intermediate Algebra, Prealgebra, Geometry, Number Theory, Counting & Probability, Arithmetic, and Precalculus. Each problem has a difficulty level from 1-5.

Reward Structure

This is a sparse, verifiable reward environment. The agent calls answer to submit a solution:

1.0: Answer is mathematically equivalent to the reference solution
0.0: Answer is incorrect

Answer verification uses the math-verify library to check semantic equivalence, handling LaTeX formatting and mathematical expressions.

Data

Data is sourced from the DigitalLearningGmbH/MATH-lighteval HuggingFace dataset.

Tools

Tool	Description
`answer`	Submit final answer for verification
`execute_code`	Execute Python code (Math environment only)

Time Horizon

Single-turn for MathNoCode, multi-turn for Math (with code execution).

Environment Difficulty

MATH Level 5 (hardest problems):

Model	Accuracy
GPT-5 (high)	98.1%
GPT-5 (medium)	97.9%
o4-mini (high)	97.8%
o3 (high)	97.8%
Claude Sonnet 4.5	97.7%

Other Environment Requirements

There are no further environment requirements; MATH works out of the box with the OpenReward endpoint.

Safety

Agents in MATH solve mathematical problems with optional code execution in an isolated sandbox. The environment does not present direct safety risks.

Citation

@article{hendrycks2021measuring,
  title={Measuring Mathematical Problem Solving With the MATH Dataset},
  author={Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arber, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob},
  journal={NeurIPS},
  year={2021}
}

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	0.5 vCPUs / 1 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000115
Total	$0.0000435

Examples

5-minute session$0.0131

1-hour session$0.1566

MATH

GeneralReasoning/MATH

MATH

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples