API Endpoint

Leaderboard

Loading leaderboard...

README

HorizonMath

Description

HorizonMath is an environment for evaluating AI systems on mathematical research problems with automatic verification. Based on the HorizonMath benchmark, it contains 101 mathematical problems across 8 domains, spanning calibration tasks with known solutions to open problems at the frontier of mathematical research. Problems cover lattice models, integrals, mathematical constants, combinatorics, number theory, coding theory, discrete geometry, and continuum physics.

Capabilities

High-precision numerical computation (100+ digit precision with mpmath)
Mathematical reasoning and closed-form expression discovery
Constructive mathematics (producing valid mathematical objects such as graph colorings, point configurations, Hadamard matrices)
Optimization (improving on known mathematical bounds and constructions)

Compute Requirements

Submitted code is executed in an OpenReward sandbox (isolated container) with a 5-minute timeout and network access disabled. The sandbox is provisioned with python:3.11-slim and mpmath, numpy, scipy, networkx installed at session start.

License

CC BY 4.0.

Tasks

There are 101 tasks across 5 splits:

Split	Description	Tasks
test	All problems	101
level_0	Calibration (known solutions)	10
level_1	Likely solvable	23
level_2	Challenging	60
level_3	Possibly unsolvable	8

Problems are classified by evaluation mode:

Ground truth computable (59 problems): Problems with known numerical answers verified at 100+ digit precision.
Benchmark best known (33 problems): Construction/optimization problems compared against published state-of-the-art baselines.
New construction (9 problems): Open problems where any valid construction is a success.

And by output type:

Constant (54): Return a high-precision numerical constant (e.g., Watson integrals, irrationality measures).
Function (5): Return a function evaluated at specific test points.
Construction (39): Return a mathematical object (graph coloring, lattice packing, Hadamard matrix, etc.).
Formula discovery (3): Discover a closed-form expression for a quantity.

Reward Structure

This is a sparse, verifiable reward environment. The agent calls submit_solution once per task.

For constants and functions (ground_truth_computable): The submitted code is executed with mpmath at 110-digit precision. Output is compared digit-by-digit against the ground truth. Reward is the fraction of required matching digits achieved (20 digits required for a full pass):

$\text{Reward} = \min\left(\frac{\text{matching digits}}{20}, 1.0\right)$

For constructions (benchmark_best_known): The construction is validated against mathematical constraints using problem-specific validators, then compared to the state-of-the-art baseline:

Beats baseline: reward = 1.0
Matches or below baseline: reward = 0.0
Invalid: reward = 0.0

For new constructions (new_construction): Binary validation only. Valid = 1.0, invalid = 0.0.

We do not use LLM graders for this task. All scoring is deterministic mathematical verification.

Data

Problem definitions and baselines are sourced from the HorizonMath GitHub repository. Data files are stored on the OpenReward platform. 44 problem-specific validators check mathematical validity of construction submissions.

Tools

Agents are given a single tool:

submit_solution: Submit a proposed_solution() Python function as a code string. The code is executed server-side with mpmath at 110-digit precision. For constants and functions, output is compared against ground truth. For constructions, the output is validated and compared against baselines.

Time Horizon

HorizonMath is a single-turn environment. The agent receives a problem description and submits one solution. Each task requires exactly one tool call.

Environment Difficulty

Most state-of-the-art models score near 0% on the full benchmark, as the problems are genuinely open research questions. The calibration split (level_0) contains problems with known solutions for pipeline verification.

Other Environment Requirements

There are no further environment requirements beyond the OpenReward API key used for sandbox provisioning.

Safety

Agents in HorizonMath submit mathematical solutions as Python code strings. The code is executed in an isolated OpenReward sandbox container with a 5-minute timeout and network access disabled. The environment does not present direct safety risks beyond standard code execution concerns.

Citations

@article{wang2026horizonmath,
  title={HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification},
  author={Wang, Erik Y. and Motwani, Sumeet and Roggeveen, James V. and Hodges, Eliot and Jayalath, Dulhan and London, Charles and Ramakrishnan, Kalyan and Cipcigan, Flaviu and Torr, Philip and Abate, Alessandro},
  journal={arXiv preprint arXiv:2603.15617},
  year={2026}
}

Repository

Source repository

EnvCommons/HorizonMath

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	1 vCPU / 2 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000230
Total	$0.0000550

Examples

5-minute session$0.0165

1-hour session$0.1980

HorizonMath

erikw26/HorizonMath

HorizonMath

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples