HorizonMath
HorizonMath
Description
HorizonMath is an environment for evaluating AI systems on mathematical research problems with automatic verification. Based on the HorizonMath benchmark, it contains 101 mathematical problems across 8 domains, spanning calibration tasks with known solutions to open problems at the frontier of mathematical research. Problems cover lattice models, integrals, mathematical constants, combinatorics, number theory, coding theory, discrete geometry, and continuum physics.
Capabilities
- High-precision numerical computation (100+ digit precision with mpmath)
- Mathematical reasoning and closed-form expression discovery
- Constructive mathematics (producing valid mathematical objects such as graph colorings, point configurations, Hadamard matrices)
- Optimization (improving on known mathematical bounds and constructions)
Compute Requirements
Submitted code is executed in an OpenReward sandbox (isolated container) with a 5-minute timeout and network access disabled. The sandbox is provisioned with python:3.11-slim and mpmath, numpy, scipy, networkx installed at session start.
License
Tasks
There are 101 tasks across 5 splits:
| Split | Description | Tasks |
|---|---|---|
| test | All problems | 101 |
| level_0 | Calibration (known solutions) | 10 |
| level_1 | Likely solvable | 23 |
| level_2 | Challenging | 60 |
| level_3 | Possibly unsolvable | 8 |
Problems are classified by evaluation mode:
- Ground truth computable (59 problems): Problems with known numerical answers verified at 100+ digit precision.
- Benchmark best known (33 problems): Construction/optimization problems compared against published state-of-the-art baselines.
- New construction (9 problems): Open problems where any valid construction is a success.
And by output type:
- Constant (54): Return a high-precision numerical constant (e.g., Watson integrals, irrationality measures).
- Function (5): Return a function evaluated at specific test points.
- Construction (39): Return a mathematical object (graph coloring, lattice packing, Hadamard matrix, etc.).
- Formula discovery (3): Discover a closed-form expression for a quantity.
Reward Structure
This is a sparse, verifiable reward environment. The agent calls submit_solution once per task.
For constants and functions (ground_truth_computable): The submitted code is executed with mpmath at 110-digit precision. Output is compared digit-by-digit against the ground truth. Reward is the fraction of required matching digits achieved (20 digits required for a full pass):
For constructions (benchmark_best_known): The construction is validated against mathematical constraints using problem-specific validators, then compared to the state-of-the-art baseline:
- Beats baseline: reward = 1.0
- Matches or below baseline: reward = 0.0
- Invalid: reward = 0.0
For new constructions (new_construction): Binary validation only. Valid = 1.0, invalid = 0.0.
We do not use LLM graders for this task. All scoring is deterministic mathematical verification.
Data
Problem definitions and baselines are sourced from the HorizonMath GitHub repository. Data files are stored on the OpenReward platform. 44 problem-specific validators check mathematical validity of construction submissions.
Tools
Agents are given a single tool:
submit_solution: Submit aproposed_solution()Python function as a code string. The code is executed server-side with mpmath at 110-digit precision. For constants and functions, output is compared against ground truth. For constructions, the output is validated and compared against baselines.
Time Horizon
HorizonMath is a single-turn environment. The agent receives a problem description and submits one solution. Each task requires exactly one tool call.
Environment Difficulty
Most state-of-the-art models score near 0% on the full benchmark, as the problems are genuinely open research questions. The calibration split (level_0) contains problems with known solutions for pipeline verification.
Other Environment Requirements
There are no further environment requirements beyond the OpenReward API key used for sandbox provisioning.
Safety
Agents in HorizonMath submit mathematical solutions as Python code strings. The code is executed in an isolated OpenReward sandbox container with a 5-minute timeout and network access disabled. The environment does not present direct safety risks beyond standard code execution concerns.
Citations
@article{wang2026horizonmath,
title={HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification},
author={Wang, Erik Y. and Motwani, Sumeet and Roggeveen, James V. and Hodges, Eliot and Jayalath, Dulhan and London, Charles and Ramakrishnan, Kalyan and Cipcigan, Flaviu and Torr, Philip and Abate, Alessandro},
journal={arXiv preprint arXiv:2603.15617},
year={2026}
}