FrontierScience
FrontierScience
Description
FrontierScience is an environment for evaluating expert-level scientific reasoning capabilities. It contains 160 expert-level problems designed to assess frontier model capabilities in physics, chemistry, and biology through two distinct evaluation tracks: Olympiad (short-answer format) and Research (open-ended PhD-level problems).
Capabilities
- Expert-level scientific reasoning across physics, chemistry, and biology
- Short-answer problem solving (Olympiad track)
- Open-ended research subtask completion (Research track)
- Multi-criterion rubric-based evaluation
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
Tasks
There are three splits in this environment:
- test: 160 tasks (all problems)
- olympic: ~92 tasks (short-answer Olympiad-style problems)
- research: ~68 tasks (open-ended PhD-level research problems)
Problems span Physics (70), Chemistry (60), and Biology (30).
Reward Structure
This is a single-turn environment with two grading methodologies:
Olympiad Track: An LLM grader (gpt-5.2 with high reasoning effort) checks equivalence with the reference answer, considering algebraic equivalence, numeric tolerance, chemical equivalents, and unit conversions. Reward is binary: 1.0 if correct, 0.0 if incorrect.
Research Track: An LLM grader (gpt-5.2 with high reasoning effort) evaluates against a multi-criterion rubric parsed from the answer field. Each criterion is graded independently, scores are aggregated, and reward is normalized (total earned / total possible). Success threshold is 7+ points out of 10 (0.7 reward).
Data
Data consists of a Parquet file (frontierscience.parquet) sourced from HuggingFace openai/frontierscience. Each row contains a problem, answer (short answer or rubric), subject, and task group ID. Data is stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
submit_answer | Submit your final answer for grading. Ends the episode. |
Time Horizon
Single-turn. The agent reads the scientific problem and submits one answer.
Environment Difficulty
FrontierScience evaluates expert-level scientific reasoning designed to challenge frontier AI systems.
| Track | Model | Pass Rate |
|---|---|---|
| Olympiad | GPT-5.2 | 77% |
| Olympiad | Gemini 3 Pro | 76% |
| Research | GPT-5.2 | 25% |
| Research | GPT-5 | 25% |
Other Environment Requirements
OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."}.
Safety
Agents in FrontierScience solve expert-level scientific problems in a standard environment. The environment does not present direct safety risks.
Citation
@article{frontierscience2025,
title={FrontierScience: Measuring Expert-Level Scientific Reasoning in AI},
author={OpenAI},
journal={arXiv preprint arXiv:2601.21165},
year={2025}
}