SimpleQAVerified

Description

SimpleQAVerified is an environment for evaluating short-form factuality and parametric knowledge in LLMs. Based on the SimpleQA Verified benchmark from Google DeepMind and Google Research, agents answer factual questions across 10 topic categories. An LLM grader evaluates correctness against gold standard answers backed by supporting URLs.

Capabilities

Short-form factual question answering
Parametric knowledge evaluation (no tool use)
10 topic categories: Politics, Art, Sports, Music, History, Geography, Science and Technology, TV Shows, Video Games, Other
LLM-graded correctness with three-way classification

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT

Tasks

Split: eval (1,000 tasks)

Each task presents a factual question with a single gold standard answer. Questions span 10 topic categories and 5 answer types (Number, Person, Place, Date, Other). Some questions require multi-step reasoning or complex inference.

Reward Structure

This is a sparse reward environment with LLM-based grading. The agent calls the answer tool once with its response, and the environment grades it:

CORRECT: The answer fully contains the important information from the gold target with no contradictions. Reward: 1.0
INCORRECT: The answer contains a factual statement that contradicts the gold target. Reward: 0.0
NOT_ATTEMPTED: Important information from the gold target is missing, but nothing contradicts it. Reward: 0.0

Data

Data is sourced from the google/simpleqa-verified HuggingFace dataset. Each entry includes a question, gold answer, topic category, answer type, and supporting URLs.

Tools

Tool	Description
`answer`	Submit answer for LLM-based grading

Time Horizon

Single-turn. The agent receives a question and submits one answer.

Environment Difficulty

Model	Accuracy
Gemini 3 Pro	72.1%
Gemini 2.5 Pro	55.6%
Kimi K2.5	36.9%

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in SimpleQAVerified answer factual questions in a standard environment. The environment does not present direct safety risks.

Citation

@misc{haas2025simpleqaverifiedreliablefactuality,
  title={SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge},
  author={Lukas Haas and Gal Yona and Giovanni D'Antonio and Sasha Goldshtein and Dipanjan Das},
  year={2025},
  eprint={2509.07968},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2509.07968}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

SimpleQAVerified

GeneralReasoning/SimpleQAVerified

SimpleQAVerified

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples