SimpleQAVerified

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

SimpleQAVerified

OpenReward Environment Hugging Face Dataset

Description

SimpleQAVerified is an environment for evaluating short-form factuality and parametric knowledge in LLMs. Based on the SimpleQA Verified benchmark from Google DeepMind and Google Research, agents answer factual questions across 10 topic categories. An LLM grader evaluates correctness against gold standard answers backed by supporting URLs.

Capabilities

  • Short-form factual question answering
  • Parametric knowledge evaluation (no tool use)
  • 10 topic categories: Politics, Art, Sports, Music, History, Geography, Science and Technology, TV Shows, Video Games, Other
  • LLM-graded correctness with three-way classification

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT

Tasks

Split: eval (1,000 tasks)

Each task presents a factual question with a single gold standard answer. Questions span 10 topic categories and 5 answer types (Number, Person, Place, Date, Other). Some questions require multi-step reasoning or complex inference.

Reward Structure

This is a sparse reward environment with LLM-based grading. The agent calls the answer tool once with its response, and the environment grades it:

  • CORRECT: The answer fully contains the important information from the gold target with no contradictions. Reward: 1.0
  • INCORRECT: The answer contains a factual statement that contradicts the gold target. Reward: 0.0
  • NOT_ATTEMPTED: Important information from the gold target is missing, but nothing contradicts it. Reward: 0.0

Data

Data is sourced from the google/simpleqa-verified HuggingFace dataset. Each entry includes a question, gold answer, topic category, answer type, and supporting URLs.

Tools

ToolDescription
answerSubmit answer for LLM-based grading

Time Horizon

Single-turn. The agent receives a question and submits one answer.

Environment Difficulty

ModelAccuracy
Gemini 3 Pro72.1%
Gemini 2.5 Pro55.6%
Kimi K2.536.9%

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in SimpleQAVerified answer factual questions in a standard environment. The environment does not present direct safety risks.

Citation

@misc{haas2025simpleqaverifiedreliablefactuality,
  title={SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge},
  author={Lukas Haas and Gal Yona and Giovanni D'Antonio and Sasha Goldshtein and Dipanjan Das},
  year={2025},
  eprint={2509.07968},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2509.07968}
}
GeneralReasoning/SimpleQAVerified | OpenReward