SimpleQAVerified
SimpleQAVerified
Description
SimpleQAVerified is an environment for evaluating short-form factuality and parametric knowledge in LLMs. Based on the SimpleQA Verified benchmark from Google DeepMind and Google Research, agents answer factual questions across 10 topic categories. An LLM grader evaluates correctness against gold standard answers backed by supporting URLs.
Capabilities
- Short-form factual question answering
- Parametric knowledge evaluation (no tool use)
- 10 topic categories: Politics, Art, Sports, Music, History, Geography, Science and Technology, TV Shows, Video Games, Other
- LLM-graded correctness with three-way classification
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
Tasks
Split: eval (1,000 tasks)
Each task presents a factual question with a single gold standard answer. Questions span 10 topic categories and 5 answer types (Number, Person, Place, Date, Other). Some questions require multi-step reasoning or complex inference.
Reward Structure
This is a sparse reward environment with LLM-based grading. The agent calls the answer tool once with its response, and the environment grades it:
- CORRECT: The answer fully contains the important information from the gold target with no contradictions. Reward: 1.0
- INCORRECT: The answer contains a factual statement that contradicts the gold target. Reward: 0.0
- NOT_ATTEMPTED: Important information from the gold target is missing, but nothing contradicts it. Reward: 0.0
Data
Data is sourced from the google/simpleqa-verified HuggingFace dataset. Each entry includes a question, gold answer, topic category, answer type, and supporting URLs.
Tools
| Tool | Description |
|---|---|
answer | Submit answer for LLM-based grading |
Time Horizon
Single-turn. The agent receives a question and submits one answer.
Environment Difficulty
| Model | Accuracy |
|---|---|
| Gemini 3 Pro | 72.1% |
| Gemini 2.5 Pro | 55.6% |
| Kimi K2.5 | 36.9% |
Other Environment Requirements
OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."}.
Safety
Agents in SimpleQAVerified answer factual questions in a standard environment. The environment does not present direct safety risks.
Citation
@misc{haas2025simpleqaverifiedreliablefactuality,
title={SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge},
author={Lukas Haas and Gal Yona and Giovanni D'Antonio and Sasha Goldshtein and Dipanjan Das},
year={2025},
eprint={2509.07968},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.07968}
}