ScholarSearch
ScholarSearch
Description
ScholarSearch is an environment for evaluating academic question answering. Based on the ScholarSearch benchmark from Peking University, agents are given academic research questions in Chinese across 12 disciplines and must provide concise, accurate answers. An LLM grader evaluates correctness.
Capabilities
- Academic research question answering across multiple disciplines
- Cross-disciplinary knowledge spanning 12 fields including Computer Science, Biology, Economics, Physics, and more
- Chinese-language academic comprehension
- Domain-specific knowledge evaluation
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
Tasks
Split: test (223 tasks)
Tasks span 12 academic disciplines covering diverse fields of study. Questions are curated by undergraduate and graduate students and faculty at Peking University.
Reward Structure
ScholarSearch follows a single-turn evaluation paradigm:
- Agent receives an academic research question
- Agent submits an answer via the
answertool - An LLM grader (gpt-4.1) evaluates the answer against the reference answer
- Binary reward: 1.0 if correct, 0.0 if incorrect
Data
File: ScholarSearch.json (223 entries)
Data sourced from HuggingFace PKU-DS-LAB/ScholarSearch. Task data is stored on the OpenReward platform.
Tools
Tool: answer
Submit a text answer for LLM-based grading.
Parameters:
text(string): The answer to the academic question
Returns:
reward(float): 1.0 if correct, 0.0 if incorrectfinished(bool): True (single-turn environment)
Time Horizon
Single-turn. Each task is evaluated in a single interaction.
Environment Difficulty
| Model | Accuracy |
|---|---|
| gpt-4o-search-preview | 18.83% |
| gpt-4o-mini-search-preview | 10.31% |
| deepseek-r1-0528 | 8.52% |
| gpt-4.1 | 7.17% |
| gpt-4o-2024-11-20 | 3.59% |
| gpt-4o-mini | 2.24% |
Other Environment Requirements
OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."} when creating an environment session.
Safety
Agents in ScholarSearch answer academic questions in a standard environment. The environment does not present direct safety risks.
Citation
@misc{zhou2025scholarsearchbenchmarkingscholarsearching,
title={ScholarSearch: Benchmarking Scholar Searching Ability of LLMs},
author={Junting Zhou and Wang Li and Yiyan Liao and Nengyuan Zhang and Tingjia Miao and Zhihui Qi and Yuhan Wu and Tong Yang},
year={2025},
eprint={2506.13784},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2506.13784}
}