ScholarSearch

Description

ScholarSearch is an environment for evaluating academic question answering. Based on the ScholarSearch benchmark from Peking University, agents are given academic research questions in Chinese across 12 disciplines and must provide concise, accurate answers. An LLM grader evaluates correctness.

Capabilities

Academic research question answering across multiple disciplines
Cross-disciplinary knowledge spanning 12 fields including Computer Science, Biology, Economics, Physics, and more
Chinese-language academic comprehension
Domain-specific knowledge evaluation

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

Apache 2.0

Tasks

Split: test (223 tasks)

Tasks span 12 academic disciplines covering diverse fields of study. Questions are curated by undergraduate and graduate students and faculty at Peking University.

Reward Structure

ScholarSearch follows a single-turn evaluation paradigm:

Agent receives an academic research question
Agent submits an answer via the answer tool
An LLM grader (gpt-4.1) evaluates the answer against the reference answer
Binary reward: 1.0 if correct, 0.0 if incorrect

Data

File: ScholarSearch.json (223 entries)

Data sourced from HuggingFace PKU-DS-LAB/ScholarSearch. Task data is stored on the OpenReward platform.

Tools

Tool: answer

Submit a text answer for LLM-based grading.

Parameters:

text (string): The answer to the academic question

Returns:

reward (float): 1.0 if correct, 0.0 if incorrect
finished (bool): True (single-turn environment)

Time Horizon

Single-turn. Each task is evaluated in a single interaction.

Environment Difficulty

Model	Accuracy
gpt-4o-search-preview	18.83%
gpt-4o-mini-search-preview	10.31%
deepseek-r1-0528	8.52%
gpt-4.1	7.17%
gpt-4o-2024-11-20	3.59%
gpt-4o-mini	2.24%

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."} when creating an environment session.

Safety

Agents in ScholarSearch answer academic questions in a standard environment. The environment does not present direct safety risks.

Citation

@misc{zhou2025scholarsearchbenchmarkingscholarsearching,
  title={ScholarSearch: Benchmarking Scholar Searching Ability of LLMs},
  author={Junting Zhou and Wang Li and Yiyan Liao and Nengyuan Zhang and Tingjia Miao and Zhihui Qi and Yuhan Wu and Tong Yang},
  year={2025},
  eprint={2506.13784},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2506.13784}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

ScholarSearch

GeneralReasoning/ScholarSearch

ScholarSearch

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples