terminal-bench-2-verified

API Endpoint
Leaderboard
Loading leaderboard...
README

Terminal-Bench-2-Verified

OpenReward Environment Hugging Face Dataset

Description

Terminal-Bench-2-Verified is an environment by Z.ai for evaluating agents on challenging terminal-based software engineering tasks. Based on Terminal-Bench 2.0 by Merrill et al., agents work in containerized environments to complete realistic programming challenges spanning compilation, configuration, data processing, cryptography, and more.

Capabilities

  • Command-line software engineering
  • Building and compiling complex projects
  • System configuration and debugging
  • Data processing and analysis
  • Working with diverse programming languages and tools

Compute Requirements

Agents are given a sandboxed Docker environment. Default sandbox size is 1 CPU and 2 GB RAM.

License

Apache 2.0.

Tasks

There is one split in this environment:

  • Test: 89 terminal-based software engineering tasks

Tasks span diverse domains including:

  • Compilation: Building Caffe, POV-Ray, CompCert, Cython extensions
  • Cryptography: Hash cracking, cryptanalysis (FEAL), password recovery
  • Data Processing: Token counting, data merging, log summarization
  • Machine Learning: Model inference, training FastText, PyTorch parallelism
  • System Administration: Git operations, QEMU setup, Nginx configuration
  • Scientific Computing: Eigenvalue computation, MCMC sampling, Raman fitting

Each task provides a detailed instruction file describing the problem. The agent must use terminal commands to implement a solution that passes the task's test suite.

Reward Structure

This is a multi-turn environment with binary reward:

  • 1.0 — All tests pass
  • 0.0 — Tests fail

Verification runs the task's test script which checks for correct output files, expected behavior, and proper implementation according to task specifications.

Data

Data consists of 89 task directories, each containing an instruction file, a pre-configured Docker image, and a test harness. Tasks are derived from Terminal-Bench 2.0 with environment fixes for reproducibility.

Tools

ToolDescription
bashRun bash commands in the sandbox container.
str_replaceReplace a unique string in a file with another string.
viewView file contents or directory listings.
create_fileCreate a new file with specified content.
submit_answerSubmit work for verification. Runs the test harness and returns reward.

Time Horizon

Terminal-Bench-2-Verified is a multi-turn environment. Agents read task instructions, explore the environment, implement solutions using terminal commands, and submit for verification.

Environment Difficulty

Evaluation results on the verified version with environment and instruction fixes:

ModelAccuracy
Claude Opus 4.561.80%
GLM-561.12%
Claude Sonnet 4.550.34%

Other Environment Requirements

There are no external API key requirements; Terminal-Bench-2-Verified works out of the box with the OpenReward endpoint.

Safety

Agents in Terminal-Bench-2-Verified operate within isolated Docker containers. The environment does not involve production systems or external network access beyond the sandbox.

Citation

@article{merrill2025terminalbench,
  author    = {Mike A. Merrill and Alexander G. Shaw and Nicholas Carlini and Boxuan Li and Harsh Raj and Ivan Bercovich and Lin Shi and Jeong Yeon Shin and Thomas Walshe and E. Kelly Buchanan and Junhong Shen and Guanghao Ye and Haowei Lin and Jason Poulos and Maoyu Wang and Marianna Nezhurina and Jenia Jitsev and Di Lu and Orfeas Menis Mastromichalakis and Zhiwei Xu and Zizhao Chen and Yue Liu and Robert Zhang and Leon Liangyu Chen and Anurag Kashyap and Jan-Lucas Uslu and Jeffrey Li and Jianbo Wu and Minghao Yan and Song Bian and Vedang Sharma and Ke Sun and Steven Dillmann and Akshay Anand and Andrew Lanpouthakoun and Bardia Koopah and Changran Hu and Etash Guha and Gabriel H. S. Dreiman and Jiacheng Zhu and Karl Krauth and Li Zhong and Niklas Muennighoff and Robert Amanfu and Shangyin Tan and Shreyas Pimpalgaonkar and Tushar Aggarwal and Xiangning Lin and Xin Lan and Xuandong Zhao and Yiqing Liang and Yuanli Wang and Zilong Wang and Changzhi Zhou and David Heineman and Hange Liu and Harsh Trivedi and John Yang and Junhong Lin and Manish Shetty and Michael Yang and Nabil Omi and Negin Raoof and Shanda Li and Terry Yue Zhuo and Wuwei Lin and Yiwei Dai and Yuxin Wang and Wenhao Chai and Shang Zhou and Dariush Wahdany and Ziyu She and Jiaming Hu and Zhikang Dong and Yuxuan Zhu and Sasha Cui and Ahson Saiyed and Arinbj{\"o}rn Kolbeinsson and Jesse Hu and Christopher Michael Rytting and Ryan Marten and Yixin Wang and Alex Dimakis and Andy Konwinski and Ludwig Schmidt},
  title     = {Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces},
  journal   = {arXiv preprint arXiv:2601.11868},
  year      = {2025},
  url       = {https://arxiv.org/abs/2601.11868}
}
GeneralReasoning/terminal-bench-2-verified | OpenReward