API Endpoint

Leaderboard

Loading leaderboard...

README

Terminal-Bench-2-Verified

Description

Terminal-Bench-2-Verified is an environment by Z.ai for evaluating agents on challenging terminal-based software engineering tasks. Based on Terminal-Bench 2.0 by Merrill et al., agents work in containerized environments to complete realistic programming challenges spanning compilation, configuration, data processing, cryptography, and more.

Capabilities

Command-line software engineering
Building and compiling complex projects
System configuration and debugging
Data processing and analysis
Working with diverse programming languages and tools

Compute Requirements

Agents are given a sandboxed Docker environment. Default sandbox size is 1 CPU and 2 GB RAM.

License

Apache 2.0.

Tasks

There is one split in this environment:

Test: 89 terminal-based software engineering tasks

Tasks span diverse domains including:

Compilation: Building Caffe, POV-Ray, CompCert, Cython extensions
Cryptography: Hash cracking, cryptanalysis (FEAL), password recovery
Data Processing: Token counting, data merging, log summarization
Machine Learning: Model inference, training FastText, PyTorch parallelism
System Administration: Git operations, QEMU setup, Nginx configuration
Scientific Computing: Eigenvalue computation, MCMC sampling, Raman fitting

Each task provides a detailed instruction file describing the problem. The agent must use terminal commands to implement a solution that passes the task's test suite.

Reward Structure

This is a multi-turn environment with binary reward:

1.0 — All tests pass
0.0 — Tests fail

Verification runs the task's test script which checks for correct output files, expected behavior, and proper implementation according to task specifications.

Data

Data consists of 89 task directories, each containing an instruction file, a pre-configured Docker image, and a test harness. Tasks are derived from Terminal-Bench 2.0 with environment fixes for reproducibility.

Tools

Tool	Description
`bash`	Run bash commands in the sandbox container.
`str_replace`	Replace a unique string in a file with another string.
`view`	View file contents or directory listings.
`create_file`	Create a new file with specified content.
`submit_answer`	Submit work for verification. Runs the test harness and returns reward.

Time Horizon

Terminal-Bench-2-Verified is a multi-turn environment. Agents read task instructions, explore the environment, implement solutions using terminal commands, and submit for verification.

Environment Difficulty

Evaluation results on the verified version with environment and instruction fixes:

Model	Accuracy
Claude Opus 4.5	61.80%
GLM-5	61.12%
Claude Sonnet 4.5	50.34%

Other Environment Requirements

There are no external API key requirements; Terminal-Bench-2-Verified works out of the box with the OpenReward endpoint.

Safety

Agents in Terminal-Bench-2-Verified operate within isolated Docker containers. The environment does not involve production systems or external network access beyond the sandbox.

Citation

@article{merrill2025terminalbench,
  author    = {Mike A. Merrill and Alexander G. Shaw and Nicholas Carlini and Boxuan Li and Harsh Raj and Ivan Bercovich and Lin Shi and Jeong Yeon Shin and Thomas Walshe and E. Kelly Buchanan and Junhong Shen and Guanghao Ye and Haowei Lin and Jason Poulos and Maoyu Wang and Marianna Nezhurina and Jenia Jitsev and Di Lu and Orfeas Menis Mastromichalakis and Zhiwei Xu and Zizhao Chen and Yue Liu and Robert Zhang and Leon Liangyu Chen and Anurag Kashyap and Jan-Lucas Uslu and Jeffrey Li and Jianbo Wu and Minghao Yan and Song Bian and Vedang Sharma and Ke Sun and Steven Dillmann and Akshay Anand and Andrew Lanpouthakoun and Bardia Koopah and Changran Hu and Etash Guha and Gabriel H. S. Dreiman and Jiacheng Zhu and Karl Krauth and Li Zhong and Niklas Muennighoff and Robert Amanfu and Shangyin Tan and Shreyas Pimpalgaonkar and Tushar Aggarwal and Xiangning Lin and Xin Lan and Xuandong Zhao and Yiqing Liang and Yuanli Wang and Zilong Wang and Changzhi Zhou and David Heineman and Hange Liu and Harsh Trivedi and John Yang and Junhong Lin and Manish Shetty and Michael Yang and Nabil Omi and Negin Raoof and Shanda Li and Terry Yue Zhuo and Wuwei Lin and Yiwei Dai and Yuxin Wang and Wenhao Chai and Shang Zhou and Dariush Wahdany and Ziyu She and Jiaming Hu and Zhikang Dong and Yuxuan Zhu and Sasha Cui and Ahson Saiyed and Arinbj{\"o}rn Kolbeinsson and Jesse Hu and Christopher Michael Rytting and Ryan Marten and Yixin Wang and Alex Dimakis and Andy Konwinski and Ludwig Schmidt},
  title     = {Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces},
  journal   = {arXiv preprint arXiv:2601.11868},
  year      = {2025},
  url       = {https://arxiv.org/abs/2601.11868}
}

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

terminal-bench-2-verified

GeneralReasoning/terminal-bench-2-verified

Terminal-Bench-2-Verified

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples