terminal-bench-2-verified
Terminal-Bench-2-Verified
Description
Terminal-Bench-2-Verified is an environment by Z.ai for evaluating agents on challenging terminal-based software engineering tasks. Based on Terminal-Bench 2.0 by Merrill et al., agents work in containerized environments to complete realistic programming challenges spanning compilation, configuration, data processing, cryptography, and more.
Capabilities
- Command-line software engineering
- Building and compiling complex projects
- System configuration and debugging
- Data processing and analysis
- Working with diverse programming languages and tools
Compute Requirements
Agents are given a sandboxed Docker environment. Default sandbox size is 1 CPU and 2 GB RAM.
License
Tasks
There is one split in this environment:
- Test: 89 terminal-based software engineering tasks
Tasks span diverse domains including:
- Compilation: Building Caffe, POV-Ray, CompCert, Cython extensions
- Cryptography: Hash cracking, cryptanalysis (FEAL), password recovery
- Data Processing: Token counting, data merging, log summarization
- Machine Learning: Model inference, training FastText, PyTorch parallelism
- System Administration: Git operations, QEMU setup, Nginx configuration
- Scientific Computing: Eigenvalue computation, MCMC sampling, Raman fitting
Each task provides a detailed instruction file describing the problem. The agent must use terminal commands to implement a solution that passes the task's test suite.
Reward Structure
This is a multi-turn environment with binary reward:
- 1.0 — All tests pass
- 0.0 — Tests fail
Verification runs the task's test script which checks for correct output files, expected behavior, and proper implementation according to task specifications.
Data
Data consists of 89 task directories, each containing an instruction file, a pre-configured Docker image, and a test harness. Tasks are derived from Terminal-Bench 2.0 with environment fixes for reproducibility.
Tools
| Tool | Description |
|---|---|
bash | Run bash commands in the sandbox container. |
str_replace | Replace a unique string in a file with another string. |
view | View file contents or directory listings. |
create_file | Create a new file with specified content. |
submit_answer | Submit work for verification. Runs the test harness and returns reward. |
Time Horizon
Terminal-Bench-2-Verified is a multi-turn environment. Agents read task instructions, explore the environment, implement solutions using terminal commands, and submit for verification.
Environment Difficulty
Evaluation results on the verified version with environment and instruction fixes:
| Model | Accuracy |
|---|---|
| Claude Opus 4.5 | 61.80% |
| GLM-5 | 61.12% |
| Claude Sonnet 4.5 | 50.34% |
Other Environment Requirements
There are no external API key requirements; Terminal-Bench-2-Verified works out of the box with the OpenReward endpoint.
Safety
Agents in Terminal-Bench-2-Verified operate within isolated Docker containers. The environment does not involve production systems or external network access beyond the sandbox.
Citation
@article{merrill2025terminalbench,
author = {Mike A. Merrill and Alexander G. Shaw and Nicholas Carlini and Boxuan Li and Harsh Raj and Ivan Bercovich and Lin Shi and Jeong Yeon Shin and Thomas Walshe and E. Kelly Buchanan and Junhong Shen and Guanghao Ye and Haowei Lin and Jason Poulos and Maoyu Wang and Marianna Nezhurina and Jenia Jitsev and Di Lu and Orfeas Menis Mastromichalakis and Zhiwei Xu and Zizhao Chen and Yue Liu and Robert Zhang and Leon Liangyu Chen and Anurag Kashyap and Jan-Lucas Uslu and Jeffrey Li and Jianbo Wu and Minghao Yan and Song Bian and Vedang Sharma and Ke Sun and Steven Dillmann and Akshay Anand and Andrew Lanpouthakoun and Bardia Koopah and Changran Hu and Etash Guha and Gabriel H. S. Dreiman and Jiacheng Zhu and Karl Krauth and Li Zhong and Niklas Muennighoff and Robert Amanfu and Shangyin Tan and Shreyas Pimpalgaonkar and Tushar Aggarwal and Xiangning Lin and Xin Lan and Xuandong Zhao and Yiqing Liang and Yuanli Wang and Zilong Wang and Changzhi Zhou and David Heineman and Hange Liu and Harsh Trivedi and John Yang and Junhong Lin and Manish Shetty and Michael Yang and Nabil Omi and Negin Raoof and Shanda Li and Terry Yue Zhuo and Wuwei Lin and Yiwei Dai and Yuxin Wang and Wenhao Chai and Shang Zhou and Dariush Wahdany and Ziyu She and Jiaming Hu and Zhikang Dong and Yuxuan Zhu and Sasha Cui and Ahson Saiyed and Arinbj{\"o}rn Kolbeinsson and Jesse Hu and Christopher Michael Rytting and Ryan Marten and Yixin Wang and Alex Dimakis and Andy Konwinski and Ludwig Schmidt},
title = {Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces},
journal = {arXiv preprint arXiv:2601.11868},
year = {2025},
url = {https://arxiv.org/abs/2601.11868}
}