SWE-Bench-Verified

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

SWE-Bench Verified

OpenReward Environment

Description

SWE-Bench Verified is an environment for evaluating software engineering capabilities on real-world GitHub issues. Based on the SWE-bench Verified dataset — a human-validated subset of SWE-bench — agents are given a problem statement describing a real GitHub issue and must navigate the codebase, understand the bug, and produce a correct patch that passes the test suite.

Capabilities

  • Software engineering and bug fixing
  • Navigating real-world Python codebases
  • Understanding and resolving GitHub issues
  • Writing patches that pass existing and withheld test suites

Compute Requirements

Agents are given a sandbox with 4 CPUs and 8 GB RAM, with repository-specific Docker images and conda environments.

License

MIT

Tasks

Three splits are available:

  • all: 500 tasks
  • mini: 50 tasks (curated subset)
  • hard_subset: 45 tasks (estimated >1 hour of SWE work)

Tasks span popular Python repositories including Django, scikit-learn, sympy, matplotlib, and more.

Reward Structure

SWE-Bench Verified uses a multi-turn reward structure. Agents edit code using bash commands and call answer when complete. The environment runs the test suite including withheld tests. The reward structure is binary:

  • 1.0: All FAIL_TO_PASS tests now pass and PASS_TO_PASS tests still pass
  • 0.0: Otherwise

Data

Task specifications are sourced from HuggingFace princeton-nlp/SWE-bench_Verified. Repository snapshots are pre-loaded in Docker images and stored on the OpenReward platform.

Tools

  • bash: Execute bash commands in the testbed conda environment
  • answer: Submit work for test execution and grading

Time Horizon

Multi-turn. Agents explore the codebase, identify the root cause, write a fix, and submit for evaluation.

Environment Difficulty

ModelAccuracy
Claude Opus 4.580.9%
Claude Opus 4.680.8%
Gemini 3.1 Pro80.6%
MiniMax M2.580.2%
GPT-5.2 Thinking80.0%

Other Environment Requirements

There are no further environment requirements; SWE-Bench Verified works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in SWE-Bench Verified edit code in sandboxed Docker containers. The environment does not present direct safety risks.

Citation

@inproceedings{jimenez2024swebench,
  title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
  author={Jimenez, Carlos E and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}
GeneralReasoning/SWE-Bench-Verified | OpenReward