SWE-Bench-Multilingual
SWE-Bench-Multilingual
Description
SWE-Bench-Multilingual is an environment for evaluating code repair capabilities across multiple programming languages. Based on the SWE-bench methodology, agents are given real GitHub issues from multilingual repositories and must produce patches that resolve the issues while passing existing tests. The environment extends SWE-bench beyond Python to include repositories in multiple languages.
Capabilities
- Multi-language code understanding and repair
- GitHub issue resolution
- Test-driven development
- Codebase navigation and modification
- Patch generation and validation
Compute Requirements
Agents are given a sandboxed environment with 4 CPUs and 8 GB RAM. Each task runs in a Docker container with the target repository pre-installed.
License
MIT.
Tasks
There is one split in this environment:
- test: Validated multilingual SWE-bench instances
Tasks span multiple programming languages from real GitHub repositories.
Reward Structure
This is a multi-turn environment. The agent explores the codebase, makes code modifications, and calls answer to submit. The environment runs the SWE-bench evaluation harness to check if:
- All fail-to-pass tests now pass
- All pass-to-pass tests still pass
Reward is binary: 1.0 if the issue is resolved (all tests pass), 0.0 otherwise.
Data
Data consists of SWE-bench instances sourced from HuggingFace SWE-bench/SWE-bench_Multilingual. Each task includes a problem statement, repository information, base commit, and test specifications.
Tools
| Tool | Description |
|---|---|
bash | Execute shell commands in the sandbox |
view | View file contents with optional line range |
str_replace | Replace strings in files |
insert | Insert content at a specific line |
create | Create new files |
answer | Submit final patch for evaluation. Ends the episode. |
Time Horizon
Multi-turn. The agent reads the problem statement, explores the codebase, implements fixes, and submits for evaluation.
Environment Difficulty
SWE-Bench-Multilingual evaluates real-world software engineering capabilities across multiple programming languages.
| Model | Resolve Rate |
|---|---|
| MiniMax M2.5 | 74.1% |
| GLM-5 | 73.3% |
| Kimi K2.5 | 73.0% |
| Gemini 3.1 Pro | 72.0% |
| Qwen 3 Coder Next (OpenHands) | 64.3% |
Other Environment Requirements
No other secrets required other than OpenReward API key.
Safety
Agents in SWE-Bench-Multilingual work within sandboxed Docker containers. Code execution is isolated and the environment does not present direct safety risks.
Citation
@article{yang2025swesmith,
title={SWE-smith: Scaling Data for Software Engineering Agents},
author={Yang, John and Lieret, Kilian and Jimenez, Carlos E. and Wettig, Alexander and Khandpur, Kabir and Zhang, Yanzhe and Hui, Binyuan and Press, Ofir and Schmidt, Ludwig and Yang, Diyi},
journal={arXiv preprint arXiv:2504.21798},
year={2025}
}