API Endpoint

Leaderboard

Loading leaderboard...

README

SWE-Bench-Multilingual

Description

SWE-Bench-Multilingual is an environment for evaluating code repair capabilities across multiple programming languages. Based on the SWE-bench methodology, agents are given real GitHub issues from multilingual repositories and must produce patches that resolve the issues while passing existing tests. The environment extends SWE-bench beyond Python to include repositories in multiple languages.

Capabilities

Multi-language code understanding and repair
GitHub issue resolution
Test-driven development
Codebase navigation and modification
Patch generation and validation

Compute Requirements

Agents are given a sandboxed environment with 4 CPUs and 8 GB RAM. Each task runs in a Docker container with the target repository pre-installed.

License

MIT.

Tasks

There is one split in this environment:

test: Validated multilingual SWE-bench instances

Tasks span multiple programming languages from real GitHub repositories.

Reward Structure

This is a multi-turn environment. The agent explores the codebase, makes code modifications, and calls answer to submit. The environment runs the SWE-bench evaluation harness to check if:

All fail-to-pass tests now pass
All pass-to-pass tests still pass

Reward is binary: 1.0 if the issue is resolved (all tests pass), 0.0 otherwise.

Data

Data consists of SWE-bench instances sourced from HuggingFace SWE-bench/SWE-bench_Multilingual. Each task includes a problem statement, repository information, base commit, and test specifications.

Tools

Tool	Description
`bash`	Execute shell commands in the sandbox
`view`	View file contents with optional line range
`str_replace`	Replace strings in files
`insert`	Insert content at a specific line
`create`	Create new files
`answer`	Submit final patch for evaluation. Ends the episode.

Time Horizon

Multi-turn. The agent reads the problem statement, explores the codebase, implements fixes, and submits for evaluation.

Environment Difficulty

SWE-Bench-Multilingual evaluates real-world software engineering capabilities across multiple programming languages.

Model	Resolve Rate
MiniMax M2.5	74.1%
GLM-5	73.3%
Kimi K2.5	73.0%
Gemini 3.1 Pro	72.0%
Qwen 3 Coder Next (OpenHands)	64.3%

Other Environment Requirements

No other secrets required other than OpenReward API key.

Safety

Agents in SWE-Bench-Multilingual work within sandboxed Docker containers. Code execution is isolated and the environment does not present direct safety risks.

Citation

@article{yang2025swesmith,
  title={SWE-smith: Scaling Data for Software Engineering Agents},
  author={Yang, John and Lieret, Kilian and Jimenez, Carlos E. and Wettig, Alexander and Khandpur, Kabir and Zhang, Yanzhe and Hui, Binyuan and Press, Ofir and Schmidt, Ludwig and Yang, Diyi},
  journal={arXiv preprint arXiv:2504.21798},
  year={2025}
}

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	4 vCPUs / 8 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000920
Total	$0.0001240

Examples

5-minute session$0.0372

1-hour session$0.4464

SWE-Bench-Multilingual

GeneralReasoning/SWE-Bench-Multilingual

SWE-Bench-Multilingual

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples