AMO-Bench
Description
AMO-Bench (Advanced Mathematical reasoning benchmark) is a benchmark for evaluating large language models' mathematical reasoning at International Mathematical Olympiad (IMO) level and above using 50 human-crafted, entirely original problems. Each problem requires only a final answer to enable automatic, robust grading and to prevent memorization, and evaluations across 26 LLMs show substantial room for improvement (best model 52.4% accuracy) while revealing promising scaling with increased test‑time compute.
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
0 | 1 months ago |