AMO-Bench

Name: LongCat/AMO-Bench
Author: LongCat

LongCat/AMO-Bench

Description

AMO-Bench (Advanced Mathematical reasoning benchmark) is a benchmark for evaluating large language models' mathematical reasoning at International Mathematical Olympiad (IMO) level and above using 50 human-crafted, entirely original problems. Each problem requires only a final answer to enable automatic, robust grading and to prevent memorization, and evaluations across 26 LLMs show substantial room for improvement (best model 52.4% accuracy) while revealing promising scaling with increased test‑time compute.

arXiv

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/AMO-Bench	0	3 months ago