SATBench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

SATBench

OpenReward Environment

Description

SATBench is an environment for evaluating LLM logical reasoning through natural language puzzles derived from Boolean satisfiability (SAT) formulas. Agents must determine whether a set of conditions can be simultaneously satisfied and, if so, provide a valid variable assignment.

Capabilities

  • Search-based logical reasoning (backtracking, constraint satisfaction)
  • Formal verification of satisfying assignments against CNF clauses
  • Natural language understanding of logical constraints

Compute Requirements

No special compute requirements. Lightweight text-based grading with no external API calls.

License

Apache 2.0

Tasks

  • Single split: train with 2,100 tasks
  • SAT/UNSAT balance: 1,050 satisfiable + 1,050 unsatisfiable (50/50)
  • Difficulty: Easy (4–15 clauses), Medium (20–30 clauses), Hard (31–50 clauses)
  • Variable count: 5–90 variables (mean 36), with 1D/2D/3D structure

Each task presents a natural language scenario with numbered conditions and asks whether all conditions can be satisfied simultaneously.

Reward Structure

Sparse, fully verifiable rewards (no LLM judge):

PredictionGround TruthConditionReward
UNSATUNSAT1.0
UNSATSAT0.0
SATUNSAT0.0
SATSATValid assignment provided1.0
SATSATInvalid/missing assignment0.0

For SAT predictions, the agent must provide a satisfying variable assignment that is verified clause-by-clause against the underlying CNF formula. This prevents guessing — the agent must actually solve the problem.

Data

  • Source: HuggingFace LLM4Code/SATBench
  • Size: ~7 MB (Parquet)
  • Format: Parquet with scenario, conditions, question, variable mapping, CNF clauses, and ground truth labels
  • Production path: /orwd_data/satbench.parquet

Tools

ToolDescription
submitSubmit SAT/UNSAT determination with optional variable assignment

Time Horizon

Single-turn. The agent receives the puzzle prompt and makes one submit call.

Environment Difficulty

  • Easy (4–15 clauses): Most models achieve >80% accuracy
  • Medium (20–30 clauses): Significant model performance degradation
  • Hard (31–50 clauses): Best model (o4-mini) achieves only 65% on hard UNSAT

Other Environment Requirements

No other requirements. No external API keys needed for grading.

Safety

SAT puzzles are abstract logical reasoning tasks with no safety concerns. Scenarios are fictional and benign (e.g., superheroes choosing missions, gardeners planting vegetables).

Citations

@inproceedings{wei-etal-2025-satbench,
    title = "{SATB}ench: Benchmarking {LLM}s' Logical Reasoning via Automated Puzzle Generation from {SAT} Formulas",
    author = "Wei, Anjiang  and
      Wu, Yuheng  and
      Wan, Yingjia  and
      Suresh, Tarun  and
      Tan, Huanmi  and
      Zhou, Zhanke  and
      Koyejo, Sanmi  and
      Wang, Ke  and
      Aiken, Alex",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1716/",
    doi = "10.18653/v1/2025.emnlp-main.1716",
    pages = "33832--33849",
}
anjiang/SATBench | OpenReward