taubench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

τ-Bench

OpenReward Environment

Description

τ-Bench is an environment for evaluating language agents in dynamic, multi-turn conversations with simulated human users. Based on the τ-bench benchmark by Sierra, agents must interact with users via a GPT-4o user simulator while following domain-specific policies and using API tools correctly. The environment includes two domains: airline reservation management (τ-airline) and retail order management (τ-retail).

Capabilities

  • Multi-turn conversational interaction with simulated users
  • Following complex domain-specific policies and business rules
  • Using API tools for flight booking, order management, and user authentication
  • Stateful database operations with consistent rule-following

Compute Requirements

Agents are given a standard environment with no special compute requirements.

License

MIT.

Tasks

There is one split in this environment:

  • Test: 165 tasks total
    • τ-airline: 50 airline reservation tasks
    • τ-retail: 115 retail order management tasks

Tasks are annotated with ground-truth action sequences and expected database states.

Reward Structure

This is a multi-turn environment with a user simulator. The agent converses with a GPT-4o simulated user, making tool calls as needed. When the conversation ends (signaled by "###STOP###"), the environment replays the ground-truth actions against the original database state and compares the resulting database state (via SHA256 hash) with the agent's final state. Reward is 1.0 if states match, 0.0 otherwise.

Data

Task data includes user profiles, flight databases, product catalogs, and order histories stored as JSON files. Airline data includes flights, reservations, and users. Retail data includes products, orders, and users. Task data is stored on the OpenReward platform.

Tools

The environment exposes two classes (Airline and Retail), each with its own set of domain-specific tools plus a shared respond tool for conversing with the simulated user.

Airline tools (15):

ToolDescription
book_reservationBook a new flight reservation with passenger and payment details.
calculatePerform mathematical calculations.
cancel_reservationCancel an existing reservation.
get_reservation_detailsRetrieve reservation information.
get_user_detailsRetrieve user profile and payment methods.
list_all_airportsList available airports with IATA codes.
respondSend a message to the simulated user and receive their reply.
search_direct_flightSearch for direct flights between airports on a date.
search_onestop_flightSearch for flights with one stopover.
send_certificateSend a travel certificate/voucher to a user.
thinkInternal reasoning (no state change).
transfer_to_human_agentsEscalate to human support.
update_reservation_baggagesModify baggage count on a reservation.
update_reservation_flightsChange flights in a reservation.
update_reservation_passengersModify passenger list on a reservation.

Retail tools (17):

ToolDescription
calculatePerform mathematical calculations.
cancel_pending_orderCancel a pending order.
exchange_delivered_order_itemsExchange items in a delivered order.
find_user_id_by_emailAuthenticate user by email.
find_user_id_by_name_zipAuthenticate user by name and zip code.
get_order_detailsRetrieve order information.
get_product_detailsRetrieve product details with variants and pricing.
get_user_detailsRetrieve user profile.
list_all_product_typesList all product categories.
modify_pending_order_addressChange shipping address for pending order.
modify_pending_order_itemsModify items in a pending order (one-time only).
modify_pending_order_paymentChange payment method for pending order.
modify_user_addressUpdate user's default address.
respondSend a message to the simulated user and receive their reply.
return_delivered_order_itemsReturn items from a delivered order.
thinkInternal reasoning (no state change).
transfer_to_human_agentsEscalate to human support.

Time Horizon

τ-Bench is a multi-turn environment. Agents engage in dynamic conversations with a simulated user, making API calls as needed to fulfill the user's request while following domain policies.

Environment Difficulty

Modelτ-airlineτ-retail
Claude Opus 481.4%59.6%
o3-high52.0%73.9%
o4-mini-high49.2%71.8%
gpt-oss-120b67.8%
o3-mini-high32.4%57.6%

Other Environment Requirements

  • OpenAI API key: Required for the GPT-4o user simulator and grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in τ-Bench interact with simulated users in fictional airline and retail domains. The environment does not present direct safety risks.

Citations

@misc{yao2024tau,
  title={$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains},
  author={Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan},
  year={2024},
  eprint={2406.12045},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2406.12045}
}
GeneralReasoning/taubench | OpenReward