taubench
τ-Bench
Description
τ-Bench is an environment for evaluating language agents in dynamic, multi-turn conversations with simulated human users. Based on the τ-bench benchmark by Sierra, agents must interact with users via a GPT-4o user simulator while following domain-specific policies and using API tools correctly. The environment includes two domains: airline reservation management (τ-airline) and retail order management (τ-retail).
Capabilities
- Multi-turn conversational interaction with simulated users
- Following complex domain-specific policies and business rules
- Using API tools for flight booking, order management, and user authentication
- Stateful database operations with consistent rule-following
Compute Requirements
Agents are given a standard environment with no special compute requirements.
License
MIT.
Tasks
There is one split in this environment:
- Test: 165 tasks total
- τ-airline: 50 airline reservation tasks
- τ-retail: 115 retail order management tasks
Tasks are annotated with ground-truth action sequences and expected database states.
Reward Structure
This is a multi-turn environment with a user simulator. The agent converses with a GPT-4o simulated user, making tool calls as needed. When the conversation ends (signaled by "###STOP###"), the environment replays the ground-truth actions against the original database state and compares the resulting database state (via SHA256 hash) with the agent's final state. Reward is 1.0 if states match, 0.0 otherwise.
Data
Task data includes user profiles, flight databases, product catalogs, and order histories stored as JSON files. Airline data includes flights, reservations, and users. Retail data includes products, orders, and users. Task data is stored on the OpenReward platform.
Tools
The environment exposes two classes (Airline and Retail), each with its own set of domain-specific tools plus a shared respond tool for conversing with the simulated user.
Airline tools (15):
| Tool | Description |
|---|---|
book_reservation | Book a new flight reservation with passenger and payment details. |
calculate | Perform mathematical calculations. |
cancel_reservation | Cancel an existing reservation. |
get_reservation_details | Retrieve reservation information. |
get_user_details | Retrieve user profile and payment methods. |
list_all_airports | List available airports with IATA codes. |
respond | Send a message to the simulated user and receive their reply. |
search_direct_flight | Search for direct flights between airports on a date. |
search_onestop_flight | Search for flights with one stopover. |
send_certificate | Send a travel certificate/voucher to a user. |
think | Internal reasoning (no state change). |
transfer_to_human_agents | Escalate to human support. |
update_reservation_baggages | Modify baggage count on a reservation. |
update_reservation_flights | Change flights in a reservation. |
update_reservation_passengers | Modify passenger list on a reservation. |
Retail tools (17):
| Tool | Description |
|---|---|
calculate | Perform mathematical calculations. |
cancel_pending_order | Cancel a pending order. |
exchange_delivered_order_items | Exchange items in a delivered order. |
find_user_id_by_email | Authenticate user by email. |
find_user_id_by_name_zip | Authenticate user by name and zip code. |
get_order_details | Retrieve order information. |
get_product_details | Retrieve product details with variants and pricing. |
get_user_details | Retrieve user profile. |
list_all_product_types | List all product categories. |
modify_pending_order_address | Change shipping address for pending order. |
modify_pending_order_items | Modify items in a pending order (one-time only). |
modify_pending_order_payment | Change payment method for pending order. |
modify_user_address | Update user's default address. |
respond | Send a message to the simulated user and receive their reply. |
return_delivered_order_items | Return items from a delivered order. |
think | Internal reasoning (no state change). |
transfer_to_human_agents | Escalate to human support. |
Time Horizon
τ-Bench is a multi-turn environment. Agents engage in dynamic conversations with a simulated user, making API calls as needed to fulfill the user's request while following domain policies.
Environment Difficulty
| Model | τ-airline | τ-retail |
|---|---|---|
| Claude Opus 4 | 81.4% | 59.6% |
| o3-high | 52.0% | 73.9% |
| o4-mini-high | 49.2% | 71.8% |
| gpt-oss-120b | — | 67.8% |
| o3-mini-high | 32.4% | 57.6% |
Other Environment Requirements
- OpenAI API key: Required for the GPT-4o user simulator and grading. Pass via
secrets={"openai_api_key": "..."}.
Safety
Agents in τ-Bench interact with simulated users in fictional airline and retail domains. The environment does not present direct safety risks.
Citations
@misc{yao2024tau,
title={$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains},
author={Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan},
year={2024},
eprint={2406.12045},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2406.12045}
}