τ-Bench

Description

τ-Bench is an environment for evaluating language agents in dynamic, multi-turn conversations with simulated human users. Based on the τ-bench benchmark by Sierra, agents must interact with users via a GPT-4o user simulator while following domain-specific policies and using API tools correctly. The environment includes two domains: airline reservation management (τ-airline) and retail order management (τ-retail).

Capabilities

Multi-turn conversational interaction with simulated users
Following complex domain-specific policies and business rules
Using API tools for flight booking, order management, and user authentication
Stateful database operations with consistent rule-following

Compute Requirements

Agents are given a standard environment with no special compute requirements.

License

MIT.

Tasks

There is one split in this environment:

Test: 165 tasks total
- τ-airline: 50 airline reservation tasks
- τ-retail: 115 retail order management tasks

Tasks are annotated with ground-truth action sequences and expected database states.

Reward Structure

This is a multi-turn environment with a user simulator. The agent converses with a GPT-4o simulated user, making tool calls as needed. When the conversation ends (signaled by "###STOP###"), the environment replays the ground-truth actions against the original database state and compares the resulting database state (via SHA256 hash) with the agent's final state. Reward is 1.0 if states match, 0.0 otherwise.

Data

Task data includes user profiles, flight databases, product catalogs, and order histories stored as JSON files. Airline data includes flights, reservations, and users. Retail data includes products, orders, and users. Task data is stored on the OpenReward platform.

Tools

The environment exposes two classes (Airline and Retail), each with its own set of domain-specific tools plus a shared respond tool for conversing with the simulated user.

Airline tools (15):

Tool	Description
`book_reservation`	Book a new flight reservation with passenger and payment details.
`calculate`	Perform mathematical calculations.
`cancel_reservation`	Cancel an existing reservation.
`get_reservation_details`	Retrieve reservation information.
`get_user_details`	Retrieve user profile and payment methods.
`list_all_airports`	List available airports with IATA codes.
`respond`	Send a message to the simulated user and receive their reply.
`search_direct_flight`	Search for direct flights between airports on a date.
`search_onestop_flight`	Search for flights with one stopover.
`send_certificate`	Send a travel certificate/voucher to a user.
`think`	Internal reasoning (no state change).
`transfer_to_human_agents`	Escalate to human support.
`update_reservation_baggages`	Modify baggage count on a reservation.
`update_reservation_flights`	Change flights in a reservation.
`update_reservation_passengers`	Modify passenger list on a reservation.

Retail tools (17):

Tool	Description
`calculate`	Perform mathematical calculations.
`cancel_pending_order`	Cancel a pending order.
`exchange_delivered_order_items`	Exchange items in a delivered order.
`find_user_id_by_email`	Authenticate user by email.
`find_user_id_by_name_zip`	Authenticate user by name and zip code.
`get_order_details`	Retrieve order information.
`get_product_details`	Retrieve product details with variants and pricing.
`get_user_details`	Retrieve user profile.
`list_all_product_types`	List all product categories.
`modify_pending_order_address`	Change shipping address for pending order.
`modify_pending_order_items`	Modify items in a pending order (one-time only).
`modify_pending_order_payment`	Change payment method for pending order.
`modify_user_address`	Update user's default address.
`respond`	Send a message to the simulated user and receive their reply.
`return_delivered_order_items`	Return items from a delivered order.
`think`	Internal reasoning (no state change).
`transfer_to_human_agents`	Escalate to human support.

Time Horizon

τ-Bench is a multi-turn environment. Agents engage in dynamic conversations with a simulated user, making API calls as needed to fulfill the user's request while following domain policies.

Environment Difficulty

Model	τ-airline	τ-retail
Claude Opus 4	81.4%	59.6%
o3-high	52.0%	73.9%
o4-mini-high	49.2%	71.8%
gpt-oss-120b	—	67.8%
o3-mini-high	32.4%	57.6%

Other Environment Requirements

OpenAI API key: Required for the GPT-4o user simulator and grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in τ-Bench interact with simulated users in fictional airline and retail domains. The environment does not present direct safety risks.

Citations

@misc{yao2024tau,
  title={$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains},
  author={Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan},
  year={2024},
  eprint={2406.12045},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2406.12045}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

taubench

GeneralReasoning/taubench

τ-Bench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Airline tools (15):

Retail tools (17):

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples