In an increasingly complex digital landscape, conversational agents offer a way to bridge the gap between human needs and machine capabilities. These agents can assist us in navigating digital interfaces, support decision-making, and help accomplish goals that require contextual understanding.
Existing approaches operate in a one-sided way—executing commands without checking back for clarification. This approach can lead to errors and unmet expectations, especially in cases where precise details matter. However, with conversational agents, there is an added layer of communication, creating a back-and-forth that mirrors human conversation and allows users to guide the agent toward optimal outcomes.
ReSpAct, or "Reason, Speak, Act," is a novel framework that builds on the ReAct approach to make interactions more human-centered. ReSpAct takes agent systems a step further by embedding conversations directly into the decision-making process, enabling agents to act autonomously while remaining responsive to user needs. For example, if an agent faces ambiguity, it can proactively ask, "Should I look for a blue sweater in the mens section, or do you have another preference?". It combines reasoning with conversational prompts to allow agents to clarify uncertainties, seek user feedback, and adapt strategies dynamically. This human-in-the-loop approach not only reduces mistakes but allows building user trust and satisfaction by keeping them engaged in real-time.
The ReSpAct framework empowers LLM-based Agents to execute tasks more accurately by blending autonomous reasoning with natural language interaction. When an agent is unclear on a user request, it can ask a question or share its reasoning, preventing it from acting based on assumptions. This dynamic engagement means fewer misunderstandings and more effective completion of tasks, as seen in tasks like booking accommodations or navigating e-commerce platforms.
ReSpAct expands the typical action space of LLMs by adding "speak" actions, allowing agents to ask for guidance, confirm details, and receive feedback during tasks. This means that instead of assuming and potentially making errors, the agent can ask, “Is this what you had in mind?” before proceeding. This interactive loop lets agents adapt on the fly, making ReSpAct ideal for complex tasks in interactive environments.
Task-oriented conversational agents that ask questions, request feedback, and adapt their strategies based on user input.
We test ReSpAct in 3 settings: AlfWorld (Embodied Decision-Making), WebShop (Web Agent Decision-Making) and MultiWOZ (Task Oriented Dialogue System). Each of these environments presents unique challenges, requiring the agent to balance reasoning, dialogue, and action in different ways. To evaluate ReSpAct effectively, we design human-annotated trajectories with sparse occurences of reasoning traces and dialogs, allowing the language model to autonomously decide when to "think," "speak," or "act." This design simulates a real-world scenario where agents must not only follow instructions but also navigate uncertainties and dynamically adapt to the users needs.
We evaluate ReSpAct in AlfWorld, a synthetic environment based on the TextWorld framework and aligned with the embodied ALFRED benchmark. AlfWorld presents six task categories, such as finding hidden objects, moving items, and using objects together (e.g., heating a potato in a microwave). In this setting, ReSpAct enables agents to ask contextually relevant questions, provide status updates, and seek clarification to navigate tasks more effectively. Compared to a baseline ReAct agent, which relies on internal reasoning without user feedback, ReSpAct demonstrates notable improvements. It achieves an 87.3% success rate across tasks, outperforming ReAct's 80.6% by effectively integrating dialogue into its decision-making process.
In our analysis we observed how ReSpAct changes the way agents handle tasks. While ReAct agents tend to jump straight into actions, ReSpAct introduces more "thinking" and "speaking" moments. The ReSpAct agents show a sharp reduction in "invalid actions"—down to just 3% from ReAct's 13%. Invalid actions are wasted steps, like trying to open an already open door, that dont contribute to task success.
By embedding conversational checkpoints within its reasoning process, ReSpAct addresses longstanding issues like error propagation, task ambiguity, and rigid agent behavior. In interactive environments like WebShop and AlfWorld, ReSpAct's approach not only reduces errors during task completion but also enables agents to align more closely with user intent by actively involving users in decision-making. This responsiveness transforms the agent into a collaborative partner rather than a passive tool.
@misc{dongre2024respactharmonizingreasoningspeaking,
title={ReSpAct: Harmonizing Reasoning, Speaking, and Acting Towards Building Large Language Model-Based Conversational AI Agents},
author={Vardhan Dongre and Xiaocheng Yang and Emre Can Acikgoz and Suvodip Dey and Gokhan Tur and Dilek Hakkani-Tür},
year={2024},
eprint={2411.00927},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.00927},
}
This work was supported in part by Other Transaction award HR0011249XXX from the U.S. Defense Advanced Research Projects Agency (DARPA) Friction for Accountability in Conversational Transactions (FACT) program and has benefited from the Microsoft Accelerate Foundation Models Research (AFMR) grant program, through which leading foundation models hosted by Microsoft Azure and access to Azure credits were provided to conduct the research.