ReSpAct: Harmonizing Reasoning, Speaking, and Acting

Towards Building Large Language Model-Based Conversational AI Agents

Conversational AI Lab
University of Illinois at Urbana Champaign

Conversational Agents: Why They Matter?

In an increasingly complex digital landscape, conversational agents offer a way to bridge the gap between human needs and machine capabilities. These agents can assist us in navigating digital interfaces, support decision-making, and help accomplish goals that require contextual understanding.

MY ALT TEXT

Existing approaches operate in a one-sided way—executing commands without checking back for clarification. This approach can lead to errors and unmet expectations, especially in cases where precise details matter. However, with conversational agents, there is an added layer of communication, creating a back-and-forth that mirrors human conversation and allows users to guide the agent toward optimal outcomes.

Why ReSpAct?

ReSpAct, or "Reason, Speak, Act," is a novel framework that builds on the ReAct approach to make interactions more human-centered. ReSpAct takes agent systems a step further by embedding conversations directly into the decision-making process, enabling agents to act autonomously while remaining responsive to user needs. For example, if an agent faces ambiguity, it can proactively ask, "Should I look for a blue sweater in the mens section, or do you have another preference?". It combines reasoning with conversational prompts to allow agents to clarify uncertainties, seek user feedback, and adapt strategies dynamically. This human-in-the-loop approach not only reduces mistakes but allows building user trust and satisfaction by keeping them engaged in real-time.

MY ALT TEXT

The ReSpAct framework empowers LLM-based Agents to execute tasks more accurately by blending autonomous reasoning with natural language interaction. When an agent is unclear on a user request, it can ask a question or share its reasoning, preventing it from acting based on assumptions. This dynamic engagement means fewer misunderstandings and more effective completion of tasks, as seen in tasks like booking accommodations or navigating e-commerce platforms.

ReSpAct Approach

ReSpAct expands the typical action space of LLMs by adding "speak" actions, allowing agents to ask for guidance, confirm details, and receive feedback during tasks. This means that instead of assuming and potentially making errors, the agent can ask, “Is this what you had in mind?” before proceeding. This interactive loop lets agents adapt on the fly, making ReSpAct ideal for complex tasks in interactive environments.

MY ALT TEXT

Task-oriented conversational agents that ask questions, request feedback, and adapt their strategies based on user input.

Performance

We test ReSpAct in 3 settings: AlfWorld (Embodied Decision-Making), WebShop (Web Agent Decision-Making) and MultiWOZ (Task Oriented Dialogue System). Each of these environments presents unique challenges, requiring the agent to balance reasoning, dialogue, and action in different ways. To evaluate ReSpAct effectively, we design human-annotated trajectories with sparse occurences of reasoning traces and dialogs, allowing the language model to autonomously decide when to "think," "speak," or "act." This design simulates a real-world scenario where agents must not only follow instructions but also navigate uncertainties and dynamically adapt to the users needs.

AlfWorld

We evaluate ReSpAct in AlfWorld, a synthetic environment based on the TextWorld framework and aligned with the embodied ALFRED benchmark. AlfWorld presents six task categories, such as finding hidden objects, moving items, and using objects together (e.g., heating a potato in a microwave). In this setting, ReSpAct enables agents to ask contextually relevant questions, provide status updates, and seek clarification to navigate tasks more effectively. Compared to a baseline ReAct agent, which relies on internal reasoning without user feedback, ReSpAct demonstrates notable improvements. It achieves an 87.3% success rate across tasks, outperforming ReAct's 80.6% by effectively integrating dialogue into its decision-making process.

MY ALT TEXT

Comparison of (a) ReAct and (b) ReSpAct over Task-specific success rates (%) in Alfworld

MY ALT TEXT

In our analysis we observed how ReSpAct changes the way agents handle tasks. While ReAct agents tend to jump straight into actions, ReSpAct introduces more "thinking" and "speaking" moments. The ReSpAct agents show a sharp reduction in "invalid actions"—down to just 3% from ReAct's 13%. Invalid actions are wasted steps, like trying to open an already open door, that dont contribute to task success.

MultiWOZ

We evaluate ReSpAct in a MultiWOZ environment, where the agent assists with booking hotels, restaurants, and taxis by understanding user preferences, asking clarifying questions, and making informed decisions. Compared to a ReAct baseline that operates without user feedback, ReSpAct achieves higher scores on two metrics: Inform Rate, which checks if all requested information (e.g., hotel address and price) was correctly provided, and Success Rate, which further requires that all user goals, like completing a booking, are fully met. Using GPT-4o-mini, ReSpAct scores 72.2% on Inform and 51.8% on Success, compared to the baseline 66.7% and 48.8%, respectively.


MY ALT TEXT

Comparison of Inform and Success scores for MultiWOZ using GPT-4o-mini and Llama-405B-instruct models.

WebShop

We evaluate ReSpAct in a WebShop environment, where the agent navigates a simulated online shopping platform containing over 1.18 million real-world products to identify and purchase items based on user preferences. The agent actively searches for products matching user requirements, asks clarifying questions to refine search results, and makes informed purchasing decisions. To assess performance, we use two metrics: (1) average score, which reflects the percentage of desired attributes met by the chosen product across all episodes, and (2) success rate, which measures the percentage of episodes where the chosen product fully satisfies the users requirements. Our results show that ReSpAct achieves an average score of 32.7% and a success rate of 12% with a user simulator. With human interaction, ReSpAct reaches an 85.8% average score and 50% success rate, underscoring the critical role of interactive questioning and feedback in enhancing accuracy and user satisfaction with task-oriented web shopping agents.


MY ALT TEXT

Comparison of Score and success rate (SR) on 100 Test WebShop trajectories using GPT-4o-mini

Implications and the Future of Conversational Agents

By embedding conversational checkpoints within its reasoning process, ReSpAct addresses longstanding issues like error propagation, task ambiguity, and rigid agent behavior. In interactive environments like WebShop and AlfWorld, ReSpAct's approach not only reduces errors during task completion but also enables agents to align more closely with user intent by actively involving users in decision-making. This responsiveness transforms the agent into a collaborative partner rather than a passive tool.

BibTeX

@misc{dongre2024respactharmonizingreasoningspeaking,
        title={ReSpAct: Harmonizing Reasoning, Speaking, and Acting Towards Building Large Language Model-Based Conversational AI Agents}, 
        author={Vardhan Dongre and Xiaocheng Yang and Emre Can Acikgoz and Suvodip Dey and Gokhan Tur and Dilek Hakkani-Tür},
        year={2024},
        eprint={2411.00927},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2411.00927}, 
  }

Acknowledgements

This work was supported in part by Other Transaction award HR0011249XXX from the U.S. Defense Advanced Research Projects Agency (DARPA) Friction for Accountability in Conversational Transactions (FACT) program and has benefited from the Microsoft Accelerate Foundation Models Research (AFMR) grant program, through which leading foundation models hosted by Microsoft Azure and access to Azure credits were provided to conduct the research.