Devil’s Advocate:
Anticipatory Reflection for LLM Agents
魔鬼代言人:LLM特工的预期反思
Abstract 抽象的
In this work, we introduce a novel approach that equips LLM agents with introspection, enhancing consistency and adaptability in solving complex tasks.
Our approach prompts LLM agents to decompose a given task into manageable subtasks (i.e., to make a plan), and to continuously introspect upon the suitability and results of their actions. We implement a three-fold introspective intervention:
1) anticipatory reflection on potential failures and alternative remedy before action execution,
2) post-action alignment with subtask objectives and backtracking with remedy to ensure utmost effort in plan execution, and
3) comprehensive review upon plan completion for future strategy refinement.
By deploying and experimenting with this methodology—a zero-shot approach—within WebArena for practical tasks in web environments, our agent demonstrates superior performance with a success rate of 23.5% over existing zero-shot methods by 3.5%.
The experimental results suggest that our introspection-driven approach not only enhances the agent’s ability to navigate unanticipated challenges through a robust mechanism of plan execution, but also improves efficiency by reducing the number of trials and plan revisions by 45% needed to achieve a task.
在这项工作中,我们引入了一种新颖的方法,使LLM代理具有内省能力,增强解决复杂任务的一致性和适应性。我们的方法提示LLM代理将给定的任务分解为可管理的子任务(即制定计划),并不断反思其行动的适用性和结果。我们实施三重内省干预:1)在行动执行之前对潜在失败和替代补救措施进行预期反思;2)行动后与子任务目标保持一致并通过补救措施进行回溯,以确保计划执行中尽最大努力;3)对行动执行情况进行全面审查。计划完成,为未来的战略细化做好准备。通过在 WebArena 中针对 Web 环境中的实际任务部署和试验这种方法(一种零样本方法),我们的代理展示了卓越的性能,成功率达到 23.5%,比现有零样本方法提高了 3.5%。实验结果表明,我们的内省驱动方法不仅增强了智能体通过强大的计划执行机制应对意外挑战的能力,而且还通过将完成任务所需的试验和计划修订次数减少了 45% 来提高效率。
1 Introduction 1简介
Two roads diverged in a yellow wood,
黄色的树林里分出两条路,
And sorry I could not travel both
抱歉我无法同时旅行
Then took the other, as just as fair,
然后拿另一个,同样公平,
And having perhaps the better claim
也许有更好的主张Robert Frost 罗伯特·弗罗斯特
The enduring appeal of Frost’s emblematic poem, “The Road Not Taken,” resides not just in its poetic elegance, but also in the profound lesson it imparts about decision-making.
As we stand at the crossroads of a choice, it is a daunting challenge to assess probable outcomes and choose a course that best aligns with our objectives.
This task becomes even more formidable when Large Language Model (LLM) agents [9, 29, 18] have to navigate complex scenarios unfolding in real time, e.g., solving tasks in web environments [11, 27, 2, 32], conducting simulated science experiments [23], and solving embodied household tasks [17].
弗罗斯特的标志性诗作《未选择的路》的持久吸引力不仅在于其诗意的优雅,还在于它所传授的关于决策的深刻教训。当我们站在选择的十字路口时,评估可能的结果并选择最符合我们目标的路线是一项艰巨的挑战。当大型语言模型 (LLM) 代理 [9,29,18] 必须导航实时展开的复杂场景时,此任务变得更加艰巨,例如,在 Web 环境中解决任务 [11,27,2 ,32],进行模拟科学实验[23],并解决具体的家庭任务[17]。
Indeed, LLM agent decision-making has witnessed enhancement by post-hoc reflection and correction [16, 19], coupled with adaptive planning [20, 15], where the agents learn from past successes and failures while concurrently mapping out flexible strategies.
However, frequent shifts in plans, albeit a mere inconvenience for humans, can lead to disorientation for AI agents.
This may produce confusion, a standstill, or even an infinite loop of failure, which substantiates the importance of thoroughly executing a set plan with utmost effort before resorting to a plan revision.
Therefore, this paper puts forward a methodology aimed at achieving an optimal balance between consistency and adaptability. This critical equilibrium mirrors the resilience and agility that is anticipated of a capable system that is prepared for curveballs but unwavering in the execution of its plan.
事实上,LLM 智能体决策通过事后反思和修正 [16, 19] 以及自适应规划 [20, 15] 得到了增强,智能体从过去的成功和失败中学习,同时制定灵活的策略。然而,计划的频繁变化虽然对人类来说只是带来不便,但可能会导致人工智能代理迷失方向。这可能会产生混乱、停滞甚至失败的无限循环,这证实了在采取计划修订之前尽最大努力彻底执行既定计划的重要性。因此,本文提出了一种旨在实现一致性和适应性之间最佳平衡的方法论。这种关键的平衡反映了一个有能力的系统所预期的弹性和敏捷性,该系统已为曲线球做好了准备,但在执行其计划时坚定不移。
In this paper, we introduce a novel approach that integrates introspection into the fabric of LLM agents. This approach enables agents to continuously reflect on their actions, thereby stimulating a learning process that dynamically optimizes exploration paths and enhances robust decision-making under uncertainty. Our introspective intervention focuses on three principal dimensions:
在本文中,我们介绍了一种将内省集成到 LLM 代理结构中的新颖方法。这种方法使智能体能够不断反思自己的行为,从而刺激学习过程,动态优化探索路径并增强不确定性下的稳健决策。我们的内省干预侧重于三个主要维度:
-
1.
Anticipatory reflection before action execution (similar to a devil’s advocate);
1. 行动执行前的预期反思(类似于魔鬼代言人); -
2.
Post-action evaluation and backtracking with remedy when necessary, to ensure the outcome aligns with subtask objectives;
2. 行动后评估并在必要时进行补救回溯,以确保结果与子任务目标保持一致; -
3.
An extensive review upon plan completion to generate finer plans for subsequent trials.
3. 计划完成后进行广泛审查,为后续试验制定更精细的计划。
We implement this introspective methodology within WebArena [32], a comprehensive web environment featuring 812 tasks in five scenarios: online shopping, e-commerce management, social discussion forums, maps, and software development platforms. Experimental results demonstrate that our approach, which is zero-shot, significantly outperforms existing zero-shot methods while improving efficiency, paving the way for a new paradigm of intelligent systems that are more consistent, adaptable, and effective at solving problems111Code to reproduce our results will be released..
我们在 WebArena [32] 中实现了这种内省方法,这是一个综合性的 Web 环境,在五个场景中包含 812 个任务:在线购物、电子商务管理、社交讨论论坛、地图和软件开发平台。实验结果表明,我们的零样本方法在提高效率的同时显着优于现有的零样本方法,为更一致、适应性更强、解决问题更有效的智能系统新范式铺平了道路。
2 Related Works 2相关作品
In this paper, we develop and expand upon several key themes within the realm of natural language processing, with a specific focus on the integration of action generation, planning, and reflection in the construction of LLM agents.
在本文中,我们开发并扩展了自然语言处理领域的几个关键主题,特别关注在 LLM 代理构建中动作生成、规划和反射的集成。
Action Generation: LLMs have been employed in tasks requiring decision-making or action generation and have proven useful as agent-controlling policies in embodied environments [9, 8, 3, 21, 33]. They have also demonstrated effectiveness in text-based environments [11, 17, 12], where techniques like ReAct [29] have shown notable benefits. Despite its success, ReAct’s limitation lies in its inability to adjust to changes in the environment. Several improvements [13, 16] have been proposed to counter these limitations, advocating for self-reflection to enhance decision-making and reasoning. However, these techniques primarily aim to improve single plans or trajectories without considering alternative actions, which could modify the plan in a wrong direction.
动作生成:LLMs已用于需要决策或动作生成的任务,并已被证明可用作具体环境中的代理控制策略[9,8,3,21,33]。他们还在基于文本的环境中展示了有效性[11,17,12],其中像ReAct[29]这样的技术已经显示出显着的好处。尽管取得了成功,ReAct 的局限性在于它无法适应环境的变化。为了应对这些限制,人们提出了一些改进措施 [13, 16],提倡自我反思以增强决策和推理。然而,这些技术主要旨在改进单一计划或轨迹,而不考虑替代行动,这可能会朝错误的方向修改计划。
Position Bias Mitigation: While comparing answer choices is generally effective, large language models used for action generation are not without flaws. They can exhibit bias, especially towards the first (or sometimes second) answer they see, regardless of its quality. This is known as position bias [30, 22]. Our method mitigates this bias by asking follow-up questions that challenge its own answer.
Planning: Extensive research has explored the potential of LLMs in task planning [4, 15, 20, 25, 5, 6]. The concept of decoupling planning and execution in formulating LLM agents has been validated through numerous paradigms such as ReWOO [26], ADaPT [15], Structured Self-Reflection [10], and DEFS [24]. Nonetheless, these methods exhibit a deficiency in establishing a resilient mechanism for plan execution, with agents frequently revisiting and revising their plans following each instance of adverse environmental feedback, often due to inaccurately executed actions. Our approach, conversely, emphasizes executing a previously defined plan with unwavering effort before considering any modifications. This guarantees a more stable and consistent problem-solving process. To implement this, the factor of tree search becomes crucial for exploring the best solutions. Past approaches, including ToT [28], RAP [7], LATS [31], AdaPlanner [20], and ToolChain* [34], have incorporated tree search techniques in identifying the optimal route to the desired solution. However, our approach distinguishes itself by engaging the LLM in preparing alternate solutions in anticipation of impending failures, ensuring more comprehensive consideration in action generation.
Reflection and Self-refinement: Reflection and refinement techniques have advanced significantly through works such as Reflexion [16], AdaPlanner [20], and AutoEval [14]. Our methodology further enhances this by incorporating an anticipatory reflection mechanism that operates before each action rather than performing post-action reflection. This approach simplifies exploration by expediting remedial action and reducing extensive backtracking and serial plan revisions, thereby improving efficiency in the overall task handling process.
3 Method
Given a task and an environment with which the LLM agent interacts, our objective is to enable the agent to systematically and adaptively complete the task through introspective methods. We first present how we decompose the task and generate action regarding each state in the environment in section 3.1 and section 3.2. Then we introduce the introspection mechanism in section 3.3.
3.1 Task Decomposition and Planning
The first step involves decomposing the task into subtasks in a sequential manner, forming a plan. This decomposition is achieved through an LLM generation process. Let denote the agent’s plan generation function, prompted by the task , description of the initial state , and any experience from past trials, i.e., history :
(1) |
Here, the plan is parsed into a sequence of ordered subtasks:
(2) |
where represents the -th subtask in the plan, and is the number of subtasks. For instance, Fig. 1 shows a plan with 5 subtasks for solving a task in WebArena. The distribution of WebArena tasks based on the number of subtasks within each task is illustrated in Fig. 2. This also reflects the difficulty of the tasks in WebArena, where most tasks take 4-8 steps to complete.
3.2 State and Action Representation
Let denote the current state of the environment at time , where is the set of all possible states. From state , let denote the next action taken by the agent, where is the set of all possible actions. The next action is generated based on the the specific subtask currently being addressed, current state , and action history up to time :
(3) |
where denotes the agent’s action generation function. Let denote the history of actions taken up to time :
(4) |
where is a textual description of action , along with useful information learned from this action execution, generated with function . The history would later be used to answer questions in the task or to revise the agent’s plan. accepts as input the state before the action, the action itself, the state after the action:
(5) |
When the state observation is too long to fit in the context window of an LLM, the state is first summarized by the LLM into a shorter description before being fed to (e.g., this operation is commonly needed for solving web navigation tasks on content management platforms). Note that a subtask can involve several actions, and thus does not necessarily equal to . Given the possibility that the task can be finished at some time before the completion of all subtasks, whenever the agent arrives at a new state, we ask the agent to check two things: whether the subtask is finished 222When the agent determines that a subtask is non-essential to solving the task, we also set ., and whether the task is finished :
(6) | ||||
(7) |
where denotes the function for checking whether an objective is fulfilled. If , the agent moves on to solve the next subtask ; whereas when the agent determines , it finishes the current trial regardless of whether the plan is finished.
3.3 Introspective Mechanisms
The sequential action generation above can potentially execute the plan and solve the task already. Nevertheless, without proper introspection and adaptation, the agent might be stuck at a certain unsolvable subtask or go into a loop of failure when unexpected problems emerge. Thus, we introduce three introspective mechanisms to enhance our LLM agent’s problem-solving ability below.
3.3.1 Anticipatory Reflection (Devil’s Advocate)
The first layer of introspection occurs before each action execution. The agent anticipates potential failures and comes up with alternative remedies . Each remedy action is by prompting the LLM with a follow-up question after :
"If your answer above is not correct, instead, the next action should be:"
We use to denote the generation of remedy actions, which accepts as input the subtask , the current state , the action history , and the LLM predicted next action at first attempt:
(8) |
If later found necessary, the agent can go back to state to modify the original action to try the remedy action to ensure a smooth plan execution. For example, in Fig. 3, we show a state observation where all three clicking actions align with the objective of the current subtask. The execution of any of these actions would complete the subtask; yet the agent might need to return to this state if it later determines that the action predicted at first attempt was incorrect333The action generated at first attempt still gets the highest priority, i.e., is the last one to be pushed to the stack so it can be popped and executed first (see line 18 in Alg. 1)..
3.3.2 Post-action Evaluation and Backtracking
The second introspective mechanism kicks in after the execution of each action. Here, the agent evaluates whether the action and the resulting state align with the subtask objective. This introspective function, denoted as , is motivated by the state before the action , the action , the resulting state , the current subtask :
(9) |
Here denotes the evaluation score reflecting how well the state aligns with the subtask objective . It is a binary signal indicating whether the agent needs to stop and backtrack to some previous state and take an alternative action , if the execution of does not meet the objective of the current subtask. In our experiments with web environments, the URL of the webpage is a useful information recorded as part of . When backtracking, we can easily navigate back to the URL. However, the element information on the URL might differ from the state we first encountered upon arriving at that page. To address this, we prompt the LLM to map the recorded element in the action to the new element with which we want to interact, if necessary.
3.3.3 Plan Revision
The third introspective mechanism occurs upon plan failure, i.e., when the stack is empty and . Now the agent performs a thorough review of the actions executed and the notes taken, and refines its future plan based on identified problems:
(10) |
Here, is the new plan after reflecting on the past failed trials. The agent then re-enters the plan execution phase and starts a new episode.
Through these three layers of introspection, our agent is more capable of navigating the complexities of unforeseen circumstances and addressing tasks, bringing us a significant stride closer to achieving truly autonomous, adaptable, and intelligent systems. By structuring the problem in this manner, we have established a clear framework for enabling LLM agents to perform tasks autonomously and adaptively through introspection. Alg. 1 shows a pseudo code demonstration of our approach.
4 Experiments
In this section, we demonstrate how introspection enhances consistency and adaptability of LLM agents in solving complex tasks in web environments. We first introduce the experimental setup for evaluation (section 4.1), followed by evaluation results (section 4.2). Detailed error analysis is provided in section 4.3, which highlights the directions for future endeavor.
4.1 Experimental Setup
Live environments: We evaluate our proposed method in the simulated web environments of WebArena [32], a dataset of human-annotated web browsing tasks designed to evaluate the ability of LLMs to perform complex, real-world actions on the internet. The 812 tasks in WebArena involve five websites: an online shopping website, a software development website, a social forum platform, a map, and an e-commerce management platform; and these tasks can be categorized into three classes: information seeking tasks, site navigation and content & config tasks, and unachievable tasks. Though WebArena provides visual observation (screenshots), in this work we use the text observation only. The observation at each step is the accessibility tree of the webpage, and the elements in the accessibility tree are all within the current viewport of a 1280720 screen. The action space of our LLM agent includes actions that interact with environment: click, type, scroll, goto, go_back, go_forward, and also a note_down action that takes down useful snippet/summary for answering information-seeking questions.
Baselines: We employ gpt-4-0613444https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4 [1] with a context window of 8k tokens to build the agents and compare our method with three other agent construction strategies: planning and sequential decision making (Plan + Act w/o reflexion), similar to ReWOO [26]; planning and sequential decision making with reflection (Plan + Act), similar to AdaPlanner [20]; and tree search based planning, similar to LATS [31], but with reflection. In all methods, we set the upper limit on the number of actions to 30, i.e., after the agent executes 30 actions for a given task, it has to stop. In all three methods, we adopt the same prompts for action generation , plan generation , and evaluator and to ensure a fair comparison555Detailed prompts are shown in Appendix.. In our experiments, we set the LLM temperature to 1.0 and max_tokens to 512, and keep all other parameters as default.
Metrics: We follow the evaluation metric Success Rate in [32], and count the number of actions per trial and the number of plan revisions per task. To determine whether a task is successfully completed, the exact_match metric is used for some site navigation and information seeking tasks. However, this can sometimes be overly stringent. For instance, consider the URLs below that display the same content (under ‘electronics’, the category id of ‘headphones’ is 60): both of them are correct answers but only one exact match with the gold answer (the first one) is considered correct666URL string matching is used to determine if a task is finished.. To address this issue, we manually review the evaluation process and correct such cases in our results.
- •
- •
4.2 Results
The experimental results, depicted in Fig. 5, demonstrate the efficacy of our introspection-driven approach in enhancing the consistency and adaptability of LLM agents in web environments. We compare the success rates of various agent construction strategies across multiple episodes. Our method, anticipatory reflection (AR), consistently outperforms the others, achieving a success rate of 23.5% after seven episodes, closely followed by LATS with 22.7%. In contrast, the Plan + Act method shows gradual improvement, reaching 19.8%, but remains significantly lower than the tree-search-based AR and LATS methods. Taking a closer look at the last few rounds of LATS reveals marginal improvements due to the homogeneous generated actions through direct sampling. In comparison, AR benefits from the “devil’s advocate” approach, enabling more thorough planning and execution due to introspective follow-up questions. This trend underscores the importance of incorporating introspection mechanisms for both plan execution and revision, highlighting their critical role in achieving more consistent and efficient results.
# of Actions | # of Plan Revisions | ||
---|---|---|---|
First Trial | Last Trial | ||
Plan+Act | 4.01 | 4.47 | 2.03 |
LATS | 6.08 | 6.45 | 1.16 |
AR | 6.39 | 7.07 | 0.64 |
Further insights can be gleaned from Table 1, which compares the average number of actions in the first and last trials across different methods. Our AR method shows an increase in the average number of actions from 6.39 in the first trial to 7.07 in the last trial, indicating a robust learning and adaptation process. In comparison, the average number of actions in the first trial of the Plan+Act method is only 4.01, suggesting that it stops at an early stage without completing full plan execution. Thus, our method effectively leverages a greater number of actions to achieve better outcomes, thereby reducing the number of plan revisions by 45% and improving overall efficiency.
4.3 Error Analysis
The subsequent sections shed light on an analysis of errors we observed from the agent’s behavior when executing tasks. Two key areas have been identified for detailed discussion: an agent’s occasional inability to fully learn from past failures, and inefficiencies in solving specific kinds of tasks due to a sequential planning scheme.
4.3.1 Agent only takes partial lesson from past failures
One category of errors we notice is that the agent is not taking full lesson from past failure in generating a new plan. As illustrated in Fig. 6(a), the agent is at the final step of drafting a refund message for a Bluetooth speaker, after a series of steps taken to seek information for the order. From the screen, we know that the agent should consolidate all the gathered information and type one piece of text into the (only) box titled “What’s on your mind?”. However, as depicted in Fig. 6(b), while some improvements were made by adding the date of purchase and a more detailed explanation in the revised plan, the agent still failed to optimize the input process, repeating the typing actions separately for fields that do not exist. This inefficiency in the agent’s behavior showcases the need for either an LLM with stronger reasoning ability or a better mechanism to solicit more comprehensive and accurate reflection.
4.3.2 Sequential planning cannot solve all tasks
In our analysis, we observed a recurrent error pertaining to the design of the agent’s planning process. Currently, the proposed methodology structures a plan as a sequence of tasks that are executed in a specific order. This approach, effective in a decent amount of use cases, seems to falter when faced with tasks necessitating more sophisticated logic. Specifically, tasks that mandate implementing a function encapsulating several actions, employing a loop construct, or those executed repetitively, tend to challenge the model’s current configuration. For example:
-
•
List out reviewers, if exist, who mention about average print quality
-
•
Give me the SKU of the products that have 1-3 units left.
-
•
Like all submissions created by CameronKelsey in subreddit earthporn.
The ability to process these tasks effectively would necessitate the incorporation of additional cognitive constructs into the planning model—e.g., loops, repetitive actions, or encapsulation of a group of actions into callable functions. Though taking notes can help the agent eliminate wrong choices, these systemic extensions would add crucial capabilities to the web agent, significantly enhancing its navigation and problem-solving competence in realistic web environments. Moreover, while the current agent can succeed in the limited search space of simple tasks, it often struggles to review and introspect upon more extensive descriptive tasks requiring dynamic problem-solving. By addressing these limitations in future work, i.e., effectively converting textual description of a plan into robust execution of callable functions and loops, we believe that the reasoning capability of our agent can be substantially improved, leading to better outcomes in understanding and solving tasks that involve dynamic cognition in web environments.
5 Conclusion
In this work, we introduce a novel introspective methodology that significantly enhances the problem-solving capabilities of LLMs in complex environments, as demonstrated through comprehensive evaluations in the WebArena setting. Our approach strategically decomposes tasks into actionable subtasks and incorporates a three-tiered introspection process, which includes anticipatory reflection, robust post-action evaluation, and episode-level plan revision. This setup not only allows LLM agents to adapt their strategies in real time but also fosters long-term learning, reducing the need for frequent interventions as experience accumulates. The successful application of our introspective methodology in the WebArena suggests its potential transferability to other domains that demand dynamic decision-making such as autonomous driving, healthcare, and interactive customer services. By enabling LLM agents to proactively contemplate potential failures, evaluate actions post-execution, and continuously refine their strategy based on experiential insights, our approach equips AI systems with a human-like strategic thinking capability.
Broader Impact
Looking forward, the integration of multi-modal data inputs could further enhance the contextual understanding and decision-making accuracy of these agents. The principles and findings from our approach provide a robust foundation for future research in AI, particularly in aspects of autonomous decision-making, learning efficiency, and adaptability. As AI continues to integrate into diverse aspects of decision-making, embedding introspective capabilities will be essential to ensure these systems operate not only with precision but with an understanding akin to strategic human cognition.
Ethics Statement
As the capabilities of LLM agents enhance and their deployment in real-world applications increases, it is crucial to address potential ethical concerns, particularly regarding data privacy, bias, and transparency. Our work focuses on improving agent introspection to enhance task performance and decision-making explanations, aiming to develop more transparent and trustworthy AI systems. We emphasize the importance of human oversight to monitor and mitigate unforeseen consequences and encourage the responsible use of this technology for societal benefit. By promoting continuous evaluation and fair practices, we seek to minimize biases and ensure that the deployment of these agents does not exacerbate social inequalities. Furthermore, we are committed to optimizing computational resources to reduce the environmental impact, advocating for sustainable AI practices.
Limitations
Despite substantial progress made with our current design, limitations persist that inhibit optimal performance. Notably, the agent lacks a full learning mechanism to capitalize on past failures when generating a new plan, resulting in inefficient execution and recurring mistakes. Furthermore, while the sequential planning approach is effective for simpler tasks, it falls short for more sophisticated operations, such as those requiring encapsulated actions or loop constructs. Additionally, the agent struggles with tasks that expand beyond a simple search space, suggesting obstacles in handling dynamic problem-solving. Last but not least, our agent needs significant amounts of LLM generation (i.e., API calling), consequently requiring substantial time and computational resources, which dents its efficiency. Therefore, future work needs to concentrate on improving the agent’s ability to fully learn from prior shortcomings, adapt to handle complex tasks, enhance dynamic problem-solving capabilities, and optimize time and resource utilization with more efficient LLM calling.
References
- Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Deng et al. [2023] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web.
- Driess et al. [2023] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. 2023. PaLM-E: An Embodied Multimodal Language Model. In arXiv preprint arXiv:2303.03378.
- Dror et al. [2023] Rotem Dror, Haoyu Wang, and Dan Roth. 2023. Zero-Shot On-the-Fly Event Schema Induction. In Findings of the Association for Computational Linguistics: EACL 2023, pages 705–725, Dubrovnik, Croatia. Association for Computational Linguistics.
- Guan et al. [2023] Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. 2023. Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning.
- Gur et al. [2024] Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2024. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. In The Twelfth International Conference on Learning Representations.
- Hao et al. [2023] Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. 2023. Reasoning with Language Model is Planning with World Model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, Singapore. Association for Computational Linguistics.
- Huang et al. [2022a] Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. 2022a. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. arXiv preprint arXiv:2201.07207.
- Huang et al. [2022b] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. 2022b. Inner Monologue: Embodied Reasoning through Planning with Language Models. In arXiv preprint arXiv:2207.05608.
- Li et al. [2023] Tao Li, Gang Li, Zhiwei Deng, Bryan Wang, and Yang Li. 2023. A Zero-Shot Language Agent for Computer Control with Structured Reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11261–11274, Singapore. Association for Computational Linguistics.
- Liu et al. [2018] Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. 2018. Reinforcement learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802.
- Liu et al. [2023] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2023. AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv: 2308.03688.
- Madaan et al. [2023] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback.
- Pan et al. [2024] Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. 2024. Autonomous Evaluation and Refinement of Digital Agents.
- Prasad et al. [2023] Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. 2023. ADaPT: As-Needed Decomposition and Planning with Language Models. arXiv.
- Shinn et al. [2023] Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning.
- Shridhar et al. [2021] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proceedings of the International Conference on Learning Representations (ICLR).
- Song et al. [2023] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, and Yu Su. 2023. LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
- Song et al. [2024] Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. 2024. Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents. arXiv preprint arXiv:2403.02502.
- Sun et al. [2023] Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. 2023. AdaPlanner: Adaptive Planning from Feedback with Language Models.
- Wang et al. [2023a] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023a. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint arXiv: Arxiv-2305.16291.
- Wang et al. [2023b] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023b. Large Language Models are not Fair Evaluators.
- Wang et al. [2022] Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. 2022. ScienceWorld: Is your Agent Smarter than a 5th Grader? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Wang et al. [2023c] Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. 2023c. Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents. In Thirty-seventh Conference on Neural Information Processing Systems.
- Wu et al. [2023] Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, and Haibin Yan. 2023. Embodied Task Planning with Large Language Models. arXiv preprint arXiv:2305.03716.
- Xu et al. [2023] Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, and Dongkuan Xu. 2023. ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models.
- Yao et al. [preprint] Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. preprint. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. In ArXiv.
- Yao et al. [2023a] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. 2023a. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Thirty-seventh Conference on Neural Information Processing Systems.
- Yao et al. [2023b] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023b. ReAct: Synergizing Reasoning and Acting in Language Models. In International Conference on Learning Representations (ICLR).
- Zheng et al. [2023] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Zhou et al. [2024a] Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2024a. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models.
- Zhou et al. [2024b] Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024b. WebArena: A Realistic Web Environment for Building Autonomous Agents. In The Twelfth International Conference on Learning Representations.
- Zhu et al. [2023] Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. 2023. Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory. arXiv preprint arXiv:2305.17144.
- Zhuang et al. [2024] Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A. Rossi, Somdeb Sarkhel, and Chao Zhang. 2024. ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search. In The Twelfth International Conference on Learning Representations.
Appendix
Prompt for Plan Generation ()
Imagine that you are imitating humans doing a task on a website step by step. You can click an element with the mouse, scroll up or down, go to a certain URL or go back to previous page, or type some text with the keyboard (e.g., click(), scroll(), goto(), go_back(), and type() functions in playwright). One step means one operation within any of the mentioned actions.
You are within a sandbox and only have access to the following websites to work with:
-
•
An online shopping website (OneStopShop): {webarena_root}:7770
-
•
An e-commerce management website (Magento): {webarena_root}:7780/admin
-
•
A Reddit website (Postmill): {webarena_root}:9999
-
•
A GitLab website: {webarena_root}:8023
-
•
A map website (OpenStreetMap): http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:3000
- •
Notes:
-
1.
If you want to use the search function, you don’t need to click on the search bar. You can directly use “type [element_id] [things_to_type]”, and generally afterwards, you don’t need to click the search button (by default, the command contains an ENTER at the end).
-
2.
You can assume that you have signed in to your account (we have set up the cookies, so login is not needed).
The website that you will be working with is:
{WEBSITE INTRO}
Please follow these specific instructions to solve tasks:
{INSTRUCTION}
Here is a more detailed description of the starting screen:
{STARTING SCREEN DESCRIPTION}
Now, based on the information above, what should be the steps to achieve the following goal (please give me a list of textual description of playwright actions, starting with ‘List’):
{TASK}
For your reference, here are some experiences from previous failed trials (please consider the following information to generate a better plan):
{FAILED PLAN}
Past experience:
{HISTORY}
To be successful in generating a new plan, you need to provide a list (1, 2, 3, …), in which each item is a natural language description of one playwright action that is necessary to complete the task (e.g., click on the ‘Account’ button; scroll down; use the search bar to search for iPhone 13). You should use the information from the past experiences to save unnecessary steps!
Prompt for Action Generation ()
I am in a sandbox and only have access to the following websites (i.e., no access to external website like www.reddit.com):
-
•
An online shopping website (OneStopShop): {webarena_root}:7770
-
•
An e-commerce management website (Magento): {webarena_root}:7780/admin
-
•
A Reddit website (Postmill): {webarena_root}:9999
-
•
A GitLab website: {webarena_root}:8023
-
•
A map website (OpenStreetMap): http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:3000
- •
Now I’m trying to complete a task on a website.
The task is:
{TASK}
The plan to complete this task is:
{PLAN}
I have executed the following actions:
{HISTORY}
And now I’m at this step: {STEP}
Here is the screen I am looking at:
{OBS}
I have taken down the following notes:
{NOTES}
What should be the next action to complete this step in my plan (only give one action)?
Note:
-
•
If the next action is to click, please indicate the element id in [] (format: click [element_id]).
-
•
If the next action is to scroll, please indicate the direction in [] (format: scroll [up or down]).
-
•
If you need to navigate to a URL, please indicate the URL in [] (format: goto [url]).
-
•
If you need to go back to the previous page, please use go_back.
-
•
If the next action is to type, please indicate both element id and the things to type in [] (format: type [element_id] [things to type]).
-
•
If you want to note down something, use this format: note_down [things to note down].
The next action is:
Prompt for Objective Alignment ()
Imagine that you are imitating humans doing a task on a website step by step.
You are currently working on this step:
{STEP}.
The step above is one of the steps in the following plan:
{PLAN}.
From Screen 1, you executed an action and then arrived at Screen 2.
The action you executed was:
{ACTION}.
Screen 1:
{OBS1}.
Screen 2:
{OBS2}.
Now describe what this action is about in one sentence, starting with ‘The action is to’.
Does this action align with the goal of the following step (i.e., are we moving towards the right direction; Answer YES or NO)?
{STEP}
Prompt for Task / Subtask Completion Evaluation ()
Imagine that you are imitating humans doing a task on a website step by step.
You are asked to solve the following task:
{TASK}
You made the following plan to solve it:
{PLAN}
To reach the current screen, you have previously executed the following actions:
{HISTORY}
You have taken down a few notes after each action as follows:
{NOTES}
And here is the accessibility tree of the current screen you are looking at:
{OBS}
Look at the screen, the task, and the actions you executed, and think thoroughly, is the task completed now?
If the task is completed, answer YES.
If the task is not yet completed (meaning further actions are yet to be executed), answer NO.
Prompt for Answer Delivery ()
Imagine that you are imitating humans doing a task on a website step by step.
You are asked to solve the following task:
{TASK}
To reach the current screen, you have previously executed the following actions:
{HISTORY}
You have taken down the following notes (to help you answer the question eventually) after each action:
{NOTES}
And here is the accessibility tree of the current screen you are looking at:
{OBS}
Based on the above information, answer the question in the task (starting with ###Answer).
Prompt for Element Mapping ()
I want to interact with an element with element id: {element_id} in the following screen:
{OBS1}
Now if I want to click on the same element in the following screen, what should be the element id now?
{OBS2}
New element id is: