这是用户在 2024-6-4 14:00 为 https://arxiv.org/html/2405.16334v3?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: CC BY 4.0
许可证:CC BY 4.0
arXiv:2405.16334v3 [cs.AI] 29 May 2024
arXiv:2405.16334v3 [cs.AI] 2024 年 5 月 29 日

Devil’s Advocate:
Anticipatory Reflection for LLM Agents
魔鬼代言人:LLM特工的预期反思

Haoyu Wang1   Tao Li2  Zhiwei Deng2  Dan Roth1   Yang Li2 1UPenn   2Google DeepMind {why16gzl, danroth}@seas.upenn.edu {tlinlp, zhiweideng, liyang}@google.com
王浩宇 李涛 2 邓志伟 2 Dan Roth Yang Li 2 宾夕法尼亚大学 2 Google DeepMind {why16gzl, danroth}@seas.upenn .edu {tlinlp, zhiweideng, liyang}@google.com
Work done during internship at Google DeepMind.
在 Google DeepMind 实习期间完成的工作。
Abstract 抽象的

In this work, we introduce a novel approach that equips LLM agents with introspection, enhancing consistency and adaptability in solving complex tasks. Our approach prompts LLM agents to decompose a given task into manageable subtasks (i.e., to make a plan), and to continuously introspect upon the suitability and results of their actions. We implement a three-fold introspective intervention: 1) anticipatory reflection on potential failures and alternative remedy before action execution, 2) post-action alignment with subtask objectives and backtracking with remedy to ensure utmost effort in plan execution, and 3) comprehensive review upon plan completion for future strategy refinement. By deploying and experimenting with this methodology—a zero-shot approach—within WebArena for practical tasks in web environments, our agent demonstrates superior performance with a success rate of 23.5% over existing zero-shot methods by 3.5%. The experimental results suggest that our introspection-driven approach not only enhances the agent’s ability to navigate unanticipated challenges through a robust mechanism of plan execution, but also improves efficiency by reducing the number of trials and plan revisions by 45% needed to achieve a task.
在这项工作中,我们引入了一种新颖的方法,使LLM代理具有内省能力,增强解决复杂任务的一致性和适应性。我们的方法提示LLM代理将给定的任务分解为可管理的子任务(即制定计划),并不断反思其行动的适用性和结果。我们实施三重内省干预:1)在行动执行之前对潜在失败和替代补救措施进行预期反思;2)行动后与子任务目标保持一致并通过补救措施进行回溯,以确保计划执行中尽最大努力;3)对行动执行情况进行全面审查。计划完成,为未来的战略细化做好准备。通过在 WebArena 中针对 Web 环境中的实际任务部署和试验这种方法(一种零样本方法),我们的代理展示了卓越的性能,成功率达到 23.5%,比现有零样本方法提高了 3.5%。实验结果表明,我们的内省驱动方法不仅增强了智能体通过强大的计划执行机制应对意外挑战的能力,而且还通过将完成任务所需的试验和计划修订次数减少了 45% 来提高效率。

1 Introduction 1简介

Two roads diverged in a yellow wood,
黄色的树林里分出两条路,

And sorry I could not travel both
抱歉我无法同时旅行

\cdots
Then took the other, as just as fair,
然后拿另一个,同样公平,

And having perhaps the better claim
也许有更好的主张

Robert Frost 罗伯特·弗罗斯特

The enduring appeal of Frost’s emblematic poem, “The Road Not Taken,” resides not just in its poetic elegance, but also in the profound lesson it imparts about decision-making. As we stand at the crossroads of a choice, it is a daunting challenge to assess probable outcomes and choose a course that best aligns with our objectives. This task becomes even more formidable when Large Language Model (LLM) agents [9, 29, 18] have to navigate complex scenarios unfolding in real time, e.g., solving tasks in web environments [11, 27, 2, 32], conducting simulated science experiments [23], and solving embodied household tasks [17].
弗罗斯特的标志性诗作《未选择的路》的持久吸引力不仅在于其诗意的优雅,还在于它所传授的关于决策的深刻教训。当我们站在选择的十字路口时,评估可能的结果并选择最符合我们目标的路线是一项艰巨的挑战。当大型语言模型 (LLM) 代理 [9,29,18] 必须导航实时展开的复杂场景时,此任务变得更加艰巨,例如,在 Web 环境中解决任务 [11,27,2 ,32],进行模拟科学实验[23],并解决具体的家庭任务[17]。

Indeed, LLM agent decision-making has witnessed enhancement by post-hoc reflection and correction [16, 19], coupled with adaptive planning [20, 15], where the agents learn from past successes and failures while concurrently mapping out flexible strategies. However, frequent shifts in plans, albeit a mere inconvenience for humans, can lead to disorientation for AI agents. This may produce confusion, a standstill, or even an infinite loop of failure, which substantiates the importance of thoroughly executing a set plan with utmost effort before resorting to a plan revision. Therefore, this paper puts forward a methodology aimed at achieving an optimal balance between consistency and adaptability. This critical equilibrium mirrors the resilience and agility that is anticipated of a capable system that is prepared for curveballs but unwavering in the execution of its plan.
事实上,LLM 智能体决策通过事后反思和修正 [16, 19] 以及自适应规划 [20, 15] 得到了增强,智能体从过去的成功和失败中学习,同时制定灵活的策略。然而,计划的频繁变化虽然对人类来说只是带来不便,但可能会导致人工智能代理迷失方向。这可能会产生混乱、停滞甚至失败的无限循环,这证实了在采取计划修订之前尽最大努力彻底执行既定计划的重要性。因此,本文提出了一种旨在实现一致性和适应性之间最佳平衡的方法论。这种关键的平衡反映了一个有能力的系统所预期的弹性和敏捷性,该系统已为曲线球做好了准备,但在执行其计划时坚定不移。

In this paper, we introduce a novel approach that integrates introspection into the fabric of LLM agents. This approach enables agents to continuously reflect on their actions, thereby stimulating a learning process that dynamically optimizes exploration paths and enhances robust decision-making under uncertainty. Our introspective intervention focuses on three principal dimensions:
在本文中,我们介绍了一种将内省集成到 LLM 代理结构中的新颖方法。这种方法使智能体能够不断反思自己的行为,从而刺激学习过程,动态优化探索路径并增强不确定性下的稳健决策。我们的内省干预侧重于三个主要维度:

  1. 1.

    Anticipatory reflection before action execution (similar to a devil’s advocate);


    1. 行动执行前的预期反思(类似于魔鬼代言人);
  2. 2.

    Post-action evaluation and backtracking with remedy when necessary, to ensure the outcome aligns with subtask objectives;


    2. 行动后评估并在必要时进行补救回溯,以确保结果与子任务目标保持一致;
  3. 3.

    An extensive review upon plan completion to generate finer plans for subsequent trials.


    3. 计划完成后进行广泛审查,为后续试验制定更精细的计划。

We implement this introspective methodology within WebArena [32], a comprehensive web environment featuring 812 tasks in five scenarios: online shopping, e-commerce management, social discussion forums, maps, and software development platforms. Experimental results demonstrate that our approach, which is zero-shot, significantly outperforms existing zero-shot methods while improving efficiency, paving the way for a new paradigm of intelligent systems that are more consistent, adaptable, and effective at solving problems111Code to reproduce our results will be released..
我们在 WebArena [32] 中实现了这种内省方法,这是一个综合性的 Web 环境,在五个场景中包含 812 个任务:在线购物、电子商务管理、社交讨论论坛、地图和软件开发平台。实验结果表明,我们的零样本方法在提高效率的同时显着优于现有的零样本方法,为更一致、适应性更强、解决问题更有效的智能系统新范式铺平了道路。

2 Related Works 2相关作品

In this paper, we develop and expand upon several key themes within the realm of natural language processing, with a specific focus on the integration of action generation, planning, and reflection in the construction of LLM agents.
在本文中,我们开发并扩展了自然语言处理领域的几个关键主题,特别关注在 LLM 代理构建中动作生成、规划和反射的集成。

Action Generation: LLMs have been employed in tasks requiring decision-making or action generation and have proven useful as agent-controlling policies in embodied environments [9, 8, 3, 21, 33]. They have also demonstrated effectiveness in text-based environments [11, 17, 12], where techniques like ReAct [29] have shown notable benefits. Despite its success, ReAct’s limitation lies in its inability to adjust to changes in the environment. Several improvements [13, 16] have been proposed to counter these limitations, advocating for self-reflection to enhance decision-making and reasoning. However, these techniques primarily aim to improve single plans or trajectories without considering alternative actions, which could modify the plan in a wrong direction.
动作生成:LLMs已用于需要决策或动作生成的任务,并已被证明可用作具体环境中的代理控制策略[9,8,3,21,33]。他们还在基于文本的环境中展示了有效性[11,17,12],其中像ReAct[29]这样的技术已经显示出显着的好处。尽管取得了成功,ReAct 的局限性在于它无法适应环境的变化。为了应对这些限制,人们提出了一些改进措施 [13, 16],提倡自我反思以增强决策和推理。然而,这些技术主要旨在改进单一计划或轨迹,而不考虑替代行动,这可能会朝错误的方向修改计划。

Position Bias Mitigation: While comparing answer choices is generally effective, large language models used for action generation are not without flaws. They can exhibit bias, especially towards the first (or sometimes second) answer they see, regardless of its quality. This is known as position bias [30, 22]. Our method mitigates this bias by asking follow-up questions that challenge its own answer.

Planning: Extensive research has explored the potential of LLMs in task planning [4, 15, 20, 25, 5, 6]. The concept of decoupling planning and execution in formulating LLM agents has been validated through numerous paradigms such as ReWOO [26], ADaPT [15], Structured Self-Reflection [10], and DEFS [24]. Nonetheless, these methods exhibit a deficiency in establishing a resilient mechanism for plan execution, with agents frequently revisiting and revising their plans following each instance of adverse environmental feedback, often due to inaccurately executed actions. Our approach, conversely, emphasizes executing a previously defined plan with unwavering effort before considering any modifications. This guarantees a more stable and consistent problem-solving process. To implement this, the factor of tree search becomes crucial for exploring the best solutions. Past approaches, including ToT [28], RAP [7], LATS [31], AdaPlanner [20], and ToolChain* [34], have incorporated tree search techniques in identifying the optimal route to the desired solution. However, our approach distinguishes itself by engaging the LLM in preparing alternate solutions in anticipation of impending failures, ensuring more comprehensive consideration in action generation.

Reflection and Self-refinement: Reflection and refinement techniques have advanced significantly through works such as Reflexion [16], AdaPlanner [20], and AutoEval [14]. Our methodology further enhances this by incorporating an anticipatory reflection mechanism that operates before each action rather than performing post-action reflection. This approach simplifies exploration by expediting remedial action and reducing extensive backtracking and serial plan revisions, thereby improving efficiency in the overall task handling process.

3 Method

Given a task 𝒯𝒯\mathcal{T}caligraphic_T and an environment \mathcal{E}caligraphic_E with which the LLM agent G𝐺Gitalic_G interacts, our objective is to enable the agent to systematically and adaptively complete the task through introspective methods. We first present how we decompose the task and generate action regarding each state in the environment in section 3.1 and section 3.2. Then we introduce the introspection mechanism in section 3.3.

3.1 Task Decomposition and Planning

The first step involves decomposing the task 𝒯𝒯\mathcal{T}caligraphic_T into subtasks in a sequential manner, forming a plan. This decomposition is achieved through an LLM generation process. Let Gplansubscript𝐺planG_{\text{plan}}italic_G start_POSTSUBSCRIPT plan end_POSTSUBSCRIPT denote the agent’s plan generation function, prompted by the task 𝒯𝒯\mathcal{T}caligraphic_T, description of the initial state S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and any experience from past trials, i.e., history \mathcal{H}caligraphic_H:

𝒫Gplan(𝒯,S0,).similar-to𝒫subscript𝐺plan𝒯subscript𝑆0\displaystyle\mathcal{P}\sim G_{\text{plan}}(\mathcal{T},S_{0},\mathcal{H}).caligraphic_P ∼ italic_G start_POSTSUBSCRIPT plan end_POSTSUBSCRIPT ( caligraphic_T , italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_H ) . (1)

Here, the plan 𝒫𝒫\mathcal{P}caligraphic_P is parsed into a sequence of ordered subtasks:

𝒫=(τ1,τ2,,τN),𝒫subscript𝜏1subscript𝜏2subscript𝜏𝑁\displaystyle\mathcal{P}=(\tau_{1},\tau_{2},\ldots,\tau_{N}),caligraphic_P = ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , (2)

where τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the i𝑖iitalic_i-th subtask in the plan, and N𝑁Nitalic_N is the number of subtasks. For instance, Fig. 1 shows a plan with 5 subtasks for solving a task in WebArena. The distribution of WebArena tasks based on the number of subtasks within each task is illustrated in Fig. 2. This also reflects the difficulty of the tasks in WebArena, where most tasks take 4-8 steps to complete.

Plan for task: What is the color configuration of the picture frame I bought in Nov 2022: 1. Click on the ‘My Account’ link to access your account details. 2. Click on the ‘Order History’ link to view your past orders. 3. Scroll down the page until you find the order from November 2022. 4. Click on the order details link for the order from November 2022. 5. Scroll down to the product details section to find the color configuration of the picture frame.

Figure 1: An example plan with 5 subtasks, generated by GPT-4.
Refer to caption
Figure 2: Distribution of WebArena tasks based on the number of subtasks within each task.

3.2 State and Action Representation

Let St𝒮subscript𝑆𝑡𝒮S_{t}\in\mathcal{S}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S denote the current state of the environment at time t𝑡titalic_t, where 𝒮𝒮\mathcal{S}caligraphic_S is the set of all possible states. From state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, let at𝒜subscript𝑎𝑡𝒜a_{t}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A denote the next action taken by the agent, where 𝒜𝒜\mathcal{A}caligraphic_A is the set of all possible actions. The next action is generated based on the the specific subtask τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT currently being addressed, current state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and action history t1subscript𝑡1\mathcal{H}_{t-1}caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT up to time t𝑡titalic_t:

atGaction(τi,St,t1),similar-tosubscript𝑎𝑡subscript𝐺actionsubscript𝜏𝑖subscript𝑆𝑡subscript𝑡1\displaystyle a_{t}\sim G_{\text{action}}(\tau_{i},S_{t},\mathcal{H}_{t-1}),italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_G start_POSTSUBSCRIPT action end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , (3)

where Gactionsubscript𝐺actionG_{\text{action}}italic_G start_POSTSUBSCRIPT action end_POSTSUBSCRIPT denotes the agent’s action generation function. Let tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the history of actions taken up to time t𝑡titalic_t:

t={a^1,a^2,,a^t},subscript𝑡subscript^𝑎1subscript^𝑎2subscript^𝑎𝑡\displaystyle\mathcal{H}_{t}=\{\hat{a}_{1},\hat{a}_{2},\ldots,\hat{a}_{t}\},caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } , (4)

where a^tsubscript^𝑎𝑡\hat{a}_{t}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a textual description of action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, along with useful information learned from this action execution, generated with function Gdescribesubscript𝐺describeG_{\text{describe}}italic_G start_POSTSUBSCRIPT describe end_POSTSUBSCRIPT. The history would later be used to answer questions in the task or to revise the agent’s plan. Gdescribesubscript𝐺describeG_{\text{describe}}italic_G start_POSTSUBSCRIPT describe end_POSTSUBSCRIPT accepts as input the state before the action, the action itself, the state after the action:

a^tGdescribe(St,at,St+1).similar-tosubscript^𝑎𝑡subscript𝐺describesubscript𝑆𝑡subscript𝑎𝑡subscript𝑆𝑡1\displaystyle\hat{a}_{t}\sim G_{\text{describe}}(S_{t},a_{t},S_{t+1}).over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_G start_POSTSUBSCRIPT describe end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) . (5)

When the state observation is too long to fit in the context window of an LLM, the state is first summarized by the LLM into a shorter description before being fed to Gdescribesubscript𝐺describeG_{\text{describe}}italic_G start_POSTSUBSCRIPT describe end_POSTSUBSCRIPT (e.g., this operation is commonly needed for solving web navigation tasks on content management platforms). Note that a subtask can involve several actions, and thus i𝑖iitalic_i does not necessarily equal to t𝑡titalic_t. Given the possibility that the task can be finished at some time t𝑡titalic_t before the completion of all subtasks, whenever the agent arrives at a new state, we ask the agent to check two things: whether the subtask is finished 𝒞τi(0,1)subscript𝒞subscript𝜏𝑖01\mathcal{C}_{\tau_{i}}\in(0,1)caligraphic_C start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ ( 0 , 1 )222When the agent determines that a subtask is non-essential to solving the task, we also set 𝒞τi=1subscript𝒞subscript𝜏𝑖1\mathcal{C}_{\tau_{i}}=1caligraphic_C start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1., and whether the task is finished 𝒞𝒯(0,1)subscript𝒞𝒯01\mathcal{C}_{\mathcal{T}}\in(0,1)caligraphic_C start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ∈ ( 0 , 1 ):

𝒞τisubscript𝒞subscript𝜏𝑖\displaystyle\mathcal{C}_{\tau_{i}}caligraphic_C start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT Gcompleted(τi,St+1,t),similar-toabsentsubscript𝐺completedsubscript𝜏𝑖subscript𝑆𝑡1subscript𝑡\displaystyle\sim G_{\text{completed}}(\tau_{i},S_{t+1},\mathcal{H}_{t}),∼ italic_G start_POSTSUBSCRIPT completed end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (6)
𝒞𝒯subscript𝒞𝒯\displaystyle\mathcal{C}_{\mathcal{T}}caligraphic_C start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT Gcompleted(𝒯,St+1,t),similar-toabsentsubscript𝐺completed𝒯subscript𝑆𝑡1subscript𝑡\displaystyle\sim G_{\text{completed}}(\mathcal{T},S_{t+1},\mathcal{H}_{t}),∼ italic_G start_POSTSUBSCRIPT completed end_POSTSUBSCRIPT ( caligraphic_T , italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (7)

where Gcompletedsubscript𝐺completedG_{\text{completed}}italic_G start_POSTSUBSCRIPT completed end_POSTSUBSCRIPT denotes the function for checking whether an objective is fulfilled. If 𝒞τi=1subscript𝒞subscript𝜏𝑖1\mathcal{C}_{\tau_{i}}=1caligraphic_C start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1, the agent moves on to solve the next subtask τi+1subscript𝜏𝑖1\tau_{i+1}italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT; whereas when the agent determines 𝒞𝒯=1subscript𝒞𝒯1\mathcal{C}_{\mathcal{T}}=1caligraphic_C start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT = 1, it finishes the current trial regardless of whether the plan 𝒫𝒫\mathcal{P}caligraphic_P is finished.

Algorithm 1 Introspective Agent

Input: task 𝒯;𝒯\mathcal{T};caligraphic_T ; initial observation Sinitial;subscript𝑆initialS_{\text{initial}};italic_S start_POSTSUBSCRIPT initial end_POSTSUBSCRIPT ; environment ;\mathcal{E};caligraphic_E ;
time t=0;𝑡0t=0;italic_t = 0 ; state St=Sinitial;subscript𝑆𝑡subscript𝑆initialS_{t}=S_{\text{initial}};italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT initial end_POSTSUBSCRIPT ; action at=;subscript𝑎𝑡a_{t}=\emptyset;italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∅ ; plan 𝒫=;𝒫\mathcal{P}=\emptyset;caligraphic_P = ∅ ; subtask τ=;𝜏\tau=\emptyset;italic_τ = ∅ ; history =;\mathcal{H}=\emptyset;caligraphic_H = ∅ ;

1:while ¬Gcompleted(𝒯,)subscript𝐺completed𝒯\neg G_{\text{completed}}(\mathcal{T},\cdot)¬ italic_G start_POSTSUBSCRIPT completed end_POSTSUBSCRIPT ( caligraphic_T , ⋅ ) do
2:   𝒫Gplan(𝒯,St,);similar-to𝒫subscript𝐺plan𝒯subscript𝑆𝑡\mathcal{P}\sim G_{\text{plan}}(\mathcal{T},S_{t},\mathcal{H});caligraphic_P ∼ italic_G start_POSTSUBSCRIPT plan end_POSTSUBSCRIPT ( caligraphic_T , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_H ) ; \triangleright Plan Revision
3:   Stack=[(St,at,τ)];Stackdelimited-[]subscript𝑆𝑡subscript𝑎𝑡𝜏\mathrm{Stack}=[(S_{t},a_{t},\tau)];roman_Stack = [ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_τ ) ] ;
4:   while StackStack\mathrm{Stack}roman_Stack do
5:     (St,at,τ)=Stack.pop()formulae-sequencesuperscriptsubscript𝑆𝑡subscript𝑎𝑡𝜏Stackpop(S_{t}^{\prime},a_{t},\tau)=\mathrm{Stack}.\text{pop}()( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_τ ) = roman_Stack . pop ( )
6:     if StStsubscript𝑆𝑡superscriptsubscript𝑆𝑡S_{t}\neq S_{t}^{\prime}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT then go_back(St);St=St;go_backsuperscriptsubscript𝑆𝑡subscript𝑆𝑡superscriptsubscript𝑆𝑡\text{go\_back}(S_{t}^{\prime});S_{t}=S_{t}^{\prime};go_back ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ; italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; \triangleright Backtracking      
7:     if τ𝜏\tauitalic_τ is \emptyset then 𝒞τ=1;τ=𝒫.next();formulae-sequenceformulae-sequencesubscript𝒞𝜏1𝜏𝒫next\mathcal{C}_{\tau}=1;\tau=\mathcal{P}.\text{next}();caligraphic_C start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = 1 ; italic_τ = caligraphic_P . next ( ) ;
8:     else St+1=(at);.add(Gdescribe(St,at,St+1));formulae-sequencesubscript𝑆𝑡1subscript𝑎𝑡addsubscript𝐺describesubscript𝑆𝑡subscript𝑎𝑡subscript𝑆𝑡1S_{t+1}=\mathcal{E}(a_{t});\mathcal{H}.\text{add}(G_{\text{describe}}(S_{t},a_% {t},S_{t+1}));italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = caligraphic_E ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ; caligraphic_H . add ( italic_G start_POSTSUBSCRIPT describe end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) ; \triangleright Grounding
9:        𝒞τGalign(St,at,St+1,τ);similar-tosubscript𝒞𝜏subscript𝐺alignsubscript𝑆𝑡subscript𝑎𝑡subscript𝑆𝑡1𝜏\mathcal{C}_{\tau}\sim G_{\text{align}}(S_{t},a_{t},S_{t+1},\tau);caligraphic_C start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∼ italic_G start_POSTSUBSCRIPT align end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_τ ) ; \triangleright Alignment with Subtask Objective
10:        if 𝒞τsubscript𝒞𝜏\mathcal{C}_{\tau}caligraphic_C start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT then
11:          if Gcompleted(𝒯,St+1)subscript𝐺completed𝒯subscript𝑆𝑡1G_{\text{completed}}(\mathcal{T},S_{t+1})italic_G start_POSTSUBSCRIPT completed end_POSTSUBSCRIPT ( caligraphic_T , italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) then Finished; \triangleright Early Stop           
12:          if Gcompleted(τ,St+1)subscript𝐺completed𝜏subscript𝑆𝑡1G_{\text{completed}}(\tau,S_{t+1})italic_G start_POSTSUBSCRIPT completed end_POSTSUBSCRIPT ( italic_τ , italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) then τ=𝒫.next()formulae-sequence𝜏𝒫next\tau=\mathcal{P}.\text{next}()italic_τ = caligraphic_P . next ( ); \triangleright Next Subtask                        
13:     t++;𝑡++t\texttt{++};italic_t ++ ;
14:     if 𝒞τsubscript𝒞𝜏\mathcal{C}_{\tau}caligraphic_C start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT then atGaction(τ,St);similar-tosubscript𝑎𝑡subscript𝐺action𝜏subscript𝑆𝑡a_{t}\sim G_{\text{action}}(\tau,S_{t});italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_G start_POSTSUBSCRIPT action end_POSTSUBSCRIPT ( italic_τ , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ;
15:        for r=1𝑟1r=1italic_r = 1 to R𝑅Ritalic_R do
16:          at(r)Gremedy(τ,St,at);similar-tosuperscriptsubscript𝑎𝑡𝑟subscript𝐺remedy𝜏subscript𝑆𝑡subscript𝑎𝑡a_{t}^{(r)}\sim G_{\text{remedy}}(\tau,S_{t},a_{t});italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ∼ italic_G start_POSTSUBSCRIPT remedy end_POSTSUBSCRIPT ( italic_τ , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ; \triangleright Anticipatory Reflection
17:          Stack.push((St,at(r),τ));formulae-sequenceStackpushsubscript𝑆𝑡superscriptsubscript𝑎𝑡𝑟𝜏\mathrm{Stack}.\text{push}((S_{t},a_{t}^{(r)},\tau));roman_Stack . push ( ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT , italic_τ ) ) ;         
18:        Stack.push((St,at,τ));formulae-sequenceStackpushsubscript𝑆𝑡subscript𝑎𝑡𝜏\mathrm{Stack}.\text{push}((S_{t},a_{t},\tau));roman_Stack . push ( ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_τ ) ) ; \triangleright Placing atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at the top of StackStack\mathrm{Stack}roman_Stack         

3.3 Introspective Mechanisms

The sequential action generation above can potentially execute the plan and solve the task already. Nevertheless, without proper introspection and adaptation, the agent might be stuck at a certain unsolvable subtask or go into a loop of failure when unexpected problems emerge. Thus, we introduce three introspective mechanisms to enhance our LLM agent’s problem-solving ability below.

3.3.1 Anticipatory Reflection (Devil’s Advocate)

The first layer of introspection occurs before each action execution. The agent anticipates potential failures and comes up with R𝑅Ritalic_R alternative remedies [at1,at2,,atR]superscriptsubscript𝑎𝑡1superscriptsubscript𝑎𝑡2superscriptsubscript𝑎𝑡𝑅[a_{t}^{1},a_{t}^{2},\cdots,a_{t}^{R}][ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ]. Each remedy action is by prompting the LLM with a follow-up question after Gactionsubscript𝐺actionG_{\text{action}}italic_G start_POSTSUBSCRIPT action end_POSTSUBSCRIPT:

"If your answer above is not correct, instead, the next action should be:"

We use Gremedysubscript𝐺remedyG_{\text{remedy}}italic_G start_POSTSUBSCRIPT remedy end_POSTSUBSCRIPT to denote the generation of remedy actions, which accepts as input the subtask τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the current state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the action history t1subscript𝑡1\mathcal{H}_{t-1}caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, and the LLM predicted next action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at first attempt:

atrsuperscriptsubscript𝑎𝑡𝑟\displaystyle a_{t}^{r}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT Gremedy(τi,St,t1,at).similar-toabsentsubscript𝐺remedysubscript𝜏𝑖subscript𝑆𝑡subscript𝑡1subscript𝑎𝑡\displaystyle\sim G_{\text{remedy}}(\tau_{i},S_{t},\mathcal{H}_{t-1},a_{t}).∼ italic_G start_POSTSUBSCRIPT remedy end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (8)

If later found necessary, the agent can go back to state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to modify the original action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to try the remedy action atrsuperscriptsubscript𝑎𝑡𝑟a_{t}^{r}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT to ensure a smooth plan execution. For example, in Fig. 3, we show a state observation where all three clicking actions align with the objective of the current subtask. The execution of any of these actions would complete the subtask; yet the agent might need to return to this state if it later determines that the action predicted at first attempt was incorrect333The action generated at first attempt still gets the highest priority, i.e., atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the last one to be pushed to the stack so it can be popped and executed first (see line 18 in Alg. 1)..

3.3.2 Post-action Evaluation and Backtracking

The second introspective mechanism kicks in after the execution of each action. Here, the agent evaluates whether the action and the resulting state align with the subtask objective. This introspective function, denoted as Galignsubscript𝐺alignG_{\text{align}}italic_G start_POSTSUBSCRIPT align end_POSTSUBSCRIPT, is motivated by the state before the action Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the resulting state St+1subscript𝑆𝑡1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, the current subtask τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

θtsubscript𝜃𝑡\displaystyle\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Galign(St,at,St+1,τi).similar-toabsentsubscript𝐺alignsubscript𝑆𝑡subscript𝑎𝑡subscript𝑆𝑡1subscript𝜏𝑖\displaystyle\sim G_{\text{align}}(S_{t},a_{t},S_{t+1},\tau_{i}).∼ italic_G start_POSTSUBSCRIPT align end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (9)

Here θt(0,1)subscript𝜃𝑡01\theta_{t}\in(0,1)italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , 1 ) denotes the evaluation score reflecting how well the state St+1subscript𝑆𝑡1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT aligns with the subtask objective τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. It is a binary signal indicating whether the agent needs to stop and backtrack to some previous state and take an alternative action akr,ktsuperscriptsubscript𝑎𝑘𝑟𝑘𝑡a_{k}^{r},k\leq titalic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_k ≤ italic_t, if the execution of atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT does not meet the objective of the current subtask. In our experiments with web environments, the URL of the webpage is a useful information recorded as part of Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. When backtracking, we can easily navigate back to the URL. However, the element information on the URL might differ from the state we first encountered upon arriving at that page. To address this, we prompt the LLM to map the recorded element in the action to the new element with which we want to interact, if necessary.

Refer to caption
Figure 3: Screen observation at one step in solving the subtask: Click on the order details link for the order from November 2022. The agent might decide to click (atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) on the “View Order” button of any one of the three Nov 2022 orders to see if a picture frame was purchased in that order, and it is highly probable that backtracking is needed to view the details of the other two orders (if the first chosen is not a picture frame). In our proposed approach, the other two alternative clicking actions [at1,at2]superscriptsubscript𝑎𝑡1superscriptsubscript𝑎𝑡2[a_{t}^{1},a_{t}^{2}][ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] would be pushed to stack before the agent executes action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

3.3.3 Plan Revision

The third introspective mechanism occurs upon plan failure, i.e., when the stack is empty and 𝒞𝒯=0subscript𝒞𝒯0\mathcal{C}_{\mathcal{T}}=0caligraphic_C start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT = 0. Now the agent performs a thorough review of the actions executed and the notes taken, and refines its future plan based on identified problems:

𝒫newsubscript𝒫new\displaystyle\mathcal{P}_{\text{new}}caligraphic_P start_POSTSUBSCRIPT new end_POSTSUBSCRIPT Gplan(𝒯,S0,t).similar-toabsentsubscript𝐺plan𝒯subscript𝑆0subscript𝑡\displaystyle\sim G_{\text{plan}}(\mathcal{T},S_{0},\mathcal{H}_{t}).∼ italic_G start_POSTSUBSCRIPT plan end_POSTSUBSCRIPT ( caligraphic_T , italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (10)

Here, 𝒫newsubscript𝒫new\mathcal{P}_{\text{new}}caligraphic_P start_POSTSUBSCRIPT new end_POSTSUBSCRIPT is the new plan after reflecting on the past failed trials. The agent then re-enters the plan execution phase and starts a new episode.

Through these three layers of introspection, our agent is more capable of navigating the complexities of unforeseen circumstances and addressing tasks, bringing us a significant stride closer to achieving truly autonomous, adaptable, and intelligent systems. By structuring the problem in this manner, we have established a clear framework for enabling LLM agents to perform tasks autonomously and adaptively through introspection. Alg. 1 shows a pseudo code demonstration of our approach.

Refer to caption
Figure 4: Decision making process of our agent in solving the task: What is the color configuration of the picture frame that I bought in Sep 2022? Before execution of the predicted action, the agent asks a follow-up question to itself regarding its decision: what if the picture frame is not in order #179? what should be the alternative remedy? And after finding out that order #179 contains no picture frame at all, the agent backtracks to the previous state to view order #175 and continue.

4 Experiments

In this section, we demonstrate how introspection enhances consistency and adaptability of LLM agents in solving complex tasks in web environments. We first introduce the experimental setup for evaluation (section 4.1), followed by evaluation results (section 4.2). Detailed error analysis is provided in section 4.3, which highlights the directions for future endeavor.

4.1 Experimental Setup

Live environments: We evaluate our proposed method in the simulated web environments of WebArena [32], a dataset of human-annotated web browsing tasks designed to evaluate the ability of LLMs to perform complex, real-world actions on the internet. The 812 tasks in WebArena involve five websites: an online shopping website, a software development website, a social forum platform, a map, and an e-commerce management platform; and these tasks can be categorized into three classes: information seeking tasks, site navigation and content & config tasks, and unachievable tasks. Though WebArena provides visual observation (screenshots), in this work we use the text observation only. The observation at each step is the accessibility tree of the webpage, and the elements in the accessibility tree are all within the current viewport of a 1280×\times×720 screen. The action space of our LLM agent includes actions that interact with environment: click, type, scroll, goto, go_back, go_forward, and also a note_down action that takes down useful snippet/summary for answering information-seeking questions.

Baselines: We employ gpt-4-0613444https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4 [1] with a context window of 8k tokens to build the agents and compare our method with three other agent construction strategies: planning and sequential decision making (Plan + Act w/o reflexion), similar to ReWOO [26]; planning and sequential decision making with reflection (Plan + Act), similar to AdaPlanner [20]; and tree search based planning, similar to LATS [31], but with reflection. In all methods, we set the upper limit on the number of actions to 30, i.e., after the agent executes 30 actions for a given task, it has to stop. In all three methods, we adopt the same prompts for action generation Gactionsubscript𝐺actionG_{\text{action}}italic_G start_POSTSUBSCRIPT action end_POSTSUBSCRIPT, plan generation Gplansubscript𝐺planG_{\text{plan}}italic_G start_POSTSUBSCRIPT plan end_POSTSUBSCRIPT, and evaluator Galignsubscript𝐺alignG_{\text{align}}italic_G start_POSTSUBSCRIPT align end_POSTSUBSCRIPT and Gcompletedsubscript𝐺completedG_{\text{completed}}italic_G start_POSTSUBSCRIPT completed end_POSTSUBSCRIPT to ensure a fair comparison555Detailed prompts are shown in Appendix.. In our experiments, we set the LLM temperature to 1.0 and max_tokens to 512, and keep all other parameters as default.

Metrics: We follow the evaluation metric Success Rate in [32], and count the number of actions per trial and the number of plan revisions per task. To determine whether a task is successfully completed, the exact_match metric is used for some site navigation and information seeking tasks. However, this can sometimes be overly stringent. For instance, consider the URLs below that display the same content (under ‘electronics’, the category id of ‘headphones’ is 60): both of them are correct answers but only one exact match with the gold answer (the first one) is considered correct666URL string matching is used to determine if a task is finished.. To address this issue, we manually review the evaluation process and correct such cases in our results.

Refer to caption
Figure 5: Results of different agent construction strategies on WebArena. AR is short for our method, anticipatory reflection; LATS represents our in-house implementation of the approach proposed by Zhou et al. [31]; Plan + Act is a method of decomposition of task and execution of each subtask, similar to ReWOO [26]. All three methods are equipped with plan revision (post-failure reflection).

4.2 Results

The experimental results, depicted in Fig. 5, demonstrate the efficacy of our introspection-driven approach in enhancing the consistency and adaptability of LLM agents in web environments. We compare the success rates of various agent construction strategies across multiple episodes. Our method, anticipatory reflection (AR), consistently outperforms the others, achieving a success rate of 23.5% after seven episodes, closely followed by LATS with 22.7%. In contrast, the Plan + Act method shows gradual improvement, reaching 19.8%, but remains significantly lower than the tree-search-based AR and LATS methods. Taking a closer look at the last few rounds of LATS reveals marginal improvements due to the homogeneous generated actions through direct sampling. In comparison, AR benefits from the “devil’s advocate” approach, enabling more thorough planning and execution due to introspective follow-up questions. This trend underscores the importance of incorporating introspection mechanisms for both plan execution and revision, highlighting their critical role in achieving more consistent and efficient results.

# of Actions # of Plan Revisions
First Trial Last Trial
Plan+Act 4.01 4.47 2.03
LATS 6.08 6.45 1.16
AR 6.39 7.07 0.64
Table 1: Statistics of the trajectory of different agents solving tasks on WebArena. We report the number of actions in the first and last trial, and also the number of plan revisions, i.e., trials.

Further insights can be gleaned from Table 1, which compares the average number of actions in the first and last trials across different methods. Our AR method shows an increase in the average number of actions from 6.39 in the first trial to 7.07 in the last trial, indicating a robust learning and adaptation process. In comparison, the average number of actions in the first trial of the Plan+Act method is only 4.01, suggesting that it stops at an early stage without completing full plan execution. Thus, our method effectively leverages a greater number of actions to achieve better outcomes, thereby reducing the number of plan revisions by 45% and improving overall efficiency.

4.3 Error Analysis

The subsequent sections shed light on an analysis of errors we observed from the agent’s behavior when executing tasks. Two key areas have been identified for detailed discussion: an agent’s occasional inability to fully learn from past failures, and inefficiencies in solving specific kinds of tasks due to a sequential planning scheme.

4.3.1 Agent only takes partial lesson from past failures

One category of errors we notice is that the agent is not taking full lesson from past failure in generating a new plan. As illustrated in Fig. 6(a), the agent is at the final step of drafting a refund message for a Bluetooth speaker, after a series of steps taken to seek information for the order. From the screen, we know that the agent should consolidate all the gathered information and type one piece of text into the (only) box titled “What’s on your mind?”. However, as depicted in Fig. 6(b), while some improvements were made by adding the date of purchase and a more detailed explanation in the revised plan, the agent still failed to optimize the input process, repeating the typing actions separately for fields that do not exist. This inefficiency in the agent’s behavior showcases the need for either an LLM with stronger reasoning ability or a better mechanism to solicit more comprehensive and accurate reflection.

Refer to caption
(a) Screen observation at the last step to solve the task: Draft a refund message via their "contact us" form for the bluetooth speaker I bought Feb 2023. It broke after three days of use. The shop requires the order id, the reason and the amount to refund in the message. Don’t submit yet.

Initial Plan: Type the order id into the appropriate field in the contact form. Type "The bluetooth speaker I bought broke after three days of use" into the ‘Reason’ field of the contact form. Type the amount to refund into the appropriate field in the contact form. Revised Plan: Type the order id ‘000000161’ into the appropriate field in the contact form. Type "The bluetooth speaker I bought in Feb 2023 broke after three days of use. I would like to request a refund." into the ‘Reason’ field of the contact form. Type the amount to refund ‘$56.35’ into the appropriate field in the contact form.

(b) Comparison between the last few steps of the initial plan and the revised plan for the same task.
Figure 6: Illustration of the first type of errors: the agent is not taking full lesson from past experience in plan revision. The agent does not capture the root cause of its failure, i.e., separately typing each piece of information into different fields that do not even exist.

4.3.2 Sequential planning cannot solve all tasks

In our analysis, we observed a recurrent error pertaining to the design of the agent’s planning process. Currently, the proposed methodology structures a plan as a sequence of tasks that are executed in a specific order. This approach, effective in a decent amount of use cases, seems to falter when faced with tasks necessitating more sophisticated logic. Specifically, tasks that mandate implementing a function encapsulating several actions, employing a loop construct, or those executed repetitively, tend to challenge the model’s current configuration. For example:

  • List out reviewers, if exist, who mention about average print quality

  • Give me the SKU of the products that have 1-3 units left.

  • Like all submissions created by CameronKelsey in subreddit earthporn.

The ability to process these tasks effectively would necessitate the incorporation of additional cognitive constructs into the planning model—e.g., loops, repetitive actions, or encapsulation of a group of actions into callable functions. Though taking notes can help the agent eliminate wrong choices, these systemic extensions would add crucial capabilities to the web agent, significantly enhancing its navigation and problem-solving competence in realistic web environments. Moreover, while the current agent can succeed in the limited search space of simple tasks, it often struggles to review and introspect upon more extensive descriptive tasks requiring dynamic problem-solving. By addressing these limitations in future work, i.e., effectively converting textual description of a plan into robust execution of callable functions and loops, we believe that the reasoning capability of our agent can be substantially improved, leading to better outcomes in understanding and solving tasks that involve dynamic cognition in web environments.

5 Conclusion

In this work, we introduce a novel introspective methodology that significantly enhances the problem-solving capabilities of LLMs in complex environments, as demonstrated through comprehensive evaluations in the WebArena setting. Our approach strategically decomposes tasks into actionable subtasks and incorporates a three-tiered introspection process, which includes anticipatory reflection, robust post-action evaluation, and episode-level plan revision. This setup not only allows LLM agents to adapt their strategies in real time but also fosters long-term learning, reducing the need for frequent interventions as experience accumulates. The successful application of our introspective methodology in the WebArena suggests its potential transferability to other domains that demand dynamic decision-making such as autonomous driving, healthcare, and interactive customer services. By enabling LLM agents to proactively contemplate potential failures, evaluate actions post-execution, and continuously refine their strategy based on experiential insights, our approach equips AI systems with a human-like strategic thinking capability.

Broader Impact

Looking forward, the integration of multi-modal data inputs could further enhance the contextual understanding and decision-making accuracy of these agents. The principles and findings from our approach provide a robust foundation for future research in AI, particularly in aspects of autonomous decision-making, learning efficiency, and adaptability. As AI continues to integrate into diverse aspects of decision-making, embedding introspective capabilities will be essential to ensure these systems operate not only with precision but with an understanding akin to strategic human cognition.

Ethics Statement

As the capabilities of LLM agents enhance and their deployment in real-world applications increases, it is crucial to address potential ethical concerns, particularly regarding data privacy, bias, and transparency. Our work focuses on improving agent introspection to enhance task performance and decision-making explanations, aiming to develop more transparent and trustworthy AI systems. We emphasize the importance of human oversight to monitor and mitigate unforeseen consequences and encourage the responsible use of this technology for societal benefit. By promoting continuous evaluation and fair practices, we seek to minimize biases and ensure that the deployment of these agents does not exacerbate social inequalities. Furthermore, we are committed to optimizing computational resources to reduce the environmental impact, advocating for sustainable AI practices.

Limitations

Despite substantial progress made with our current design, limitations persist that inhibit optimal performance. Notably, the agent lacks a full learning mechanism to capitalize on past failures when generating a new plan, resulting in inefficient execution and recurring mistakes. Furthermore, while the sequential planning approach is effective for simpler tasks, it falls short for more sophisticated operations, such as those requiring encapsulated actions or loop constructs. Additionally, the agent struggles with tasks that expand beyond a simple search space, suggesting obstacles in handling dynamic problem-solving. Last but not least, our agent needs significant amounts of LLM generation (i.e., API calling), consequently requiring substantial time and computational resources, which dents its efficiency. Therefore, future work needs to concentrate on improving the agent’s ability to fully learn from prior shortcomings, adapt to handle complex tasks, enhance dynamic problem-solving capabilities, and optimize time and resource utilization with more efficient LLM calling.

References

  • Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Deng et al. [2023] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web.
  • Driess et al. [2023] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. 2023. PaLM-E: An Embodied Multimodal Language Model. In arXiv preprint arXiv:2303.03378.
  • Dror et al. [2023] Rotem Dror, Haoyu Wang, and Dan Roth. 2023. Zero-Shot On-the-Fly Event Schema Induction. In Findings of the Association for Computational Linguistics: EACL 2023, pages 705–725, Dubrovnik, Croatia. Association for Computational Linguistics.
  • Guan et al. [2023] Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. 2023. Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning.
  • Gur et al. [2024] Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2024. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. In The Twelfth International Conference on Learning Representations.
  • Hao et al. [2023] Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. 2023. Reasoning with Language Model is Planning with World Model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, Singapore. Association for Computational Linguistics.
  • Huang et al. [2022a] Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. 2022a. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. arXiv preprint arXiv:2201.07207.
  • Huang et al. [2022b] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. 2022b. Inner Monologue: Embodied Reasoning through Planning with Language Models. In arXiv preprint arXiv:2207.05608.
  • Li et al. [2023] Tao Li, Gang Li, Zhiwei Deng, Bryan Wang, and Yang Li. 2023. A Zero-Shot Language Agent for Computer Control with Structured Reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11261–11274, Singapore. Association for Computational Linguistics.
  • Liu et al. [2018] Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. 2018. Reinforcement learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802.
  • Liu et al. [2023] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2023. AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv: 2308.03688.
  • Madaan et al. [2023] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback.
  • Pan et al. [2024] Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. 2024. Autonomous Evaluation and Refinement of Digital Agents.
  • Prasad et al. [2023] Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. 2023. ADaPT: As-Needed Decomposition and Planning with Language Models. arXiv.
  • Shinn et al. [2023] Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning.
  • Shridhar et al. [2021] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proceedings of the International Conference on Learning Representations (ICLR).
  • Song et al. [2023] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, and Yu Su. 2023. LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  • Song et al. [2024] Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. 2024. Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents. arXiv preprint arXiv:2403.02502.
  • Sun et al. [2023] Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. 2023. AdaPlanner: Adaptive Planning from Feedback with Language Models.
  • Wang et al. [2023a] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023a. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint arXiv: Arxiv-2305.16291.
  • Wang et al. [2023b] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023b. Large Language Models are not Fair Evaluators.
  • Wang et al. [2022] Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. 2022. ScienceWorld: Is your Agent Smarter than a 5th Grader? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Wang et al. [2023c] Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. 2023c. Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Wu et al. [2023] Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, and Haibin Yan. 2023. Embodied Task Planning with Large Language Models. arXiv preprint arXiv:2305.03716.
  • Xu et al. [2023] Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, and Dongkuan Xu. 2023. ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models.
  • Yao et al. [preprint] Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. preprint. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. In ArXiv.
  • Yao et al. [2023a] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. 2023a. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Yao et al. [2023b] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023b. ReAct: Synergizing Reasoning and Acting in Language Models. In International Conference on Learning Representations (ICLR).
  • Zheng et al. [2023] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Zhou et al. [2024a] Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2024a. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models.
  • Zhou et al. [2024b] Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024b. WebArena: A Realistic Web Environment for Building Autonomous Agents. In The Twelfth International Conference on Learning Representations.
  • Zhu et al. [2023] Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. 2023. Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory. arXiv preprint arXiv:2305.17144.
  • Zhuang et al. [2024] Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A. Rossi, Somdeb Sarkhel, and Chao Zhang. 2024. ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search. In The Twelfth International Conference on Learning Representations.

Appendix

Prompt for Plan Generation (Gplansubscript𝐺planG_{\text{plan}}italic_G start_POSTSUBSCRIPT plan end_POSTSUBSCRIPT)

Imagine that you are imitating humans doing a task on a website step by step. You can click an element with the mouse, scroll up or down, go to a certain URL or go back to previous page, or type some text with the keyboard (e.g., click(), scroll(), goto(), go_back(), and type() functions in playwright). One step means one operation within any of the mentioned actions.

You are within a sandbox and only have access to the following websites to work with:

Notes:

  1. 1.

    If you want to use the search function, you don’t need to click on the search bar. You can directly use “type [element_id] [things_to_type]”, and generally afterwards, you don’t need to click the search button (by default, the command contains an ENTER at the end).

  2. 2.

    You can assume that you have signed in to your account (we have set up the cookies, so login is not needed).

The website that you will be working with is:

{WEBSITE INTRO}

Please follow these specific instructions to solve tasks:

{INSTRUCTION}

Here is a more detailed description of the starting screen:

{STARTING SCREEN DESCRIPTION}

Now, based on the information above, what should be the steps to achieve the following goal (please give me a list of textual description of playwright actions, starting with ‘List’):

{TASK}

For your reference, here are some experiences from previous failed trials (please consider the following information to generate a better plan):

{FAILED PLAN}

Past experience:

{HISTORY}

To be successful in generating a new plan, you need to provide a list (1, 2, 3, …), in which each item is a natural language description of one playwright action that is necessary to complete the task (e.g., click on the ‘Account’ button; scroll down; use the search bar to search for iPhone 13). You should use the information from the past experiences to save unnecessary steps!

Prompt for Action Generation (Gactionsubscript𝐺actionG_{\text{action}}italic_G start_POSTSUBSCRIPT action end_POSTSUBSCRIPT)

I am in a sandbox and only have access to the following websites (i.e., no access to external website like www.reddit.com):

Now I’m trying to complete a task on a website.

The task is:

{TASK}

The plan to complete this task is:

{PLAN}

I have executed the following actions:

{HISTORY}

And now I’m at this step: {STEP}

Here is the screen I am looking at:

{OBS}

I have taken down the following notes:

{NOTES}

What should be the next action to complete this step in my plan (only give one action)?

Note:

  • If the next action is to click, please indicate the element id in [] (format: click [element_id]).

  • If the next action is to scroll, please indicate the direction in [] (format: scroll [up or down]).

  • If you need to navigate to a URL, please indicate the URL in [] (format: goto [url]).

  • If you need to go back to the previous page, please use go_back.

  • If the next action is to type, please indicate both element id and the things to type in [] (format: type [element_id] [things to type]).

  • If you want to note down something, use this format: note_down [things to note down].

The next action is:

Prompt for Objective Alignment (Galignsubscript𝐺alignG_{\text{align}}italic_G start_POSTSUBSCRIPT align end_POSTSUBSCRIPT)

Imagine that you are imitating humans doing a task on a website step by step.

You are currently working on this step:

{STEP}.

The step above is one of the steps in the following plan:

{PLAN}.

From Screen 1, you executed an action and then arrived at Screen 2.

The action you executed was:

{ACTION}.

Screen 1:

{OBS1}.

Screen 2:

{OBS2}.

Now describe what this action is about in one sentence, starting with ‘The action is to’.

Does this action align with the goal of the following step (i.e., are we moving towards the right direction; Answer YES or NO)?

{STEP}

Prompt for Task / Subtask Completion Evaluation (Gcompletedsubscript𝐺completedG_{\text{completed}}italic_G start_POSTSUBSCRIPT completed end_POSTSUBSCRIPT)

Imagine that you are imitating humans doing a task on a website step by step.

You are asked to solve the following task:

{TASK}

You made the following plan to solve it:

{PLAN}

To reach the current screen, you have previously executed the following actions:

{HISTORY}

You have taken down a few notes after each action as follows:

{NOTES}

And here is the accessibility tree of the current screen you are looking at:

{OBS}

Look at the screen, the task, and the actions you executed, and think thoroughly, is the task completed now?

If the task is completed, answer YES.

If the task is not yet completed (meaning further actions are yet to be executed), answer NO.

Prompt for Answer Delivery (Ganswersubscript𝐺answerG_{\text{answer}}italic_G start_POSTSUBSCRIPT answer end_POSTSUBSCRIPT)

Imagine that you are imitating humans doing a task on a website step by step.

You are asked to solve the following task:

{TASK}

To reach the current screen, you have previously executed the following actions:

{HISTORY}

You have taken down the following notes (to help you answer the question eventually) after each action:

{NOTES}

And here is the accessibility tree of the current screen you are looking at:

{OBS}

Based on the above information, answer the question in the task (starting with ###Answer).

Prompt for Element Mapping (Gmapsubscript𝐺mapG_{\text{map}}italic_G start_POSTSUBSCRIPT map end_POSTSUBSCRIPT)

I want to interact with an element with element id: {element_id} in the following screen:

{OBS1}

Now if I want to click on the same element in the following screen, what should be the element id now?

{OBS2}

New element id is: