StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
StreamBench: 面向语言代理的持续改进基准测试

Cheng-Kuang Wu^1,2, Zhi Rui Tam¹¹¹footnotemark: 1, Chieh-Yen Lin¹, Yun-Nung Chen^1,2, Hung-yi Lee²
¹Appier AI Research
²National Taiwan University
{brian.wu, ray.tam}@appier.com Equal contribution

Abstract 摘要

Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios. Source code: https://github.com/stream-bench/stream-bench. Benchmark website: https://stream-bench.github.io.
最近的研究表明，大型语言模型 (LLM) 代理能够从经验中自我改进，这是一种重要的能力，用于部署后持续增强。然而，现有的基准测试主要评估它们的先天能力，并不评估它们随着时间推移改进的能力。为了填补这一空白，我们引入 StreamBench，这是一个开创性的基准测试，旨在评估 LLM 代理在输入-反馈序列上的持续改进。StreamBench 模拟了一个在线学习环境，其中 LLMs 接收连续的反馈流，并迭代地提升它们的性能。此外，我们提出了几个简单却有效的基准线，用于在 StreamBench 上改进 LLMs，并提供全面分析，以识别有助于成功流式策略的关键组件。我们的工作作为开发 LLMs 有效在线学习策略的垫脚石，为流式场景中更具适应性的 AI 系统铺平道路。Source code: https://github.com/stream-bench/stream-bench. Benchmark website: https://stream-bench.github.io。

1 Introduction 1 介绍

Recently, large-scale pretraining [1] and instruction fine-tuning [2] have driven paradigm shifts in how we interact with language models. These advancements allow us to use them out-of-the-box to solve problems. Consequently, many benchmarks have emerged to evaluate the general capabilities of these models. Some notable examples include MMLU [3], GSM8K [4], and BIG-Bench-Hard [5]. All these benchmarks aim to assess LLMs’ innate capabilities, which we define as the general knowledge or reasoning abilities demonstrated when used out-of-the-box.
最近，大规模预训练 [ 1] 和指令微调 [ 2] 推动了我们与语言模型互动方式的范式转变。这些进步使我们能够直接使用它们来解决问题。因此，许多基准测试出现了，以评估这些模型的通用能力。一些著名的例子包括 MMLU [ 3]、GSM8K [ 4] 和 BIG-Bench-Hard [ 5]。所有这些基准测试旨在评估 LLMs’ 的先天能力，我们将其定义为在使用时直接展示的一般知识或推理能力。

In addition to LLMs’ strong innate capabilities, recent works have shown that LLM agents, which are LLMs augmented with extra components such as memory, retrievers, or tools, are able to improve themselves from experience. MemPrompt [6] shows that memory-enhanced GPT-3 can improve through time by storing past user feedback and retrieve them in the future. Reflexion [7] demonstrates that LLM agents can perform better in future trials by running repeated trials on the same dataset via self-reflection. ExpeL [8] further shows that LLM agents can learn from cross-task experience and improve performance without executing repeated trials on the target task.
除了LLMs的强大先天能力，最近的研究表明，LLM 智能体，它们是LLMs 通过添加额外组件如 memory、retrievers 或 tools 增强的，能够从经验中自我改进。MemPrompt [6] 显示，memory-enhanced GPT-3 可以通过存储过去的用户反馈并在未来 retrieve 它们来随着时间改进。Reflexion [7] 证明，LLM 智能体可以通过在相同数据集上运行重复试验 via self-reflection 来在未来试验中表现更好。ExpeL [8] 进一步显示，LLM 智能体可以从 cross-task experience 中学习，并在不执行目标任务的重复试验的情况下改善性能。

Given LLM agents’ self-improvement abilities, there remains a missing piece in the current evaluation landscape. Beyond measuring LLMs’ innate capabilities with aforementioned offline benchmarks [3, 4, 5], it is important to assess their capacity to improve over time since we would like our systems to gradually improve after deployment. This gap motivated us to develop a new evaluation scenario–an online setting to measure LLM agents’ ability to continuously enhance their performance over time.
鉴于LLM 代理的自提升能力，当前评估领域中仍存在一个缺失的部分。除了使用上述离线基准 [3, 4, 5] 测量LLMs 的先天能力外，重要的是评估它们随着时间推移改进的能力，因为我们希望我们的系统在部署后逐渐改进。这一空白激励我们开发了一个新的评估场景——一个在线设置，用于测量LLM 代理随着时间不断提升其性能的能力。

This online setting focuses on scenarios where LLM agents attempt to solve a specific downstream task and improve themselves from an input-feedback sequence, with the goal to maximize the accuracy for the whole sequence of the agent’s predictions.
这个在线设置专注于LLM 代理试图解决特定下游任务并从输入-反馈序列中提升自己的场景，其目标是最大化代理预测的整个序列的准确性。

Given these rationales, we introduce StreamBench, a benchmark designed to evaluate LLM agents’ ability to improve themselves over an input-feedback sequence. StreamBench simulates an environment where LLM agents are exposed to a sequence of users’ natural language requirements and feedback. To the best of our knowledge, StreamBench is the first benchmark to evaluate LLM agents in streaming scenarios with a diverse range of tasks. StreamBench aims to inspire further efforts to develop more adaptive LLM agents, thereby enhancing their practical effectiveness. Our contributions can be summarized as follows:
基于这些理由，我们引入 StreamBench，这是一个设计用于评估 LLM agents’ 能力，使其在输入-反馈序列中自我改进的基准。StreamBench 模拟了一个环境，其中 LLM agents 暴露于一系列用户的自然语言要求和反馈。据我们所知，StreamBench 是第一个用于评估 LLM agents 在具有多样化任务的流式场景中的基准。StreamBench 旨在激发进一步的努力来开发更具适应性的 LLM agents，从而提升它们的实际有效性。我们的贡献可以总结如下：

•

We introduce StreamBench, the first benchmark designed to evaluate LLM agents’ ability to improve over an input-feedback sequence in an online setting across a wide range of tasks.

• 我们介绍 StreamBench，这是第一个基准，设计用于评估 LLM agents’ 在输入-反馈序列上改进的能力，在在线环境中，涵盖广泛的任务。
•

We propose several simple yet effective baselines for enhancing LLM agents’ performance in streaming scenarios, including a cost-effective multi-agent method that outperforms other baselines while maintaining the average cost of a single agent.

• 我们提出几个简单却有效的基线，用于增强 LLM 代理在流式场景中的性能，包括一个成本有效的多代理方法，它优于其他基线，同时保持单个代理的平均成本。
•

We conduct analysis on the advantages and potential pitfalls of the proposed methods, providing insights into effective streaming strategies of LLMs.

• 我们对所提出的方法的优势和潜在风险进行分析，提供对LLMs的有效流媒体策略的见解。

Refer to caption — Figure 1: (Left) A schematic diagram showing the online evaluation setting of StreamBench, where agents update their components ( $p,r,\mathcal{M}$ , or $\theta$ ) from an input-feedback sequence to achieve the highest final accuracy (refer to Section 3.1 for details). (Right) Performance curve on the DDXPlus dataset on StreamBench. Agents are able to gradually improve with our proposed streaming baselines.
Figure 1:(Left) 一个示意图，展示了 StreamBench 的在线评估设置，其中代理更新它们的组件 ( $p,r,\mathcal{M}$ , 或 $\theta$ ) 从输入-反馈序列以实现最高的最终准确率 (refer to Section 3.1 for details)。 (Right) 在 StreamBench 上的 DDXPlus 数据集上的性能曲线。代理能够通过我们提出的流式基线逐渐改进。

2 Formulation 2 配方

Consider a streaming scenario involving an agent, an external environment, and a sequence of inputs:
考虑一个涉及 agent 的流式场景、external environment 和 sequence of inputs：

Agent. We define an agent as an LLM parameterized by $\theta$ and augmented with additional components to enhance the agent’s capabilities, such as the external memory $\mathcal{M}$ and a retriever $r(\cdot)$ to store and retrieve useful information. Given an instance $x$ in natural language, a prompting template $p(\cdot)$ , and a retrieval function $r(\cdot)$ , the agent’s output is denoted as $\hat{y}=f(p(x,r(\mathcal{M}))|\theta)$ .
Agent. 我们将代理定义为一个 LLM，由 $\theta$ 参数化，并添加额外的组件来增强代理的能力，例如外部内存 $\mathcal{M}$ 和一个检索器 $r(\cdot)$ 来存储和检索有用信息。给定一个自然语言中的实例 $x$ 、一个提示模板 $p(\cdot)$ 和一个检索函数 $r(\cdot)$ ，代理的输出被表示为 $\hat{y}=f(p(x,r(\mathcal{M}))|\theta)$ 。

Environment. The external environment, denoted as $g(\cdot)$ , provides feedback to the agent. The nature of $g(\cdot)$ varies depending on the specific downstream task and the type of feedback being collected. Potential roles for $g(\cdot)$ include human users, code execution environments, and API responses.
环境。外部环境，用 $g(\cdot)$ 表示，向代理提供反馈。 $g(\cdot)$ 的性质因具体下游任务和收集的反馈类型而异。 $g(\cdot)$ 的潜在角色包括人类用户、代码执行环境和 API 响应。

Input-feedback sequence. Consider a sequence of input stream where each input is denoted by $x_{t}$ , with $t$ representing the $t$ -th time step. After the agent provides the output $\hat{y}_{t}$ , the environment provides feedback signal $fb_{t}=g(x_{t},\hat{y}_{t})$ . Figure 1 shows an overview of the streaming scenario.
输入-反馈序列。考虑一个输入流序列，其中每个输入由 $x_{t}$ 表示， $t$ 表示第 $t$ 时间步。在代理提供输出 $\hat{y}_{t}$ 后，环境提供反馈信号 $fb_{t}=g(x_{t},\hat{y}_{t})$ 。图 1 显示了流式场景的概述。

Algorithm 1 presents a simple framework for language agents to continuously learn from feedback. Benchmark users can adapt Algorithm 1 or develop their own algorithms to update components of their language agents, with the goal to maximize the accuracy of the entire sequence.
Algorithm 1 呈现了一个简单的框架，用于语言代理持续从反馈中学习。Benchmark users 可以适应 Algorithm 1 或开发他们自己的算法，以更新他们语言代理的组件，目的是最大化整个序列的准确性。

Algorithm 1 Framework for Language Agents to Continuously Learn from Feedback on StreamBench
Algorithm 1 语言代理持续从反馈中学习的框架在 StreamBench 上

1:Initialize agent

f(\cdot|\theta)

, prompting template

p(\cdot)

, retriever

r(\cdot)

, and external memory

\mathcal{M}

;
1:初始化 agent

f(\cdot|\theta)

，提示模板

p(\cdot)

，检索器

r(\cdot)

，和 external memory

\mathcal{M}

；

2:for

t=1,2,\ldots,T

3: Receive instance

x_{t}

from the data stream;
3: 接收实例

x_{t}

从数据流;

4: The agent predicts

\hat{y}_{t}=f(p(x_{t},r(\mathcal{M}))|\theta)

;
4: 代理预测

\hat{y}_{t}=f(p(x_{t},r(\mathcal{M}))|\theta)

;

5: Receive feedback signal

fb_{t}=g(x_{t},\hat{y}_{t})

;
5: 接收反馈信号

fb_{t}=g(x_{t},\hat{y}_{t})

;

6: Update

p(\cdot)

r(\cdot)

\mathcal{M}

, or

\theta

using

x_{t}

\hat{y}_{t}

, and

fb_{t}

;
6: 更新

p(\cdot)

，

r(\cdot)

，

\mathcal{M}

，或

\theta

使用

x_{t}

，

\hat{y}_{t}

，和

fb_{t}

；

\triangleright

Benchmark users can develop their own algorithms for updating their Language Agents to learn continuously

\triangleright

Benchmark 用户可以开发他们自己的算法，用于更新他们的 Language Agents 以持续学习

7:end for

Traditionally, updating the agent at each time step $t$ involves updating the model parameters $\theta$ . However, as foundation models grow increasingly larger, frequently updating the agent’s network parameters has become computationally expensive. Recent advancements offer promising alternatives for iterative improvement by updating other components of the agent. For example, one can adapt existing iterative prompt refinement strategies to refine $p(\cdot)$ [9, 10, 11], update the weights of the retriever $r(\cdot)$ [12, 13, 14], expand the agent’s memory $\mathcal{M}$ [6, 15], or use parameter-efficient fine-tuning techniques for augmenting $\theta$ [16]. These different strategies open new possibilities for continuous adaptation of agents without relying solely on network parameter updates. In this work, we develop several baselines for improving agents over time, with a particular focus on updating $p(\cdot)$ and $\mathcal{M}$ . The baselines demonstrate both simplicity and effectiveness. We leave methods for updating $r(\cdot)$ and $\theta$ , which require computationally expensive network parameter updates, for future research.
传统上，在每个时间步 $t$ 更新代理涉及更新模型参数 $\theta$ 。然而，随着基础模型变得越来越大，频繁更新代理的网络参数变得计算上代价高昂。最近的进步提供了有前景的替代方案，通过更新代理的其他组件来进行迭代改进。例如，可以适应现有的迭代提示细化策略来细化 $p(\cdot)$ [9, 10, 11]，更新检索器的权重 $r(\cdot)$ [12, 13, 14]，扩展代理的记忆 $\mathcal{M}$ [6, 15]，或者使用参数高效的微调技术来增强 $\theta$ [16]。这些不同的策略为代理的持续适应打开了新的可能性，而不仅仅依赖于网络参数更新。在这项工作中，我们开发了几个基线，用于随着时间改进代理，特别是关注于更新 $p(\cdot)$ 和 $\mathcal{M}$ 。这些基线展示了简单性和有效性。我们将更新 $r(\cdot)$ 和 $\theta$ 的方法留给未来的研究，这些方法需要计算上代价高昂的网络参数更新。

3 StreamBench

3.1 General setup 3.1 一般设置

Streaming sequence 流式序列

Most public datasets are inherently static, meaning each instance does not have a time-related dependency. To adapt them for our streaming setup, we serialize each selected dataset in Section 3.2 by assigning a time step to each instance. To avoid arbitrary sequence assignment in the original datasets, we randomly shuffle each dataset using a fixed random seed. We release each dataset’s assigned sequence obtained by this random seed in the supplementary materials to ensure reproducibility on StreamBench. Additionally, to ensure the robustness of our evaluation, we conduct experiments on different shuffled sequences with five random seeds, as discussed in Section 5.2. We also discuss the effects of distributional shifts in Appendix C.2.
大多数公共 datasets are inherently static, meaning each instance does not have a time-related dependency. To adapt them for our streaming setup, we serialize each selected dataset in Section 3.2 by assigning a time step to each instance. To avoid arbitrary sequence assignment in the original datasets, we randomly shuffle each dataset using a fixed random seed. We release each dataset’s assigned sequence obtained by this random seed in the supplementary materials to ensure reproducibility on StreamBench. Additionally, to ensure the robustness of our evaluation, we conduct experiments on different shuffled sequences with five random seeds, as discussed in Section 5.2. We also discuss the effects of distributional shifts in Appendix C.2。

Feedback signals 反馈信号

Choosing appropriate type of feedback signal is a crucial consideration in StreamBench. Firstly, cost and practicality play a significant role; in practice, obtaining ground truth $y_{t}$ at each time step can be prohibitively expensive. For example, providing the exact code in programming tasks or the complete schema of each API call in tool use tasks is often impractical. In contrast, partial feedback, such as the helpfulness or correctness of the agent’s output, is relatively easy to obtain–such as the “thumbs up” or “thumbs down” buttons commonly found in user interfaces of LLM applications. Given these rationales, we formalize the type of $fb_{t}$ as follows:
选择合适的反馈信号类型是 StreamBench 中的一个关键考虑因素。Firstly, 成本和实用性发挥着重要作用；在实际中，获取每个时间步骤的真实值 $y_{t}$ 可能非常昂贵。例如，在编程任务中提供确切的代码或在工具使用任务中提供每个 API 调用的完整模式往往不切实际。相比之下，部分反馈，例如代理输出的有用性或正确性，相对容易获取——如常见于LLM 应用程序的用户界面中的“赞”或“踩”按钮。基于这些理由，我们将 $fb_{t}$ 的类型形式化如下：

fb_{t}=g(x_{t},\hat{y}_{t}),fb_{t}\in\{0,1\}

where $fb_{t}$ is a scalar serving as a proxy for the correctness of $\hat{y}_{t}$ with respect to $x_{t}$ , determined by the environment $g(\cdot)$ of the given downstream tasks. The feedback $fb_{t}\in\{0,1\}$ is binary, indicating whether the agent’s output $\hat{y}_{t}$ is correct. This simplified feedback setting aims to offer a unified evaluation framework for ensuring consistency and practicality across diverse tasks. We leave other designs of $fb_{t}$ , such as ground truth or natural language feedback, for future works.
其中 $fb_{t}$ 是一个标量，用作 $\hat{y}_{t}$ 相对于 $x_{t}$ 的正确性的代理，由给定下游任务的环境 $g(\cdot)$ 确定。反馈 $fb_{t}\in\{0,1\}$ 是二元的，表示代理的输出 $\hat{y}_{t}$ 是否正确。这种简化的反馈设置旨在提供一个统一的评估框架，以确保在不同任务中的一致性和实用性。我们将 $fb_{t}$ 的其他设计，如 ground truth 或自然语言反馈，留待未来工作。

Evaluation 评估

In practice, an agent’s goal is to satisfy as many user requirements as possible over a time sequence. We thus evaluate an agent by its aggregate metric at the final time step ( $T$ ). For example, the final metric on a given dataset can be calculated as $\frac{\sum_{t=1}^{T}h(\hat{y}_{t},y_{t})}{T}$ , where $h$ is the function for calculating the corresponding metric on a given dataset. Table 1 shows metrics for each dataset.
在实践中，一个 agent 的目标是尽可能满足用户需求在时间序列上。因此，我们通过其在最终时间步的 aggregate metric ( $T$ ) 来评估 agent。例如，在给定 dataset 上的最终 metric 可以计算为 $\frac{\sum_{t=1}^{T}h(\hat{y}_{t},y_{t})}{T}$ ，其中 $h$ 是用于计算给定 dataset 上对应 metric 的函数。Table 1 显示了每个 dataset 的 metrics。

3.2 Datasets 3.2 数据集

To measure LLM agents’ capacity for continuous improvement post-deployment, we select a diverse set of downstream tasks with potential real-world applications. Following the setting in Section 3.1, these tasks share the property that their ground truth output $y_{t}$ is costly to obtain at each time step.
为了测量 LLM 智能体的持续改进能力在部署后，我们选择一个多样化的下游任务集合，这些任务具有潜在的实际应用。遵循 Section 3.1 中的设置，这些任务共享这样的属性，即它们的地面真实输出 $y_{t}$ 在每个时间步获取成本高。

Text-to-SQL

For text-to-SQL tasks, the agent has to convert users’ natural language queries into SQL code to meet their data requirements. StreamBench integrates three prominent datasets: Spider [17], CoSQL [18], and BIRD [19]. These datasets represent a progressive difficulty curve, allowing for evaluation of how well agents improve when faced with data of varying difficulties.
对于 text-to-SQL 任务，代理必须将用户的自然语言查询转换为 SQL 代码，以满足他们的数据需求。StreamBench 整合了三个重要的数据集：Spider [17]、CoSQL [18] 和 BIRD [19]。这些数据集代表了一个渐进的难度曲线，允许评估代理在面对不同难度的数据时改进的程度。

Python programming Python 编程

To evaluate coding ability improvement, we use the DS-1000 [20] dataset, which consists of real-world Python programming questions from StackOverflow. To successfully solve a given question, the agent must provide a solution and pass the associated test cases.
为了评估编码能力提升，我们使用 DS-1000 [ 20] 数据集，它由真实世界的 Python 编程问题组成，这些问题来自 StackOverflow。为了成功解决给定的问题，代理必须提供解决方案并通过相关的测试用例。

Tool use 工具使用

The ability to use external tools is a significant milestone in the development of LLMs, as it compensates for certain limitations, such as performing precise arithmetic operations or conducting web searches. For this purpose, we utilize the large-scale tool usage dataset ToolBench [21], and select the subset that includes stable and low-latency tool APIs collected in a previous work [22].
外部工具的使用能力是 LLMs 开发中的一个重要里程碑，因为它补偿了某些限制，例如执行精确的算术运算或进行网络搜索。为此，我们利用大规模工具使用数据集 ToolBench [ 21]，并选择其中包括在之前的工作 [ 22] 中收集的稳定且低延迟工具 API 的子集。

Medical diagnosis 医学诊断

To assess LLMs’ continuous improvement in applying expert knowledge, we use the DDXPlus [23] dataset, where agents must make a medical diagnosis out of 49 diagnoses based on patient profiles detailing their symptoms. This setup mimics how medical doctors improve their diagnostic skills through accumulated patient encounters. Evaluating LLMs on this dataset helps us understand their potential for continuous improvement in a highly specialized domain.
为了评估 LLMs 的持续改进在应用专家知识方面的表现，我们使用 DDXPlus [ 23] 数据集，其中代理必须基于详细描述其症状的患者档案，从 49 个诊断中做出医疗诊断。这种设置模仿了医疗医生通过积累的患者就诊来提高其诊断技能的方式。在该数据集上评估 LLMs 帮助我们理解他们在高度专业化领域中的持续改进潜力。

Question answering 问答

Question answering (QA) tasks evaluate an agent’s ability to reason over supporting facts to answer users’ questions. We adopt the distractor setting in HotpotQA [24], which requires reasoning over multiple supporting or distracting documents to answer questions. This helps measure the agent’s improved proficiency in reasoning over grounded knowledge to provide accurate answers. Given the extensive volume of questions, we sampled 1,500 out of the total 7,410 questions.
问题回答 (QA) 任务评估代理的能力，以基于支持事实推理来回答用户的问题。我们采用 HotpotQA [24] 中的 distractor 设置，这需要基于多个支持或干扰文档进行推理来回答问题。这有助于衡量代理在基于 grounded knowledge 推理方面的改进熟练度，以提供准确的答案。鉴于问题的庞大数量，我们从总共 7,410 个问题中抽样了 1,500 个。

Detailed information of the aforementioned datasets are provided in Table 1 and Appendix F.
上述数据集的详细信息的提供于 Table 1 和 Appendix F 中。

Table 1: Input, output, evaluation metrics, and number of testing instances of selected datasets.
Table 1: 输入, 输出, 评估指标, 和选定数据集的测试实例的数量。

Task 任务	Text-to-SQL			Python	Tool Use 工具使用	Medical 医疗	QA
Dataset	Spider	CoSQL	BIRD	DS-1000	ToolBench	DDXPlus	HotpotQA
Input ( $x_{t}$ ) 输入 ( $x_{t}$ )	Data requirements 数据要求			Question 问题	User query 用户查询	Symptoms 症状	Question 问题
Output ( $y_{t}$ ) 输出 ( $y_{t}$ )	SQL code SQL 代码			Code	API calls API 调用	Diagnosis 诊断	Answer 答案
Metric	Execution accuracy 执行准确性			Pass@1	Accuracy 准确性	Accuracy 准确性	Exact Match
Test size ( $T$ ) 测试大小 ( $T$ )	2,147	1,007	1,534	1,000	750	1,764	1,500

4 Experiments

4.1 Baselines

A key objective of StreamBench is to compare the performance gains of LLM agents using non-streaming versus streaming methods. In non-streaming settings, methods focus on optimizing performance at a per-instance level, with improvements made independently for each testing instance. For these non-streaming methods, the overall performance boost on the testing set stems from improvements on individual testing instances. In contrast, streaming methods utilize information from past instances to improve future performance. For streaming methods, we adapt two previously proposed methods and introduce two new approaches to explore effective streaming strategies.
StreamBench 的关键目标是比较 LLM 代理在使用非流式与流式方法时的性能提升。在非流式设置中，方法专注于在每个实例级别优化性能，每个测试实例的改进都是独立的。对于这些非流式方法，测试集上的整体性能提升源于单个测试实例的改进。相比之下，流式方法利用过去实例的信息来改善未来的性能。对于流式方法，我们改编了两个先前提出的方法，并引入两个新方法来探索有效的流式策略。

4.1.1 Non-streaming methods
4.1.1 非流式方法

Zero-shot

It reflects the basic instruction-following abilities of LLMs for solving a given task.
它反映了LLMs的基本指令遵循能力，用于解决给定任务。

Few-shot

It involves providing several ground truth $(x,y)$ pairs in the prompting template $p(\cdot)$ . For datasets with training sets, we construct few-shot examples from the training data. For datasets without training sets, we compose few-shot examples and inspect their quality to ensure reliability. We include few-shot examples for each dataset in the supplementary materials for reproducibility.
它涉及提供几个 ground truth $(x,y)$ pairs in the prompting template $p(\cdot)$ 。对于有 training sets 的 datasets，我们从 training data 中构建 few-shot examples。对于没有 training sets 的 datasets，我们组成 few-shot examples 并检查它们的质量以确保 reliability。我们为每个 dataset 在 supplementary materials 中包括 few-shot examples 以实现 reproducibility。

Chain-of-thought (CoT) 链式思考 (CoT)

Following previous work [25], we employ a trigger phrase (e.g., “Let’s think step by step.”) to instruct the LLM to generate the reasoning process before providing its final answer. We then extract the answer in the correct format from the generated reasoning text.
遵循先前的研究 [ 25]，我们使用一个触发短语 (e.g., “让我们一步步思考。”) 来指示 LLM 在提供最终答案之前生成推理过程。然后我们从生成的推理文本中提取正确格式的答案。

Self-Refine

Self-Refine [26] is a technique where the LLM iteratively improves its output based on self-feedback. The model generates an initial response and refines it through multiple iterations of refinement. It leverages LLMs’ ability to self-evaluate and adjust its responses at a per-instance level.
Self-Refine [ 26] 是一种技术，其中 LLM 通过自我反馈迭代改进其输出。The model generates an initial response and refines it through multiple iterations of refinement。它利用 LLMs’ 的能力来自我评估并在每个实例级别调整其响应。

4.1.2 Streaming methods 4.1.2 流式方法

GrowPrompt

We adapt the previously proposed method GrowPrompt [6], where $(x_{t},\hat{y}_{t},fb_{t})$ of the latest time steps are stored in a sliding window $\mathcal{W}$ . The contents of $\mathcal{W}$ are incorporated into the prompt at inference time to output $y_{t}=f(p(x_{t},\mathcal{W})|\theta)$ . This provides the agent with information from the past $k$ instances, where $k$ is a hyperparameter. Since LLM agents take text input, we verbalize $fb_{t}\in\{0,1\}$ to inform the agent of whether its output $y_{t}$ correctly satisfies the input $x_{t}$ .
我们改编了先前提出的方法 GrowPrompt [6]，其中 $(x_{t},\hat{y}_{t},fb_{t})$ 的最新时间步存储在一个滑动窗口 $\mathcal{W}$ 中。 $\mathcal{W}$ 的内容在推理时被整合到提示中以输出 $y_{t}=f(p(x_{t},\mathcal{W})|\theta)$ 。这为代理提供了来自过去 $k$ 个实例的信息，其中 $k$ 是一个超参数。由于 LLM 代理采用文本输入，我们将 $fb_{t}\in\{0,1\}$ 口头化以告知代理其输出 $y_{t}$ 是否正确满足输入 $x_{t}$ 。

MemPrompt

As an advanced version of GrowPrompt, MemPrompt [6] incorporates an external memory $\mathcal{M}$ to store $(x_{t},\hat{y}_{t},fb_{t})$ of all past time points. During inference, a retriever $r(\cdot)$ is used to select $k$ elements from $\mathcal{M}$ , and $fb_{t}$ is also verbalized to inform the agent of $\hat{y}_{t}$ ’s correctness.
作为 GrowPrompt 的高级版本，MemPrompt [6] incorporates an external memory $\mathcal{M}$ to store $(x_{t},\hat{y}_{t},fb_{t})$ of all past time points. During inference, a retriever $r(\cdot)$ is used to select $k$ elements from $\mathcal{M}$ , and $fb_{t}$ is also verbalized to inform the agent of $\hat{y}_{t}$ ’s correctness.

Self-StreamICL

Previous works [27, 28] have found that incorrect examples can negatively impact in-context learning (ICL) performance, though the extent varies for different LLMs. Based on these insights, we hypothesize that while GrowPrompt and MemPrompt use verbalized $fb_{t}$ to inform the agent about the correctness of its output, incorrect $\hat{y}_{t}$ still introduces distracting signals that can hinder improvement. Therefore, we propose to save $(x_{t},\hat{y}_{t})$ pairs to memory $\mathcal{M}$ only when $fb_{t}=1$ , eliminating the need to save verbalized $fb_{t}$ . This method, called Self-StreamICL, operates similarly to regular ICL, except that the labels are now self-generated and gradually accumulate over the data stream, without the need to preconstruct few-shot examples. For more details, refer to Algorithm 2.
先前的研究 [27, 28] 已经发现，不正确的例子可能负面影响 in-context learning (ICL) 性能，尽管程度因不同的 LLMs 而异。基于这些洞见，我们假设，虽然 GrowPrompt 和 MemPrompt 使用 verbalized $fb_{t}$ 来告知代理其输出的正确性，但不正确的 $\hat{y}_{t}$ 仍然会引入 distracting signals，可能阻碍改善。因此，我们提出仅在 $fb_{t}=1$ 时保存 $(x_{t},\hat{y}_{t})$ pairs 到 memory $\mathcal{M}$ ，从而消除保存 verbalized $fb_{t}$ 的需要。这个方法，称为 Self-StreamICL，与 regular ICL 运作类似，只是标签现在是 self-generated 的，并在数据流中逐渐积累，而不需要预先构建 few-shot examples。更多细节，请参考 Algorithm 2。

Multi-Agentic-Memory StreamICL
多代理记忆 StreamICL

In Self-StreamICL, the agent learns exclusively from its own past experiences. However, we hypothesize that different LLM agents possess distinct strengths and weaknesses, so they can potentially benefit from the experiences of other agents. To explore this idea, we introduce Multi-Agentic-Memory StreamICL (MAM-StreamICL), which employs a multi-agent framework where multiple LLM agents share a common memory. This shared memory incorporates the past outputs of all agents, allowing each agent to learn from the diverse experiences of the others.
在 Self-StreamICL 中，代理仅从其自身的过去经验中学习。然而，我们假设不同的 LLM 代理具有不同的优势和弱点，因此它们可能从其他代理的经验中受益。为了探索这个想法，我们引入 Multi-Agentic-Memory StreamICL (MAM-StreamICL)，它采用多代理框架，其中多个 LLM 代理共享一个公共内存。这个共享内存整合了所有代理的过去输出，从而使每个代理能够从其他代理的多样经验中学习。

We implement a simple round-robin-like scheduling to switch between different LLM agents outlined in Algorithm 2. This ensures that each agent contributes to the shared memory in a balanced manner. Our experiments show that this straightforward strategy can boost performance beyond the average performance of the individual agents. In fact, Self-StreamICL can be seen as a special case of MAM-StreamICL with only one LLM agent.
我们实现一个简单的类似轮询的调度，以在不同的 LLM 代理之间切换，这些代理在 Algorithm 2 中概述。这确保每个代理以平衡的方式贡献到共享内存。我们的实验显示，这种直接的策略可以提升性能，超过单个代理的平均性能。事实上，Self-StreamICL 可以被视为 MAM-StreamICL 的一个特殊情况，仅有一个 LLM 代理。

Note that the high cost associated with scaling is the most critical drawback of multi-agent methods proposed by previous works, and the key advantage of MAM-StreamICL is its cost-effectiveness. Unlike methods such as Multiagent Debate [29] and RECONCILE [30], the cost of MAM-StreamICL does not scale proportionally with the number of agents. Instead, the cost is equivalent to the averaged cost of a single agent, since only one agent is assigned to answer at each time step.
注意，扩展相关的的高成本是之前作品提出的多代理方法的最关键缺点，并且 MAM-StreamICL 的关键优势是其成本效益。与 Multiagent Debate [ 29] 和 RECONCILE [ 30] 等方法不同，MAM-StreamICL 的成本不会与代理数量成比例增加。相反，该成本相当于单个代理的平均成本，因为在每个时间步骤只分配一个代理来回答。

Algorithm 2 Round-Robin Algorithm for MAM-StreamICL
算法 2 轮询算法 for MAM-StreamICL

1:Initialize

K

agents

f_{0}(\cdot|\theta_{0})

f_{1}(\cdot|\theta_{1})

, …,

f_{K-1}(\cdot|\theta_{K-1})

;
1:初始化

K

代理

f_{0}(\cdot|\theta_{0})

f_{1}(\cdot|\theta_{1})

, …,

f_{K-1}(\cdot|\theta_{K-1})

;

\triangleright

K = 1 in the Self-StreamICL baseline

\triangleright

K = 1 在 Self-StreamICL baseline

2:Initialize prompt

p(\cdot)

, retriever

r(\cdot)

, and external memory

\mathcal{M}_{0}

, all shared between agents;
2:初始化 prompt

p(\cdot)

，检索器

r(\cdot)

，以及外部内存

\mathcal{M}_{0}

，所有这些在代理之间共享；

3:for

t=1,2,\ldots,T

4: Receive instance

x_{t}

from the data stream;
4: 接收实例

x_{t}

从数据流中；

5: Select the next agent by

k=t

\mathrm{mod}

K

;
5: 通过

k=t

\mathrm{mod}

K

选择下一个代理 ;

6: The

k

-th agent predicts

\hat{y}_{t}=f_{k}(p(x_{t},r(\mathcal{M}_{t-1}))|\theta_{k})

;
6: 第

k

个代理预测

\hat{y}_{t}=f_{k}(p(x_{t},r(\mathcal{M}_{t-1}))|\theta_{k})

;

7: Receive feedback signal

fb_{t}=g(x_{t},\hat{y}_{t})

;
7: 接收反馈信号

fb_{t}=g(x_{t},\hat{y}_{t})

;

\triangleright

fb_{t}\in\{0,1\}

under the StreamBench setting

\triangleright

fb_{t}\in\{0,1\}

在 StreamBench 设置下

8: if

fb_{t}=1

then

\triangleright

which means the self-output

\hat{y}_{t}

is correct

\triangleright

这意味着自输出

\hat{y}_{t}

是正确的

\mathcal{M}_{t}\leftarrow\mathcal{M}_{t-1}\cup\{(x_{t},\hat{y}_{t})\}

;

10: else

11:

\mathcal{M}_{t}\leftarrow\mathcal{M}_{t-1}

;

12: end if

13:end for

4.2 Implementation details
4.2 实现细节

We conduct experiments using three LLM families: GPT [31, 32], Gemini [33, 34], and Claude [35]. For our main experiments, we use the endpoints gpt-3.5-turbo-0125, gemini-1.0-pro-001, and claude-3-haiku-20240307. These models represent cost-effective LLMs, balancing performance and affordability. The models initialize the $K=3$ agents in MAM-StreamICL. For methods with $\mathcal{M}$ (MemPrompt, Self-StreamICL, and MAM-StreamICL), we implement $\mathcal{M}$ as a vector database. We use BAAI/bge-base-en-v1.5 to encode $x_{t}$ as key embeddings and save $x_{t}$ , $\hat{y}_{t}$ (and $fb_{t}$ for MemPrompt) as values. For hyperparameters and prompts, refer to Appendix A and F.
我们使用三个 LLM 系列：GPT [ 31, 32]、Gemini [ 33, 34] 和 Claude [ 35]。对于我们的主要实验，我们使用端点 gpt-3.5-turbo-0125、gemini-1.0-pro-001 和 claude-3-haiku-20240307。这些模型代表成本有效的 LLMs，平衡性能和负担能力。这些模型在 MAM-StreamICL 中初始化 $K=3$ 代理。对于具有 $\mathcal{M}$ 的方法（MemPrompt、Self-StreamICL 和 MAM-StreamICL），我们将 $\mathcal{M}$ 实现为向量数据库。我们使用 BAAI/bge-base-en-v1.5 来编码 $x_{t}$ 作为关键嵌入，并保存 $x_{t}$ 、 $\hat{y}_{t}$ （以及 MemPrompt 的 $fb_{t}$ ）作为值。对于超参数和提示，请参考附录 A 和 F。

4.3 Main results 4.3 主要结果

The main results are shown in Table 2, which lists the averaged performance of the three LLM agents. The only exception is MAM-StreamICL, which only runs once on the streaming sequence where each agent takes turns at each time step. For results of each respective model, refer to Appendix B.
主要结果显示在 Table 2 中，其中列出了三个 LLM 代理的平均性能。唯一的例外是 MAM-StreamICL，它仅在流序列上运行一次，其中每个代理在每个时间步骤轮流。对于每个相应模型的结果，请参考 Appendix B。

Table 2: Averaged performance of three LLM agents across different baselines and datasets.
Table 2: 三个 LLM 代理在不同基线和数据集上的平均性能。

Task 任务	Text-to-SQL			Python	Tool Use 工具使用	Medical 医疗	QA
Dataset	Spider	CoSQL	BIRD	DS-1000	ToolBench	DDXPlus	HotpotQA
Non-streaming 非流式
Zero-Shot	67.89	50.55	29.60	37.70	61.38	52.85	48.49
Few-Shot	68.55	50.61	30.40	33.33	68.58	60.98	53.11
CoT	61.53	46.01	27.23	25.93	58.98	58.20	52.47
Self-Refine	67.75	49.49	29.62	36.30	60.67	52.89	43.53
Streaming
GrowPrompt	69.90	51.97	30.35	33.77	65.07	55.10	51.38
MemPrompt	70.78	53.29	31.99	35.47	64.31	54.02	52.62
Self-StreamICL	74.63	55.05	35.31	41.30	71.33	70.56	54.80
MAM-StreamICL	75.69	55.17	36.38	43.10	75.87	83.50	55.20

Overall, streaming methods outperform non-streaming methods, though the extent of improvement varies across different datasets. These results demonstrate the value of leveraging input-feedback streams to enhance agent performance on downstream tasks. In addition, as demonstrated by the robust performance of Self-StreamICL compared to GrowPrompt and MemPrompt, leveraging feedback as simple as correctness can enable agents to improve even more through self-generated outputs $\hat{y}_{t}$ . This provides an important insight: rather than solely focusing on prompting pipelines to boost per-instance performance, adopting simple yet effective streaming approaches, such as collecting correctness feedback, could potentially lead to notable improvements on LLM agents. Lastly, MAM-StreamICL shows the most notable and consistent performance boost across all datasets.
总体上，streaming methods outperform non-streaming methods，不过改进的程度因不同数据集而异。这些结果证明了利用 input-feedback streams 来提升代理在下游任务上的性能的价值。此外，正如 Self-StreamICL 与 GrowPrompt 和 MemPrompt 相比的稳健性能所展示的，利用像正确性这样简单的反馈，可以让代理通过自生成输出进一步改进 $\hat{y}_{t}$ 。这提供了一个重要洞见：与其 solely focusing on prompting pipelines 来提升每个实例的性能，不如采用简单而有效的 streaming approaches，例如收集正确性反馈，这可能导致 LLM agents 上的显著改进。最后，MAM-StreamICL 在所有数据集上显示出最显著和一致的性能提升。

5 Discussion 5 讨论

5.1 What makes effective streaming strategies?
5.1 什么使有效的流媒体策略？

This subsection provides insights into the key aspects that contribute to successful streaming strategies. We identify two effective factors for streaming improvements as follows:
本小节提供了有助于成功流媒体策略的关键方面的见解。我们识别了两个用于流媒体改进的有效因素如下：

5.1.1 Collecting correct self-output
5.1.1 收集正确的自输出

To investigate whether incorrect self-output hinders agents’ improvement, we conducted ablation studies on GrowPrompt and MemPrompt. In the default setting in Table 2, both methods use all $k$ retrieved $(x_{t},\hat{y}_{t},fb_{t})$ triples during inference (use all). In contrast, the ablations either use only the triples where $fb_{t}=0$ (only incorrect), or use only the triples where $fb_{t}=1$ (only correct).
为了调查不正确的自我输出是否阻碍代理的改进，我们在 GrowPrompt 和 MemPrompt 上进行了 ablation studies。在表 2 中的默认设置中，两种方法在 inference 过程中使用所有 $k$ retrieved $(x_{t},\hat{y}_{t},fb_{t})$ triples（use all）。相比之下，ablations 要么只使用 $fb_{t}=0$ 的 triples（only incorrect），要么只使用 $fb_{t}=1$ 的 triples（only correct）。

The ablation results in Figure 2 reveal several findings. First, using incorrect self-output degrades performance, sometimes even worse than the zero-shot baseline. In contrast, using only correct self-output consistently boosts performance over the zero-shot baseline, with particularly consistent improvements observed in the MemPrompt (only correct) method. An important observation is that, even if $fb_{t}$ is verbalized to inform the agent whether its $\hat{y}_{t}$ correctly satisfies $x_{t}$ in GrowPrompt and MemPrompt, simply informing the agent that its self-output is incorrect does not help it learn from past mistakes. Conversely, telling the LLM agent what it does correctly is very effective in facilitating improvement. These findings underscore the importance of collecting and utilizing correct self-output in streaming. This also explains the intuition and effectiveness behind Self-StreamICL, where input-output pairs are saved to $\mathcal{M}$ only when the self-output is correct.
Figure 2 中的消融实验结果揭示了几个发现。首先，使用不正确的自输出会降低性能，有时甚至比零样本基线更差。相比之下，仅使用正确的自输出会持续提升性能超过零样本基线，在 MemPrompt (only correct) 方法中观察到特别一致的改进。一个重要的观察是，即使 $fb_{t}$ 被表述为告知代理其 $\hat{y}_{t}$ 是否在 GrowPrompt 和 MemPrompt 中正确满足 $x_{t}$ ，简单地告知代理其自输出不正确并不会帮助它从过去的错误中学习。相反，告诉 LLM 代理它做正确的事情在促进改进方面非常有效。这些发现强调了在流式处理中收集和利用正确自输出的重要性。这也解释了 Self-StreamICL 背后的直觉和有效性，其中输入-输出对仅在自输出正确时保存到 $\mathcal{M}$ 。

5.1.2 Sharing memory across multiple agents
5.1.2 在多个代理之间共享内存

Another key insight is the benefit of sharing memory across multiple agents, as demonstrated in MAM-StreamICL. To analyze why memory sharing works, we use DDXPlus as an example and visualize the confusion matrices for a subset of diagnoses related to upper respiratory tract diseases.
另一个关键洞见是跨多个代理共享内存的好处，正如在 MAM-StreamICL 中所展示的那样。为了分析内存共享为什么有效，我们以 DDXPlus 为例，并可视化与上呼吸道疾病相关的诊断子集的 confusion matrices。

Figure 3 presents the confusion matrices for three different LLM agents: gpt-3.5-turbo-0125, gemini-1.0-pro, and claude-3-haiku-20240307, along with the matrix of MAM-StreamICL. Each matrix illustrates the proficiency of an agent across various medical diagnosis categories. It is evident that each model excels in certain areas while struggling in others. For instance, gpt-3.5-turbo-0125 shows high accuracy in predicting “acute rhinosinusitis” and “allergic sinusitis” but struggles with “chronic rhinosinusitis” and “URTI”. In contrast, gemini-1.0-pro performs well in “URTI”, and claude-3-haiku could solve “chronic rhinosinusitis”.
图 3 呈现了三个不同的 LLM 代理的混淆矩阵：gpt-3.5-turbo-0125、gemini-1.0-pro 和 claude-3-haiku-20240307，以及 MAM-StreamICL 的矩阵。每个矩阵展示了代理在各种医疗诊断类别中的熟练程度。很明显，每个模型在某些领域表现出色，而在其他领域则表现不佳。例如，gpt-3.5-turbo-0125 在预测“acute rhinosinusitis”和“allergic sinusitis”时准确率很高，但处理“chronic rhinosinusitis”和“URTI”时有困难。相比之下，gemini-1.0-pro 在“URTI”上表现良好，而 claude-3-haiku 能够解决“chronic rhinosinusitis”。

The diversity in performance across models suggests that their collective past experiences can provide complementary strengths, thereby enhancing overall performance when these experiences are shared. Since each agent takes turn to solve an incoming $x_{t}$ at each time point $t$ , the shared memory system allows the agents to benefit from others while maintaining a cost similar to that of a single agent. We also conduct further ablation studies in Appendix D to discuss the importance of sharing memory.
模型性能的多样性表明它们的集体过去经验可以提供互补优势，从而提升整体性能，当这些经验被共享时。由于每个代理轮流解决传入的 $x_{t}$ 在每个时间点 $t$ ，共享内存系统允许代理从他人那里受益，同时保持与单个代理相似的成本。我们还在 Appendix D 中进行进一步的消融研究，以讨论共享内存的重要性。

5.2 Robustness to different streaming sequences
5.2 对不同流媒体序列的鲁棒性

Given the time-variant nature of streaming, evaluating each method’s robustness across different data streams is essential. Therefore, we rerun the streaming baselines with 5 random seeds on five tasks. Figure 4 presents the averaged performance and standard errors of claude-3-haiku across 5 shuffled sequences, with results for gpt-3.5-turbo and gemini-1.0-pro provided in Appendix C. The performance ranking of streaming baselines remains mostly consistent across datasets, with Self-StreamICL and MAM-StreamICL being the top performers. Due to the high cost of running all 5 sequences on StreamBench, we select a fixed sequence for fair comparison among future benchmark users. However, we also release all 5 sequences for those who wish to conduct a thorough evaluation.
鉴于流式处理的时变性质，对每个方法的鲁棒性在不同数据流中的评估至关重要。因此，我们使用 5 random seeds 在 five tasks 上重新运行 streaming baselines。Figure 4 展示了 claude-3-haiku 在 5 shuffled sequences 中的平均性能和 standard errors，gpt-3.5-turbo 和 gemini-1.0-pro 的结果在 Appendix C 中提供。流式基线的性能排名在不同数据集上基本保持一致，其中 Self-StreamICL 和 MAM-StreamICL 是 top performers。由于在 StreamBench 上运行所有 5 sequences 的高成本，我们选择了一个 fixed sequence 以供未来的 benchmark users 进行公平比较。然而，我们也发布所有 5 sequences，以便那些希望进行 thorough evaluation 的人使用。

5.3 Would stronger LLMs still benefit from streaming?
5.3 更强的 LLMs 是否仍然能从流媒体中受益？

To evaluate whether stronger LLMs still benefit from streaming, we tested two newer models: gpt-4o-2024-08-06 and gemini-1.5-flash-001. Due to the high cost, we only run the methods shown in Table 3. We found that with Self-StreamICL, these stronger models still showed significant performance improvements. This demonstrates that even the most advanced models can leverage the information from streaming data to further enhance their performance across diverse tasks.
为了评估更强的 LLMs 是否仍然从流式处理中受益，我们测试了两个较新的模型：gpt-4o-2024-08-06 和 gemini-1.5-flash-001。由于高成本，我们只运行 Table 3 中显示的方法。我们发现，使用 Self-StreamICL，这些更强的模型仍然显示出显著的性能改进。这表明，即使是最先进的模型也可以利用来自流式数据的信息来进一步提升它们在各种任务中的性能。

Table 3: Performance of gpt-4o-2024-08-06 and gemini-1.5-flash-001 on StreamBench.
表 3:gpt-4o-2024-08-06 和 gemini-1.5-flash-001 在 StreamBench 上的性能

Task 任务	Text-to-SQL			Python	Tool Use 工具使用	Medical 医疗	QA
Dataset	Spider	CoSQL	BIRD	DS-1000	ToolBench	DDXPlus	HotpotQA
gemini-1.5-flash
Zero-shot	69.63	48.26	33.83	50.20	69.47	58.90	60.60
Few-shot 少样本	71.40	49.35	37.03	50.60	72.13	73.58	64.87
CoT	72.52	52.73	35.14	44.80	68.00	64.06	63.13
Self-StreamICL	77.83	56.21	41.20	52.20	75.07	86.34	65.20
gpt-4o-2024-08-06
Zero-shot	73.54	53.33	34.42	54.90	72.40	70.64	65.53
Few-shot 少样本	76.85	57.60	36.25	52.30	71.47	83.45	66.87
CoT	72.52	54.82	31.16	41.90	66.80	73.02	62.80
Self-StreamICL	80.58	59.19	42.63	59.40	76.27	92.01	67.00

5.4 Cost analysis 5.4 成本分析

For benchmark users to estimate the cost, the token usage of all baselines is listed in Appendix E.
为了让 benchmark users 估算成本，所有 baselines 的 token usage 列在 Appendix E 中。

6 Related work 6 相关工作

6.1 Online learning 6.1 在线学习

Online learning [36] explores the incremental updating of models as new data arrives, making it valuable for dynamic improvement in downstream applications. Traditionally, it focuses updating network weights, such as in methods for training recurrent neural networks [37], online representation learning for image classification [38], and adapting language models to learn new world knowledge [39]. Recent advancements have introduced strategies for improving LLM agents by updating prompts [9, 10, 11], memory [6, 15], or retrievers [12, 13, 14]. These new strategies are promising for designing new algorithms to adapt LLMs in the online setting. However, there are no standard testbeds for this setup. Addressing this gap, we propose StreamBench, the first benchmark to pave the way for developing more dynamic adaptation methods for LLM agents.
在线学习 [36] 探索了模型的增量更新，因为新数据不断到来，这使得它在下游应用中具有动态改进的价值。传统上，它专注于更新网络权重，例如在训练循环神经网络 [37] 的方法中、在线表示学习用于图像分类 [38]，以及使语言模型适应学习新的世界知识 [39]。最近的进步引入了通过更新提示 [9, 10, 11]、记忆 [6, 15] 或检索器 [12, 13, 14] 来改进 LLM 代理的策略。这些新策略对于设计新算法以适应 LLMs 在在线设置中非常有前景。然而，对于这个设置，没有标准测试床。为了解决这个空白，我们提出 StreamBench，这是第一个基准，用于为开发更动态的适应方法为 LLM 代理铺平道路。

6.2 Improvement from feedback with LLMs
6.2 改进从反馈与LLMs

Recent works have shown that LLMs can improve from feedback when augmented with prompting pipelines or memory mechanisms, forming two main research branches. One is instance-level improvement methods, such as ReAct [40], Self-ICL [41], and Self-Refine [26]. These methods focus on boosting performance on each input instance without leveraging information from past instances. The other is time-sequence-level improvement methods. For example, MemPrompt [6] enhances GPT-3 by storing past user feedback and retrieve them in the future. Reflexion [7] shows that LLM agents can perform better in future trials by running repeated trials on the same dataset, but this is not always practical in real-world scenarios. ExpeL [8] shows that LLM agents can benefit from cross-task experience without needing repeated trials on the target task. However, these works use different datasets and lack a standardized evaluation setting. StreamBench bridges this gap by providing a consistent empirical testbed across diverse tasks to evaluate LLM agents’ improvement.
最近的研究表明，LLMs 可以通过反馈改进，当与提示管道或记忆机制相结合时，形成两个主要研究分支。一个是实例级改进方法，例如 ReAct [ 40]、Self-ICL [ 41] 和 Self-Refine [ 26]。这些方法专注于提升每个输入实例的性能，而不利用过去实例的信息。另一个是时间序列级改进方法。例如，MemPrompt [ 6] 通过存储过去的用户反馈并检索它们来增强 GPT-3。Reflexion [ 7] 显示，LLM 代理可以通过在相同数据集上运行重复试验来在未来试验中表现更好，但这在真实世界场景中并不总是实用。ExpeL [ 8] 显示，LLM 代理可以从跨任务经验中受益，而不需要在目标任务上进行重复试验。然而，这些工作使用不同的数据集，并且缺少标准化的评估设置。StreamBench 通过提供一个一致的经验测试床，跨越多样任务来评估 LLM 代理的改进。

7 Conclusion 7 结论

In this work, we introduce a new evaluation setting to measure LLM agents’ performance improvement on downstream tasks, and propose StreamBench as an instance of this setting. There are two major findings in our experiments. Firstly, collecting correct self-generated outputs improve performance consistently, while informing agents of their incorrect outputs sometimes degrade performance. Secondly, sharing memory across multiple agents is a promising cost-effective technique, as MAM-StreamICL achieves robust performance while maintaining the average cost of a single agent.
在本研究中，我们引入一个新的评估设置来衡量LLM代理在下游任务上的性能改进，并提出 StreamBench 作为这个设置的一个实例。我们的实验中有两个主要发现。首先，收集正确的自我生成输出能一致地改善性能，而告知代理它们的错误输出有时会降低性能。其次，在多个代理之间共享内存是一种有前景的成本效益技术，因为 MAM-StreamICL 实现了稳健的性能，同时保持单个代理的平均成本。

StreamBench serves as a stepping stone towards more adaptive AI systems. Future directions include exploring online active learning where agents could inquire feedback only when necessary, or viewing multi-agent collaboration as multi-arm bandits (MABs) to develop more sophisticated methods for selecting agents and sharing memory at each time point. It is also practical to investigate the utilization of different feedback signals beyond correctness, such as users’ natural language feedback. We hope that this work inspires development of adaptive methodology for improving LLM agents.
StreamBench 作为通往更具适应性 AI 系统的垫脚石。未来方向包括探索 online active learning，其中代理能够仅在必要时查询反馈，或者将 multi-agent collaboration 视为 multi-arm bandits (MABs)，以开发更复杂的代理选择和在每个时间点共享记忆的方法。调查不同反馈信号的利用也是实用的，这些信号超越正确性，例如用户的 natural language feedback。我们希望这项工作激发针对改进 LLM agents 的适应性方法的发展。

8 Limitations

Tasks and modality coverage
任务和模态覆盖

The current version of StreamBench includes tasks such as programming, text-to-SQL conversion, medical diagnosis, question-answering, and tool use. While diverse, they do not encompass all possible types of tasks or domains where LLMs can be applied. StreamBench is also limited to text and does not cover other modalities such as image and audio.
当前版本的 StreamBench 包括了诸如 programming、text-to-SQL conversion、medical diagnosis、question-answering 和 tool use 等任务。尽管多样化，但它们并不涵盖所有可能的任务类型或 LLMs 可以应用的领域。StreamBench 也仅限于文本，并不覆盖诸如 image 和 audio 等其他 modalities。

Sim2Real gap Sim2Real 差距

Although we have attempted to simulate feedback signals as practical as possible, there may still be a gap between the simulated correctness feedback in StreamBench and the feedback encountered in real-world applications. Real-world feedback can be more diverse, noisy, and context-dependent, which may not be fully captured by the current benchmark.
虽然我们已经尝试尽可能实际地模拟反馈信号，但是在 StreamBench 中的模拟正确性反馈和在真实世界的应用中遇到的反馈之间可能仍然存在差距。真实世界的反馈可以更多样化、嘈杂，并且依赖于上下文，这可能无法被当前的基准测试完全捕获。

Acknowledgments and Disclosure of Funding
致谢和资金披露

We would like to express our gratitude to Chih-Han Yu, Wei-Lin Chen, Yi-Lin Tsai, Chao-Chung Wu, Zhen-Ting Liu, and An-Zi Yen for their valuable comments and feedback on this work. Their insights greatly contributed to the improvement of our research. We declare no competing interests related to this work.
我们谨向 Chih-Han Yu、Wei-Lin Chen、Yi-Lin Tsai、Chao-Chung Wu、Zhen-Ting Liu 和 An-Zi Yen 致以感谢，感谢他们对这项工作的宝贵评论和反馈。他们的见解极大地促进了我们研究的改进。我们声明与此工作无竞争利益。

References 参考文献

[1] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 语言模型是少样本学习者。神经信息处理系统进展, 33:1877–1901, 2020。
[2] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 微调的语言模型是零-shot 学习者。arXiv preprint arXiv:2109.01652, 2021.
[3] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 测量大规模多任务语言理解。In International Conference on Learning Representations, 2020。
[4] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 训练验证器来解决数学文字问题。 arXiv preprint arXiv:2110.14168, 2021。
[5] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. 挑战 big-bench tasks 和 chain-of-thought 是否能解决它们。在计算语言学协会的 Findings: ACL 2023，第 13003–13051 页，2023。
[6] Aman Madaan, Niket Tandon, Peter Clark, and Yiming Yang. Memory-assisted prompt editing to improve gpt-3 after deployment. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2833–2861, 2022.
Aman Madaan, Niket Tandon, Peter Clark, and Yiming Yang. 记忆辅助提示编辑以改进 gpt-3 部署后. 在 2022 年自然语言处理实证方法会议的会议录中, 页码 2833–2861, 2022.
[7] Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In Neural Information Processing Systems, 2023.
Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: 语言代理与语言强化学习。In Neural Information Processing Systems, 2023.
[8] Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024.
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, 和 Gao Huang. Expel: Llm 代理是经验学习者。在 AAAI Conference on Artificial Intelligence 会议录中, 第 38 卷, 页 19632–19642, 2024 年。
[9] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 大型语言模型是人类级别的提示工程师。arXiv 预印本 arXiv:2211.01910, 2022。
[10] Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023.
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. 自动提示优化使用 "gradient descent" 和 beam search. arXiv 预印本 arXiv:2305.03495, 2023.
[11] Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532, 2023.
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. 将大型语言模型与进化算法相结合产生强大的提示优化器。 arXiv preprint arXiv:2309.08532, 2023.
[12] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, 2022.
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 学习检索提示用于 in-context learning。 In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, 2022。
[13] Jiacheng Ye, Zhiyong Wu, Jiangtao Feng, Tao Yu, and Lingpeng Kong. Compositional exemplars for in-context learning. In International Conference on Machine Learning, pages 39818–39833. PMLR, 2023.
Jiacheng Ye, Zhiyong Wu, Jiangtao Feng, Tao Yu, 和 Lingpeng Kong. 组合示例用于上下文学习。在 International Conference on Machine Learning, 页 39818–39833. PMLR, 2023.
[14] Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu, Yuan Ni, Guotong Xie, Xiaoling Wang, and Xipeng Qiu. Unified demonstration retriever for in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4644–4668, 2023.
Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu, Yuan Ni, Guotong Xie, Xiaoling Wang, and Xipeng Qiu. 统一的演示检索器用于 in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4644–4668, 2023.
[15] Xiaonan Li and Xipeng Qiu. Mot: Memory-of-thought enables chatgpt to self-improve. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6354–6374, 2023.
Xiaonan Li and Xipeng Qiu. Mot: 思想记忆使 ChatGPT 能够自我改进。 In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6354–6374, 2023。
[16] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. 大规模预训练语言模型的参数高效微调。 Nature Machine Intelligence, 5(3):220–235, 2023。
[17] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, 2018.
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: 一个大规模的人工标注数据集，用于复杂和跨领域的语义解析以及文本到 SQL 任务。在 2018 年实证方法在自然语言处理会议的论文集，第 3911–3921 页，2018。
[18] Tao Yu, Rui Zhang, He Yang Er, Suyi Li, Eric Xue, Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi, Zihan Li, et al. Cosql: A conversational text-to-sql challenge towards cross-domain natural language interfaces to databases. arXiv preprint arXiv:1909.05378, 2019.
Tao Yu, Rui Zhang, He Yang Er, Suyi Li, Eric Xue, Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi, Zihan Li, et al. Cosql: 一个对话式 text-to-sql 挑战，针对跨域自然语言数据库接口。 arXiv preprint arXiv:1909.05378, 2019。
[19] Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems, 36, 2024.
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. llm 已经可以作为数据库接口吗？一个大型基准测试，用于大规模数据库基础的文本到 SQL。Advances in Neural Information Processing Systems, 36, 2024。
[20] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR, 2023.
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: 一个自然且可靠的基准，用于数据科学代码生成。 In International Conference on Machine Learning, pages 18319–18345. PMLR, 2023。
[21] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023.
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: 促进大型语言模型掌握 16000+个现实世界的 APIs, 2023。
[22] Boshi Wang, Hao Fang, Jason Eisner, Benjamin Van Durme, and Yu Su. Llms in the imaginarium: tool learning through simulated trial and error. arXiv preprint arXiv:2403.04746, 2024.
Boshi Wang, Hao Fang, Jason Eisner, Benjamin Van Durme, and Yu Su. Llms 在想象世界中：工具学习通过模拟试错。arXiv 预印本 arXiv:2403.04746, 2024。
[23] Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. Ddxplus: A new dataset for automatic medical diagnosis. Advances in Neural Information Processing Systems, 35:31306–31318, 2022.
Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. Ddxplus: 一个新的数据集用于自动医疗诊断。 Advances in Neural Information Processing Systems, 35:31306–31318, 2022.
[24] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: 一个多样化、可解释的多跳问答数据集。 arXiv 预印本 arXiv:1809.09600, 2018。
[25] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 大型语言模型是零-shot 推理器。 Advances in neural information processing systems, 35:22199–22213, 2022.
[26] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: 带有自我反馈的迭代精炼。神经信息处理系统进展, 36, 2024。
[27] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, 2022.
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 重新思考演示的作用：是什么让 in-context learning 起作用？ In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, 2022.
[28] Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. 更大的语言模型以不同的方式进行上下文学习。 arXiv 预印本 arXiv:2303.03846, 2023.
[29] Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023.
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 通过多代理辩论改善语言模型的事实性和推理。 arXiv 预印本 arXiv:2305.14325, 2023。
[30] Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023.
Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: 圆桌会议通过多样化中的共识改善推理 llms. arXiv preprint arXiv:2309.13007, 2023.
[31] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 语言模型是少样本学习者。 ArXiv, abs/2005.14165, 2020.
[32] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 技术报告。 arXiv preprint arXiv:2303.08774, 2023。
[33] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: 一个高度 capable 的多模态模型家族。arXiv preprint arXiv:2312.11805, 2023。
[34] Gemini Team Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv, abs/2403.05530, 2024.
Gemini Team Google. Gemini 1.5: 解锁多模态理解，跨越数百万个令牌的上下文。 ArXiv, abs/2403.05530, 2024.
[35] AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
AI Anthropic. Claude 3 模型系列：Opus、sonnet、haiku. Claude-3 Model Card, 2024.
[36] Steven CH Hoi, Doyen Sahoo, Jing Lu, and Peilin Zhao. Online learning: A comprehensive survey. Neurocomputing, 459:249–289, 2021.
Steven CH Hoi, Doyen Sahoo, Jing Lu, and Peilin Zhao. 在线学习：全面综述. Neurocomputing, 459:249–289, 2021.
[37] Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
Ronald J Williams and David Zipser. 一种用于持续运行完全循环神经网络的学习算法. Neural computation, 1(2):270–280, 1989.
[38] Yanis Bahroun and Andrea Soltoggio. Online representation learning with single and multi-layer hebbian networks for image classification. arXiv preprint arXiv:1702.06456, 2017.
Yanis Bahroun and Andrea Soltoggio. 在线表示学习与单层和多层 Hebbian 网络用于图像分类. arXiv 预印本 arXiv:1702.06456, 2017.
[39] Nathan Hu, Eric Mitchell, Christopher D Manning, and Chelsea Finn. Meta-learning online adaptation of language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4418–4432, 2023.
Nathan Hu, Eric Mitchell, Christopher D Manning, and Chelsea Finn. 元学习在线适应语言模型。In 2023 年实证方法在自然语言处理会议论文集, pages 4418–4432, 2023。
[40] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: 协同推理和行动在语言模型中。 arXiv 预印本 arXiv:2210.03629, 2022。
[41] Wei-Lin Chen, Cheng-Kuang Wu, Yun-Nung Chen, and Hsin-Hsi Chen. Self-icl: Zero-shot in-context learning with self-generated demonstrations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15651–15662, 2023.
Wei-Lin Chen, Cheng-Kuang Wu, Yun-Nung Chen, and Hsin-Hsi Chen. Self-icl: 零-shot 上下文学习与自生成演示。在 2023 年实证方法在自然语言处理会议的论文集, pages 15651–15662, 2023。

Checklist 检查列表

1.
For all authors…
1. (a)
  
  Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] See the experiment results in Section 4.3.
  
  (a) 摘要和引言中提出的主要声明是否准确反映了论文的贡献和范围？ [Yes] 见第 4.3 节中的实验结果。
2. (b)
  
  Did you describe the limitations of your work? [Yes] See Section 8.
  
  (b) 你是否描述了你工作的局限性？ [Yes] 见 Section 8.
3. (c)
  
  Did you discuss any potential negative societal impacts of your work? [N/A]
  
  (c) 您是否讨论了任何潜在的负面社会影响？ [N/A]
4. (d)
  
  Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]
  
  (d) 您是否已阅读道德审查指南并确保您的论文符合它们？ [Yes]
1. 对于所有作者…
2.
If you are including theoretical results…
1. (a)
  
  Did you state the full set of assumptions of all theoretical results? [N/A]
  
  (a) 您是否陈述了所有理论结果的完整假设集合？ [N/A]
2. (b)
  
  Did you include complete proofs of all theoretical results? [N/A]
  
  (b) 您是否包括了所有理论结果的完整证明？ [N/A]
2. 如果您要包含理论结果…
3.
If you ran experiments (e.g. for benchmarks)…
1. (a)
  
  Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Please refer to the supplementary materials.
  
  (a) 您是否包含了代码、数据和重现主要实验结果所需的指令（或者在补充材料中，或者作为 URL）？ [Yes] 请参考补充材料。
2. (b)
  
  Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Please refer to Appendix A and F.
  
  (b) 您是否指定了所有训练细节 (e.g., data splits, hyperparameters, how they were chosen)? [Yes] 请参考 Appendix A and F.
3. (c)
  
  Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See the results in Figure 4.
  
  (c) 您是否报告了误差条（例如，在多次运行实验后，针对随机种子）？ [Yes] 请参阅 Figure 4 中的结果。
4. (d)
  
  Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 5.4.
  
  (d) 您是否包括了计算总量和使用的资源类型（例如，GPUs 类型、internal cluster 或 cloud provider）？ [Yes] See Section 5.4.
3. 如果您运行了实验 (e.g. 用于 benchmarks)…
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
1. (a)
  
  If your work uses existing assets, did you cite the creators? [Yes] See Section 3.2.
  
  (a) 如果您的作品使用了现有资产，您是否引用了创建者？ [Yes] See Section 3.2.
2. (b)
  
  Did you mention the license of the assets? [Yes] See the supplementary materials.
  
  (b) 你提到了资产的许可证吗？ [是] 请参阅补充材料。
3. (c)
  
  Did you include any new assets either in the supplemental material or as a URL? [Yes] We include code in the supplementary materials.
  
  (c) 您是否在补充材料中或作为 URL 包含了任何新资源？ [Yes] 我们在补充材料中包含了代码。
4. (d)
  
  Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [Yes] See the supplementary materials.
  
  (d) 您是否讨论了是否以及如何从使用/整理的数据所属人员那里获得同意？ [Yes] 请参阅补充材料。
5. (e)
  
  Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]
  
  (e) 您是否讨论了您正在使用/整理的数据是否包含 personally identifiable information 或 offensive content？ [N/A]
4. 如果您正在使用现有资产 (e.g., code, data, models) 或整理/发布新资产…
5.
If you used crowdsourcing or conducted research with human subjects…
1. (a)
  
  Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]
  
  (a) 您是否包括了提供给参与者的指示的完整文本以及屏幕截图，如果适用？ [N/A]
2. (b)
  
  Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]
  
  (b) 您是否描述了任何潜在的参与者风险，并附上 Institutional Review Board (IRB) 批准的链接，如果适用？ [N/A]
3. (c)
  
  Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]
  
  (c) 你是否包括了支付给参与者的估计小时工资以及用于参与者补偿的总金额？ [N/A]
5. 如果您使用了 crowdsourcing 或进行了 research with human subjects…

Appendix A Hyperparameters
附录 A Hyperparameters

For decoding strategies of all model endpoints used in this work, we set temperature to $0$ and top-p to $1$ . The few-shot baseline and streaming baselines (GrowPrompt, MemPrompt, Self-StreamICL, and MAM-StreamICL) incorporate information from $k$ instances into the prompt $p(\cdot)$ to improve LLM agents. We use the same $k$ across these baselines for fair comparison. We set $k=16$ for Spider, CoSQL, BIRD, ToolBench, and DDXPlus. For DS-1000 and HotpotQA, we set $k=4$ to avoid exceeding the context size of gpt-3.5-turbo-0125.
对于本工作中所有模型端点的解码策略，我们将 temperature 设置为 $0$ ，并将 top-p 设置为 $1$ 。少样本基线和流式基线（GrowPrompt、MemPrompt、Self-StreamICL 和 MAM-StreamICL）将 $k$ 个实例的信息整合到 prompt $p(\cdot)$ 中，以改进 LLM 代理。我们对这些基线使用相同的 $k$ ，以确保公平比较。我们为 Spider、CoSQL、BIRD、ToolBench 和 DDXPlus 设置 $k=16$ 。对于 DS-1000 和 HotpotQA，我们设置 $k=4$ ，以避免超过 gpt-3.5-turbo-0125 的上下文大小。

We also analyze how different text embeddings used in memory correlate with streaming performance in Table 4. We observe that within the same text encoder family (bge), the larger model (109M parameters) generally delivers better performance. However, smaller models (22.7M parameters) can also achieve strong results, indicating that each LLM may benefit from a specific encoder.
我们还分析了不同文本 embeddings 用于 memory 时与 streaming performance 在 Table 4 中的相关性。我们观察到，在相同的 text encoder family (bge) 中，较大的模型 (109M parameters) 通常提供更好的性能。然而，较小的模型 (22.7M parameters) 也可以取得强有力的结果，这表明每个 LLM 可能从特定 encoder 中受益。

Table 4: Performance of Self-StreamICL (implemented with different text encoders and LLMs) on the DDXPlus dataset.
表 4:Self-StreamICL 的性能（implemented with different text encoders and LLMs）on the DDXPlus dataset。

Text encoder / LLMs 文本编码器 / LLMs	gpt-3.5-turbo-0125	claude-3-haiku	gemini-1.5-flash-001
all-MiniLM-L6-v2 (22.7M)	63.61	78.91	83.50
bge-small-en-v1.5 (33.4M)	63.55	75.51	83.90
bge-base-en-v1.5 (109M)	66.16	76.02	86.34

Appendix B Main results of each LLM endpoint
附录 B 每个 LLM endpoint 的主要结果

Main experiments results for three different LLM families models are shown below:
主要实验结果针对三个不同的LLM系列模型显示如下：

Table 5: Performance of different baselines and datasets for gpt-3.5-turbo-0125
Table 5:不同基线和数据集的性能 for gpt-3.5-turbo-0125

Task 任务	Text-to-SQL			Python	Tool Use 工具使用	Medical 医疗	QA
Dataset	Spider	CoSQL	BIRD	DS-1000	ToolBench	DDXPlus	HotpotQA
Non-streaming 非流式
Zero-Shot	68.89	52.83	29.75	41.50	64.13	47.56	54.53
Few-Shot	69.54	52.73	28.94	33.30	70.13	54.31	54.93
CoT	65.53	47.96	29.21	32.80	57.20	53.18	57.13
Self-Refine	67.21	51.64	29.92	39.80	64.53	47.68	40.06
Streaming
GrowPrompt	70.89	53.43	30.31	27.20	67.60	44.62	55.80
MemPrompt	73.68	54.32	34.16	29.40	67.07	51.53	56.73
StreamICL	75.59	54.92	35.07	41.90	74.00	66.16	55.80

Table 6: Performance of different baselines and datasets for gemini-1.0-pro-001

Task	Text-to-SQL			Python	Tool Use	Medical 医疗	QA
Dataset	Spider	CoSQL	BIRD	DS-1000	ToolBench	DDXPlus	HotpotQA
Non-streaming
Zero-Shot	68.28	49.26	28.16	33.20	61.73	50.57	49.47
Few-Shot	68.33	49.65	31.49	29.40	66.27	57.48	54.40
CoT	52.31	40.81	21.71	21.10	58.40	59.30	49.67
Self-Refine	69.59	46.47	28.94	32.80	61.60	50.57	49.53
Streaming
GrowPrompt	71.59	52.43	28.68	33.10	61.87	51.13	52.80
MemPrompt	70.28	54.22	30.18	35.50	58.80	43.59	55.00
StreamICL	76.48	55.41	33.25	35.80	68.80	69.50	57.20

Table 7: Performance of different baselines and datasets for claude-3-haiku-20240307

Task	Text-to-SQL			Python	Tool Use	Medical	QA
Dataset	Spider	CoSQL	BIRD	DS-1000	ToolBench	DDXPlus	HotpotQA
Non-streaming
Zero-Shot	66.51	49.55	30.90	38.40	58.27	60.43	41.47
Few-Shot	67.77	49.45	30.77	37.30	69.33	71.15	50.00
CoT	66.74	49.26	30.77	23.90	61.33	62.13	50.60
Self-Refine	66.46	50.35	29.99	36.30	55.87	60.43	41.00
Streaming
GrowPrompt	67.21	50.05	32.07	41.00	65.73	69.56	45.53
MemPrompt	68.37	51.34	31.62	41.50	67.07	66.95	46.13
Self-StreamICL	71.82	54.82	37.61	46.20	71.20	76.02	51.40

Appendix C Robustness to different streaming sequences

C.1 Performance on sequences shuffled by five different random seeds

Results of averaged performance and standard errors of gpt-3.5-turbo-0125, gemini-1.0-pro-001, and claude-3-haiku-20240307 are listed below:

C.2 Performance on the sequence with distributional shifts

Since we randomly assign time steps to each instance, our main results in Table 2 simulates scenarios where each instance in the streaming sequence is drawn from the same distribution. However, it is important to investigate how well streaming methods perform on sequences with distributional shifts. To this end, we conduct experiments on BIRD [19] by arranging instances from the same database consecutively (DB1, DB2, …, DB11), resulting in 10 distributional shifts during streaming. The results are shown in Table 8. The key finding is that Self-StreamICL still outperforms non-streaming baselines, and the method’s performance does not differ drastically when distributional shifts occur.

Table 8: Performance of Self-StreamICL with different LLMs on BIRD.

Method / LLM	gpt-3.5	gemini-1.0-pro	claude-3-haiku	gemini-1.5-flash	gpt-4o
Zero-Shot	29.75	28.16	30.90	33.83	34.42
Few-Shot	28.94	31.49	30.77	37.03	36.25
CoT	29.21	21.71	30.77	35.14	31.16
Self-StreamICL (no shifts)	35.07	33.25	37.61	41.20	42.63
Self-StreamICL (w/ shifts)	36.31	32.60	36.57	40.48	43.09

Appendix D Detailed ablation study results

Table 9: Ablation studies with MAM-StreamICL on DDXPlus. The ablated version of MAM-StreamICL only uses memory of the single corresponding agent, while still uses round-robin algorithm for multi-agent inference. We can see that both multi-agent memory and inference are beneficial for performance boost. The detailed algorithm for this ablation study can be found in Appendix F.

Method	GPT	Gemini	Claude	Memory	Inference
Zero-Shot	47.56	50.57	60.43	x	single agent
Self-StreamICL	66.16	69.50	76.02	single agent	single agent
MAM-StreamICL (ablation)	65.31	72.05	81.52	single agent	multi agent
MAM-StreamICL	83.50	83.50	83.50	multi agent	multi agent

Appendix E Token cost breakdown for each LLM endpoint

Table 10, 11, 12, and 13 shows how many millions of input, output tokens are used by gpt-3.5-turbo-0125, gemini-1.0-pro-001, claude-3-haiku-20240307, and the latest models. The cost of MAM-StreamICL is simply the averaged cost of the three LLMs due to the round-robin algorithm. As most LLM endpoints use input and output tokens to charge usage fees, we provide this information for benchmark users to estimate the cost for running StreamBench.

Table 10: Cost analysis (using millions of input and output tokens of gpt-3.5-turbo-0125).

Task	Text-to-SQL			Python	Tool Use	Medical	QA
Dataset	Spider	CoSQL	BIRD	DS-1000	ToolBench	DDXPlus	HotpotQA
Non-streaming
Zero-Shot	0.734/0.064	0.392/0.023	1.338/0.070	0.522/0.053	5.202/0.019	1.131/0.013	2.182/0.014
Few-Shot	2.367/0.061	0.954/0.021	3.206/0.074	1.984/0.074	5.869/0.023	8.138/0.014	9.864/0.013
CoT	0.778/0.181	0.407/0.077	1.252/0.168	0.550/0.134	5.239/0.053	1.186/0.184	2.227/0.082
Self-Refine	2.317/0.164	1.285/0.079	4.634/0.245	1.802/0.078	45.67/0.262	2.384/0.023	10.43/0.101
Streaming
GrowPrompt	2.819/0.065	1.205/0.021	3.309/0.069	2.103/0.088	5.910/0.019	7.699/0.012	10.56/0.013
MemPrompt	2.711/0.064	1.193/0.021	3.178/0.065	2.066/0.090	5.890/0.018	7.932/0.013	10.60/0.013
Self-StreamICL	2.156/0.063	0.966/0.021	2.688/0.066	1.765/0.073	5.756/0.019	7.833/0.013	10.46/0.013

Table 11: Cost analysis (using millions of input and output tokens of gemini-1.0-pro-001).

Task	Text-to-SQL			Python	Tool Use	Medical	QA
Dataset	Spider	CoSQL	BIRD	DS-1000	ToolBench	DDXPlus	HotpotQA
Non-streaming
Zero-Shot	0.871/0.110	0.454/0.045	1.538/0.114	0.591/0.044	5.603/0.020	1.131/0.011	2.222/0.018
Few-Shot	2.639/0.108	1.062/0.043	3.668/0.120	2.215/0.157	6.301/0.023	8.553/0.010	9.863/0.015
CoT	0.930/0.322	0.477/0.130	1.443/0.319	0.626/0.158	5.626/0.873	1.191/0.310	2.273/0.083
Self-Refine	1.944/0.131	0.987/0.056	2.977/0.129	1.869/0.205	11.14/0.038	2.350/0.019	4.587/0.025
Streaming
GrowPrompt	4.307/0.152	1.774/0.053	4.103/0.109	2.146/0.040	6.334/0.020	8.077/0.010	10.82/0.016
MemPrompt	3.405/0.100	1.693/0.049	4.012/0.109	2.073/0.039	6.325/0.020	8.354/0.011	10.88/0.015
Self-StreamICL	3.402/0.142	1.212/0.034	3.299/0.106	1.880/0.037	6.175/0.019	8.250/0.010	10.79/0.014

Table 12: Cost analysis (using millions of input and output tokens of claude-3-haiku-20240307).

Task 任务	Text-to-SQL			Python	Tool Use 工具使用	Medical 医疗	QA
Dataset 数据集	Spider	CoSQL	BIRD	DS-1000	ToolBench	DDXPlus	HotpotQA
Non-streaming 非流式
Zero-Shot	0.911/0.096	0.474/0.038	1.510/0.094	0.669/0.204	5.971/0.028	1.287/0.011	2.419/0.025
Few-Shot 少样本	2.790/0.094	1.111/0.036	4.047/0.098	2.347/0.256	6.726/0.027	9.160/0.024	10.98/0.022
CoT	0.975/0.391	0.500/0.148	1.571/0.370	0.704/0.341	5.996/0.074	1.350/0.297	2.473/0.166
Self-Refine	2.088/0.194	1.109/0.089	3.753/0.195	1.624/0.225	20.75/0.252	2.948/0.298	6.427/0.076
Streaming
GrowPrompt	3.562/0.099	1.551/0.037	4.103/0.098	2.336/0.222	6.750/0.025	8.650/0.022	11.77/0.024
MemPrompt	3.495/0.103	1.527/0.038	4.067/0.101	2.275/0.214	6.763/0.025	8.924/0.024	11.82/0.024
Self-StreamICL	2.844/0.100	1.219/0.036	3.409/0.098	2.048/0.216	6.577/0.026	8.666/0.026	11.71/0.021

Table 13: Cost analysis (millions of input and output tokens) on gemini-1.5-flash and gpt-4o.
表 13: 成本分析 (数百万个输入和输出 tokens) 在 gemini-1.5-flash 和 gpt-4o 上

Task 任务	Text-to-SQL			Python	Tool Use 工具使用	Medical 医疗	QA
Dataset	Spider	CoSQL	BIRD	DS-1000	ToolBench	DDXPlus	HotpotQA
gemini-1.5-flash
Zero-Shot	0.851/0.103	0.439/0.045	1.386/0.110	0.690/0.048	5.603/0.023	1.119/0.019	2.222/0.018
Self-StreamICL	2.622/0.092	1.274/0.042	3.339/0.109	2.108/0.045	6.201/0.023	8.018/0.016	10.83/0.013
gpt-4o-2024-05-13
Zero-Shot	0.717/0.079	0.377/0.031	1.359/0.082	0.586/0.067	5.171/0.021	1.088/0.011	2.145/0.019
Self-StreamICL	2.233/0.076	0.978/0.026	2.596/0.074	1.846/0.060	5.705/0.019	7.533/0.011	10.42/0.013

Appendix F Supplementary materials
附录 F 补充材料

The supplementary materials include details such as preprocessing of datasets, prompts of each baseline method, and code to reproduce the experiments.
补充材料包括诸如数据集的预处理、每个基线方法的提示以及再现实验的代码等细节。

F.1 Code repository F.1Code 仓库

The code for reproducing the experiments can be found in our GitHub repository: https://github.com/stream-bench/stream-bench.
用于再现实验的代码可以在我们的 GitHub 仓库中找到： https://github.com/stream-bench/stream-bench。

F.2 Details for each dataset
F.2 每个数据集的详细信息

We provide the licenses, preprocessing pipelines, calculation of evaluation metrics, and links to the datasets in StreamBench: https://huggingface.co/datasets/appier-ai-research/StreamBench. To construct the streaming sequences, one only needs to download the datasets and follow the instructions in our code repository.
我们提供许可证、预处理管道、评估指标的计算，以及数据集的链接，在 StreamBench：https://huggingface.co/datasets/appier-ai-research/StreamBench。要构建流式序列，只需下载数据集并遵循我们代码仓库中的说明。

Table 14: Licenses of each dataset on StreamBench.
表 14: StreamBench 上每个数据集的许可证。

Task 任务	Text-to-SQL			Python	Tool Use 工具使用	Medical 医疗	QA
Dataset	Spider	CoSQL	BIRD	DS-1000	ToolBench	DDXPlus	HotpotQA
License	CC BY-SA	CC BY-SA	CC BY-SA	CC BY-SA	Apache-2.0	CC-BY	CC BY-SA

F.2.1 Spider

Preprocessing 预处理

We download the Spider [17] dataset from their project website: https://yale-lily.github.io/spider, and use the original test set as our test set on StreamBench¹¹1The current “Spider” on StreamBench refers to Spider 1.0. Please refer to their project website for details on the upcoming version of Spider.
当前的 “Spider” on StreamBench 指的是 Spider 1.0。请参考他们的项目网站以获取关于即将发布的 Spider 版本的详细信息。
我们从他们的项目网站下载 Spider [ 17] 数据集： https://yale-lily.github.io/spider，并使用原始测试集作为我们的测试集在 StreamBench ¹ .

Evaluation metric 评估指标

We adopt the commonly used Execution Accuracy (EA) for all three Text-to-SQL datasets (Spider, CoSQL, and BIRD). This metric quantifies the proportion of instances where the execution result of the generated SQL $\hat{y}_{t}$ is identical to that of the ground truth SQL $y_{t}$ across all instances from time step $t=1$ to $T$ , where $T$ is the number of instances in the test set:
我们采用常用的 Execution Accuracy (EA) 用于所有三个 Text-to-SQL 数据集 (Spider, CoSQL, and BIRD)。该指标量化了在从时间步 $t=1$ 到 $T$ 的所有实例中，生成的 SQL $\hat{y}_{t}$ 的执行结果与 ground truth SQL $y_{t}$ 的执行结果相同的实例的比例，其中 $T$ 是测试集中实例的数量：

\mathrm{EA}=\frac{\sum_{t=1}^{T}\mathbbm{1}\left(r_{t},\hat{r}_{t}\right)}{T}

Here, $r_{t}$ represents the execution results of $y_{t}$ , while $\hat{r}_{t}$ is the execution results of $\hat{y}_{t}$ , with $\mathbbm{1}(\cdot)$ being the indicator function defined by:
这里， $r_{t}$ 代表 $y_{t}$ 的执行结果，而 $\hat{r}_{t}$ 是 $\hat{y}_{t}$ 的执行结果，其中 $\mathbbm{1}(\cdot)$ 是指示函数，由...定义的：

\mathbbm{1}(r_{t},\hat{r}_{t})=\begin{cases}1,&r_{t}=\hat{r}_{t}\\ 0,&r_{t}\neq\hat{r}_{t}\end{cases}

F.2.2 CoSQL

Preprocessing 预处理

The CoSQL [18] dataset is sourced from its official website: https://yale-lily.github.io/cosql. Due to the unavailability of the official test set, we utilize the original development set as our test set on StreamBench. CoSQL was originally formatted in the multi-turn conversation structure, including a sequence of question-SQL pairs (i.e., $(x,y)$ pairs where $x$ is the user’s natural language question and $y$ is the SQL code). To adapt CoSQL into the streaming framework of StreamBench, we extract each $(x,y)$ pair from the conversations to build the test set of size $T=$ 1,007.
CoSQL [ 18] 数据集来自其官方网站：https://yale-lily.github.io/cosql。由于官方测试集不可用，我们使用原始开发集作为 StreamBench 上的测试集。CoSQL 最初以多轮对话结构格式化，包括一系列问题-SQL 对（即 $(x,y)$ 对，其中 $x$ 是用户的自然语言问题， $y$ 是 SQL 代码）。为了将 CoSQL 适应到 StreamBench 的流式框架中，我们从对话中提取每个 $(x,y)$ 对，以构建大小为 $T=$ 1,007 的测试集。

F.2.3 BIRD

Preprocessing 预处理

We download the BIRD [19] dataset from their project website https://bird-bench.github.io/. Similar to COSQL, we use the full development set as our test set on StreamBench due to the unavailability of the original test set.
我们从他们的项目网站 https://bird-bench.github.io/ 下载 BIRD [ 19] 数据集。与 COSQL 类似，由于原始测试集不可用，我们在 StreamBench 上使用完整的开发集作为我们的测试集。

F.2.4 DS-1000

Preprocessing 预处理

We use the DS-1000 [20] dataset on Huggingface: https://huggingface.co/datasets/xlangai/DS-1000. Since this dataset only contains the test set, we manually construct few-shot examples for the few-shot baseline. The few-shot examples are available in our code repository. Please refer to Section F.1.
我们使用 DS-1000 [ 20] 数据集，该数据集位于 Huggingface 上：https://huggingface.co/datasets/xlangai/DS-1000。由于这个数据集只包含测试集，我们手动构建 few-shot examples 用于 few-shot baseline。Few-shot examples 可在我们的代码仓库中获得。请参考 Section F.1。

Evaluation metric 评估指标

DS-1000 adopts pass@1 as the evaluation metric, which denotes the proportion of instances where the agent’s code solution $\hat{y}_{t}$ pass all test cases of $x_{t}$ for $t=1,2,...,T$ .
DS-1000 采用 pass@1 作为评估指标，它表示实例的比例，其中代理的代码解决方案 $\hat{y}_{t}$ 通过 $x_{t}$ 的所有测试用例对于 $t=1,2,...,T$ 。

F.2.5 ToolBench

Preprocessing 预处理

Since the original ToolBench [21] contains large-scale real online APIs suffering from instability, we adopt the 50 high-quality APIs curated by STE [22]. Each API has 15 test instances, so there are 750 instances in the test set.
由于原始的 ToolBench [21] 包含大规模的真实在线 APIs，这些 APIs 存在不稳定性，我们采用了 STE [22] 整理的 50 个高质量 APIs。每个 API 有 15 个测试 instances，因此测试集中有 750 个 instances。

Evaluation metric 评估指标

We following the same evaluation protocol specified in STE [22] to calculate the accuracy. The agent’s output $\hat{y}_{t}$ is considered correct if and only if both the API name and API arguments are correct. The API name is checked by exact string matching. For APIs that have deterministic values for the API arguments, exact string matching is performed. For APIs that accept natural language inputs, a judge LLM is used to evaluate the correctness of API arguments. The implementation details can be found in our code repository (Section F.1).
我们遵循 STE [22] 中指定的相同评估协议来计算准确性。代理的输出 $\hat{y}_{t}$ 被认为是正确的，当且仅当 API 名称和 API 参数都正确。API 名称通过 exact string matching 来检查。对于那些 API 参数具有 deterministic values 的 APIs，进行 exact string matching。对于那些接受 natural language inputs 的 APIs，使用 judge LLM 来评估 API 参数的正确性。实现细节可以在我们的 code repository 中找到（Section F.1）。

F.2.6 DDXPlus

Preprocessing 预处理

DDXPlus [23] is a large-scale dataset for medical diagnosis, and it contains more than 100,000 instances in the test set originally. Since it would be too expensive to run all test instances on LLMs, we sample equal number of instances from each medical diagnosis to make the test set of size $T=$ 1,764 on StreamBench. The full test set is available from the link provided in Section F.2, where $x$ is the patient profile and $y$ is the pathology (i.e., diagnosis). The original dataset can be found in the repository of DDXPlus: https://github.com/mila-iqia/ddxplus.
DDXPlus [ 23] 是一个大规模的医疗诊断数据集，并且它最初在测试集中包含超过 100,000 个实例。由于在 LLMs 上运行所有测试实例会太昂贵，我们从每个医疗诊断中抽样相同数量的实例，以在 StreamBench 上使测试集大小为 $T=$ 1,764。完整的测试集可从 Section F.2 中提供的链接获取，其中 $x$ 是患者档案， $y$ 是病理（即诊断）。原始数据集可以在 DDXPlus 的仓库中找到：https://github.com/mila-iqia/ddxplus。

Evaluation metric 评估指标

We use accuracy as the evaluation metric, which is calculated as $\frac{\sum_{t=1}^{T}\mathbbm{1}(\hat{y}_{t}=y_{t})}{T}$ .
我们使用准确率作为评估指标，这计算为 $\frac{\sum_{t=1}^{T}\mathbbm{1}(\hat{y}_{t}=y_{t})}{T}$ 。

F.2.7 HotpotQA

Preprocessing 预处理

We adopt the distractor setting in HotpotQA [24], where each instance contains both supporting or distracting documents for answering the question. The supporting documents have the information needed to answer the question, while the distracting documents do not. Because the test set is not available, we construct the test set by sampling 1,500 instances randomly from the dev set (distractor) downloaded from the official website: https://hotpotqa.github.io/.
我们采用 HotpotQA [ 24] 中的 distractor 设置，其中每个实例都包含用于回答问题的 supporting 或 distracting 文档。The supporting documents have the information needed to answer the question, while the distracting documents do not。因为 the test set is not available, we construct the test set by sampling 1,500 instances randomly from the dev set (distractor) downloaded from the official website: https://hotpotqa.github.io/。

Evaluation metric 评估指标

Following the HotpotQA paper, we adopt exact match (EM) and F1 as two evaluation metrics. We use EM as the primary evaluation metric on StreamBench. However, we also include the calculation of F1 in our code.
遵循 HotpotQA 论文，我们采用 exact match (EM) 和 F1 作为两个评估指标。我们使用 EM 作为 StreamBench 的主要评估指标。然而，我们还在代码中包括 F1 的计算。

F.3 Prompt templates F.3 提示模板

We use similar prompt templates in all tasks to minimize prompt engineering. To demonstrate, the prompt templates of the Text-to-SQL task (Spider, CoSQL, and BIRD) as well as the medical diagnosis task (DDXPlus) are provided below. Note that the Self-Refine [26] prompting pipeline involves using the zero-shot prompt to generate the initial output, and then use the feedback prompt and refinement prompt alternatingly to arrive at the final answer. Therefore, we provide two prompt templates for the Self-Refine baseline, one for feedback and the other for refinement. The prompt templates for other datasets (DS-1000, ToolBench, and HotpotQA) can be found in our code repository.
我们使用类似的提示模板在所有任务中，以最小化提示工程。为了演示，Text-to-SQL 任务（Spider, CoSQL 和 BIRD）以及医疗诊断任务（DDXPlus）的提示模板在下面提供。请注意，Self-Refine [26] 提示管道涉及使用 zero-shot 提示生成初始输出，然后交替使用 feedback 提示和 refinement 提示来得出最终答案。因此，我们为 Self-Refine 基线提供两个提示模板，一个用于 feedback，另一个用于 refinement。其他数据集（DS-1000, ToolBench 和 HotpotQA）的提示模板可以在我们的代码仓库中找到。

F.3.1 Text-to-SQL

The prompt templates for Spider, CoSQL, and BIRD are provided below.
Spider、CoSQL 和 BIRD 的提示模板如下提供。

Zero-shot

In this template, {schema} would be replaced by the database schema, while {question} would be replaced by the user’s data requirements.
在此模板中，{schema} 将被数据库模式替换，而{question} 将被用户的數據需求替换。

{mdframed}

{schema}

– Using valid SQLite, answer the following question for the tables provided above.
– 使用有效的 SQLite，回答以下问题，针对上面提供的表格。

– Question: {question} – 问题: {question}

Now, generate the correct SQL code directly (Do NOT generate other text except the SQL code):
现在，直接生成正确的 SQL 代码 (不要生成除 SQL 代码以外的其他文本):

Figure 11: The prompt template for the zero-shot baseline for Text-to-SQL datasets.
Figure 11:用于 Text-to-SQL 数据集的 zero-shot 基线的提示模板。

Chain-of-thought (CoT) 链式思考 (CoT)

It is similar to the zero-shot prompt template, except that the trigger phrase “take a deep breath and work on this problem step-by-step to derive the correct SQL code” is appended to the end.
它类似于 the zero-shot prompt template，除了触发短语“深吸一口气，然后一步步地解决这个问题，以推导出正确的 SQL code”被附加到末尾。

{mdframed}

{schema}

– Using valid SQLite, answer the following question for the tables provided above.
– 使用有效的 SQLite，回答以下问题，针对上面提供的表格。

– Question: {question} – 问题: {question}

Now, take a deep breath and work on this problem step-by-step to derive the correct SQL code.
现在，深呼吸一下并一步步地解决这个问题，以推导出正确的 SQL 代码。

Provide your output in the following format:
以以下格式提供您的输出：

Rationale: <your_rationale>
理由:

Answer: “‘sql\n<your_SQL_code>\n“‘
答案: “‘sql\n\n“‘"

Figure 12: The prompt template for the CoT baseline for Text-to-SQL datasets.
Figure 12: Text-to-SQL 数据集的 CoT 基线的提示模板。

Self-Refine

The feedback prompt and refinement prompt are provided below.
反馈提示和精炼提示在下面提供。

{mdframed}

You are performing the text-to-SQL task. Here is the database schema, user’s question, and your previously generated SQL code.
您正在执行 text-to-SQL 任务。这是数据库模式、用户的查询以及您先前生成的 SQL 代码。

– SQL schema: {schema}
– SQL 模式: {schema}
– User’s question: {question}
– 用户的问题： {question}
– Your SQL code: {model_output}
– 您的 SQL 代码： {model_output}

First, determine whether you need to refine your SQL code in terms of its correctness.
首先，确定是否需要改进你的 SQL 代码，以确保其正确性。

If you consider that your SQL code is correct, output ’NO NEED TO REFINE’ in uppercase.
如果您认为您的 SQL 代码是正确的，则输出 ’NO NEED TO REFINE’ 以大写形式。

Otherwise, provide a suggestion to correct the SQL code.
否则，提供一个建议来修正 SQL 代码。

Figure 13: The feedback prompt template for the Self-Refine baseline for Text-to-SQL datasets.
图 13：The feedback prompt template for the Self-Refine baseline for Text-to-SQL datasets。

{mdframed}

You are performing the text-to-SQL task. Here is the database schema, user’s question, and your previous answer-feedback trajectory.
您正在执行 text-to-SQL 任务。这是数据库模式、用户的查询，以及您之前的回答-反馈轨迹。

– SQL schema: {schema}
– User’s question: {question}
– 用户的问题： {question}
– Your previous answer-feedback trajectory:
– 您之前的答案反馈 trajectory：

{trajectory} {轨迹}

According to the latest feedback, provide your refined SQL code.
根据最新的反馈，提供你的精炼 SQL 代码。

Provide your output in the following format:
以以下格式提供您的输出：

“‘sql\n<your_SQL_code>\n“‘
‘sql\n\n“‘

Figure 14: The refinement prompt template for the Self-Refine baseline for Text-to-SQL datasets.
图 14: 针对 Self-Refine 基线的 Text-to-SQL 数据集的细化提示模板。

Prompt template for past information
过去信息的提示模板

For the few-shot and streaming methods (GrowPrompt, MemPrompt, Self-StreamICL, and MAM-StreamICL), we use the following prompt template for integrating information of past instances.
对于 few-shot 和 streaming methods (GrowPrompt, MemPrompt, Self-StreamICL, and MAM-StreamICL)，我们使用以下 prompt template 来整合 past instances 的信息。

{mdframed}

You are performing the text-to-SQL task. Here are some examples:
您正在执行 text-to-SQL 任务。这里有一些示例：

{past_information}

Now it’s your turn. 现在轮到你了。

– SQL schema: {schema}
– Using valid SQLite, answer the following question for the SQL schema provided above.
– 使用有效的 SQLite，回答以下问题，针对上面提供的 SQL 架构。

– Question: {question} – 问题: {question}

Now, generate the correct SQL code directly (Do NOT generate other text except the SQL code):
现在，直接生成正确的 SQL 代码 (不要生成除 SQL 代码以外的其他文本):

Figure 15: The prompt template for integrating past information for Text-to-SQL datasets.
Figure 15:用于整合过去信息的 Text-to-SQL 数据集的提示模板。

Note 注意

The actual content of {past_information} would be replaced by $k$ templated past instances, and the template is different for different baselines. We provide the templates as follows:
{past_information} 的实际内容将被 $k$ 模板化的过去实例替换，并且模板对于不同的基线是不同的。我们提供以下模板：

{mdframed}

Question: {question} 问题： {question}
{sql_code}

Figure 16: The template for each past instance (

k=16

in Text-to-SQL) for the few-shot, Self-StreamICL, and MAM-StreamICL baselines.
Figure 16:每个过去实例的模板（

k=16

in Text-to-SQL）针对 few-shot、Self-StreamICL 和 MAM-StreamICL 的基线。

{mdframed}

Question: {question} 问题： {question}
Your SQL code: {sql_code}
您的 SQL 代码： {sql_code}
User Feedback: {verbalized_feedback}
用户反馈: {verbalized_feedback}

Figure 17: The template for each past instance (

k=16

in Text-to-SQL) for the GrowPrompt and MemPrompt baselines. In this template, {verbalized_feedback} is the verbalized

fb_{t}

, which is “Your answer is correct” when

fb_{t}=1

or “Your answer is not correct” when

fb_{t}=0

.
Figure 17:每个过去实例的模板（

k=16

in Text-to-SQL）用于 GrowPrompt 和 MemPrompt 基线。在此模板中，{verbalized_feedback} 是 verbalized

fb_{t}

，当

fb_{t}=1

时为“Your answer is correct”，或当

fb_{t}=0

时为“Your answer is not correct”。

The {past_information} would be replaced by the information of $k$ templated past instances ( $k$ varies with datasets, see Section A for details), each delimited by three newlines.
该 {past_information} 将被 $k$ templated past instances 的信息替换（ $k$ 随数据集而变化，详见 Section A），每个用三个换行符分隔。

F.3.2 Medical diagnosis F.3.2 医学诊断

The prompt templates for DDXPlus are provided below.
以下提供了 DDXPlus 的提示模板。

Zero-shot

The prompt template for the zero-shot baseline on DDXPlus. In this template, {profile} would be replaced by the actual patient profile, while {option_text} would be replaced by all 49 possible diagnoses for the agent to choose from.
零-shot baseline on DDXPlus 的提示模板。在这个模板中，{profile} 将被实际的患者资料替换，而 {option_text} 将被所有 49 个可能诊断替换，以供代理选择。

{mdframed}

Act as a medical doctor and diagnose the patient based on the following patient profile:
充当一名医疗医生并基于以下患者档案诊断患者：

{profile}

All possible diagnoses for you to choose from are as follows (one diagnosis per line, in the format of <number>. <diagnosis>):
以下是您可以选择的全部可能诊断（每行一个诊断，格式为 . ）：

{option_text}

Now, directly provide the diagnosis for the patient in the following format: <number>. <diagnosis>
现在，直接提供该患者的诊断，以以下格式：.

Figure 18: The prompt template for the zero-shot baseline on DDXPlus.
Figure 18: DDXPlus 上的 zero-shot baseline 的提示模板。

Chain-of-thought (CoT) 链式思考 (CoT)

It is similar to the zero-shot prompt template, except that the trigger phrase “take a deep breath and work on this problem step-by-step to derive the most likely diagnosis” is appended to the end.
它类似于 the zero-shot prompt template，除了触发短语 “take a deep breath and work on this problem step-by-step to derive the most likely diagnosis” 被附加到结尾。

{mdframed}

Act as a medical doctor and diagnose the patient based on the following patient profile:
充当一名医疗医生并基于以下患者档案诊断患者：

{profile}

{option_text}

Now, take a deep breath and work on this problem step-by-step to derive the most likely diagnosis. Provide your output in the following valid JSON format:
现在，深呼吸一下并一步步地处理这个问题，以推导出最可能的诊断。请以以下有效的 JSON 格式提供您的输出：

{"rationale": "<your_rationale>", "answer": "<number>. <diagnosis>"}

Figure 19: The prompt template for the zero-shot chain-of-thought (CoT) baseline on DDXPlus.
图 19:零-shot chain-of-thought (CoT) 基线在 DDXPlus 上的提示模板。

Self-Refine

The feedback prompt and refinement prompt are provided below.
反馈提示和优化提示在下面提供。

{mdframed}

You are acting as medical doctor and tasked to diagnose the patient based on the provided patient profile. Here’s the patient diagnosis:
您正在扮演医生角色，并被指派基于提供的 patient profile 来诊断患者。这是 patient diagnosis：

{profile}

{option_text}

Your answer : {model_output}
您的答案 : {model_output}

First, determine whether you need to refine your answer in terms of its correctness.
首先，确定是否需要在正确性方面完善你的答案。

If you consider that your answer is correct, output ‘NO NEED TO REFINE’ in uppercase.
如果您认为您的答案是正确的，则输出‘NO NEED TO REFINE’大写形式。

Otherwise, provide a suggestion to correct the diagnoses in the format of <number>. <diagnosis>.
否则，提供一个建议以纠正诊断，以 . 的格式。

Figure 20: The feedback prompt template for the Self-Refine baseline on DDXPlus.
Figure 20：Self-Refine 基线在 DDXPlus 上的反馈提示模板。

{mdframed}

You are acting as medical doctor and tasked to diagnose the patient based on the provided patient profile. Here’s the patient diagnosis:
您正在扮演医生的角色，并被指派根据提供的患者资料诊断患者。这是患者的诊断：

{profile}
All possible diagnoses for you to choose from are as follows (one diagnosis per line, in the format of <number>. <diagnosis>):
以下是您可以选择的全部可能诊断（每行一个诊断，格式为 . ）：

{option_text}
– Your previous answer-feedback trajectory:
– 您之前的答案反馈 trajectory：

{trajectory} {轨迹}

According to the latest feedback, provide your new answer
根据最新的反馈，提供你的新答案

Provide your output in the following format: one diagnosis per line, in the format of <number>. <diagnosis>
请以以下格式提供您的输出：每行一个诊断，格式为 .

Figure 21: The refinement prompt template for the Self-Refine baseline on DDXPlus.
Figure 21:针对 Self-Refine baseline on DDXPlus 的改进提示模板。

Prompt template for past information
Prompt template for 过去的信息

{mdframed}

Act as a medical doctor and diagnose the patient based on the provided patient profile.
充当一名医生并根据提供的患者资料诊断患者。

{option_text}

Here are some example cases.
以下是一些示例案例。

{past_information}

Now it’s your turn. 现在轮到你了。

{profile}

Now provide the diagnosis for the patient in the following format: <number>. <diagnosis>
现在为患者提供诊断，以以下格式：.

Figure 22: The prompt template for multiple baselines on DDXPlus.
Figure 22:DDXPlus 上的多个基线的提示模板。

Note 注意

The actual content of {past_information} is different for different baselines:
{past_information} 的实际内容对于不同的基线是不同的：

{mdframed}

{profile}

Diagnosis: {diagnosis} 诊断: {diagnosis}

Figure 23: The template for each past instance (

k=16

in DDXPlus) for the few-shot, Self-StreamICL, and MAM-StreamICL baselines.
Figure 23:每个过去实例的模板（

k=16

在 DDXPlus）用于 few-shot、Self-StreamICL 和 MAM-StreamICL 的基线。

{mdframed}

{profile}

Your answer: {diagnosis} 您的答案： {diagnosis}

User Feedback: {verbalized_feedback}
用户反馈: {verbalized_feedback}

Figure 24: The template for each past instance (

k=16

in DDXPlus) for the GrowPrompt and MemPrompt baselines. In this template, {verbalized_feedback} is the verbalized

fb_{t}

, which is “Your answer is correct” when

fb_{t}=1

or “Your answer is not correct” when

fb_{t}=0

.
Figure 24:每个过去实例的模板 (

k=16

in DDXPlus) 针对 GrowPrompt and MemPrompt 基线。在此模板中，{verbalized_feedback} 是 verbalized

fb_{t}

，当

fb_{t}=1

时为 “Your answer is correct” 或当

fb_{t}=0

时为 “Your answer is not correct” 。

F.4 Other details F.4 其他细节

F.4.1 Algorithm of ablation studies with MAM-StreamICL
F.4.1 消融研究的算法与 MAM-StreamICL

In Table 9, we conduct ablation studies on MAM-StreamICL to show the importance of multi-agent memory. The algorithm of “MAM-StreamICL (ablation)” in this table is provided as follows:
在 Table 9 中，我们对 MAM-StreamICL 进行了消融研究，以展示多代理记忆的重要性。本表中 “MAM-StreamICL (ablation)” 的算法如下：

Algorithm 3 Algorithm for MAM-StreamICL (ablation)
算法 3：MAM-StreamICL (ablation) 的算法

1:Initialize

K

agents

f_{0}(\cdot|\theta_{0})

f_{1}(\cdot|\theta_{1})

, …,

f_{K-1}(\cdot|\theta_{K-1})

;
1:初始化

K

代理

f_{0}(\cdot|\theta_{0})

f_{1}(\cdot|\theta_{1})

, …,

f_{K-1}(\cdot|\theta_{K-1})

;

\triangleright

K = 1 in the Self-StreamICL baseline

\triangleright

K = 1 在 Self-StreamICL baseline

2:Initialize prompt

p(\cdot)

, retriever

r(\cdot)

, and external memory

\mathcal{M}_{0}

, all shared between agents;
2:初始化 prompt

p(\cdot)

，检索器

r(\cdot)

，以及外部内存

\mathcal{M}_{0}

，所有这些在代理之间共享；

3:Select an agent

f_{s}(\cdot|\theta_{s})

as the source of single-agent memory;
3:选择一个代理

f_{s}(\cdot|\theta_{s})

作为单代理内存的来源;

\triangleright

For example, we can choose gemini-1.0-pro-001

\triangleright

例如，我们可以选择 gemini-1.0-pro-001

4:for

t=1,2,\ldots,T

do
4:for

t=1,2,\ldots,T

5: Receive instance

x_{t}

from the data stream;
5: 从数据流中接收 instance

x_{t}

;

6: Select the next agent by

k=t

\mathrm{mod}

K

;
6: 通过

k=t

\mathrm{mod}

K

选择下一个代理 ;

7: The

k

-th agent predicts

\hat{y}_{t}=f_{k}(p(x_{t},r(\mathcal{M}_{t-1}))|\theta_{k})

;
7: 第

k

代理预测

\hat{y}_{t}=f_{k}(p(x_{t},r(\mathcal{M}_{t-1}))|\theta_{k})

;

\triangleright

\hat{y}_{t}

is used to for evaluation

\triangleright

\hat{y}_{t}

用于评估

8: The chosen single agent predicts

\hat{y}_{t_{s}}=f_{s}(p(x_{t},r(\mathcal{M}_{t-1}))|\theta_{s})

;
8: 被选中的单个代理预测

\hat{y}_{t_{s}}=f_{s}(p(x_{t},r(\mathcal{M}_{t-1}))|\theta_{s})

;

\triangleright

Counterfactual ablation

\triangleright

反事实消融

9: Receive feedback signal

fb_{t_{s}}=g(x_{t},\hat{y}_{t_{s}})

9: 接收反馈信号

fb_{t_{s}}=g(x_{t},\hat{y}_{t_{s}})

;

\triangleright

\hat{y}_{t_{s}}

is used for receiving feedback

\triangleright

\hat{y}_{t_{s}}

用于接收反馈

10: if

fb_{t_{s}}=1

then
10: if

fb_{t_{s}}=1

then

\triangleright

which means the self-output

\hat{y}_{t}

is correct

\triangleright

这意味着自输出

\hat{y}_{t}

是正确的

11:

\mathcal{M}_{t}\leftarrow\mathcal{M}_{t-1}\cup\{(x_{t},\hat{y}_{t_{s}})\}

;

12: else

13:

\mathcal{M}_{t}\leftarrow\mathcal{M}_{t-1}

;

14: end if

15:end for

The parts different from the original MAM-StreamICL algorithm is highlighted in red. This ablated algorithm can be seen as a counterfactual experiment, where we use multiple agents for inference but only one chosen agent for the memory mechanism. The results in Table 9 show that both multi-agent memory and multi-agent inference are beneficial for performance boost.
与原始 MAM-StreamICL 算法不同的部分以红色突出显示。这一个消融算法可以被视为一个反事实实验，其中我们使用多个代理进行推理，但仅使用一个选定的代理进行记忆机制。Table 9 中的结果显示，多代理记忆和多代理推理都对性能提升有益。

StreamBench: Towards Benchmarking Continuous Improvement of Language AgentsStreamBench: 面向语言代理的持续改进基准测试

Abstract 摘要

1 Introduction 1 介绍

2 Formulation 2 配方

3 StreamBench

3.1 General setup 3.1 一般设置

Streaming sequence 流式序列

Feedback signals 反馈信号

Evaluation 评估

3.2 Datasets 3.2 数据集

Text-to-SQL

Python programming Python 编程

Tool use 工具使用

Medical diagnosis 医学诊断

Question answering 问答

4 Experiments

4.1 Baselines

4.1.1 Non-streaming methods4.1.1 非流式方法

Zero-shot

Few-shot

Chain-of-thought (CoT) 链式思考 (CoT)

Self-Refine

4.1.2 Streaming methods 4.1.2 流式方法

GrowPrompt

MemPrompt

Self-StreamICL

Multi-Agentic-Memory StreamICL多代理记忆 StreamICL

4.2 Implementation details4.2 实现细节

4.3 Main results 4.3 主要结果

5 Discussion 5 讨论

5.1 What makes effective streaming strategies?5.1 什么使有效的流媒体策略？

5.1.1 Collecting correct self-output5.1.1 收集正确的自输出

5.1.2 Sharing memory across multiple agents5.1.2 在多个代理之间共享内存

5.2 Robustness to different streaming sequences5.2 对不同流媒体序列的鲁棒性

5.3 Would stronger LLMs still benefit from streaming?5.3 更强的 LLMs 是否仍然能从流媒体中受益？

5.4 Cost analysis 5.4 成本分析

6 Related work 6 相关工作

6.1 Online learning 6.1 在线学习

6.2 Improvement from feedback with LLMs6.2 改进从反馈与LLMs

7 Conclusion 7 结论

8 Limitations

Tasks and modality coverage任务和模态覆盖

Sim2Real gap Sim2Real 差距

Acknowledgments and Disclosure of Funding致谢和资金披露

References 参考文献

Checklist 检查列表

Appendix A Hyperparameters附录 A Hyperparameters

Appendix B Main results of each LLM endpoint附录 B 每个 LLM endpoint 的主要结果

Appendix C Robustness to different streaming sequences

C.1 Performance on sequences shuffled by five different random seeds

C.2 Performance on the sequence with distributional shifts

Appendix D Detailed ablation study results

Appendix E Token cost breakdown for each LLM endpoint

Appendix F Supplementary materials附录 F 补充材料

F.1 Code repository F.1Code 仓库

F.2 Details for each datasetF.2 每个数据集的详细信息

F.2.1 Spider

Preprocessing 预处理

Evaluation metric 评估指标

F.2.2 CoSQL

Preprocessing 预处理

F.2.3 BIRD

Preprocessing 预处理

F.2.4 DS-1000

Preprocessing 预处理

Evaluation metric 评估指标

F.2.5 ToolBench

Preprocessing 预处理

Evaluation metric 评估指标

F.2.6 DDXPlus

Preprocessing 预处理

Evaluation metric 评估指标

F.2.7 HotpotQA

Preprocessing 预处理

Evaluation metric 评估指标

F.3 Prompt templates F.3 提示模板

F.3.1 Text-to-SQL

Zero-shot

Chain-of-thought (CoT) 链式思考 (CoT)

Self-Refine

StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
StreamBench: 面向语言代理的持续改进基准测试

4.1.1 Non-streaming methods
4.1.1 非流式方法

Multi-Agentic-Memory StreamICL
多代理记忆 StreamICL

4.2 Implementation details
4.2 实现细节

5.1 What makes effective streaming strategies?
5.1 什么使有效的流媒体策略？

5.1.1 Collecting correct self-output
5.1.1 收集正确的自输出

5.1.2 Sharing memory across multiple agents
5.1.2 在多个代理之间共享内存

5.2 Robustness to different streaming sequences
5.2 对不同流媒体序列的鲁棒性

5.3 Would stronger LLMs still benefit from streaming?
5.3 更强的 LLMs 是否仍然能从流媒体中受益？

6.2 Improvement from feedback with LLMs
6.2 改进从反馈与LLMs

Tasks and modality coverage
任务和模态覆盖

Acknowledgments and Disclosure of Funding
致谢和资金披露

Appendix A Hyperparameters
附录 A Hyperparameters

Appendix B Main results of each LLM endpoint
附录 B 每个 LLM endpoint 的主要结果

Appendix F Supplementary materials
附录 F 补充材料

F.2 Details for each dataset
F.2 每个数据集的详细信息

Prompt template for past information
过去信息的提示模板

Prompt template for past information
Prompt template for 过去的信息

F.4.1 Algorithm of ablation studies with MAM-StreamICL
F.4.1 消融研究的算法与 MAM-StreamICL