August 28, 2024 2024 年 8 月 28 日

In Blog , Developer Blog , Large Language Model , Deep Learning
在博客，开发者博客，大型语言模型，深度学习

ReadAgent: Bringing Gist Memory to AI
ReadAgent：将 Gist Memory 引入 AI

Learn how gist memory improves long context handling for large language models. This blog post explores building an AI agents with gist memory using the Cerebras Inference SDK, demonstrating how fast inference can make complex LLM-based workflows more efficient and practical for real-world applications.
了解 gist 内存如何提高大型语言模型的长上下文处理能力。本博客文章探讨使用 Cerebras 推理 SDK 搭建带有 gist 内存的 AI 代理，展示快速推理如何使复杂的 LLM 基础工作流更加高效和实用，以适用于现实世界的应用。

Large Language Models (LLMs) exhibit remarkable abilities in understanding natural language, but they are not without limitations. One area where LLMs can struggle is in processing long text inputs, even when these inputs don’t exceed their context length. Although the context windows for models have been continuously expanding over the last few years, research shows that longer contexts are not always utilized effectively by the model.
大型语言模型（LLMs）在理解自然语言方面表现出显著的能力，但它们并非没有局限性。 LLMs可能难以处理的领域之一是处理长文本输入，即使这些输入没有超过其上下文长度。虽然过去几年模型的上下文窗口不断扩大，但研究表明，更长的上下文并不总是被模型有效利用。

Nevertheless, the ability to refer to and retrieve information from large pieces of text is critical when building LLM applications. To enhance LLMs’ capabilities in this area, earlier this year researchers at Google DeepMind introduced an AI agent called ReadAgent . Inspired by how humans read and remember text, ReadAgent “reads” through a document, splits the text into “pages,” and generates a summary (gist) for each page. When prompted to answer a question or complete a task using the text, it refers to these summaries to determine which pages from the original context it should use to generate a response to the prompt.
尽管如此，在构建LLM应用程序时，引用和检索大块文本中的信息的能力至关重要。为了增强LLMs在这一领域的能力，今年早些时候，谷歌 DeepMind 的研究人员引入了一个名为ReadAgent的 AI 代理。受人类阅读和记忆文本的方式启发，ReadAgent“阅读”文档，将文本分成“页面”，并为每个页面生成摘要（要点）。当被提示回答问题或完成任务时，它会参考这些摘要来确定应该使用原始上下文中的哪些页面来生成对提示的响应。

This method has proven effective, as ReadAgent has demonstrated improved performance on long-document reading comprehension tasks (QuALITY, NarrativeQA, QMSum), increasing the effective context length by 3 to 20 times. These improvements are solely due to its prompting pattern, which operates through a series of consecutive API calls to the LLM. Despite the impressive performance gains, this method has yet to see wide adoption because it requires processing a large number of tokens, which can be both slow and expensive.
该方法已被证明是有效的，ReadAgent 在长文档阅读理解任务（QuALITY，NarrativeQA，QMSum）中展示了改进的性能，将有效上下文长度增加了 3 到 20 倍。这些改进仅归功于其提示模式，该模式通过一系列连续的 API 调用到 LLM 来运行。尽管性能提高令人印象深刻，但该方法尚未被广泛采用，因为它需要处理大量令牌，这可能既慢又昂贵。

At Cerebras, we’ve designed an inference solution that stands out in these workflows with its low-latency performance. We’ve implemented ReadAgent using our Cerebras Inference SDK to showcase what’s possible when innovations in agentic workflows meet fast inference.
在 Cerebras，我们设计了一个在这些工作流中脱颖而出的推理解决方案，其低延迟性能令人印象深刻。我们使用 Cerebras 推理 SDK 实现了 ReadAgent，以展示创新性工作流与快速推理结合时的可能性。

In this blog post, we’ll dive into the details of how ReadAgent works. If you’d like to explore the code for our implementation, you can visit out our project repository .
在这篇博客文章中，我们将深入探讨 ReadAgent 的工作细节。如果您想探索我们实现的代码，可以访问我们的项目仓库。

ReadAgent’s Workflow ReadAgent 的工作流程

Step 1: Episode Pagination
步骤 1：剧集分页

ReadAgent begins by breaking down long texts into manageable episodes or ‘pages’ through a process called episode pagination. As the model reads through the text, it decides where to pause by evaluating natural break points, such as scene transitions, ends of dialogues, or narrative shifts. This process starts by providing the language model with a segment of text that begins from the previous pause point and ends when it reaches a set maximum word limit. The model is then prompted to choose a natural pause point between paragraphs, which are marked by numbered tags. The content between two consecutive pause points becomes a page. This approach allows ReadAgent to create shorter, meaningful segments of text that preserve context and coherence, rather than relying on arbitrary fixed-length chunks.
ReadAgent 首先通过所谓的剧集分页过程，将长文本分解为易于管理的剧集或“页面”。当模型阅读文本时，它通过评估自然断点（例如场景转换、对话结束或叙事转变）来决定在哪里暂停。该过程首先向语言模型提供一段文本，该段文本从上一个暂停点开始，一直到达到设定的最大字数限制。然后提示模型在段落之间选择一个自然的暂停点，这些段落用编号标签标记。两个连续暂停点之间的内容成为一页。这种方法允许 ReadAgent 创建更短、更有意义的文本段，保留上下文和连贯性，而不是依赖任意固定长度的块。

Step 2: Memory Gisting 第 2 步：记忆梗概

After pagination, ReadAgent compresses each page into a shorter “gist” through a process called memory gisting. The language model is prompted to create a summary of the page’s content, focusing on key information while removing redundant or less important details. These gists are then associated with their corresponding page numbers to maintain context. Lastly, the gists from all the pages are concatenated to form the gist memory, which serves as a compressed representation of the entire document. This step significantly reduces the overall length of the text while preserving its essential meaning and structure. This is somewhat similar to how humans remember what they read. As we go through a book or document, we create a mental outline of the material, retaining important concepts and the sequence in which we encountered them. We don’t remember the exact words of the text.
在分页之后，ReadAgent 通过一种称为“记忆提取”的过程，将每一页压缩成一个更短的“要点”。语言模型被提示创建一个页面内容的摘要，重点关注关键信息，同时去除冗余或不太重要的细节。这些要点然后与其对应的页码相关联，以保持上下文。最后，将所有页面的要点连接起来，形成“要点记忆”，它作为整个文档的压缩表示。这一步骤显著减少了文本的整体长度，同时保留了其基本含义和结构。这与人类记忆所读内容的方式有些相似。当我们浏览一本书或文档时，我们会创建一个材料的精神大纲，保留重要的概念和我们遇到的顺序。我们不记得文本的确切词语。

Step 3: Interactive Lookup
步骤 3：交互式查找

The final component of ReadAgent is the interactive lookup process, which allows the model to access and use information from the original text when needed. When faced with a specific task or question, ReadAgent first examines the gist memory to get an overview of the document. It then decides which original pages it needs to review in more detail to answer the question or complete the task accurately. The model is prompted to select one or more pages to “look up” based on their relevance to the current task. ReadAgent can use either parallel lookup (ReadAgent-P), where multiple pages are selected at once, or sequential lookup (ReadAgent-S), where pages are selected one at a time with the opportunity to see previously expanded pages before making the next selection. After lookup, the selected original pages replace their corresponding gists in the working memory, providing a mix of detailed and summarized information. This approach allows ReadAgent to efficiently handle very long contexts by focusing on the most relevant sections while maintaining awareness of the overall document structure. Similar to the previous step, we can draw analogies in Interactive Lookup to how humans engage with text. We can answer questions about the general details of a text after reading it once, but if a question is very specific, we need to refer back to the original text to provide an accurate answer.
ReadAgent 的最后一个组件是交互式查找过程，这使模型能够在需要时访问和使用原始文本中的信息。当面对特定任务或问题时，ReadAgent 首先检查 gist memory 以获取文档的概述。然后，它决定需要更详细地查看哪些原始页面以准确回答问题或完成任务。模型被提示根据当前任务的相关性选择一个或多个页面进行“查找”。ReadAgent 可以使用并行查找（ReadAgent-P），即一次选择多个页面，也可以使用顺序查找（ReadAgent-S），即一次选择一个页面，并在做出下一次选择之前查看以前展开的页面。查找后，选定的原始页面替换工作记忆中对应的 gist，提供详细信息和摘要信息的混合。这一方法使 ReadAgent 能够通过专注于最相关的部分，同时保持对整个文档结构的认识，从而高效地处理非常长的上下文。与前一步类似，我们可以在交互式查找中与人类阅读文本的方式进行类比。我们可以在阅读文本一次后回答关于文本的一般细节的问题，但如果问题非常具体，我们需要参考原始文本才能提供准确的答案。

The Importance of Fast Inference for ReadAgent
ReadAgent 的快速推理的重要性

One design pattern that is evident throughout ReadAgent’s workflow is that it involves multiple iterative steps that each require one or more API calls to an LLM. For this reason, ReadAgent’s efficiency heavily relies on low-latency LLM inference across its workflow. The pagination and summarization stages can involve hundreds of API calls for lengthy texts, and the lookup phase requires multiple, sequential LLM queries. Without a fast inference solution, the processing of a lengthy document could last so long that it would render the application to be unusable.
ReadAgent 的工作流程中显著的一个设计模式是，它涉及多个迭代步骤，每个步骤都需要一个或多个对 LLM 的 API 调用。因此，ReadAgent 的效率在很大程度上依赖于其工作流程中对低延迟 LLM 推理的需求。分页和摘要阶段对于冗长的文本可能需要数百个 API 调用，并且查找阶段需要多个连续的 LLM 查询。如果没有快速推理解决方案，处理冗长文档的时间可能会很长，从而使应用程序无法使用。

Moreover, the benefits of faster inference extend beyond mere speed improvements. The machine learning community has observed that models often perform better when generating more tokens, as seen in techniques like chain-of-thought reasoning and self-refinement strategies. By enabling more operations within the same time frame, low-latency inference creates opportunities to implement these advanced methods, which lead to better model performance.
此外，更快的推理带来的好处不仅仅是速度的提高。机器学习社区已经观察到，当生成更多的令牌时，模型通常会表现得更好，如链式推理和自我改进策略所示。通过在相同的时间框架内启用更多的操作，低延迟推理为实施这些高级方法创造了机会，从而带来更好的模型性能。

Conclusion 结论

The work done by researchers at Google DeepMind on ReadAgent and gist memory showcases how innovative engineering and workflows can enhance the capabilities of existing large language models. It’s not difficult to imagine the types of applications that would benefit from the method’s used in building ReadAgent. AI agents assisting lawyers in analyzing large volumes of legal text or answering questions about scientific literature are just a few examples of where this method could be effectively applied. Another domain where gist memory could be valuable is customer service, where it can help efficiently navigate large knowledge bases to answer customer queries.
谷歌 DeepMind 的研究人员在 ReadAgent 和 gist memory 方面的工作展示了创新工程和工作流如何提高现有大型语言模型的能力。不难想象这种方法可以应用于哪些领域。在帮助律师分析大量法律文本或回答有关科学文献的问题的 AI 代理只是这种方法可以有效应用的几个例子。另一个领域是客户服务，在那里，gist memory 可以帮助高效地浏览大型知识库，以回答客户的查询。

We’re excited to offer a fast inference layer that enables the integration of these new workflows into AI applications and agentic systems. Remember to to check out our ReadAgent repository to explore the full codebase. Lastly, if you’re interested in using the Cerebras API to build agentic workflows, please visit our documentation portal to get started!
我们很兴奋地提供了一个快速推理层，可以将这些新工作流集成到 AI 应用程序和代理系统中。记得查看我们的 ReadAgent 仓库以探索完整的代码库。最后，如果您有兴趣使用 Cerebras API 构建代理工作流，请访问我们的文档门户开始！

References 参考资料

K.-H. Lee, X. Chen, H. Furuta, J. Canny, and I. Fischer, “A humaninspired reading agent with gist memory of very long contexts,” arXiv preprint arXiv:2402.09727, 2024.
K.-H. Lee, X. Chen, H. Furuta, J. Canny, I. Fischer， “具有超长上下文要点记忆的人类灵感阅读代理”， arXiv 预印本 arXiv:2402.09727, 2024。

2024 featured
2024

Avatar photo

Rohan Deshpande 罗汉·德什潘德

Author posts 作者帖子

Integrating LLMs and Software Engineering for Self-Refining Copy Creation
整合LLMs和软件工程实现自我完善的副本创建

Discover how to build an AI agent that generates marketing copy efficiently…
了解如何建立高效生成营销文案的 AI 代理..

Avatar photo by Kunal Patel 作者：库纳尔·帕特尔

Blog Developer Blog Large Language Model Deep Learning

August 28, 2024 2024 年 8 月 28 日

Llama3.1 Model Quality Evaluation: Cerebras, Groq, Together, and Fireworks
Llama3.1 模型质量评估：Cerebras、Groq、Together 以及 Fireworks

Cerebras' new inference solution redefines AI performance, offering unmatched…
Cerebras 的新推理解决方案重新定义了 AI 性能，提供无与伦比的……

Avatar photo by Vithursan Thangarasa
Vithursan Thangarasa

Blog Developer Blog Large Language Model Deep Learning

August 27, 2024 2024 年 8 月 27 日

Introducing Cerebras Inference: AI at Instant Speed
推出 Cerebras 推理：瞬间速度的 AI

Avatar photo by James Wang 作者：詹姆斯·王

Manage Cookie Consent 管理 Cookie 同意

To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
为了提供最佳体验，我们使用类似 Cookie 的技术来存储和/或访问设备信息。同意这些技术将允许我们处理此网站上的数据，例如浏览行为或唯一 ID。不同意或撤回同意可能会对某些功能和功能造成不利影响。

Functional Functional Always active
功能性功能性始终活跃

The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
技术存储或访问是为了实现用户或订阅者明确请求的特定服务的合法目的，或者仅为通过电子通信网络传输通信而严格必要的。

Preferences Preferences 首选项首选项

The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
技术存储或访问是为了合法目的存储未经订户或用户请求的偏好而必要的。

Statistics Statistics 统计统计

The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
出于统计目的而专门使用的技术存储或访问。专门用于匿名统计目的的技术存储或访问。如果没有您的互联网服务提供商的自愿遵从、法院传票或第三方的额外记录，则仅为此目的存储或检索的信息通常无法用于识别您。

Marketing Marketing 营销营销

The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
需要技术存储或访问以创建用户档案以发送广告，或跟踪用户在网站上或跨多个网站以进行类似的营销目的。

Manage options Manage services Manage {vendor_count} vendors Read more about these purposes
管理选项管理服务管理{vendor_count}个供应商进一步了解这些目的

Accept Deny View preferences Save preferences View preferences
接受拒绝查看首选项保存首选项查看首选项

{title} {title} {title}
{}{}{}

Manage consent 管理同意

日本語日语

English It seems there is no source text provided. Please input the text you would like me to translate into Simplified Chinese

URL to Markdown Renderer