这是用户在 2024-4-2 16:17 为 https://arxiv.org/html/2403.05821v1?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: arXiv.org perpetual non-exclusive license
许可证:arXiv.org 永久非独家许可证
arXiv:2403.05821v1 [cs.LG] 09 Mar 2024
arXiv:2403.05821v1 [cs.LG] 2024 年 3 月 9 日

Optimizing LLM Queries in Relational Workloads
在关系型工作负载中优化LLM查询

Shu Liu*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Asim Biswal*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Audrey Cheng*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Xiangxi Mo, Shiyi Cao,
刘舒 *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT ,阿西姆·比斯瓦尔 *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT ,奥黛丽·程 *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT ,莫向西,曹世仪,

Joseph E. Gonzalez, Ion Stoica, Matei Zaharia
约瑟夫·E·冈萨雷斯、伊恩·斯托伊卡、马泰·扎哈里亚
UC Berkeley 加州大学伯克利分校 lshu@berkeley.edu
Abstract. 摘要

Analytical database providers (e.g., Redshift, Databricks, BigQuery) have rapidly added support for invoking Large Language Models (LLMs) through native user-defined functions (UDFs) to help users perform natural language tasks, such as classification, entity extraction, and translation, inside analytical workloads. For instance, an analyst might want to extract customer sentiments on millions of product reviews. However, LLM inference is highly expensive in both computational and economic terms: for example, an NVIDIA L4 GPU running Llama2-7B can only process 6 KB of text per second. In this paper, we explore how to optimize LLM inference for analytical workloads that invoke LLMs within relational queries. We show that relational queries present novel opportunities for accelerating LLM inference, including reordering rows to maximize key-value (KV) cache reuse within the LLM inference engine, reordering columns within a row to further increase cache reuse, and deduplicating redundant inference requests. We implement these optimizations in Apache Spark, with vLLM as the model serving backend and achieve up to 4.4×\times× improvement in end-to-end latency on a benchmark of diverse LLM-based queries on real datasets. To the best of our knowledge, this is the first work to explicitly address the problem of optimizing LLM invocations within SQL queries.
分析型数据库提供商(例如,Redshift、Databricks、BigQuery)已迅速增加对通过原生用户定义函数(UDFs)调用大型语言模型(LLMs)的支持,以帮助用户执行自然语言任务,如分类、实体提取和翻译,以及在分析工作负载中的其他任务。例如,一名分析师可能希望提取数百万产品评论中的客户情绪。然而,LLM 推理在计算和经济上都高度昂贵:例如,一台运行 Llama2-7B 的 NVIDIA L4 GPU 每秒只能处理 6 KB 文本。在本文中,我们探讨了如何优化 LLM 推理,以便在调用 LLMs 的关系查询中进行分析工作负载。我们展示了关系查询为加速 LLM 推理提供了新的机会,包括重新排序行以最大化关键值(KV)缓存在 LLM 推理引擎中的重用,重新排序行内的列以进一步增加缓存重用,以及去重复的冗余推理请求。我们在 Apache Spark 中实现了这些优化,以 vLLM 作为模型服务后端,并在一系列基于 LLM 的查询的真实数据集基准测试上实现了高达 4.4 ×\times× 的端到端延迟改进。据我们所知,这是首次明确解决在 SQL 查询中优化 LLM 调用问题的工作。

PVLDB Reference Format:  PVLDB 引用格式:
PVLDB, 14(1): XXX-XXX, 2020.
doi:XX.XX/XXX.XX This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
doi:XX.XX/XXX.XX

doi:XX.XX/XXX.XX

PVLDB Artifact Availability:
PVLDB 成果可用性:

The source code, data, and/or other artifacts have been made available at https://github.com/lynnliu030/vldb24
源代码、数据和/或其他工件已在 https://github.com/lynnliu030/vldb24 提供。
.

1. Introduction 1. 引言

Large Language Models (LLMs) are making it dramatically easier to analyze textual data. In fact, a number of analytical database vendors, including AWS Redshift (aws, [n.d.]), Databricks (dat, [n.d.]b), and Google BigQuery (goo, [n.d.]), have already added LLM invocation functions to their SQL APIs. As an example, consider the following SQL query:
大型语言模型(LLMs)极大地简化了文本数据的分析过程。实际上,包括 AWS Redshift(aws,[n.d.])、Databricks(dat,[n.d.]b)和 Google BigQuery(goo,[n.d.])在内的多家分析型数据库供应商,已经在他们的 SQL API 中加入了LLM调用功能。以以下 SQL 查询为例:

1 2SELECT user, request, support_response, 3LLM("Did {support_response} address {request}?", support_response, request) AS success 4FROM customer_tickets 从客户工单中 5WHERE support_response <> NULL
在支持响应不为 NULL 的情况下

where the LLM is invoked to analyze whether customer service requests are effectively addressed. Increasingly, analysts wish to leverage LLMs in such queries for tasks, such as classification, entity extraction, summarization, and translations (dat, [n.d.]b). We will refer to SQL queries that invoke LLMs as LLM queries.
在此,LLM被调用来分析客户服务请求是否得到了有效处理。分析师们越来越希望在此类查询中利用LLMs来执行任务,如分类、实体提取、摘要编写和翻译(dat, [n.d.]b)。我们将调用LLMs的 SQL 查询称为LLM查询。

Unfortunately, applying LLMs in this manner to real-world datasets (which can contain millions of rows) has significant computational and economic costs. For example, classifying 15,000 rows of movie reviews from the Rotten Tomatoes Movies dataset (Pang and Lee, 2005) takes 30 minutes on an NVIDIA L4 GPU instance running a Llama 7B model (Touvron et al., 2023). On a similar sized instance, an analytical database, such as DuckDB (Raasveldt and Mühleisen, 2019), can process more than 100GB of data per second in the TPC-DS benchmark (duc, [n.d.]). Processing the equivalent amount of data via the same LLM would take 96 days, more than 8 million times longer! Thus, minimizing the cost of LLM invocations is the critical objective for LLM queries. Later in this section, we demonstrate that for the movie review workload, our optimizations can reduce the runtime from 30 minutes to 8 minutes.
不幸的是,以这种方式将LLMs应用于现实世界的数据集(可能包含数百万行)会带来巨大的计算和经济成本。例如,使用 NVIDIA L4 GPU 实例运行 Llama 7B 模型(Touvron 等人,2023)对来自烂番茄电影数据集(Pang 和 Lee,2005)的 15,000 行电影评论进行分类需要 30 分钟。在类似大小的实例上,一个分析数据库,如 DuckDB(Raasveldt 和 Mühleisen,2019),可以在 TPC-DS 基准测试中每秒处理超过 100GB 的数据(duc, [n.d.])。通过相同的LLM处理等量的数据将需要 96 天,超过 800 万倍的时间!因此,最小化LLM调用的成本是LLM查询的关键目标。在本节的后面,我们将展示对于电影评论的工作负载,我们的优化可以将运行时间从 30 分钟减少到 8 分钟。

Given the staggering costs of LLM inference, there has been extensive research on how to optimize this process. At a high level, LLM generation occurs sequentially for each request due to the autoregressive nature of the model (new tokens generated depend on all previous tokens), so requests are batched to improve throughput. However, all computational states of each request in a batch must be stored in memory during inference. As a result, memory management is a critical factor for LLM inference performance. Specifically, LLM inference engines store intermediate states for past prompts and generations, or prefixes of these requests, in a key-value (KV) cache (Vaswani et al., 2023; Kwon et al., 2023). Reuse of prefixes (e.g., between requests that share the same prompt) in the cache has been shown to have an outsized impact on performance (Zheng et al., 2023; Kwon et al., 2023; Ye et al., 2024; Juravsky et al., 2024). Accordingly, existing inference systems aim to maximize the hit rate of the KV cache.
鉴于LLM推理的成本之高,已有大量研究致力于优化这一过程。从高层次来看,由于模型的自回归特性(新生成的标记依赖于所有之前的标记),LLM生成对每个请求都是顺序进行的,因此通过批处理请求以提高吞吐量。然而,在推理过程中,批处理中每个请求的所有计算状态都必须存储在内存中。因此,内存管理是LLM推理性能的关键因素。具体来说,LLM推理引擎会在键值(KV)缓存中存储过去提示和生成的中间状态,或这些请求的前缀(Vaswani 等人,2023 年;Kwon 等人,2023 年)。缓存中前缀的重用(例如,在共享相同提示的请求之间)已被证明对性能有着巨大的影响(Zheng 等人,2023 年;Kwon 等人,2023 年;Ye 等人,2024 年;Juravsky 等人,2024 年)。因此,现有的推理系统旨在最大化 KV 缓存的命中率。

Unfortunately, current LLM inference systems are designed for online serving workloads—they process requests in arrival, or first-in, first-out (FIFO) order and make no assumptions about future requests. As a result, they miss out on opportunities to improve performance by leveraging the workload information present in relational analytics. In this work, we address the problem of optimizing inference for LLM queries. We introduce a range of techniques that reduce end-to-end query latency (including LLM request time).
不幸的是,当前的LLM推理系统是为在线服务工作负载设计的——它们按到达顺序,或先进先出(FIFO)顺序处理请求,并且不对未来的请求做任何假设。结果是,它们错过了通过利用关系分析中存在的工作负载信息来提高性能的机会。在这项工作中,我们解决了优化LLM查询推理的问题。我们引入了一系列技术,以减少端到端查询延迟(包括LLM请求时间)。

To optimize for relational analytics queries using LLMs, we propose prefix sharing maximization (PSM) to dynamically reorder the columns and rows of input data to significantly improve the KV cache hit rate. Our key insight is that with oracular knowledge of all requests to be sent to the LLM, we can reorder the requests as well as the fields inside each request to increase the number of cache hits. For instance, two requests that share the same prefix (which may be non-consecutive under FIFO ordering) should be served together so that the latter one can observe a cache hit. Likewise, in requests that input multiple fields of data (e.g., a product name and review) into the LLM, the fields should be ordered to maximize the number of shared prefixes. In real datasets, there can be many shared prefixes across both columns and rows of data, so we can markedly increase the KV cache hit rate by changing the order and format of requests.
为了优化使用LLMs的关系分析查询,我们提出了前缀共享最大化(PSM),通过动态重新排序输入数据的列和行来显著提高 KV 缓存命中率。我们的关键洞察是,如果拥有对将发送到LLM的所有请求的先知知识,我们可以重新排序请求以及每个请求内的字段,以增加缓存命中次数。例如,共享相同前缀的两个请求(在 FIFO 排序下可能是不连续的)应该一起服务,以便后者可以观察到缓存命中。同样,在将多个数据字段(例如,产品名称和评论)输入到LLM的请求中,应该对字段进行排序以最大化共享前缀的数量。在真实数据集中,数据的列和行之间可能有许多共享的前缀,因此我们可以通过改变请求的顺序和格式显著增加 KV 缓存命中率。

Finding the optimal order and format of requests is challenging because there are an exponential number of ways to order the columns and rows of data in a query. Consequently, we present a heuristic-based column reordering algorithm that reorders fields within each row processed based on estimated cache savings. Our algorithm scores column values based on their frequency and size to determine the order priority of the entire column. We also present a row sorting optimization that groups requests with shared prefixes together to increase cache reuse.
找到请求的最佳顺序和格式是具有挑战性的,因为在查询中对数据的列和行进行排序有指数级的方法。因此,我们提出了一种基于启发式的列重排序算法,该算法基于估计的缓存节省对每行处理的字段进行重新排序。我们的算法根据它们的频率和大小对列值进行评分,以确定整个列的排序优先级。我们还提出了一种行排序优化,将具有共享前缀的请求分组在一起以增加缓存重用。

In addition to our prefix sharing maximization techniques, we also present two optimizations to further reduce the computational costs of LLMs in relational queries. First, we observe that many real-world workloads have duplicates in textual data that lead to redundant LLM invocations. With deduplication, we can minimize the number of LLM calls without affecting the accuracy of the overall query. Second, we enhance the functionality of existing database query optimizers to estimate LLM operator costs within query expressions. This optimization allows for the strategic reordering of operations by taking into account the significant expense associated with LLM operators, thereby optimizing performance.
除了我们的前缀共享最大化技术外,我们还提出了两种优化措施,进一步降低关系查询中LLMs的计算成本。首先,我们观察到许多现实世界的工作负载在文本数据中存在重复,这导致了冗余的LLM调用。通过数据去重,我们可以在不影响整体查询准确性的情况下,最小化LLM调用次数。其次,我们增强了现有数据库查询优化器的功能,以估算查询表达式中LLM操作符的成本。这一优化允许通过考虑与LLM操作符相关的显著开销,策略性地重新排序操作,从而优化性能。

We implement our techniques in Apache Spark (Armbrust et al., 2015) with vLLM (Kwon et al., 2023) as the model serving backend. Given the lack of standard workloads in this area, we build a diverse benchmark suite of LLM queries on multiple real-world datasets. We construct a wide range of query types, such as multi-LLM invocations within the same query and retrieval-augmented generation (RAG) (Lewis et al., 2021) queries, across a variety of recommendation and question-answering datasets, including Amazon Product Review (He and McAuley, 2016), the Rotten Tomatoes Movie Dataset (Pang and Lee, 2005), and the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016). We find that our techniques provide 1.5–4.4×\times× improvements in end-to-end query latency while preserving query semantics. In summary, our contributions are as follows:
我们在 Apache Spark(Armbrust 等人,2015 年)中实现了我们的技术,以 vLLM(Kwon 等人,2023 年)作为模型服务后端。鉴于这一领域缺乏标准工作负载,我们构建了一个多样化的基准测试套件,包括多个真实世界数据集上的LLM查询。我们构建了广泛的查询类型,例如在同一查询中的多个LLM调用,以及跨越多个推荐和问答数据集的检索增强生成(RAG)(Lewis 等人,2021 年)查询,包括亚马逊产品评论(He 和 McAuley,2016 年),烂番茄电影数据集(Pang 和 Lee,2005 年)和斯坦福问答数据集(SQuAD)(Rajpurkar 等人,2016 年)。我们发现,我们的技术在保持查询语义的同时,将端到端查询延迟提高了 1.5–4.4 ×\times× 倍。总之,我们的贡献如下:

  • We present our prefix sharing maximization approach which applies a column reordering algorithm and row sorting optimization to improve the KV cache hit rate for LLM inference in relational analytics.


    • 我们提出了我们的前缀共享最大化方法,该方法应用了列重排序算法和行排序优化,以提高关系分析中LLM推理的 KV 缓存命中率。
  • We introduce deduplication and cost estimation techniques to further reduce the invocation costs of LLMs from SQL.


    • 我们引入了数据去重和成本估算技术,以进一步降低从 SQL 调用LLMs的成本。
  • We present a set of LLM query benchmarks using real-world data to represent a range of retrieval and processing tasks.


    • 我们提出了一套使用真实世界数据的LLM查询基准,以代表一系列检索和处理任务。
  • We evaluate our optimizations using vLLM and Spark and to observe up to a 4.4×\times× decrease in end-to-end query latency.


    • 我们使用 vLLM 和 Spark 评估我们的优化,并观察到端到端查询延迟最多降低了 4.4 ×\times×

2. Background and Motivation
2.背景与动机

LLMs are a powerful tool for programmatic analysis of text data and are being rapidly incorporated into major analytical DBMSes (goo, [n.d.]; aws, [n.d.]; dat, [n.d.]b). LLM inference has several unique characteristics that have significant implications for performance. In this section, we provide a brief overview of the inference process and the key components of the LLM architecture to provide context. We then present opportunities to improve performance under relational analytics.
LLMs是进行文本数据程序化分析的强大工具,正在迅速融入主要的分析型 DBMS 中(goo,[n.d.];aws,[n.d.];dat,[n.d.]b)。LLM推理具有几个独特的特点,这些特点对性能有重大影响。在本节中,我们将简要概述推理过程和LLM架构的关键组成部分,以提供背景。然后,我们提出了在关系分析下提高性能的机会。

LLM inference. Today’s LLMs are autoregressive Transformer models (Vaswani et al., 2023), which generate words as tokens, one at a time, based on a given prompt and the existing sequence of tokens that have already been outputted. A token is a concise representation of a chunk of characters, typically using Byte-Paired Encoding (Zouhar et al., 2023). On average, a token is approximately four English characters.
LLM 推理。当今的LLMs是自回归变换器模型(Vaswani 等人,2023 年),它们根据给定的提示和已经输出的令牌序列,一次生成一个令牌作为单词。令牌是字符块的简洁表示,通常使用字节对编码(Zouhar 等人,2023 年)。平均而言,一个令牌大约相当于四个英文字符。

The inference process for LLMs occurs in two stages: (i) the prefill stage in which the model processes all the input prompt tokens at once and (ii) the decoding stage during which token generation occurs. Generation for each request proceeds sequentially (i.e., new tokens must be generated one by one) since the process depends on (the product of conditional probabilities of) all previous tokens. This process continues until the model outputs a termination token.
LLMs的推理过程分为两个阶段:(i)预填充阶段,模型一次性处理所有输入提示令牌;(ii)解码阶段,此时发生令牌生成。每个请求的生成按顺序进行(即,新令牌必须逐一生成),因为该过程依赖于所有先前令牌的条件概率的乘积。此过程持续进行,直到模型输出一个终止令牌。

An LLM inference engine (e.g. vLLM (Kwon et al., 2023), TGI (Huggingface, 2023), TensorRT-LLM (NVIDIA, 2023b)) runs the transformer models and schedules the prefill and decode stages. Since each request must be processed sequentially, the LLM inference engine batches multiple requests in a continuous fashion (Yu et al., 2022) together to improve throughput. However, to process multiple requests in a batch, the memory usage of each request must be efficiently managed: during the inference process, the intermediate computed state for all tokens involved must be stored in memory. This token state is cached as key and value vectors in the key-value (KV) cache, and each token can take up to 800KB for a 13B Model (Kwon et al., 2023). Accordingly, an average request involving 2000 tokens can take up to 1.6GB of space. Even with batching (for online workloads, batch sizes of up to 32 are used (Yu et al., 2022)), LLM inference is computationally expensive and is currently limited to processing about 2000 tokens/s on a single GPU. As such, LLM performance is currently the bottleneck in many analytical tasks.
一个LLM推理引擎(例如 vLLM(Kwon 等人,2023 年),TGI(Huggingface,2023 年),TensorRT-LLM(NVIDIA,2023b))运行变换器模型,并安排预填充和解码阶段。由于每个请求必须按顺序处理,LLM推理引擎以连续方式批量处理多个请求(Yu 等人,2022 年),以提高吞吐量。然而,为了批量处理多个请求,必须有效管理每个请求的内存使用:在推理过程中,必须将所有涉及令牌的中间计算状态存储在内存中。这种令牌状态被缓存为键值(KV)缓存中的键和值向量,每个令牌最多可占用 800KB,对于一个 13B 模型(Kwon 等人,2023 年)。因此,平均每个请求涉及 2000 个令牌,可占用高达 1.6GB 的空间。即使是批处理(对于在线工作负载,批量大小最多使用 32(Yu 等人,2022 年)),LLM推理在计算上代价高昂,目前仅限于在单个 GPU 上处理大约 2000 个令牌/秒。因此,LLM性能目前是许多分析任务的瓶颈。

KV cache. A crucial factor for LLM serving throughput is memory management within the KV cache. To enable maximum cache utilization (i.e., hit rate), recent work proposes sharing tokens across requests. Since many user prompts share the same prefix, or input prompt, the KV cache can store one copy of the prefix in advance to reduce redundant computation during the prefill phase. For instance, if two requests share a prefix, the first request will have already performed some computation on the input tokens and loaded these results in the KV cache. The subsequent request can directly reuse these values for further inference without having to recompute the shared tokens. An example of prefix sharing across requests to enhance KV cache efficiency is shown in Figure 1.
KV 缓存。对于LLM服务吞吐量而言,KV 缓存内的内存管理是一个关键因素。为了实现最大的缓存利用率(即,命中率),最近的工作提出了跨请求共享令牌。由于许多用户提示共享相同的前缀或输入提示,KV 缓存可以提前存储前缀的一份副本,以减少预填充阶段的冗余计算。例如,如果两个请求共享一个前缀,第一个请求将已经对输入令牌进行了一些计算,并将这些结果加载到 KV 缓存中。后续请求可以直接重用这些值进行进一步的推理,而无需重新计算共享的令牌。图 1 展示了跨请求共享前缀以提高 KV 缓存效率的示例。

Existing research on KV cache management focuses only on online inference in which the LLM assumes no knowledge about future requests (Kwon et al., 2023). In contrast, invoking an LLM through a SQL query provides information about the structure and data of the full set of requests before execution. We can leverage this information through a variety of techniques to significantly improve the performance of such queries.
现有的关于 KV 缓存管理的研究仅聚焦于在线推断,其中LLM不预设对未来请求的了解(权等,2023)。相比之下,通过 SQL 查询调用LLM能够在执行前提供关于全部请求集的结构和数据的信息。我们可以通过各种技术利用这些信息,显著提升此类查询的性能。

Refer to caption
Figure 1. Example of prefix sharing across requests in a KV cache. Green tokens (“Classify” and “:”) are universally shared across all three requests. This means that only one set of KV needs to be computed, reducing redundant computations. Purple tokens (“Hi”) represent partial sharing between Req 1 and 2, illustrating selective reuse. Black tokens are the ones that require computation.
图 1. KV 缓存中请求间前缀共享的示例。绿色标记(“分类”和“:”)在所有三个请求中被普遍共享。这意味着只需计算一套 KV,减少了冗余计算。紫色标记(“嗨”)代表请求 1 和 2 之间的部分共享,展示了选择性重用。黑色标记是需要计算的部分。

2.1. Optimization Opportunities in Analytics
2.1.分析中的优化机会

In this paper, we optimize LLM inference in the context of relational analytics queries. We describe new opportunities to improve performance by utilizing the structure and semantics of SQL workloads.
在本文中,我们在关系分析查询的背景下,优化了LLM推断。我们描述了利用 SQL 工作负载的结构和语义来提升性能的新机会。

Improving KV cache hit rate. Existing LLM inference engines, which are designed for the online setting, make no assumptions about future requests and execute queries as they arrive, or first-in, first-out (FIFO), order. As a result, they miss out on opportunities to improve cache hit rate by leveraging the workload information present in relational analytics. At the query optimizer level, we have full knowledge about the structure of each request (e.g., prompt and input data) and often the format / length of the outputted response. This information can be used to determine when prefixes should be loaded into and removed from the KV cache to maximize the cache hit rate. For instance, under FIFO execution, two non-consecutive requests that share a prefix may both lead to cache misses. Instead, if we ordered these requests together, we could ensure that the latter results in a cache hit. Since all prefixes are known in batch queries, we can group shared prefixes together to increase the KV cache hit rate. Overall, we want to maximize the token hit rate, or the ratio of tokens that can be served from the KV cache.
提升 KV 缓存命中率。现有的LLM推理引擎,旨在在线环境中使用,不对未来请求做任何假设,按照请求到达的顺序或先进先出(FIFO)顺序执行查询。因此,它们错失了通过利用关系分析中存在的工作负载信息来提高缓存命中率的机会。在查询优化器层面,我们完全了解每个请求的结构(例如,提示和输入数据)以及通常的输出响应的格式/长度。这些信息可以用来确定何时应该将前缀加载进或移出 KV 缓存以最大化缓存命中率。例如,在 FIFO 执行下,两个非连续的共享前缀的请求可能都会导致缓存未命中。相反,如果我们将这些请求一起排序,我们可以确保后者结果是缓存命中。由于在批量查询中所有前缀都是已知的,我们可以将共享的前缀分组以提高 KV 缓存命中率。总的来说,我们想要最大化令牌命中率,或者说,可以从 KV 缓存中提供的令牌的比率。

Our Approach: Data-Aware Caching and Request Scheduling. We leverage workload information to enhance the KV cache hit rate. Specifically, we introduce algorithms that reorder requests, and fields in each request, to maximize prefix sharing across requests. We reorder requests to the LLM at both the column level and row level to enable cache reuse and efficient memory usage.
我们的方法:数据感知缓存和请求调度。我们利用工作负载信息来提高 KV 缓存命中率。具体来说,我们引入算法重新排序请求,以及每个请求中的字段,以最大化跨请求的前缀共享。我们对LLM的请求进行重新排序,无论是在列级别还是行级别,以实现缓存重用和高效的内存使用。

Avoiding redundant computation. Naively invoking the LLM on textual data can lead to many redundant requests. Perhaps surprisingly, structured text data often contains duplicate entries. Naive execution of the queries can lead to redundant computation. To reduce overall latency, we should minimize the number of calls made to the LLM by identifying and filtering for these requests that will lead to the same response.
避免冗余计算。天真地对文本数据调用LLM可能导致许多冗余请求。或许令人惊讶的是,结构化文本数据经常包含重复条目。天真地执行查询可能导致冗余计算。为了减少总体延迟,我们应该通过识别和过滤将导致相同响应的请求,来最小化对LLM的调用次数。

Our Approach: Eliminating Redundant Requests. We leverage workload information to reduce the number of LLM requests without affecting query accuracy. Specifically, we deduplicate exact input prompts to ensure we invoke the LLM only when necessary.
我们的方法:消除冗余请求。我们利用工作负载信息来减少LLM请求的数量,同时不影响查询准确性。具体来说,我们对完全相同的输入提示进行去重,以确保我们只在必要时调用LLM。

Accounting for LLM operator costs. LLM operators within SQL queries are often the most expensive parts of these requests. As standard DB query optimizers rely on accurately estimating the cost of the key operators to generate efficient query plans, they should account for LLM operator costs. For instance, a predicate pushdown should be applied before invoking the LLM.
考虑LLM操作符成本。在 SQL 查询中,LLM操作符往往是这些请求中最昂贵的部分。由于标准数据库查询优化器依赖于准确估算关键操作符的成本来生成高效的查询计划,它们应当考虑LLM操作符成本。例如,在调用LLM之前,应当应用谓词下推。

Our Approach: Cost Estimation. We incorporate LLM operator costs into the SQL query optimizer to ensure the most expensive operations are invoked only when necessary. Specifically, we model LLM computational costs during the query planning process.
我们的方法:成本估算。我们将LLM操作符成本纳入 SQL 查询优化器,以确保只在必要时调用最昂贵的操作。具体来说,我们在查询计划过程中对LLM的计算成本进行建模。

3. Request Formulation 3.请求构建

In this section, we introduce our request scheduling algorithms designed to maximize prefix sharing in the KV cache. We leverage full workload information from batch queries to ensure all requests sharing the same prefixes are executed consecutively to optimize overall LLM inference efficiency. Specifically, we present column-level and row-level reordering algorithms that group queries based on the data they access.
在本节中,我们介绍了旨在最大化 KV 缓存中前缀共享的请求调度算法。我们利用批量查询的完整工作负载信息,确保所有共享相同前缀的请求连续执行,以优化整体LLM推理效率。具体而言,我们提出了基于它们访问的数据对查询进行列级和行级重排序的算法。

3.1. Request Structure 3.1.请求结构

We define the LLM function within the context of structured queries: this function accepts structured expressions as input, such as one or multiple columns {a,b,c}𝑎𝑏𝑐\{a,b,c\}{ italic_a , italic_b , italic_c } or {pr.*}formulae-sequence𝑝𝑟\{pr.*\}{ italic_p italic_r . * }, and processes them to extract and analyze data. The design of our UDF input structure allows for dynamic reordering of fields within these expressions to optimize for cache efficiency.
我们在结构化查询的上下文中定义了LLM函数:该函数接受结构化表达式作为输入,例如一个或多个列 {a,b,c}𝑎𝑏𝑐\{a,b,c\}{ italic_a , italic_b , italic_c }{pr.*}formulae-sequence𝑝𝑟\{pr.*\}{ italic_p italic_r . * } ,并处理它们以提取和分析数据。我们的 UDF 输入结构的设计允许在这些表达式中动态重新排序字段,以优化缓存效率。

3.2. Column Reordering 3.2.列重排序

Since cache hits occur only for the prefix of requests, determining the right order of columns in SQL queries can have a significant impact on performance. For example, consider the following query:
由于缓存命中仅发生在请求的前缀上,因此确定 SQL 查询中列的正确顺序对性能可能有重大影响。例如,考虑以下查询:

1 2SELECT 2 选择 3LLM("Give an overview based on this: ", pr.*)
3 长短期记忆网络("根据此给出一个概述:", pr.*)
4FROM 4 来自 5(SELECT 5(选择 6r.reviewText, 6r.评论内容, 7%**** 3_1_sol_workload_shape.tex Line 25 ****r.reviewerName,
7%**** 3_1_sol_workload_shape.tex 第 25 行 ****r.审稿人姓名,
8p.description, 8p.描述, 9p.title, 9p.标题, 10FROM reviews r 此查询传入多个列,例如 reviewText、reviewerName 和 description,将每一行数据发送至<b1001></b1001>进行摘要处理。使用默认的列顺序(例如,schema 中 reviewText 恰好是第一个)可能效率不高,因为这一列中有更多的不同值,导致更多的唯一前缀。相反,我们可以将 description 列放在首位,以增加共享前缀的数量:由于许多评论与同一产品相关联,这些前缀可以被缓存和重用 11JOIN product p ON r.asin = p.asin 12) AS pr

This query passes in multiple columns, such as reviewText, reviewerName, and description, for each row of data sent to the LLM for summarization. Using the default column order (e.g., reviewText happens to be first in the schema) can be inefficient because there are more distinct values in this column, leading to more unique prefixes. Instead, we can place the description column first to increase the number of shared prefixes: since many reviews are associated with the same product, these prefixes can be cached and reused.

The objective of column level input reordering is to order fields within requests such that we maximize the token hit rate of the KV cache. This task is complicated by the diversity in shared prefix values and their variable lengths. For instance, longer shared prefixes might lead to more token hits even if they arise less frequently than shorter prefixes. Thus, a reordering strategy needs balance between prefix frequency and size to maximize cache reuse.
列级输入重排序的目标是对请求中的字段进行排序,以最大化 KV 缓存的令牌命中率。这项任务因共享前缀值的多样性及其可变长度而复杂化。例如,较长的共享前缀可能即使出现频率较低,也会导致更多的令牌命中。因此,重排序策略需要在前缀频率和大小之间找到平衡,以最大化缓存重用。

3.2.1. Oracle Algorithm 3.2.1.Oracle 算法

Finding the best order of columns is challenging because there is an exponential number of possible permutations for a given number of columns. For instance, a dataset with N𝑁Nitalic_N rows and M𝑀Mitalic_M columns would produce N×M!𝑁𝑀N\times M!italic_N × italic_M ! possible orders. Enumerating all of them would be prohibitively expensive in practice, so we need a heuristic that enables us to efficiently find an effective ordering.
找到列的最佳顺序是具有挑战性的,因为对于给定数量的列,可能的排列组合数量是指数级的。例如,一个有 N𝑁Nitalic_N 行和 M𝑀Mitalic_M 列的数据集将产生 N×M!𝑁𝑀N\times M!italic_N × italic_M ! 种可能的顺序。在实践中枚举所有这些顺序的成本将是极其昂贵的,因此我们需要一种启发式方法,使我们能够有效地找到一个有效的排序。

Algorithm 1 Column-Level Reordering
算法 1 列级重排序
1:Input: Table T𝑇Titalic_T 1:输入:表格 T𝑇Titalic_T
2:Output: Reordered Columns List L𝐿Litalic_L
2:输出:重新排序的列列表 L𝐿Litalic_L
3:
4:Initialize L[]𝐿L\leftarrow[]italic_L ← [ ] 4:初始化 L[]𝐿L\leftarrow[]italic_L ← [ ]
5:Precompute ASL𝐴𝑆𝐿absentASL\leftarrowitalic_A italic_S italic_L ← Map()
5:预计算 ASL𝐴𝑆𝐿absentASL\leftarrowitalic_A italic_S italic_L ← 映射()
\triangleright Pre-computed average string length for each column
为每列预先计算的平均字符串长度
6:
7:
8:function CalculateScores(T𝑇Titalic_T)
8:函数 计算得分( T𝑇Titalic_T )
9:     Initialize ColumnScores𝐶𝑜𝑙𝑢𝑚𝑛𝑆𝑐𝑜𝑟𝑒𝑠absentColumnScores\leftarrowitalic_C italic_o italic_l italic_u italic_m italic_n italic_S italic_c italic_o italic_r italic_e italic_s ← Map()
9: 初始化 ColumnScores𝐶𝑜𝑙𝑢𝑚𝑛𝑆𝑐𝑜𝑟𝑒𝑠absentColumnScores\leftarrowitalic_C italic_o italic_l italic_u italic_m italic_n italic_S italic_c italic_o italic_r italic_e italic_s ← 映射()
10:     for each column m𝑚mitalic_m in T𝑇Titalic_T do
10:对于 T𝑇Titalic_T 中的每一列 m𝑚mitalic_m
11:         Cmsubscript𝐶𝑚absentC_{m}\leftarrowitalic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← Cardinality(T,m𝑇𝑚T,mitalic_T , italic_m)
11: Cmsubscript𝐶𝑚absentC_{m}\leftarrowitalic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← 基数( T,m𝑇𝑚T,mitalic_T , italic_m )
12:         ColumnScores[m]ASL[m]×(TableLengthCm)𝐶𝑜𝑙𝑢𝑚𝑛𝑆𝑐𝑜𝑟𝑒𝑠delimited-[]𝑚𝐴𝑆𝐿delimited-[]𝑚𝑇𝑎𝑏𝑙𝑒𝐿𝑒𝑛𝑔𝑡subscript𝐶𝑚ColumnScores[m]\leftarrow ASL[m]\times\left(\frac{TableLength}{C_{m}}\right)italic_C italic_o italic_l italic_u italic_m italic_n italic_S italic_c italic_o italic_r italic_e italic_s [ italic_m ] ← italic_A italic_S italic_L [ italic_m ] × ( divide start_ARG italic_T italic_a italic_b italic_l italic_e italic_L italic_e italic_n italic_g italic_t italic_h end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG )
13:     end for 13:结束循环
14:     return ColumnScores𝐶𝑜𝑙𝑢𝑚𝑛𝑆𝑐𝑜𝑟𝑒𝑠ColumnScoresitalic_C italic_o italic_l italic_u italic_m italic_n italic_S italic_c italic_o italic_r italic_e italic_s 14:返回 ColumnScores𝐶𝑜𝑙𝑢𝑚𝑛𝑆𝑐𝑜𝑟𝑒𝑠ColumnScoresitalic_C italic_o italic_l italic_u italic_m italic_n italic_S italic_c italic_o italic_r italic_e italic_s
15:end function 15:结束函数
16:
17:TcurrentTsubscript𝑇𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑇T_{current}\leftarrow Titalic_T start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT ← italic_T
18:ColumnScores𝐶𝑜𝑙𝑢𝑚𝑛𝑆𝑐𝑜𝑟𝑒𝑠absentColumnScores\leftarrowitalic_C italic_o italic_l italic_u italic_m italic_n italic_S italic_c italic_o italic_r italic_e italic_s ← CalculateScores(Tcurrentsubscript𝑇𝑐𝑢𝑟𝑟𝑒𝑛𝑡T_{current}italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT)
18: ColumnScores𝐶𝑜𝑙𝑢𝑚𝑛𝑆𝑐𝑜𝑟𝑒𝑠absentColumnScores\leftarrowitalic_C italic_o italic_l italic_u italic_m italic_n italic_S italic_c italic_o italic_r italic_e italic_s ← 计算分数( Tcurrentsubscript𝑇𝑐𝑢𝑟𝑟𝑒𝑛𝑡T_{current}italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT )
19:while Columns remain for selection in Tcurrentsubscript𝑇𝑐𝑢𝑟𝑟𝑒𝑛𝑡T_{current}italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT do
19: 当 Tcurrentsubscript𝑇𝑐𝑢𝑟𝑟𝑒𝑛𝑡T_{current}italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT 中还有列可供选择时
20:     cselectargmaxc(ColumnScores)subscript𝑐𝑠𝑒𝑙𝑒𝑐𝑡subscriptargmax𝑐𝐶𝑜𝑙𝑢𝑚𝑛𝑆𝑐𝑜𝑟𝑒𝑠c_{select}\leftarrow\text{argmax}_{c}(ColumnScores)italic_c start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t end_POSTSUBSCRIPT ← argmax start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_C italic_o italic_l italic_u italic_m italic_n italic_S italic_c italic_o italic_r italic_e italic_s )
21:     L.append(cselect)formulae-sequence𝐿𝑎𝑝𝑝𝑒𝑛𝑑subscript𝑐𝑠𝑒𝑙𝑒𝑐𝑡L.append(c_{select})italic_L . italic_a italic_p italic_p italic_e italic_n italic_d ( italic_c start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t end_POSTSUBSCRIPT )
22:     Remove cselectsubscript𝑐𝑠𝑒𝑙𝑒𝑐𝑡c_{select}italic_c start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t end_POSTSUBSCRIPT from Tcurrentsubscript𝑇𝑐𝑢𝑟𝑟𝑒𝑛𝑡T_{current}italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT
22: 从 Tcurrentsubscript𝑇𝑐𝑢𝑟𝑟𝑒𝑛𝑡T_{current}italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT 中移除 cselectsubscript𝑐𝑠𝑒𝑙𝑒𝑐𝑡c_{select}italic_c start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t end_POSTSUBSCRIPT
23:     ColumnScores𝐶𝑜𝑙𝑢𝑚𝑛𝑆𝑐𝑜𝑟𝑒𝑠absentColumnScores\leftarrowitalic_C italic_o italic_l italic_u italic_m italic_n italic_S italic_c italic_o italic_r italic_e italic_s ← CalculateScores(Tcurrentsubscript𝑇𝑐𝑢𝑟𝑟𝑒𝑛𝑡T_{current}italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT)
23: ColumnScores𝐶𝑜𝑙𝑢𝑚𝑛𝑆𝑐𝑜𝑟𝑒𝑠absentColumnScores\leftarrowitalic_C italic_o italic_l italic_u italic_m italic_n italic_S italic_c italic_o italic_r italic_e italic_s ← 计算分数( Tcurrentsubscript𝑇𝑐𝑢𝑟𝑟𝑒𝑛𝑡T_{current}italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT )
\triangleright Calculate scores for remaining columns
为剩余列计算分数
24:end while 24:结束循环
25:return L𝐿Litalic_L 25:返回

3.2.2. Reordering Algorithm
3.2.2.重排序算法

Our reordering algorithm employs column statistics—specifically, the cardinality of column values—to quickly search for an effective ordering. Databases typically maintain statistics, such as the number of unique entries (i.e., cardinality) and distribution of values for each column, on store data, which we leverage to infer the potential for prefix sharing. We simplify the problem by establishing a fixed column ordering for all rows. This allows us to avoid having to consider an exponential number of permutations. In practice, many datasets have columns with high cardinality (i.e., few shared prefixes), so their order does not affect the cache hit rate.
我们的重排序算法采用列统计数据——具体来说,是列值的基数——来快速搜索有效的排序。数据库通常会维护存储数据的每列的统计信息,如唯一条目数(即基数)和值的分布,我们利用这些信息来推断前缀共享的潜力。我们通过为所有行建立固定的列排序来简化问题,这样我们就无需考虑指数数量级的排列组合。实际上,许多数据集的列具有高基数(即少量共享前缀),因此它们的顺序不会影响缓存命中率。

Concretely, our algorithm begins by precomputing two pieces of metadata for each column: the average string length (ASL) and the cardinality (C) (or the number of unique values). These metrics enable us to estimate the likelihood of each column contributing to the token hit rate. We obtain the average length of the values in a given column by running a SQL query (e.g., AVG(CHAR_LENGTH(col))) on the data. We retrieve the cardinality from existing database statistics present in most systems.
具体而言,我们的算法首先为每列预计算两个元数据:平均字符串长度(ASL)和基数(C)(或唯一值的数量)。这些指标使我们能够估计每列对令牌命中率的贡献可能性。我们通过对数据运行 SQL 查询(例如,AVG(CHAR_LENGTH(col)))来获取给定列中值的平均长度。我们从大多数系统中存在的现有数据库统计信息中检索基数。

Next, the algorithm calculates a score for each column. We draw inspiration from standard caching algorithms, such as GDSF (Cherkasova and Ciardo, 2001), which give equal weight to size and frequency factors. Our algorithm uses the following formula:
接下来,算法为每列计算一个分数。我们从标准缓存算法中汲取灵感,如 GDSF(Cherkasova 和 Ciardo,2001),这些算法对大小和频率因素给予相等的权重。我们的算法使用以下公式:

Score(𝑐𝑜𝑙)=ASLcol×Total Number of RowsCcolScore(𝑐𝑜𝑙)subscriptASLcolTotal Number of RowssubscriptCcol\textit{Score(\text{col})}=\text{ASL}_{\text{col}}\times\frac{\text{Total % Number of Rows}}{\text{C}_{\text{col}}}Score( italic_col ) = ASL start_POSTSUBSCRIPT col end_POSTSUBSCRIPT × divide start_ARG Total Number of Rows end_ARG start_ARG C start_POSTSUBSCRIPT col end_POSTSUBSCRIPT end_ARG

Under this method, higher scores correspond to columns with longer and more frequent shared data values. Our algorithm then places columns in order of descending scores.
根据这种方法,得分较高的列对应于具有更长且更频繁共享数据值的列。我们的算法然后按降序分数排列列。

3.2.3. Limitations 3.2.3.局限性

While our column-level input reordering algorithm demonstrates the potential for enhancing cache efficiency, it has limitations. The primary assumption under our approach is the fixed ordering of columns across all data rows. This assumption simplifies the computational complexity but may not always hold in real-world scenarios. In practice, different rows could benefit from different orderings based on the specific data they contain. Future work could explore smarter reordering strategies, as we will talk about in Section 8.2.2.
尽管我们的列级输入重排序算法展现了提升缓存效率的潜力,但它也有其局限性。我们方法的主要假设是所有数据行中列的顺序是固定的。这一假设简化了计算复杂度,但在现实世界的场景中可能并不总是成立。实际上,不同的行根据它们包含的特定数据,可能会从不同的排序中受益。未来的工作可以探索更智能的重排序策略,正如我们将在第 8.2.2 节中讨论的。

3.3. Row Sorting 3.3.行排序

Once we determine a column order, we need to group the rows of data such that we maximize the number of shared prefixes across requests. To achieve this, we concatenate all column values into a string and sort these strings lexicographically (over different rows) to determine the row order.
一旦我们确定了列的顺序,我们需要对数据行进行分组,以最大化跨请求共享的前缀数量。为了实现这一点,我们将所有列值连接成一个字符串,并对这些字符串按字典顺序进行排序(跨不同行),以确定行的顺序。

T Without Reordering 不重新排序 With Reordering 重新排序
Keys accessed 访问的键 Cache state 缓存状态 Keys accessed 访问的键 Cache state 缓存状态
1 1,2,3123{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}1,2,3}1 , 2 , 3 - 1,2,3123{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}1,2,3}1 , 2 , 3 -
2 - 1,2,31231,2,31 , 2 , 3 - 1,2,31231,2,31 , 2 , 3
3 4,5,6456{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}4,5,6}4 , 5 , 6 1,2,31231,2,31 , 2 , 3 1,2,31231,2,31 , 2 , 3 1,2,31231,2,31 , 2 , 3
4 - 4,5,64564,5,64 , 5 , 6 - 1,2,31231,2,31 , 2 , 3
5 1,2,3123{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}1,2,3}1 , 2 , 3 4,5,6456{4,5,6}4 , 5 , 6 4,5,6456{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}4,5,6}4 , 5 , 6 1,2,31231,2,31 , 2 , 3
6 - 1,2,31231,2,31 , 2 , 3 - 4,5,64564,5,64 , 5 , 6
7 4,5,6456{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}4,5,6}4 , 5 , 6 1,2,31231,2,31 , 2 , 3 4,5,64564,5,64 , 5 , 6 4,5,64564,5,64 , 5 , 6
Table 1. Comparison of KV cache state with different prefix input orders. Red represents cache misses: without reordering, there are 12 cache misses; with reordering, there are only 6 cold misses.
表 1. 不同前缀输入顺序下 KV 缓存状态的比较。红色代表缓存未命中:不重新排序时,有 12 次缓存未命中;重新排序后,只有 6 次冷未命中。

To demonstrate how this technique improves cache usage, consider an example in which we have a batch size of three and six distinct prefixes that each appear three times (as illustrated in Table 1). If the requests arrive in the following order (with batches shown in brackets): [1,2,3],[4,5,6]123456[1,2,3],[4,5,6][ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] ,[1,2,3],[4,5,6],[1,2,3],[4,5,6], [ 1 , 2 , 3 ] , [ 4 , 5 , 6 ], the KV cache would evict and recompute all six prefixes under a caching algorithm, such as FIFO, leading to inefficient cache use. Instead, if we sort the rows of input to group common prefixes together (e.g., [1,2,3],[1,2,3],[4,5,6],[4,5,6]123123456456[1,2,3],[1,2,3],[4,5,6],[4,5,6][ 1 , 2 , 3 ] , [ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] , [ 4 , 5 , 6 ]), we ensure that each prefix is loaded into the cache only once and reused across any requests it is involved in. Upon eviction, we know that this prefix will never need to be loaded into the cache again. This approach can greatly reduce cache eviction and recomputation, leading to faster LLM processing times.
为了展示这项技术如何提高缓存使用率,考虑一个例子,我们有一个批次大小为三,六个不同的前缀,每个前缀出现三次(如表 1 所示)。如果请求按以下顺序到达(批次显示在括号中): [1,2,3],[4,5,6]123456[1,2,3],[4,5,6][ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] ,[1,2,3],[4,5,6],[1,2,3],[4,5,6], [ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] ,KV 缓存将在诸如 FIFO 之类的缓存算法下,驱逐并重新计算所有六个前缀,导致缓存使用效率低下。相反,如果我们对输入的行进行排序,将共同的前缀组合在一起(例如, [1,2,3],[1,2,3],[4,5,6],[4,5,6]123123456456[1,2,3],[1,2,3],[4,5,6],[4,5,6][ 1 , 2 , 3 ] , [ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] , [ 4 , 5 , 6 ] ),我们确保每个前缀只被加载到缓存中一次,并在其涉及的任何请求中重复使用。在驱逐时,我们知道这个前缀将不再需要被加载到缓存中。这种方法可以大大减少缓存驱逐和重新计算,从而加快LLM处理时间。

Conveniently, the structured nature of SQL queries enables us to efficiently sort requests—we can identify and easily group shared data based on the relational schema by using existing database operations (e.g., GROUP BY and ORDER BY).
便利的是,SQL 查询的结构化特性使我们能够高效地对请求进行排序——我们可以利用现有的数据库操作(例如,GROUP BY 和 ORDER BY)根据关系模式识别并轻松地对共享数据进行分组。

4. Deduplication 4.去重

Since LLM invocations are the bottleneck in most LLM queries, we can further reduce runtime by minimizing the number of inference calls we need to make. Specifically, we can deduplicate across requests that have the same input prompts. Perhaps surprisingly, large textual datasets often contain many duplicates. For instance, in the previous query shown in Sec   3.2, many reviews might mention similar product. Processing every review through an LLM or vector DB would lead to redundant computations. Instead, we can identify these duplicates before passing them to the LLM or vector DB to significantly reduce the computational load. In modern recommendation datasets, there are often high degrees of duplication in the table columns. For example, the Rotten Tomatoes Movie dataset (Pang and Lee, 2005), commonly used in recommendation models, has a review_type column corresponding to the category of the review (e.g., “Fresh” or “Rotten”). As a result, the selectivity ratio, or the ratio of distinct values compared to total values, is only 0.008, since there only two distinct categories.
由于LLM调用在大多数LLM查询中是瓶颈,我们可以通过减少需要进行的推理调用次数来进一步减少运行时间。具体来说,我们可以对具有相同输入提示的请求进行去重。或许令人惊讶的是,大型文本数据集经常包含许多重复项。例如,在第 3.2 节中展示的之前的查询中,许多评论可能会提到类似的产品。通过LLM或向量数据库处理每条评论将导致重复的计算。相反,我们可以在将这些重复项传递给LLM或向量数据库之前识别它们,以显著减少计算负载。在现代推荐数据集中,表列中的重复程度往往很高。例如,常用于推荐模型的 Rotten Tomatoes 电影数据集(Pang 和 Lee,2005)有一个 review_type 列,对应于评论的类别(例如,“新鲜”或“腐烂”)。结果,选择性比率,或者说不同值与总值的比率,仅为 0.008,因为只有两个不同的类别。

4.1. Exact Deduplication 4.1.精确去重

For the purposes of this work, we choose to apply deduplication without affecting query results. In the RDMBSes, this is commonly achieved through the use of distinct filters (e.g., DISTINCT in SQL) on the query inputs. This approach is straightforward and effective for reducing the number of LLM invocations by filtering out duplicate requests before they reach the model.
为了本项工作的目的,我们选择应用去重技术,同时不影响查询结果。在关系型数据库管理系统中,这通常通过在查询输入上使用不同的过滤器(例如,SQL 中的 DISTINCT)来实现。这种方法直接有效,通过在请求到达模型之前过滤掉重复的请求,减少了LLM调用的次数。

However, applying deduplication across many rows of data can present challenges, especially regarding memory and processing time. Furthermore, in scenarios with a vast amount of distinct data, identifying duplicates is not that helpful. We show the ablation study on deduplication with columns of different selectivity ratios in Figure 8 in Section 7.
然而,在许多数据行上应用去重技术可能会带来挑战,特别是在内存和处理时间方面。此外,在有大量不同数据的场景中,识别重复项并不那么有帮助。我们在第 7 节的图 8 中展示了不同选择性比率的列上去重的消融研究。

Beyond exact matches, semantic deduplication, which identifies records that are semantically similar, represents a promising area for future research. We discuss it further in Section 8.4.
除了精确匹配之外,语义去重,即识别语义上相似的记录,代表了未来研究的一个有前景的领域。我们将在第 8.4 节进一步讨论这个话题。

5. SQL Optimizations 5.SQL 优化

Traditionally, relational databases employ cost-based query optimizers to enhance the performance of requests. These optimizers analyze query structures and operators to minimize data accesses and computational overheads. We extend these optimizers to account for LLM operators costs so that they can find query plans that minimize LLM invocations.
传统上,关系型数据库采用基于成本的查询优化器来提升请求的性能。这些优化器分析查询结构和操作符,以最小化数据访问和计算开销。我们扩展了这些优化器,以考虑LLM操作符的成本,使它们能够找到最小化LLM调用的查询计划。

5.1. LLM Operator Costs 5.1.LLM操作符成本

We augment cost estimation for the external functions that are used to invoke LLMs. Currently, LLMs are most commonly accessed via UDFs. For instance, Databricks offers an ai_query() function that allows users to call a model endpoint (dat, [n.d.]a). Similarly, Amazon Redshift enables uses to LLM calls via UDF functions (aws, [n.d.]). We estimate LLM costs based on the token length of the input and estimated decode length over requests. In most cases, LLM invocations are significantly more expensive than other query operators, so these calls are pulled up in the query plan.
我们增强了用于调用LLMs的外部函数的成本估算。目前,LLMs最常通过 UDF 访问。例如,Databricks 提供了一个 ai_query()函数,允许用户调用模型端点(dat, [n.d.]a)。同样,Amazon Redshift 允许通过 UDF 函数LLM调用(aws, [n.d.])。我们根据输入的令牌长度和请求的预估解码长度来估算LLM成本。在大多数情况下,LLM调用的成本显著高于其他查询操作符,因此这些调用在查询计划中被提前。

As an example of how LLM operator costs affect query planning, consider the query provided earlier in Figure 2 as an example. Without accounting for LLM inference costs, a naive execution might call the UDF on every row in the table before applying the filter c.Timestamp ¿ ’2023-10-01’. This would lead to unnecessary LLM invocations and could significantly increase overall execution time. A better execution strategy would be to apply the filter before invoking the LLM UDF, thereby reducing the dataset size that the LLM processes. This optimization would also benefit other query operators, such as joins. In general, the optimizer will pull up LLM calls as much as possible since these tend to be the most expensive parts of the query.
作为LLM操作符成本如何影响查询计划的一个例子,考虑之前在图 2 中提供的查询。如果不考虑LLM推理成本,一个简单的执行可能会在应用过滤器 c.Timestamp ¿ ’2023-10-01’之前,对表中的每一行调用 UDF。这将导致不必要的LLM调用,并可能显著增加总体执行时间。一个更好的执行策略是在调用LLMUDF 之前应用过滤器,从而减少LLM处理的数据集大小。这种优化也将惠及其他查询操作符,如连接。总的来说,优化器会尽可能地提前LLM调用,因为这些往往是查询中最昂贵的部分。

1 2-- Selection, Projection
2-- 选择,投影
3SELECT u.UserID 3 选择 u.用户 ID 4FROM Users u 4 来自 用户 u 5JOIN Comments c ON u.CommentID = c.CommentID
5 加入 评论 c 于 u.评论 ID = c.评论 ID
6WHERE 6 在哪里 7LLM("{Few_shot_examples}, sentiment on the {Text} is", c.Text) = Negative
7 当 LLM("{少数示例}, 对于{文本}的情感是", c.Text) = '负面'
8AND c.Timestamp > ’2023-10-01’
8 且 c.Timestamp > '2023-10-01'
Figure 2. SQL Example with one LLM and one non-LLM filter. The order in which these filters are applied affects the end-to-end query execution time.
图 2. SQL 示例,包含一个LLM过滤器和一个非LLM过滤器。这些过滤器的应用顺序会影响整个查询执行时间。

6. Implementation 6.实施

In this section, we describe the implementation of our optimizations, which consists of 4K lines of Python code integrated with Spark (Armbrust et al., 2015).
在本节中,我们描述了我们优化的实施,它包括与 Spark(Armbrust 等人,2015 年)集成的 4K 行 Python 代码。

6.1. LLM UDFs and Optimizations
6.1. 用户定义函数和优化

To invoke the LLM in SQL queries, we implement a PySpark (pys, [n.d.]) UDF that makes external calls to an LLM endpoint and a vector database. The UDF takes in a prompt for the LLM as well as data values (per column) as arguments to pass to the model. LLM output post-processing is also handled in the UDF.
为了在 SQL 查询中调用LLM,我们实现了一个 PySpark(pys,[n.d.])的用户定义函数(UDF),该函数进行外部调用到一个LLM端点和一个向量数据库。UDF 接受一个针对LLM的提示以及作为参数传递给模型的数据值(每列)。在 UDF 中也处理了LLM输出的后处理。

Prompt construction. Our UDF converts column values to prompt strings that can be parsed by an LLM. The first part of every prompt is the system prompt, or the instructions for the LLM. Next, data values are passed in per column. If there are multiple columns, the prompts are constructed in the order specified by our column reordering algorithm. All rows of data are also ordered with our row-sorting algorithm.
提示构建。我们的用户定义函数(UDF)将列值转换为LLM能够解析的提示字符串。每个提示的第一部分是系统提示,即对LLM的指令。接下来,数据值按列传入。如果有多个列,提示将按照我们的列重排序算法指定的顺序构建。所有数据行也按照我们的行排序算法进行排序。

Output post-processing. LLM output can be inconsistent and contain rambling answers (even if the model is told not to explicitly in its instruction prompt). Accordingly, we implement our UDF to optionally take in output pattern(s) as arguments, which the UDF searches for in the LLM output. For example, in a sentiment analysis query, we may only want our outputs to be ”Positive”, ”Negative”, or ”Neutral”, so we specify these output targets to the UDF.
输出后处理。LLM的输出可能不一致,并且包含冗长的回答(即使在其指令提示中明确告知模型不要这样做)。因此,我们实现了我们的 UDF,可选择性地接受输出模式作为参数,UDF 会在LLM的输出中搜索这些模式。例如,在情感分析查询中,我们可能只希望我们的输出是“正面”、“负面”或“中性”,因此我们将这些输出目标指定给 UDF。

6.2. Prefix Caching 6.2.前缀缓存

The current vLLM implementation assumes that all prefixes of a batch of requests can fit in the KV cache, but this does not hold when there is a high diversity prefix size, which is typical for batch analytical workloads. We address this limitation by implementing a novel eviction strategy for the KV cache within vLLM. Our approach is reactive: when a new request is scheduled and there is insufficient GPU memory to accommodate the new requests, the system automatically evicts prefixes based on the first-in, first-out (FIFO) order. If a previously evicted prefix is needed for a future request, it is recomputed and readmitted into the cache. This strategy ensures that the system dynamically adapts to changing workload patterns while managing memory constraints effectively.
当前的 vLLM 实现假设一批请求的所有前缀都可以适应 KV 缓存,但当存在高多样性的前缀大小时,这一假设并不成立,这对于批处理分析工作负载来说是典型的。我们通过为 vLLM 内的 KV 缓存实施一种新颖的驱逐策略来解决这一限制。我们的方法是反应式的:当一个新请求被调度且 GPU 内存不足以容纳新请求时,系统会自动根据先进先出(FIFO)顺序驱逐前缀。如果之前被驱逐的前缀在未来的请求中需要,它将被重新计算并重新加入缓存。这一策略确保了系统能够动态适应变化的工作负载模式,同时有效管理内存约束。

7. Evaluation 7.评估

In this section, we evaluate the effectiveness of our optimizations within a constructed benchmark suite of queries. We illustrate that:
在本节中,我们通过构建的查询基准套件来评估我们优化的有效性。我们展示了:

  1. (1)

    LLM output generation time dominates query latency, and optimizations that reduce the number of inputs the LLM receives significantly improve performance.


    (1) LLM 输出生成时间占据了查询延迟的主导地位,减少LLM接收的输入数量的优化显著提升了性能。
  2. (2)

    Changing the order of inputs we provide to the LLM, both within a single prompt and within a batch of prompts, can noticeably reduce query latency.


    (2) 改变我们提供给LLM的输入顺序,无论是在单个提示中还是在一批提示中,都可以显著减少查询延迟。

7.1. Evaluation Benchmark 7.1.评估基准

Given the lack of standard benchmarks for LLM queries, we construct a benchmark suite to represent real-world data retrieval and processing tasks. We create a range of query types over datasets from various sources to assess the impact of integrating LLMs into relational analytics.
鉴于缺乏针对LLM查询的标准基准,我们构建了一个基准套件,以代表现实世界中的数据检索和处理任务。我们从各种来源的数据集中创建了一系列查询类型,以评估将LLMs集成到关系分析中的影响。

7.1.1. Datasets 7.1.1.数据集

Field 领域 Type 类型 Example 示例
asin 亚马逊标准识别码 identifier 评论内容 0741304058
reviewText string 字符串 “A favorite cd now…”
“现在是一张我喜欢的 CD…”
verified 已验证 boolean 布尔值 True 真实
overall 总体 double 双倍 5.0
summary 摘要 string 字符串 “Five Stars” “五星级”
Format 格式 string 字符串 ”MP3 Music” "MP3 音乐"
description 描述 list of strings 字符串列表 [“Great CD for babies…”, “”, “”]
["适合婴儿的绝佳 CD…", "", ""]
Table 2. Amazon Products Schema
表 2. 亚马逊产品架构
Field 字段 Type 类型 Example 示例
rotten-tomatoes-link 烂番茄链接 identifier 标识符 m/10002114-dark-water m/10002114-黑暗之水
review-type 评论类型 string 字符串 “Fresh” “新鲜”
review-content 评论内容 string 字符串 “Fun, brisk and imaginative”
“有趣、轻快且富有想象力”
top-critic 顶级影评人 boolean 布尔值 True 
movie-info 电影信息 string 字符串 “In this moody…” “在这种忧郁的氛围中……”
Table 3. Rotten Tomatoes Movies Schema
表 3. 烂番茄电影架构

We build our benchmark suite on a variety of commonly used datasets for recommendation and natural language processing (NLP) models.
我们基于一系列常用的数据集构建了我们的基准测试套件,这些数据集适用于推荐系统和自然语言处理(NLP)模型。

  1. (1)

    Amazon Product Reviews(He and McAuley, 2016) is a recommendation dataset that contains product reviews and metadata from Amazon. We select the 5-core subset of data from the digital music category, which consists of 64,706 reviews. We show several example fields of interest from this dataset in Table 2. We use 15014 rows of this dataset in our queries.


    (1) 亚马逊产品评论(He 和 McAuley,2016 年)是一个包含亚马逊产品评论和元数据的推荐数据集。我们从数字音乐类别中选择了 5 核心子集,该子集包含 64,706 条评论。我们在表 2 中展示了该数据集中几个感兴趣的示例字段。我们在查询中使用了这个数据集的 15014 行。
  2. (2)

    Rotten Tomatoes Movie Reviews(Pang and Lee, 2005) is a recommendation dataset that stores critic review data along with movie metadata from the popular movie review website Rotten Tomatoes. This dataset consists of 1130018 reviews. We show part of the schema with fields of interest in Table 3. We use 15018 rows of this dataset in our queries.


    (2) 烂番茄电影评论(Pang 和 Lee,2005 年)是一个推荐数据集,存储了来自烂番茄这一流行电影评论网站的影评数据以及电影元数据。该数据集包含 1,130,018 条评论。我们在表 3 中展示了部分感兴趣字段的架构。我们在查询中使用了这个数据集的 15018 行。
  3. (3)

    Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset with more than 100,000 rows and consists of questions posed by crowdworkers on a set of Wikipedia articles. The context to every question is a segment of text, or span, from the corresponding reading passage.


    (3) 斯坦福问答数据集(SQuAD)是一个阅读理解数据集,包含超过 100,000 行数据,由众包工作者针对一系列维基百科文章提出的问题组成。每个问题的上下文是来自相应阅读段落的一段文本或跨度。

7.1.2. LLM Queries 7.1.2.LLM 查询

We incorporate a broad range of query types and use cases in our benchmark suite:
我们的基准测试套件中融入了广泛的查询类型和使用案例:

  1. (1)

    Q1: LLM projection. This query type makes calls to an LLM within a SELECT statement to process information from specified database column(s). It reflects common tasks in data analytics in which the LLM is used for summarization and interpretation based on certain data attributes.


    (1) Q1:LLM 投影。这种查询类型在 SELECT 语句中调用 LLM 来处理指定数据库列的信息。它反映了数据分析中的常见任务,其中 LLM 用于基于某些数据属性的总结和解释。
  2. (2)

    Q2: LLM filter. This query type leverages LLM for filtering data within a WHERE clause. The LLM processes and analyzes information to meet some specified criteria, such as identifying positive reviews. This query type illustrates typical use cases in sentiment analysis and content filtering, which are important for application tasks, such as customer feedback analysis and content moderation.


    (2) Q2:LLM 过滤。这种查询类型利用 LLM 在 WHERE 子句中过滤数据。 LLM 处理和分析信息以满足一些特定标准,例如识别正面评价。这种查询类型展示了情感分析和内容过滤中的典型用例,这对于应用任务非常重要,如客户反馈分析和内容审核。
  3. (3)

    Q3: Multi-LLM invocation. This query type involves multiple LLM calls in different parts of the query and addresses scenarios in which several layers of data processing or analysis are required. It represents advanced analytical tasks, such as combining different data insights or performing sequential data transformations.


    (3) Q3:多LLM调用。这种查询类型涉及在查询的不同部分多次调用 LLM,并解决需要多层数据处理或分析的场景。它代表了高级分析任务,如结合不同的数据洞察或执行顺序数据转换。
  4. (4)

    Q4: LLM aggregation. This query type incorporates LLM outputs into further query processing. For example, one such query could use LLMs to assign sentiment scores to individual reviews and then aggregate these scores to calculate an average sentiment for overall customer feedback. This query type is essential for tasks that need to extract insights from complex textual data.


    (4) Q4:LLM聚合。这种查询类型将LLM输出纳入进一步的查询处理中。例如,这样的一个查询可以使用LLMs为个别评论分配情感分数,然后将这些分数聚合起来,计算出整体客户反馈的平均情感值。这种查询类型对于需要从复杂文本数据中提取洞见的任务至关重要。
  5. (5)

    Q5: Retrieval-augmented generation (RAG). This query type leverages external knowledge bases for enhanced LLM processing, enriching LLM queries with a broader context. It simulates use cases where queries need to pull in relevant information from external sources, such as document databases or knowledge graphs, to provide comprehensive answers.


    (5) Q5:检索增强型生成(RAG)这种查询类型利用外部知识库来增强LLM处理,通过更广泛的上下文丰富LLM查询。它模拟了需要从外部来源(如文档数据库或知识图谱)拉取相关信息以提供全面答案的查询用例。

We present examples for each query type below. We run Q1-Q4 on the Amazon Product Reviews and Rotten Tomatoes Movie Reviews datasets, and Q5 on SquAD. We evaluate Q1, Q2, and Q4 on around 15,000 rows of the datasets, and Q3 with around 10,000 rows of the datasets since this query takes significantly longer to run. For Q5, we evaluate 5,000 questions, where each question retrieved K=3 contexts to augment its answer.
我们在下面为每种查询类型提供示例。我们在亚马逊产品评论和烂番茄电影评论数据集上运行 Q1-Q4,在 SquAD 上运行 Q5。我们在大约 15,000 行的数据集上评估 Q1、Q2 和 Q4,由于这个查询需要显著更长的运行时间,在大约 10,000 行的数据集上评估 Q3。对于 Q5,我们评估了 5,000 个问题,其中每个问题检索了 K=3 个上下文来增强其答案。

  • Q1: LLM Projection 1 2SELECT LLM("Recommend movies for the user based on {movie information} and {user review}", m.movie_info, r.review_content)
    根据{电影信息}和{用户评论}为用户推荐电影,m.movie_info, r.review_content
    3FROM reviews r 来自评论 r 4JOIN movies m ON r.rotten_tomatoes_link == m.rotten_tomatoes_link
    通过 r.rotten_tomatoes_link == m.rotten_tomatoes_link 连接电影 m


    • Q1:LLM投影
  • Q2: LLM Filter

     • Q2: 过滤器
    1 2SELECT m.movie_title 选择 m.电影标题 3FROM Movies m 来自 电影 m 4JOIN Reviews r ON r.rotten_tomatoes_link = m.rotten_tomatoes_link
    加入 评论 r 于 r.烂番茄链接 = m.烂番茄链接
    5WHERE 其中 6LLM("Analyze whether this movie would be suitable for kids based on {movie information} and {user review}", m.movie_info, r.review_content) == "Yes"
    请根据{电影信息}和{用户评论}分析这部电影是否适合儿童。== "是"
    7AND r.review_type == "Fresh"
    并且评论类型为"新鲜"
  • Q3: Multi-LLM invocation 1 2SELECT 2 选择 3LLM("Recommend movies for the user based on {movie information} and {user review}", m.movie_info, r.review_content) AS recommendations
    基于{电影信息}和{用户评论}为用户推荐电影的 3LLM(m.movie_info, r.review_content) AS 推荐
    4FROM Movies m 来自电影 m 5JOIN Reviews r ON r.rotten_tomatoes_link = m.rotten_tomatoes_link
    通过 r.rotten_tomatoes_link = m.rotten_tomatoes_link 连接评论 r
    6WHERE 条件为 7LLM("Analyze whether this movie would be suitable for kids based on {movie information} and {user review}", m.movie_info, r.review_content) == "Yes"
    请根据{电影信息}和{用户评论}分析这部电影是否适合儿童观看 == "是"
    8AND r.review_type == "Fresh"
    并且 评论类型 == "新鲜"


    • Q3:多次LLM调用
  • Q4: LLM Average 1 2SELECT 2 选择 3AVG( 请根据{review}和{info}给出一个满意度评分,从 0(差)到 5(好): 4LLM("Rate a satisfaction score between 0 (bad) and 5 (good) based on {review} and {info}: ",
    根据 r.review_content 和 m.movie_info
    5r.review_content, m.movie_info)
    )作为平均分数
    6) as AverageScore 7%**** 6_evaluation.tex Line 200 ****FROM reviews r
    请从评论表 r 中
    8JOIN movies m ON r.rotten_tomatoes_link = m.rotten_tomatoes_link
    连接电影表 m,条件为 r 中的 rotten_tomatoes_link 与 m 中的 rotten_tomatoes_link 相同
    9GROUP BY m.movie_title 按照 m 中的电影标题分组


    • Q4: LLM 平均
  • Q5: RAG 1 2SELECT 2 选择 3LLM("Given the following {context}, answer this {question}", VectorDB.similarity_search(s.question),
    3 长短期记忆网络("鉴于以下{上下文},回答这个{问题}",VectorDB.相似性搜索(s.question),
    4s.question 5) 6FROM squad s 6 来自 squad s 7WHERE s.is_impossible == False
    7.当 s.is_impossible == False 时

     • 问题 5:RAG

7.1.3. Evaluation Metrics 7.1.3.评估指标

Our key evaluation metric is end-to-end query execution time. Additionally, we analyze the token hit rate (THR), which represents the ratio of prefix tokens that are served from the KV cache. This metric corresponds directly with the query latency speed-up.
我们的关键评估指标是端到端查询执行时间。此外,我们分析了令牌命中率(THR),它代表了从 KV 缓存中提供的前缀令牌的比率。这个指标直接对应于查询延迟的加速。

7.2. Experimental Setup 7.2.实验设置

We run experiments on a g2-standard-48 GCP instance (48vCPUs, 192GB RAM) with an NVIDIA L4 GPU accelerator hosted in the us-central1 region. For the LLM endpoint, we use Meta’s LLaMA-2 model with 7B parameters (Touvron et al., 2023). This model is lightweight and inexpensive to host locally, making it well-suited to analytical tasks. We use vLLM (Kwon et al., 2023) as our model serving engine. As vLLM does not currently support prefix eviction and thus cannot handle requests where all prefixes can not fit in memory, we implemented a FIFO eviction policy for prefixes within vLLM. For RAG queries, we use BGE embedding models (BAAI/bge-large-en-v1.5) (Xiao et al., 2023) to embed the context, and use Facebook Similarity Search Library (FAISS)  (Johnson et al., 2019) to store these context embeddings into an index.
我们在一个配置为 48vCPUs、192GB RAM 的 g2-standard-48 GCP 实例上,搭载 NVIDIA L4 GPU 加速器,托管于 us-central1 区域,进行实验。对于LLM端点,我们使用了 Meta 的 LLaMA-2 模型,该模型拥有 70 亿参数(Touvron 等,2023)。这个模型既轻量又便宜,适合本地部署,非常适合分析任务。我们使用 vLLM(Kwon 等,2023)作为模型服务引擎。由于 vLLM 目前不支持前缀驱逐,因此无法处理所有前缀都无法放入内存的请求,我们在 vLLM 中实现了一个 FIFO 前缀驱逐策略。对于 RAG 查询,我们使用 BGE 嵌入模型(BAAI/bge-large-en-v1.5)(Xiao 等,2023)来嵌入上下文,并使用 Facebook 相似性搜索库(FAISS)(Johnson 等,2019)将这些上下文嵌入存储到一个索引中。

Refer to caption
(a) Rotten Tomatoes Movies Dataset
(a)烂番茄电影数据集
Refer to caption
(b) Amazon Products Dataset
(b)亚马逊产品数据集
Figure 3. End-to-end Result: Our optimizations (Cache (PSM + Dedup + SQL Opt)) achieves 1.76 - 2.68 ×\times× on Movie Dataset and 1.64 - 3.18 ×\times× speed-up on Product Dataset over Cache with FIFO ordering (Cache(FIFO)).
图 3. 端到端结果:我们的优化(缓存(PSM + 去重 + SQL 优化))在电影数据集上实现了 1.76 - 2.68 ×\times× 的加速,在产品数据集上实现了 1.64 - 3.18 ×\times× 的加速,相比之下,使用 FIFO 排序的缓存(缓存(FIFO))。
Refer to caption
Figure 4. End-to-end Result: Our optimizations (Cache (PSM)) achieve 1.5×\times× over Cache with FIFO ordering (Cache (FIFO)) on SQuAD Dataset.
图 4. 端到端结果:我们的优化(缓存(PSM))在 SQuAD 数据集上比带有 FIFO 排序的缓存(缓存(FIFO))实现了 1.5 倍的提升。

7.3. End-to-End Benchmark Results
7.3. 端到端基准测试结果

Overview. Figure 3 and Figure 4 shows the end-to-end latency results on our optimization techniques for our full benchmark suite. As baselines, we show the results of not using the KV cache for prefixes (No Cache) and caching without any reordering (Cache (FIFO)). We also measure the impact of caching with our column reordering and row reordering. We denote this as Caching with Prefix Sharing Maximization, or Cache (PSM). We measure Cache (PSM) with and without deduplication and SQL optimization. The result shows that our approach can achieve 4.4×\times× speedup compared to baselines. We discuss each query type’s evaluation in detail below.
概览。图 3 和图 4 展示了我们对整个基准测试套件的优化技术的端到端延迟结果。作为基线,我们展示了不使用前缀的 KV 缓存(无缓存)和未经任何重排序的缓存(缓存(FIFO))的结果。我们还测量了使用我们的列重排序和行重排序的缓存的影响。我们将此称为缓存与前缀共享最大化,或缓存(PSM)。我们分别测量了带有和不带有去重和 SQL 优化的缓存(PSM)。结果显示,与基线相比,我们的方法可以实现 4.4 倍的加速。我们将在下面详细讨论每种查询类型的评估。

Q1: LLM projection. This query type applies the LLM to selected data for a given task. For the Movie dataset, we use the LLM to recommend movies to a user based on their review of a given movie. For the Product dataset, we use the LLM to analyze whether the product quality inferred from a user’s review matches the quality advertised in the product description.
Q1: 投影查询。这种查询类型将LLM应用于给定任务的选定数据。对于电影数据集,我们使用LLM基于用户对给定电影的评价来推荐电影。对于产品数据集,我们使用LLM来分析用户评论中推断出的产品质量是否与产品描述中宣传的质量相匹配。

We achieve up to 3.6×\times× speed-up in the Movie dataset and 2.3×\times× speed-up in the Product dataset on projection queries, compared to our no KV cache baseline. The significant speed-ups result from the sharing of large prefixes. Q1 has the longest system prompt out of all our query types: the total length of this prompt is 172 tokens for the Movie dataset and 141 tokens for the Product dataset. As a result, we observe significant savings by avoiding recomputation on these longer prefixes. The No Cache baseline constructs a prompt for each row in the table and thus sends as many prompts to the LLM as there are rows in the table. No computation is saved by the model itself, and as a result this method incurs the highest query runtime for each of our queries.
我们在投影查询中,相比于没有键值缓存(KV cache)的基线,电影数据集上实现了高达 3.6 倍的加速,而在产品数据集上实现了 2.3 倍的加速。这一显著的加速效果源于大前缀的共享。Q1 拥有我们所有查询类型中最长的系统提示:对于电影数据集,这一提示的总长度为 172 个标记;对于产品数据集,为 141 个标记。因此,我们通过避免对这些较长前缀的重复计算,观察到了显著的节省。无缓存基线为表中的每一行构建一个提示,因此向模型发送的提示数量与表中的行数一样多。模型本身不保存任何计算结果,因此这种方法在我们的每个查询中都产生了最高的查询运行时间。

We analyze the impact of our optimizations in detail on the Movie dataset. Cache (FIFO) provides 1.7×\times× speedup over No Cache since we can now reuse computed tokens for the instruction prompt. Cache (PSM) enables the movie_info column to be ordered first and we sort the requests to achieve a further speed-up of 1.7×\times×. Our other techniques (deduplication and SQL optimizations) have minimal impact on this query type. For the former, the review_content column contains few duplicates (1̃00 rows), so the latency speed-up is marginal. The latter does not apply to Q1 because this query does not contain a filter clause.
我们详细分析了我们的优化措施在电影数据集上的影响。缓存(FIFO)相比于无缓存实现了 1.7 倍的加速,因为我们现在可以重用计算过的指令提示标记。缓存(PSM)使得 movie_info 列被优先排序,我们通过排序请求进一步实现了 1.7 倍的加速。我们的其他技术(去重和 SQL 优化)对这种查询类型的影响微乎其微。对于前者,review_content 列包含的重复项较少(约 100 行),因此延迟加速效果边际化。后者不适用于 Q1,因为这个查询不包含过滤子句。

Our optimizations achieve similar improvements on the Product dataset. Cache (FIFO) improves query latency by only 1.1×\times× over the No Cache baseline. This is because the suffix (i.e., differing tokens) dominates in the input prompts. Figure      4 shows the average token length of the reviewText and description input columns to be 381.54 and 282.56 tokens, respectively. Since the instruction prompt is only 141 tokens long, most of the user prompts must be computed even if the instruction prompt is cached. Cache (PSM) achieves much higher speed-ups by ordering the description column first. Since this column contains many longer shared prefixes, we can benefit much more from cached prefixes. Consequently, we achieve a 1.7×\times× speedup over the No Cache baseline and 1.5×\times× speedup over default caching. Similar to the Movie dataset, the deduplication and SQL filter optimizations have less impact on this query type. There is no filter clause, and the reviewText column contains many unique values. Nonetheless, roughly 2,000 rows are deduplicated leading to a further 1.4×\times× speedup over only our caching techniques.
我们的优化在产品数据集上取得了类似的改进。缓存(FIFO)相比于无缓存基线仅提高了 1.1 倍的查询延迟。这是因为后缀(即,不同的标记)在输入提示中占主导地位。图 4 显示,reviewText 和 description 输入列的平均标记长度分别为 381.54 和 282.56 个标记。由于指令提示仅长 141 个标记,即使缓存了指令提示,大部分用户提示也必须计算。缓存(PSM)通过首先排序 description 列,实现了更高的加速效果。由于这一列包含许多较长的共享前缀,我们可以从缓存的前缀中获得更多好处。因此,我们相比于无缓存基线实现了 1.7 倍的加速,相比于默认缓存实现了 1.5 倍的加速。与电影数据集类似,去重和 SQL 过滤优化对这种查询类型的影响较小。没有过滤子句,且 reviewText 列包含许多唯一值。尽管如此,大约 2000 行被去重,进一步实现了相比于仅我们的缓存技术 1.4 倍的加速。

Q2: LLM filter. This query type applies LLM as a filtering tool. In particular, for the Movie dataset, we filter rows based on a standard SQL operator review_type (with condition “Fresh”) and an LLM operator (with condition ’Negative’). For the Product dataset, we use the verified column (with the condition “True”) to filter and the same LLM operator to filter for “Negative” sentiment in a review.
Q2:模型过滤。这种查询类型将模型作为过滤工具。特别是对于电影数据集,我们基于标准 SQL 操作符 review_type(条件为“Fresh”)和一个模型操作符(条件为‘Negative’)来过滤行。对于产品数据集,我们使用 verified 列(条件为“True”)进行过滤,并使用相同的模型操作符过滤出评论中的“Negative”情绪。

We show the benefits of 4.4×\times× speed-up over no caching in both the Movie Product datasets. In the Movie dataset, Cache (FIFO) provides up to 1.7×\times× speed-up over No Cache, since the former saves computation by storing the instruction prompt. Cache (PSM) provides a further speed-up of 1.5×\times×. Since we cache the movie_info column alongside the instruction prompt and sort our rows based on this column (which has many duplicates), the requests benefit from prefix sharing. For deduplication, the review_content column that is included in this query has few duplicates, so this technique has minimal impact. On the other hand, our SQL optimizations significantly benefit performance (1.8×\times× improvement over just our caching techniques). Since Q2 contains both a non-LLM and an LLM filter, the order of execution between these two clauses impacts query latency. Pushing the non-LLM filter down first results in only 10,461 rows being passed to the LLM after the first filter (review_type == “Fresh”) out of the total 15,008 rows in the table.
我们展示了在电影产品数据集中,相比于不使用缓存,4.4 倍的加速效果。在电影数据集中,缓存(FIFO)相比于不使用缓存,提供了高达 1.7 倍的加速,因为前者通过存储指令提示来节省计算。缓存(PSM)进一步提供了 1.5 倍的加速。由于我们同时缓存了电影信息列和指令提示,并根据这一列(其中包含许多重复项)对行进行排序,请求因此受益于前缀共享。对于去重,此查询中包含的 review_content 列重复项较少,因此这种技术的影响最小。另一方面,我们的 SQL 优化显著提升了性能(相比仅使用我们的缓存技术,提高了 1.8 倍)。由于 Q2 包含了一个非LLM过滤器和一个LLM过滤器,这两个子句的执行顺序影响查询延迟。首先下推非LLM过滤器,结果在第一个过滤器(review_type == “Fresh”)之后,只有 10,461 行被传递给LLM,而表中总共有 15,008 行。

For the Product dataset, Cache (FIFO) provides a 1.3×\times× speedup over the No Cache baseline. For Cache (PSM), we order description as the first column and cache it alongside the instruction prompt to increase the cache hit rate. For deduplication, the reviewText column has few duplicates, so this optimization has limited impact. Our SQL optimization enables us to push down the non-LLM filter so that only 8,874 rows out of the 15,000 are passed to the LLM after the first filter (verified == True). As a result, we achieve 2.2×\times× improvement over only our caching techniques.
对于产品数据集,缓存(FIFO)相比于不使用缓存的基线,提供了 1.3 倍的加速。对于缓存(PSM),我们将描述作为第一列并与指令提示一起缓存,以提高缓存命中率。对于去重,reviewText 列的重复项较少,因此这种优化的影响有限。我们的 SQL 优化使我们能够下推非LLM过滤器,因此在第一个过滤器(verified == True)之后,只有 8,874 行被传递给LLM。结果,我们仅通过我们的缓存技术实现了 2.2 倍的改进。

Q3: Multi-LLM invocation. We achieve 3.0×\times× speed-up over on the Movie dataset and 4.1×\times× on the Product dataset on multiple invocation queries over the No Cache baseline. In this query, we combine Q1 and Q2 for each dataset. We first apply the LLM filter before invoking the LLM again for recommending products and movies in the SELECT statement.
Q3:多次LLM调用。我们在电影数据集上实现了 3.0 倍的加速,在产品数据集上实现了 4.1 倍的加速,这是相对于不使用缓存基线的多次调用查询的结果。在这个查询中,我们结合了每个数据集的 Q1 和 Q2。我们首先应用LLM过滤器,然后再次调用LLM,以在 SELECT 语句中推荐产品和电影。

On the Movie dataset, Cache (FIFO) provides 1.7×\times× improvement over the No Cache baseline. Cache (PSM) provides 1.4×\times× improvement over default caching. Our SQL optimization has the biggest impact on latency (1.3×\times× speed-up) for this query type: since the non-LLM filter (review_type == “Fresh”) has a selectivity ratio of 0.7, we significantly reduce the number of LLM invocations. For the Product dataset, Cache (FIFO) and Cache (PSM) provide the same speed-ups as on the Movie dataset. Our SQL optimizations result in a 2.2×\times× speed-up over our caching techniques.
在电影数据集上,缓存(FIFO)相比于不使用缓存的基线,提供了 1.7 倍的改进。缓存(PSM)相比于默认缓存,提供了 1.4 倍的改进。我们的 SQL 优化对于这种查询类型的延迟影响最大(1.3 倍的加速):由于非LLM过滤器(review_type == “Fresh”)的选择性比率为 0.7,我们显著减少了LLM调用的次数。对于产品数据集,缓存(FIFO)和缓存(PSM)提供了与电影数据集相同的加速效果。我们的 SQL 优化相比于我们的缓存技术,实现了 2.2 倍的加速。

Q4: LLM aggregation. We achieve a 3.4×\times× speed-up in the Movie dataset and a 2.2×\times× speed-up in the Product dataset over the No Cache baseline on aggregation queries using our optimizations. We use the AVG operator to aggregate the average sentiment score on the reviews column with the description column provided as context. For the Movie dataset, we group by movie_title and average over the LLM sentiment score output. For the Product dataset, we group by asin and average over the LLM sentiment score output. The results of this query type are similar to that of Q1, as the same columns are passed into the LLM with an instruction prompt of similar length. Specifically, the length of the instruction prompt is 166 tokens for the Movie dataset and 112 tokens for the Product dataset.
Q4: LLM 聚合。在使用我们的优化对聚合查询进行处理时,相较于无缓存基线,我们在电影数据集上实现了 3.4 ×\times× 的加速,而在产品数据集上实现了 2.2 ×\times× 的加速。我们使用 AVG 操作符来聚合评论列的平均情感得分,同时提供描述列作为上下文。对于电影数据集,我们按 movie_title 分组,并对LLM情感得分输出进行平均。对于产品数据集,我们按 asin 分组,并对LLM情感得分输出进行平均。这种查询类型的结果与 Q1 类似,因为相同的列被传入LLM,且指令提示的长度相似。具体来说,电影数据集的指令提示长度为 166 个令牌,产品数据集的为 112 个令牌。

On the Movie dataset, Cache (FIFO) provides a 1.9×\times× speed-up over the No Cache baseline. Cache (PSM) generates an additional 1.8×\times× speed-up over default caching since the movie_info columns contain many shared values. Like Q1, there is no LLM-filter clause and few duplicates in the review_content column. As a result, the query latency with all optimizations is nearly identical to that with just the caching techniques.
在电影数据集上,缓存(FIFO)相较于无缓存基线提供了 1.9 ×\times× 的加速。缓存(PSM)由于 movie_info 列包含许多共享值,相较于默认缓存生成了额外的 1.8 ×\times× 加速。与 Q1 类似,没有LLM-过滤子句,且 review_content 列中的重复项很少。因此,所有优化的查询延迟几乎与仅使用缓存技术时相同。

For the Product dataset, Cache (FIFO) leads to a 1.4×\times× speed-up over the No Cache baseline, and Cache (PSM) brings 1.5×\times× speed-up over default caching. Similar to Q1, the description column is cached with the instruction prompt. There are marginal deduplication benefits (1.1×\times× improvement over our caching techniques) with roughly 2000 rows being deduplicated.
对于产品数据集,缓存(FIFO)相较于无缓存基线带来了 1.4 ×\times× 的加速,而缓存(PSM)相较于默认缓存带来了 1.5 ×\times× 的加速。与 Q1 类似,描述列与指令提示一起被缓存。大约有 2000 行被去重,带来了边际去重效益(相较于我们的缓存技术,改进了 1.1 ×\times× )。

Q5: RAG. We achieve a 1.5×\times× speed-up on the SQuAD dataset. In this experiment, we embed all the context into a FAISS index and retrieved 5,000 questions for the question-answering task. For each question, we perform a K-nearest neighbor search on the vector index to fetch the top K relevant context, where we choose K = 3.
Q5: RAG。我们在 SQuAD 数据集上实现了 1.5 ×\times× 的加速。在这个实验中,我们将所有上下文嵌入到一个 FAISS 索引中,并检索了 5000 个问题用于问答任务。对于每个问题,我们在向量索引上执行 K-最近邻搜索,以获取最相关的前 K 个上下文,其中我们选择 K = 3。

The contexts are computed offline into the embedding format to be stored in the vector index. At runtime, each question needs to be embedded to search on the FAISS index, and this is implemented in a batch parallel fashion so it only takes up 0.4% of the end-to-end time. We perform row reordering within both the contexts being fetched for each question, as well as across multiple questions grouping similar contexts. We achieve 1.4×\times× speed-up over cache with no reordering and 1.5×\times× speed-up over no cache. Unsurprisingly, we find that LLM invocation dominates the overall query latency as opposed to traditional SQL operators (e.g. filters, joins, etc.). Specifically, the SQL operator time consisted of just 0.35% of the No Cache method runtime and 1.4% of the runtime applying all optimizations.
上下文被离线计算并转换为嵌入格式,以存储在向量索引中。在运行时,每个问题都需要被嵌入以在 FAISS 索引上进行搜索,这一过程以批量并行的方式实现,因此仅占端到端时间的 0.4%。我们在为每个问题获取的上下文内部,以及跨多个问题将相似上下文分组时,都执行行重排序。我们实现了相比无重排序的缓存加速 1.4 倍,以及相比无缓存加速 1.5 倍。不出所料,我们发现LLM调用占据了整体查询延迟的主导地位,与传统的 SQL 操作符(例如,过滤器、连接等)相反。具体来说,SQL 操作符时间仅占无缓存方法运行时间的 0.35%,以及应用所有优化的运行时间的 1.4%。

Token Hit Rate. We also measure the token hit rate (%) for Cache (FIFO) and Cache (PSM) for the query types in Figure 3. This metric is calculated as the ratio of tokens that can be served from the KV cache over all tokens in the input prompt. It serves as an indicator of KV cache effectiveness and is directly correlated with latency performance. On the Movie dataset, Cache (PSM) achieves a 21.6–38.0% token hit rate improvement over Cache (FIFO). On the Product dataset, Cache (PSM) achieves a 17.8–26.0% token hit rate improvement over Cache (FIFO).
令牌命中率。我们还为图 3 中的查询类型测量了缓存(FIFO)和缓存(PSM)的令牌命中率(%)。该指标计算为可以从 KV 缓存中提供服务的令牌与输入提示中所有令牌的比率。它作为 KV 缓存效率的指标,并且与延迟性能直接相关。在电影数据集上,缓存(PSM)相比缓存(FIFO)实现了 21.6–38.0%的令牌命中率提升。在产品数据集上,缓存(PSM)实现了 17.8–26.0%的令牌命中率提升。

Column Name 列名称 ASL Cardinality 基数 Score 得分
“description” “描述” 282.56 144 29460.80
“reviewText” “评论文本” 381.54 12932 442.97
“Format” “格式” 8.93 16 8379.69
Table 4. Column statistics for Product table.
表 4. 产品表的列统计信息。
Column Name 列名称 ASL Cardinality 基数 Score 得分
“movie_info” “电影信息” 407.27 68 89946.78
“review_content” “评论内容” 131.50 14977 131.86
“review_type” “评论类型” 5.3 2 39797.7
Table 5. Column statistics for Movie table.
表 5. 电影表的列统计信息。
Refer to caption
(a) Movie Dataset (a) 电影数据集
Refer to caption
(b) Product Dataset (b) 产品数据集
Figure 5. Cache Hit Rate Ablation. We illustrate the cache hit rate improvements achieved by Cache (PSM) compared to Cache (FIFO), showing up to a 26% increase on the Product dataset and up to a 38% increase on the Movie dataset.
图 5. 缓存命中率剖析。我们展示了缓存(PSM)相比于缓存(FIFO)在产品数据集上达到的最高 26%的缓存命中率提升,以及在电影数据集上达到的最高 38%的提升。
Refer to caption
Figure 6. Column Ordering Ablation. Best Column Order is what our system selects automatically, and Default Column Order is what the original order of the table is. Execution with Best Column Order achieves 1.8×\times× speedup and 1.4×\times× speedup on end-to-end query latency on Movie and Product Dataset.
图 6. 列排序剖析。最佳列顺序是我们的系统自动选择的,而默认列顺序则是表格的原始顺序。在电影和产品数据集上,采用最佳列顺序执行,分别实现了 1.8 倍和 1.4 倍的端到端查询延迟加速。

7.4. Impact of Column Reordering
7.4.列重排序的影响

To measure the effect of column reordering, we evaluate how overall query latency changes under varying column orders on an LLM projection query similar to Q1 but with an additional column added to the LLM context. For this set of experiments, three columns are used to provide input data into the LLM invocations from both the Movie and Product datasets. We cache the first column of our request if there is sufficient sharing and cache only the instruction prompt otherwise. Fig 4(a) shows the query latency results for the default column order and the best column order (which we automatically select).
为了衡量列重排序的效果,我们评估了在不同列顺序下,整体查询延迟如何变化,这是通过一个类似于 Q1 但在LLM上下文中增加了额外列的LLM投影查询来实现的。在这组实验中,我们使用三列数据作为输入,分别来自电影和产品数据集,以提供给LLM调用。如果有足够的共享,我们会缓存我们请求的第一列,否则只缓存指令提示。图 4(a)展示了默认列顺序和最佳列顺序(我们自动选择)的查询延迟结果。

For the Movie dataset, we use the 1 movie_info, 2 review_type, and 3 review_content columns in the query. The default column order in execution is 3: 2: 1, whereas the best column order is 1: 2: 3. The column metadata is shown in Fig 5. Unsurprisingly, caching the movie_info column, which is repeated across different reviews of the same movie, produces the largest speed-up of a 2.0×\times× improvement over the worst ordering (review_content first). While the review_type column has many shared values (there are only two unique values across the entire dataset), the length of each field is one token in length since the value is either “Fresh” or “Rotten”. As a result, the cached prefix is too short to result in prefix caching benefits. In contrast, the movie_info column has the longest average token length, resulting in large prefixes that improve performance when they are shared across requests.
对于电影数据集,我们在查询中使用了 movie_info、review_type 和 review_content 列。执行中的默认列顺序是::,而最佳列顺序是::。列元数据显示在图 5 中。不出所料,缓存 movie_info 列(在同一电影的不同评论中重复出现)产生了最大的加速效果,与最差排序(首先是 review_content)相比,改进了 2.0 ×\times× 。虽然 review_type 列有许多共享值(整个数据集中只有两个唯一值),但每个字段的长度是一个令牌,因为值要么是“Fresh”要么是“Rotten”。因此,缓存的前缀太短,无法带来前缀缓存的好处。相比之下,movie_info 列的平均令牌长度最长,导致大的前缀在请求间共享时提高了性能。

For the Product dataset, we choose the 1 description, 2 format, and 3 reviewText columns. The default column order in execution is 3: 2: 1, whereas the best column order is 1: 2: 3. Column metadata is shown in Fig 4. As expected, caching the description column, which is repeated across reviews of the same product, produces the largest speed-up of 1.5×\times× improvement over the worst ordering (reviewText first). While the format column has many shared values across the dataset (with only 17 unique values total), its length is too short, containing values, such as “Audio CD” and “MP3 Music”, that are two to three tokens in length. As a result, we do not see as many prefix caching benefits. The improvement on this dataset is less than on the Movie dataset for two main reasons: (1) the description column in the products dataset has fewer duplicates and (2) the reviewText column has the longest average length and its suffixes (i.e., distinct data after shared prefixes) represent a larger portion of the input data.
对于产品数据集,我们选择了 description、format 和 reviewText 列。执行中的默认列顺序是::,而最佳列顺序是::。列元数据显示在图 4 中。如预期,缓存 description 列(在同一产品的评论中重复出现)产生了最大的加速效果,与最差排序(首先是 reviewText)相比,改进了 1.5 ×\times× 。虽然 format 列在数据集中有许多共享值(总共只有 17 个唯一值),但其长度太短,包含的值如“Audio CD”和“MP3 Music”长度为两到三个令牌。因此,我们没有看到太多前缀缓存的好处。这个数据集上的改进比电影数据集上的要少,主要有两个原因:(1)产品数据集中的 description 列重复项较少;(2)reviewText 列的平均长度最长,其后缀(即共享前缀后的不同数据)占输入数据的较大部分。

7.5. Impact of Deduplication
7.5.去重技术的影响

We investigate the effects of our deduplication technique in detail. We construct queries based on Q1 while changing the specific LLM column inputs. Specifically, we vary the selection of columns passed into the LLM for analysis based on the cardinality of the column. Prior to LLM invocation, we deduplicate exact input prompt matches and pass only the first occurence of each distinct prompt into the LLM. Figure 8 shows the results on the Movie and Product datasets.
我们详细研究了我们的去重技术的效果。我们基于 Q1 构建查询,同时改变特定LLM列的输入。具体来说,我们根据列的基数变化选择传入LLM进行分析的列。在调用LLM之前,我们对精确的输入提示进行去重,只将每个不同提示的第一次出现传入LLM。图 8 展示了在电影和产品数据集上的结果。

For the Movie dataset, we pass in the movie_info column alongside either the review_type, review_score, or review_content columns. Since the review_type column has only two distinct values, the query with this column has a runtime that is 8.2×\times× faster than the baseline with no deduplication. There are also many duplicate values in the review_score column (which has only five distinct values), so the query with this column has a runtime that is 3.3×\times× faster than the baseline. In contrast, the review_content column contains mostly unique values, with only 99 rows being deduplicated out of 15018. As a result, the differences in runtime for this query with/without deduplication are negligible.
对于电影数据集,我们传入 movie_info 列以及 review_type、review_score 或 review_content 列之一。由于 review_type 列只有两个不同的值,包含此列的查询的运行时间比没有去重的基线快了 8.2 ×\times× 。review_score 列也有许多重复值(只有五个不同的值),因此包含此列的查询的运行时间比基线快了 3.3 ×\times× 。相比之下,review_content 列包含的值大多是唯一的,只有 15018 行中的 99 行被去重。因此,这个查询在有/无去重的情况下的运行时间差异可以忽略不计。

Refer to caption
(a) Movie Dataset (a) 电影数据集
Refer to caption
(b) Product Dataset (b) 产品数据集
Figure 7. Selectivity Ablation. Different columns are filtered on with resulting rows passed into LLM invocation. The lowest selectivity of the non LLM-filter for each of the Movie and Product tables (0.13 and 0.14 respectively) yields 6.7×\times× and 5.3×\times× faster query runtimes than filtering with the LLM first.
图 7. 选择性消融。不同的列被过滤,结果行传递给LLM调用。对于电影和产品表,非LLM-过滤器的最低选择性(分别为 0.13 和 0.14)使得查询运行时间比首先使用LLM过滤快了 6.7 ×\times× 和 5.3 ×\times× 倍。
Refer to caption
(a) Movie Dataset (a) 电影数据集
Refer to caption
(b) Product Dataset (b) 产品数据集
Figure 8. Deduplication Ablation. Exact prompt matches are deduplicated before going into the LLM. For the lowest cardinality column on each of the Movie (review_type: 2) and Product tables (title: 144), deduplicating inputs leads to 8.2×\times× and 1.9×\times× faster query execution, respectively.
图 8. 去重消融分析。在进入LLM之前,将完全匹配的提示进行去重。对于电影(review_type: 2)和产品表(title: 144)上每个最低基数列,去重输入分别使查询执行速度提高了 8.2 ×\times× 和 1.9 ×\times×

For the Product dataset, we evaluate three pairs of columns: (title, overall), (summary, description), and (reviewText, description). The overall column, which captures the review score, has only five distinct values, so the query with this column has a 1.9×\times× faster runtime than the no deduplication baseline. On the other hand, the summary and reviewText columns do not contain many duplicate values as mostly unique to each user. As a result, deduplication achieves only 1.1×\times× speedup for each.
对于产品数据集,我们评估了三对列:(title, overall)、(summary, description)和(reviewText, description)。overall 列,记录了评分,只有五个不同的值,因此,包含此列的查询比无去重基线快 1.9 ×\times× 。另一方面,summary 和 reviewText 列几乎不包含重复值,因为它们对每个用户来说大多是独一无二的。因此,去重仅为每个实现了 1.1 ×\times× 的加速。

7.6. Impact of SQL Optimizations
7.6.SQL 优化的影响

Finally, we investigate the effects of our SQL optimizations. Specifically, we evaluate the latency impact of varying the order in which filter clauses for LLM queries are applied. We construct queries identical in structure to Q2 and vary the column(s) to filter on (alongside the LLM predicate). A naively written query could execute the LLM predicate first on the entire table before applying the non-LLM filters to the resulting set of data. We measure overall query runtime for two scenarios: (1) the LLM filter is executed first and (2) the non-LLM filters are executed first. We choose columns to filter on alongside the LLM based on their selectivity ratio, measured as the ratio of LLM input size to the table size. Fig 7 shows query latency as a factor of selectivity ratio.
最后,我们调查了 SQL 优化的效果。具体来说,我们评估了变更LLM查询中过滤子句应用顺序对延迟的影响。我们构建了与 Q2 结构相同的查询,并变更了过滤的列(以及LLM谓词)。一个简单编写的查询可能会首先在整个表上执行LLM谓词,然后对结果数据集应用非LLM过滤器。我们测量了两种情况下的整体查询运行时间:(1)首先执行LLM过滤器和(2)首先执行非LLM过滤器。我们根据它们的选择性比率选择与LLM一起过滤的列,选择性比率是LLM输入大小与表大小的比率。图 7 显示了查询延迟作为选择性比率的函数。

For the Movie dataset, we choose the columns review_type and top_critic to filter on. We construct queries filtering with each possible value in review_type (“Fresh”, “Rotten”) and “top_critic” (True, False) as well as a combination of those values (review_type = Fresh & top_critic = True). We find that, as expected, query latency decreases as selectivity decreases since fewer inputs are being passed into the LLM. At the lowest selectivity level in this experiment (0.13), applying this filter order optimization yields a 6.7×\times× faster overall runtime compared to that from executing the LLM filter first.
对于电影数据集,我们选择了“评论类型”和“顶级评论家”两列进行筛选。我们构建了查询,根据“评论类型”(“新鲜”、“烂”)和“顶级评论家”(真、假)的每个可能值进行筛选,以及这些值的组合(评论类型=新鲜 & 顶级评论家=真)。我们发现,正如预期的那样,随着选择性的降低,查询延迟也随之减少,因为传入的输入较少。在本实验中选择性最低的级别(0.13)时,应用这种筛选顺序优化,与首先执行过滤器的总运行时间相比,可以快 6.7 倍。

For the Product dataset, we choose the columns verified and overall. We construct query filtering with each possible value in verified (True, False) and “overall” (1.0 … 5.0) as well as a combination of those values (verified = True & overall = 5.0). Similar to the Movie dataset, query latency decreases with less inputs being passed in at a lower selectivity. At the lowest selectivity level in this experiment of 0.14, applying this filter order optimization yields a 5.3x faster runtime overall over executing the LLM filter first.
对于产品数据集,我们选择了“已验证”和“总体评分”两列进行筛选。我们构建了查询,根据“已验证”(真、假)和“总体评分”(1.0 … 5.0)的每个可能值进行筛选,以及这些值的组合(已验证=真 & 总体评分=5.0)。与电影数据集类似,随着传入的输入减少,选择性较低时,查询延迟也随之减少。在本实验中选择性最低的级别为 0.14,应用这种筛选顺序优化,与首先执行过滤器的总运行时间相比,可以快 5.3 倍。

8. Discussion 讨论

Our techniques for optimizing LLM queries have been pragmatically motivated: we consider straightforward optimizations that apply broadly to most LLM inference systems and relational databases. In this section, we outline future directions of research to further enhance LLM inference in relational analytics. Our discussion is organized into three subsections: distributed cache and request routing, cost estimation for LLM invocations, and semantic de-duplication. These directions are critical to improving the scalability of database systems with LLM invocations.
我们优化查询的技术是出于实用的动机:我们考虑了广泛适用于大多数推理系统和关系数据库的直接优化。在本节中,我们概述了进一步增强关系分析中推理的未来研究方向。我们的讨论分为三个小节:分布式缓存和请求路由、调用成本估算和语义去重。这些方向对于提高带有调用的数据库系统的可扩展性至关重要。

8.1. Column Reordering and Row Sorting
8.1.列重排和行排序

Our column reordering and row sorting algorithms leverage straightforward, practical heuristics to efficiently identify shared prefixes. We can further extend these algorithms in future work by adopting a finer-grained approach. For instance, instead of maintaining a fixed order of columns for all rows in a table, we can consider different column orderings for subsets of rows to potentially enable higher token hit rates.
我们的列重排和行排序算法利用直接、实用的启发式方法高效识别共享前缀。通过采用更细粒度的方法,我们可以在未来的工作中进一步扩展这些算法。例如,我们可以考虑对表中所有行的列保持固定顺序,而是为行的子集考虑不同的列排序,以潜在地实现更高的令牌命中率。

To do so, one potential strategy would be to iteratively determine column orders for each group of rows that share a value in a designated “pivot” column. Once an order is established for a group, we can exclude these rows from subsequent iterations, thereby reducing the complexity and focusing on determining the orders of the remaining rows. This recursive approach would allow us to progressively refine the column order to enhance the cache hit rate without giving up computational efficiency. Moreover, we plan to introduce a pruning mechanism for columns that consistently show negligible impact for prefix sharing. By excluding these columns from the reordering process, we can further reduce the computational overhead of this technique.
为此,一种潜在的策略是迭代地为每一组共享某个指定“枢轴”列值的行确定列顺序。一旦为一组确定了顺序,我们就可以在后续迭代中排除这些行,从而降低复杂性并专注于确定剩余行的顺序。这种递归方法将允许我们逐步细化列顺序以提高缓存命中率,而不牺牲计算效率。此外,我们计划引入一种剪枝机制,用于那些一贯显示出对共享前缀影响微乎其微的列。通过将这些列从重排过程中排除,我们可以进一步减少这项技术的计算开销。

8.2. Prefix Caching: Opportunities
8.2.前缀缓存:机遇

8.2.1. Decode Length Knowledge
8.2.1.解码长度知识

While our work focuses on memory management for input prompts, output decoding also has a significant impact on performance. In particular, there are potential performance benefits with guided decoding in the analytics setting. Guided decoding, also known as structured generation or controlled generation, is a popular method to deterministically enforce a schema on the output of LLM (Microsoft, 2023; Zheng et al., 2023; Willard and Louf, 2023; Beurer-Kellner et al., 2023; Research, 2023). Since guided decoding involves computing a large intermediate state, it does not typically employ batching, hurting GPU utilization. In the relational analytics scenario, we can utilize our knowledge of the output schema and decoding length to construct the output constraint ahead of time. In many cases, we can skip tokens that are known a priori and perform batching to improve utilization.
尽管我们的工作重点是对输入提示的内存管理,输出解码对性能也有显著影响。特别是,在分析设置中,引导解码有潜在的性能优势。引导解码,也称为结构化生成或控制生成,是一种流行的方法,用于在LLM的输出上强制实施一个预定的架构(微软,2023;郑等人,2023;威拉德和卢夫,2023;博伊尔-凯尔纳等人,2023;研究,2023)。由于引导解码涉及计算大型中间状态,它通常不采用批处理,这影响了 GPU 的利用率。在关系分析场景中,我们可以利用对输出架构和解码长度的了解,提前构建输出约束。在许多情况下,我们可以跳过预先已知的令牌,并执行批处理以提高利用率。

8.2.2. Batch-Aware Reordering
8.2.2.批量感知重排序

While the column reordering and row sorting techniques we present can significantly improve the cache hit rate, they do not account for how requests in a batch are processed in parallel during LLM inference in some systems. Some inference systems (Huggingface, 2023) currently do not support prefix sharing within the same batch due to simultaneous computation. For instance, consider a sequence of requests [1,1,1][2,2,2][3,3,3]111222333[1,1,1][2,2,2][3,3,3][ 1 , 1 , 1 ] [ 2 , 2 , 2 ] [ 3 , 3 , 3 ]. None of these prefixes benefit from being cached because each distinct prefix (1,2,31231,2,31 , 2 , 3) is computed multiple times within the same batch. As a result, there are six cache misses on this workload. However, if we instead reorder these requests according to the sequence [1,2,3][1,2,3][1,2,3]123123123[1,2,3][1,2,3][1,2,3][ 1 , 2 , 3 ] [ 1 , 2 , 3 ] [ 1 , 2 , 3 ], we observe benefits from using the prefix cache—1,2,31231,2,31 , 2 , 3 are computed and cached after the first batch, allowing them to be reused in subsequent batches. Thus, there are only three cache misses on this series of requests.
尽管我们介绍的列重排序和行排序技术可以显著提高缓存命中率,但它们没有考虑到在某些系统中LLM推理期间如何并行处理批次中的请求。一些推理系统(Huggingface,2023)目前不支持在同一批次内共享前缀,因为计算是同时进行的。例如,考虑一系列请求 [1,1,1][2,2,2][3,3,3]111222333[1,1,1][2,2,2][3,3,3][ 1 , 1 , 1 ] [ 2 , 2 , 2 ] [ 3 , 3 , 3 ] 。这些前缀中的任何一个都不会从缓存中受益,因为每个不同的前缀( 1,2,31231,2,31 , 2 , 3 )在同一批次内被多次计算。因此,这个工作负载上有六次缓存未命中。然而,如果我们根据序列 [1,2,3][1,2,3][1,2,3]123123123[1,2,3][1,2,3][1,2,3][ 1 , 2 , 3 ] [ 1 , 2 , 3 ] [ 1 , 2 , 3 ] 重新排序这些请求,我们会观察到使用前缀缓存的好处—— 1,2,31231,2,31 , 2 , 3 在第一批次后被计算并缓存,允许它们在后续批次中被重用。因此,这一系列请求只有三次缓存未命中。

To address this, we propose a batch-aware reordering strategy that optimizes prefix utilization by aligning the request order with batch execution phases. This approach requires a warm-up phase for the cache, wherein we pre-load the cache with diverse prefixes expected to be reused in subsequent batches. Our algorithm first sorts all possible prefixes by their frequency of occurrence then arranges them across batches to ensure that each batch contains a set of distinct prefixes. This method allows us to leverage the cache more effectively: once a batch is processed, any subsequent requests can reuse the cached prefixes, significantly reducing cache misses.
为了解决这个问题,我们提出了一种批量感知的重排序策略,通过将请求顺序与批处理执行阶段对齐来优化前缀利用率。这种方法需要一个缓存的预热阶段,在此阶段,我们预先加载缓存,填充预期在后续批次中重用的各种前缀。我们的算法首先按出现频率对所有可能的前缀进行排序,然后将它们跨批次进行排列,以确保每个批次包含一组不同的前缀。这种方法使我们能够更有效地利用缓存:一旦处理完一个批次,任何后续请求都可以重用缓存中的前缀,显著减少缓存未命中。

8.3. Distributed Cache and Request Routing
8.3.分布式缓存和请求路由

On large-scale datasets, we need to distribute the inference workload over many machines to enable queries to be completed in a reasonable amount of time. However, distributed KV caching is challenging; we need to ensure that requests with similar or identical prefixes are directed to the same machine. RAG queries bring additional complexities: since the KV cache must store not only LLM prefixes but also embeddings and associated retrieved contents, which may be distributed across nodes. As such, balancing load over different machines with varying cache contents will be a critical objective.
在大规模数据集上,我们需要将推理工作负载分布到许多机器上,以便在合理的时间内完成查询。然而,分布式 KV 缓存具有挑战性;我们需要确保具有相似或相同前缀的请求被引导到同一台机器。RAG 查询带来额外的复杂性:因为 KV 缓存必须存储不仅仅是LLM前缀,还包括嵌入和检索到的相关内容,这些可能分布在不同的节点上。因此,在具有不同缓存内容的不同机器上平衡负载将是一个关键目标。

8.4. Semantic Deduplication
8.4.语义去重

We can extend our exact deduplication technique to filter out more data by implementing rule-based methods. Specifically, we can leverage NLP pre-processing techniques that remove case sensitivity (e.g. ‘Apple’ and ‘apple’) and noise (e.g. ‘¡Apple¿’ and ‘Apple’) as well as apply stemming and lemmatization (e.g. ‘running’, ‘ran’, ‘runs’ ), spelling correction (e.g. ‘mitake’ vs. ‘mistake’), and stop word removal (e.g. remove ‘a’, ‘the’, etc.). These could be implemented as hard constraints (optionally specified by the user) to reduce the number of LLM invocations without impacting query accuracy since the equivalence of text is specified via the given rules.
我们可以通过实施基于规则的方法,将我们的精确去重技术扩展到过滤更多数据。具体来说,我们可以利用 NLP 预处理技术,消除大小写敏感性(例如,“Apple”和“apple”)和噪声(例如,“¡Apple¿”和“Apple”),以及应用词干提取和词形还原(例如,“running”、“ran”、“runs”)、拼写纠正(例如,“mitake”与“mistake”)和停用词移除(例如,移除“a”、“the”等)。这些可以作为硬性约束(用户可选择性指定)来减少LLM调用次数,而不影响查询准确性,因为文本的等价性是通过给定规则指定的。

We can deduplicate even more aggressively by identifying semantically similar rows. For example, ”I like this restaurant” and ”I love this place” have the same semantic meaning and will likely produce identical output under an LLM sentiment analysis task. As such, sending only one of these inputs to LLM for processing should be sufficient for most purposes. One possible direction is to evaluate token similarity within the LLM to identify semantically equivalent requests. We are also looking forward to bringing in methods in training dataset de-duplication such as SemDeDup (Abbas et al., 2023) for production workload processing.
我们可以通过识别语义上相似的行来更加积极地进行去重。例如,“I like this restaurant”和“I love this place”具有相同的语义意义,在LLM情感分析任务下很可能产生相同的输出。因此,仅将这些输入之一发送到LLM进行处理,对于大多数目的来说应该就足够了。一种可能的方向是在LLM中评估令牌相似性,以识别语义上等价的请求。我们也期待引入训练数据集去重的方法,如 SemDeDup(Abbas 等,2023 年)用于生产工作负载处理。

9. Related Work 9.相关工作

Our optimizations build on recent work in LLM inference as well as a long tradition of integrating machine learning and data management. We describe several major related areas below.
我们的优化建立在最近的LLM推理工作以及将机器学习与数据管理整合的长期传统之上。我们在下面描述了几个主要的相关领域。

Model pipeline tools. LLM toolkits, which have grown rapidly in popularity, provide users with the ability to stitch model pipelines together from basic abstractions. Among these, LangChain (Chase, 2022) and LlamaIndex (Liu, 2022) have seen the most usage. LangChain’s framework allows for convenient abstractions for different parts of the LLM serving stack and also enables users to “batch” multiple requests into a model. However, this is accomplished through basic thread parallelism and without any model optimizations applied to handle the series of queries found in a typical analytics workload.
模型管道工具。这些工具包的受欢迎程度迅速增长,为用户提供了从基本抽象概念中拼接模型管道的能力。在这些工具中,LangChain(Chase,2022 年)和 LlamaIndex(Liu,2022 年)的使用率最高。LangChain 的框架允许方便地为不同部分的服务堆栈提供抽象,并且还使用户能够将多个请求“批处理”到一个模型中。然而,这是通过基本的线程并行性来实现的,并且没有应用任何模型优化来处理典型分析工作负载中发现的一系列查询。

Inference-optimized systems. There has been a recent rise of dedicated systems for LLM inference, including FasterTransformer (NVIDIA, 2023a), Orca (Yu et al., 2022), vLLM (Kwon et al., 2023), and FlexGen (Sheng et al., 2023). Our work builds upon prior work investigating high throughput LLM inference and continuous batching for model serving. However, past systems focus on the online setting and make no assumptions about what requests will be sent to the LLM. In contrast, we leverage full workload information from batch queries to significantly improve performance for these workloads.
推理优化系统。最近,专门用于推理的系统出现了,包括 FasterTransformer(NVIDIA,2023a)、Orca(Yu 等人,2022 年)、vLLM(Kwon 等人,2023 年)和 FlexGen(Sheng 等人,2023 年)。我们的工作是在之前研究高吞吐量推理和模型服务连续批处理的基础上建立的。然而,过去的系统专注于在线设置,并且不对将发送到的请求做任何假设。相比之下,我们利用批处理查询的完整工作负载信息,显著提高了这些工作负载的性能。

Optimized shared prefix kernels. Since prefix sharing is a simple yet highly effective optimization, there has been recent work on developing memory-efficient GPU kernels that perform inference while leveraging shared prefixes. SGLang’s RadixAttention (Zheng et al., 2023), Hydragen (Juravsky et al., 2024), and Cascade Inference (Ye et al., 2024) all implement optimized kernels. Our work heavily leverages these kernels to enable prefix sharing while delivering higher throughput as compared to traditional attention kernels (Dao et al., 2022).
优化的共享前缀核心。由于前缀共享是一个简单而高效的优化方法,最近有研究开发了在利用共享前缀进行推理时内存效率高的 GPU 核心。SGLang 的 RadixAttention(Zheng 等人,2023 年)、Hydragen(Juravsky 等人,2024 年)和 Cascade Inference(Ye 等人,2024 年)都实现了优化的核心。我们的工作大量利用这些核心,以实现前缀共享,与传统的注意力核心(Dao 等人,2022 年)相比,提供了更高的吞吐量。

Machine learning in databases. There is extensive work on integrating machine learning models with databases (Kang et al., 2017, 2019; Lu et al., 2018). MADLib (Hellerstein et al., 2012) is one example of many works that have focused on designing systems to train complex models on large datasets. Recent works such as Velox explore online serving using data management systems (Crankshaw et al., 2014), and Ralf optimizes machine learning feature maintenance in data pipeline  (Wooders et al., 2023). However, these past works did not specifically address large language models, which have extremely high computational costs and unique architectural properties, such as the KV cache. As such, LLMs offer many new optimization opportunities in the context of relational queries. Other works, like NoScope (Kang et al., 2017), BlazeIt (Kang et al., 2019), and Probabilistic Predicates (Lu et al., 2018), propose approximating expensive ML model calls with less expensive models for approximate query processing, but this can reduce query accuracy, and these works do not take advantage of the unique characteristics of KV cache reuse in LLM inference.
数据库中的机器学习。有大量工作集成了机器学习模型与数据库(Kang 等人,2017 年,2019 年;Lu 等人,2018 年)。MADLib(Hellerstein 等人,2012 年)是许多专注于设计系统以在大型数据集上训练复杂模型的工作之一。最近的工作,如 Velox,探索使用数据管理系统进行在线服务(Crankshaw 等人,2014 年),而 Ralf 优化了数据管道中机器学习特征的维护(Wooders 等人,2023 年)。然而,这些过去的工作并没有特别针对大型语言模型,这些模型具有极高的计算成本和独特的架构属性,如 KV 缓存。因此,提供了许多新的优化机会,以在关系查询的背景下。其他工作,如 NoScope(Kang 等人,2017 年)、BlazeIt(Kang 等人,2019 年)和 Probabilistic Predicates(Lu 等人,2018 年),提出了用较不昂贵的模型来近似昂贵的 ML 模型调用,以进行近似查询处理,但这可能会降低查询准确性,这些工作也没有利用在推理中 KV 缓存重用的独特特性。

10. Conclusion 10.结论

In this paper, we introduce a range of techniques to optimize LLM invocations in relational workloads. We first observe that the LLM computation is the dominant cost factor in LLM-enabled SQL workload. Each byte processed by LLMs is multiple orders of magnitude more expensive than bytes processed with traditional SQL operators. By leveraging workload information in relational analytics, coupled with observations about the LLM inference process, we can significantly improve end-to-end query performance and reduce costs without affecting query semantics. Our caching, deduplication, and SQL optimization techniques enhance prefix sharing of the LLM KV cache and reduce the number of LLM invocations needed. We observe up to 4.4×\times× decreases in end-to-end query latency. In optimizing for LLM queries, we fill a void in existing inference systems, which only target the online setting. Our results suggest that there is a wide design space to further enhance LLM inference in the relational analytics setting to greatly improve performance.
在本文中,我们介绍了一系列技术,旨在优化关系型工作负载中的LLM调用。我们首先观察到,在LLM启用的 SQL 工作负载中,LLM计算是主要的成本因素。由LLMs处理的每个字节的成本比使用传统 SQL 操作符处理的字节高出数个数量级。通过利用关系型分析中的工作负载信息,结合对LLM推理过程的观察,我们可以显著提高端到端查询性能并降低成本,而不影响查询语义。我们的缓存、去重和 SQL 优化技术增强了LLM KV 缓存的前缀共享,并减少了所需的LLM调用次数。我们观察到端到端查询延迟最多降低了 4.4 ×\times× 。在针对LLM查询的优化中,我们填补了现有推理系统的空白,这些系统仅针对在线设置。我们的结果表明,在关系型分析设置中进一步增强LLM推理的设计空间很广,可以大幅提升性能。

References 参考文献

  • (1)
  • dat ([n.d.]a) 参考资料 [n.d.]a. https://docs.databricks.com/en/large-language-models/how-to-ai-query.html.
  • dat ([n.d.]b) dat(无日期)b [n.d.]b. AI Functions on Databricks — docs.databricks.com. https://docs.databricks.com/en/large-language-models/ai-functions.html. [Accessed 01-03-2024].
    无日期 b。Databricks 上的 AI 功能 - docs.databricks.com。https://docs.databricks.com/en/large-language-models/ai-functions.html。[访问于 2024 年 3 月 1 日]。
  • duc ([n.d.]) duc(无日期) [n.d.]. How fast is DuckDB really? — Blog — Fivetran — fivetran.com. https://www.fivetran.com/blog/how-fast-is-duckdb-really. [Accessed 01-03-2024].
    无日期。DuckDB 到底有多快?- 博客 - Fivetran - fivetran.com。https://www.fivetran.com/blog/how-fast-is-duckdb-really。[访问于 2024 年 3 月 1 日]。
  • aws ([n.d.]) aws(无日期) [n.d.]. Large Language Models for sentiment analysis with Amazon Redshift ML (Preview) — Amazon Web Services — aws.amazon.com. https://aws.amazon.com/blogs/big-data/large-language-models-for-sentiment-analysis-with-amazon-redshift-ml-preview/. [Accessed 01-03-2024].
    无日期。使用 Amazon Redshift ML(预览版)进行情感分析的大型语言模型 — 亚马逊网络服务 — aws.amazon.com。https://aws.amazon.com/blogs/big-data/large-language-models-for-sentiment-analysis-with-amazon-redshift-ml-preview/。[访问日期:2024 年 3 月 1 日]。
  • goo ([n.d.]) goo(无日期) [n.d.]. LLM with Vertex AI only using SQL queries in BigQuery — Google Cloud Blog — cloud.google.com. https://cloud.google.com/blog/products/ai-machine-learning/llm-with-vertex-ai-only-using-sql-queries-in-bigquery. [Accessed 01-03-2024].
    无日期。在 BigQuery 中仅使用 SQL 查询与 Vertex AI 一起使用LLM — Google Cloud 博客 — cloud.google.com。https://cloud.google.com/blog/products/ai-machine-learning/llm-with-vertex-ai-only-using-sql-queries-in-bigquery。[访问日期:2024 年 3 月 1 日]。
  • pys ([n.d.]) pys(日期不详) [n.d.]. PySpark. https://www.databricks.com/glossary/pyspark.
    日期不详。PySpark。https://www.databricks.com/glossary/pyspark。
  • Abbas et al. (2023) 阿巴斯等人(2023) Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. 2023. SemDeDup: Data-efficient learning at web-scale through semantic deduplication. arXiv:2303.09540 [cs.LG]
    Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, 以及 Ari S. Morcos。2023。SemDeDup:通过语义去重实现网络规模下的数据高效学习。arXiv:2303.09540 [cs.LG]
  • Armbrust et al. (2015) Armbrust 等人 (2015) Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD ’15). Association for Computing Machinery, New York, NY, USA, 1383–1394. https://doi.org/10.1145/2723372.2742797
    Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, 以及 Matei Zaharia. 2015. Spark SQL:Spark 中的关系数据处理。在 2015 年 ACM SIGMOD 国际数据管理会议论文集中 (墨尔本,维多利亚,澳大利亚) (SIGMOD '15)。美国纽约,计算机机构协会,1383–1394。https://doi.org/10.1145/2723372.2742797
  • Beurer-Kellner et al. (2023)
    Beurer-Kellner 等人 (2023)
    Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. 2023. Prompting Is Programming: A Query Language for Large Language Models. Proceedings of the ACM on Programming Languages 7, PLDI (June 2023), 1946–1969. https://doi.org/10.1145/3591300
    Luca Beurer-Kellner, Marc Fischer, 和 Martin Vechev. 2023. 提示即编程:针对大型语言模型的查询语言。ACM 编程语言论文集,第 7 卷,PLDI (2023 年 6 月),1946–1969。https://doi.org/10.1145/3591300
  • Chase (2022) Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain
    哈里森·蔡斯. 2022. LangChain. https://github.com/langchain-ai/langchain
  • Cherkasova and Ciardo (2001)
    切尔卡索娃与奇亚多 (2001)
    Ludmila Cherkasova and Gianfranco Ciardo. 2001. Role of Aging, Frequency, and Size in Web Cache Replacement Policies. In Proceedings of the 9th International Conference on High-Performance Computing and Networking (HPCN Europe 2001). Springer-Verlag, Berlin, Heidelberg, 114–123.
    卢德米拉·切尔卡索娃和詹弗朗科·奇亚多. 2001. 在 Web 缓存替换策略中年龄、频率和大小的作用. 在第 9 届国际高性能计算与网络会议(HPCN Europe 2001)论文集中. 施普林格-维拉格, 柏林, 海德堡, 114–123.
  • Crankshaw et al. (2014) Crankshaw 等人(2014) Daniel Crankshaw, Peter Bailis, Joseph E. Gonzalez, Haoyuan Li, Zhao Zhang, Michael J. Franklin, Ali Ghodsi, and Michael I. Jordan. 2014. The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox. arXiv:1409.3809 [cs.DB]
    Daniel Crankshaw, Peter Bailis, Joseph E. Gonzalez, Haoyuan Li, Zhao Zhang, Michael J. Franklin, Ali Ghodsi, 和 Michael I. Jordan. 2014. 复杂分析中缺失的一环:低延迟、可扩展的模型管理与 Velox 服务. arXiv:1409.3809 [cs.DB]
  • Dao et al. (2022) Dao 等人(2022) Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG]
    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, 和 Christopher Ré. 2022. FlashAttention:具有 IO 感知的快速且内存高效的精确注意力机制. arXiv:2205.14135 [cs.LG]
  • He and McAuley (2016) 何润东与朱利安·麦考利(2016) Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In Proceedings of the 25th International Conference on World Wide Web (Montréal, Québec, Canada) (WWW ’16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 507–517. https://doi.org/10.1145/2872427.2883037
    何润东和朱利安·麦考利。2016。起伏:用单类协同过滤模型来模拟时尚趋势的视觉演变。发表于第 25 届国际万维网大会论文集(加拿大魁北克蒙特利尔)(WWW '16)。国际万维网大会指导委员会,日内瓦共和国及州,CHE,507-517。https://doi.org/10.1145/2872427.2883037
  • Hellerstein et al. (2012) 赫勒斯坦等人(2012) Joe Hellerstein, Christopher Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library or MAD Skills, the SQL. arXiv:1208.4165 [cs.DB]
    乔·赫勒斯坦、克里斯托弗·雷、弗洛里安·肖普曼、王哲(Daisy Zhe Wang)、尤金·弗拉特金、亚历山大·戈拉杰克、吴基成(Kee Siong Ng)、凯勒布·韦尔顿、冯锡轩、李坤、阿伦·库马尔。2012。MADlib 分析库或 SQL 的 MAD 技能。arXiv:1208.4165 [cs.DB]
  • Huggingface (2023) 拥抱脸 (2023) Huggingface. 2023. Text Generation Inference. https://huggingface.co/docs/text-generation-inference/en/index
    拥抱脸. 2023. 文本生成推理. https://huggingface.co/docs/text-generation-inference/zh/index
  • Johnson et al. (2019) 约翰逊等人 (2019) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
    杰夫·约翰逊、马蒂斯·杜兹和埃尔韦·耶古. 2019. 使用 GPU 进行十亿级别的相似性搜索. IEEE 大数据交易 7, 3 (2019), 535–547.
  • Juravsky et al. (2024) Juravsky 等人(2024 年) Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG]
    Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, 和 Azalia Mirhoseini. 2024. Hydragen:具有共享前缀的高吞吐量LLM推理. arXiv:2402.05099 [cs.LG]
  • Kang et al. (2019) Kang 等人(2019 年) Daniel Kang, Peter Bailis, and Matei Zaharia. 2019. BlazeIt: optimizing declarative aggregation and limit queries for neural network-based video analytics. Proc. VLDB Endow. 13, 4 (dec 2019), 533–546.
    Daniel Kang, Peter Bailis, 和 Matei Zaharia. 2019. BlazeIt:针对基于神经网络的视频分析优化声明式聚合和限制查询. Proc. VLDB Endow. 13, 4 (2019 年 12 月), 533–546.
  • Kang et al. (2017) 康等(2017) Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. NoScope: optimizing neural network queries over video at scale. Proc. VLDB Endow. 10, 11 (aug 2017), 1586–1597.
    康丹尼尔、约翰·埃蒙斯、费拉斯·阿布扎伊德、彼得·贝利斯和马泰·扎哈里亚。2017。NoScope:在大规模视频上优化神经网络查询。《VLDB 终端处理进程》第 10 卷,第 11 期(2017 年 8 月),1586-1597。
  • Kwon et al. (2023) 权等(2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (¡conf-loc¿, ¡city¿Koblenz¡/city¿, ¡country¿Germany¡/country¿, ¡/conf-loc¿) (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 611–626. https://doi.org/10.1145/3600006.3613165
    权宇硕、李卓涵、庄思源、盛颖、郑连民、余浩科迪、何塞·冈萨雷斯、张浩和伊恩·斯托伊卡。2023。使用分页注意力机制高效管理大型语言模型服务的内存。在第 29 届操作系统原理研讨会论文集中(德国科布伦茨)(SOSP '23)。美国纽约州纽约市:计算机协会,611-626。https://doi.org/10.1145/3600006.3613165
  • Lewis et al. (2021) 刘易斯等人(2021) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL]
    帕特里克·刘易斯、伊桑·佩雷斯、亚历山德拉·皮克图斯、法比奥·佩特罗尼、弗拉基米尔·卡尔普欣、纳曼·戈亚尔、海因里希·库特勒、迈克·刘易斯、温陶·伊、蒂姆·罗克塔谢尔、塞巴斯蒂安·里德尔、以及杜威·基拉。2021。用于知识密集型自然语言处理任务的检索增强生成。arXiv:2005.11401 [cs.CL]
  • Liu (2022) 刘(2022) Jerry Liu. 2022. LlamaIndex. https://doi.org/10.5281/zenodo.1234
    杰瑞·刘。2022。LlamaIndex。https://doi.org/10.5281/zenodo.1234
  • Lu et al. (2018) 陆耀等(2018) Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, and Surajit Chaudhuri. 2018. Accelerating Machine Learning Inference with Probabilistic Predicates. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1493–1508.
    陆耀,Aakanksha Chowdhery,Srikanth Kandula,以及 Surajit Chaudhuri。2018。通过概率谓词加速机器学习推理。发表于 2018 年数据管理国际会议论文集(美国德克萨斯州休斯顿)(SIGMOD '18)。美国纽约市,计算机机构协会,1493-1508。
  • Microsoft (2023) 微软(2023) Microsoft. 2023. Guidance. https://github.com/guidance-ai/guidance
    微软。2023。指南。https://github.com/guidance-ai/guidance
  • NVIDIA (2023a) 英伟达(2023a) NVIDIA. 2023a. Faster Transformer. https://github.com/NVIDIA/FasterTransformer
    英伟达。2023a。更快的变换器。https://github.com/NVIDIA/FasterTransformer
  • NVIDIA (2023b) 英伟达(2023b) NVIDIA. 2023b. TensorRT LLM. https://github.com/NVIDIA/TensorRT-LLM
    英伟达。2023b。TensorRT LLM。https://github.com/NVIDIA/TensorRT-LLM
  • Pang and Lee (2005) 庞与李(2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
    庞波与李莉莲。2005。看星星:利用类关系对评分尺度下的情感分类进行开发。在美国计算机语言学会会议论文集中。
  • Raasveldt and Mühleisen (2019)
    拉斯韦尔特与穆勒森(2019)
    Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1981–1984. https://doi.org/10.1145/3299869.3320212
    马克·拉斯韦尔特与汉内斯·穆勒森。2019。DuckDB:一个可嵌入的分析型数据库。在 2019 年国际数据管理会议论文集(荷兰阿姆斯特丹)(SIGMOD '19)。美国纽约计算机机构协会,1981-1984。https://doi.org/10.1145/3299869.3320212
  • Rajpurkar et al. (2016) Rajpurkar 等人(2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250 [cs.CL]
    Pranav Rajpurkar、Jian Zhang、Konstantin Lopyrev 和 Percy Liang。2016。SQuAD:文本机器理解的 10 万+问题。arXiv:1606.05250 [cs.CL]
  • Research (2023) 研究(2023) Micrsoft Research. 2023. Artificial Intelligence Controller Interface (AICI). https://github.com/microsoft/aici
    微软研究院。2023。人工智能控制器接口(AICI)。https://github.com/microsoft/aici
  • Sheng et al. (2023) 盛等(2023) Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. arXiv:2303.06865 [cs.LG]
    盛颖、郑连敏、袁斌航、李卓涵、马克斯·雷宾宁、傅达尧、谢志强、陈蓓蒂、克拉克·巴雷特、冈萨雷斯·约瑟夫·E、梁沛理、克里斯托弗·雷、斯托伊卡·伊恩、张策。2023。FlexGen:利用单 GPU 高吞吐量生成大型语言模型的推理。arXiv:2303.06865 [cs.LG]
  • Touvron et al. (2023) 图弗龙等(2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
    雨果·图弗龙、路易斯·马丁、凯文·斯通、彼得·阿尔伯特、阿姆贾德·阿尔马赫里、亚斯敏·巴巴伊、尼古拉·巴什利科夫、索米亚·巴特拉、普拉杰瓦尔·巴尔加瓦、施鲁提·博萨莱、丹·比克尔、卢卡斯·布莱彻、克里斯蒂安·坎顿·费雷尔、莫亚·陈、吉列姆·库库鲁尔、大卫·埃西奥布、朱德·费尔南德斯、傅杰瑞、傅文印、布莱恩·富勒、高欣迪、韦达努杰·戈斯瓦米、纳曼·戈亚尔、安东尼·哈特肖恩、萨格哈尔·侯赛尼、侯睿、哈坎·伊南、马尔钦·卡达斯、维克托·克尔凯兹、马迪安·卡布萨、伊莎贝尔·克劳曼、阿尔乔姆·科雷涅夫、普尼特·辛格·库拉、玛丽-安娜·拉肖、蒂博·拉维尔、叶尼娅·李、戴安娜·利斯科维奇、陆映海、毛宇宁、泽维尔·马丁内、托多尔·米哈伊洛夫、普什卡尔·米什拉、伊戈尔·莫利博格、聂一鑫、安德鲁·波尔顿、杰里米·赖森斯坦、拉希·伦格塔、卡利扬·萨拉迪、艾伦·谢尔滕、鲁安·席尔瓦、埃里克·迈克尔·史密斯、兰詹·苏布拉马尼安、谭小青·艾伦、唐彬、罗斯·泰勒、阿迪娜·威廉姆斯、关建翔、徐普新、严正、伊利扬·扎罗夫、张宇辰、安吉拉·范、梅兰妮·坎巴杜尔、沙兰·纳朗、奥雷利安·罗德里格斯、罗伯特·斯托尼奇、谢尔盖·埃杜诺夫、托马斯·西亚洛姆。2023。Llama 2:开放基础与微调聊天模型。arXiv:2307.09288 [cs.CL]
  • Vaswani et al. (2023) Vaswani 等人 (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, 以及 Illia Polosukhin. 2023. 你所需要的只是注意力. arXiv:1706.03762 [cs.CL]
  • Willard and Louf (2023) Willard 和 Louf (2023) Brandon T. Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models. arXiv:2307.09702 [cs.CL]
    Brandon T. Willard 和 Rémi Louf. 2023. 针对大型语言模型的高效引导生成. arXiv:2307.09702 [cs.CL]
  • Wooders et al. (2023) Wooders 等(2023) Sarah Wooders, Xiangxi Mo, Amit Narang, Kevin Lin, Ion Stoica, Joseph M. Hellerstein, Natacha Crooks, and Joseph E. Gonzalez. 2023. RALF: Accuracy-Aware Scheduling for Feature Store Maintenance. Proc. VLDB Endow. 17, 3 (nov 2023), 563–576. https://doi.org/10.14778/3632093.3632116
    Sarah Wooders、Xiangxi Mo、Amit Narang、Kevin Lin、Ion Stoica、Joseph M. Hellerstein、Natacha Crooks 和 Joseph E. Gonzalez。2023。RALF:面向特征存储维护的准确性感知调度。Proc. VLDB Endow. 17, 3(2023 年 11 月),563-576。https://doi.org/10.14778/3632093.3632116
  • Xiao et al. (2023) Xiao 等(2023) Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv:2309.07597 [cs.CL]
    Shitao Xiao、Zheng Liu、Peitian Zhang 和 Niklas Muennighoff。2023。C-Pack:推进通用中文嵌入的打包资源。arXiv:2309.07597 [cs.CL]
  • Ye et al. (2024) 叶子豪、赖瑞航、卢博儒、林建宇、郑思泽、陈乐群、陈天奇、路易斯·塞泽. 2024. 级联推理:内存带宽高效的共享前缀批量解码. https://flashinfer.ai/2024/02/02/cascade-inference.html Zihao Ye, Ruihang Lai, Bo-Ru Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, and Luis Ceze. 2024. Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding. https://flashinfer.ai/2024/02/02/cascade-inference.html
    俞庆仁、郑珠成、金建宇、金秀贞、全炳坤. 2022. Orca:一个分布式服务系统,用于基于 Transformer 的生成模型. 在第 16 届 USENIX 操作系统设计与实现研讨会(OSDI 22)中. USENIX 协会, 加利福尼亚州卡尔斯巴德, 521–538. https://www.usenix.org/conference/osdi22/presentation/yu
  • Yu et al. (2022) Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521–538. https://www.usenix.org/conference/osdi22/presentation/yu
  • Zheng et al. (2023) 郑等(2023) Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2023. Efficiently Programming Large Language Models using SGLang. arXiv:2312.07104 [cs.AI]
    郑连民、尹良生、谢志强、黄杰夫、孙楚月、余浩科迪、曹世仪、克里斯托斯·科齐拉基斯、斯托伊卡·伊恩、冈萨雷斯·约瑟夫·E、巴雷特·克拉克、盛颖。2023。使用 SGLang 高效编程大型语言模型。arXiv:2312.07104 [cs.AI]
  • Zouhar et al. (2023) 佐哈尔等(2023) Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, and Ryan Cotterell. 2023. A Formal Perspective on Byte-Pair Encoding. arXiv:2306.16837 [cs.CL]
    维莱姆·佐哈尔、克拉拉·迈斯特、胡安·路易斯·加斯塔尔迪、杜莉、蒂姆·维埃拉、萨钦·米林玛雅、科特雷尔·瑞安。2023。字节对编码的形式化视角。arXiv:2306.16837 [cs.CL]