The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems* 早起的鸟儿捉到漏洞:揭示LLM服务系统中的时序侧信道*
Linke Song ^(**§){ }^{* \S}, Zixuan Pang ^(†){ }^{\dagger}, Wenhao Wang ^(***§^(⊠)){ }^{\star \S^{\boxtimes}}, Zihao Wang ^(‡){ }^{\ddagger}, XiaoFeng Wang ^(‡){ }^{\ddagger}, Hongbo Chen ^(‡){ }^{\ddagger}, Wei Song ^(**§){ }^{* \S}, Yier Jin ^(†){ }^{\dagger}, Dan Meng*, Rui Hou**Institute of Information Engineering, CAS ^(†){ }^{\dagger} University of Science and Technology of China 中国科学院信息工程研究所 ^(†){ }^{\dagger} 中国科学技术大学^(‡){ }^{\ddagger} Indiana University Bloomington 印第安纳大学布卢明顿^("§ "){ }^{\text {§ }} School of Cyber Security, University of Chinese Academy of Sciences 中国科学院网络安全学院
Abstract 摘要
The wide deployment of Large Language Models (LLMs) has given rise to strong demands for optimizing their inference performance. Today’s techniques serving this purpose primarily focus on reducing latency and improving throughput through algorithmic and hardware enhancements, while largely overlooking their privacy side effects, particularly in a multi-user environment. In our research, for the first time, we discovered a set of new timing side channels in LLM systems, arising from shared caches and GPU memory allocations, which can be exploited to infer both confidential system prompts and those issued by other users. These vulnerabilities echo security challenges observed in traditional computing systems, highlighting an urgent need to address potential information leakage in LLM serving infrastructures. 大型语言模型(LLMs)的广泛部署引发了对优化其推理性能的强烈需求。当前为此目的服务的技术主要集中在通过算法和硬件增强来减少延迟和提高吞吐量,同时在很大程度上忽视了其隐私副作用,特别是在多用户环境中。在我们的研究中,我们首次发现了一组新的时序侧信道,这些信道出现在LLM系统中,源于共享缓存和 GPU 内存分配,可以被利用来推断机密系统提示和其他用户发出的提示。这些漏洞反映了传统计算系统中观察到的安全挑战,突显了在LLM服务基础设施中解决潜在信息泄露的紧迫需求。
In this paper, we report novel attack strategies designed to exploit such timing side channels inherent in LLM deployments, specifically targeting the Key-Value (KV) cache and semantic cache widely used to enhance LLM inference performance. Our approach leverages timing measurements and classification models to detect cache hits, allowing an adversary to infer private prompts with high accuracy. We also propose a token-by-token search algorithm to efficiently recover shared prompt prefixes in the caches, showing the feasibility of stealing system prompts and those produced by peer users. Our experimental studies on black-box testing of popular online LLM services demonstrate that such privacy risks are completely realistic, with significant consequences. Our findings underscore the need for robust mitigation to protect LLM systems against such emerging threats. 在本文中,我们报告了旨在利用LLM部署中固有的时间侧信道的新攻击策略,特别针对广泛用于增强LLM推理性能的键值(KV)缓存和语义缓存。我们的方法利用时间测量和分类模型来检测缓存命中,使对手能够高精度推断私密提示。我们还提出了一种逐个令牌搜索算法,以高效恢复缓存中的共享提示前缀,展示了窃取系统提示和同行用户生成的提示的可行性。我们对流行在线LLM服务的黑箱测试的实验研究表明,这种隐私风险完全是现实的,后果严重。我们的研究结果强调了需要强有力的缓解措施,以保护LLM系统免受这些新兴威胁的影响。
1 Introduction 1 引言
Large Language Models (LLMs) are widely used in applications such as chatbots [9, 12], search engines [26], and 大型语言模型 (LLMs) 广泛应用于聊天机器人 [9, 12]、搜索引擎 [26] 等应用中
coding assistants [14]. However, LLM inference is resourceintensive, requiring substantial computational power and memory due to the model’s vast parameters, numerous layers, and large context sizes. Improving LLM inference performance has thus become essential, leading to solutions such as weight quantization [42,56,78,81,85][42,56,78,81,85], model compression [62, 76, 98], algorithm optimization [39, 53, 82], hardware advancements [72,91], and parallel processing techniques [87]. These approaches aim to reduce latency and improve inference efficiency [60], though their privacy implications remain less clear. 编码助手 [14]。然而,LLM 推理是资源密集型的,因模型的庞大参数、众多层次和大上下文大小而需要大量的计算能力和内存。因此,提高 LLM 推理性能变得至关重要,导致了诸如权重量化 [42,56,78,81,85][42,56,78,81,85] 、模型压缩 [62, 76, 98]、算法优化 [39, 53, 82]、硬件进步 [72, 91] 和并行处理技术 [87] 等解决方案。这些方法旨在减少延迟并提高推理效率 [60],尽管它们的隐私影响仍不太明确。
In this paper, we conduct the first security analysis of performance optimization techniques employed by modern LLM systems that serve multiple users or applications concurrently. Our research reveals significant information leaks arising from distinct side channels introduced by these techniques. Specifically, current LLM performance optimizations use shared caches to reduce computation and storage overhead during inference. However, memory sharing, cache contention and eviction and task scheduling among different users and applications can interfere with user requests, creating noticeable timing side channels. Exploiting these side channels can expose private prompts from other users or applications. 在本文中,我们对现代 LLM 系统所采用的性能优化技术进行了首次安全分析,这些系统同时为多个用户或应用程序提供服务。我们的研究揭示了由于这些技术引入的不同侧信道而导致的显著信息泄露。具体而言,当前 {{1002 }} 性能优化使用共享缓存来减少推理过程中的计算和存储开销。然而,内存共享、缓存争用和驱逐以及不同用户和应用程序之间的任务调度可能会干扰用户请求,从而产生明显的时间侧信道。利用这些侧信道可能会暴露其他用户或应用程序的私密提示。
LLM cache channels. In our work, we examined various caches in LLM systems, which not only reduce the computational cost of LLM inference but also improve user experience by lowering service latency. We found that these caches can be misused to infer proprietary system prompts or sensitive prompts from peer users. These prompts may contain private user information and also hold commercial value, as they enable an LLM to carry out various downstream tasks without additional fine-tuning. We identified two primary cache channels: LLM 缓存通道。在我们的研究中,我们检查了 LLM 系统中的各种缓存,这不仅降低了 LLM 推理的计算成本,还通过降低服务延迟改善了用户体验。我们发现这些缓存可能被滥用,以推断来自同行用户的专有系统提示或敏感提示。这些提示可能包含私人用户信息,并且具有商业价值,因为它们使得 {{1004 }} 能够在不进行额外微调的情况下执行各种下游任务。我们识别了两个主要的缓存通道:
Leakage from the KV cache. For each inference request, the LLM maintains an in-memory state called the KV cache, which is reused in every iteration throughout the request’s entire service time. Due to the causal attention mask in LLMs, each token’s activations are influenced only by preceding tokens in the sequence. Thus, if multiple requests share a 来自 KV 缓存的泄漏。对于每个推理请求,LLM维护一个称为 KV 缓存的内存状态,该状态在请求的整个服务时间内在每次迭代中重复使用。由于LLMs中的因果注意力掩码,每个标记的激活仅受序列中前面的标记的影响。因此,如果多个请求共享一个
common prefix, the key and value embeddings for those prefix tokens are identical across sequences. To optimize the KV cache’s memory usage, the system identifies matching prompt prefixes across multiple requests and shares their key and value embeddings in memory at runtime [53, 93]. This sharing occurs when prompts include a common prefix, which frequently happens with few-shot examples [67], chatbot system prompts [45], or prompt templates [29]. For example, it has been noted that Claude’s prompt caching feature can reduce costs by up to 90%90 \% and decrease latency by up to 85%85 \% for long prompts [28]. 公共前缀,对于这些前缀标记的键和值嵌入在序列中是相同的。为了优化 KV 缓存的内存使用,系统识别多个请求中匹配的提示前缀,并在运行时共享它们的键和值嵌入。这种共享发生在提示包含公共前缀时,这在少量示例、聊天机器人系统提示或提示模板中经常发生。例如,有人指出 Claude 的提示缓存功能可以将成本降低多达 90%90 \% ,并将长提示的延迟降低多达 85%85 \% 。
Leakage from the semantic cache. The semantic cache boosts LLM performance by caching responses based on the semantic content of the requests. For example, for the prompts “give me suggestions for a comedy movie” and “recommend a comedy movie”, the LLM system can detect their semantic similarity and return similar responses without querying the LLM backend. Experiments show that when GPTCache is integrated with OpenAI’s service, response speed can be improved by a factor of 2 to 10 upon a cache hit [38]. 语义缓存的泄漏。语义缓存通过基于请求的语义内容缓存响应来提升 LLM 的性能。例如,对于提示“给我一些喜剧电影的建议”和“推荐一部喜剧电影”,LLM 系统可以检测它们的语义相似性,并在不查询 LLM 后端的情况下返回相似的响应。实验表明,当 GPTCache 与 OpenAI 的服务集成时,缓存命中时响应速度可以提高 2 到 10 倍 [38]。
Challenges and solutions. A straightforward way to exploit these vulnerable caches is to directly search the prompt space for one that triggers a cache hit. However, this method faces multiple hurdles. First, the time difference resulting from hitting a single cache block is often minimal and can blend with GPU system noise and fluctuations in voltage and power, making it difficult to detect and exploit. Second, the KV cache only works when prompts share a common prefix, limiting attack opportunities. Additionally, the vastness of the prompt space makes it infeasible to systematically test every potential prompt to find a cached one. Complicating matters further, the attacker’s own requests might be cached during the process, introducing additional noise and potentially causing the victim’s cached data to be evicted. 挑战与解决方案。利用这些脆弱缓存的一个简单方法是直接在提示空间中搜索一个能够触发缓存命中的提示。然而,这种方法面临多个障碍。首先,由于命中单个缓存块所产生的时间差通常很小,且可能与 GPU 系统噪声以及电压和功率的波动混合在一起,使得检测和利用变得困难。其次,KV 缓存仅在提示共享公共前缀时有效,限制了攻击机会。此外,提示空间的广阔使得系统地测试每个潜在提示以找到一个缓存的提示变得不可行。更复杂的是,攻击者自己的请求可能在此过程中被缓存,导致额外的噪声,并可能导致受害者的缓存数据被驱逐。
To address these challenges, we developed various attack strategies to exploit LLM side channels. Specifically, we use a classification model to detect token-level KV cache hits based on offline timing measurements. By dynamically adjusting the timing threshold at runtime, we observe that online detection accuracy can be substantially improved with only a few repeated trials. To reduce the search space for the KV cache channel, we propose an incremental search algorithm that capitalizes on the requirement for prompts to share a common prefix, allowing us to recover the victim’s prompt token by token. For the semantic cache channel, we design an algorithm to select the most representative prompts as the attacker’s requests, given a targeted semantic focus. To minimize interference from the attacker’s own requests, we introduce a mechanism to clear cached data via batches of irrelevant requests. For the semantic cache, our method also ensures the attacker’s requests remain distinct by computing their semantic similarities. 为了解决这些挑战,我们开发了多种攻击策略来利用LLM侧信道。具体而言,我们使用分类模型根据离线时序测量来检测令牌级 KV 缓存命中。通过在运行时动态调整时序阈值,我们观察到在线检测的准确性可以通过仅进行少量重复试验显著提高。为了减少 KV 缓存通道的搜索空间,我们提出了一种增量搜索算法,该算法利用提示共享公共前缀的要求,使我们能够逐个恢复受害者的提示。对于语义缓存通道,我们设计了一种算法,根据目标语义焦点选择最具代表性的提示作为攻击者的请求。为了最小化攻击者自身请求的干扰,我们引入了一种通过一批无关请求清除缓存数据的机制。对于语义缓存,我们的方法还确保攻击者的请求通过计算其语义相似性保持独特。
Experimental studies. In our study, we verified the pres- 实验研究。在我们的研究中,我们验证了前提。
ence of timing leakages in open-source projects, including SGLang [30], Langchain [20], and GPTCache [16]. Building on these findings, we demonstrate the feasibility of deducing proprietary system prompts (i.e., prompt stealing attack) and inferring sensitive requests from neighboring users (i.e., peeping neighbor attack). 开源项目中时序泄漏的存在,包括 SGLang [30]、Langchain [20] 和 GPTCache [16]。基于这些发现,我们展示了推断专有系统提示(即提示窃取攻击)和从邻近用户推断敏感请求(即窥探邻居攻击)的可行性。
For the prompt stealing attack, our evaluation indicates that the accuracy of detecting per-token cache hits or misses in the KV cache is 0.99 , with a false positive rate (FPR) of 0.003 . Using the incremental search algorithm, we recovered the system prompt token by token, requiring an average of 111.46 queries per recovered token. This approach achieved an average recovery accuracy of 89.0%89.0 \% and a corresponding FPR of 0.04 . For the peeping neighbor attack, our measurements show an 81.4%81.4 \% accuracy in distinguishing hits from misses, with an average FPR of 0.045 in a single trial. This accuracy improved to 95.4%95.4 \% with a 0.056 FPR after 5 trials under GPTCache’s default settings. We further observed that it is possible to infer the documents processed by a victim user in a vulnerable LLM application, even when using standard commodity LLM services. Moreover, our black-box study of existing online services shows that popular LLM systems-such as Claude, DeepSeek, and Azure OpenAIemploy KV or semantic cache sharing to cut costs, rendering them susceptible to timing side-channel attacks. 对于提示窃取攻击,我们的评估表明,在 KV 缓存中检测每个令牌的缓存命中或未命中的准确率为 0.99,假阳性率(FPR)为 0.003。使用增量搜索算法,我们逐个恢复系统提示,每个恢复的令牌平均需要 111.46 个查询。该方法实现了平均恢复准确率为 89.0%89.0 \% ,相应的 FPR 为 0.04。对于窥探邻居攻击,我们的测量显示在区分命中和未命中的准确率为 81.4%81.4 \% ,在单次试验中的平均 FPR 为 0.045。在 GPTCache 的默认设置下,经过 5 次试验后,这一准确率提高到 95.4%95.4 \% ,FPR 为 0.056。我们进一步观察到,即使在使用标准商品LLM服务时,也可以推断出受害用户在脆弱的LLM应用程序中处理的文档。此外,我们对现有在线服务的黑箱研究表明,流行的LLM系统——如 Claude、DeepSeek 和 Azure OpenAI——采用 KV 或语义缓存共享以降低成本,使其容易受到时序侧信道攻击。
Finally, we propose initial defenses against these sidechannel risks. To mitigate KV cache leakage, we recommend sharing prefix caches only in batches of at least kk tokens ( k=2,3,4k=2,3,4, etc.). Although this increases the prompt search space and thus the required number of guesses, the larger timing differences for sharing multiple tokens also make classifiers more robust. Consequently, attacks remain accurate but incur higher query overhead. To address semantic cache leakage, we advise anonymizing privacy-related content in user inputs before performing semantic-similarity searches. Preliminary experiments show this measure adds modest overhead (around 4%). 最后,我们提出了针对这些侧信道风险的初步防御措施。为了减轻 KV 缓存泄漏,我们建议仅在至少 kk 个令牌的批次中共享前缀缓存( k=2,3,4k=2,3,4 等)。尽管这增加了提示搜索空间,从而需要更多的猜测,但共享多个令牌所带来的更大时间差异也使分类器更加稳健。因此,攻击仍然保持准确,但会产生更高的查询开销。为了应对语义缓存泄漏,我们建议在执行语义相似性搜索之前,对用户输入中的隐私相关内容进行匿名化。初步实验表明,这一措施增加了适度的开销(约 4%)。
Contributions. Our paper makes the following contributions: 贡献。我们的论文做出了以下贡献:
New discovery. We identified new timing side channels in both open-source and online LLM serving systems, arising from the sharing of KV caches and semantic caches to lower inference costs. 新发现。我们在开源和在线 LLM 服务系统中识别出了新的时序侧信道,这些信道源于共享 KV 缓存和语义缓存以降低推理成本。
Novel exploit strategies. We introduced new attack strategies to leverage the inherent side channels in LLM inference optimizations, enabling two distinctive attacks: prompt stealing attack and peeping neighbor attack. 新型利用策略。我们引入了新的攻击策略,以利用LLM推理优化中的固有侧信道,从而实现两种独特的攻击:提示窃取攻击和窥视邻居攻击。
Experimental validations, real-world measurements and mitigations. We validated the side-channel leakages locally on prominent LLM systems and conducted a black-box measurement study of popular online LLM services. We also presented preliminary mitigation measures for these risks. 实验验证、实际测量和缓解措施。我们在显著的 LLM 系统上局部验证了侧信道泄漏,并对流行的在线 LLM 服务进行了黑箱测量研究。我们还提出了这些风险的初步缓解措施。
Responsible disclosure. We disclosed our findings to all relevant developers (SGLang, GPTCache, etc.) and LLM service 负责任的披露。我们将我们的发现披露给所有相关开发者(SGLang, GPTCache 等)和LLM服务。
providers (OpenAI, Claude, Google Gemini, etc.) upon identifying the side channels in September 2024. At the time of this manuscript’s preparation, we received positive responses from the SGLang team, which noted that we were among the first two groups to report this issue, both within the same week. Moreover, we were the first to raise the topic during the SGLang development meeting, and we are now working closely with their team on a resolution. We will make all the code and datasets necessary to reproduce our experiments publicly available once the paper is published. 在 2024 年 9 月识别出侧信道后,提供者(OpenAI、Claude、Google Gemini 等)对此做出了反应。在本手稿准备时,我们收到了 SGLang 团队的积极反馈,他们指出我们是第一批报告此问题的两个团队之一,且都是在同一周内。此外,我们是首次在 SGLang 开发会议上提出该主题的团队,目前我们正与他们的团队紧密合作以寻求解决方案。论文发表后,我们将公开所有必要的代码和数据集,以便重现我们的实验。
2 Background 2 背景
2.1 LLM Serving Systems 2.1 LLM 服务系统
In this paper, we explore the deployment of a shared LLM to serve multiple users or applications within a computing system. This setup is frequently observed in public services offered by commercial companies (e.g., OpenAI’s ChatGPT) and also applies to locally deployed shared enterprise LLMs, which are tailored to handle specific tasks, process large volumes of proprietary data, and meet unique business requirements. Additionally, the rise of LLM-based applicationsoften referred to as AI agents or co-pilots-has introduced a novel software paradigm that merges the capabilities of LLMs with traditional software functionalities. With the emergence of LLMs as operating systems (e.g., AIOS [58]) and agents functioning as apps, multiple LLM-based applications or agents can operate on the same shared LLM, treating it as a foundational model. These LLM agents are typically developed and deployed by different teams or organizations. This concept also extends to local LLM instances in a browser environment. For example, Lumos [25] is a Chrome extension powered by Ollama, a Retrieval-Augmented Generation (RAG) LLM co-pilot for web browsing, running entirely on local hardware without relying on remote servers. 在本文中,我们探讨了共享 LLM 的部署,以服务于计算系统中的多个用户或应用程序。这种设置在商业公司提供的公共服务中经常被观察到(例如,OpenAI 的 ChatGPT),也适用于本地部署的共享企业 LLMs,这些企业旨在处理特定任务、处理大量专有数据并满足独特的业务需求。此外,基于 LLM 的应用程序,通常被称为 AI 代理或副驾驶,带来了一个新的软件范式,将 LLMs 的能力与传统软件功能相结合。随着 {{1005 }} 作为操作系统的出现(例如,AIOS [58])和代理作为应用程序的功能,多个基于 LLM 的应用程序或代理可以在同一共享 LLM 上运行,将其视为基础模型。这些 LLM 代理通常由不同的团队或组织开发和部署。这个概念也扩展到浏览器环境中的本地 LLM 实例。 例如,Lumos [25] 是一个由 Ollama 驱动的 Chrome 扩展,它是一个用于网页浏览的检索增强生成(RAG)LLM副驾驶,完全在本地硬件上运行,而不依赖于远程服务器。
In these scenarios, LLMs are typically optimized to achieve efficient latency and throughput, focusing on memory usage optimization, effective batching, and scheduling. However, complications arise from memory sharing, cache contention and eviction, and GPU scheduling across different users and applications. Such factors can introduce interference among concurrent requests, potentially leading to observable timing side channels. As these users and applications are not all mutually trusted, sensitive information leakage becomes a concern. This includes the potential exposure of other users’ confidential data-such as sensitive queries, proprietary system prompts, and processed documents-through timing side channels. 在这些场景中,LLMs 通常被优化以实现高效的延迟和吞吐量,重点关注内存使用优化、有效的批处理和调度。然而,内存共享、缓存争用和驱逐,以及不同用户和应用程序之间的 GPU 调度会引发复杂性。这些因素可能会在并发请求之间引入干扰,可能导致可观察的时间侧信道。由于这些用户和应用程序并非全部相互信任,敏感信息泄露成为一个问题。这包括通过时间侧信道潜在暴露其他用户的机密数据,例如敏感查询、专有系统提示和处理过的文档。
2.2 Serving Frontend 2.2 服务前端
LLM serving modes. The LLM service offers two operation modes. In non-streaming mode, the response is fully gener- LLM 服务模式。LLM 服务提供两种操作模式。在非流模式下,响应是完全生成的。
ated and then delivered once the request has been processed. However, for long completions, this approach can result in an extended waiting period, possibly lasting several seconds. To achieve faster responses, the streaming mode is available. In this mode, the LLM emits tokens sequentially, allowing users to view the beginning of the completion while the remaining tokens are still being generated. Streaming is the preferred method for interacting with LLMs, especially in chatbot scenarios where real-time conversation is essential. Popular LLM applications (e.g., Bing Copilot [8], ChatGPT [9]) use a system prompt containing task definitions, examples, and safety rules to guide their behavior. This prompt is typically static and shared among all users. 一旦请求处理完毕,生成的内容将被交付。然而,对于较长的完成内容,这种方法可能导致较长的等待时间,可能持续几秒钟。为了实现更快的响应,提供了流式模式。在此模式下,LLM 按顺序发出标记,允许用户在其余标记仍在生成时查看完成的开头。流式传输是与 LLMs 交互的首选方法,特别是在实时对话至关重要的聊天机器人场景中。流行的 LLM 应用程序(例如,Bing Copilot [8],ChatGPT [9])使用包含任务定义、示例和安全规则的系统提示来指导其行为。该提示通常是静态的,并在所有用户之间共享。
Metrics. Latency measures how long it takes for an LLM to respond to a user’s query, shaping users’ perceptions of speed and efficiency in generative AI applications. Low latency is particularly important for real-time interactions, such as chatbots and AI copilots. Time to First Token (TTFT) is the interval from the moment a user submits a prompt until receiving the first token of the response. It reflects the initial processing delay and serves as a crucial indicator of user-perceived responsiveness. Throughput, on the other hand, represents how many requests or tokens an LLM can process within a given time window. Since requests per second is affected by the model’s total generation time-which depends on output length-tokens per second is often used as the key metric for measuring throughput. This paper examines the risks arising from optimizing an LLM’s serving latency and employs TTFT as the primary metric for side-channel observations. 指标。延迟衡量一个 LLM 对用户查询的响应所需时间,影响用户对生成式人工智能应用速度和效率的感知。低延迟对于实时交互尤为重要,例如聊天机器人和人工智能副驾驶。首次令牌时间(TTFT)是指用户提交提示到收到响应的第一个令牌之间的时间间隔。它反映了初始处理延迟,并作为用户感知响应性的关键指标。另一方面,吞吐量表示一个 LLM 在给定时间窗口内可以处理多少请求或令牌。由于每秒请求受到模型总生成时间的影响——这取决于输出长度——每秒令牌通常被用作衡量吞吐量的关键指标。本文考察了优化 LLM 的服务延迟所带来的风险,并将 TTFT 作为侧信道观察的主要指标。
2.3 Serving Backend 2.3 服务后端
Most LLMs rely on the Transformer architecture, which uses the attention mechanism [75] to pinpoint the most relevant parts of the input. Core to this mechanism are Query (Q), Key (K), and Value (V) embeddings: Q represents what the model is seeking at the current position, K encodes how to match relevant information across the sequence, and V holds the actual data to be retrieved when a match occurs. Leveraging scaled dot-product attention, the model processes Q,K\mathrm{Q}, \mathrm{K}, and V to selectively focus on the most pertinent parts of the input. LLM inference consists of two stages: the prefill phase and the decoding phase. The prefill phase processes the entire request prompt to produce the first output token, while the decoding phase generates subsequent tokens one by one. 大多数 LLMs 依赖于 Transformer 架构,该架构使用注意力机制 [75] 来确定输入中最相关的部分。该机制的核心是查询 (Q)、键 (K) 和值 (V) 嵌入:Q 表示模型在当前位置所寻求的内容,K 编码如何在序列中匹配相关信息,而 V 则保存在匹配发生时要检索的实际数据。通过利用缩放点积注意力,模型处理 Q,K\mathrm{Q}, \mathrm{K} 和 V,以选择性地关注输入中最相关的部分。LLM 推理由两个阶段组成:预填充阶段和解码阶段。预填充阶段处理整个请求提示以生成第一个输出标记,而解码阶段则逐个生成后续标记。
Prefill phase. During the prefill phase, the LLM takes the request prompt as input and converts it into a sequence of tokens. Each token is transformed into a numerical representation, called an embedding, which the model can process. In this phase, the LLM computes the K and V embeddings for each token across every attention layer, enabling the generation of the first token of the response in a single step. 预填充阶段。在预填充阶段,LLM将请求提示作为输入并将其转换为一系列标记。每个标记被转换为一种称为嵌入的数值表示,模型可以处理。在此阶段,LLM计算每个标记在每个注意力层的 K 和 V 嵌入,从而使得在单一步骤中生成响应的第一个标记成为可能。
Decoding phase. In the decoding phase, the LLM generates 解码阶段。在解码阶段,LLM生成
each subsequent token by using the prefilled information and the single token produced in the previous step. For every layer, the engine computes the Q,K\mathrm{Q}, \mathrm{K}, and V embeddings for the new token and performs attention against all existing context tokens. Unlike the prefill phase, the decoding phase processes only one token at a time. 通过使用预填充的信息和在上一步中生成的单个标记,逐个处理后续标记。对于每一层,引擎计算新标记的 Q,K\mathrm{Q}, \mathrm{K} 和 V 嵌入,并对所有现有上下文标记执行注意力机制。与预填充阶段不同,解码阶段一次只处理一个标记。
Memory management of KV cache. The attention mechanism in LLMs requires computing pairwise similarities among tokens in an input sequence, which leads to quadratic complexity with respect to sequence length [43]. To address this, KV caching stores the key and value embeddings in GPU memory, eliminating redundant computations and allowing the computation cost to scale linearly with sequence length. KV 缓存的内存管理。LLMs中的注意力机制需要计算输入序列中标记之间的成对相似性,这导致与序列长度呈二次复杂度[43]。为了解决这个问题,KV 缓存将键和值嵌入存储在 GPU 内存中,消除冗余计算,并使计算成本与序列长度呈线性增长。
Originally, LLM serving systems would statically allocate a sizable portion of memory for storing the KV cache, due to the unpredictable lengths of model outputs. However, this led to significant internal and external fragmentation. To mitigate these issues, vLLM introduced PagedAttention, which divides the KV cache into blocks and accesses them through a lookup table [53]. This table maps virtual cache blocks to physical locations in GPU memory, enabling efficient memory sharing across different requests. Modern LLM inference frameworks such as Nvidia’s TensorRT-LLM [34] and Huggingface’s TGI [23] incorporate similar concepts, but the security implications of sharing KV caches have not been thoroughly studied, leaving a critical gap in existing research. 最初,LLM 服务系统会静态分配大量内存用于存储 KV 缓存,因为模型输出的长度不可预测。然而,这导致了显著的内部和外部碎片化。为了缓解这些问题,vLLM 引入了 PagedAttention,它将 KV 缓存划分为块,并通过查找表进行访问 [53]。该表将虚拟缓存块映射到 GPU 内存中的物理位置,从而实现不同请求之间的高效内存共享。现代 LLM 推理框架,如 Nvidia 的 TensorRT-LLM [34] 和 Huggingface 的 TGI [23],采用了类似的概念,但共享 KV 缓存的安全隐患尚未得到充分研究,留下了现有研究中的一个关键空白。
2.4 Threat Model 2.4 威胁模型
In this paper, we examine the security implications of deploying a shared LLM to serve multiple users or applications within a single computing system. Specifically, we consider two main scenarios. First, an LLM service provider offers public APIs that registered users can employ to send requests, all of which are processed by the same underlying serving system. In this context, a victim user may establish a proprietary system prompt to power a widely used LLM application. Meanwhile, an attacker could leverage the same LLM APIs to infer this system prompt, thereby gaining potential financial benefits or circumventing the safety instructions encoded in the prompt. Second, public LLM applications-such as chatbots (e.g., OpenAI’s GPT-4) or document analysis services (e.g., AnythingLLM [7], Klu [15], etc.)-handle concurrent requests from multiple users via the same LLM serving system. If the application itself relies on a public LLM API, these requests are generally routed through the same developer’s API key. An attacker could register as a user of such an application to discover whether specific requests have been submitted by others. For instance, they might seek to determine whether a user has shown interest in a particular topic or uploaded a specific file. They could also monitor requests over time to detect private attributes or sensitive personally identifiable information (PII) (Table 2). In both scenarios, the attacker’s and the victim’s requests share the same platform 在本文中,我们考察了在单一计算系统内部署共享 LLM 为多个用户或应用程序服务的安全隐患。具体而言,我们考虑了两个主要场景。首先,LLM 服务提供商提供公共 API,注册用户可以利用这些 API 发送请求,所有请求均由同一底层服务系统处理。在这种情况下,受害用户可能会建立一个专有系统提示,以支持广泛使用的 LLM 应用程序。与此同时,攻击者可以利用相同的 LLM API 来推断该系统提示,从而获得潜在的经济利益或规避编码在提示中的安全指令。其次,公共 LLM 应用程序——例如聊天机器人(如 OpenAI 的 GPT-4)或文档分析服务(如 AnythingLLM [7]、Klu [15] 等)——通过同一 LLM 服务系统处理来自多个用户的并发请求。如果应用程序本身依赖于公共 LLM API,这些请求通常通过同一开发者的 API 密钥进行路由。攻击者可以注册为此类应用程序的用户,以发现特定请求是否已由其他人提交。 例如,他们可能试图确定用户是否对特定主题表现出兴趣或上传了特定文件。他们还可以随着时间的推移监控请求,以检测私密属性或敏感的个人身份信息(PII)(表 2)。在这两种情况下,攻击者和受害者的请求共享同一平台。
and thus make use of the same caches. This shared environment can produce interference and create observable timing side channels-the core subject of our investigation. The attacker needs only black-box access to the underlying model, without knowledge of its architecture or weight parameters. However, the attacker must first examine the system’s leakage profile in an offline phase, analyzing how different inputs affect timing. This analysis helps them craft inputs that exploit the timing discrepancies introduced by cache sharing. 因此利用相同的缓存。这个共享环境可能会产生干扰,并创建可观察的时间侧信道——这是我们研究的核心主题。攻击者只需要对底层模型进行黑箱访问,而无需了解其架构或权重参数。然而,攻击者必须首先在离线阶段检查系统的泄漏特征,分析不同输入如何影响时间。这种分析帮助他们构造利用缓存共享引入的时间差异的输入。
In this paper, we explore the side-channel risks linked to both local and remote LLM services. For local settings, we assume a stable network connection between client and server. For remote services, previous works-such as NetCAT [52] and NetSpectre [69]-have addressed mitigating noise caused by unstable connections and jitters, particularly in CPU cache side-channel attacks. Extending such noise-reduction strategies to remote LLM scenarios remains an avenue for future research. We do not consider hardware side channels tied to GPU micro-architectures (e.g., GPU caches [61], residuebased leakage [97], or power/frequency side channels [73]). Instead, our focus lies on software caches maintained by the LLM serving system, making our attacks applicable across various hardware platforms (CPUs, GPUs, ASICs, etc.). 在本文中,我们探讨了与本地和远程 LLM 服务相关的侧信道风险。对于本地环境,我们假设客户端和服务器之间有稳定的网络连接。对于远程服务,之前的研究,如 NetCAT [52] 和 NetSpectre [69],已经解决了不稳定连接和抖动引起的噪声,特别是在 CPU 缓存侧信道攻击中。将这种噪声减少策略扩展到远程 LLM 场景仍然是未来研究的一个方向。我们不考虑与 GPU 微架构相关的硬件侧信道(例如,GPU 缓存 [61]、基于残余的泄漏 [97] 或功率/频率侧信道 [73])。相反,我们的重点在于 LLM 服务系统维护的软件缓存,使我们的攻击适用于各种硬件平台(CPU、GPU、ASIC 等)。
3 Attacks 3 次攻击
3.1 Overview 3.1 概述
Creating effective prompts is a challenging task that requires substantial effort, particularly in scenarios like incontext learning where extensive data is needed to optimize LLM performance. Furthermore, prompts can include personal or sensitive information, making them valuable assets that must be safeguarded. For instance, Samsung Electronics has prohibited employees from using generative AI tools like ChatGPT to prevent accidental disclosure of confidential data to OpenAI [1]. 创建有效的提示是一项具有挑战性的任务,需要大量的努力,特别是在像上下文学习这样的场景中,需要大量数据来优化 LLM 的性能。此外,提示可能包含个人或敏感信息,使其成为必须保护的宝贵资产。例如,三星电子已禁止员工使用生成性人工智能工具,如 ChatGPT,以防止意外泄露机密数据给 OpenAI [1]。
In our research, we investigated two types of attacks. The first is the prompt stealing attack (PSA), which targets system prompts. A system prompt defines the model’s operational behavior and may incorporate carefully crafted business logic, private data, or safety-related instructions. Consequently, LLM application developers treat it as confidential intellectual property [33]. Moreover, once exposed, the system prompt could facilitate other attacks, such as jailbreaking. The second is the peeping neighbor attack (PNA), which focuses on uncovering the semantics of another user’s prompt. Since these prompts may contain personally identifiable information (PII) or other sensitive data, any disclosure poses a substantial risk to user privacy. There are three entities involved in these attacks: the server (S)(S), the victim user (C)(\mathcal{C}), and the attacker (A)(\mathcal{A}). The attacker’s goal is to infer the prompt submitted by the victim user. The attack proceeds in two phases. In the offline phase, the attacker studies how a request 在我们的研究中,我们调查了两种类型的攻击。第一种是提示窃取攻击(PSA),其目标是系统提示。系统提示定义了模型的操作行为,并可能包含精心设计的业务逻辑、私人数据或与安全相关的指令。因此,LLM 应用程序开发者将其视为机密知识产权 [33]。此外,一旦暴露,系统提示可能会促进其他攻击,例如越狱。第二种是窥探邻居攻击(PNA),其重点是揭示另一个用户提示的语义。由于这些提示可能包含个人可识别信息(PII)或其他敏感数据,任何泄露都对用户隐私构成重大风险。这些攻击涉及三个实体:服务器 (S)(S) 、受害用户 (C)(\mathcal{C}) 和攻击者 (A)(\mathcal{A}) 。攻击者的目标是推断受害用户提交的提示。攻击分为两个阶段进行。在离线阶段,攻击者研究请求的方式。
alters the server’s state and how these modifications manifest in the latency of subsequent requests. Critically, these timing profiles stem primarily from the system’s optimization techniques rather than from a specific model or parameter set. 改变服务器的状态,以及这些修改如何在后续请求的延迟中表现出来。关键是,这些时间配置主要源于系统的优化技术,而不是来自特定的模型或参数集。
In the online phase, the attacker leverages insights gained during the offline phase to craft requests that exploit the identified timing properties. Initially, S\mathcal{S} is in state State _(0)_{0}. When C\mathcal{C} issues a request, the state changes to State _(1)_{1}, reflecting updates like modifications to the KV or semantic cache. These state transitions can affect the performance of later requests. To track the system state, A\mathcal{A} regularly sends a request rr at intervals starting from time t_("start ")t_{\text {start }}, measuring the resulting latency l=t_("end ")-t_("start ")l=t_{\text {end }}-t_{\text {start }}, where t_("end ")t_{\text {end }} denotes the time point when the first token in the response arrives. By analyzing these latency readings, A\mathscr{A} can infer the prompt submitted by C\mathcal{C}. 在在线阶段,攻击者利用离线阶段获得的见解来构造请求,以利用已识别的时序特性。最初, S\mathcal{S} 处于状态 State _(0)_{0} 。当 C\mathcal{C} 发出请求时,状态变更为 State _(1)_{1} ,反映出对 KV 或语义缓存的修改等更新。这些状态转换可能会影响后续请求的性能。为了跟踪系统状态, A\mathcal{A} 定期在从时间 t_("start ")t_{\text {start }} 开始的间隔内发送请求 rr ,测量结果延迟 l=t_("end ")-t_("start ")l=t_{\text {end }}-t_{\text {start }} ,其中 t_("end ")t_{\text {end }} 表示响应中第一个令牌到达的时间点。通过分析这些延迟读数, A\mathscr{A} 可以推断出 C\mathcal{C} 提交的提示。
Background. KV caching is a widely adopted optimization in LLM serving systems, retaining the key and value embeddings from earlier inference steps to circumvent redundant computations during autoregressive text generation. Recent innovations, notably PagedAttention [53], improve on this concept by allowing the reuse of cached embeddings when prompts share common text segments, such as system messages, templates, or documents frequently included across multiple prompts. Representative implementations include automatic prefix sharing in vLLM [36], which detects shared prefix segments at runtime for KV cache reuse, and RadixAttention in SGLang [93], which efficiently manages shared KV caches for prompts containing common prefixes. 背景。KV 缓存是LLM服务系统中广泛采用的优化方法,保留早期推理步骤中的键和值嵌入,以避免在自回归文本生成过程中进行冗余计算。最近的创新,特别是 PagedAttention [53],通过允许在提示共享公共文本段时重用缓存嵌入,改进了这一概念,例如系统消息、模板或在多个提示中经常包含的文档。代表性的实现包括 vLLM [36]中的自动前缀共享,它在运行时检测共享前缀段以实现 KV 缓存重用,以及 SGLang [93]中的 RadixAttention,它有效管理包含公共前缀的提示的共享 KV 缓存。
The side channel. Sharing KV caches can introduce a timing side channel. During the prefill phase of LLM inference, If a request’s prefix matches one already stored in the KV cache, it will be processed more quickly. Because most LLMs stream their outputs (i.e., token by token), it is possible to measure the fine-grained Time to First Token (TTFT) and detect timing discrepancies associated with cache hits versus misses. Assuming a stable network latency, we estimate the timing gap under a typical LLM deployment [11] as follows. Consider a model with 7 billion parameters (e.g., Llama-7B) running on an A100 GPU with 312 TFLOPS of computational power and a memory bandwidth of 1.5TB//s1.5 \mathrm{~TB} / \mathrm{s}. In the best case, with full GPU utilization, a cache miss for a single token during the prefill phase may take: 侧信道。共享 KV 缓存可能会引入时间侧信道。在LLM推理的预填充阶段,如果请求的前缀与 KV 缓存中已存储的前缀匹配,则处理速度会更快。由于大多数LLMs以流的方式输出(即逐个标记),因此可以测量细粒度的首次标记时间(TTFT),并检测与缓存命中与未命相关的时间差异。在假设网络延迟稳定的情况下,我们估计在典型的LLM部署[11]下的时间差如下。考虑一个具有 70 亿参数的模型(例如,Llama-7B),在具有 312 TFLOPS 计算能力和 1.5TB//s1.5 \mathrm{~TB} / \mathrm{s} 内存带宽的 A100 GPU 上运行。在最佳情况下,充分利用 GPU 时,预填充阶段单个标记的缓存未命中可能需要:
By contrast, if the token hits the cache, the time is dominated by loading the precomputed KV cache from HBM (assuming 16-bit precision parameters): 相比之下,如果令牌命中缓存,则时间主要由从 HBM 加载预计算的 KV 缓存所主导(假设为 16 位精度参数):
These gaps become larger for more complex models or when serving multiple requests concurrently. For instance, on a Llama-2-70B-Chat-GPTQ model (70 billion parameters at 4 -bit precision), the prefill time for a single token miss is about 0.45 ms , while a hit is roughly 0.22 mus0.22 \mu \mathrm{~s}. These timing differences underpin the prompt stealing attack, which leverages KV cache sharing. Specifically, an attacker sends a request to the LLM and observes TTFT to detect whether the victim’s prefixes match. Since KV cache sharing occurs only for prompts with the same prefix, we devised an incremental search algorithm to recover prompts on a token-by-token basis. We present this algorithm and assess its real-world performance in Section 4.1 and Section 4.3. 这些差距在更复杂的模型或同时处理多个请求时变得更大。例如,在 Llama-2-70B-Chat-GPTQ 模型(70 亿参数,4 位精度)上,单个令牌未命中的预填充时间约为 0.45 毫秒,而命中的时间大约为 0.22 mus0.22 \mu \mathrm{~s} 。这些时间差异是提示窃取攻击的基础,该攻击利用 KV 缓存共享。具体而言,攻击者向 LLM 发送请求并观察 TTFT,以检测受害者的前缀是否匹配。由于 KV 缓存共享仅发生在具有相同前缀的提示上,我们设计了一种增量搜索算法,以逐个令牌的方式恢复提示。我们在第 4.1 节和第 4.3 节中介绍了该算法并评估其在现实世界中的性能。
Background. Semantic caching (e.g., GPTCache [16, 38]) stores prior requests and their corresponding responses. Upon receiving a new request, it measures semantic similarity with cached requests. If the similarity surpasses a certain threshold, the system returns the cached response; otherwise, it queries the LLM again. Semantic caching can significantly reduce costs and enhance performance, and is integrated into major LLM frameworks like LangChain [20] and LlamaIndex [24]. 背景。语义缓存(例如,GPTCache [16, 38])存储先前的请求及其相应的响应。在接收到新请求时,它会测量与缓存请求的语义相似性。如果相似性超过某个阈值,系统将返回缓存的响应;否则,它将再次查询LLM。语义缓存可以显著降低成本并提高性能,并已集成到主要的LLM框架中,如 LangChain [20]和 LlamaIndex [24]。
The side channel. Unlike KV caching, which reuses data only for identical prompt prefixes, semantic caching allows reuse based on semantic similarity beyond a predefined threshold. However, sharing a semantic cache among multiple users can inadvertently reveal their requests. Cache hits provide responses in mere milliseconds, whereas cache misses can take several seconds-creating a clear timing difference that can be exploited by an attacker. This discrepancy enables the attacker to infer the semantics of concurrent requests issued by nearby users, a scenario we refer to as the peeping neighbor attack. Despite this, when the attacker tries to match a victim’s request semantically, the attacker’s own requests may also be cached, introducing noise into subsequent attempts. To address this challenge, we propose an efficient search algorithm that both minimizes the caching effects of the attacker’s own requests and improves the detection rate for the victim’s request. We describe this algorithm and illustrate how it can recover private information from neighboring users’ prompts in Section 4.2. 侧信道。与仅对相同提示前缀重用数据的 KV 缓存不同,语义缓存允许基于超出预定义阈值的语义相似性进行重用。然而,在多个用户之间共享语义缓存可能会无意中泄露他们的请求。缓存命中在毫秒级别提供响应,而缓存未命中可能需要几秒钟,这造成了明显的时间差,攻击者可以利用这一点。这种差异使攻击者能够推断出附近用户发出的并发请求的语义,我们称之为窥视邻居攻击。尽管如此,当攻击者试图在语义上匹配受害者的请求时,攻击者自己的请求也可能被缓存,从而在后续尝试中引入噪声。为了解决这个挑战,我们提出了一种高效的搜索算法,既最小化攻击者自己请求的缓存效应,又提高受害者请求的检测率。我们在第 4.2 节中描述了该算法,并说明它如何从邻近用户的提示中恢复私人信息。
4 Side-channel Analysis and Evaluation 4 边信道分析与评估
In this section, we present our empirical analysis of the identified side channels and describe strategies for their efficient exploitation. All experiments were conducted on a Lenovo server equipped with two Intel Xeon E5-2678 v3 CPUs (12 cores at 2.50 GHz each), 100 GB DDR4 memory, and two NVIDIA A100-PCIE GPUs (40 GB memory each). The system ran Ubuntu 22.04 (kernel 5.15.0-125-generic) with GPU driver 550.127.05, CUDA 12.4, and PyTorch 2.4.0. We used open-source models from the Llama family as the underlying LLM, adhering to their default hardware and software configurations for all evaluations. 在本节中,我们展示了对识别出的侧信道的实证分析,并描述了其高效利用的策略。所有实验均在一台配备两颗 Intel Xeon E5-2678 v3 CPU(每颗 12 个核心,主频 2.50 GHz)、100 GB DDR4 内存和两块 NVIDIA A100-PCIE GPU(每块 40 GB 内存)的联想服务器上进行。系统运行 Ubuntu 22.04(内核 5.15.0-125-generic),GPU 驱动程序为 550.127.05,CUDA 版本为 12.4,PyTorch 版本为 2.4.0。我们使用了 Llama 家族的开源模型作为基础LLM,在所有评估中遵循其默认的硬件和软件配置。
4.1 Analysis on PSA 4.1 PSA 分析
Attack overview. With increasing concerns about client data leakage in public LLM services, enterprise LLMs have become increasingly popular [19, 22]. In our study, we focus on a use case where the LLM API service is constructed from open-source projects within a local network environment. As described in Section 2.4, we consider a scenario in which a victim develops a popular LLM application (e.g., a chatbot) using a proprietary system prompt via the LLM service. The attacker interacts with this LLM application over the local network, measures the TTFT, and attempts to uncover the system prompt based on timing discrepancies. Specifically, the LLM service uses the SGLang backend API server [30], which supports KV cache sharing for common prefixes. Notably, LLM API servers often allow users to define various roles, such as “system” and “user”, within their requests. The victim’s LLM chatbot is built on the FastChat framework [92]. As shown in Figure 1, FastChat supports both direct and synthesized operation modes. In direct mode, the user sends requests directly to the SGLang backend, whereas in synthesized mode, the full prompt is created by concatenating messages from each role according to predefined templates. 攻击概述。随着对公共 LLM 服务中客户数据泄露的关注增加,企业 LLMs 变得越来越受欢迎 [19, 22]。在我们的研究中,我们关注一个用例,其中 LLM API 服务是由本地网络环境中的开源项目构建的。如第 2.4 节所述,我们考虑一个场景,其中受害者使用通过 LLM 服务的专有系统提示开发一个流行的 LLM 应用程序(例如,一个聊天机器人)。攻击者通过本地网络与该 LLM 应用程序交互,测量 TTFT,并试图根据时间差异揭示系统提示。具体而言,LLM 服务使用 SGLang 后端 API 服务器 [30],支持对常见前缀的 KV 缓存共享。值得注意的是,LLM API 服务器通常允许用户在其请求中定义各种角色,例如“系统”和“用户”。受害者的 LLM 聊天机器人是基于 FastChat 框架 [92] 构建的。如图 1 所示,FastChat 支持直接和合成操作模式。 在直接模式下,用户直接向 SGLang 后端发送请求,而在合成模式下,完整的提示是通过根据预定义模板连接来自每个角色的消息创建的。
We consider that the victim’s chatbot employs a proprietary system prompt for the “system” role in the synthesized mode, while the user’s inputs fall under the “user” role. In the synthesized mode, this system prompt is prepended at every conversational turn to form the complete prompt sent to the SGLang backend. Notably, BloombergGPT [80] serves as a real-world example of a purposely built LLM for financial use, deployed for internal use at Bloomberg. It leverages 3-shot prompting to handle domain-specific tasks more effectively, safeguarding sensitive data and maintaining control over proprietary financial processes. As illustrated in Figure 2, an attacker can masquerade as a chatbot user, submitting either a direct request or a synthesized request that includes the static system prompt. When the LLM processes these synthesized prompts, it retains the separator and system prompt in the KV cache used by the SGLang backend, expediting subsequent requests that share partial prefixes of the system prompt. In 我们认为受害者的聊天机器人在合成模式下使用了一个专有的系统提示作为“系统”角色,而用户的输入则属于“用户”角色。在合成模式中,这个系统提示在每个对话轮次前添加,以形成发送到 SGLang 后端的完整提示。值得注意的是,BloombergGPT [80] 是一个专门为金融用途构建的真实案例,已在 Bloomberg 内部使用。它利用 3-shot 提示更有效地处理特定领域的任务,保护敏感数据并保持对专有金融流程的控制。如图 2 所示,攻击者可以伪装成聊天机器人用户,提交直接请求或包含静态系统提示的合成请求。当LLM处理这些合成提示时,它在 SGLang 后端使用的 KV 缓存中保留分隔符和系统提示,加快了后续请求的处理,这些请求共享系统提示的部分前缀。
Figure 1: LLM chatbots like FastChat allow user input through both direct requests (top) and synthesized requests via a template (bottom). 图 1:LLM 像 FastChat 这样的聊天机器人允许用户通过直接请求(上)和通过模板合成的请求(下)进行输入。
this attack, the attacker first submits a synthesized request to cache the system prompt, then employs direct queries to reveal it through timing leakages. 在这次攻击中,攻击者首先提交一个合成请求以缓存系统提示,然后通过直接查询利用时间泄漏来揭示它。
Characterizing the leakage. We began by examining the timing difference between a cache hit and a miss for a single token, comparing multiple model sizes. Specifically, we tested SGLang v0.3.0 [30], which organizes KV cache blocks in a radix tree to facilitate efficient prefix matching and reuse. Two models were evaluated: llama3.1-8B-Instruct and llama2-70BGPTQ. System prompts of varying lengths were derived from the gabrielchua/system-prompt-leakage [32] dataset on Hugging Face [32]. We measured TTFTs under conditions where the shared-prefix token count differed by exactly one-signifying cache hits versus misses-across 4,000 runs. Figure 3 illustrates the resulting time distributions, revealing a pronounced distinction between hit and miss scenarios for both models. 特征化泄漏。我们首先检查了单个令牌的缓存命中和未命中的时间差,比较了多个模型大小。具体来说,我们测试了 SGLang v0.3.0 [30],该版本在基数树中组织 KV 缓存块,以便于高效的前缀匹配和重用。评估了两个模型:llama3.1-8B-Instruct 和 llama2-70BGPTQ。不同长度的系统提示来自 Hugging Face [32] 上的 gabrielchua/system-prompt-leakage [32] 数据集。我们在共享前缀令牌计数恰好相差一个的条件下测量了 TTFT,标志着缓存命中与未命中,共进行了 4,000 次运行。图 3 显示了结果时间分布,揭示了两个模型在命中和未命中场景之间的明显区别。
Based on these observations, a straightforward classifier can be built to categorize a token as a hit if its latency is below a certain threshold. As prompts grow longer, the latency also tends to rise due to increased computational demands. Consequently, the threshold must be adjusted according to the token’s position in the prompt, as well as the particular hardware and model used. 基于这些观察,可以构建一个简单的分类器,将一个标记分类为命中,如果其延迟低于某个阈值。随着提示变长,延迟也往往会因计算需求增加而上升。因此,阈值必须根据标记在提示中的位置以及所使用的特定硬件和模型进行调整。
In real-world evaluations of our classifier, we noted that TTFT often varies due to factors such as GPU system noise and fluctuations in voltage and power, weakening the effectiveness of a fixed classification threshold. To address this challenge, we introduce a dynamic calibration scheme that continuously adjusts the threshold in real time, enhancing the classifier’s robustness. The primary mechanism involves simultaneously collecting TTFT data from known requests (i.e., requests without appended predicted tokens) within a brief time window alongside the targeted request’s TTFT. These concurrent measurements establish a baseline threshold representing real-time system performance. Based on this baseline, the threshold is dynamically updated at runtime to boost accuracy. 在我们分类器的实际评估中,我们注意到 TTFT 常常因 GPU 系统噪声和电压及功率波动等因素而变化,这削弱了固定分类阈值的有效性。为了解决这个挑战,我们引入了一种动态校准方案,实时持续调整阈值,从而增强分类器的鲁棒性。主要机制是在一个短时间窗口内,同时收集已知请求(即没有附加预测标记的请求)的 TTFT 数据,以及目标请求的 TTFT。这些同时测量建立了一个基线阈值,代表实时系统性能。基于这个基线,阈值在运行时动态更新,以提高准确性。
To illustrate the classifier’s effectiveness, consider an example using the LLaMA-2-70B-GPTQ model under our evaluation settings. We build a classifier to determine whether the 10 -th token hits the KV cache. Using the profiled timing distribution, we start with an initial threshold of 0.07971 seconds 为了说明分类器的有效性,考虑一个在我们的评估设置下使用 LLaMA-2-70B-GPTQ 模型的例子。我们构建了一个分类器来判断第 10 个标记是否命中 KV 缓存。使用分析后的时间分布,我们从初始阈值 0.07971 秒开始。
Figure 2: Overview of prompt stealing attacks. 图 2:提示窃取攻击概述。
Figure 3: Latency distribution of one token hit and miss. We used 2 representative models of different sizes: llama3.1-8B-Instruct and llama2-70B-GPTQ. We use the 11 ama2-70B-GPTQ model for subsequent evaluations. 图 3:一个令牌命中和未命中的延迟分布。我们使用了两种不同大小的代表性模型:llama3.1-8B-Instruct 和 llama2-70B-GPTQ。我们在后续评估中使用 11 ama2-70B-GPTQ 模型。
and then dynamically refine it based on the system’s realtime performance. We test the classifier using 4,000 hits and 4,000 misses drawn from randomly selected system prompts, achieving a TPR of 0.88 and a FPR of 0.10 . To further mitigate noise, we employ multi-sampling by collecting nn TTFT samples per token. The token is classified as a hit only if the number of hit detections exceeds a threshold kk. With n=10n=10 and k=5k=5, the final classifier attains a TPR of 0.99 and an FPR of 0.003. 然后根据系统的实时性能动态地对其进行细化。我们使用从随机选择的系统提示中提取的 4,000 个命中和 4,000 个未命中测试分类器,达到了 0.88 的真正率(TPR)和 0.10 的假正率(FPR)。为了进一步减小噪声,我们通过每个标记收集 nn TTFT 样本进行多重采样。只有当命中检测的数量超过阈值 kk 时,该标记才被分类为命中。在 n=10n=10 和 k=5k=5 的情况下,最终分类器达到了 0.99 的真正率(TPR)和 0.003 的假正率(FPR)。
End-to-end attacks. We built a local chatbot using the FastChat framework as the victim application. Its backend API server is configured with SGLang v0.3.0, which supports KV cache sharing for common prefixes. In this evaluation, we used the gabrielchua/system-prompt-leakage [32] dataset from Hugging Face [32], containing over 350,000 synthetic system prompts. Following the setup in Section 2.4, we assume the attacker has a public dataset of system prompts that mirrors those of the victim. Therefore, we randomly selected 200 prompts from the dataset as the victim’s prompts and used the remainder for the attacker. 端到端攻击。我们使用 FastChat 框架构建了一个本地聊天机器人作为受害者应用程序。其后端 API 服务器配置为 SGLang v0.3.0,支持对公共前缀的 KV 缓存共享。在本次评估中,我们使用了来自 Hugging Face 的 gabrielchua/system-prompt-leakage [32]数据集,该数据集包含超过 350,000 个合成系统提示。根据第 2.4 节中的设置,我们假设攻击者拥有一个与受害者系统提示相似的公共数据集。因此,我们从数据集中随机选择了 200 个提示作为受害者的提示,其余的用于攻击者。
To streamline the search process, as illustrated in Figure 4, we propose an incremental token-by-token approach to recover the target prompt. This approach relies on multiple components: a series of token classifiers for validation, a nexttoken predictor to estimate token probabilities, and a sampler that selects candidate tokens based on the predictor’s temperature setting. The next-token predictor is fine-tuned on the public system prompt dataset available to the attacker, allowing it to forecast the next token given the already retrieved 为了简化搜索过程,如图 4 所示,我们提出了一种增量逐个标记的方法来恢复目标提示。该方法依赖于多个组件:一系列用于验证的标记分类器、一个用于估计标记概率的下一个标记预测器,以及一个根据预测器的温度设置选择候选标记的采样器。下一个标记预测器在攻击者可用的公共系统提示数据集上进行了微调,使其能够在给定已检索的内容的情况下预测下一个标记。
Figure 4: Efficient token-by-token request recovery. 图 4:高效的逐个令牌请求恢复。
tokens. For each candidate token at position ii, we feed the partially reconstructed prompt into the LLM to obtain TTFT data, which then serves as the input to the corresponding classifier _(i)_{i}. If the classifier identifies this token as a cache hit, the token is appended to the prompt; otherwise, a repetition penalty is applied by adjusting the probability distribution of the current token (in our implementation, this penalty is applied by halving the sampling probability for an incorrect token in the next round). tokens。对于位置 ii 的每个候选 token,我们将部分重构的提示输入到 LLM 中以获取 TTFT 数据,然后将其作为相应分类器 _(i)_{i} 的输入。如果分类器将该 token 识别为缓存命中,则该 token 被附加到提示中;否则,通过调整当前 token 的概率分布施加重复惩罚(在我们的实现中,这种惩罚是通过将下一个回合中错误 token 的采样概率减半来实现的)。
In constructing the next-token predictor, we adopt LLaMA-3.1-8B-Instruct [59] as the base model and fine-tune it using the attacker’s training dataset. Given the partially recovered prompt and the chosen temperature, the predictor generates a probability distribution for the next token. Temperature scaling introduces variability into the prediction. Our predictor is not heavily optimized; it was fine-tuned on a single NVIDIA A100-PCIE GPU for only a few hours with limited training resources. With larger models, more epochs, and bigger batch sizes, the predictor’s performance could be further improved. 在构建下一个标记预测器时,我们采用 LLaMA-3.1-8B-Instruct [59] 作为基础模型,并使用攻击者的训练数据集对其进行微调。考虑到部分恢复的提示和选择的温度,预测器生成下一个标记的概率分布。温度缩放为预测引入了变异性。我们的预测器并没有经过大量优化;它仅在一台 NVIDIA A100-PCIE GPU 上微调了几个小时,训练资源有限。使用更大的模型、更多的训练轮次和更大的批量大小,预测器的性能可以进一步提高。
To obtain TTFT values in the end-to-end scenario, the attacker first clears both the victim’s and the attacker’s requests from the cache. Next, the attacker measures the TTFT for their own request after the victim’s system prompt has been cached. Below, we describe how the TTFT is gathered and how caches are flushed. 为了在端到端场景中获得 TTFT 值,攻击者首先清除受害者和攻击者的请求缓存。接下来,攻击者在受害者的系统提示已被缓存后,测量自己请求的 TTFT。下面,我们描述如何收集 TTFT 以及如何清空缓存。
Timing measurement. As shown in Figure 5, the attacker begins by issuing a synthesized request containing the targeted system prompt. Once the end-of-sequence token is received 时间测量。如图 5 所示,攻击者首先发出一个包含目标系统提示的合成请求。一旦接收到序列结束标记
def get_ttft(text):
start_time = time.perf_counter()
# In stream mode, max_tokens = 1
response = requests.post(
{..., max_tokens=1,},
stream = True
)
for line in response.iter_lines():
if line:
end_time = time.perf_counter()
break
ttft = end_time - start_time
return ttft
def complete(text):
# Trigger the system prompt first
response = client.chat.completions.create(...)
for line in response.iter_lines():
if line:
data = json.loads(line)
if data.get("end_of_sequence", False):
break
Wait for complete
omplete(triggering_prompt)
Short delay to ensure KV cache is updated
ime.sleep(0.2)
tft = get_ttft(predicted_prompt)
Figure 5: Code for measuring response latency in PSA. 图 5:测量 PSA 中响应延迟的代码。
in the POST response, a short delay is introduced, ensuring the system prompt resides in the KV cache. The attacker then sends a direct request via a POST call using the anticipated prompt, configured to generate only one token of output. The TTFT is computed as the interval between sending the request and detecting the first non-blank token in the streamed response. 在 POST 响应中,引入了短暂的延迟,以确保系统提示保留在 KV 缓存中。攻击者随后通过使用预期提示的 POST 调用发送直接请求,该请求配置为仅生成一个输出令牌。TTFT 被计算为发送请求与检测到流式响应中的第一个非空令牌之间的时间间隔。
Flushing the caches through eviction. We observed SGLang provides a flush_cache API [47] that efficiently clears the cache. However, for our end-to-end attack scenario, we chose not to use this API, as it is unlikely to be accessible to attackers in real-world environments. Instead, we employed a more robust method of evicting the KV cache by issuing batches of irrelevant requests. Under default SGLang settings, sending 15 such requests (each containing about 300 tokens) was sufficient to trigger eviction in about 5 seconds. This approach proved successful in 100%100 \% of our 1,000 tests. 通过驱逐清空缓存。我们观察到 SGLang 提供了一个 flush_cache API [47],可以有效地清除缓存。然而,对于我们的端到端攻击场景,我们选择不使用这个 API,因为在现实环境中攻击者不太可能访问它。相反,我们采用了一种更强大的驱逐 KV 缓存的方法,通过发出一批无关请求。在默认的 SGLang 设置下,发送 15 个此类请求(每个请求包含大约 300 个标记)足以在大约 5 秒内触发驱逐。这种方法在我们 1,000 次测试中的 100%100 \% 次中证明是成功的。
Although the FPR for detecting a single token’s cache hit or miss is low (0.003), the FPR can accumulate if the nexttoken predictor repeatedly suggests incorrect tokens, potentially leading to an erroneous token recovery. To address this, we introduce a cross-verification step for each candidate token. Specifically, we send another query that omits the token to estimate the hit-based TTFT. Next, we compare the TTFT 尽管检测单个标记的缓存命中或未命中的假阳性率(FPR)较低(0.003),但如果下一个标记预测器反复建议不正确的标记,假阳性率可能会累积,从而导致错误的标记恢复。为了解决这个问题,我们为每个候选标记引入了交叉验证步骤。具体而言,我们发送另一个省略该标记的查询,以估计基于命中的 TTFT。接下来,我们比较 TTFT。
of this query with that of the predicted prompt. If the TTFT difference exceeds a preset threshold corresponding to one token’s prefill time, the current prediction is deemed incorrect, and we proceed to the next guess. This token recovery process continues until either a predetermined number of tokens is successfully retrieved or the maximum allotted attack queries is reached. 如果该查询的 TTFT 差异超过与一个 token 的预填充时间相对应的预设阈值,则当前预测被视为不正确,我们将继续进行下一个猜测。这个 token 恢复过程将持续进行,直到成功检索到预定数量的 token 或达到最大允许的攻击查询次数。
We evaluated the recovery accuracy and the average number of queries needed to retrieve each token. Our results show a success rate of 89.0%89.0 \%, an FPR of 0.04 , and an average of 5.57 guesses with 111.46 attack queries per recovered token. Out of 200 victim prompts, we successfully recovered an average of 11.58 tokens for the top 100 prompts, and 17.9 tokens on average for the top 50 prompts. The maximum number of tokens recovered for a single prompt was 81 , achieved with just 513 total guesses. Table 1 presents several examples of target system prompts alongside the prompts recovered via PSA. In real-world attacks, the attacker can recover additional tokens by deploying a more advanced next-token predictor and increasing the maximum number of attack queries. A demo for the end-to-end attack is presented in our website [37]. 我们评估了恢复准确性和检索每个令牌所需的平均查询次数。我们的结果显示成功率为 89.0%89.0 \% ,假阳性率为 0.04,每个恢复的令牌平均需要 5.57 次猜测,攻击查询平均为 111.46 次。在 200 个受害者提示中,我们成功恢复了前 100 个提示的平均 11.58 个令牌,以及前 50 个提示的平均 17.9 个令牌。单个提示恢复的最大令牌数为 81,仅用 513 次总猜测实现。表 1 展示了目标系统提示的几个示例,以及通过 PSA 恢复的提示。在现实世界的攻击中,攻击者可以通过部署更先进的下一个令牌预测器和增加最大攻击查询次数来恢复更多令牌。我们的网站[37]上展示了端到端攻击的演示。
4.2 Analysis on PNA 4.2 PNA 分析
Attack overview. In the PNA attack, we note that not all user queries contain sensitive data. An attacker is unlikely to pursue generic requests like “What’s the weather today?”. Instead, they are expected to focus on specific requests more likely to reveal private information. For example, a request such as “Draft a travel plan for my husband and me to Rome in September for a 3-day stay at the Hilton hotel” could expose personal and location details. By identifying such high-risk queries, the attacker can exploit timing side channels to recover private information from other users’ requests. For this purpose, the attacker could compile a list of privacy-related prompts from online sources (Table 2). The goal is to discover the connections with private attributes, e.g., whether a user is linked to a particular medical condition. 攻击概述。在 PNA 攻击中,我们注意到并非所有用户查询都包含敏感数据。攻击者不太可能追求诸如“今天的天气怎么样?”这样的通用请求。相反,他们更可能专注于更有可能揭示私人信息的特定请求。例如,像“为我和我的丈夫制定一个九月份去罗马的三天旅行计划,住在希尔顿酒店”这样的请求可能会暴露个人和位置信息。通过识别这些高风险查询,攻击者可以利用时间侧信道从其他用户的请求中恢复私人信息。为此,攻击者可以从在线来源编制与隐私相关的提示列表(表 2)。目标是发现与私人属性的关联,例如,用户是否与特定的医疗状况相关联。
Figure 6 illustrates the PNA steps. When semantic caching is used, the LLM stores victim requests and serves cached responses for similar queries. To exploit this channel, the attacker creates requests containing private attributes and monitors TTFT. A noticeable reduction in TTFT indicates that a cached victim’s request has been matched semantically with the attacker’s probe, thereby revealing private user data with high accuracy. 图 6 展示了 PNA 步骤。当使用语义缓存时,LLM存储受害者请求并为类似查询提供缓存响应。为了利用这个通道,攻击者创建包含私有属性的请求并监控 TTFT。TTFT 的明显减少表明缓存的受害者请求与攻击者的探测在语义上匹配,从而以高准确度揭示私有用户数据。
In practice, users might express the same idea in varied ways and embed diverse private attributes (e.g., personal names, medical conditions) in their queries. Our objective is to determine how these different attributes affect semantic similarity and whether the resulting variations exceed a threshold described in Section 3.3. If so, such differences can be detected, enabling the attacker to determine whether the victim’s request carries particular attributes, even when the 在实践中,用户可能以不同的方式表达相同的想法,并在他们的查询中嵌入多种私人属性(例如,个人姓名、医疗状况)。我们的目标是确定这些不同的属性如何影响语义相似性,以及结果的变化是否超过第 3.3 节中描述的阈值。如果是这样,这种差异可以被检测到,从而使攻击者能够确定受害者的请求是否携带特定属性,即使在
Table 1: Examples of recovered system prompts, including the number of attack queries and recovered tokens. We only listed travel planning-related prompts to demonstrate the PSA’s ability to recover diverse expressions. 表 1:恢复的系统提示示例,包括攻击查询的数量和恢复的标记。我们仅列出了与旅行规划相关的提示,以展示 PSA 恢复多样表达的能力。
In your role as a dedicated travel itinerary assistant, you will craft well-organized travel plans based on user
preferences and interests. Gather information from the user's inputs such as destinations, travel dates, and
preferred activities to formulate a comprehensive itinerary. The output should include: dates, activities planned
for each day, estimated costs, and important local information such as culture or tips. Emphasize clear,
organized, dots\ldots
In your role as a dedicated travel itinerary assistant, you will craft well-organized travel plans based on user
preferences and interests. Gather information from the user's inputs such as destinations, travel dates, and
preferred activities to formulate a comprehensive itinerary. The output should include: dates, activities planned
for each day, estimated costs, and important local information such as culture or tips. Emphasize clear,
organized, dots| In your role as a dedicated travel itinerary assistant, you will craft well-organized travel plans based on user |
| :---: |
| preferences and interests. Gather information from the user's inputs such as destinations, travel dates, and |
| preferred activities to formulate a comprehensive itinerary. The output should include: dates, activities planned |
| for each day, estimated costs, and important local information such as culture or tips. Emphasize clear, |
| organized, $\ldots$ |
You are programmed to function as a travel itinerary planner focusing exclusively on creating unique travel
experiences. Provide tailored itineraries for destinations worldwide. cdots\cdots
You are programmed to function as a travel itinerary planner focusing exclusively on creating unique travel
experiences. Provide tailored itineraries for destinations worldwide. cdots| You are programmed to function as a travel itinerary planner focusing exclusively on creating unique travel |
| :---: |
| experiences. Provide tailored itineraries for destinations worldwide. $\cdots$ |
Imagine you are a travel itinerary planner specializing in creating unique and personalized travel experiences.
Your role is to craft itineraries that cater to the diverse interests and needs of travelers. cdots\cdots
Imagine you are a travel itinerary planner specializing in creating unique and personalized travel experiences.
Your role is to craft itineraries that cater to the diverse interests and needs of travelers. cdots| Imagine you are a travel itinerary planner specializing in creating unique and personalized travel experiences. |
| :---: |
| Your role is to craft itineraries that cater to the diverse interests and needs of travelers. $\cdots$ |
1,360//321,360 / 32
No. Recovered system prompts "#queries,
#recovered
tokens"
1 "In your role as a dedicated travel itinerary assistant, you will craft well-organized travel plans based on user
preferences and interests. Gather information from the user's inputs such as destinations, travel dates, and
preferred activities to formulate a comprehensive itinerary. The output should include: dates, activities planned
for each day, estimated costs, and important local information such as culture or tips. Emphasize clear,
organized, dots" 10,260//81
2 "You are programmed to function as a travel itinerary planner focusing exclusively on creating unique travel
experiences. Provide tailored itineraries for destinations worldwide. cdots" 3,620//26
3 "Imagine you are a travel itinerary planner specializing in creating unique and personalized travel experiences.
Your role is to craft itineraries that cater to the diverse interests and needs of travelers. cdots" 1,360//32| No. | Recovered system prompts | #queries, <br> #recovered <br> tokens |
| :---: | :---: | :---: |
| 1 | In your role as a dedicated travel itinerary assistant, you will craft well-organized travel plans based on user <br> preferences and interests. Gather information from the user's inputs such as destinations, travel dates, and <br> preferred activities to formulate a comprehensive itinerary. The output should include: dates, activities planned <br> for each day, estimated costs, and important local information such as culture or tips. Emphasize clear, <br> organized, $\ldots$ | $10,260 / 81$ |
| 2 | You are programmed to function as a travel itinerary planner focusing exclusively on creating unique travel <br> experiences. Provide tailored itineraries for destinations worldwide. $\cdots$ | $3,620 / 26$ |
| 3 | Imagine you are a travel itinerary planner specializing in creating unique and personalized travel experiences. <br> Your role is to craft itineraries that cater to the diverse interests and needs of travelers. $\cdots$ | $1,360 / 32$ |
Table 2: Examples of user prompts that contain private attributes. 表 2:包含私人属性的用户提示示例。
Use cases 用例
Prompts 提示
Healthcare 医疗保健
Compose a meeting agenda for an interdisciplinary team discussing the treatment plan for [Name] with [medical condition]. 为讨论[Name]的[medical condition]治疗计划的跨学科团队制定会议议程。
旅行规划
Travel
planning
Travel
planning| Travel |
| :---: |
| planning |
I'm [flying/driving] to [destination] with [Name] for a leisurely trip, and we'll be staying at [hotel] for [number of days]. 我和[Name]正在开车/飞往[destination]进行一次悠闲的旅行,我们将在[hotel]住[number of days]天。
Can you create a comprehensive packing list organized by category? 你能创建一个按类别组织的全面打包清单吗?
Use cases Prompts
Healthcare Compose a meeting agenda for an interdisciplinary team discussing the treatment plan for [Name] with [medical condition].
"Travel
planning" I'm [flying/driving] to [destination] with [Name] for a leisurely trip, and we'll be staying at [hotel] for [number of days].
Can you create a comprehensive packing list organized by category? | Use cases | Prompts |
| :---: | :---: |
| Healthcare | Compose a meeting agenda for an interdisciplinary team discussing the treatment plan for [Name] with [medical condition]. |
| Travel <br> planning | I'm [flying/driving] to [destination] with [Name] for a leisurely trip, and we'll be staying at [hotel] for [number of days]. |
| Can you create a comprehensive packing list organized by category? | |
Figure 6: Peeping Neighbor Attacks. 图 6:窥视邻居攻击。
victim rephrases the content. 受害者重新表述内容。
In our experimental study, we chose LangChain as the LLM serving framework, given its widespread adoption and extensive application ecosystem [21]. The local chatbot system was built using LangChain and integrated with GPTCache [17] for semantic caching. We used gpt-3.5-turbo API as the backend LLM, and MedQuAD [2] as the evaluation dataset. 在我们的实验研究中,我们选择了 LangChain 作为LLM服务框架,考虑到其广泛的采用和丰富的应用生态系统[21]。本地聊天机器人系统是使用 LangChain 构建的,并与 GPTCache[17]集成以实现语义缓存。我们使用 gpt-3.5-turbo API 作为后端LLM,并使用 MedQuAD[2]作为评估数据集。
Characterizing the leakage. Figure 7 outlines our evaluation steps. We illustrate the process with a query template “Compose a meeting agenda … for [Name] with [medical condition]”, denoted as T_(0)T_{0}. Here, both the name and medical condition are treated as private attributes. 特征化泄漏。图 7 概述了我们的评估步骤。我们用查询模板“为[Name]和[medical condition]编写会议议程……”来说明这个过程,记作 T_(0)T_{0} 。在这里,姓名和医疗状况都被视为私有属性。
Step 1. We randomly sampled 10 names from the Python package names-dataset [66] (Names). To obtain 10 random medical conditions (Medconds), we selected 10 semantically unrelated Q&A pairs from the MedQuAD dataset and used GPT-3.5-turbo to extract medical conditions from the questions. We then randomly chose 1 name (Name )_(0))_{0} ) and 1 medical condition (Medcond_(0))\left(\mathrm{Medcond}_{0}\right) as the private attributes to be recovered; the remaining pairs serve as the negative group (described below). 步骤 1. 我们从 Python 包名称数据集 [66] (Names) 随机抽取了 10 个名称。为了获得 10 个随机的医疗状况 (Medconds),我们从 MedQuAD 数据集中选择了 10 对语义无关的问答对,并使用 GPT-3.5-turbo 从问题中提取医疗状况。然后,我们随机选择了 1 个名称 (Name )_(0))_{0} ) 和 1 个医疗状况 (Medcond_(0))\left(\mathrm{Medcond}_{0}\right) 作为需要恢复的私有属性;其余对作为负组(如下所述)。
Step 2. Since real-world users may phrase the same content differently, we generated multiple sentences that share the same semantics as T_(0)T_{0}. Specifically, we asked GPT-3.5-turbo to paraphrase T_(0)T_{0} into nn variations {T_(1),dots,T_(n)}\left\{T_{1}, \ldots, T_{n}\right\}, each filled with Name _(0)_{0} and Medcond _(0){ }_{0}. 步骤 2。由于现实世界中的用户可能会以不同的方式表达相同的内容,我们生成了多个与 T_(0)T_{0} 具有相同语义的句子。具体来说,我们要求 GPT-3.5-turbo 将 T_(0)T_{0} 改写为 nn 个变体 {T_(1),dots,T_(n)}\left\{T_{1}, \ldots, T_{n}\right\} ,每个变体中填入 Name _(0)_{0} 和 Medcond _(0){ }_{0} 。
Following these steps, we created two sample sets: (1) Positive Group: Sentences that are semantically similar to T_(0)T_{0}, all containing Name_(0)N a m e ~_{0} and Medcond _(0){ }_{0}. Formally, {(T_(i):}\left\{\left(T_{i}\right.\right., Name _(0)_{0}, Medcond {:_(0))∣i=1,dots,n}\left.\left._{0}\right) \mid i=1, \ldots, n\right\}. (2) Negative Group: The original template T_(0)T_{0} populated with other names or medical conditions. Formally, {(T_(0):}\left\{\left(T_{0}\right.\right., Name _(i)_{i}, 按照这些步骤,我们创建了两个样本集:(1) 正面组:与 T_(0)T_{0} 在语义上相似的句子,均包含 Name_(0)N a m e ~_{0} 和 Medcond _(0){ }_{0} 。正式来说, {(T_(i):}\left\{\left(T_{i}\right.\right. ,名称 _(0)_{0} ,Medcond {:_(0))∣i=1,dots,n}\left.\left._{0}\right) \mid i=1, \ldots, n\right\} 。(2) 负面组:用其他名称或医疗条件填充的原始模板 T_(0)T_{0} 。正式来说, {(T_(0):}\left\{\left(T_{0}\right.\right. ,名称 _(i)_{i} ,
Medcond {:_(j))∣i!=0\left._{j}\right) \mid i \neq 0 or {:j!=0}\left.j \neq 0\right\}.
Step 3. We split the positive group into two subsets. The 步骤 3。我们将正组分成两个子集。
Figure 7: Evaluating semantic leakage of private attributes. 图 7:评估私有属性的语义泄漏。
Figure 8: Leakage profile of semantic cache sharing. We plotted the ROC curve to fingerprint the relationship between the similarity vectors of the positive and negative groups. 图 8:语义缓存共享的泄漏特征。我们绘制了 ROC 曲线,以指纹化正负组相似性向量之间的关系。
first ( 20%20 \% of the samples) is designated as the “positive” reference set, while the remaining 80%80 \% forms the evaluation set. We also ensure that the evaluation set’s positive and negative samples are equal in size. Finally, we compute semantic similarities between the evaluation samples and both the positive and negative groups, yielding two respective similarity distributions. 第一个( 20%20 \% 个样本)被指定为“正”参考集,而剩余的 80%80 \% 形成评估集。我们还确保评估集的正负样本数量相等。最后,我们计算评估样本与正负组之间的语义相似性,得到两个各自的相似性分布。
We tested similarity thresholds from 0.6 to 0.9 (GPTCache’s default is 0.8 ). Figure 8a shows the ROC curves for the positive and negative similarity distributions, revealing a clear separation between the two. At the default threshold of 0.8 , for instance, a TPR of 0.95 can be achieved with an FPR below 0.1 . We also examined cases where only one private attribute (either Name_(0)\mathrm{Name}_{0} or Medcond 0_(0)0_{0} ) matched. Here, the negative group consists of sentences with only one correct private attribute, while the positive group remains the same. Figure 8b shows that the semantic distinctions remain substantial: at the default threshold of 0.8 , a TPR of 0.85 corresponds to an FPR under 0.1. 我们测试了相似性阈值从 0.6 到 0.9(GPTCache 的默认值为 0.8)。图 8a 显示了正负相似性分布的 ROC 曲线,清晰地揭示了两者之间的分离。例如,在默认阈值 0.8 下,可以实现 0.95 的真正率(TPR),而假正率(FPR)低于 0.1。我们还检查了只有一个私有属性( Name_(0)\mathrm{Name}_{0} 或 Medcond 0_(0)0_{0} )匹配的情况。在这里,负组由仅具有一个正确私有属性的句子组成,而正组保持不变。图 8b 显示语义区分仍然显著:在默认阈值 0.8 下,0.85 的真正率对应于低于 0.1 的假正率。
End-to-end attacks. We consider a typical open-world scenario in which a victim user requests healthcare assistance from an LLM. For instance, the user might submit a query with semantics similar to the template “compose a meeting agenda…” shown in Table 2, but with various names and medical conditions. The user may also send queries unrelated to the targeted request. To simulate this, we model the user’s queries as follows: 端到端攻击。我们考虑一个典型的开放世界场景,其中受害用户向一个 LLM 请求医疗帮助。例如,用户可能提交一个与表 2 中显示的模板“撰写会议议程……”语义相似的查询,但使用不同的名称和医疗状况。用户还可能发送与目标请求无关的查询。为了模拟这一点,我们将用户的查询建模如下:
Type-1 (true samples): Queries with the specific name (e.g., “Alice”) and the specific medical condition (e.g., “heart Type-1(真实样本):带有特定名称(例如,“Alice”)和特定医疗状况(例如,“心脏”)的查询
disease”). 疾病”).
Type-2 (false samples): Queries that use the same name as the true samples (e.g., “Alice”) but feature different medical conditions, such as “diabetes”, “hypertension”, or “asthma”. 类型 2(假样本):使用与真实样本相同名称的查询(例如,“Alice”),但具有不同的医疗状况,如“糖尿病”、“高血压”或“哮喘”。
Type-3 (false samples): Queries with the same medical condition as the true samples (e.g., “heart disease”) but with different names. 类型 3(虚假样本):与真实样本具有相同医疗状况的查询(例如,“心脏病”),但名称不同。
Type-4 (false samples): Queries unrelated to the target scenario. 类型 4(虚假样本):与目标场景无关的查询。
We assume that the attacker focuses on uncovering the private attribute associations found in Type-1 queries. The victim can freely choose different paraphrases while preserving the same underlying semantics. To simulate this, we configure the victim to send five random requests per round: one Type-1 query (true sample) and four false samples (one Type-2, one Type- 3 , and two Type- 4 requests). To measure the effectiveness of the attack, we use the TPR to assess how successfully the attacker retrieves private attributes from Type-1 requests. We also measure separate FPRs to capture how often the attacker incorrectly categorizes each of the three false sample types as positive. 我们假设攻击者专注于揭示在 Type-1 查询中发现的私有属性关联。受害者可以自由选择不同的释义,同时保持相同的基本语义。为了模拟这一点,我们配置受害者每轮发送五个随机请求:一个 Type-1 查询(真实样本)和四个虚假样本(一个 Type-2,一个 Type-3,以及两个 Type-4 请求)。为了衡量攻击的有效性,我们使用 TPR 来评估攻击者从 Type-1 请求中成功检索私有属性的程度。我们还测量单独的 FPR,以捕捉攻击者错误地将三种虚假样本类型中的每一种分类为正样本的频率。
To perform effective end-to-end attacks on the semantic cache, we must eliminate noise introduced by the attacker’s own requests, which also remain in the cache. To address this challenge, we developed a method that fully clears the semantic cache after each attack round. Specifically, our experiments show that under GPTCache’s default configurations, sending 1,000 semantically unrelated requests is sufficient to remove any leftover cache entries. In this scenario, we assume an attacker aims to determine whether a particular user (e.g., “Alice”) is associated with a specific medical condition (e.g., “heart disease”). Since the victim may use different phrases, the attacker issues multiple requests to enhance coverage and boost the TPR. However, increasing the total number of requests also raises the risk of false positives (i.e., a higher FPR), especially if new requests strongly resemble earlier ones. To mitigate this issue, the attacker prioritizes representative requests in the request space, thus increasing the overall number of queries while minimizing interference among them. 为了对语义缓存执行有效的端到端攻击,我们必须消除攻击者自身请求引入的噪声,这些请求也会保留在缓存中。为了解决这个挑战,我们开发了一种方法,在每轮攻击后完全清除语义缓存。具体而言,我们的实验表明,在 GPTCache 的默认配置下,发送 1,000 个语义无关的请求足以移除任何剩余的缓存条目。在这种情况下,我们假设攻击者旨在确定特定用户(例如,“Alice”)是否与特定医疗状况(例如,“心脏病”)相关。由于受害者可能使用不同的短语,攻击者发出多个请求以增强覆盖率并提高真正率(TPR)。然而,增加请求总数也会提高假阳性风险(即更高的假阳性率 FPR),特别是当新请求与早期请求高度相似时。为了缓解这个问题,攻击者在请求空间中优先考虑代表性请求,从而在增加查询总数的同时最小化它们之间的干扰。
Representative requests are those that most closely approximate the rest of the request space. To identify them, we use the distilbert-base-uncased model to generate embeddings and then compute the L2 distance between these embeddings. We then sort the requests by their L2 distances; those with the smallest distances are deemed the most representative and se- 代表性请求是最接近其余请求空间的请求。为了识别它们,我们使用 distilbert-base-uncased 模型生成嵌入,然后计算这些嵌入之间的 L2 距离。接着,我们按 L2 距离对请求进行排序;距离最小的请求被认为是最具代表性的。
Figure 9: The greedy search strategy for PNA. 图 9:PNA 的贪婪搜索策略。
Table 3: Attack accuracy for the 4 types of victim requests with different number of attack trails. 表 3:不同攻击试验次数下 4 种受害者请求的攻击准确率。
lected as attack requests to maximize coverage. To further expand coverage, we incorporate orthogonal requests-requests that are semantically distinct from one another. This reduces the chance that semantic overlaps among the attacker’s own requests degrade accuracy in identifying victim requests. We classify a cache access as a hit if at least one of the attacker’s requests triggers a cache hit in the timing channel. Although this strategy boosts coverage, it also raises the false positive rate (FPR), necessitating a careful balance. 选择攻击请求以最大化覆盖率。为了进一步扩展覆盖率,我们引入正交请求——语义上彼此不同的请求。这减少了攻击者自身请求之间的语义重叠降低识别受害者请求准确性的可能性。如果攻击者的请求中至少有一个在时序通道中触发了缓存命中,则我们将缓存访问分类为命中。尽管这一策略提高了覆盖率,但也提高了误报率(FPR),因此需要谨慎平衡。
Figure 9 depicts our greedy search algorithm for locating the most representative requests within a semantically similar target space, thereby improving PNA accuracy. Specifically, during each iteration, we pick the most representative candidate (e.g., alpha\alpha ) and add it to the attacker’s requests unless it is overly similar to existing ones (e.g., beta\beta ). This process continues until no additional candidates remain or until the FPR exceeds a predefined threshold sigma\sigma. 图 9 展示了我们的贪婪搜索算法,用于在语义相似的目标空间中定位最具代表性的请求,从而提高 PNA 的准确性。具体而言,在每次迭代中,我们选择最具代表性的候选者(例如, alpha\alpha ),并将其添加到攻击者的请求中,除非它与现有请求过于相似(例如, beta\beta )。该过程持续进行,直到没有额外的候选者或 FPR 超过预定义的阈值 sigma\sigma 。
In the evaluation, each of the 4 victim request types was tested 500 times. In each iteration, a random pair of private attributes in the Type- 1 request was selected. We set sigma=0.06\sigma=0.06 and identified 5 orthogonal attack requests using the proposed greedy search strategy. Table 3 summarizes the TPR for true samples, and the separate FPRs for each false sample type when increasing the number of attack requests from 1 to 5 . Specifically, we successfully recovered 407 victim requests out of the 500 true samples with a single attack request, achieving a recovery accuracy of 81.4%81.4 \% with an average FPR of 0.045 . With 5 attack requests, 477 victim requests are recovered, demonstrating a recovery accuracy of 95.4%95.4 \% with an average FPR of 0.056 . We provide a demo for this attack in our website [37]. 在评估中,每种受害者请求类型被测试了 500 次。在每次迭代中,随机选择了 Type-1 请求中的一对私有属性。我们设置了 sigma=0.06\sigma=0.06 并使用提出的贪婪搜索策略识别了 5 个正交攻击请求。表 3 总结了真实样本的 TPR,以及在将攻击请求数量从 1 增加到 5 时每种假样本类型的单独 FPR。具体而言,我们成功恢复了 500 个真实样本中的 407 个受害者请求,单个攻击请求的恢复准确率为 81.4%81.4 \% ,平均 FPR 为 0.045。使用 5 个攻击请求,恢复了 477 个受害者请求,显示出恢复准确率为 95.4%95.4 \% ,平均 FPR 为 0.056。我们在网站[37]上提供了此攻击的演示。
Figure 10: Timing distribution for hits and misses of processed documents with 12,000,18,00012,000,18,000, and 24,000 tokens. 图 10:处理文档的命中和未命中时间分布,使用 12,000,18,00012,000,18,000 和 24,000 个标记。
4.3 Inferring Documents on Commodity LLM 4.3 推断商品文档 LLM
The KV cache can be shared among the same user, within an organization, or even across organizations [48], creating the potential for cross-user information leaks. In our research, we discovered that such leaks are feasible even in remote attack scenarios, particularly when the target LLM processes documents. To demonstrate this, we utilized a document summarization application powered by a commodity LLM API service, where all user requests are processed using the same developer’s API key. We provide an end-to-end demonstration showing how an adversary could infer processed documents from the application through the KV cache side channels. Importantly, an application is not required for cross-user attacks, as the KV cache can be shared across organizations [48]. However, our use of the application highlights the privacy risks inherent to LLM-powered applications, even when they fully comply with the guidelines provided by the LLM service. KV 缓存可以在同一用户、同一组织内,甚至跨组织之间共享[48],这就产生了跨用户信息泄露的潜在风险。在我们的研究中,我们发现即使在远程攻击场景中,这种泄露也是可行的,特别是当目标LLM处理文档时。为了证明这一点,我们利用了一个由商品LLM API 服务驱动的文档摘要应用程序,所有用户请求都使用同一开发者的 API 密钥进行处理。我们提供了一个端到端的演示,展示了对手如何通过 KV 缓存侧信道推断应用程序中处理的文档。重要的是,跨用户攻击并不需要应用程序,因为 KV 缓存可以在组织之间共享[48]。然而,我们对应用程序的使用突显了LLM驱动的应用程序固有的隐私风险,即使它们完全遵循LLM服务提供的指南。
Note that the observation of the document uploaded to an LLM service, even when the content of the document is known, can expose an organization’s interest, with substantial privacy and competitive ramifications across various domains. For example, a law firm relying on an LLM-based document processor could unknowingly disclose its involvement in complex litigation or pivotal mergers, tipping off opposing parties about strategic decisions before they become public. Similarly, an investment firm analyzing financial statements might inadvertently signal which companies it views as high-potential opportunities, allowing competitors to anticipate emerging deals or investment moves. 请注意,即使在已知文档内容的情况下,观察上传到LLM服务的文档也可能暴露一个组织的利益,这在各个领域都可能带来重大的隐私和竞争后果。例如,依赖于LLM基础文档处理器的律师事务所可能在不知情的情况下披露其在复杂诉讼或关键并购中的参与,从而在这些战略决策公开之前向对方透露信息。同样,分析财务报表的投资公司可能无意中发出信号,表明其视哪些公司为高潜力机会,从而使竞争对手能够预测新兴交易或投资动向。
The application. We implemented the document summarization application with direct summarization [31] (also known as the stuff approach [4]), using the public Deepseek-chat model as the backend LLM API server. The application was built in accordance with Deepseek’s guideline and operates by first extracting text from user-uploaded PDF files using the pdfplumber package. This text is then formatted into a request and sent to the LLM for summarization. In this setup, documents uploaded by users are included in messages under the “user” role and sent to the Deepseek model, which returns 该应用程序。我们实现了文档摘要应用程序,采用直接摘要 [31](也称为内容方法 [4]),使用公共的 Deepseek-chat 模型作为后端 LLM API 服务器。该应用程序是根据 Deepseek 的指南构建的,首先使用 pdfplumber 包从用户上传的 PDF 文件中提取文本。然后将该文本格式化为请求并发送到 LLM 进行摘要。在此设置中,用户上传的文档以“用户”角色包含在消息中,并发送到 Deepseek 模型,返回
summaries of the documents. Notably, all user inputs are processed under a single Deepseek account, which is typical in such applications. However, this poses a privacy concern because Deepseek’s prompt caching [27] can inadvertently allow cached content to be reused across different users. As a result, a malicious user could potentially infer which documents other users have processed by exploiting timing side channels. 文档的摘要。值得注意的是,所有用户输入都在一个单一的 Deepseek 账户下处理,这在此类应用中是典型的。然而,这带来了隐私问题,因为 Deepseek 的提示缓存[27]可能无意中允许缓存内容在不同用户之间重用。因此,恶意用户可能通过利用时间侧信道推断其他用户处理过哪些文档。
Characterizing the leakage. We experimented with 200 documents of various lengths (approximately 12,000,18,00012,000,18,000, and 24,000 tokens), each saved in the PDF format and derived from segments of the zero_scrolls dataset [70]. Our results revealed distinct latencies between cache hits (where documents had been cached) and cache misses (where documents had not been cached), as shown in Figure 10. In particular, responses to cache hits remained consistently fast across different document lengths, while cache misses grew noticeably slower with larger documents. 特征化泄漏。我们对 200 份不同长度的文档(大约 12,000,18,00012,000,18,000 和 24,000 个标记)进行了实验,每份文档以 PDF 格式保存,来源于 zero_scrolls 数据集的片段[70]。我们的结果揭示了缓存命中(文档已被缓存)和缓存未命中(文档未被缓存)之间的明显延迟,如图 10 所示。特别是,对缓存命中的响应在不同文档长度之间保持了一致的快速,而缓存未命中的响应在较大文档中明显变慢。
End-to-end attacks. In this evaluation, we assume an attacker aims to determine whether a specific document is uploaded by the victim for summarization. The attacker prepares a set of 200 documents of varying lengths (the “interested” documents). Meanwhile, the victim submits a total of 200 documents, half of which come from the interested set and half from outside it. This sequence is repeated 5 times, and each time the attacker attempts to distinguish which of the victim’s documents belong to the interested set. 端到端攻击。在本次评估中,我们假设攻击者旨在确定受害者是否上传了特定文档以进行摘要。攻击者准备了一组 200 个不同长度的文档(“感兴趣”的文档)。与此同时,受害者提交了总共 200 个文档,其中一半来自感兴趣的集合,另一半来自外部。这个过程重复 5 次,每次攻击者都试图区分受害者的哪些文档属于感兴趣的集合。
Specifically, the victim first submits 100 documents from the interested set. The attacker then probes each of the 200 interested documents once, recording response latencies. Based on a predefined threshold ( 2.0 seconds in our experiments), the TPR is computed as the fraction of probed documents correctly identified as cache hits (i.e., with latencies below the threshold). Next, the victim uploads 100 additional documents outside the interested set, and the attacker probes the entire set of 200 documents again. The FPR is then calculated as the fraction of documents incorrectly labeled as hits. Our tests produced a TPR of 0.89 and an FPR of 0.05. A demo for the attack is presented in our website [37]. 具体而言,受害者首先提交 100 份感兴趣的文档。攻击者随后对每份 200 份感兴趣的文档进行一次探测,记录响应延迟。根据预定义的阈值(在我们的实验中为 2.0 秒),TPR 被计算为正确识别为缓存命中的探测文档的比例(即,延迟低于阈值的文档)。接下来,受害者上传 100 份不在感兴趣集中的额外文档,攻击者再次对整个 200 份文档集进行探测。然后,FPR 被计算为错误标记为命中的文档的比例。我们的测试产生了 0.89 的 TPR 和 0.05 的 FPR。攻击的演示在我们的网站[37]上提供。
Notes. For ethical reasons, we did not explore techniques for forcibly evicting cache entries in real-world systems. Such research would require extensive experimentation, potentially violating usage policies or interfering with other users’ experience. Without active cache eviction, the timing-based attack primarily operates at the granularity where caches naturally expire due to inactivity-around 5 minutes for systems like OpenAI and Anthropic, as indicated in their documentation [27,28]. 注意。出于伦理原因,我们没有探讨在现实系统中强制驱逐缓存条目的技术。这类研究需要广泛的实验,可能会违反使用政策或干扰其他用户的体验。在没有主动缓存驱逐的情况下,基于时间的攻击主要在缓存因不活动而自然过期的粒度上进行——对于像 OpenAI 和 Anthropic 这样的系统,大约是 5 分钟,正如它们的文档中所指出的[27,28]。
4.4 Measurement Study on Commodity LLMs 4.4 商品 LLMs 的测量研究
KV cache sharing. To investigate KV cache sharing in commodity LLM services, we conducted experiments by invoking KV 缓存共享。为了研究商品LLM服务中的 KV 缓存共享,我们通过调用进行实验。
Table 4: Summary of KV cache sharing in real world LLM serving systems (date: 08/29/2024). 表 4:现实世界LLM服务系统中 KV 缓存共享的总结(日期:2024 年 8 月 29 日)。
LLM service LLM 服务
System prompt sharing 系统提示共享
User prompt sharing 用户提示共享
GPT-4o-mini ^(†){ }^{\dagger}
✓\checkmark
✓\checkmark
Deepinfra
✓\checkmark
✓\checkmark
Deepseekchat
✓\checkmark
✓\checkmark
Claude-3.5
✓\checkmark
✓\checkmark
Qwen-max
XX
XX
Moonshot 月球计划
✓\checkmark
✓\checkmark
Baidu 百度
xx
XX
Google 谷歌
xx
xx
Gemini 双子座
chi\chi
Fireworks.ai
✓\checkmark
✓\checkmark
Groq [18]
XX
XX
SiliconFLow
✓\checkmark
✓\checkmark
†\dagger We observed a timing difference on 08/29/2024 and reported it to OpenAI. By late December 2024, the timing difference was no longer stable, despite the API indicating that the prompt cache was effective. This may be due to timing obfuscation measures implemented by OpenAI. †\dagger 我们在 2024 年 8 月 29 日观察到了一个时间差,并向 OpenAI 报告了此事。到 2024 年 12 月底,尽管 API 显示提示缓存有效,但时间差已不再稳定。这可能是由于 OpenAI 实施的时间混淆措施所致。
LLM service System prompt sharing User prompt sharing
GPT-4o-mini ^(†) ✓ ✓
Deepinfra ✓ ✓
Deepseekchat ✓ ✓
Claude-3.5 ✓ ✓
Qwen-max X X
Moonshot ✓ ✓
Baidu x X
Google x x
Gemini chi
Fireworks.ai ✓ ✓
Groq [18] X X
SiliconFLow ✓ ✓
† We observed a timing difference on 08/29/2024 and reported it to OpenAI. By late December 2024, the timing difference was no longer stable, despite the API indicating that the prompt cache was effective. This may be due to timing obfuscation measures implemented by OpenAI. | LLM service | System prompt sharing | User prompt sharing |
| :---: | :---: | :---: |
| GPT-4o-mini ${ }^{\dagger}$ | $\checkmark$ | $\checkmark$ |
| Deepinfra | $\checkmark$ | $\checkmark$ |
| Deepseekchat | $\checkmark$ | $\checkmark$ |
| Claude-3.5 | $\checkmark$ | $\checkmark$ |
| Qwen-max | $X$ | $X$ |
| Moonshot | $\checkmark$ | $\checkmark$ |
| Baidu | $x$ | $X$ |
| Google | $x$ | $x$ |
| Gemini | | $\chi$ |
| Fireworks.ai | $\checkmark$ | $\checkmark$ |
| Groq [18] | $X$ | $X$ |
| SiliconFLow | $\checkmark$ | $\checkmark$ |
| $\dagger$ We observed a timing difference on 08/29/2024 and reported it to OpenAI. By late December 2024, the timing difference was no longer stable, despite the API indicating that the prompt cache was effective. This may be due to timing obfuscation measures implemented by OpenAI. | | |
the APIs provided by these vendors. These APIs support different roles, such as system and user. For the measurement study, we designed requests with system and user prompts of varying lengths and configured them to run in the streaming mode. For this evaluation, we used the zero_scrolls dataset for generating requests. 这些供应商提供的 API。这些 API 支持不同的角色,如系统和用户。对于测量研究,我们设计了不同长度的系统和用户提示请求,并将其配置为以流模式运行。对于此次评估,我们使用了 zero_scrolls 数据集来生成请求。
Specifically, we first measured the response latencies by sending initial requests that were likely to miss the cache. Then, we sent identical requests multiple times and measured the average latencies for these subsequent requests. To maximize the likelihood of co-locating on the same physical machine and ensuring the requests were cached, we conducted continuous tests within the same time period. If we observed lower latencies in the later requests, this indicated the use of caching mechanisms in the LLM services. With KV cache sharing, the computation of matched prefix tokens during the prefill phase can be ignored. However the output generated during the decoding phase still requires computation and inference, which are influenced by parameters such as temperature, introducing randomness. To verify that the latency reduction was due to KV cache sharing, we also set a high temperature ( 0.9 ) in the request. We verified whether the LLM produced different responses for each request, with TTFT reductions consistently observed. If it did, this strongly indicated that KV cache sharing was supported, enabling a reduction in TTFT while still allowing for diverse outputs. To minimize the impact of network latency, we sent requests of varying lengths, ranging from 200 to 2,000 tokens. The time difference between cached and uncached responses typically spanned several hundred milliseconds, making it easy 具体而言,我们首先通过发送可能未命中缓存的初始请求来测量响应延迟。然后,我们多次发送相同的请求,并测量这些后续请求的平均延迟。为了最大化在同一物理机器上共同定位的可能性并确保请求被缓存,我们在同一时间段内进行了连续测试。如果我们观察到后续请求的延迟较低,这表明在LLM服务中使用了缓存机制。通过 KV 缓存共享,在预填充阶段匹配前缀令牌的计算可以忽略。然而,在解码阶段生成的输出仍然需要计算和推理,这受到温度等参数的影响,引入了随机性。为了验证延迟减少是否由于 KV 缓存共享,我们还在请求中设置了高温度(0.9)。我们验证了LLM是否对每个请求产生不同的响应,并始终观察到 TTFT 的减少。如果是这样,这强烈表明支持 KV 缓存共享,从而在允许多样化输出的同时减少 TTFT。 为了最小化网络延迟的影响,我们发送了长度不等的请求,范围从 200 到 2000 个标记。缓存和未缓存响应之间的时间差通常跨越几百毫秒,这使得很容易
Table 5: Native support of semantic caching of popular AI service providers (date: 08/29/2024). 表 5:流行 AI 服务提供商的语义缓存原生支持(日期:2024 年 8 月 29 日)。
Azure OpenAI Service models [13] Azure OpenAI Service 模型 [13]
✓\checkmark
Amazon Bedrock [6] 亚马逊基石 [6]
✓\checkmark
Google Vertex AI [35]
↗\nearrow
阿里巴巴弹性算法服务(EAS)人工智能平台(PAI)[5]
Alibaba Elastic Algorithm Service
(EAS) of Platform for AI (PAI) [5]
Alibaba Elastic Algorithm Service
(EAS) of Platform for AI (PAI) [5]| Alibaba Elastic Algorithm Service |
| :---: |
| (EAS) of Platform for AI (PAI) [5] |
✓\checkmark
Service providers "Semantic cache
support"
Azure OpenAI Service models [13] ✓
Amazon Bedrock [6] ✓
Google Vertex AI [35] ↗
"Alibaba Elastic Algorithm Service
(EAS) of Platform for AI (PAI) [5]" ✓| Service providers | Semantic cache <br> support |
| :---: | :---: |
| Azure OpenAI Service models [13] | $\checkmark$ |
| Amazon Bedrock [6] | $\checkmark$ |
| Google Vertex AI [35] | $\nearrow$ |
| Alibaba Elastic Algorithm Service <br> (EAS) of Platform for AI (PAI) [5] | $\checkmark$ |
to distinguish between the two. Additionally, we observed when the cache is hit, the TTFT remains consistent, regardless of the request length, whereas when the cache is missed, TTFT increases almost linearly as the length of the request grows. As summarized in Table 4, most popular LLM service providers support KV cache sharing in specific scenarios. 区分这两者。此外,我们观察到当缓存命中时,TTFT 保持一致,无论请求长度如何,而当缓存未命中时,TTFT 几乎线性增加,随着请求长度的增长。如表 4 所总结,大多数流行的LLM服务提供商在特定场景下支持 KV 缓存共享。
Semantic cache sharing. We manually reviewed the documentation of public cloud AI service providers to verify whether they support semantic cache APIs. As shown in Table 5 , semantic caching is supported by major AI platform-as-a-service providers. Notably, even on platforms that do not offer native semantic caching, users can still implement their own solutions or leverage open-source alternatives, such as GPTCache. 语义缓存共享。我们手动审查了公共云 AI 服务提供商的文档,以验证它们是否支持语义缓存 API。如表 5 所示,主要的 AI 平台即服务提供商支持语义缓存。值得注意的是,即使在不提供原生语义缓存的平台上,用户仍然可以实现自己的解决方案或利用开源替代品,如 GPTCache。
5 Mitigations 5 缓解措施
5.1 Mitigating KV Cache Leakages 5.1 减轻 KV 缓存泄漏
Design. A straightforward approach to mitigate KV cache leakages is to eliminate any sharing across requests. However, this would negate the computational and cost savings associated with caching. Noting that PSA recovers the request token by token by observing per-token timing differences, we explore the effect of a simple mitigation strategy that the prefix cache can only be shared in units of at least KK tokens ( K=2,3,4K=2,3,4 etc.). In SGLang, this could be achieved by modifying the radix tree structure used to manage the KV cache. To prevent the leakages of cache-aware task scheduling, the scheduling policy in SGLang needs to be modified to prioritize requests that have at least KK shared prefix tokens. Requests with fewer than KK shared prefix tokens would still be placed in the waiting list. This approach reduces the likelihood of KV cache sharing, but it is unlikely to significantly impact performance. 设计。减轻 KV 缓存泄漏的一个简单方法是消除请求之间的任何共享。然而,这将抵消与缓存相关的计算和成本节省。注意到 PSA 通过观察每个令牌的时间差异逐个恢复请求令牌,我们探讨了一种简单的缓解策略的效果,即前缀缓存只能以至少 KK 个令牌( K=2,3,4K=2,3,4 等)的单位进行共享。在 SGLang 中,这可以通过修改用于管理 KV 缓存的基数树结构来实现。为了防止缓存感知任务调度的泄漏,SGLang 中的调度策略需要修改,以优先处理至少有 KK 个共享前缀令牌的请求。共享前缀令牌少于 KK 的请求仍将被放入等待列表。该方法减少了 KV 缓存共享的可能性,但不太可能对性能产生显著影响。
Evaluation. To evaluate the effectiveness of the mitigation and reduce the cost of querying the LLM, we conducted a simulation-based experiment. First, we built the classifiers that detect the hits and misses of KK tokens for each value of KK ( K=1,2,3,4K=1,2,3,4 ), following the method outlined in Section 4.1. Since the timing differences become more pronounced as KK increases, we reduced the number of samples (i.e., nn ) in multi- 评估。为了评估缓解措施的有效性并降低查询LLM的成本,我们进行了基于模拟的实验。首先,我们构建了分类器,以检测每个 KK 值( K=1,2,3,4K=1,2,3,4 )的 KK 令牌的命中和未命中,遵循第 4.1 节中概述的方法。由于随着 KK 的增加,时间差异变得更加明显,我们减少了多重样本的数量(即 nn )。
Table 6: Token recovery results under different numbers of minimum shared tokens. 表 6:在不同最小共享令牌数量下的令牌恢复结果。
KK
回收率
Recovery
rate
Recovery
rate| Recovery |
| :---: |
| rate |
Accuracy 准确性
每个恢复的令牌的查询次数
#queries per
recovered token
#queries per
recovered token| #queries per |
| :---: |
| recovered token |
#每个令牌的查询次数
#queries
per token
#queries
per token| #queries |
| :---: |
| per token |
Figure 11: Mitigating semantic cache leakages. The shaded components are customized as part of our mitigation. 图 11:减轻语义缓存泄漏。阴影部分是我们减轻措施的一部分。
sampling, and obtained the corresponding TPRs and FPRs for each classifier. Then we used an oracle to simulate the classifiers by randomly sampling a number and determining whether it fell within the classification range. In this evaluation, we used the same repetitive trails method, dataset and fine-tuned model as described in Section 4.1. The next-token predictor was modified to predict the next KK tokens. Table 6 presents the token recovery rate and the average number of queries needed to recover 1 token for K=1,2,3K=1,2,3 and 4 . The results show that as KK increases, the attack still achieves a notable recovery rate. The average number of queries for recovered tokens decreases with a larger KK, as the predictor performs well on easy prompts. However, the overall recovery rate declines because the predictor has more difficulty with harder prompts for a larger KK. When considering the tokens not recovered after the maximum of 80 allowed guesses, the average number of queries increases. 采样,并获得每个分类器的相应 TPR 和 FPR。然后,我们使用一个 oracle 通过随机抽样一个数字并确定其是否落在分类范围内来模拟分类器。在此评估中,我们使用与第 4.1 节中描述的相同重复试验方法、数据集和微调模型。下一个令牌预测器被修改为预测下一个 KK 个令牌。表 6 展示了令牌恢复率和恢复 1 个令牌所需的平均查询次数,分别为 K=1,2,3K=1,2,3 和 4。结果表明,随着 KK 的增加,攻击仍然实现了显著的恢复率。恢复令牌的平均查询次数随着更大的 KK 而减少,因为预测器在简单提示上表现良好。然而,整体恢复率下降,因为对于更大的 KK ,预测器在处理更难的提示时遇到更多困难。当考虑在最多允许 80 次猜测后未恢复的令牌时,平均查询次数增加。
Design. As investigated in Section 4.2, private attributes have a significant impact on the semantic similarity between requests. As a result, the PNA infers private attributes by probing whether a semantically similar request is cached. To mitigate this leakage, we propose a strategy that involves identifying and anonymizing the private attributes present in the requests. This approach not only prevents the leakage of private attributes but also increases the potential for sharing requests across users. 设计。如第 4.2 节所述,私有属性对请求之间的语义相似性有显著影响。因此,PNA 通过探测语义相似的请求是否被缓存来推断私有属性。为了减轻这种泄漏,我们提出了一种策略,涉及识别和匿名化请求中存在的私有属性。这种方法不仅防止了私有属性的泄漏,还增加了跨用户共享请求的潜力。
As shown in Figure 11, we integrate a custom pre-processor 如图 11 所示,我们集成了一个自定义预处理器
and post-processor into the GPTCache framework. The preprocessor is designed to identify private attributes within the requests, and replace them with anonymized identifiers. In this approach, we selectively de-identify Personally Identifiable Information (PII) attributes, such as names, email addresses, phone numbers, credit card numbers, and IP addresses, while ensuring that no essential information needed for the LLMs is removed. 将后处理器集成到 GPTCache 框架中。预处理器旨在识别请求中的私有属性,并用匿名标识符替换它们。在这种方法中,我们有选择地去标识个人可识别信息(PII)属性,例如姓名、电子邮件地址、电话号码、信用卡号码和 IP 地址,同时确保不删除LLMs所需的任何重要信息。
To facilitate reuse, the cache manager maintains a mapping structure that stores the anonymized identifier alongside its corresponding private attribute in a key-value format. The post-processor then identifies the anonymized identifiers in the response and replaces them with the private attributes by referencing this mapping. This ensures that the user receives an accurate response. 为了便于重用,缓存管理器维护一个映射结构,该结构以键值格式存储匿名标识符及其对应的私有属性。后处理器随后在响应中识别匿名标识符,并通过引用此映射将其替换为私有属性。这确保用户收到准确的响应。
Evaluation. In our prototype implementation, we used the Presidio tool by Microsoft [3] to automatically identify private attributes. For performance evaluation, we used the evaluation dataset released by Presidio, which includes sentences containing private information. Specifically, we randomly sampled 1,000 sentences from the dataset and fed them into both the original GPTCache and the enhanced GPTCache. Then we measured the average delay introduced by the pre-processor and post-processor. The results show that the anonymization process adds an average delay of approximately 6 ms , while GPTCache’s response latency for a semantic cache hit without anonymization is around 0.14 s . Thus, de-identification introduces only about 4%4 \% additional overhead, which has a minimal impact on GPTCache’s overall performance. 评估。在我们的原型实现中,我们使用了微软的 Presidio 工具[3]来自动识别私有属性。为了进行性能评估,我们使用了 Presidio 发布的评估数据集,该数据集包含包含私人信息的句子。具体而言,我们从数据集中随机抽取了 1,000 个句子,并将它们输入到原始的 GPTCache 和增强的 GPTCache 中。然后,我们测量了预处理器和后处理器引入的平均延迟。结果表明,匿名化过程平均增加了大约 6 毫秒的延迟,而在没有匿名化的情况下,GPTCache 的语义缓存命中响应延迟约为 0.14 秒。因此,去标识化仅引入了大约 4%4 \% 的额外开销,这对 GPTCache 的整体性能影响很小。
6 Discussions 6 讨论
Unexplored timing leakages. This paper utilizes SGLang [30] and GPTCache [16] as the most representative KV cache and semantic cache sharing mechanisms. However, it is important to note that more sophisticated optimization techniques may enhance performance, but they could also amplify the significance of timing leakage. For example, modular prefix caching [44] and CacheBlend [84] facilitate KV caching for not only prefix tokens but also intermediate ones. Cascade inference [86] stores the shared KV cache in GPU shared memory (SMEM for short), for fast access in multiple requests. The sharing of KV cache in speculative decoding [54] may also introduce speculative side channels, akin to Spectre [51]. We leave the further exploration of the impact of the discovered side channels to future work. 未探索的时序泄漏。本文利用 SGLang [30] 和 GPTCache [16] 作为最具代表性的 KV 缓存和语义缓存共享机制。然而,需要注意的是,更复杂的优化技术可能会提高性能,但也可能会放大时序泄漏的重要性。例如,模块化前缀缓存 [44] 和 CacheBlend [84] 不仅为前缀令牌提供 KV 缓存,还为中间令牌提供缓存。级联推理 [86] 将共享的 KV 缓存存储在 GPU 共享内存(简称 SMEM)中,以便在多个请求中快速访问。在推测解码 [54] 中 KV 缓存的共享也可能引入推测侧信道,类似于 Spectre [51]。我们将对发现的侧信道影响的进一步探索留待未来的工作。
Co-locations. Co-location is a prerequisite for timing sidechannel attacks. There has been significant research on how to achieve co-location in the cloud for micro-architectural side channels [68, 74, 90, 96]. Co-location can be more efficient in LLM systems because these systems often focus on optimizing cache reuse through improved scheduling policies. For 共址。共址是时序侧信道攻击的前提条件。关于如何在云中实现微架构侧信道的共址,已经进行了大量研究[68, 74, 90, 96]。在LLM系统中,共址可能更高效,因为这些系统通常专注于通过改进调度策略来优化缓存重用。
example, baseline scheduling policies, such as first-come-firstserve, typically do not consider the shared prompts within an LLM system. As a result, requests may be mixed across different LLM engines, preventing the reuse of common prompt prefixes and potentially leading to suboptimal cache efficiency. To address this, LLM serving systems like SGLang [30], Parrot [55], Mooncake [64] and BatchLLM [95] have introduced modified schedulers that prioritize requests that align with cached prefixes, a strategy known as cache-aware scheduling. 例如,基线调度策略,如先到先服务,通常不考虑 LLM 系统中的共享提示。因此,请求可能会在不同的 LLM 引擎之间混合,阻止常见提示前缀的重用,并可能导致缓存效率不佳。为了解决这个问题,LLM 服务系统如 SGLang [30]、Parrot [55]、Mooncake [64] 和 BatchLLM [95] 引入了修改过的调度器,优先处理与缓存前缀对齐的请求,这一策略被称为缓存感知调度。
7 Related Works 7 相关工作
Prompt extraction attacks with adversarial prompts. Most existing research focuses on stealing system prompts from LLMs by eliciting the previous prompts, typically through direct output or translation. For example, a twitter user claimed to have discovered the prompt used by Bing Chat [10]. Earlier studies involved manually constructing attacker prompts [63, 89], while more recent work, such as PLeak, leverages output feedback from a given prompt and introduces an incremental search algorithm to optimize prompt retrieval [50]. Zhang et al. present a framework for systematically evaluating the effectiveness of these attacks. While these approaches exploit the model’s vulnerability to adversarial prompts, our proposed attacks take advantage of timing differences introduced in LLM service deployment. As such, our attacks do not rely on the specific details of any particular LLM. 使用对抗性提示的提示提取攻击。大多数现有研究集中在通过引出先前的提示来窃取系统提示,通常通过直接输出或翻译。例如,一位推特用户声称发现了 Bing Chat 使用的提示[10]。早期研究涉及手动构建攻击者提示[63, 89],而最近的工作,如 PLeak,利用给定提示的输出反馈,并引入增量搜索算法来优化提示检索[50]。Zhang 等人提出了一个框架,用于系统评估这些攻击的有效性。虽然这些方法利用了模型对对抗性提示的脆弱性,但我们提出的攻击利用了在LLM服务部署中引入的时间差。因此,我们的攻击不依赖于任何特定LLM的具体细节。
Side channel attacks on LLM. Debenedetti et al. propose system-level side channels within the deep learning lifecycle, such as training data filtering, input preprocessing, output monitoring, and query filtering. These side channels can potentially be exploited to infer the training data or the requests [40]. LLM keystroking attacks [79] are a type of packet analysis based side channels. These attacks exploit the length of response tokens, assuming the attacker has access to encrypted network packets. By comparison, we are the first to study the timing leaks introduced by LLM optimizations, rather than relying on output data or token packet sizes to recover requests. Inspired by our work, a follow-up study by Gu et al. reports a comprehensive measurement analysis of prompt caching in real-world LLMs [48], detecting prompt caching in 8 out of 17 LLM service providers. More recently, Zheng et al. conducted a similar study on LLM timing side channels [94], which however does not feature our optimized search strategy for efficient request recovery. 对LLM的侧信道攻击。Debenedetti 等人提出了深度学习生命周期中的系统级侧信道,例如训练数据过滤、输入预处理、输出监控和查询过滤。这些侧信道可能被利用来推断训练数据或请求[40]。LLM击键攻击[79]是一种基于数据包分析的侧信道。这些攻击利用响应令牌的长度,假设攻击者可以访问加密的网络数据包。相比之下,我们是首个研究LLM优化引入的时间泄漏,而不是依赖输出数据或令牌数据包大小来恢复请求。受到我们工作的启发,Gu 等人的后续研究报告了对现实世界LLMs中提示缓存的全面测量分析[48],在 17 个LLM服务提供商中检测到 8 个存在提示缓存。最近,Zheng 等人对LLM时间侧信道进行了类似的研究[94],但该研究没有采用我们优化的搜索策略以实现高效的请求恢复。
Micro-architectural side channel attacks on deep learning systems. Numerous studies have explored methods for extracting deep learning models or structures, as well as fingerprinting these models, by exploiting various side channels, such as CPU [41,46,65, 71, 83], GPU [77], FPGA [88], power and magnetic channels [49,57], and PCIe traffic [99]. In comparison, our work focuses on leaking private prompts rather 微架构侧信道攻击深度学习系统。许多研究探讨了通过利用各种侧信道(如 CPU [41,46,65, 71, 83]、GPU [77]、FPGA [88]、功率和磁通道 [49,57]以及 PCIe 流量 [99])提取深度学习模型或结构以及对这些模型进行指纹识别的方法。相比之下,我们的工作专注于泄露私人提示。
*The two lead authors contribute equally to the work. Corresponding author: Wenhao Wang (wangwenhao@iie.ac.cn). *两位主要作者对该工作贡献相同。通讯作者:Wenhao Wang ( wangwenhao@iie.ac.cn)。