Large language models (LLMs) achieve state-of-the-art performance on multiple language tasks, yet their safety guardrails can be circumvented, leading to harmful generations. In light of this, recent research on safety mechanisms has emerged, revealing that when safety representations or components are suppressed, the safety capability of LLMs is compromised. However, existing research tends to overlook the safety impact of multi-head attention mechanisms despite their crucial role in various model functionalities. Hence, in this paper, we aim to explore the connection between standard attention mechanisms and safety capability to fill this gap in safety-related mechanistic interpretability. We propose a novel metric tailored for multi-head attention, the Safety Head ImPortant Score (Ships), to assess the individual heads’ contributions to model safety. Based on this, we generalize Ships to the dataset level and further introduce the Safety Attention Head AttRibution Algorithm (Sahara) to attribute the critical safety attention heads inside the model. Our findings show that the special attention head has a significant impact on safety. Ablating a single safety head allows the aligned model (e.g., Llama-2-7b-chat) to respond to 16xx uarr\mathbf{1 6} \times \uparrow more harmful queries, while only modifying 0.006%darr\mathbf{0 . 0 0 6 \%} \downarrow of the parameters, in contrast to the ∼5%\sim 5 \% modification required in previous studies. More importantly, we demonstrate that attention heads primarily function as feature extractors for safety, and models fine-tuned from the same base model exhibit overlapping safety heads through comprehensive experiments. Together, our attribution approach and findings provide a novel perspective for unpacking the black box of safety mechanisms within large models. Our code is available at https://github.com/ydyjya/SafetyHeadAttribution. 大型语言模型 ()LLMs 在多语言任务上实现了最先进的性能,但它们的安全护栏可以被规避,从而导致有害的世代。有鉴于此,最近出现了关于安全机制的研究,表明当安全表示或组件被抑制时,其LLMs安全功能会受到影响。然而,现有的研究往往忽视了多头注意力机制的安全影响,尽管它们在各种模型功能中起着关键作用。因此,在本文中,我们旨在探索标准注意力机制和安全能力之间的联系,以填补安全相关机械可解释性的这一空白。我们提出了一个为多头注意力量身定制的新指标,即 Safety Head ImPortant Score (Ships),以评估各个头对模型安全的贡献。基于此,我们将船舶推广到数据集级别,并进一步引入安全注意头 AttRibution 算法 (Sahara) 来归因模型内的关键安全注意头。我们的研究结果表明,特别关注头对安全性有重大影响。消融单个安全头可以使对齐的模型(例如 Llama-2-7b-chat)响应 16xx uarr\mathbf{1 6} \times \uparrow 更有害的查询,同时仅修改 0.006%darr\mathbf{0 . 0 0 6 \%} \downarrow 参数,这与以前研究中要求的 ∼5%\sim 5 \% 修改相反。更重要的是,我们证明了注意力头主要作为安全性的特征提取器,并且通过综合实验从同一基础模型微调的模型表现出重叠的安全头。总之,我们的归因方法和发现为解开大型模型中安全机制的黑匣子提供了一个新颖的视角。 我们的代码可在 https://github.com/ydyjya/SafetyHeadAttribution 获取。
1 Introduction 1 引言
The capabilities of large language models (LLMs) (Achiam et al., 2023; Touvron et al., 2023; Dubey et al., 2024; Yang et al., 2024) have significantly improved while learning from larger pre-training datasets recently. Despite this, language models may respond to harmful queries, generating unsafe and toxic content (Ousidhoum et al., 2021; Deshpande et al., 2023), raising concerns about potential risks (Bengio et al., 2024). In sight of this, alignment (Ouyang et al., 2022; Bai et al., 2022a;b) is employed to ensure LLM safety by aligning with human values, while existing research (Zou et al., 2023b; Wei et al., 2024a; Carlini et al., 2024) suggests that malicious attackers can circumvent safety guardrails. Therefore, understanding the inner workings of LLMs is necessary for responsible and ethical development (Zhao et al., 2024a; Bereska & Gavves, 2024; Fang et al., 2024). 大型语言模型的能力 (LLMs) (Achiam et al., 2023;Touvron 等人,2023 年;Dubey et al., 2024;Yang et al., 2024)在最近从更大的预训练数据集中学习时有了显着改善。尽管如此,语言模型可能会响应有害的查询,生成不安全和有毒的内容(Ousidhoum et al., 2021;Deshpande et al., 2023),引发了对潜在风险的担忧(Bengio et al., 2024)。有鉴于此,对齐(Ouyang et al., 2022;Bai et al., 2022a;b) 通过与人类价值观保持一致来确保LLM安全,而现有的研究(Zou 等人,2023b;Wei等人,2024a;Carlini et al., 2024)表明恶意攻击者可以绕过安全护栏。因此,了解其LLMs内部运作对于负责任和合乎道德的发展是必要的(Zhao et al., 2024a;Bereska & Gavves, 2024;Fang等人,2024 年)。
Currently, revealing the black-box LLM safety is typically achieved through mechanism interpretation methods. Specifically, these methods (Geiger et al., 2021; Stolfo et al., 2023; Gurnee et al., 2023) granularly analyze features, neurons, layers, and parameters to assist humans in understanding model behavior and capabilities. Recent studies (Zou et al., 2023a; Templeton, 2024; Arditi et al., 2024; Chen et al., 2024) indicate that the safety capability can be attributed to representations and neurons. However, multi-head attention, which is confirmed to be crucial in other abilities (Vig, 目前,揭示黑盒LLM安全性通常是通过机制解释方法实现的。具体来说,这些方法(Geiger等人,2021 年;Stolfo等人,2023 年;Gurnee et al., 2023)精细分析特征、神经元、层和参数,以帮助人类理解模型的行为和能力。最近的研究(Zou et al., 2023a;邓普顿,2024 年;Arditi et al., 2024;Chen et al., 2024)表明,安全能力可归因于表征和神经元。然而,多头注意力被证实在其他能力中至关重要(维格,
Figure 1: Upper. Ablation of the safety attention head through undifferentiated attention causes the attention weight to degenerate to the mean; Bottom. After ablating the attention head according to the upper, the safety capability is weakened, and it responds to both harmful and benign queries. 图 1:上部。通过无差别关注对安全注意头的消融导致注意力权重退化到均值;底。根据鞋面消融注意力头后,安全能力减弱,它同时响应有害和良性查询。
2019; Gould et al., 2024; Wu et al., 2024), has received less attention in safety interpretability. Due to the differing specificities of components and representations, directly transferring existing methods to safety attention attribution is challenging. Additionally, some general approaches (Meng et al., 2022; Wang et al., 2023; Zhang & Nanda, 2024) typically involve special tasks to observe the result changes in one forward, whereas safety tasks necessitate full generation across multiple forwards. 2019;Gould 等人,2024 年;Wu et al., 2024)在安全性可解释性方面受到的关注较少。由于组件和表示的特异性不同,将现有方法直接转移到安全注意归因是具有挑战性的。此外,一些通用方法(Meng等人,2022 年;Wang et al., 2023;Zhang & Nanda, 2024)通常涉及特殊任务来观察一个前锋的结果变化,而安全任务则需要在多个前锋中完全生成。
In this paper, we aim to interpret safety capability within multi-head attention. To achieve this, we introduce Safety Head ImPortant Scores (Ships) to attribute the safety capability of individual attention heads in an aligned model. The model is trained to reject harmful queries in a high probability so that it aligns with human values (Ganguli et al., 2022; Dubey et al., 2024). Based on this, Ships quantifies the impact of each attention head on the change in the rejection probability of harmful queries through causal tracing. Concretely, we demonstrate that Ships can be used for attributing safety attention head. Experimental results show that on three harmful query datasets, using Ships to identify safe heads and using undifferentiated attention ablation (only modifying ∼0.006%\sim \mathbf{0 . 0 0 6 \%} of the parameters) can improve the attack success rate (ASR) of Llama-2-7b-chat from 0.04\mathbf{0 . 0 4} to 0.64uarr\mathbf{0 . 6 4} \uparrow and Vicuna-7b-v1. 5 from 0.27\mathbf{0 . 2 7} to 0.55uarr\mathbf{0 . 5 5} \uparrow. 在本文中,我们旨在解释多头注意力中的安全能力。为了实现这一目标,我们引入了安全头重要评分 (Ships),以将各个注意力头的安全功能归因于对齐模型中。该模型经过训练,可以高概率地拒绝有害查询,使其与人类价值观保持一致(Ganguli et al., 2022;Dubey等人,2024 年)。基于此,Ships 通过因果追踪量化了每个注意力头对有害查询拒绝概率变化的影响。具体来说,我们证明了船舶可用于归因安全关注头。实验结果表明,在 3 个有害查询数据集上,使用 Ships 识别安全头并使用无差别注意力消融(仅修改 ∼0.006%\sim \mathbf{0 . 0 0 6 \%} 参数)可以提高 Llama-2-7b-chat from 0.04\mathbf{0 . 0 4} to 0.64uarr\mathbf{0 . 6 4} \uparrow 和 Vicuna-7b-v1 的攻击成功率 (ASR)。5 从 0.27\mathbf{0 . 2 7} 到 0.55uarr\mathbf{0 . 5 5} \uparrow .
Furthermore, to attribute generalized safety attention heads, we generalize Ships to evaluate the changes in the representation of ablating attention heads on harmful query datasets. Based on the generalized version of Ships, we attribute the most important safety attention head, which is ablated, and the ASR is improved to 0.72uarr\mathbf{0 . 7 2} \uparrow. Iteratively selecting important heads results in a group of heads that can significantly change the rejection representation. We name this heuristic method Safety Attention Head AttRibution Algorithm (Sahara). Experimental results show that ablating the attention head group can further weaken the safety capability collaboratively. 此外,为了归因广义安全注意头,我们泛化了 Ships 以评估有害查询数据集上消融注意力头的表示变化。基于通用版本的 Ships,我们将最重要的安全注意头归于消融,并将 ASR 改进为 0.72uarr\mathbf{0 . 7 2} \uparrow 。迭代选择重要的 head 会产生一组可以显著改变拒绝表示的 head。我们将这种启发式方法命名为 Safety Attention Head AttRibution Algorithm (Sahara)。实验结果表明,消融注意力头群可以协同进一步削弱安全能力。
Based on the Ships and Sahara, we interpret the safety head of attention on several popular LLMs, such as Llama-2-7b-chat and Vicuna-7b-v1.5. This interpretation yields several intriguing insights: 1. Certain safety heads within the attention mechanism are crucial for feature integration in safety tasks. Specifically, modifying the value of the attention weight matrices changes the model output significantly, while scaling the attention output does not; 2. For LLMs fine-tuned from the same base model, their safety heads have overlap, indicating that in addition to alignment, the safety impact of the base model is critical; 3. The attention heads that affect safety can act independently with affecting helpfulness little. These insights provide a new perspective on LLM safety and provide a solid basis for the enhancement and future optimization of safety alignment. Our contributions are summarized as follows: 基于船舶和撒哈拉沙漠,我们解读了几个流行的 LLMs,例如 Llama-2-7b-chat 和 Vicuna-7b-v1.5 上的安全负责人。这种解释产生了几个有趣的见解:1. 注意力机制中的某些安全头对于安全任务中的功能集成至关重要。具体来说,修改注意力权重矩阵的值会显著改变模型输出,而缩放注意力输出则不会;2. 对于LLMs从同一基础模型微调,它们的安全头有重叠,表明除了对齐之外,基础模型的安全影响是关键的;3. 影响安全的注意力头可以独立行动,对帮助性的影响很小。这些见解为LLM安全提供了新的视角,并为安全对齐的增强和未来优化提供了坚实的基础。我们的贡献总结如下: =>\Rightarrow We make a pioneering effort to discover and prove the existence of safety-specific attention heads in LLMs, which complements the research on safety interpretability. =>\Rightarrow 我们开创性地发现和证明 中LLMs安全特定注意力头的存在,这与安全可解释性的研究相辅相成。 =>\Rightarrow We present Ships to evaluate the safety impact of attention head ablation. Then, we propose a heuristic algorithm Sahara to find head groups whose ablation leads to safety degradation. =>\Rightarrow 我们提出船舶来评估注意力头消融术的安全影响。然后,我们提出了一种启发式算法 Sahara 来查找消融导致安全降级的头部群。 =>\Rightarrow We comprehensively analyze the importance of the standard multi-head attention mechanism for LLM safety, providing intriguing insights based on extensive experiments. Our work significantly boosts transparency and alleviates concerns regarding LLM risks. =>\Rightarrow 我们全面分析了标准多头注意力机制对LLM安全的重要性,并根据广泛的实验提供了有趣的见解。我们的工作显著提高了透明度,并减轻了对风险的LLM担忧。
2 PRELIMINARY 2 初步
Large Language Models (LLMs). Current state-of-the-art LLMs are predominantly based on a decoder-only architecture, which predicts the next token for the given prompt. For the input sequence x=x_(1),x_(2),dots,x_(s)x=x_{1}, x_{2}, \ldots, x_{s}, LLMs can return the probability distribution of the next token: 大型语言模型 (LLMs)。LLMs当前最先进的技术主要基于仅解码器架构,该架构预测给定提示的下一个令牌。对于输入序列 x=x_(1),x_(2),dots,x_(s)x=x_{1}, x_{2}, \ldots, x_{s} ,LLMs可以返回下一个 token 的概率分布:
where o_(s)o_{s} is the last residual stream, and WW is the linear function, which maps o_(s)o_{s} to the the logits associated with each token in the vocabulary VV. Sampling from the probability distribution yields a new token x_(n+1)x_{n+1}. Iterating this process allows to obtain a response R=x_(s+1),x_(s+2),dots,x_(s+R)R=x_{s+1}, x_{s+2}, \ldots, x_{s+R}. 其中 o_(s)o_{s} 是最后一个残差流, WW 是线性函数,它映射到 o_(s)o_{s} 与词汇表中的每个标记关联的 logits VV 。从概率分布中抽样会产生一个新 token x_(n+1)x_{n+1} 。迭代此过程允许获取响应 R=x_(s+1),x_(s+2),dots,x_(s+R)R=x_{s+1}, x_{s+2}, \ldots, x_{s+R} 。
Multi-Head Attention (MHA). The attention mechanism (Vaswani, 2017) in LLMs plays is critical for capturing the features of the input sequence. Prior works (Htut et al., 2019; Clark et al., 2019b; Campbell et al., 2023; Wu et al., 2024) demonstrate that individual heads in MHA contribute distinctively across various language tasks. MHA, with nn heads, is formulated as follows: 多头注意力 (MHA)。戏剧中的LLMs注意力机制 (Vaswani, 2017) 对于捕捉输入序列的特征至关重要。以前的工作(Htut et al., 2019;Clark等人,2019b;Campbell 等人,2023 年;Wu et al., 2024)表明,MHA 中的单个头部在各种语言任务中做出了独特的贡献。带 nn 头的 MHA 公式如下:
where o+\oplus represents concatenation and d_(k)d_{k} denotes the dimension size of W_(k)W_{k}. 其中 o+\oplus 表示串联, d_(k)d_{k} 表示 的 W_(k)W_{k} 维度大小。
LLM Safety and Jailbreak Attack. LLMs may generate content that is unethical or illegal, raising significant safety concerns. To address the risks, safety alignment (Bai et al., 2022a; Dai et al., 2024) is implemented to prevent models from responding to harmful queries x_(H)x_{\mathcal{H}}. Specifically, safety alignment train LLMs theta\theta to optimize the following objective: LLM安全和越狱攻击。LLMs可能会生成不道德或非法的内容,从而引发严重的安全问题。为了解决风险,安全对齐(Bai et al., 2022a;Dai et al., 2024) 的实施是为了防止模型响应有害的查询 x_(H)x_{\mathcal{H}} 。具体来说,安全对准列车LLMs theta\theta 优化目标如下:
where _|_\perp denotes rejection, and R_(_|_)R_{\perp} generally includes phrases like ‘I cannot’ or ‘As a responsible AI assistant’. This objective aims to increase the likelihood of rejection tokens in response to harmful inputs. However, jailbreak attacks (Li et al., 2023; Chao et al., 2023; Liu et al., 2024) can circumvent the safety guardrails of LLMs. The objective of a jailbreak attack can be formalized as: 其中 _|_\perp 表示拒绝, R_(_|_)R_{\perp} 通常包括“I cannot”或“As a responsible AI assistant”等短语。此目标旨在增加响应有害输入而被拒绝令牌的可能性。然而,越狱攻击(Li et al., 2023;Chao et al., 2023;Liu et al., 2024)可以规避 LLMs的安全护栏。越狱攻击的目标可以正式表示为:
where DD is a safety discriminator that flags RR as harmful when D(R)=D(R)= True. Prior studies (Liao & Sun, 2024; Jia et al., 2024) show that shifting the probability distribution towards affirmative tokens can significantly improve the attack success rate. Suppressing rejection tokens (Shen et al., 2023; Wei et al., 2024a) yields similar results. These insights highlight that LLM safety relies on maximizing the probability of generating rejection tokens in response to harmful queries. 其中 DD 是一个安全鉴别器,当 True 时 D(R)=D(R)= 标记为 RR 有害。先前的研究(Liao & Sun, 2024;Jia et al., 2024)表明,将概率分布转向肯定标记可以显著提高攻击成功率。抑制拒绝标记(Shen et al., 2023;Wei et al., 2024a) 得出类似的结果。这些见解强调,LLM安全性取决于最大限度地提高生成拒绝令牌以响应有害查询的可能性。
Safety Parameters. Mechanistic interpretability (Zhao et al., 2024a; Lindner et al., 2024) attributes model capabilities to specific parameters, improving the transparency of black-box LLMs while addressing concerns about their behavior. Recent work (Wei et al., 2024b; Chen et al., 2024) specializes in safety by identifying critical parameters responsible for ensuring LLM safety. When these safety-related parameters are modified, the safety guardrails of LLMs are compromised, potentially leading to the generation of unethical content. Consequently, safety parameters are those whose ablation results in a significantly increase in the probability of generating an illegal or unethical response to the harmful queries x_(H)x_{\mathcal{H}}. Formally, we define the Safety Parameters as: 安全参数。机械可解释性(Zhao et al., 2024a;Lindner et al., 2024) 将模型功能归因于特定参数,提高了黑盒LLMs的透明度,同时解决了对其行为的担忧。最近的工作(Wei et al., 2024b;Chen et al., 2024)通过确定负责确保LLM安全的关键参数来研究安全性。当这些与安全相关的参数被修改时,安全LLMs护栏就会受到损害,这可能会导致生成不道德的内容。因此,安全参数是那些消融导致对有害查询 x_(H)x_{\mathcal{H}} 产生非法或不道德响应的可能性显着增加的参数。正式地,我们将安全参数定义为: