Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
语义熵探针:在 LLMs
Abstract 抽象
We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in Large Language Models (LLMs).
Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs.
Recent work by Farquhar et al. [21] proposes semantic entropy (SE), which can detect hallucinations by estimating uncertainty in the space semantic meaning for a set of model generations.
However, the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption.
To address this, we propose SEPs, which directly approximate SE from the hidden states of a single generation.
SEPs are simple to train and do not require sampling multiple model generations at test time, reducing the overhead of semantic uncertainty quantification to almost zero.
We show that SEPs retain high performance for hallucination detection and generalize better to out-of-distribution data than previous probing methods that directly predict model accuracy.
Our results across models and tasks suggest that model hidden states capture SE, and our ablation studies give further insights into the token positions and model layers for which this is the case.
我们提出了语义熵探针 (SEP),这是一种廉价且可靠的大型语言模型 (LLMs。幻觉是听起来似乎合理但事实不正确且武断的模型生成,对 LLMs。Farquhar 等 人 [21] 最近的工作提出了语义熵 (SE),它可以通过估计一组模型世代的空间语义含义的不确定性来检测幻觉。然而,与 SE 计算相关的计算成本增加了 5 到 10 倍,阻碍了实际采用。为了解决这个问题,我们提出了 SEP,它直接从单代的隐藏状态近似 SE。SEP 易于训练,不需要在测试时对多个模型生成进行采样,从而将语义不确定性量化的开销降低到几乎为零。我们表明,与以前直接预测模型准确性的探测方法相比,SEP 在幻觉检测方面保持了高性能,并且更好地泛化到分布外数据。我们在模型和任务中的结果表明,模型隐藏状态捕获了 SE,我们的消融研究进一步了解了这种情况的标记位置和模型层。
1 Introduction 1 介绍
Large Language Models (LLMs) have demonstrated impressive capabilities across a wide variety of natural language processing tasks [70, 71, 54, 68, 8].
They are increasingly deployed in real-world settings, including in high-stakes domains such as medicine, journalism, or legal services [65, 76, 53, 63].
It is therefore paramount that we can trust the outputs of LLMs.
Unfortunately, LLMs have a tendency to hallucinate.
Originally defined as “content that is nonsensical or unfaithful to the provided source” [45, 23, 27], the term is now used to refer to nonfactual, arbitrary content generated by LLMs.
For example, when asked to generate biographies, even capable LLMs such as GPT-4 will often fabricate facts entirely [48, 69, 21].
While this may be acceptable in low-stakes use cases, hallucinations can cause significant harm when factuality is critical.
The reliable detection or mitigation of hallucinations is a key challenge to ensure the safe deployment of LLM-based systems.
大型语言模型 (LLMs) 在各种自然语言处理任务中表现出令人印象深刻的能力 [70, 71, 54, 68, 8]。它们越来越多地部署在现实世界中,包括医学、新闻或法律服务等高风险领域 [65, 76, 53, 63]。因此,我们可以信任 LLMs。不幸的是,LLMs 有产生幻觉的倾向。该术语最初定义为“无意义或不忠于所提供来源的内容”[45, 23, 27],现在用于指代由 LLMs。例如,当被要求生成传记时,即使是像 GPT-4 这样有能力的 LLMs 也经常完全捏造事实 [48, 69, 21]。虽然这在低风险用例中可能是可以接受的,但当事实至关重要时,幻觉可能会造成重大伤害。可靠地检测或缓解幻觉是确保安全部署基于 LLM 的系统的关键挑战。
Various approaches have been proposed to address hallucinations in LLMs (see Section 2).
An effective strategy for detecting hallucinations is to sample multiple responses for a given prompt and check if the different samples convey the same meaning [21, 31, 30, 16, 13, 11, 19, 43, 48].
The core idea is that if the model knows the answer, it will consistently provide the same answer.
If the model is hallucinating, its responses may vary across generations.
For example, given the prompt “What is the capital of France?”, an LLM that “knows” the answer will consistently output (Paris, Paris, Paris), while an LLM that “does not know” the answer may output (Naples, Rome, Berlin), indicating a hallucination.
已经提出了各种方法来解决 LLMs(参见第 2 节 )。检测幻觉的有效策略是对给定提示的多个响应进行采样,并检查不同的样本是否传达相同的含义 [21, 31, 30, 16, 13, 11, 19, 43, 48].核心思想是,如果模型知道答案,它将始终提供相同的答案。如果模型出现幻觉,其反应可能会因代际而异。例如,给定提示 “What is the capital of France?”,“知道”答案的 LLM 将始终输出 (Paris, Paris, Paris),而 “不知道” 答案的 LLM 可能会输出 (Naples, Rome, Berlin),表示幻觉。
One explanation for why this works is that LLMs have calibrated uncertainty [30, 54], i.e., “language models (mostly) know what they know” [30]. When an LLM is certain about an answer, it consistently provides the correct response. Conversely, when uncertain, it generates arbitrary answers. This suggests that we can leverage model uncertainty to detect hallucinations.
However, we cannot use token-level probabilities to estimate uncertainty directly because different sequences of tokens may convey the same meaning.
For the example, the answers “Paris”, “It’s Paris”, and “The capital of France is Paris” all mean the same.
To address this, Farquhar et al. [21] propose semantic entropy (SE), which clusters generations into sets of equivalent meaning and then estimates uncertainty in semantic space.
为什么这样做的一种解释是LLMs 已经校准了不确定性 [30, 54],即“语言模型(大部分)知道它们知道什么”[30]。当 LLM 确定答案时,它会始终提供正确的响应。相反,当不确定时,它会生成任意的答案。这表明我们可以利用模型不确定性来检测幻觉。然而,我们不能使用标记级概率直接估计不确定性,因为不同的标记序列可能传达相同的含义。对于该示例,答案 “Paris”、“It's Paris” 和 “The capital of France is Paris” 的含义相同。为了解决这个问题,Farquhar 等 人 [21] 提出了语义熵 (SE),它将世代聚类为等效含义的集合,然后估计语义空间中的不确定性。
A major limitation of SE and other sampling-based approaches is that they require multiple model generations for each input query, typically between 5 and 10. This results in a 5-to-10-fold higher cost compared to naive generation without SE, presenting a major hurdle to the practical adoption of these methods. Computationally cheaper methods for reliable hallucination detection in LLMs are needed.
The hidden states of LLMs are a promising avenue to better understand, predict, and steer a wide range of LLM behaviors [79, 26, 67]. In particular, a recent line of work learns to predict the truthfulness of model responses by training a simple linear probe on the hidden states of LLMs. Linear probes are computationally efficient, both to train and when used at inference. However, existing approaches are usually supervised [61, 35, 4, 44] and therefore require a labeled training dataset assigning accuracy to statements or model generations. And while unsupervised approaches exist [9], their validity has been questioned [20]. In this paper, we argue that supervising probes via SE is preferable to accuracy labels for robust prediction of truthfulness.
We propose Semantic Entropy Probes (SEPs), linear probes that capture semantic uncertainty from the hidden states of LLMs, presenting a cost-effective and reliable hallucination detection method. SEPs combine the advantages of probing and sampling-based hallucination detection. Like other probing approaches, SEPs are easy to train, cheap to deploy, and can be applied to the hidden states of a single model generation. Similar to sampling-based hallucination detection, SEPs capture the semantic uncertainty of the model. Furthermore, they address some of the shortcomings of previous approaches. Contrary to sampling-based hallucination detection, SEPs act directly on a single model hidden state and do not require generating multiple samples at test time. And unlike previous probing methods, SEPs are trained to predict semantic entropy [31] rather than model accuracy, which can be computed without access to ground truth accuracy labels that can be expensive to curate.
We find that SEP predictions are effective proxies for truthfulness. In fact, SEPs generalize better to new tasks than probes trained directly to predict accuracy, setting a new state-of-the-art for cost-efficient hallucination detection, cf. Fig. 1. Our results additionally provides insights into the inner workings of LLMs, strongly suggesting that model hidden states directly capture the model’s uncertainty over semantic meanings. Through ablation studies, we show that this holds across a variety of models, tasks, layers, and token positions.
In summary, our core contributions are:
-
•
We propose Semantic Entropy Probes (SEPs), linear probes trained on the hidden states of LLMs to capture semantic entropy (Section 4).
-
•
We demonstrate that semantic entropy is encoded in the hidden states of a single model generation and can be successfully extracted using probes (Section 6).
-
•
We perform ablation studies to study SEP performance across models, tasks, layers, and token positions. Our results strongly suggest internal model states across layers and tokens implicitly capture semantic uncertainty, even before generating any tokens. (Section 6)
- •
2 Related Work
LLM Hallucinations. We refer to Rawte et al. [60], Zhang et al. [78] for extensive surveys on hallucinations in LLMs and here briefly review the most relevant related work to this paper. Early work on hallucinations in language models typically refers to issues in summarization tasks where models “hallucinate” content that is not faithful to the provided source text [45, 14, 17, 10, 75, 42, 51]. Around the same time, research emerged that showed LLMs themselves could store and retrieve factual knowledge [58], leading to the currently popular closed-book setting, where LLMs are queried without any additional context [62]. Since then, a large variety of work has focused on detecting hallucinations in LLMs for general natural language generation tasks. These can typically be classified into one of two directions: sampling-based and retrieval-based approaches.
Sampling-Based Hallucination Detection. For sampling-based approaches, a variety of methods have been proposed that sample multiple model completions for a given query and then quantify the semantic difference between the model generations [31, 30, 16, 13, 11, 19]. For this paper, Farquhar et al. [21] is particularly relevant, as we use their semantic entropy measure to supervise our hidden state probes (we summarize their method in Section 3). A different line of work does not directly re-sample answers for the same query, but instead asks follow-up questions to uncover inconsistencies in the original answer [15, 2]. Recent work has also proposed to detect hallucinations in scenarios where models generate entire paragraphs of text by decomposing the paragraph into individual facts or sentences, and then validating the uncertainty of those individual facts separately [39, 49, 43, 15].
Retrieval-Based Methods. A different strategy to mitigate hallucinations is to rely on external knowledge bases, e.g. web search, to verify the factuality of model responses [22, 77, 57, 18, 24, 36, 74, 66]. An advantage of such approaches is that they do not rely on good model uncertainties and can be used directly to fix errors in model generations. However, retrieval-based approaches can add significant cost and latency. Further, they may be less effective for domains such as reasoning, where LLMs are also prone to produce unfaithful and misleading generations [73, 33]. Thus, retrieval- and uncertainty-based methods are orthogonal and can be combined for maximum effect.
Sampling and Finetuning Strategies. A number of different strategies exist to reduce, rather than detect, the number of hallucinations that LLMs generate. Previous work has proposed simple adaptations to LLM sampling schemes [34, 12, 64], preference optimization targeting factuality [69], or finetuning to align “verbal” uncertainties of LLMs with model accuracy [47, 37, 5].
Understanding Hidden States. Recent work suggests that simple operations on LLM hidden states can qualitatively change model behavior [79, 67, 61] manipulate knowledge [26], or reveal deceitful intent [40]. Probes can be a valuable tool to better understand the internal representations of neural networks like LLMs [3, 6]. Previous work has shown that hidden state probes can predict LLM outputs one or multiple tokens ahead with high accuracy [7, 55]. Relevant to our paper is recent work that suggests there is a “truthfulness” direction in latent space that predicts correctness of statements and generations [44, 4, 9, 35, 4]. Our work extends this – we are also interested in predicting if the model is hallucinating nonfactual responses, however, rather than directly supervising probes with accuracy labels, we argue that capturing semantic entropy is key for generalization performance.
3 Semantic Entropy
Measuring uncertainty in free-form natural language generation tasks is challenging. The uncertainties over tokens output by the language model can be misleading because they conflate semantic uncertainty, uncertainty over the meaning of the generation, with lexical and syntactic uncertainty, uncertainty over how to phrase the answer (see the example in Section 1). To address this, Farquhar et al. [21] propose semantic entropy, which aggregates token-level uncertainties across clusters of semantic equivalence.111Farquhar et al. [21] is a journal version of the original semantic entropy paper by Kuhn et al. [31]. Semantic entropy is important in the context of this paper because we use it as the supervisory signal to train our hidden state SEP probes.
Semantic entropy is calculated in three steps: (1) for a given query , sample model completions from the LLM, (2) aggregate the generations into clusters of equivalent semantic meaning, (3) calculate semantic entropy, , by aggregating uncertainties within each cluster. Step (1) is trivial, and we detail steps (2) and (3) below.
Semantic Clustering. To determine if two generations convey the same meaning, Farquhar et al. [21] use natural language inference (NLI) models, such as DeBERTa [25], to predict entailment between the generations. Concretely, two generations and are identical in meaning if entails and entails , i.e. they entail each other bi-directionally. Farquhar et al. [21] then propose a greedy algorithm to cluster generations semantically: for each sample , we either add it to an existing cluster if bi-directional entailment holds between and a sample , or add it to a new cluster if the semantic meaning of is distinct from all existing clusters. After processing all generations, we obtain a clustering of the generations by semantic meaning.
Semantic Entropy. Given an input context , the joint probability of a generation consisting of tokens is given by the product of conditional token probabilities in the sequence,
(1) |
The probability of the semantic cluster is then the aggregate probability of all possible generations which belong to that cluster,
(2) |
The uncertainty associated with the distribution over semantic clusters is the semantic entropy,
(3) |
Estimating SE in Practice. In practice, we cannot compute the above exactly. The expectations with respect to and are intractable, as the number of possible token sequences grows exponentially with sequence length. Instead, Farquhar et al. [21] sample generations at non-zero temperature from the LLM (typically and also in this paper ). They then treat as Monte Carlo samples from the true distribution over semantic clusters , and approximate semantic entropy as
(4) |
We here use an additional approximation, employing the discrete variant of SE that yields good performance without access to token probabilities, making it compatible with black-box models [21]. For the discrete SE variant, we estimate cluster probabilities as the fraction of generations in that cluster, , and then compute semantic entropy as the entropy of the resulting categorical distribution, . Discrete SE further avoids problems when estimating Eq. 4 for generations of different lengths [41, 50, 31, 21].
4 Semantic Entropy Probes
Although semantic entropy is effective at detecting hallucinations, its high computational cost may limit its use to only the most critical scenarios. In this section, we propose Semantic Entropy Probes (SEPs), a novel method for cost-efficient and reliable uncertainty quantification in LLMs. SEPs are linear probes trained on the hidden states of LLMs to capture semantic entropy [31]. However, unlike semantic entropy and other sampling-based approaches, SEPs act on the hidden states of a single model generation and do not require sampling multiple responses from the model at test time. Thus, SEPs solve a key practical issue of semantic uncertainty quantification by almost completely eliminating the computational overhead of semantic uncertainty estimation at test time. We further argue that SEPs are advantageous to probes trained to directly predict model accuracy. Our intuition for this is that semantic entropy is an inherent property of the model that should be encoded in the hidden states and thus should be easier to extract than truthfulness, which relies on potentially noisy external information. We discuss this further in Section 8.
Training SEPs. SEPs are constructed as linear logistic regression models, trained on the hidden states of LLMs to predict semantic entropy. We create a dataset of pairs, where is an input query, is the model hidden state at token position and layer , is the hidden state dimension, and is the semantic entropy. That is, given an input query , we first generate a high-likelihood model response via greedy sampling and store the hidden state at a particular layer and token position, . We then sample responses from the model at high temperature and compute semantic entropy, , as detailed in the previous section. For inputs, we rely on questions from popular QA datasets (see Section 5 for details), although we do not need the ground-truth labels provided by these datasets and could alternatively compute semantic entropy for any unlabeled set of suitable LLM inputs.
Binarization. Semantic entropy scores are real numbers. However, for the purposes of this paper, we convert them into binary labels, indicating whether semantic entropy is high or low, and then train a logistic regression classifier to predict these labels. Our motivation for doing so is two-fold. For one, we ultimately want to use our probes for predicting binary model correctness, so we eventually need to construct a binary classifier regardless. Additionally, we would like to compare the performance of SEP probes and accuracy probes. This is easier if both probes target binary classification problems. We note that the logistic regression classifier returns probabilities, such that we can always recover fine-grained signals even after transforming the problem into binary classification.
More formally, we compute , where is a threshold that optimally partitions the raw SE scores into high and low values according to the following objective:
(5) |
where
This procedure is inspired by splitting objectives used in regression trees [38] and we have found it to perform well in practice compared to alternatives such as soft labelling, cf. Appendix B.
In summary, given a input dataset of queries, , we compute a training set of hidden state – binarized semantic entropy pairs, , and use this to train a linear classifier, which is our semantic entropy probe (SEP). At test time, SEPs predict the probability that a model generation for a given input query has high semantic entropy.
Probing Locations. We collect hidden states, , across all layers, , of the LLM to investigate which layers best capture semantic entropy. We consider two different token positions, . Firstly, we consider the hidden state at the last token of the input , i.e. the token before generating (TBG) the model response. Secondly, we consider the last token of the model response, which is the token before the end-of-sequence token, i.e. the second last token (SLT). We refer to these scenarios as TBG and SLT. The TBG experiments allow us to study to what extent LLM hidden states capture semantic entropy before generating a response. The TBG setup potentially allows us to quantify the semantic uncertainty given an input in a single forward pass – without generating any novel tokens – further reducing the cost of our approach over sampling-based alternatives. In practice, this may be useful to quickly determine if a model will answer a particular input query with high certainty.
5 Experiment Setup
We investigate and evaluate Semantic Entropy Probes (SEPs) across a range of models and datasets. First, we show that we can accurately predict semantic entropy from the hidden states of LLMs (Section 6). We then explore how SEP predictions vary across different tasks, models, tokens indices, and layers. Second, we demonstrate that SEPs are a cheap and reliable method for hallucination detection (Section 7), which generalizes better to novel tasks than accuracy probes, although they cannot match the performance of much more expensive sampling-based methods in our experiments.
Tasks. We evaluate SEPs on four datasets: TriviaQA [29], SQuAD [59], BioASQ [72], and NQ Open [32]. We use the input queries of these tasks to derive training sets for SEPs and evaluate the performance of each method on the validation/test sets, creating splits if needed. We consider a short- and a long-form setting: Short-form answers are generated by few-shot prompting the LLM to answer “as briefly as possible” and long-form answers are generated by prompting for a “single brief but complete sentence”, leading to an approximately six-fold increase in the number of generated tokens [21]. Following Farquhar et al. [21], we assess model accuracy via the SQuAD F1 score for short-form generations, and we use use GPT-4 [54] to compare model answers to ground truth labels for long-form answers. We provide prompt templates in Section B.1.
Models. We evaluate SEPs on five different models. For short generations, we generate hidden states and answers with Llama-2 7B and 70B [71], Mistral 7B [28], and Phi-3 Mini [1], and use DeBERTa-Large [25] as the entailment model for calculating semantic entropy [31]. For long generations, we use Llama-2-70B [71] or Llama-3-70B [46] and use GPT-3.5 [8] to predict entailment.
Baselines. We compare SEPs against the ground truth semantic entropy, accuracy probes supervised with model correctness labels, naive entropy, log likelihood, and the method of Kadavath et al. [30]. For naive entropy, following Farquhar et al. [21], we compute the length-normalized average log token probabilities across the same number of generations as for SE. For log likelihood, we use the length-normalized log likelihood of a single model generation. The method works by constructing a custom few-shot prompt that contains a number of examples – each consisting of a training set input, a corresponding low-temperature model answer, high-temperature model samples, and a model correctness score. Essentially, treats sampling-based truthfulness detection as an in-context learning task, where the few-shot prompt teaches the model that model answers with high semantic variety are likely incorrect. We refer to Kadavath et al. [30] for more details.
Linear Probe. For both SEPs and our accuracy probe baseline, we use the logistic regression model from scikit-learn [56] with default hyperparameters for regularization and the LBFGS optimizer.
Evaluation. We evaluate SEPs both in terms of their ability to capture semantic entropy as well as their ability to predict model hallucinations. In both cases, we compute the area under the receiver operating characteristic curve (AUROC), with gold labels given by binarized SE or model accuracy.
6 LLM Hidden States Implicitly Capture Semantic Entropy
This section investigates whether LLM hidden states encode semantic entropy. We study SEPs across different tasks, models, and layers, and compare them to accuracy probes in- and out-of-distribution.
Hidden States Capture Semantic Entropy. Figure 2 shows that SEPs are consistently able to capture semantic entropy across different models and tasks. Here, probes are trained on hidden states of the second-last-token for the short-form generation setting. In general, we observe that AUROC values increase for later layers in the model, reaching values between and depending on the scenario.
Semantic Entropy Can Be Predicted Before Generating. Next, we investigate if semantic entropy can be predicted before even generating the output. Similar to before, Fig. 3 shows AUROC values for predicting binarized semantic entropy from the SEP probes. Perhaps surprisingly (although in line with related work, cf. Section 2), we find that SEPs can capture semantic entropy even before generation. SEPs consistently achieve good AUROC values, with performance slightly below the SLT experiments in Fig. 2. The TBG variant provides even larger cost savings than SEPs already do, as it allows us to quantify uncertainty before generating any novel tokens, i.e. with a single forward pass through the model. This could be useful in practice, for example, to refrain from answering queries for which semantic uncertainty is high.
AUROC values for Llama-2-7B on BioASQ, in both Fig. 2 and Fig. 3, reach very high values, even for early layers. We investigated this and believe it is likely related to the particularities of BioASQ. Concretely, it is the only of our tasks to contain a significant number of yes-no questions, which are generally associated with lower semantic entropy as the possible number of semantic meanings in outcome space is limited. For a model with relatively low accuracy such as Llama-2-7B, simply identifying whether or not the given input is a yes-no question, will lead to high AUROC values.
SEPs Capture Semantic Uncertainty for Long Generations. While experiments with short generations are popular even in the recent literature [31, 30, 16, 13, 11], this scenario is increasingly disconnected from popular use cases of LLMs as conversational chatbots. In recognition of this, we also study our probes in a long-form generation setting, which increases the average length of model responses from ~15 characters in the short-length scenario to about ~100 characters.
Figure 4 shows that, even in the long-form setting, SEPs are able to capture semantic entropy well in both the second-last-token and token-before-generation scenarios for Llama-2-70B and Llama-3-70B. Compared to the short-form generation scenario, we now observe more often that AUROC values peak for intermediate layers. This makes sense as hidden states closer to the final layer will likely be preoccupied with predicting the next token. In the long-form setting, the next token is more often unrelated to the semantic uncertainty of the overall answer, and instead concerned with syntax or lexis.
Counterfactual Context Addition Experiment. To confirm that SEPs capture SE rather than relying on spurious correlations, we perform a counterfactual intervention experiment for Llama-2-7B on TriviaQA. For each input question of TriviaQA, the dataset contains a “context”, from which the ground truth answer can easily be predicted. We usually exclude this context, because including it makes the task too easy. However, for the purpose of this experiment, we add the context and study how this affects SEP predictions.
Figure 5 shows a kernel density estimate of the distribution over the predicted probability for high semantic entropy, , for Llama-2-7B on the TriviaQA dataset with context (blue) and without context (orange) in the short generation setting using the SLT. Without context, the distribution for from the SEP is concentrated around . However, as soon as we provide the context, decreases, as shown by the shift in distribution. As the task becomes much easier – accuracy increases from 26% to 78% – the model becomes more certain – ground truth SE decreases from 1.84 to 0.50. This indicates SEPs accurately capture model behavior for the context addition experiment, with predictions for following ground truth SE behavior, despite never being trained on inputs with context.
7 SEPs Are Cheap and Reliable Hallucination Detectors
In this section, we explore the use of SEPs to predict hallucinations, comparing them to accuracy probes and other baselines. Crucially, we also evaluate probes in a challenging generalization setting, testing them on tasks that they were not trained for. This setup is much more realistic than evaluating probes in-distribution, as, for most deployment scenarios, inputs will rarely match the training distribution exactly. The generalization setting does not affect semantic entropy, naive entropy, or log likelihood, which do not rely on training data. While does rely on a few samples for prompt construction, we find its performance is usually unaffected by the task origin of the prompt data.
Model | SEP Acc Pr. |
---|---|
Mistral-7B (S) | |
Phi-3-3.8B (S) | |
Llama-2-7B (S) | |
Llama-2-70B (S) | |
Llama-2-70B (L) | |
Llama-3-70B (L) |
Figure 6 shows both in-distribution and generalization performance of SEPs and accuracy probes across different layers for Llama-2-7B in a short-form generation setting trained on the SLT. In-distribution, accuracy probes outperform SEPs across most layers and tasks, with the exception of NQ Open. In Table 1, we compare the average difference in AUROC between SEPs and accuracy probes for predicting model hallucinations, taking a representative set of high-performing layers for both probe types (see Appendix B). We find that SEPs and accuracy probes perform similarly on in-distribution data across models. We report unaggregated results in Fig. A.8. The performance of SEPs here is commendable: SEPs are trained without any ground truth answers or accuracy labels, and yet, can capture truthfulness. To the best of our knowledge, SEPs may be the best unsupervised method for hallucination detection even in-distribution, given problems of other unsupervised methods for truthfulness prediction [20].
However, when evaluating probe generalization to new tasks, SEPs show their true strength. We evaluate probes in a leave-one-out fashion – evaluating on all datasets except one, which we train on. As shown in Fig. 6 (right), SEPs consistently outperform accuracy probes across various layers and tasks for short-form generations in the generalization setting. For BioASQ, the difference is particularly large. SEPs clearly generalize better to unseen tasks than accuracy probes. In Table 2 and Fig. 7, we report results for more models, taking a representative set of high-performing layers for both probe types, and Fig. A.7 shows results for Mistral-7B across layers. We again find that SEPs generalize better than accuracy probes to novel tasks. We additionally compare to the sampling-based semantic entropy, naive entropy, and methods. While SEPs cannot match the performance of
Model | SEP Acc Pr. |
---|---|
Mistral-7B (S) | |
Phi-3-3.8B (S) | |
Llama-2-7B (S) | |
Llama-2-70B (S) | |
Llama-2-70B (L) | |
Llama-3-70B (L) |
these methods, it is important to note the significantly higher cost these baselines incur, requiring additional model generations, whereas SEPs and accuracy probes operate on single generations.
We further evaluate SEPs for long-form generations. As shown in Fig. 8, SEPs outperform accuracy probes for Llama-2-70B and Llama-3-70B in the generalization setting. We also provide in-distribution results for long generations with both models in Figs. A.9 and A.10. Both results confirm the trend discussed above. Overall, our results clearly suggest that SEPs are the best choice for cost-effective uncertainty quantification in LLMs, especially if the distribution of the query data is unknown.
8 Discussion, Future Work, and Conclusions.
Discussion. Our experiments show that SEPs generalize better than accuracy probes – in terms of detecting hallucinations – to inputs from unseen tasks. One potential explanation for this is that semantic uncertainty is a better probing target than truthfulness, because semantic uncertainty is a more model-internal characteristic that can be better predicted from model hidden states. Model correctness labels required for accuracy probing on the other hand are external and can be noisy, which may make them more difficult to predict from hidden states. We can find evidence for this by comparing the in-distribution AUROC values for SEPs (for predicting binarized SE) with the AUROC of the accuracy probes for predicting accuracy in Figs. A.6 and A.5. AUROC values – which are insensitive to class frequencies and a good proxy for task difficulty – for predicting SE are significantly higher, indicating that semantic entropy is indeed better captured by model hidden states than accuracy.
Another possible explanation for the gap in OOD generalization could be that accuracy probes capture model correctness in a way that is specific to the training dataset. For example, the probe may latch on to discriminative features for model correctness that relate to the task at hand but do not generalize, such as identifying a knowledge domain where accuracy is high or low, but which rarely occurs outside the training data. Conversely, semantic probes may capture more inherent model states – e.g., uncertainty from failure to gather relevant facts or attributes for the query. The literature on mechanistic interpretability [52] supports the idea that such information is likely contained in model hidden states. We believe that concretizing these links is a fruitful area for future research.
Future Work. We believe it should be possible to further close the performance gap between sampling-based approaches, such as semantic entropy, and SEPs. One avenue to achieve this could be to increase the scale of the training datasets used to train SEPs. In this work, we relied on established QA tasks to train SEPs to allow for easy comparison to accuracy probes. However, future work could explore training SEPs on unlabelled data, such as inputs generated from another LLM or natural language texts used for general model training or finetuning. This could massively increase in the amount of training data for SEPs, which should improve probe accuracy, and also allow us to explore other more complex probing techniques that require more training data.
Conclusions. We have introduced semantic entropy probes (SEPs): linear probes trained on the hidden states of LLMs to predict semantic entropy, an effective measure of uncertainty for free-form LLM generations [21]. We find that the hidden states of LLMs implicitly capture semantic entropy across a wide range of scenarios. SEPs are able to predict semantic entropy consistently, and, importantly, they detect model hallucinations more effectively than probes trained directly for accuracy prediction when testing on novel inputs from a different distribution than the training set – despite not requiring any ground truth model correctness labels. Semantic uncertainty probing, both in terms of model interpretability and practical applications, is an exciting avenue for further research.
Author Contributions and Acknowledgements. JK conceived the project idea, wrote the paper, and, together with MR, provided close mentoring for JH throughout the project. JH wrote the code for SEPs, carried out all of the experiments in the paper, and created some of the figures. JK, MR, LS, and SM explored SEPs in a hackathon, refining the idea and collecting positive preliminary results. LS provided expertise on model interpretability and suggested extensive improvements to the writing. SM created all plots in the initial version of main paper and appendix. YG provided high level guidance. All authors provided critical feedback on writing.
The authors further thank Kunal Handa, Gunshi Gupta, and all members of the OATML lab for insightful discussions, in particular for the feedback given during the hackathon. SM and LS acknowledge funding from the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems (Grant No: EP/S024050/1).
References
- Abdin et al. [2024] Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C. C. T., Chen, W., Chaudhary, V., Chopra, P., Giorno, A. D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J., Hewett, R. J., Huynh, J., Javaheripi, M., Jin, X., Kauffmann, P., Karampatziakis, N., Kim, D., Khademi, M., Kurilenko, L., Lee, J. R., Lee, Y. T., Li, Y., Liang, C., Liu, W., Lin, E., Lin, Z., Madan, P., Mitra, A., Modi, H., Nguyen, A., Norick, B., Patra, B., Perez-Becker, D., Portet, T., Pryzant, R., Qin, H., Radmilac, M., Rosset, C., Roy, S., Ruwase, O., Saarikivi, O., Saied, A., Salim, A., Santacroce, M., Shah, S., Shang, N., Sharma, H., Song, X., Tanaka, M., Wang, X., Ward, R., Wang, G., Witte, P., Wyatt, M., Xu, C., Xu, J., Yadav, S., Yang, F., Yang, Z., Yu, D., Zhang, C., Zhang, C., Zhang, J., Zhang, L. L., Zhang, Y., Zhang, Y., Zhang, Y., and Zhou, X. Phi-3 technical report: A highly capable language model locally on your phone. arXiv 2404.14219, 2024.
- Agrawal et al. [2024] Agrawal, A., Mackey, L., and Kalai, A. T. Do language models know when they’re hallucinating references? In EACL, 2024.
- Alain & Bengio [2017] Alain, G. and Bengio, Y. Understanding intermediate layers using linear classifier probes. In ICLR, 2017.
- Azaria & Mitchell [2023] Azaria, A. and Mitchell, T. The internal state of an llm knows when it’s lying. In EMNLP, 2023.
- Band et al. [2024] Band, N., Li, X., Ma, T., and Hashimoto, T. Linguistic calibration of language models. arXiv:2404.00474, 2024.
- Belinkov [2021] Belinkov, Y. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 2021.
- Belrose et al. [2023] Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., and Steinhardt, J. Eliciting latent predictions from transformers with the tuned lens. arXiv 2303.08112, 2023.
- Brown et al. [2020] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. NeurIPS, 2020.
- Burns et al. [2023] Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. In ICLR, 2023.
- Cao et al. [2022] Cao, M., Dong, Y., and Cheung, J. C. K. Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In ACL, 2022.
- Chen & Mueller [2023] Chen, J. and Mueller, J. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness. arXiv 2308.16175, 2023.
- Chuang et al. [2024] Chuang, Y.-S., Xie, Y., Luo, H., Kim, Y., Glass, J., and He, P. Dola: Decoding by contrasting layers improves factuality in large language models. In ICLR, 2024.
- Cole et al. [2023] Cole, J. R., Zhang, M. J., Gillick, D., Eisenschlos, J. M., Dhingra, B., and Eisenstein, J. Selectively answering ambiguous questions. EMNLP, 2023.
- Deutsch et al. [2021] Deutsch, D., Bedrax-Weiss, T., and Roth, D. Towards question-answering as an automatic metric for evaluating the content quality of a summary. TACL, 2021.
- Dhuliawala et al. [2023] Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., and Weston, J. Chain-of-verification reduces hallucination in large language models. arXiv:2309.11495, 2023.
- Duan et al. [2023] Duan, J., Cheng, H., Wang, S., Wang, C., Zavalny, A., Xu, R., Kailkhura, B., and Xu, K. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv:2307.01379, 2023.
- Durmus et al. [2020] Durmus, E., He, H., and Diab, M. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. ACL, 2020.
- Dziri et al. [2021] Dziri, N., Madotto, A., Zaïane, O., and Bose, A. J. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In EMNLP, 2021.
- Elaraby et al. [2023] Elaraby, M., Lu, M., Dunn, J., Zhang, X., Wang, Y., and Liu, S. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv:2308.11764, 2023.
- Farquhar et al. [2023] Farquhar, S., Varma, V., Kenton, Z., Gasteiger, J., Mikulik, V., and Shah, R. Challenges with unsupervised llm knowledge discovery. arXiv:2312.06681, 2023.
- Farquhar et al. [2024] Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y. Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 2024.
- Feldman et al. [2023] Feldman, P., Foulds, J. R., and Pan, S. Trapping llm hallucinations using tagged context prompts. arXiv:2306.06085, 2023.
- Filippova [2020] Filippova, K.