Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
语义熵探针：在 LLMs

Jannik Kossen¹ 123 Jiatong Han^1∗† 123 Muhammed Razzak^1∗
Lisa Schut¹ 123 Shreshth Malik¹ 123 Yarin Gal¹
¹ OATML, Department of Computer Science, University of Oxford Equal contribution. Correspondence to jannik.kossen@cs.ox.ac.uk. ^†Work done while at OATML.

Abstract 抽象

We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in Large Language Models (LLMs). Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs. Recent work by Farquhar et al. [21] proposes semantic entropy (SE), which can detect hallucinations by estimating uncertainty in the space semantic meaning for a set of model generations. However, the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption. To address this, we propose SEPs, which directly approximate SE from the hidden states of a single generation. SEPs are simple to train and do not require sampling multiple model generations at test time, reducing the overhead of semantic uncertainty quantification to almost zero. We show that SEPs retain high performance for hallucination detection and generalize better to out-of-distribution data than previous probing methods that directly predict model accuracy. Our results across models and tasks suggest that model hidden states capture SE, and our ablation studies give further insights into the token positions and model layers for which this is the case.
我们提出了语义熵探针（SEP），这是一种廉价且可靠的大型语言模型（LLMs。幻觉是听起来似乎合理但事实不正确且武断的模型生成，对 LLMs。Farquhar 等人 [21] 最近的工作提出了语义熵（SE），它可以通过估计一组模型世代的空间语义含义的不确定性来检测幻觉。然而，与 SE 计算相关的计算成本增加了 5 到 10 倍，阻碍了实际采用。为了解决这个问题，我们提出了 SEP，它直接从单代的隐藏状态近似 SE。SEP 易于训练，不需要在测试时对多个模型生成进行采样，从而将语义不确定性量化的开销降低到几乎为零。我们表明，与以前直接预测模型准确性的探测方法相比，SEP 在幻觉检测方面保持了高性能，并且更好地泛化到分布外数据。我们在模型和任务中的结果表明，模型隐藏状态捕获了 SE，我们的消融研究进一步了解了这种情况的标记位置和模型层。

1 Introduction 1 介绍

Large Language Models (LLMs) have demonstrated impressive capabilities across a wide variety of natural language processing tasks [70, 71, 54, 68, 8]. They are increasingly deployed in real-world settings, including in high-stakes domains such as medicine, journalism, or legal services [65, 76, 53, 63]. It is therefore paramount that we can trust the outputs of LLMs. Unfortunately, LLMs have a tendency to hallucinate. Originally defined as “content that is nonsensical or unfaithful to the provided source” [45, 23, 27], the term is now used to refer to nonfactual, arbitrary content generated by LLMs. For example, when asked to generate biographies, even capable LLMs such as GPT-4 will often fabricate facts entirely [48, 69, 21]. While this may be acceptable in low-stakes use cases, hallucinations can cause significant harm when factuality is critical. The reliable detection or mitigation of hallucinations is a key challenge to ensure the safe deployment of LLM-based systems.
大型语言模型（LLMs）在各种自然语言处理任务中表现出令人印象深刻的能力 [70， 71， 54， 68， 8]。它们越来越多地部署在现实世界中，包括医学、新闻或法律服务等高风险领域 [65， 76， 53， 63]。因此，我们可以信任 LLMs。不幸的是，LLMs 有产生幻觉的倾向。该术语最初定义为“无意义或不忠于所提供来源的内容”[45， 23， 27]，现在用于指代由 LLMs。例如，当被要求生成传记时，即使是像 GPT-4 这样有能力的 LLMs 也经常完全捏造事实 [48， 69， 21]。虽然这在低风险用例中可能是可以接受的，但当事实至关重要时，幻觉可能会造成重大伤害。可靠地检测或缓解幻觉是确保安全部署基于 LLM 的系统的关键挑战。

Various approaches have been proposed to address hallucinations in LLMs (see Section 2). An effective strategy for detecting hallucinations is to sample multiple responses for a given prompt and check if the different samples convey the same meaning [21, 31, 30, 16, 13, 11, 19, 43, 48]. The core idea is that if the model knows the answer, it will consistently provide the same answer. If the model is hallucinating, its responses may vary across generations. For example, given the prompt “What is the capital of France?”, an LLM that “knows” the answer will consistently output (Paris, Paris, Paris), while an LLM that “does not know” the answer may output (Naples, Rome, Berlin), indicating a hallucination.
已经提出了各种方法来解决 LLMs（参见第 2 节）。检测幻觉的有效策略是对给定提示的多个响应进行采样，并检查不同的样本是否传达相同的含义 [21， 31， 30， 16， 13， 11， 19， 43， 48].核心思想是，如果模型知道答案，它将始终提供相同的答案。如果模型出现幻觉，其反应可能会因代际而异。例如，给定提示 “What is the capital of France？”，“知道”答案的 LLM 将始终输出（Paris， Paris， Paris），而 “不知道” 答案的 LLM 可能会输出（Naples， Rome， Berlin），表示幻觉。

One explanation for why this works is that LLMs have calibrated uncertainty [30, 54], i.e., “language models (mostly) know what they know” [30]. When an LLM is certain about an answer, it consistently provides the correct response. Conversely, when uncertain, it generates arbitrary answers. This suggests that we can leverage model uncertainty to detect hallucinations. However, we cannot use token-level probabilities to estimate uncertainty directly because different sequences of tokens may convey the same meaning. For the example, the answers “Paris”, “It’s Paris”, and “The capital of France is Paris” all mean the same. To address this, Farquhar et al. [21] propose semantic entropy (SE), which clusters generations into sets of equivalent meaning and then estimates uncertainty in semantic space.
为什么这样做的一种解释是LLMs 已经校准了不确定性 [30， 54]，即“语言模型（大部分）知道它们知道什么”[30]。当 LLM 确定答案时，它会始终提供正确的响应。相反，当不确定时，它会生成任意的答案。这表明我们可以利用模型不确定性来检测幻觉。然而，我们不能使用标记级概率直接估计不确定性，因为不同的标记序列可能传达相同的含义。对于该示例，答案 “Paris”、“It's Paris” 和 “The capital of France is Paris” 的含义相同。为了解决这个问题，Farquhar 等人 [21] 提出了语义熵 （SE），它将世代聚类为等效含义的集合，然后估计语义空间中的不确定性。

Refer to caption — Figure 1: Semantic entropy probes (SEPs) outperform accuracy probes for hallucination detection with Llama-2-7B, although there is a gap to $10$ x costlier baselines. (See Sec. 5.)
图 1：语义熵探针（SEP）在使用 Llama-2-7B 进行幻觉检测时的性能优于准确性探针，尽管与成本高出 $10$ 倍的基线存在差距。（见第 5 节。

A major limitation of SE and other sampling-based approaches is that they require multiple model generations for each input query, typically between 5 and 10. This results in a 5-to-10-fold higher cost compared to naive generation without SE, presenting a major hurdle to the practical adoption of these methods. Computationally cheaper methods for reliable hallucination detection in LLMs are needed.

The hidden states of LLMs are a promising avenue to better understand, predict, and steer a wide range of LLM behaviors [79, 26, 67]. In particular, a recent line of work learns to predict the truthfulness of model responses by training a simple linear probe on the hidden states of LLMs. Linear probes are computationally efficient, both to train and when used at inference. However, existing approaches are usually supervised [61, 35, 4, 44] and therefore require a labeled training dataset assigning accuracy to statements or model generations. And while unsupervised approaches exist [9], their validity has been questioned [20]. In this paper, we argue that supervising probes via SE is preferable to accuracy labels for robust prediction of truthfulness.

We propose Semantic Entropy Probes (SEPs), linear probes that capture semantic uncertainty from the hidden states of LLMs, presenting a cost-effective and reliable hallucination detection method. SEPs combine the advantages of probing and sampling-based hallucination detection. Like other probing approaches, SEPs are easy to train, cheap to deploy, and can be applied to the hidden states of a single model generation. Similar to sampling-based hallucination detection, SEPs capture the semantic uncertainty of the model. Furthermore, they address some of the shortcomings of previous approaches. Contrary to sampling-based hallucination detection, SEPs act directly on a single model hidden state and do not require generating multiple samples at test time. And unlike previous probing methods, SEPs are trained to predict semantic entropy [31] rather than model accuracy, which can be computed without access to ground truth accuracy labels that can be expensive to curate.

We find that SEP predictions are effective proxies for truthfulness. In fact, SEPs generalize better to new tasks than probes trained directly to predict accuracy, setting a new state-of-the-art for cost-efficient hallucination detection, cf. Fig. 1. Our results additionally provides insights into the inner workings of LLMs, strongly suggesting that model hidden states directly capture the model’s uncertainty over semantic meanings. Through ablation studies, we show that this holds across a variety of models, tasks, layers, and token positions.

In summary, our core contributions are:

•

We propose Semantic Entropy Probes (SEPs), linear probes trained on the hidden states of LLMs to capture semantic entropy (Section 4).
•

We demonstrate that semantic entropy is encoded in the hidden states of a single model generation and can be successfully extracted using probes (Section 6).
•

We perform ablation studies to study SEP performance across models, tasks, layers, and token positions. Our results strongly suggest internal model states across layers and tokens implicitly capture semantic uncertainty, even before generating any tokens. (Section 6)
•

We show that SEPs can be used to predict hallucinations and that they generalize better than probes directly trained for accuracy as suggested by previous work, establishing a new state-of-the-art for cost-efficient hallucination detection (Section 7, Fig. 1).

2 Related Work

LLM Hallucinations. We refer to Rawte et al. [60], Zhang et al. [78] for extensive surveys on hallucinations in LLMs and here briefly review the most relevant related work to this paper. Early work on hallucinations in language models typically refers to issues in summarization tasks where models “hallucinate” content that is not faithful to the provided source text [45, 14, 17, 10, 75, 42, 51]. Around the same time, research emerged that showed LLMs themselves could store and retrieve factual knowledge [58], leading to the currently popular closed-book setting, where LLMs are queried without any additional context [62]. Since then, a large variety of work has focused on detecting hallucinations in LLMs for general natural language generation tasks. These can typically be classified into one of two directions: sampling-based and retrieval-based approaches.

Sampling-Based Hallucination Detection. For sampling-based approaches, a variety of methods have been proposed that sample multiple model completions for a given query and then quantify the semantic difference between the model generations [31, 30, 16, 13, 11, 19]. For this paper, Farquhar et al. [21] is particularly relevant, as we use their semantic entropy measure to supervise our hidden state probes (we summarize their method in Section 3). A different line of work does not directly re-sample answers for the same query, but instead asks follow-up questions to uncover inconsistencies in the original answer [15, 2]. Recent work has also proposed to detect hallucinations in scenarios where models generate entire paragraphs of text by decomposing the paragraph into individual facts or sentences, and then validating the uncertainty of those individual facts separately [39, 49, 43, 15].

Retrieval-Based Methods. A different strategy to mitigate hallucinations is to rely on external knowledge bases, e.g. web search, to verify the factuality of model responses [22, 77, 57, 18, 24, 36, 74, 66]. An advantage of such approaches is that they do not rely on good model uncertainties and can be used directly to fix errors in model generations. However, retrieval-based approaches can add significant cost and latency. Further, they may be less effective for domains such as reasoning, where LLMs are also prone to produce unfaithful and misleading generations [73, 33]. Thus, retrieval- and uncertainty-based methods are orthogonal and can be combined for maximum effect.

Sampling and Finetuning Strategies. A number of different strategies exist to reduce, rather than detect, the number of hallucinations that LLMs generate. Previous work has proposed simple adaptations to LLM sampling schemes [34, 12, 64], preference optimization targeting factuality [69], or finetuning to align “verbal” uncertainties of LLMs with model accuracy [47, 37, 5].

Understanding Hidden States. Recent work suggests that simple operations on LLM hidden states can qualitatively change model behavior [79, 67, 61] manipulate knowledge [26], or reveal deceitful intent [40]. Probes can be a valuable tool to better understand the internal representations of neural networks like LLMs [3, 6]. Previous work has shown that hidden state probes can predict LLM outputs one or multiple tokens ahead with high accuracy [7, 55]. Relevant to our paper is recent work that suggests there is a “truthfulness” direction in latent space that predicts correctness of statements and generations [44, 4, 9, 35, 4]. Our work extends this – we are also interested in predicting if the model is hallucinating nonfactual responses, however, rather than directly supervising probes with accuracy labels, we argue that capturing semantic entropy is key for generalization performance.

3 Semantic Entropy

Measuring uncertainty in free-form natural language generation tasks is challenging. The uncertainties over tokens output by the language model can be misleading because they conflate semantic uncertainty, uncertainty over the meaning of the generation, with lexical and syntactic uncertainty, uncertainty over how to phrase the answer (see the example in Section 1). To address this, Farquhar et al. [21] propose semantic entropy, which aggregates token-level uncertainties across clusters of semantic equivalence.¹¹1Farquhar et al. [21] is a journal version of the original semantic entropy paper by Kuhn et al. [31]. Semantic entropy is important in the context of this paper because we use it as the supervisory signal to train our hidden state SEP probes.

Semantic entropy is calculated in three steps: (1) for a given query $x$ , sample model completions from the LLM, (2) aggregate the generations into clusters $(C_{1},\dots,C_{K})$ of equivalent semantic meaning, (3) calculate semantic entropy, $H_{\textup{SE}}$ , by aggregating uncertainties within each cluster. Step (1) is trivial, and we detail steps (2) and (3) below.

Semantic Clustering. To determine if two generations convey the same meaning, Farquhar et al. [21] use natural language inference (NLI) models, such as DeBERTa [25], to predict entailment between the generations. Concretely, two generations $s_{a}$ and $s_{b}$ are identical in meaning if $s_{a}$ entails $s_{b}$ and $s_{b}$ entails $s_{a}$ , i.e. they entail each other bi-directionally. Farquhar et al. [21] then propose a greedy algorithm to cluster generations semantically: for each sample $s_{a}$ , we either add it to an existing cluster $C_{k}$ if bi-directional entailment holds between $s_{a}$ and a sample $s_{b}\in C_{k}$ , or add it to a new cluster if the semantic meaning of $s_{a}$ is distinct from all existing clusters. After processing all generations, we obtain a clustering of the generations by semantic meaning.

Semantic Entropy. Given an input context $x$ , the joint probability of a generation $s$ consisting of tokens $(t_{1},\dots,t_{n})$ is given by the product of conditional token probabilities in the sequence,

p(s\mid x)=\prod\nolimits_{i=1}^{n}p(t_{i}\mid t_{1:i-1},x).

(1)

The probability of the semantic cluster $C$ is then the aggregate probability of all possible generations $s$ which belong to that cluster,

p(C\mid x)=\sum\nolimits_{s\in C}p(s\mid x).

(2)

The uncertainty associated with the distribution over semantic clusters is the semantic entropy,

H[C\mid x]=\mathbb{E}_{p(C\mid x)}[-\log p(C\mid x)].

(3)

Estimating SE in Practice. In practice, we cannot compute the above exactly. The expectations with respect to $p(s|x)$ and $p(C|x)$ are intractable, as the number of possible token sequences grows exponentially with sequence length. Instead, Farquhar et al. [21] sample $N$ generations $(s_{1},\dots,s_{N})$ at non-zero temperature from the LLM (typically and also in this paper $N=10$ ). They then treat $(C_{1},\dots,C_{K})$ as Monte Carlo samples from the true distribution over semantic clusters $p(C|x)$ , and approximate semantic entropy as

\displaystyle H[C\mid x]\approx-\frac{1}{K}\sum\nolimits_{k=1}^{K}\log p(C_{k}% |x).

(4)

We here use an additional approximation, employing the discrete variant of SE that yields good performance without access to token probabilities, making it compatible with black-box models [21]. For the discrete SE variant, we estimate cluster probabilities $p(C|x)$ as the fraction of generations in that cluster, $\smash{p(C_{k}|x)=\textstyle\sum\nolimits_{j=1}^{N}\mathbbm{1}[s_{j}\in C_{k}]% /K}$ , and then compute semantic entropy as the entropy of the resulting categorical distribution, $H_{\textup{SE}}(x)\coloneq-\textstyle\sum\nolimits_{k=1}^{K}p(C_{k}|x)\log p(C% _{k}|x)$ . Discrete SE further avoids problems when estimating Eq. 4 for generations of different lengths [41, 50, 31, 21].

4 Semantic Entropy Probes

Although semantic entropy is effective at detecting hallucinations, its high computational cost may limit its use to only the most critical scenarios. In this section, we propose Semantic Entropy Probes (SEPs), a novel method for cost-efficient and reliable uncertainty quantification in LLMs. SEPs are linear probes trained on the hidden states of LLMs to capture semantic entropy [31]. However, unlike semantic entropy and other sampling-based approaches, SEPs act on the hidden states of a single model generation and do not require sampling multiple responses from the model at test time. Thus, SEPs solve a key practical issue of semantic uncertainty quantification by almost completely eliminating the computational overhead of semantic uncertainty estimation at test time. We further argue that SEPs are advantageous to probes trained to directly predict model accuracy. Our intuition for this is that semantic entropy is an inherent property of the model that should be encoded in the hidden states and thus should be easier to extract than truthfulness, which relies on potentially noisy external information. We discuss this further in Section 8.

Training SEPs. SEPs are constructed as linear logistic regression models, trained on the hidden states of LLMs to predict semantic entropy. We create a dataset of $(h_{p}^{l}(x),H_{\textup{SE}}(x))$ pairs, where $x$ is an input query, $\smash{h_{p}^{l}(x)\in\mathbb{R}^{d}}$ is the model hidden state at token position $p$ and layer $l$ , $d$ is the hidden state dimension, and $\smash{H_{\textup{SE}}(x)\in\mathbb{R}}$ is the semantic entropy. That is, given an input query $x$ , we first generate a high-likelihood model response via greedy sampling and store the hidden state at a particular layer and token position, $\smash{h_{p}^{l}(x)}$ . We then sample $N=10$ responses from the model at high temperature $(T=1)$ and compute semantic entropy, $\smash{H_{\textup{SE}}(x)}$ , as detailed in the previous section. For inputs, we rely on questions from popular QA datasets (see Section 5 for details), although we do not need the ground-truth labels provided by these datasets and could alternatively compute semantic entropy for any unlabeled set of suitable LLM inputs.

Binarization. Semantic entropy scores are real numbers. However, for the purposes of this paper, we convert them into binary labels, indicating whether semantic entropy is high or low, and then train a logistic regression classifier to predict these labels. Our motivation for doing so is two-fold. For one, we ultimately want to use our probes for predicting binary model correctness, so we eventually need to construct a binary classifier regardless. Additionally, we would like to compare the performance of SEP probes and accuracy probes. This is easier if both probes target binary classification problems. We note that the logistic regression classifier returns probabilities, such that we can always recover fine-grained signals even after transforming the problem into binary classification.

More formally, we compute $\smash{\tilde{H}_{\textup{SE}}(x)=\mathbbm{1}[H_{\textup{SE}}(x)>\gamma^{\star% }]}$ , where $\smash{\gamma^{\star}}$ is a threshold that optimally partitions the raw SE scores into high and low values according to the following objective:

\displaystyle\textstyle\gamma^{\star}=\arg\min_{\gamma}\sum\nolimits_{j\in% \text{SE}_{\textup{low}}}(H_{\textup{SE}}(x_{j})-\hat{H}_{\textup{low}})^{2}+% \sum\nolimits_{j\in\text{SE}_{\textup{high}}}(H_{\textup{SE}}(x_{j})-\hat{H}_{% \textup{high}})^{2},

(5)

where

	$\displaystyle\text{SE}_{\textup{low}}=\{j:H_{\textup{SE}}(x_{j})<\gamma\},$	$\displaystyle\text{SE}_{\textup{high}}=\{j:H_{\textup{SE}}(x_{j})\geq\gamma\},$
	$\displaystyle\hat{H}_{\textup{low}}=\frac{1}{\lvert\text{SE}_{\textup{low}}% \rvert}\sum\nolimits_{j\in\text{SE}_{\textup{low}}}H_{\textup{SE}}(x_{j}),$	$\displaystyle\hat{H}_{\textup{high}}=\frac{1}{\lvert\text{SE}_{\textup{high}}% \rvert}\sum\nolimits_{j\in\text{SE}_{\textup{high}}}H_{\textup{SE}}(x_{j}).$

This procedure is inspired by splitting objectives used in regression trees [38] and we have found it to perform well in practice compared to alternatives such as soft labelling, cf. Appendix B.

In summary, given a input dataset of queries, $\{x_{j}\}_{j=1}^{Q}$ , we compute a training set of hidden state – binarized semantic entropy pairs, $\smash{\{(h_{p}^{l}(x_{j}),\tilde{H}_{\textup{SE}}(x_{j}))\}_{j=1}^{Q}}$ , and use this to train a linear classifier, which is our semantic entropy probe (SEP). At test time, SEPs predict the probability that a model generation for a given input query $x$ has high semantic entropy.

Probing Locations. We collect hidden states, $h_{p}^{l}(x)$ , across all layers, $l$ , of the LLM to investigate which layers best capture semantic entropy. We consider two different token positions, $p$ . Firstly, we consider the hidden state at the last token of the input $x$ , i.e. the token before generating (TBG) the model response. Secondly, we consider the last token of the model response, which is the token before the end-of-sequence token, i.e. the second last token (SLT). We refer to these scenarios as TBG and SLT. The TBG experiments allow us to study to what extent LLM hidden states capture semantic entropy before generating a response. The TBG setup potentially allows us to quantify the semantic uncertainty given an input in a single forward pass – without generating any novel tokens – further reducing the cost of our approach over sampling-based alternatives. In practice, this may be useful to quickly determine if a model will answer a particular input query with high certainty.

5 Experiment Setup

We investigate and evaluate Semantic Entropy Probes (SEPs) across a range of models and datasets. First, we show that we can accurately predict semantic entropy from the hidden states of LLMs (Section 6). We then explore how SEP predictions vary across different tasks, models, tokens indices, and layers. Second, we demonstrate that SEPs are a cheap and reliable method for hallucination detection (Section 7), which generalizes better to novel tasks than accuracy probes, although they cannot match the performance of much more expensive sampling-based methods in our experiments.

Tasks. We evaluate SEPs on four datasets: TriviaQA [29], SQuAD [59], BioASQ [72], and NQ Open [32]. We use the input queries of these tasks to derive training sets for SEPs and evaluate the performance of each method on the validation/test sets, creating splits if needed. We consider a short- and a long-form setting: Short-form answers are generated by few-shot prompting the LLM to answer “as briefly as possible” and long-form answers are generated by prompting for a “single brief but complete sentence”, leading to an approximately six-fold increase in the number of generated tokens [21]. Following Farquhar et al. [21], we assess model accuracy via the SQuAD F1 score for short-form generations, and we use use GPT-4 [54] to compare model answers to ground truth labels for long-form answers. We provide prompt templates in Section B.1.

Models. We evaluate SEPs on five different models. For short generations, we generate hidden states and answers with Llama-2 7B and 70B [71], Mistral 7B [28], and Phi-3 Mini [1], and use DeBERTa-Large [25] as the entailment model for calculating semantic entropy [31]. For long generations, we use Llama-2-70B [71] or Llama-3-70B [46] and use GPT-3.5 [8] to predict entailment.

Baselines. We compare SEPs against the ground truth semantic entropy, accuracy probes supervised with model correctness labels, naive entropy, log likelihood, and the $p(\text{True})$ method of Kadavath et al. [30]. For naive entropy, following Farquhar et al. [21], we compute the length-normalized average log token probabilities across the same number of generations as for SE. For log likelihood, we use the length-normalized log likelihood of a single model generation. The $p(\text{True})$ method works by constructing a custom few-shot prompt that contains a number of examples – each consisting of a training set input, a corresponding low-temperature model answer, high-temperature model samples, and a model correctness score. Essentially, $p(\text{True})$ treats sampling-based truthfulness detection as an in-context learning task, where the few-shot prompt teaches the model that model answers with high semantic variety are likely incorrect. We refer to Kadavath et al. [30] for more details.

Linear Probe. For both SEPs and our accuracy probe baseline, we use the logistic regression model from scikit-learn [56] with default hyperparameters for $\text{L}_{2}$ regularization and the LBFGS optimizer.

Evaluation. We evaluate SEPs both in terms of their ability to capture semantic entropy as well as their ability to predict model hallucinations. In both cases, we compute the area under the receiver operating characteristic curve (AUROC), with gold labels given by binarized SE or model accuracy.

6 LLM Hidden States Implicitly Capture Semantic Entropy

This section investigates whether LLM hidden states encode semantic entropy. We study SEPs across different tasks, models, and layers, and compare them to accuracy probes in- and out-of-distribution.

Hidden States Capture Semantic Entropy. Figure 2 shows that SEPs are consistently able to capture semantic entropy across different models and tasks. Here, probes are trained on hidden states of the second-last-token for the short-form generation setting. In general, we observe that AUROC values increase for later layers in the model, reaching values between $0.7$ and $0.95$ depending on the scenario.

Semantic Entropy Can Be Predicted Before Generating. Next, we investigate if semantic entropy can be predicted before even generating the output. Similar to before, Fig. 3 shows AUROC values for predicting binarized semantic entropy from the SEP probes. Perhaps surprisingly (although in line with related work, cf. Section 2), we find that SEPs can capture semantic entropy even before generation. SEPs consistently achieve good AUROC values, with performance slightly below the SLT experiments in Fig. 2. The TBG variant provides even larger cost savings than SEPs already do, as it allows us to quantify uncertainty before generating any novel tokens, i.e. with a single forward pass through the model. This could be useful in practice, for example, to refrain from answering queries for which semantic uncertainty is high.

AUROC values for Llama-2-7B on BioASQ, in both Fig. 2 and Fig. 3, reach very high values, even for early layers. We investigated this and believe it is likely related to the particularities of BioASQ. Concretely, it is the only of our tasks to contain a significant number of yes-no questions, which are generally associated with lower semantic entropy as the possible number of semantic meanings in outcome space is limited. For a model with relatively low accuracy such as Llama-2-7B, simply identifying whether or not the given input is a yes-no question, will lead to high AUROC values.

SEPs Capture Semantic Uncertainty for Long Generations. While experiments with short generations are popular even in the recent literature [31, 30, 16, 13, 11], this scenario is increasingly disconnected from popular use cases of LLMs as conversational chatbots. In recognition of this, we also study our probes in a long-form generation setting, which increases the average length of model responses from ~15 characters in the short-length scenario to about ~100 characters.

Figure 4 shows that, even in the long-form setting, SEPs are able to capture semantic entropy well in both the second-last-token and token-before-generation scenarios for Llama-2-70B and Llama-3-70B. Compared to the short-form generation scenario, we now observe more often that AUROC values peak for intermediate layers. This makes sense as hidden states closer to the final layer will likely be preoccupied with predicting the next token. In the long-form setting, the next token is more often unrelated to the semantic uncertainty of the overall answer, and instead concerned with syntax or lexis.

Counterfactual Context Addition Experiment. To confirm that SEPs capture SE rather than relying on spurious correlations, we perform a counterfactual intervention experiment for Llama-2-7B on TriviaQA. For each input question of TriviaQA, the dataset contains a “context”, from which the ground truth answer can easily be predicted. We usually exclude this context, because including it makes the task too easy. However, for the purpose of this experiment, we add the context and study how this affects SEP predictions.

Figure 5 shows a kernel density estimate of the distribution over the predicted probability for high semantic entropy, $p(\text{high SE})$ , for Llama-2-7B on the TriviaQA dataset with context (blue) and without context (orange) in the short generation setting using the SLT. Without context, the distribution for $p(\text{high SE})$ from the SEP is concentrated around $0.9$ . However, as soon as we provide the context, $p(\text{high SE})$ decreases, as shown by the shift in distribution. As the task becomes much easier – accuracy increases from 26% to 78% – the model becomes more certain – ground truth SE decreases from 1.84 to 0.50. This indicates SEPs accurately capture model behavior for the context addition experiment, with predictions for $p(\text{high SE})$ following ground truth SE behavior, despite never being trained on inputs with context.

7 SEPs Are Cheap and Reliable Hallucination Detectors

In this section, we explore the use of SEPs to predict hallucinations, comparing them to accuracy probes and other baselines. Crucially, we also evaluate probes in a challenging generalization setting, testing them on tasks that they were not trained for. This setup is much more realistic than evaluating probes in-distribution, as, for most deployment scenarios, inputs will rarely match the training distribution exactly. The generalization setting does not affect semantic entropy, naive entropy, or log likelihood, which do not rely on training data. While $p(\text{True})$ does rely on a few samples for prompt construction, we find its performance is usually unaffected by the task origin of the prompt data.

Table 1: Tab. 1:

\Delta

AUROC (x100) of SEPs and acc. probes over tasks in-distribution. Avg

\pm

std error, (S)hort- and (L)ong-form gens.

Model	SEP $-$ Acc Pr.
Mistral-7B (S)	$2.8\pm 1.4$
Phi-3-3.8B (S)	$2.1\pm 0.8$
Llama-2-7B (S)	$-0.5\pm 2.6$
Llama-2-70B (S)	$1.3\pm 0.7$
Llama-2-70B (L)	$-1.9\pm 7.5$
Llama-3-70B (L)	$-2.0\pm 2.1$

Figure 6 shows both in-distribution and generalization performance of SEPs and accuracy probes across different layers for Llama-2-7B in a short-form generation setting trained on the SLT. In-distribution, accuracy probes outperform SEPs across most layers and tasks, with the exception of NQ Open. In Table 1, we compare the average difference in AUROC between SEPs and accuracy probes for predicting model hallucinations, taking a representative set of high-performing layers for both probe types (see Appendix B). We find that SEPs and accuracy probes perform similarly on in-distribution data across models. We report unaggregated results in Fig. A.8. The performance of SEPs here is commendable: SEPs are trained without any ground truth answers or accuracy labels, and yet, can capture truthfulness. To the best of our knowledge, SEPs may be the best unsupervised method for hallucination detection even in-distribution, given problems of other unsupervised methods for truthfulness prediction [20].

However, when evaluating probe generalization to new tasks, SEPs show their true strength. We evaluate probes in a leave-one-out fashion – evaluating on all datasets except one, which we train on. As shown in Fig. 6 (right), SEPs consistently outperform accuracy probes across various layers and tasks for short-form generations in the generalization setting. For BioASQ, the difference is particularly large. SEPs clearly generalize better to unseen tasks than accuracy probes. In Table 2 and Fig. 7, we report results for more models, taking a representative set of high-performing layers for both probe types, and Fig. A.7 shows results for Mistral-7B across layers. We again find that SEPs generalize better than accuracy probes to novel tasks. We additionally compare to the sampling-based semantic entropy, naive entropy, and $p(\text{True})$ methods. While SEPs cannot match the performance of

Table 2: Tab. 2:

\Delta

AUROC (x100) of SEPs over acc. probes for task generalization. Avg

\pm

std error, (S)hort- and (L)ong-form gens.

Model	SEP $-$ Acc Pr.
Mistral-7B (S)	$10.5\pm 3.5$
Phi-3-3.8B (S)	$9.9\pm 2.9$
Llama-2-7B (S)	$7.7\pm 1.3$
Llama-2-70B (S)	$7.9\pm 3.0$
Llama-2-70B (L)	$2.2\pm 0.4$
Llama-3-70B (L)	$6.2\pm 1.9$

these methods, it is important to note the significantly higher cost these baselines incur, requiring $10$ additional model generations, whereas SEPs and accuracy probes operate on single generations.

We further evaluate SEPs for long-form generations. As shown in Fig. 8, SEPs outperform accuracy probes for Llama-2-70B and Llama-3-70B in the generalization setting. We also provide in-distribution results for long generations with both models in Figs. A.9 and A.10. Both results confirm the trend discussed above. Overall, our results clearly suggest that SEPs are the best choice for cost-effective uncertainty quantification in LLMs, especially if the distribution of the query data is unknown.

8 Discussion, Future Work, and Conclusions.

Discussion. Our experiments show that SEPs generalize better than accuracy probes – in terms of detecting hallucinations – to inputs from unseen tasks. One potential explanation for this is that semantic uncertainty is a better probing target than truthfulness, because semantic uncertainty is a more model-internal characteristic that can be better predicted from model hidden states. Model correctness labels required for accuracy probing on the other hand are external and can be noisy, which may make them more difficult to predict from hidden states. We can find evidence for this by comparing the in-distribution AUROC values for SEPs (for predicting binarized SE) with the AUROC of the accuracy probes for predicting accuracy in Figs. A.6 and A.5. AUROC values – which are insensitive to class frequencies and a good proxy for task difficulty – for predicting SE are significantly higher, indicating that semantic entropy is indeed better captured by model hidden states than accuracy.

Another possible explanation for the gap in OOD generalization could be that accuracy probes capture model correctness in a way that is specific to the training dataset. For example, the probe may latch on to discriminative features for model correctness that relate to the task at hand but do not generalize, such as identifying a knowledge domain where accuracy is high or low, but which rarely occurs outside the training data. Conversely, semantic probes may capture more inherent model states – e.g., uncertainty from failure to gather relevant facts or attributes for the query. The literature on mechanistic interpretability [52] supports the idea that such information is likely contained in model hidden states. We believe that concretizing these links is a fruitful area for future research.

Future Work. We believe it should be possible to further close the performance gap between sampling-based approaches, such as semantic entropy, and SEPs. One avenue to achieve this could be to increase the scale of the training datasets used to train SEPs. In this work, we relied on established QA tasks to train SEPs to allow for easy comparison to accuracy probes. However, future work could explore training SEPs on unlabelled data, such as inputs generated from another LLM or natural language texts used for general model training or finetuning. This could massively increase in the amount of training data for SEPs, which should improve probe accuracy, and also allow us to explore other more complex probing techniques that require more training data.

Conclusions. We have introduced semantic entropy probes (SEPs): linear probes trained on the hidden states of LLMs to predict semantic entropy, an effective measure of uncertainty for free-form LLM generations [21]. We find that the hidden states of LLMs implicitly capture semantic entropy across a wide range of scenarios. SEPs are able to predict semantic entropy consistently, and, importantly, they detect model hallucinations more effectively than probes trained directly for accuracy prediction when testing on novel inputs from a different distribution than the training set – despite not requiring any ground truth model correctness labels. Semantic uncertainty probing, both in terms of model interpretability and practical applications, is an exciting avenue for further research.

Author Contributions and Acknowledgements. JK conceived the project idea, wrote the paper, and, together with MR, provided close mentoring for JH throughout the project. JH wrote the code for SEPs, carried out all of the experiments in the paper, and created some of the figures. JK, MR, LS, and SM explored SEPs in a hackathon, refining the idea and collecting positive preliminary results. LS provided expertise on model interpretability and suggested extensive improvements to the writing. SM created all plots in the initial version of main paper and appendix. YG provided high level guidance. All authors provided critical feedback on writing.

The authors further thank Kunal Handa, Gunshi Gupta, and all members of the OATML lab for insightful discussions, in particular for the feedback given during the hackathon. SM and LS acknowledge funding from the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems (Grant No: EP/S024050/1).

References

Abdin et al. [2024] Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C. C. T., Chen, W., Chaudhary, V., Chopra, P., Giorno, A. D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J., Hewett, R. J., Huynh, J., Javaheripi, M., Jin, X., Kauffmann, P., Karampatziakis, N., Kim, D., Khademi, M., Kurilenko, L., Lee, J. R., Lee, Y. T., Li, Y., Liang, C., Liu, W., Lin, E., Lin, Z., Madan, P., Mitra, A., Modi, H., Nguyen, A., Norick, B., Patra, B., Perez-Becker, D., Portet, T., Pryzant, R., Qin, H., Radmilac, M., Rosset, C., Roy, S., Ruwase, O., Saarikivi, O., Saied, A., Salim, A., Santacroce, M., Shah, S., Shang, N., Sharma, H., Song, X., Tanaka, M., Wang, X., Ward, R., Wang, G., Witte, P., Wyatt, M., Xu, C., Xu, J., Yadav, S., Yang, F., Yang, Z., Yu, D., Zhang, C., Zhang, C., Zhang, J., Zhang, L. L., Zhang, Y., Zhang, Y., Zhang, Y., and Zhou, X. Phi-3 technical report: A highly capable language model locally on your phone. arXiv 2404.14219, 2024.
Agrawal et al. [2024] Agrawal, A., Mackey, L., and Kalai, A. T. Do language models know when they’re hallucinating references? In EACL, 2024.
Alain & Bengio [2017] Alain, G. and Bengio, Y. Understanding intermediate layers using linear classifier probes. In ICLR, 2017.
Azaria & Mitchell [2023] Azaria, A. and Mitchell, T. The internal state of an llm knows when it’s lying. In EMNLP, 2023.
Band et al. [2024] Band, N., Li, X., Ma, T., and Hashimoto, T. Linguistic calibration of language models. arXiv:2404.00474, 2024.
Belinkov [2021] Belinkov, Y. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 2021.
Belrose et al. [2023] Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., and Steinhardt, J. Eliciting latent predictions from transformers with the tuned lens. arXiv 2303.08112, 2023.
Brown et al. [2020] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. NeurIPS, 2020.
Burns et al. [2023] Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. In ICLR, 2023.
Cao et al. [2022] Cao, M., Dong, Y., and Cheung, J. C. K. Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In ACL, 2022.
Chen & Mueller [2023] Chen, J. and Mueller, J. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness. arXiv 2308.16175, 2023.
Chuang et al. [2024] Chuang, Y.-S., Xie, Y., Luo, H., Kim, Y., Glass, J., and He, P. Dola: Decoding by contrasting layers improves factuality in large language models. In ICLR, 2024.
Cole et al. [2023] Cole, J. R., Zhang, M. J., Gillick, D., Eisenschlos, J. M., Dhingra, B., and Eisenstein, J. Selectively answering ambiguous questions. EMNLP, 2023.
Deutsch et al. [2021] Deutsch, D., Bedrax-Weiss, T., and Roth, D. Towards question-answering as an automatic metric for evaluating the content quality of a summary. TACL, 2021.
Dhuliawala et al. [2023] Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., and Weston, J. Chain-of-verification reduces hallucination in large language models. arXiv:2309.11495, 2023.
Duan et al. [2023] Duan, J., Cheng, H., Wang, S., Wang, C., Zavalny, A., Xu, R., Kailkhura, B., and Xu, K. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv:2307.01379, 2023.
Durmus et al. [2020] Durmus, E., He, H., and Diab, M. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. ACL, 2020.
Dziri et al. [2021] Dziri, N., Madotto, A., Zaïane, O., and Bose, A. J. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In EMNLP, 2021.
Elaraby et al. [2023] Elaraby, M., Lu, M., Dunn, J., Zhang, X., Wang, Y., and Liu, S. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv:2308.11764, 2023.
Farquhar et al. [2023] Farquhar, S., Varma, V., Kenton, Z., Gasteiger, J., Mikulik, V., and Shah, R. Challenges with unsupervised llm knowledge discovery. arXiv:2312.06681, 2023.
Farquhar et al. [2024] Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y. Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 2024.
Feldman et al. [2023] Feldman, P., Foulds, J. R., and Pan, S. Trapping llm hallucinations using tagged context prompts. arXiv:2306.06085, 2023.
Filippova [2020] Filippova, K. Controlled hallucinations: Learning to generate faithfully from noisy data. In EMNLP, 2020.
Gao et al. [2022] Gao, L., Dai, Z., Pasupat, P., Chen, A., Chaganty, A. T., Fan, Y., Zhao, V. Y., Lao, N., Lee, H., Juan, D.-C., et al. Rarr: Researching and revising what language models say, using language models. In ACL, 2022.
He et al. [2021] He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. In ICLR, 2021.
Hernandez et al. [2023] Hernandez, E., Li, B. Z., and Andreas, J. Measuring and manipulating knowledge representations in language models. arXiv:2304.00740, 2023.
Ji et al. [2023] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., and Fung, P. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
Jiang et al. [2023] Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b. arXiv, 2023.
Joshi et al. [2017] Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. ACL, 2017.
Kadavath et al. [2022] Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Dodds, Z. H., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know. arXiv:2207.05221, 2022.
Kuhn et al. [2023] Kuhn, L., Gal, Y., and Farquhar, S. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In ICLR, 2023.
Kwiatkowski et al. [2019] Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S. Natural questions: a benchmark for question answering research. TACL, 2019.
Lanham et al. [2023] Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., et al. Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702, 2023.
Lee et al. [2022] Lee, N., Ping, W., Xu, P., Patwary, M., Fung, P. N., Shoeybi, M., and Catanzaro, B. Factuality enhanced language models for open-ended text generation. NeurIPS, 2022.
Li et al. [2024] Li, K., Patel, O., Viégas, F., Pfister, H., and Wattenberg, M. Inference-time intervention: Eliciting truthful answers from a language model. NeurIPS, 36, 2024.
Li et al. [2023] Li, X., Zhao, R., Chia, Y. K., Ding, B., Joty, S., Poria, S., and Bing, L. Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources. In ICLR, 2023.
Lin et al. [2023] Lin, S., Hilton, J., and Evans, O. Teaching models to express their uncertainty in words. TMLR, 2023.
Loh [2011] Loh, W.-Y. Classification and regression trees. Wiley interdisciplinary reviews: data mining and knowledge discovery, 2011.
Luo et al. [2023] Luo, J., Xiao, C., and Ma, F. Zero-resource hallucination prevention for large language models. arXiv:2309.02654, 2023.
MacDiarmid et al. [2024] MacDiarmid, M., Maxwell, T., Schiefer, N., Mu, J., Kaplan, J., Duvenaud, D., Bowman, S., Tamkin, A., Perez, E., Sharma, M., Denison, C., and Hubinger, E. Simple probes can catch sleeper agents, 2024. URL https://www.anthropic.com/news/probes-catch-sleeper-agents.
Malinin & Gales [2021] Malinin, A. and Gales, M. Uncertainty estimation in autoregressive structured prediction. ICLR, 2021.
Manakul et al. [2023a] Manakul, P., Liusie, A., and Gales, M. J. Mqag: Multiple-choice question answering and generation for assessing information consistency in summarization. IJCNLP-AACL, 2023a.
Manakul et al. [2023b] Manakul, P., Liusie, A., and Gales, M. J. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Conference on Empirical Methods in Natural Language Processing, 2023b.
Marks & Tegmark [2023] Marks, S. and Tegmark, M. The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets. arXiv 2310.06824, 2023.
Maynez et al. [2020] Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. On faithfulness and factuality in abstractive summarization. In ACL, 2020.
Meta [2024] Meta. Introducing meta llama 3: The most capable openly available llm to date, 2024. URL https://ai.meta.com/blog/meta-llama-3/. [Online; accessed June 16 2024].
Mielke et al. [2022] Mielke, S. J., Szlam, A., Boureau, Y.-L., and Dinan, E. Reducing conversational agents’ overconfidence through linguistic calibration. TACL, 2022.
Min et al. [2023] Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.-t., Koh, P. W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. EMNLP, 2023.
Mündler et al. [2023] Mündler, N., He, J., Jenko, S., and Vechev, M. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv:2305.15852, 2023.
Murray & Chiang [2018] Murray, K. and Chiang, D. Correcting length bias in neural machine translation. WMT, 2018.
Nan et al. [2021] Nan, F., Santos, C. N. d., Zhu, H., Ng, P., McKeown, K., Nallapati, R., Zhang, D., Wang, Z., Arnold, A. O., and Xiang, B. Improving factual consistency of abstractive summarization via question answering. ACL-IJCNLP, 2021.
Nanda et al. [2023] Nanda, N., Rajamanoharan, S., Kramar, J., and Shah, R. Fact finding: Attempting to reverse-engineer factual recall on the neuron level, Dec 2023. URL https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall.
Opdahl et al. [2023] Opdahl, A. L., Tessem, B., Dang-Nguyen, D.-T., Motta, E., Setty, V., Throndsen, E., Tverberg, A., and Trattner, C. Trustworthy journalism through AI. Data Knowl. Eng., 2023.
OpenAI [2023] OpenAI. GPT-4 technical report, 2023.
Pal et al. [2023] Pal, K., Sun, J., Yuan, A., Wallace, B., and Bau, D. Future lens: Anticipating subsequent tokens from a single hidden state. In CoNLL, 2023.
Pedregosa et al. [2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. Scikit-learn: Machine learning in Python. JMLR, 12, 2011.
Peng et al. [2023] Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., Huang, Q., Liden, L., Yu, Z., Chen, W., et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv:2302.12813, 2023.
Petroni et al. [2019] Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., and Riedel, S. Language models as knowledge bases? EMNLP, 2019.
Rajpurkar et al. [2018] Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad. ACL, 2018.
Rawte et al. [2023] Rawte, V., Sheth, A., and Das, A. A survey of hallucination in large foundation models. arXiv:2309.05922, 2023.
Rimsky et al. [2023] Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering llama 2 via contrastive activation addition. arXiv:2312.06681, 2023.
Roberts et al. [2020] Roberts, A., Raffel, C., and Shazeer, N. How much knowledge can you pack into the parameters of a language model? EMNLP, 2020.
Shen et al. [2023] Shen, Y., Heacock, L., Elias, J., Hentel, K. D., Reig, B., Shih, G., and Moy, L. ChatGPT and other large language models are double-edged swords. Radiology, 2023.
Shi et al. [2023] Shi, W., Han, X., Lewis, M., Tsvetkov, Y., Zettlemoyer, L., and Yih, S. W.-t. Trusting your evidence: Hallucinate less with context-aware decoding. NAACL, 2023.
Singhal et al. [2023] Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Scharli, N., Chowdhery, A., Mansfield, P., y Arcas, B. A., Webster, D., Corrado, G. S., Matias, Y., Chou, K., Gottweis, J., Tomasev, N., Liu, Y., Rajkomar, A., Barral, J., Semturs, C., Karthikesalingam, A., and Natarajan, V. Large language models encode clinical knowledge. Nature, 2023.
Su et al. [2022] Su, D., Li, X., Zhang, J., Shang, L., Jiang, X., Liu, Q., and Fung, P. Read before generate! faithful long form question answering with machine reading. ACL, 2022.
Subramani et al. [2022] Subramani, N., Suresh, N., and Peters, M. E. Extracting latent steering vectors from pretrained language models. ACL, 2022.
Team [2023] Team, T. G. Gemini: a family of highly capable multimodal models. 2023.
Tian et al. [2024] Tian, K., Mitchell, E., Yao, H., Manning, C. D., and Finn, C. Fine-tuning language models for factuality. ICLR, 2024.
Touvron et al. [2023a] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
Touvron et al. [2023b] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023b.
Tsatsaronis et al. [2015] Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers, M. R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almirantis, Y., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artiéres, T., Ngomo, A.-C. N., Heino, N., Gaussier, E., Barrio-Alvers, L., Schroeder, M., Androutsopoulos, I., and Paliouras, G. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 2015.
Turpin et al. [2023] Turpin, M., Michael, J., Perez, E., and Bowman, S. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. NeurIPS, 2023.
Varshney et al. [2023] Varshney, N., Yao, W., Zhang, H., Chen, J., and Yu, D. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by actively validating low-confidence generation. arXiv 2307.03987, 2023.
Wang et al. [2020] Wang, A., Cho, K., and Lewis, M. Asking and answering questions to evaluate the factual consistency of summaries. ACL, 2020.
Weiser [2023] Weiser, B. Lawyer who used ChatGPT faces penalty for made up citations. The New York Times, June 2023.
Zhang et al. [2023a] Zhang, S., Pan, L., Zhao, J., and Wang, W. Y. Mitigating language model hallucination with interactive question-knowledge alignment. arXiv:2305.13669, 2023a.
Zhang et al. [2023b] Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv:2309.01219, 2023b.
Zou et al. [2023] Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to ai transparency. arXiv:2310.01405, 2023.

Appendix A Additional Results

Model Task Accuracies.

We report the accuracies achieved by the models on the various datasets used in this work in Table 3.

Table 3: Task accuracy of models across datasets, in (L)ong- and (S)hort-form generation settings.

Model	BioASQ (%)	TriviaQA (%)	NQ Open (%)	SQuAD (%)
Llama-3-70B (L)	67.2	88.5	61.2	46.0
Llama-2-70B (L)	60.3	85.0	58.3	43.9
Llama-2-70B (S)	48.4	75.7	49.5	31.4
Llama-2-7B (S)	43.3	64.8	38.3	23.5
Mistral-7B (S)	39.3	52.3	28.3	20.7
Phi-3-3.8B (S)	45.5	48.3	26.1	24.3

Predicting Model Correctness from Hidden States.

Figures A.1 and A.2 give additional results that show we can predict model correctness from hidden states using SEPs trained on the second-last-token (SLT) or token-before-generating (TBG) in the short-form in-distribution scenario across models and tasks. In Figs. A.3 and A.4, we further demonstrate that accuracy probes also perform similarly when trained on the SLT or TBG in the short-form in-distribution scenario across models and tasks.

Predicting Correctness vs. Semantic Entropy.

Figures A.5 and A.6 show that predicting semantic entropy from hidden states is generally easier than directly predicting model correctness, suggesting that semantic entropy is implicitly encoded in the hidden states.

Additional Comparisons to Baselines.

In, Fig. A.7 we additionally report results comparing SEPs to accuracy probes across layers for Mistral-7B for the in-distribution and generalization settings. In Fig. A.8, we compare the performance of SEPs to baselines for the in-distribution setting across models and datasets, finding that SEPs and accuracy probes perform similarly, with SEPs performing slightly better for 3 out of 5 models. In Figs. A.9 and A.10 we report in- and out-of-distribution results for Llama-2-70B and Llama-3-70B in the long-form generation setting.

Hidden State Alternatives. In addition to investigating the performance of probes on the hidden states, we study whether residual stream or MLP outputs can also be used for semantic entropy prediction. Figure A.11 shows that probing the hidden states results in consistently higher performance.

Different Binarization Procedures. In addition to the “best split” procedure discussed in Section 4 and used in all of our experiments, we here explore the performance of a simple “even split” alternative, which splits semantic entropy into high and low classes such that there are an equal number samples in both classes. Figure A.12 shows that performance is similar, with our optimal splitting procedure slightly outperforming the even split ablation. For illustration purposes, Fig. A.13 shows the behavior of the best split objective Eq. 5 across different thresholds. We have also explored a “soft labelling” strategy as an alternative to hard binarization, for which we obtain soft labels by transforming raw semantic entropies into probabilities with a sigmoid function centered around the best-split threshold, and then train SEPs on the resulting soft labels. Early results did not improve performance.

Appendix B Experiment Details

Here we provide additional details to reproduce the experiments of the main paper.

B.1 Prompt Templates

We use the following prompt templates across experiments.

For long-form generations, we use the following prompt template:

Answer the following question in a single brief but complete sentence.
Question: [query question]
Answer:

For short-form generations, we adjust the instruction and additionally provide 5 demonstration examples with short ground truth answers, to elicit a short answer from the model:

Answer the following question as briefly as possible.
Question: [example question 1]
Answer:   [example answer 1]
...
Question: [example question 5]
Answer:   [example answer 5]
Question: [query question]
Answer:

Finally, for the counterfactual context addition experiment, we prepend the context, prior to the question:

Context: [query context]
Question: [query question]
Answer:

B.2 Semantic Entropy Calculation

We compute semantic entropy with $N=10$ generations sampled at temperature $T=1.0$ and using default values of top-p ( $p=0.9$ ) and top-K ( $K=50$ ).

For short-form generations, we predict entailment using DeBERTa-Large [25] and assess model accuracy via the SQuAD F1 score.

For long-form generations, we predict entailment with GPT-3.5 [8] and the following prompt:

Here are two possible answers:
Possible Answer 1: [model generation a]
Possible Answer 2: [model generation b]
Does Possible Answer 1 semantically entail Possible Answer 2?
Respond with entailment, contradiction, or neutral.

To assess the correctness of long-form generations, we prompt GPT-4 [54] or GPT-4o²²2 We use GPT-4 to evaluate Llama-2-70B but switched to GPT-4o for our more recent experiments on Llama-3-70B given the difference in cost between the two GPT models. as follows

We are assessing the quality of answers to the following question: [query question]
The expected answer is: [ground truth label].
The proposed answer is: [model generation].
Within the context of the question,
does the proposed answer mean the same as the expected answer?
Respond only with yes or no.
Response:

B.3 Semantic Entropy Probes

SEPs are trained on the hidden states, which vary in dimensionality between models. We detail the dimensionality of the hidden states, and number of layers in Table 4.

Table 4: Models properties and selected layers for concatenation for SEPs and (Acc)uracy (P)robe, in (L)ong-form and (S)hort-form generation settings.

Model Name	No. of Layers	Hidden Dim.	Layers for SEPs	Layers for Acc. P.
Llama-3-70B (L)	80	8192	[76, 77, 78, 79, 80]	[31, 32, 33, 34, 35]
Llama-2-70B (L)	80	8192	[74, 75, 76, 77, 78]	[76, 77, 78, 79, 80]
Llama-2-70B (S)	80	8192	[76, 77, 78, 79, 80]	[75, 76, 77, 78, 79]
Llama-2-7B (S)	32	4096	[28, 29, 30, 31, 32]	[18 ,19 ,20 ,21, 22]
Mistral-7B (S)	32	4096	[28, 29, 30, 31, 32]	[12 ,13 ,14 ,15, 16]
Phi-3-3.8B (S)	32	3072	[21, 22, 23, 24, 25]	[25, 26, 27, 28, 29]

Layer Concatenation. For any aggregate results presented in the main paper or appendix, i.e. any barplots or tables, we report SEP and accuracy probe performance on a representative set of high-performing layers. Concretely, we select a set of adjacent layers and concatenate their hidden states to train both types of probes based on the highest mean AUROC value achieved in the interval (on un-concatenated hidden states) in the in-distribution setting. We report the layers across which we concatenate in Table 4.

Filtering for Long-form Generations. In order to provide a clearer signal to the SEP on what constitutes high and low semantic entropy inputs, we filter out training samples with semantic entropy in between the 55% and 80% quantiles for long generations, as we have found this to give a mild increase in performance. Note that this filtering did not improve performance for the accuracy probes, and we report results for the accuracy probes without filtering. We found this filtering to be unnecessary for experiments with Llama-3-70B.

Training Set Size. For long-generation experiments, we collect 1000 samples across tasks. For short-generation experiments, we collect 2000 samples of hidden state–semantic entropy pairs across tasks. We match the training set sizes between accuracy probes and SEPs.

B.4 Baselines

For the $p(\text{True})$ baseline, we construct a few-shot prompt with 10 examples, where each example is formatted as below:

Question: [example question 1]
Brainstormed Answers: [model generation a]
[model generation b]
[model generation c]
..
[model generation j]
Possible answer: [greedy model generation]
Is the possible answer:
A) True
B) False
The possible answer is: [A / B depending on correctness of possible answer]

We give an illustrative example below for what this could look like in practice:

Question: What is the capital of France?
Brainstormed Answers: The capital of France is Paris.
Paris is the capital of France.
It’s Paris.
Possible answer: The capital of France is Paris.
Is the possible answer:
A) True
B) False
The possible answer is: A

For $p(\text{True})$ , we obtain the probability of model truthfulness by measuring the token probability of $A$ at the end of the prompt.

B.5 Evaluation

To evaluate the performance of the probes in the generalization setting, we employ the following leave-one-out procedure for the aggregate results reported in the barplots and tables.

First, each probe is trained on a single dataset. Then, the trained probes are evaluated on all other datasets in terms of AUROC of detecting hallucinations, excluding the dataset used for training. We then report the mean across all probes evaluated on that specific dataset. This allows us to assess the generalization capability of the probes by measuring their performance on datasets that were not used during the training phase. This scenario is important in practice, as the distribution of the query data will rarely be known.

Appendix C Compute Resources

We make use of an internal cluster with 24 Nvidia A100 80GB GPUs. We further use GPT 3.5, 4, and 4o via the OpenAI API.

For experiments requiring the use of Llama 70B models, we require 2 A100s to do inference and calculate the hidden states. The smaller models require only a slice of an A100 80GB. However, once the training data for the semantic entropy probes has been created, a CPU-only computing resource is sufficient to fit the logistic regression models.

Based on tracked finished runs, we estimate ~300 GPU-hours plus ~310 CPU-hours to obtain the results in the paper.

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs语义熵探针：在 LLMs