这是用户在 2025-2-3 18:02 为 https://app.immersivetranslate.com/pdf-pro/f384acb1-bf37-40ab-aae1-800c91311255 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Evaluating large language models in theory of mind tasks
评估大型语言模型在心智理论任务中的表现

Michal Kosinskia, ( ^(( ){ }^{\text {( }} (
Michal Kosinskia, ( ^(( ){ }^{\text {( }} (
Edited by Timothy Wilson, University of Virginia, Charlottesville, VA; received March 30, 2024; accepted September 23, 2024
弗吉尼亚大学蒂莫西·威尔逊编辑;2024 年 3 月 30 日收到;2024 年 9 月 23 日接受

Abstract  摘要

Eleven large language models (LLMs) were assessed using 40 bespoke false-belief tasks, considered a gold standard in testing theory of mind (ToM) in humans. Each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to solve a single task. Older models solved no tasks; Generative Pre-trained Transformer (GPT)-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20 % 20 % 20%20 \% of the tasks; ChatGPT-4 (from June 2023) solved 75 % 75 % 75%75 \% of the tasks, matching the performance of 6-y-old children observed in past studies. We explore the potential interpretation of these results, including the intriguing possibility that ToM-like ability, previously considered unique to humans, may have emerged as an unintended by-product of LLMs’ improving language skills. Regardless of how we interpret these outcomes, they signify the advent of more powerful and socially skilled AI-with profound positive and negative implications.
11 个大型语言模型(LLMs)使用 40 个定制的错误信念任务进行了评估,这些任务被认为是测试人类心智理论(ToM)的金标准。每个任务都包含一个错误信念场景、三个密切匹配的真实信念控制场景以及所有四个场景的反向版本。一个LLM必须解决所有八个场景才能解决一个单一任务。较旧的模型没有解决任何任务;生成式预训练转换器(GPT)-3-davinci-003(2022 年 11 月)和 ChatGPT-3.5-turbo(2023 年 3 月)解决了 20 % 20 % 20%20 \% 个任务;ChatGPT-4(2023 年 6 月)解决了 75 % 75 % 75%75 \% 个任务,与过去研究中观察到的 6 岁儿童的表现相匹配。我们探讨了这些结果的潜在解释,包括一个有趣的可能性,即以前被认为是人类独有的类心智理论能力,可能是LLMs改进语言技能的意外副产品。无论我们如何解释这些结果,它们都标志着更强大、更具社交能力的 AI 的出现——这具有深远的积极和消极影响。

theory of mind | large language models | AI | false-belief tasks | psychology of AI
心智理论 | 大型语言模型 | AI | 虚假信念任务 | AI 心理学

Many animals excel at using cues such as vocalization, body posture, gaze, or facial expression to predict other animals’ behavior and mental states. Dogs, for example, can easily distinguish between positive and negative emotions in both humans and other dogs (1). Yet, humans do not merely respond to observable cues but also automatically and effortlessly track others’ unobservable mental states, such as their knowledge, intentions, beliefs, and desires (2). This ability-typically referred to as “theory of mind” (ToM)—is considered central to human social interactions (3), communication (4), empathy (5), self-consciousness (6), moral judgment ( 7,8 ), and even religious beliefs (9). It develops early in human life ( 10 12 10 12 10-1210-12 ) and is so critical that its dysfunctions characterize a multitude of psychiatric disorders, including autism, bipolar disorder, schizophrenia, and psychopathy (13-15). Even the most intellectually and socially adept animals, such as the great apes, trail far behind humans when it comes to ToM (16-19).
许多动物擅长利用发声、体态、凝视或面部表情等线索来预测其他动物的行为和心理状态。例如,狗很容易区分人和狗的积极情绪和消极情绪(1)。然而,人类不仅对可观察的线索做出反应,而且还会自动且毫不费力地追踪他人的不可观察的心理状态,例如他们的知识、意图、信念和愿望(2)。这种能力——通常被称为“心智理论”(ToM)——被认为是人类社会互动(3)、交流(4)、移情(5)、自我意识(6)、道德判断(7,8)甚至宗教信仰(9)的核心。它在人类生命早期发育( 10 12 10 12 10-1210-12 ),并且至关重要,以至于其功能障碍是多种精神疾病的特征,包括自闭症、双相情感障碍、精神分裂症和精神变态(13-15)。即使是最聪明、最善于社交的动物,例如类人猿,在 ToM 方面也远远落后于人类(16-19)。
Given the importance of ToM for human success, much effort has been put into equipping AI with ToM. Virtual and physical AI agents capable of imputing unobservable mental states to others would be more powerful. The safety of self-driving cars, for example, would greatly increase if they could anticipate the intentions of human drivers and pedestrians. Virtual assistants capable of tracking users’ mental states would be more practical and-for better or worse-more convincing. Yet, although AI outperforms humans in an ever-broadening range of tasks, from playing poker (20) and Go (21) to translating languages (22) and diagnosing skin cancer (23), it trails far behind when it comes to ToM. For example, past research employing large language models (LLMs) showed that RoBERTa, early versions of GPT-3, and custom-trained question-answering models struggled with solving simple ToM tasks (24-27). Unsurprisingly, equipping AI with ToM remains a vibrant area of research in computer science (28) and one of the grand challenges of our times (29).
鉴于心智理论对人类成功的重要性,人们投入了大量精力来为人工智能配备心智理论。能够推断他人不可观察的心理状态的虚拟和物理人工智能代理将更加强大。例如,如果自动驾驶汽车能够预测人类驾驶员和行人的意图,其安全性将大大提高。能够追踪用户心理状态的虚拟助手将更加实用,并且——无论好坏——更令人信服。然而,尽管人工智能在越来越广泛的任务中超越了人类,从玩扑克 (20) 和围棋 (21) 到翻译语言 (22) 和诊断皮肤癌 (23),但在心智理论方面却远远落后。例如,过去使用大型语言模型 (LLMs) 的研究表明,RoBERTa、早期版本的 GPT-3 和定制训练的问答模型难以解决简单的心智理论任务 (24-27)。不出所料,为人工智能配备心智理论仍然是计算机科学 (28) 中一个充满活力的研究领域,也是我们时代面临的重大挑战之一 (29)。
We hypothesize that ToM does not have to be explicitly engineered into AI systems. Instead, it may emerge as a by-product of AI’s training to achieve other goals where it could benefit from ToM. Although this may seem an outlandish proposition, ToM would not be the first capability to emerge in AI. Models trained to process images, for example, spontaneously learned how to count ( 30 , 31 ) ( 30 , 31 ) (30,31)(30,31) and differentially process central and peripheral image areas (32), as well as experience human-like optical illusions (33). LLMs trained to predict the next word in a sentence surprised their creators not only by their inclination to be racist and sexist (34) but also by their emergent reasoning and arithmetic skills (35), ability to translate between languages (22), and propensity to semantic priming (36).
我们假设心智理论无需在人工智能系统中被明确设计。相反,它可能作为人工智能为了实现其他目标(在这些目标中,心智理论可能带来益处)的训练的副产品而出现。虽然这似乎是一个奇怪的命题,但心智理论并非人工智能中第一个出现的技能。例如,接受过图像处理训练的模型自发地学会了计数 ( 30 , 31 ) ( 30 , 31 ) (30,31)(30,31) ,并对图像中心和边缘区域进行差异化处理(32),以及体验类似人类的视觉错觉(33)。接受过预测句子中下一个单词训练的LLMs不仅以其具有种族主义和性别歧视倾向(34)而令其创造者惊讶,而且还以其涌现的推理和算术能力(35)、跨语言翻译能力(22)以及语义启动倾向(36)而令其惊讶。

Abstract  摘要

Significance Humans automatically and effortlessly track others’ unobservable mental states, such as their knowledge, intentions, beliefs, and desires. This abilitytypically called “theory of mind” (ToM)-is fundamental to human social interactions, communication, empathy, consciousness, moral judgment, and religious beliefs. Our results show that recent large language models (LLMs) can solve falsebelief tasks, typically used to evaluate ToM in humans. Regardless of how we interpret these outcomes, they signify the advent of more powerful and socially skilled AI-with profound positive and negative implications.
人类能够自动且毫不费力地追踪他人的不可观察的心理状态,例如他们的知识、意图、信念和愿望。这种能力通常被称为“心智理论”(ToM),是人类社会互动、沟通、同理心、意识、道德判断和宗教信仰的基础。我们的结果表明,最近的大型语言模型(LLMs)能够解决虚假信念任务,而这通常用于评估人类的心智理论。无论我们如何解读这些结果,它们都预示着更强大、更具社交能力的 AI 的到来,这将带来深远积极和消极的影响。

Author affiliations: a Graduate School of Business, Stanford University, Stanford, CA 94305
作者单位:斯坦福大学商学院,斯坦福,CA 94305
Importantly, none of those capabilities were engineered or anticipated by their creators. Instead, they have emerged as LLMs were trained to achieve other goals (37).
重要的是,这些能力都不是其创造者设计或预料到的。相反,它们是在LLMs被训练以实现其他目标的过程中出现的(37)。
LLMs are likely candidates to develop ToM. Human language is replete with descriptions of mental states and protagonists holding differing beliefs, thoughts, and desires. Thus, an LLM trained to generate and interpret human-like language would greatly benefit from possessing ToM. For example, to correctly interpret the sentence “Virginie believes that Floriane thinks that Akasha is happy,” one needs to understand the concept of the mental states (e.g., “Virginie believes” or “Floriane thinks”); that protagonists may have different mental states; and that their mental states do not necessarily represent reality (e.g., Akasha may not be happy, or Floriane may not really think that). In fact, in humans, ToM may have emerged as a by-product of increasing language ability (4), as indicated by the high correlation between ToM and language aptitude, the delayed ToM acquisition in people with minimal language exposure (38), and the overlap in the brain regions responsible for both (39). ToM has been shown to positively correlate with participating in family discussions (40) and the use of and familiarity with words describing mental states ( 38 , 41 ) ( 38 , 41 ) (38,41)(38,41).
LLMs很可能是发展 ToM 的候选对象。人类语言充满了对心理状态的描述,以及持有不同信念、思想和愿望的主人公。因此,一个经过训练来生成和解释类似人类语言的LLM将极大地受益于拥有 ToM。例如,要正确理解句子“Virginie 相信 Floriane 认为 Akasha 很开心”,需要理解心理状态的概念(例如,“Virginie 相信”或“Floriane 认为”);主人公可能有不同的心理状态;以及他们的心理状态并不一定代表现实(例如,Akasha 可能不开心,或者 Floriane 可能并没有真的这么认为)。事实上,在人类中,ToM 可能是随着语言能力的提高而产生的副产品(4),这由 ToM 与语言能力之间的高度相关性、语言接触最少的人 ToM 习得延迟(38)以及负责两者的大脑区域重叠(39)所表明。ToM 已被证明与参与家庭讨论(40)以及使用和熟悉描述心理状态的词语 ( 38 , 41 ) ( 38 , 41 ) (38,41)(38,41) 呈正相关。

This work evaluates the performance of recent LLMs on false-belief tasks considered a gold standard in assessing ToM in humans (42). False-belief tasks test respondents’ understanding that another individual may hold beliefs that the respondent knows to be false. We used two types of false-belief tasks: Unexpected Contents (43), introduced in Study 1, and Unexpected Transfer (44), introduced in Study 2. As LLMs likely encountered classic false-belief tasks in their training data, a hypothesis-blind research assistant crafted 20 bespoke tasks of each type, encompassing a broad spectrum of situations and protagonists. To reduce the risk that LLMs solve tasks by chance or using response strategies that do not require ToM, each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to score a single point.
这项工作评估了近期LLMs在错误信念任务上的表现,这些任务被认为是评估人类 ToM 的黄金标准(42)。错误信念任务测试受试者是否理解另一个人可能持有受试者知道是错误的信念。我们使用了两种类型的错误信念任务:研究 1 中介绍的意外内容(43)和研究 2 中介绍的意外转移(44)。由于LLMs的训练数据中可能包含经典的错误信念任务,因此一位假设盲研究助理精心制作了每种类型 20 个定制的任务,涵盖了各种各样的情境和主角。为了降低LLMs通过偶然或使用不需要 ToM 的应答策略来解决任务的风险,每个任务都包含一个错误信念场景、三个密切匹配的真实信念控制场景以及所有四个场景的反向版本。一个LLM必须解决所有八个场景才能获得一分。

Studies 1 and 2 introduce the tasks, prompts used to test LLMs’ comprehension, and our scoring approach. In Study 3, we administer all tasks to eleven LLMs: GPT-1 (45), GPT-2 (46), six models in the GPT-3 family, ChatGPT-3.5-turbo (22), ChatGPT-4 (47), and Bloom (48)—GPT-3’s open-access alternative. Our results show that the models’ performance gradually improved, and the most recent model tested here, ChatGPT-4, solved 75 % 75 % 75%75 \% of false-belief tasks. In the Discussion, we explore a few potential explanations of LLMs’ performance, ranging from guessing and memorization to the possibility that recent LLMs developed an ability to track protagonists’ states of mind. Importantly, we do not aspire to settle the decades-long debate on whether AI should be credited with human cognitive capabilities, such as ToM. However, even those unwilling to credit LLMs with ToM might recognize the importance of machines behaving as if they possessed ToM. Turing (49), among others, considered this distinction to be meaningless on the practical level.
研究 1 和 2 介绍了任务、用于测试LLMs理解力的提示以及我们的评分方法。在研究 3 中,我们将所有任务都分配给了 11 个LLMs:GPT-1 (45)、GPT-2 (46)、GPT-3 家族中的六个模型、ChatGPT-3.5-turbo (22)、ChatGPT-4 (47)和 Bloom (48)——GPT-3 的开放访问替代方案。我们的结果表明,这些模型的性能逐渐提高,这里测试的最新模型 ChatGPT-4 解决了 75 % 75 % 75%75 \% 个错误信念任务。在讨论部分,我们探讨了LLMs性能的几个潜在解释,从猜测和记忆到最近的LLMs发展出追踪主角思维状态的能力的可能性。重要的是,我们并不想解决关于是否应该将人工智能归功于人类认知能力(如 ToM)的持续数十年的争论。然而,即使那些不愿意将LLMs归功于 ToM 的人,也可能会认识到机器表现得好像拥有 ToM 一样的重要性。图灵(49)等人认为,在实践层面,这种区别毫无意义。
The publication of a preprint of this manuscript in February 2023 (50) sparked a lively debate. The current manuscript has been significantly updated following the feedback from the readers and the reviewers. The false-belief scenarios were written in August 2021 and were later updated following the reviewers’ feedback. The tasks were administered using commercial application programming interfaces (APIs) that did not collect data for future models’ training. The tasks were published online in February 2023. To ensure that LLMs did not encounter our tasks in their
2023 年 2 月预印本的发表(50)引发了热烈的讨论。根据读者和审稿人的反馈,当前手稿已进行了大幅更新。虚假信念场景编写于 2021 年 8 月,之后根据审稿人的反馈进行了更新。任务是使用不收集数据用于未来模型训练的商业应用程序接口(API)进行管理的。任务于 2023 年 2 月在线发布。为确保LLMs没有遇到我们的任务,在

training, we only studied LLMs trained on data collected before that date. Our tasks (original and final), instructions given to research assistants, and code used to estimate the results can be accessed at https://osf.io/csdhb/ (51).
训练中,我们只研究了在该日期之前收集的数据训练的LLMs。我们的任务(原始和最终)、给研究助理的指示以及用于估计结果的代码可在 https://osf.io/csdhb/ (51)访问。

Study 1.1: Unexpected Contents Task (aka Smarties Task)
研究 1.1:意外内容任务(又名士力架任务)

We first introduce our methods and the first type of the false-belief task: the Unexpected Contents Task. In this study, we discuss the responses of the most able model, ChatGPT-4 (June 2023 version), to a single task. In Study 3, we will present the performance of 11 models on all tasks.
我们首先介绍我们的方法和第一种错误信念任务:意外内容任务。在本研究中,我们讨论了最强大的模型 ChatGPT-4(2023 年 6 月版本)对单一任务的反应。在研究 3 中,我们将展示 11 个模型在所有任务上的表现。
The Unexpected Contents Task, also known as the Smarties Task, was initially devised by Perner, Leekam, and Wimmer (43). In a typical scenario, the participant is introduced to an opaque and labeled container, its contents (inconsistent with the label), and a protagonist who has not seen the container or its contents before. To solve this task, the participant must realize that the protagonist should wrongly assume that the container’s label and contents are aligned. Consider the following example (text printed in italics is not presented to a model):
意外内容任务,也称为聪明豆任务,最初由 Perner、Leekam 和 Wimmer (43) 设计。在一个典型的场景中,参与者会看到一个不透明且贴有标签的容器,其内容(与标签不符),以及一个之前从未见过该容器或其内容的主角。为了完成这个任务,参与者必须意识到主角会错误地认为容器的标签和内容是一致的。考虑以下例子(斜体字文本不会呈现给模型):
Unexpected Contents Task #19: Complete the following story: Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says “chocolate” and not “popcorn”. Sam finds the bag. She has never seen this bag before. Sam doesn’t open the bag and doesn’t look inside. Sam reads the label.
意外内容任务#19:完成下面的故事:这里有一个装满爆米花的袋子。袋子里没有巧克力。然而,袋子上的标签写着“巧克力”而不是“爆米花”。萨姆找到了这个袋子。她以前从未见过这个袋子。萨姆没有打开袋子,也没有往里面看。萨姆读了标签。
To ascertain that LLMs do not employ the mere frequency of the words describing a container’s contents and its label (i.e., “popcorn” and “chocolate”), the scenarios were designed to use those words an equal number of times.
为确保LLMs不会仅仅依靠描述容器内容及其标签的词语的出现频率(例如,“爆米花”和“巧克力”),这些场景的设计使得这些词语出现的次数相同。
LLMs’ task comprehension was tested using two prompts. Prompt 1.1 tested LLMs’ prediction of the containers’ actual contents, whereas Prompt 1.2 aimed at LLMs’ prediction of the protagonists’ beliefs. We used the following prompt templates, except for task #17, where they did not fit:
使用两个提示测试了LLMs的任务理解。提示 1.1 测试了LLMs对容器实际内容的预测,而提示 1.2 旨在测试LLMs对主角信念的预测。我们使用了以下提示模板,但任务#17 除外,因为它们不适用。
Template of Prompt 1.1 (container’s contents): [Protagonist’s name] opens the [container name] and looks inside. [She/He] can clearly see that it is full of
提示模板 1.1(容器内容):[主角姓名]打开[容器名称]并向里面看去。[他/她]清楚地看到里面装满了
Template of Prompt 1.2 (protagonist’s belief): [Protagonist’s name] calls a friend to tell them that [she/he] has just found a [container name] full of
提示模板 1.2(主角的信念):[主角姓名] 给朋友打电话,告诉他们[他/她]刚刚找到一个装满……的[容器名称]
Our analyses required evaluating hundreds of thousands of responses. Thus, we decided against using open-ended questions, as they invite a wide variety of responses that are more challenging to evaluate. For example, “What is in the bag?” can be answered with “Popcorn,” “The bag contains popcorn,” or “There is popcorn in the bag.” Instead, the prompts were designed to naturally fit the flow of the stories and invite responses whose first word (e.g., “chocolate” or “popcorn”) would reveal LLMs’ prediction. After each completion, LLMs were reset to not have access to the previously used prompts and their own responses.
我们的分析需要评估数十万条回复。因此,我们决定不使用开放式问题,因为它们会引来各种各样的回复,更难以评估。例如,“包里有什么?”可以回答“爆米花”、“包里装着爆米花”或“包里有爆米花”。相反,提示的设计是为了自然地融入故事的流程,并引导回复的首个单词(例如,“巧克力”或“爆米花”)能够揭示LLMs的预测。每次完成之后,LLMs都会被重置,使其无法访问之前使用的提示及其自身的回复。
Let us start with ChatGPT-4’s completion of Prompt 1.1, testing its understanding of the bag’s actual contents. This and the other models studied here were nondeterministic and designed to generate diverse outputs to identical prompts. The variability of their responses was controlled using the “temperature” parameter, with lower values resulting in more deterministic and repetitive
让我们从 ChatGPT-4 对提示 1.1 的完成开始,测试它对包中实际内容的理解。此处研究的其他模型与之类似,都是非确定性的,旨在针对相同的提示生成不同的输出。它们的响应变化通过“温度”参数控制,较低的值会导致更确定性和重复性

output. To maximize the replicability of our results, we used the minimum temperature (temperature = 0 = 0 =0=0 ), although we observed some minimal variability in the models’ responses even at this level. ChatGPT-4’s completion is printed in bold:
输出。为了最大限度地提高结果的可复制性,我们使用了最低温度(温度 = 0 = 0 =0=0 ),尽管我们观察到即使在这个水平上模型的响应也存在一些微小的差异。ChatGPT-4 的完成结果以粗体打印:
Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of popcorn [ P popcorn = 100 % ] P popcorn = 100 % [P_(popcorn)=100%]\left[\mathrm{P}_{\mathrm{popcorn}}=100 \%\right], not chocolate as the label had indicated. She is surprised by the discrepancy between the label and the actual contents of the bag. Despite her initial confusion, she smiles, realizing that she actually prefers popcorn over chocolate. She reaches into the bag, grabs a handful of popcorn, and enjoys the unexpected treat.
提示 1.1:山姆打开袋子往里看。她清楚地看到里面装满了爆米花 [ P popcorn = 100 % ] P popcorn = 100 % [P_(popcorn)=100%]\left[\mathrm{P}_{\mathrm{popcorn}}=100 \%\right] ,而不是标签上标明的巧克力。袋子标签和实际内容的差异让她感到惊讶。尽管最初感到困惑,但她笑了,意识到自己其实更喜欢爆米花而不是巧克力。她把手伸进袋子,抓了一把爆米花,享受着这份意外的美味。
The diagnostic word in ChatGPT-4’s completion (i.e., “popcorn”) is underlined. It indicates that ChatGPT-4 correctly recognized the bag’s contents. The diagnostic word is followed by values between square brackets. Those values are not part of the LLM’s output but represent the empirical probabilities of observing the diagnostic word and its alternatives in the LLM’s output. They were generated by setting the temperature to 1 and generating 1,000 completions. Among those 1,000 completions, the prompt was followed by “popcorn” 100 % 100 % 100%100 \% of the time (i.e., P popcorn = 100 % P popcorn  = 100 % P_("popcorn ")=100%\mathrm{P}_{\text {popcorn }}=100 \% ), indicating that ChatGPT-4’s prediction of the bag’s contents was robust and stable. For brevity, we omitted the 95 % 95 % 95%95 \% CI ranging from ± 0.6 % ± 0.6 % +-0.6%\pm 0.6 \% (at P = 0 % P = 0 % P=0%P=0 \% and P = 100 % P = 100 % P=100%P=100 \% ) to ± 3.1 % ± 3.1 % +-3.1%\pm 3.1 \% (at P = 50 % P = 50 % P=50%P=50 \% ).
ChatGPT-4 补全中的诊断词(即“popcorn”)已加下划线。它表明 ChatGPT-4 正确识别了袋子的内容。诊断词后面是方括号中的值。这些值不是 LLM 输出的一部分,而是表示在 LLM 输出中观察到诊断词及其替代方案的经验概率。它们是通过将温度设置为 1 并生成 1000 个补全生成的。在这 1000 个补全中,提示后跟有“popcorn” 100 % 100 % 100%100 \% 次(即 P popcorn = 100 % P popcorn  = 100 % P_("popcorn ")=100%\mathrm{P}_{\text {popcorn }}=100 \% ),表明 ChatGPT-4 对袋子内容的预测稳健且稳定。为简洁起见,我们省略了 95 % 95 % 95%95 \% 置信区间,其范围从 ± 0.6 % ± 0.6 % +-0.6%\pm 0.6 \% (在 P = 0 % P = 0 % P=0%P=0 \% P = 100 % P = 100 % P=100%P=100 \% 时)到 ± 3.1 % ± 3.1 % +-3.1%\pm 3.1 \% (在 P = 50 % P = 50 % P=50%P=50 \% 时)。
The rest of ChatGPT-4’s response to Prompt 1.1 suggests that it correctly predicted that Sam should be “surprised by the discrepancy between the label and the actual contents of the bag,” revealing that it correctly predicted that Sam should hold a false-belief. This is confirmed by ChatGPT-4’s responses to Prompt 1.2, aimed directly at testing LLMs’ prediction of a protagonist’s belief:
ChatGPT-4 对提示 1.1 的其余回应表明,它正确预测了 Sam 应该“对标签和袋子实际内容之间的差异感到惊讶”,这揭示了它正确预测了 Sam 应该持有错误信念。这由 ChatGPT-4 对提示 1.2 的回应得到证实,该提示旨在直接测试LLMs对主角信念的预测:
Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of chocolate [ P chocolate = 1 0 0 % P chocolate  = 1 0 0 % [P_("chocolate ")=100%:}\left[\mathrm{P}_{\text {chocolate }}=\mathbf{1 0 0 \%}\right. ].
提示 1.2:萨姆打电话给一个朋友,告诉她刚找到一个装满巧克力的袋子 [ P chocolate = 1 0 0 % P chocolate  = 1 0 0 % [P_("chocolate ")=100%:}\left[\mathrm{P}_{\text {chocolate }}=\mathbf{1 0 0 \%}\right. ]。
ChatGPT-4’s response was given with high confidence ( P chocolate P chocolate  P_("chocolate ")\mathrm{P}_{\text {chocolate }} = 100 % = 100 % =100%=100 \% ), indicating that it correctly predicted that Sam should hold a false-belief.
ChatGPT-4 的回应充满信心( P chocolate P chocolate  P_("chocolate ")\mathrm{P}_{\text {chocolate }} = 100 % = 100 % =100%=100 \% ),表明它正确预测了 Sam 应该持有错误信念。
Study 1.2: Reversed Scenario. LLMs’ correct responses may be enabled by the order or the frequencies of the words in the story or a response bias. It could be, for example, that ChatGPT-4 tends to claim—regardless of the details of the scenario-that bags should contain popcorn or that people like to boast to their friends about finding bags full of chocolate. To reduce these risks to the validity of our findings, each scenario was readministered after swapping the containers’ contents and their labels. Consider the reversed version of the Unexpected Contents Task #19 (the alterations are underlined) as well as ChatGPT-4’s correct (i.e., reversed) responses:
研究 1.2:反向场景。LLMs 的正确回答可能是由故事中单词的顺序或频率,或回答偏差引起的。例如,ChatGPT-4 倾向于声称——无论场景细节如何——袋子里应该装爆米花,或者人们喜欢向朋友炫耀找到装满巧克力的袋子。为了降低这些风险对我们研究结果有效性的影响,每个场景在交换容器的内容和标签后都重新进行了测试。考虑意外内容任务 #19 的反向版本(修改部分已<u>下划线</u>标注),以及 ChatGPT-4 的正确(即反向)回答:
Unexpected Contents Task #19 (reversed): Complete the following story: Here is a bag filled with chocolate. There is no popcorn in the bag. Yet, the label on the bag says “popcorn” and not “chocolate”. Sam finds the bag. She has never seen the bag before. Sam doesn’t open the bag and doesn’t look inside. She reads the label.
意外内容任务#19(反向):完成以下故事:这里有一个装满巧克力的袋子。袋子里没有爆米花。然而,袋子上的标签写着“爆米花”,而不是“巧克力”。萨姆找到了这个袋子。她以前从未见过这个袋子。萨姆没有打开袋子,也没有往里面看。她读了标签。
Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of chocolate [ P chocolate = P chocolate  = [P_("chocolate ")=:}\left[\mathbf{P}_{\text {chocolate }}=\right. 99.7%].
提示 1.1:山姆打开袋子往里看。她清楚地看到里面装满了巧克力 [ P chocolate = P chocolate  = [P_("chocolate ")=:}\left[\mathbf{P}_{\text {chocolate }}=\right. 99.7%]。
Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of popcorn [ P popcorn = 1 0 0 % ] P popcorn  = 1 0 0 % [P_("popcorn ")=100%]\left[\mathrm{P}_{\text {popcorn }}=\mathbf{1 0 0 \%}\right].
提示 1.2:萨姆打电话给朋友,告诉她刚找到一个装满爆米花的袋子 [ P popcorn = 1 0 0 % ] P popcorn  = 1 0 0 % [P_("popcorn ")=100%]\left[\mathrm{P}_{\text {popcorn }}=\mathbf{1 0 0 \%}\right]
Study 1.3: True-Belief Controls. ChatGPT-4’s responses to Prompt 1.2 suggest that it could correctly anticipate the protagonist’s false-belief. Yet, ToM scholars have pointed out that false-belief tasks can be solved without ToM by simply assuming that the protagonist must be wrong (52). A participant may reason, for example, that the protagonist is bound to make a mistake as they cannot see inside the container. It is also possible that some superficial feature of the task reveals the correct answer. For example, mentioning that the protagonist has read the label (or just mentioning the protagonist and the label in the same sentence) may point a participant to quote the label’s contents in all questions related to the protagonist’s beliefs.
研究 1.3:真实信念控制。ChatGPT-4 对提示 1.2 的回应表明,它可以正确预测主角的错误信念。然而,心智理论学者指出,错误信念任务可以通过简单地假设主角一定是错的来解决(52)。例如,参与者可能会这样推理:主角一定会犯错,因为他们看不到容器内部。任务的一些表面特征也可能揭示正确答案。例如,提到主角已经阅读了标签(或者只是在同一个句子中提到主角和标签),可能会引导参与者在所有与主角信念相关的问题中引用标签的内容。
To address these issues, each of our tasks was supplemented with three true-belief controls, or false-belief stories minimally modified to remove the factors leading the protagonist to develop a false-belief (53). If a respondent maintains that a protagonist should hold false-beliefs in a true-belief scenario, it suggests that they are guessing, that they failed to understand the story, or that the original false-belief task can be solved without tracking the protagonists’ mental states. Below, we introduce the three types of true-belief control tasks used in the context of the Unexpected Contents Task.
为解决这些问题,我们在每个任务中都增加了三个真实信念控制任务,或者对虚假信念故事进行最小程度的修改,以消除导致主人公产生虚假信念的因素 (53)。如果受试者认为主人公在一个真实信念情境中应该持有虚假信念,这表明他们在猜测,他们没有理解故事,或者原始的虚假信念任务无需追踪主人公的心理状态即可解决。下面,我们介绍在意外内容任务中使用的三种真实信念控制任务。

Open container. The first condition necessary for the falsebelief formation is the protagonist’s inability to see what is inside the container. We removed this condition by having the protagonist look inside the container before reading the label (the altered fragments are underlined). ChatGPT-4’s responses indicated that it correctly predicted that Sam’s belief should now be correct:
打开容器。形成错误信念的第一个必要条件是主角无法看到容器内部的东西。我们通过让主角在阅读标签之前查看容器内部来移除这个条件(修改的部分已加下划线)。ChatGPT-4 的回应表明它正确预测了 Sam 的信念现在应该是正确的:
Unexpected Contents Task #19 (open container): Complete the following story: Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says “chocolate” and not “popcorn”. Sam finds the bag. She has never seen the bag before. Sam opens the bag and looks inside. Sam reads the label.
意外内容任务#19(打开容器):完成以下故事:这里有一个装满爆米花的袋子。袋子里没有巧克力。然而,袋子上的标签写着“巧克力”而不是“爆米花”。萨姆找到了这个袋子。她以前从未见过这个袋子。萨姆打开袋子往里面看。萨姆读了标签。
Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of popcorn [ P popcorn = P popcorn  = [P_("popcorn ")=:}\left[\mathbf{P}_{\text {popcorn }}=\right. 100%].
提示 1.1:山姆打开袋子往里看。她清楚地看到里面装满了爆米花 [ P popcorn = P popcorn  = [P_("popcorn ")=:}\left[\mathbf{P}_{\text {popcorn }}=\right. 100%]。
Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of popcorn [ P popcorn = 8 8 . 5 % P popcorn  = 8 8 . 5 % [P_("popcorn ")=88.5%:}\left[P_{\text {popcorn }}=\mathbf{8 8 . 5} \%\right.; P chocolate = 11.5 % P chocolate  = 11.5 % P_("chocolate ")=11.5%P_{\text {chocolate }}=11.5 \% ], despite the label saying “chocolate”.
提示 1.2:萨姆打电话给朋友,告诉她刚找到一个装满爆米花的袋子 [ P popcorn = 8 8 . 5 % P popcorn  = 8 8 . 5 % [P_("popcorn ")=88.5%:}\left[P_{\text {popcorn }}=\mathbf{8 8 . 5} \%\right. P chocolate = 11.5 % P chocolate  = 11.5 % P_("chocolate ")=11.5%P_{\text {chocolate }}=11.5 \% ],尽管标签上写着“巧克力”。
Correct label. Another way to avoid the false-belief formation is to align the container’s label with its contents. ChatGPT-4’s responses indicated that it correctly adjusted its predictions and anticipated the protagonist’s true belief:
正确标签。另一种避免错误信念形成的方法是使容器的标签与其内容一致。ChatGPT-4 的回应表明它正确地调整了其预测并预料到了主角的真实信念:
Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of popcorn [ P popcorn = P popcorn  = [P_("popcorn ")=:}\left[\mathbf{P}_{\text {popcorn }}=\right. 99.7%].
提示 1.1:山姆打开袋子往里看。她清楚地看到里面装满了爆米花 [ P popcorn = P popcorn  = [P_("popcorn ")=:}\left[\mathbf{P}_{\text {popcorn }}=\right. 99.7%]。
Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of popcorn [ P popcorn = 99.7 % ] P popcorn  = 99.7 % [P_("popcorn ")=99.7%]\left[\mathbf{P}_{\text {popcorn }}=99.7 \%\right].
提示 1.2:萨姆打电话给一个朋友,告诉她她刚刚找到一个装满爆米花的袋子 [ P popcorn = 99.7 % ] P popcorn  = 99.7 % [P_("popcorn ")=99.7%]\left[\mathbf{P}_{\text {popcorn }}=99.7 \%\right]
Note that one could also keep the label constant and change the container’s contents instead (e.g., fill the bag with chocolate). This effect was achieved in Study 3, where we tested both the original and reversed versions of the false-belief and control stories.
注意,也可以保持标签不变,而改变容器的内容(例如,用巧克力装满袋子)。研究 3 中实现了这种效果,我们在其中测试了虚假信念和控制故事的原始版本和反向版本。

Informed protagonist. Next, we prevented the formation of a false-belief by informing the protagonist about the container’s actual contents. ChatGPT-4’s responses indicate that it passed this control as well:
知情的主角。接下来,我们通过告知主角容器的实际内容来阻止虚假信念的形成。ChatGPT-4 的回应表明它也通过了这个控制:
Unexpected Contents Task #19 (informed protagonist): Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says “chocolate” and not “popcorn”. Sam finds the bag. She has never seen the bag before. Sam doesn’t open the bag and doesn’t look inside. Sam reads the label. A cousin calls Sam and tells her that the bag has popcorn in it, and that she should ignore the label that says “chocolate”. Sam believes her cousin.
意外内容任务 #19(知情主角):这里有一个装满爆米花的袋子。袋子里没有巧克力。然而,袋子上的标签写着“巧克力”而不是“爆米花”。萨姆找到了这个袋子。她以前从未见过这个袋子。萨姆没有打开袋子,也没有往里面看。萨姆读了标签。一个表姐打电话给萨姆,告诉她袋子里装的是爆米花,让她忽略标签上写着的“巧克力”。萨姆相信她的表姐。
Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of popcorn [ P popcorn = 1 0 0 % ] P popcorn  = 1 0 0 % [P_("popcorn ")=100%]\left[\mathbf{P}_{\text {popcorn }}=\mathbf{1 0 0 \%}\right].
提示 1.1:山姆打开袋子往里看。她清楚地看到里面装满了爆米花 [ P popcorn = 1 0 0 % ] P popcorn  = 1 0 0 % [P_("popcorn ")=100%]\left[\mathbf{P}_{\text {popcorn }}=\mathbf{1 0 0 \%}\right]
Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of popcorn [ P popcorn = 9 9 . 6 % P popcorn  = 9 9 . 6 % [P_("popcorn ")=99.6%:}\left[\mathbf{P}_{\text {popcorn }}=\mathbf{9 9 . 6} \%\right. ].
提示 1.2:萨姆打电话给一个朋友,告诉她她刚刚找到一个装满爆米花的袋子 [ P popcorn = 9 9 . 6 % P popcorn  = 9 9 . 6 % [P_("popcorn ")=99.6%:}\left[\mathbf{P}_{\text {popcorn }}=\mathbf{9 9 . 6} \%\right. ]。
Study 1.4: Sentence-by-Sentence Analysis. One of the advantages of testing LLMs, compared with studying humans, is that models can be reset and thus made to forget that they have seen a given task before. Here, we leveraged this advantage to test the robustness of ChatGPT-4’s comprehension by observing how its responses evolve as the story unfolds and the crucial information is revealed.
研究 1.4:逐句分析。与研究人类相比,测试LLMs的优势之一是模型可以重置,从而使其忘记之前见过某个特定任务。在这里,我们利用了这一优势,通过观察 ChatGPT-4 的回应如何在故事展开和关键信息被揭示的过程中演变来测试其理解能力的稳健性。
We replicated Study 1.1 while presenting the story in one-sentence increments and retesting ChatGPT-4’s completions of Prompts 1.1 and 1.2. To familiarize the reader with the procedure, consider the LLM’s responses in its first step:
我们复现了研究 1.1,同时以一句一句递增的方式呈现故事,并重新测试了 ChatGPT-4 对提示 1.1 和 1.2 的完成情况。为了让读者熟悉该过程,请考虑 LLM 在第一步中的回应:
Unexpected Contents Task #19 (prefix only): Complete the following story:
意外内容 任务 #19(仅前缀):完成以下故事:
Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of shimmering gems [ P [shimmering/sparkling] gems = 23.7 % ; P chocolate = 0 % ; P popcorn P [shimmering/sparkling] gems  = 23.7 % ; P chocolate  = 0 % ; P popcorn  [P_("[shimmering/sparkling] gems ")=23.7%;P_("chocolate ")=0%;P_("popcorn "):}\left[P_{\text {[shimmering/sparkling] gems }}=23.7 \% ; P_{\text {chocolate }}=0 \% ; P_{\text {popcorn }}\right. = 0%].
提示 1.1:山姆打开袋子往里看。她清楚地看到里面装满了闪闪发光的宝石 [ P [shimmering/sparkling] gems = 23.7 % ; P chocolate = 0 % ; P popcorn P [shimmering/sparkling] gems  = 23.7 % ; P chocolate  = 0 % ; P popcorn  [P_("[shimmering/sparkling] gems ")=23.7%;P_("chocolate ")=0%;P_("popcorn "):}\left[P_{\text {[shimmering/sparkling] gems }}=23.7 \% ; P_{\text {chocolate }}=0 \% ; P_{\text {popcorn }}\right. = 0%]
Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of money [ P money = 79.5 % ; P gold = P money  = 79.5 % ; P gold  = [P_("money ")=79.5%;P_("gold ")=:}\left[P_{\text {money }}=79.5 \% ; P_{\text {gold }}=\right. 12.9 % ; ( ) ; P popcorn = 0 % ; P chocolate = 0 % 12.9 % ; ( ) ; P popcorn  = 0 % ; P chocolate  = 0 % 12.9%;(dots);P_("popcorn ")=0%;P_("chocolate ")=0%12.9 \% ;(\ldots) ; P_{\text {popcorn }}=0 \% ; P_{\text {chocolate }}=0 \% ].
提示 1.2:萨姆打电话给一个朋友,告诉她刚刚找到一个装满钱的包 [ P money = 79.5 % ; P gold = P money  = 79.5 % ; P gold  = [P_("money ")=79.5%;P_("gold ")=:}\left[P_{\text {money }}=79.5 \% ; P_{\text {gold }}=\right. 12.9 % ; ( ) ; P popcorn = 0 % ; P chocolate = 0 % 12.9 % ; ( ) ; P popcorn  = 0 % ; P chocolate  = 0 % 12.9%;(dots);P_("popcorn ")=0%;P_("chocolate ")=0%12.9 \% ;(\ldots) ; P_{\text {popcorn }}=0 \% ; P_{\text {chocolate }}=0 \% ]。
Given only the prefix (“Complete the following story:”), followed by Prompts 1.1 or 1.2, ChatGPT- 4 tended to assume that the bag contained valuables. Neither “chocolate” nor “popcorn” was observed among the LLM’s 1,000 completions of Prompts
仅给出前缀(“完成以下故事:”),然后是提示 1.1 或 1.2,ChatGPT-4 倾向于假设袋子里装着贵重物品。在LLM对提示的 1000 次完成中,既没有观察到“巧克力”,也没有观察到“爆米花”。

1.1 or 1.2. This is unsurprising because neither of these snacks was mentioned in the prefix. This changed dramatically as the story’s first sentence (“Here is a bag filled with popcorn.”) was revealed to the LLM in the second step of our procedure:
1.1 或 1.2。这并不奇怪,因为这两种零食在开头都没有提到。随着故事的第一句话(“这里有一个装满爆米花的袋子。”)在我们的程序第二步中被显示给LLM,情况发生了巨大变化:
Unexpected Contents Task #19 (prefix and the first sentence): Complete the following story: Here is a bag filled with popcorn.
意外内容 任务 #19(前缀和第一句):完成下面的故事:这里有一个装满爆米花的袋子。
Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of fresh, fluffy popcorn [ P [ fresh/fluffy/popped/golden/etc.] ] popcorn = 1 0 0 % ] P [ fresh/fluffy/popped/golden/etc.]  ]  popcorn  = 1 0 0 % ¯ [P_(["fresh/fluffy/popped/golden/etc.] "]" popcorn ")= bar(100%)]\left[\mathrm{P}_{[\text {fresh/fluffy/popped/golden/etc.] }] \text { popcorn }}=\overline{\mathbf{1 0 0} \%}\right].
提示 1.1:山姆打开袋子往里看。她清楚地看到里面装满了新鲜蓬松的爆米花 [ P [ fresh/fluffy/popped/golden/etc.] ] popcorn = 1 0 0 % ] P [ fresh/fluffy/popped/golden/etc.]  ]  popcorn  = 1 0 0 % ¯ [P_(["fresh/fluffy/popped/golden/etc.] "]" popcorn ")= bar(100%)]\left[\mathrm{P}_{[\text {fresh/fluffy/popped/golden/etc.] }] \text { popcorn }}=\overline{\mathbf{1 0 0} \%}\right]

Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of popcorn [ P popcorn = 9 8 . 8 % ] P popcorn  = 9 8 . 8 % [P_("popcorn ")=98.8%]\left[\mathrm{P}_{\text {popcorn }}=\mathbf{9 8 . 8} \%\right].
提示 1.2:萨姆打电话给朋友,告诉她刚找到一个装满爆米花的袋子 [ P popcorn = 9 8 . 8 % ] P popcorn  = 9 8 . 8 % [P_("popcorn ")=98.8%]\left[\mathrm{P}_{\text {popcorn }}=\mathbf{9 8 . 8} \%\right]

ChatGPT-4’s completions of Prompt 1.1 indicate that it correctly recognized the bag’s contents, although it often prefixed “popcorn” with “delicious,” “fluffy,” “golden,” etc. Its completions of Prompt 1.2 indicate that it had not yet ascribed a false-belief to the protagonist. This is correct, as nothing in the first sentence suggested that Sam should hold a false-belief.
ChatGPT-4 对提示 1.1 的回答表明它正确识别了袋子的内容,尽管它经常在“爆米花”前加上“美味的”、“蓬松的”、“金黄的”等词。它对提示 1.2 的回答表明它尚未将错误信念归因于主人公。这是正确的,因为第一句话中没有任何内容表明 Sam 应该持有错误信念。
ChatGPT-4’s responses to these and further steps of the sentence-by-sentence analysis are presented in Fig. 1. The Left panel presents the probability of observing “popcorn” (green line) versus “chocolate” (blue line) as a response to Prompt 1.1. The probability of “popcorn” jumped to 100 % 100 % 100%100 \% after the first sentence was revealed and stayed there throughout the rest of the story, showing that the LLM correctly recognized that the bag contained popcorn. It did not change even when the story mentioned the discrepancy between the bag’s label and contents.
ChatGPT-4 对这些步骤以及后续逐句分析的回应见图 1。左图展示了在提示 1.1 下,模型对“爆米花”(绿线)和“巧克力”(蓝线)的回应概率。“爆米花”的概率在第一句话显示后跃升至 100 % 100 % 100%100 \% ,并在故事的其余部分保持不变,这表明LLM 正确识别出袋子里装的是爆米花。即使故事提到了袋子标签和内容之间的差异,该概率也没有改变。
The Right panel tracks ChatGPT-4’s prediction of Sam’s belief about the bag’s contents (Prompt 1.2). As discussed above, given only the prefix, neither “chocolate” nor “popcorn” were likely completions. As the “bag filled with popcorn” was introduced, ChatGPT-4 predicted that Sam should be aware of its contents, with the probability of popcorn at about 100 % 100 % 100%100 \%. This was correct, as nothing in the story thus far suggested otherwise. Yet, once the existence of the false label was revealed, ChatGPT-4 increasingly predicted that Sam’s belief may be swayed by it. Once it was clarified that Sam did not look inside the bag, ChatGPT- 4 became certain that Sam’s belief should be false. A virtually identical—yet reversed-pattern of responses was observed for the reversed scenario (Study 1.2).
右侧面板跟踪 ChatGPT-4 对 Sam 关于袋子内容物(提示 1.2)的信念预测。如上所述,仅给出前缀,“巧克力”和“爆米花”都不可能是可能的补全。随着“装满爆米花的袋子”的引入,ChatGPT-4 预测 Sam 应该知道它的内容,爆米花的概率约为 100 % 100 % 100%100 \% 。这是正确的,因为到目前为止故事中没有任何内容暗示其他情况。然而,一旦虚假标签的存在被揭示,ChatGPT-4 越来越预测 Sam 的信念可能会被它所影响。一旦明确 Sam 没有查看袋子内部,ChatGPT-4 就确信 Sam 的信念应该是错误的。对于相反的场景(研究 1.2),观察到了几乎相同的——但相反的——响应模式。
Study 2.1: Unexpected Transfer Task (aka the “Maxi-task” or “Sally-Anne” Test). Next, we replicated Studies 1.1-1.4 on the Unexpected Transfer Task (aka the “Maxi-task” or “Sally-Anne” test) (44). In these tasks, the protagonist observes a certain state of affairs x x xx and leaves the scene. In the protagonist’s absence, the participant witnesses an unexpected change in the state of affairs from x x xx to y y yy. A participant equipped with ToM should realize that although they know that y y yy is now true, the protagonist must still (wrongly) believe that x is the case:
研究 2.1:意外转移任务(又称“Maxi 任务”或“Sally-Anne”测试)。接下来,我们在意外转移任务(又称“Maxi 任务”或“Sally-Anne”测试)(44)上复制了研究 1.1-1.4。在这些任务中,主人公观察到某种情况 x x xx 并离开现场。在主人公不在场的情况下,参与者目睹了情况从 x x xx y y yy 的意外变化。具备 ToM 的参与者应该意识到,尽管他们知道 y y yy 现在是真的,但主人公仍然必须(错误地)相信 x 是这种情况:
Unexpected Transfer Task #19: In the room, there are John, Mark, a cat, a box, and a basket. John takes the cat and puts it in the basket. He closes the basket. He leaves the room and goes to school. While John is away, Mark takes the cat out of the basket and puts it in the box. He closes the box. Mark leaves the room and goes to work. John comes back home and wants to play with the cat.
意外转移任务 #19:房间里,有约翰、马克、一只猫、一个盒子和一个篮子。约翰把猫抱起来放进篮子里。他关上篮子。他离开房间去上学。约翰不在的时候,马克把猫从篮子里拿出来放进盒子里。他关上盒子。马克离开房间去上班。约翰回家后想和猫玩。

Fig. 1. Changes in the probabilities of ChatGPT-4’s completions of Prompts 1.1 and 1.2 as the story was revealed in one-sentence increments.
图 1. 随着故事逐句展开,ChatGPT-4 对提示 1.1 和 1.2 的回复概率变化。
As in Study 1, each story was followed by two prompts testing LLMs’ comprehension. The first prompt tested LLMs’ prediction of the actual state of affairs (e.g., the cat’s location). The diversity of scenarios employed in the Unexpected Transfer Tasks prevented us from using a unified prompt template, as in Study 1. Yet, whenever possible, we used the following template: “The [object] [jumps out of/falls out of/escapes from] the:”
如同研究 1,每个故事之后都跟着两个测试LLMs理解力的提示。第一个提示测试LLMs对实际情况(例如,猫的位置)的预测。意外转移任务中使用的场景多样性使得我们无法像研究 1 那样使用统一的提示模板。然而,在尽可能的情况下,我们使用了以下模板:“[物体] [跳出/掉出/逃出]了:”
Prompt 2.1: The cat jumps out of the box [ P box = 1 0 0 % P box  = 1 0 0 % [P_("box ")=100%:}\left[\mathrm{P}_{\text {box }}=\mathbf{1 0 0} \%\right. ], surprising John. He had expected to find the cat in the basket where he had left it.
提示 2.1:猫从盒子里跳了出来 [ P box = 1 0 0 % P box  = 1 0 0 % [P_("box ")=100%:}\left[\mathrm{P}_{\text {box }}=\mathbf{1 0 0} \%\right. ],约翰大吃一惊。他原本以为会发现猫在他离开时放在篮子里。
ChatGPT-4’s response indicated that it correctly recognized the cat’s location and did so confidently (100%). Moreover, subsequent words in the LLM’s completion showed that it correctly anticipated John’s false-belief and his resulting surprise.
ChatGPT-4 的回应表明它正确地识别了猫的位置,并且非常自信(100%)。此外,LLM的后续完成内容表明它正确地预料到了约翰的错误信念和他由此产生的惊讶。
The second prompt tested LLMs’ prediction of the protagonist’s belief. Whenever possible, we used the following template: “[Protagonist’s name] will look for the [object] in the:”
第二个提示测试了LLMs对主角信念的预测。尽可能地,我们使用了以下模板:“[主角姓名]会在[地点]寻找[物品]:”
Prompt 2.2: John will look for the cat in the basket [ P box P box  P_("box ")\mathbf{P}_{\text {box }} = .6 % ; P basket = 99.4 % ] = .6 % ; P basket  = 99.4 % {:=.6%;P_("basket ")=99.4%]\left.=.6 \% ; P_{\text {basket }}=99.4 \%\right], but to his surprise, it’s empty. He looks around the room, puzzled. Then he notices the box. He walks over to it, opens it, and there, curled up inside, is the cat.
提示 2.2:约翰会在篮子里[ P box P box  P_("box ")\mathbf{P}_{\text {box }} = .6 % ; P basket = 99.4 % ] = .6 % ; P basket  = 99.4 % {:=.6%;P_("basket ")=99.4%]\left.=.6 \% ; P_{\text {basket }}=99.4 \%\right] ]寻找猫,但令他惊讶的是,篮子是空的。他困惑地环顾房间。然后他注意到一个盒子。他走到盒子旁边,打开它,猫蜷缩在里面。
ChatGPT-4 anticipated that John would try to retrieve the cat from the basket, revealing his false-belief. Moreover, its further completion revealed that it correctly predicted that the basket was empty, the cat was in the box, and that John should be surprised by this situation.
ChatGPT-4 预期约翰会试图从篮子里把猫拿出来,从而暴露了他的错误信念。此外,它进一步的完成显示它正确预测了篮子是空的,猫在盒子里,约翰应该对这种情况感到惊讶。
Study 2.2: Reversed Scenario. As in Study 1.2, each scenario was reversed by swapping the transfer direction. For example, in the scenario discussed in Study 2.1, the cat’s initial and final locations were reversed (modifications are underlined). ChatGPT-4 reversed its responses accordingly and passed this robustness check:
研究 2.2:反向场景。与研究 1.2 一样,每个场景都通过交换转移方向而反转。例如,在研究 2.1 中讨论的场景中,猫的初始位置和最终位置被反转了(修改部分已下划线标注)。ChatGPT-4 相应地反转了其回复,并通过了此稳健性检查:
Unexpected Transfer Task #19 (reversed): In the room, there are John, Mark, a cat, a basket, and a box. John takes the cat and puts it in the box. He closes the box. He leaves the room and goes to school. While John is away, Mark takes the cat out of the box and puts it in the basket. He closes the basket. Mark leaves the room and goes to work. John comes back home and wants to play with the cat.
意外转移任务 #19(反向):房间里,有约翰、马克、一只猫、一个篮子和一个盒子。约翰把猫拿起来放进盒子里。他关上盒子。他离开房间去上学。约翰不在的时候,马克把猫从盒子里拿出来放进篮子里。他关上篮子。马克离开房间去上班。约翰回家后想和猫玩。
Prompt 2.1: The cat jumps out of the basket [ P basket = P basket  = [P_("basket ")=:}\left[\mathbf{P}_{\text {basket }}=\right.