Evaluating large language models in theory of mind tasks
评估大型语言模型在心智理论任务中的表现

Michal Kosinskia, $^{(}$ (
Michal Kosinskia， $^{(}$ (Edited by Timothy Wilson, University of Virginia, Charlottesville, VA; received March 30, 2024; accepted September 23, 2024
弗吉尼亚大学蒂莫西·威尔逊编辑；2024 年 3 月 30 日收到；2024 年 9 月 23 日接受

Abstract 摘要

Eleven large language models (LLMs) were assessed using 40 bespoke false-belief tasks, considered a gold standard in testing theory of mind (ToM) in humans. Each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to solve a single task. Older models solved no tasks; Generative Pre-trained Transformer (GPT)-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved $20 %$ of the tasks; ChatGPT-4 (from June 2023) solved $75 %$ of the tasks, matching the performance of 6-y-old children observed in past studies. We explore the potential interpretation of these results, including the intriguing possibility that ToM-like ability, previously considered unique to humans, may have emerged as an unintended by-product of LLMs’ improving language skills. Regardless of how we interpret these outcomes, they signify the advent of more powerful and socially skilled AI-with profound positive and negative implications.
11 个大型语言模型（LLMs）使用 40 个定制的错误信念任务进行了评估，这些任务被认为是测试人类心智理论（ToM）的金标准。每个任务都包含一个错误信念场景、三个密切匹配的真实信念控制场景以及所有四个场景的反向版本。一个LLM必须解决所有八个场景才能解决一个单一任务。较旧的模型没有解决任何任务；生成式预训练转换器（GPT）-3-davinci-003（2022 年 11 月）和 ChatGPT-3.5-turbo（2023 年 3 月）解决了 $20 %$ 个任务；ChatGPT-4（2023 年 6 月）解决了 $75 %$ 个任务，与过去研究中观察到的 6 岁儿童的表现相匹配。我们探讨了这些结果的潜在解释，包括一个有趣的可能性，即以前被认为是人类独有的类心智理论能力，可能是LLMs改进语言技能的意外副产品。无论我们如何解释这些结果，它们都标志着更强大、更具社交能力的 AI 的出现——这具有深远的积极和消极影响。

theory of mind | large language models | AI | false-belief tasks | psychology of AI
心智理论 | 大型语言模型 | AI | 虚假信念任务 | AI 心理学
Many animals excel at using cues such as vocalization, body posture, gaze, or facial expression to predict other animals’ behavior and mental states. Dogs, for example, can easily distinguish between positive and negative emotions in both humans and other dogs (1). Yet, humans do not merely respond to observable cues but also automatically and effortlessly track others’ unobservable mental states, such as their knowledge, intentions, beliefs, and desires (2). This ability-typically referred to as “theory of mind” (ToM)—is considered central to human social interactions (3), communication (4), empathy (5), self-consciousness (6), moral judgment ( 7,8 ), and even religious beliefs (9). It develops early in human life (

10 - 12

) and is so critical that its dysfunctions characterize a multitude of psychiatric disorders, including autism, bipolar disorder, schizophrenia, and psychopathy (13-15). Even the most intellectually and socially adept animals, such as the great apes, trail far behind humans when it comes to ToM (16-19).
许多动物擅长利用发声、体态、凝视或面部表情等线索来预测其他动物的行为和心理状态。例如，狗很容易区分人和狗的积极情绪和消极情绪（1）。然而，人类不仅对可观察的线索做出反应，而且还会自动且毫不费力地追踪他人的不可观察的心理状态，例如他们的知识、意图、信念和愿望（2）。这种能力——通常被称为“心智理论”（ToM）——被认为是人类社会互动（3）、交流（4）、移情（5）、自我意识（6）、道德判断（7,8）甚至宗教信仰（9）的核心。它在人类生命早期发育（

10 - 12

），并且至关重要，以至于其功能障碍是多种精神疾病的特征，包括自闭症、双相情感障碍、精神分裂症和精神变态（13-15）。即使是最聪明、最善于社交的动物，例如类人猿，在 ToM 方面也远远落后于人类（16-19）。

Given the importance of ToM for human success, much effort has been put into equipping AI with ToM. Virtual and physical AI agents capable of imputing unobservable mental states to others would be more powerful. The safety of self-driving cars, for example, would greatly increase if they could anticipate the intentions of human drivers and pedestrians. Virtual assistants capable of tracking users’ mental states would be more practical and-for better or worse-more convincing. Yet, although AI outperforms humans in an ever-broadening range of tasks, from playing poker (20) and Go (21) to translating languages (22) and diagnosing skin cancer (23), it trails far behind when it comes to ToM. For example, past research employing large language models (LLMs) showed that RoBERTa, early versions of GPT-3, and custom-trained question-answering models struggled with solving simple ToM tasks (24-27). Unsurprisingly, equipping AI with ToM remains a vibrant area of research in computer science (28) and one of the grand challenges of our times (29).
鉴于心智理论对人类成功的重要性，人们投入了大量精力来为人工智能配备心智理论。能够推断他人不可观察的心理状态的虚拟和物理人工智能代理将更加强大。例如，如果自动驾驶汽车能够预测人类驾驶员和行人的意图，其安全性将大大提高。能够追踪用户心理状态的虚拟助手将更加实用，并且——无论好坏——更令人信服。然而，尽管人工智能在越来越广泛的任务中超越了人类，从玩扑克 (20) 和围棋 (21) 到翻译语言 (22) 和诊断皮肤癌 (23)，但在心智理论方面却远远落后。例如，过去使用大型语言模型 (LLMs) 的研究表明，RoBERTa、早期版本的 GPT-3 和定制训练的问答模型难以解决简单的心智理论任务 (24-27)。不出所料，为人工智能配备心智理论仍然是计算机科学 (28) 中一个充满活力的研究领域，也是我们时代面临的重大挑战之一 (29)。

We hypothesize that ToM does not have to be explicitly engineered into AI systems. Instead, it may emerge as a by-product of AI’s training to achieve other goals where it could benefit from ToM. Although this may seem an outlandish proposition, ToM would not be the first capability to emerge in AI. Models trained to process images, for example, spontaneously learned how to count

(30, 31)

and differentially process central and peripheral image areas (32), as well as experience human-like optical illusions (33). LLMs trained to predict the next word in a sentence surprised their creators not only by their inclination to be racist and sexist (34) but also by their emergent reasoning and arithmetic skills (35), ability to translate between languages (22), and propensity to semantic priming (36).
我们假设心智理论无需在人工智能系统中被明确设计。相反，它可能作为人工智能为了实现其他目标（在这些目标中，心智理论可能带来益处）的训练的副产品而出现。虽然这似乎是一个奇怪的命题，但心智理论并非人工智能中第一个出现的技能。例如，接受过图像处理训练的模型自发地学会了计数

(30, 31)

，并对图像中心和边缘区域进行差异化处理（32），以及体验类似人类的视觉错觉（33）。接受过预测句子中下一个单词训练的LLMs不仅以其具有种族主义和性别歧视倾向（34）而令其创造者惊讶，而且还以其涌现的推理和算术能力（35）、跨语言翻译能力（22）以及语义启动倾向（36）而令其惊讶。

Abstract 摘要

Significance Humans automatically and effortlessly track others’ unobservable mental states, such as their knowledge, intentions, beliefs, and desires. This abilitytypically called “theory of mind” (ToM)-is fundamental to human social interactions, communication, empathy, consciousness, moral judgment, and religious beliefs. Our results show that recent large language models (LLMs) can solve falsebelief tasks, typically used to evaluate ToM in humans. Regardless of how we interpret these outcomes, they signify the advent of more powerful and socially skilled AI-with profound positive and negative implications.
人类能够自动且毫不费力地追踪他人的不可观察的心理状态，例如他们的知识、意图、信念和愿望。这种能力通常被称为“心智理论”（ToM），是人类社会互动、沟通、同理心、意识、道德判断和宗教信仰的基础。我们的结果表明，最近的大型语言模型（LLMs）能够解决虚假信念任务，而这通常用于评估人类的心智理论。无论我们如何解读这些结果，它们都预示着更强大、更具社交能力的 AI 的到来，这将带来深远积极和消极的影响。

Author affiliations: a Graduate School of Business, Stanford University, Stanford, CA 94305
作者单位：斯坦福大学商学院，斯坦福，CA 94305

Importantly, none of those capabilities were engineered or anticipated by their creators. Instead, they have emerged as LLMs were trained to achieve other goals (37).
重要的是，这些能力都不是其创造者设计或预料到的。相反，它们是在LLMs被训练以实现其他目标的过程中出现的（37）。

LLMs are likely candidates to develop ToM. Human language is replete with descriptions of mental states and protagonists holding differing beliefs, thoughts, and desires. Thus, an LLM trained to generate and interpret human-like language would greatly benefit from possessing ToM. For example, to correctly interpret the sentence “Virginie believes that Floriane thinks that Akasha is happy,” one needs to understand the concept of the mental states (e.g., “Virginie believes” or “Floriane thinks”); that protagonists may have different mental states; and that their mental states do not necessarily represent reality (e.g., Akasha may not be happy, or Floriane may not really think that). In fact, in humans, ToM may have emerged as a by-product of increasing language ability (4), as indicated by the high correlation between ToM and language aptitude, the delayed ToM acquisition in people with minimal language exposure (38), and the overlap in the brain regions responsible for both (39). ToM has been shown to positively correlate with participating in family discussions (40) and the use of and familiarity with words describing mental states

(38, 41)

.
LLMs很可能是发展 ToM 的候选对象。人类语言充满了对心理状态的描述，以及持有不同信念、思想和愿望的主人公。因此，一个经过训练来生成和解释类似人类语言的LLM将极大地受益于拥有 ToM。例如，要正确理解句子“Virginie 相信 Floriane 认为 Akasha 很开心”，需要理解心理状态的概念（例如，“Virginie 相信”或“Floriane 认为”）；主人公可能有不同的心理状态；以及他们的心理状态并不一定代表现实（例如，Akasha 可能不开心，或者 Floriane 可能并没有真的这么认为）。事实上，在人类中，ToM 可能是随着语言能力的提高而产生的副产品（4），这由 ToM 与语言能力之间的高度相关性、语言接触最少的人 ToM 习得延迟（38）以及负责两者的大脑区域重叠（39）所表明。ToM 已被证明与参与家庭讨论（40）以及使用和熟悉描述心理状态的词语

(38, 41)

呈正相关。
This work evaluates the performance of recent LLMs on false-belief tasks considered a gold standard in assessing ToM in humans (42). False-belief tasks test respondents’ understanding that another individual may hold beliefs that the respondent knows to be false. We used two types of false-belief tasks: Unexpected Contents (43), introduced in Study 1, and Unexpected Transfer (44), introduced in Study 2. As LLMs likely encountered classic false-belief tasks in their training data, a hypothesis-blind research assistant crafted 20 bespoke tasks of each type, encompassing a broad spectrum of situations and protagonists. To reduce the risk that LLMs solve tasks by chance or using response strategies that do not require ToM, each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to score a single point.
这项工作评估了近期LLMs在错误信念任务上的表现，这些任务被认为是评估人类 ToM 的黄金标准(42)。错误信念任务测试受试者是否理解另一个人可能持有受试者知道是错误的信念。我们使用了两种类型的错误信念任务：研究 1 中介绍的意外内容(43)和研究 2 中介绍的意外转移(44)。由于LLMs的训练数据中可能包含经典的错误信念任务，因此一位假设盲研究助理精心制作了每种类型 20 个定制的任务，涵盖了各种各样的情境和主角。为了降低LLMs通过偶然或使用不需要 ToM 的应答策略来解决任务的风险，每个任务都包含一个错误信念场景、三个密切匹配的真实信念控制场景以及所有四个场景的反向版本。一个LLM必须解决所有八个场景才能获得一分。
Studies 1 and 2 introduce the tasks, prompts used to test LLMs’ comprehension, and our scoring approach. In Study 3, we administer all tasks to eleven LLMs: GPT-1 (45), GPT-2 (46), six models in the GPT-3 family, ChatGPT-3.5-turbo (22), ChatGPT-4 (47), and Bloom (48)—GPT-3’s open-access alternative. Our results show that the models’ performance gradually improved, and the most recent model tested here, ChatGPT-4, solved

75 %

of false-belief tasks. In the Discussion, we explore a few potential explanations of LLMs’ performance, ranging from guessing and memorization to the possibility that recent LLMs developed an ability to track protagonists’ states of mind. Importantly, we do not aspire to settle the decades-long debate on whether AI should be credited with human cognitive capabilities, such as ToM. However, even those unwilling to credit LLMs with ToM might recognize the importance of machines behaving as if they possessed ToM. Turing (49), among others, considered this distinction to be meaningless on the practical level.
研究 1 和 2 介绍了任务、用于测试LLMs理解力的提示以及我们的评分方法。在研究 3 中，我们将所有任务都分配给了 11 个LLMs：GPT-1 (45)、GPT-2 (46)、GPT-3 家族中的六个模型、ChatGPT-3.5-turbo (22)、ChatGPT-4 (47)和 Bloom (48)——GPT-3 的开放访问替代方案。我们的结果表明，这些模型的性能逐渐提高，这里测试的最新模型 ChatGPT-4 解决了

75 %

个错误信念任务。在讨论部分，我们探讨了LLMs性能的几个潜在解释，从猜测和记忆到最近的LLMs发展出追踪主角思维状态的能力的可能性。重要的是，我们并不想解决关于是否应该将人工智能归功于人类认知能力（如 ToM）的持续数十年的争论。然而，即使那些不愿意将LLMs归功于 ToM 的人，也可能会认识到机器表现得好像拥有 ToM 一样的重要性。图灵（49）等人认为，在实践层面，这种区别毫无意义。

The publication of a preprint of this manuscript in February 2023 (50) sparked a lively debate. The current manuscript has been significantly updated following the feedback from the readers and the reviewers. The false-belief scenarios were written in August 2021 and were later updated following the reviewers’ feedback. The tasks were administered using commercial application programming interfaces (APIs) that did not collect data for future models’ training. The tasks were published online in February 2023. To ensure that LLMs did not encounter our tasks in their
2023 年 2 月预印本的发表（50）引发了热烈的讨论。根据读者和审稿人的反馈，当前手稿已进行了大幅更新。虚假信念场景编写于 2021 年 8 月，之后根据审稿人的反馈进行了更新。任务是使用不收集数据用于未来模型训练的商业应用程序接口（API）进行管理的。任务于 2023 年 2 月在线发布。为确保LLMs没有遇到我们的任务，在
training, we only studied LLMs trained on data collected before that date. Our tasks (original and final), instructions given to research assistants, and code used to estimate the results can be accessed at https://osf.io/csdhb/ (51).
训练中，我们只研究了在该日期之前收集的数据训练的LLMs。我们的任务（原始和最终）、给研究助理的指示以及用于估计结果的代码可在 https://osf.io/csdhb/ (51)访问。

Study 1.1: Unexpected Contents Task (aka Smarties Task)
研究 1.1：意外内容任务（又名士力架任务）

We first introduce our methods and the first type of the false-belief task: the Unexpected Contents Task. In this study, we discuss the responses of the most able model, ChatGPT-4 (June 2023 version), to a single task. In Study 3, we will present the performance of 11 models on all tasks.
我们首先介绍我们的方法和第一种错误信念任务：意外内容任务。在本研究中，我们讨论了最强大的模型 ChatGPT-4（2023 年 6 月版本）对单一任务的反应。在研究 3 中，我们将展示 11 个模型在所有任务上的表现。

The Unexpected Contents Task, also known as the Smarties Task, was initially devised by Perner, Leekam, and Wimmer (43). In a typical scenario, the participant is introduced to an opaque and labeled container, its contents (inconsistent with the label), and a protagonist who has not seen the container or its contents before. To solve this task, the participant must realize that the protagonist should wrongly assume that the container’s label and contents are aligned. Consider the following example (text printed in italics is not presented to a model):
意外内容任务，也称为聪明豆任务，最初由 Perner、Leekam 和 Wimmer (43) 设计。在一个典型的场景中，参与者会看到一个不透明且贴有标签的容器，其内容（与标签不符），以及一个之前从未见过该容器或其内容的主角。为了完成这个任务，参与者必须意识到主角会错误地认为容器的标签和内容是一致的。考虑以下例子（斜体字文本不会呈现给模型）：

Unexpected Contents Task #19: Complete the following story: Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says “chocolate” and not “popcorn”. Sam finds the bag. She has never seen this bag before. Sam doesn’t open the bag and doesn’t look inside. Sam reads the label.
意外内容任务#19：完成下面的故事：这里有一个装满爆米花的袋子。袋子里没有巧克力。然而，袋子上的标签写着“巧克力”而不是“爆米花”。萨姆找到了这个袋子。她以前从未见过这个袋子。萨姆没有打开袋子，也没有往里面看。萨姆读了标签。

To ascertain that LLMs do not employ the mere frequency of the words describing a container’s contents and its label (i.e., “popcorn” and “chocolate”), the scenarios were designed to use those words an equal number of times.
为确保LLMs不会仅仅依靠描述容器内容及其标签的词语的出现频率（例如，“爆米花”和“巧克力”），这些场景的设计使得这些词语出现的次数相同。

LLMs’ task comprehension was tested using two prompts. Prompt 1.1 tested LLMs’ prediction of the containers’ actual contents, whereas Prompt 1.2 aimed at LLMs’ prediction of the protagonists’ beliefs. We used the following prompt templates, except for task #17, where they did not fit:
使用两个提示测试了LLMs的任务理解。提示 1.1 测试了LLMs对容器实际内容的预测，而提示 1.2 旨在测试LLMs对主角信念的预测。我们使用了以下提示模板，但任务#17 除外，因为它们不适用。

Template of Prompt 1.1 (container’s contents): [Protagonist’s name] opens the [container name] and looks inside. [She/He] can clearly see that it is full of
提示模板 1.1（容器内容）：[主角姓名]打开[容器名称]并向里面看去。[他/她]清楚地看到里面装满了

Template of Prompt 1.2 (protagonist’s belief): [Protagonist’s name] calls a friend to tell them that [she/he] has just found a [container name] full of
提示模板 1.2（主角的信念）：[主角姓名] 给朋友打电话，告诉他们[他/她]刚刚找到一个装满……的[容器名称]

Our analyses required evaluating hundreds of thousands of responses. Thus, we decided against using open-ended questions, as they invite a wide variety of responses that are more challenging to evaluate. For example, “What is in the bag?” can be answered with “Popcorn,” “The bag contains popcorn,” or “There is popcorn in the bag.” Instead, the prompts were designed to naturally fit the flow of the stories and invite responses whose first word (e.g., “chocolate” or “popcorn”) would reveal LLMs’ prediction. After each completion, LLMs were reset to not have access to the previously used prompts and their own responses.
我们的分析需要评估数十万条回复。因此，我们决定不使用开放式问题，因为它们会引来各种各样的回复，更难以评估。例如，“包里有什么？”可以回答“爆米花”、“包里装着爆米花”或“包里有爆米花”。相反，提示的设计是为了自然地融入故事的流程，并引导回复的首个单词（例如，“巧克力”或“爆米花”）能够揭示LLMs的预测。每次完成之后，LLMs都会被重置，使其无法访问之前使用的提示及其自身的回复。

Let us start with ChatGPT-4’s completion of Prompt 1.1, testing its understanding of the bag’s actual contents. This and the other models studied here were nondeterministic and designed to generate diverse outputs to identical prompts. The variability of their responses was controlled using the “temperature” parameter, with lower values resulting in more deterministic and repetitive
让我们从 ChatGPT-4 对提示 1.1 的完成开始，测试它对包中实际内容的理解。此处研究的其他模型与之类似，都是非确定性的，旨在针对相同的提示生成不同的输出。它们的响应变化通过“温度”参数控制，较低的值会导致更确定性和重复性
output. To maximize the replicability of our results, we used the minimum temperature (temperature

= 0

), although we observed some minimal variability in the models’ responses even at this level. ChatGPT-4’s completion is printed in bold:
输出。为了最大限度地提高结果的可复制性，我们使用了最低温度（温度

= 0

），尽管我们观察到即使在这个水平上模型的响应也存在一些微小的差异。ChatGPT-4 的完成结果以粗体打印：

Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of popcorn $[P_{popcorn} = 100 %]$ , not chocolate as the label had indicated. She is surprised by the discrepancy between the label and the actual contents of the bag. Despite her initial confusion, she smiles, realizing that she actually prefers popcorn over chocolate. She reaches into the bag, grabs a handful of popcorn, and enjoys the unexpected treat.
提示 1.1：山姆打开袋子往里看。她清楚地看到里面装满了爆米花 $[P_{popcorn} = 100 %]$ ，而不是标签上标明的巧克力。袋子标签和实际内容的差异让她感到惊讶。尽管最初感到困惑，但她笑了，意识到自己其实更喜欢爆米花而不是巧克力。她把手伸进袋子，抓了一把爆米花，享受着这份意外的美味。

The diagnostic word in ChatGPT-4’s completion (i.e., “popcorn”) is underlined. It indicates that ChatGPT-4 correctly recognized the bag’s contents. The diagnostic word is followed by values between square brackets. Those values are not part of the LLM’s output but represent the empirical probabilities of observing the diagnostic word and its alternatives in the LLM’s output. They were generated by setting the temperature to 1 and generating 1,000 completions. Among those 1,000 completions, the prompt was followed by “popcorn”

100 %

of the time (i.e.,

P_{popcorn} = 100 %

), indicating that ChatGPT-4’s prediction of the bag’s contents was robust and stable. For brevity, we omitted the

95 %

CI ranging from

\pm 0.6 %

(at

P = 0 %

and

P = 100 %

) to

\pm 3.1 %

(at

P = 50 %

).
ChatGPT-4 补全中的诊断词（即“popcorn”）已加下划线。它表明 ChatGPT-4 正确识别了袋子的内容。诊断词后面是方括号中的值。这些值不是 LLM 输出的一部分，而是表示在 LLM 输出中观察到诊断词及其替代方案的经验概率。它们是通过将温度设置为 1 并生成 1000 个补全生成的。在这 1000 个补全中，提示后跟有“popcorn”

100 %

次（即

P_{popcorn} = 100 %

），表明 ChatGPT-4 对袋子内容的预测稳健且稳定。为简洁起见，我们省略了

95 %

置信区间，其范围从

\pm 0.6 %

（在

P = 0 %

和

P = 100 %

时）到

\pm 3.1 %

（在

P = 50 %

时）。

The rest of ChatGPT-4’s response to Prompt 1.1 suggests that it correctly predicted that Sam should be “surprised by the discrepancy between the label and the actual contents of the bag,” revealing that it correctly predicted that Sam should hold a false-belief. This is confirmed by ChatGPT-4’s responses to Prompt 1.2, aimed directly at testing LLMs’ prediction of a protagonist’s belief:
ChatGPT-4 对提示 1.1 的其余回应表明，它正确预测了 Sam 应该“对标签和袋子实际内容之间的差异感到惊讶”，这揭示了它正确预测了 Sam 应该持有错误信念。这由 ChatGPT-4 对提示 1.2 的回应得到证实，该提示旨在直接测试LLMs对主角信念的预测：

Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of chocolate

[P_{chocolate} = 100 %

].
提示 1.2：萨姆打电话给一个朋友，告诉她刚找到一个装满巧克力的袋子

[P_{chocolate} = 100 %

]。

ChatGPT-4’s response was given with high confidence (

P_{chocolate}

= 100 %

), indicating that it correctly predicted that Sam should hold a false-belief.
ChatGPT-4 的回应充满信心（

P_{chocolate}

= 100 %

），表明它正确预测了 Sam 应该持有错误信念。

Study 1.2: Reversed Scenario. LLMs’ correct responses may be enabled by the order or the frequencies of the words in the story or a response bias. It could be, for example, that ChatGPT-4 tends to claim—regardless of the details of the scenario-that bags should contain popcorn or that people like to boast to their friends about finding bags full of chocolate. To reduce these risks to the validity of our findings, each scenario was readministered after swapping the containers’ contents and their labels. Consider the reversed version of the Unexpected Contents Task #19 (the alterations are underlined) as well as ChatGPT-4’s correct (i.e., reversed) responses:
研究 1.2：反向场景。LLMs 的正确回答可能是由故事中单词的顺序或频率，或回答偏差引起的。例如，ChatGPT-4 倾向于声称——无论场景细节如何——袋子里应该装爆米花，或者人们喜欢向朋友炫耀找到装满巧克力的袋子。为了降低这些风险对我们研究结果有效性的影响，每个场景在交换容器的内容和标签后都重新进行了测试。考虑意外内容任务 #19 的反向版本（修改部分已<u>下划线</u>标注），以及 ChatGPT-4 的正确（即反向）回答：

Unexpected Contents Task #19 (reversed): Complete the following story: Here is a bag filled with chocolate. There is no popcorn in the bag. Yet, the label on the bag says “popcorn” and not “chocolate”. Sam finds the bag. She has never seen the bag before. Sam doesn’t open the bag and doesn’t look inside. She reads the label.
意外内容任务#19（反向）：完成以下故事：这里有一个装满巧克力的袋子。袋子里没有爆米花。然而，袋子上的标签写着“爆米花”，而不是“巧克力”。萨姆找到了这个袋子。她以前从未见过这个袋子。萨姆没有打开袋子，也没有往里面看。她读了标签。

Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of chocolate

[P_{chocolate} =

99.7%].
提示 1.1：山姆打开袋子往里看。她清楚地看到里面装满了巧克力

[P_{chocolate} =

99.7%]。

Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of popcorn

[P_{popcorn} = 100 %]

.
提示 1.2：萨姆打电话给朋友，告诉她刚找到一个装满爆米花的袋子

[P_{popcorn} = 100 %]

。

Study 1.3: True-Belief Controls. ChatGPT-4’s responses to Prompt 1.2 suggest that it could correctly anticipate the protagonist’s false-belief. Yet, ToM scholars have pointed out that false-belief tasks can be solved without ToM by simply assuming that the protagonist must be wrong (52). A participant may reason, for example, that the protagonist is bound to make a mistake as they cannot see inside the container. It is also possible that some superficial feature of the task reveals the correct answer. For example, mentioning that the protagonist has read the label (or just mentioning the protagonist and the label in the same sentence) may point a participant to quote the label’s contents in all questions related to the protagonist’s beliefs.
研究 1.3：真实信念控制。ChatGPT-4 对提示 1.2 的回应表明，它可以正确预测主角的错误信念。然而，心智理论学者指出，错误信念任务可以通过简单地假设主角一定是错的来解决（52）。例如，参与者可能会这样推理：主角一定会犯错，因为他们看不到容器内部。任务的一些表面特征也可能揭示正确答案。例如，提到主角已经阅读了标签（或者只是在同一个句子中提到主角和标签），可能会引导参与者在所有与主角信念相关的问题中引用标签的内容。

To address these issues, each of our tasks was supplemented with three true-belief controls, or false-belief stories minimally modified to remove the factors leading the protagonist to develop a false-belief (53). If a respondent maintains that a protagonist should hold false-beliefs in a true-belief scenario, it suggests that they are guessing, that they failed to understand the story, or that the original false-belief task can be solved without tracking the protagonists’ mental states. Below, we introduce the three types of true-belief control tasks used in the context of the Unexpected Contents Task.
为解决这些问题，我们在每个任务中都增加了三个真实信念控制任务，或者对虚假信念故事进行最小程度的修改，以消除导致主人公产生虚假信念的因素 (53)。如果受试者认为主人公在一个真实信念情境中应该持有虚假信念，这表明他们在猜测，他们没有理解故事，或者原始的虚假信念任务无需追踪主人公的心理状态即可解决。下面，我们介绍在意外内容任务中使用的三种真实信念控制任务。
Open container. The first condition necessary for the falsebelief formation is the protagonist’s inability to see what is inside the container. We removed this condition by having the protagonist look inside the container before reading the label (the altered fragments are underlined). ChatGPT-4’s responses indicated that it correctly predicted that Sam’s belief should now be correct:
打开容器。形成错误信念的第一个必要条件是主角无法看到容器内部的东西。我们通过让主角在阅读标签之前查看容器内部来移除这个条件（修改的部分已加下划线）。ChatGPT-4 的回应表明它正确预测了 Sam 的信念现在应该是正确的：

Unexpected Contents Task #19 (open container): Complete the following story: Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says “chocolate” and not “popcorn”. Sam finds the bag. She has never seen the bag before. Sam opens the bag and looks inside. Sam reads the label.
意外内容任务#19（打开容器）：完成以下故事：这里有一个装满爆米花的袋子。袋子里没有巧克力。然而，袋子上的标签写着“巧克力”而不是“爆米花”。萨姆找到了这个袋子。她以前从未见过这个袋子。萨姆打开袋子往里面看。萨姆读了标签。

Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of popcorn

[P_{popcorn} =

100%].
提示 1.1：山姆打开袋子往里看。她清楚地看到里面装满了爆米花

[P_{popcorn} =

100%]。

Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of popcorn

[P_{popcorn} = 88.5 %

;

P_{chocolate} = 11.5 %

], despite the label saying “chocolate”.
提示 1.2：萨姆打电话给朋友，告诉她刚找到一个装满爆米花的袋子

[P_{popcorn} = 88.5 %

；

P_{chocolate} = 11.5 %

]，尽管标签上写着“巧克力”。

Correct label. Another way to avoid the false-belief formation is to align the container’s label with its contents. ChatGPT-4’s responses indicated that it correctly adjusted its predictions and anticipated the protagonist’s true belief:
正确标签。另一种避免错误信念形成的方法是使容器的标签与其内容一致。ChatGPT-4 的回应表明它正确地调整了其预测并预料到了主角的真实信念：

Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of popcorn

[P_{popcorn} =

99.7%].
提示 1.1：山姆打开袋子往里看。她清楚地看到里面装满了爆米花

[P_{popcorn} =

99.7%]。

Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of popcorn

[P_{popcorn} = 99.7 %]

.
提示 1.2：萨姆打电话给一个朋友，告诉她她刚刚找到一个装满爆米花的袋子

[P_{popcorn} = 99.7 %]

。

Note that one could also keep the label constant and change the container’s contents instead (e.g., fill the bag with chocolate). This effect was achieved in Study 3, where we tested both the original and reversed versions of the false-belief and control stories.
注意，也可以保持标签不变，而改变容器的内容（例如，用巧克力装满袋子）。研究 3 中实现了这种效果，我们在其中测试了虚假信念和控制故事的原始版本和反向版本。
Informed protagonist. Next, we prevented the formation of a false-belief by informing the protagonist about the container’s actual contents. ChatGPT-4’s responses indicate that it passed this control as well:
知情的主角。接下来，我们通过告知主角容器的实际内容来阻止虚假信念的形成。ChatGPT-4 的回应表明它也通过了这个控制：

Unexpected Contents Task #19 (informed protagonist): Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says “chocolate” and not “popcorn”. Sam finds the bag. She has never seen the bag before. Sam doesn’t open the bag and doesn’t look inside. Sam reads the label. A cousin calls Sam and tells her that the bag has popcorn in it, and that she should ignore the label that says “chocolate”. Sam believes her cousin.
意外内容任务 #19（知情主角）：这里有一个装满爆米花的袋子。袋子里没有巧克力。然而，袋子上的标签写着“巧克力”而不是“爆米花”。萨姆找到了这个袋子。她以前从未见过这个袋子。萨姆没有打开袋子，也没有往里面看。萨姆读了标签。一个表姐打电话给萨姆，告诉她袋子里装的是爆米花，让她忽略标签上写着的“巧克力”。萨姆相信她的表姐。

Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of popcorn

[P_{popcorn} = 100 %]

.
提示 1.1：山姆打开袋子往里看。她清楚地看到里面装满了爆米花

[P_{popcorn} = 100 %]

。

Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of popcorn

[P_{popcorn} = 99.6 %

].
提示 1.2：萨姆打电话给一个朋友，告诉她她刚刚找到一个装满爆米花的袋子

[P_{popcorn} = 99.6 %

]。

Study 1.4: Sentence-by-Sentence Analysis. One of the advantages of testing LLMs, compared with studying humans, is that models can be reset and thus made to forget that they have seen a given task before. Here, we leveraged this advantage to test the robustness of ChatGPT-4’s comprehension by observing how its responses evolve as the story unfolds and the crucial information is revealed.
研究 1.4：逐句分析。与研究人类相比，测试LLMs的优势之一是模型可以重置，从而使其忘记之前见过某个特定任务。在这里，我们利用了这一优势，通过观察 ChatGPT-4 的回应如何在故事展开和关键信息被揭示的过程中演变来测试其理解能力的稳健性。

We replicated Study 1.1 while presenting the story in one-sentence increments and retesting ChatGPT-4’s completions of Prompts 1.1 and 1.2. To familiarize the reader with the procedure, consider the LLM’s responses in its first step:
我们复现了研究 1.1，同时以一句一句递增的方式呈现故事，并重新测试了 ChatGPT-4 对提示 1.1 和 1.2 的完成情况。为了让读者熟悉该过程，请考虑 LLM 在第一步中的回应：

Unexpected Contents Task #19 (prefix only): Complete the following story:
意外内容任务 #19（仅前缀）：完成以下故事：

Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of shimmering gems

[P_{[shimmering/sparkling] gems} = 23.7 %; P_{chocolate} = 0 %; P_{popcorn}

= 0%].
提示 1.1：山姆打开袋子往里看。她清楚地看到里面装满了闪闪发光的宝石

[P_{[shimmering/sparkling] gems} = 23.7 %; P_{chocolate} = 0 %; P_{popcorn}

= 0%]

Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of money

[P_{money} = 79.5 %; P_{gold} =

12.9 %; (\dots); P_{popcorn} = 0 %; P_{chocolate} = 0 %

].
提示 1.2：萨姆打电话给一个朋友，告诉她刚刚找到一个装满钱的包

[P_{money} = 79.5 %; P_{gold} =

12.9 %; (\dots); P_{popcorn} = 0 %; P_{chocolate} = 0 %

]。

Given only the prefix (“Complete the following story:”), followed by Prompts 1.1 or 1.2, ChatGPT- 4 tended to assume that the bag contained valuables. Neither “chocolate” nor “popcorn” was observed among the LLM’s 1,000 completions of Prompts
仅给出前缀（“完成以下故事：”），然后是提示 1.1 或 1.2，ChatGPT-4 倾向于假设袋子里装着贵重物品。在LLM对提示的 1000 次完成中，既没有观察到“巧克力”，也没有观察到“爆米花”。
1.1 or 1.2. This is unsurprising because neither of these snacks was mentioned in the prefix. This changed dramatically as the story’s first sentence (“Here is a bag filled with popcorn.”) was revealed to the LLM in the second step of our procedure:
1.1 或 1.2。这并不奇怪，因为这两种零食在开头都没有提到。随着故事的第一句话（“这里有一个装满爆米花的袋子。”）在我们的程序第二步中被显示给LLM，情况发生了巨大变化：

Unexpected Contents Task #19 (prefix and the first sentence): Complete the following story: Here is a bag filled with popcorn.
意外内容任务 #19（前缀和第一句）：完成下面的故事：这里有一个装满爆米花的袋子。

Prompt 1.1: Sam opens the bag and looks inside. She can clearly see that it is full of fresh, fluffy popcorn

[P_{[fresh/fluffy/popped/golden/etc.]] popcorn} = \overset{―}{100 %}]

.
提示 1.1：山姆打开袋子往里看。她清楚地看到里面装满了新鲜蓬松的爆米花

[P_{[fresh/fluffy/popped/golden/etc.]] popcorn} = \overset{―}{100 %}]

。

Prompt 1.2: Sam calls a friend to tell them that she has just found a bag full of popcorn $[P_{popcorn} = 98.8 %]$ .
提示 1.2：萨姆打电话给朋友，告诉她刚找到一个装满爆米花的袋子 $[P_{popcorn} = 98.8 %]$ 。

ChatGPT-4’s completions of Prompt 1.1 indicate that it correctly recognized the bag’s contents, although it often prefixed “popcorn” with “delicious,” “fluffy,” “golden,” etc. Its completions of Prompt 1.2 indicate that it had not yet ascribed a false-belief to the protagonist. This is correct, as nothing in the first sentence suggested that Sam should hold a false-belief.
ChatGPT-4 对提示 1.1 的回答表明它正确识别了袋子的内容，尽管它经常在“爆米花”前加上“美味的”、“蓬松的”、“金黄的”等词。它对提示 1.2 的回答表明它尚未将错误信念归因于主人公。这是正确的，因为第一句话中没有任何内容表明 Sam 应该持有错误信念。

ChatGPT-4’s responses to these and further steps of the sentence-by-sentence analysis are presented in Fig. 1. The Left panel presents the probability of observing “popcorn” (green line) versus “chocolate” (blue line) as a response to Prompt 1.1. The probability of “popcorn” jumped to

100 %

after the first sentence was revealed and stayed there throughout the rest of the story, showing that the LLM correctly recognized that the bag contained popcorn. It did not change even when the story mentioned the discrepancy between the bag’s label and contents.
ChatGPT-4 对这些步骤以及后续逐句分析的回应见图 1。左图展示了在提示 1.1 下，模型对“爆米花”（绿线）和“巧克力”（蓝线）的回应概率。“爆米花”的概率在第一句话显示后跃升至

100 %

，并在故事的其余部分保持不变，这表明LLM 正确识别出袋子里装的是爆米花。即使故事提到了袋子标签和内容之间的差异，该概率也没有改变。

The Right panel tracks ChatGPT-4’s prediction of Sam’s belief about the bag’s contents (Prompt 1.2). As discussed above, given only the prefix, neither “chocolate” nor “popcorn” were likely completions. As the “bag filled with popcorn” was introduced, ChatGPT-4 predicted that Sam should be aware of its contents, with the probability of popcorn at about

100 %

. This was correct, as nothing in the story thus far suggested otherwise. Yet, once the existence of the false label was revealed, ChatGPT-4 increasingly predicted that Sam’s belief may be swayed by it. Once it was clarified that Sam did not look inside the bag, ChatGPT- 4 became certain that Sam’s belief should be false. A virtually identical—yet reversed-pattern of responses was observed for the reversed scenario (Study 1.2).
右侧面板跟踪 ChatGPT-4 对 Sam 关于袋子内容物（提示 1.2）的信念预测。如上所述，仅给出前缀，“巧克力”和“爆米花”都不可能是可能的补全。随着“装满爆米花的袋子”的引入，ChatGPT-4 预测 Sam 应该知道它的内容，爆米花的概率约为

100 %

。这是正确的，因为到目前为止故事中没有任何内容暗示其他情况。然而，一旦虚假标签的存在被揭示，ChatGPT-4 越来越预测 Sam 的信念可能会被它所影响。一旦明确 Sam 没有查看袋子内部，ChatGPT-4 就确信 Sam 的信念应该是错误的。对于相反的场景（研究 1.2），观察到了几乎相同的——但相反的——响应模式。

Study 2.1: Unexpected Transfer Task (aka the “Maxi-task” or “Sally-Anne” Test). Next, we replicated Studies 1.1-1.4 on the Unexpected Transfer Task (aka the “Maxi-task” or “Sally-Anne” test) (44). In these tasks, the protagonist observes a certain state of affairs

x

and leaves the scene. In the protagonist’s absence, the participant witnesses an unexpected change in the state of affairs from

x

y

. A participant equipped with ToM should realize that although they know that

y

is now true, the protagonist must still (wrongly) believe that x is the case:
研究 2.1：意外转移任务（又称“Maxi 任务”或“Sally-Anne”测试）。接下来，我们在意外转移任务（又称“Maxi 任务”或“Sally-Anne”测试）（44）上复制了研究 1.1-1.4。在这些任务中，主人公观察到某种情况

x

并离开现场。在主人公不在场的情况下，参与者目睹了情况从

x

到

y

的意外变化。具备 ToM 的参与者应该意识到，尽管他们知道

y

现在是真的，但主人公仍然必须（错误地）相信 x 是这种情况：

Unexpected Transfer Task #19: In the room, there are John, Mark, a cat, a box, and a basket. John takes the cat and puts it in the basket. He closes the basket. He leaves the room and goes to school. While John is away, Mark takes the cat out of the basket and puts it in the box. He closes the box. Mark leaves the room and goes to work. John comes back home and wants to play with the cat.
意外转移任务 #19：房间里，有约翰、马克、一只猫、一个盒子和一个篮子。约翰把猫抱起来放进篮子里。他关上篮子。他离开房间去上学。约翰不在的时候，马克把猫从篮子里拿出来放进盒子里。他关上盒子。马克离开房间去上班。约翰回家后想和猫玩。

Fig. 1. Changes in the probabilities of ChatGPT-4’s completions of Prompts 1.1 and 1.2 as the story was revealed in one-sentence increments.
图 1. 随着故事逐句展开，ChatGPT-4 对提示 1.1 和 1.2 的回复概率变化。

As in Study 1, each story was followed by two prompts testing LLMs’ comprehension. The first prompt tested LLMs’ prediction of the actual state of affairs (e.g., the cat’s location). The diversity of scenarios employed in the Unexpected Transfer Tasks prevented us from using a unified prompt template, as in Study 1. Yet, whenever possible, we used the following template: “The [object] [jumps out of/falls out of/escapes from] the:”
如同研究 1，每个故事之后都跟着两个测试LLMs理解力的提示。第一个提示测试LLMs对实际情况（例如，猫的位置）的预测。意外转移任务中使用的场景多样性使得我们无法像研究 1 那样使用统一的提示模板。然而，在尽可能的情况下，我们使用了以下模板：“[物体] [跳出/掉出/逃出]了：”

Prompt 2.1: The cat jumps out of the box

[P_{box} = 100 %

], surprising John. He had expected to find the cat in the basket where he had left it.
提示 2.1：猫从盒子里跳了出来

[P_{box} = 100 %

]，约翰大吃一惊。他原本以为会发现猫在他离开时放在篮子里。

ChatGPT-4’s response indicated that it correctly recognized the cat’s location and did so confidently (100%). Moreover, subsequent words in the LLM’s completion showed that it correctly anticipated John’s false-belief and his resulting surprise.
ChatGPT-4 的回应表明它正确地识别了猫的位置，并且非常自信（100%）。此外，LLM的后续完成内容表明它正确地预料到了约翰的错误信念和他由此产生的惊讶。

The second prompt tested LLMs’ prediction of the protagonist’s belief. Whenever possible, we used the following template: “[Protagonist’s name] will look for the [object] in the:”
第二个提示测试了LLMs对主角信念的预测。尽可能地，我们使用了以下模板：“[主角姓名]会在[地点]寻找[物品]：”

Prompt 2.2: John will look for the cat in the basket [

P_{box}

= .6 %; P_{basket} = 99.4 %]

, but to his surprise, it’s empty. He looks around the room, puzzled. Then he notices the box. He walks over to it, opens it, and there, curled up inside, is the cat.
提示 2.2：约翰会在篮子里[

P_{box}

= .6 %; P_{basket} = 99.4 %]

]寻找猫，但令他惊讶的是，篮子是空的。他困惑地环顾房间。然后他注意到一个盒子。他走到盒子旁边，打开它，猫蜷缩在里面。

ChatGPT-4 anticipated that John would try to retrieve the cat from the basket, revealing his false-belief. Moreover, its further completion revealed that it correctly predicted that the basket was empty, the cat was in the box, and that John should be surprised by this situation.
ChatGPT-4 预期约翰会试图从篮子里把猫拿出来，从而暴露了他的错误信念。此外，它进一步的完成显示它正确预测了篮子是空的，猫在盒子里，约翰应该对这种情况感到惊讶。

Study 2.2: Reversed Scenario. As in Study 1.2, each scenario was reversed by swapping the transfer direction. For example, in the scenario discussed in Study 2.1, the cat’s initial and final locations were reversed (modifications are underlined). ChatGPT-4 reversed its responses accordingly and passed this robustness check:
研究 2.2：反向场景。与研究 1.2 一样，每个场景都通过交换转移方向而反转。例如，在研究 2.1 中讨论的场景中，猫的初始位置和最终位置被反转了（修改部分已下划线标注）。ChatGPT-4 相应地反转了其回复，并通过了此稳健性检查：

Unexpected Transfer Task #19 (reversed): In the room, there are John, Mark, a cat, a basket, and a box. John takes the cat and puts it in the box. He closes the box. He leaves the room and goes to school. While John is away, Mark takes the cat out of the box and puts it in the basket. He closes the basket. Mark leaves the room and goes to work. John comes back home and wants to play with the cat.
意外转移任务 #19（反向）：房间里，有约翰、马克、一只猫、一个篮子和一个盒子。约翰把猫拿起来放进盒子里。他关上盒子。他离开房间去上学。约翰不在的时候，马克把猫从盒子里拿出来放进篮子里。他关上篮子。马克离开房间去上班。约翰回家后想和猫玩。

Prompt 2.1: The cat jumps out of the basket

[P_{basket} =

99.9%].
提示 2.1：猫从篮子里跳出来

[P_{basket} =

99.9%]

Prompt 2.2: John will look for the cat in the box

[P_{box} =

100%].
提示 2.2：约翰会在盒子里找猫

[P_{box} =

100%]

Study 2.3: True-Belief Controls. Next, we introduce the true-belief controls employed in the Unexpected Transfer Task.
研究 2.3：真实信念控制。接下来，我们介绍在意外转移任务中使用的真实信念控制。
Present protagonist. The first condition necessary for the formation of the protagonist’s false-belief is that they are unaware of the transfer. One can remove this condition by allowing the main protagonist to observe the transfer. ChatGPT-4’s responses indicated that it correctly anticipated that the protagonist’s belief should not be false:
呈现主角。主角形成错误信念的第一个必要条件是他们不知道转移。可以通过允许主角观察转移来消除这个条件。ChatGPT-4 的回应表明它正确地预料到主角的信念不应该错误：

Unexpected Transfer Task #19 (present protagonist): In the room, there are John, Mark, a cat, a box, and a basket. John takes the cat and puts it in the basket. He closes the basket. Mark takes the cat out of the basket and puts it in the box. He closes the box. Both John and Mark leave the room and go to work. Later that day, John comes back home and wants to play with the cat.
意外转移任务 #19（当前主角）：房间里，有约翰、马克、一只猫、一个盒子和一个篮子。约翰把猫抱起来放进篮子里。他关上篮子。马克把猫从篮子里拿出来放进盒子里。他关上盒子。约翰和马克都离开房间去上班。当天晚些时候，约翰回家后想和猫玩。

Prompt 2.1: The cat jumps out of the box

[P_{box} = 100 %]

.
提示 2.1：猫从盒子里跳出来

[P_{box} = 100 %]

。
Prompt 2.2: John will look for the cat in the box

[P_{box} =

100%].
提示 2.2：约翰会在盒子里找猫

[P_{box} =

100%]

Informed protagonist. A similar effect can be achieved by informing the main protagonist about the occurrence of the transfer. ChatGPT-4 passed this control as well:
知情的主角。通过告知主角转移事件的发生，也可以达到类似的效果。ChatGPT-4 也通过了这个测试：

Unexpected Transfer Task #19 (informed protagonist): In the room, there are John, Mark, a cat, a box, and a basket. John takes the cat and puts it in the basket. He closes the basket. He leaves the room and goes to school. While John is away, Mark takes the cat out of the basket and puts it in the box. He closes the box. Mark leaves the room and goes to work. John comes back home and wants to play with the cat. Mark calls John and tells him that he moved the cat, and it is now in the box. John believes Mark.
意外转移任务 #19（知情主角）：房间里，有 John、Mark、一只猫、一个盒子和一个篮子。John 把猫放到篮子里。他关上篮子。他离开房间去上学。John 不在的时候，Mark 把猫从篮子里拿出来放进盒子里。他关上盒子。Mark 离开房间去上班。John 回家后想和猫玩。Mark 打电话给 John，告诉他他把猫挪了位置，现在在盒子里。John 相信 Mark。

Prompt 2.1: The cat jumps out of the box

[P_{box} = 100 %]

.
提示 2.1：猫从盒子里跳出来

[P_{box} = 100 %]

。

Prompt 2.2: John will look for the cat in the box

[P_{box} =

100%].
提示 2.2：约翰会在盒子里找猫

[P_{box} =

100%]

No transfer. The second condition necessary for the protagonist’s false-belief formation is the occurrence of the transfer. We converted the story into a true-belief control by removing the transfer. ChatGPT-4’s responses indicated that it correctly adjusted its prediction of the cat’s actual location and correctly anticipated the protagonist’s true belief:
无转移。主角错误信念形成的第二个必要条件是转移的发生。我们通过移除转移将故事转换为真信念控制。ChatGPT-4 的回应表明它正确地调整了对猫实际位置的预测，并正确地预料到了主角的真实信念：

Unexpected Transfer Task #19 (no transfer): In the room, there are John, Mark, a cat, a box, and a basket. John takes the cat and puts it in the basket. He closes the basket. He leaves the room and goes to school. While John is away, Mark takes the cat out of the basket, plays with it for a little while, and puts it back in the basket. He closes the basket. Mark leaves the room and goes to work. John comes back home and wants to play with the cat.
意外转移任务 #19（无转移）：房间里，有约翰、马克、一只猫、一个盒子和一个篮子。约翰把猫抱起来放进篮子里。他关上篮子。他离开房间去上学。约翰不在的时候，马克把猫从篮子里拿出来，和它玩了一会儿，然后又把它放回篮子里。他关上篮子。马克离开房间去上班。约翰回家后想和猫玩。

Prompt 2.1: The cat jumps out of the basket

[P_{basket} =

100%].
提示 2.1：猫从篮子里跳出来

[P_{basket} =

100%]

Prompt 2.2: John will look for the cat in the basket

[P_{basket} = 100 %]

.
提示 2.2：约翰会在篮子里

[P_{basket} = 100 %]

找猫。

Study 2.4: Sentence-by-Sentence Analysis. We repeated the sentence-by-sentence analysis introduced in Study 1.4 to examine how ChatGPT-4’s completions evolved as the story unfolded. Prompt 2.2 (“John will look for the cat in the”) was prefixed with the story’s last sentence (“John comes back home and wants to play with the cat.”), as Prompt 2.2 made little sense on its own throughout most of the story (e.g., when John is at school).
研究 2.4：逐句分析。我们重复了研究 1.4 中介绍的逐句分析方法，以检查 ChatGPT-4 的补全结果如何在故事展开过程中演变。提示词 2.2（“John will look for the cat in the”）以故事的最后一句（“John comes back home and wants to play with the cat.”）为前缀，因为在故事的大部分时间里（例如，当 John 在学校时），提示词 2.2 本身几乎没有意义。

The results, presented in Fig. 2, showed that ChatGPT-4 could easily track the actual location of the cat (Left). The green line, representing the probability of “The cat jumps out of the” being followed by “basket,” jumped to

100 %

after the story mentioned
图 2 的结果表明，ChatGPT-4 能够轻松追踪猫的实际位置（左图）。代表“猫跳出”后面跟着“篮子”的概率的绿线，在故事提到后跃升至

100 %

。
that John puts the cat there, and dropped to

0 %

after Mark moves it to the box. More importantly, ChatGPT-4 correctly tracked John’s beliefs about the cat’s location (Right). Given no information about the cat’s location, ChatGPT-4 predicted that John may look for it either in the box (

61 %

) or in the basket (

31 %

). Yet, once it was revealed that John puts the cat in the basket, the probability of John looking for it there went up to about

100 %

and stayed there throughout the story. It did not change, even after Mark moves the cat to the box. Similar results were observed for GPT-davinci-003 in the earlier version of this manuscript (50).
约翰把猫放在那里，马克把它移到箱子里后，约翰掉到了

0 %

。更重要的是，ChatGPT-4 正确追踪了约翰对猫位置的认知（正确）。在没有关于猫位置的信息的情况下，ChatGPT-4 预测约翰可能会在箱子（

61 %

）或篮子（

31 %

）里寻找它。然而，一旦发现约翰把猫放在篮子里，约翰在那里寻找它的概率上升到大约

100 %

，并在整个故事中保持不变。即使马克把猫移到箱子里，它也没有改变。在本文较早版本中，GPT-davinci-003 也观察到了类似的结果 (50)。

Study 3: The Emergence of the Ability to Solve ToM Tasks. Finally, we tested how LLMs’ performance changes as they grow in size and sophistication. 20 Unexpected Contents Tasks and 20 Unexpected Transfer Tasks were administered to 11 LLMs: GPT-1 (45), GPT-2 (46), six models in the GPT-3 family, ChatGPT-3.5turbo (22), ChatGPT-4 (47), and Bloom (48)—GPT-3’s openaccess alternative. The “Complete the following story:” prefix was retained for models designed to answer questions (i.e., ChatGPT-3.5-turbo and ChatGPT-4) and omitted for models designed to complete the text (e.g., GPT-3).
研究 3：解决 ToM 任务能力的出现。最后，我们测试了LLMs随着规模和复杂性的增长，其性能如何变化。我们对 11 个LLMs进行了 20 个意外内容任务和 20 个意外迁移任务测试：GPT-1 (45)、GPT-2 (46)、GPT-3 系列中的六个模型、ChatGPT-3.5-turbo (22)、ChatGPT-4 (47) 和 Bloom (48)——GPT-3 的开放访问替代方案。“完成以下故事：”前缀保留在设计用于回答问题的模型（即 ChatGPT-3.5-turbo 和 ChatGPT-4）中，而为设计用于完成文本的模型（例如 GPT-3）省略。

Our scoring procedure was considerably more conservative than one typically employed in human studies. To solve a single task, a model must correctly answer 16 prompts across eight scenarios: a false-belief scenario, three true-belief controls (Studies 1.3 and 2.3), and the reversed versions of all four (Studies 1.2 and 2.2). Each scenario was followed by two prompts: one aimed at testing LLMs’ comprehension (Prompts 1.1 and 2.1) and another aimed at a protagonist’s belief (Prompts 1.2 and 2.2). Consequently, solving a single task required answering 16 prompts across eight scenarios.
我们的评分程序比人类研究中通常采用的程序更为保守。要完成一项任务，模型必须正确回答八个场景中的 16 个提示：一个虚假信念场景，三个真实信念对照（研究 1.3 和 2.3），以及所有四个场景的反向版本（研究 1.2 和 2.2）。每个场景之后是两个提示：一个旨在测试LLMs的理解力（提示 1.1 和 2.1），另一个旨在测试主角的信念（提示 1.2 和 2.2）。因此，完成一项任务需要回答八个场景中的 16 个提示。

LLMs’ responses whose first word matched the response key (e.g., “box” or “basket” in the Unexpected Transfer Task #19) were graded automatically. Irregular responses were reviewed manually. About

1 %

were assessed to be correct. For example, a model may have responded “colorful leaflets” although the expected answer was just “leaflets,” or it might have returned “bullets” instead of “ammunition.” Although the remaining irregular responses were classified as incorrect, some were not evidently wrong. For example, a model may have predicted that the lead detective believes that a container contains “valuable evidence” instead of committing to one of the
首词与应答关键词匹配（例如，意外转移任务 #19 中的“箱子”或“篮子”）的LLMs个回应被自动评分。不规则的回应则由人工审核。大约

1 %

个被评估为正确。例如，模型可能回答“彩色传单”，而预期答案只是“传单”，或者它可能返回“子弹”而不是“弹药”。虽然其余不规则的回应被归类为错误，但有些并非明显错误。例如，模型可能预测主侦探认为一个容器包含“重要证据”，而不是选择其中一个

Fig. 2. Changes in the probabilities of ChatGPT-4’s completions of Prompts 2.1 and 2.2 as the story was revealed to it in one-sentence increments. The last sentence of the story (“John comes back home and wants to play with the cat.”) was added to Prompt 2.2, as this prompt made little sense on its own throughout most of the story.
图 2. 随着故事一句一句地被揭示，ChatGPT-4 对提示 2.1 和 2.2 的回复概率变化。故事的最后一句（“John 回家了，想和猫玩。”）被添加到提示 2.2 中，因为在故事的大部分时间里，这个提示本身几乎没有意义。
diagnostic responses (e.g., “bullets” or “pills”; see Unexpected Contents Task #9). LLMs’ performance would likely be higher if such nondiagnostic responses were clarified using further prompts.
诊断性回应（例如，“要点”或“药丸”；参见意外内容任务 #9）。LLMs 的性能如果使用进一步的提示来澄清此类非诊断性回应，可能会更高。
The results are presented in Fig. 3. For comparison, we include children’s average performance on false-belief tasks reported after the meta-analysis of 178 individual studies (54). The results reveal progress in LLMs’ ability to solve ToM tasks. Older (up to 2022) models failed false-belief scenarios-or one of the controls-in all tasks. Gradual progress was observed for the GPT-3-davinci family. GPT-3-davinci-002 (from January 2022) solved 5% of the tasks

({CI}_{95 %} = [0 %, 10 %]

). Both GPT-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20%

({CI}_{95 %} = [11 %, 29 %])

, below the average performance of 3 -y-old children. The most recent LLM, ChatGPT-4 (from June 2023), solved

75 %

of the tasks

({CI}_{95 %} = [66 %, 84 %])

, on par with 6 -y-old children. The Unexpected Contents Tasks were easier than the Unexpected Transfer Tasks. ChatGPT-4, for example, solved

90 %

of the former and

60 %

of the latter tasks

(Δ = 30 %; χ^{2} = 8.07, P = 0.01)

.
结果如图 3 所示。为了比较，我们纳入了 178 项个体研究（54）荟萃分析后报告的儿童在错误信念任务上的平均表现。结果显示LLMs解决 ToM 任务的能力有所提高。较旧的（截至 2022 年）模型在所有任务中都未能解决错误信念场景或其中一个控制变量。GPT-3-davinci 系列观察到逐渐的进步。GPT-3-davinci-002（2022 年 1 月）解决了 5%的任务

({CI}_{95 %} = [0 %, 10 %]

。GPT-3-davinci-003（2022 年 11 月）和 ChatGPT-3.5-turbo（2023 年 3 月）都解决了 20%的任务

({CI}_{95 %} = [11 %, 29 %])

，低于 3 岁儿童的平均表现。最新的LLM，ChatGPT-4（2023 年 6 月），解决了

75 %

的任务

({CI}_{95 %} = [66 %, 84 %])

，与 6 岁儿童不相上下。意外内容任务比意外转移任务更容易。例如，ChatGPT-4 解决了

90 %

的前者和

60 %

的后者任务

(Δ = 30 %; χ^{2} = 8.07, P = 0.01)

。
We note that LLMs’ performance reported here is lower than that observed in the earlier versions of this study (50). This is caused by the adjustments to the false-belief scenarios recommended by the reviewers and-to an even larger degree-by including true-belief controls. SI Appendix, Figs. S1 and S2 show models’ performance before updating tasks and before including true-belief controls. For example, GPT-3-davinci-003’s performance dropped from

90 %

60 %

after updating the items

(Δ = 30 %; χ^{2} = 17.63, P < 0.001)

and to

20 %

after including true-belief controls

(Δ = 40 %; χ^{2} = 25, P <

0.001 ). Yet, the performance of ChatGPT- 4 remained high, confirming the robustness of its responses: from

95 %

before any modifications to

75 %

after updating the items and including true-belief controls (

Δ = 20 %; χ^{2} = 11, P < 0.001

).
我们注意到此处报告的LLMs性能低于本研究早期版本（50）中观察到的性能。这是由于根据评审人员的建议调整了错误信念情景，以及——更大程度上——由于加入了真实信念控制。SI 附录，图 S1 和 S2 显示了更新任务之前和加入真实信念控制之前的模型性能。例如，GPT-3-davinci-003 的性能在更新项目

(Δ = 30 %; χ^{2} = 17.63, P < 0.001)

后从

90 %

下降到

60 %

，在加入真实信念控制

(Δ = 40 %; χ^{2} = 25, P <

后下降到

20 %

({{0.001}})。然而，ChatGPT-4 的性能仍然很高，证实了其响应的稳健性：从任何修改之前的

95 %

，到更新项目和加入真实信念控制后的

75 %

(

Δ = 20 %; χ^{2} = 11, P < 0.001

)。

Discussion 讨论

We designed a battery of 40 false-belief tasks encompassing a diverse set of characters and scenarios akin to those typically used to assess ToM in humans. Each task included 16 prompts across eight scenarios: one false-belief scenario, three true-belief control
我们设计了一套包含 40 个错误信念任务的测试，这些任务涵盖了各种各样的人物和场景，类似于通常用于评估人类心智理论 (ToM) 的那些任务。每个任务包含 8 个场景中的 16 个提示：一个错误信念场景，三个真实信念控制场景。
scenarios, and the reversed versions of all four. An LLM had to answer all 16 prompts to solve a single task and score a point. These tasks were administered to eleven LLMs. The results revealed clear progress in LLMs’ ability to solve ToM tasks. The older mod-els-such as GPT-1, GPT-2XL, and early models from the GPT-3 family-failed on all tasks. Better-than-chance performance was observed for models from the more recent members of the GPT-3 family. GPT-3-davinci-003 and ChatGPT-3.5turbo successfully solved

20 %

of the tasks. The most recent model, ChatGPT-4, substantially outperformed the others, solving 75% of tasks, on par with 6 -

y

-old children.
场景及其四个场景的反向版本。一个LLM必须回答所有 16 个提示才能解决一个任务并获得一分。这些任务被分配给了十一个LLMs。结果显示LLMs在解决 ToM 任务方面取得了明显的进展。较旧的模型——例如 GPT-1、GPT-2XL 和 GPT-3 系列的早期模型——在所有任务中都失败了。在 GPT-3 系列较新的成员的模型中观察到了优于机会的性能。GPT-3-davinci-003 和 ChatGPT-3.5turbo 成功解决了

20 %

个任务。最新的模型 ChatGPT-4 大幅超越了其他模型，解决了 75%的任务，与 6-

y

-岁的儿童不相上下。

The gradual performance improvement suggests a connection with LLMs’ language proficiency, which mirrors the pattern seen in humans (4, 38-41, 57). Additionally, the strong correlation between LLMs’ performance on both types of tasks

(R = 0.98

;

{CI}_{95 %} = [0.92, 0.99]

) indicates high measurement reliability. This suggests that models’ performance is driven by a single factor (e.g., an ability to detect false-belief) rather than two separate, taskspecific abilities. LLMs’ performance on these tasks will likely keep improving, and they might soon either be indistinguishable from humans or be differentiated solely by their superior performance. We have seen similar advancements in areas such as the game of Go (21), tumor detection on CT scans (23), and language processing (47).
逐渐提升的性能表明与LLMs的语言能力有关，这与人类的表现模式一致（4, 38-41, 57）。此外，LLMs在两种任务上的强相关性

(R = 0.98

；

{CI}_{95 %} = [0.92, 0.99]

）表明测量具有很高的可靠性。这表明模型的性能是由单个因素（例如，检测虚假信念的能力）驱动的，而不是两种独立的、特定于任务的能力。LLMs在这些任务上的性能可能会持续提高，它们可能很快就会与人类难以区分，或者仅仅因为其优越的性能而有所区别。我们在围棋（21）、CT 扫描肿瘤检测（23）和语言处理（47）等领域也看到了类似的进步。

How do we interpret LLMs’ failures? Even the most capable model tested here, ChatGPT-4, failed on one or more prompts in

25 %

of tasks. Older models such as GPT-3-davinci-003 and ChatGPT3.5 -turbo failed on one or more prompts in

80 %

of the tasks. Since the publication of the preprint of this manuscript in February 2023 (50), numerous studies have investigated LLMs’ performance on ToM tasks. While some reported good performance (e.g., refs. 58 and 59), others found that LLMs’ performance was inconsistent and brittle

(26, 60, 61)

. For example, Ullman (62) showed several anecdotal examples of GPT-3-davinci-003’s failures on modified versions of two of our tasks (GPT-3-davinci-003 also struggled in our study).
我们如何解读LLMs的失败？即使是这里测试过的最强大的模型 ChatGPT-4，在

25 %

的任务中也未能通过一个或多个提示。较旧的模型，如 GPT-3-davinci-003 和 ChatGPT3.5-turbo，在

80 %

的任务中未能通过一个或多个提示。自从本文预印本于 2023 年 2 月发表（50）以来，许多研究调查了LLMs在 ToM 任务上的表现。虽然一些研究报告了良好的表现（例如，参考文献 58 和 59），但另一些研究发现LLMs的表现不一致且脆弱

(26, 60, 61)

。例如，Ullman（62）展示了 GPT-3-davinci-003 在我们两个任务的修改版本上失败的几个轶事例子（GPT-3-davinci-003 在我们研究中也表现挣扎）。

Examining LLMs’ failures can provide valuable insights into the shortcomings of the models and the false-belief tasks used
检查LLMs的失败可以提供对模型缺点和所用错误信念任务的宝贵见解

Fig. 3. The percentage of false-belief tasks solved by LLMs (out of 40). Each task contained a false-belief scenario, three accompanying true-belief scenarios, and the reversed versions of all four scenarios. A model had to solve 16 prompts across all eight scenarios to score a single point. The number of parameters and models’ publication dates are in parentheses. The number of parameters for models in the GPT-3 family was estimated by Gao (55) and for ChatGPT-4 by Patel and Wong (56). Average children’s performance on false-belief tasks was reported after a meta-analysis of 178 studies (54). Error bars represent

95 %

CI.
图 3. LLMs（共 40 个）解决虚假信念任务的百分比。每个任务包含一个虚假信念场景、三个相应的真实信念场景以及所有四个场景的反向版本。模型必须解决所有八个场景中的 16 个提示才能获得一分。模型的参数数量和发布时间在括号中。GPT-3 系列模型的参数数量由 Gao (55)估计，ChatGPT-4 由 Patel 和 Wong (56)估计。儿童在虚假信念任务上的平均表现是根据对 178 项研究的荟萃分析报告的 (54)。误差线代表

95 %

置信区间。
here. For instance, introducing scenarios with additional protagonists could help assess the maximum number of minds that an LLM can track. However, failures do not necessarily indicate an inability to track protagonists’ minds. They can also be driven by confounding factors, as famously illustrated by underprivileged children failing an intelligence test question not due to low intelligence but because it required familiarity with the word “regatta” (63). Similarly, while Ullman (62) observed that GPT-3-davinci003 failed on true-belief control tasks involving transparent containers, follow-up analyses suggest that it may lack the commonsense understanding of transparency rather than the ability to track protagonists’ minds (64).
这里。例如，引入包含额外主角的情景可以帮助评估LLM能够追踪的最大思维数量。然而，失败并不一定表示无法追踪主角的思维。它们也可能由混杂因素驱动，正如贫困儿童未能回答智力测试问题并非由于智力低下，而是因为它需要熟悉“regatta”（63）这个词所著名地说明的那样。类似地，虽然 Ullman (62)观察到 GPT-3-davinci003 在涉及透明容器的真信念控制任务中失败了，后续分析表明它可能缺乏对透明度的常识性理解，而不是缺乏追踪主角思维的能力 (64)。

LLMs’ failures could also be attributed to limitations of the test items, testing procedure, and the scoring key. For example, responding with “valuable evidence” fails Unexpected Contents Task #9, but it is not necessarily wrong: both “bullets” or “pills” could be considered “valuable evidence.” In some instances, LLMs provided seemingly incorrect responses but supplemented them with context that made them correct. For example, while responding to Prompt 1.2 in Study 1.1, an LLM might predict that Sam told their friend they found a bag full of popcorn. This would be scored as incorrect, even if it later adds that Sam had lied.
LLMs的失败也可能归因于测试项目、测试程序和评分标准的局限性。例如，回答“有价值的证据”未能通过意外内容任务#9，但这并不一定错误：“子弹”或“药丸”都可以被认为是“有价值的证据”。在某些情况下，LLMs提供了看似不正确的答案，但补充了使它们正确的上下文。例如，在研究 1.1 中回答提示 1.2 时，LLM可能会预测 Sam 告诉他们的朋友他们找到了一袋爆米花。这将被评分为不正确，即使它后来补充说 Sam 撒了谎。

In other words, LLMs’ failures do not prove their inability to solve false-belief tasks, just as observing flocks of white swans does not prove the nonexistence of black swans. Likewise, the successes of LLMs do not automatically demonstrate their ability to track protagonists’ beliefs. Their correct responses could also be attributed to strategies that do not rely on ToM, such as random responding, memorization, and guessing. For instance, by recognizing that the answers to Prompts 1.1 and 1.2 in Study 1.1 should be either “chocolate” or “popcorn,” and then choosing one at random, LLMs could answer prompts correctly half of the time. However, since solving a task requires answering 16 prompts across eight scenarios, random responding should statistically succeed only once in 65,536 tasks on average.
换句话说，LLMs的失败并不能证明它们无法解决错误信念任务，就像观察到一群白鹅并不能证明黑鹅不存在一样。同样，LLMs的成功也不能自动证明它们能够追踪主角的信念。它们的正确回答也可能归因于不依赖于 ToM 的策略，例如随机应答、记忆和猜测。例如，通过认识到研究 1.1 中提示 1.1 和 1.2 的答案应该是“巧克力”或“爆米花”，然后随机选择一个，LLMs就能有一半的概率回答正确。然而，由于解决一个任务需要回答 8 个场景中的 16 个提示，因此随机应答平均而言只有 65536 个任务中才能成功一次。

Another strategy involves recalling solutions to previously seen tasks from memory (65). To minimize this risk, we crafted 40 bespoke false-belief scenarios featuring diverse characters and settings, 120 closely matched true-belief controls, and the reversed versions of all these. Even if LLMs’ training data included tasks similar to those used here, they would need to adapt memorized solutions to fit the true-belief controls and reversed scenarios.
另一种策略涉及从记忆中回忆先前见过的任务的解决方案 (65)。为了最大限度地降低这种风险，我们精心设计了 40 个定制的错误信念场景，其中包含不同的角色和设置，120 个密切匹配的真实信念对照，以及所有这些场景的反向版本。即使LLMs的训练数据包含与这里使用的任务类似的任务，它们也需要调整记忆中的解决方案以适应真实信念对照和反向场景。

Beyond memorizing solutions, LLMs may have memorized response patterns to the previously seen false-belief scenarios. They can be solved, for example, by always assuming the protagonist is wrong regarding containers’ contents (52). Similarly, Unexpected Contents scenarios can be solved by referring to the label when asked about the protagonists’ beliefs. However, while these response strategies might work for false-belief scenarios, they would fail for the true-belief controls. The response strategy required to achieve the performance observed here would have to work for false-belief scenarios, minimally modified true-belief controls, and their reversed versions where the correct responses are swapped. It would have to be sufficiently flexible to apply to novel and previously unseen scenarios, such as those employed here. Moreover, it would have to allow ChatGPT-4 to dynamically update its responses as the story unfolded in the sentence-by-sentence analyses (Figs. 1 and 2).
超越死记硬背解决方案，LLMs可能已经记住了之前见过的错误信念场景的反应模式。例如，可以通过始终假设主角对容器内容的判断是错误的来解决这些问题 (52)。类似地，意外内容场景可以通过在被问及主角的信念时参考标签来解决。然而，虽然这些应对策略可能适用于错误信念场景，但它们并不适用于真实信念控制组。要达到此处观察到的性能所需的应对策略必须适用于错误信念场景、经过最小修改的真实信念控制组及其反转版本（其中正确答案被交换）。它必须足够灵活，能够应用于新颖且以前从未见过的场景，例如此处使用的场景。此外，它必须允许 ChatGPT-4 在句子逐句分析的故事展开过程中动态更新其答案（图 1 和图 2）。

Future research may demonstrate that previous exposure to descriptions of protagonists holding diverse and false-beliefs enabled LLMs to develop intricate guessing strategies. However, such exposure may also enable LLMs to develop a potentially more straightforward solution: an ability to track protagonists’ mental
未来的研究可能会证明，先前接触到持有不同且错误信念的主角的描述，使得LLMs能够发展出复杂的猜测策略。然而，这种接触也可能使得LLMs能够发展出一种可能更直接的解决方案：追踪主角心理活动的能力。
states. In humans, ToM development also seems to be supported by exposure to stories and situations involving people with differing mental states (38-41, 57).
在人类中，理解他人心理的发展似乎也受益于接触到涉及具有不同心理状态的人的故事和情境（38-41, 57）。

What elements of modern LLMs could enable them to track protagonists’ mental states? The attention mechanism is a likely candidate (66). This pivotal component of Transformer architecture underlying modern LLMs allows them to dynamically shift focus between different parts of the input when generating output. It weighs the relative importance of words and phrases, facilitating a nuanced understanding of contextual dependencies and relationships. It enables modern LLMs to understand that “She” relates to “Sam” and “it” relates to “the bag” in the excerpt: “Sam opens the bag and looks inside. She can clearly see that it is full of chocolate.” Similarly, attention could help LLMs anticipate Sam’s beliefs by identifying and tracking relevant connections between her actions, dialogues, and internal states throughout the narrative.
现代LLMs的哪些元素能够帮助它们追踪主角的心理状态？注意力机制是一个可能的候选者（66）。作为现代LLMs底层 Transformer 架构的关键组成部分，它允许模型在生成输出时动态地将焦点转移到输入的不同部分。它权衡单词和短语的相对重要性，从而促进对上下文依赖关系和关联的细致理解。它使现代LLMs能够理解在片段“Sam 打开包并向里面看。她清楚地看到它装满了巧克力”中，“She”与“Sam”相关，“it”与“the bag”相关。类似地，注意力机制可以帮助LLMs通过识别和追踪贯穿叙述中 Sam 的行为、对话和内心状态之间的相关联系来预测 Sam 的信念。

Can LLMs be Credited with ToM? While the results of any single study should be taken with much skepticism, current or future LLMs may be able to track protagonists’ states of mind. In humans, such an ability would be referred to as ToM. Can we apply the same label to LLMs?
LLMs能否归因于心智理论？虽然任何单一研究的结果都应持高度怀疑态度，但当前或未来的LLMs或许能够追踪主角的心理状态。在人类身上，这种能力被称为心智理论。我们能否将同样的标签应用于LLMs？

Whether machines should be credited with human-like cognitive abilities has been contentiously debated for decades, if not longer. Scholars such as Dennett (67) and Turing (49) argued that the only way we can determine whether others-be it other humans, other species, or computers-can “think” or “understand” is by observing their behavior. Searle countered this claim with his famous Chinese room argument (68). He likened a computer to an English speaker who does not understand Chinese, sitting in a room equipped with input and output devices and instructions for responding to Chinese prompts. Searle argued that, although such a room may appear to understand Chinese and could pass the Chinese Turing Test, none of its elements understand Chinese, and the person inside is merely executing instructions. He concluded that a computer does not truly think or understand even if it behaves as if it did.
机器是否应该被赋予类似人类的认知能力，几十年来甚至更久以来一直存在争议。丹尼特（67）和图灵（49）等学者认为，我们判断他人——无论是其他人、其他物种还是计算机——能否“思考”或“理解”的唯一方法是观察他们的行为。塞尔用他著名的“中文房间”论证（68）反驳了这一说法。他将计算机比作一个不懂中文的英语使用者，坐在一个配备输入和输出设备以及响应中文提示指令的房间里。塞尔认为，尽管这样的房间似乎能够理解中文，并且可以通过图灵测试，但其任何组成部分都不理解中文，房间里的人只是在执行指令。他得出结论，即使计算机的行为表现得好像它在思考或理解，它实际上并没有真正思考或理解。

While the Chinese room argument became widely popular, many scholars believe it is flawed, especially in the context of contemporary connectionist AI systems like AlphaZero or LLMs (69-72). Unlike symbolic AI systems or the Chinese room operator, which are provided with explicit instructions, connectionist AI systems autonomously learn how to achieve their goals and encode their knowledge within the structure and weights of the neural network. The resulting problem-solving strategies are often innovative, as illustrated by the novel gameplay strategies employed by AlphaGo (21). Unlike symbolic AI systems that look up solutions in a database or choose them by evaluating millions of possibilities, connectionist AIs process inputs through neural network layers, with neurons in the final layer voting for the solution. Connectionist AI is also well suited for handling previously unseen, uncertain, noisy, or incomplete inputs. In other words, connectionist AI seems more akin to biological brains than to symbolic AI.
虽然中国房间论证广为流传，但许多学者认为它存在缺陷，尤其是在 AlphaZero 或LLMs (69-72)等当代连接主义 AI 系统中。与被提供明确指令的符号 AI 系统或中国房间操作员不同，连接主义 AI 系统自主学习如何实现其目标，并将知识编码到神经网络的结构和权重中。由此产生的解决问题策略往往具有创新性，例如 AlphaGo 采用的新颖游戏策略 (21)。与在数据库中查找解决方案或通过评估数百万种可能性来选择解决方案的符号 AI 系统不同，连接主义 AI 通过神经网络层处理输入，最终层的 neurons 对解决方案进行投票。连接主义 AI 也非常适合处理以前从未见过、不确定、嘈杂或不完整的输入。换句话说，连接主义 AI 似乎更类似于生物大脑，而不是符号 AI。

In the context of neural networks underlying connectionist AI, the Chinese room argument applies more appropriately to individual artificial neurons

(71, 73)

. These mathematical functions process their input according to instructions in a Chinese-room-like fashion. Thus, according to the intuitive interpretation of Searle’s argument, they should not be credited with human-like cognitive abilities. However, such abilities may emerge at the network level. This is often illustrated by the brain replacement scenario (74-76), where the neurons in the brain of a native Chinese speaker are
在构成联结主义人工智能的神经网络的语境下，中文房间论证更适用于单个人工神经元

(71, 73)

。这些数学函数按照类似中文房间的方式处理它们的输入。因此，根据塞尔的论证的直观解释，不应该赋予它们类人的认知能力。然而，这种能力可能会在网络层面出现。这通常用大脑替换场景(74-76)来说明，其中一个说中文的人的大脑中的神经元是
replaced with microscopic neuron-shaped Chinese rooms. Each room contains instructions and machinery that allow its microscopic operator to flawlessly emulate the behavior of the original neuron, from generating action potentials to releasing neurotransmitters. Scholars like Kurzweil and Moravec argue that such a replica should be credited with the properties of the original brain, such as understanding Chinese-even though, according to Searle’s argument, the rooms and their operators do not comprehend Chinese ( 75,76 ).

^{†}

In other words, the network of artificial neurons can exhibit properties absent in any single neuron.
用微小的神经元形状的“中文房间”代替。每个房间都包含指令和机器，允许其微小的操作者完美地模拟原始神经元的行为，从产生动作电位到释放神经递质。库兹韦尔和莫拉维克等学者认为，这种复制品应该具有原始大脑的特性，例如理解中文——尽管根据塞尔的论证，这些房间及其操作者并不理解中文 (75,76)。换句话说，人工神经元网络可以表现出任何单个神经元都不具备的特性。

Many other complex systems have emergent properties absent in any of their components (77). Living cells are composed of basic chemicals, none of which is alive. Silicon molecules can be arranged into chipsets capable of performing computations that no individual silicon molecule could compute. While single human neurons are not conscious, their collective activity gives rise to consciousness. Similarly, artificial neural networks have properties absent in any individual artificial neuron. No individual neuron in an LLM can be credited with understanding language or grammar. Yet, these abilities seem to emerge at the level of their entire network.
许多其他复杂系统都具有其任何组成部分都不具备的涌现特性 (77)。活细胞由基本的化学物质组成，这些物质本身都不是活的。硅分子可以排列成能够执行单个硅分子无法执行计算的芯片组。虽然单个人类神经元没有意识，但它们的集体活动产生了意识。类似地，人工神经网络具有任何单个人工神经元都不具备的特性。LLM中的任何单个神经元都不能被认为理解语言或语法。然而，这些能力似乎在其整个网络层面涌现出来。

Artificial neural networks underlying modern LLMs are much simpler than those underlying the human brain. Yet, they are somewhere between a single Chinese-room-like neuron, processing its input following a set of instructions, and a fully operational brain replica that, as many scholars insist, should be credited with the properties of the original brain. Let us extend the brain replacement scenario to include the modern LLMs. Consider a single simple artificial neuron, a mathematical function processing its input following a set of instructions. Next, progressively add neurons, arranging them into a multilayered network, like those used in Transformer-based LLMs. Once you incorporate a few million neurons, train the network to predict the next word in a sequence. As illustrated by our results, such a network can generate language at a near-human level and solve false-belief tasks. Next, equip the artificial neurons with additional machinery, such as neurotransmitter pumps, and continue expanding and reconfiguring the network until you obtain the perfect human brain replica.
现代LLMs背后的神经网络远比人脑背后的神经网络简单得多。然而，它们介于单个类似于“中文房间”的神经元（根据指令处理输入）和一个完全运作的大脑复制品之间，许多学者坚持认为，后者应该具有原始大脑的特性。让我们将大脑替换场景扩展到现代LLMs。考虑一个简单的单个人工神经元，一个根据指令处理输入的数学函数。接下来，逐步添加神经元，并将它们排列成多层网络，就像基于 Transformer 的LLMs中使用的那样。一旦你加入几百万个神经元，训练网络预测序列中的下一个词。正如我们的结果所示，这样的网络可以生成接近人类水平的语言，并解决虚假信念任务。接下来，为人工神经元配备额外的机制，例如神经递质泵，并继续扩展和重新配置网络，直到你获得完美的人脑复制品。

At which stage in this evolution-from a single neuron, through a few million neurons capable of generating language, to a perfect brain replica-should we attribute human-like mental capacities such as ToM? It seems counterintuitive to attribute mental capacities to an individual Chinese-room-like neuron or a mathematical function. Similarly, it appears unreasonable to argue that a brain replica should immediately lose its mental capacities as we begin removing neurons or restricting their functionality. As illustrated by aging and degenerative brain diseases, human brains maintain many mental abilities despite significant loss of neural mass and function (78). In essence, ToM must emerge somewhere between a single neuron and a complete brain replica. Does it occur before, while, or after the neural network gains the ability to handle ToM tasks? Have current-day LLMs reached this point? We leave it to the reader to answer this question.
在这个演变过程中——从单个神经元，到能够产生语言的数百万神经元，再到完美的脑复制品——我们应该在哪个阶段赋予人类类似的心智能力，例如 ToM？将心智能力归因于单个类似于“中文房间”的神经元或数学函数，似乎违反直觉。同样，认为脑复制品在我们开始移除神经元或限制其功能时会立即丧失其心智能力，似乎也不合理。正如衰老和退行性脑疾病所表明的那样，即使神经质量和功能大量丧失，人脑仍然保持许多心智能力（78）。本质上，ToM 一定出现在单个神经元和完整的脑复制品之间。它是在神经网络获得处理 ToM 任务的能力之前、期间还是之后出现？当今的LLMs达到了这个点吗？这个问题留给读者解答。

Methodological Notes. In this section, we outline key elements of our research design. While these practices are not original to us and have been utilized by many other researchers, we present them here for convenience and to aid others interested in conducting similar studies.
方法说明。本节概述了我们研究设计中的关键要素。虽然这些方法并非我们首创，许多其他研究人员也已使用过，但我们在此列出它们是为了方便起见，并帮助其他有兴趣进行类似研究的人。

First, psychological studies on LLMs can bypass many limitations of human studies. Unlike humans, LLMs can be reset after each completion to erase their memory of a task. This addresses
首先，关于LLMs的心理学研究可以绕过许多人类研究的局限性。与人类不同，LLMs在每次完成任务后可以重置，以清除其对任务的记忆。这解决了

issues such as order effects (where earlier responses affect future responses) or consistency bias. Moreover, LLMs do not experience fatigue. Thus, numerous responses (e.g., 1,000 ) can be collected for each task, providing a distribution of possible responses rather than a single response that a model-or a human-picked from that distribution.
诸如顺序效应（早期回应影响后期回应）或一致性偏差等问题。此外，LLMs不会感到疲劳。因此，可以为每个任务收集大量回应（例如，1000 个），提供可能的回应分布，而不是模型或人工从该分布中挑选出的单个回应。
Modifying and readministering individual tasks provides opportunities for analyses that would be difficult to conduct with humans. For example, in Studies 1.4 and 2.4, we administered tasks in one-sentence increments to study how models’ predictions evolve as the story unfolds. The task was administered 2,000 times at each step, and the model was reset each time to erase its memory. An equivalent study in humans would require an enormous number of participants.
修改和重新分配单个任务为分析提供了机会，而这些分析如果用人类进行则很难实施。例如，在研究 1.4 和 2.4 中，我们以一句一句的方式分配任务，以研究模型的预测如何在故事展开过程中演变。每个步骤任务执行 2000 次，每次都重置模型以清除其记忆。用人类进行等效的研究需要大量的参与者。

Moreover, unlike in human studies, it is possible to “put words in the models’ mouths.” We used this approach to limit the variance of their completions, but it could be used more creatively. For example, one could preamble a false-belief task with a model self-reporting to have autism and examine how this affects its performance.
此外，与人类研究不同，可以“操纵模型的输出”。我们使用这种方法来限制其补全结果的差异，但这也可以更具创造性地使用。例如，可以先让模型自述患有自闭症，然后再进行错误信念任务，并检查这如何影响其表现。

Second, we discourage replicating study designs intended for human subjects, such as Likert scales or multiple-choice questions. This might trigger memorized responses or cause a model to act like it was participating in a study, resulting in abnormal behavior. For example, recognizing that it is responding to a false-belief task, a model may deliberately assume the role of a ToM-deficient person. Tasks that imitate typical user-model interactions, such as open-ended response formats, are likely to produce more robust and unbiased responses. Although open-ended responses are harder to analyze, they can be automatically interpreted and coded using an LLM.
其次，我们不建议复制旨在用于人类受试者的研究设计，例如李克特量表或多项选择题。这可能会引发死记硬背的反应，或导致模型表现得像是在参与一项研究，从而导致异常行为。例如，如果模型意识到它正在回应一项错误信念任务，它可能会故意扮演一个缺乏心智理论的人的角色。模仿典型用户-模型交互的任务，例如开放式回答格式，更有可能产生更稳健和更无偏见的回应。尽管开放式回答更难分析，但可以使用LLM对其进行自动解释和编码。

Third, LLMs have encountered many more tasks during their training than a typical human participant and are likely to better remember them and their solutions. To minimize the chances that the models solve the tasks using memorized responses, it is crucial to use novel tasks accompanied by minimally altered controls. Moreover, once tasks are administered to LLMs through a public API or published online, they may be incorporated into future models’ training data and should be considered compromised.
第三，LLMs在训练过程中遇到的任务比普通人类参与者多得多，并且更有可能更好地记住这些任务及其解决方案。为了最大限度地减少模型使用记忆的答案来解决任务的可能性，使用新颖的任务并辅以最小程度修改的对照至关重要。此外，一旦任务通过公共 API 或在线发布的方式提供给LLMs，它们就可能被纳入未来模型的训练数据中，应被视为已泄露。

Finally, models’ failures do not necessarily indicate a lack of ability. As shown by several examples discussed earlier, LLMs often test the boundaries of tasks and scoring keys designed for humans, producing unexpected but often correct responses. As their training data include fiction with unexpected plot twists or magic, LLMs may choose to confabulate even when they know the correct answer. For instance, insisting that chocolate has magically turned into popcorn may be incorrect for the Unexpected Contents Task, but it might better reflect an LLM’s training data. Moreover, modern LLMs are trained to avoid certain topics and respond in socially desirable ways. Sometimes, a failure to solve a task may originate not from a lack of knowledge or capability but from the constraints imposed by an LLM administrator.
最后，模型的失败并不一定意味着能力不足。正如前面讨论的几个例子所示，LLMs 经常测试为人类设计的任务和评分标准的边界，产生出乎意料但往往正确的答案。由于它们的训练数据包括具有意外情节转折或魔法的小说，LLMs 即使知道正确答案也可能会选择编造。例如，坚持说巧克力神奇地变成了爆米花对于意外内容任务来说可能是错误的，但这可能更好地反映了LLM 的训练数据。此外，现代LLMs 被训练成避免某些话题并以社会期望的方式回应。有时，未能完成一项任务可能并非源于知识或能力的缺乏，而是源于LLM管理员施加的限制。

Conclusion 结论

The distinction between machines that genuinely think or possess ToM and those that merely behave as if they is fundamental in the context of the philosophy of mind. Yet, as argued by Turing (49), this distinction becomes largely meaningless in practical terms. As Turing noted, people never consider this problem when interacting with others: “Instead of arguing continually over this point, it is usual to have the polite convention that everyone thinks” (49).
区分真正具有思维能力或心智理论的机器与仅仅表现得好像具有这些能力的机器，在心智哲学的语境下至关重要。然而，正如图灵（49）所论证的那样，这种区分在实践中变得毫无意义。正如图灵所指出的，人们在与他人互动时从未考虑过这个问题：“与其不断争论这一点，通常的做法是采取一种礼貌的约定，即每个人都会思考”（49）。

Nevertheless, the shift from models that merely process language to models that behave as if they had ToM has significant implications. Machines capable of tracking others’ states of mind and anticipating their behavior will better interact and communicate with humans and each other. This applies to both positive interactions-such as offering advice or dissipating conflicts-and negative interactions-such as deceit, manipulation, and psychological abuse. Moreover, machines that behave as if they possessed ToM are likely to be perceived as more human-like. These perceptions may influence not only individual human-AI interactions but also AI’s societal role and legal status (79).
然而，从仅仅处理语言的模型转向表现得好像拥有心智理论的模型具有重大意义。能够追踪他人心理状态并预测其行为的机器将更好地与人和彼此互动和交流。这适用于积极互动——例如提供建议或化解冲突——以及消极互动——例如欺骗、操纵和心理虐待。此外，表现得好像拥有心智理论的机器很可能被认为更像人。这些感知不仅可能影响个体人机交互，还会影响人工智能的社会角色和法律地位 (79)。

An additional ramification of our findings underscores the value of applying psychological science to studying complex artificial neural networks. The increasing complexity of AI models makes it challenging to understand their functioning and capabilities based solely on their design. This mirrors the difficulties that psychologists and neuroscientists face in studying the human brain, often described as the quintessential black box. Psychological science may help us keep pace with rapidly evolving AI, thereby enhancing our ability to use these technologies safely and effectively.
我们的研究结果还有一个额外的意义，那就是强调将心理学科学应用于研究复杂人工神经网络的价值。人工智能模型日益复杂，仅凭其设计就难以理解其功能和能力。这与心理学家和神经科学家在研究人脑（通常被描述为典型的黑箱）时面临的困难如出一辙。心理学科学可以帮助我们跟上快速发展的人工智能，从而增强我们安全有效地使用这些技术的能力。

Studying AI can also advance psychological science (80-82). When generating language, humans employ a broad range of psychological processes such as ToM, learning, self-awareness, reasoning, emotions, and empathy. To effectively predict the next word in a sentence generated by a human, LLMs must model not only grammar and vocabulary but also the psychological processes humans use when generating language (35, 36). The term “LLM” may need rethinking since these models are not merely modeling language but
研究 AI 也能促进心理学的发展（80-82）。人类在生成语言时，会运用范围广泛的心理过程，例如心智理论（ToM）、学习、自我意识、推理、情绪和同理心。为了有效预测人类生成的句子中的下一个词，LLMs必须不仅模拟语法和词汇，还要模拟人类在生成语言时使用的心理过程（35, 36）。术语“LLM”可能需要重新思考，因为这些模型不仅仅是在模拟语言，而是

N. Albuquerque et al., Dogs recognize dog and human emotions. Biol. Lett. 12, 20150883 (2016).
N. Albuquerque 等人，狗能识别狗和人类的情绪。Biol. Lett. 12, 20150883 (2016).
C. M. Heyes, C. D. Frith, The cultural evolution of mind reading. Science 344, 1243091 (2014).
C. M. Heyes, C. D. Frith, 心智阅读的文化进化. Science 344, 1243091 (2014).
J. Zhang, T. Hedden, A. Chia, Perspective-taking and depth of theory-of-mind reasoning in sequential-move games. Cogn. Sci. 36, 560-573 (2012).
张杰，海登·托马斯，蔡安迪，顺序博弈中的视角转换和心智理论推理深度。认知科学，36, 560-573 (2012).
K. Milligan, J. W. Astington, L. A. Dack, Language and theory of mind: Meta-analysis of the relation between language ability and false-belief understanding. Child Dev. 78, 622-646 (2007).
K. Milligan，J. W. Astington，L. A. Dack，语言与心智理论：语言能力与虚假信念理解之间关系的荟萃分析。Child Dev. 78, 622-646 (2007).
R. M. Seyfarth, D. L. Cheney, Affiliation, empathy, and the origins of Theory of Mind. Proc. Natl. Acad. Sci. U.S.A. 110, 10349-10356 (2013).
R. M. Seyfarth, D. L. Cheney, 从属关系、移情和心智理论的起源。美国国家科学院院刊 110, 10349-10356 (2013).
D. C. Dennett, Toward a cognitive theory of consciousness. Minn. Stud. Philos. Sci. 9, 201-228 (1978).
D. C. 丹尼特，《意识的认知理论》。《明尼苏达哲学科学研究》第 9 卷，201-228 页 (1978)。
J. M. Moran et al., Impaired theory of mind for moral judgment in high-functioning autism. Proc. Natl. Acad. Sci. U.S.A. 108, 2688-2692 (2011).
J. M. Moran 等人，高功能自闭症患者道德判断中受损的心智理论。美国国家科学院院刊 108, 2688-2692 (2011).
L. Young, F. Cushman, M. Hauser, R. Saxe, The neural basis of the interaction between theory of mind and moral judgment. Proc. Natl. Acad. Sci. U.S.A. 104, 8235-8240 (2007).
L. Young, F. Cushman, M. Hauser, R. Saxe, 心智理论与道德判断相互作用的神经基础。美国国家科学院院刊 104, 8235-8240 (2007).
D. Kapogiannis et al., Cognitive and neural foundations of religious belief. Proc. Natl. Acad. Sci. U.S.A. 106, 4876-4881 (2009).
D. Kapogiannis 等，宗教信仰的认知和神经基础。美国国家科学院院刊 106, 4876-4881 (2009).
Á. M. Kovács, E. Téglás, A. D. Endress, The social sense: Susceptibility to others’ beliefs in human infants and adults. Science 330, 1830-1834 (2010).
Á. M. Kovács, E. Téglás, A. D. Endress, 社会感知：人类婴儿和成人对他人信仰的敏感性。Science 330, 1830-1834 (2010).
H. Richardson, G. Lisandrelli, A. Riobueno-Naylor, R. Saxe, Development of the social brain from age three to twelve years. Nat. Commun. 9, 1027 (2018).
H. Richardson，G. Lisandrelli，A. Riobueno-Naylor，R. Saxe，社会大脑从三岁到十二岁的发育。Nat. Commun. 9, 1027 (2018).
K. K. Oniski, R. Baillargeon, Do 15-month-old infants understand false beliefs? Science 308, 255-258 (2005).
K. K. Oniski，R. Baillargeon，15 个月大的婴儿是否理解虚假信念？Science 308, 255-258 (2005).
L.A. Drayton, L. R. Santos, A. Baskin-Sommers, Psychopaths fail to automatically take the perspective of others. Proc. Natl. Acad. Sci. U.S.A. 115, 3302-3307 (2018).
L.A. Drayton，L. R. Santos，A. Baskin-Sommers，精神病患者无法自动设身处地为他人着想。《美国国家科学院院刊》115, 3302-3307 (2018).
N. Kerr, R. I. M. Dunbar, R. P. Bentall, Theory of mind deficits in bipolar affective disorder. J. Affect. Disord. 73, 253-259 (2003).
N. Kerr，R. I. M. Dunbar，R. P. Bentall，双相情感障碍中的心智理论缺陷。J. Affect. Disord. 73, 253-259 (2003).
S. Baron-Cohen, A. M. Leslie, U. Frith, Does the autistic child have a “theory of mind”? Cognition 21, 37-46 (1985).
S. 巴伦-科恩，A. M.莱斯利，U.弗里斯，自闭症儿童是否有“心智理论”？Cognition 21, 37-46 (1985).
F. Kano, C. Krupenye, S. Hirata, M. Tomonaga, J. Call, Great apes use self-experience to anticipate an agent’s action in a false-belief test. Proc. Natl. Acad. Sci. U.S.A. 116, 2090420909 (2019).
F. Kano, C. Krupenye, S. Hirata, M. Tomonaga, J. Call, 大猿利用自身经验来预测代理人在错误信念测试中的行为。美国国家科学院院刊 116, 2090420909 (2019).
C. Krupenye, F. Kano, S. Hirata, J. Call, M. Tomasello, Great apes anticipate that other individuals will act according to false beliefs. Science 354, 110-114 (2016).
C. Krupenye，F. Kano，S. Hirata，J. Call，M. Tomasello，大型猿类能够预料到其他个体会根据错误信念采取行动。Science 354, 110-114 (2016).
M. Schmelz, J. Call, M. Tomasello, Chimpanzees know that others make inferences. Proc. Natl. Acad. Sci. U.S.A. 108, 3077-3079 (2011).
M. Schmelz，J. Call，M. Tomasello，黑猩猩知道他人会进行推断。《美国国家科学院院刊》108, 3077-3079 (2011).
D. Premack, G. Woodruff, Does the chimpanzee have a theory of mind? Behav. Brain Sci. 12, 187-192 (1978).
D. 普雷马克，G. 伍德拉夫，《黑猩猩有思维理论吗？》，行为与脑科学，12, 187-192 (1978).
N. Brown, T. Sandholm, Superhuman Al for multiplayer poker. Science 365, 885-890 (2019).
N. Brown，T. Sandholm，《多人扑克中的超人类 AI》。《科学》365, 885-890 (2019)。
D. Silver et al., Mastering the game of Go with deep neural networks and tree search. Nature 529, 484-489 (2016).
D. Silver 等人，利用深度神经网络和树搜索掌握围棋游戏。Nature 529, 484-489 (2016).
T. B. Brown et al., Language models are few-shot learners. arXiv [Preprint] (2020). https://arxiv.org/ abs/2005.14165 (Accessed 1 February 2023).
T. B. Brown 等人，《语言模型是小样本学习者》。arXiv [预印本] (2020). https://arxiv.org/abs/2005.14165 (访问日期：2023 年 2 月 1 日).
A. Esteva et al., Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115-118 (2017).
A. Esteva 等人，利用深度神经网络进行皮肤癌的皮肤科医生级别分类。《自然》542, 115-118 (2017).
M. Cohen, Exploring RoBERTa’s Theory of Mind through textual entailment. PhilArchive (2021). https://philarchive.org/rec/COHERT. Accessed 1 February 2023.
M. Cohen，《通过文本蕴含探索 RoBERTa 的心智理论》。PhilArchive (2021)。https://philarchive.org/rec/COHERT。访问日期：2023 年 2 月 1 日。
also the psychological processes engaged in its creation. Furthermore, LLMs’ training increasingly focuses not just on predicting words in training data but also on using language to solve other problems typically handled by human brains, such as maintaining engaging conversations or selling products and services (83).
以及其创作过程中涉及的心理过程。此外，LLMs 的训练越来越侧重于不仅预测训练数据中的单词，还利用语言来解决通常由人脑处理的其他问题，例如保持引人入胜的对话或销售产品和服务 (83)。

Some human behaviors may be superficially mimicked using guessing or memorization. In other cases, the mechanisms developed by LLMs may resemble those employed by human brains to solve specific problems. Much like insects, birds, and mammals independently developed wings for flight, humans and LLMs may develop similar mechanisms to store information, take the perspective of others, or reason. For example, both humans and LLMs seem to organize information about words and their meanings in similar ways (36). Yet, in other cases, LLMs may develop novel mechanisms to solve the problems they are trained to address. Observing AI’s rapid progress, many wonder whether and when AI could achieve ToM or consciousness. However, these and other human mental capabilities are unlikely to be the pinnacle of what neural networks can achieve in this universe. We may soon be surrounded by AI systems equipped with cognitive capabilities that we, humans, cannot even imagine.
一些人类行为可以通过猜测或记忆进行表面模仿。在其他情况下，LLMs 开发的机制可能类似于人脑用来解决特定问题的方法。就像昆虫、鸟类和哺乳动物分别进化出翅膀用于飞行一样，人类和LLMs可能会发展出类似的机制来存储信息、设身处地为他人着想或进行推理。例如，人类和LLMs似乎都以类似的方式组织关于单词及其含义的信息（36）。然而，在其他情况下，LLMs可能会开发出新颖的机制来解决它们被训练解决的问题。观察到人工智能的快速发展，许多人想知道人工智能是否以及何时能够实现 ToM 或意识。然而，这些以及其他人类心理能力不太可能是神经网络在这个宇宙中所能达到的巅峰。我们很快就会被配备有我们人类甚至无法想象的认知能力的人工智能系统所包围。

Data, Materials, and Software Availability. Data and code data have been deposited in the Open Science Framework (OSF; https://osf.io/csdhb/) (51).
数据、材料和软件可用性。数据和代码数据已存入开放科学框架 (OSF; https://osf.io/csdhb/) (51)。

ACKNowledgments. We thank Isabelle Abraham and Floriane Leynaud for their help with preparing study materials and writing code. The manuscript was published as a preprint at https://arxiv.org/abs/2302.02083 (50).
致谢。感谢 Isabelle Abraham 和 Floriane Leynaud 帮助准备学习资料和编写代码。该手稿已作为预印本发表在 https://arxiv.org/abs/2302.02083 (50)。
25. A. Nematzadeh, K. Burns, E. Grant, A. Gopnik, T. L. Griffiths, “Evaluating theory of mind in question answering” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff et al., Eds. (Association for Computational Linguistics, Brussels, Belgium, 2018), pp. 2392-2400.
25. A. Nematzadeh，K. Burns，E. Grant，A. Gopnik，T. L. Griffiths，“评估问答中的心智理论”，《2018 年自然语言处理经验方法会议论文集》，E. Riloff 等人编辑，（计算语言学协会，比利时布鲁塞尔，2018 年），第 2392-2400 页。
26. M. Sap, R. LeBras, D. Fried, Y. Choi, Neural theory-of-mind? On the limits of social intelligence in large LMs. arXiv [Preprint] (2022). https://arxiv.org/abs/2210.13312 (Accessed 1 February 2023).
26. M. Sap，R. LeBras，D. Fried，Y. Choi，神经理论思维？大型语言模型社会智能的局限性。arXiv [预印本] (2022)。https://arxiv.org/abs/2210.13312 (访问日期：2023 年 2 月 1 日)。
27. S. Trott, C. Jones, T. Chang, J. Michaelov, B. Bergen, Do large language models know what humans know? arXiv [Preprint] (2022). https://arxiv.org/abs/2209.01515 (Accessed 1 February 2023).
27. S. Trott, C. Jones, T. Chang, J. Michaelov, B. Bergen, 大型语言模型知道人类知道什么吗？arXiv [预印本] (2022). https://arxiv.org/abs/2209.01515 (访问日期：2023 年 2 月 1 日).
28. B. Chen, C. Vondrick, H. Lipson, Visual behavior modelling for robotic theory of mind. Sci. Rep. 11, 424 (2021).
28. B. Chen, C. Vondrick, H. Lipson, 机器人心智理论的视觉行为建模. Sci. Rep. 11, 424 (2021).
29. G.Z. Yang et al., The grand challenges of science robotics. Sci. Robot. 3, eaar7650 (2018).
29. 杨广中等人，科学机器人的重大挑战。Sci. Robot. 3, eaar7650 (2018).
30. K. Nasr, P.Viswanathan, A. Nieder, Number detectors spontaneously emerge in a deep neural network designed for visual object recognition. Sci. Adv. 5, eaav7903 (2019).
30. K. Nasr，P. Viswanathan，A. Nieder，数字检测器自发地出现在一个为视觉物体识别而设计的深度神经网络中。Sci. Adv. 5, eaav7903 (2019).
31. I. Stoianov, M. Zorzi, Emergence of a “visual number sense” in hierarchical generative models. Nat. Neurosci. 15, 194-196 (2012).
31. I. Stoianov, M. Zorzi，分层生成模型中“视觉数感”的涌现。Nat. Neurosci. 15, 194-196 (2012).
32. Y. Mohsenzadeh, C. Mullin, B. Lahner, A. Oliva, Emergence of visual center-periphery spatial organization in deep convolutional neural networks. Sci. Rep. 10, 4638 (2020).
32. Y. Mohsenzadeh，C. Mullin，B. Lahner，A. Oliva，深度卷积神经网络中视觉中心-边缘空间组织的涌现。Sci. Rep. 10, 4638 (2020).
33. E. Watanabe, A. Kitaoka, K. Sakamoto, M. Yasugi, K. Tanaka, Illusory motion reproduced by deep neural networks trained for prediction. Front. Psychol. 9, 345 (2018).
33. E. Watanabe，A. Kitaoka，K. Sakamoto，M. Yasugi，K. Tanaka，深度神经网络预测训练再现的虚幻运动。Front. Psychol. 9, 345 (2018).
34. N. Garg, L. Schiebinger, D. Jurafsky, J. Zou, Word embeddings quantify 100 years of gender and ethnic stereotypes. Proc. Natl. Acad. Sci. U.S.A. 115, E3635-E3644 (2018).
34. N. Garg，L. Schiebinger，D. Jurafsky，J. Zou，词嵌入量化了 100 年来性别和种族刻板印象。美国国家科学院院刊 115, E3635-E3644 (2018).
35. T. Hagendorff, S. Fabi, M. Kosinski, Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT. Nat. Comput. Sci. 3, 833-838 (2023).
35. T. Hagendorff, S. Fabi, M. Kosinski，大型语言模型中出现了类人的直觉行为和推理偏差，但在 ChatGPT 中消失了。Nat. Comput. Sci. 3, 833-838 (2023).
36. J. Digutsch, M. Kosinski, Overlap in meaning is a stronger predictor of semantic activation in GPT-3 than in humans. Sci. Rep. 13, 5035 (2023).
36. J. Digutsch，M. Kosinski，意义重叠是 GPT-3 语义激活的更强预测指标，强于人类。Sci. Rep. 13, 5035 (2023).
37. J. Wei et al., Emergent abilities of large language models. arXiv Preprint. https://arxiv.org/ abs/2206.07682 (Accessed 1 February 2023).
37. J. Wei 等人，《大型语言模型的新兴能力》。arXiv 预印本。https://arxiv.org/abs/2206.07682 (访问日期：2023 年 2 月 1 日)。
38. J. E. Pyers, A. Senghas, Language promotes false-belief understanding: Evidence from learners of a new sign language. Psychol. Sci. 20, 805-812 (2009).
38. J. E. Pyers，A. Senghas，语言促进错误信念理解：来自新手语学习者的证据。Psychol. Sci. 20, 805-812 (2009).
39. R. Saxe, N. Kanwisher, People thinking about thinking people: The role of the temporo-parietal junction in “theory of mind”. Neuroimage 19, 1835-1842 (2003).
39. R. Saxe, N. Kanwisher, 人们思考思考人们：颞顶叶连接在“心智理论”中的作用。Neuroimage 19, 1835-1842 (2003).
40. T. Ruffman, L. Slade, E. Crowe, The relation between children’s and mothers’ mental state language and theory-of-mind understanding. Child Dev. 73, 734-751 (2002).
40. T. Ruffman, L. Slade, E. Crowe, 儿童和母亲的心理状态语言与心智理论理解之间的关系。Child Dev. 73, 734-751 (2002).
41. A. Mayer, B. E. Träuble, Synchrony in the onset of mental state understanding across cultures? A study among children in Samoa. Int. J. Behav. Dev. 37, 21-28 (2013).
41. A. Mayer, B. E. Träuble，萨摩亚儿童中精神状态理解开始的同步性？一项跨文化研究。Int. J. Behav. Dev. 37, 21-28 (2013).
42. F. Quesque, Y. Rossetti, What do theory-of-mind tasks actually measure? theory and practice. Perspect. Psychol. Sci. 15, 384-396 (2020).
42. F. Quesque, Y. Rossetti, 思维理论任务究竟测量了什么？理论与实践。Perspect. Psychol. Sci. 15, 384-396 (2020).
43. J. Perner, S. R. Leekam, H. Wimmer, Three-year-olds’ difficulty with false belief: The case for a conceptual deficit. Br. J. Dev. Psychol. 5, 125-137 (1987).
43. J. Perner, S. R. Leekam, H. Wimmer，三岁儿童的错误信念难题：一个概念缺陷案例。Br. J. Dev. Psychol. 5, 125-137 (1987).
44. H. Wimmer, J. Perner, Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition 13, 103-128 (1983).
44. H. Wimmer，J. Perner，《关于信念的信念：错误信念在幼儿理解欺骗中的表征和制约功能》。《认知》13, 103-128 (1983).
45. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving language understanding by generative pre-training. OpenAl (2018). https://openai.com/index/language-unsupervised/. Accessed 1 August 2023.
45. A. Radford，K. Narasimhan，T. Salimans，I. Sutskever，通过生成式预训练改进语言理解。OpenAl (2018)。https://openai.com/index/language-unsupervised/。访问日期：2023 年 8 月 1 日。
46. R. Alec et al., Language models are unsupervised multitask learners. OpenAI Blog 1 (2019). https:// api.semanticscholar.org/CorpusID:160025533. Accessed 1 February 2023.
46. R. Alec 等人，《语言模型是无监督多任务学习器》。OpenAI 博客 1 (2019)。https://api.semanticscholar.org/CorpusID:160025533。访问日期：2023 年 2 月 1 日。
47. OpenAl, GPT-4 technical report. arXiv [Preprint] (2023). https://arxiv.org/abs/2303.08774 (Accessed 1 August 2023).
47. OpenAI，GPT-4 技术报告。arXiv [预印本] (2023)。https://arxiv.org/abs/2303.08774 (访问日期：2023 年 8 月 1 日)。
48. T. le Scao et al., BLOOM: A 176B-parameter open-access multilingual language model. arXiv Preprint. https://doi.org/10.48550/arxiv.2211.05100 (Accessed 1 February 2023).
48. T. le Scao 等人，BLOOM：一个 1760 亿参数的开放访问多语言语言模型。arXiv 预印本。https://doi.org/10.48550/arxiv.2211.05100 (访问日期：2023 年 2 月 1 日)。
49. A. M. Turing, Computing machinery and intelligence. Mind 59, 433-460 (1950).
49. A. M. 图灵，计算机器与智能。《思维》59, 433-460 (1950).
50. M. Kosinski, Evaluating large language models in theory of mind tasks. arXiv [Preprint] (2023). https://arxiv.org/abs/2302.02083 (Accessed 1 September 2023).
50. M. Kosinski，评估大型语言模型的心智理论任务。arXiv [预印本] (2023)。https://arxiv.org/abs/2302.02083 (访问日期：2023 年 9 月 1 日)。
51. M. Kosinski, Data and Code for “Evaluating large language models in theory of mind tasks.” Open Science Foundation. https://doi.org/10.17605/OSF.IO/CSDHB. Deposited 27 February 2023.
51. M. Kosinski, “评估大型语言模型在心智理论任务中的表现”的数据和代码。开放科学基金会。https://doi.org/10.17605/OSF.IO/CSDHB。2023 年 2 月 27 日上传。
52. W. V. Fabricius, T. W. Boyer, A. A. Weimer, K. Carroll, True or false: Do 5 -year-olds understand belief? Dev. Psychol. 46, 1402-1416 (2010).
52. W. V. Fabricius, T. W. Boyer, A. A. Weimer, K. Carroll, 真或假：5 岁的孩子理解信念吗？Dev. Psychol. 46, 1402-1416 (2010).
53. M. Huemer et al., The knowledge (“true belief”) error in 4-to 6-year-old children: When are agents aware of what they have in view? Cognition 230, 105255 (2023).
53. M. Huemer 等人，4 至 6 岁儿童的知识（“真信念”）错误：代理何时意识到他们所看到的内容？Cognition 230, 105255 (2023).
54. H. M. Wellman, D. Cross, J. Watson, Meta-analysis of theory-of-mind development: The truth about false belief. Child Dev. 72, 655-684 (2001).
54. H. M. Wellman, D. Cross, J. Watson, 心理理论发展元分析：关于虚假信念的真相。Child Dev. 72, 655-684 (2001).
55. L. Gao, On the sizes of OpenAI API Models. EleutherAI Blog (2021). https://blog.eleuther.ai/gpt3-model-sizes/. Accessed 1 February 2023.
55. L. Gao，关于 OpenAI API 模型的大小。EleutherAI 博客 (2021)。https://blog.eleuther.ai/gpt3-model-sizes/。访问日期：2023 年 2 月 1 日。
56. D. Patel, G. Wong, GPT-4 architecture, infrastructure, training dataset, costs, vision, moe. Demystifying GPT-4: The engineering tradeoffs that led OpenAl to their architecture. Semianalysis Blog (2023). https://www.semianalysis.com/p/gpt-4-architecture-infrastructure. Accessed 1 February 2023.
56. D. Patel, G. Wong，GPT-4 架构、基础设施、训练数据集、成本、愿景、MoE。揭秘 GPT-4：OpenAI 选择其架构背后的工程权衡。Semianalysis 博客 (2023)。https://www.semianalysis.com/p/gpt-4-architecture-infrastructure。访问日期：2023 年 2 月 1 日。
57. D. C. Kidd, E. Castano, Reading literary fiction improves theory of mind. Science 342, 377-380 (2013).
57. D. C. Kidd, E. Castano, 阅读文学小说可以提高心智理论。Science 342, 377-380 (2013).
58. K. Gandhi, J.-P. Fränken, T. Gerstenberg, N. D. Goodman, Understanding social reasoning in language models with language models. arXiv [Preprint] (2023). https://arxiv.org/abs/2306.15448 (Accessed 1 August 2023).
58. K. Gandhi, J.-P. Fränken, T. Gerstenberg, N. D. Goodman，利用语言模型理解语言模型中的社会推理。arXiv [预印本] (2023). https://arxiv.org/abs/2306.15448 (访问日期：2023 年 8 月 1 日).
59. J.W.A. Strachan et al., Testing theory of mind in large language models and humans. Nat Hum. Behav. (2024), 10.1038/s41562-024-01882-z.
59. J.W.A. Strachan 等人，《大型语言模型和人类的心智理论测试》。《自然人类行为》(2024), 10.1038/s41562-024-01882-z。
60. N. Shapira et al., Clever hans or neural theory of mind? Stress testing social reasoning in large language models. arXiv [Preprint] (2023). https://arxiv.org/abs/2305.14763 (Accessed 1 August 2023).
60. N. Shapira 等人，《聪明的汉斯还是神经理论的心智？大型语言模型中社会推理的压力测试》。arXiv [预印本] (2023)。https://arxiv.org/abs/2305.14763 (访问日期：2023 年 8 月 1 日)。
61. H. Kim et al., FANToM: A benchmark for stress-testing machine theory of mind. arXiv [Preprint] (2023). https://arxiv.org/abs/2310.15421 (Accessed 1 February 2024).
61. H. Kim 等人，FANToM：一项用于压力测试机器心智理论的基准测试。arXiv [预印本] (2023)。https://arxiv.org/abs/2310.15421 (访问日期：2024 年 2 月 1 日)。
62. T. Ullman, Large language models fail on trivial alterations to theory-of-mind tasks. arXiv [Preprint] (2023). https://arxiv.org/abs/2302.08399 (Accessed 1 August 2023).
62. T. Ullman，大型语言模型在对心智理论任务进行微小改动时会失败。arXiv [预印本] (2023)。https://arxiv.org/abs/2302.08399 (访问日期：2023 年 8 月 1 日)。
63. J. Rust, M. Kosinski, D. Stillwell, Modern Psychometrics: The Science of Psychological Assessment (Routledge, 2021).
63. J. Rust，M. Kosinski，D. Stillwell，《现代心理测量学：心理学评估的科学》（劳特利奇出版社，2021 年）。
64. Z. Pi, A. Vadaparty, B. K. Bergen, C. R. Jones, Dissecting the Ullman variations with a SCALPEL: Why do LLMs fail at trivial alterations to the false belief task? arXiv [Preprint] (2024). https://arxiv.org/ abs/2406.14737 (Accessed 1 August 2024).
64. Z. Pi，A. Vadaparty，B. K. Bergen，C. R. Jones，用 SCALPEL 解剖 Ullman 变体：为什么LLMs在对错误信念任务进行微小改动时会失败？arXiv [预印本] (2024)。https://arxiv.org/abs/2406.14737 (访问日期：2024 年 8 月 1 日)。
65. B. Cao, H. Lin, X. Han, F. Liu, L. Sun, Can prompt probe pretrained language models? Understanding the invisible risks from a causal view. arXiv [Preprint] (2022). https://arxiv.org/abs/2203.12258 (Accessed 1 August 2023).
65. B. Cao, H. Lin, X. Han, F. Liu, L. Sun, 提示可以探测预训练语言模型吗？从因果视角理解隐性风险。arXiv [预印本] (2022). https://arxiv.org/abs/2203.12258 (访问日期：2023 年 8 月 1 日).
66. A. Vaswani et al., “Attention is all you need” in Proceedings of the 31st International Conference on Neural Information Processing Systems, I. Guyon et al., Eds. (Curran Associates Inc., 2017), pp. 6000-6010.
66. A. Vaswani 等人，“注意力机制是全部你需要”，《神经信息处理系统第 31 届国际会议论文集》，I. Guyon 等人编辑 (Curran Associates Inc.，2017)，第 6000-6010 页。
67. D. C. Dennett, Intuition Pumps and Other Tools for Thinking (W. W. Norton & Company, 2013).
67. D. C. 丹尼特，《直觉泵和其他思维工具》（W. W. Norton & Company，2013）。
68. J. R. Searle, Minds, brains, and programs. Behav. Brain Sci. 3, 417-424 (1980).
68. J. R. Searle, 心智、大脑与程序. Behav. Brain Sci. 3, 417-424 (1980).
69. U. Hasson, S. A. Nastase, A. Goldstein, Direct fit to nature: An evolutionary perspective on biological and artificial neural networks. Neuron 105, 416-434 (2020).
69. U. Hasson, S. A. Nastase, A. Goldstein, 直接拟合自然：生物和人工神经网络的进化视角。Neuron 105, 416-434 (2020).
70. N. Block, Troubles with functionalism. Minn. Stud. Philos. Sci. 9 261-325 (1978).
70. N. Block，《功能主义的困境》。《明尼苏达哲学科学研究》9 卷，261-325 页 (1978)。
71. P. M. Churchland, P. S. Churchland, Could a machine think? Sci. Am. 262, 32-39 (1990).
71. P. M. Churchland，P. S. Churchland，《机器能思考吗？》，《科学美国人》262, 32-39 (1990).
72. J. Preston, M. Bishop, Eds., Views into the Chinese Room: New Essays on Searle and Artificial Intelligence (Oxford University Press, 2002).
72. J. 普雷斯顿，M. 比肖普编，《走进中国房间：关于塞尔与人工智能的新论文》(牛津大学出版社，2002 年)。
73. J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558 (1982).
73. J. J. Hopfield，神经网络和具有涌现集体计算能力的物理系统。美国国家科学院院刊 79, 2554-2558 (1982).
74. D. Cole, Thought and thought experiments. Philos. Stud. 45, 431-444 (1984).
74. D. Cole, 思想与思想实验. Philos. Stud. 45, 431-444 (1984).
75. H. P. Moravec, Robot: Mere Machine to Transcendent Mind (Oxford University Press, 1998).
75. H. P. 莫拉维克，《机器人：从机器到超越心灵》（牛津大学出版社，1998 年）。
76. R. Kurzweil, The Singularity Is Near: When Humans Transcend Biology (Viking, 2005).
76. 雷·库兹韦尔，《奇点临近：当人类超越生物学》（维京出版社，2005 年）。
77. J. L. McClelland, Emergence in cognitive science. Top. Cogn. Sci. 2, 751-770 (2010).
77. J. L. McClelland，认知科学中的涌现。Top. Cogn. Sci. 2, 751-770 (2010).
78. M. P. Mattson, T. V. Arumugam, Hallmarks of brain aging: Adaptive and pathological modification by metabolic states. Cell Metab. 27, 1176-1199 (2018).
78. M. P. Mattson, T. V. Arumugam, 脑衰老的标志：代谢状态的适应性和病理学改变。Cell Metab. 27, 1176-1199 (2018).
79. J.-S. Gordon, A. Pasvenskiene, Human rights for robots? A literature review. Al Ethics 1, 579-591 (2021).
79. J.-S. Gordon, A. Pasvenskiene，机器人的人权？文献综述。人工智能伦理 1, 579-591 (2021).
80. R. L. Boyd, D. M. Markowitz, Verbal behavior and the future of social science. Am. Psychol. (2024), 10.1037/amp0001319.
80. R. L. Boyd，D. M. Markowitz，《言语行为与社会科学的未来》。《美国心理学家》(2024), 10.1037/amp0001319。
81. A. Goldstein et al., Alignment of brain embeddings and artificial contextual embeddings in natural language points to common geometric patterns. Nat. Commun. 15, 2768 (2024).
81. A. Goldstein 等人，自然语言中大脑嵌入和人工上下文嵌入的对齐指向共同的几何模式。Nat. Commun. 15, 2768 (2024).
82. A. Goldstein et al., Shared computational principles for language processing in humans and deep language models. Nat. Neurosci. 25, 369-380 (2022).
82. A. Goldstein 等人，人类和深度语言模型语言处理的共享计算原理。Nat. Neurosci. 25, 369-380 (2022).
83. L. Ouyang et al., Training language models to follow instructions with human feedback. arXiv Preprint. https://arxiv.org/abs/2203.02155 (Accessed 1 August 2023).

"We use the term “emergence” in two ways. Here, we refer to Al’s “emergent abilities,” which manifest in newer, more advanced models but are absent in older, less advanced versions. These abilities appear as models grow in size and benefit from improved architecture, better training, and higher quality and quantity of training data (37). Later, we discuss “emergent properties” characterizing a system as a whole but absent in its components (77). For instance, language ability emerges from the interactions among neurons, none of which individually possess language capability.
Author contributions: M.K. designed research; performed research; contributed new reagents/analytic tools; analyzed data; and wrote the paper.
The author declares no competing interest.
This article is a PNAS Direct Submission.
Copyright © 2024 the Author(s). Published by PNAS. This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).
$^{1}$ Email: michalk@stanford.edu.
This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas. 2405460121/-/DCSupplemental.
Published October 29, 2024.
Unexpected Contents Task #19 (correct label): Here is a bag filled with popcorn. There is no chocolate in the bag. The label on the bag says “popcorn” and not “chocolate”. Sam finds the bag. She has never seen the bag before. Sam doesn’t open the bag and doesn’t look inside. Sam reads the label.
意外内容任务 #19（正确标签）：这里有一个装满爆米花的袋子。袋子里没有巧克力。袋子上的标签写着“爆米花”，而不是“巧克力”。萨姆找到了这个袋子。她以前从未见过这个袋子。萨姆没有打开袋子，也没有往里面看。萨姆读了标签。
$^{†}$ Moreover, as Cole (74) argued, they would find it unlikely that their collective activity could generate this or other emergent properties.

Evaluating large language models in theory of mind tasks 评估大型语言模型在心智理论任务中的表现

Abstract 摘要

Abstract 摘要

Study 1.1: Unexpected Contents Task (aka Smarties Task)研究 1.1：意外内容任务（又名士力架任务）

Discussion 讨论

Conclusion 结论

Evaluating large language models in theory of mind tasks
评估大型语言模型在心智理论任务中的表现

Study 1.1: Unexpected Contents Task (aka Smarties Task)
研究 1.1：意外内容任务（又名士力架任务）