A World-Self Model Towards Understanding Intelligence
理解智能的世界自我模型

Yutao Yue 岳玉涛
Institute of Deep Perception Technology, JITRI
Wuxi, China
yueyutao@idpt.org
中国无锡JITRI深度感知技术研究所 yueyutao@idpt.org

Abstract 抽象

The symbolism, connectionism and behaviorism approaches of artificial intelligence have achieved a lot of successes in various tasks, while we still do not have a clear definition of "intelligence" with enough consensus in the community (although there are over 70 different "versions" of definitions). The nature of intelligence is still in darkness. In this work we do not take any of these three traditional approaches, instead we try to identify certain fundamental aspects of the nature of intelligence, and construct a mathematical model to represent and potentially reproduce these fundamental aspects. We first stress the importance of defining the scope of discussion and granularity of investigation. We carefully compare human and artificial intelligence, and qualitatively demonstrate an information abstraction process, which we propose to be the key to connect perception and cognition. We then present the broader idea of "concept", separate the idea of self model out of the world model, and construct a new model called world-self model (WSM). We show the mechanisms of creating and connecting concepts, and the flow of how the WSM receives, processes and outputs information with respect to an arbitrary type of problem to solve. We also consider and discuss the potential computer implementation issues of the proposed theoretical framework, and finally we propose a unified general framework of intelligence based on WSM.
人工智能的象征主义、联结主义和行为主义方法在各种任务中取得了很大的成功，而我们仍然没有一个明确的“智能”定义，在社区中有足够的共识（尽管有 70 多个不同的“版本”定义）。智能的本质仍然处于黑暗之中。在这项工作中，我们没有采用这三种传统方法中的任何一种，而是试图确定智力本质的某些基本方面，并构建一个数学模型来表示和潜在地再现这些基本方面。我们首先强调界定讨论范围和调查粒度的重要性。我们仔细比较了人类和人工智能，并定性地展示了一个信息抽象过程，我们认为这是连接感知和认知的关键。然后，我们提出了更广泛的“概念”概念，将自我模型的概念从世界模型中分离出来，并构建了一个称为世界自我模型（WSM）的新模型。我们展示了创建和连接概念的机制，以及 WSM 如何接收、处理和输出有关要解决的任意类型问题的信息的流程。本文还考虑并讨论了所提出的理论框架中潜在的计算机实现问题，最后提出了一个基于WSM的统一智能通用框架。

Keywords artificial general intelligence; concept; human intelligence; information abstraction; nature of intelligence; world-self model.
关键词：人工智能;概念;人类智能;信息抽象;智力的本质;世界自我模型。

1 Introduction 1引言

In the last decade, neuron-network-based models have achieved great successes in various tasks such as facial recognition, target tracking, machine translation, go games, and so on, [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] which have brought a significant impact in different industries including security, consumer electronics, manufacture, finance, customer service, etc. Those tasks are widely recognized as "intelligent" tasks, and the technical and industrial field are considered "artificial intelligence (AI)" field.
在过去的十年中，基于神经元网络的模型在人脸识别、目标跟踪、机器翻译、围棋游戏等各种任务中取得了巨大成功，[ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8][ 9][ 10][ 11][ 12]，在安全、消费电子、制造、金融、客户服务等不同行业产生了重大影响。这些任务被广泛认为是“智能”任务，技术和工业领域被认为是“人工智能（AI）”领域。

Nevertheless, after over 65 years efforts of numerous scientists in the AI field since its establishment at 1956 Dartmouth meeting, there still has not been an accurate and widely agreed definition of what is "intelligence" today, nor a well recognized understanding of the nature of intelligence. The neural network models and CNN/RNN/Transformer/GAN/RL algorithms [4] [13] [14] [15] [16] [17] can outperform humans in many "specific" tasks, but we still don’t have much clue regarding how to build human-like robust, flexible and self-evolving "general" intelligence. The query for the nature of intelligence, often related to the effort to construct artificial general intelligence (AGI), is barely a marginal sub-field of AI in the AI community, with researchers touching up the topic from various background of computer vision, neuroscience, math and statistics, psychology and behaviour science, language, physics, etc. [18] [19] [20]
然而，自1956年达特茅斯会议成立以来，人工智能领域的众多科学家经过65年的努力，今天仍然没有一个准确和广泛认同的“智能”定义，也没有对智能的本质有一个公认的理解。神经网络模型和CNN/RNN/Transformer/GAN/RL算法[ 4][ 13][ 14][ 15][ 16][ 17]在许多“特定”任务中可以胜过人类，但我们仍然没有太多关于如何构建类似人类的强大、灵活和自我进化的“通用”智能的线索。对智能本质的探究，通常与构建通用人工智能（AGI）的努力有关，在人工智能社区中几乎不是人工智能的一个边缘子领域，研究人员从计算机视觉、神经科学、数学和统计学、心理学和行为科学、语言、物理学等各种背景中触及这个话题[18][ 19][ 20]

The AI field has been very exciting and will continue to be exciting with many great challenges lying ahead. What is the nature of intelligence? How to define and quantify intelligence? Why is human intelligence so diverse? How are the different aspects of human intelligence related? What is missing to bring the current mainstream AI to the next level? What are the other possibilities of developing intelligence beyond the currently used methodologies?
人工智能领域一直非常令人兴奋，并将继续令人兴奋，未来将面临许多巨大的挑战。智力的本质是什么？如何定义和量化智能？为什么人类的智慧如此多样化？人类智能的不同方面有什么关系？将当前主流人工智能提升到一个新的水平缺少什么？除了目前使用的方法之外，还有哪些其他可能性可以开发智能？

In this paper, we will propose our thinking on the subject, a resultant model called World-Self Model (WSM), and a framework of intelligence based on WSM, all in order to better understand intelligence and try to answer some of the above questions. The paper will be organized as follows. In Section 2 we discuss the scope and granularity of discussion subject to avoid ambiguity. In Section 3 we analyze the different aspects and methodologies of intelligence to the best of our understanding, and compare human and artificial intelligence. In Section 4 we present our WSM model, its mathematical representation, and tips for possible computer implementation. In Section 5 we will incorporate WSM into a united intelligence framework that takes into account all different aspects and methodologies of intelligence discussed in this paper. Finally in Section 6 we summarize the main points of this work as well as possible future work.
在本文中，我们将提出我们对该主题的思考，一个称为世界自我模型（WSM）的模型，以及基于WSM的智能框架，所有这些都是为了更好地理解智能并尝试回答上述一些问题。本文的编排方式如下。在第 2 节中，我们将讨论讨论主题的范围和粒度，以避免歧义。在第 3 节中，我们尽我们所能分析了智能的不同方面和方法，并比较了人类和人工智能。在第 4 节中，我们将介绍我们的 WSM 模型、其数学表示以及可能的计算机实现技巧。在第 5 节中，我们将把 WSM 纳入一个统一的情报框架，该框架考虑了本文讨论的情报的所有不同方面和方法。最后，在第 6 节中，我们总结了这项工作的要点以及未来可能的工作。

2 General discussion on intelligence
2关于情报的一般性讨论

2.1 The subject of discussion
2.1讨论主题

The concept of "mass" or "acceleration" is an understanding in people’s mind that often differ from one person to another, until they were formally defined by the science community. Unfortunately, the concept of "intelligence", although with at least over 70 different "versions" of definitions,[21] [22] still doesn’t have one strict scientific definition of consensus in the community. We want to first define our scope of discussion here: intelligent system (IS).
“质量”或“加速度”的概念是人们头脑中的一种理解，通常因人而异，直到科学界正式定义它们。不幸的是，“智能”的概念，虽然至少有70多个不同的“版本”定义，[21][22]在社区中仍然没有一个严格的科学共识定义。我们想首先在这里定义我们的讨论范围：智能系统（IS）。

A golf ball that responses by flying away when hit by a rod, is not an IS. Viruses, which have molecular but not cell-level structures, are not considered IS here. A programmed manufacturing line machine that can only do repeating movements is not an IS. We say that these systems exhibit no intelligence, i.e., Level-0 (L0) intelligence.
当被杆击中时会飞走的高尔夫球不是 IS。病毒具有分子水平而非细胞水平的结构，在这里不被视为IS。只能进行重复运动的编程生产线机器不是 IS。我们说这些系统没有表现出智能，即 0 级（L0）智能。

We assume the phenomenon of intelligence involves process of information, and sometimes physical actions. We propose the first scope of of discussion as Level-1 (L1) IS:
我们假设智能现象涉及信息过程，有时还包括物理行为。我们建议将第一个讨论范围作为 1 级（L1）是：

Proposition 1

An intelligent system with Level-1 intelligence (IS-L1) is a system that meets all following 3 criteria:
命题1具有一级智能（IS-L1）的智能系统是满足以下所有3个条件的系统：
(1) It is part of a bigger system with clear boundary.
（1）它是具有明确边界的更大系统的一部分。
(2) It has a goal.
（2）有目标。
(3) It is able to receive, process and output information in a way to benefit the goal.
（3）能够以有利于目标的方式接收、处理和输出信息。

We call the bigger system as the "world" for the IS, which is consisted by two parts, the IS itself and the so-called "environment". We then define the second scope of discussion as Level-2 (L2) IS:
我们称更大的系统为IS的“世界”，它由两部分组成，IS本身和所谓的“环境”。然后，我们将第二个讨论范围定义为 2 级（L2）是：

Proposition 2

An intelligent system with Level-2 intelligence (IS-L2) is a system that meets all following 3 criteria:
命题2具有二级智能（IS-L2）的智能系统是满足以下所有3个标准的系统：
(1) It is an IS-L1.
（1）它是 IS-L1。
(2) It is able to update itself towards better serving its goal.
（2）它能够更新自己，以更好地服务于其目标。
(3) It has an informational representation of the world, which can simulate and predict certain aspects of the world.
（3）它具有世界的信息表示，可以模拟和预测世界的某些方面。

We call the informational representation a "model", which is a "simplified world" that mimics certain aspects of the rules of the real world. Note that up to now, the IS can be pure informational in nature, e.g., a chatbot computer program can be an IS-L1 or IS-L2, the "world" for it can be the ocean of text information in its accessible database. The existence of information itself relies on physical reality, e.g., information encoded in a beam of electromagnetic wave relies on the substance of electromagnetic field, information stored in computer flash drive or showed on a screen relies on the status of electrons in semiconductor structures. Nevertheless, we further define the third scope of discussion as Level-3 (L3) IS:
我们称信息表示为“模型”，它是一个“简化的世界”，它模仿现实世界规则的某些方面。请注意，到目前为止，IS本质上可以是纯信息性的，例如，聊天机器人计算机程序可以是IS-L1或IS-L2，它的“世界”可以是其可访问数据库中的文本信息海洋。信息的存在本身依赖于物理现实，例如，电磁波束中编码的信息依赖于电磁场的物质，存储在计算机闪存驱动器中或显示在屏幕上的信息依赖于半导体结构中电子的状态。尽管如此，我们进一步将第三个讨论范围定义为 3 级（L3）是：

Proposition 3

An intelligent system with Level-3 intelligence (IS-L3) is a system that meets all following 3 criteria:
命题3具有三级智能（IS-L3）的智能系统是满足以下所有3个条件的系统：
(1) It is an IS-L2.
（1）它是 IS-L2。
(2) It is able to abstract information from physical reality, and transfer information into physical actions.
（2）能够从物理现实中抽象信息，并将信息转化为物理动作。
(3) It is able to interact with the physical world.
（3）能够与物理世界互动。

In this context, an IS-L3 not only relies on physical substances, but also is able to conduct informational-physical conversions, and to interact with (and thus change) physical world. For instance, a human being, as a typical IS-L3, has his or her physical body and brain, and the information residing on them. The perception system converts physical reality into information, e.g., light scattered by an object is converted into electrical signal in optical nerve cells. The human informational IS, carried by the chemical and electrical signals in nerve system and brain, receives, processes and outputs information. The output information is converted into actions that allow the human to approach his or her goal. A face-recognition access control system can also be an IS-L3. It has hardware like the camera, computing and storage chips, screen, and electrically controlled door. It has software that includes the trained artificial neural network (ANN) inference program, door control program. The Is uses camera to capture the light scattered by an approaching person’s face, converts it into a picture as electrical signals, and outputs face recognition result (e.g., YES if the person is in the approved list). It then controls the door to open or stay closed, serving the goal as a door keeper. The ANN can represent the raw information of picture as simplified "features", and can update itself to increase accuracy as taking in more and more pictures.
在这种情况下，IS-L3不仅依赖于物理物质，而且还能够进行信息物理转换，并与物理世界进行交互（从而改变）。例如，人类，作为典型的IS-L3，拥有他或她的身体和大脑，以及驻留在它们身上的信息。感知系统将物理现实转换为信息，例如，物体散射的光在视神经细胞中转换为电信号。人类信息IS由神经系统和大脑中的化学和电信号携带，接收，处理和输出信息。输出信息被转换为允许人类接近他或她的目标的行动。人脸识别门禁系统也可以是 IS-L3。它具有摄像头、计算和存储芯片、屏幕和电控门等硬件。它有软件，包括训练的人工神经网络（ANN）推理程序，门禁程序。IS使用摄像头捕捉接近人脸散射的光线，将其转换为电信号图片，并输出人脸识别结果（例如，如果该人在批准列表中，则为YES）。然后，它控制门打开或保持关闭状态，作为门卫服务于目标。ANN可以将图片的原始信息表示为简化的“特征”，并且可以随着接收越来越多的图片而自我更新以提高准确性。

Very interestingly, the IS-L3 itself, as part of the physical world, is also (potentially) represented in its own informational representation of the physical world. As defined above, IS-L3 is both informational and physical, while it has an informational representation of the physical world. This gives rise to a lot of interesting properties of intelligent systems.
非常有趣的是，IS-L3本身作为物理世界的一部分，也（可能）以它自己的物理世界的信息表示来表示。如上所述，IS-L3 既是信息性的，也是物理的，同时它具有物理世界的信息表示。这产生了许多智能系统的有趣特性。

In this work, our scope of discussion is IS-L2 and IS-L3, with a stress on human intelligence (as being studied in neurobiology, brain science, psychology, etc) and artificial intelligence (as being studied in computer science, automation and robotics, math and statistics, etc).
在这项工作中，我们的讨论范围是 IS-L2 和 IS-L3，重点是人类智能（如神经生物学、脑科学、心理学等研究）和人工智能（如计算机科学、自动化和机器人技术、数学和统计学等）。

2.2 Physical and informational granularity of intelligence study
2.2情报研究的物理和信息粒度

There is almost no other subject as intelligence that is being studied in such a broad range of disciplines. [23][24][25][26][27] In order to understand human intelligence, we can in principle study it at atomic and sub-atomic level as everything including neurons and nerve cells are atoms. We can study it at molecular (i.e., a group of atoms) level by looking into the chemical reactions responsible for the regulation and function of brain activities. We can study intelligence at cellular (i.e., a group of molecules) level by looking into the different status of neuron cells, the electrical signals transmitting among them, and the way they connect to one another. We can study intelligence at the minicolumn (i.e., a group of cells [28]) level by looking into minicolumn’s layered structure and their differences. We can study intelligence at the encephalic region (e.g., V2 visual area is a group of minicolumns) level, [29] [30] by looking into how the region’s status affects its functions. We can study intelligence at the whole cortex (i.e., a group of regions) level, at the whole brain and nerve organ system (i.e., a group of structures), and the entire human body level. As for human intelligence’s physical structure and mechanism, we can study it from the perspectives of physics, chemistry, molecular biology, neuron biology, brain science, and physiology, etc.
在如此广泛的学科中，几乎没有其他学科像智力一样被研究。[ 23][ 24][ 25][ 26][ 27] 为了理解人类的智能，我们原则上可以在原子和亚原子水平上研究它，因为包括神经元和神经细胞在内的一切都是原子。我们可以通过研究负责大脑活动的调节和功能的化学反应，在分子（即一组原子）水平上研究它。我们可以通过研究神经元细胞的不同状态、它们之间传输的电信号以及它们相互连接的方式来研究细胞（即一组分子）水平的智力。我们可以通过研究微柱的分层结构及其差异来研究微柱（即一组细胞[28]）水平的智力。我们可以通过研究大脑区域的状态如何影响其功能来研究大脑区域（例如，V2视觉区域是一组小柱）水平的智力[29][30]。我们可以研究整个皮层（即一组区域）水平、整个大脑和神经器官系统（即一组结构）以及整个人体水平的智力。至于人类智能的物理结构和机制，我们可以从物理、化学、分子生物学、神经元生物学、脑科学、生理学等角度进行研究。

We can study the phenomenon of intelligence at information level, at word level, at rule of language (e.g., grammar, syntax, how to organize a group of words) level, at psychology or strategy (e.g., behaviors, emotions, how mind works) level, at social (e.g., a group of minds) level. As for human intelligence’s informational structure and mechanism, we can study it from the perspectives of information theory, linguistics, psychology, ethics, sociology and philosophy, etc.
我们可以在信息层面、词语层面、语言规则（例如，语法、句法、如何组织一组单词）层面、心理学或策略层面（例如，行为、情绪、思维如何运作）层面、社会层面（例如，一群心灵）层面研究智力现象。至于人类智能的信息结构和机制，我们可以从信息论、语言学、心理学、伦理学、社会学和哲学等角度进行研究。

We can study an artificial IS at transistor (e.g., diode, triod, MOSFET) level, at chip structure level, at computer hardware architecture level, etc. We can study intelligence at logic gate level, at variable level, at function level, at program level, and at system software architecture level, etc. As for artificial intelligence, it can be studied by the disciplines of computer science, automation and robotics, microelectronics and solid state electronics, math and statistics, information theory, etc.
我们可以在晶体管（例如二极管、三极管、MOSFET）级别、芯片结构级别、计算机硬件架构级别等研究人工 IS。我们可以在逻辑门级别、变量级别、功能级别、程序级别和系统软件架构级别等方面研究智能。至于人工智能，可以通过计算机科学、自动化和机器人、微电子和固态电子、数学和统计学、信息论等学科来研究。

Refer to caption — Figure 1: Physical and informational granularity of intelligence study
图1：情报研究的物理和信息粒度

It is far from completely summarizing all disciplines and granularity levels to study intelligence, but here we want to stress the idea that the phenomenon of intelligence involves multiple different levels of granularity, from micro to macro worlds, as shown in Figure 1.[31][32][33][34][18][35] We propose that:
研究智能还远未完全概括所有学科和粒度级别，但在这里我们想强调的是，智能现象涉及从微观到宏观世界的多个不同粒度级别，如图 1 所示。[ 31][ 32][ 33][ 34][ 18][ 35] 我们建议：

Proposition 4

The same phenomenon of intelligence can happen on multiple levels of granularity. The same phenomenon of intelligence can be described by different mechanisms.

命题4同样的智能现象可以发生在多个粒度层次上。同样的智能现象可以用不同的机制来描述。

Although from the perspective of reductionism, all rules and mechanisms are incorporated in the microscopic level system, we argue that certain aspects and mechanisms of intelligence can only be studied $by\ us$ at certain higher granularity levels. It’s practically important that:
尽管从还原论的角度来看，所有的规则和机制都包含在微观层面的系统中，但我们认为，智能的某些方面和机制只能在某些更高的粒度层面上进行研究 $by\ us$ 。实际上，重要的是：

Proposition 5

The phenomenon of intelligence needs to be studied at multiple levels of granularity in order for us to understand, handle and artificially reconstruct it. When investigating a certain subject of intelligence, it is a key issue to choose the appropriate level of granularity to study.

命题5智能现象需要在多个粒度层面上进行研究，以便我们理解、处理和人为地重建它。在研究某个智能主题时，选择适当的粒度级别进行研究是一个关键问题。

As we will show later, "neuron" and "concept" are two levels of granularity that we will consider in this work.
正如我们稍后将展示的那样，“神经元”和“概念”是我们将在这项工作中考虑的两个粒度级别。

3 Comparing of human and artificial intelligence
3人类与人工智能的比较

3.1 Four aspects of human intelligence
3.1人类智能的四个方面

Looking back into the era of Issac Newton and Albert Einstein, in the field of physics, scientists observe phenomenon such as celestial and object movements, formulate the theoretical mechanism behind them, and validate the theory. In the field of AI, other than the models and algorithms we have "artificially" created, the only reference and object-to-observe is our own intelligence, including the various levels of "natural" intelligence of living species from insects to humans through evolution history of lives. As the most advanced and powerful natural intelligence we know of, human intelligence is the definite choice to learn from while searching for the beam of light to understand the nature of intelligence, and to advance the AI field to the new next level.
回顾伊萨克·牛顿和阿尔伯特·爱因斯坦的时代，在物理学领域，科学家观察天体和物体运动等现象，制定其背后的理论机制，并验证理论。在人工智能领域，除了我们“人工”创造的模型和算法之外，唯一可以观察的参考和对象就是我们自己的智能，包括从昆虫到人类的生物物种通过生命的进化史所具有的各种层次的“自然”智能。作为我们所知道的最先进、最强大的自然智能，人类智能是在寻找光束以了解智能本质，并将人工智能领域推向新的下一个层次时，学习的绝对选择。

Human intelligence is complex and can be interpreted in different ways. Important mechanisms in human intelligence includes but is not limited to intuitive response, logic and analytical thinking, biological and social objective system, attention mechanism, chemical regulating system, self-optimization capability during evolution and development and learning, etc. [36][37][38][39] There are some theories to describe human intelligence, among which the theory of two systems (System 1 and System 2) of human brain is a widely accepted one in psychology and behaviour science community.[40] [36] System 1 (a.k.a. autonomous system) functions as an automatic response, and System 2 (a.k.a analytical system) needs our attention and effort to "think" to function.
人类的智能是复杂的，可以用不同的方式解释。人类智能中的重要机制包括但不限于直觉反应、逻辑和分析思维、生物和社会目标系统、注意机制、化学调节系统、进化和发展过程中的自我优化能力以及学习等。 [ 36][ 37][ 38][ 39] 有一些理论来描述人类智力，其中人脑的两个系统（系统1和系统2）的理论在心理学和行为科学社区。[ 40][ 36] 系统 1（又名自治系统）起着自动响应的作用，而系统 2（又名分析系统）需要我们的注意力和努力来“思考”才能发挥作用。

We view the human intelligence as, not a summation of several independent blocks, but instead an integrated whole with different aspects. Based on the theory of two systems, we here further propose that, in order to understand the nature of intelligence, there are four key aspects of human intelligence that we need to investigate: the responsive intelligence (Aspect 1), the analytical intelligence (Aspect 2), the conceptualizational intelligence (Aspect 3), and the adaptive intelligence (Aspect 4).
我们认为人类智能不是几个独立块的总和，而是一个具有不同方面的综合整体。基于两个系统的理论，我们在这里进一步提出，为了理解智能的本质，我们需要研究人类智能的四个关键方面：反应智能（方面1），分析智能（方面2），概念化智能（方面3）和适应性智能（方面4）。

Aspect 1 intelligence is the autonomous response of our brain and body to input or stimulus, such as sucking and chewing, blinking, hitting a tennis ball, dribbling by seeing a picture of lemon, pronouncing an easy word, giving the answer to 2+3, etc. It is fast, autonomous, and does not require attention. One in general cannot "control" Aspect 1 intelligence activity, as you cannot avoid blinking your eye if a bug flies quickly towards your eye. One in general cannot tell exactly what happened during the response process, as you cannot tell which signal from your neuron system controlling which muscles to complete the blinking activity. Aspect 1 intelligence is a very typical so-called end-to-end "black-box". The IS receives an input, and produces an output, while the IS itself does not know what happened in between.
方面 1 智力是我们的大脑和身体对输入或刺激的自主反应，例如吸吮和咀嚼、眨眼、击打网球、通过看到柠檬图片运球、发音一个简单的单词、给出 2+3 的答案等。它快速、自主，不需要注意。一般来说，一个人无法“控制”第 1 方面的情报活动，因为如果一个虫子快速飞向你的眼睛，你就无法避免眨眼。一般来说，人们无法确切地说出反应过程中发生了什么，因为你无法分辨来自神经元系统的哪个信号控制着哪些肌肉完成眨眼活动。方面 1 智能是一个非常典型的所谓端到端“黑匣子”。IS接收输入并产生输出，而IS本身不知道中间发生了什么。

We observe that Aspect 1 intelligence can be obtained in two ways. One is by evolution (those intelligence that newborn babies have, integrated in the DNA information system), and the other way is by training (those intelligence that are learned through repeating activities). In a sense, evolution is about keeping the characteristics that produce enough repeat of success or survival, while abandoning the others. So it is fair to say that, Aspect 1 intelligence is the result of repeated training, while during the training, the connection status (parameters and patterns) of human neural network are being fixed and optimized.
我们观察到，方面 1 智能可以通过两种方式获得。一种是通过进化（新生婴儿拥有的那些智能，整合到DNA信息系统中），另一种是通过训练（那些通过重复活动学习的智能）。从某种意义上说，进化就是要保留那些能够产生足够重复的成功或生存的特征，同时放弃其他特征。所以可以公平地说，方面1智能是重复训练的结果，而在训练过程中，人类神经网络的连接状态（参数和模式）正在被固定和优化。

Aspect 2 intelligence, on the other hand, is the analytical ability of our brain, such as doing a hard math problem, counting the number of students in the classroom, preparing for a lecture, comparing two cars you want to buy, etc. It is slow, needs "mental" effort, and requires attention. One in general is aware of and understands Aspect 2 intelligence during the process. The operation of Aspect 2 intelligence in our brain relies on 3 kinds of entities, natural language (such as English words), predefined symbols (such as math operation or physics quantities), and undefined entities (such as the concept we have in mind but cannot accurately describe it). We here refer all 3 kinds of entities as " $concept$ ", which is a fair terminology for this broader idea of entities.
另一方面，方面 2 智能是我们大脑的分析能力，例如做难的数学题、计算教室里的学生人数、准备讲座、比较你想买的两辆车等。它很慢，需要“精神”努力，需要注意。在此过程中，人们通常会意识到并理解方面 2 智能。我们大脑中 Aspect 2 智能的运作依赖于 3 种实体，自然语言（如英语单词）、预定义符号（如数学运算或物理量）和未定义的实体（例如我们脑海中的概念，但无法准确描述它）。我们在这里将所有 3 种实体称为“ $concept$ ”，这是这个更广泛的实体概念的公平术语。

Aspect 2 intelligence, by running upon certain natural and logic rules, is able to take in input and gives good results or precise predictions as output for very difficult problems. For a similar problem, Aspect 2 intelligence usually does not require lots of examples to train, but can easily outperform Aspect 1 intelligence which requires so. For instance, an ancient astronomer can observe the movements of planets and accumulate tons of experience as input for Aspect 1 intelligence, but still cannot predict how they moves in the future. In contrast, once a few symbols are defined to formulate the law of gravity (see Equation 1, $F_{grav}$ is gravitational force, $m_{1}$ and $m_{2}$ are masses of the two objects of consideration, $G$ is a constant), now even a high school student can easily use Aspect 2 intelligence to accurately predict how these planets, or any similar gravitational system, move in the future.
方面 2 智能通过运行某些自然和逻辑规则，能够接收输入并给出良好的结果或精确的预测作为非常困难问题的输出。对于类似的问题，Aspect 2 智能通常不需要大量示例来训练，但可以很容易地胜过需要 Aspect 1 的智能。例如，一位古代天文学家可以观察行星的运动，并积累大量经验作为第1方面智能的输入，但仍然无法预测它们未来的运动方式。相比之下，一旦定义了几个符号来制定万有引力定律（参见等式1， $F_{grav}$ 是引力， $m_{1}$ 并且是 $m_{2}$ 两个考虑对象的质量， $G$ 是一个常数），现在即使是高中生也可以很容易地使用Aspect 2智能来准确预测这些行星或任何类似的引力系统在未来如何运动。

F_{grav}=G\frac{m_{1}m_{2}}{r^{2}}

(1)

We do not have evidence that any life form other than human has Aspect 2 intelligence. It seems the capability of Aspect 2 intelligence gives human advantage over other life species, but we do not know how we acquired it through the course of evolution, or how we learned (or explored) it during development of brain.
我们没有证据表明人类以外的任何生命形式都具有方面 2 智能。似乎第二方面智能的能力使人类比其他生命物种更具优势，但我们不知道我们是如何通过进化过程获得它的，或者我们如何在大脑发育过程中学习（或探索）它。

Aspect 3 intelligence is the ability to convert perceptual (i.e., vision, auditory, tactility, etc) signal streams into abstracted concepts. It is a remarkable information abstraction mechanism that bridges Aspect 1 and Aspect 2 intelligence. With Aspect 3 intelligence, we developed language capability by converting perceptions into words (for example, we saw many different apples and gradually abstracted the word "apple"). We developed words upon words (for example, the word "economy" is built on many other concrete perceptions and words), with a potentially unlimited "layers" of words. We were able to perceive certain characteristics of object movements, and abstracted the idea of mass, force and velocity, defined them as symbols, and only after that, we were able to use our Aspect 2 intelligence to formulate Newton’s 2^nd law. With Aspect 3 intelligence, we developed our entire science system, which is formalized in natural language words, and scientific symbols and their mathematical relations.
第 3 方面智能是将感知（即视觉、听觉、触觉等）信号流转换为抽象概念的能力。它是一种非凡的信息抽象机制，它连接了方面 1 和方面 2 智能。借助 Aspect 3 智能，我们通过将感知转换为单词来开发语言能力（例如，我们看到了许多不同的苹果，并逐渐抽象了“苹果”这个词）。我们开发了一个又一个的词语（例如，“经济”一词建立在许多其他具体的感知和词语之上），具有潜在的无限“层次”词语。我们能够感知物体运动的某些特征，并抽象出质量、力和速度的概念，将它们定义为符号，只有在那之后，我们才能使用我们的 Aspect 2 智能来制定牛顿 2 ^nd 定律。借助 Aspect 3 智能，我们开发了整个科学系统，该系统以自然语言单词、科学符号及其数学关系的形式化。

Aspect 4 intelligence is the ability of interacting and adapting to (usually unknown and changing) environment. During the course of evolution, humans are able to update their DNA to generate improved behaviors for more successful survival according to the changing natural environment. This mostly happens by many generations of reproduction. During the course of an individual life from birth to death, humans are able to build, update and adjust the neuron connections in neural system especially in brain, by perceiving and interacting with natural and social environment to gain skills, abilities and knowledge, to better serve the survival, psychological and more advanced objectives of life.
方面 4 智能是互动和适应（通常是未知和不断变化的）环境的能力。在进化过程中，人类能够更新他们的DNA，以根据不断变化的自然环境产生更好的行为，从而更成功地生存。这主要发生在多代繁殖中。在个体从出生到死亡的人生过程中，人类能够通过感知自然和社会环境并与之互动来建立、更新和调整神经系统中的神经元连接，尤其是大脑中的神经元连接，以获得技能、能力和知识，从而更好地服务于生存、心理和更高级的生活目标。

In summary, Aspect 1 intelligence (response) is the primitive mechanism of human intelligence, which any life form possesses. Aspect 2 intelligence (analysis) is an advanced mechanism that only humans possess and therefore gain advantage over other species. Aspect 3 intelligence (conceptualization) is the key mechanism that bridges Aspects 1 and 2, and thus makes Aspect 2 possible at all. Aspect 4 intelligence (adaption) makes use of Aspects 1, 2, and 3 intelligence, develops effective strategies, and update the IS itself to serve the objective of life in often unknown and changing environment.
总之，方面 1 智能（反应）是人类智能的原始机制，任何生命形式都具有这种机制。方面 2 智能（分析）是一种只有人类才拥有的先进机制，因此比其他物种更具优势。方面 3 智能（概念化）是连接方面 1 和方面 2 的关键机制，因此使方面 2 成为可能。方面 4 智能（适应）利用方面 1、2 和 3 智能，制定有效的策略，并更新 IS 本身，以在通常未知和不断变化的环境中服务于生活目标。

Some mechanisms (e.g., the flow of small chemical molecules in brain, the conversion and usage of energy for neurons, etc) of human intelligence are by nature the intrinsic properties, and possibly the limitations, of biochemical systems. Those mechanisms are thus not necessarily included to understand the nature of intelligence, or to build artificial intelligence. There are many different angles, aspects and mechanisms while observing human intelligence, but after examining a broad range of research of intelligence, we believe that, the responsive (Aspect 1), analytical (Aspect 2), conceptualizational (Aspect 3) and adaptive (Aspect 4) intelligence are the four most important key building factors towards defining the nature of intelligence, and towards building the next level artificial general intelligence.
人类智力的一些机制（例如，大脑中小化学分子的流动，神经元能量的转换和使用等）本质上是生化系统的内在特性，也可能是生化系统的局限性。因此，这些机制不一定用于理解智能的本质或构建人工智能。在观察人类智能时，有许多不同的角度、方面和机制，但在研究了广泛的智能研究之后，我们认为，响应式（方面 1）、分析性（方面 2）、概念化（方面 3）和适应性（方面 4）智能是定义智能本质和构建下一级通用人工智能的四个最重要的关键构建因素。

3.2 Three types of mainstream artificial intelligence
3.2主流人工智能的三种类型

We will discuss the three types of mainstream artificial intelligence, and their relations with the four aspects of human artificial intelligence.
我们将讨论主流人工智能的三种类型，以及它们与人类人工智能四个方面的关系。

The currently most popular type is connectionism AI based on ANN models. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [4] [13] [14] [15] The ANN is consisted by a number of "neurons", each with a few parameters describing how would it respond (e.g., "activation") to input and produce output. The artificial neurons mimic biological neurons, while their interconnected structures mimic the biological neural network. As a result, a typical connectionism AI is in principle a function with lots of parameters (up to hundreds even thousands of billion parameters). The function takes in input and gives "inferred" output. For example, it can take in an image of a busy traffic intersection, recognize the objects in it, and classify them as cars, trucks, pedestrians, bikes, etc. It can take in an verbal audio clip and output the corresponding text. It can take in an English sentence and output the translated Chinese text.
目前最流行的类型是基于ANN模型的联结主义AI。[ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8][ 9][ 10][ 11][ 12][ 4][ 13][ 14][ 15] 人工神经网络由许多“神经元”组成，每个神经元都有一些参数，描述它如何响应（例如，“激活”）输入和产生输出。人工神经元模仿生物神经元，而它们的互连结构模仿生物神经网络。因此，典型的连接主义 AI 原则上是一个具有大量参数（多达数百亿甚至数千亿个参数）的函数。该函数接受输入并给出“推断”输出。例如，它可以获取繁忙的交通十字路口的图像，识别其中的物体，并将它们分类为汽车、卡车、行人、自行车等。它可以接收口头音频剪辑并输出相应的文本。它可以接收一个英文句子并输出翻译后的中文文本。

The parameters are established by an optimization process typically with lots of "annotated" data containing the correct underlying relation between inputs and outputs. The difference between inferred results and ground truth serves as the objective to minimize during the process. The performance of the AI relies on the design of the ANN model and the amount and quality of the data. The more the model structure is suitable to reflect the underlying relation between inputs and outputs, the better it performs. The more the training data is abundant, accurate and diverse, the better result the AI can achieve.
参数是通过优化过程建立的，通常使用大量“注释”数据，其中包含输入和输出之间的正确基本关系。推断结果与真实值之间的差异是在此过程中最小化的目标。人工智能的性能取决于人工神经网络模型的设计以及数据的数量和质量。模型结构越适合反映输入和输出之间的基本关系，其性能就越好。训练数据越丰富、越准确、越多样化，人工智能就能取得更好的结果。

This type of AI has achieved tremendous successes in various tasks, and often outperforms human. However, we typically do not know what the parameters mean, and how the specific combination of parameters can work so well. We know it works, but we do not know how and why. As for humans, it is basically an end-to-end blackbox.
这种类型的人工智能在各种任务中取得了巨大的成功，并且往往优于人类。但是，我们通常不知道参数的含义，以及参数的特定组合如何能够如此出色地工作。我们知道它有效，但我们不知道如何以及为什么。至于人类，它基本上是一个端到端的黑匣子。

This apparently reminds us of Aspect 1 intelligence of human ourselves. It is not surprising that:
这显然让我们想起了人类本身的方面 1 智能。毫不奇怪：

Proposition 6

Connectionism AI is a mimic of human’s Aspect 1 intelligence.

命题 6联结主义 AI 是人类 Aspect 1 智能的模仿。

There are three reasons. First, they are both end-to-end blackbox intelligence that we know they work but do not know exactly how and why. We are not even aware of the intermediate steps and underlying mechanisms while we use our Aspect 1 intelligence. Second, the structure of ANN model is intentionally a mimic for human neuron system. Third, the training process for connectionism AI is very similar to the learning process of human Aspect 1 intelligence.
有三个原因。首先，它们都是端到端的黑匣子智能，我们知道它们可以工作，但不知道确切的方式和原因。当我们使用方面 1 智能时，我们甚至不知道中间步骤和潜在机制。其次，ANN模型的结构是有意模仿人类神经元系统的。第三，联结主义AI的训练过程与人类方面1智能的学习过程非常相似。

Note that there are efforts on unsupervised learning, semi-supervised learning and self-supervised learning trying to release the dependence of "annotated" data, but there are no major difference in the principle mechanism of realizing such a form of intelligence.
需要注意的是，在无监督学习、半监督学习和自监督学习方面，都有试图释放“注释”数据依赖性的努力，但在实现这种智能形式的主要机制上没有重大差异。

The second type of mainstream AI is traditionally called symbolism AI, including general problem solver, expert system, knowledge engineering and knowledge graph, natural language understanding, etc. [41] [42] [43] [44] In general it tries to construct a system of "symbols" and their relations, as a simplified representation of the knowledge or mechanism of a real world system or a certain problem. The system is capable of running upon certain logic rules. We then have:
第二种主流人工智能传统上称为符号主义人工智能，包括一般问题解决者、专家系统、知识工程和知识图谱、自然语言理解等。 [ 41][ 42][ 43][ 44] 一般来说，它试图构建一个“符号”系统及其关系，作为现实世界系统或某个问题的知识或机制的简化表示。系统能够按照某些逻辑规则运行。然后我们有：

Proposition 7

Symbolism AI is a mimic of human’s Aspect 2 intelligence.

命题 7象征主义 AI 是人类 Aspect 2 智能的模仿。

The similarity between the two is obvious. However, the ability of symbolism AI is in general far behind human’s Aspect 2 intelligence. Humans can easily understand complex ideas expressed in natural language (e.g., a satirical novel with lots of metaphors) or scientific symbols (e.g., a calculus equation), and develop new theories (e.g., quantum mechanics, relativity theory) with astonishing accuracy and predicting power. In the meantime, symbolism AI can hardly tell a commonsense error such as "the sun has one eye" or "a pencil is heavier than a toaster".
两者之间的相似性是显而易见的。然而，象征主义人工智能的能力总体上远远落后于人类的 Aspect 2 智能。人类可以很容易地理解用自然语言（例如，带有大量隐喻的讽刺小说）或科学符号（例如，微积分方程）表达的复杂想法，并以惊人的准确性和预测能力开发新理论（例如，量子力学，相对论）。与此同时，象征主义人工智能几乎无法分辨出常识性错误，例如“太阳有一只眼睛”或“铅笔比烤面包机重”。

The third type of mainstream AI is traditionally called behaviorism AI. [17] [16] [45] [46] Reinforcement learning, as the currently most popular behaviorism AI, tries to optimize the activity of an agent in an environment, measured by accumulated "reward" for the agent. Unlike the connectionism AI building an end-to-end relationship or the symbolism AI deriving certain knowledge, the behaviorism AI focuses on searching for optimal task strategy in an usually very high dimensional parameter space, while trying to reach a balance between exploration of unknown territory and exploitation of known information. We here propose that:
第三种主流人工智能传统上称为行为主义人工智能。[ 17][ 16][ 45][ 46] 强化学习作为目前最流行的行为主义人工智能，试图优化智能体在环境中的活动，通过对智能体的累积“奖励”来衡量。与建立端到端关系的联结主义人工智能或派生某些知识的象征主义人工智能不同，行为主义人工智能专注于在通常非常高的维度参数空间中寻找最佳任务策略，同时试图在探索未知领域和利用已知信息之间取得平衡。我们在此建议：

Proposition 8

Behaviorism AI is a mimic of human’s Aspect 4 intelligence.

命题 8行为主义 AI 是人类第 4 方面智能的模仿。

Each of the three types of mainstream AI is trying to mimic a certain aspect of human intelligence, while the idea of combining more than one types of methodology has also been attracting more and more attention, especially on the so-called neurosymbolic approach. [47] [48] [49] [50] [51]. For example, Ding et. al [52] [53][54] constructed a connectionism system of recognizing objects and their motions in video clips, a knowledge parser that processes text questions about the contents of the clips, and combined the two into one Q&A system. Lemos et.al [47] used a graph neural network and trajectory data of planets to reconstruct the law of gravity. Nevertheless, as interesting as those efforts are, and more attention is attracted to the related area,[55] [56] [57] the integration of different types of AI is still in a very primitive stage. The methodologies as well as their performance heavily rely on human design, participation and intervention.
这三种主流人工智能中的每一种都试图模仿人类智能的某个方面，而结合多种方法论的想法也吸引了越来越多的关注，尤其是在所谓的神经符号方法上。[ 47][ 48][ 49][ 50][ 51]. 例如，丁等人。al [ 52][ 53][ 54] 构建了一个识别视频剪辑中物体及其运动的联结主义系统，一个处理有关剪辑内容的文本问题的知识解析器，并将两者合并为一个问答系统。Lemos et.al [ 47] 利用图神经网络和行星轨迹数据重建了万有引力定律。然而，尽管这些努力很有趣，并且吸引了更多人关注相关领域，[ 55][ 56][ 57]，但不同类型人工智能的集成仍处于非常原始的阶段。这些方法及其绩效在很大程度上依赖于人为设计、参与和干预。

3.3 Differences between human and artificial intelligence
3.3人类与人工智能的区别

When a kid first learns math, if he or she sees an apple, an orange, a pencil, a car and is told that is number "1", he or she is able to build up the concept "one" from various (very different) visual experiences. This is part of Aspect 3 intelligence. While the kid is reading a math book, the vision system automatically (as part of Aspect 1 intelligence) relates the visual appearance of printed numbers to abstract concept of numbers (e.g., 1, 2, 3), and in the meantime, the logic system is working hard to understand and solve the math problem (as part of Aspect 2 intelligence, e.g., $12\times 24=288$ ). The kid might try different ways in order to solve the problem correctly and quickly, so that he or she may get a candy, which is part of Aspect 4 intelligence.
当一个孩子第一次学习数学时，如果他或她看到一个苹果、一个橙子、一支铅笔、一辆汽车，并被告知这是数字“1”，他或她就能够从各种（非常不同的）视觉体验中建立起“一”的概念。这是方面 3 智能的一部分。当孩子阅读数学书时，视觉系统（作为方面 1 智能的一部分）自动将印刷数字的视觉外观与抽象的数字概念（例如，1、2、3）联系起来，同时，逻辑系统正在努力理解和解决数学问题（作为方面 2 智能的一部分，例如， $12\times 24=288$ ）。孩子可能会尝试不同的方法，以便正确快速地解决问题，这样他或她就可以得到糖果，这是方面 4 智力的一部分。

Aspect 2 is only possible while Aspect 3 builds it up from Aspect 1. Aspect 1 often calls for help from Aspect 2 if itself cannot handle the situation. Aspect 4 is constantly using the power of all Aspects 1, 2, and 3. The reason we use the term "aspect" instead of "type" to describe the four kinds of human intelligence is that, they function in an integrated fashion, often depend on each other, and usually work simultaneously.
方面 2 只有在方面 3 从方面 1 构建它时才有可能。如果方面 1 本身无法处理这种情况，则经常向方面 2 寻求帮助。方面 4 不断使用所有方面 1、2 和 3 的力量。我们之所以使用术语“方面”而不是“类型”来描述这四种人类智能，是因为它们以综合方式运作，通常相互依赖，并且通常同时工作。

However, that is not the case in current mainstream artificial intelligence. The different models are usually used independently to solve various specific problems, the efforts of integrating them are still in a very primitive stage. Each AI program works only in its very specific pre-defined scenarios. Furthermore, current mainstream artificial intelligence does not have a way to represent Aspect 3 intelligence, and the symbolism AI lacks some key mechanism and performs far behind Aspect 2 intelligence.
然而，在当前的主流人工智能中，情况并非如此。不同的模型通常独立用于解决各种具体问题，集成它们的努力仍处于非常原始的阶段。每个 AI 程序仅在其非常特定的预定义场景中工作。此外，目前主流的人工智能没有办法代表Aspect 3智能，而象征AI缺乏一些关键机制，表现远远落后于Aspect 2智能。

In a typical convolutional neural network (CNN, such as YOLO series) that processes image or video of a busy traffic intersection and recognizes traffic participants, the raw data out of the camera (e.g., RGB pixels) is processed by the CNN layer by layer, "features" from fine structures (e.g., light-dark edges) to large scale objects (e.g., faces, wheels) are generated, and finally, the probability of a recognized target as a certain class of traffic participant (e.g., pedestrian, bike, car, truck) is calculated. It is somewhat similar to how humans do the same job, but the CNN can only assign probabilities to artificially predefined classes. The CNN can not do much more to the "features" at different levels of granularity (which carries different levels of semantic information and different levels of causality), while humans are able to generate appropriate concepts, combine them to build an effective knowledge model, and use the model to understand, simulate and predict what’s in the traffic scene, such as a danger of crash, a jam to avoid, a rule behind the signal, and a strategy to drive or walk.
在典型的卷积神经网络（CNN，如YOLO系列）中，它处理繁忙的交通十字路口的图像或视频并识别交通参与者，来自摄像头的原始数据（例如，RGB像素）由CNN逐层处理，从精细结构（例如，明暗边缘）到大尺度物体（例如，面部、轮子），最后计算出被识别的目标作为某一类交通参与者（例如，行人、自行车、汽车、卡车）的概率。这有点类似于人类做同样的工作，但CNN只能将概率分配给人为预定义的类。CNN不能对不同粒度级别的“特征”做更多的事情（它承载着不同级别的语义信息和不同级别的因果关系），而人类能够生成适当的概念，将它们组合起来建立一个有效的知识模型，并使用该模型来理解、模拟和预测交通场景中的内容，例如碰撞的危险、要避免的堵塞、信号背后的规则以及开车或步行的策略。

Figure 2 gives a comparison of human and artificial intelligence. The correspondence between human intelligence aspects and artificial intelligence types, as well as the granularity trend of them, are given. We propose that:
图 2 给出了人类和人工智能的比较。给出了人类智能方面与人工智能类型之间的对应关系，以及它们的粒度趋势。我们建议：

Proposition 9

Aspect 1, 2, 4 of human intelligence are on different granularities of neuron, concept, strategy, respectively. The granularity level increases in the order of Aspect 1, 3, 2, 4 for human intelligence, and increases in the order of connectionism, symbolism, behaviorism for artificial intelligence.

命题 9人类智能的第 1、2、4 方面分别在神经元、概念和策略的不同粒度上。人类智能的粒度级别按方面 1、3、2、4 的顺序增加，人工智能的粒度级别按联结主义、象征主义、行为主义的顺序增加。

With what we already have in current mainstream artificial intelligence, we believe that:
凭借我们在当前主流人工智能中已经拥有的东西，我们相信：

Proposition 10

The three key missing pieces for understanding the nature of intelligence and constructing the next level artificial intelligence are: (1) Aspect 3 intelligence that generates appropriate concepts out of raw data through multi-step information abstraction, (2) a World Model with effective running mechanism, and (3) a Self Model.

命题10理解智能的本质和构建下一级人工智能的三个关键缺失部分是：（1）方面3智能，通过多步骤信息抽象从原始数据中生成适当的概念，（2）具有有效运行机制的世界模型，以及（3）自我模型。

4 The World-Self Model (WSM)
4世界自我模型（WSM）

In this section we present the key ideas of our World-Self Model (WSM), the mathematical representation of the model, and reminders for possible computer implementations.
在本节中，我们将介绍世界自我模型（WSM）的关键思想、模型的数学表示以及可能的计算机实现的提醒。

4.1 Creating "concepts": Aspect 3 intelligence connects perception and cognition
4.1创造“概念”：方面3智能连接感知和认知

Humans have the ability to create an informational representation of the physical world. When we see an apple with our eyes, the refracted light from the apple is passing through cornea, modulated by structures like pupil and crystalline lens, and then converted to electrical signals by retina cells. The physical reality of the apple is now represented by an informational cluster of electrical signals. Human eye has a resolution on the order of 200 million "pixels" while the number of optical nerve cells is only on the order of 1 million "pixels". It is believed that the information is compressed over 100-fold before travelling to the brain visual regions. The compressed representation is processed by multiple brain visual regions, creates higher level features such as pattern of color, round curved shape, glossiness, size, etc. The informational representation of the physical reality of the apple now is the status and connection patterns of certain brain neurons.
人类有能力创建物理世界的信息表示。当我们用眼睛看到苹果时，来自苹果的折射光穿过角膜，由瞳孔和晶状体等结构调节，然后由视网膜细胞转换为电信号。苹果的物理现实现在由电信号的信息簇表示。人眼的分辨率约为2亿个“像素”，而视神经细胞的数量仅为100万个“像素”。据信，在到达大脑视觉区域之前，信息被压缩了 100 倍以上。压缩后的表示由多个大脑视觉区域处理，创建更高级别的特征，例如颜色图案、圆弧形状、光泽度、大小等。现在，苹果物理现实的信息表示是某些大脑神经元的状态和连接模式。

If a child first sees an apple, his or her brain generates those compressed representations for the first time. But after seeing other apples (similar but often with quite different appearances) for a few times, he or she would remarkably be able to recognize the similarities (features in common), organize them in one category, name it with a further compressed (simplified) representation "apple" if told so by his or her parents.
如果孩子第一次看到苹果，他或她的大脑就会第一次产生这些压缩的表征。但是，在看到其他苹果（相似但通常具有完全不同的外观）几次后，他或她将能够显着地识别出相似之处（共同特征），将它们组织在一个类别中，如果他或她的父母告诉他，他或她会用进一步压缩（简化）的表示来命名它“苹果”。

In physical reality, the approximate number of photons refracted by the apple can be given by:
在物理现实中，苹果折射的光子的近似数可以由下式给出：

N_{p}=\frac{SLtc}{h\lambda}\approx 5e18

(2)

while S=0.02 $m^{2}$ is the apple’s illuminated surface area, L=100 Lux is the average intensity of light refracted by the apple, $\lambda$ =500nm is the average wavelength of light refracted by the apple, t=1 second is the time period for consideration (and eye observation), $h=6.6.26e{-14}$ is the Planck constant. We assume here one photon carries 1 bit of information, although in principle it could be more than 1 bit. We can then roughly quantify the original information of the physical reality:
而 S=0.02 $m^{2}$ 是苹果的照表面积，L=100 Lux 是苹果折射的平均光强度， $\lambda$ =500nm 是苹果折射的光的平均波长，t=1 秒是考虑（和眼睛观察）的时间段， $h=6.6.26e{-14}$ 是普朗克常数。我们假设一个光子携带 1 位信息，尽管原则上它可能超过 1 位。然后，我们可以粗略地量化物理现实的原始信息：

H_{real}\approx 5e18bits

(3)

The eye captures some of the photons and converts them into electrical signals by retina cells, the information is on the order of:
眼睛捕获一些光子，并通过视网膜细胞将它们转换为电信号，信息量级为：

H_{retina}\approx 2e8bits

(4)

The information being transmitted to the brain via optical nerve cells is on the order of:
通过视神经细胞传递到大脑的信息大约是：

H_{nerve}\approx 1e6bits

(5)

The exact way the brain stores the features created by brain visual areas is not clear. To have a rough idea of the order of magnitude, here we assume 10 features are created, each with a probability of 0.01. The information is then on the order of:
大脑存储大脑视觉区域创建的特征的确切方式尚不清楚。为了粗略了解数量级，这里我们假设创建了 10 个特征，每个特征的概率为 0.01。然后，信息按以下顺序排列：

H_{feature}\approx 700bits

(6)

Finally, the word "apple" is stored in brain language area. Assuming a regular encoding method, the word "apple" has information on the order of:
最后，“苹果”一词存储在大脑语言区域。假设采用常规编码方法，单词“apple”的信息顺序为：

H_{concept}\approx 40bits

(7)

The quantity of information is compressed over $10^{17}$ fold in a few consecutive steps. Astonishingly simple and effective representations of the physical real world are created in the brain, which allows the brain to handle abundant and complex information from reality. We call the final representation as "concept". This remarkable process of concept formation is illustrated in Figure 3, and we here propose a definition:
信息量在几个连续的步骤中被压缩成 $10^{17}$ 倍数。在大脑中创建了物理现实世界的惊人简单而有效的表示，这使得大脑能够处理来自现实的丰富而复杂的信息。我们将最终表示称为“概念”。图 3 说明了这一非凡的概念形成过程，我们在这里提出一个定义：

Proposition 11

A "concept" for a human is a word, symbol or idea stored in the connected structure of neurons. It has the status "activated" if being visited by consciousness, or "not activated" otherwise.

命题11人类的“概念”是存储在神经元连接结构中的单词、符号或想法。如果被意识访问，它具有“激活”状态，否则具有“未激活”的状态。

Concepts can form from various sources. Concrete objects that one can directly see, touch or hear (e.g., apples, birds, cars, buildings, water, siren, air) can form concepts. Objects that usually cannot be directly perceived (atoms, electrons, Hubble space telescope, the star Proxima Centauri) can form concepts. Motions (e.g., running, jumping, sliding, gliding) can form concepts. Characteristics (e.g., red, big, fast, hard) can form concepts. Abstract ideas (e.g., the math operation log, politics, belief, nice) can form concepts. Natural language words (e.g., those you are reading) can all (potentially) form concepts in human’s intelligent system. Symbols in a certain professional field (e.g., $\pi$ , $\div$ , $\sqrt{3}$ , $h$ as Planck constant, $Fe$ as a chemical element, DNA as a molecule, $s$ as spin state of an electron of the $Fe^{2+}$ ion in an enzyme molecule) can form concepts. Certain ideas that cannot be easily expressed in natural language or other symbol systems (e.g., the special feeling at a certain moment of a woman in love) can also form concepts.
概念可以从各种来源形成。人们可以直接看到、触摸或听到的具体物体（例如，苹果、鸟、汽车、建筑物、水、警报器、空气）可以形成概念。通常无法直接感知的物体（原子、电子、哈勃太空望远镜、比邻星）可以形成概念。动作（例如，跑步、跳跃、滑动、滑行）可以形成概念。特征（例如，红色、大、快、硬）可以形成概念。抽象的概念（例如，数学操作日志、政治、信仰、nice）可以形成概念。自然语言单词（例如，您正在阅读的单词）都可以（潜在地）在人类的智能系统中形成概念。某个专业领域的符号（例如， $\pi$ ， $\div$ ， $\sqrt{3}$ $h$ ，作为普朗克常数， $Fe$ 作为化学元素，DNA作为分子， $s$ 作为酶分子中 $Fe^{2+}$ 离子电子的自旋态）可以形成概念。某些不能轻易用自然语言或其他符号系统表达的想法（例如，恋爱中的女人在某个时刻的特殊感受）也可以形成概念。

Concepts lie on a structure of multiple hierarchical levels. The bottom level concepts are formed based on non-conceptual information, e.g., the concept "apple" is formed based on the certain common features created in a child’s brain while he or she sees, touches or tastes apples for a number of times. The concept "fruit" is on a higher level than "apple". The concepts "plant", "life", "matter" are consecutively on higher and higher levels in the hierarchical tree of concepts. Similarly, the concept "vehicle" is on a higher level than concept "car", "truck" or "bus", the concept "color" is on a higher level than concepts "red", "green" or "purple". The concept "feature" is on a higher level than concepts "color", "size" or "shape". There are potentially an infinite number of layers in the structure. Nevertheless, the concept "concept", itself, is on the top level higher than any other concept such as "matter", "feature", "economy", "mind" or "consciousness".
概念位于多个层次结构上。底层概念是基于非概念信息形成的，例如，“苹果”概念是基于儿童多次看到、触摸或品尝苹果时大脑中产生的某些共同特征而形成的。“水果”的概念比“苹果”更高。“植物”、“生命”、“物质”这些概念在概念的层次结构中连续处于越来越高的层次。同样，“车辆”概念比概念“汽车”、“卡车”或“公共汽车”处于更高的水平，“颜色”概念比“红色”、“绿色”或“紫色”概念处于更高的水平。“特征”概念比“颜色”、“大小”或“形状”概念更高。结构中可能有无限多的层。然而，“概念”这个概念本身比“物质”、“特征”、“经济”、“心灵”或“意识”等任何其他概念都要高。

"Concept" is about commonality of different things. It is the key output of Aspect 3 intelligence. Upon perceiving with visual, auditory, tactile systems and generating large amount of information in the form of electrical signal, human brain is able to process the information, compress them in a few steps, and finally form "concepts", and "concepts above concepts". Concepts have only very little information in quantity, easy to store and process, but can form a model that is very effective to simulate and predict the real world.

In other words, Aspect 3 intelligence converts the information out of perception system into concepts that build up the cognition system. We propose that:

Proposition 12

Aspect 3 intelligence is the mechanism that connects perception and cognition.

4.2 Connecting "concepts": the World Model (WM)

With enough correctly annotated image data, if we train a traffic participant 4-type classification ANN, it can effectively classify cars, trucks, bikes and persons into the correct category. With some more NLP training, it can even produce a text description of the participants upon a text inquiry. But if one asks the ANN, "is the truck or person heavier?", or "which one is alive, the person or the truck?", it will totally have no idea. If we provide more annotated data and tweak more on the ANN model, the accuracy or efficiency of classification and description could be very high, while it can never answer the latter questions, because the concepts of "heavy" or "alive" is far beyond any information that the ANN received from the training data.

In contrary, if a person is asked the same questions, even a kindergarden child can easily answer them. We can image what happens in a person’s mind while answering these questions. The concept "heavy" is connected to a lot of other concepts including "person" and "truck". The connection from "truck" to "heavy" is stronger than the connection from "person" to "heavy". From past experience (learning process), one knows that "heavy" (as well as "light") is a "feature" (which is also a concept) of objects, is a continuous degree of measurement (in contrast to "yes" or "no") for objects, has a certain relationship ("contrary") with "light", and "truck" has more of this feature than "person". A person can build up such a network of concepts that makes one very easy to answer such questions. We here call this network of concepts a "world model" (WM).

A WM is a high dimensional, inter-connected, complex network of concepts. The concept "fruit" connects to its sub-layer concepts like "apple", "orange" or "banana", it connects to its up-layer concepts like "plant" or "food", it connects other concepts like "eatable", "nutrition", "beneficial to human", etc. The concept "apple" connects to the concept "fruit", connects to related concept "peach", connects to the concept "mobile phone" (because of the brand "Apple"), connects to the concept "company" (because of the company "Apple"), and connects to the concept "gravity" (because of the famous story that Issac Newton’s great discovery of gravity was inspired by an apple dropping on his head). The concept "company" connects to the concept "KFC", and further connects to the concept "food".

The connection between concepts A and B is directional, and could be strong or weak. The connection from "apple" to "fruit" is strong, while the connection from "fruit" to "apple" is weaker. The connection from economy to "technology", "stock" or "policy" is stronger than to "philosophy", "happy" or "dance". We here define the "strength" of connection from concept A to B as the probability of activating $B$ upon activation of $A$ :

S(A\rightarrow B)=P(B=B^{a}|A=A^{a})

(8)

The superscript $a$ indicates an activated concept, $A=A^{a}$ means concept A is activated, and $P(B=B^{a}|A=A^{a})$ is the chance of activating concept B ( $B=B^{a}$ ) over its other connected concepts. If there are N concepts that A connects to, $X_{i}$ is the $i_{th}$ concept, then we have:

\sum_{i=1}^{N}P(X_{i}=X_{i}^{a}|A=A^{a})=1

(9)

With Aspect 2 and 3 intelligence, human can build up a WM consisted by concepts and their relationships. We want to stress that the WM we propose here is a broader idea than the sometimes-mentioned idea of "world model of commonsense". [43][44] It builds up commonsense from regular life experience such as eating, sleeping, talking, travelling, interacting with other people, etc. It also builds up more precise models of real world by professional experiences, a physicist would know the laws of mechanics, electronics, optics and the microscopic and cosmic world, a biologist would know how lives, bodies, organisms and neurons work, a psychologist would know a lot about cognition, emotion, behavior, personality and motivation. A person’s experiences all contribute together and build up a person’s WM:

Proposition 13

A World Model (WM) is an informational mimic of the physical real world. It is consisted by concepts and their structured relations.

Furthermore, humankind has together built a huge WM with all accumulated common knowledge of mankind. We have:

Proposition 14

Each person has a unique WM from his or her experiences. Mankind has the Great World Model (GWM) from mankind’s all accumulated effective common knowledge.

WM provides a mimic of the real world. Animals like pigs are able to perceive the environment, response to inner needs (e.g., eat if hungry) and outer stimulus (e.g., scream if hit), and behave Aspect 1 intelligence. In contrast, with the help of WM, humans are able to describe and predict the real world at a whole new level. Humans do so not by actually making things happen in the real world and observe, but by "virtually" running the WM in the informational world of mind.

WM runs by certain rules. In order to organize concepts (many of which are natural language words) into complex ideas, to generate descriptions and predictions, grammar and the rules of language are playing a key role. In order to understand and deduce the natural laws described by scientific symbols, the laws of logic are playing a key role. The origin and complete image of the running rules of WM is still an open question and under investigation.

There is a tendency in some AI communities to use large ANN models. Models with 175 billion parameters (GPT-3 [12] [58]) and 1750 billion parameters (FastMoE [59]) have been built and studied. The data used to train those models are on the order of dozens of TB, including text, image, audio, etc. Those "large models" perform remarkably well on some tasks. They are able to generate descriptions or answering questions with natural language. But they do so by mimicking the rules of language in an in principle end-to-end fashion. Without a "structured" world model that effectively reflects the complicated relations between concepts, the large models are overall still far from performing human level intelligence.

4.3 The "concept" of "self": the Self Model (SM)

Among all concepts, there is one concept of unique importance: self. The IS is part of the physical world. While the IS is effectively building up its world model by receiving input, perception and condensing information into concepts, it needs to be able to get the information about itself, build up the concept of "self", recognize "self" as part of "world", and simulate the interactions between "self" and "world" in its own world model. For instance, a person on dining table is able to perceive not only the outside world like the wall of the room, seats, plates and food, and voices from another person on table, but also his or her own hands holding the spoon, the feel of taste in mouth, the movement of body, and the voice he or she is speaking to another person on table. One is able to connect the right perception stream to self, is able to imagine what would happen if him- or herself told a joke to another person. One knows how to adjust status of "self" in world model, by feeling him- or herself and estimating the amount of eaten food, to achieve the goal of eating, e.g., not hungry but not too full.

By building up the concept "self", the IS has a self model (SM) that represents itself. For simplicity, when we mention the term world model (WM), we refer to the model of the physical world not including the IS itself. The IS and the physical world each has a model in the IS’s information system: the SM and the WM. We call the combination of the two as world-self model (WSM). The relation and interaction of SM and WM in WSM are crucial to understand or reconstruct intelligence.

Figure 4 illustrates the interesting structure of this idea. The physical world and the IS both have an informational representation in the WSM of the IS itself. In this WSM, the SM (which is an informational representation of the IS) has a WSM in it. This loop can in principle go on infinitely. This gives rise to a number of interesting topics on philosophy aspects, which is beyond the scope of this paper, and we will discuss it in another paper.

4.4 Mathematical representation of the World-Self Model (WSM)

4.4.1 Creating concepts

For humankind, the remarkable history of the formation of language is an odyssey of creating essential concepts that represent real world entities. For a person, the learning process of language and professional knowledge is to build up the concepts in his or her brain neural network, as informational representations corresponding to the reality that he or she perceives.

For artificial IS implementation, one can use any of the following as the content of a concept: a natural language word, phrase or sentence, a math or scientific symbol, or any artificially defined symbol designated to represent a certain concept.

When existing concepts are not able to represent a new idea, new concepts need to be created. Human’s Aspect 3 intelligence does do by creating a new word or new symbol, using the human perception and cognition system. For artificial IS implementation, one can do so in two ways. First, one can manually assign new word or new symbol for new concepts, which would be transparent and understandable to humans. Second, one can use a carefully designed artificial neural network (ANN) to process low level raw data and extract high level features. Those features are potential new concepts that can be used by the artificial IS but likely not transparent or understandable to humans.

We define a concept as $C$ , and the special concept of self as $C_{0}$ . The whole concept space (WCS) $\mathbb{M}^{W}$ is formed by $W+1$ concepts:

\mathbb{M}^{W}=\{C_{w}\}

(10)

while $w\in\{0,1,2,...,W\}$ is an integer between $0$ and $W$ .

4.4.2 Connecting concepts

A WSM contains the concepts and their relations. We define each concept of $\mathbb{M}^{W}$ (including $C_{0}$ ) as a vector:

{C_{w}}=C_{w}(a_{w},C_{w}^{0},\vec{N}_{w0},\vec{N}_{w1},\vec{N}_{w2},...,\vec{N}_{wK})

(11)

while $a_{w}=a(C_{w})\in\{0,1\}$ means that a concept $C_{w}$ can be in default (not activated) state ( $a(C_{w})=0$ ) or activated state ( $a(C_{w})=1$ ). The value of activation state can in principle be any continuous value between 0 and 1, but we simplify it to just two values here. $C_{w}^{0}$ is the content of $C_{w}$ , e.g., a word, phrase, or symbol. The concept $C_{w}$ connects to $K$ other concepts, each with a 2-dimensional connection vector

\vec{N}_{wk}=(S(C_{w}\rightarrow C_{k}),R(C_{w}\rightarrow C_{k}))

(12)

while $S(C_{w}\rightarrow C_{k})\in[0,1]$ is the directed connecting strength from concept $C_{w}$ to concept $C_{k}$ , and $R(C_{w}\rightarrow C_{k})\in\{1,2,...,N_{t}\}$ is the relation indicator, an integer between $1$ and $N_{t}$ .

A connection between two concepts has two properties, one is strength $S$ , the other is relation indicator $R$ . $N_{t}$ is the total number of relation types for the connection of the two concepts. For example, even if a number of connections have the same connection strength, they can still have very different connection types. We here propose that:

Proposition 15

A finite number $N_{t}$ of different connection types is sufficient to build up an effective WSM.

We further define $P_{wk}$ as the notation for probability of activating $C_{k}$ after $C_{w}$ is activated:

P_{wk}=P(a_{k}=1|a_{w}=1)=S(C_{w}\rightarrow C_{k})

(13)

In order to establish such a network with effective connections, humans do so by using the mechanisms of storing and connecting concepts in the brain neural network, the details of which is still not clear. For artificial IS implementation, such a network can be trained. The data $\mathbb{D}^{C}$ for training is defined as

\mathbb{D}^{C}=\{S^{C}\}

(14)

i.e., a collection of $S^{C}$ , which is streams of concepts that contain knowledge. $S^{C}$ could be human verbal dialogue texts, scientific papers, documentary text like Wikipedia, or any form of combination of symbols that contains knowledge. Note that the $\mathbb{D}^{C}$ should be selected to match the task.

With such data, a neural network model $\mathbb{T}$ can be constructed with two main components:

\mathbb{T}=\{\mathbb{T}_{S},\mathbb{T}_{R}\}

(15)

while $\mathbb{T}_{S}$ is used to learn the strength values $S$ , and $\mathbb{T}_{R}$ is used to learn the relation indicator values. Networks with attention mechanism such as Transformer [13] (which typically generates very good results for word embedding) are good candidates for $\mathbb{T}_{S}$ and $\mathbb{T}_{R}$ .

With the data and model described above, one can build up a WSM as the key component of the artificial IS.

4.4.3 Running the WSM

We can run the WSM to conduct various tasks and achieve various goals. The basic block of functioning of WSM is producing output for a certain input, e.g., solving problems. For example, a question like "how many eyes does the sun have?" or an idea like "the meaning of life is to look for the meaning of life" is a stream of natural language word concepts, a problem like "if $a=2,b=3a$ , then what is $b$ ?" is a stream of natural language and math symbol concepts. If an idea cannot be expressed in existing words or symbols, the IS can always generate new concepts (via Aspect 3 intelligence) and assign new symbols to it. In the context of WSM, we propose that:

Proposition 16

Any complex problem or answer of consideration for the IS can be expressed as a stream of concepts.

The WSM works by the following steps:

(1) Receiving: WSM receives an input, which is a stream of $I$ concepts, forming the input concept vector

\vec{V}^{I}=(C_{1},C_{2},C_{3},...C_{I})

(16)

while these $I$ concepts are from the WCS $\mathbb{M}^{W}$ , and form a sub-space $\mathbb{M}^{I}$ that we call input concept space (ICS):

\mathbb{M}^{I}=\{C_{i}\}

(17)

while $i\in\{1,2,3,...,I\}$ .

(2) Activating: Each received concept $C_{i}$ is activated

a(C_{i})=1

(18)

and any directly or indirectly connected concept $C_{j}$ that is connected strong enough is also activated:

a(C_{j})=1

(19)

forming the activated concept space (ACS):

\mathbb{M}^{A}=\{C_{j}\}

(20)

while $j\in\{1,2,3,...,J\}$ and $J$ is the total number of concepts activated by the input stream of $I$ concepts. The condition for concept $C_{j}$ to be activated is that, there exist any combination of $Q$ consecutively connected concepts such that:

\prod\limits_{j=1}^{Q}S(C_{j}\rightarrow C_{j+1})\geq T_{a}

(21)

$T_{a}$ is the activation threshold. The first concept $C_{1}$ of ACS has to be one of the input concepts:

C_{1}\in\mathbb{M}^{I}

(22)

The activation process starts from the activation of input concepts, a "layer" of concepts that are strongly connected to the input concepts are activated, then further "layers" of concepts can be activated as far as their (multi-step) connections to the input concepts are strong enough.

For humans, the idea of "activation" means it is being accessed by conscious attention. Concepts that can potentially be activated in subconsciousness are not considered activated by the definition here. The activation of concepts (either in ICS or ACS) has to happen one by one in a linear sequential manner, but the brain has a mechanism of storing a number of activated concepts in a certain period of time. If the input is not too long, all concepts in ACS are stored and ready to be used in the next step.

For artificial IS implementation, the size of input $I$ and the threshold of connecting strength $T_{a}$ can be designed so that the generated ACS has an appropriate size that satisfies the requirement of intelligent task as well as the limitation of hardware and time.

(3) Searching: The output of WSM is a stream of $N$ activated concepts, forming the output concept vector

\vec{V}^{O}=(C_{1},C_{2},C_{3},...C_{N})

(23)

while these $N$ concepts are all from the ACS $\mathbb{M}^{A}$ , and form a sub-space $\mathbb{M}^{O}$ that we call output concept space (OCS):

\mathbb{M}^{O}=\{C_{n}\}

(24)

with $n\in\{1,2,...,N\}$ and $N\leq J$ . $N$ could be a large number, or a small number like $1$ with only one concept as output.

$\vec{V}^{O}$ is a permutation of $N$ concepts out of the $J$ concepts of ACS $\mathbb{M}^{A}$ . There is a total number of $B$ possible permutations:

B=P_{N}^{J}=\frac{J!}{(J-N)!}

(25)

while the $b$ -th possible permutation is $\vec{V}^{O}_{b}$ . To get the optimal output, e.g., to give the best answer (as output) to a question (as input), the WSM needs to search through ACS for the optimal permutation. We here define the loss function of a candidate $\vec{V}_{b}^{O}$ as $L(\vec{V}_{b}^{O})$ . Theoretically, the optimal permutation $\vec{V}^{O}$ is given when:

L(\vec{V}^{O})=\min\limits_{b=1}^{B}L(\vec{V}^{O}_{b})

(26)

while in practice, we consider the optimal output found:

\vec{V}^{O}=\vec{V}^{O}_{b}

(27)

if the loss function satisfies the cut-off condition:

L(\vec{V}^{O}_{b})\leq L^{c}

(28)

with $L^{c}$ the cut-off value. The loss function $L(\vec{V}^{O}_{b})$ of permutation $\vec{V}^{O}_{b}$ is defined as:

L(\vec{V}^{O}_{b})=\alpha_{s}L_{s}(\vec{V}^{O}_{b})+\alpha_{r}L_{r}(\vec{V}^{O}_{b})+\alpha_{p}L_{p}(\vec{V}^{O}_{b})

(29)

The first term $L_{s}(\vec{V}^{O}_{b})$ is to favor stronger connections between output and input concepts. Suppose there are a total of $P$ possible paths connecting from input concepts to output concepts, and the $p$ -th path has a total of $Q_{p}-1$ connection steps (connecting $Q$ concepts), we then have:

L_{s}(\vec{V}^{O}_{b})=\frac{1}{\sum\limits_{p=1}^{P}\prod\limits_{q=1}^{Q_{p}}S(C_{q}\rightarrow C_{q+1})}

(30)

while for all paths, $C_{1}\in\mathbb{M}^{I}$ and $C_{Q}$ is a component of $\vec{V}^{O}_{b}$ . The second term $L_{r}(\vec{V}^{O}_{b})$ is to favor appropriate relations between concepts, which needs further study to include rules like semantic logic and mathematical logic. The third term $L_{p}(\vec{V}^{O}_{b})$ is a penalty that incorporates human habits such as the rules of language, for humans to better understand, improve and communicate with the artificial IS. This term can be obtained by a trained ANN.

4.5 Parameter sensitivity analysis and computer implementation

The parameter $W$ (Eq. 10) represents the size of the WCS, and to some extent, represents the capability of what level of intelligence the IS can potentially achieve. A regular adult person can know, for example, $20000-30000$ natural language words (as concepts), and a larger number of word combinations (also as concepts). An ANN-based "large model" can have as many as hundreds of billions of parameters, each of which can in principle be considered as a concept.

Nevertheless, the connection structure of concepts, rather than the number of concepts, is usually more important for the IS to achieve high level intelligence. How is the connection structure represented in the model? The parameter $K$ represents the "density" of interconnections among concepts, and $N_{t}$ represents the "diversity" of different types of connections (Eq. 11 and 12). They together define the overall profile of the WSM. Large values of $W$ , $K$ and $N_{t}$ represent a high possibility of achieving high level intelligence.

The parameter $I$ (Eq. 16) represents the magnitude (and thus often the level of difficulty) of the problem to be solved by the IS. The parameter $J$ (Eq. 20) represents the size of the ACS (the magnitude of the subsystem of the IS) to be used to solve the specific input problem.

The ability of the IS to solve specific problems is highly sensitive to parameters $J$ (Eq. 20) and $T_{a}$ (Eq. 21). The ACS could be too small to effectively solve the problem or give a meaningful output. Or the ACS could be too large and the search for the optimal output could be too demanding in computing power and time. For practical computer implementation of the above processes, we here discuss two possible tricks that can be used.

The first trick is "ordering". While constructing the WSM by learning and updating the vectors $\vec{N}$ (see Eq. 11) of connected concepts for a certain concept, one can arrange the connected concepts in the order for strength $S$ to be from large to small. In this way, a cut-off value of $K^{\prime}<K$ can be implemented to consider only the first $K^{\prime}$ concepts that are connected, which can greatly reduce the computing demand.

The second trick is "expanding". In the process of searching, if it is necessary, one can adjust the activation threshold $T_{a}$ to expand the ACS. The number of concepts in ACS then changes from $J$ to $J^{\prime}$ . With such a flexibility during searching, the IS can better adapt to different kinds of problems.

One can observe interesting corresponding phenomenons for human IS. If a person is smart and sober, the $T_{a}$ value is lower, meaning more concepts can be activated under the same effort. If a person is sleepy, the $T_{a}$ value is higher, then less concepts are activated. If a person find a problem hard, he or she might think harder and longer to solve the problem, meaning reducing $T_{a}$ and expanding the ACS to produce a better output.

It seems that human brain has an automatic mechanism of "ordering", which we believe originates from biochemical properties of human neural network. When a person’s mind receives a question, the most relevant concepts are first activated naturally without extra effort. We consider this as a natural advantage of the structure of our brain neural network. The size and scope of ACS can be constantly adjusted during the search for answer.

4.6 WSM and uncertainty

Concepts and their connections in a WSM are all different kinds of information. We know information is about uncertainty. The WSM is a simplified representation of the real world which contains limited information. This manifests the statistical nature of the WSM, and the importance to address the issue of intrinsic uncertainties.

Let us first look at an ANN-based classifier that tells an animal in a picture to be tiger or lion. The model can see as much as the training data can see. If one inputs a test picture into the trained model, it gives some probability, e.g., $91\%$ , for the animal to be a tiger and $9\%$ to be a lion. The accuracy of classification can be improved by using more well-annotated training data, but the uncertainty can never be completely eliminated in real life situations. In an extreme case, what should the model do if it sees a picture of a liger? Now let us consider a self-driving car that is about to decide the control strategy based on results from two object detection and classification models, one from radars and the other from cameras. If the two models contradict with each other, what should the car do? In this case, it is important to have quantified uncertainties along with the results themselves, so that a statistical model can be used to retrieve best results out of two information sources. We see that uncertainties, wanted or not, are intrinsically carried all along with the data, the model, and the results.

For humans, the issue of uncertainty is more obvious. Our memories can be correct or wrong. Our perception of the same object can differ from person to person. When we try to repeat a certain action or thinking, it may vary from time to time. We make misspellings, typos and many kinds of mistakes that we do not want. Although we humans achieve remarkable level of intelligence, uncertainties (and the way we deal with uncertainties) are an intrinsic part of our intelligence.

In WSM, we defined $P_{wk}$ as the probability of activating another concept $C_{k}$ after a concept $C_{w}$ is activated. We related this probability to the connecting strength $S$ between two concepts. If the connecting strength is high, it is more likely, but not necessarily, activated. The chain of concept activation is of statistical nature here. If we input the same problem to a WSM many times, the behavior and output of the WSM can be the same, but can also vary.

This design of uncertainty of the model loosens the boundary of the searching algorithm, and allows a balance between exploration and exploitation, which is a must for high level IS (e.g., IS-L3). Nevertheless, a well-constructed and trained WSM can have the flexibility of understanding the problem on a broad scope, while achieving stable performance over different or the same tasks.

5 Unifying different aspects and types of intelligence into one WSM-based framework

In section 4 we constructed the WSM model, including the key idea of concept, the special concept of self, the creation and connection of concepts, the mechanism for the concept-network to process input-output streams, and in a whole, the WSM model that helps us understand the key missing pieces of intelligence study. In this section we will discuss how the WSM could further benefit our standing of the nature of intelligence, by presenting a way to unify different aspects and types of intelligence into one framework of intelligence.

Being independent of what kind of intelligence or IS to consider, we present the WSM-based broader framework of intelligence in Figure 5.

There are two basic entities in the framework, the IS and the world (environment). Inside IS, there are two parts. The first part is the End-to-End Model (EEM, designated as $\mathbb{M}_{EEM}$ , the pink area in Figure 5). As the name suggests, the EEM takes in input and gives output in an end-to-end fashion. It corresponds to Aspect 1 intelligence (responsive system, System 1) of human IS or connectionism AI. The second part is the WSM (designated as $\mathbb{M}_{WSM}$ , area included in dash line in Figure 5), which is the concept network including world concepts (dark green area) and the special concept of self (brown area). It receives input and gives output in the fashion described in Section 4. It corresponds to Aspect 2 intelligence (analytical system, System 2) of human IS or symbolism AI, Aspect 3 intelligence is linking the EEM and the WSM.

The dark brown area is a special component of the IS that we call "goal". In the concept "self" of WSM, there is a dimension that defines the goal of the IS, i.e., psychological or social objective like self-actualization of educating kids to be good people as a teacher, or predefined objective like keeping an old lady safe for a household robot (in the future). This dimension of concept "self" projects to be part of the goal. The other part of the goal comes from EEM, i.e., more direct (and often simpler) objective like generating a dodging signal to actuator upon perceiving a dangerous fast approaching object, like the fist of opponent for a boxer, or the approaching obstacle for a moving robot.

With WSM, EEM and goal, the IS is able to evaluate a "distance" $\mathbb{D}$ as the difference between its current status $\mathbb{S}$ and the goal $\mathbb{G}$ :

\mathbb{D}=\mathbb{G}-\mathbb{S}

(31)

This distance and the IS together determine a certain action $\mathbb{A}$ performed to the world or to the IS itself. The changes of world or self are then perceived, and the perceived information flows back into the IS as input. The IS update itself to be $\mathbb{S}^{\prime}$ , i.e., the EEM is updated to $\mathbb{M}^{\prime}_{EEM}$ , the WSM is updated to $\mathbb{M}^{\prime}_{WSM}$ , and the goal can also be updated to $\mathbb{G}^{\prime}$ . A new distance can then be evaluated:

\mathbb{D}^{\prime}=\mathbb{G}^{\prime}-\mathbb{S}^{\prime}

(32)

The new distance $\mathbb{D}^{\prime}$ and new models $\mathbb{M}^{\prime}_{EEM}$ and $\mathbb{M}^{\prime}_{WSM}$ together will determine the new action $\mathbb{A}^{\prime}$ . The loop then continues. Note that there are two loops that can happen, one is to change the world, and the other is to change the IS itself. The two loops can happen simultaneously.

The loops can happen on two levels. The first level is to use EEM. The multi-ball practice of a human table tennis player to improve "muscle memory" of striking movements (the "goal") is on this first level. The training of a deep neural network with lots of annotated data for accurate face recognition (the "goal") is also on this first level. The second level is to use WSM. A student taking math classes to get good score on the final exam (the "goal") is on the second level. An artificial IS can also in principle be designed to work on the second level, though it’s a mission that needs a lot of future efforts.

The framework (Figure 5) based on WSM has the key components of "self" and "goal", in addition to EEM and traditional knowledge models. Thus it is able to not only integrate all 3 types of mainstream AI methodologies (connectionism, symbolism, behaviorism), but also integrate all 4 key Aspects 1, 2, 3 and 4 of human intelligence, into one single united framework.

6 Conclusion

Researchers have been working towards an understanding, definition, modeling and reconstruction of intelligence for decades. They are from many different disciplines such as psychology, mathematics, linguistics, engineering, computer science, statistics, physics and complexity science, information science, and so on. The three approaches of symbolism, behaviorism and connectionism all achieved a lot of successes. In this work we did not take any of these three traditional approaches, instead we try to identify certain fundamental aspects of the nature of intelligence, and to construct a mathematical model to represent and potentially reconstruct these fundamental aspects. Rather than investigating intelligence with respect to a certain kind of intelligence or in a certain scenario, our work is largely independent of what kind of intelligence to consider, and our effort is towards the understanding of the nature of intelligence.

We first discussed the scope of different levels of intelligence (IS-L0, IS-L1, IS-L2, IS-L3), the importance of looking at the right physical and informational granularity. We then analyzed and compared the 4 aspects of human intelligence and 3 types of mainstream artificial intelligence, which point to the important role of information abstraction mechanism (Aspect 3 intelligence), the need for a new methodology of concepts, and the need for a new model of the IS. We then described the broader idea of concept, the way they connect, and the structure of the WSM. We formally defined a mathematical framework for the concepts, the WSM, and the way to run the WSM to solve problems. Based on all these, we finally brought up a united general framework of intelligence.

By analyzing the scope of discussion and granularity of investigation, we provided a new multi-discipline-multi-granularity view of intelligence. By quantitatively demonstrating the information abstraction process, we proposed the Aspect 3 intelligence as the key to connect perception and cognition. The storage and processing of information in an IS are all subject to a certain cost, while the reality of world and self has almost infinite amount of raw information and are constantly changing. This fact imposes the main constraint to reconstruct intelligence from a single level of information granularity or mechanism granularity. In contrast, we set up a new broader definition of concept that can represent information on multiple granularities. We then were able to construct a mathematical framework of WSM with a formal mechanism of creating and connecting concepts, and a formal flow of how the WSM receives, processes and outputs concepts to solve an arbitrary problem for the IS. We also discussed potential computer implementations of our theoretical framework.

Moreover, the self model was formally separated out of the world model in a clear, mathematically defined theoretical framework. We are the first to do so to the best of our knowledge. In the meantime, we are happy to see a new work with a similar related idea has recently been presented in Ref.[60]. (Only a few weeks after this work of ours was first presented on arXiv, the work Ref.[60] was presented on arXiv by the Yann LeCun team.)

Along the way to truly understand intelligence and practically build the next level AI, we believe many current works are merely a beginning. There are a lot of important future problems to study. For example, the way to identify the right commonalities to form the best set of concepts is unknown, a proper mathematical formulation of the governing rule of the concept network needs to be established so that a practical definition for $L_{r}(\vec{V}^{O}_{b})$ in Eq. 30 can be formalized.

Intelligence is one of the most fascinating subjects in all sciences. As interesting as the introduction of "self" and "goal" in our models is, it also brings up the even more interesting topics of self-reproduction and self-reference of intelligence systems, which expands the scope of intelligence study to broader levels, and needs efforts from many different fields.

Acknowledgment

This research received financial support from Jiangsu Industrial Technology Research Institute (JITRI) and Wuxi National Hi-Tech District (WND).

The author would like to thank Prof. Steven Guan, Prof. Kalok Man and Prof. Fei Ma from Xi’an Jiaotong-Liverpool University, Prof. Junqing Zhang from the University of Liverpool for the valuable help and discussions.

The author would like to thank Swarma Club and Prof. Jiang Zhang of Beijing Normal University for the valuable discussions and for providing wonderful learning opportunities.

Abbreviations

The following abbreviations are used in this manuscript:

AI	artificial intelligence
CNN	convolutional neural network
RNN	recurrent neural networks
GAN	generative adversarial network
RL	reinforcement learning
AGI	artificial general intelligence
ANN	artificial neural network
IS	intelligence system
WM	world model
GWM	great world model
SM	self model
WSM	world-self model
WCS	whole concept space
ICS	input concept space
ACS	activated concept space
OCS	output concept space
EEM	end-to-end model

References

[1] Yann LeCun, Bernhard E. Boser, John S. Denker, D. Henderson, Richard Howard, W. Hubbard, and Lawrence D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541–551, 1989.
[2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.
[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Communications of The ACM, 60:84–90, 2017.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, 2016.
[5] Joseph Redmon, Santosh K. Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Computer Vision and Pattern Recognition, 2016.
[6] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014.
[7] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Computer Vision and Pattern Recognition, 2014.
[8] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38:295–307, 2016.
[9] Saeed Anwar and Nick Barnes. Real image denoising with feature attention. In International Conference on Computer Vision, 2019.
[10] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. arXiv: Computer Vision and Pattern Recognition, 2018.
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv: Computer Vision and Pattern Recognition, 2020.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2018.
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems, 2017.
[14] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9:1735–1780, 1997.
[15] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323:696–699, 1988.
[16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv: Learning, 2013.
[17] David Silver, Satinder Singh, Doina Precup, and Richard S. Sutton. Reward is enough. Artificial Intelligence, 299:103535, 2021.
[18] Ben Goertzel. Artificial general intelligence: Concept, state of the art, and future prospects. In Artificial General Intelligence, 2014.
[19] Pei Wang, Ben Goertzel, and Bas Steunebrink. Artificial general intelligence : 9th international conference, agi 2016, new york, ny, usa, july 16-19, 2016, proceedings. 2016.
[20] Roman V. Yampolskiy and Joshua Fox. Artificial general intelligence and the human mental model. 2012.
[21] François Chollet. The measure of intelligence. arXiv: Artificial Intelligence, 2019.
[22] Pei Wang. On defining artificial intelligence. In Artificial General Intelligence, 2019.
[23] Alex Pentland. On the collective nature of human intelligence. Adaptive Behavior, 2007.
[24] Robert J. Sternberg. Handbook of intelligence. 2000.
[25] Louis Leon Thurstone. The nature of intelligence. 1924.
[26] Arnold B. Scheibel and J. William Schopf. The origin and evolution of intelligence. 1997.
[27] Uri Fidelman. Intelligence and the brain’s energy consumption: what is intelligence? Personality and Individual Differences, 1993.
[28] Daniel P. Buxhoeveden and Manuel F. Casanova. The minicolumn hypothesis in neuroscience. Brain, 125:935–951, 2002.
[29] Jon H. Kaas. Why does the brain have so many visual areas. Journal of Cognitive Neuroscience, 1:121–135, 1989.
[30] Claudio Galletti, Patrizia Fattori, Michela Gamberini, and Dieter F. Kutz. The cortical visual area v6: brain location and visual topography. European Journal of Neuroscience, 11:3922–3936, 1999.
[31] Earl Hunt. On the nature of intelligence. Science, 1983.
[32] F. R. Eirich. Thoughts on the origin and nature of life and intelligence on earth. Journal of Biological Physics, 1995.
[33] Richard L. Derr. Insights on the nature of intelligence from ordinary discourse. Intelligence, 1989.
[34] James S. Albus. Outline for a theory of intelligence. IEEE Transactions on Systems, 1991.
[35] Gary Marcus. The next decade in ai: Four steps towards robust artificial intelligence. arXiv: Artificial Intelligence, 2020.
[36] Keith Frankish and Jonathan St. B. T. Evans. The duality of mind: an historical perspective. 2009.
[37] Vladimir Vapnik. The nature of statistical learning theory. 1995.
[38] Wendy Johnson and Thomas J. Bouchard. The structure of human intelligence: It is verbal, perceptual, and image rotation (vpr), not fluid and crystallized. Intelligence, 2005.
[39] Kevin S. McGrew. The cattell-horn-carroll theory of cognitive abilities: Past, present, and future. 2005.
[40] Daniel Kahneman. Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011.
[41] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In National Conference on Artificial Intelligence, 2014.
[42] Xiaojun Chen, Shengbin Jia, and Yang Xiang. A review: Knowledge reasoning over knowledge graph. Expert Systems With Applications, 141:112948, 2020.
[43] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In North American Chapter of the Association for Computational Linguistics, 2019.
[44] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In National Conference on Artificial Intelligence, 2020.
[45] Panagiotis Radoglou-Grammatikis, Konstantinos Robolos, Panagiotis Sarigiannidis, Vasileios Argyriou, Thomas Lagkas, Antonios Sarigiannidis, Sotirios K. Goudos, and Shawn Wan. Modelling, detecting and mitigating threats against industrial healthcare systems: A combined sdn and reinforcement learning approach. IEEE Transactions on Industrial Informatics, 2021.
[46] Chen Chen, Yuru Zhang, Zheng Wang, Shaohua Wan, and Qingqi Pei. Distributed computation offloading method based on deep reinforcement learning in icv. Applied Soft Computing, 2021.
[47] Pablo Lemos, Niall Jeffrey, Miles Cranmer, Shirley Ho, and Peter Battaglia. Rediscovering orbital mechanics with machine learning. 2022.
[48] Hugo Latapie, Ozkan Kilic, Kristinn R. Thorisson, Pei Wang, and Patrick Hammer. Neurosymbolic systems of perception & cognition: The role of attention. 2021.
[49] Artur S. d’Avila Garcez and Luis C. Lamb. Neurosymbolic ai: The 3rd wave. arXiv: Artificial Intelligence, 2020.
[50] Haodi Zhang, Zihang Gao, Yi Zhou, Hao Zhang, Kaishun Wu, and Fangzhen Lin. Faster and safer training by embedding high-level knowledge into deep reinforcement learning. arXiv: Artificial Intelligence, 2019.
[51] Kamruzzaman Sarker, Lu Zhou, Aaron Eberhart, and Pascal Hitzler. Neuro-symbolic artificial intelligence: Current trends. arXiv: Artificial Intelligence, 2021.
[52] Mingyu Ding, Zhenfang Chen, Tao Du, Ping Luo, Joshua B. Tenenbaum, and Chuang Gan. Dynamic visual reasoning by learning differentiable physics models from video and language. arXiv: Computer Vision and Pattern Recognition, 2021.
[53] Chen Chen, Lei Liu, Shaohua Wan, Hui Xiaozhe, and Qingqi Pei. Data dissemination for industry 4.0 applications in internet of vehicles based on short-term traffic prediction. ACM Transactions on Internet Technology, 2021.
[54] Hongyu Wang, Dandan Zhang, Songtao Ding, Zhanyi Gao, Jun Feng, and Shaohua Wan. Rib segmentation algorithm for x-ray image based on unpaired sample augmentation and multi-scale network. Neural Computing and Applications, 2021.
[55] Yutaka Matsuo, Yann LeCun, Maneesh Sahani, Doina Precup, David Silver, Masashi Sugiyama, Eiji Uchibe, and Jun Morimoto. Deep learning, reinforcement learning, and world models. Neural Networks, 2022.
[56] Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 2021.
[57] Krzysztof Chalupka, Frederick Eberhardt, and Pietro Perona. Causal feature learning: an overview. Behaviormetrika, 2017.
[58] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Thomas Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Samuel McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. arXiv: Computation and Language, 2020.
[59] Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang. Fastmoe: A fast mixture-of-expert training system. arXiv: Learning, 2021.
[60] Vlad Sobal, Alfredo Canziani, Nicolas Carion, Kyunghyun Cho, and Yann LeCun. Separating the world and ego models for self-driving. 2022.

A World-Self Model Towards Understanding Intelligence 理解智能的世界自我模型