Elsevier

Artificial Intelligence  人工智能

Volume 337, December 2024, 104228
卷 337,2024 年 12 月,104228
Artificial Intelligence

Integration of memory systems supporting non-symbolic representations in an architecture for lifelong development of artificial agents
支持非符号表示的记忆系统在人工智能代理终身发展的架构中的整合

https://doi.org/10.1016/j.artint.2024.104228Get rights and content  获取权利和内容
Under a Creative Commons license
在知识共享许可下
open access  开放获取

Abstract  摘要

Compared to autonomous agent learning, lifelong agent learning tackles the additional challenge of accumulating skills in a way favourable to long term development. What an agent learns at a given moment can be an element for the future creation of behaviours of greater complexity, whose purpose cannot be anticipated.
与自主智能体学习相比,终身智能体学习面临着以有利于长期发展的方式积累技能的额外挑战。智能体在某一时刻所学到的内容可以成为未来创造更复杂行为的一个元素,而其目的无法预见。
Beyond its initial low-level sensorimotor development phase, the agent is expected to acquire, in the same manner as skills, values and goals which support the development of complex behaviours beyond the reactive level. To do so, it must have a way to represent and memorize such information.
在其初始的低级感知运动发展阶段之后,代理预计将以与技能相同的方式获得支持复杂行为发展的价值观和目标,这些行为超越了反应水平。为此,它必须有一种方式来表示和记忆这些信息。
In this article, we identify the properties suitable for a representation system supporting the lifelong development of agents through a review of a wide range of memory systems and related literature. Following this analysis, our second contribution is the proposition and implementation of such a representation system in MIND, a modular architecture for lifelong development. The new variable module acts as a simple memory system which is strongly integrated to the hierarchies of skill modules of MIND, and allows for the progressive structuration of behaviour around persistent non-symbolic representations. Variable modules have many applications for the development and structuration of complex behaviours, but also offer designers and operators explicit models of values and goals facilitating human interaction, control and explainability.
在本文中,我们通过对广泛的记忆系统和相关文献的回顾,确定了适合支持代理终身发展的表示系统的属性。在此分析之后,我们的第二个贡献是提出并实现这样一个表示系统在 MIND 中,这是一个用于终身发展的模块化架构。新的可变模块作为一个简单的记忆系统,与 MIND 的技能模块层次结构紧密集成,并允许围绕持久的非符号表示逐步构建行为。可变模块在复杂行为的发展和结构化方面有许多应用,同时也为设计师和操作员提供了明确的价值和目标模型,促进人机交互、控制和可解释性。
We show through experiments two possible uses of variable modules. In the first experiment, skills exchange information by using a variable representing the concept of “target”, which allows the generalization of navigation behaviours. In the second experiment, we show how a non-symbolic representation can be learned and memorized to develop beyond simple reactive behaviour, and keep track of the steps of a process whose state cannot be inferred by observing the environment.
我们通过实验展示了可变模块的两种可能用途。在第一个实验中,技能通过使用一个表示“目标”概念的变量来交换信息,这使得导航行为的泛化成为可能。在第二个实验中,我们展示了如何学习和记忆非符号表示,以便超越简单的反应行为,并跟踪一个过程的步骤,而这个过程的状态无法通过观察环境来推断。

Keywords  关键词

Lifelong learning
Learning agents
Artificial behaviour
Modular architecture
Memory system
Internal representation

终身学习学习代理人工行为模块化架构记忆系统内部表征

1. Introduction  1. 引言

The field of Machine Learning has provided many solutions applicable to the training of intelligent agents, suited to different learning contexts, levels of supervision and availability of data. Some works have focused on multitask learning, and how previous learning experience can be leveraged when the goals and problems faced by the agents change. Transfer Learning [1], [2], for instance, provides solutions to retrain an agent for a new target behaviour, and Protean Learning [3] for a change in its domain and signature. These solutions aim at solving the problem of change and help the agent adapt to new constraints. However, they are not designed to integrate change as part of the learning process, with long term and open-ended development in mind.
机器学习领域提供了许多适用于智能体训练的解决方案,适合不同的学习环境、监督水平和数据可用性。一些研究集中在多任务学习上,以及如何利用先前的学习经验来应对智能体面临的目标和问题的变化。例如,迁移学习 [1], [2] 提供了解决方案,以重新训练智能体以实现新的目标行为,而变形学习 [3] 则用于其领域和特征的变化。这些解决方案旨在解决变化的问题,并帮助智能体适应新的约束。然而,它们并不是为了将变化作为学习过程的一部分进行整合,而是考虑长期和开放式的发展。
In contrast to other fields focusing on learning, Developmental Robotics seeks novel methodologies from studies in the biological and psychological development of natural systems to create flexible artificial systems able to develop skills suited to real world problems and go beyond single-task learning. “The search for flexible autonomous and open-ended multitask learning system is in essence, a particular re-instantiation of the long-standing research for general-purpose AI” [4]. Developmental Robotics supports the claim that embodiment is required for the emergence of intelligence as we know it: biological systems are not passively exposed to sensory input, but instead interact actively with their surrounding environment [4] leading to self-organization of dynamical interactions among brain, body and environment [5].
与其他专注于学习的领域相比,发展机器人学寻求从自然系统的生物和心理发展研究中获得新方法,以创建能够发展适应现实世界问题的技能并超越单一任务学习的灵活人工系统。“对灵活自主和开放式多任务学习系统的探索本质上是对通用人工智能长期研究的特定重新实例化”[4]。发展机器人学支持这样的观点:体现是我们所知的智能出现所必需的:生物系统并不是被动地暴露于感官输入中,而是积极地与周围环境互动[4],这导致了大脑、身体和环境之间动态互动的自组织[5]。
The benefits of embodiment can be seen directly in the simplification of a number of traditional AI problems, such as image processing which can take advantage of depth by actively changing point of view. It supports development in the formation of the notion of object which emerges from the simultaneous experiences of seeing (spatial bounds) and grasping (impenetrability). From this notion can begin the categorization for practical use (food and non-food) which in turn serves as a basis for the emergence of symbols [4].
具身性的好处可以直接体现在简化许多传统人工智能问题上,例如图像处理可以通过主动改变视角来利用深度。它支持在形成对象概念的过程中发展,该概念源于同时的视觉体验(空间界限)和触觉体验(不可穿透性)。从这个概念可以开始进行实际用途的分类(食品和非食品),这反过来又为符号的出现提供了基础[4]。
Ongoing emergence [6] proposes criteria for a complete developmental approach: the bootstrapping, the continuous acquisition and incorporation of new skills, their stability and reproducibility, and the autonomous development of values and goals. In these criteria, developmental robotics meets the field of lifelong (machine) learning, specifically Lifelong Robot Learning [7] which aims at providing artificial agents with the ability to accumulate skills without compromising or forgetting previous ones (stability), in an incremental manner that will favour the acquisition of new skills of greater complexity.
持续出现[6]提出了完整发展方法的标准:自举、持续获取和整合新技能、技能的稳定性和可重复性,以及价值观和目标的自主发展。在这些标准中,发展机器人技术与终身(机器)学习领域相结合,特别是终身机器人学习[7],旨在为人工代理提供积累技能的能力,而不妨碍或遗忘之前的技能(稳定性),以增量的方式促进更复杂新技能的获取。
Along with the ability to progressively learn and structure its behaviour, an agent aiming at lifelong development will require the ability to form internal representations to evolve beyond purely reactive behaviour. Internal representations play a role at many levels of the cognitive process, from fragile and working memory keeping short-term information required for immediate spatio-temporal problems, to long term memory of experiences and acquired symbols needed in abstract reasoning as we understand it. Memory is required to commit to a task and exhibit a behaviour that is not entirely conditioned by immediate sensory information. To develop its own values and goals, an agent must be able to maintain an internal state, existing apart from the environment and other agents. Such internal states grant agents individuality and enable independent behaviour even in a completely homogeneous population. This ability plays an important part in social behaviour, allowing the representation of roles able to change dynamically [8], [9].
随着逐步学习和结构化其行为的能力,旨在终身发展的智能体将需要形成内部表征的能力,以超越纯粹的反应行为。内部表征在认知过程的多个层面上发挥作用,从脆弱的工作记忆保持短期信息以应对即时的时空问题,到长期记忆中储存的经验和在我们理解的抽象推理中所需的符号。记忆是执行任务和表现出不完全受即时感官信息影响的行为所必需的。为了发展自己的价值观和目标,智能体必须能够维持一个内部状态,独立于环境和其他智能体。这样的内部状态赋予智能体个体性,并使其即使在完全同质的群体中也能独立行为。这种能力在社会行为中发挥着重要作用,允许角色的动态变化 [8],[9]。
In this article we address memory systems and representations in the context of Lifelong Development for agents. We intend to approach memory systems along the same principles of lifelong development of procedural knowledge: complex representations in both their form and meaning are reached by the progressive increase in complexity of previously acquired representations, beginning with the non symbolic. Closely coupled with procedural knowledge and its development mechanism, non-symbolic internal representations are flexible enough to fit the particular needs of agents, follow their development path and increase their complexity along with it. The use of this memory system, the representations stored in it and the interpretation of these representations, will have to be learned according to behaviours. The validity of the representations acquired by the agent depends solely on their ability to support the agent's actions in the world. This emergent process is the first step towards developing “grounded” symbols [10], [5], which are later introduced and negotiated in a social context [10].
在本文中,我们讨论了在代理的终身发展背景下的记忆系统和表征。我们打算沿着程序性知识的终身发展原则来探讨记忆系统:复杂的表征在其形式和意义上是通过先前获得的表征的复杂性逐步增加而达到的,从非符号开始。与程序性知识及其发展机制紧密相连,非符号内部表征足够灵活,以适应代理的特定需求,遵循其发展路径,并随着其发展而增加复杂性。这个记忆系统的使用、存储在其中的表征以及对这些表征的解释,都必须根据行为进行学习。代理所获得的表征的有效性完全依赖于它们支持代理在世界中行动的能力。这个涌现过程是发展“扎根”符号的第一步,这些符号随后在社会背景中被引入和协商。
We propose an implementation of this approach as the Variable module on MIND, an architecture designed for lifelong agent development which serves as our testbed. The MIND architecture [11] is based on a flexible modular system able to integrate new skills and progressively structure complex behaviours through a curriculum of training tasks. The new Variable module integrates closely to skills and their learning process, making it suitable for the acquisition of emergent representations. We show how it can be used to increase the complexity of behaviour, and develop beyond the reactive level. We explore new options in the structuration of behaviour offered by the ability to store, retrieve and share information, such as in-line or branching structures organized around a variable module representing a particular concept. Furthermore, we discuss how internal representations impact learning strategies and agent development, and offer opportunities to introduce human guidance and control, improve explainability of behaviour and serve as a basis for motivational systems.
我们提出将这种方法实现为 MIND 上的可变模块,MIND 是一种旨在终身代理开发的架构,作为我们的测试平台。MIND 架构[11]基于一个灵活的模块化系统,能够整合新技能并通过训练任务的课程逐步构建复杂行为。新的可变模块与技能及其学习过程紧密集成,使其适合于新兴表征的获取。我们展示了它如何用于增加行为的复杂性,并在反应水平之上发展。我们探索了通过存储、检索和共享信息的能力所提供的行为结构化的新选项,例如围绕表示特定概念的可变模块组织的内联或分支结构。此外,我们讨论了内部表征如何影响学习策略和代理开发,并提供了引入人类指导和控制、改善行为可解释性以及作为激励系统基础的机会。
After reviewing the domain, motivating and defining our approach to memory systems in Section 3, we present MIND and its new Variable module in Section 4, along with its implementations and mechanism to enable human guidance and control, the Drive module. Section 5 presents the experimental context and training process, and two experiments evaluating our new system: we first explore an alternative structuration of a previously established skill hierarchy, using a variable to exchange information between learning skills (Section 5.2), and then we provide a variable to a learning skill to memorize a step in a sequence of actions (Section 5.3).
在回顾领域、激励和定义我们在第 3 节中对记忆系统的看法后,我们在第 4 节中介绍了 MIND 及其新的 Variable 模块,以及其实现和机制,以便实现人类的指导和控制,即 Drive 模块。第 5 节介绍了实验背景和训练过程,以及评估我们新系统的两个实验:我们首先探索了先前建立的技能层次结构的替代结构,使用一个变量在学习技能之间交换信息(第 5.2 节),然后我们为学习技能提供一个变量,以记住一系列动作中的一个步骤(第 5.3 节)。

2. Related works  2. 相关工作

Our work is related to the field of developmental robotics [4], [5] where representation of information is studied under different aspects, such as the acquisition of symbols [10], [12], their grounding [13], and their obvious use in the decision process, or the part such information can play in motivational systems [14], [15]. We do not deal with any of these particular mechanisms beyond what is needed for our experiments, relying for now on a fixed curriculum to train behaviour, leaving the implementation of autonomous development [16], [17], [18] for future works. This article focuses on the structure and elements able to support representation of information in a developmental context.
我们的工作与发展机器人学领域相关[4],[5],在该领域中,信息的表示从不同方面进行研究,例如符号的获取[10],[12],它们的基础[13],以及这些信息在决策过程中的明显使用,或者这些信息在动机系统中可能发挥的作用[14],[15]。我们并不处理这些特定机制,超出我们实验所需的部分,目前依赖于固定的课程来训练行为,将自主发展的实现[16],[17],[18]留待未来的工作。本文重点关注能够支持信息表示的结构和元素,特别是在发展背景下。
Our general approach to structuration using MIND is close to the works on evolving virtual creatures [19] for its hierarchical aspect and combination mechanism, differing mostly by our encapsulation of heterogeneous sub-structures. The specific focus of this article on the place of representation of information in such structures bears resemblance to several connectionist approaches, from autoencoders [20] to layered learning [21] and modular neurocontrollers [22], forming what can be considered as emergent sub-symbolic representations into specific neuron layers. The information processed into these representations can be provided as input for different parameterized skills [23]. Unlike these works however, the integration of elements for representation of information is done following a “behaviour shaping” approach [24] at the level of the architecture, with a higher level purpose aiming at establishing building blocks for the lifelong development of agents [7], and their possible use in future works on meta behaviours such as intrinsic motivation [14], [25].
我们使用 MIND 进行结构化的一般方法与关于进化虚拟生物的研究[19]在其层次结构和组合机制上相近,主要的不同在于我们对异构子结构的封装。本文特别关注信息在此类结构中的表示位置,与几种连接主义方法相似,从自编码器[20]到分层学习[21]和模块化神经控制器[22],形成可以被视为特定神经层中的涌现子符号表示。这些表示中处理的信息可以作为不同参数化技能的输入[23]。然而,与这些工作不同的是,信息表示元素的集成是在架构层面上遵循“行为塑造”方法[24]进行的,其更高层次的目的旨在为代理的终身发展建立构建块[7],并可能在未来关于元行为的研究中使用,如内在动机[14],[25]。
The experimental process we use is similar to experiments on embodied developmental models associating abstract representation and motor skills [26], [27]. However, our goal is to acquire practical representations of information in a lifelong development context, rather than simulate specific psychological models. Our results show the formation of sub-symbolic prototypes [10] grounded in the agent's experience. This process is comparable to the concept quantization [28] used in neuro-symbolic concept learning [12], although in our embodied approach, physical interaction is substituted to language feedback, and the program execution is validated by its actions in the environment and progress towards a goal. Unlike the ambitious works of Konidaris on symbol generation [13], our approach covers the acquisition of skills along with the formation of representations.
我们使用的实验过程类似于关于具身发展模型的实验,这些模型将抽象表示与运动技能结合起来[26],[27]。然而,我们的目标是在终身发展背景下获取信息的实用表示,而不是模拟特定的心理模型。我们的结果显示了基于智能体经验的亚符号原型的形成[10]。这个过程可以与神经符号概念学习中使用的概念量化[28]相比较,尽管在我们的具身方法中,物理交互取代了语言反馈,程序执行通过其在环境中的行动和朝向目标的进展得到验证。与 Konidaris 在符号生成方面的雄心勃勃的工作[13]不同,我们的方法涵盖了技能的获取以及表示的形成。

3. Memory and representation for lifelong development of artificial agents
3. 人工智能体终身发展的记忆与表征

The development of an agent as an individual distinct from others involves the persistence of knowledge, outside the restrictions of the environment (boundedness, ephemerality). Skills are a form of persistent knowledge, and many works studied the acquisition and structuration of such procedural knowledge (‘know-how’ or ‘savoir-faire’ knowledge). Skills are usually long term knowledge and do not change on a short term scale.
作为一个与他人不同的个体,代理的开发涉及知识的持久性,超越环境的限制(有限性、短暂性)。技能是一种持久的知识,许多研究探讨了这种程序性知识(“知道如何”或“savoir-faire”知识)的获取和结构化。技能通常是长期知识,并且在短期内不会发生变化。
Skills and procedural knowledge are often distinguished from declarative memory and propositional knowledge which represent information on knowing ‘that’, on storing facts and events. From the perspective of cognitive psychology, declarative knowledge refers to long-term symbolic memory, which can be retrieved and spoken about explicitly. However, this definition has a few exceptions such as the memory of faces, which is not symbolic and explicit. The evolutionary perspective views declarative memory, which occurs only in higher animals, as more recent development compared to procedural memory. As such, it might be considered as an outgrowth of procedural memory, which argues for a less clear-cut and more gradual distinction between them [29].
技能和程序性知识通常与声明性记忆和命题知识区分开来,后者代表关于“知道什么”的信息,存储事实和事件。从认知心理学的角度来看,声明性知识指的是长期的符号记忆,可以被明确地提取和表达。然而,这一定义有一些例外,例如面孔的记忆,它不是符号化和明确的。从进化的角度来看,声明性记忆仅在高级动物中出现,被视为相较于程序性记忆的较新发展。因此,它可能被视为程序性记忆的衍生物,这表明它们之间的区别并不那么明确,而是更为渐进 [29]。
In this section we look at declarative knowledge in artificial systems in a broad sense, that is information about the current context. This will include short term memorization of states to processed information representing concepts which can be structured or exchanged between agents. Section 3.1 introduces low-level connectionist approaches to this issue which are used in many fields because of their general purpose and non-symbolic nature. Section 3.2 presents the top-down designs found in cognitive architectures created specifically to serve the needs of embodied artificial agents. Section 3.3 discusses the acquisition of representations and their relation to meaning. Section 3.4 gives a synthesis on the use of internal states and representation with regard to the lifelong development of agents.
在本节中,我们广泛地探讨人工系统中的陈述性知识,即关于当前上下文的信息。这将包括对状态的短期记忆,以处理表示概念的信息,这些信息可以在代理之间进行结构化或交换。第 3.1 节介绍了低级连接主义方法,这些方法因其通用性和非符号特性而在许多领域中被使用。第 3.2 节介绍了专门为满足具身人工代理需求而创建的认知架构中的自上而下设计。第 3.3 节讨论了表征的获取及其与意义的关系。第 3.4 节对内部状态和表征在代理的终身发展中的使用进行了综合。

3.1. Neuro-inspired low-level memory
3.1. 神经启发的低级记忆

Bio-inspired approaches to Machine Learning and connectionist AI techniques have investigated memory systems and the bottom-up formation of low to intermediate levels of non-symbolic representations for a wide range of applications [30], [31]. Early developmental agent architectures such as robot shaping [24] include a memory of the past state of the agent's sensors to deal with time series (following remarks from Whitehead and Lin [32]). The authors note that this kind of memory need not be regarded as a “representation” of anything. This improvement upon purely reactive agent behaviour involves memory as understood in common language: recording past states of the environment.
生物启发的方法在机器学习和连接主义人工智能技术中研究了记忆系统以及自下而上形成低到中等水平的非符号表示,适用于广泛的应用[30],[31]。早期的开发代理架构,如机器人塑形[24],包括对代理传感器过去状态的记忆,以处理时间序列(遵循怀特海和林[32]的评论)。作者指出,这种记忆不必被视为对任何事物的“表示”。这种对纯反应代理行为的改进涉及到在日常语言中理解的记忆:记录环境的过去状态。
This aspect of short term memory is akin to the idea of iconic memory or fragile memory in the human brain. Both are short-lived high-capacity memories used in storing a much larger raw perceptual information than can be immediately processed. Applications are obvious in spatio-temporal problems such as extrapolating the future position of a projectile based on previously recorded positions. Recent studies [33] indicate fragile memory has some level of processing, although not on the same level as working memory. The low-level processing of fragile memory could be compared to context units [34] in Recurrent Neural Networks (RNN), or reservoir computing [35] methods such as echo states networks [36] or liquid state machines [37]. In reservoir computing [38] instantaneous input is fed to the reservoir network which accumulates and enriches the state space. Readouts are done by another network feeding from the reservoir and trained by conventional methods (Fig. 1).
短期记忆的这一方面类似于人脑中的图像记忆或脆弱记忆的概念。两者都是短暂的高容量记忆,用于存储比可以立即处理的原始感知信息要大得多的信息。在时空问题中的应用显而易见,例如根据先前记录的位置推测投射物的未来位置。最近的研究[33]表明,脆弱记忆具有一定程度的处理能力,尽管不及工作记忆。脆弱记忆的低级处理可以与递归神经网络(RNN)中的上下文单元[34]进行比较,或与回声状态网络[36]或液态状态机[37]等水库计算[35]方法进行比较。在水库计算[38]中,瞬时输入被馈送到水库网络,该网络积累并丰富状态空间。通过另一个从水库中获取并通过传统方法训练的网络进行读出(图 1)。
Fig. 1
  1. Download: Download high-res image (32KB)
    下载:下载高分辨率图像(32KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 1. Reservoir computing: the instantaneous input on the left is fed to the reservoir network (in grey). On the right, the readout is done by another network (from Lukoševičius and Jaeger [38]).
图 1. Reservoir computing:左侧的瞬时输入被输入到储备网络(灰色部分)。右侧的读出由另一个网络完成(来自 Lukoševičius 和 Jaeger [38])。

In contrast to iconic memory and fragile memory, working memory is a (comparatively) long-lived low capacity memory used to retain processed perceptions that can be consciously addressed. Working memory can be understood as storing high-level elements for short term planning, for instance the spatial coordinates of an entity in the context of a navigation problem. An overview of these distinct short term memories in the human brain and an examination of their respective properties is given in Vandenbroucke et al. [33].
与图像记忆和脆弱记忆相比,工作记忆是一种(相对)长寿命的低容量记忆,用于保留可以被有意识地处理的感知。工作记忆可以理解为存储高层次元素以进行短期规划,例如在导航问题中实体的空间坐标。Vandenbroucke 等人 [33] 对人脑中这些不同的短期记忆进行了概述,并检查了它们各自的特性。
The auto-encoder (or auto-associator) [20] is a solution to processing data into compact intermediate level representations. Auto-encoders are neural network structures in which an encoding layer, processing a large number of input neurons, is linked to a decoding layer, mirroring the input neurons, through a smaller hidden layer. The auto-encoder is trained to map the input onto itself, using the error measured between the input and the encoded-decoded output. Since the hidden layer is much smaller than the input, this process creates a compact representation in the hidden layer which only retains the information useful to recreate an acceptable approximation of the input.
The idea of using an intermediate information element containing processed representation as a junction between two processes (here the encoder and decoder) has inspired several approaches to the structuration of procedural knowledge such as layered learning [21] or modular neural network policies [22]. In these works, separate levels or layers of control are linked by a memory element used to share an intermediate representation. In layered learning output neurons of a subnetwork are connected to the input neurons of the following network. In modular neural network policies different sensor-dependent and actuator-dependant networks can be interchanged by training a common middle layer.
使用包含处理表示的中间信息元素作为两个过程(这里是编码器和解码器)之间的连接点的想法,激发了多种关于程序知识结构化的方法,例如分层学习 [21] 或模块化神经网络策略 [22]。在这些工作中,控制的不同级别或层通过一个用于共享中间表示的记忆元素相连。在分层学习中,子网络的输出神经元与下一个网络的输入神经元相连。在模块化神经网络策略中,可以通过训练一个共同的中间层来互换不同传感器依赖和执行器依赖的网络。
Low to intermediate levels of representations stored in persistent memory modules, as we have seen, have many uses for autonomous agents in general, from learning time dependant behaviours to improving structuration, but they also are of particular interest for developmental agents. Such memory elements can represent simulated physiological or psychological internal states used to trigger and regulate autonomous behaviour. Motivational systems inspired by drive reduction theory [39] use sub-symbolic internal representations to that effect for autonomous agent development [14].
在持久记忆模块中存储的低到中等水平的表征,如我们所见,对于自主代理一般有许多用途,从学习时间依赖的行为到改善结构化,但它们对于发展代理也特别重要。这些记忆元素可以表示模拟的生理或心理内部状态,用于触发和调节自主行为。受驱动减少理论启发的动机系统[39]使用次符号内部表征来实现自主代理的发展[14]。

3.2. Memory in cognitive architectures
3.2. 认知架构中的记忆

Long term memory systems are found in cognitive architectures for both procedural and declarative knowledge, association and past experiences. Current elements of working memory or perception are mapped to the appropriate long term memory structure to gain higher level knowledge or prediction.
长期记忆系统存在于认知架构中,用于程序性和陈述性知识、关联和过去的经验。当前的工作记忆或感知元素被映射到适当的长期记忆结构,以获得更高层次的知识或预测。
The belief hierarchy of the ICARUS architecture [40] is a mechanism for long term memory of percept associations: percepts are combined to form low-level beliefs, which in turn can be combined to form higher level beliefs. In the example of self-driving car given in Fig. 2, the agent uses primitive beliefs (right-of) to determine the relative position of 3 lines on the road. From these beliefs it can reach the higher level knowledge of the existence of a lane between line 1 and 2, and its current position within that lane. The semantic structure associating beliefs is the long term memory element and reflects a knowledge about the world that should hold true. Although explicit, this belief hierarchy shares some similarities with reservoir structures where nodes of any level can be accessed by the control hierarchy. For ICARUS, the control hierarchy is a teleoreactive hierarchy of subgoals (see [41], [42]), where each subgoal uses the appropriate beliefs, and level of beliefs, for its decision process. For instance, a high-level goal of overtaking a vehicle might use lane positions and occupation beliefs whereas a simple subgoal of keeping inside a lane will use low-level beliefs about line positions.
ICARUS 架构的信念层次[40]是感知关联的长期记忆机制:感知被组合形成低级信念,而这些信念又可以组合形成更高级的信念。在图 2 中给出的自动驾驶汽车的例子中,智能体使用原始信念(右侧)来确定道路上 3 条线的相对位置。基于这些信念,它可以得出在第 1 条线和第 2 条线之间存在车道的更高级知识,以及它在该车道中的当前位置。将信念关联起来的语义结构是长期记忆元素,反映了关于世界的知识,这些知识应该是正确的。尽管是显性的,这种信念层次与水库结构有一些相似之处,其中任何级别的节点都可以通过控制层次访问。对于 ICARUS,控制层次是一个目标导向的子目标层次(见[41],[42]),每个子目标在其决策过程中使用适当的信念和信念级别。 例如,超车的高层目标可能会使用车道位置和占用信念,而保持在车道内的简单子目标则会使用关于线位置的低层信念。
Fig. 2
  1. Download: Download high-res image (96KB)
    下载:下载高分辨率图像(96KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 2. An example of the ICARUS belief system. The agent determines it is in the lane 1-2 from the perception of the relative position of 3 lines and its own position (from Choi and Langley [40]).
图 2. ICARUS 信念系统的示例。代理通过感知 3 条线的相对位置和自身位置,确定它位于 1-2 车道(来自 Choi 和 Langley [40])。

In addition to long term semantic memory, the extended SOAR architecture [43] includes other kinds of long term memory systems, such as episodic memory. Episodic memory is designed to store sequences of states, perceptions and working memory elements, that were experienced by the agent. When the agent experiences a succession of states matching a partial sequence in the Episodic memory, the rest of the sequence can be recalled and considered to make a prediction.
除了长期语义记忆,扩展的 SOAR 架构[43]还包括其他类型的长期记忆系统,例如情节记忆。情节记忆旨在存储代理所经历的状态、感知和工作记忆元素的序列。当代理经历与情节记忆中的部分序列匹配的一系列状态时,可以回忆起其余的序列并考虑以进行预测。
Experiments were conducted [44] in which an agent, among other tasks, must collect energy charges as its battery discharges. While performing other tasks, the agent might discover an energy charge which it does not need at the present moment, it will however record the successive steps/state preceding its discovery. When the agent is in need of energy and recognizes several steps of a memorized sequence in its current context, it will enact the following steps of the episode to reach its goal. The results show that this approach is ten times more efficient than a random search. Although the example given is a navigation problem, the encoding of the episodic memory is claimed to be task-independent (as long as there remains a temporal/sequential relationship).
进行了实验[44],其中一个智能体在执行其他任务时,必须在电池放电时收集能量充电。虽然在执行其他任务时,智能体可能会发现一个当前不需要的能量充电,但它会记录发现之前的连续步骤/状态。当智能体需要能量并在当前环境中识别出多个记忆序列的步骤时,它将执行该事件的后续步骤以达到目标。结果表明,这种方法的效率是随机搜索的十倍。尽管给出的例子是一个导航问题,但据称情节记忆的编码是与任务无关的(只要保持时间/顺序关系)。
Having a representation of knowledge strongly dependent on the task and type of knowledge is a common practice in many projects with commercial or industrial applications. For instance, self-driving cars may use dedicated sensors such as LiDAR to accurately map their environment. The clouds of 3D points are interpreted as simple volumes, bounding boxes and planes which can be stored in memory. This model of the world is not only convenient for humans to understand, but the agent can also be given algorithms specific to this representation, such as path planning algorithms. It also allows for simulations and projections that can guide the decision process.
在许多商业或工业应用的项目中,知识的表示强烈依赖于任务和知识类型是一种常见做法。例如,自动驾驶汽车可能使用专用传感器,如 LiDAR,来准确地映射其环境。3D 点云被解释为简单的体积、边界框和可以存储在内存中的平面。这种世界模型不仅方便人类理解,代理还可以获得特定于这种表示的算法,例如路径规划算法。它还允许进行模拟和预测,以指导决策过程。
The top-down approaches for memory and representation used in cognitive architectures and other autonomous agent control systems are very efficient at solving the problems they were designed for and offer many advantages. Conceptual predicates can be used to generate higher level information or predictions, and specific algorithms compatible with the representation of information can be used to generate optimal solutions. However, they are generally task-dependant on some level, or make use of pre-specified representations and rules, or innate conceptual predicates [45]. Because such elements must be specified a priori, their application tends to be limited to problems for which a solution is already known and a context well-defined, they thus can not be solely relied upon for general purpose use, especially in the context of lifelong development.
在认知架构和其他自主代理控制系统中使用的自上而下的记忆和表示方法在解决它们设计的特定问题时非常高效,并提供了许多优势。概念谓词可以用于生成更高层次的信息或预测,并且可以使用与信息表示兼容的特定算法来生成最佳解决方案。然而,它们通常在某种程度上依赖于任务,或者使用预先指定的表示和规则,或者使用先天的概念谓词[45]。由于这些元素必须事先指定,因此它们的应用往往仅限于已经知道解决方案且上下文定义良好的问题,因此不能仅依赖于通用用途,特别是在终身发展的背景下。

3.3. Perception, categorisation, meaning, symbols
3.3. 感知、分类、意义、符号

Previous sections show that the acquisition of representations involves filtering relevant information from perceptions or states. This process is comparable to feature extraction, categorization, encoding or even compression in a way that what is perceived is reduced to what is needed to understand the situation [46]. Increasing the level of representation generalizes what is observed, details are removed and the representation becomes more compact, as with the different levels of short term memories in the human brain [33], or the different levels of the belief hierarchy in the ICARUS architecture [40]. Such representations are said to be perceptually grounded, their meaning came from the interaction between the agent and its environment.
前面的部分表明,表征的获取涉及从感知或状态中过滤相关信息。这个过程可以与特征提取、分类、编码甚至压缩相比较,因为所感知的内容被简化为理解情况所需的内容[46]。提高表征的层次会使观察到的内容更加概括,细节被去除,表征变得更加紧凑,就像人脑中的短期记忆的不同层次[33],或 ICARUS 架构中的信念层次的不同层次[40]。这样的表征被称为感知基础的,它们的意义来自于智能体与其环境之间的互动。
Through this process, an agent generates abstract representations which can be used in high-level planning, for instance by matching them with conditions for execution, expected outcomes and measures of success of a plan. Konidaris et al. [13] shows how these representations can be autonomously acquired and linked to conditions and effects of motor skills, and in turn how they can be used in high-level planning.
With increased processing of the information, the selection of a particular domain to consider and the use of classification methods, abstract representations move from non-symbolic to symbolic. While the meaning remains grounded in individual experience, its association to a symbol is negotiated during social interactions. The cost and constraints of communication justify the use of highly abstracted, highly compressed symbolic representations (indeed, attempts at conveying a complex feeling or experience will find one “at a loss for words”).
随着信息处理的增加,选择特定领域进行考虑以及使用分类方法,抽象表示从非符号转向符号。虽然意义仍然根植于个人经验,但其与符号的关联是在社会互动中协商的。交流的成本和限制证明了使用高度抽象、高度压缩的符号表示的合理性(实际上,试图传达复杂的情感或体验时,人们常常会“无言以对”)。
Recent works on neuro-symbolic systems [47], [12] attempt to bridge the gap between low-level neural architectures and natural language. Questions given in natural language are converted in a tree structure of combination operators from a given domain specific language (e.g. filter, relate, exist). A neural-based vision system is trained to recognize concepts fitting the operators, such as entities and attributes (e.g. red, cube, dog) and spatial relations (e.g. right of, behind). In this work, the formation of concepts is guided by the language and the set of questions provided by the instructor.
最近关于神经符号系统的研究[47],[12]试图弥合低级神经架构与自然语言之间的差距。用自然语言给出的问题被转换为来自特定领域语言的组合运算符的树结构(例如,过滤、关联、存在)。一个基于神经的视觉系统被训练以识别符合运算符的概念,如实体和属性(例如,红色、立方体、狗)以及空间关系(例如,右侧、后面)。在这项工作中,概念的形成受到语言和教师提供的问题集的指导。
The question of grounded symbols and emergent vocabularies has been largely covered in the works of Steels [10]. Through language games, agents develop a common language (semiotic landscape) enabling them to refer to elements of the environment such as colours [48] or relative spatial coordinates [49]. Fig. 4 shows a language game where agents take turns naming a colour1, while the other agent must guess and point at the colour referred to. Each agent has developed its own representations (semiotic network) from sensations to prototypes, and generated its own linguistic inventory. Through the game, they negotiate the association between symbols and meaning, success in shared communication causes words to spread in the population [10]. In this work, the social context for learning is a peer relation between agents sharing the same environmental constraints. Unlike trainer-trainee interactions of other works (e.g. [12]), this approach is less susceptible to introduce the trainer's bias and favours the emergence of concepts with a level of precision suited to their common goals. The extensive sea ice terminology found in dialects of the Inupiaq language [50] (Alaska) illustrates the importance of this socio-environmental relation in the emergence of a culture and the formation of its relevant concepts.
关于基础符号和新兴词汇的问题在 Steels 的研究中得到了很大程度的探讨[10]。通过语言游戏,智能体发展出一种共同语言(符号景观),使它们能够指代环境中的元素,如颜色[48]或相对空间坐标[49]。图 4 展示了一个语言游戏,其中智能体轮流命名一种颜色 1 ,而另一个智能体必须猜测并指向所提到的颜色。每个智能体从感知到原型发展出自己的表征(符号网络),并生成自己的语言库存。通过游戏,它们协商符号与意义之间的关联,共享交流的成功使得词汇在群体中传播[10]。在这项工作中,学习的社会背景是智能体之间的同伴关系,它们共享相同的环境约束。与其他工作的训练者-受训者互动(例如[12])不同,这种方法不易引入训练者的偏见,并有利于以适合其共同目标的精确度出现概念。 在因纽皮克语方言中发现的丰富海冰术语[50](阿拉斯加)说明了这种社会环境关系在文化的形成及其相关概念的出现中的重要性。
In the example shown in Fig. 4, it is interesting to see the increased “compression” and the loss of detail that occurs when processing sensations into symbols. Categorization already loses the specific hue of the colour when converting sensations into prototypes (it is “blue”), but this also occurs when associating representations to symbols: here the word “xiuxiu” designates both the green and yellow prototypes. In this instance, one symbol was deemed sufficient to express the meaning (which is most likely “the other colours”) needed to succeed in the guessing game.
在图 4 所示的例子中,处理感觉为符号时,增加的“压缩”和细节的丧失是很有趣的。分类在将感觉转换为原型时已经失去了颜色的特定色调(它是“蓝色”),但在将表示与符号关联时也会发生这种情况:在这里,词“xiuxiu”指代绿色和黄色原型。在这种情况下,一个符号被认为足以表达成功进行猜谜游戏所需的意义(很可能是“其他颜色”)。
Perceptually grounded approaches to the acquisition of representations are particularly suited to the development of memory systems and internal states for agents: each successive level of complexity serves a corresponding level of procedural knowledge, meaning that both can develop simultaneously and support each other. With such a mechanism supporting development, social behaviour and communication (at least with simple symbols) can be apprehended as an increment in complexity rather than a radical change. Abstract ideas and symbols which cannot be grounded are left for further steps in this development process.
感知基础的方法特别适合于为智能体开发记忆系统和内部状态:每一个复杂性级别对应于一个程序性知识的级别,这意味着两者可以同时发展并相互支持。在这样的机制支持下,社会行为和沟通(至少是简单符号的沟通)可以被理解为复杂性的增加,而不是根本性的变化。无法被基础化的抽象概念和符号则留待在这一发展过程中进一步处理。

3.4. Internal states for lifelong development of agents
3.4. 代理的终身发展内部状态

It is well known that intelligent behaviour can emerge without internal representations [51]. However, agents aiming at lifelong development require a way to form internal representations, if they are to reach levels of complexity approaching higher animals. In the previous sections, we reviewed a number of approaches to representation of declarative knowledge, from low-level to complex, non-symbolic to symbolic, from the formation of concepts to the emergence of language. We now provide an analysis of these approaches from the perspective of lifelong agent development. The requirements for the general purpose representation system for lifelong development of artificial agents we derive from this analysis is presented in Section 4.2.
众所周知,智能行为可以在没有内部表征的情况下出现 [51]。然而,旨在终身发展的智能体需要一种形成内部表征的方法,以便达到接近高等动物的复杂性水平。在前面的章节中,我们回顾了多种关于声明性知识表征的方法,从低级到复杂,从非符号到符号,从概念的形成到语言的出现。我们现在从终身智能体发展的角度对这些方法进行分析。我们从这一分析中得出的通用表征系统的要求将在第 4.2 节中呈现。
On the integration of the representation system  Should the skills of the agent use structures such as Recurrent Neural Networks to represent procedural knowledge, some form of low-level memory system will already be included in the form of the memory layer. However, these memory elements are tied to a single skill and cannot be shared with others, which precludes the representations acquired during training from supporting the development of subsequent skills. To address these limitations, an intermediate level memory system should also be included at the level of the control structure. This would allow commitment to a task as a memory layer, and also offer alternatives in skill structuration by, for instance, generalizing some behaviour around representations of concepts. Representations existing independently of the procedural knowledge manipulating them enable their use in the developmental process, where they can serve as base elements in the subsequent creation of skills whose purpose was not anticipated.
关于表示系统的整合 如果代理的技能使用诸如递归神经网络这样的结构来表示程序性知识,那么某种形式的低级记忆系统将已经以记忆层的形式包含在内。然而,这些记忆元素与单一技能相关联,无法与其他技能共享,这使得在训练过程中获得的表示无法支持后续技能的发展。为了解决这些限制,控制结构层次上还应包含一个中间级别的记忆系统。这将允许将任务作为记忆层进行承诺,并通过例如围绕概念的表示来提供技能结构化的替代方案。独立于操控它们的程序性知识存在的表示使其能够在发展过程中使用,在这个过程中,它们可以作为后续创建未预期目的的技能的基本元素。
On purpose-built representation systems  A developmental agent could benefit from using purpose-built high-level representation mechanisms. For instance, using a dedicated spatial memory for mapping and path planning makes practical sense, as navigation in a spatial environment is a problem often related to embodiment. However, the main drawback of these representation systems is their dependency on the type of knowledge they represent. Building agents relying on a given set of specialized elements, whether it is memory systems (e.g. Fig. 3) or motor primitives for behaviours, is already making the assumption that we have anticipated and provided all that will be needed for their development. Even if such systems can be very efficient for autonomous agents, they cannot completely fill the role of a more general purpose mechanism for lifelong development.
在专门构建的表征系统上,发展型代理可以从使用专门构建的高级表征机制中受益。例如,使用专用的空间记忆进行映射和路径规划是有实际意义的,因为在空间环境中的导航通常与具身性相关。然而,这些表征系统的主要缺点是它们依赖于所表示知识的类型。构建依赖于特定一组专门元素的代理,无论是记忆系统(例如图 3)还是行为的运动原语,已经假设我们预见并提供了其发展所需的所有内容。即使这样的系统对于自主代理非常高效,它们也无法完全替代更通用的终身发展的机制。
Fig. 3
  1. Download: Download high-res image (160KB)
    下载:下载高分辨率图像(160KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 3. On the left: the extended SOAR architecture, showing a flat mapping between long term and short term memory (from Laird [43]). On the right: the ICARUS architecture, mapping long term to short term memory of different components at different stages of the deliberative process (from Choi and Langley [40]).
图 3. 左侧:扩展的 SOAR 架构,显示长期记忆和短期记忆之间的平面映射(来自 Laird [43])。右侧:ICARUS 架构,将长期记忆映射到不同阶段的深思熟虑过程中的不同组件的短期记忆(来自 Choi 和 Langley [40])。

Fig. 4
  1. Download: Download high-res image (204KB)
    下载:下载高分辨率图像(204KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 4. On the left: Two robots playing a language game of naming and pointing at colours. On the right: portion of a semiotic network for a single agent, linking sensations to sensory experiences, prototypes, and symbols. The generated symbols are negotiated between agents during the language game (from Steels [10]).
图 4. 左侧:两个机器人正在进行命名和指向颜色的语言游戏。右侧:单个智能体的符号网络的一部分,将感觉与感官体验、原型和符号联系起来。生成的符号在语言游戏中由智能体之间协商(来自 Steels [10])。

On representation systems and development  We have seen mechanisms for classifying and encoding high dimensional states or intermediate elements of the working memory into a highly compressed representation, and how these representations play a role in emergent symbol grounding. In the long run, the emergence of high-level symbols will open the way to general purpose artificial intelligence and high-level reasoning as understood by humans. We regard this bottom-up approach as the most suited to developmental agents, each level of complexity of behaviour forming its useful and grounded representation as a basis for the next until abstract level operations can be reached to process these representations.
关于表征系统和发展,我们已经看到将高维状态或工作记忆的中间元素分类和编码为高度压缩表征的机制,以及这些表征在新兴符号基础中的作用。从长远来看,高级符号的出现将为通用人工智能和人类理解的高级推理铺平道路。我们认为这种自下而上的方法最适合发展中的智能体,每个行为复杂性的层次形成其有用且扎根的表征,作为下一个层次的基础,直到可以达到抽象层次的操作以处理这些表征。
The representation system for lifelong development we propose borrows properties from the different approaches previously discussed, and is designed to be integrated into a developmental agent architecture. The following sections present our contribution, the Variable module, and its implementation as an extension of MIND, a reactive architecture for developmental agents.
我们提出的终身发展的表征系统借鉴了之前讨论的不同方法的特性,并旨在集成到一个发展代理架构中。以下部分介绍了我们的贡献——可变模块,以及其作为 MIND 的扩展的实现,MIND 是一个用于发展代理的反应性架构。

4. An architecture integrating memory systems for lifelong development
4. 一种集成内存系统以实现终身发展的架构

Our approach to memory systems follows the same principles as the lifelong development of procedural knowledge: complex representations are reached by the progressive increase in complexity of previously acquired representations, and are closely coupled with procedural knowledge, providing mutual support in accomplishing new tasks. Therefore, this memory system has to be strongly integrated with the structure supporting procedural knowledge, and such architecture must fit the properties required by the memory system.
我们对记忆系统的处理遵循与程序性知识的终身发展相同的原则:复杂的表征是通过逐步增加先前获得的表征的复杂性而达到的,并且与程序性知识紧密结合,在完成新任务时提供相互支持。因此,这个记忆系统必须与支持程序性知识的结构紧密集成,并且这样的架构必须符合记忆系统所需的属性。
We propose an implementation of this approach as the Variable module on MIND, an architecture designed for lifelong agent development which is based on a flexible modular system able to integrate new skills and progressively structure complex behaviours through a curriculum of training tasks. As a regular component of the architecture, the Variable module is able to integrate closely to skills and their learning process, making it suitable for the progressive acquisition of emergent representations.
我们提出将这种方法实现为 MIND 上的可变模块,MIND 是一种旨在支持终身智能体发展的架构,基于一个灵活的模块化系统,能够整合新技能并通过训练任务的课程逐步构建复杂行为。作为架构的常规组件,可变模块能够与技能及其学习过程紧密集成,使其适合于新兴表征的逐步获取。
Section 4.1 presents MIND and its specific coordination mechanism, the Influence. Section 4.2 presents the Variable module, our general purpose memory system compliant with the Influence mechanism. Section 4.3 details several of its possible implementations. Section 4.4 presents the drive module, an implementation solution providing an entry point into MIND hierarchies to enable human guidance and control.
第 4.1 节介绍了 MIND 及其特定的协调机制——影响。第 4.2 节介绍了变量模块,我们的通用内存系统,符合影响机制。第 4.3 节详细说明了其几种可能的实现方式。第 4.4 节介绍了驱动模块,一种实现解决方案,提供了进入 MIND 层次结构的入口,以便实现人类的指导和控制。

4.1. MIND, an architecture for lifelong development
4.1. MIND,一个用于终身发展的架构

The MIND architecture (Modular Influence Network Development) [11] is an artificial agent control architecture suited to open-ended and cumulative learning. MIND encapsulates heterogeneous structures representing procedural knowledge (such as neural networks or programmed procedures) into skill modules dedicated to performing different sub behaviours. Skill modules are combined into hierarchies, from the low-level base skills performing simple sensorimotor behaviour, to complex and master skills performing the combination of concurrent lower level skills.
MIND 架构(Modular Influence Network Development)[11] 是一种适合开放式和累积学习的人工智能代理控制架构。MIND 将代表程序性知识的异构结构(如神经网络或编程程序)封装到专门执行不同子行为的技能模块中。技能模块被组合成层次结构,从执行简单感知运动行为的低级基础技能,到执行并发低级技能组合的复杂和高级技能。
Compared to similar research, the main original aspect of MIND lies in its multi-layered hierarchy [52], [19] using a generic control signal, the Influence, allowing the use of vector composition [53], [54], [55] to obtain an efficient global behaviour. It offers significant improvements over methods such as the Robot Shaping techniques [24] and other hierarchical approaches [25] by allowing high-level controllers to interpolate between the subspaces of low-level policies. The Influence mechanism proves to be suited to direct neurocontrol [19], [56], [57], [22] as well as motor primitives or schemas (policies) [53]. Unlike cognitive architectures [43], [40], [25] using multiple sub-systems, the network of modules constructed by MIND should be viewed as development of reactive architectures towards complex behaviours. The modular design and the encapsulation of the skill internal function makes the choice of the controller independent of the implementation of MIND, and allows the use of different controllers for each skill, working together within the same hierarchy, which is an advantage over comparable systems [58], [19].
与类似研究相比,MIND 的主要原创性在于其多层次层级结构[52],[19],使用通用控制信号 Influence,允许使用向量组合[53],[54],[55]来获得高效的全局行为。它在方法上显著优于机器人塑形技术[24]和其他层次方法[25],通过允许高级控制器在低级策略的子空间之间进行插值。Influence 机制证明适合于直接神经控制[19],[56],[57],[22]以及运动原语或模式(策略)[53]。与使用多个子系统的认知架构[43],[40],[25]不同,MIND 构建的模块网络应被视为对复杂行为的反应架构的发展。模块化设计和技能内部功能的封装使得控制器的选择独立于 MIND 的实现,并允许为每个技能使用不同的控制器,在同一层级内协同工作,这相较于可比系统[58],[19]是一个优势。

4.1.1. Base skill, complex skill, and Influence
4.1.1. 基础技能、复杂技能和影响

We consider sensory information and motor commands represented as vectors of real numbers, normalized between 0 and 1. It is possible to create a module encapsulating a vector-valued function f that maps the input vector VI = [I1,I2,...,In] to the output vector VO = [O1,O2,...,Om] (Eq. 1). The function f can be implemented as a programming procedure, or it can be a function approximator such as a neural network, or any other kind of function that associates two vectors of real numbers. Such a module is called a skill, and the module whose output vector is used directly as motor commands is a base skill.
我们将感官信息和运动指令视为归一化在 0 和 1 之间的实数向量。可以创建一个模块,封装一个向量值函数 f,该函数将输入向量 VI = [I1,I2,...,In] 映射到输出向量 VO = [O1,O2,...,Om] (公式 1)。函数 f 可以实现为编程过程,或者可以是一个函数逼近器,例如神经网络,或任何其他将两个实数向量关联的函数。这样的模块称为技能,而其输出向量直接用作运动指令的模块称为基础技能。
Several concurrent base skills can be trained to perform the subtasks of a complex task. Each base skill only associates the inputs and outputs necessary to accomplish its assigned task. To perform the complex task, a complex skill is created which will coordinate several skills (its subskills) by means of a signal called Influence which determines how much “weight” a subskill has on the overall behaviour. This approach can be understood as delegating to one or a combination of subskills the resolution of the current task in the same fashion the Boid brain coordinates its sub behaviours to perform the flocking behaviour [59].
可以训练多个并发的基础技能来执行复杂任务的子任务。每个基础技能仅关联完成其分配任务所需的输入和输出。为了执行复杂任务,创建了一个复杂技能,它将通过一个称为影响的信号协调多个技能(其子技能),该信号决定了一个子技能对整体行为的“权重”。这种方法可以理解为将当前任务的解决委托给一个或多个子技能,类似于 Boid 大脑协调其子行为以执行群体行为的方式[59]。
A Complex skill, as any other skill, encapsulates a function as defined in Equation 1). Its output vector is directed at its subskills and called the Influence vector VInfl = [Infl1,Infl2,...,Inflm]. A complex skill can be the subskill of a higher level complex skill, thus creating hierarchies of skills. The top level complex skill, the master skill, receive a constant Influence of 1.0, an impulse setting the whole decision process in motion.
复杂技能与其他技能一样,封装了如方程 1)中定义的一个函数。它的输出向量指向其子技能,称为影响向量 VInfl = [Infl1,Infl2,...,Inflm] 。复杂技能可以是更高层次复杂技能的子技能,从而创建技能的层次结构。顶层复杂技能,即主技能,接收一个恒定的影响值 1.0,这是一个推动整个决策过程的冲动。
Fig. 5 shows a Hierarchy of skills. The Influence flows along a vertical axis from the master skill down to the base skills to determine who (and with which magnitude) is in charge of the overall behaviour. The information from the sensors reaches all the skills of the hierarchy, and the motor commands are output from the base skills to the actuators forming a horizontal information flow. Its purpose is to determine how the behaviour is executed. Unlike other approaches (e.g. [24], [60]), sensory inputs are available to every skill, including complex skills. This enables a complex skill to perform subtle coordination based on information that may not be needed by subskills.
图 5 显示了技能的层次结构。影响沿着垂直轴从主技能流向基础技能,以确定谁(以及以何种程度)负责整体行为。传感器的信息到达层次结构中的所有技能,运动指令从基础技能输出到执行器,形成水平信息流。其目的是确定行为的执行方式。与其他方法(例如[24],[60])不同,传感器输入对每个技能都是可用的,包括复杂技能。这使得复杂技能能够基于可能不被子技能所需的信息进行微妙的协调。
Fig. 5
  1. Download: Download high-res image (105KB)
    下载:下载高分辨率图像(105KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 5. A skill hierarchy, a master skill influences complex skills which in turn influence the base skills.
图 5. 技能层次结构,主技能影响复杂技能,而复杂技能又影响基础技能。

4.1.2. Using Influence to determine motor commands
4.1.2. 使用影响来确定运动指令

Starting from the master skill, each complex skill computes its output vector VO and multiplies each element by the sum of the Influences it received, forming the Influence vector VInfl. The skill then sends each element Infl of the Influence vector to the corresponding subskill (Fig. 6).
从主技能开始,每个复杂技能计算其输出向量 VO ,并将每个元素乘以它所接收到的影响的总和,形成影响向量 VInfl 。然后,技能将影响向量的每个元素 Infl 发送到相应的子技能(图 6)。
Fig. 6
  1. Download: Download high-res image (167KB)
    下载:下载高分辨率图像(167KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 6. Internal architecture of a skill.
图 6. 技能的内部架构。

The base skill computes its output vector and multiplies each element by the sum of the Influences it received, similarly to equation 2, forming the motor command vector VCom = [Com1,Com2,...,Comm]. The base skill then sends each element Comx of the motor command vector to the corresponding motor module along with the sum of the Influences (ΣInfl) the base skill received. Each motor module then computes the corresponding motor command for its actuator as a normalized weighted sum according to Equation 3. An example of the complete computation of a motor command from the master skill to the actuator in a three level hierarchy is shown in Equation 4.
基础技能计算其输出向量,并将每个元素乘以它所接收到的影响的总和,类似于方程 2,从而形成运动命令向量 VCom = [Com1,Com2,...,Comm] 。基础技能然后将运动命令向量的每个元素 Comx 发送到相应的电机模块,并附上基础技能所接收到的影响的总和 ( ΣInfl )。每个电机模块随后根据方程 3 计算其执行器的相应运动命令,作为归一化加权和。方程 4 展示了从主技能到执行器在三层层次结构中运动命令的完整计算示例。

4.1.3. Training MIND and experimental results
4.1.3. 训练 MIND 和实验结果

The MIND architecture was evaluated through a series of experiments on the acquisition of navigation and collection behaviour by a simulated robot. MIND hierarchies are trained following a curriculum, beginning with low-level base skills, which increases in complexity and requires combining previously acquired skills. The experimental setup involved a curriculum of six tasks to form a behaviour hierarchy of six skills (Fig. 7) enabling an agent to collect objects in an environment while avoiding obstacles. This controlled “shaping” [24] approach to development greatly reduces the complexity of new tasks through the categorizability of context [61]. The experiments also introduced a retraining and learning optimization strategy for skill hierarchies, along with strategies for the establishment of curriculums inspired by Robot Shaping [24] and coevolution [62].
MIND 架构通过一系列实验评估了模拟机器人在导航和收集行为方面的获取。MIND 层次结构按照课程进行训练,从低级基础技能开始,逐渐增加复杂性,并需要结合先前获得的技能。实验设置涉及六个任务的课程,以形成六项技能的行为层次结构(图 7),使代理能够在环境中收集物体,同时避免障碍物。这种受控的“塑造”[24]开发方法通过上下文的可分类性[61]大大降低了新任务的复杂性。实验还引入了技能层次结构的再训练和学习优化策略,以及受机器人塑造[24]和共同进化[62]启发的课程建立策略。
Fig. 7
  1. Download: Download high-res image (542KB)
    下载:下载高分辨率图像(542KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 7. An overview of the division into skills. On the left: the original Collect task, on the right : the Collect hierarchy with the addition of energy management.
图 7. 技能划分的概述。左侧:原始的收集任务,右侧:增加了能量管理的收集层级。

To demonstrate the suitability of MIND for lifelong development, the collect behaviour was then extended to include energy management constraints, along with new inputs and physical elements indicating the battery level of the agent and the position of a recharge zone. In MIND, each new skill has independent access to sensor data [52] and performs local coordination of the added elements (software, hardware), including elements not present or needed during the formation of the previous skills. This experiment shows that new behaviours and physical elements can be integrated without any impact on the previously acquired behaviours, which remain available for future combination and will not have to be learned again. Finally, a proof-of-concept was deployed on a real world robot using the skills trained in simulation.
为了证明 MIND 适合终身发展,收集行为随后被扩展以包括能源管理约束,以及指示代理电池电量和充电区位置的新输入和物理元素。在 MIND 中,每项新技能都可以独立访问传感器数据[52],并对添加的元素(软件、硬件)进行局部协调,包括在形成先前技能时未出现或不需要的元素。该实验表明,可以集成新行为和物理元素,而不会对先前获得的行为产生任何影响,这些行为仍然可以用于未来的组合,并且不需要重新学习。最后,使用在模拟中训练的技能在现实世界的机器人上部署了一个概念验证。
Videos are available for each experiment [63], [64], [65] and the preliminary robotic application [66].
每个实验都有视频可供观看 [63], [64], [65] 和初步的机器人应用 [66]。

4.1.4. Benefits and limitations
4.1.4. 优势和局限性

MIND is by design an architecture supporting the lifelong development of artificial agent, its modularity reflects the modular and hierarchical nature of complex tasks and is the key to the progressive learning and accumulation of skills. Encapsulation into modules provides stable and identifiable skills [6]: it defines an area of responsibility for the skills within the global behaviour and makes them available for combination. This offers a great flexibility in building hierarchies: other levels are not concerned with how a module accomplishes its function, and modules performing similar functions can be substituted. For instance a skill relying on a reach target subskill is not concerned if the target is reached by flying or swimming. Stability means subsequent training makes use of previously acquired skills without altering their purpose, thus avoiding conceptual drift and the risk of catastrophic forgetting which would impair all other skills relying on them. Existing hierarchies can be built upon, sub-hierarchies can be retrained or replaced, and new modules can be added to interface additional physical elements. At each step, the target behaviour for which an agent is trained can be an element for the future creation of one or even several behaviours of greater complexity, whose purpose cannot be anticipated.
MIND 的设计是支持人工智能代理的终身发展的架构,其模块化反映了复杂任务的模块化和层次化特性,是技能逐步学习和积累的关键。模块的封装提供了稳定且可识别的技能 [6]:它定义了全球行为中技能的责任范围,并使其可用于组合。这在构建层次结构时提供了极大的灵活性:其他层次不关心模块如何完成其功能,执行类似功能的模块可以互相替换。例如,依赖于达到目标子技能的技能并不关心目标是通过飞行还是游泳来达到的。稳定性意味着后续训练利用先前获得的技能而不改变其目的,从而避免概念漂移和灾难性遗忘的风险,这将损害所有依赖于它们的其他技能。现有的层次结构可以在其基础上构建,子层次结构可以重新训练或替换,并且可以添加新模块以接口额外的物理元素。 在每一步中,代理被训练的目标行为可以成为未来创建一个或多个更复杂行为的元素,其目的无法预见。
Compared to underlying architectures of other works focusing on autonomous development, such as GRAIL [60], the use of different decision techniques within the same system is extended to higher levels. Variants of GRAIL [67], [68] such as C-GRAIL, M-GRAIL can be implemented in MIND as different internal functions of complex skills. Since it does not introduce an explicit two-level separation, such as “goal selectors” and “experts”, the subdivision of areas of responsibilities of skills can be finer and gradually distributed over multiple levels. There are no restrictions on the decision technique used at each level, and neural based approaches can be used to control lower level skills. Finally, multi-level hierarchies enable MIND to overcome the limitation of M-GRAIL [67] on the reuse of previously learned “sequences”, as illustrated by the extension of the collect hierarchy with energy management (Fig. 7) and object counting (Section 5.3).
与其他专注于自主发展的工作的基础架构相比,例如 GRAIL [60],在同一系统内使用不同的决策技术扩展到了更高的层次。GRAIL [67]、[68] 的变体,如 C-GRAIL、M-GRAIL,可以作为复杂技能的不同内部功能在 MIND 中实现。由于它没有引入明确的两级分离,例如“目标选择器”和“专家”,技能的责任领域的细分可以更细致,并逐渐分布在多个层次上。对每个层次使用的决策技术没有限制,可以使用基于神经网络的方法来控制较低层次的技能。最后,多层次的层级使 MIND 能够克服 M-GRAIL [67] 在重用先前学习的“序列”方面的限制,如通过能量管理(图 7)和物体计数(第 5.3 节)扩展收集层级所示。
Further research involving MIND aims at the integration of mechanisms for intrinsic motivation, autonomous structuration of hierarchies, human-robot interfaces and take further steps toward a complete architecture for developmental robotics, leaning towards the emergentist paradigm of cognitive architectures. Application to robotics and drones, as well as research on multiagent problems are being investigated. However, a more fundamental concern is the lack of any form of memory systems or ways to represent information inside hierarchies.
进一步的研究涉及 MIND,旨在整合内在动机机制、自主结构化层次、人机界面,并朝着发展机器人学的完整架构迈出进一步的步骤,倾向于认知架构的涌现主义范式。正在研究对机器人和无人机的应用,以及多智能体问题的研究。然而,更根本的关注是缺乏任何形式的记忆系统或在层次内部表示信息的方法。
Fairly complex behaviours can be performed using a reactive control mechanism [51], and other works have shown that MIND reactive hierarchies are able to support coordinated multiagent behaviours [69]. However, some behaviours require the ability to memorize or represent information. It can be as simple as retaining information about the environment which is no longer observable, or much more complex mechanisms involving the evolution of an internal state over time, for purposes such as social specialization in a homogeneous population [70]. As discussed in Section 3.4, memory elements could be added to the skill internal function, using for instance RNNs, but those memory elements could not be accessed or shared between skills due to encapsulation. Each skill would have to learn and manage its own representations, and those representations could not be used for the creation of future behaviours.
使用反应控制机制可以执行相当复杂的行为 [51],其他研究表明,MIND 反应层次能够支持协调的多智能体行为 [69]。然而,一些行为需要记忆或表示信息的能力。这可以简单到保留关于不再可观察环境的信息,或者涉及内部状态随时间演变的更复杂机制,目的是在同质群体中进行社会专业化 [70]。正如第 3.4 节所讨论的,可以将记忆元素添加到技能内部功能中,例如使用 RNN,但由于封装,这些记忆元素无法在技能之间访问或共享。每个技能必须学习和管理自己的表示,而这些表示无法用于创建未来的行为。
To address this issue we provide a memory and representation system at the level of the hierarchy, were its elements can be reused by multiple skills in a developmental manner. This memory system conforms to the principles of MIND and interacts with skills through the Influence mechanism. It enables agents to commit to behaviours, beyond a purely reactive level, and supports lifelong development of internal representation. Each new concept is introduced along with related skills, the representations acquired are grounded in the behaviour, and remain available for future combination, supporting development in the same way skill modules do.
为了解决这个问题,我们在层级的水平上提供了一个记忆和表征系统,其元素可以以发展方式被多个技能重用。该记忆系统符合 MIND 的原则,并通过影响机制与技能进行交互。它使代理能够承诺于行为,超越纯粹的反应水平,并支持内部表征的终身发展。每个新概念都与相关技能一起引入,所获得的表征扎根于行为中,并保持可用于未来的组合,支持与技能模块相同的方式的发展。
On a structural level, internal representations enable skills to exchange information through the Influence mechanism. This is immediately beneficial for basic navigation which can be generalized to a single skill using the concept of “target” assigned by higher level skills, instead of being bound to a specific sensory input (object, drop zone, power supply).
在结构层面,内部表征使技能能够通过影响机制交换信息。这对基本导航立即有益,可以通过更高层技能分配的“目标”概念将其概括为单一技能,而不是绑定于特定的感官输入(对象、投放区、电源)。
The following section shows how our memory and representation system is implemented as a new type of module for MIND: the Variable module.
以下部分展示了我们的记忆和表示系统是如何作为 MIND 的新型模块实现的:可变模块。

4.2. The memory system supporting non-symbolic representation: the variable module
4.2. 支持非符号表示的记忆系统:可变模块

To meet the requirements of lifelong development and comply with architectures handling the development of procedural knowledge, we follow the guidelines highlighted in Section 3.4 and provide a (1) general-purpose non-symbolic memory system (2) integrated with procedural knowledge. Memory elements are (3) identifiable, stable and reusable and support the (4) structuration of behaviour. The development of values is grounded by the same learning process used by the skills, leading to (5) emergent representations and could support the design of (6) explainable systems. This is achieved by conforming to the modular principle of MIND and its signal approach for representing information, enabling the close integration with skills to form representations that are emergent and persistent.
为了满足终身发展的要求并遵循处理程序性知识发展的架构,我们遵循第 3.4 节中强调的指导方针,并提供一个(1)通用的非符号记忆系统(2)与程序性知识集成。记忆元素是(3)可识别的、稳定的和可重用的,并支持(4)行为的结构化。价值观的发展基于与技能相同的学习过程,导致(5)涌现表示,并可以支持(6)可解释系统的设计。这是通过遵循 MIND 的模块化原则及其表示信息的信号方法来实现的,从而使与技能的紧密集成形成涌现和持久的表示。
Modularity: To support the development process of the agent, the information, internal states or concepts for which representation are formed are implemented as variable modules. They are accessible by skills at hierarchy level, and subjected to the same mechanisms for regulation, access and use as other information carrying modules (i.e. sensor and motor modules). This ensures that variable modules are regular components preserving the unicity and scalability of the underlying approach. Variable modules can be used by skills as input to receive shared information, or as output to share information and synthesize several concurrent commands according to the Influence mechanism. Variable modules exist independently of skill modules, and skill modules depending on them are, to some extent, able to function independently of variable modules. Indeed, skills organized to exchange information through variable modules will not fail because of missing input when a skill is removed or rendered inactive, as they would if a direct exchange between skills was allowed. The variable module acts as a buffer for information exchanged between skills which allows alternative structuration of hierarchies, such as in-line or branching information processing structures, while keeping the benefits of the Influence mechanism. As a module, it is identifiable and addressable, and its shared use in multiple behaviour ensures the stability of its dedicated purpose in the same way skill modules provide identifiable and stable behaviours. These internal states are observable, and could support the design of explainable systems.
模块化:为了支持智能体的发展过程,形成表示的信息、内部状态或概念被实现为可变模块。它们可以通过技能在层级水平上访问,并受到与其他信息承载模块(即传感器和运动模块)相同的调节、访问和使用机制的约束。这确保了可变模块是常规组件,保持了基础方法的唯一性和可扩展性。可变模块可以被技能用作输入,以接收共享信息,或作为输出,以共享信息并根据影响机制合成多个并发命令。可变模块独立于技能模块存在,而依赖于它们的技能模块在某种程度上能够独立于可变模块运行。实际上,通过可变模块交换信息的技能在移除或使其无效时不会因为缺少输入而失败,因为如果允许技能之间的直接交换,它们就会失败。 可变模块充当技能之间信息交换的缓冲区,允许以不同的方式构建层次结构,例如线性或分支信息处理结构,同时保持影响机制的好处。作为一个模块,它是可识别和可寻址的,并且在多个行为中的共享使用确保了其专用目的的稳定性,就像技能模块提供可识别和稳定的行为一样。这些内部状态是可观察的,并且可以支持可解释系统的设计。
Representation: Variable modules are intended as a general purpose representation system and must handle any kind of information, such as the orientation of a target, time, an alert level, a role, etc. An individual variable module can represent a simple concept, a conceptual space limited to a single quality dimension [28]. To form complex conceptual spaces, several variables can be provided together to a skill. Fig. 8 illustrates the notion of conceptual space, with the psychological representation of colour, and the position of a target in polar coordinates. Conforming to the signal approach of MIND, the concept is represented using a real number bound between 0 and 1 which allows for a rich non-symbolic representation. Applied to motor control, this approach is able to represent a binary choice (above 0.5, for activation, under 0.5 for deactivation) as well as the fine control of an actuator (from 0 to 1, from full speed forward to full reverse). Applied to declarative knowledge, it is independent of type of knowledge it represents, and its continuous range can be divided into as many classes as needed: 4 seasons, 10 roles, 24 hours, 26 letters or 360 degrees, the limit is placed on the precision of the function interpreting the information.
表示:变量模块旨在作为通用表示系统,必须处理任何类型的信息,例如目标的方向、时间、警报级别、角色等。单个变量模块可以表示一个简单概念,一个仅限于单一质量维度的概念空间[28]。为了形成复杂的概念空间,可以将多个变量一起提供给一个技能。图 8 说明了概念空间的概念,展示了颜色的心理表征以及目标在极坐标中的位置。根据 MIND 的信号方法,该概念使用一个介于 0 和 1 之间的实数表示,从而实现丰富的非符号表示。应用于运动控制时,该方法能够表示二元选择(0.5 以上为激活,0.5 以下为去激活),以及对执行器的精细控制(从 0 到 1,从全速前进到全速倒退)。 应用于陈述性知识时,它与所表示的知识类型无关,其连续范围可以根据需要划分为任意数量的类别:4 个季节、10 个角色、24 小时、26 个字母或 360 度,限制在于解释信息的函数的精度。
Fig. 8
  1. Download: Download high-res image (321KB)
    下载:下载高分辨率图像(321KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 8. On the left: the HSV representation of colour as an example of a conceptual space defined by the 3 dimensions of hue, saturation and value (from [28]). On the right: Polar coordinates relative to an agent. The 2 dimensions can be represented by 2 MIND variables.
图 8. 左侧:颜色的 HSV 表示,作为由色调、饱和度和明度三个维度定义的概念空间的示例(来自[28])。右侧:相对于一个智能体的极坐标。这两个维度可以由两个 MIND 变量表示。

Emergence: The integration of a medium for non-symbolic representations and its instantiation alongside new skills during the agent's development results in strong interactions between behaviour and internal representations in every new learning situation. Inferences made on the memory elements, their meaning and interpretation, depend on the skills manipulating them, and whatever particular form these may take, they are perceptually or behaviourally grounded by the same learning process used by the skills, leading to emergent representations. When learning sensorimotor behaviour, skills attempt to improve their efficiency by finding the appropriate commands to send to the actuators. These output commands are translated by a driver layer into the specific command codes of the actuator and, as this driver layer is given, skills have to adapt to its expected values. In the case of the variable module, skills will both provide and interpret the information of the variable. This means that skills writing into a variable and skills reading that same variable will have to converge on the meaning of its values through an emergent process, by subdividing this variable into classes whose bounds are grounded in experience. For instance, if a skill would set a variable to represent one of 3 possible roles an agent can play, it would pick 3 different values to represent these 3 roles. A skill which reads this variable to act according to its role will have to learn to classify this value into 3 different classes. The process is comparable to languages games between agents [10], happening here between skills within a single agent. The agreed representation is valid as long as it improves the behaviour of the agent in the world.
出现:非符号表示媒介的整合及其在智能体发展过程中与新技能的实例化,导致在每个新的学习情境中行为与内部表征之间产生强烈的互动。对记忆元素的推理、其意义和解释,依赖于操控它们的技能,无论这些技能采取何种特定形式,它们都通过与技能相同的学习过程在感知或行为上扎根,从而导致涌现的表征。在学习感觉运动行为时,技能试图通过找到适当的命令来提高其效率,以发送给执行器。这些输出命令由驱动层转换为执行器的特定命令代码,并且由于该驱动层是给定的,技能必须适应其预期值。在可变模块的情况下,技能将同时提供和解释变量的信息。 这意味着,写入变量的技能和读取该变量的技能必须通过一种涌现过程在其值的意义上达成一致,通过将该变量细分为基于经验的边界类。例如,如果一个技能将一个变量设置为表示代理可以扮演的 3 种可能角色之一,它将选择 3 个不同的值来表示这 3 个角色。一个根据其角色读取该变量的技能将必须学习将该值分类为 3 个不同的类别。这个过程可以与代理之间的语言游戏相比较[10],在这里发生在单个代理内的技能之间。只要达成的表示能够改善代理在世界中的行为,它就是有效的。
Persistence: The information contained in a variable module persists until overridden, and can be used as memory. Its application parallels iconic and fragile memory, up to working memory in humans, as a way to store past observations and information in an intermediate stage of processing. Beyond commitment to tasks and low-level cognitive behaviours using working memory, persistent information shared explicitly within the architecture offers the possibility of representing internal states subject to gradual evolution and reinforcement. It enables the design of agents emulating physiological or psychological states, which can be instantiated as identifiable elements. Such internal states can regulate behaviour and vary over time by aggregating information, such as feedback from various mental processes. This opens many possibilities for the future integration of various motivational systems such as those based on drive reduction theory [39].
持久性:变量模块中包含的信息在被覆盖之前会持续存在,并可以用作记忆。它的应用与人类的图像记忆和脆弱记忆相似,直到工作记忆,作为在处理的中间阶段存储过去观察和信息的一种方式。除了对任务的承诺和使用工作记忆的低级认知行为之外,架构内显式共享的持久信息提供了表示内部状态的可能性,这些状态可以逐渐演变和强化。它使得设计模仿生理或心理状态的智能体成为可能,这些状态可以被实例化为可识别的元素。这些内部状态可以调节行为,并通过聚合信息(例如来自各种心理过程的反馈)随时间变化。这为未来整合各种动机系统打开了许多可能性,例如基于驱动减少理论的系统[39]。
Implementation: Variable modules provide information to skills in the same way sensor modules do. Variable modules can also receive commands from skills in the same way motor modules do. The transmission of a command to a variable module conforms to the Influence mechanism, the value of the command is determined in the same manner as for motor commands. Fig. 9 shows the diagram of a variable module and how it integrates into a skill hierarchy.
实现:可变模块以与传感器模块相同的方式向技能提供信息。可变模块也可以以与电机模块相同的方式接收来自技能的命令。向可变模块传输命令符合影响机制,命令的值以与电机命令相同的方式确定。图 9 显示了可变模块的示意图以及它如何融入技能层次结构。
Fig. 9
  1. Download: Download high-res image (320KB)
    下载:下载高分辨率图像(320KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 9. Left : Internal architecture of a variable module. Right : Variable integration in a MIND hierarchy.
图 9. 左:可变模块的内部架构。右:MIND 层次中的可变集成。

Once the input commands coming from skills have been processed, a regulation function is applied before making the result available to other skills. In its simplest form, it is a linear transfer function and outputs the normalized input commands. In the following section, we present possible implementations of the regulation function which allow variable modules to serve varied purposes.
一旦来自技能的输入命令被处理,就会应用一个调节功能,然后再将结果提供给其他技能。在其最简单的形式中,它是一个线性传递函数,输出标准化的输入命令。在接下来的部分中,我们将介绍调节功能的可能实现,这些实现允许可变模块服务于不同的目的。

4.3. Variable implementation
4.3. 变量实现

The regulation process offers many possible implementations for variables. As skills can accommodate many kinds of internal functions, variable modules support any regulation process that conforms to the input/output rules of the Influence mechanism. Thus, variable modules may serve many purposes such as memory management, counters, signal generators, internal clocks, etc. The choice of function is left to the designer, as is the possibility of adding new types of functions.
调节过程为变量提供了许多可能的实现方式。由于技能可以适应多种内部功能,变量模块支持任何符合影响机制输入/输出规则的调节过程。因此,变量模块可以用于许多目的,例如内存管理、计数器、信号生成器、内部时钟等。功能的选择由设计者决定,添加新类型功能的可能性也是如此。
Here follows an overview of 3 implementations of the variable module developed for our experiments: the (simple) variable, the counter and the sinusoid wave generator.
以下是我们实验中开发的变量模块的三个实现的概述:简单变量、计数器和正弦波生成器。

4.3.1. Simple variable  4.3.1. 简单变量

The simple variable uses a linear transfer function as its regulation function. Its output value is the direct result of arbitration between the commands of the input skills, computed following the equation 3 for motor commands, Section 4.1.
简单变量使用线性传递函数作为其调节函数。其输出值是输入技能命令之间仲裁的直接结果,按照第 4.1 节的方程 3 计算运动命令。
Other variable types extend the behaviour of the simple variable by applying their regulation function to this processed input. Fig. 10 shows the evolution of the value of a simple variable linked to two input skills. This example could represent the priority level for a mining/collector robot to return to its base, one skill writing its assessment according to energy consumption, the other according to its cargo load. As the relative Influence of the red skill increases, its effect on the value of the variable becomes noticeable, at T6 the effect of the red skill's command is clearly visible.
其他变量类型通过将其调节功能应用于该处理输入,扩展了简单变量的行为。图 10 显示了与两个输入技能相关联的简单变量值的演变。这个例子可以代表一个采矿/收集机器人返回基地的优先级水平,一个技能根据能量消耗进行评估,另一个根据其货物负载进行评估。随着红色技能的相对影响力增加,其对变量值的影响变得明显,在 T6 时红色技能命令的效果清晰可见。
Fig. 10
  1. Download: Download high-res image (210KB)
    下载:下载高分辨率图像(210KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 10. Simple variable. On the right: a diagram of two skills writing to a simple variable. On the left: the evolution over time of the commands, Influences and resulting value of the variable. The top graph shows the commands sent by the two concurrent skills, the middle graph shows their respective Influence and the bottom graph shows the evolution of the output value of the variable.
图 10. 简单变量。右侧:两个技能写入简单变量的示意图。左侧:命令、影响和变量结果值随时间的演变。顶部图表显示两个并发技能发送的命令,中间图表显示它们各自的影响,底部图表显示变量输出值的演变。

The simple variable is the most general purpose variable implementation: it stores raw commands coming from skills and provides them as they are. As such, simple variables are very convenient to share rich representations between skills.
简单变量是最通用的变量实现:它存储来自技能的原始命令,并按原样提供。因此,简单变量在技能之间共享丰富的表示时非常方便。
The simple variable can persist a value over time, however, it is continuously overridden by any input command of non-null Influence value, no matter how weak it is. As a result, the use of simple variables for memorization requires carefully designed hierarchies and very accurate training of the associated skills.
简单变量可以在一段时间内保持一个值,然而,它会被任何非空影响值的输入命令不断覆盖,无论其多么微弱。因此,使用简单变量进行记忆需要精心设计的层次结构和非常准确的相关技能训练。
This is related to the sensitivity issue of the signal based approach, discussed in [11]. Instead of punctual commands or messages between components of the hierarchy, the control mechanism simulates continuous signal updates, and the weakest Influence becomes dominant in the absence of a greater Influence counteracting it. Although this has not proven to be a problem so far, weak Influences of insufficiently trained skills might accumulate in larger hierarchies and disrupt behaviour. Should training skills to the required accuracy be too costly, it was proposed that mechanisms such as activation thresholds for Influence could be added to address the issue.
这与基于信号的方法的敏感性问题有关,如[11]中所讨论的。控制机制模拟连续的信号更新,而不是层次结构组件之间的瞬时命令或消息,在缺乏更大影响力的情况下,最弱的影响力变得主导。尽管到目前为止这并未被证明是一个问题,但不足够训练的技能的弱影响力可能在更大的层次结构中累积并干扰行为。如果将技能训练到所需的准确度成本过高,建议可以添加诸如影响力的激活阈值等机制来解决这个问题。
The application of variables to memorization and persistence revives this reflection on the sensitivity problem and the signal approach of MIND. In addressing this issue, many simple solutions inspired by signal processing, biology or electronics can be thought of such as accumulators, capacitors, thresholds, etc. One such solution is implemented as the Counter variable described in the following section.
将变量应用于记忆和持久性重新唤起了对敏感性问题和 MIND 信号方法的反思。在解决这个问题时,可以考虑许多受信号处理、生物学或电子学启发的简单解决方案,例如积累器、电容器、阈值等。其中一种解决方案在下一节中实现为 Counter 变量。

4.3.2. Counter  4.3.2. 计数器

The Counter variable is a simple solution for memorization which records “events” from its input signal and provides a “discrete” count of these events as an output. The events recorded are rising and falling edges of the signal around given thresholds.
Counter 变量是一个简单的记忆解决方案,它记录来自输入信号的“事件”,并提供这些事件的“离散”计数作为输出。记录的事件是信号在给定阈值周围的上升沿和下降沿。
The counter outputs the current recorded count as a fraction of its maximum possible count. In our default implementation the maximum is set to 10, therefore the possible values for the counter are [0.0 0.1... 0.9 1.0].2 The counter is incremented on the rising edge of the input value above the increment threshold. The counter resets to zero on a falling edge of the input value below the reset threshold.3 In case of overflow, the counter resets to zero.
计数器将当前记录的计数输出为其最大可能计数的一个分数。在我们的默认实现中,最大值设置为 10,因此计数器的可能值为[0.0 0.1... 0.9 1.0]。 2 当输入值超过增量阈值的上升沿时,计数器递增。当输入值低于重置阈值的下降沿时,计数器重置为零。 3 在溢出的情况下,计数器重置为零。
Fig. 11 shows the evolution of the value of a counter over time. Between T1 and T2, the first rising edge of the input value over the increment threshold brings the value of the counter to 0.1. Between T5 and T6, the second rising edge brings the value to 0.2. Between T7 and T8, the falling edge of the value below the reset threshold resets the value to 0.0.
图 11 显示了计数器值随时间的演变。在 T1 和 T2 之间,输入值首次超过增量阈值的上升沿将计数器的值提升至 0.1。在 T5 和 T6 之间,第二次上升沿将值提升至 0.2。在 T7 和 T8 之间,值低于重置阈值的下降沿将值重置为 0.0。
Fig. 11
  1. Download: Download high-res image (64KB)
    下载:下载高分辨率图像(64KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 11. A Counter variable: the rising edges of the input increment the counter, the falling edges reset it.
图 11. 计数器变量:输入的上升沿增加计数器,下降沿重置计数器。

Compared to a simple variable the value is more stable over time: moderate fluctuations of the input which do not cross thresholds do not affect the output value. This greatly diminished the accuracy requirements of the skills manipulating these variables. However, the representation of the information is much poorer (partly due to this “discrete” form) and subjected to some higher level processing (counting “events”) which might not be suitable for all applications.
与简单变量相比,值在时间上更稳定:输入的适度波动如果不超过阈值,则不会影响输出值。这大大降低了操纵这些变量的技能的准确性要求。然而,信息的表示要差得多(部分原因是这种“离散”形式),并且需要一些更高层次的处理(计数“事件”),这可能并不适合所有应用。

4.3.3. Wave generator  4.3.3. 波形发生器

When learning reactive motor skills, providing a random value as input helps the agent unstuck itself from behaviour loops. In these cases, slight variations of the behaviour are acceptable, and taking into account a random value allows the agent to produce different outputs for the same set of sensory inputs.
在学习反应性运动技能时,提供一个随机值作为输入有助于代理从行为循环中解脱出来。在这些情况下,行为的轻微变化是可以接受的,考虑到随机值可以使代理对同一组感官输入产生不同的输出。
This idea of virtual input fits with the variable system as generators: the output is a signal generated as a function of time (or ticks), the input can remain unused or can control a parameter of the generator. In the case of the wave generator, the output is a sinusoidal function of a measure of time, the input value sets the frequency.
这个虚拟输入的想法与作为生成器的可变系统相符:输出是一个作为时间(或刻度)函数生成的信号,输入可以保持未使用或可以控制生成器的一个参数。在波形生成器的情况下,输出是时间测量的正弦函数,输入值设置频率。
Fig. 12 shows the evolution of the value of a wave generator over time according to the normalized input value: the higher input value between T3 and T5 causes a faster oscillation of the output value, the lower input value between T6 and T8, a slower one.
图 12 显示了波形发生器随时间变化的值与归一化输入值的关系:T3 和 T5 之间的较高输入值导致输出值的振荡速度更快,而 T6 和 T8 之间的较低输入值则导致输出值的振荡速度更慢。
Fig. 12
  1. Download: Download high-res image (80KB)
    下载:下载高分辨率图像(80KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 12. A wave generator: the input value controls the wavelength.
图 12. 一个波发生器:输入值控制波长。

Wave generators can have multiple uses, such as measuring time or providing a fluctuating value. Different types of oscillating functions are used in works on evolving virtual creatures [71], [19] to synchronize movements between limbs for locomotion for instance. In such cases, one can see the benefit of varying the frequency to accelerate or decelerate synchronized walk cycles.
波形发生器可以有多种用途,例如测量时间或提供波动值。在关于进化虚拟生物的研究中[71],[19],使用了不同类型的振荡函数来同步肢体之间的运动以实现运动。在这种情况下,可以看到改变频率以加速或减速同步行走周期的好处。
In our actual implementation, the measure of time used is the update tick of the system, a value which is incremented after each evaluation of the MIND hierarchy. This is a simple method with low computing costs which suits most uses. Such measure is relative to the execution of the system, which means it starts at the same point (tick 0) for each execution, share the same update rate, and is paused and resumed along with the execution of the agent's behaviour. This relative measure is better suited for the purpose of synchronization of behaviours in the hierarchy, for other purposes dependent on a measure of real time (24 hour cycles, powered down periods, etc.) a call to the real time clock system can be substituted to update ticks.
在我们实际的实现中,使用的时间度量是系统的更新滴答,这是一个在每次评估 MIND 层次结构后递增的值。这是一种简单且计算成本低的方法,适合大多数用途。这种度量是相对于系统的执行而言的,这意味着它在每次执行时从同一点(滴答 0)开始,具有相同的更新速率,并且随着代理行为的执行而暂停和恢复。这种相对度量更适合于层次结构中行为的同步目的,对于其他依赖于实际时间度量的目的(24 小时周期、断电期间等),可以调用实时时钟系统来替代更新滴答。

4.4. An entry point for human guidance and control: the drive module
4.4. 人类指导和控制的切入点:驱动模块

Exploring the new possibilities of representing and manipulating information within the hierarchy required a convenient way to understand and interact with variable modules. When shaping hierarchies, we needed to provide temporary structures and inputs, set up arbitrary goals needed for such training, all of which required tricks and ad-hoc solutions, as shown in Section 5.2.
探索在层级中表示和操控信息的新可能性需要一种方便的方式来理解和与可变模块互动。在构建层级时,我们需要提供临时结构和输入,设定进行此类训练所需的任意目标,这一切都需要技巧和临时解决方案,如第 5.2 节所示。
Ultimately, the master skill of a fully developed MIND agent should be able to define its own goals, fulfil its needs, and continue its development for itself, following ongoing emergence [6]. Another perspective on the purpose of such work is that artificial agents are build to help us achieve our goals and fulfil our needs, which requires a way to control what the agent is trying to learn or accomplish. Most control architectures offer a formal way to input the operator's commands: “mission modules”, “supervisors” [72], “mission layer” [73]. The name chosen for such system often reflects the design philosophy of the architecture.
最终,一个完全发展的 MIND 代理的主要技能应该能够定义自己的目标,满足自己的需求,并继续自我发展,遵循持续的出现[6]。对这项工作的另一个看法是,人工代理是为了帮助我们实现目标和满足需求,这需要一种控制代理试图学习或完成的内容的方法。大多数控制架构提供了一种正式的方式来输入操作员的命令:“任务模块”、“监督者”[72]、“任务层”[73]。为这种系统选择的名称通常反映了架构的设计哲学。
In MIND, this role is played by the drive module (a term borrowed from drive reduction theory [39]). The drive module (Fig. 13) is the entry point in the hierarchy for setting mission parameters and objectives, task configuration, constraints, and can provide adjustable autonomy and temporary teleoperation [74] or human guidance in planning (so called “human-on-the-loop”). It enables the designer to define a programmed procedure called at each evaluation of the hierarchy. This procedure can load and access all modules available to the agent and can be used to set the value of a variable, add a concurrent command to an actuator or send Influence to skills, in the same manner as a skill controls subordinate components (actuator, variable, subskill). Appendix C discusses applications and a proof of concept for adjustable autonomy, a demonstration video of this behaviour is available at [75].
在 MIND 中,这一角色由驱动模块(一个借用自驱动减少理论的术语[39])来扮演。驱动模块(图 13)是设置任务参数和目标、任务配置、约束的层级入口,并且可以提供可调的自主性和临时遥控[74]或人类指导(所谓的“人类在环”)。它使设计者能够定义一个在每次评估层级时调用的程序。该程序可以加载和访问代理可用的所有模块,并可以用于设置变量的值、向执行器添加并发命令或向技能发送影响,方式与技能控制下属组件(执行器、变量、子技能)相同。附录 C 讨论了可调自主性的应用和概念验证,该行为的演示视频可在[75]获取。
Fig. 13
  1. Download: Download high-res image (195KB)
    下载:下载高分辨率图像(195KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 13. The drive module and its relation to a MIND hierarchy. In this example the drive module guides learning by reading from the sensors and writing into the shared variable, taking the place of the master skill which has not been learned yet (shown in grey).
图 13. 驱动模块及其与 MIND 层次结构的关系。在这个例子中,驱动模块通过从传感器读取数据并写入共享变量来指导学习,取代了尚未学习的主技能(以灰色显示)。

In this article, the drive module is applied to the simplification of the behaviour shaping process in the following experiments.
在本文中,驱动模块被应用于以下实验中行为塑造过程的简化。

5. Experiments with the variable system
5. 变量系统的实验

To demonstrate the applications of the variable system of MIND, we present two scenarios showing different uses of variable modules. These experiments are conducted using the latest version of EvoAgents, a framework implementing MIND, along with the learning algorithms and simulation environments required to train agents. Section 5.1 briefly presents EvoAgents, the experimental context and training methods used. In keeping with the spirit of lifelong development, the new skills learned in these experiments will reuse as much as possible the skills and hierarchies acquired during the previous series of experiments described in Section 4.1.
为了展示 MIND 可变系统的应用,我们呈现了两个场景,展示可变模块的不同用途。这些实验使用最新版本的 EvoAgents 进行,EvoAgents 是一个实现 MIND 的框架,配备了训练代理所需的学习算法和仿真环境。第 5.1 节简要介绍了 EvoAgents、实验背景和使用的训练方法。为了符合终身发展的精神,这些实验中学到的新技能将尽可能重用在第 4.1 节中描述的先前实验系列中获得的技能和层次结构。
The following sections present experiments on the use of variables to store, retrieve and share information. In Section 5.2, we investigate the use of variables as a means to exchange information between skills and organize in-line structures: the output of one skill is the input of another. We reorganize the different navigation skills of the Collect scenario around the “target” concept, represented by a variable whose value is set by another skill.
以下部分介绍了使用变量存储、检索和共享信息的实验。在第 5.2 节中,我们研究了将变量作为技能之间交换信息的手段,并组织内联结构:一个技能的输出是另一个技能的输入。我们围绕“目标”概念重新组织了收集场景中的不同导航技能,该概念由另一个技能设置值的变量表示。
Section 5.3 investigates the use of variables as memory. In this experiment we learn to count steps in a process which shows no indications of its current state in the environment. The challenge is to learn a useful internal representation through interaction with the environment.
第 5.3 节探讨了将变量用作记忆。在这个实验中,我们学习在一个环境中没有当前状态指示的过程中计数步骤。挑战在于通过与环境的互动学习有用的内部表示。

5.1. Experimental context
5.1. 实验背景

EvoAgents is a custom framework written in Java designed to conduct experiments with agents using the MIND hierarchy as their control system (Fig. 14). It includes the implementation of MIND and its different modules, a set of learning algorithms for neural networks and genetic programs based on the Encog library [76] to train the skills internal functions, and a number of interfaces (remote control and simulations) and physics based simulation environments (2D [77], 3D [78], multi-agent). EvoAgents is designed to run parallel learning algorithms in headless configuration and has been used on HPC clusters. EvoAgents is open source software distributed under GPL 3 licence, and available to the community for review or implementation of MIND hierarchies [79].
EvoAgents 是一个用 Java 编写的自定义框架,旨在使用 MIND 层次结构作为控制系统进行代理实验(图 14)。它包括 MIND 及其不同模块的实现,一套基于 Encog 库[76]的神经网络和遗传程序的学习算法,用于训练技能内部功能,以及多个接口(远程控制和仿真)和基于物理的仿真环境(2D [77],3D [78],多代理)。EvoAgents 旨在以无头配置运行并行学习算法,并已在 HPC 集群上使用。EvoAgents 是根据 GPL 3 许可证分发的开源软件,供社区审查或实现 MIND 层次结构[79]。
Fig. 14
  1. Download: Download high-res image (144KB)
    下载:下载高分辨率图像(144KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 14. Interfaces and simulation environments of EvoAgents. From left to right: 3D physics environment, our remote robot, multiagent environment, a remote environment in unity game engine.
图 14. EvoAgents 的接口和仿真环境。从左到右:3D 物理环境、我们的远程机器人、多智能体环境、Unity 游戏引擎中的远程环境。

For the purpose of this article we only consider the 2D physics environment. All dynamic objects have a simulated mass and friction properties, and behave according to laws of motion.
为了本文的目的,我们仅考虑二维物理环境。所有动态物体都有模拟的质量和摩擦特性,并根据运动定律行为。
The simulated robot is composed of two drive wheels and a front claw to grab objects. In the physics environment, the effect of each wheel is represented by a force impulse vector on the robot's body at a position and with a direction corresponding to the relative position of the wheel (if only one of the wheels is set to forward motion, the robot rotates in place). The robot is equipped with 18 sensors providing information such as obstacle distance or target orientation. Sensor modules are in charge of interfacing sensors with the MIND architecture and enriching the sensory information. They provide a history of past values and derivative to be used as input by skills.
模拟机器人由两个驱动轮和一个前爪组成,用于抓取物体。在物理环境中,每个轮子的作用通过一个力冲量向量在机器人的身体上表示,该向量的位置和方向与轮子的相对位置相对应(如果只有一个轮子设置为向前运动,机器人将在原地旋转)。机器人配备了 18 个传感器,提供障碍物距离或目标方向等信息。传感器模块负责将传感器与 MIND 架构接口,并丰富感官信息。它们提供过去值和导数的历史,以供技能作为输入使用。
Motor commands which are issued by a motor module to its corresponding actuator as real numbers in the [0,1] range, are interpreted by a driver layer as a percentage of the actuator's capability (either power, speed, angular position...). For the wheels, 1 corresponds to full speed forward, 0 to full speed backwards and 0.5 to stopped. For the claw, the value corresponds to its state: above 0.5 is closed, under 0.5 is opened.
由电机模块发出的电机命令作为[0,1]范围内的实数,被驱动层解释为执行器能力的百分比(无论是功率、速度、角位置等)。对于轮子,1 对应全速前进,0 对应全速后退,0.5 对应停止。对于爪子,该值对应于其状态:大于 0.5 为闭合,小于 0.5 为打开。
The learning skill modules are controlled by small multilayer perceptrons (MLP) which are trained by a simple genetic algorithm [80] using tournament selection and speciation. We chose this solution for its simplicity, exploratory properties and good performance with delayed rewards, however, other learning algorithms could be used. EvoAgents also include Evolutionary Programming [81] which uses a genetic algorithm to evolve programs using a tree representation comparable to S-Expressions. These Genetic Programs (GP) can only provide one output, for skills requiring multiple outputs we altered the genetic algorithm to consider multiple GPs as a single individual, which we refer to as a Genetic Program Forest (GPF). Fig. 15 illustrates of how MLPs and GPs are altered by the crossover operation of the genetic algorithm. Algorithm 1 details the genome evaluation process for an individual skill.
学习技能模块由小型多层感知器(MLP)控制,这些感知器通过简单的遗传算法[80]进行训练,采用锦标赛选择和物种化。我们选择这个解决方案是因为它的简单性、探索性和在延迟奖励下的良好表现,然而,也可以使用其他学习算法。EvoAgents 还包括进化编程[81],它使用遗传算法通过树形表示法进化程序,类似于 S-表达式。这些遗传程序(GP)只能提供一个输出,对于需要多个输出的技能,我们修改了遗传算法,将多个 GP 视为一个个体,我们称之为遗传程序森林(GPF)。图 15 展示了遗传算法的交叉操作如何改变 MLP 和 GP。算法 1 详细说明了个体技能的基因组评估过程。
Fig. 15
  1. Download: Download high-res image (364KB)
    下载:下载高分辨率图像(364KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 15. Examples of crossover operations for neural networks and evolutionary programs. In bright red: the crossover point or its equivalent.
图 15. 神经网络和进化程序的交叉操作示例。亮红色部分:交叉点或其等效物。

Algorithm 1
  1. Download: Download high-res image (691KB)
    下载:下载高分辨率图像(691KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Algorithm 1. Genome evaluation process for an individual skill.
算法 1. 个体技能的基因组评估过程。

Hierarchies are trained using a curriculum learning technique, which suits the developmental context and the hierarchical nature of MIND: new skills of increasing complexity are added through new “lessons” aiming at more complex goals.
层次结构使用课程学习技术进行训练,这适合 MIND 的开发背景和层次特性:通过新的“课程”添加越来越复杂的新技能,旨在实现更复杂的目标。
Curriculum learning [82] are progressive learning methods intended to simplify complex learning problems through decomposition into simpler problems. These methods are used in a number of domains, from function approximation using neural networks to robotics and video games [83]. Curriculum learning subdivides a learning task into different but complementary subtasks (or source tasks) to be learned in a given order, generally increasing in complexity.
课程学习[82]是一种渐进式学习方法,旨在通过将复杂学习问题分解为更简单的问题来简化学习过程。这些方法在多个领域中得到应用,从使用神经网络的函数逼近到机器人技术和视频游戏[83]。课程学习将学习任务细分为不同但互补的子任务(或源任务),按照一定的顺序进行学习,通常是逐渐增加复杂性。
For our purpose, a curriculum will consist in a series of tasks, each corresponding to a skill of the hierarchy to train, ordered from base skills to master skill. The curriculum is handcrafted, each task uses a simple environment and set of reward functions, which both simplifies the process of designing learning environments and reduces the cost in supervision during training. The curriculum guides the development of the agent through the key skills to acquire and, as with the separation into different skill modules, each new task focuses on the additional complexity of the new environment and corresponding goals.
为了我们的目的,课程将由一系列任务组成,每个任务对应于要训练的技能层级,从基础技能到精通技能依次排列。课程是手工制作的,每个任务使用一个简单的环境和一组奖励函数,这既简化了学习环境的设计过程,又减少了训练期间的监督成本。课程引导代理的发展,通过关键技能的获取,并且与不同技能模块的分离一样,每个新任务都专注于新环境和相应目标的额外复杂性。
Fig. 16 shows how a hierarchy is initially trained using a curriculum: low-level skills are trained as separate tasks, each skill has the role of the master skill during the training episode and must cover all the expected behaviour of the task. Higher level skills are then trained to use low-level skills on a corresponding task of higher complexity. Learning from low-level to high-level skills is a general rule, the order of the tasks really depends on having the subskills needed. For instance, in the example given in Fig. 16, the Go To Object + Avoid skill could be trained before the Go To DropZone skill, as both its subskills are already trained.
图 16 显示了如何使用课程初步训练一个层次结构:低级技能作为单独的任务进行训练,每个技能在训练过程中充当主技能的角色,并必须涵盖任务的所有预期行为。然后,高级技能被训练以在相应的更高复杂度的任务中使用低级技能。从低级技能到高级技能的学习是一个普遍规则,任务的顺序实际上取决于所需的子技能。例如,在图 16 中给出的例子中,Go To Object + Avoid 技能可以在 Go To DropZone 技能之前进行训练,因为它的两个子技能已经训练完成。
Fig. 16
  1. Download: Download high-res image (229KB)
    下载:下载高分辨率图像(229KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 16. Using a curriculum to build a MIND hierarchy composed of 6 skills. At each step one skill is trained using one task of the curriculum, the figure shows from left to right steps 1, 4 and 6. Lower level skills are trained first, at each step the learning skill has the role of master skill.
图 16. 使用课程构建由 6 项技能组成的 MIND 层次结构。在每一步中,使用课程中的一个任务训练一项技能,图中从左到右显示了第 1、4 和 6 步。首先训练较低级别的技能,在每一步中,学习技能充当主技能的角色。

True to the spirit of lifelong development of artificial agents, the following experiments will make use of the skills and skill hierarchies acquired during the previous series of experiments [11]. Fig. 7 in section 4.1 shows the skills already available which will not need to be trained again.
忠于人工智能代理的终身发展的精神,以下实验将利用在之前一系列实验中获得的技能和技能层次[11]。第 4.1 节中的图 7 显示了已经可用的技能,这些技能不需要再次训练。

5.2. Scenario 1: Target variable
5.2. 场景 1:目标变量

This scenario demonstrates the use of variables for the centralization and exchange of information between skills. This example of application intends to show how the result of a computation made by one (or several) skill can be stored and made available as input for others, offering the possibility of forming in-line or branching information processing structures. The information exchanged can be the intermediate analysis of perceptions or the result of a more complex decision process. It is identified as a concept around which skill structures can be organized.
该场景演示了变量在技能之间集中和交换信息的使用。这个应用示例旨在展示一个(或多个)技能所做计算的结果如何被存储并作为其他技能的输入提供,从而形成内联或分支的信息处理结构。交换的信息可以是对感知的中间分析或更复杂决策过程的结果。它被识别为一个可以组织技能结构的概念。
In this experiment we generalize navigation behaviours by reorganizing the different navigation skills of the Collect scenario around the concept of target. The master skill selects the relevant target and sets a variable with its orientation information. In turn this variable is provided as input to a dedicated navigation skill.
在这个实验中,我们通过围绕目标的概念重新组织收集场景中的不同导航技能来概括导航行为。主技能选择相关目标并设置一个包含其方向信息的变量。然后,这个变量作为输入提供给专用的导航技能。

5.2.1. Protocol  5.2.1. 协议

The goal is to learn the collect behaviour of previous experiments, described in Section 4.1. The collect task consists in picking up an object and bringing it back to a drop zone while avoiding collisions.
目标是学习第 4.1 节中描述的先前实验的收集行为。收集任务包括捡起一个物体并将其带回投放区,同时避免碰撞。
Hierarchy: The hierarchy is a modified version of the original Collect hierarchy shown in Fig. 7: the GoToObject and GoToDropZone skills are replaced by GoToTarget.
层次结构:该层次结构是原始 Collect 层次结构的修改版,如图 7 所示:GoToObject 和 GoToDropZone 技能被 GoToTarget 替代。
Our lowest level complex skills are:
我们最低级的复杂技能是:
  • 1.
    Avoid: Move while avoiding obstacles. Its inputs are the 10 proximity sensors. This skill is reused from previous experiments and did not require training.
    避免:在避开障碍物的同时移动。它的输入是 10 个接近传感器。这个技能是从之前的实验中重用的,不需要训练。
  • 2.
    GoToTarget: Similar to GoToObject and GoToDropZone, the orientation to the target is given by a variable (Target in Fig. 17) instead of the object or drop zone sensor.
    GoToTarget:与 GoToObject 和 GoToDropZone 类似,目标的方向由一个变量(图 17 中的 Target)给出,而不是由物体或投放区传感器给出。
    Fig. 17
    1. Download: Download high-res image (283KB)
      下载:下载高分辨率图像(283KB)
    2. Download: Download full-size image
      下载:下载完整尺寸图像

    Fig. 17. Collect hierarchy using a variable for the target.
    图 17. 使用变量收集目标层次结构。

GoToTarget + Avoid combines Avoid and GoToTarget using the proximity sensors to navigate towards the target while avoiding collision in an environment with obstacles. It substitutes itself to the GoToObject + Avoid and GoToDropZone + Avoid skills of the original Collect hierarchy (Fig. 7).
GoToTarget + Avoid 结合了 Avoid 和 GoToTarget,利用接近传感器在有障碍物的环境中导航到目标,同时避免碰撞。它替代了原始 Collect 层次结构中的 GoToObject + Avoid 和 GoToDropZone + Avoid 技能(图 7)。
Finally, the CollectVariable skill combines GoToTarget + Avoid and the base skill ClawControl. It also outputs to the Target variable. Its inputs are the object and the drop zone presence sensors, to decide what must be done, but also both orientation sensors for the object and the drop zone to set the appropriate value for the Target variable.
最后,CollectVariable 技能结合了 GoToTarget + Avoid 和基础技能 ClawControl。它还输出到 Target 变量。它的输入是物体和投放区存在传感器,以决定需要做什么,同时也包括物体和投放区的两个方向传感器,以设置 Target 变量的适当值。
The original Collect skill simply needs to delegate to the appropriate subskill, setting “who” is in charge. Each of its subskills has access to the appropriate information via sensors to determine how to accomplish its subtask, the object sensor for one skill and the drop zone sensor for the other. The CollectVariable skill must relay through the variable the information needed to determine how to accomplish the task, i.e. the orientation of either the object or the drop zone, by identifying what is the appropriate target (object or drop zone).
原始的 Collect 技能只需委托给适当的子技能,设置“谁”负责。每个子技能通过传感器访问适当的信息,以确定如何完成其子任务,一个技能使用物体传感器,另一个使用投放区传感器。CollectVariable 技能必须通过变量传递完成任务所需的信息,即通过识别适当的目标(物体或投放区)来确定物体或投放区的方向。
Training environment: The initial training environments are similar to the original collect experiments. For instance, the environment for the master skill contains an object to collect, a zone in which to deposit the object and obstacles to avoid. The agent is given a limited time during which it can augment its score. A small reward is given for picking up the object, and a large reward for depositing the object in the zone. A collision ends the evaluation.
训练环境:初始训练环境类似于原始收集实验。例如,主技能的环境包含一个要收集的物体、一个存放物体的区域和需要避免的障碍物。代理在有限的时间内可以增加其得分。捡起物体会获得小奖励,而将物体存放在区域内则会获得大奖励。碰撞会结束评估。
The original setup places a single object in the environment which is always tracked by the agent's sensor. When the agent succeeds in collecting the object, it is placed at random in the environment so that the task can be repeated. We found out that this configuration can be exploited by the Influence mechanism to simplify the learning task. When an agent brings an object to the drop zone, the object tracked by the orientation sensor is held in the frontal claw. Under these conditions the orientation sensor always reports the object being in front and thus the output given by GoToObject is to move straightforward. Using vector composition, the GoToObject behaviour could therefore remain active while the agent brings the object back to the drop zone without any adverse effect on the trajectory, simply adding a constant forward movement.
原始设置在环境中放置一个单一的物体,该物体始终被代理的传感器跟踪。当代理成功收集物体时,它会随机放置在环境中,以便任务可以重复进行。我们发现,这种配置可以被影响机制利用,以简化学习任务。当代理将物体带到投放区时,由方向传感器跟踪的物体被保持在前爪中。在这些条件下,方向传感器始终报告物体在前面,因此 GoToObject 给出的输出是向前移动。因此,通过向量组合,GoToObject 行为可以在代理将物体带回投放区时保持活跃,而不会对轨迹产生不利影响,只需添加一个恒定的向前移动。
To make sure the agent learns a strict exclusion between the different subtasks of the collect behaviour, it must not receive an input pointing always forward from its object orientation sensor when carrying an object, so that it is forced to ignore this information in favour of the orientation of the drop zone. The experimental setup was thus altered in two ways:
为了确保智能体在收集行为的不同子任务之间学习到严格的排斥,它在携带物体时不能总是从其物体方向传感器接收到指向前方的输入,以便它被迫忽略这些信息,而优先考虑投放区域的方向。因此,实验设置在两个方面进行了调整:
  • Multiple target objects are present in the environment. The object orientation sensor will give the orientation of the closest one.
    环境中存在多个目标物体。物体方向传感器将提供最近物体的方向。
  • An object carried in the claw does not register on orientation sensors.
    在爪子中携带的物体不会在方向传感器上注册。
Skills internal functions: The default internal function for skills remains the Multilayer Perceptron. However, in this scenario one of the skills uses Genetic Programming to evolve its internal function. We will see in the result section why this type of controller is better suited for a particular task.
技能内部功能:技能的默认内部功能仍然是多层感知器。然而,在这种情况下,其中一个技能使用遗传编程来进化其内部功能。我们将在结果部分看到为什么这种类型的控制器更适合特定任务。
Shaping variable dependent skills and the need for a drive module:
塑造依赖于变量的技能和驱动模块的需求:
Following the curriculum approach, lower level skills are trained first, higher levels are progressively built on top of the previous ones as illustrated in Fig. 16 Section 5.1. In the CollectVariable hierarchy (Fig. 17) the lower level skill GoToTarget depends on a variable whose value is set by the higher level skill CollectVariable (Fig. 18). To shape GoToTarget we provide a value to the variable, the orientation of a target object, and train it with the GoToObject environments and its set of rewards.
根据课程方法,首先训练较低级别的技能,较高级别的技能逐步建立在之前的技能之上,如图 16 第 5.1 节所示。在 CollectVariable 层次结构中(图 17),较低级别的技能 GoToTarget 依赖于一个由较高级别技能 CollectVariable 设置的变量值(图 18)。为了塑造 GoToTarget,我们为变量提供一个值,即目标对象的方向,并使用 GoToObject 环境及其奖励集进行训练。
Fig. 18
  1. Download: Download high-res image (158KB)
    下载:下载高分辨率图像(158KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 18. From left to right: The problem of the Target variable, the scaffolding solution using a programmed master skill and the drive module solution accessing the Target variable.
图 18. 从左到右:目标变量的问题,使用编程主技能的支架解决方案,以及访问目标变量的驱动模块解决方案。

Our initial solution inspired by context retraining methods developed in [11] uses a master skill programmed to copy the value of the object orientation sensor to the Target variable and activates GoToTarget by sending it a constant maximum Influence. However, creating such “scaffolding” skills quickly proved to be cumbersome and highlighted the need for direct control of higher level functions, either for learning or setting a goal for exploitation, motivating the development of the drive module described in Section 4.4.
我们最初的解决方案受到[11]中开发的上下文重训练方法的启发,使用一个主技能,该技能被编程为将物体方向传感器的值复制到目标变量,并通过发送一个恒定的最大影响力来激活 GoToTarget。然而,创建这样的“支架”技能很快被证明是繁琐的,并突显了对更高层次功能的直接控制的需求,无论是用于学习还是设定利用目标,这促使了第 4.4 节中描述的驱动模块的发展。

5.2.2. Results and analysis
5.2.2. 结果与分析

Videos of the results are available [84], [85], for quantitative analysis of the training process, 10 replications with random initial seeds were done for each skill. Fig. B.27 in Appendix B shows the evolution of the average score during training.
结果的视频可在[84]、[85]中获得,用于对训练过程的定量分析,每项技能进行了 10 次随机初始种子的复制。附录 B 中的图 B.27 显示了训练期间平均分数的演变。
Using neural networks: In a single object configuration of the environment the behaviour is similar to the original hierarchy (Fig. 19). Fig. 20 shows the behaviour in the alternate setup with multiple objects: each row shows the collect of an object, on the left the agent is reaching the object and on the right, the drop zone.
使用神经网络:在环境的单一对象配置中,行为类似于原始层次结构(图 19)。图 20 显示了在多个对象的替代设置中的行为:每一行显示一个对象的收集,左侧是代理正在接近对象,右侧是投放区。
Fig. 19
  1. Download: Download high-res image (62KB)
    下载:下载高分辨率图像(62KB)
  2. Download: Download full-size image
    下载:下载完整尺寸图像

Fig. 19. Trajectory of the Collect Variable skill collecting a single object.
图 19. 收集变量技能收集单个物体的轨迹。

Fig. 20
  1. Download: Download high-res image (370KB)
  2. Download: Download full-size image

Fig. 20. Trajectory of the Collect Variable skill collecting 3 objects (neural network variant).

When the agent brings back the second object, we can see a loop on the right side of the screen. This is due to the sensory information given by the third object interfering with the orientation to the drop zone. Even if the agent does succeed in collecting several objects, it is clear from observation that this behaviour is far from efficient.
We suspect that this is a limitation of the combined use of neural networks and genetic algorithms on this specific task. The skill must perform a strict exclusion between setting the Target variable either to the orientation of the object or drop zone. In addition, this exclusive decision has to let one of the input signals (orientation of the object or orientation of the drop zone) through the network without distortion, which is difficult for a neural network. Fig. A.26 in Appendix A shows that an optimal configuration exists, however it is very specific and its neighbouring solutions have very poor performance which is not suited to our learning algorithm. The configurations we were able to learn are part of local optimums, which are far from the global optimal solution in the configuration space.
We experimented with various topologies and training methods, which slightly improved performance, but specific combinations of inputs still cause erratic behaviour. In all instances we were unable to consistently train a network over several replications.
Using genetic programs: The encapsulation of controllers into skill modules enables MIND agents to combine heterogeneous sets of controllers within the same hierarchy. It is thus possible to select a control function more suited to this task, without losing any of the neural network based skills accumulated so far.
The problem of strict exclusion when selecting the appropriate sensor can be solved by a simple program, such as:
Evolutionary Programming can produce such program, using the same training environment and reward functions. When evolving a single Genetic Programs (GP) to set the value of the variable, optimal programs can be found in less than a few dozen generations. Here are two examples of results, both exhibit optimal behaviour4 as shown in Fig. 21:
Fig. 21
  1. Download: Download high-res image (306KB)
  2. Download: Download full-size image

Fig. 21. The optimal trajectory of the Collect Variable skill using a function evolved via genetic programming.

Simple GPs can only provide a single output. To generalize its application to skill modules we modified the GP algorithm to evolve a GP for each output, referred to as a Genetic Program Forest (GPF). The evolutionary algorithm treats this collection of GPs as a single individual.
The optimal behaviour for CollectVariable is a combination of the previously described program for the variable output and any non-null Influence values for the subskills GoToTarget+Avoid and Claw Control. Solving this task with a GPF only adds little complexity to the search space.

5.3. Scenario 2: Counter variable

This scenario presents the use of variables as a memory and internal representation system to handle counting and memorizing events. Perceiving numbers and quantities is one of the most basic perceptual skill of humans and animals, and constitutes a suitable case study in the gradual development of abstract reasoning skills [86]. Embodied developmental systems have been used to implement psychological models of association between abstract representation and motor skills, for number comparison tasks and incremental count of objects [26], [27], the latter task involving Recurrent Neural Networks.
Although we do not intend to implement any specific psychological model, our approach shares a similar experimental context: the goal for the agent is to learn a representation through physical interaction, which enables it to perform an incremental count using our variable mechanism.
In this experiment, the final desired behaviour of the agent is to perform a sequence of actions, without any indication in the environment of which step of the sequence it has already accomplished. By memorizing the current step of the sequence, we depart from purely reactive behaviour and begin to investigate emergent memory representations and the process of evolving low-level cognitive functions from the ground up.
To study the emergent aspect of representations, we use a protocol that does not allow the teaching entity direct access to the memory modules. The learning process remains a genetic algorithm dependent on a reward function determined solely by observing the behaviour of the agent in the simulation, and not the behaviour of its internal mechanism and states. Respecting the barrier of the interiority of the agent's mind allows the agent to form representations based on its individual experience.

5.3.1. Protocol

The agent must collect an object twice, then activate a validation trigger by entering a specific zone. This expresses, through a motor skill, that the agent has managed to count and memorize a given number of events. This sequence is repeated indefinitely (Collect-Collect-Validate-Collect-Collect-Validate...). A small reward is given for each object collected and a large reward is given for a full sequence. An incorrect sequence (3 objects collected) ends the simulation.
Hierarchy: The hierarchy (Fig. 22) extends the original Collect hierarchy.5 The master skill Count Objects combines the Collect skill with Go To Valid + Avoid, which is a duplicate of the Go To DropZone + Avoid skill using a different orientation sensor to drive the agent to the validation zone while avoiding obstacles.
Fig. 22
  1. Download: Download high-res image (471KB)
  2. Download: Download full-size image

Fig. 22. The Count Objects hierarchy.

Count Objects uses the object, the drop zone and the validation zone presence sensors. The Count variable is used as input, to choose between the collect or the validation task, and used as an output to increment and reset the sequence count.
Training environment and skills internal functions: All skills use simple multilayer perceptrons. The environment contains the drop zone, the validation zone, obstacles and a single target object.

5.3.2. Results and analysis

Videos of the results are available at [87]. Fig. 23, Fig. 24 show the resulting behaviour and the state of the MIND hierarchy over 5 steps of the process.
  • 1.
    The agent brings the first object to the drop zone, VAR_COUNT is at 0.
  • 2.
    The agent just dropped the first object, VAR_COUNT is incremented by 0.1.
  • 3.
    The agent brings the second object to the drop zone.
  • 4.
    The agent just dropped the second object, VAR_COUNT is incremented by 0.1, bringing it to 0.2. The agent is now headed to the validation zone.
  • 5.
    The agent just reached the validation zone, VAR_COUNT is reset to 0. The agent is headed for the object.
Fig. 23
  1. Download: Download high-res image (579KB)
  2. Download: Download full-size image

Fig. 23. Step 1-3 of the Count Objects behaviour. On the right the simulation, on the left the state of the MIND hierarchy.

Fig. 24
  1. Download: Download high-res image (404KB)
  2. Download: Download full-size image

Fig. 24. Step 4-5 of the Count Objects behaviour. On the right the simulation, on the left the state of the MIND hierarchy.

The agent was able to learn to count through exposure to a (simulated) real world problem. The learning supervision did not have access to the internal state of the agent, nor did it teach a predefined symbol or memory representation. Using MIND, our agent is able to go beyond simple reactive behaviour and base its decision process on an internal state independent of the environment, which persists over time. In turn, the decision process is able to affect the internal state.
The relation between internal states and decisions, and the effect of decisions on internal states are learned by the agent. Hence, the meaningful values of the internal states, the non-symbolic representations or “prototypes” (see Section 3.3), are formed in an emergent process and grounded in experience.
In the results shown in Fig. 23, Fig. 24, the agent learns to count from 0, successively increments until reaching 0.2, and resets the value to 0 when starting the sequence over. In other attempts at learning the same behaviour, the agent starts at a value of 0.1, increments to 0.3 and resets the counter to the value of 0.1. This can be parallelled to the formation of different semiotic networks within a population of agents that are nevertheless able to function together [10], both representations are able to support the same behaviour. It is interesting to note the emergent aspect these representations, the formation of one or the other is due to the genetic process and the bias induced by the random initialization of the networks.
The goal of this experiment was to collect a fixed number of objects. This constant (viz. 2) was given through the training process and reward functions, as part of the expected behaviour. Further experiments could investigate the collect of a variable number of objects, with the desired count provided as an input from the environment through a sensor, or as a request from an operator through the drive module.
The training would involve creating and interpreting representations that can match the different number of requested objects and deliver them (in effect, a task similar to the number comparison performed in the ICUB experiment [26]). Moreover, we could investigate complex uses of counter variables such as the combination of several variables to form a multidimensional concept space. For instance, with counter variables limited to 11 distinct values (in the interval [0,10]), could it be possible to learn to count beyond 10 by combining two counter variables? Could we train a behaviour to count to 20 using a representation of 2 times 10? or even to a hundred by using the two dimensions to represent the units and tenths digits?

5.4. Assessment of variable module use in MIND

The CountObject behaviour did not require any alteration of the training methods used in previous experiments. The only real challenge in training hierarchies using variable modules was in scaffolding the behaviour for the CollectTarget task, and much of this process was streamlined by the introduction of the drive module. CollectTarget is also the first case where a simple multilayer perceptron was not suited as skill internal function: it could not be trained to provide the specific values, or representations of information, required as output. Because of our choice of scaffolding method, the representation expected in the Target variable by GoToTarget is a direct copy of the value of a sensor. Other representations might exist, which could be learned by a multilayer perceptron. In any case, MIND offers the possibility of combining different types of skill functions which completely by-passed this issue.
The final behaviour of each experiment could be learned without using variables: CollectTarget is functionally identical its reactive counterpart, and since CountObjects is the only skill using the Count variable, it could probably be trained using an internal function containing its own memory system within the skill module (such as a RNN [27]). Nonetheless, the success of these experiments ensures that the MIND can use variables to learn such behaviours just as well. The actual benefit of the variable module resides in its potential in a lifelong development context, and the further acquisition of skills based on previously acquired representations.
In the GoToTarget experiment, a navigation skill which fits most cases is learned once and for all, independently of what could be considered a target later on. Indeed, any new navigation request will simply provide the required information through the Target variable using the appropriate representation, regardless of the kind of new objects or positions to consider, and the corresponding new sensors or sources of information. Furthermore, a “target” no longer has to be a direct perception of an element of the environment, and can be a position chosen by calculations and decisions based on multiple sources of information. For instance, the position of both flags forming a gate in a slalom race can be combined to give an orientation guiding the agent through the gate, while in a different context the position of goal posts and defending opponents could be combined. Learning generalized behaviour favours re-use of skills, which suits the philosophy of agent development, and from a designer's perspective, helps in establishing a robust library of skills.
In the case of CountObject, the information about the current step of the process can be observed and accessed by other skills (or human observers). When extending the agent's behaviour, this information could be used by other skills in planning and optimizing decisions, for instance in answering questions such as “Am I almost done with a rewarding sequence? Should I complete my task or abandon it to recharge batteries?”. It may be the case that further development of this agent would call for multiagent cooperation to accomplish the sequence task. In this case, agents would signal others when an object is collected so that each of them can increase their own count variable. Fig. 25 shows how the CountObject hierarchy could be extended for this task.
Fig. 25
  1. Download: Download high-res image (254KB)
  2. Download: Download full-size image

Fig. 25. A possible extension of the Count Objects hierarchy for multiagent coordination.

6. Discussion

6.1. Analysis

The purpose of the variable module for MIND is to allow the circulation of information in hierarchies and act as a general purpose memory system. Its integration is achieved by following key aspects of the architecture: the modularity providing flexibility, identifiability and stability; the signal approach of the Influence mechanism providing a general purpose non-symbolic representation of information; and its integration with skills and the learning of procedural knowledge to support the developmental process. The experiments with variable modules demonstrate different aspects of this contribution: how an independent module carrying information can be used and its effects on the structuration of hierarchies, and how representations can be learned and memorized to develop beyond reactive behaviour.
We have seen how variable modules can be used by skills to store, retrieve and share information, and how it affects the internal organization and structuration of hierarchies. We have shown how a higher level skill can share information with a lower level skill, which can now function as parameterized skills [23]. It is also possible for a lower level skill to send information, such as feedback on the result of its actions, to a higher level skill and affect its decision process, thus allowing the formation of hierarchies similar to Robot Shaping [24]. Used as a buffer for information, variable modules allow centralizing and redistributing intermediate analysis of the environment or the results of decision processes identified as concepts, which can be addressed by the variable's name. Compared to other connectionist approaches previously discussed, such as layered learning [21] or modular neural network policies [22], the use of variables as interface between controllers follows the same principle of lifelong development which guides the acquisition of skills: the concepts represented can become elements for the future creation of behaviours of greater complexity, whose purpose cannot be anticipated.
The generalization of behaviours is one of the benefits of sharing information between skills, as illustrated with navigation in Section 5.2. The hierarchy used in the example of the Target variable is a very simple one, but variable modules can be used as any sensor or actuator, which means several skills in the hierarchy can use the same variable as input and send concurrent commands to the same variable as output. For instance, skills such as GoToTarget and FleeFromTarget would use the same target variable for different purposes. Since the Target variable represents a heading, one could imagine a version of Avoid setting the heading of the agent to avoid collisions being concurrent with a GoToObject skill on the Target variable. Generalization also reduces the number of skills, which is beneficial form a designer standpoint to organize and maintain libraries of skills, as well as reduce the memory requirements for applications in embedded systems.
Representing information in identifiable modules favours the explainability of global behaviour and supports the creation of systems for human control and interactions. We have seen in Section 5.2 how these modules can be manipulated through the drive module for training purposes, and Appendix C shows how it can be used for adjustable and semi-autonomous operation, where a human operator can set alternative goals for the agent through the manipulation of the Target variable. The idea behind this approach is to enable control of the agent through the manipulation of Influence signals and variable values. This is a subtle difference between asking the agent to “collect objects” and giving the agent a need for objects, the latter does not constrain the agent regarding how the objects are obtained. This approach is integrated and fits our future plans for the autonomous acquisition of a large collection of skills: several ways of accomplishing the task may be considered and, given an intrinsic motivational system [14], the agent may even evaluate the need for the acquisition of new skills to satisfy its drives.
Sensor modules already provided a low-level memory system, a history of past samplings, but with the variable module, any form of processed information can be memorized offering the agent the ability to commit to higher level behaviours. Since this information is shared by skills with the ability to learn on both sides of the process (storing and retrieving), the meaning of a variable's values has to be determined through an emergent process. The example given in Section 5.3 shows the emergence of the meaning of “enough” by subdividing the values of a variable into classes whose bounds are grounded in experience, and that different representations can emerge for the same meaning.
However, the scenarios we presented only constitute limited experimentations in learning grounded representations, as we bypassed the difficulty in converging on an acceptable representation shared by two separate skills. In the first experiment, the representation used for the Target variable shared by the two different skills cannot be considered as autonomously acquired. The receiving skill GoToTarget was first shaped using a direct copy of the orientation sensor as a representation, the CollectVariable skill then adapted to provide the representation expected by GoToTarget. The second experiment was intended, and did succeed in investigating emergent representations along with memorization, however the count variable is used by a single skill which has the opportunity to decide both how to represent information and how to interpret this representation.
Linking new skills to an already established shared representation should not cause any particular issue, the new skill will conform to the given representation as was the case in the first experiment. This is comparable to symbol grounding in a population of agents where new agents introduced in an existing population will adapt to the pre-existing vocabulary [48], as following the path of the least resistance. Acquiring the original shared representation is a more delicate matter: if we are to compare the problem to symbol grounding again, the language game between agents [10] is a back and forth process between two learning entities. However, both sides of this interaction are identical in function and do not have assigned roles or area of responsibility as skills do. Some clues may be found in coevolution [62] where multiple skills composing a global behaviour are learned together, sharing the same task and training episode, although in this work the responsibility of each skill (assigning what sub-behaviour each skill is in charge of) is given by the designer in the form of a decision tree. We pointed out [11] that assigning each skill its area of responsibility is part of the function of curriculum learning, creating a bias in the distribution which can be exploited by a coevolution mechanism.
A simple solution would be to initialize representations using a single skill, as with the Count hierarchy, and link other skills after a useful representation has started to form. Tasks dedicated to the initialization of a representation system can be included in curriculums, comparable to children's learning through play. From a more technical standpoint, this approach is comparable to auto-encoders forming their representation through a process having no practical purpose, the decoding part being then discarded to train a useful behaviour on the basis of this representation, an idea that inspired modular neural network policies [22].

6.2. Limits

Our platform implementing MIND is limited by the few control functions available at this time, and no attempt was made so far to implement available state-of-the-art learning techniques for training skills. Further iterations of the EvoAgents platform will integrate such components, as needed or per request, without the need for any change to the process described in this article. Indeed, over the years multiple ‘off-the-shelf’ software solutions and algorithms have been integrated, from simulation environments to machine learning libraries, while the process of building behaviour hierarchies using a curriculum remains the same.
The major concern remaining for the implementation is in bridging the gap between the skill learning techniques used and the curriculum established by the designer. While the curriculum approach is acceptable for high-level human guidance in development, a number of meta learning algorithms are needed to increase the autonomy and streamline the process of building hierarchies. These algorithms would perform such tasks as reviewing existing skills to determine if the given task requires a completely new skill or if it can be learned by duplicating an existing skill and optimizing it. They would also evaluate and select, through sensorimotor babbling for instance, the relevant input and output to use in creating a new skill. These two examples of meta learning algorithms face new challenges with the addition of the variable module: which existing variables should be part of the inputs and outputs of a new skill, and should an established concept be reused or a new variable created?
Finally, there remain concerns over the ability of MIND and the Influence mechanism to handle deep hierarchies, and the accuracy requirements for skills to avoid noisy motor output. As variable modules are also recipients for skill commands, this effect might manifest itself there also. The Target Variable scenario (Section 5.2) is an example of application requiring precise values (even though these requirements were induced by our training method, as discussed before). This issue is closely related to the effect of Influence on a simple variable discussed in Section 4.3.1, should it become necessary to address it, we would follow up on the proposed solutions, using approaches inspired by signal processing and robust training methods involving the drive module.

7. Conclusions and future works

In this article, we identified the desirable properties of a system supporting internal representations for the lifelong development of agents, out of a wide range of memory systems and representation mechanisms, from low-level memory to symbols. Adhering to the developmental approach of reaching complex representation by building upon simpler ones, we proposed a mechanism for general purpose non-symbolic representation, discussed its place within the lifelong development process, and formulated a proposition for its integration within an architecture. We then explained the principle of MIND, the architecture for lifelong development used as testbed, summed up the previous experimental results on reactive behaviours, before introducing the variable module with several of its possible implementations and applications. Through experiments, we have demonstrated the use of variable modules. With skills able to share information, we have shown alternative ways to structure hierarchies, which generalizes behaviour through shared concepts (Section 5.2), and that the close coupling with the skills and learning process favours the emergence of non-symbolic representations (Section 5.3). By providing a mechanism to manage internal representations we addressed one of the limitations of the architecture and allow the development of agents to continue beyond simple reactive behaviour (Section 5.3).
Beyond the validation of our proposition, the positive experimental results are an encouraging first step in the investigation of emergent memory representations. We are confident it makes an excellent testbed for new research on evolving low-level cognitive functions from the ground up, and its connectionist and developmental approach gives a promising alternative to symbolic Artificial Intelligence. Specifically, the mechanism of encoding high dimensional states or intermediate elements of the working memory into a highly compressed representation which in turn can be processed into higher level representations offers a grounded and emergent way to reach symbolic representation, suitable to general purpose Artificial Intelligence, and high-level reasoning.
In the formation of such general purpose long term memory elements, social interaction will certainly play an important role. The individual experience and its synthetization process will lead to the generation of many symbols. Through exchange of such symbols in a community bound to encounter similar experiences, their validity can be cross-checked, the symbols can be refined or rejected, merged or simplified, and their expression standardized. This process constitutes a vastly distributed experimental machine where the concept can go through much more testing and refinement, with a much wider range of possible conditions than would be possible in the lifetime of a single agent. Symbols surviving this process are adopted by the majority as a culture and form the basis of communication.
Keeping in mind the relation between emergent representations and social interactions, we want to focus our efforts on the development of multiagent behaviours. Our next immediate objective is the study of developmental agents relying on internal states to learn cooperative tasks in a society of agents. Based on our work with models of social specialization, using reinforcement to form an internal representation [88], [70], we will investigate the possibility of learning the reinforcement behaviour itself: In addition to learning how to find the appropriate roles and their distribution within a group on a species level, each agent will use this mechanism to learn its own role within a society during its lifetime.
Additional experiments need to be conducted on the use of variables, such as increasing the number of variables involved in a behaviour, the complexity of the hierarchies and tasks to perform. The process of converting perceptual information into representations leads to perceptually grounded symbols as explained by Steels [10], but with the possibilities of structuration offered by variables, what about the conversion of deliberative and behavioural processes into representations? To use once more the image of psychological “drives”, skills could output information on their work process simulating “satisfaction” or “frustration” to inform motivational systems and meta behaviours. Using a dual internal function associating the controller with a predictor, skills could output their “surprise” [16] at unexpected results. Such meta information has many potential applications, from autonomous learning through intrinsic motivation [14], [16] to multiagent coordination [54].
Scaffolding behaviour through the drive module is a step towards the integration of the training process within the control architecture, which so far was limited to the manipulation of skill hierarchies from without. The same approach could be used to increase the autonomy of development, by integrating processes such as IMAGINE [17] to generate new skills through intrinsically motivated goal discovery. We are particularly interested in the natural language (NL) interactions and descriptive social partner (SP) approach of IMAGINE, especially since it seems that direct human guidance can be realistically used in the role of the SP. By manipulating elements of the hierarchies, such approach would help in identification, naming procedures (skills), perceptions (sensors) and declarative knowledge (variable), and exploit permutations of elements in the language to explore new combinations of procedures and objects, considered as specific concepts though variables. In such context, the specificity of MIND would introduce the idea of achieving simultaneous tasks in the natural language descriptions. For instance, a description of the goal reached by GoToObject+Avoid could be: “you fetched the object while avoiding collisions at the same time”.
In addition to the autonomous and emergent aspects of agent development, there is a real need to study the possibilities of building control systems designed around variables for human control and interactions in an adjustable and semi-autonomous operation context. The use of the drive module to control variables, modify Influence or even alter hierarchies and dynamically load libraries of skill modules would offer a very flexible control system for drones, UAVs or human-assistance robotic systems. In this context, lifelong development architectures would ease the addition of new skills and goals, of new components and external agents, which would help in supporting the lifetime evolution of such complex projects.
Tangentially related to our work, the failure of neural networks to learn the function associated to the task in Section 5.2 led to questions about the generalization of the Influence mechanism to neural networks. Following the No Free Lunch Theorem, neural networks were considerably improved by making assumption on the nature of the data they process. For instance, convolutional neural networks processing images use a topology which exploits the spatial proximity of pixels. The requirements for the function in Section 5.2 are very similar to the kind of problems our architecture was designed for: one higher level decision inhibits or lets through lower level signals without altering them. Since the inputs, outputs and the Influence are all signals by nature, this structure could be generalized as a neural network topology with the addition of a simple operator dynamically adjusting the weights of connections. Naturally the proposition of an Influence Neural Network would have to be accompanied by corresponding alterations to popular learning algorithms.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work has been realized with the support of the High Performance Computing Platform HPC@LR, financed by the Occitanie / Pyrénées-Méditerranée Region, Montpellier Mediterranean Metropole and the University of Montpellier, France.

Appendix A. Handcrafted network for collectVariable

Fig. A.26
  1. Download: Download high-res image (361KB)
  2. Download: Download full-size image

Fig. A.26. The optimal solution for the CollectVariable skill using a multilayer perceptron. Connections not shown have their weights set to 0. SENSOBJ gives the value of 0.8 if an object is carried by the agent and 0.2 if not. This configuration is able to perform exclusion by shifting the excluded value of the orientation sensors to negative and using a ReLU to bring the excluded value to 0, having thus no effect on the final sum. It is however very specific and its neighbouring solutions have very poor performance which is not suited to our learning algorithm.

Appendix B. Quantitative results for Target variable

Fig. B.27
  1. Download: Download high-res image (69KB)
  2. Download: Download full-size image

Fig. B.27. Evolution of the fitness score during training for the skills of the Target variable experiments (Section 5.2) for 10 replications using random initial seeds. The Collect skill using Genetic Program Forests reaches its optimal score around 100 generations. The greater variation in scores for GoToTarget+Avoid is due to the greater importance of random elements in the training scenario (obstacle generation and random target placement).

Appendix C. Applications for the drive module

The drive module enables control of the agent through the manipulation of Influence signals and variable values, to set both exploitation goal and learning motivation. In its simplest form, it encapsulates a programming procedure which has access to all the elements of a MIND hierarchy.
The drive module can be used in the context of learning, to simplify the process of shaping or scaffolding behaviour. From a development standpoint, this can be understood as creating drives for play. The arbitrary object of a game has no meaning to the agent on its own, but the skills learned from the pursuit of this goal can be transferred to other situations, when the agent learns to set its own goals. In future iterations of the architecture, the drive module could be used to autonomously guide the agent's development, placing it in a learning mode, hinting corrections to existing behaviour, and generating artificial goals to pursue in order to acquire new skills, through mechanisms simulating intrinsic motivation [14].
Another example of application to learning is in improving the robustness of skill training. The drive module can be used to send parasitic commands and noise to actuators during learning, to deal with the sensitivity issues of the Influence mechanism, which answers the open question on the limits of MIND hierarchies discussed previously in Suro et al. [11].
For exploitation purposes, it can be used to set mission objectives and parameters, altering behaviour through variables representing its drives and selecting the appropriate master skill by manipulating the Influence signal. It can also provide interactivity and offer an adjustable level of control, from fully autonomous to programmed or teleoperated, in a way that integrates with the principles of MIND. An example of application to adjustable autonomy is shown in Fig. C.28, where an agent performing a collect task can be given alternate drop zone coordinates in real time by a human operator. Human input is interpreted in a semi-autonomous fashion: the orientation of the alternate goal is given to the agent, but the avoidance of obstacles is still done autonomously. If the alternate goal is removed by the operator, the agent resumes fully autonomous operation. A demonstration video of this behaviour is available [75]. Listing 1 shows the corresponding implementation of this drive module. Here, the operator input is given through a GUI element (l.6&12), other means could be used such as NetworkSockets or various I/O libraries for peripherals.
Fig. C.28
  1. Download: Download high-res image (119KB)
  2. Download: Download full-size image

Fig. C.28. An example of use for exploitation purposes: here the drive module takes operator input in real time and overrides variables to set alternate targets for delivering an object. The agent will only take operator input into account if it is carrying an object to deliver. If the waypoint is removed by the operator, the agent resumes its fully autonomous behaviour, setting its own values for variables through the hierarchy.

Listing 1
  1. Download: Download high-res image (144KB)
  2. Download: Download full-size image

Listing 1. Abridged program for the drive module shown in Fig. C.28. L.12 computes the relative orientation of the operator marker given by the simulation viewer GUI. This value overrides the value of the target at l.13. Unlike the override procedure, L. 17&18 show examples of sending a regular Influence command which will be integrated with other commands exchanged in the hierarchy. Both skill receive a command of value 1.0, with a cumulative Influence of 1.0.

Data availability

As mentioned in the article, all research data and code is available in the following public repositories: https://gricad-gitlab.univ-grenoble-alpes.fr/surof/evoagents2020 or https://gite.lirmm.fr/suro/evoagents2020

References

Cited by (0)

1
For interpretation of the colours in the figure(s), the reader is referred to the web version of this article.
2
In the EvoAgents implementation of MIND, the maximum value N of a counter can be provided via the agent configuration file, and is set at instantiation. Its value increments in 1/N steps, the only limit for N is the limit of double precision.
3
Default increment threshold: 0.8. Default reset threshold: 0.2.
4
Values for binary sensors such as SENSOBJ are 0.2 for false and 0.8 for true.
5
The CollectVariable hierarchy was still under development at the time of this experiment, however they are interchangeable.