这是用户在 2024-6-5 11:31 为 https://app.immersivetranslate.com/pdf-pro/b7039f3b-6e2a-4b73-abda-b430a2db978d 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_06_05_cc21ad89f4c8035c7b1cg

Threetypes of incremental learning
渐进学习的类型

Received: 1 October 2021
收到:2021 年 10 月 1 日

Accepted: 18 October 2022
接受:2022 年 10 月 18 日
Published online: 5 December 2022
在线出版:2022 年 12 月 5 日

Check for updates 检查更新

Gido M. van de Ven (1) , Tinne Tuytelaars & Andreas S. Tolias
Gido M. van de Ven (1) , Tinne Tuytelaars & Andreas S. Tolias

Abstract 摘要

Incrementally learning new information from a non-stationary stream of data, referred to as 'continual learning', is a key feature of natural intelligence, but a challenging problem for deep neural networks. In recent years, numerous deep learning methods for continual learning have been proposed, but comparing their performances is difficult due to the lack of a common framework. To help address this, we describe three fundamental types, or 'scenarios', of continual learning: task-incremental, domain-incremental and class-incremental learning. Each of these scenarios has its own set of challenges. To illustrate this, we provide a comprehensive empirical comparison of currently used continual learning strategies, by performing the Split MNIST and Split CIFAR-100 protocols according to each scenario. We demonstrate substantial differences between the three scenarios in terms of difficulty and in terms of the effectiveness of different strategies. The proposed categorization aims to structure the continual learning field, by forming a key foundation for clearly defining benchmark problems.
从非稳态数据流中增量学习新信息(称为 "持续学习")是自然智能的一个关键特征,但对于深度神经网络来说却是一个具有挑战性的问题。近年来,人们提出了许多用于持续学习的深度学习方法,但由于缺乏通用框架,很难对这些方法的性能进行比较。为了帮助解决这个问题,我们描述了持续学习的三种基本类型或 "场景":任务递增学习、领域递增学习和类递增学习。每种情况都有各自的挑战。为了说明这一点,我们对当前使用的持续学习策略进行了全面的实证比较,根据每种情景执行了拆分 MNIST 和拆分 CIFAR-100 协议。我们证明了三种场景在难度和不同策略的有效性方面存在很大差异。所提出的分类方法旨在构建持续学习领域的结构,为明确定义基准问题奠定重要基础。

Abstract 摘要

An important open problem in deep learning is enabling neural networks to incrementally learn from non-stationary streams of data . For example, when deep neural networks are trained on samples from a new task or data distribution, they tend to rapidly lose previously acquired capabilities, a phenomenon referred to as catastrophic forgetting . In stark contrast, humans and other animals are able to incrementally learn new skills without compromising those that were already learned . The field of continual learning, also referred to as lifelong learning, is devoted to closing the gap in incremental learning ability between natural and artificial intelligence. In recent years, this area of machine learning research has been rapidly expanding, fuelled by the potential utility of deploying continual learning algorithms for applications such as medical diagnosis , autonomous driving or predicting financial markets .
深度学习中一个重要的开放性问题是让神经网络从非稳态数据流中逐步学习 。例如,当深度神经网络接受来自新任务或数据分布的样本训练时,它们往往会迅速丧失先前获得的能力,这种现象被称为灾难性遗忘 。与此形成鲜明对比的是,人类和其他动物能够循序渐进地学习新技能,而不会影响已经学会的技能 。持续学习(也称终身学习)领域致力于缩小自然智能与人工智能在增量学习能力方面的差距。近年来,持续学习算法在医疗诊断 、自动驾驶 或金融市场预测 等应用领域的潜在效用推动了这一机器学习研究领域的迅速发展。

Despite its scope, continual learning research is relatively unstructured and the field lacks a shared framework. Because of an abundance of subtle, but often important, differences between evaluation protocols, systematic comparison between continual learning algorithms is challenging, even when papers use the same datasets . It is therefore not surprising that numerous continual learning methods claim to be
尽管持续学习的范围很广,但其研究相对缺乏条理,该领域也缺乏一个共享框架。由于评估协议之间存在大量微妙但往往重要的差异,即使论文使用相同的数据集 ,对持续学习算法进行系统比较也很有挑战性。因此,众多持续学习方法声称自己是

state-of-the-art. To help address this, here we describe a structured and intuitive framework for continual learning.
最先进的。为了帮助解决这个问题,我们在这里描述了一个结构化的、直观的持续学习框架。
We put forward the view that, at the computational level , there are three fundamental types, or 'scenarios', of supervised continual learning. Informally, (a) in task-incremental learning, an algorithm must incrementally learn a set of clearly distinguishable tasks; (b) in domain-incremental learning, an algorithm must learn the same kind of problem but in different contexts; and (c) in class-incremental learning, an algorithm must incrementally learn to distinguish between a growing number of objects or classes. In this article, we formally define these three scenarios and point out different challenges associated with each one of them. We also review existing strategies for continual learning with deep neural networks and we provide a comprehensive, empirical comparison to test how suitable these different strategies are for each scenario.
我们认为,在计算层面 ,有监督的持续学习有三种基本类型或 "情景"。非正式地讲,(a) 在任务递增学习中,算法必须递增地学习一组可明确区分的任务;(b) 在领域递增学习中,算法必须学习同类问题,但要在不同的情境中学习;(c) 在类别递增学习中,算法必须递增地学习区分越来越多的对象或类别。在本文中,我们正式定义了这三种情况,并指出了与每种情况相关的不同挑战。我们还回顾了利用深度神经网络进行持续学习的现有策略,并进行了全面的实证比较,以检验这些不同策略在每种场景下的适用性。

Three continual learning scenarios
三种持续学习情景

In classical machine learning, an algorithm has access to all training data at the same time. In continual learning, the data instead arrives in a sequence, or in a number of steps, and the underlying distribution of the data changes over time. In this article, we propose that, depending on how the aspect of the data that changes over time relates to the function or mapping that must be learned, there are three fundamental ways in which a supervised learning problem can be incremental (Table 1). Below, we start by describing the resulting three continual learning scenarios intuitively. After that we define them more formally: first in a restricted, 'academic' setting, before generalizing them to more flexible continual learning settings.
在传统的机器学习中,算法可以同时获得所有训练数据。而在持续学习中,数据会按顺序或按步骤到达,而且数据的基本分布会随着时间的推移而变化。在本文中,我们提出,根据随时间变化的数据方面与必须学习的函数或映射之间的关系,监督学习问题可以通过三种基本方式实现增量(表 1)。下面,我们将首先直观地描述由此产生的三种持续学习情况。之后,我们将对它们进行更正式的定义:首先是在受限的 "学术 "环境中,然后再将它们推广到更灵活的持续学习环境中。

Intuitive descriptions and each scenario's challenges
直观的描述和每个场景的挑战

The first continual learning scenario we refer to as 'task-incremental learning' (or Task-IL). This scenario is best described as the case where an algorithm must incrementally learn a set of distinct tasks (see refs. for examples from the literature). The defining characteristic of task-incremental learning is that it is always clear to the algorithm-also at test time-which task must be performed. In practice, this could mean that task identity is explicitly provided, or that the tasks are clearly distinguishable. In this scenario it is possible to train models with task-specific components (for example, a separate output layer per task), or even to have a completely separate network for each task to be learned. In this last case there is no forgetting at all. The challenge with task-incremental learning, therefore, is not-or should not be-to simply prevent catastrophic forgetting, but rather to find effective ways to share learned representations across tasks, to optimize the trade-off between performance and computational complexity and to use information learned in one task to improve performance on other tasks (that is, to achieve positive forward or even backward transfer between tasks) . These are still open challenges. Real-world examples of task-incremental learning are learning to play different sports or different musical instruments, because typically it is always clear which sport or instrument should be played.
我们将第一种持续学习情景称为 "任务递增学习"(Task-IL)。这种情况最适合描述为算法必须增量学习一组不同任务的情况(文献中的例子见参考文献 )。任务递增学习的显著特点是,算法始终清楚--在测试时也是如此--必须执行哪项任务。在实践中,这可能意味着任务标识是明确提供的,或者任务是清晰可辨的。在这种情况下,就有可能训练出具有特定任务组件的模型(例如,每个任务都有一个独立的输出层),甚至为每个待学习的任务建立一个完全独立的网络。在最后一种情况下,根本不存在遗忘。因此,任务递增学习所面临的挑战不是--或者说不应该是--简单地防止灾难性遗忘,而是要找到有效的方法,在不同任务间共享所学表征,优化性能与计算复杂度之间的权衡,并利用在一项任务中学到的信息提高其他任务的性能(即实现任务间的正向或反向转移) 。这些仍是尚未解决的难题。任务递增学习在现实世界中的例子是学习不同的体育运动或不同的乐器,因为通常情况下,人们总是很清楚应该进行哪项运动或演奏哪种乐器。
We call the second scenario 'domain-incremental learning' (or Domain-IL). In this scenario, the structure of the problem is always the same, but the context or input-distribution changes (for example, there are domain-shifts; see refs. ). Similarly to task-incremental learning, this scenario can be described as that an algorithm must incrementally learn a set of 'tasks' (although now it might be more intuitive to think of them as 'domains'), but with the crucial difference that-at least at test time-the algorithm does not know to which task a sample belongs. However, identifying the task is not necessary, because each task has the same possible outputs (for example, the same classes are used in each task).Using task-specific components in this scenario is, however, only possible if an algorithm first identifies the task , but that is not necessarily the most efficient strategy. Preventing forgetting by design' is therefore not possible with domain-incremental learning, and alleviating catastrophic forgetting is still an important unsolved challenge. Examples of this scenario are incrementally learning to recognize objects under variable lighting conditions (for example, indoors versus outdoors) or learning to drive in different weather conditions .
我们称第二种情况为 "领域增量学习"(或称 "领域-IL")。在这种情况下,问题的结构始终保持不变,但上下文或输入分布发生了变化(例如,存在领域转移;参见参考文献 )。与任务增量学习类似,这种情况可以描述为算法必须增量学习一组 "任务"(虽然现在把它们看作 "领域 "可能更直观),但关键的区别在于--至少在测试时--算法不知道样本属于哪个任务。不过,识别任务并不是必须的,因为每个任务都有相同的可能输出(例如,每个任务都使用相同的类别)。不过,在这种情况下使用特定于任务的组件只有在算法首先识别任务 时才有可能,但这并不一定是最有效的策略。因此,"通过设计防止遗忘 "在领域递增学习中是不可能实现的,而减轻灾难性遗忘仍是一个尚未解决的重要挑战。这种情况的例子包括在不同照明条件下渐进学习识别物体 (例如,室内与室外),或在不同天气条件下学习驾驶
The third continual learning scenario is 'class-incremental learning' (or Class-IL). This scenario is best described as the case where an algorithm must incrementally learn to discriminate between a growing number of objects or classes (for example, refs. ). An often used set-up for this scenario is that a sequence of classification-based tasks (although now it might be more intuitive to think of them as 'episodes') is encountered, whereby each task contains different classes and the algorithm must learn to distinguish between all classes . In this case, task identification is necessary to solve the problem, as it determines which possible classes the current sample might belong to. In other words, the algorithm should be able both to solve each individual task (that is, distinguish between classes within an episode) and to identify which task a sample belongs to (that is, distinguish between classes from different episodes). For example, an agent might first learn about cats and dogs, and later about cows and horses; while with task-incremental learning
第三种持续学习情况是 "类递增学习"(或称 "类-IL")。这种情况的最佳描述是,算法必须逐步学会分辨越来越多的对象或类别(例如,参考文献 )。这种情况下经常使用的一种设置是,遇到一系列基于分类的任务(虽然现在将其视为 "事件 "可能更直观),其中每个任务都包含不同的类别,而算法必须学会区分所有类别 。在这种情况下,任务识别是解决问题的必要条件,因为它决定了当前样本可能属于哪些类别。换句话说,算法既要能解决每个单独的任务(即区分一个情节中的类别),又要能识别样本属于哪个任务(即区分不同情节中的类别)。例如,一个代理可能先学习猫和狗,然后再学习牛和马;而任务递增式学习则是

Table Overview of the three continual learning scenarios
三个持续学习方案概览
Scenario 场景 Intuitive description 直观描述 Mapping to learn 绘制学习地图

任务增量学习
Task-incremental
learning

按顺序学习解决一系列不同的任务
Sequentially learn to solve
a number of distinct tasks

领域强化学习
Domain-incremental
learning

学会在不同情境中解决同一问题
Learn to solve the same
problem in different
contexts

班级强化学习
Class-incremental
learning

区分逐步观察到的类别
Discriminate between
incrementally observed
classes
Notation: is the input space, is the within-context output space and is the context space. In this article, the term 'context' refers to an underlying distribution from which observations are sampled. The context changes over time. In the continual learning literature, the term 'task' is often used in a way analogous to how the term 'context' is used here.
符号: 为输入空间, 为上下文内输出空间, 为上下文空间。在本文中,"上下文 "一词指的是一个基础分布,观测数据就是从这个分布中采样的。上下文随时间而变化。在持续学习文献中,"任务 "一词的使用方式通常与本文中 "上下文 "一词的使用方式类似。
the agent would not be expected to distinguish between animals encountered in different episodes (for example, between cats and cows), with class-incremental learning this is required. An important challenge in this scenario is learning to discriminate between classes that are not observed together, which has turned out to be very challenging for deep neural networks, especially when storing examples of previously seen classes is not allowed .
在这种情况下,深度神经网络面临的重要挑战是学习如何区分不在一起观察到的动物类别,这对深度神经网络来说非常具有挑战性,尤其是在不允许存储以前观察到的动物类别示例的情况下。在这种情况下,一个重要的挑战是学会区分不在一起观察到的类别,这对深度神经网络来说非常具有挑战性,尤其是在不允许存储以前观察到的类别示例的情况下

Formalization in a restricted, 'academic' setting
在受限制的 "学术 "环境中实现正规化

To more formally define these three scenarios, we start by considering the simple, but frequently studied, continual learning setting in which a classification problem is split up into multiple parts or episodes that must be learned sequentially, with no overlap between the different episodes. In the continual learning literature, these episodes are often called tasks, but in this article we will refer to them as 'contexts'. The term task is problematic because in the literature it is used with several different meanings or connotations. From here on, we will use the term task only to refer to a context when it is always clear to the learning algorithm when a sample belongs to that context (as is the case with task-incremental learning).
为了更正式地定义这三种情况,我们首先考虑一种简单但经常被研究的持续学习环境,在这种环境中,分类问题被分割成多个部分或情节,必须按顺序学习,不同情节之间没有重叠。在持续学习文献中,这些情节通常被称为任务,但在本文中,我们将把它们称为 "情境"。任务一词之所以有问题,是因为在文献中,它有几种不同的含义或内涵。从这里开始,我们将仅在学习算法始终清楚某个样本属于某个情境时(如任务递增学习的情况)使用任务一词来指代情境。
In the 'academic continual learning setting' sketched above (that is, classification-based, non-overlapping contexts encountered sequentially), a clear distinction can be drawn between the three scenarios. To formalize this, we express each sample as consisting of three components: an input , a within-context label and a context label . The three scenarios can then be defined based on how the function or mapping that must be learned relates to the context space . With task-incremental learning, an algorithm is expected to learn a mapping of the form , with domain-incremental learning a mapping of the form must be learned and with class-incremental learning the shape of the mapping to be learned is . For class-incremental learning this mapping can also be written as , with the 'global label space' obtained by combining and .
在上文勾勒的 "学术持续学习环境 "中(即基于分类的、非重叠的、依次遇到的情境),这三种情境之间有明显的区别。为了将其形式化,我们将每个样本表述为由三个部分组成:输入 、上下文内标签 和上下文标签 。然后,可以根据必须学习的函数或映射与上下文空间 的关系来定义三种情景。在任务递增学习中,算法需要学习形式为 的映射;在领域递增学习中,必须学习形式为 的映射;在类递增学习中,需要学习的映射形状为 。对于类递增学习,这个映射也可以写成 ,其中 是通过组合 得到的 "全局标签空间"。
These definitions imply that the three scenarios can be distinguished based on whether at test time context identity information is known to the algorithm and, in case it is not, whether it must be inferred (Fig.1). Each scenario thus specifies whether context labels are available during testing, but not necessarily whether they are available during training. With task-and class-incremental learning, it is often implicit that context labels are provided during training (for example, in the case of supervised learning), but with domain-incremental learning it is good practice to explicitly state whether context labels (or context boundaries) are provided during training.
这些定义意味着,可以根据算法在测试时是否知道上下文身份信息,以及在不知道的情况下是否必须推断出上下文身份信息来区分这三种情况(图 1)。因此,每种情况都规定了测试时是否有上下文标签,但不一定规定了训练时是否有上下文标签。在任务和类递增学习中,通常隐含着在训练过程中提供上下文标签(例如,在监督学习中),但在领域递增学习中,明确说明在训练过程中是否提供上下文标签(或上下文边界)是一种很好的做法。
To illustrate the continual learning scenarios with an example, Fig. 2 shows how Split MNIST, which is a popular toy problem for continual learning , can be performed according to each of the three scenarios. Further examples illustrating these scenarios with other context sequences are provided in Supplementary Note 1.
为了举例说明持续学习的应用场景,图 2 显示了如何根据三种应用场景中的每一种进行拆分 MNIST(这是持续学习 的一个常用玩具问题)。补充说明 1 提供了更多示例,用其他上下文序列来说明这些情况。
Fig. 1|Decision tree for the three continual learning scenarios. The scenarios can be defined based on whether at test time context identity is known and if it is not, whether it must be inferred.
图 1:三种持续学习情景的决策树。可以根据测试时上下文标识是否已知以及如果不已知是否必须推断来定义这些情景。
It might be unintuitive to distinguish domain- and classincremental learning by whether context identity must be inferred, because with class-incremental learning context identification is often not explicitly performed, as typically a direct mapping is learned from the input space to the set of global labels . Another way to tell these two scenarios apart is by whether different contexts contain the same classes (domain-incremental learning) or different classes (class-incremental learning). However, it should then be realized that whether two samples belong to the same class can change depending on perspective: in the Split MNIST example (Fig. 2), with domainincremental learning the digits ' 0 ' and ' 2 ' belong to the same class (as they are both even digits), but with class-incremental learning they are considered different classes.
用是否必须推断上下文标识来区分领域递增学习和类递增学习可能并不直观,因为在类递增学习中,上下文标识通常不会明确执行,因为通常是从输入空间 直接学习映射到全局标签集 。区分这两种情况的另一种方法是看不同的上下文是否包含相同的类别(领域递增学习)或不同的类别(类别递增学习)。然而,我们应该意识到,两个样本是否属于同一类别会因视角不同而发生变化:在拆分 MNIST 示例(图 2)中,通过领域递增学习,数字 "0 "和 "2 "属于同一类别(因为它们都是偶数),但通过类别递增学习,它们被视为不同的类别。

Generalization to more flexible settings
推广到更灵活的环境

The clear separation between the three scenarios makes the academic continual learning setting convenient for studying these scenarios and their different challenges in isolation. However, this setting does not reflect well the arbitrary non-stationarity that can be observed in the real world . To generalize the three scenarios to more flexible continual learning settings, we first introduce a distinction between the concepts 'context set' and 'data stream':
这三种情景之间的明确分离使得学术持续学习环境便于孤立地研究这些情景及其不同的挑战。然而,这种环境并不能很好地反映现实世界中可以观察到的任意的非稳态性 。为了将这三种情景推广到更灵活的持续学习环境中,我们首先引入了 "情境集 "和 "数据流 "这两个概念之间的区别:
The 'context set' is defined as a collection of underlying distributions, denoted by , from which the observations presented to the algorithm are sampled. For a supervised continual learning problem, for each context , samples from consist of an input and a within-context label . (With class-incremental learning each context could also contain a single class, in which case the withincontext label is not used.)
上下文集 "被定义为底层分布的集合,用 表示,提交给算法的观察结果就是从这个集合中采样的。在有监督的持续学习问题中,对于每个上下文 ,来自 的样本包括一个输入 和一个上下文内标签 。(在类递增学习中,每个上下文也可以包含一个单独的类,在这种情况下,上下文内标签 将不被使用)。
The 'data stream' is defined as a (possibly unbounded) stream of experiences that are sequentially presented to the algorithm: . .ach experience consists of a set of observations sampled from one or more of the underlying distributions of the context set. These experiences are the incremental steps of a continual learning problem, in the sense that at each step, the algorithm has free access to the data of the current experience, but not to the data from past or future experiences (see also ref. ).
数据流 "被定义为依次呈现给算法的经验流(可能是无限制的): .每一次经验都由一组从上下文集的一个或多个基本分布中采样的观察结果组成。这些经验是持续学习问题的增量步骤,即在每个步骤中,算法可以自由访问当前经验的数据,但不能访问过去或未来经验的数据(另见参考文献 )。
In the academic continual-learning setting, there is no distinction between the context set and the data stream, because each experience consists of all the training data of a particular context. In general, however, such a direct relation is not needed. Every observation within each experience can in principle be sampled from any combination of underlying datasets from the context set. This can be formalized as:
在学术持续学习环境中,情境集和数据流之间没有区别,因为每次体验都由特定情境中的所有训练数据组成。然而,在一般情况下,这种直接关系是不需要的。原则上,每个经验中的每个观察结果都可以从上下文集中的任何底层数据集组合中采样。这可以形式化为
whereby is observation of experience and is the probability that this observation is sampled from . Importantly, in this framework, from a probabilistic perspective, two observations at different points in time can only differ from each other with respect to the (combination of) contexts from which they are sampled. With this formulation, the context set describes the aspects of the data that 'can' change over time and the probabilities describe 'how' they change over time.
其中 是经验 中的观测值 是该观测值从 中采样的概率。重要的是,在这一框架中,从概率论的角度来看,不同时间点的两个观察结果只能因其采样的情境(组合)而有所不同。通过这种表述,情境集描述了数据 "可能 "随时间变化的方面,而概率 则描述了它们 "如何 "随时间变化。
An advantage of distinguishing between the context set and the data stream is that it makes it possible to describe continual learning problems with gradual transitions between contexts or whereby contexts are revisited . In this framework, which is suitable for so-called 'task-free continual learning , generalized versions of the three scenarios can be defined based on how the mapping that must be learned relates to the context space , which describes the non-stationary aspect of the data. Supplementary Note 2 illustrates how a 'task-free' data stream can be performed according to each of the generalized versions of the three scenarios.
区分上下文集和数据流的一个好处是,可以描述上下文之间渐进转换的持续学习问题 ,或重新审视上下文的持续学习问题 。在这个适合所谓 "无任务持续学习 "的框架中,可以根据必须学习的映射与上下文空间 的关系来定义这三种情况的通用版本,上下文空间描述了数据的非稳态方面。补充说明 2 举例说明了如何根据这三种情景的每一种通用版本来执行 "无任务 "数据流。
We note that for complex real-world incremental learning problems, it might not be straight-forward to express the mapping that must be learned in terms of the context space , for example, because there are different aspects of the data that change over time. To accommodate this, a multidimensional context space can be used, whereby each dimension could adhere to a different scenario. This allows for continual-learning problems that are mixtures of scenarios (Supplementary Note 3). Another generalization is that contexts do not need to be discrete, but can be continuous (in that case the summation in equation (1) becomes an integral); an example of a continuous context set is Rotated MNIST with arbitrary rotation (Supplementary Note3).
我们注意到,对于复杂的现实世界增量学习问题,用上下文空间 来表达必须学习的映射可能并不直截了当,例如,因为数据的不同方面会随着时间的推移而变化。为了适应这种情况,可以使用多维上下文空间 ,其中每个维度可以对应不同的场景。这样就可以解决情景混合的持续学习问题(补充说明 3)。另一个概括是,上下文不一定是离散的,也可以是连续的(在这种情况下,等式(1)中的求和就变成了积分);连续上下文集的一个例子是任意旋转的旋转 MNIST(补充说明 3)。

Empirical comparison 经验比较

To further explore the differences between the three continual learning scenarios, here we provide an empirical comparison of the performance of different deep learning strategies. To do this comparison in a structured manner, in Supplementary Note 4 we discuss and distinguish five computational strategies for continual learning (Fig. 3). For each strategy, we included a few representative methods in our comparison.
为了进一步探索三种持续学习方案之间的差异,我们在此对不同深度学习策略的性能进行了实证比较。为了有条理地进行比较,我们在补充说明 4 中讨论并区分了五种持续学习的计算策略(图 3)。对于每种策略,我们都在比较中纳入了一些具有代表性的方法。

Compared methods 比较方法

The use of context-specific components was represented by context-dependent gating , which masks for each context a randomly selected subset of hidden units; and the separate networks approach, where the available parameter budget is divided over all contexts and a separate network is learned for each context.
使用特定情境组件的方法有两种:一种是情境门控方法 ,即针对每种情境随机选择一个隐藏单元子集;另一种是独立网络方法,即在所有情境中分配可用参数预算,并针对每种情境学习一个独立网络。
Included parameter regularization methods were elastic weight consolidation ), which estimates parameter importance using a diagonal approximation to the Fisher information; and synaptic intelligence ( ), which estimates parameter importance online based on the training trajectory.
参数正则化方法包括弹性权重巩固法( )和突触智能法( ),前者利用费雪信息的对角线近似估计参数的重要性,后者则根据训练轨迹在线估计参数的重要性。
For functional regularization, compared were learning without forgetting , which uses the inputs from the current context as anchor points; and functional regularization of the memorable past(FROMP , which uses stored examples from past contexts as anchor points.
在功能正则化方面,比较了无遗忘学习( )和记忆性过去功能正则化(FROMP ),前者使用当前上下文中的输入作为锚点,后者则使用过去上下文中存储的示例作为锚点。
The included replay methods were deep generative replay ), which replays generated representations at the input level; brain-inspired replay (BI-R ), which replays generated representations at the latent feature level; experience replay , which replays stored samples in the 'standard way' (that is, loss on replayed data added to loss on current data); and averaged gradient episodic memory (A-GEM ), which replays the same stored samples but using the loss on replayed data as inequality constraint (that is, loss on current data optimized under constraint that loss on replayed data cannot increase).
包括的重放方法有:深度生成重放(deep generative replay ),在输入层面重放生成的表征;大脑启发重放(Brain-inspired replay,BI-R ),在潜特征层面重放生成的表征;经验重放(experience replay ),以 "标准方式 "重放存储的样本(即重放数据的损失与当前数据的损失相加);以及平均梯度外显记忆(A-GEM ),该方法重放相同的存储样本,但将重放数据的损失作为不平等约束条件(即在重放数据损失不能增加的约束条件下优化当前数据的损失)。
The compared template-based methods were iCaRL , with mean latent feature representations of stored examples as templates; and the generative classifier from ref. , which uses class-specific generative models as templates.
比较过的基于模板的方法有 iCaRL 和参考文献中的生成分类器,前者使用存储示例的平均潜在特征表示作为模板,后者使用特定类别的生成模型作为模板。 该方法使用特定类别的生成模型作为模板。

a
b
b Input (at test time)
输入(测试时)
Expected output 预期产出 Intuitive description 直观描述
Task-incremental learning
任务增量学习
Image + context label
图像 + 上下文标签
Within-context label
语境内标签
Choice between two digits of same context (e.g. 0 or 1)
在上下文相同的两个数字(如 0 或 1)之间进行选择
Domain-incremental learning
领域强化学习
Image 图片 Within-context label 语境内标签 Is the digit odd or even?
数字是奇数还是偶数?
Class-incremental learning
班级强化学习
Image 图片 Global label 全球标签 Choice between all ten digits
在所有十位数之间进行选择
Fig. 2 | Split MNIST according to the three scenarios. a, The Split MNIST protocol is obtained by splitting the original MNIST dataset into five contexts, with each context consisting of two digits. b, Overview of what is expected of the algorithm at test time when the Split MNIST protocol is performed according to each continual learning scenario. With task-incremental learning, at the computational level, there is no difference between whether the algorithm must return the within-context label or the global label, because the within-context label can be combined with the context label (which is provided as input) to get the global label.
图 2 根据三种情况拆分 MNIST a, 将原始 MNIST 数据集拆分为五个上下文,每个上下文由两位数字组成,从而得到拆分 MNIST 协议。 通过任务递增学习,在计算层面上,算法必须返回上下文内标签还是全局标签并无区别,因为上下文内标签可以与上下文标签(作为输入提供)相结合,从而得到全局标签。
Fig. 3 | Schematic illustrations of different continual learning strategies. a, Context-specific components. Certain parts of the network are only used for specific contexts. b, Parameter regularization. Parameters important for past contexts are encouraged not to change too much when learning new contexts. c, Functional regularization. The input-output mapping learned previously is encouraged not to change too much at a particular set of inputs (the 'anchor points') when training on new contexts. d, Replay. The training data of a new context is complemented with data representative of past context. The replayed data is sampled from , which can be a memory buffer or a generative model. e, Template-based classification. A 'template' is learned for each class (for example, a prototype, an energy value or a generative model), and classification is performed based on which template is most suitable for the sample to be classified. See Supplementary Note 4 for a detailed discussion of these strategies.
图 3 不同持续学习策略的示意图。网络的某些部分仅用于特定情境。在学习新情境时,鼓励对过去情境重要的参数不要发生太大变化。在对新情境进行训练时,鼓励不要对特定输入集("锚点")上之前学习的输入输出映射进行过多更改。新情境的训练数据与过去情境的代表性数据相辅相成。重放数据采样自 ,它可以是一个内存缓冲区,也可以是一个生成模型。为每个类别学习一个 "模板"(例如,一个原型、一个能量值或一个生成模型),然后根据哪个模板最适合要分类的样本进行分类。有关这些策略的详细讨论,请参阅补充说明 4。

Finally, two baselines were included. As lower target, referred to as 'none', the model was incrementally trained on all contexts in the standard way. As upper target, referred to as 'joint', the model was trained on the data of all contexts at the same time.
最后,还包括两个基线。作为下层目标,即 "无",模型以标准方式在所有语境中逐步训练。作为上部目标,即 "联合",模型同时在所有语境的数据上进行训练。

Set-up 设置

We performed both the Split MNIST and the Split CIFAR-100 protocol according to each of the three scenarios. All experiments used the academic continual learning setting and context identity information was available during training. To make the comparisons as informative as possible, we used similar network architectures and similar training protocols for all compared methods. Depending on the continual learning scenario, the output layer of the network was treated differently. With task-incremental learning, a multi-headed output layer was used whereby each context had its own output units and only the units of the context under consideration were used. For the other two scenarios, single-headed output layers were used, with the number of output units equal to the number of classes per context (domain-incremental learning) or to the total number of classes (class-incremental learning). See Methods for more detail.
我们根据三种情况分别执行了拆分 MNIST 和拆分 CIFAR-100 协议。所有实验都使用了学术持续学习设置,并且在训练过程中提供了上下文身份信息。为了使比较尽可能具有参考价值,我们在所有比较方法中使用了相似的网络架构和相似的训练协议。根据持续学习方案的不同,网络输出层的处理方式也不同。在任务递增学习中,我们使用了多头输出层,每个情境都有自己的输出单元,并且只使用所考虑情境的单元。在其他两种情况下,则使用单头输出层,输出单元的数量等于每个上下文的类的数量(领域递增学习)或类的总数(类递增学习)。详见方法。

Results 成果

For both Split MNIST (Table 2) and Split CIFAR-100 (Table3), we found clear differences between the three continual learning scenarios. With task-incremental learning, almost all tested methods performed well compared to the 'none' and 'joint' baselines, with domain-incremental learning the relative performances of many methods dropped considerably and with class-incremental learning they decreased even further.
对于 Split MNIST(表 2)和 Split CIFAR-100(表 3),我们发现三种持续学习方案之间存在明显差异。与 "无 "基线和 "联合 "基线相比,在任务递增学习中,几乎所有测试方法都表现良好;而在领域递增学习中,许多方法的相对性能大幅下降;在类递增学习中,性能进一步下降。
The decline in performance across the three scenarios was most pronounced for the parameter regularization methods. On both protocols, EWC and SI performed close to the upper target when context identity was known during testing (that is, task-incremental learning); with domain-incremental learning the performance of both methods was substantially lower, but remained above the lower target of sequentially training a network in the standard way; and with class-incremental learning the performance of EWC and SI was similar to the lower target, indicating that in this scenario these methods failed completely. There was a similar trend across the three scenarios for the functional regularization methods, albeit less pronounced for FROMP than for LwF.
在三种情况下,参数正则化方法的性能下降最为明显。在两种协议中,当测试期间上下文身份已知时(即任务递增学习),EWC 和 SI 的性能接近上限目标;当进行领域递增学习时,两种方法的性能大幅下降,但仍高于按标准方法顺序训练网络的下限目标;而当进行类递增学习时,EWC 和 SI 的性能与下限目标相近,表明在这种情况下这些方法完全失败。功能正则化方法在三种情况下也有类似的趋势,只是 FROMP 没有 LwF 那么明显。
Replay-based methods performed relatively well in all three scenarios. Although on both protocols their performance still decreased from task- to domain- to class-incremental learning, the decline was less sharp than for the regularization-based methods, and replay-based methods were among the top performers in each scenario. Template-based classification also performed well with class-incremental learning, with iCaRL and the Generative Classifier among the best performing methods on both protocols.
基于重放的方法在所有三种情况下的表现都相对较好。虽然在两种协议中,从任务学习到领域学习再到类递增学习,它们的性能都有所下降,但下降幅度没有基于正则化的方法那么大,而且基于重放的方法在每种情况下都是表现最好的。在类递增学习中,基于模板的分类方法也表现出色,iCaRL 和生成分类器在两种协议中都是表现最好的方法。
For class-incremental learning, the methods that performed best either used a generative model or they stored previously seen data in a memory buffer. Directly comparing methods using these two
在类递增学习方面,表现最好的方法要么使用了生成模型,要么在内存缓冲区中存储了之前看过的数据。直接比较使用这两种
Table 2 | Results on Split MNIST
表 2 分离 MNIST 的结果
Strategy 战略 Method 方法 Budget 预算 GM Task-IL 任务-IL Domain-IL 域名-IL Class-IL I 类
Baselines 基线 None - lower target
无 - 目标较低
Joint - upper target
关节 - 上部目标
Context-specific components
针对具体情况的组件
Separate Networks 独立网络 - - - -
- - - -
Parameter regularization
参数正则化
EWC - -
SI - -
Functional regularization
功能正则化
- -
FROMP 100 -
Replay 重播 DGR - Yes
- Yes
ER 100 -
A-GEM 100 -
Template-based classification
基于模板的分类
Generative Classifier 生成式分类器 - Yes - -
iCaRL 100 - - -

performed 20 times with different random seeds, reported is the mean ( s.e.m.) over these runs.
用不同的随机种子执行了 20 次,报告的是这些运行的平均值( s.e.m.)。
Table 3 | Results on Split CIFAR-100
表 3 CIFAR-100 的拆分结果
Strategy 战略 Method 方法 Budget 预算 GM Task-IL 任务-IL Domain-IL 域名-IL Class-IL I 类
Baselines 基线 None - lower target
无 - 目标较低
Joint - upper target
关节 - 上部目标
Context-specific components
针对具体情况的组件
Separate Networks 独立网络 - - - -
- - - -
Parameter regularization
参数正则化
EWC - -
SI - -
Functional regularization
功能正则化
LwF - -
Replay 重播 DGR - Yes
- Yes
ER 100 -
A-GEM 100 -
Template-based classification
基于模板的分类
Generative Classifier 生成式分类器 - Yes - -
iCaRL 100 - - -

over these runs. All compared methods used convolutional layers that were pre-trained on CIFAR-10, see Methods for full details.
在这些运行中。所有比较方法都使用了在 CIFAR-10 上预先训练好的卷积层,详情请参见方法。
types of memories can be arbitrary, as their performance can heavily depend on the number of stored examples or the kind of generative model. We instead focus on comparing methods using the same type of memory.
由于记忆类型的性能在很大程度上取决于存储示例的数量或生成模型的类型,因此比较记忆类型是任意的。我们将重点放在比较使用相同类型存储器的方法上。
For methods using generative models, the largest differences were observed with Split CIFAR-100. In particular,DGR did not perform well on this protocol, indicating that standard generative replay (that is, at the input level) is not a good approach when the input data are complex (see also refs. ). There was also a substantial gap in performance between BI-R and the Generative Classifier. As both methods had a generative model on the latent features, this suggests that the way in which a generative model is used (that is, for generating replay or as templates) is important as well.
在使用生成模型的方法中,拆分 CIFAR-100 的差异最大。特别是,DGR 在该协议上的表现不佳,这表明当输入数据复杂时,标准生成重放(即在输入层)不是一种好方法(另见参考文献 )。BI-R 和生成式分类器之间的性能差距也很大。由于这两种方法都有一个关于潜在特征的生成模型,这表明使用生成模型的方式(即生成重放还是作为模板)也很重要。

For methods using stored data, we found that replaying stored data in the standard way (as is done by ER) was not often outperformed by more complex ways of using stored data. In fact, perhaps surprisingly, on all experiments ER comfortably outperformed A-GEM, and ER performed significantly better than FROMP on two of the three scenarios of Split MNIST. These results held for different sizes of the memory buffer (Extended Data Fig. 1). A clear improvement over ER was only observed with iCaRL, and only when the size of the memory buffer was relatively small.
对于使用存储数据的方法,我们发现,以标准方式重放存储数据(如ER所做的那样)的表现并不经常优于使用存储数据的更复杂的方法。事实上,也许令人吃惊的是,在所有实验中,ER 的表现都明显优于 A-GEM,而且在拆分 MNIST 的三种情况中,ER 在两种情况下的表现明显优于 FROMP。这些结果在不同大小的内存缓冲区中都保持不变(扩展数据图 1)。只有在 iCaRL 上,而且只有在内存缓冲区相对较小时,ER 的性能才有明显提高。

Discussion 讨论

Continual learning is a key feature of natural intelligence, but an open challenge for deep learning. Standard deep neural networks tend
持续学习是自然智能的一个关键特征,但对深度学习来说却是一个公开的挑战。标准深度神经网络倾向于

to catastrophically forget previous tasks or data distributions when trained on a new one. Enabling these networks to incrementally learn, and retain, information from different contexts has become a topic of intense research. Yet, despite its scope, the continual learning field lacks structure and direct comparisons between published papers can be misleading. Here, we pointed out that an important difference between continual learning set-ups is whether context identity is known to the algorithm and-if it is not-whether it must be inferred. Based on these two distinctions, we identified three scenarios for continual learning: task-incremental learning, domain-incremental learning and class-incremental learning.
当在新任务或数据分布上接受训练时,会灾难性地遗忘之前的任务或数据分布。让这些网络从不同的环境中逐步学习并保留信息,已成为一个热门研究课题。然而,尽管研究范围广泛,持续学习领域却缺乏结构性,对已发表论文进行直接比较可能会产生误导。在这里,我们指出了持续学习设置之间的一个重要区别,即算法是否知道上下文身份,如果不知道,是否必须推断。基于这两点区别,我们确定了持续学习的三种情况:任务递增学习、领域递增学习和类递增学习。
These three scenarios and their different challenges can be conveniently studied in an academic continual learning setting, where a classification-based problem is split up in discrete, non-overlapping contexts (which are often called 'tasks') that are encountered in sequence. We showed that in this setting there is a clear separation between the three scenarios. At least in part because of two preprints of this article , the terms 'task-incremental learning', 'domain-incremental learning' and 'class-incremental learning' are sometimes being used in the recent literature in a way that restricts them to this academic setting. Here, by interpreting these three scenarios as specifying how the non-stationary aspect of the data relates to the mapping that must be learned, we propose that they generalize to more flexible continual-learning settings. To demonstrate the value of such generalized versions of these three scenarios, Supplementary Note 2 shows how a 'task-free' data stream without sharp context boundaries can also be performed in three different ways.
这三种情况及其不同的挑战可以在学术持续学习环境中方便地进行研究,在这种环境中,基于分类的问题被分割成离散的、非重叠的情境(通常称为 "任务"),并依次遇到。我们的研究表明,在这种情况下,三种情景之间有明显的区别。至少部分由于本文的两篇预印本 ,"任务递增学习"、"领域递增学习 "和 "类递增学习 "等术语有时在近期的文献中被用于限定这种学术环境的方式。在这里,通过将这三种情况解释为数据的非稳态方面与必须学习的映射之间的关系,我们建议将它们推广到更灵活的持续学习环境中。为了证明这三种情况的通用版本的价值,补充说明 2 展示了如何以三种不同的方式来处理没有明显上下文界限的 "无任务 "数据流。
A key insight of this article is that a useful way to categorize continual-learning problems is based on how the non-stationary aspect of the data relates to the mapping to be learned. For supervised classification this leads to the three continual learning scenarios discussed here, but the same perspective might also be useful for unsupervised or reinforcement learning (Supplementary Note 5). Continual learning can be categorized in other ways as well, some of which we discuss in Supplementary Note 6.
本文的一个重要观点是,对持续学习问题进行分类的有用方法是基于数据的非稳态方面与待学习映射之间的关系。对于有监督的分类,这导致了本文讨论的三种持续学习情况,但同样的观点可能也适用于无监督或强化学习(补充说明 5)。持续学习还可以通过其他方式进行分类,我们将在补充说明 6 中讨论其中的一些。
Using the academic continual-learning setting, for each scenario we performed an empirical comparison of a representative selection of continual learning algorithms. This comparison revealed marked differences between the three scenarios in overall difficulty level and in the relative effectiveness of different continual learning strategies. The only strategy among the top performers in all three scenarios is replay, with the replayed data sampled either from a memory buffer or a generative model. Surprisingly, within the class of methods using stored data, the strongest performance is often obtained by the method ER, which replays stored data in the standard way'. In our experiments, popular methods such as A-GEM and FROMP, which use stored data in more complex ways, are almost always outperformed by ER, even though the computational costs of A-GEM and FROMP are strictly higher than those of ER.
利用学术持续学习设置,我们对每种情景中具有代表性的持续学习算法进行了实证比较。比较结果表明,三种情景在整体难度和不同持续学习策略的相对有效性方面存在明显差异。在所有三种场景中,唯一表现最佳的策略是重放,重放的数据从内存缓冲区或生成模型中采样。令人惊讶的是,在使用存储数据的方法类别中,以标准方式重放存储数据的 ER 方法往往表现最佳。在我们的实验中,尽管 A-GEM 和 FROMP 的计算成本远远高于 ER,但 A-GEM 和 FROMP 等以更复杂方式使用存储数据的流行方法的性能几乎总是优于 ER。
In the class-incremental learning scenario, we found that parameter regularization methods such as EWC and SI fail almost completely, even on Split MNIST. Functional regularization typically works better, especially when using stored data as anchor points, but thisstrategy also does not work optimally. We hypothesize that regularization-based strategies-at least by themselves-are not well suited for class-incremental learning because they do not provide a way to compare between classes that are not observed together. Regularization-based methods aim to learn new contexts while preserving the function or parameters learned in previous contexts. However, with class-incremental learning, learning a new context (for example, distinguishing ' 2 ' and ' 3 ') while preserving what was learned before (for example, distinguishing ' 0 ' and ' 1 ') is not enough; it is also needed to combine information from different contexts (for example, for distinguishing ' 1 ' and ' 2 '). For learning to distinguish between classes not observed together, it might be unavoidable to use either replay, which allows for comparing between classes from different contexts during training, or template-based classification, which allows for comparing between classes from different contexts during inference (that is, during the classification decision).
在类递增学习场景中,我们发现 EWC 和 SI 等参数正则化方法几乎完全失效,即使在拆分的 MNIST 上也是如此。函数正则化通常效果更好,尤其是在使用存储数据作为锚点时,但这种策略也无法达到最佳效果。我们推测,基于正则化的策略--至少就其本身而言--并不适合类别递增学习,因为它们没有提供一种方法来比较不在一起观察的类别。基于正则化的方法旨在学习新的情境,同时保留在之前情境中学习到的函数或参数。但是,在类递增学习中,学习新的上下文(例如,区分 "2 "和 "3")的同时保留之前学习的内容(例如,区分 "0 "和 "1")是不够的;还需要结合来自不同上下文的信息(例如,区分 "1 "和 "2")。为了学习如何区分不在一起观察到的类别,可能不可避免地要使用回放或基于模板的分类法,前者可以在训练过程中比较不同上下文中的类别,后者可以在推理过程中(即分类决策过程中)比较不同上下文中的类别。
Task-incremental learning is sometimes considered 'easy', and it has been argued that the continual-learning community should move away from the assumption that context identities (or task labels, as they are often called) are provided at test time . A reason for this notion might be that with task-incremental learning, the bar is often set too low. In our experiments, while all methods indeed perform substantially better than the usual 'lower target' in which a single shared neural network is sequentially trained on all contexts, most methods perform worse than the more appropriate lower target in which a smaller, separate network is trained for each context. To do better than this 'Separate Networks' approach, positive forward or backward transfer between contexts is necessary, but achieving such positive transfer is not trivial .
任务递增式学习有时被认为是 "简单 "的,有人认为,持续学习界应该摒弃在测试时提供上下文标识(或通常所说的任务标签)的假设 。产生这种观点的一个原因可能是任务递增学习的标准往往定得太低。在我们的实验中,虽然所有方法的表现都大大优于通常的 "较低目标",即在所有上下文中顺序训练一个共享神经网络,但大多数方法的表现都不如更合适的较低目标,即在每个上下文中训练一个较小的独立网络。要想比这种 "独立网络 "方法做得更好,上下文之间的正向或反向转移是必要的,但实现这种正向转移并非易事
Domain-incremental learning might be the least studied continual learning scenario. A few years ago this scenario was regularly studied with Permuted MNIST , but this protocol is not often used anymore as it is considered too artificial. The continual learning field is currently dominated by context sets created by splitting up existing image classification datasets based on class labels (for example, Split MNIST, Split CIFAR-100). Although in theory these context sets can be performed according to all three scenarios, they are typically less intuitive and/or realistic under the assumptions of domain-incremental learning. However, in recent years, several resources have been created that provide-or enable the generation of-more realistic context sets well suited for domain-incremental learning . These resources might help to renew the community's interest in this scenario.
领域递增学习可能是研究最少的持续学习方案。几年前,人们经常使用 "Permuted MNIST "来研究这种情况,但由于这种方法过于人工化,现在已经不常用了。目前,持续学习领域主要是基于类标签拆分现有图像分类数据集(如拆分 MNIST、拆分 CIFAR-100)而创建的上下文集。虽然从理论上讲,这些情境集可以根据所有三种情况来执行,但在领域递增学习的假设下,它们通常不太直观和/或现实。不过,近年来已经有一些资源可以提供或生成更适合领域增量学习的更真实的情境集 。这些资源可能有助于重新激发社区对这一情景的兴趣。
The three continual learning scenarios described in this article provide a useful basis for defining clear and unambiguous benchmark problems for continual learning. We hope this will accelerate progress to bridge the gap between natural and artificial intelligence. Moreover, we believe that it is an important conceptual insight that, at the computational level, a supervised learning problem can be incremental in these three different ways. Perhaps especially in the real world, where continual learning problems are often complex and 'mixtures' of scenarios, it might be fruitful to approach problems as consisting of a combination of these three fundamental types of incremental learning.
本文描述的三种持续学习场景为定义清晰明确的持续学习基准问题提供了有用的基础。我们希望这将加快缩小自然智能与人工智能之间差距的进程。此外,我们认为,在计算层面上,监督学习问题可以通过这三种不同的方式实现增量,这是一个重要的概念见解。在现实世界中,持续学习问题往往是复杂的 "混合物",因此,将问题视为这三种基本增量学习类型的组合,也许会更有成效。

Methods 方法

All experiments were run using custom-written code for the Python machine learning framework PyTorch .
所有实验均使用 Python 机器学习框架 PyTorch 的定制代码运行。

Context sets 语境集

For the Split MNIST protocol, the MNIST dataset was split into five contexts, such that each context contained two digits. The digits were randomly divided over the five contexts, so the order of the digits was different for each random seed. The original pixel greyscale images were used without pre-processing. The standard training/test-split was used, which resulted in 60,000 training images (approximately 6,000 per digit) and 10,000 test images (approximately 1,000 per digit).
在拆分 MNIST 协议中,MNIST 数据集 被拆分为五个上下文,每个上下文包含两个数字。数字在五个上下文中随机分配,因此每个随机种子的数字顺序都不同。原始 像素灰度图像未经预处理。使用标准的训练/测试分割法,得到 60,000 张训练图像(每个数字约 6,000 张)和 10,000 张测试图像(每个数字约 1,000 张)。
For the Split CIFAR-100 protocol, the CIFAR-100 dataset was split up into ten contexts, such that each context contained ten image classes. The classes were randomly divided over the contexts, with a different class order for each random seed. The original pixel RGB-colour images were normalized (that is, each pixel-value was subtracted by the relevant channel-wise mean and divided by the channel-wise standard deviation, with means and standard deviations calculated over all training images). No other pre-processing or augmentation was applied. The standard training/test-split was used,
在 "拆分 CIFAR-100 协议 "中,CIFAR-100 数据集 被拆分为十个上下文,每个上下文包含十个图像类别。这些类别在上下文中随机划分,每个随机种子具有不同的类别顺序。原始 像素 RGB 彩色图像经过归一化处理(即每个像素值减去相关通道的平均值,再除以通道的标准偏差,计算出所有训练图像的平均值和标准偏差)。没有进行其他预处理或增强处理。采用标准的训练/测试分割、

which resulted in 50,000 training images ( 500 per class) and 10,000 test images ( 100 per class).
结果产生了 50 000 幅训练图像(每类 500 幅)和 10 000 幅测试图像(每类 100 幅)。

Base neural network architecture
基础神经网络架构

To make the comparisons as informative as possible, we used the same base neural network architecture for all methods as much as possible. For Split MNIST, the base network had two fully connected hidden layers of 400 ReLU each and a softmax output layer. For Split CIFAR-100, the base network had five pre-trained convolutional layers followed by two fully connected layers with 2,000 ReLU each and a softmax output layer. The convolutional layers contained and 256 channels. Each convolutional layer used a kernel, a padding of 1 and there was a stride of 1 in the first layer (that is, no downsampling) and a stride of 2 in the other layers (that is, image-size was halved in each of those layers). Batch norm was used in all convolutional layers, followed by a ReLU non-linearity. No pooling was used. The convolutional layers were pre-trained on CIFAR-10, which is a dataset containing similar but non-overlapping images and image classes compared with CIFAR-100 . To pre-train the convolutional layers, the base neural network was trained to classify the 10 classes of CIFAR-10 for 100 epochs, using the ADAM-optimizer with learning rate of 0.0001 and mini-batch size of 256 . For the pre-training on CIFAR-10, images were normalized and augmented by random cropping and horizontal flipping. A similar pre-training protocol was used in ref. . During the incremental training on CIFAR-100, the parameters of the pre-trained convolutional layers were frozen. For all compared methods, freezing these parameters resulted in similar or better performance compared with not freezing them.
为了使比较尽可能具有参考价值,我们尽可能在所有方法中使用相同的基础神经网络架构。对于 Split MNIST,基础网络有两个完全连接的隐层,每个隐层有 400 ReLU 和一个 softmax 输出层。对于 Split CIFAR-100,基础网络有五个预先训练好的卷积层,然后是两个完全连接的层,每个层有 2,000 ReLU 和一个软最大输出层。卷积层包含 和 256 个通道。每个卷积层使用 内核,填充为 1,第一层的跨距为 1(即不进行降采样),其他层的跨距为 2(即每个层的图像大小减半)。所有卷积层都使用了批量规范 ,然后是 ReLU 非线性。没有使用池化。卷积层在 CIFAR-10 数据集上进行了预训练,与 CIFAR-100 数据集相比,CIFAR-10 数据集包含相似但不重叠的图像和图像类别 。为了预训练卷积层,使用 ADAM 优化器 对基础神经网络进行了 100 次训练,以对 CIFAR-10 的 10 个类别进行分类,学习率为 0.0001,迷你批大小为 256。在对 CIFAR-10 进行预训练时,对图像进行了归一化处理,并通过随机裁剪和水平翻转对图像进行了增强。参考文献 也采用了类似的预训练方案。在 CIFAR-100 的增量训练中,预训练卷积层的参数被冻结。在所有比较方法中,冻结这些参数与不冻结参数相比,性能相近或更好。

Output layer 输出层

The softmax output layer of the network was treated differently depending on the continual learning scenario that was performed. With task-incremental learning, a multi-headed output layer was used, meaning that each context had its own output units and only the output units of the context under consideration-that is, either the current context or the replayed context-were set to 'active' (see next paragraph). With domain- and class-incremental learning, a single-headed output layer was used. For domain-incremental learning, this meant that all contexts used the same output units (that is, there were 2 output units for Split MNIST and 10 for Split CIFAR-100); for class-incremental learning, this meant that each class had its own output unit (that is, there were 10 output units for Split MNIST and 100 for Split CIFAR-100). With both domain- and class-incremental learning, always all output units were set to 'active'. Note that with class-incremental learning another possibility is to use an 'expanding head' and only set the output units of classes seen so far to active (for example, see refs. ). We found that for our experiments there was not much difference in performance between these two options. Because all output units should always be active for the Bayesian interpretation of the parameter regularization methods , we decided to use that approach in this study.
网络的 softmax 输出层根据持续学习方案的不同而有不同的处理方法。在任务递增学习中,使用的是多头输出层,这意味着每个情境都有自己的输出单元,只有正在考虑的情境(即当前情境或重放情境)的输出单元被设置为 "激活"(见下段)。在领域递增学习和类递增学习中,使用的是单头输出层。对于领域递增学习,这意味着所有上下文都使用相同的输出单元(即 Split MNIST 有 2 个输出单元,Split CIFAR-100 有 10 个输出单元);对于类别递增学习,这意味着每个类别都有自己的输出单元(即 Split MNIST 有 10 个输出单元,Split CIFAR-100 有 100 个输出单元)。在领域递增学习和类递增学习中,所有输出单元都被设置为 "激活"。需要注意的是,在类递增学习中,另一种可能的方法是使用 "扩展头",只将目前看到的类的输出单元设置为 "活动"(例如,参见参考文献 )。我们在实验中发现,这两种方案的性能差别不大。由于对于参数正则化方法的贝叶斯解释而言,所有输出单元都应始终处于激活状态 ,因此我们决定在本研究中采用这种方法。
Whether an output unit was set to 'active' controlled whether a network could assign a positive probability to its corresponding class. The probability predicted by a neural network with parameters that an input belongs to output class was calculated as:
输出单元是否被设置为 "激活 "控制着网络能否为其相应类别分配正概率。参数为 的神经网络预测输入 属于输出类别 的概率计算公式为:
whereby was the logit of output class obtained by putting input through the neural network with parameters . The summation in the denominator was over all active classes in the output layer. Importantly, with task- and class-incremental learning, output class refers to the 'global class' that is obtained by combining the within-context label and the context label (that is, the set of global classes is given by ). With domain-incremental learning, output class refers to the within-context label .
其中 是将输入 通过带有参数 的神经网络得到的输出类别 的对数。分母求和是对输出层中所有活动类的求和。重要的是,在任务和类递增学习中,输出类 指的是 "全局类",它由上下文内标签 和上下文标签 组合而成(也就是说,全局类集合由 给出)。通过领域递增学习,输出类 是指上下文内标签

Data stream 数据流

All experiments in this article used the academic continual learning setting, meaning that the different contexts were presented to the algorithm one after the other. Within each context, the training data was fed to the algorithm in a stream of independent and identically distributed experiences (or iterations). For Split MNIST, each context was trained for 2,000 iterations with mini-batch size of 128 . For Split CIFAR-100, there were 5,000 iterations per context with mini-batch size of 256 . Some of the compared methods (EWC, FROMP and iCaRL) performed an additional pass over each context's training data upon finishing training on that context.
本文中的所有实验都采用了学术性的持续学习设置,即不同的情境一个接一个地呈现给算法。在每个上下文中,训练数据都是以独立且分布相同的经验流(或迭代)的形式提供给算法的。对于 Split MNIST,每个上下文都要进行 2,000 次迭代训练,迷你批量大小为 128。对于 Split CIFAR-100,每个上下文有 5,000 次迭代,迷你批量大小为 256。一些比较方法(EWC、FROMP 和 iCaRL)在完成每个上下文的训练后,会对该上下文的训练数据进行额外的遍历。

Loss function and optimization
损失函数和优化

For all compared methods, the parameters of the neural network were sequentially trained on each context by optimizing a loss function (denoted by ) using stochastic gradient descent. In each iteration, the loss was calculated as the average over all samples in the mini-batch and a single gradient step was taken with the ADAM-optimizer , ; ref. ) and a learning rate of either 0.001 (Split MNIST) or 0.0001 (Split CIFAR-100).
在所有比较方法中,神经网络的参数都是通过使用随机梯度下降法优化损失函数(用 表示),在每种情况下依次进行训练的。在每次迭代中,损失计算为迷你批次中所有样本的平均值,并使用 ADAM 优化器 , ; 参考文献 ) 和 0.001(Split MNIST)或 0.0001(Split CIFAR-100)的学习率进行单步梯度计算。
For most compared methods, a central component of the loss function was the multi-class cross-entropy classification loss on the data of the current context. For an input labeled with a hard target , this classification loss was given by:
对于大多数比较方法而言,损失函数的核心部分是当前上下文数据的多类交叉熵分类损失。对于标有硬目标 的输入 ,该分类损失由以下公式给出:
with the conditional probability distribution defined by the neural network with parameters , as given in equation (2).
是神经网络定义的条件概率分布,其参数为 ,如公式 (2) 所示。

Memory buffer and generative models
内存缓冲区和生成模型

Several of the compared methods (FROMP,ER, A-GEM and iCaRL) maintained a memory buffer in which examples of previously seen classes were stored. Except for the experiments in Extended Data Fig. 1, 100 examples per class were allowed to be stored in the memory buffer (that is, the per-class memory budget was set to 100). Some other methods (DGR, BI-R and the Generative Classifier) learned generative models, these methods used up to three times as many parameters compared with the other methods.
几种比较方法(FROMP,ER、A-GEM 和 iCaRL)都保留了一个内存缓冲区,用于存储以前看过的类的示例。除扩展数据图 1 中的实验外,每个类别允许在内存缓冲区中存储 100 个示例(即每个类别的内存预算 设为 100)。其他一些方法(DGR、BI-R 和生成分类器)学习生成模型,这些方法使用的参数是其他方法的三倍。

Baselines 基线

For the baseline 'None', which was included as a lower target, the base neural network was sequentially trained on each context in the standard way, meaning that the loss function to be optimized was always just the classification loss on the current data (that is, ).
对于作为较低目标的基线 "无",基础神经网络按照标准方法对每种上下文进行顺序训练,这意味着需要优化的损失函数始终只是当前数据的分类损失(即 )。
For the baseline 'Joint', which was included as an upper target, the base neural network was trained on the data from all contexts at the same time. For this baseline, the same total number of iterations was used as with the sequential training protocol (that is, iterations for Split MNIST and iterations for Split CIFAR), but each mini-batch was always sampled jointly from the data of all contexts.
对于作为上限目标的基线 "联合",基础神经网络同时在所有语境的数据上进行训练。在此基线中,迭代总数与顺序训练协议相同(即 迭代用于 Split MNIST, 迭代用于 Split CIFAR),但每个迷你批总是从所有上下文数据中联合采样。

Approaches using context-specific components
使用针对具体情况的组件的方法

For XdG and the 'Separate Networks' approach, not all parts of the network were used for each context. These approaches require knowledge of which context a sample belongs to (to select the correct context-specific components), which meant that they could only be used in the task-incremental learning scenario. For both approaches, training was performed using just the classification loss on the current data (that is, ).
对于 XdG 和 "分离网络 "方法,并非网络的所有部分都用于每种情境。这些方法需要知道样本属于哪种情境(以选择正确的特定情境组件),这意味着它们只能用于任务增量学习场景。对于这两种方法,训练时只使用当前数据的分类损失(即 )。
In the task-incremental learning scenario, the other methods (that is, all methods except XdG and Separate Networks) used the available context identity information only in the form of a separate output layer for each context. This is a common and often sensible way to use context identity information, although in Supplementary Note 7 we show that sometimes it is more efficient to use context identity information in other ways.
在任务增量学习场景中,其他方法(即除 XdG 和 Separate Networks 之外的所有方法)仅以为每个上下文建立单独输出层的形式使用可用的上下文身份信息。这是使用上下文身份信息的常见方法,通常也是合理的方法,不过我们在补充说明 7 中指出,有时以其他方式使用上下文身份信息会更有效。
Separate Networks. For the Separate Networks approach, the available parameter budget was equally divided over all contexts to be learned, and a separate sub-network was trained for each context. For Split MNIST, each context-specific sub-network had two fully connected hidden layers of 100 ReLU each and a softmax output layer. For Split CIFAR-100, the pre-trained and frozen convolutional layers were shared between all contexts, and only the fully connected part of the network was split up into context-specific sub-networks. Each context-specific sub-network had two fully connected layers with 400 ReLU each and a softmax output layer.
独立网络。在 "分离网络 "方法中,可用参数预算被平均分配给所有要学习的上下文,并为每个上下文训练一个独立的子网络。对于 Split MNIST,每个特定语境的子网络都有两个完全连接的隐藏层,每个隐藏层有 100 个 ReLU 和一个 softmax 输出层。对于 Split CIFAR-100,预训练和冻结的卷积层在所有上下文中共享,只有网络的全连接部分被分割成特定上下文的子网络。每个特定上下文子网络有两个全连接层,每个层有 400 ReLU 和一个软最大输出层。
XdG. With , the base neural networkwas used and for each contexta different, randomly selected subset of of the units in each hiddenlayer was fully gated (that is, their activations were set to zero), with a hyperparameter whose value was set by a grid search (Supplementary Note 8).
XdG。在 中,使用了基础神经网络,并针对每个情境随机选取了不同的子集 ,对每个隐层中的单元进行了完全门控(即其激活度设置为零), ,其值通过网格搜索设置(补充说明 8)。

Parameter regularization methods
参数正则化方法

For the parameter regularization methods EWC and SI, a regularization term was added to the classification loss: . This regularization term penalized changes to parameters thought to be important for previously learned contexts.
对于参数正则化方法 EWC 和 SI,在分类损失中添加了一个正则化项: 。该正则化项对被认为对先前学习的上下文很重要的参数变化进行惩罚。
EWC. The regularization term of consisted of a quadratic penalty term for each previously learned context, whereby the term of each context penalized parameters for how different they were compared to their value directly after finishing training on that context. When training on context , the EWC regularization term was given by:
EWC。 的正则化项由每个先前学习过的上下文的二次惩罚项组成,其中每个上下文的惩罚项根据参数与其在该上下文上完成训练后的直接值相比的不同程度进行惩罚。当在上下文 上进行训练时,EWC 正则化项为
with a hyperparameter controlling the regularization strength (which was set based on a grid search, Supplementary Note 8 ), the value of the parameter at the end of training on context , and the estimated importance of parameter for context . This importance estimate was calculated as the diagonal element of the Fisher information matrix of context :
其中 是控制正则化强度的超参数(根据网格搜索设置,补充说明 8), 参数在上下文 的训练结束时的值, 是参数 对上下文 的重要性估计值。这一重要性估计值是根据 上下文 的费雪信息矩阵的对角元素计算得出的:
whereby was the training data of context and was the probability that belongs to output class , as predicted by the network after finishing training on context -that is, . The inner summation in equation (5) was over all output classes that were active during training on context .
其中, 是上下文 的训练数据, 属于输出类 的概率,这是网络在完成上下文 (即 )的训练后预测的。等式 (5) 中的内求和是对所有在上下文 的训练过程中处于活动状态的输出类的求和。
SI. The regularization term of (ref. ) consisted of a single quadratic term that penalized changes to the parameters away from the value they had after finishing training on the previous context. When training on context , the SI regularization term was given by:
SI. (参考文献: )的正则化项由一个二次项组成,用于惩罚参数偏离在上一语境中完成训练后的值的变化。当在上下文 上进行训练时,SI 正则化项为
with a hyperparameter controlling the regularization strength (which was set based on a grid search, see Supplementary Note 8 ), the value of the parameter at the end of training on context , and the estimated importance of parameter after the first contexts have been learned. To compute these parameter importance estimates, after each context , a per-parameter contribution to the change of the loss was calculated for each parameter as follows:
其中 是控制正则化强度的超参数(根据网格搜索设置,见补充说明 8), 参数在上下文 的训练结束时的值, 是在学习了第一个 上下文后参数 的估计重要性。为了计算这些参数重要性估计值,在每个上下文 之后,对每个参数 的损失变化计算如下:
with the number of iterations per context, the value of parameter after the training iteration on context and the gradient of the loss with respect to parameter during the training iteration on context . For every context, these per-parameter contributions were normalized by the square of the total change of that parameter during training on that context plus a small dampening term (set to 0.1 , to bound the resulting normalized contributions when a parameter's total change goes to zero), after which they were summed over all contexts so far:
是每个上下文的迭代次数, 是参数 在上下文 训练迭代之后的值, 是参数 在上下文 训练迭代期间的损失梯度。对于每种上下文,这些每个参数的贡献都通过该上下文训练期间该参数总变化的平方加上一个小的阻尼项 (设为 0.1,以约束当参数总变化为零时产生的归一化贡献)进行归一化,然后将它们在迄今为止的所有上下文中求和:
with where was the value of parameter right before starting training on context .
,其中 是在上下文 开始训练前的参数 的值。

Functional regularization methods
功能正则化方法

Similar as with parameter regularization, the functional regularization methods and FROMP had a regularization term added to the classification loss: . This regularization term encouraged the input-output mapping of the network not to change too much at a set of anchor points.
与参数正则化类似,功能正则化方法 和 FROMP 在分类损失中添加了一个正则化项: 。该正则化项促使网络的输入输出映射在一组锚点上不会发生太大变化。
LwF. The method (ref. ) used the inputs from the current context as anchor points in combination with knowledge distillation . During training on context , the regularization term was given by:
LwF。 (参考文献: )将当前上下文的输入作为锚点,结合知识提炼 。在上下文 的训练过程中, 正则化项由以下公式给出:
whereby was the set of output classes in context was the parameter vector with values as they were at the end of training on context and was the 'temperature-raised' probability that input belongs to output class , as predicted by the network with parameters . These temperature-raised probabilities were defined as:
其中, 是上下文中的输出类别集 是参数向量,其值与上下文 的训练结束时的值相同; 是输入 属于输出类别 的 "温度升高 "概率,由带有参数 的网络预测。这些温度升高的概率定义为
with the temperature, which was set to 2 , and the logit of output class obtained by putting input through the neural network with parameters . The summation in the denominator was over all active classes in the output layer. With task-incremental learning, for each context's term in the outer summation of equation (9), only the output classes contained in that context were active. With domain- and class-incremental learning, always all output classes were active. In each iteration, the regularization term was computed as average over the same inputs that were used to compute .
是温度(设为 2), 是输出类 的对数,是将输入 通过带有参数 的神经网络得到的。分母求和是对输出层中所有活动类别的求和。在任务递增学习中,对于方程(9)外求和中的每个上下文项,只有该上下文中包含的输出类是有效的。而在领域和类递增学习中,所有输出类都处于激活状态。在每次迭代中, 正则化项都是根据用于计算 的相同输入计算平均值。
We note that this implementation of LwF differs slightly from the implementation of used in ref. . Compared with that implementation, the regularization term here was weighted less strongly, which substantially improved the performance of LwF on Split CIFAR-100. Initial experiments indicated that by reducing the weight of the replay term in equation (16) it is also be possible to improve the performance
我们注意到,LwF 的这一实现与参考文献 的实现略有不同。与该实现相比,这里的正则化项权重较低,这大大提高了 LwF 在 Split CIFAR-100 上的性能。初步实验表明,通过降低公式 (16) 中重放项的权重,也有可能提高 LwF 的性能。

of several of the replay methods on Split CIFAR-100, but at the cost of impaired performance on Split MNIST.
在 Split CIFAR-100 中,几种重放方法的性能都有所提高,但在 Split MNIST 中的性能却有所下降。
FROMP. The method FROMP (ref. ) performed functional regularization in a Bayesian framework and used stored data from previous contexts, referred to as memorable inputs, as anchor points. During training on context , the regularization term of FROMP was given by:
FROMP。FROMP 方法(参考文献: )在贝叶斯框架内进行功能正则化,并将以前语境中存储的数据(称为记忆输入)作为锚点。在对上下文 进行训练时,FROMP 的正则化项为
with a hyperparameter controlling the regularization strength (which was set based on a grid search, see Supplementary Note 8) and the parameter vector with values as they were at the end of training on context . Further, was a vector containing for each memorable input from context the probability that this input belongs to output class as predicted by the network with parameters . That is, the element of was given by , with the memorable input of context . Finally, was a matrix whose elements were given by:
是控制正则化强度的超参数(根据网格搜索设置,见补充说明 8), 是参数向量,其值与上下文 的训练结束时的值相同。此外, 是一个向量,其中包含上下文 中每个可记忆输入的概率,即该输入属于输出类 的概率,该概率由带有参数 的网络预测。也就是说, 元素由 给出, 是上下文 可记忆输入。最后, 是一个矩阵,其元素由以下公式给出:
with: 用:
and was a diagonal matrix with diagonal given by:
是一个对角矩阵,其对角线 由以下公式给出:
whereby with the logits obtained by putting input through the neural network with parameters , and .
其中 是通过神经网络输入 得到的对数,其参数为
The selection of memorable inputs, which are FROMP's anchor points, took place after finishing training on each context. After finishing on context , for each input in that context's training set, a relevance score was calculated as:
FROMP 的锚点是在完成每个上下文的训练后选择的。在完成上下文 的训练后,对于该上下文训练集中的每个输入 ,相关性得分的计算公式为:
whereby was the set of output classes in context and were the parameters after training on context . Then, for each output class in context , the inputs with the highest relevance scores were selected as the memorable inputs for that class and stored in the memory buffer.
其中, 是上下文 中的输出类集合, 是在上下文 中训练后的参数。然后,对于上下文 中的每个输出类别,选择相关性得分最高的 输入作为该类别的记忆输入,并存储在内存缓冲区中。

Replay-based methods 基于重放的方法

The replay-based methods had two separate loss terms: one for the data of the current context, denoted as and one for the replayed data, denoted as . Except with A-GEM, during training the objective was to optimize an overall loss function that was a weighted sum of these two terms, with the weights depending on how many contexts had been seen so far:
基于重放的方法有两个独立的损失项:一个是当前上下文数据的损失项,用 表示;另一个是重放数据的损失项,用 表示。除 A-GEM 外,在训练过程中的目标是优化一个整体损失函数,该函数是这两个项的加权和,权重取决于迄今为止看到的上下文数量:
In each iteration, the number of replayed samples was always equal to the number of samples from the current context (that is, 128 for Split MNIST and 256 for Split CIFAR-100).
在每次迭代中,重放样本的数量始终与当前上下文中的样本数量相等(即 128 个样本用于 Split MNIST,256 个样本用于 Split CIFAR-100)。
ER. With ER, the term was the standard classification loss on the data of the current context (that is, ). The term was also the standard classification loss, but on the replayed data. In each iteration, the samples to be replayed were randomly sampled from the memory buffer. The memory buffer was updated after each context, when for each new class samples were randomly selected from the training data and added to the buffer.
ER。在 ER 中,术语 是当前上下文数据的标准分类损失(即 )。 也是标准分类损失,但针对的是重放数据。在每次迭代中,要重放的样本都是从内存缓冲区中随机抽取的。每次上下文结束后,内存缓冲区都会更新,即从训练数据中随机抽取每个新类别的 样本并添加到缓冲区中。
A-GEM. For the method A-GEM (ref. ), the loss terms and were defined similarly as for ER. The population of the memory buffer and sampling of the the data to be replayed from the memory buffer were also the same. The only difference compared to ER was that with A-GEM, the objective was not to minimize the combined loss (that is, , but instead the objective was to minimize the loss on the current data (that is, ) under the constraint that the loss on the replayed data (that is, ) did not increase. To achieve this, in every iteration, the gradient vector that was used to update the parameters (that is, the gradient vector that was put into the ADAM-optimizer) was required to have a positive angle with the gradient of . Therefore, whenever the angle between the gradient of and the gradient of was negative, the gradient of was projected onto the orthogonal complement of the gradient of . Let be the mini-batch of data from the current context and the mini-batch of replayed data from the memory buffer. The gradient of was then:
A-GEM。对于 A-GEM 方法(参考文献 ),损失项 的定义与 ER 相似。内存缓冲区的数量和从内存缓冲区重放数据的取样也是一样的。与 ER 唯一不同的是,A-GEM 的目标不是最小化综合损失(即 ,而是在重放数据(即 )的损失不增加的约束条件下,最小化当前数据(即 )的损失。为此,在每次迭代中,用于更新参数的梯度矢量(即输入 ADAM 优化器的梯度矢量)必须与 的梯度成正角。因此,只要 的梯度与 的梯度之间的夹角为负, 的梯度就会被投影到 梯度的正交补集上。假设 是当前上下文中的小批量数据,而 是内存缓冲区中的小批量重放数据。这样, 的梯度为
and the gradient of was given by:
的梯度由以下公式给出:
The gradient used to update the parameters was then given by:
用于更新参数的梯度 由以下公式给出:
with a small constant to ensure numerical stability. In A-GEM's original formulation there was no -term, but we found that without it, performance was unstable. We used .
一个小常数,以确保数值稳定性。在 A-GEM 的原始公式 中,没有 -项,但我们发现,如果没有它,性能就不稳定。我们使用了
DGR. With DGR , two neural networks were sequentially trained on all contexts: a classifier, for which we used the base neural network, and a separate generative model.
DGR。通过 DGR ,我们在所有语境中依次训练了两个神经网络:一个是分类器(我们使用了基础神经网络),另一个是独立的生成模型。
For training of the classifier, as with ER and A-GEM, and were the standard classification loss on the data of the current context and the replayed data, respectively. With DGR, the replayed data was obtained by sampling inputs from a copy of the generative model and labelling them as the most likely class predicted for those inputs by a copy of the classifier. The samples replayed during context were generated by copies of the generator and classifier stored directly after finishing training on context . With task-incremental learning, each replayed sample was labelled and evaluated separately for all previous contexts and was the average over those contexts.
在训练分类器时,与 ER 和 A-GEM 一样, 分别是当前上下文数据和重放数据的标准分类损失。在 DGR 中,重放数据是通过从生成模型副本中抽取输入样本,并将其标记为分类器副本对这些输入所预测的最可能类别而获得的。在上下文 期间重放的样本由生成器和分类器的副本生成,这些副本在完成上下文 的训练后直接存储。在任务递增学习过程中,每个重放样本都会针对之前的所有情境分别进行标注和评估,而 则是这些情境的平均值。
As generative model a variational autoencoder (VAE; ref. ) was used, which consisted of an encoder network that mapped an input-vector to a vector of latent variables , and a decoder network that mapped those latent variables back to a reconstructed or decoded input-vector . The architecture of these two networks was kept similar to that of the base neural network: for Split MNIST, the encoder and the decoder were both fully connected networks with two hidden layers of 400 ReLU each; for Split CIFAR-100, the encoder consisted of the same five pre-trained convolutional layers as the base neural network followed by two fully connected layers with 2,000 ReLU units, and the decoder consisted of two fully connected layers with
作为生成模型,使用了变异自动编码器(VAE;参考文献 ),它由一个编码器网络 和一个解码器网络 组成,前者将输入向量 映射到潜在变量向量 ,后者将这些潜在变量映射回重建或解码的输入向量 。这两个网络的结构与基础神经网络的结构相似:对于 Split MNIST,编码器和解码器都是全连接网络,各有两个 400 ReLU 的隐藏层;对于 Split CIFAR-100,编码器由与基础神经网络相同的五个预训练卷积层组成,后接两个 2,000 ReLU 单元的全连接层,解码器由两个全连接层组成,后接两个 2,000 ReLU 单元的全连接层。
2,000 ReLU followed by five deconvolutional (or transposed convolutional) layers that mirrored the convolutional layers and contained and 3 channels. The first four deconvolutional layers used a kernel, a padding of 1 and a stride of 2 (that is, image size was doubled in each of those layers), while the final layer used a kernel, a padding of 1 and a stride of 1 (that is, no upsampling). Batch-norm and ReLU non-linearities were used in all deconvolutional layers except for the last one. For both context sets, the VAE's latent variable layer had 100 Gaussian units. The prior over the latent variables was the standard normal distribution.
2,000 ReLU,然后是五个去卷积(或转置卷积)层 ,这些层与卷积层相映成趣,包含 和 3 个通道。前四层去卷积层使用 内核、1 个填充和 2 个步长(即每一层的图像尺寸都增加一倍),而最后一层使用 内核、1 个填充和 1 个步长(即不进行上采样)。除最后一层外,所有解卷积层都使用了批量规范和 ReLU 非线性。对于两个上下文集,VAE 的潜变量层 有 100 个高斯单元。潜变量的先验分布为标准正态分布。
For a given input , the loss function for training the parameters of the VAE was:
对于给定的输入 ,用于训练 VAE 参数的损失函数为
The first term in equation (20), the 'latent variable regularization term', was given by:
方程 (20) 中的第一项,即 "潜变量正则化项",由以下公式给出:
with the number of latent variables, and and the elements of and , which were the outputs of the encoder network for input . The second term in equation (20), the 'reconstruction term', was given by the squared error between the original and decoded pixel values:
是潜在变量的数量, 元素,它们是编码器网络 对输入 的输出。公式 (20) 中的第二项,即 "重建项",由原始像素值与解码像素值之间的平方误差得出:
whereby was the value of the pixel of the original input image and was the value of the pixel of the decoded image , with and sampled from .
其中, 是原始输入图像 像素的值, 是解码图像 像素的值, 采样。
Training of the generative model was also done with generative replay, which was provided by its own copy stored after finishing training on the previous context. The loss terms of the current and replayed data were weighted similarly to the classifier:
生成模型的训练也是通过生成回放来完成的,回放数据是由在上一个情境中完成训练后存储的自己的副本提供的。当前数据和重放数据的损失项加权与分类器类似:
BI-R. For the method BI-R, we followed the protocol as described in the original paper . For Split CIFAR-100, all five of the proposed modifications relative to DGR were used: distillation, replay-through-feedback, conditional replay, gating based on internal context and internal replay. For Split MNIST, internal replay was not used, but the other four components were used. We did not combine BI-R with SI. The hyperparameter , which controlled the proportion of hidden units in the decoder that was gated per class, was set based on a grid search (Supplementary Note 8).
BI-R。对于 BI-R 方法,我们遵循了原始论文 中描述的协议。对于 Split CIFAR-100,我们使用了与 DGR 相对应的全部五项修改建议:蒸馏、通过反馈重放、条件重放、基于内部上下文的门控和内部重放。对于 Split MNIST,没有使用内部重放,但使用了其他四个部分。我们没有将 BI-R 与 SI 结合起来。超参数 用于控制解码器中每一类被门控的隐藏单元的比例,它是根据网格搜索设定的(补充说明 8)。
Compared with ref. there were two slight differences: (1) here we used a different set of pre-trained convolutional layers for each random seed, while ref. always used the same pre-trained convolutional layers; and (2) in the class-incremental learning scenario, here we used a softmax layer with the output units of all classes always set to active, while ref. used an 'expanding head' (that is, only the output units of classes seen so far were set to active).
与参考文献 相比,有两点细微差别:(1) 在这里,我们为每个随机种子使用了一组不同的预训练卷积层,而参考文献 始终使用相同的预训练卷积层;(2) 在类递增学习场景中,我们在这里使用了一个 softmax 层,所有类的输出单元始终设置为活动,而参考文献 使用了一个 "扩展头"(即只将目前看到的类的输出单元设置为活动)。

Template-based classification methods
基于模板的分类方法

Although for the context sets considered in this article, the template-based classification methods could, in theory, be used for all three continual learning scenarios, we considered them only for class-incremental learning. This was because, from an incremental-learning perspective, the specific benefit of template-based classification (that is, rephrasing a class-incremental learning problem as a task-incremental learning problem, see Supplementary Note 4) is only relevant in that scenario.
虽然就本文所考虑的情境集而言,基于模板的分类方法理论上可用于所有三种持续学习情境,但我们只考虑了类递增学习。这是因为,从增量学习的角度来看,基于模板的分类法的具体优势(即把类增量学习问题改写为任务增量学习问题,见补充说明 4)只与该情景相关。
Generative classifier. For the generative classifier , a separate VAE model was trained for each class to be learned. Training of these models was done as described above for DGR, except that no replay was used and each VAE was only trained on the examples from its own class. Each class-specific VAE was trained for either 1,000 iterations (Split MNIST) or 500 iterations (Split CIFAR-100), which meant that the total number of training iterations was the same as for the other methods. The mini-batch size was also the same: 128 for Split MNIST and 256 for Split CIFAR-100.
生成分类器。对于生成式分类器 ,每个要学习的类都要训练一个单独的 VAE 模型。这些模型的训练方法与上述 DGR 的训练方法相同,但不使用重放,而且每个 VAE 只针对其所属类别的示例进行训练。每个特定类别的 VAE 都进行了 1000 次迭代(拆分 MNIST)或 500 次迭代(拆分 CIFAR-100),这意味着训练迭代的总次数与其他方法相同。迷你批次大小也相同:拆分 MNIST 为 128 次,拆分 CIFAR-100 为 256 次。
The architecture of the VAE models was chosen so that the total number of parameters of the generative classifier was similar to the number of parameters used by generative replay. For Split MNIST, the encoder and the decoder were both fully connected networks with two hidden layers of 85 ReLU units each and the latent variable layer had five units. For Split CIFAR-100, the pre-trained convolutional layers were used as a feature extractor, and the VAE models were trained on the extracted features rather than on the raw inputs (that is, the reconstruction loss was in the feature space instead of at the pixel level). The encoder and decoder both had one fully connected hidden layer with 85 ReLU and a latent variable layer with 20 units.
选择 VAE 模型的架构是为了使生成式分类器的总参数数与生成式重放所使用的参数数相似。对于 Split MNIST,编码器和解码器都是全连接网络,两个隐藏层各有 85 个 ReLU 单元,潜变量层有 5 个单元。对于 Split CIFAR-100,预先训练的卷积层被用作特征提取器,而 VAE 模型是根据提取的特征而不是原始输入进行训练的(也就是说,重建损失是在特征空间而不是在像素级别)。编码器和解码器都有一个带有 85 个 ReLU 的全连接隐藏层和一个带有 20 个单元的潜变量层。
Classification was performed based on Bayes' rule: a test sample was classified as the class under whose generative model it was estimated to be the most likely. That is, the output class label predicted for an input was given by:
分类是根据贝叶斯法则进行的:根据生成模型估计,测试样本最有可能被归入哪一类。也就是说,输入 预测的输出类别标签 由以下公式给出:
whereby was the likelihood of input under the generative model of class . These likelihoods were estimated using importance sampling :
其中 是输入 在类 的生成模型下的似然值。这些似然是通过重要度采样 估算的:
with and the outputs of the encoder network for input the output of the decoder network for input the number of importance samples and the importance sample drawn from . In this notation, indicates the probability density of under the multivariate normal distribution with mean and covariance matrix . Similar to ref. , we used importance samples per likelihood estimation.
是编码器网络对输入的输出, 是解码器网络对输入的输出, 是重要度样本的数量, 是取自 重要度样本。在此符号中, 表示 在多元正态分布下的概率密度,其均值为 ,协方差矩阵为 。与参考文献类似。 与参考文献类似,我们每次似然估计都使用 重要性样本。
iCaRL. The method iCaRL (ref. ) used a neural network for feature extraction and then performed classification based on a nearest-class-mean rule in that feature space, whereby the class means were calculated from stored data. To protect the feature extractor network from becoming unsuitable for previously learned contexts, also replayed the stored data-as well as the inputs from the current context with a special form of distillation-during training of the feature extractor.
iCaRL。iCaRL 方法(参考文献: )使用神经网络进行特征提取,然后根据该特征空间中的最近类均值规则进行分类,其中类均值是根据存储数据计算得出的。为了防止特征提取网络变得不适合以前学习过的上下文, 还在训练特征提取器时重放已存储的数据以及当前上下文的输入数据,并采用一种特殊的蒸馏方式。
For the feature extractor we used the base neural network, except with the softmax output layer removed. We denote this feature extractor by . . These parameters were trained based on a binary classification/distillation loss. For this, during training only, a sigmoid output layer was appended to . The resulting extended network outputs for any output class a binary probability whether input belongs to it:
对于特征提取器,我们使用了基础神经网络,只是去掉了软最大输出层。我们用 . 表示这个特征提取器。这些参数基于二元分类/蒸馏损失进行训练。为此,仅在训练期间,在 上附加了一个 sigmoid 输出层。由此产生的扩展网络输出为任何输出类 ,输入 是否属于该类的二进制概率:
with a vector containing all the trainable parameters of iCaRL. Whenever a new output class was encountered, new parameters were added to .
是一个包含 iCaRL 所有可训练参数的向量。每当遇到新的输出类别 时,就会向 添加新的参数
In each context, the parameters in were trained on an extended dataset containing the current context's training data as well as all stored data in the memory buffer. When training on context , each input with hard target in this extended dataset was paired with a new target-vector whose element was given by:
在每个上下文中, 中的参数都在一个扩展数据集上进行训练,该数据集包含当前上下文的训练数据以及内存缓冲区中的所有存储数据。在对上下文 进行训练时,该扩展数据集中每个具有硬目标 的输入 都与一个新的目标向量 配对,其 元素由以下公式给出:
whereby is the vector with parameter values at the end of training on context . The binary classification/distillation loss function for an input labelled with such an cold-context-soft-target/ new-context-hard-target' vector ō was then given by:
其中 是上下文 训练结束时的参数值向量。输入 的二进制分类/蒸馏损失函数就是用这样的 "冷语境-软目标/新语境-硬目标 "向量 ō 来标注的:
After finishing training on a context, data to be added to the memory buffer were selected as follows. For each new output class , iteratively samples (or 'exemplars') were selected based on their extracted feature vectors according to a procedure referred to as 'herding'. In each iteration, a new sample from output class was selected such that the average feature vector over all selected examples was as close as possible to the average feature vector over all available examples of class . Let be the set of all available examples of class and let be the average feature vector over set . The exemplar (for ) to be selected for output class was then given by:
在完成上下文训练后,要添加到内存缓冲区中的数据将按以下方式选择。对于每个新的输出类 ,根据提取的特征向量,按照 "herding"(群居)程序迭代选择 样本(或 "exemplars")。在每次迭代中,从输出类 中选择一个新样本,使所有被选样本的平均特征向量尽可能接近类 所有可用样本的平均特征向量。让 成为 类所有可用示例的集合,让 成为 集合的平均特征向量。输出类 所选的 示例(为 )由以下公式给出:
This resulted in ordered exemplar-sets for each new output class that were stored in the memory buffer.
这样,每个新的输出类 都有一个有序的示例集 ,这些示例集被存储在内存缓冲区中。
Finally, classification was performed based on a nearest-class-mean rule in feature space, whereby the class means were calculated from the stored exemplars. For this, let for , . The output class label predicted for a new input was then given by:
最后,根据特征空间中的最近类均值规则进行分类,即根据存储的示例计算类均值。为此,让 代表 , 。对新输入 预测的输出类标签 由以下公式给出:

Data availability 数据可用性

All datasets used in this study are freely available online resources: http://yann.lecun.com/exdb/mnist/ (MNIST ) and https://www. cs.toronto.edu/ kriz/cifar.html (CIFAR-10 and CIFAR-100 ).
本研究使用的所有数据集均可免费在线获取:http://yann.lecun.com/exdb/mnist/(MNIST )和 https://www. cs.toronto.edu/ kriz/cifar.html (CIFAR-10 和 CIFAR-100 )。

Code availability 代码可用性

Documented code that can be used to reproduce or build upon the reported experiments is available online under an MIT licence: https://github.com/GMvandeVen/continual-learning .
在 MIT 许可下,可在线获取可用于重现或在所报告实验基础上发展的记录代码: https://github.com/GMvandeVen/continual-learning

References 参考资料

  1. Chen, Z. & Liu, B. Lifelong machine learning. Synth. Lect. Artif. Intell. Mach. Learn. 12, 1-207 (2018).
    Chen, Z. & Liu, B. Lifelong machine learning.Synth.Lect.Artif.Intell.Mach.Learn.12, 1-207 (2018).
  2. Hadsell, R., Rao, D., Rusu, A. A. & Pascanu, R. Embracing change: continual learning in deep neural networks. Trends Cognit. Sci. 24,1028-1040 (2020).
    Hadsell, R., Rao, D., Rusu, A. A. & Pascanu, R. Embracing change: Continual learning in deep neural networks.Trends Cognit.24,1028-1040 (2020).
  3. McCloskey, M. & Cohen, N. J. In Psychology of Learning and Motivation Vol. 24, 109-165 (Elsevier, 1989).
    McCloskey, M. & Cohen, N. J. In Psychology of Learning and Motivation Vol. 24, 109-165 (Elsevier, 1989).
  4. French, R. M. Catastrophic forgetting in connectionist networks. Trends Cognit. Sci. 3, 128-135 (1999).
    French, R. M. Catastrophic forgetting in connectionist networks.Trends Cognit.3, 128-135 (1999).
  5. Kudithipudi, D. et al. Biological underpinnings for lifelong learning machines. Nat. Mach. Intell. 4, 196-210 (2022).
    Kudithipudi, D. 等人. 终身学习机器的生物学基础。Nat.Mach.Intell.4, 196-210 (2022).
  6. Lee, C. S. & Lee, A. Y. Clinical applications of continual learning machine learning. Lancet Digital Health 2, e279-e281 (2020).
    Lee, C. S. & Lee, A. Y. 《持续学习机器学习的临床应用》。Lancet Digital Health 2, e279-e281 (2020).
  7. Shaheen, K., Hanif, M. A., Hasan, O. & Shafique, M. Continual learning for real-world autonomous systems: Algorithms, challenges and frameworks. J. Intell. Robot. Syst. 105, 9 (2022).
    Shaheen, K., Hanif, M. A., Hasan, O. & Shafique, M. Continual learning for real-world autonomous systems:算法、挑战和框架。J. Intell.Robot.Syst.105, 9 (2022).
  8. Philps, D., Weyde, T., Garcez, A. d. & Batchelor, R. Continual learning augmented investment decisions. Preprint at https://arxiv.org/abs/1812.02340 (2018).
    Philps, D., Weyde, T., Garcez, A. d. & Batchelor, R. Continual learning augmented investment decisions.Preprint at https://arxiv.org/abs/1812.02340 (2018)。
  9. Mundt, M., Lang, S., Delfosse, Q. & Kersting, K. CLEVA-compass: A continual learning evaluation assessment compass to promote research transparency and comparability. In International Conference on Learning Representations (2022).
    Mundt, M., Lang, S., Delfosse, Q. & Kersting, K. CLEVA-compass:促进研究透明度和可比性的持续学习评估指南针。国际学习表征会议(2022 年)。
  10. Marr, D. Vision: A computational investigation into the human representation and processing of visual information (WH Freeman, 1982).
    Marr, D. Vision:人类视觉信息表征与处理的计算研究》(WH Freeman,1982 年)。
  11. Ruvolo, P. & Eaton, E. ELLA: An efficient lifelong learning algorithm. In International Conference on Machine Learning 507-515 (PMLR, 2013).
    Ruvolo, P. & Eaton, E. ELLA:高效的终身学习算法。国际机器学习大会,507-515(PMLR,2013 年)。
  12. Masse, N. Y., Grant, G. D. & Freedman, D. J. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proc. Natl Acad. Sci. USA 115, E10467-E1O475 (2018).
    Masse, N. Y., Grant, G. D. & Freedman, D. J. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization.Proc.Natl Acad.USA 115, E10467-E1O475 (2018)。
  13. Ramesh, R. & Chaudhari, P. Model Zoo: A growing brain that learns continually. In International Conference on Learning Representations (2O22).
    Ramesh, R. & Chaudhari, P. 《动物园模型》:不断学习的成长大脑。学习表征国际会议(2O22)。
  14. Lopez-Paz, D. & Ranzato, M. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems Vol. 30, 6470-6479 (2017).
    Lopez-Paz, D. & Ranzato, M. Gradient episodic memory for continual learning.神经信息处理系统进展》第 30 卷,6470-6479(2017 年)。
  15. Vogelstein, J. T. et al. Representation ensembling for synergistic lifelong learning with quasilinear complexity. Preprint at https:// arxiv.org/abs/2004.12908 (2020).
    Vogelstein, J. T. 等人. 具有准线性复杂性的协同终身学习的表征集合。Preprint at https:// arxiv.org/abs/2004.12908 (2020).
  16. Ke, Z., Liu, B., Xu, H. & Shu, L. CLASSIC: Continual and contrastive learning of aspect sentiment classification tasks. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing 6871-6883 (Association for Computational Linguistics, 2021).
    Ke, Z., Liu, B., Xu, H. & Shu, L. CLASSIC: Continual and contrastive learning of aspect sentiment classification tasks.自然语言处理实证方法 2021 年会议论文集》,6871-6883(计算语言学协会,2021 年)。
  17. Mirza, M. J., Masana, M., Possegger, H. & Bischof, H. An efficient domain-incremental learning approach to drive in all weather conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 3001-3011 (2022).
    Mirza, M. J., Masana, M., Possegger, H. & Bischof, H. An efficient domain-incremental learning approach to drive in all weather conditions.在 IEEE/CVF 计算机视觉和模式识别(CVPR)研讨会论文集 3001-3011 (2022) 中。
  18. Aljundi, R., Chakravarty, P. & Tuytelaars, T. Expert gate: Lifelong learning with a network of experts. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3366-3375 (2017).
    Aljundi, R., Chakravarty, P. & Tuytelaars, T. Expert gate:专家网络终身学习。In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3366-3375 (2017).
  19. von Oswald, J., Henning, C., Sacramento, J. & Grewe, B. F. Continual learning with hypernetworks. In International Conference on Learning Representations (2020).
    von Oswald, J., Henning, C., Sacramento, J. & Grewe, B. F. Continual learning with hypernetworks.学习表征国际会议(2020)。
  20. Wortsman, M. et al. Supermasks in superposition. In Advances in Neural Information Processing Systems Vol. 33, 15173-15184 (2020).
    Wortsman, M. 等人. 叠加中的超级任务。In Advances in Neural Information Processing Systems Vol. 33, 15173-15184 (2020).
  21. Henning, C. et al. Posterior meta-replay for continual learning. In Advances in Neural Information Processing Systems Vol. 34, 14135-14149 (2021).
    亨宁(Henning)、C.等人. 持续学习的后元回放。神经信息处理系统进展》第 34 卷,14135-14149(2021 年)。
  22. Verma, V. K., Liang, K. J., Mehta, N., Rai, P. & Carin, L. Efficient feature transformations for discriminative and generative continual learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 13865-13875 (2021).
    Verma, V. K., Liang, K. J., Mehta, N., Rai, P. & Carin, L. Efficient feature transformations for discriminative and generative continual learning.In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 13865-13875 (2021).
  23. Heald, J. B., Lengyel, M. & Wolpert, D. M. Contextual inference underlies the learning of sensorimotor repertoires. Nature 600, 489-493 (2021).
    Heald, J. B., Lengyel, M. & Wolpert, D. M. Contextual inference underlies the learning of sensorimotor repertoires.自然》600 卷,489-493 页(2021 年)。
  24. Lomonaco, V. & Maltoni, D. Core5O: a new dataset and benchmark for continuous object recognition. In Conference on Robot Learning 17-26 (PMLR, 2017).
    Lomonaco, V. & Maltoni, D. Core5O:连续物体识别的新数据集和基准。第 17-26 届机器人学习大会(PMLR,2017 年)。
  25. Rebuffi, S.-A., Kolesnikov, A., Sperl, G. & Lampert, C. H. icarl: Incremental classifier and representation learning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2001-2010 (2017).
    Rebuffi, S.-A., Kolesnikov, A., Sperl, G. & Lampert, C. H. icarl: Incremental classifier and representation learning.In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2001-2010 (2017).
  26. Tao, X. et al. Few-shot class-incremental learning. In Proc. IEEE/ CVF Conference on Computer Vision and Pattern Recognition 12183-12192 (2020).
    Tao, X. 等人. 少量类递增学习.In Proc. IEEE/ CVF Conference on Computer Vision and Pattern Recognition 12183-12192 (2020).
  27. Shin, H., Lee, J. K., Kim, J. & Kim, J. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems Vol. 30, 2994-3003 (2017).
    Shin, H., Lee, J. K., Kim, J. & Kim, J. Continual learning with deep generative replay.In Advances in Neural Information Processing Systems Vol. 30, 2994-3003 (2017).
  28. van de Ven, G. M., Siegelmann, H. T. & Tolias, A. S. Brain-inspired replay for continual learning with artificial neural networks. Nat. Commun. 11, 4069 (2020).
    van de Ven, G. M., Siegelmann, H. T. & Tolias, A. S. Brain-inspired replay for continual learning with artificial neural networks.Nat.Nat.11, 4069 (2020).
  29. Belouadah, E., Popescu, A. & Kanellos, I. A comprehensive study of class incremental learning algorithms for visual tasks. Neural Networks 135, 38-54 (2021).
    Belouadah, E., Popescu, A. & Kanellos, I. A comprehensive study of class incremental learning algorithms for visual tasks.神经网络 135, 38-54 (2021).
  30. Masana, M. et al. Class-incremental learning: survey and performance evaluation on image classification. In IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE, 2O22). https://doi.org/10.1109/TPAMI.2O22.3213473
    Masana, M. 等人. 类别递增学习:图像分类调查与性能评估.In IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE, 2O22). https://doi.org/10.1109/TPAMI.2O22.3213473
  31. Zenke, F., Poole, B. & Ganguli, S. Continual learning through synaptic intelligence. In International Conference on Machine Learning 3987-3995 (PMLR, 2017).
    Zenke, F., Poole, B. & Ganguli, S. 通过突触智能持续学习。国际机器学习大会,3987-3995(PMLR,2017 年)。
  32. Zeng, G., Chen, Y., Cui, B. & Yu, S. Continual learning of context-dependent processing in neural networks. Nat. Mach. Intell. 1, 364-372 (2019).
    Zeng, G., Chen, Y., Cui, B. & Yu, S. Continual learning of context-dependent processing in neural networks.Nat.Mach.Intell.1, 364-372 (2019).
  33. Aljundi, R., Kelchtermans, K. & Tuytelaars, T. Task-free continual learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 11254-11263 (2019).
    Aljundi, R., Kelchtermans, K. & Tuytelaars, T. Task-free continual learning.In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 11254-11263 (2019).
  34. Zeno, C., Golan, I., Hoffer, E. & Soudry, D. Task agnostic continual learning using online variational bayes. Preprint at https://arxiv.org/abs/1803.10123v3 (2019).
    Zeno, C., Golan, I., Hoffer, E. & Soudry, D. Task agnostic continual learning using online variational bayes.Preprint at https://arxiv.org/abs/1803.10123v3 (2019)。
  35. Rao, D. et al. Continual unsupervised representation learning. In Advances in Neural Information Processing Systems Vol. 32, 7647-7657 (2019).
    Rao, D. 等人. 持续无监督表征学习。In Advances in Neural Information Processing Systems Vol. 32, 7647-7657 (2019).
  36. De Lange, M. & Tuytelaars, T. Continual prototype evolution: Learning online from non-stationary data streams. In Proc. IEEE/CVF International Conference on Computer Vision 8250-8259 (2021).
    De Lange, M. & Tuytelaars, T. Continual prototype evolution:从非稳态数据流中在线学习。In Proc. IEEE/CVF International Conference on Computer Vision 8250-8259 (2021).
  37. Li, S., Du, Y., van de Ven, G. M. & Mordatch, I. Energy-based models for continual learning. Preprint at https://arxiv.org/ (2020).
    Li, S., Du, Y., van de Ven, G. M. & Mordatch, I. 基于能量的持续学习模型。预印本:https://arxiv.org/ (2020)。
  38. Hayes, T. L. & Kanan, C. Lifelong machine learning with deep streaming linear discriminant analysis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 220-221 (2020).
    Hayes, T. L. & Kanan, C. 使用深度流线性判别分析的终身机器学习。In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 220-221 (2020).
  39. Mai, Z. et al. Online continual learning in image classification: An empirical survey. Neurocomputing 469, 28-51 (2022).
    Mai, Z. et al. 图像分类中的在线持续学习:实证调查。Neurocomputing 469, 28-51 (2022).
  40. Lesort, T., Caccia, M. & Rish, I. Understanding continual learning settings with data distribution drift analysis. Preprint at https://arxiv.org/abs/2104.01678 (2021).
    Lesort, T., Caccia, M. & Rish, I. 利用数据分布漂移分析理解持续学习设置。预印本:https://arxiv.org/abs/2104.01678 (2021)。
  41. Lomonaco, V. et al. Avalanche: an end-to-end library for continual learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 3600-3610 (2021).
    Lomonaco, V. et al. Avalanche: an end-to-end library for continual learning.In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 3600-3610 (2021).
  42. Gepperth, A. & Hammer, B. Incremental learning algorithms and applications. In European Symposium on Artificial Neural Networks (ESANN) (2016).
    Gepperth, A. & Hammer, B. Incremental learning algorithms and applications.欧洲人工神经网络研讨会(ESANN)(2016 年)。
  43. Stojanov, S. et al. Incremental object learning from contiguous views. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 8777-8786 (2019).
    Stojanov, S. 等人. 来自连续视图的增量对象学习。In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 8777-8786 (2019).
  44. Caccia, L., Belilovsky, E., Caccia, M. & Pineau, J. Online learned continual compression with adaptive quantization modules. In International Conference on Machine Learning 1240-1250 (PMLR, 2020).
    Caccia, L., Belilovsky, E., Caccia, M. & Pineau, J. Online learned continual compression with adaptive quantization modules.国际机器学习大会 1240-1250 (PMLR, 2020).
  45. Cossu, A. et al. Is class-incremental enough for continual learning? Front. Artif. Intell. 5, 829842 (2022).
    Cossu, A. et al.Front.Artif.Intell.5, 829842 (2022).
  46. Lee, S., Ha, J., Zhang, D. & Kim, G.A neural dirichlet process mixture model for task-free continual learning. In International Conference on Learning Representations (2020).
    Lee, S., Ha, J., Zhang, D. & Kim, G. A neural dirichlet process mixture model for task-free continual learning.国际学习表征会议(2020)。
  47. Jin, X., Sadhu, A., Du, J. & Ren, X. Gradient-based editing of memory examples for online task-free continual learning. In Advances in Neural Information Processing Systems Vol. 34, 29193-29205 (2021).
    Jin, X., Sadhu, A., Du, J. & Ren, X. Gradient-based editing of memory examples for online task-free continual learning.神经信息处理系统进展》第 34 卷,29193-29205(2021 年)。
  48. Shanahan, M., Kaplanis, C. & Mitrović, J. Encoders and ensembles for task-free continual learning. Preprint at https://arxiv.org/ abs/2105.13327 (2021).
    Shanahan, M., Kaplanis, C. & Mitrović, J. Encoders and ensembles for task-free continual learning.预印本:https://arxiv.org/ abs/2105.13327 (2021)。
  49. Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl Acad. Sci. USA 114, 3521-3526 (2017).
    Kirkpatrick, J. 等人. 克服神经网络中的灾难性遗忘.Proc.Natl Acad.USA 114, 3521-3526 (2017)。
  50. Li, Z. & Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2935-2947 (2017).
    Li, Z. & Hoiem, D. Learning without forgetting.IEEE Trans.Pattern Anal.Mach.Intell.40, 2935-2947 (2017).
  51. Pan, P. et al. Continual deep learning by functional regularisation of memorable past. In Advances in Neural Information Processing Systems Vol. 33, 4453-4464 (2020).
    Pan, P. 等人. 通过记忆性过去的功能正则化进行持续深度学习。In Advances in Neural Information Processing Systems Vol. 33, 4453-4464 (2020).
  52. Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T. & Wayne, G. Experience replay for continual learning. In Advances in Neural Information Processing Systems Vol. 32 (2019).
    Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T. & Wayne, G. Experience replay for continual learning.神经信息处理系统进展》第 32 卷(2019 年)。
  53. Chaudhry, A. et al. On tiny episodic memories in continual learning. Preprint at https://arxiv.org/abs/1902.10486 (2019).
    Chaudhry, A. 等人. 论持续学习中的微小外显记忆.Preprint at https://arxiv.org/abs/1902.10486 (2019).
  54. Chaudhry, A., Ranzato, M., Rohrbach, M. & Elhoseiny, M. Efficient lifelong learning with a-gem. In International Conference on Learning Representations (2019).
    Chaudhry, A., Ranzato, M., Rohrbach, M. & Elhoseiny, M. Efficient lifelong learning with a-gem.学习表征国际会议(2019)。
  55. van de Ven, G. M., Li, Z. & Tolias, A. S. Class-incremental learning with generative classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 3611-3620 (2021).
    van de Ven, G. M., Li, Z. & Tolias, A. S. Class-incremental learning with generative classifiers.计算机视觉与模式识别(CVPR)研讨会论文集 3611-3620 (2021)。
  56. Lesort, T., Caselles-Dupré, H., Garcia-Ortiz, M., Stoian, A. & Filliat, D. Generative models from the perspective of continual learning. In International Joint Conference on Neural Networks (IEEE, 2019).
    Lesort, T., Caselles-Dupré, H., Garcia-Ortiz, M., Stoian, A. & Filliat, D. 从持续学习的角度看生成模型。国际神经网络联合会议(IEEE,2019 年)。
  57. Aljundi, R. et al. Online continual learning with maximally interfered retrieval. In Advances in Neural Information Processing Systems Vol. 32 (2019).
    Aljundi, R. 等人. 具有最大干扰检索的在线持续学习。神经信息处理系统进展》第 32 卷(2019 年)。
  58. van de Ven, G. M. & Tolias, A. S. Generative replay with feedback connections as a general strategy for continual learning. Preprint at https://arxiv.org/abs/1809.10635 (2018).
    van de Ven, G. M. & Tolias, A. S. 作为持续学习一般策略的反馈连接生成重放。Preprint at https://arxiv.org/abs/1809.10635 (2018)。
  59. van de Ven, G. M. & Tolias, A. S. Three scenarios for continual learning. Preprint at https://arxiv.org/abs/1904.07734 (2019).
    van de Ven, G. M. & Tolias, A. S. 持续学习的三种情景。Preprint at https://arxiv.org/abs/1904.07734 (2019).
  60. Farquhar, S. & Gal, Y. Towards robust evaluations of continual learning. Preprint at https://arxiv.org/abs/1805.09733 (2018).
    Farquhar, S. & Gal, Y. Towards robust evaluations of continual learning.Preprint at https://arxiv.org/abs/1805.09733 (2018)。
  61. Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A. & Bengio, Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks. Preprint at https://arxiv.org/ (2013).
    Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A. & Bengio, Y. 基于梯度的神经网络中灾难性遗忘的实证研究。预印本:https://arxiv.org/ (2013)。
  62. Douillard, A. & Lesort, T. Continuum: Simple management of complex continual learning scenarios. Preprint at https:// arxiv.org/abs/2102.06253 (2021).
    Douillard, A. & Lesort, T. Continuum:复杂持续学习场景的简单管理。Preprint at https:// arxiv.org/abs/2102.06253 (2021).
  63. Normandin, F. et al. Sequoia: A software framework to unify continual learning research. Preprint at https://arxiv.org/ abs/2108.01005 (2021).
    Normandin, F. et al:统一持续学习研究的软件框架。Preprint at https://arxiv.org/ abs/2108.01005 (2021).
  64. Hess, T., Mundt, M., Pliushch, I. & Ramesh, V. A procedural world generation framework for systematic evaluation of continual learning. In Thirty-fifth Conference on Neural Information Processing Systems, Datasets and Benchmarks Track (2021).
    Hess, T., Mundt, M., Pliushch, I. & Ramesh, V. A procedural world generation framework for systematic evaluation of continual learning.第三十五届神经信息处理系统、数据集和基准会议(2021 年)。
  65. Paszke, A. et al. Automatic differentiation in pytorch. In NeurIPS Autodiff Workshop (2017).
    Paszke, A. 等人. pytorch 中的自动分化。In NeurIPS Autodiff Workshop (2017).
  66. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. et al. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278-2324 (1998).
    LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. et al. 基于梯度的学习应用于文档识别。Proc. IEEE 86, 2278-2324 (1998).
  67. Krizhevsky, A., Hinton, G. et al. Learning Multiple Layers of Features from Tiny Images (University of Toronto, 2009).
    Krizhevsky, A., Hinton, G. et al. Learning Multiple Layers of Features from Tiny Images(多伦多大学,2009 年)。
  68. Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning 448-456 (PMLR, 2015).
    Ioffe, S. & Szegedy, C. Batch normalization:通过减少内部协变量偏移加速深度网络训练。国际机器学习会议 448-456(PMLR,2015 年)。
  69. Maltoni, D. & Lomonaco, V. Continuous learning in singleincremental-task scenarios. Neural Networks 116, 56-73 (2019).
    Maltoni, D. & Lomonaco, V. 单增任务场景中的持续学习。神经网络 116, 56-73 (2019).
  70. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
    Kingma, D. P. & Ba, J. Adam:随机优化方法。Preprint at https://arxiv.org/abs/1412.6980 (2014)。
  71. Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at https://arxiv.org/abs/1503.02531 (2015).
    Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network.预印本:https://arxiv.org/abs/1503.02531 (2015)。
  72. Kingma, D. P. & Welling, M. Auto-encoding variational bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).
    Kingma, D. P. & Welling, M. Auto-encoding variational bayes.Preprint at https://arxiv.org/abs/1312.6114 (2013)。
  73. Zeiler, M. D., Taylor, G. W., Fergus, R. et al. Adaptive deconvolutional networks for mid and high level feature learning. In International Conference on Computer Vision 2018-2025 (IEEE, 2011).
    Zeiler, M. D., Taylor, G. W., Fergus, R. et al. Adaptive deconvolutional networks for mid and high level feature learning.计算机视觉 2018-2025 国际会议(IEEE,2011 年)。
  74. Rezende, D. & Mohamed, S. Variational inference with normalizing flows. In International Conference on Machine Learning 1530-1538 (PMLR, 2015).
    Rezende, D. & Mohamed, S. 归一化流量的变量推理。国际机器学习会议 1530-1538 (PMLR, 2015)。
  75. van de Ven, G. M. GMvandeVen/continual-learning: v1.0.0 (2022). https://doi.org/10.5281/zenodo.7189378

Acknowledgements 致谢

We thank K. Jensen, M. Mundt, W. Barfuss, T. Hess and M. De Lange for insightful comments. This research project was supported by an IBRO-ISN Research Fellowship (to G.M.v.d.V.), by the ERC-funded project KeepOnLearning (reference number 101021347; to T.T.), by the National Institutes of Health (NIH) under awards RO1MH109556 (NIH/NIMH; to A.S.T.) and P3OEYOO2520 (NIH/NEI; to A.S.T.), by the Lifelong Learning Machines (L2M) programme of the Defence Advanced Research Projects Agency (DARPA) via contract number HRO011-18-2-0025 (to A.S.T.) and by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number D16PC00003 (to A.S.T.). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of NIH, DARPA, IARPA, DoI/IBC or the US government.
感谢 K. Jensen、M. Mundt、W. Barfuss、T. Hess 和 M. De Lange 提出的有见地的意见。本研究项目得到了 IBRO-ISN 研究奖学金(授予 G.M.v.d.V.)、欧洲研究理事会(ERC)资助的 KeepOnLearning 项目(编号 101021347;授予 T.T.)、美国国立卫生研究院(NIH)的 RO1MH109556(NIH/NIMH;授予 A.S.T.)和 P3OEYOO2520(NIH/NEI;授予 A.S.T.)以及 Lifelong Learning Machines (L2M) 的支持。本研究由美国国防部高级研究计划局(DARPA)的终身学习机(L2M)计划(合同号:HRO011-18-2-0025,授予 A.S.T.)和情报高级研究计划活动(IARPA)(合同号:D16PC00003,授予 A.S.T.)提供。本文中的观点和结论仅代表作者本人,不应被解释为代表美国国立卫生研究院、DARPA、IARPA、DoI/IBC 或美国政府明示或暗示的官方政策或认可。

Author contributions 作者供稿

Conceptualization, G.M.v.d.V, T.T. and A.S.T.; formal analysis, G.M.v.d.V.; funding acquisition, A.S.T., T.T. and G.M.v.d.V.; investigation, G.M.v.d.V.; methodology, G.M.v.d.V.; resources, A.S.T.; software, G.M.v.d.V.; supervision, A.S.T. and T.T.; visualization, G.M.v.d.V.; writing - original draft, G.M.v.d.V.; writing - review and editing, G.M.v.d.V., T.T. and A.S.T.
概念化,G.M.v.d.V.、T.T.和 A.S.T.;形式分析,G.M.v.d.V.;资金获取,A.S.T、T.T.和 G.M.v.d.V.;调查,G.M.v.d.V.;方法,G.M.v.d.V.;资源,A.S.T.;软件,G.M.v.d.V.;监督,A.S.T.和 T.T.;可视化,G.M.v.d.V.;写作 - 原稿,G.M.v.d.V.;写作 - 审阅和编辑,G.M.v.d.V.、T.T. 和 A.S.T.

Competing interests 竞争利益

The authors declare no competing interests.
作者声明不存在利益冲突。

Additional information 其他信息

Extended data is available for this paper at
本文的扩展数据见
Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s42256-022-00568-3.
补充信息 在线版本包含补充材料,可从 https://doi.org/10.1038/s42256-022-00568-3 获取。
Correspondence and requests for materials should be addressed to Gido M. van de Ven.
通讯和资料索取请联系 Gido M. van de Ven。
Peer review information Nature Machine Intelligence thanks Thomas Miconi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
同行评审信息 《自然-机器智能》感谢托马斯-米科尼(Thomas Miconi)和其他匿名审稿人对本作品的同行评审做出的贡献。
Reprints and permissions information is available at www.nature.com/reprints.
转载和授权信息请访问 www.nature.com/reprints。
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
出版者注:《施普林格-自然》杂志对出版地图中的管辖权主张和机构隶属关系保持中立。
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/.
开放存取 本文采用知识共享署名 4.0 国际许可协议进行许可,该协议允许以任何媒介或格式使用、共享、改编、分发和复制本文,但需适当注明原作者和出处,提供知识共享许可协议的链接,并注明是否进行了修改。本文中的图片或其他第三方材料均包含在文章的知识共享许可协议中,除非在材料的署名栏中另有说明。如果材料未包含在文章的知识共享许可协议中,且您的使用意图未得到法律法规的允许或超出了允许的使用范围,您需要直接从版权所有者处获得许可。要查看该许可证的副本,请访问 http://creativecommons. org/licenses/by/4.0/。
(c) The Author(s) 2022
(c) 作者 2022
b
Extended Data Fig. 1 | Comparison of methods using stored data with different buffer sizes. Shown is the average test accuracy (as , over all contexts) of different methods that use stored data on Split MNIST (a) and on Split CIFAR-100 (b) as a function of the number of examples per class that is allowed to be stored in memory. Due to its high computational costs, we were not able to run FROMP on Split CIFAR-100, or on Split MNIST with a memory budget above 1,000
图 1 | 使用不同缓冲区大小的存储数据的方法比较。图中显示的是在 Split MNIST(a)和 Split CIFAR-100(b)上使用存储数据的不同方法的平均测试准确率(如 ,在所有上下文中)与允许存储在内存中的每类示例数的函数关系。由于计算成本较高,我们无法在 Split CIFAR-100 上运行 FROMP,也无法在内存预算超过 1,000 的 Split MNIST 上运行 FROMP。

samples per class. Displayed are the means over 5 repetitions, shaded areas are SEM.ER: exact replay, A-GEM: averaged gradient episodic memory, FROMP: functional regularization of the memorable past, iCaRL: incremental classifier and representation learning, None: sequential training in the standard way, Joint: training on all data at the same time.
每类样本。显示的是 5 次重复的平均值,阴影区域为 SEM。ER:精确重放;A-GEM:平均梯度外显记忆;FROMP:记忆性过去的功能正则化;iCaRL:增量分类器和表征学习;None:以标准方式进行顺序训练;Joint:同时对所有数据进行训练。

  1. Center for Neuroscience and Artificial Intelligence, Department of Neuroscience, Baylor College of Medicine, Houston, TX, USA. Computational and Biological Learning Lab, Department of Engineering, University of Cambridge, Cambridge, UK. Processing Speech and Images, Department of Electrical Engineering, KU Leuven, Leuven, Belgium. Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA. ₪e-mail: gido.vandeven@kuleuven.be
    美国德克萨斯州休斯顿贝勒医学院神经科学系神经科学与人工智能中心。 英国剑桥大学工程系计算与生物学习实验室。 比利时鲁汶大学电子工程系语音和图像处理研究室。 美国德克萨斯州休斯顿莱斯大学电气与计算机工程系。电子邮件:gido.vandeven@kuleuven.be