Deep learning 深度学习

Yann LeCun $^{1, 2}$ , Yoshua Bengio $^{3}$ & Geoffrey Hinton $^{4, 5}$
Yann $^{1, 2}$ LeCun，Yoshua Bengio $^{3}$ & Geoffrey Hinton $^{4, 5}$

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
深度学习允许由多个处理层组成的计算模型学习具有多个抽象级别的数据表示。这些方法极大地改善了语音识别、视觉对象识别、对象检测和许多其他领域（如药物发现和基因组学）的最新技术。深度学习通过使用反向传播算法来指示机器应如何更改其内部参数，这些参数用于从前一层中的表示中更改用于计算每一层中的表示。深度卷积网络在处理图像、视频、语音和音频方面取得了突破，而循环网络则揭示了文本和语音等顺序数据。

Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning.
机器学习技术为现代社会的许多方面提供支持：从 Web 搜索到社交网络上的内容过滤，再到电子商务网站上的推荐，它越来越多地出现在相机和智能手机等消费产品中。机器学习系统用于识别图像中的对象，将语音转录为文本，将新闻项目、帖子或产品与用户的兴趣相匹配，以及选择相关的搜索结果。这些应用程序越来越多地使用一类称为深度学习的技术。

Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input.
传统的机器学习技术在处理原始形式的自然数据的能力方面受到限制。几十年来，构建模式识别或机器学习系统需要仔细的工程设计和丰富的领域专业知识，以设计一个特征提取器，将原始数据（例如图像的像素值）转换为合适的内部表示或特征向量，学习子系统（通常是分类器）可以从中检测或分类输入中的模式。
Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.
表示学习是一组方法，允许向机器提供原始数据并自动发现检测或分类所需的表示。深度学习方法是具有多个表示级别的表示学习方法，通过组合简单但非线性的模块获得，每个模块将一个级别的表示（从原始输入开始）转换为更高、更抽象的表示。通过组合足够多的此类转换，可以学习非常复杂的函数。对于分类任务，更高层次的表示会放大输入中对判别很重要的方面，并抑制不相关的变化。例如，图像以像素值数组的形式出现，第一层表示中学习的特征通常表示图像中特定方向和位置是否存在边缘。第二层通常通过发现边缘的特定排列来检测模体，而不管边缘位置的微小变化。第三层可以将模体组装成更大的组合，这些组合对应于熟悉对象的部分，随后的层会将对象检测为这些部分的组合。深度学习的关键方面是这些特征层不是由人类工程师设计的：它们是使用通用学习程序从数据中学习的。
Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years. It has turned out to be very good at discovering
深度学习在解决多年来一直抵制人工智能社区最佳尝试的问题方面取得了重大进展。事实证明，它非常善于发现
intricate structures in high-dimensional data and is therefore applicable to many domains of science, business and government. In addition to beating records in image recognition

^{1 - 4}

and speech recognition

^{5 - 7}

, it has beaten other machine-learning techniques at predicting the activity of potential drug molecules

^{8}

, analysing particle accelerator data

^{9, 10}

, reconstructing brain circuits

^{11}

, and predicting the effects of mutations in non-coding DNA on gene expression and disease

^{12, 13}

. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding

^{14}

, particularly topic classification, sentiment analysis, question answering

^{15}

and language translation

^{16, 17}

.
高维数据中的复杂结构，因此适用于科学、商业和政府的许多领域。除了打破图像识别和

^{1 - 4}

语音识别

^{5 - 7}

方面的记录外，它在预测潜在药物分子

^{8}

的活性、分析粒子加速器数据

^{9, 10}

、重建大脑回路

^{11}

以及预测非编码 DNA 突变对基因表达和疾病

^{12, 13}

的影响方面击败了其他机器学习技术.也许更令人惊讶的是，深度学习为自然语言理解

^{14}

中的各种任务产生了非常有希望的结果，特别是主题分类、情感分析、问答

^{15}

和语言翻译

^{16, 17}

。
We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only accelerate this progress.
我们认为深度学习在不久的将来会取得更大的成功，因为它只需要很少的手工工程设计，因此它可以轻松利用可用计算和数据量的增加。目前正在为深度神经网络开发的新学习算法和架构只会加速这一进展。

Supervised learning 监督式学习

The most common form of machine learning, deep or not, is supervised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or distance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input-output function of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.
最常见的机器学习形式，无论深度与否，都是监督式学习。想象一下，我们想构建一个系统，可以将图像分类为包含房屋、汽车、人或宠物。我们首先收集了一组大型数据，包括房屋、汽车、人和宠物的图像，每张图像都标有其类别。在训练期间，机器会显示图像，并以分数向量的形式生成输出，每个类别一个。我们希望所需的类别在所有类别中具有最高分，但这不太可能在训练之前发生。我们计算一个目标函数，用于测量输出分数和所需分数模式之间的误差（或距离）。然后，机器会修改其内部可调参数以减少此误差。这些可调参数（通常称为权重）是实数，可以看作是定义机器输入输出功能的“旋钮”。在典型的深度学习系统中，可能有数亿个这样的可调权重，以及数亿个标记样本来训练机器。
To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direction to the gradient vector.
为了正确调整权重向量，学习算法会计算一个梯度向量，对于每个权重，该向量表示如果权重增加少量，误差将增加或减少多少。然后，沿与梯度向量相反的方向调整权重向量。

The objective function, averaged over all the training examples, can
目标函数（对所有训练样本进行平均）可以

be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.
被看作是 Weight Values 的高维空间中的一种丘陵景观。负梯度向量表示此景观中最陡下降的方向，使其更接近最小值，此时输出误差平均较低。

In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization techniques

^{18}

. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine - its ability to produce sensible answers on new inputs that it has never seen during training.
在实践中，大多数从业者使用一种称为随机梯度下降（SGD）的程序。这包括显示几个示例的输入向量、计算输出和误差、计算这些示例的平均梯度以及相应地调整权重。对于训练集中的许多小样本集，重复此过程，直到目标函数的平均值停止减少。之所以称为随机指标，是因为每个小样本集都给出了所有样本的平均梯度的噪声估计值。与更复杂的优化技术

^{18}

相比，这个简单的过程通常会以惊人的速度找到一组好的权重。训练后，系统的性能在一组不同的示例（称为测试集）上进行测量。这有助于测试机器的泛化能力 - 它对训练期间从未见过的新输入产生合理答案的能力。

Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category.
机器学习的许多当前实际应用在手动设计的特征之上使用线性分类器。两类线性分类器计算特征向量分量的加权和。如果加权总和高于阈值，则输入将被分类为属于特定类别。
Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces separated by a hyperplane

^{19}

. But problems such as image and speech recognition require the input-output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on
自 1960 年代以来，我们已经知道线性分类器只能将其输入空间划分为非常简单的区域，即由超平面

^{19}

分隔的半空间。但是图像和语音识别等问题要求输入-输出函数对输入的不相关变化不敏感，例如物体的位置、方向或照明的变化，或者语音音调或口音的变化，同时对特定的微小变化非常敏感（例如，白狼和一种叫做萨摩耶犬的类似狼的白狗之间的区别）。在像素级别，两只处于不同姿势和不同环境中的萨摩耶犬的图像可能彼此非常不同，而处于相同位置和相似背景上的萨摩耶犬和狼的两张图像可能彼此非常相似。线性分类器，或任何其他作

Figure 1 | Multilayer neural networks and backpropagation. a, A multilayer neural network (shown by the connected dots) can distort the input space to make the classes of data (examples of which are on the red and blue lines) linearly separable. Note how a regular grid (shown on the left) in input space is also transformed (shown in the middle panel) by hidden units. This is an illustrative example with only two input units, two hidden units and one output unit, but the networks used for object recognition or natural language processing contain tens or hundreds of thousands of units. Reproduced with permission from C. Olah (http://colah.github.io/). b, The chain rule of derivatives tells us how two small effects (that of a small change of

x

y

, and that of

y

z

) are composed. A small change

Δ x

x

gets transformed first into a small change

Δ y

y

by getting multiplied by

\partial y / \partial x

(that is, the definition of partial derivative). Similarly, the change

Δ y

creates a change

Δ z

z

. Substituting one equation into the other gives the chain rule of derivatives - how

Δ x

gets turned into

Δ z

through multiplication by the product of

\partial y / \partial x

and

\partial z / \partial x

. It also works when

x

y

and

z

are vectors (and the derivatives are Jacobian matrices). c, The equations used for computing the forward pass in a neural net with two hidden layers and one output layer, each constituting a module through
图 1 |多层神经网络和反向传播。a，多层神经网络（由连接的点显示）可以扭曲输入空间，使数据类（例如红线和蓝线上）线性可分离。请注意，输入空间中的常规网格（如左图所示）是如何被隐藏单位转换（如中间面板中所示）的。这是一个说明性示例，只有两个输入单元、两个隐藏单元和一个输出单元，但用于对象识别或自然语言处理的网络包含数万或数十万个单元。经 C. Olah （http://colah.github.io/）许可转载。b，导数的链式法则告诉我们两个小效应（

x

y

和

y

z

的微小变化）是如何组成的。通过乘以

\partial y / \partial x

（即偏导数的定义），首先将 in

x

中的小变化

Δ x

转换为 in

y

中的小变化

Δ y

。同样，此更改

Δ y

会在中创建

z

更改

Δ z

。将一个方程代入另一个方程可以得到导数的链式规则 - 如何

Δ x

Δ z

通过乘以

\partial y / \partial x

和

\partial z / \partial x

的乘积来变成。当，

y

和

z

是向量（导数是雅可比矩阵）时

x

，它也有效。c，用于计算神经网络中前向传递的方程，该神经网络具有两个隐藏层和一个输出层，每个层构成一个模块，直到
d Compare outputs with correct answer to get error derivatives
d 将输出与正确答案进行比较，以获得误差导数

which one can backpropagate gradients. At each layer, we first compute the total input

z

to each unit, which is a weighted sum of the outputs of the units in the layer below. Then a non-linear function

f (

. ) is applied to

z

to get the output of the unit. For simplicity, we have omitted bias terms. The non-linear functions used in neural networks include the rectified linear unit (ReLU)

f (z) = max (0, z)

, commonly used in recent years, as well as the more conventional sigmoids, such as the hyberbolic tangent,

f (z) = (\exp (z) - \exp (- z)) / (\exp (z) + \exp (- z))

and logistic function logistic,

f (z) = 1 / (1 + \exp (- z))

. d, The equations used for computing the backward pass. At each hidden layer we compute the error derivative with respect to the output of each unit, which is a weighted sum of the error derivatives with respect to the total inputs to the units in the layer above. We then convert the error derivative with respect to the output into the error derivative with respect to the input by multiplying it by the gradient of

f (z)

. At the output layer, the error derivative with respect to the output of a unit is computed by differentiating the cost function. This gives

y_{l} - t_{l}

if the cost function for unit

l

0.5 {(y_{l} - t_{l})}^{2}

, where

t_{l}

is the target value. Once the

\partial E / \partial z_{k}

is known, the error-derivative for the weight

w_{i k}

on the connection from unit

j

in the layer below is just

y_{j} \partial E / \partial z_{k}

.
它可以反向传播梯度。在每一层，我们首先计算每个单元的总输入

z

，即下一层中单元输出的加权和。然后是一个非线性函数

f (

。）应用于

z

获取单元的输出。为简单起见，我们省略了偏差项。神经网络中使用的非线性函数包括近年来常用的修正线性单元（ReLU）

f (z) = max (0, z)

，以及更传统的 sigmoid，例如 hyberbolic tangent

f (z) = (\exp (z) - \exp (- z)) / (\exp (z) + \exp (- z))

和 logistic 函数 logistic，

f (z) = 1 / (1 + \exp (- z))

。d，用于计算向后传递的方程。在每个隐藏层，我们计算相对于每个单元输出的误差导数，这是误差导数相对于上一层中单元的总输入的加权和。然后，我们将关于输出的误差导数乘以的梯度，将其转换为关于输入的

f (z)

误差导数。在输出层，通过对成本函数进行微分来计算相对于单位输出的误差导数。如果 unit

l

的成本函数为

0.5 {(y_{l} - t_{l})}^{2}

，则给出

y_{l} - t_{l}

，其中

t_{l}

是目标值。一旦知道，

\partial E / \partial z_{k}

下面层中 from unit

j

的连接上的权重

w_{i k}

的误差导数就是

y_{j} \partial E / \partial z_{k}

。

Figure

2 ∣

Inside a convolutional network. The outputs (not the filters) of each layer (horizontally) of a typical convolutional network architecture applied to the image of a Samoyed dog (bottom left; and RGB (red, green, blue) inputs, bottom right). Each rectangular image is a feature map
图

2 ∣

卷积网络内部。应用于萨摩耶犬图像的典型卷积网络架构的每一层（水平）的输出（不是滤波器）（左下;和 RGB（红、绿、蓝）输入，右下角）。每个矩形图像都是一个特征图
raw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity-invariance dilemma - one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods

^{20}

, but generic features such as those arising with the Gaussian kernel do not allow the learner to generalize well far from the training examples

^{21}

. The conventional option is to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.
原始像素不可能区分后两者，而将前两者归为同一类别。这就是为什么浅层分类器需要一个好的特征提取器来解决选择性-不变性困境 - 一个产生对图像中对区分很重要的方面有选择性的表示，但对不相关的方面（如动物的姿势）是不变的。为了使分类器更强大，可以使用通用非线性特征，就像 kernel methods

^{20}

一样，但是通用特征（例如高斯核产生的特征）不允许学习者在远离训练示例

^{21}

的地方进行很好的泛化。传统的选择是手工设计好的特征提取器，这需要大量的工程技能和领域专业知识。但是，如果可以使用通用学习程序自动学习好的特征，那么这一切都可以避免。这就是深度学习的关键优势。
A deep-learning architecture is a multilayer stack of simple modules, all (or most) of which are subject to learning, and many of which compute non-linear input-output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20 , a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details - distinguishing Samoyeds from white wolves - and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects.
深度学习架构是一个由简单模块组成的多层堆栈，所有（或大部分）模块都需要学习，其中许多模块计算非线性输入-输出映射。堆栈中的每个模块都会转换其 Importing，以提高表示的选择性和不变性。通过多个非线性层，比如 5 到 20 的深度，一个系统可以实现其输入的极其复杂的功能，这些功能同时对微小的细节敏感 - 区分萨摩耶犬和白狼 - 并且对背景、姿势、照明和周围物体等不相关的大变化不敏感。

Backpropagation to train multilayer architectures
用于训练多层架构的反向传播

From the earliest days of pattern recognition

^{22, 23}

, the aim of researchers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and

1980 s^{24 - 27}

.
从最早的模式识别

^{22, 23}

开始，研究人员的目标一直是用可训练的多层网络取代手工设计的特征，但尽管它很简单，但直到 1980 年代中期，该解决方案才被广泛理解。事实证明，多层架构可以通过简单的随机梯度下降来训练。只要模块的输入和内部权重是相对平滑的函数，就可以使用反向传播过程计算梯度。这是可以做到的，而且它奏效的想法是在 1970 年代被几个不同的团体独立发现的。

1980 s^{24 - 27}

The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain
计算目标函数相对于多层模块堆栈权重的梯度的反向传播过程只不过是链的实际应用
corresponding to the output for one of the learned features, detected at each of the image positions. Information flows bottom up, with lower-level features acting as oriented edge detectors, and a score is computed for each image class in output. ReLU, rectified linear unit.
对应于在每个图像位置检测到的其中一个学习特征的输出。信息自下而上流动，较低级别的特征充当定向边缘检测器，并为输出中的每个图像类计算分数。ReLU，整流线性单元。
rule for derivatives. The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.
衍生品规则。关键的见解是，目标相对于模块输入的导数（或梯度）可以通过从相对于该模块的输出（或后续模块的输入）的梯度向后工作来计算（图 1）。反向传播方程可以重复应用，以将梯度传播到所有模块，从顶部的输出（网络生成预测的地方）一直到底部（外部输入馈送的地方）。计算出这些梯度后，就可以直接计算每个模块权重的梯度。
Many applications of deep learning use feedforward neural network architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a probability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (

ReLU

), which is simply the half-wave

rectifier f (z) = max (z, 0)

. In past decades, neural nets used smoother non-linearities, such as

\tanh (z)

1 / (1 + \exp (- z))

, but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training

^{28}

. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1).
深度学习的许多应用都使用前馈神经网络架构（图 1），该架构学习将固定大小的输入（例如，图像）映射到固定大小的输出（例如，几个类别中每个类别的概率）。为了从一个层转到下一个层，一组单元计算它们从前一层输入的加权和，并通过非线性函数传递结果。目前，最流行的非线性函数是整流线性单位（

ReLU

），它就是半波

rectifier f (z) = max (z, 0)

。在过去的几十年里，神经网络使用更平滑的非线性，例如

\tanh (z)

或

1 / (1 + \exp (- z))

，但 ReLU 通常在具有多层的网络中学习得更快，从而允许训练深度监督网络，而无需无监督的预训练

^{28}

。不在输入层或输出层中的单元通常称为隐藏单元。隐藏层可以看作以非线性方式扭曲输入，使类别变得可以被最后一层线性分离（图 1）。
In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with little prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima - weight configurations for which no small change would reduce the average error.
在 1990 年代后期，神经网络和反向传播在很大程度上被机器学习社区抛弃，而被计算机视觉和语音识别社区所忽视。人们普遍认为，在几乎没有先验知识的情况下学习有用的、多阶段的特征提取器是不可行的。特别是，人们普遍认为简单的梯度下降会陷入较差的局部最小值 - 权重配置中，对于这些配置，不小的变化都会减少平均误差。
In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the
在实践中，对于大型网络来说，较差的局部最小值很少成为问题。无论初始条件如何，系统几乎总是能获得质量非常相似的解决方案。最近的理论和实证结果强烈表明，局部最小值通常不是一个严重的问题。相反，景观由大量组合的鞍点组成，其中梯度为零，并且曲面在大多数维度上向上弯曲，在

$^{1}$ Facebook AI Research, 770 Broadway, New York, New York 10003 USA. $^{2}$ New York University, 715 Broadway, New York, New York 10003 , USA. $^{3}$ Department of Computer Science and Operations Research Université de Montréal, Pavillon André-Aisenstadt, PO Box 6128 Centre-Ville STN Montréal, Quebec H3C 3J7, Canada. $^{4}$ Google, 1600 Amphitheatre Parkway, Mountain View, California 94043, USA. $^{5}$ Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3G4, Canada.
$^{1}$ Facebook AI Research，770 Broadway， New York， New York 10003 美国。 $^{2}$ 纽约大学，715 Broadway， New York， New York 10003 ，美国。 $^{3}$ 计算机科学与运筹学系蒙特利尔大学，Pavillon André-Aisenstadt，PO Box 6128 Centre-Ville STN Montréal，魁北克 H3C 3J7，加拿大。 $^{4}$ Google， 1600 Amphitheatre Parkway， Mountain View， California 94043，美国。 $^{5}$ 多伦多大学计算机科学系，6 King's College Road， Toronto， Ontario M5S 3G4， Canada。

Deep learning 深度学习

Supervised learning 监督式学习

Backpropagation to train multilayer architectures用于训练多层架构的反向传播

Backpropagation to train multilayer architectures
用于训练多层架构的反向传播