2024_05_10_5d59de6b28c248d3a0d9g

Chapter 4 第 4 章

Numerical Computation 数值计算

Machine learning algorithms usually require a high amount of numerical computation. This typically refers to algorithms that solve mathematical problems by methods that update estimates of the solution via an iterative process, rather than analytically deriving a formula to provide a symbolic expression for the correct solution. Common operations include optimization (finding the value of an argument that minimizes or maximizes a function) and solving systems of linear equations. Even just evaluating a mathematical function on a digital computer can be difficult when the function involves real numbers, which cannot be represented precisely using a finite amount of memory.
机器学习算法通常需要大量的数值计算。这通常是指通过迭代过程更新解的估计值的方法来解决数学问题的算法，而不是通过分析推导公式来提供正确解的符号表达式。常见的运算包括优化（找出使函数最小化或最大化的参数值）和求解线性方程组。当函数涉及实数时，即使只是在数字计算机上求值也很困难，因为实数无法用有限的内存精确表示。

4.1 Overflow and Underflow
4.1 溢出和下溢

The fundamental difficulty in performing continuous math on a digital computer is that we need to represent infinitely many real numbers with a finite number of bit patterns. This means that for almost all real numbers, we incur some approximation error when we represent the number in the computer. In many cases, this is just rounding error. Rounding error is problematic, especially when it compounds across many operations, and can cause algorithms that work in theory to fail in practice if they are not designed to minimize the accumulation of rounding error.
在数字计算机上进行连续运算的根本困难在于，我们需要用有限个比特模式来表示无限多的实数。这意味着，对于几乎所有实数，我们在计算机中表示时都会产生一些近似误差。在很多情况下，这只是舍入误差。舍入误差是个问题，尤其是当它在许多运算中累积时，如果算法的设计不能最大限度地减少舍入误差的累积，就会导致理论上可行的算法在实践中失效。

One form of rounding error that is particularly devastating is underflow. Underflow occurs when numbers near zero are rounded to zero. Many functions behave qualitatively differently when their argument is zero rather than a small positive number. For example, we usually want to avoid division by zero (some
一种破坏性特别大的四舍五入错误是下溢。当接近零的数字被四舍五入为零时，就会出现下溢。当参数为零而不是一个小正数时，许多函数的行为会发生质的变化。例如，我们通常希望避免除以零（一些
software environments will raise exceptions when this occurs, others will return a result with a placeholder not-a-number value) or taking the logarithm of zero (this is usually treated as

, which then becomes not-a-number if it is used for many further arithmetic operations).
如果出现这种情况，软件环境会引发异常，其他软件环境则会返回一个带有占位符的非数字值的结果），或者取零的对数（这通常被视为

，如果用于更多的算术运算，它就会变成非数字）。

Another highly damaging form of numerical error is overflow. Overflow occurs when numbers with large magnitude are approximated as

. Further arithmetic will usually change these infinite values into not-a-number values.
另一种危害性极大的数值错误是溢出。当幅度较大的数字被近似为

或

时，就会发生溢出。进一步的运算通常会将这些无限值变成非数值。

One example of a function that must be stabilized against underflow and overflow is the softmax function. The softmax function is often used to predict the probabilities associated with a multinoulli distribution. The softmax function is defined to be
softmax 函数就是一个必须防止下溢和溢出的函数。softmax 函数通常用于预测与多概率分布相关的概率。软最大函数的定义是

Consider what happens when all the

are equal to some constant

. Analytically, we can see that all the outputs should be equal to

. Numerically, this may not occur when

has large magnitude. If

is very negative, then

will underflow. This means the denominator of the softmax will become 0 , so the final result is undefined. When

is very large and positive,

will overflow, again resulting in the expression as a whole being undefined. Both of these difficulties can be resolved by instead evaluating

where

. Simple algebra shows that the value of the softmax function is not changed analytically by adding or subtracting a scalar from the input vector. Subtracting

results in the largest argument to exp being 0 , which rules out the possibility of overflow. Likewise, at least one term in the denominator has a value of 1 , which rules out the possibility of underflow in the denominator leading to a division by zero.
考虑当所有

都等于某个常数

时会发生什么。分析可知，所有输出都应等于

。从数值上看，当

幅值较大时，可能不会出现这种情况。如果

非常负，那么

将出现下溢。这意味着 softmax 的分母将变为 0 ，因此最终结果是未定义的。当

非常大且为正值时，

将溢出，再次导致表达式整体未定义。这两个问题都可以通过对

进行求值来解决，其中

。简单的代数表明，从输入向量中添加或减去一个标量并不会改变 softmax 函数的分析值。减去

后，exp 的最大参数为 0，这就排除了溢出的可能性。同样，分母中至少有一个项的值为 1，这就排除了分母下溢导致除以 0 的可能性。

There is still one small problem. Underflow in the numerator can still cause the expression as a whole to evaluate to zero. This means that if we implement

by first running the softmax subroutine then passing the result to the log function, we could erroneously obtain

. Instead, we must implement a separate function that calculates

softmax in a numerically stable way. The log softmax function can be stabilized using the same trick as we used to stabilize the softmax function.
还有一个小问题。分子中的下溢仍会导致整个表达式的值为零。这意味着，如果我们先运行 softmax 子程序，然后将结果传递给 log 函数，从而实现

，我们可能会错误地得到

。相反，我们必须执行一个单独的函数，以数值稳定的方式计算

softmax。对数 softmax 函数的稳定方法与 softmax 函数的稳定方法相同。

For the most part, we do not explicitly detail all the numerical considerations involved in implementing the various algorithms described in this book. Developers of low-level libraries should keep numerical issues in mind when implementing deep learning algorithms. Most readers of this book can simply rely on lowlevel libraries that provide stable implementations. In some cases, it is possible to implement a new algorithm and have the new implementation automatically
在大多数情况下，我们不会明确详述在实现本书所述各种算法时涉及的所有数值考虑因素。底层库的开发者在实现深度学习算法时，应牢记数值问题。本书的大多数读者只需依赖提供稳定实现的底层库即可。在某些情况下，可以实现一种新算法，并让新实现自动
stabilized. Theano (Bergstra et al., 2010; Bastien et al., 2012) is an example of a software package that automatically detects and stabilizes many common numerically unstable expressions that arise in the context of deep learning.
稳定。Theano （Bergstra 等人，2010 年；Bastien 等人，2012 年）就是一个软件包的例子，它能自动检测并稳定深度学习中出现的许多常见数值不稳定表达式。

4.2 Poor Conditioning 4.2 条件差

Conditioning refers to how rapidly a function changes with respect to small changes in its inputs. Functions that change rapidly when their inputs are perturbed slightly can be problematic for scientific computation because rounding errors in the inputs can result in large changes in the output.
调理是指函数在输入发生微小变化时的变化速度。当输入受到轻微扰动时，函数会迅速发生变化，这对科学计算来说是个问题，因为输入的四舍五入误差会导致输出的巨大变化。

Consider the function

. When

has an eigenvalue decomposition, its condition number is
考虑函数

。当

具有特征值分解时，其条件数为

This is the ratio of the magnitude of the largest and smallest eigenvalue. When this number is large, matrix inversion is particularly sensitive to error in the input.
这是最大和最小特征值的比值。当这一数值较大时，矩阵反演对输入误差特别敏感。

This sensitivity is an intrinsic property of the matrix itself, not the result of rounding error during matrix inversion. Poorly conditioned matrices amplify pre-existing errors when we multiply by the true matrix inverse. In practice, the error will be compounded further by numerical errors in the inversion process itself.
这种敏感性是矩阵本身的固有属性，而不是矩阵反演时舍入误差的结果。当我们乘以真正的矩阵求逆时，条件不良的矩阵会放大先前存在的误差。实际上，反演过程本身的数值误差会进一步加剧误差。

4.3 Gradient-Based Optimization
4.3 基于梯度的优化

Most deep learning algorithms involve optimization of some sort. Optimization refers to the task of either minimizing or maximizing some function

by altering

. We usually phrase most optimization problems in terms of minimizing

. Maximization may be accomplished via a minimization algorithm by minimizing

.
大多数深度学习算法都涉及某种优化。优化指的是通过改变

来最小化或最大化某个函数

的任务。我们通常用最小化

来表述大多数优化问题。最大化可以通过最小化算法实现，即最小化

。

The function we want to minimize or maximize is called the objective function, or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function. In this book, we use these terms interchangeably, though some machine learning publications assign special meaning to some of these terms.
我们想要最小化或最大化的函数称为目标函数或标准。在最小化目标函数时，我们也可以称其为成本函数、损失函数或误差函数。在本书中，我们交替使用这些术语，不过有些机器学习出版物会赋予其中一些术语特殊的含义。

We often denote the value that minimizes or maximizes a function with a superscript

. For example, we might say

.
我们通常用上标

来表示使函数最小化或最大化的值。例如，我们可以说

。

Figure 4.1: Gradient descent. An illustration of how the gradient descent algorithm uses the derivatives of a function to follow the function downhill to a minimum.
图 4.1：梯度下降梯度下降算法如何利用函数的导数，沿着函数下坡的方向找到最小值的示意图。

We assume the reader is already familiar with calculus but provide a brief review of how calculus concepts relate to optimization here.
我们假设读者已经熟悉微积分，但在此简要回顾一下微积分概念与优化的关系。

Suppose we have a function

, where both

and

are real numbers. The derivative of this function is denoted as

or as

. The derivative

gives the slope of

at the point

. In other words, it specifies how to scale a small change in the input to obtain the corresponding change in the output:

.
假设有一个函数

，其中

和

都是实数。这个函数的导数用

或

表示。导数

给出了

在点

上的斜率。换句话说，它规定了如何通过调节输入的微小变化来获得输出的相应变化：

。

The derivative is therefore useful for minimizing a function because it tells us how to change

in order to make a small improvement in

. For example, we know that

is less than

for small enough

. We can thus reduce

by moving

in small steps with the opposite sign of the derivative. This technique is called gradient descent (Cauchy, 1847). See figure 4.1 for an example of this technique.
因此，导数对最小化函数非常有用，因为它告诉我们如何改变

才能使

稍有改善。例如，我们知道，当

足够小时，

小于

。因此，我们可以通过以导数的相反符号小步移动

来减少

。这种方法称为梯度下降法（Cauchy，1847 年）。这种方法的示例见图 4.1。

When

, the derivative provides no information about which direction to move. Points where

are known as critical points, or stationary points. A local minimum is a point where

is lower than at all neighboring points, so it is no longer possible to decrease

by making infinitesimal steps. A local maximum is a point where

is higher than at all neighboring points,
当

时，导数不提供移动方向的信息。

的点称为临界点或静止点。局部最小值是指

低于所有邻近点的值，因此无法通过无限小的步长来降低

。局部最大值是指

比所有相邻点都高的点、

Figure 4.2: Types of critical points. Examples of the three types of critical points in one dimension. A critical point is a point with zero slope. Such a point can either be a local minimum, which is lower than the neighboring points; a local maximum, which is higher than the neighboring points; or a saddle point, which has neighbors that are both higher and lower than the point itself.
图 4.2：临界点类型。一维中三种临界点的示例。临界点是一个斜率为零的点。临界点既可以是局部最小值，比邻近点低；也可以是局部最大值，比邻近点高；还可以是鞍点，邻近点既比临界点高，也比临界点低。

so it is not possible to increase

by making infinitesimal steps. Some critical points are neither maxima nor minima. These are known as saddle points. See figure 4.2 for examples of each type of critical point.
因此不可能通过无限小的步长来增加

。有些临界点既不是最大值，也不是最小值。这些临界点被称为鞍点。各类临界点的示例见图 4.2。

A point that obtains the absolute lowest value of

is a global minimum. There can be only one global minimum or multiple global minima of the function. It is also possible for there to be local minima that are not globally optimal. In the context of deep learning, we optimize functions that may have many local minima that are not optimal and many saddle points surrounded by very flat regions. All of this makes optimization difficult, especially when the input to the function is multidimensional. We therefore usually settle for finding a value of

that is very low but not necessarily minimal in any formal sense. See figure 4.3 for an example.
获得

绝对最小值的点是全局最小值。函数可能只有一个全局最小值，也可能有多个全局最小值。也有可能存在非全局最优的局部最小值。在深度学习中，我们优化的函数可能有很多局部最小值不是最优的，也可能有很多鞍点被非常平坦的区域所包围。所有这些都给优化带来了困难，尤其是当函数的输入是多维的时候。因此，我们通常会在

上找到一个非常小的值，但从任何形式上来说都不一定是最小值。示例见图 4.3。

We often minimize functions that have multiple inputs:

. For the concept of "minimization" to make sense, there must still be only one (scalar) output.
我们经常最小化有多个输入的函数：

。要使 "最小化 "的概念有意义，输出必须只有一个（标量）。

For functions with multiple inputs, we must make use of the concept of partial derivatives. The partial derivative

measures how

changes as only the variable

increases at point

. The gradient generalizes the notion of derivative to the case where the derivative is with respect to a vector: the gradient of

is the vector containing all the partial derivatives, denoted

. Element

of the gradient is the partial derivative of

with respect to

. In multiple dimensions,
对于有多个输入的函数，我们必须使用偏导数的概念。偏导数

可以衡量当

变量增加时，

在

点的变化情况。梯度将导数的概念推广到导数相对于向量的情况：

的梯度是包含所有偏导数的向量，表示为

。梯度的元素

是

相对于

的偏导数。在多个维度中、

Figure 4.3: Approximate minimization. Optimization algorithms may fail to find a global minimum when there are multiple local minima or plateaus present. In the context of deep learning, we generally accept such solutions even though they are not truly minimal, so long as they correspond to significantly low values of the cost function.
图 4.3：近似最小化。当存在多个局部最小值或高原时，优化算法可能无法找到全局最小值。在深度学习中，我们通常会接受这样的解决方案，即使它们不是真正的最小值，只要它们对应的成本函数值明显较低即可。

critical points are points where every element of the gradient is equal to zero.
临界点是指梯度的每个元素都等于零的点。

The directional derivative in direction

(a unit vector) is the slope of the function

in direction

. In other words, the directional derivative is the derivative of the function

with respect to

, evaluated at

. Using the chain rule, we can see that

evaluates to

when

.
方向

（单位向量）上的方向导数是函数

在方向

上的斜率。换句话说，方向导数是函数

相对于

的导数，在

处求值。利用链式法则，我们可以看到，当

时，

的值为

。

To minimize

, we would like to find the direction in which

decreases the fastest. We can do this using the directional derivative:
为了使

最小化，我们希望找到

下降最快的方向。我们可以使用方向导数来实现这一目标：

where

is the angle between

and the gradient. Substituting in

and ignoring factors that do not depend on

, this

. This is minimized when

points in the opposite direction as the gradient. In other words, the gradient points directly uphill, and the negative gradient points directly downhill. We can decrease

by moving in the direction of the negative gradient. This is known as the method of steepest descent, or gradient descent.
其中

是

与梯度之间的夹角。将

代入，并忽略与

无关的因素，即

。当

指向与坡度相反的方向时，该值最小。换句话说，梯度直接指向上坡，负梯度直接指向下坡。我们可以通过向负梯度方向移动来减少

。这就是所谓的最陡坡下降法，或梯度下降法。

Steepest descent proposes a new point
陡坡下降提出了一个新点

where

is the learning rate, a positive scalar determining the size of the step. We can choose

in several different ways. A popular approach is to set

to a small constant. Sometimes, we can solve for the step size that makes the directional derivative vanish. Another approach is to evaluate

for several values of

and choose the one that results in the smallest objective function value. This last strategy is called a line search.
其中

是学习率，是一个决定步长的正标量。我们可以用几种不同的方法来选择

。一种常用的方法是将

设为一个小常数。有时，我们可以求解使方向导数消失的步长。另一种方法是针对

的多个值对

进行评估，然后选择目标函数值最小的那个值。最后一种策略称为直线搜索。

Steepest descent converges when every element of the gradient is zero (or, in practice, very close to zero). In some cases, we may be able to avoid running this iterative algorithm and just jump directly to the critical point by solving the equation

for

.
当梯度的每个元素都为零时（或实际上非常接近于零时），陡坡下降算法就会收敛。在某些情况下，我们也许可以避免运行这种迭代算法，而直接跳到临界点，方法是求解

的方程

。

Although gradient descent is limited to optimization in continuous spaces, the general concept of repeatedly making a small move (that is approximately the best small move) toward better configurations can be generalized to discrete spaces. Ascending an objective function of discrete parameters is called hill climbing (Russel and Norvig, 2003).
虽然梯度下降法仅限于在连续空间中进行优化，但向更好的配置反复进行小移动（近似于最佳小移动）的一般概念可以推广到离散空间。离散参数目标函数的上升称为爬坡（Russel 和 Norvig，2003 年）。

4.3.1 Beyond the Gradient: Jacobian and Hessian Matrices
4.3.1 梯度之外：雅各布矩阵和赫塞斯矩阵

Sometimes we need to find all the partial derivatives of a function whose input and output are both vectors. The matrix containing all such partial derivatives is known as a Jacobian matrix. Specifically, if we have a function

, then the Jacobian matrix

is defined such that

.
有时，我们需要找到输入和输出均为向量的函数的所有偏导数。包含所有此类偏导数的矩阵称为雅各布矩阵。具体来说，如果我们有一个函数

，那么

的雅各布矩阵

的定义是：

。

We are also sometimes interested in a derivative of a derivative. This is known as a second derivative. For example, for a function

, the derivative with respect to

of the derivative of

with respect to

is denoted as

. In a single dimension, we can denote

. The second derivative tells us how the first derivative will change as we vary the input. This is important because it tells us whether a gradient step will cause as much of an improvement as we would expect based on the gradient alone. We can think of the second derivative as measuring curvature. Suppose we have a quadratic function (many functions that arise in practice are not quadratic but can be approximated well as quadratic, at least locally). If such a function has a second derivative of zero, then there is no curvature. It is a perfectly flat line, and its value can be predicted using only the gradient. If the gradient is 1 , then we can make a step of size

along the negative gradient, and the cost function will decrease by

. If the second derivative is negative, the function curves downward, so the cost function will actually decrease by more than

. Finally, if the second derivative is positive, the function curves upward, so the cost function can decrease by less than

. See
我们有时也会对导数的导数感兴趣。这就是所谓的二阶导数。例如，对于函数

，

相对于

的导数对

的导数表示为

。在单维度中，我们可以用

表示

。二阶导数告诉我们，当我们改变输入时，一阶导数将如何变化。这一点非常重要，因为它可以告诉我们，梯度阶跃是否会像我们仅根据梯度所预期的那样带来改善。我们可以把二次导数看作是曲率的测量值。假设我们有一个二次函数（实际中出现的许多函数并不是二次函数，但至少在局部可以很好地近似为二次函数）。如果这样一个函数的二次导数为零，那么就不存在曲率。它是一条完全平坦的直线，只需利用梯度就能预测其值。如果梯度为 1，那么我们可以沿着负梯度跨一步，步长为

，成本函数将减少

。如果二阶导数为负，则函数曲线向下，因此成本函数的实际下降幅度将超过

。最后，如果二次导数为正，则函数曲线向上，因此成本函数的减小幅度会小于

。参见

Figure 4.4: The second derivative determines the curvature of a function. Here we show quadratic functions with various curvature. The dashed line indicates the value of the cost function we would expect based on the gradient information alone as we make a gradient step downhill. With negative curvature, the cost function actually decreases faster than the gradient predicts. With no curvature, the gradient predicts the decrease correctly. With positive curvature, the function decreases more slowly than expected and eventually begins to increase, so steps that are too large can actually increase the function inadvertently.
图 4.4：二阶导数决定了函数的曲率。这里我们展示了具有不同曲率的二次函数。虚线表示我们在下坡过程中，仅根据梯度信息预计的代价函数值。在负曲率的情况下，成本函数的下降速度实际上比梯度预测的要快。在没有曲率的情况下，梯度预测的下降速度是正确的。在正曲率情况下，代价函数的下降速度比预期的要慢，并最终开始增加，因此过大的梯度实际上会在无意中增加代价函数。

figure 4.4 to see how different forms of curvature affect the relationship between the value of the cost function predicted by the gradient and the true value.
请参见图 4.4，了解不同形式的曲率如何影响梯度预测的成本函数值与真实值之间的关系。

When our function has multiple input dimensions, there are many second derivatives. These derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix

is defined such that
当我们的函数有多个输入维度时，就会有许多二次导数。这些导数可以汇集成一个矩阵，称为 Hessian 矩阵。黑森矩阵

的定义是

Equivalently, the Hessian is the Jacobian of the gradient.
等价地，Hessian 就是梯度的雅各布。

Anywhere that the second partial derivatives are continuous, the differential operators are commutative; that is, their order can be swapped:
只要二次偏导数是连续的，微分算子就是交换的；也就是说，它们的顺序可以互换：

This implies that

, so the Hessian matrix is symmetric at such points. Most of the functions we encounter in the context of deep learning have a symmetric Hessian almost everywhere. Because the Hessian matrix is real and symmetric, we can decompose it into a set of real eigenvalues and an orthogonal basis of
这意味着

，因此赫塞斯矩阵在这些点上是对称的。我们在深度学习中遇到的大多数函数几乎都具有对称的 Hessian。由于 Hessian 矩阵是实数且对称的，我们可以将其分解为一组实特征值和一个正交基础，即
eigenvectors. The second derivative in a specific direction represented by a unit vector

is given by

. When

is an eigenvector of

, the second derivative in that direction is given by the corresponding eigenvalue. For other directions of

, the directional second derivative is a weighted average of all the eigenvalues, with weights between 0 and 1 , and eigenvectors that have a smaller angle with

receiving more weight. The maximum eigenvalue determines the maximum second derivative, and the minimum eigenvalue determines the minimum second derivative.
特征向量。由单位向量

代表的特定方向上的二阶导数由

给出。当

是

的一个特征向量时，该方向上的二阶导数由相应的特征值给出。对于

的其他方向，方向二阶导数是所有特征值的加权平均值，权重介于 0 和 1 之间，与

夹角较小的特征向量权重较大。最大特征值决定最大二阶导数，最小特征值决定最小二阶导数。

The (directional) second derivative tells us how well we can expect a gradient descent step to perform. We can make a second-order Taylor series approximation to the function

around the current point

:
二阶导数（方向性）告诉我们梯度下降步骤的性能如何。我们可以围绕当前点

对函数

进行二阶泰勒级数近似：

where

is the gradient and

is the Hessian at

. If we use a learning rate of

, then the new point

will be given by

. Substituting this into our approximation, we obtain
其中，

是梯度，

是

处的 Hessian。如果我们使用

的学习率，那么新点

将由

给出。将其代入我们的近似值，可以得到

There are three terms here: the original value of the function, the expected improvement due to the slope of the function, and the correction we must apply to account for the curvature of the function. When this last term is too large, the gradient descent step can actually move uphill. When

is zero or negative, the Taylor series approximation predicts that increasing

forever will decrease

forever. In practice, the Taylor series is unlikely to remain accurate for large

, so one must resort to more heuristic choices of

in this case. When

is positive, solving for the optimal step size that decreases the Taylor series approximation of the function the most yields
这里有三个项：函数的原始值、函数斜率带来的预期改进，以及我们必须应用以考虑函数曲率的修正。当最后一项过大时，梯度下降步骤实际上是在走上坡路。当

为零或负数时，根据泰勒级数近似法的预测，永远增加

将永远减少

。在实践中，泰勒级数不太可能对较大的

保持精确，因此在这种情况下，我们必须对

进行更多的启发式选择。当

为正值时，求解使函数的泰勒级数近似值减小最多的最佳步长，结果是

In the worst case, when

aligns with the eigenvector of

corresponding to the maximal eigenvalue

, then this optimal step size is given by

. To the extent that the function we minimize can be approximated well by a quadratic function, the eigenvalues of the Hessian thus determine the scale of the learning rate.
在最坏的情况下，当

与最大特征值

对应的特征向量

一致时，最佳步长为

。如果我们最小化的函数可以用二次函数很好地近似，那么 Hessian 的特征值就决定了学习率的大小。

The second derivative can be used to determine whether a critical point is a local maximum, a local minimum, or a saddle point. Recall that on a critical point,

. When the second derivative

, the first derivative

二阶导数可用于确定临界点是局部最大值、局部最小值还是鞍点。回顾临界点，

。当二阶导数

时，一阶导数

increases as we move to the right and decreases as we move to the left. This means

and

for small enough

. In other words, as we move right, the slope begins to point uphill to the right, and as we move left, the slope begins to point uphill to the left. Thus, when

and

, we can conclude that

is a local minimum. Similarly, when

and

, we can conclude that

is a local maximum. This is known as the second derivative test. Unfortunately, when

, the test is inconclusive. In this case

may be a saddle point or a part of a flat region.
向右移动时增加，向左移动时减少。这意味着，对于足够小的

，

和

。换句话说，当我们向右移动时，斜率开始向右上坡，当我们向左移动时，斜率开始向左上坡。因此，当

和

时，我们可以得出

是局部最小值的结论。同样，当

和

时，我们可以得出

是局部最大值的结论。这就是所谓的二次导数检验。遗憾的是，当

时，检验无法得出结论。在这种情况下，

可能是一个鞍点或平坦区域的一部分。

In multiple dimensions, we need to examine all the second derivatives of the function. Using the eigendecomposition of the Hessian matrix, we can generalize the second derivative test to multiple dimensions. At a critical point, where

, we can examine the eigenvalues of the Hessian to determine whether the critical point is a local maximum, local minimum, or saddle point. When the Hessian is positive definite (all its eigenvalues are positive), the point is a local minimum. This can be seen by observing that the directional second derivative in any direction must be positive, and making reference to the univariate second derivative test. Likewise, when the Hessian is negative definite (all its eigenvalues are negative), the point is a local maximum. In multiple dimensions, it is actually possible to find positive evidence of saddle points in some cases. When at least one eigenvalue is positive and at least one eigenvalue is negative, we know that

is a local maximum on one cross section of

but a local minimum on another cross section. See figure 4.5 for an example. Finally, the multidimensional second derivative test can be inconclusive, just as the univariate version can. The test is inconclusive whenever all the nonzero eigenvalues have the same sign but at least one eigenvalue is zero. This is because the univariate second derivative test is inconclusive in the cross section corresponding to the zero eigenvalue.
在多维情况下，我们需要检验函数的所有二阶导数。利用 Hessian 矩阵的特征分解，我们可以将二阶导数检验推广到多个维度。在临界点，即

处，我们可以检查 Hessian 的特征值，以确定临界点是局部最大值、局部最小值还是鞍点。当 Hessian 为正定值时（所有特征值均为正），临界点为局部最小值。这一点可以通过观察任意方向上的方向二阶导数必须为正值，并参考单变量二阶导数检验得出。同样，当 Hessian 为负定值（其所有特征值均为负）时，该点为局部最大值。在多维情况下，实际上有可能在某些情况下找到鞍点的正面证据。当至少一个特征值为正，至少一个特征值为负时，我们知道

在

的一个截面上是局部最大值，而在另一个截面上是局部最小值。示例见图 4.5。最后，多维二阶导数检验与单变量检验一样，也可能无法得出结论。只要所有非零特征值的符号相同，但至少有一个特征值为零，检验就不能得出结论。这是因为单变量二阶导数检验在与零特征值相对应的截面上是不确定的。

In multiple dimensions, there is a different second derivative for each direction at a single point. The condition number of the Hessian at this point measures how much the second derivatives differ from each other. When the Hessian has a poor condition number, gradient descent performs poorly. This is because in one direction, the derivative increases rapidly, while in another direction, it increases slowly. Gradient descent is unaware of this change in the derivative, so it does not know that it needs to explore preferentially in the direction where the derivative remains negative for longer. Poor condition number also makes choosing a good step size difficult. The step size must be small enough to avoid overshooting the minimum and going uphill in directions with strong positive curvature. This usually means that the step size is too small to make significant progress in other directions with less curvature. See figure 4.6 for an example.
在多维情况下，单点上的每个方向都有不同的二阶导数。在这一点上，Hessian 的条件数可以衡量二次导数之间的差异程度。当 Hessian 的条件数较差时，梯度下降的效果就会很差。这是因为在一个方向上，导数增加得很快，而在另一个方向上，导数增加得很慢。梯度下降法并不知道导数的这种变化，因此它不知道需要优先探索导数保持负值时间较长的方向。较差的条件数也使得选择合适的步长变得困难。步长必须足够小，以避免过冲最小值和在具有强正曲率的方向上走上坡路。这通常意味着步长太小，无法在曲率较小的其他方向上取得显著进展。示例见图 4.6。

Figure 4.5: A saddle point containing both positive and negative curvature. The function in this example is

. Along the axis corresponding to

, the function curves upward. This axis is an eigenvector of the Hessian and has a positive eigenvalue. Along the axis corresponding to

, the function curves downward. This direction is an eigenvector of the Hessian with negative eigenvalue. The name "saddle point" derives from the saddle-like shape of this function. This is the quintessential example of a function with a saddle point. In more than one dimension, it is not necessary to have an eigenvalue of 0 to get a saddle point: it is only necessary to have both positive and negative eigenvalues. We can think of a saddle point with both signs of eigenvalues as being a local maximum within one cross section and a local minimum within another cross section.
图 4.5：包含正曲率和负曲率的鞍点。本例中的函数是

。沿着与

对应的轴，函数向上弯曲。该轴是 Hessian 的特征向量，具有正特征值。沿着与

对应的轴，函数向下弯曲。这个方向是 Hessian 的一个特征向量，具有负特征值。鞍点 "这一名称源于该函数的鞍状形状。这就是鞍点函数的典型例子。在多个维度中，不一定要有 0 的特征值才能得到鞍点：只需同时具有正特征值和负特征值即可。我们可以把具有正负两个特征值的鞍点看作是一个截面内的局部最大值和另一个截面内的局部最小值。

Figure 4.6: Gradient descent fails to exploit the curvature information contained in the Hessian matrix. Here we use gradient descent to minimize a quadratic function

whose Hessian matrix has condition number 5. This means that the direction of most curvature has five times more curvature than the direction of least curvature. In this case, the most curvature is in the direction

, and the least curvature is in the direction

. The red lines indicate the path followed by gradient descent. This very elongated quadratic function resembles a long canyon. Gradient descent wastes time repeatedly descending canyon walls because they are the steepest feature. Since the step size is somewhat too large, it has a tendency to overshoot the bottom of the function and thus needs to descend the opposite canyon wall on the next iteration. The large positive eigenvalue of the Hessian corresponding to the eigenvector pointed in this direction indicates that this directional derivative is rapidly increasing, so an optimization algorithm based on the Hessian could predict that the steepest direction is not actually a promising search direction in this context.
图 4.6：梯度下降无法利用 Hessian 矩阵中包含的曲率信息。在这里，我们使用梯度下降法来最小化一个二次函数

，其 Hessian 矩阵的条件数为 5。这意味着曲率最大方向的曲率是曲率最小方向的五倍。在这种情况下，曲率最大的方向是

，曲率最小的方向是

。红线表示梯度下降的路径。这个非常长的二次函数就像一个长长的峡谷。梯度下降法在反复下降峡谷壁时浪费时间，因为峡谷壁是最陡峭的特征。由于步长过大，梯度下降往往会超过函数的底部，因此需要在下一次迭代中下降到对面的峡谷壁。与指向该方向的特征向量相对应的较大的正特征值表明，该方向的导数正在快速增加，因此基于黑森的优化算法可以预测，在这种情况下，最陡峭的方向实际上并不是一个理想的搜索方向。

This issue can be resolved by using information from the Hessian matrix to guide the search. The simplest method for doing so is known as Newton's method. Newton's method is based on using a second-order Taylor series expansion to approximate

near some point

:
利用黑森矩阵的信息来引导搜索，可以解决这个问题。最简单的方法是牛顿法。牛顿法基于使用二阶泰勒级数展开来近似

附近的某一点

：

If we then solve for the critical point of this function, we obtain
如果我们求解这个函数的临界点，就会得到

When

is a positive definite quadratic function, Newton's method consists of applying equation 4.12 once to jump to the minimum of the function directly. When

is not truly quadratic but can be locally approximated as a positive definite quadratic, Newton's method consists of applying equation 4.12 multiple times. Iteratively updating the approximation and jumping to the minimum of the approximation can reach the critical point much faster than gradient descent would. This is a useful property near a local minimum, but it can be a harmful property near a saddle point. As discussed in section 8.2.3, Newton's method is only appropriate when the nearby critical point is a minimum (all the eigenvalues of the Hessian are positive), whereas gradient descent is not attracted to saddle points unless the gradient points toward them.
当

是正定二次函数时，牛顿方法包括应用方程 4.12 一次，直接跳到函数的最小值。当

不是真正的二次函数，但可以局部近似为正定二次函数时，牛顿方法包括多次应用方程 4.12。迭代更新近似值并跳转到近似值的最小值，可以比梯度下降法更快地到达临界点。这在局部最小值附近是一个有用的特性，但在鞍点附近可能是一个有害的特性。正如第 8.2.3 节所述，牛顿方法只适用于临近临界点为最小值的情况（Hessian 的所有特征值均为正），而梯度下降法不会被鞍点吸引，除非梯度指向鞍点。

Optimization algorithms that use only the gradient, such as gradient descent, are called first-order optimization algorithms. Optimization algorithms that also use the Hessian matrix, such as Newton's method, are called second-order optimization algorithms (Nocedal and Wright, 2006).
只使用梯度的优化算法，如梯度下降法，称为一阶优化算法。同时使用 Hessian 矩阵的优化算法，如牛顿法，称为二阶优化算法（Nocedal 和 Wright，2006 年）。

The optimization algorithms employed in most contexts in this book are applicable to a wide variety of functions but come with almost no guarantees. Deep learning algorithms tend to lack guarantees because the family of functions used in deep learning is quite complicated. In many other fields, the dominant approach to optimization is to design optimization algorithms for a limited family of functions.
本书在大多数情况下采用的优化算法适用于各种函数，但几乎没有任何保证。深度学习算法往往缺乏保证，因为深度学习中使用的函数族相当复杂。在许多其他领域，优化的主流方法是为有限的函数族设计优化算法。

In the context of deep learning, we sometimes gain some guarantees by restricting ourselves to functions that are either Lipschitz continuous or have Lipschitz continuous derivatives. A Lipschitz continuous function is a function

whose rate of change is bounded by a Lipschitz constant

:
在深度学习中，我们有时会将自己限制在利普希兹连续或具有利普希兹连续导数的函数上，从而获得一些保证。Lipschitz 连续函数是指变化率受 Lipschitz 常量

约束的函数

：

This property is useful because it enables us to quantify our assumption that a small change in the input made by an algorithm such as gradient descent will have
这一特性非常有用，因为它使我们能够量化我们的假设，即梯度下降等算法对输入的微小改变将产生
a small change in the output. Lipschitz continuity is also a fairly weak constraint, and many optimization problems in deep learning can be made Lipschitz continuous with relatively minor modifications.
输出的微小变化。Lipschitz 连续性也是一个相当弱的约束条件，深度学习中的许多优化问题都可以通过相对较小的修改实现 Lipschitz 连续性。

Perhaps the most successful field of specialized optimization is convex optimization. Convex optimization algorithms are able to provide many more guarantees by making stronger restrictions. These algorithms are applicable only to convex functions-functions for which the Hessian is positive semidefinite everywhere. Such functions are well-behaved because they lack saddle points, and all their local minima are necessarily global minima. However, most problems in deep learning are difficult to express in terms of convex optimization. Convex optimization is used only as a subroutine of some deep learning algorithms. Ideas from the analysis of convex optimization algorithms can be useful for proving the convergence of deep learning algorithms, but in general, the importance of convex optimization is greatly diminished in the context of deep learning. For more information about convex optimization, see Boyd and Vandenberghe (2004) or Rockafellar (1997).
凸优化也许是最成功的专门优化领域。凸优化算法能够通过更强的限制提供更多的保证。这些算法只适用于凸函数--Hessian 在任何地方都是正半无限的函数。这类函数很好处理，因为它们没有鞍点，而且所有局部极小值都必然是全局极小值。然而，深度学习中的大多数问题都很难用凸优化来表达。凸优化仅被用作某些深度学习算法的子程序。凸优化算法分析中的思想对于证明深度学习算法的收敛性很有用，但总的来说，凸优化在深度学习中的重要性已大大降低。有关凸优化的更多信息，请参阅 Boyd and Vandenberghe (2004) 或 Rockafellar (1997)。

4.4 Constrained Optimization
4.4 受限优化

Sometimes we wish not only to maximize or minimize a function

over all possible values of

. Instead we may wish to find the maximal or minimal value of

for values of

in some set

. This is known as constrained optimization. Points

that lie within the set

are called feasible points in constrained optimization terminology.
有时，我们不仅希望在

的所有可能值上最大化或最小化函数

。相反，我们可能希望找到

在某个集合

中的最大值或最小值

。这就是所谓的约束优化。在约束优化术语中，位于集合

内的点

称为可行点。

We often wish to find a solution that is small in some sense. A common approach in such situations is to impose a norm constraint, such as

.
我们经常希望找到一个在某种意义上很小的解。在这种情况下，一种常见的方法是施加一个规范约束，如

。

One simple approach to constrained optimization is simply to modify gradient descent taking the constraint into account. If we use a small constant step size

, we can make gradient descent steps, then project the result back into

. If we use a line search, we can search only over step sizes

that yield new

points that are feasible, or we can project each point on the line back into the constraint region. When possible, this method can be made more efficient by projecting the gradient into the tangent space of the feasible region before taking the step or beginning the line search (Rosen, 1960).
约束优化的一个简单方法是在考虑约束条件的情况下修改梯度下降算法。如果我们使用一个小的恒定步长

，我们可以进行梯度下降，然后将结果投影回

。如果我们使用直线搜索，则可以只搜索步长为

、能得到可行的新

点，或者将直线上的每个点投影回约束区域。在可能的情况下，可以先将梯度投影到可行区域的切线空间，然后再迈出步子或开始直线搜索，从而提高这种方法的效率（Rosen，1960 年）。

A more sophisticated approach is to design a different, unconstrained optimization problem whose solution can be converted into a solution to the original, constrained optimization problem. For example, if we want to minimize

for
更复杂的方法是设计一个不同的、无约束的优化问题，其解决方案可以转换成原始约束优化问题的解决方案。例如，如果我们想最小化

的

with

constrained to have exactly unit

norm, we can instead minimize

with respect to

, then return

as the solution to the original problem. This approach requires creativity; the transformation between optimization problems must be designed specifically for each case we encounter.

的情况下，我们可以将与的关系最小化，然后返回作为原始问题的解。这种方法需要创造性；优化问题之间的转换必须针对我们遇到的每种情况进行专门设计。

The Karush-Kuhn-Tucker (KKT) approach

provides a very general solution to constrained optimization. With the KKT approach, we introduce a new function called the generalized Lagrangian or generalized Lagrange function.
卡鲁什-库恩-塔克（KKT）方法

为约束优化提供了一种非常通用的解决方案。通过 KKT 方法，我们引入了一个新函数，称为广义拉格朗日函数或广义拉格朗日函数。

To define the Lagrangian, we first need to describe

in terms of equations and inequalities. We want a description of

in terms of

functions

and

functions

so that

and

. The equations involving

are called the equality constraints, and the inequalities involving

are called inequality constraints.
要定义拉格朗日，我们首先需要用方程和不等式来描述

。我们希望用

函数

和

函数

来描述

，这样

和

。涉及

的方程称为等式约束，涉及

的不等式称为不等式约束。

We introduce new variables

and

for each constraint, these are called the KKT multipliers. The generalized Lagrangian is then defined as
我们为每个约束条件引入新变量

和

，这些变量称为 KKT 乘数。广义拉格朗日定义为

We can now solve a constrained minimization problem using unconstrained optimization of the generalized Lagrangian. As long as at least one feasible point exists and

is not permitted to have value

, then
现在，我们可以利用广义拉格朗日的无约束优化来解决约束最小化问题。只要存在至少一个可行点，且

的值不允许为

，那么

has the same optimal objective function value and set of optimal points

as
具有相同的最优目标函数值和最优点集合

。

This follows because any time the constraints are satisfied,
这是因为任何时候都能满足约束条件、

while any time a constraint is violated,
而任何时候都会违反约束条件、

These properties guarantee that no infeasible point can be optimal, and that the optimum within the feasible points is unchanged.
这些特性保证了任何不可行点都不可能是最优点，并且可行点内的最优点保持不变。

To perform constrained maximization, we can construct the generalized Lagrange function of

, which leads to this optimization problem:
为了实现有约束的最大化，我们可以构建

的广义拉格朗日函数，从而得到这个优化问题：

We may also convert this to a problem with maximization in the outer loop:
我们也可以将其转换为一个在外环中最大化的问题：

The sign of the term for the equality constraints does not matter; we may define it with addition or subtraction as we wish, because the optimization is free to choose any sign for each

.
平等约束项的符号并不重要，我们可以随意用加法或减法来定义它，因为优化可以自由地为每个

选择任何符号。

The inequality constraints are particularly interesting. We say that a constraint

is active if

. If a constraint is not active, then the solution to the problem found using that constraint would remain at least a local solution if that constraint were removed. It is possible that an inactive constraint excludes other solutions. For example, a convex problem with an entire region of globally optimal points (a wide, flat region of equal cost) could have a subset of this region eliminated by constraints, or a nonconvex problem could have better local stationary points excluded by a constraint that is inactive at convergence. Yet the point found at convergence remains a stationary point whether or not the inactive constraints are included. Because an inactive

has negative value, then the solution to

will have

. We can thus observe that at the solution,

. In other words, for all

, we know that at least one of the constraints

must be active at the solution. To gain some intuition for this idea, we can say that either the solution is on the boundary imposed by the inequality and we must use its KKT multiplier to influence the solution to

, or the inequality has no influence on the solution and we represent this by zeroing out its KKT multiplier.
不等式约束尤其有趣。如果

，我们就说

是有效的。如果一个约束条件不活跃，那么使用该约束条件找到的问题解，即使去掉该约束条件，也至少是一个局部解。不活动的约束可能会排除其他解。例如，一个凸问题的整个区域都是全局最优点（成本相等的宽平区域），该区域的一个子集可能会被约束排除，或者一个非凸问题可能会有更好的局部静止点被收敛时不活动的约束排除。然而，无论是否包含不活动的约束条件，收敛时发现的点仍然是一个静止点。因为不活动的

具有负值，那么

的解就会有

。因此，我们可以观察到，在解决方案中，

。换句话说，对于所有的

，我们知道至少有一个约束条件

或

在解中必须是活动的。为了获得对这一想法的一些直观认识，我们可以说，要么解在不等式所施加的边界上，我们必须使用其 KKT 乘数来影响解

，要么不等式对解没有影响，我们通过将其 KKT 乘数清零来表示这一点。

A simple set of properties describe the optimal points of constrained optimization problems. These properties are called the Karush-Kuhn-Tucker (KKT) conditions (Karush, 1939; Kuhn and Tucker, 1951). They are necessary conditions, but not always sufficient conditions, for a point to be optimal. The conditions are:
一组简单的属性描述了约束优化问题的最佳点。这些性质被称为卡鲁什-库恩-塔克（KKT）条件（Karush, 1939; Kuhn and Tucker, 1951）。它们是最优点的必要条件，但并不总是充分条件。这些条件是

The gradient of the generalized Lagrangian is zero.
广义拉格朗日的梯度为零。
All constraints on both and the KKT multipliers are satisfied.
和 KKT 乘数的所有约束条件都得到满足。
The inequality constraints exhibit "complementary slackness": .
不平等约束表现出 "互补松弛性"： .

For more information about the KKT approach, see Nocedal and Wright (2006).
有关 KKT 方法的更多信息，请参见 Nocedal 和 Wright (2006)。

4.5 Example: Linear Least Squares
4.5 示例线性最小二乘法

Suppose we want to find the value of

that minimizes
假设我们想找到

的值，使它最小化

Specialized linear algebra algorithms can solve this problem efficiently; however, we can also explore how to solve it using gradient-based optimization as a simple example of how these techniques work.
专门的线性代数算法可以高效地解决这个问题；不过，我们也可以探索如何利用基于梯度的优化来解决这个问题，作为这些技术如何发挥作用的一个简单例子。

First, we need to obtain the gradient:
首先，我们需要获得梯度：

We can then follow this gradient downhill, taking small steps. See algorithm 4.1 for details.
然后，我们就可以沿着这个梯度小步下坡。详见算法 4.1。

Algorithm 4.1 An algorithm to minimize \(f(\boldsymbol{x})=\frac{1}{2}\|\boldsymbol{A} \boldsymbol{x}-\boldsymbol{b}\|_{2}^{2}\) with respect to \(\boldsymbol{x}\)
using gradient descent, starting from an arbitrary value of \(\boldsymbol{x}\).
Set the step size \((\epsilon)\) and tolerance \((\delta)\) to small, positive numbers.
while \(\left\|\boldsymbol{A}^{\top} \boldsymbol{A} \boldsymbol{x}-\boldsymbol{A}^{\top} \boldsymbol{b}\right\|_{2}>\delta\) do
\(\boldsymbol{x} \leftarrow \boldsymbol{x}-\epsilon\left(\boldsymbol{A}^{\top} \boldsymbol{A} \boldsymbol{x}-\boldsymbol{A}^{\top} \boldsymbol{b}\right)\)
end while

One can also solve this problem using Newton's method. In this case, because the true function is quadratic, the quadratic approximation employed by Newton's method is exact, and the algorithm converges to the global minimum in a single step.
我们也可以用牛顿法来解决这个问题。在这种情况下，由于真实函数是二次函数，牛顿法采用的二次函数近似是精确的，算法只需一步就能收敛到全局最小值。

Now suppose we wish to minimize the same function, but subject to the constraint

. To do so, we introduce the Lagrangian
现在，假设我们希望最小化相同的函数，但受限于

。为此，我们引入拉格朗日

We can now solve the problem
现在我们可以解决这个问题

The smallest-norm solution to the unconstrained least-squares problem may be found using the Moore-Penrose pseudoinverse:

. If this point is feasible, then it is the solution to the constrained problem. Otherwise, we must find a
无约束最小二乘问题的最小正解法可以利用摩尔-彭罗斯伪逆变换法找到：

。如果这一点是可行的，那么它就是约束问题的解。否则，我们必须找到一个
solution where the constraint is active. By differentiating the Lagrangian with respect to

, we obtain the equation
的解。通过对

的拉格朗日微分，我们得到方程

This tells us that the solution will take the form
这就告诉我们，解的形式是

The magnitude of

must be chosen such that the result obeys the constraint. We can find this value by performing gradient ascent on

. To do so, observe
必须选择

的大小，以使结果符合约束条件。我们可以通过对

进行梯度上升来找到这个值。为此，请观察

When the norm of

exceeds 1 , this derivative is positive, so to follow the derivative uphill and increase the Lagrangian with respect to

, we increase

. Because the coefficient on the

penalty has increased, solving the linear equation for

will now yield a solution with a smaller norm. The process of solving the linear equation and adjusting

continues until

has the correct norm and the derivative on

is 0 .
当

的常模超过 1 时，导数为正，因此，为了沿着导数上坡并增加拉格朗日与

的关系，我们增加了

。由于

惩罚的系数增加了，因此求解

的线性方程将得到一个较小规范的解。解线性方程和调整

的过程一直持续到

具有正确的规范，并且

的导数为 0。

This concludes the mathematical preliminaries that we use to develop machine learning algorithms. We are now ready to build and analyze some full-fledged learning systems.
至此，我们用来开发机器学习算法的数学预备知识就结束了。现在，我们可以构建和分析一些成熟的学习系统了。

The KKT approach generalizes the method of Lagrange multipliers, which allows equality constraints but not inequality constraints.
KKT 方法概括了拉格朗日乘数法，它允许平等约束，但不允许不平等约束。

Chapter 4 第 4 章

Numerical Computation 数值计算

4.1 Overflow and Underflow4.1 溢出和下溢

4.2 Poor Conditioning 4.2 条件差

4.3 Gradient-Based Optimization4.3 基于梯度的优化

4.3.1 Beyond the Gradient: Jacobian and Hessian Matrices4.3.1 梯度之外：雅各布矩阵和赫塞斯矩阵

4.4 Constrained Optimization4.4 受限优化

4.5 Example: Linear Least Squares4.5 示例线性最小二乘法

4.1 Overflow and Underflow
4.1 溢出和下溢

4.3 Gradient-Based Optimization
4.3 基于梯度的优化

4.3.1 Beyond the Gradient: Jacobian and Hessian Matrices
4.3.1 梯度之外：雅各布矩阵和赫塞斯矩阵

4.4 Constrained Optimization
4.4 受限优化

4.5 Example: Linear Least Squares
4.5 示例线性最小二乘法