This post is the first of a three-part series in which we set out to derive the mathematics behind feedforward neural networks. They have
本文是三部曲系列的第一篇,旨在推导前馈神经网络背后的数学原理。

  • an input and an output layer with at least one hidden layer in between,
    一个输入层和一个输出层,中间至少有一个隐藏层,
  • fully-connected layers, which means that each node in one layer connects to every node in the following layer, and
    全连接层,意味着一层中的每个节点都与下一层的所有节点相连,并且
  • ways to introduce nonlinearity by means of activation functions.
    通过激活函数引入非线性的方法。

We start with forward propagation, which involves computing predictions and the associated cost of these predictions.
我们首先进行前向传播,这涉及计算预测值及其相应的成本。

Forward Propagation 前向传播

Settling on what notations to use is tricky since we only have so many letters in the Roman alphabet. As you browse the Internet, you will likely find derivations that have used different notations than the ones we are about to introduce. However, and fortunately, there is no right or wrong here; it is just a matter of taste. In particular, the notations used in this series take inspiration from Andrew Ng’s Standard notations for Deep Learning. If you make a comparison, you will find that we only change a couple of the details.
选择使用哪些符号颇具挑战,因为罗马字母数量有限。浏览互联网时,你可能会发现与我们即将介绍的符号不同的衍生用法。然而,幸运的是,这里并无对错之分,仅关乎个人偏好。特别是,本系列所采用的符号灵感源自 Andrew Ng 的深度学习标准符号。若进行对比,你会发现我们仅对细节做了少许改动。

Now, whatever we come up with, we have to support
现在,无论我们提出什么,我们都必须支持

  • multiple layers, 多层结构
  • several nodes in each layer,
    每层中的若干节点,
  • various activation functions,
    各种激活函数
  • various types of cost functions, and
    各种类型的成本函数,以及
  • mini-batches of training examples.
    训练示例的小批量。

As a result, our definition of a node ends up introducing a fairly large number of notations:
因此,我们对节点的定义最终引入了相当多的符号表示:

(1)zj,i[l]=kwj,k[l]ak,i[l1]+bj[l],(2)aj,i[l]=gj[l](z1,i[l],,zj,i[l],,zn[l],i[l]).

Does the node definition look intimidating to you at first glance? Do not worry. Hopefully, it will make more sense once we have explained the notations, which we shall do next:
节点定义乍看之下是否让你感到畏惧?不必担心。希望在我们解释了符号之后,它将变得更加清晰,接下来我们将进行这一步骤:

Entity 实体 Description 描述
l The current layer l=1,,L, where L is the number of layers that have weights and biases. We use l=0 and l=L to denote the input and output layers.
当前层 l=1,,L ,其中 L 表示具有权重和偏置的层数。我们使用 l=0l=L 来表示输入层和输出层。
n[l] The number of nodes in the current layer.
当前层的节点数。
n[l1] The number of nodes in the previous layer.
前一层中的节点数量。
j The jth node of the current layer, j=1,,n[l].
当前层的第 j 个节点, j=1,,n[l]
k The kth node of the previous layer, k=1,,n[l1].
前一层的第 k 个节点, k=1,,n[l1]
i The current training example i=1,,m, where m is the number of training examples.
当前训练示例 i=1,,m ,其中 m 表示训练示例的数量。
zj,i[l] A weighted sum of the activations of the previous layer, shifted by a bias.
前一层激活值的加权和,经偏置调整。
wj,k[l] A weight that scales the kth activation of the previous layer.
一个权重,用于调整前一层第 k 个激活值的尺度。
bj[l] A bias in the current layer.
当前层的偏差。
aj,i[l] An activation in the current layer.
当前层的激活。
ak,i[l1] An activation in the previous layer.
上一层的激活。
gj[l] An activation function gj[l]:Rn[l]R used in the current layer.
当前层使用的激活函数 gj[l]:Rn[l]R

To put it concisely, a node in the current layer depends on every node in the previous layer, and the following visualization can help us see that more clearly:
简而言之,当前层的节点依赖于前一层的每个节点,下面的可视化图将帮助我们更清晰地理解这一点:

w_{j, k - 2}^[l]w_{j, k - 1}^[l]w_{j, k}^[l]a_{k - 2, i}^[l - 1]a_{k - 1, i}^[l - 1]a_{k, i}^[l - 1]a_{j, i}^[l]a_{j + 1, i}^[l]a_{j + 2, i}^[l]
Figure 1: A node in the current layer.
图 1:当前层中的一个节点。

Moreover, a node in the previous layer affects every node in the current layer, and with a change in highlighting, we will also be able to see that more clearly:
此外,前一层的节点会影响当前层的每个节点,通过改变突出显示方式,我们也将更清楚地看到这一点:

w_{j, k}^[l]w_{j + 1, k}^[l]w_{j + 2, k}^[l]a_{k - 2, i}^[l - 1]a_{k - 1, i}^[l - 1]a_{k, i}^[l - 1]a_{j, i}^[l]a_{j + 1, i}^[l]a_{j + 2, i}^[l]
Figure 2: A node in the previous layer.
图 2:前一层中的一个节点。

In the future, we might want to write an implement from scratch in, for example, Python. To take advantage of the heavily optimized versions of vector and matrix operations that come bundled with libraries such as NumPy, we need to vectorize (1) and (2).
未来,我们可能希望从头开始编写一个实现,例如使用 Python。为了充分利用与 NumPy 等库捆绑在一起的向量和矩阵操作的高度优化版本,我们需要对 (1)(2) 进行向量化。

To begin with, we vectorize the nodes:
首先,我们对节点进行向量化:

[z1,i[l]zj,i[l]zn[l],i[l]]=[w1,1[l]w1,k[l]w1,n[l1][l]wj,1[l]wj,k[l]wj,n[l1][l]wn[l],1[l]wn[l],k[l]wn[l],n[l1][l]][a1,i[l1]ak,i[l1]an[l1],i[l1]]+[b1[l]bj[l]bn[l][l]],[a1,i[l]aj,i[l]an[l],i[l]]=[g1[l](z1,i[l],,zj,i[l],,zn[l],i[l])gj[l](z1,i[l],,zj,i[l],,zn[l],i[l])gn[l][l](z1,i[l],,zj,i[l],,zn[l],i[l])],

which we can write as
我们可以将其写作

(3)z:,i[l]=W[l]a:,i[l1]+b[l],(4)a:,i[l]=g[l](z:,i[l]),

where z:,i[l]Rn[l], W[l]Rn[l]×n[l1], b[l]Rn[l], a:,i[l]Rn[l], a:,i[l1]Rn[l1], and lastly, g[l]:Rn[l]Rn[l]. We have used a colon to clarify that z:,i[l] is the ith column of Z[l], and so on.
其中 z:,i[l]Rn[l]W[l]Rn[l]×n[l1]b[l]Rn[l]a:,i[l]Rn[l]a:,i[l1]Rn[l1] ,最后是 g[l]:Rn[l]Rn[l] 。我们使用冒号来明确指出 z:,i[l]Z[l] 的第 i 列,依此类推。

Next, we vectorize the training examples:
接下来,我们将训练样本向量化:

(5)Z[l]=[z:,1[l]z:,i[l]z:,m[l]]=W[l][a:,1[l1]a:,i[l1]a:,m[l1]]+[b[l]b[l]b[l]]=W[l]A[l1]+broadcast(b[l]),(6)A[l]=[a:,1[l]a:,i[l]a:,m[l]],

where Z[l]Rn[l]×m, A[l]Rn[l]×m, and A[l1]Rn[l1]×m. In addition, have a look at the NumPy documentation if you want to read a well-written explanation of broadcasting.
其中 Z[l]Rn[l]×mA[l]Rn[l]×mA[l1]Rn[l1]×m 。此外,若想阅读有关广播的详尽解释,可参阅 NumPy 文档。

We would also like to establish two additional notations:
我们还希望建立两个额外的符号表示:

(7)A[0]=X,(8)A[L]=Y^,

where XRn[0]×m denotes the inputs and Y^Rn[L]×m denotes the predictions/outputs.
其中 XRn[0]×m 表示输入, Y^Rn[L]×m 表示预测/输出。

Finally, we are ready to define the cost function:
最后,我们准备定义成本函数:

(9)J=f(Y^,Y)=f(A[L],Y),

where YRn[L]×m denotes the targets and f:R2n[L]R can be tailored to our needs.
其中 YRn[L]×m 表示目标,而 f:R2n[L]R 可根据我们的需求进行定制。

We are done with forward propagation! Next up: backward propagation, also known as backpropagation, which involves computing the gradient of the cost function with respect to the weights and biases.
我们已完成前向传播!接下来:反向传播,亦称反向传播算法,涉及计算成本函数相对于权重和偏置的梯度。

Backward Propagation 反向传播

We will make heavy use of the chain rule in this section, and to understand better how it works, we first apply the chain rule to the following example:
本节中我们将大量运用链式法则,为了更好地理解其工作原理,我们首先将链式法则应用于以下示例:

(10)ui=gi(x1,,xj,,xn),(11)yk=fk(u1,,ui,,um).

Note that xj may affect u1,,ui,,um, and yk may depend on u1,,ui,,um; thus,
请注意, xj 可能影响 u1,,ui,,um ,而 yk 可能依赖于 u1,,ui,,um ;因此,

(12)ykxj=iykuiuixj.

Great! If we ever get stuck trying to compute or understand some partial derivative, we can always go back to (10), (11), and (12). Hopefully, these equations will provide the clues necessary to move forward. However, be extra careful not to confuse the notation used for the chain rule example with the notation we use elsewhere in this series. The overlap is unintentional.
太好了!如果在计算或理解某些偏导数时遇到困难,我们总可以回过头来参考 (10)(11)(12) 。希望这些方程能提供推进所需的线索。但请特别注意,不要将链式法则示例中使用的符号与我们在此系列其他地方使用的符号混淆,这种重叠是无意的。

Now, let us concentrate on the task at hand:
现在,让我们专注于当前的任务:

(13)Jwj,k[l]=iJzj,i[l]zj,i[l]wj,k[l]=iJzj,i[l]ak,i[l1],(14)Jbj[l]=iJzj,i[l]zj,i[l]bj[l]=iJzj,i[l].

Vectorization results in
向量化结果为

[Jw1,1[l]Jw1,k[l]Jw1,n[l1][l]Jwj,1[l]Jwj,k[l]Jwj,n[l1][l]Jwn[l],1[l]Jwn[l],k[l]Jwn[l],n[l1][l]]=[Jz1,1[l]Jz1,i[l]Jz1,m[l]Jzj,1[l]Jzj,i[l]Jzj,m[l]Jzn[l],1[l]Jzn[l],i[l]Jzn[l],m[l]]=[a1,1[l1]ak,1[l1]an[l1],1[l1]a1,i[l1]ak,i[l1]an[l1],i[l1]a1,m[l1]ak,m[l1]an[l1],m[l1]],[Jb1[l]Jbj[l]Jbn[l][l]]=[Jz1,1[l]Jzj,1[l]Jzn[l],1[l]]++[Jz1,i[l]Jzj,i[l]Jzn[l],i[l]]++[Jz1,m[l]Jzj,m[l]Jzn[l],m[l]],

which we can write as
我们可以将其写作

(15)JW[l]=iJz:,i[l]a:,i[l1]=JZ[l]A[l1],(16)Jb[l]=iJz:,i[l]=axis=1JZ[l]column vector,

where J/z:,i[l]Rn[l], J/Z[l]Rn[l]×m, J/W[l]Rn[l]×n[l1], and J/b[l]Rn[l].
其中 J/z:,i[l]Rn[l]J/Z[l]Rn[l]×mJ/W[l]Rn[l]×n[l1] ,以及 J/b[l]Rn[l]

Looking back at (13) and (14), we see that the only unknown entity is J/zj,i[l]. By applying the chain rule once again, we get
回顾 (13)(14) ,我们发现唯一未知的实体是 J/zj,i[l] 。再次应用链式法则,我们得到

(17)Jzj,i[l]=pJap,i[l]ap,i[l]zj,i[l],

where p=1,,n[l]. 源文本:where p=1,,n[l] . 翻译文本:在p=1,,n[l]处。

Next, we present the vectorized version:
接下来,我们展示向量化版本:

[Jz1,i[l]Jzj,i[l]Jzn[l],i[l]]=[a1,i[l]z1,i[l]aj,i[l]z1,i[l]an[l],i[l]z1,i[l]a1,i[l]zj,i[l]aj,i[l]zj,i[l]an[l],i[l]zj,i[l]a1,i[l]zn[l],i[l]aj,i[l]zn[l],i[l]an[l],i[l]zn[l],i[l]][Ja1,i[l]Jaj,i[l]Jan[l],i[l]],

which compresses into 压缩成

(18)Jz:,i[l]=a:,i[l]z:,i[l]Ja:,i[l],

where J/a:,i[l]Rn[l] and a:,i[l]/z:,i[l]Rn[l]×n[l].
J/a:,i[l]Rn[l]a:,i[l]/z:,i[l]Rn[l]×n[l] 处。

We have already encountered
我们已有所接触

(19)JZ[l]=[Jz:,1[l]Jz:,i[l]Jz:,m[l]],

and for the sake of completeness, we also clarify that
为了完整性起见,我们也明确指出

(20)JA[l]=[Ja:,1[l]Ja:,i[l]Ja:,m[l]],

where J/A[l]Rn[l]×m. 源文本:where J/A[l]Rn[l]×m . 翻译文本:在J/A[l]Rn[l]×m处。

On purpose, we have omitted the details of gj[l](z1,i[l],,zj,i[l],,zn[l],i[l]); consequently, we cannot derive an analytic expression for aj,i[l]/zj,i[l], which we depend on in (17). However, since the second post of this series will be dedicated to activation functions, we will instead derive aj,i[l]/zj,i[l] there.
出于特定目的,我们省略了 gj[l](z1,i[l],,zj,i[l],,zn[l],i[l]) 的细节;因此,我们无法推导出依赖于 (17)aj,i[l]/zj,i[l] 的解析表达式。然而,由于本系列第二篇文章将专门讨论激活函数,我们将在那里推导出 aj,i[l]/zj,i[l]

Furthermore, according to (17), we see that J/zj,i[l] also depends on J/aj,i[l]. Now, it might come as a surprise, but J/aj,i[l] has already been computed when we reach the lth layer during backward propagation. How did that happen, you may ask. The answer is that every layer paves the way for the previous layer by also computing J/ak,i[l1], which we shall do now:
此外,根据 (17) ,我们发现 J/zj,i[l] 也依赖于 J/aj,i[l] 。现在,这可能令人惊讶,但在反向传播过程中到达 l 层时, J/aj,i[l] 已经计算完毕。你可能会问,这是如何发生的?答案是,每一层都为前一层铺平道路,同时也计算 J/ak,i[l1] ,我们现在就来进行这一步:

(21)Jak,i[l1]=jJzj,i[l]zj,i[l]ak,i[l1]=jJzj,i[l]wj,k[l].

As usual, our next step is vectorization:
如常,我们的下一步是向量化:

[Ja1,1[l1]Ja1,i[l1]Ja1,m[l1]Jak,1[l1]Jak,i[l1]Jak,m[l1]Jan[l1],1[l1]Jan[l1],i[l1]Jan[l1],m[l1]]=[w1,1[l]wj,1[l]wn[l],1[l]w1,k[l]wj,k[l]wn[l],k[l]w1,n[l1][l]wj,n[l1][l]wn[l],n[l1][l]]=[Jz1,1[l]Jz1,i[l]Jz1,m[l]Jzj,1[l]Jzj,i[l]Jzj,m[l]Jzn[l],1[l]Jzn[l],i[l]Jzn[l],m[l]],

which we can write as
我们可以将其写作

(22)JA[l1]=W[l]JZ[l],

where J/A[l1]Rn[l1]×m. 源文本:where J/A[l1]Rn[l1]×m . 翻译文本:在J/A[l1]Rn[l1]×m处。

Summary 摘要

Forward propagation is seeded with A[0]=X and evaluates a set of recurrence relations to compute the predictions A[L]=Y^. We also compute the cost J=f(Y^,Y)=f(A[L],Y).
正向传播以 A[0]=X 为起点,通过一系列递推关系计算出预测值 A[L]=Y^ 。同时,我们还计算了成本 J=f(Y^,Y)=f(A[L],Y)

Backward propagation, on the other hand, is seeded with J/A[L]=J/Y^ and evaluates a different set of recurrence relations to compute J/W[l] and J/b[l]. If not stopped prematurely, it eventually computes J/A[0]=J/X, a partial derivative we usually ignore.
另一方面,反向传播以 J/A[L]=J/Y^ 为起点,通过计算一组不同的递推关系来求得 J/W[l]J/b[l] 。若不提前终止,它最终会计算出 J/A[0]=J/X ,这是一个我们通常忽略的偏导数。

Moreover, let us visualize the inputs we use and the outputs we produce during the forward and backward propagations:
此外,让我们设想一下在正向和反向传播过程中使用的输入和产生的输出:

Z^[l]A^[l - 1]A^[l]W^[l]b^[l]cache^[l] dZ^[l]dA^[l - 1]dA^[l]dW^[l]db^[l]
Figure 3: An overview of inputs and outputs.
图 3:输入与输出的概览。

Now, you might have noticed that we have yet to derive an analytic expression for the backpropagation seed J/A[L]=J/Y^. To recap, we have deferred the derivations that concern activation functions to the second post of this series. Similarly, since the third post will be dedicated to cost functions, we will instead address the derivation of the backpropagation seed there.
现在,您可能已经注意到,我们尚未推导出反向传播种子 J/A[L]=J/Y^ 的解析表达式。回顾一下,我们将涉及激活函数的推导推迟到本系列的第二篇文章中。同样,由于第三篇文章将专门讨论成本函数,我们将在那里处理反向传播种子的推导。

Last but not least: congratulations! You have made it to the end (of the first post). 🏅
最后但同样重要的是:恭喜!你已经完成了第一篇文章的阅读。🏅