Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)
可视化神经机器翻译模型(带有注意力机制的 Seq2seq 模型的机制)

Translations: Chinese (Simplified), French, Japanese, Korean, Persian, Russian, Turkish, Uzbek
翻译:中文(简体),法语,日语,韩语,波斯语,俄语,土耳其语,乌兹别克语

Watch: MIT’s Deep Learning State of the Art lecture referencing this post
观看:麻省理工学院关于深度学习最新技术的讲座,参考此帖子

May 25th update: New graphics (RNN animation, word embedding graph), color coding, elaborated on the final attention example.
5 月 25 日更新:新图形(RNN 动画,词嵌入图),颜色编码,对最终注意力示例进行了详细说明。

Note: The animations below are videos. Touch or hover on them (if you’re using a mouse) to get play controls so you can pause if needed.
注意:下面的动画是视频。触摸或悬停在它们上面(如果您使用鼠标),以获取播放控件,这样您就可以在需要时暂停。

Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016. These models are explained in the two pioneering papers (Sutskever et al., 2014, Cho et al., 2014).
序列到序列模型是深度学习模型,在机器翻译、文本摘要和图像标题等任务中取得了很大成功。谷歌翻译在 2016 年底开始在生产中使用这种模型。这些模型在两篇开创性论文中有详细解释(Sutskever 等,2014 年,Cho 等,2014 年)。

I found, however, that understanding the model well enough to implement it requires unraveling a series of concepts that build on top of each other. I thought that a bunch of these ideas would be more accessible if expressed visually. That’s what I aim to do in this post. You’ll need some previous understanding of deep learning to get through this post. I hope it can be a useful companion to reading the papers mentioned above (and the attention papers linked later in the post).
然而,我发现要理解模型足够好以实现它,需要解开一系列相互关联的概念。我认为,如果用视觉方式表达,这些想法中的许多将更易理解。这就是我在这篇文章中的目标。您需要一些关于深度学习的先前理解才能理解这篇文章。希望它能成为阅读上述论文(以及稍后在文章中链接的注意力论文)的有用伴侣。

A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items. A trained model would work like this:
一个序列到序列模型是一个模型,它接受一系列项目(单词、字母、图像特征等)并输出另一系列项目。一个经过训练的模型将工作如下:


In neural machine translation, a sequence is a series of words, processed one after another. The output is, likewise, a series of words:
在神经机器翻译中,一个序列是一系列单词,依次处理。输出也是一系列单词:

Looking under the hood
查看引擎盖下面

Under the hood, the model is composed of an encoder and a decoder.
在模型内部,模型由编码器和解码器组成。

The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.
编码器处理输入序列中的每个项目,将其捕获的信息编译成一个向量(称为上下文)。在处理完整个输入序列后,编码器将上下文发送到解码器,解码器开始逐个生成输出序列项目。


The same applies in the case of machine translation.
机器翻译的情况也是如此。

The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks (Be sure to check out Luis Serrano’s A friendly introduction to Recurrent Neural Networks for an intro to RNNs).
上下文是机器翻译中的向量(基本上是一组数字数组)。编码器和解码器通常都是循环神经网络(请务必查看 Luis Serrano 的《循环神经网络友好介绍》以了解 RNN 的简介)。

The context is a vector of floats. Later in this post we will visualize vectors in color by assigning brighter colors to the cells with higher values.
上下文是一个浮点数向量。在本文的后面,我们将通过将更亮的颜色分配给具有较高值的单元格来以彩色可视化向量。

You can set the size of the context vector when you set up your model. It is basically the number of hidden units in the encoder RNN. These visualizations show a vector of size 4, but in real world applications the context vector would be of a size like 256, 512, or 1024.
您可以在设置模型时设置上下文向量的大小。这基本上是编码器 RNN 中隐藏单元的数量。这些可视化展示了大小为 4 的向量,但在实际应用中,上下文向量的大小可能是 256、512 或 1024。


By design, a RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state. The word, however, needs to be represented by a vector. To transform a word into a vector, we turn to the class of methods called “word embedding” algorithms. These turn words into vector spaces that capture a lot of the meaning/semantic information of the words (e.g. king - man + woman = queen).
根据设计,RNN 在每个时间步骤接受两个输入:一个输入(在编码器的情况下,是输入句子中的一个词),以及一个隐藏状态。然而,这个词需要用一个向量来表示。为了将一个词转换为一个向量,我们转向了一类称为“词嵌入”算法的方法。这些算法将词转换为捕捉词的很多含义/语义信息的向量空间(例如:king - man + woman = queen)。


We need to turn the input words into vectors before processing them. That transformation is done using a word embedding algorithm. We can use pre-trained embeddings or train our own embedding on our dataset. Embedding vectors of size 200 or 300 are typical, we're showing a vector of size four for simplicity.
我们需要在处理输入单词之前将它们转换为向量。这种转换是通过词嵌入算法完成的。我们可以使用预训练的嵌入或在我们的数据集上训练自己的嵌入。大小为 200 或 300 的嵌入向量是典型的,我们为简单起见展示了大小为四的向量。

Now that we’ve introduced our main vectors/tensors, let’s recap the mechanics of an RNN and establish a visual language to describe these models:
现在我们已经介绍了我们的主要向量/张量,让我们回顾一下 RNN 的机制,并建立一个视觉语言来描述这些模型:


The next RNN step takes the second input vector and hidden state #1 to create the output of that time step. Later in the post, we’ll use an animation like this to describe the vectors inside a neural machine translation model.
下一个 RNN 步骤将使用第二个输入向量和隐藏状态#1 来创建该时间步的输出。稍后在帖子中,我们将使用类似这样的动画来描述神经机器翻译模型中的向量。


In the following visualization, each pulse for the encoder or decoder is that RNN processing its inputs and generating an output for that time step. Since the encoder and decoder are both RNNs, each time step one of the RNNs does some processing, it updates its hidden state based on its inputs and previous inputs it has seen.
在以下可视化中,编码器或解码器的每个脉冲都是 RNN 处理其输入并为该时间步生成输出。由于编码器和解码器都是 RNN,每个时间步中的一个 RNN 都会进行一些处理,它会根据其输入和先前看到的输入更新其隐藏状态。

Let’s look at the hidden states for the encoder. Notice how the last hidden state is actually the context we pass along to the decoder.
让我们来看看编码器的隐藏状态。请注意,最后一个隐藏状态实际上是我们传递给解码器的上下文。


The decoder also maintains a hidden state that it passes from one time step to the next. We just didn’t visualize it in this graphic because we’re concerned with the major parts of the model for now.
解码器还保持一个隐藏状态,它会从一个时间步传递到下一个。我们在这个图中没有展示它,因为我们现在关注模型的主要部分。

Let’s now look at another way to visualize a sequence-to-sequence model. This animation will make it easier to understand the static graphics that describe these models. This is called an “unrolled” view where instead of showing the one decoder, we show a copy of it for each time step. This way we can look at the inputs and outputs of each time step.
让我们现在看另一种可视化序列到序列模型的方式。这个动画将使理解描述这些模型的静态图形变得更容易。这被称为“展开”视图,其中我们不是显示一个解码器,而是为每个时间步骤显示其副本。这样我们可以查看每个时间步骤的输入和输出。


Let’s Pay Attention Now
让我们现在注意一下

The context vector turned out to be a bottleneck for these types of models. It made it challenging for the models to deal with long sentences. A solution was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers introduced and refined a technique called “Attention”, which highly improved the quality of machine translation systems. Attention allows the model to focus on the relevant parts of the input sequence as needed.
上下文向量证明对这些类型的模型来说是一个瓶颈。这使得模型难以处理长句子。Bahdanau 等人在 2014 年和 Luong 等人在 2015 年提出了一个解决方案。这些论文介绍并完善了一种称为“注意力”的技术,极大地提高了机器翻译系统的质量。注意力允许模型根据需要专注于输入序列的相关部分。

At time step 7, the attention mechanism enables the decoder to focus on the word "étudiant" ("student" in french) before it generates the English translation. This ability to amplify the signal from the relevant part of the input sequence makes attention models produce better results than models without attention.
在时间步骤 7,注意机制使解码器在生成英文翻译之前专注于单词“étudiant”(法语中的“学生”)。这种能力可以增强来自输入序列相关部分的信号,使注意模型产生比没有注意力的模型更好的结果。


Let’s continue looking at attention models at this high level of abstraction. An attention model differs from a classic sequence-to-sequence model in two main ways:
让我们继续在这个高度抽象的层面上看关注模型。关注模型与经典的序列到序列模型有两个主要区别:

First, the encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder:
首先,编码器向解码器传递了更多的数据。编码器不再只传递编码阶段的最后隐藏状态,而是将所有隐藏状态传递给解码器。


Second, an attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following:
其次,在生成输出之前,注意力解码器会执行额外的步骤。为了专注于与此解码时间步相关的输入部分,解码器会执行以下操作:

  1. Look at the set of encoder hidden states it received – each encoder hidden state is most associated with a certain word in the input sentence
    看看它接收到的编码器隐藏状态集合 - 每个编码器隐藏状态与输入句子中的某个单词最相关
  2. Give each hidden state a score (let’s ignore how the scoring is done for now)
    给每个隐藏状态分配一个分数(暂时忽略如何进行评分)
  3. Multiply each hidden state by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores
    将每个隐藏状态乘以其经过 softmax 处理的分数,从而放大具有高分数的隐藏状态,并淹没具有低分数的隐藏状态



This scoring exercise is done at each time step on the decoder side.
这个评分练习是在解码器端的每个时间步骤上完成的。

Let us now bring the whole thing together in the following visualization and look at how the attention process works:
让我们现在通过以下可视化将整个过程结合起来,看看注意力过程是如何工作的:

  1. The attention decoder RNN takes in the embedding of the <END> token, and an initial decoder hidden state.
    注意力解码器 RNN 接收<END>令牌的嵌入和初始解码器隐藏状态。
  2. The RNN processes its inputs, producing an output and a new hidden state vector (h4). The output is discarded.
    RNN 处理其输入,生成一个输出和一个新的隐藏状态向量(h4)。输出被丢弃。
  3. Attention Step: We use the encoder hidden states and the h4 vector to calculate a context vector (C4) for this time step.
    注意步骤:我们使用编码器隐藏状态和 h4 向量来计算这个时间步的上下文向量(C4)。
  4. We concatenate h4 and C4 into one vector.
    我们将 h4 和 C4 连接成一个向量。
  5. We pass this vector through a feedforward neural network (one trained jointly with the model).
    我们将这个向量通过一个前馈神经网络(与模型一起训练)传递。
  6. The output of the feedforward neural networks indicates the output word of this time step.
    前馈神经网络的输出表示了这个时间步的输出单词。
  7. Repeat for the next time steps
    重复下一个时间步骤



This is another way to look at which part of the input sentence we’re paying attention to at each decoding step:
这是另一种查看我们在每个解码步骤中关注的输入句子的哪一部分的方式:

Note that the model isn’t just mindless aligning the first word at the output with the first word from the input. It actually learned from the training phase how to align words in that language pair (French and English in our example). An example for how precise this mechanism can be comes from the attention papers listed above:
请注意,该模型不仅仅是毫无意识地将输出的第一个单词与输入的第一个单词对齐。它实际上是从训练阶段学习如何对齐该语言对中的单词(以我们的示例为法语和英语)。这种机制可以有多精确的一个例子来自上面列出的注意力论文:

You can see how the model paid attention correctly when outputing "European Economic Area". In French, the order of these words is reversed ("européenne économique zone") as compared to English. Every other word in the sentence is in similar order.
您可以看到模型在输出“欧洲经济区”时正确地注意到了。在法语中,这些单词的顺序与英语相比是颠倒的(“européenne économique zone”)。句子中的其他单词顺序都是相似的。


If you feel you’re ready to learn the implementation, be sure to check TensorFlow’s Neural Machine Translation (seq2seq) Tutorial.
如果您觉得自己已经准备好学习实施,请务必查看 TensorFlow 的神经机器翻译(seq2seq)教程。



I hope you’ve found this useful. These visuals are early iterations of a lesson on attention that is part of the Udacity Natural Language Processing Nanodegree Program. We go into more details in the lesson, including discussing applications and touching on more recent attention methods like the Transformer model from Attention Is All You Need.
希望您觉得这个有用。这些视觉内容是优达学城自然语言处理纳米学位课程中关于注意力的早期迭代版本。在课程中,我们会更详细地讨论应用程序,并涉及更多最新的注意力方法,比如来自《注意力就是一切》的 Transformer 模型。

Check out the trailer of the NLP Nanodegree Program:
查看 NLP 纳米学位课程的预告片:

I’ve also created a few lessons as a part of Udacity’s Machine Learning Nanodegree Program. The lessons I’ve created cover Unsupervised Learning, as well as a jupyter notebook on movie recommendations using collaborative filtering.
我还为 Udacity 的机器学习纳米学位项目创建了一些课程。我创建的课程涵盖了无监督学习,以及使用协同过滤进行电影推荐的 jupyter 笔记本。

I’d love any feedback you may have. Please reach me at @JayAlammmar.
我很乐意听取您的任何反馈。请通过@JayAlammmar 与我联系。

Written on May 9, 2018
2018 年 5 月 9 日写