Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)
可视化神经机器翻译模型(带有注意力机制的 Seq2seq 模型的机制)

Translations: Chinese (Simplified), French, Japanese, Korean, Persian, Russian, Turkish, Uzbek

Watch: MIT’s Deep Learning State of the Art lecture referencing this post

May 25th update: New graphics (RNN animation, word embedding graph), color coding, elaborated on the final attention example.
5 月 25 日更新:新图形(RNN 动画,词嵌入图),颜色编码,对最终注意力示例进行了详细说明。

Note: The animations below are videos. Touch or hover on them (if you’re using a mouse) to get play controls so you can pause if needed.

Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016. These models are explained in the two pioneering papers (Sutskever et al., 2014, Cho et al., 2014).
序列到序列模型是深度学习模型,在机器翻译、文本摘要和图像标题等任务中取得了很大成功。谷歌翻译在 2016 年底开始在生产中使用这种模型。这些模型在两篇开创性论文中有详细解释(Sutskever 等,2014 年,Cho 等,2014 年)。

I found, however, that understanding the model well enough to implement it requires unraveling a series of concepts that build on top of each other. I thought that a bunch of these ideas would be more accessible if expressed visually. That’s what I aim to do in this post. You’ll need some previous understanding of deep learning to get through this post. I hope it can be a useful companion to reading the papers mentioned above (and the attention papers linked later in the post).

A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items. A trained model would work like this:

In neural machine translation, a sequence is a series of words, processed one after another. The output is, likewise, a series of words:

Looking under the hood

Under the hood, the model is composed of an encoder and a decoder.

The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.

The same applies in the case of machine translation.

The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks (Be sure to check out Luis Serrano’s A friendly introduction to Recurrent Neural Networks for an intro to RNNs).
上下文是机器翻译中的向量(基本上是一组数字数组)。编码器和解码器通常都是循环神经网络(请务必查看 Luis Serrano 的《循环神经网络友好介绍》以了解 RNN 的简介)。

The context is a vector of floats. Later in this post we will visualize vectors in color by assigning brighter colors to the cells with higher values.

You can set the size of the context vector when you set up your model. It is basically the number of hidden units in the encoder RNN. These visualizations show a vector of size 4, but in real world applications the context vector would be of a size like 256, 512, or 1024.
您可以在设置模型时设置上下文向量的大小。这基本上是编码器 RNN 中隐藏单元的数量。这些可视化展示了大小为 4 的向量,但在实际应用中,上下文向量的大小可能是 256、512 或 1024。

By design, a RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state. The word, however, needs to be represented by a vector. To transform a word into a vector, we turn to the class of methods called “word embedding” algorithms. These turn words into vector spaces that capture a lot of the meaning/semantic information of the words (e.g. king - man + woman = queen).
根据设计,RNN 在每个时间步骤接受两个输入:一个输入(在编码器的情况下,是输入句子中的一个词),以及一个隐藏状态。然而,这个词需要用一个向量来表示。为了将一个词转换为一个向量,我们转向了一类称为“词嵌入”算法的方法。这些算法将词转换为捕捉词的很多含义/语义信息的向量空间(例如:king - man + woman = queen)。

We need to turn the input words into vectors before processing them. That transformation is done using a word embedding algorithm. We can use pre-trained embeddings or train our own embedding on our dataset. Embedding vectors of size 200 or 300 are typical, we're showing a vector of size four for simplicity.
我们需要在处理输入单词之前将它们转换为向量。这种转换是通过词嵌入算法完成的。我们可以使用预训练的嵌入或在我们的数据集上训练自己的嵌入。大小为 200 或 300 的嵌入向量是典型的,我们为简单起见展示了大小为四的向量。

Now that we’ve introduced our main vectors/tensors, let’s recap the mechanics of an RNN and establish a visual language to describe these models:
现在我们已经介绍了我们的主要向量/张量,让我们回顾一下 RNN 的机制,并建立一个视觉语言来描述这些模型:

The next RNN step takes the second input vector and hidden state #1 to create the output of that time step. Later in the post, we’ll use an animation like this to describe the vectors inside a neural machine translation model.
下一个 RNN 步骤将使用第二个输入向量和隐藏状态#1 来创建该时间步的输出。稍后在帖子中,我们将使用类似这样的动画来描述神经机器翻译模型中的向量。

In the following visualization, each pulse for the encoder or decoder is that RNN processing its inputs and generating an output for that time step. Since the encoder and decoder are both RNNs, each time step one of the RNNs does some processing, it updates its hidden state based on its inputs and previous inputs it has seen.
在以下可视化中,编码器或解码器的每个脉冲都是 RNN 处理其输入并为该时间步生成输出。由于编码器和解码器都是 RNN,每个时间步中的一个 RNN 都会进行一些处理,它会根据其输入和先前看到的输入更新其隐藏状态。

Let’s look at the hidden states for the encoder. Notice how the last hidden state is actually the context we pass along to the decoder.

The decoder also maintains a hidden state that it passes from one time step to the next. We just didn’t visualize it in this graphic because we’re concerned with the major parts of the model for now.

Let’s now look at another way to visualize a sequence-to-sequence model. This animation will make it easier to understand the static graphics that describe these models. This is called an “unrolled” view where instead of showing the one decoder, we show a copy of it for each time step. This way we can look at the inputs and outputs of each time step.

Let’s Pay Attention Now

The context vector turned out to be a bottleneck for these types of models. It made it challenging for the models to deal with long sentences. A solution was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers introduced and refined a technique called “Attention”, which highly improved the quality of machine translation systems. Attention allows the model to focus on the relevant parts of the input sequence as needed.
上下文向量证明对这些类型的模型来说是一个瓶颈。这使得模型难以处理长句子。Bahdanau 等人在 2014 年和 Luong 等人在 2015 年提出了一个解决方案。这些论文介绍并完善了一种称为“注意力”的技术,极大地提高了机器翻译系统的质量。注意力允许模型根据需要专注于输入序列的相关部分。

At time step 7, the attention mechanism enables the decoder to focus on the word "étudiant" ("student" in french) before it generates the English translation. This ability to amplify the signal from the relevant part of the input sequence makes attention models produce better results than models without attention.
在时间步骤 7,注意机制使解码器在生成英文翻译之前专注于单词“étudiant”(法语中的“学生”)。这种能力可以增强来自输入序列相关部分的信号,使注意模型产生比没有注意力的模型更好的结果。

Let’s continue looking at attention models at this high level of abstraction. An attention model differs from a classic sequence-to-sequence model in two main ways:

First, the encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder:

Second, an attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following:

  1. Look at the set of encoder hidden states it received – each encoder hidden state is most associated with a certain word in the input sentence
    看看它接收到的编码器隐藏状态集合 - 每个编码器隐藏状态与输入句子中的某个单词最相关
  2. Give each hidden state a score (let’s ignore how the scoring is done for now)
  3. Multiply each hidden state by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores
    将每个隐藏状态乘以其经过 softmax 处理的分数,从而放大具有高分数的隐藏状态,并淹没具有低分数的隐藏状态

This scoring exercise is done at each time step on the decoder side.

Let us now bring the whole thing together in the following visualization and look at how the attention process works:

  1. The attention decoder RNN takes in the embedding of the <END> token, and an initial decoder hidden state.
    注意力解码器 RNN 接收<END>令牌的嵌入和初始解码器隐藏状态。
  2. The RNN processes its inputs, producing an output and a new hidden state vector (h4). The output is discarded.
    RNN 处理其输入,生成一个输出和一个新的隐藏状态向量(h4)。输出被丢弃。
  3. Attention Step: We use the encoder hidden states and the h4 vector to calculate a context vector (C4) for this time step.
    注意步骤:我们使用编码器隐藏状态和 h4 向量来计算这个时间步的上下文向量(C4)。
  4. We concatenate h4 and C4 into one vector.
    我们将 h4 和 C4 连接成一个向量。
  5. We pass this vector through a feedforward neural network (one trained jointly with the model).
  6. The output of the feedforward neural networks indicates the output word of this time step.
  7. Repeat for the next time steps

This is another way to look at which part of the input sentence we’re paying attention to at each decoding step:

Note that the model isn’t just mindless aligning the first word at the output with the first word from the input. It actually learned from the training phase how to align words in that language pair (French and English in our example). An example for how precise this mechanism can be comes from the attention papers listed above:

You can see how the model paid attention correctly when outputing "European Economic Area". In French, the order of these words is reversed ("européenne économique zone") as compared to English. Every other word in the sentence is in similar order.
您可以看到模型在输出“欧洲经济区”时正确地注意到了。在法语中,这些单词的顺序与英语相比是颠倒的(“européenne économique zone”)。句子中的其他单词顺序都是相似的。

If you feel you’re ready to learn the implementation, be sure to check TensorFlow’s Neural Machine Translation (seq2seq) Tutorial.
如果您觉得自己已经准备好学习实施,请务必查看 TensorFlow 的神经机器翻译(seq2seq)教程。

I hope you’ve found this useful. These visuals are early iterations of a lesson on attention that is part of the Udacity Natural Language Processing Nanodegree Program. We go into more details in the lesson, including discussing applications and touching on more recent attention methods like the Transformer model from Attention Is All You Need.
希望您觉得这个有用。这些视觉内容是优达学城自然语言处理纳米学位课程中关于注意力的早期迭代版本。在课程中,我们会更详细地讨论应用程序,并涉及更多最新的注意力方法,比如来自《注意力就是一切》的 Transformer 模型。

Check out the trailer of the NLP Nanodegree Program:
查看 NLP 纳米学位课程的预告片:

I’ve also created a few lessons as a part of Udacity’s Machine Learning Nanodegree Program. The lessons I’ve created cover Unsupervised Learning, as well as a jupyter notebook on movie recommendations using collaborative filtering.
我还为 Udacity 的机器学习纳米学位项目创建了一些课程。我创建的课程涵盖了无监督学习,以及使用协同过滤进行电影推荐的 jupyter 笔记本。

I’d love any feedback you may have. Please reach me at @JayAlammmar.
我很乐意听取您的任何反馈。请通过@JayAlammmar 与我联系。

Written on May 9, 2018
2018 年 5 月 9 日写