2024_05_25_f72c16e1f7c2ae9ac271g

Efficient Estimation of Word Representations in Vector Space
在向量空间中高效估计单词表示

Tomas Mikolov 托马斯·米科洛夫Google Inc., Mountain View, CA
谷歌公司，加利福尼亚州山景城tmikolov@google.comGreg Corrado 格雷格·科拉多Google Inc., Mountain View, CA
谷歌公司，加利福尼亚州山景城gcorrado@google.com

Kai Chen 陈凯Google Inc., Mountain View, CA
谷歌公司，加利福尼亚州山景城kaichen@google.comJeffrey Dean 杰弗里·迪恩Google Inc., Mountain View, CA
谷歌公司，加利福尼亚州山景城jeff@google.com

Abstract 摘要

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
我们提出了两种新颖的模型架构，用于从非常大的数据集中计算单词的连续向量表示。这些表示的质量是通过单词相似性任务来衡量的，并将结果与基于不同类型的神经网络的先前表现最佳技术进行比较。我们观察到在更低的计算成本下准确性大幅提高，即从一个包含 16 亿个单词的数据集中学习高质量单词向量不到一天的时间。此外，我们展示这些向量在测量句法和语义单词相似性的测试集上提供了最先进的性能。

1 Introduction 1 简介

Many current NLP systems and techniques treat words as atomic units - there is no notion of similarity between words, as these are represented as indices in a vocabulary. This choice has several good reasons - simplicity, robustness and the observation that simple models trained on huge amounts of data outperform complex systems trained on less data. An example is the popular

-gram model used for statistical language modeling - today, it is possible to train

-grams on virtually all available data (trillions of words [3]).
许多当前的自然语言处理系统和技术将单词视为原子单位 - 单词之间没有相似性的概念，因为它们在词汇表中表示为索引。这种选择有几个很好的理由 - 简单性、鲁棒性以及简单模型在大量数据上训练的观察结果优于在较少数据上训练的复杂系统。一个例子是用于统计语言建模的流行

-gram 模型 - 如今，可以在几乎所有可用数据上训练

-gram。

However, the simple techniques are at their limits in many tasks. For example, the amount of relevant in-domain data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality transcribed speech data (often just millions of words). In machine translation, the existing corpora for many languages contain only a few billions of words or less. Thus, there are situations where simple scaling up of the basic techniques will not result in any significant progress, and we have to focus on more advanced techniques.
然而，在许多任务中，简单技术已经达到了极限。例如，用于自动语音识别的领域内相关数据量有限 - 性能通常由高质量转录语音数据的规模主导（通常仅为数百万字）。在机器翻译中，许多语言的现有语料库仅包含数十亿字或更少。因此，在某些情况下，简单扩展基本技术将不会带来任何显著进展，我们必须专注于更先进的技术。

With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data set, and they typically outperform the simple models. Probably the most successful concept is to use distributed representations of words [10]. For example, neural network based language models significantly outperform

-gram models [1, 27, 17].
近年来，随着机器学习技术的进步，现在可以在更大的数据集上训练更复杂的模型，它们通常优于简单模型。可能最成功的概念是使用单词的分布式表示[10]。例如，基于神经网络的语言模型明显优于

-gram 模型[1, 27, 17]。

1.1 Goals of the Paper
论文的目标

The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. As far as we know, none of the previously proposed architectures has been successfully trained on more
本文的主要目标是介绍一些技术，这些技术可以用于从包含数十亿单词的大型数据集和包含数百万单词的词汇表中学习高质量的词向量。据我们所知，先前提出的架构都没有成功地在更多的数据上进行训练。
than a few hundred of millions of words, with a modest dimensionality of the word vectors between

.
比几亿字少得多，词向量的维度适中在

之间。

We use recently proposed techniques for measuring the quality of the resulting vector representations, with the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity [20]. This has been observed earlier in the context of inflectional languages - for example, nouns can have multiple word endings, and if we search for similar words in a subspace of the original vector space, it is possible to find words that have similar endings [13, [14].
我们使用最近提出的技术来衡量生成的向量表示的质量，期望不仅相似的词会彼此靠近，而且词语之间可能存在多个相似度程度[20]。这在屈折语言的背景下早已观察到 - 例如，名词可能有多个词尾，如果我们在原始向量空间的子空间中搜索相似的词，可能会找到具有相似词尾的词[13, 14]。

Somewhat surprisingly, it was found that similarity of word representations goes beyond simple syntactic regularities. Using a word offset technique where simple algebraic operations are performed on the word vectors, it was shown for example that vector("King") - vector("Man") + vector("Woman") results in a vector that is closest to the vector representation of the word Queen [20].
有点令人惊讶的是，发现单词表示的相似性超越了简单的句法规律。使用单词偏移技术，对单词向量执行简单的代数运算，例如，向量("King") - 向量("Man") + 向量("Woman") 的结果最接近单词 Queen 的向量表示[20]。

In this paper, we try to maximize accuracy of these vector operations by developing new model architectures that preserve the linear regularities among words. We design a new comprehensive test set for measuring both syntactic and semantic regularities

and show that many such regularities can be learned with high accuracy. Moreover, we discuss how training time and accuracy depends on the dimensionality of the word vectors and on the amount of the training data.
在本文中，我们尝试通过开发新的模型架构来最大化这些向量操作的准确性，以保留单词之间的线性规律。我们设计了一个新的全面测试集，用于测量句法和语义规律，并展示许多这样的规律可以以高准确性学习。此外，我们讨论了训练时间和准确性如何取决于词向量的维度和训练数据的数量。

1.2 Previous Work 1.2 之前的工作

Representation of words as continuous vectors has a long history [10, 26, 8]. A very popular model architecture for estimating neural network language model (NNLM) was proposed in [1], where a feedforward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model. This work has been followed by many others.
将单词表示为连续向量具有悠久的历史[10, 26, 8]。用于估计神经网络语言模型（NNLM）的非常流行的模型架构是在[1]中提出的，其中使用前馈神经网络、线性投影层和非线性隐藏层共同学习单词向量表示和统计语言模型。这项工作已被许多其他人跟随。

Another interesting architecture of NNLM was presented in [13, 14], where the word vectors are first learned using neural network with a single hidden layer. The word vectors are then used to train the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this work, we directly extend this architecture, and focus just on the first step where the word vectors are learned using a simple model.
NNLM 的另一个有趣的架构是在[13, 14]中提出的，其中单隐藏层神经网络首先学习单词向量。然后使用这些单词向量来训练 NNLM。因此，即使没有构建完整的 NNLM，也可以学习单词向量。在这项工作中，我们直接扩展了这个架构，只关注第一步，即使用简单模型学习单词向量的过程。

It was later shown that the word vectors can be used to significantly improve and simplify many NLP applications [4, 5, 29]. Estimation of the word vectors itself was performed using different model architectures and trained on various corpora [4, 29, 23, 19, 9], and some of the resulting word vectors were made available for future research and comparison

. However, as far as we know, these architectures were significantly more computationally expensive for training than the one proposed in [13], with the exception of certain version of log-bilinear model where diagonal weight matrices are used [23].
后来证明，词向量可以显著改进和简化许多自然语言处理应用[4, 5, 29]。词向量本身的估计是使用不同的模型架构进行的，并在各种语料库上进行了训练[4, 29, 23, 19, 9]，一些生成的词向量可供未来研究和比较使用

。然而，据我们所知，这些架构在训练时的计算成本要比[13]中提出的模型显著更高，除了某些版本的对数双线性模型使用对角权重矩阵的情况[23]。

2 Model Architectures 2 个模型架构

Many different types of models were proposed for estimating continuous representations of words, including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). In this paper, we focus on distributed representations of words learned by neural networks, as it was previously shown that they perform significantly better than LSA for preserving linear regularities among words [20, 31]; LDA moreover becomes computationally very expensive on large data sets.
许多不同类型的模型被提出来估计单词的连续表示，包括众所周知的潜在语义分析（LSA）和潜在狄利克雷分布（LDA）。在本文中，我们专注于由神经网络学习的单词的分布式表示，因为先前已经显示它们在保持单词之间的线性规律方面比 LSA 表现显著更好[20, 31]；此外，LDA 在大数据集上变得计算非常昂贵。

Similar to [18], to compare different model architectures we define first the computational complexity of a model as the number of parameters that need to be accessed to fully train the model. Next, we will try to maximize the accuracy, while minimizing the computational complexity.
与[18]类似，为了比较不同的模型架构，我们首先将模型的计算复杂度定义为需要访问的参数数量，以完全训练模型。接下来，我们将尝试最大化准确性，同时最小化计算复杂度。

For all the following models, the training complexity is proportional to
对于所有以下模型，训练复杂度与

where

is number of the training epochs,

is the number of the words in the training set and

is defined further for each model architecture. Common choice is

and

up to one billion. All models are trained using stochastic gradient descent and backpropagation [26].
其中

是训练周期数，

是训练集中的单词数，

对每个模型架构进一步定义。常见选择是

和

，最多达十亿。所有模型均使用随机梯度下降和反向传播进行训练[26]。

2.1 Feedforward Neural Net Language Model (NNLM)
前馈神经网络语言模型（NNLM）

The probabilistic feedforward neural network language model has been proposed in [1]. It consists of input, projection, hidden and output layers. At the input layer,

previous words are encoded using 1 -of-

coding, where

is size of the vocabulary. The input layer is then projected to a projection layer

that has dimensionality

, using a shared projection matrix. As only

inputs are active at any given time, composition of the projection layer is a relatively cheap operation.
概率前馈神经网络语言模型已在[1]中提出。它由输入、投影、隐藏和输出层组成。在输入层，

个先前的单词使用 1-of-

编码进行编码，其中

是词汇表的大小。然后将输入层投影到一个投影层

，其维度为

，使用共享的投影矩阵。由于任何给定时间只有

个输入是活跃的，所以投影层的组合是一个相对廉价的操作。

The NNLM architecture becomes complex for computation between the projection and the hidden layer, as values in the projection layer are dense. For a common choice of

, the size of the projection layer

might be 500 to 2000 , while the hidden layer size

is typically 500 to 1000 units. Moreover, the hidden layer is used to compute probability distribution over all the words in the vocabulary, resulting in an output layer with dimensionality

. Thus, the computational complexity per each training example is
NNLM 架构在投影层和隐藏层之间的计算变得复杂，因为投影层中的值是密集的。对于常见的选择

，投影层的大小

可能为 500 到 2000，而隐藏层的大小

通常为 500 到 1000 个单元。此外，隐藏层用于计算词汇表中所有单词的概率分布，导致具有维度

的输出层。因此，每个训练示例的计算复杂度是

where the dominating term is

. However, several practical solutions were proposed for avoiding it; either using hierarchical versions of the softmax [25, 23, 18], or avoiding normalized models completely by using models that are not normalized during training [4, 9]. With binary tree representations of the vocabulary, the number of output units that need to be evaluated can go down to around

. Thus, most of the complexity is caused by the term

.
主导项是

。然而，有几种实际的解决方案可用于避免它；要么使用 softmax 的分层版本[25, 23, 18]，要么完全避免使用在训练期间未归一化的模型[4, 9]。通过使用词汇的二叉树表示，需要评估的输出单元数量可以降至约

。因此，大部分复杂性是由项

引起的。

In our models, we use hierarchical softmax where the vocabulary is represented as a Huffman binary tree. This follows previous observations that the frequency of words works well for obtaining classes in neural net language models [16]. Huffman trees assign short binary codes to frequent words, and this further reduces the number of output units that need to be evaluated: while balanced binary tree would require

outputs to be evaluated, the Huffman tree based hierarchical softmax requires only about

Unigram_perplexity

. For example when the vocabulary size is one million words, this results in about two times speedup in evaluation. While this is not crucial speedup for neural network LMs as the computational bottleneck is in the

term, we will later propose architectures that do not have hidden layers and thus depend heavily on the efficiency of the softmax normalization.
在我们的模型中，我们使用分层 softmax，其中词汇被表示为赫夫曼二进制树。这遵循了先前的观察结果，即单词的频率对于在神经网络语言模型中获取类别非常有效。赫夫曼树为频繁出现的单词分配短的二进制代码，这进一步减少了需要评估的输出单元数量：而平衡的二叉树需要评估

个输出，基于赫夫曼树的分层 softmax 只需要评估约

个 Unigram_perplexity

。例如，当词汇量为一百万个单词时，这将导致评估速度提高约两倍。虽然这对于神经网络语言模型来说并不是关键的加速因素，因为计算瓶颈在

项中，但我们将在后面提出不具有隐藏层且严重依赖 softmax 归一化效率的架构。

2.2 Recurrent Neural Net Language Model (RNNLM)
2.2 循环神经网络语言模型（RNNLM）

Recurrent neural network based language model has been proposed to overcome certain limitations of the feedforward NNLM, such as the need to specify the context length (the order of the model

), and because theoretically RNNs can efficiently represent more complex patterns than the shallow neural networks [15, 2]. The RNN model does not have a projection layer; only input, hidden and output layer. What is special for this type of model is the recurrent matrix that connects hidden layer to itself, using time-delayed connections. This allows the recurrent model to form some kind of short term memory, as information from the past can be represented by the hidden layer state that gets updated based on the current input and the state of the hidden layer in the previous time step.
基于循环神经网络的语言模型已被提出，以克服前馈 NNLM 的某些限制，例如需要指定上下文长度（模型的阶数

），并且从理论上讲，RNN 可以有效地表示比浅层神经网络更复杂的模式[15, 2]。RNN 模型没有投影层；只有输入、隐藏和输出层。这种模型的特殊之处在于连接隐藏层与自身的循环矩阵，使用延迟连接。这使得循环模型能够形成某种短期记忆，因为过去的信息可以由隐藏层状态表示，该状态根据当前输入和上一个时间步的隐藏层状态进行更新。

The complexity per training example of the RNN model is
RNN 模型每个训练样本的复杂度是

where the word representations

have the same dimensionality as the hidden layer

. Again, the term

can be efficiently reduced to

by using hierarchical softmax. Most of the complexity then comes from

.
单词表示

与隐藏层

具有相同的维度。同样，术语

可以通过使用分层 softmax 有效地减少到

。大部分复杂性来自

。

2.3 Parallel Training of Neural Networks
神经网络的并行训练

To train models on huge data sets, we have implemented several models on top of a large-scale distributed framework called DistBelief [6], including the feedforward NNLM and the new models proposed in this paper. The framework allows us to run multiple replicas of the same model in parallel, and each replica synchronizes its gradient updates through a centralized server that keeps all the parameters. For this parallel training, we use mini-batch asynchronous gradient descent with an adaptive learning rate procedure called Adagrad [7]. Under this framework, it is common to use one hundred or more model replicas, each using many CPU cores at different machines in a data center.
为了在庞大的数据集上训练模型，我们在一个名为 DistBelief 的大规模分布式框架上实现了几个模型，包括前馈 NNLM 和本文提出的新模型。该框架允许我们并行运行同一模型的多个副本，每个副本通过一个保留所有参数的中央服务器同步其梯度更新。对于这种并行训练，我们使用一种名为 Adagrad 的自适应学习率过程的小批量异步梯度下降。在这个框架下，通常会使用一百个或更多模型副本，每个副本在数据中心的不同机器上使用许多 CPU 核心。

3 New Log-linear Models
3 个新的对数线性模型

In this section, we propose two new model architectures for learning distributed representations of words that try to minimize computational complexity. The main observation from the previous section was that most of the complexity is caused by the non-linear hidden layer in the model. While this is what makes neural networks so attractive, we decided to explore simpler models that might not be able to represent the data as precisely as neural networks, but can possibly be trained on much more data efficiently.
在本节中，我们提出了两种新的模型架构，用于学习分布式单词表示，试图最小化计算复杂性。前一节的主要观察是，大部分复杂性是由模型中的非线性隐藏层引起的。虽然这正是神经网络如此吸引人的地方，但我们决定探索更简单的模型，这些模型可能无法像神经网络那样精确地表示数据，但可能能够更有效地在更多数据上进行训练。

The new architectures directly follow those proposed in our earlier work [13, 14], where it was found that neural network language model can be successfully trained in two steps: first, continuous word vectors are learned using simple model, and then the N-gram NNLM is trained on top of these distributed representations of words. While there has been later substantial amount of work that focuses on learning word vectors, we consider the approach proposed in [13] to be the simplest one. Note that related models have been proposed also much earlier [26, 8].
新的架构直接遵循我们早期工作中提出的那些[13, 14]，在那里发现神经网络语言模型可以成功地通过两个步骤进行训练：首先，使用简单模型学习连续的词向量，然后在这些词的分布式表示之上训练 N-gram NNLM。虽然后来有大量工作集中在学习词向量上，但我们认为[13]中提出的方法是最简单的。请注意，相关模型也在更早时提出过[26, 8]。

3.1 Continuous Bag-of-Words Model
3.1 连续词袋模型

The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same position (their vectors are averaged). We call this architecture a bag-of-words model as the order of words in the history does not influence the projection. Furthermore, we also use words from the future; we have obtained the best performance on the task introduced in the next section by building a log-linear classifier with four future and four history words at the input, where the training criterion is to correctly classify the current (middle) word. Training complexity is then
第一个提出的架构类似于前馈 NNLM，其中去除了非线性隐藏层，并且投影层被所有单词共享（不仅仅是投影矩阵）；因此，所有单词都被投影到相同的位置（它们的向量被平均）。我们将这种架构称为词袋模型，因为历史中单词的顺序不会影响投影。此外，我们还使用来自未来的单词；通过在输入处构建一个具有四个未来和四个历史单词的对数线性分类器，我们已经在下一节介绍的任务中获得了最佳性能，其中训练标准是正确分类当前（中间）单词。然后，训练复杂度是

We denote this model further as CBOW, as unlike standard bag-of-words model, it uses continuous distributed representation of the context. The model architecture is shown at Figure 1. Note that the weight matrix between the input and the projection layer is shared for all word positions in the same way as in the NNLM.
我们进一步将这个模型称为 CBOW，与标准的词袋模型不同，它使用上下文的连续分布表示。模型架构如图 1 所示。请注意，输入和投影层之间的权重矩阵与 NNLM 中的方式相同，对于所有单词位置都是共享的。

3.2 Continuous Skip-gram Model
3.2 连续 Skip-gram 模型

The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence. More precisely, we use each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word. We found that increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity. Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples.
第二种架构类似于 CBOW，但不是基于上下文预测当前单词，而是试图最大化基于同一句子中另一个单词的单词分类。更准确地说，我们将每个当前单词用作连续投影层的对数线性分类器的输入，并预测当前单词之前和之后一定范围内的单词。我们发现增加范围可以提高生成的单词向量的质量，但也会增加计算复杂性。由于远离的单词通常与当前单词的关联性较低，我们在训练示例中从这些单词中采样较少，从而给予远离的单词较少的权重。

The training complexity of this architecture is proportional to
这种架构的训练复杂度与

where

is the maximum distance of the words. Thus, if we choose

, for each training word we will select randomly a number

in range

, and then use

words from history and
其中

是单词的最大距离。因此，如果我们选择

，对于每个训练单词，我们将在范围

内随机选择一个数字

，然后使用

个历史单词。

Figure 1: New model architectures. The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word.
图 1：新模型架构。CBOW 架构基于上下文预测当前单词，而 Skip-gram 预测给定当前单词的周围单词。

words from the future of the current word as correct labels. This will require us to do

word classifications, with the current word as input, and each of the

words as output. In the following experiments, we use

个来自当前单词未来的单词作为正确标签。这将要求我们进行

个单词分类，以当前单词作为输入，每个

个单词作为输出。在接下来的实验中，我们使用

。

4 Results 4 个结果

To compare the quality of different versions of word vectors, previous papers typically use a table showing example words and their most similar words, and understand them intuitively. Although it is easy to show that word France is similar to Italy and perhaps some other countries, it is much more challenging when subjecting those vectors in a more complex similarity task, as follows. We follow previous observation that there can be many different types of similarities between words, for example, word big is similar to bigger in the same sense that small is similar to smaller. Example of another type of relationship can be word pairs big - biggest and small - smallest [20]. We further denote two pairs of words with the same relationship as a question, as we can ask: "What is the word that is similar to small in the same sense as biggest is similar to big?"
为了比较不同版本的词向量质量，先前的论文通常使用一个表格来展示示例词和它们最相似的词，并直观地理解它们。虽然很容易展示出单词“法国”与“意大利”以及可能一些其他国家相似，但当将这些向量置于更复杂的相似性任务中时，情况就会变得更具挑战性。我们遵循先前的观察，单词之间可能存在许多不同类型的相似性，例如，单词“大”与“更大”在相同意义上相似，就像“小”与“更小”相似一样。另一种关系类型的示例可以是单词对“大 - 最大”和“小 - 最小”。我们进一步将具有相同关系的两对单词称为一个问题，因为我们可以问：“与最大类似的单词与大类似的单词相同意义上的小是什么？”

Somewhat surprisingly, these questions can be answered by performing simple algebraic operations with the vector representation of words. To find a word that is similar to small in the same sense as biggest is similar to big, we can simply compute vector

vector("biggest") - vector("big") + vector ("small"). Then, we search in the vector space for the word closest to

measured by cosine distance, and use it as the answer to the question (we discard the input question words during this search). When the word vectors are well trained, it is possible to find the correct answer (word smallest) using this method.
有些令人惊讶的是，这些问题可以通过对单词的向量表示进行简单的代数运算来回答。要找到一个与“small”类似的单词，就像“biggest”与“big”类似一样，我们可以简单地计算向量("biggest") - 向量("big") + 向量("small")。然后，我们在向量空间中寻找与

最接近的单词，通过余弦距离进行测量，并将其用作问题的答案（在此搜索过程中丢弃输入的问题单词）。当单词向量训练得很好时，可以通过这种方法找到正确答案（单词"smallest"）。

Finally, we found that when we train high dimensional word vectors on a large amount of data, the resulting vectors can be used to answer very subtle semantic relationships between words, such as a city and the country it belongs to, e.g. France is to Paris as Germany is to Berlin. Word vectors with such semantic relationships could be used to improve many existing NLP applications, such as machine translation, information retrieval and question answering systems, and may enable other future applications yet to be invented.
最后，我们发现当我们在大量数据上训练高维词向量时，生成的向量可以用来回答单词之间非常微妙的语义关系，比如一个城市和它所属的国家，例如法国对应巴黎，德国对应柏林。具有这种语义关系的词向量可以用来改进许多现有的自然语言处理应用，比如机器翻译、信息检索和问答系统，并且可能促成其他尚未被发明的未来应用。

Table 1: Examples of five types of semantic and nine types of syntactic questions in the SemanticSyntactic Word Relationship test set.
表 1：SemanticSyntactic 词语关系测试集中五种语义问题和九种句法问题的示例。

Type of relationship 关系类型	Word Pair 1 词对 1		Word Pair 2 词对 2
Common capital city 常见的首都	Athens	Greece	Oslo	Norway
All capital cities 所有首都	Astana	Kazakhstan	Harare	Zimbabwe
Currency	Angola	kwanza	Iran	rial
City-in-state 城市在州内	Chicago	Illinois	Stockton	California
Man-Woman	brother	sister	grandson	granddaughter 孙女
Adjective to adverb 形容词转副词	apparent	apparently	rapid	rapidly
Opposite	possibly	impossibly	ethical	unethical
Comparative 比较	great	greater	tough	tougher
Superlative 最高级	easy	easiest	lucky	luckiest
Present Participle 现在分词	think	thinking	read	reading
Nationality adjective 国籍形容词	Switzerland 瑞士	Swiss	Cambodia	Cambodian
Past tense 过去时	walking	walked	swimming	swam
Plural nouns 复数名词	mouse	mice	dollar	dollars
Plural verbs 复数动词	work	works	speak	speaks

4.1 Task Description 4.1 任务描述

To measure quality of the word vectors, we define a comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions. Two examples from each category are shown in Table 1. Overall, there are 8869 semantic and 10675 syntactic questions. The questions in each category were created in two steps: first, a list of similar word pairs was created manually. Then, a large list of questions is formed by connecting two word pairs. For example, we made a list of 68 large American cities and the states they belong to, and formed about

questions by picking two word pairs at random. We have included in our test set only single token words, thus multi-word entities are not present (such as New York).
为了衡量词向量的质量，我们定义了一个包含五种语义问题和九种句法问题的综合测试集。每个类别中各展示了两个示例，如表 1 所示。总共有 8869 个语义问题和 10675 个句法问题。每个类别中的问题是通过两个步骤创建的：首先，手动创建了一组相似的词对列表。然后，通过连接两个词对形成了一个大量的问题列表。例如，我们制作了一个包含 68 个美国大城市及其所属州的列表，通过随机选择两个词对形成了约

个问题。我们的测试集中仅包含单词标记，因此不包括多词实体（如纽约）。

We evaluate the overall accuracy for all question types, and for each question type separately (semantic, syntactic). Question is assumed to be correctly answered only if the closest word to the vector computed using the above method is exactly the same as the correct word in the question; synonyms are thus counted as mistakes. This also means that reaching

accuracy is likely to be impossible, as the current models do not have any input information about word morphology. However, we believe that usefulness of the word vectors for certain applications should be positively correlated with this accuracy metric. Further progress can be achieved by incorporating information about structure of words, especially for the syntactic questions.
我们评估所有问题类型的整体准确性，以及每种问题类型单独（语义，句法）。只有当使用上述方法计算的向量最接近的单词与问题中的正确单词完全相同时，问题被认为回答正确；因此，同义词被视为错误。这也意味着达到

准确度可能是不可能的，因为当前模型没有关于单词形态的任何输入信息。然而，我们相信单词向量对于某些应用的有用性应该与这个准确性指标呈正相关。通过合并有关单词结构的信息，特别是对于句法问题，可以实现进一步的进展。

4.2 Maximization of Accuracy
准确性的最大化

We have used a Google News corpus for training the word vectors. This corpus contains about 6B tokens. We have restricted the vocabulary size to 1 million most frequent words. Clearly, we are facing time constrained optimization problem, as it can be expected that both using more data and higher dimensional word vectors will improve the accuracy. To estimate the best choice of model architecture for obtaining as good as possible results quickly, we have first evaluated models trained on subsets of the training data, with vocabulary restricted to the most frequent

words. The results using the CBOW architecture with different choice of word vector dimensionality and increasing amount of the training data are shown in Table 2.
我们使用了 Google 新闻语料库来训练词向量。这个语料库包含大约 60 亿个标记。我们将词汇量限制在 100 万个最常见的词汇。显然，我们面临着时间受限的优化问题，因为可以预期使用更多数据和更高维度的词向量将提高准确性。为了估计获取尽可能好的结果的最佳模型架构的选择，我们首先评估了在训练数据子集上训练的模型，词汇量限制为最常见的

个词汇。使用 CBOW 架构的结果，不同选择的词向量维度和增加的训练数据量在表 2 中显示。

It can be seen that after some point, adding more dimensions or adding more training data provides diminishing improvements. So, we have to increase both vector dimensionality and the amount of the training data together. While this observation might seem trivial, it must be noted that it is currently popular to train word vectors on relatively large amounts of data, but with insufficient size
可以看到，在某个点之后，增加更多维度或增加更多训练数据会带来递减的改进。因此，我们必须同时增加向量维度和训练数据的数量。虽然这一观察可能看似琐碎，但必须指出的是，目前流行的做法是在相对大量的数据上训练词向量，但规模不足。

Table 2: Accuracy on subset of the Semantic-Syntactic Word Relationship test set, using word vectors from the CBOW architecture with limited vocabulary. Only questions containing words from the most frequent

words are used.
表 2：使用 CBOW 架构的词向量在语义-句法词关系测试集的子集上的准确性，使用有限词汇表。仅使用包含来自最常见

个词的问题。

Dimensionality / Training words 维度 / 训练词
50	13.4	15.7	18.6	19.1	22.5	23.2
100	19.4	23.1	27.8	28.7	33.4	32.2
300	23.2	29.2	35.3	38.6	43.7	45.9
600	24.0	30.1	36.5	40.8	46.6	50.4

Table 3: Comparison of architectures using models trained on the same data, with 640-dimensional word vectors. The accuracies are reported on our Semantic-Syntactic Word Relationship test set, and on the syntactic relationship test set of [20]
表 3：使用在相同数据上训练的模型进行架构比较，使用 640 维词向量。准确率报告在我们的语义-句法词关系测试集上，以及[20]的句法关系测试集上。

{

Model

Architecture

} 模型架构

Semantic-Syntactic Word Relationship test set
语义-句法词关系测试集

MSR Word Relatedness MSR 词相关性

Semantic Accuracy [%] 语义准确性 [%]

Syntactic Accuracy [%] 句法准确性 [%]

Test [20] 测试 [20]

RNNLM

NNLM

CBOW

Skip-gram

(such as 50 - 100). Given Equation 4, increasing amount of training data twice results in about the same increase of computational complexity as increasing vector size twice.
（例如 50-100）。根据方程 4，训练数据量增加一倍导致计算复杂度大约增加一倍，与向量大小增加一倍的情况大致相同。

For the experiments reported in Tables 2 and 4 , we used three training epochs with stochastic gradient descent and backpropagation. We chose starting learning rate 0.025 and decreased it linearly, so that it approaches zero at the end of the last training epoch.
在表 2 和表 4 中报告的实验中，我们使用了三个训练周期，采用随机梯度下降和反向传播。我们选择了初始学习率 0.025，并线性减小，使其在最后一个训练周期结束时接近零。

4.3 Comparison of Model Architectures
4.3 模型架构比较

First we compare different model architectures for deriving the word vectors using the same training data and using the same dimensionality of 640 of the word vectors. In the further experiments, we use full set of questions in the new Semantic-Syntactic Word Relationship test set, i.e. unrestricted to the 30k vocabulary. We also include results on a test set introduced in [20] that focuses on syntactic similarity between word

首先，我们比较不同的模型架构，使用相同的训练数据和相同维度为 640 的词向量来推导单词向量。在进一步的实验中，我们使用新的语义-句法词关系测试集中的全部问题集，即不受限于 30k 词汇表。我们还包括了一项在[20]中引入的测试集的结果，该测试集侧重于单词之间的句法相似性。

The training data consists of several LDC corpora and is described in detail in [18] (320M words,

vocabulary). We used these data to provide a comparison to a previously trained recurrent neural network language model that took about 8 weeks to train on a single CPU. We trained a feedforward NNLM with the same number of 640 hidden units using the DistBelief parallel training [6], using a history of 8 previous words (thus, the NNLM has more parameters than the RNNLM, as the projection layer has size

).
训练数据包括几个 LDC 语料库，详细描述见[18]（320M 字，

词汇量）。我们使用这些数据与之前训练的递归神经网络语言模型进行比较，后者在单个 CPU 上训练大约 8 周。我们使用 DistBelief 并行训练训练了一个具有相同数量的 640 个隐藏单元的前馈 NNLM，使用了 8 个先前单词的历史（因此，NNLM 的参数比 RNNLM 多，因为投影层的大小为

）。

In Table 3, it can be seen that the word vectors from the RNN (as used in [20]) perform well mostly on the syntactic questions. The NNLM vectors perform significantly better than the RNN - this is not surprising, as the word vectors in the RNNLM are directly connected to a non-linear hidden layer. The CBOW architecture works better than the NNLM on the syntactic tasks, and about the same on the semantic one. Finally, the Skip-gram architecture works slightly worse on the syntactic task than the CBOW model (but still better than the NNLM), and much better on the semantic part of the test than all the other models.
在表 3 中，可以看到 RNN（如[20]中使用的）的词向量在句法问题上表现良好。NNLM 的向量比 RNN 明显好得多-这并不令人惊讶，因为 RNNLM 中的词向量直接连接到非线性隐藏层。CBOW 架构在句法任务上比 NNLM 表现更好，在语义任务上大致相同。最后，Skip-gram 架构在句法任务上略逊于 CBOW 模型（但仍优于 NNLM），在语义测试的部分比所有其他模型表现都要好得多。

Next, we evaluated our models trained using one CPU only and compared the results against publicly available word vectors. The comparison is given in Table 4 The CBOW model was trained on subset
接下来，我们评估了仅使用一个 CPU 训练的模型，并将结果与公开可用的词向量进行了比较。比较结果如表 4 所示。CBOW 模型是在子集上训练的。

Table 4: Comparison of publicly available word vectors on the Semantic-Syntactic Word Relationship test set, and word vectors from our models. Full vocabularies are used.
表 4：在语义-句法词关系测试集上公开可用的词向量与我们模型的词向量进行比较。使用完整词汇表。

Model

矢量维度

Vector

Dimensionality

训练词汇

Training

words

Accuracy [%] 准确率 [%]

Semantic

Syntactic

Total

Collobert-Weston NNLM

9.3

12.3

11.0

Turian NNLM 图里安 NNLM

1.4

2.6

2.1

Turian NNLM 图里安 NNLM

200

1.4

2.2

1.8

Mnih NNLM

1.8

9.1

5.8

Mnih NNLM

100

3.3

13.2

8.8

Mikolov RNNLM

4.9

18.4

12.7

Mikolov RNNLM

640

8.6

36.5

24.6

Huang NNLM 黄 NNLM

13.3

11.6

12.3

Our NNLM 我们的 NNLM

12.9

26.4

20.3

Our NNLM 我们的 NNLM

27.9

55.8

43.2

Our NNLM 我们的 NNLM

100

34.2

50.8

CBOW

300

15.5

53.1

36.1

Skip-gram

300

55.9

Table 5: Comparison of models trained for three epochs on the same data and models trained for one epoch. Accuracy is reported on the full Semantic-Syntactic data set.
表 5：对同一数据进行三个时期训练的模型和一个时期训练的模型进行比较。准确率报告在完整的语义-句法数据集上。

Model

矢量维度

Vector

Dimensionality

训练词汇

Training

words

Accuracy [%] 准确率 [%]

Training time 培训时间

Semantic

Syntactic

Total

3 epoch CBOW 3 个时代的 CBOW

300

15.5

53.1

36.1

3 epoch Skip-gram 3 个时代的跳字模型

300

50.0

55.9

53.3

1 epoch CBOW 1 个时代的 CBOW

300

13.8

49.9

33.6

0.3

1 epoch CBOW 1 个时代的 CBOW

300

16.1

52.6

36.1

0.6

1 epoch CBOW 1 个时代的 CBOW

600

15.4

53.3

36.2

0.7

1 epoch Skip-gram 1 个时代 Skip-gram

300

45.6

52.2

49.2

1 epoch Skip-gram 1 个时代 Skip-gram

300

52.2

55.1

53.8

1 epoch Skip-gram 1 个时代 Skip-gram

600

56.7

54.5

55.5

2.5

of the Google News data in about a day, while training time for the Skip-gram model was about three days.
Google News 数据的处理大约需要一天时间，而 Skip-gram 模型的训练时间大约为三天。

For experiments reported further, we used just one training epoch (again, we decrease the learning rate linearly so that it approaches zero at the end of training). Training a model on twice as much data using one epoch gives comparable or better results than iterating over the same data for three epochs, as is shown in Table 5, and provides additional small speedup.
对于进一步报告的实验，我们只使用了一个训练周期（再次，我们线性降低学习率，使其在训练结束时接近零）。使用两倍数据在一个周期内训练模型的结果与在相同数据上迭代三个周期相比，表现出可比或更好的结果，如表 5 所示，并提供额外的小幅加速。

4.4 Large Scale Parallel Training of Models
4.4 模型的大规模并行训练

As mentioned earlier, we have implemented various models in a distributed framework called DistBelief. Below we report the results of several models trained on the Google News 6B data set, with mini-batch asynchronous gradient descent and the adaptive learning rate procedure called Adagrad [7]. We used 50 to 100 model replicas during the training. The number of CPU cores is an
正如前面提到的，我们在一个名为 DistBelief 的分布式框架中实现了各种模型。下面我们报告了在 Google News 6B 数据集上训练的几个模型的结果，使用了小批量异步梯度下降和自适应学习率程序 Adagrad [7]。在训练过程中我们使用了 50 到 100 个模型副本。CPU 核心数是一个

Table 6: Comparison of models trained using the DistBelief distributed framework. Note that training of NNLM with 1000-dimensional vectors would take too long to complete.
表 6：使用 DistBelief 分布式框架训练的模型比较。请注意，使用 1000 维向量训练 NNLM 将花费太长时间才能完成。

Model

矢量维度

Vector

Dimensionality

训练词汇

Training

words

Accuracy [%] 准确率 [%]

Training time 培训时间

Semantic

Syntactic

Total

NNLM

100

34.2

64.5

50.8

CBOW

1000

57.3

68.9

63.7

Skip-gram

1000

66.1

65.1

65.6

Table 7: Comparison and combination of models on the Microsoft Sentence Completion Challenge.
表 7：在微软句子完成挑战赛上模型的比较和组合。

Architecture 建筑	Accuracy [%] 准确率 [%]
4-gram [32]	39
Average LSA similarity [32] 平均 LSA 相似度[32]	49
Log-bilinear model [24] 对数双线性模型[24]	54.8
RNNLMs [19]	55.4
Skip-gram	48.0
Skip-gram + RNNLMs 跳字模型 + 循环神经网络语言模型

estimate since the data center machines are shared with other production tasks, and the usage can fluctuate quite a bit. Note that due to the overhead of the distributed framework, the CPU usage of the CBOW model and the Skip-gram model are much closer to each other than their single-machine implementations. The result are reported in Table 6.
由于数据中心的机器与其他生产任务共享，并且使用情况可能会有很大波动。请注意，由于分布式框架的开销，CBOW 模型和 Skip-gram 模型的 CPU 使用率比它们的单机实现更接近。结果报告在表 6 中。

4.5 Microsoft Research Sentence Completion Challenge
4.5 微软研究句子完成挑战

The Microsoft Sentence Completion Challenge has been recently introduced as a task for advancing language modeling and other NLP techniques [32]. This task consists of 1040 sentences, where one word is missing in each sentence and the goal is to select word that is the most coherent with the rest of the sentence, given a list of five reasonable choices. Performance of several techniques has been already reported on this set, including

-gram models, LSA-based model [32], log-bilinear model [24] and a combination of recurrent neural networks that currently holds the state of the art performance of

accuracy on this benchmark [19].
Microsoft 句子完成挑战最近被引入作为推进语言建模和其他 NLP 技术的任务[32]。这项任务包括 1040 个句子，每个句子中缺少一个单词，目标是从五个合理选择中选择与句子其余部分最连贯的单词。已经有几种技术在这个数据集上报告了性能，包括

-gram 模型、基于 LSA 的模型[32]、对数双线性模型[24]以及目前在这个基准测试上保持最先进性能的循环神经网络组合，准确率为

[19]。

We have explored the performance of Skip-gram architecture on this task. First, we train the 640dimensional model on 50M words provided in [32]. Then, we compute score of each sentence in the test set by using the unknown word at the input, and predict all surrounding words in a sentence. The final sentence score is then the sum of these individual predictions. Using the sentence scores, we choose the most likely sentence.
我们已经探索了 Skip-gram 架构在这个任务上的性能。首先，我们在[32]提供的 5000 万个单词上训练了 640 维模型。然后，我们通过在输入中使用未知单词，计算测试集中每个句子的得分，并预测句子中所有周围的单词。最终句子得分是这些个别预测的总和。使用句子得分，我们选择最可能的句子。

A short summary of some previous results together with the new results is presented in Table 77 While the Skip-gram model itself does not perform on this task better than LSA similarity, the scores from this model are complementary to scores obtained with RNNLMs, and a weighted combination leads to a new state of the art result

accuracy (

on the development part of the set and

on the test part of the set).
在表 77 中呈现了一些先前结果的简要总结以及新结果。虽然 Skip-gram 模型本身在这项任务上的表现不如 LSA 相似度，但该模型得分与 RNNLMs 获得的得分是互补的，加权组合导致了一个新的最先进结果

准确度（在集合的开发部分为

，在集合的测试部分为

）。

5 Examples of the Learned Relationships
学到的关系的 5 个例子

Table 8 shows words that follow various relationships. We follow the approach described above: the relationship is defined by subtracting two word vectors, and the result is added to another word. Thus for example, Paris - France + Italy = Rome. As it can be seen, accuracy is quite good, although there is clearly a lot of room for further improvements (note that using our accuracy metric that
表 8 显示了遵循各种关系的单词。我们遵循上述描述的方法：关系是通过减去两个单词向量来定义的，然后将结果添加到另一个单词中。因此，例如，Paris - France + Italy = Rome。正如可以看到的那样，准确性相当不错，尽管显然还有很大的改进空间（请注意，使用我们的准确度指标）。

Table 8: Examples of the word pair relationships, using the best word vectors from Table 4 (Skipgram model trained on

words with 300 dimensionality).
表 8：使用表 4 中最佳词向量的单词对关系示例（在

个单词上训练的 Skipgram 模型，维度为 300）。

Relationship 关系	Example 1 示例 1	Example 2 例子 2	Example 3 例子 3
France - Paris 法国 - 巴黎	Italy: Rome 意大利：罗马	Japan: Tokyo 日本：东京	Florida: Tallahassee 佛罗里达：塔拉哈西
big - bigger 大 - 更大	small: larger 小：更大	cold: colder 冷：更冷	quick: quicker 快：更快
Miami - Florida 迈阿密 - 佛罗里达	Baltimore: Maryland 巴尔的摩：马里兰	Dallas: Texas 达拉斯：德克萨斯	Kona: Hawaii 科纳：夏威夷
Einstein - scientist 爱因斯坦 - 科学家	Messi: midfielder 梅西：中场球员	Mozart: violinist 莫扎特：小提琴家	Picasso: painter 毕加索：画家
Sarkozy - France 萨科齐 - 法国	Berlusconi: Italy 贝卢斯科尼：意大利	Merkel: Germany 默克尔：德国	Koizumi: Japan 小泉：日本
copper - Cu 铜 - Cu	zinc: Zn 锌：Zn	gold: Au 金：Au	uranium: plutonium 铀：钚
Berlusconi - Silvio 贝卢斯科尼 - 西尔维奥	Sarkozy: Nicolas 萨科齐：尼古拉	Putin: Medvedev 普京：梅德韦杰夫	Obama: Barack 奥巴马：巴拉克
Microsoft - Windows 微软 - Windows	Google: Android 谷歌：安卓	IBM: Linux IBM：Linux	Apple: iPhone 苹果：iPhone
Microsoft - Ballmer 微软 - 鲍尔默	Google: Yahoo 谷歌：雅虎	IBM: McNealy IBM：麦克尼利	Apple: Jobs 苹果：乔布斯
Japan - sushi 日本 - 寿司	Germany: bratwurst 德国：烤肠	France: tapas 法国：小吃	USA: pizza 美国：披萨

assumes exact match, the results in Table 8 would score only about

. We believe that word vectors trained on even larger data sets with larger dimensionality will perform significantly better, and will enable the development of new innovative applications. Another way to improve accuracy is to provide more than one example of the relationship. By using ten examples instead of one to form the relationship vector (we average the individual vectors together), we have observed improvement of accuracy of our best models by about

absolutely on the semantic-syntactic test.
假设精确匹配，表 8 中的结果仅得分约

。我们相信，即使在更大的数据集上训练的词向量具有更大的维度，性能也会显着提高，并将促进新的创新应用的开发。提高准确性的另一种方法是提供多个关系示例。通过使用十个示例而不是一个来形成关系向量（我们将各个向量平均在一起），我们观察到我们最佳模型的准确性在语义-句法测试中绝对提高了约

。

It is also possible to apply the vector operations to solve different tasks. For example, we have observed good accuracy for selecting out-of-the-list words, by computing average vector for a list of words, and finding the most distant word vector. This is a popular type of problems in certain human intelligence tests. Clearly, there is still a lot of discoveries to be made using these techniques.
也可以应用向量运算来解决不同的任务。例如，我们已经观察到通过计算一组单词的平均向量，并找到最远的单词向量，来选择列表之外的单词时取得了良好的准确性。这是某些人类智力测试中常见的问题类型。显然，还有很多发现可以通过这些技术来实现。

6 Conclusion 6 结论

In this paper we studied the quality of vector representations of words derived by various models on a collection of syntactic and semantic language tasks. We observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent). Because of the much lower computational complexity, it is possible to compute very accurate high dimensional word vectors from a much larger data set. Using the DistBelief distributed framework, it should be possible to train the CBOW and Skip-gram models even on corpora with one trillion words, for basically unlimited size of the vocabulary. That is several orders of magnitude larger than the best previously published results for similar models.
在本文中，我们研究了通过各种模型在一系列句法和语义语言任务上得出的单词向量表示的质量。我们观察到，与流行的神经网络模型（前馈和循环）相比，使用非常简单的模型架构可以训练高质量的单词向量。由于计算复杂度大大降低，可以从更大的数据集中计算非常准确的高维单词向量。使用 DistBelief 分布式框架，应该可以在拥有一万亿字的语料库上训练 CBOW 和 Skip-gram 模型，基本上可以处理无限大小的词汇量。这比先前类似模型的最佳已发表结果大几个数量级。

An interesting task where the word vectors have recently been shown to significantly outperform the previous state of the art is the SemEval-2012 Task 2 [11]. The publicly available RNN vectors were used together with other techniques to achieve over

increase in Spearman's rank correlation over the previous best result [31]. The neural network based word vectors were previously applied to many other NLP tasks, for example sentiment analysis [12] and paraphrase detection [28]. It can be expected that these applications can benefit from the model architectures described in this paper.
最近展示出词向量明显优于先前技术水平的一个有趣任务是 SemEval-2012 任务 2 [11]。公开可用的 RNN 向量与其他技术一起使用，使 Spearman 等级相关性比以前的最佳结果提高了

[31]。基于神经网络的词向量先前已应用于许多其他 NLP 任务，例如情感分析 [12] 和释义检测 [28]。可以预期这些应用可以从本文描述的模型架构中受益。

Our ongoing work shows that the word vectors can be successfully applied to automatic extension of facts in Knowledge Bases, and also for verification of correctness of existing facts. Results from machine translation experiments also look very promising. In the future, it would be also interesting to compare our techniques to Latent Relational Analysis [30] and others. We believe that our comprehensive test set will help the research community to improve the existing techniques for estimating the word vectors. We also expect that high quality word vectors will become an important building block for future NLP applications.
我们正在进行的工作表明，词向量可以成功应用于知识库中事实的自动扩展，也可用于验证现有事实的正确性。机器翻译实验的结果看起来也非常有前途。将来，将我们的技术与潜在关系分析[30]等进行比较也会很有趣。我们相信，我们的全面测试集将帮助研究社区改进现有的用于估计词向量的技术。我们也期望高质量的词向量将成为未来自然语言处理应用的重要基石。

7 Follow-Up Work 7 后续工作

After the initial version of this paper was written, we published single-machine multi-threaded C++ code for computing the word vectors, using both the continuous bag-of-words and skip-gram architecture

The training speed is significantly higher than reported earlier in this paper, i.e. it is in the order of billions of words per hour for typical hyperparameter choices. We also published more than 1.4 million vectors that represent named entities, trained on more than 100 billion words. Some of our follow-up work will be published in an upcoming NIPS 2013 paper [21].
在撰写本文的初始版本后，我们发布了用于计算单机多线程 C++ 代码的词向量，使用了连续词袋和跳字架构。训练速度明显高于本文中早期报道的速度，即对于典型的超参数选择，每小时处理的单词数量达到数十亿。我们还发布了超过 140 万个代表命名实体的向量，这些向量是在超过 1000 亿个单词上训练的。我们的一些后续工作将在即将出版的 NIPS 2013 论文中发表。

References 参考资料

[1] Y. Bengio, R. Ducharme, P. Vincent. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137-1155, 2003.
Y. Bengio, R. Ducharme, P. Vincent. 一种神经概率语言模型. 机器学习研究杂志, 3:1137-1155, 2003.

[2] Y. Bengio, Y. LeCun. Scaling learning algorithms towards AI. In: Large-Scale Kernel Machines, MIT Press, 2007.
Y. Bengio, Y. LeCun. Scaling learning algorithms towards AI. In: Large-Scale Kernel Machines, MIT Press, 2007.

[3] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, 2007.
T. Brants, A. C. Popat, P. Xu, F. J. Och 和 J. Dean。机器翻译中的大型语言模型。在 2007 年联合实证方法自然语言处理和计算语言学会议论文集中。

[4] R. Collobert and J. Weston. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In International Conference on Machine Learning, ICML, 2008.
R. Collobert 和 J. Weston。一个统一的自然语言处理架构：具有多任务学习的深度神经网络。在 2008 年国际机器学习会议 ICML 中。

[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12:2493

.
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu 和 P. Kuksa. 从零开始的自然语言处理。机器学习研究杂志，12:2493。

[6] J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, M.Z. Mao, M.A. Ranzato, A. Senior, P. Tucker, K. Yang, A. Y. Ng., Large Scale Distributed Deep Networks, NIPS, 2012.
J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, M.Z. Mao, M.A. Ranzato, A. Senior, P. Tucker, K. Yang, A. Y. Ng., 大规模分布式深度网络，NIPS，2012。

[7] J.C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011.
J.C. Duchi, E. Hazan, 和 Y. Singer. 自适应次梯度方法用于在线学习和随机优化。机器学习研究杂志，2011。

[8] J. Elman. Finding Structure in Time. Cognitive Science, 14, 179-211, 1990.
J. Elman. 在时间中找到结构. 认知科学, 14, 179-211, 1990.

[9] Eric H. Huang, R. Socher, C. D. Manning and Andrew Y. Ng. Improving Word Representations via Global Context and Multiple Word Prototypes. In: Proc. Association for Computational Linguistics, 2012.
Eric H. Huang, R. Socher, C. D. Manning 和 Andrew Y. Ng. 通过全局上下文和多个词原型改进词表示. 在: Proc. Association for Computational Linguistics, 2012.

[10] G.E. Hinton, J.L. McClelland, D.E. Rumelhart. Distributed representations. In: Parallel distributed processing: Explorations in the microstructure of cognition. Volume 1: Foundations, MIT Press, 1986.
[10] G.E. Hinton, J.L. McClelland, D.E. Rumelhart. 分布式表示。在：并行分布式处理：认知微结构探索。卷 1：基础，麻省理工学院出版社，1986 年。

[11] D.A. Jurgens, S.M. Mohammad, P.D. Turney, K.J. Holyoak. Semeval-2012 task 2: Measuring degrees of relational similarity. In: Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval 2012), 2012.
D.A. Jurgens, S.M. Mohammad, P.D. Turney, K.J. Holyoak. Semeval-2012 任务 2：度量关系相似度。在：第 6 届国际语义评估研讨会（SemEval 2012）论文集，2012 年。

[12] A.L. Maas, R.E. Daly, P.T. Pham, D. Huang, A.Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of ACL, 2011.
[12] A.L. 马斯, R.E. 戴利, P.T. 范, D. 黄, A.Y. 吴, 和 C. 波茨. 学习词向量进行情感分析. 在 ACL 会议论文集中, 2011.

[13] T. Mikolov. Language Modeling for Speech Recognition in Czech, Masters thesis, Brno University of Technology, 2007.
T. Mikolov. 语言建模在捷克语言语音识别中的应用，布尔诺科技大学硕士论文，2007 年。

[14] T. Mikolov, J. Kopecký, L. Burget, O. Glembek and J. Černocký. Neural network based language models for higly inflective languages, In: Proc. ICASSP 2009.
T. Mikolov, J. Kopecký, L. Burget, O. Glembek 和 J. Černocký. 基于神经网络的高变形语言模型，见：Proc. ICASSP 2009。

[15] T. Mikolov, M. Karafiát, L. Burget, J. Černocký, S. Khudanpur. Recurrent neural network based language model, In: Proceedings of Interspeech, 2010.
T. Mikolov, M. Karafiát, L. Burget, J. Černocký, S. Khudanpur. 基于循环神经网络的语言模型, 在：Interspeech 会议论文集, 2010.

[16] T. Mikolov, S. Kombrink, L. Burget, J. Černocký, S. Khudanpur. Extensions of recurrent neural network language model, In: Proceedings of ICASSP 2011.
T. Mikolov, S. Kombrink, L. Burget, J. Černocký, S. Khudanpur. 递归神经网络语言模型的扩展，见：ICASSP 2011 年会议论文集。

[17] T. Mikolov, A. Deoras, S. Kombrink, L. Burget, J. Černocký. Empirical Evaluation and Combination of Advanced Language Modeling Techniques, In: Proceedings of Interspeech, 2011.
T. Mikolov, A. Deoras, S. Kombrink, L. Burget, J. Černocký. 经验评估和高级语言建模技术的组合，见：Interspeech 会议论文集，2011。

[18] T. Mikolov, A. Deoras, D. Povey, L. Burget, J. Černocký. Strategies for Training Large Scale Neural Network Language Models, In: Proc. Automatic Speech Recognition and Understanding, 2011.
[18] T. Mikolov, A. Deoras, D. Povey, L. Burget, J. Černocký. 大规模神经网络语言模型训练策略, 在：自动语音识别与理解会议论文集, 2011.

[19] T. Mikolov. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of Technology, 2012.
T. Mikolov. 基于神经网络的统计语言模型. 博士论文, 布尔诺科技大学, 2012.

[20] T. Mikolov, W.T. Yih, G. Zweig. Linguistic Regularities in Continuous Space Word Representations. NAACL HLT 2013.
T. Mikolov, W.T. Yih, G. Zweig. 连续空间词表示中的语言规律. NAACL HLT 2013.

[21] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. Accepted to NIPS 2013.
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, 和 J. Dean. 词和短语的分布式表示及其组合性。已被 NIPS 2013 接受。

[22] A. Mnih, G. Hinton. Three new graphical models for statistical language modelling. ICML, 2007.
[22] A. Mnih, G. Hinton. 三个新的统计语言建模图形模型. ICML, 2007.

[23] A. Mnih, G. Hinton. A Scalable Hierarchical Distributed Language Model. Advances in Neural Information Processing Systems 21, MIT Press, 2009.
[23] A. Mnih, G. Hinton. A Scalable Hierarchical Distributed Language Model. Advances in Neural Information Processing Systems 21, MIT Press, 2009. [23] A. Mnih, G. Hinton. 可扩展的分层分布式语言模型。《神经信息处理系统进展》第 21 卷，麻省理工学院出版社，2009 年。

[24] A. Mnih, Y.W. Teh. A fast and simple algorithm for training neural probabilistic language models. ICML, 2012.
A. Mnih, Y.W. Teh. 一种快速简单的算法用于训练神经概率语言模型. ICML, 2012.

[25] F. Morin, Y. Bengio. Hierarchical Probabilistic Neural Network Language Model. AISTATS, 2005.
F. Morin, Y. Bengio. 分层概率神经网络语言模型. AISTATS, 2005.

[26] D. E. Rumelhart, G. E. Hinton, R. J. Williams. Learning internal representations by backpropagating errors. Nature, 323:533.536, 1986.
D. E. Rumelhart, G. E. Hinton, R. J. Williams. 通过反向传播错误学习内部表示。自然, 323:533.536, 1986.

[27] H. Schwenk. Continuous space language models. Computer Speech and Language, vol. 21, 2007.
[27] H. Schwenk. 连续空间语言模型. 计算机语音与语言, 第 21 卷, 2007.

[28] R. Socher, E.H. Huang, J. Pennington, A.Y. Ng, and C.D. Manning. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In NIPS, 2011.
[28] R. Socher, E.H. Huang, J. Pennington, A.Y. Ng, and C.D. Manning. 动态池化和展开递归自编码器用于释义检测。在 NIPS，2011。

[29] J. Turian, L. Ratinov, Y. Bengio. Word Representations: A Simple and General Method for Semi-Supervised Learning. In: Proc. Association for Computational Linguistics, 2010.
[29] J. Turian, L. Ratinov, Y. Bengio. 词表示：一种简单且通用的半监督学习方法。在：计算语言学协会会议论文集，2010。

[30] P. D. Turney. Measuring Semantic Similarity by Latent Relational Analysis. In: Proc. International Joint Conference on Artificial Intelligence, 2005.
P. D. Turney. 通过潜在关系分析测量语义相似性。在：2005 年国际人工智能联合会议论文集。

[31] A. Zhila, W.T. Yih, C. Meek, G. Zweig, T. Mikolov. Combining Heterogeneous Models for Measuring Relational Similarity. NAACL HLT 2013.

[32] G. Zweig, C.J.C. Burges. The Microsoft Research Sentence Completion Challenge, Microsoft Research Technical Report MSR-TR-2011-129, 2011.
[32] G. Zweig, C.J.C. Burges. 微软研究句子完成挑战，微软研究技术报告 MSR-TR-2011-129，2011。

The test set is available at www.fit.vutbr.cz/ imikolov/rnnlm/word-test.v1.txt
测试集可在 www.fit.vutbr.cz/imikolov/rnnlm/word-test.v1.txt 找到。

http://metaoptimize.com/projects/wordreprs/ http://www.fit.vutbr.cz/ imikolov/rnnlm/ http://ai.stanford.edu/ ehhuang/
We thank Geoff Zweig for providing us the test set.
我们感谢 Geoff Zweig 提供给我们测试集。
The code is available at https://code.google.com/p/word2vec/
代码可在 https://code.google.com/p/word2vec/找到。

Efficient Estimation of Word Representations in Vector Space 在向量空间中高效估计单词表示

Abstract 摘要

1 Introduction 1 简介

1.1 Goals of the Paper论文的目标

1.2 Previous Work 1.2 之前的工作

2 Model Architectures 2 个模型架构

2.1 Feedforward Neural Net Language Model (NNLM)前馈神经网络语言模型（NNLM）

2.2 Recurrent Neural Net Language Model (RNNLM)2.2 循环神经网络语言模型（RNNLM）

2.3 Parallel Training of Neural Networks神经网络的并行训练

3 New Log-linear Models3 个新的对数线性模型

3.1 Continuous Bag-of-Words Model3.1 连续词袋模型

3.2 Continuous Skip-gram Model3.2 连续 Skip-gram 模型

4 Results 4 个结果

4.1 Task Description 4.1 任务描述

4.2 Maximization of Accuracy准确性的最大化

4.3 Comparison of Model Architectures4.3 模型架构比较

4.4 Large Scale Parallel Training of Models4.4 模型的大规模并行训练

4.5 Microsoft Research Sentence Completion Challenge4.5 微软研究句子完成挑战

5 Examples of the Learned Relationships学到的关系的 5 个例子

6 Conclusion 6 结论

7 Follow-Up Work 7 后续工作

References 参考资料

Efficient Estimation of Word Representations in Vector Space
在向量空间中高效估计单词表示

1.1 Goals of the Paper
论文的目标

2.1 Feedforward Neural Net Language Model (NNLM)
前馈神经网络语言模型（NNLM）

2.2 Recurrent Neural Net Language Model (RNNLM)
2.2 循环神经网络语言模型（RNNLM）

2.3 Parallel Training of Neural Networks
神经网络的并行训练

3 New Log-linear Models
3 个新的对数线性模型

3.1 Continuous Bag-of-Words Model
3.1 连续词袋模型

3.2 Continuous Skip-gram Model
3.2 连续 Skip-gram 模型

4.2 Maximization of Accuracy
准确性的最大化

4.3 Comparison of Model Architectures
4.3 模型架构比较

4.4 Large Scale Parallel Training of Models
4.4 模型的大规模并行训练

4.5 Microsoft Research Sentence Completion Challenge
4.5 微软研究句子完成挑战

5 Examples of the Learned Relationships
学到的关系的 5 个例子