这是用户在 2024-5-10 20:08 为 https://app.immersivetranslate.com/pdf-pro/75ac72b0-0f52-41ee-9fd2-8b028adaf101 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_05_10_6a5ea0bdcecdfd8da50eg

Chapter 5  第 5 章

Machine Learning Basics 机器学习基础

Deep learning is a specific kind of machine learning. To understand deep learning well, one must have a solid understanding of the basic principles of machine learning. This chapter provides a brief course in the most important general principles that are applied throughout the rest of the book. Novice readers or those who want a wider perspective are encouraged to consider machine learning textbooks with a more comprehensive coverage of the fundamentals, such as Murphy (2012) or Bishop (2006). If you are already familiar with machine learning basics, feel free to skip ahead to section 5.11. That section covers some perspectives on traditional machine learning techniques that have strongly influenced the development of deep learning algorithms.
深度学习是一种特殊的机器学习。要很好地理解深度学习,就必须扎实地理解机器学习的基本原理。本章简要介绍了贯穿本书其余部分的最重要的一般原理。新手读者或希望获得更广阔视野的读者,可以考虑阅读涵盖基础知识更全面的机器学习教科书,如 Murphy (2012) 或 Bishop (2006)。如果您已经熟悉机器学习基础知识,请随意跳到第 5.11 节。该部分涵盖了传统机器学习技术的一些观点,这些观点对深度学习算法的发展产生了重大影响。
We begin with a definition of what a learning algorithm is and present an example: the linear regression algorithm. We then proceed to describe how the challenge of fitting the training data differs from the challenge of finding patterns that generalize to new data. Most machine learning algorithms have settings called hyperparameters, which must be determined outside the learning algorithm itself; we discuss how to set these using additional data. Machine learning is essentially a form of applied statistics with increased emphasis on the use of computers to statistically estimate complicated functions and a decreased emphasis on proving confidence intervals around these functions; we therefore present the two central approaches to statistics: frequentist estimators and Bayesian inference. Most machine learning algorithms can be divided into the categories of supervised learning and unsupervised learning; we describe these categories and give some examples of simple learning algorithms from each category. Most deep learning algorithms are based on an optimization algorithm called stochastic gradient
我们首先定义了什么是学习算法,并举例说明:线性回归算法。然后,我们将介绍拟合训练数据的挑战与寻找可泛化到新数据的模式的挑战有何不同。大多数机器学习算法都有被称为超参数的设置,这些参数必须在学习算法本身之外确定;我们将讨论如何使用额外的数据来设置这些参数。机器学习本质上是一种应用统计学,更加强调使用计算机对复杂函数进行统计估计,而不再强调证明这些函数的置信区间;因此,我们介绍了统计学的两种核心方法:频数估计法和贝叶斯推断法。大多数机器学习算法可分为有监督学习和无监督学习两类;我们将介绍这两类算法,并举例说明每类算法中的一些简单学习算法。大多数深度学习算法都基于一种称为随机梯度的优化算法。

descent. We describe how to combine various algorithm components, such as an optimization algorithm, a cost function, a model, and a dataset, to build a machine learning algorithm. Finally, in section 5.11, we describe some of the factors that have limited the ability of traditional machine learning to generalize. These challenges have motivated the development of deep learning algorithms that overcome these obstacles.
cent.我们介绍了如何将优化算法、代价函数、模型和数据集等各种算法组件结合起来,构建机器学习算法。最后,在第 5.11 节中,我们介绍了限制传统机器学习泛化能力的一些因素。这些挑战推动了克服这些障碍的深度学习算法的发展。

5.1 Learning Algorithms 5.1 学习算法

A machine learning algorithm is an algorithm that is able to learn from data. But what do we mean by learning? Mitchell (1997) provides a succinct definition: "A computer program is said to learn from experience with respect to some class of tasks and performance measure , if its performance at tasks in , as measured by , improves with experience ." One can imagine a wide variety of experiences , tasks , and performance measures , and we do not attempt in this book to formally define what may be used for each of these entities. Instead, in the following sections, we provide intuitive descriptions and examples of the different kinds of tasks, performance measures, and experiences that can be used to construct machine learning algorithms.
机器学习算法是一种能够从数据中学习的算法。但我们所说的学习是什么意思呢?Mitchell (1997) 给出了一个简洁的定义:"就某类任务 和性能指标 而言,如果一个计算机程序在 的任务中的性能(由 来衡量)随着经验的积累而有所提高 ,那么这个计算机程序就可以说是从经验中学习的 "。我们可以想象出各种各样的经验 、任务 和性能指标 ,在本书中我们并不试图正式定义这些实体中的每一个。相反,在以下章节中,我们将对可用于构建机器学习算法的各类任务、性能指标和经验进行直观描述和举例说明。

5.1.1 The Task,
5.1.1 任务、

Machine learning enables us to tackle tasks that are too difficult to solve with fixed programs written and designed by human beings. From a scientific and philosophical point of view, machine learning is interesting because developing our understanding of it entails developing our understanding of the principles that underlie intelligence.
机器学习使我们能够解决人类编写和设计的固定程序难以解决的任务。从科学和哲学的角度来看,机器学习之所以有趣,是因为我们要理解机器学习,就必须理解智能的基本原理。
In this relatively formal definition of the word "task," the process of learning itself is not the task. Learning is our means of attaining the ability to perform the task. For example, if we want a robot to be able to walk, then walking is the task. We could program the robot to learn to walk, or we could attempt to directly write a program that specifies how to walk manually.
在这个相对正式的 "任务 "定义中,学习过程本身并不是任务。学习是我们获得执行任务能力的手段。例如,如果我们希望机器人能够行走,那么行走就是任务。我们可以通过编程让机器人学会行走,也可以尝试直接编写一个程序,规定如何手动行走。
Machine learning tasks are usually described in terms of how the machine learning system should process an example. An example is a collection of features that have been quantitatively measured from some object or event that we want the machine learning system to process. We typically represent an example as a vector where each entry of the vector is another feature. For example, the features of an image are usually the values of the pixels in the image.
机器学习任务通常以机器学习系统应如何处理示例来描述。示例是我们希望机器学习系统处理的某个对象或事件的定量测量特征集合。我们通常将示例表示为一个向量 ,其中向量的每个条目 是另一个特征。例如,图像的特征通常是图像中像素的值。
Many kinds of tasks can be solved with machine learning. Some of the most common machine learning tasks include the following:
机器学习可以解决多种任务。最常见的机器学习任务包括以下几种:
  • Classification: In this type of task, the computer program is asked to specify which of categories some input belongs to. To solve this task, the learning algorithm is usually asked to produce a function . When , the model assigns an input described by vector to a category identified by numeric code . There are other variants of the classification task, for example, where outputs a probability distribution over classes. An example of a classification task is object recognition, where the input is an image (usually described as a set of pixel brightness values), and the output is a numeric code identifying the object in the image. For example, the Willow Garage PR2 robot is able to act as a waiter that can recognize different kinds of drinks and deliver them to people on command (Goodfellow et al., 2010). Modern object recognition is best accomplished with deep learning (Krizhevsky et al., 2012; Ioffe and Szegedy, 2015). Object recognition is the same basic technology that enables computers to recognize faces (Taigman et al., 2014), which can be used to automatically tag people in photo collections and for computers to interact more naturally with their users.
    分类:在这类任务中,计算机程序被要求指定某些输入属于 中的哪一类。为完成这项任务,通常要求学习算法生成一个函数 。当 时,模型会将向量 描述的输入分配到由数字代码 标识的类别中。分类任务还有其他变体,例如, ,输出类别的概率分布。物体识别就是分类任务的一个例子,输入是一幅图像(通常描述为一组像素亮度值),输出是识别图像中物体的数字代码。例如,Willow Garage PR2 机器人可以充当服务员,识别不同种类的饮料,并根据指令将饮料送到人们手中(Goodfellow 等人,2010 年)。现代物体识别的最佳方法是深度学习(Krizhevsky 等人,2012 年;Ioffe 和 Szegedy,2015 年)。物体识别与计算机识别人脸的基本技术相同(Taigman et al.
  • Classification with missing inputs: Classification becomes more challenging if the computer program is not guaranteed that every measurement in its input vector will always be provided. To solve the classification task, the learning algorithm only has to define a single function mapping from a vector input to a categorical output. When some of the inputs may be missing, rather than providing a single classification function, the learning algorithm must learn a set of functions. Each function corresponds to classifying with a different subset of its inputs missing. This kind of situation arises frequently in medical diagnosis, because many kinds of medical tests are expensive or invasive. One way to efficiently define such a large set of functions is to learn a probability distribution over all the relevant variables, then solve the classification task by marginalizing out the missing variables. With input variables, we can now obtain all different classification functions needed for each possible set of missing inputs, but the computer program needs to learn only a single function describing the joint probability distribution. See Goodfellow et al. (2013b) for an example of a deep probabilistic model applied to such a task in this way. Many of the other tasks described in this section can also be generalized to work with missing inputs; classification with missing inputs is just one example of what machine learning can do.
    缺失输入的分类:如果计算机程序不能保证始终提供输入向量中的每个测量值,分类工作就会变得更具挑战性。为了解决分类任务,学习算法只需定义一个从向量输入到分类输出的单一函数映射。当某些输入可能缺失时,学习算法必须学习一组函数,而不是提供一个单一的分类函数。每个函数对应于对输入缺失不同子集的 进行分类。这种情况在医疗诊断中经常出现,因为许多医疗测试都是昂贵的或侵入性的。要有效定义如此庞大的函数集,一种方法是学习所有相关变量的概率分布,然后通过边际化缺失变量来解决分类任务。有了 输入变量,我们现在可以获得每组可能的缺失输入所需的所有 不同的分类函数,但计算机程序只需学习一个描述联合概率分布的函数。关于深度概率模型以这种方式应用于此类任务的例子,请参见 Goodfellow 等人(2013b)。本节中介绍的许多其他任务也可以通用于缺失输入;缺失输入分类只是机器学习能做的事情中的一个例子。
  • Regression: In this type of task, the computer program is asked to predict a numerical value given some input. To solve this task, the learning algorithm is asked to output a function . This type of task is similar to classification, except that the format of output is different. An example of a regression task is the prediction of the expected claim amount that an insured person will make (used to set insurance premiums), or the prediction of future prices of securities. These kinds of predictions are also used for algorithmic trading.
    回归:在这类任务中,计算机程序被要求预测某个输入的数值。为了解决这个任务,要求学习算法输出一个函数 。这类任务与分类任务类似,只是输出格式不同。回归任务的一个例子是预测被保险人的预期索赔金额(用于设定保险费),或预测证券的未来价格。这类预测也用于算法交易。
  • Transcription: In this type of task, the machine learning system is asked to observe a relatively unstructured representation of some kind of data and transcribe the information into discrete textual form. For example, in optical character recognition, the computer program is shown a photograph containing an image of text and is asked to return this text in the form of a sequence of characters (e.g., in ASCII or Unicode format). Google Street View uses deep learning to process address numbers in this way (Goodfellow et al., 2014d). Another example is speech recognition, where the computer program is provided an audio waveform and emits a sequence of characters or word ID codes describing the words that were spoken in the audio recording. Deep learning is a crucial component of modern speech recognition systems used at major companies, including Microsoft, IBM and Google (Hinton et al., 2012b).
    转录:在这类任务中,机器学习系统需要观察某种数据的相对非结构化表示,并将信息转录为离散的文本形式。例如,在光学字符识别中,计算机程序会看到一张包含文本图像的照片,并被要求以字符序列的形式(如 ASCII 或 Unicode 格式)返回文本。谷歌街景使用深度学习以这种方式处理地址编号(Goodfellow 等人,2014d)。另一个例子是语音识别,计算机程序获得音频波形后,会发出一串字符或单词 ID 码,描述录音中所说的单词。深度学习是微软、IBM 和谷歌等大公司使用的现代语音识别系统的重要组成部分(Hinton 等人,2012b)。
  • Machine translation: In a machine translation task, the input already consists of a sequence of symbols in some language, and the computer program must convert this into a sequence of symbols in another language. This is commonly applied to natural languages, such as translating from English to French. Deep learning has recently begun to have an important impact on this kind of task (Sutskever et al., 2014; Bahdanau et al., 2015).
    机器翻译:在机器翻译任务中,输入内容已经由某种语言的符号序列组成,计算机程序必须将其转换成另一种语言的符号序列。这通常应用于自然语言,例如从英语翻译成法语。深度学习最近开始对这类任务产生重要影响(Sutskever 等人,2014 年;Bahdanau 等人,2015 年)。
  • Structured output: Structured output tasks involve any task where the output is a vector (or other data structure containing multiple values) with important relationships between the different elements. This is a broad category and subsumes the transcription and translation tasks described above, as well as many other tasks. One example is parsing - mapping a natural language sentence into a tree that describes its grammatical structure by tagging nodes of the trees as being verbs, nouns, adverbs, and so on. See Collobert (2011) for an example of deep learning applied to a parsing task. Another example is pixel-wise segmentation of images, where the computer program assigns every pixel in an image to a specific category.
    结构化输出:结构化输出任务涉及任何输出为矢量(或其他包含多个值的数据结构)且不同元素之间存在重要关系的任务。这是一个广泛的类别,包含上述转录和翻译任务以及许多其他任务。其中一个例子是解析--将自然语言句子映射成一棵树,通过标记树中的动词、名词、副词等节点来描述其语法结构。有关将深度学习应用于解析任务的例子,请参见 Collobert (2011)。另一个例子是对图像进行像素分割,计算机程序将图像中的每个像素分配到特定类别。
For example, deep learning can be used to annotate the locations of roads in aerial photographs (Mnih and Hinton, 2010). The output form need not mirror the structure of the input as closely as in these annotation-style tasks. For example, in image captioning, the computer program observes an image and outputs a natural language sentence describing the image (Kiros et al., 2014a,b; Mao et al., 2015; Vinyals et al., 2015b; Donahue et al., 2014; Karpathy and Li, 2015; Fang et al., 2015; Xu et al., 2015). These tasks are called structured output tasks because the program must output several values that are all tightly interrelated. For example, the words produced by an image captioning program must form a valid sentence.
例如,深度学习可用于标注航拍照片中的道路位置(Mnih 和 Hinton,2010 年)。在这些注释式任务中,输出形式不必与输入结构如出一辙。例如,在图像标注中,计算机程序观察图像并输出描述图像的自然语言句子(Kiros 等人,2014a,b;Mao 等人,2015;Vinyals 等人,2015b;Donahue 等人,2014;Karpathy 和 Li,2015;Fang 等人,2015;Xu 等人,2015)。这些任务之所以被称为结构化输出任务,是因为程序必须输出多个相互紧密关联的值。例如,图像字幕程序生成的单词必须构成一个有效的句子。
  • Anomaly detection: In this type of task, the computer program sifts through a set of events or objects and flags some of them as being unusual or atypical. An example of an anomaly detection task is credit card fraud detection. By modeling your purchasing habits, a credit card company can detect misuse of your cards. If a thief steals your credit card or credit card information, the thief's purchases will often come from a different probability distribution over purchase types than your own. The credit card company can prevent fraud by placing a hold on an account as soon as that card has been used for an uncharacteristic purchase. See Chandola et al. (2009) for a survey of anomaly detection methods.
    异常检测:在这类任务中,计算机程序会对一系列事件或对象进行筛选,并将其中一些事件或对象标记为异常或反常。异常检测任务的一个例子是信用卡欺诈检测。通过模拟您的购买习惯,信用卡公司可以检测到您的信用卡被滥用。如果盗贼窃取了您的信用卡或信用卡信息,那么盗贼的购买行为往往与您的购买行为的概率分布不同。一旦信用卡被用于非正常消费,信用卡公司可通过扣留账户来防止欺诈行为。有关异常检测方法的调查,请参见 Chandola 等人(2009 年)。
  • Synthesis and sampling: In this type of task, the machine learning algorithm is asked to generate new examples that are similar to those in the training data. Synthesis and sampling via machine learning can be useful for media applications when generating large volumes of content by hand would be expensive, boring, or require too much time. For example, video games can automatically generate textures for large objects or landscapes, rather than requiring an artist to manually label each pixel (Luo et al., 2013). In some cases, we want the sampling or synthesis procedure to generate a specific kind of output given the input. For example, in a speech synthesis task, we provide a written sentence and ask the program to emit an audio waveform containing a spoken version of that sentence. This is a kind of structured output task, but with the added qualification that there is no single correct output for each input, and we explicitly desire a large amount of variation in the output, in order for the output to seem more natural and realistic.
    合成和采样:在这类任务中,机器学习算法需要生成与训练数据中的示例相似的新示例。如果手工生成大量内容既昂贵又枯燥,或者需要耗费大量时间,那么通过机器学习进行合成和采样对媒体应用非常有用。例如,视频游戏可以自动生成大型物体或景观的纹理,而不需要艺术家手动标注每个像素(Luo 等人,2013 年)。在某些情况下,我们希望采样或合成程序能根据输入生成特定类型的输出。例如,在语音合成任务中,我们提供一个书面句子,并要求程序发出包含该句子口语版本的音频波形。这是一种结构化输出任务,但附加的条件是,每个输入都没有一个正确的输出,而且我们明确希望输出有很大的变化,以使输出看起来更自然、更逼真。
  • Imputation of missing values: In this type of task, the machine learning algorithm is given a new example , but with some entries of
    缺失值估算:在这类任务中,机器学习算法会得到一个新的示例 ,但其中某些条目

    missing. The algorithm must provide a prediction of the values of the missing entries.
    缺失。算法必须对缺失条目的值进行预测。
  • Denoising: In this type of task, the machine learning algorithm is given as input a corrupted example obtained by an unknown corruption process from a clean example . The learner must predict the clean example from its corrupted version , or more generally predict the conditional probability distribution .
    去噪:在这类任务中,机器学习算法的输入是一个损坏的示例 ,该示例是通过一个未知的损坏过程从一个干净的示例 中得到的。学习者必须根据损坏版本 预测干净示例 ,或者更笼统地说,预测条件概率分布
  • Density estimation or probability mass function estimation: In the density estimation problem, the machine learning algorithm is asked to learn a function , where can be interpreted as a probability density function (if is continuous) or a probability mass function (if is discrete) on the space that the examples were drawn from. To do such a task well (we will specify exactly what that means when we discuss performance measures ), the algorithm needs to learn the structure of the data it has seen. It must know where examples cluster tightly and where they are unlikely to occur. Most of the tasks described above require the learning algorithm to at least implicitly capture the structure of the probability distribution. Density estimation enables us to explicitly capture that distribution. In principle, we can then perform computations on that distribution to solve the other tasks as well. For example, if we have performed density estimation to obtain a probability distribution , we can use that distribution to solve the missing value imputation task. If a value is missing, and all the other values, denoted , are given, then we know the distribution over it is given by . In practice, density estimation does not always enable us to solve all these related tasks, because in many cases the required operations on are computationally intractable.
    密度估计或概率质量函数估计:在密度估计问题中,机器学习算法被要求学习一个函数 ,其中 可以解释为示例空间上的概率密度函数(如果 是连续的)或概率质量函数(如果 是离散的)。要很好地完成这样的任务(我们将在讨论性能指标 时明确说明这意味着什么),算法需要学习所见数据的结构。它必须知道哪些地方的例子聚集得很紧密,哪些地方的例子不太可能出现。上述大多数任务都要求学习算法至少隐含地捕捉到概率分布的结构。密度估计能让我们明确捕捉到这种分布。原则上,我们可以对该分布进行计算,从而解决其他任务。例如,如果我们通过密度估计得到了概率分布 ,我们就可以使用该分布来解决缺失值估算任务。如果 缺少一个值,而其他所有值(表示为 )都是给定的,那么我们就知道它的分布是由 给定的。实际上,密度估计并不总能让我们解决所有这些相关任务,因为在很多情况下,对 所需的操作在计算上是难以实现的。
Of course, many other tasks and types of tasks are possible. The types of tasks we list here are intended only to provide examples of what machine learning can do, not to define a rigid taxonomy of tasks.
当然,还有许多其他任务和任务类型。我们在此列出的任务类型只是为了举例说明机器学习的功能,而不是对任务进行严格的分类。

5.1.2 The Performance Measure,
5.1.2 绩效衡量、

To evaluate the abilities of a machine learning algorithm, we must design a quantitative measure of its performance. Usually this performance measure is specific to the task being carried out by the system.
为了评估机器学习算法的能力,我们必须设计一种量化的性能测量方法。通常,这种性能衡量 是针对系统正在执行的任务
For tasks such as classification, classification with missing inputs, and transcription, we often measure the accuracy of the model. Accuracy is just the
对于分类、输入缺失分类和转录等任务,我们通常会衡量模型的准确性。准确度只是

proportion of examples for which the model produces the correct output. We can also obtain equivalent information by measuring the error rate, the proportion of examples for which the model produces an incorrect output. We often refer to the error rate as the expected 0-1 loss. The 0-1 loss on a particular example is 0 if it is correctly classified and 1 if it is not. For tasks such as density estimation, it does not make sense to measure accuracy, error rate, or any other kind of 0-1 loss. Instead, we must use a different performance metric that gives the model a continuous-valued score for each example. The most common approach is to report the average log-probability the model assigns to some examples.
模型输出正确结果的例子比例。我们还可以通过测量错误率(即模型产生错误输出的例子比例)来获得等效信息。我们通常将错误率称为预期 0-1 损失。如果某个实例被正确分类,则 0-1 损失为 0;如果没有被正确分类,则 0-1 损失为 1。对于密度估计等任务,测量准确率、错误率或任何其他类型的 0-1 损失都是没有意义的。相反,我们必须使用不同的性能指标,为模型的每个示例提供一个连续值的分数。最常见的方法是报告模型赋予某些示例的平均对数概率。
Usually we are interested in how well the machine learning algorithm performs on data that it has not seen before, since this determines how well it will work when deployed in the real world. We therefore evaluate these performance measures using a test set of data that is separate from the data used for training the machine learning system.
通常,我们关注的是机器学习算法在未见过的数据上的表现,因为这决定了算法在实际应用中的效果。因此,我们使用一组测试数据来评估这些性能指标,这组数据与用于训练机器学习系统的数据是分开的。
The choice of performance measure may seem straightforward and objective, but it is often difficult to choose a performance measure that corresponds well to the desired behavior of the system.
性能指标的选择看似简单、客观,但要选择一个与系统预期行为相匹配的性能指标往往很困难。
In some cases, this is because it is difficult to decide what should be measured. For example, when performing a transcription task, should we measure the accuracy of the system at transcribing entire sequences, or should we use a more fine-grained performance measure that gives partial credit for getting some elements of the sequence correct? When performing a regression task, should we penalize the system more if it frequently makes medium-sized mistakes or if it rarely makes very large mistakes? These kinds of design choices depend on the application.
在某些情况下,这是因为很难决定应该测量什么。例如,在执行转录任务时,我们是应该衡量系统转录整个序列的准确性,还是应该使用更精细的性能衡量标准,对序列中某些元素的正确性给予部分奖励?在执行回归任务时,如果系统经常犯中等程度的错误,我们应该对其进行更多的惩罚;如果系统很少犯非常严重的错误,我们应该对其进行更多的惩罚?这类设计选择取决于应用。
In other cases, we know what quantity we would ideally like to measure, but measuring it is impractical. For example, this arises frequently in the context of density estimation. Many of the best probabilistic models represent probability distributions only implicitly. Computing the actual probability value assigned to a specific point in space in many such models is intractable. In these cases, one must design an alternative criterion that still corresponds to the design objectives, or design a good approximation to the desired criterion.
在其他情况下,我们知道我们最想测量的量是什么,但测量它是不切实际的。例如,这种情况经常出现在密度估计中。许多最好的概率模型只是隐含地表示概率分布。在许多此类模型中,计算空间中特定点的实际概率值是难以实现的。在这种情况下,我们必须设计一个仍然符合设计目标的替代标准,或者设计一个理想标准的近似值。

5.1.3 The Experience,
5.1.3 体验、

Machine learning algorithms can be broadly categorized as unsupervised or supervised by what kind of experience they are allowed to have during the learning process.
根据机器学习算法在学习过程中的经验类型,可将其大致分为无监督和有监督两种。
Most of the learning algorithms in this book can be understood as being allowed
本书中的大多数学习算法都可以理解为允许

to experience an entire dataset. A dataset is a collection of many examples, as defined in section 5.1.1. Sometimes we call examples data points.
来体验整个数据集。数据集是许多示例的集合,定义见第 5.1.1 节。有时我们称示例为数据点。
One of the oldest datasets studied by statisticians and machine learning researchers is the Iris dataset (Fisher, 1936). It is a collection of measurements of different parts of 150 iris plants. Each individual plant corresponds to one example. The features within each example are the measurements of each part of the plant: the sepal length, sepal width, petal length and petal width. The dataset also records which species each plant belonged to. Three different species are represented in the dataset.
统计学家和机器学习研究人员研究的最古老的数据集之一是鸢尾花数据集(Fisher,1936 年)。它收集了 150 株鸢尾花不同部位的测量数据。每株植物对应一个示例。每个示例的特征是植物各部分的测量值:萼片长度、萼片宽度、花瓣长度和花瓣宽度。数据集还记录了每种植物所属的物种。数据集中有三个不同的物种。
Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. In the context of deep learning, we usually want to learn the entire probability distribution that generated a dataset, whether explicitly, as in density estimation, or implicitly, for tasks like synthesis or denoising. Some other unsupervised learning algorithms perform other roles, like clustering, which consists of dividing the dataset into clusters of similar examples.
无监督学习算法先体验包含许多特征的数据集,然后学习该数据集结构的有用属性。在深度学习中,我们通常希望学习生成数据集的整个概率分布,无论是显式的,如密度估计,还是隐式的,如合成或去噪任务。其他一些无监督学习算法还扮演着其他角色,比如聚类,它将数据集划分为由相似示例组成的簇。
Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target. For example, the Iris dataset is annotated with the species of each iris plant. A supervised learning algorithm can study the Iris dataset and learn to classify iris plants into three different species based on their measurements.
监督学习算法体验的是包含特征的数据集,但每个示例也与标签或目标相关联。例如,鸢尾花数据集标注了每株鸢尾花的种类。监督学习算法可以研究鸢尾花数据集,并学会根据测量结果将鸢尾花植物分为三种不同的种类。
Roughly speaking, unsupervised learning involves observing several examples of a random vector and attempting to implicitly or explicitly learn the probability distribution , or some interesting properties of that distribution; while supervised learning involves observing several examples of a random vector and an associated value or vector , then learning to predict from , usually by estimating . The term supervised learning originates from the view of the target being provided by an instructor or teacher who shows the machine learning system what to do. In unsupervised learning, there is no instructor or teacher, and the algorithm must learn to make sense of the data without this guide.
粗略地说,无监督学习涉及观察随机向量 的若干实例,并尝试隐式或显式地学习概率分布 ,或该分布的某些有趣属性;而有监督学习涉及观察随机向量 和相关值或向量 的若干实例,然后学习从 预测 ,通常是通过估计 。监督学习一词源于这样一种观点,即目标 是由指导者或教师提供的,他向机器学习系统展示该做什么。在无监督学习中,没有指导者或教师,算法必须在没有指导者或教师的情况下学习如何理解数据。
Unsupervised learning and supervised learning are not formally defined terms. The lines between them are often blurred. Many machine learning technologies can be used to perform both tasks. For example, the chain rule of probability states that for a vector , the joint distribution can be decomposed as
无监督学习和有监督学习并不是正式定义的术语。它们之间的界限往往很模糊。许多机器学习技术可用于执行这两种任务。例如,概率链规则指出,对于一个向量 ,联合分布可以分解为
This decomposition means that we can solve the ostensibly unsupervised problem of
这种分解意味着我们可以解决表面上无监督的问题,即

modeling by splitting it into supervised learning problems. Alternatively, we can solve the supervised learning problem of learning by using traditional unsupervised learning technologies to learn the joint distribution , then inferring
建模 ,将其拆分为 监督学习问题。另外,我们也可以通过使用传统的无监督学习技术来学习联合分布 ,然后推断出 ,从而解决学习 的监督学习问题。
Though unsupervised learning and supervised learning are not completely formal or distinct concepts, they do help roughly categorize some of the things we do with machine learning algorithms. Traditionally, people refer to regression, classification and structured output problems as supervised learning. Density estimation in support of other tasks is usually considered unsupervised learning.
尽管无监督学习和有监督学习并不是完全正式或截然不同的概念,但它们确实有助于对我们使用机器学习算法所做的一些事情进行粗略分类。传统上,人们将回归、分类和结构化输出问题称为监督学习。支持其他任务的密度估计通常被视为无监督学习。
Other variants of the learning paradigm are possible. For example, in semisupervised learning, some examples include a supervision target but others do not. In multi-instance learning, an entire collection of examples is labeled as containing or not containing an example of a class, but the individual members of the collection are not labeled. For a recent example of multi-instance learning with deep models, see Kotzias et al. (2015).
学习范式还可能有其他变体。例如,在半监督学习中,一些示例包含监督目标,而另一些则不包含。在多实例学习中,整个实例集合被标记为包含或不包含某类实例,但集合中的单个成员不被标记。有关使用深度模型进行多实例学习的最新实例,请参见 Kotzias 等人(2015)。
Some machine learning algorithms do not just experience a fixed dataset. For example, reinforcement learning algorithms interact with an environment, so there is a feedback loop between the learning system and its experiences. Such algorithms are beyond the scope of this book. Please see Sutton and Barto (1998) or Bertsekas and Tsitsiklis (1996) for information about reinforcement learning, and Mnih et al. (2013) for the deep learning approach to reinforcement learning.
有些机器学习算法并不只是体验固定的数据集。例如,强化学习算法会与环境互动,因此学习系统与其经验之间存在反馈回路。此类算法超出了本书的讨论范围。有关强化学习的信息,请参阅 Sutton 和 Barto (1998) 或 Bertsekas 和 Tsitsiklis (1996),有关强化学习的深度学习方法,请参阅 Mnih 等人 (2013)。
Most machine learning algorithms simply experience a dataset. A dataset can be described in many ways. In all cases, a dataset is a collection of examples, which are in turn collections of features.
大多数机器学习算法只需经历一个数据集。数据集有多种描述方式。在所有情况下,数据集都是示例的集合,而示例又是特征的集合。
One common way of describing a dataset is with a design matrix. A design matrix is a matrix containing a different example in each row. Each column of the matrix corresponds to a different feature. For instance, the Iris dataset contains 150 examples with four features for each example. This means we can represent the dataset with a design matrix , where is the sepal length of plant is the sepal width of plant , etc. We describe most of the learning algorithms in this book in terms of how they operate on design matrix datasets.
描述数据集的一种常见方法是设计矩阵。设计矩阵是一个每行包含一个不同示例的矩阵。矩阵的每一列对应一个不同的特征。例如,虹膜数据集包含 150 个示例,每个示例有四个特征。这意味着我们可以用设计矩阵 来表示数据集,其中 是植物的萼片长度 是植物的萼片宽度 等。我们将在本书中介绍大部分学习算法如何在设计矩阵数据集上运行。
Of course, to describe a dataset as a design matrix, it must be possible to describe each example as a vector, and each of these vectors must be the same size. This is not always possible. For example, if you have a collection of photographs with different widths and heights, then different photographs will contain different numbers of pixels, so not all the photographs may be described with the same
当然,要将数据集描述为设计矩阵,就必须能够将每个示例描述为一个向量,而且每个向量的大小必须相同。但这并不总是可能的。例如,如果您有一组具有不同宽度和高度的照片,那么不同的照片将包含不同数量的像素,因此并不是所有的照片都可以用相同的像素来描述。

length of vector. In Section 9.7 and chapter 10, we describe how to handle different types of such heterogeneous data. In cases like these, rather than describing the dataset as a matrix with rows, we describe it as a set containing elements: . This notation does not imply that any two example vectors and have the same size.
向量的长度。在第 9.7 节和第 10 章中,我们将介绍如何处理不同类型的异构数据。在这种情况下,我们不会将数据集描述为一个行数为 的矩阵,而是将其描述为一个包含 元素的集合: 。这个符号并不意味着任何两个示例向量 的大小相同。
In the case of supervised learning, the example contains a label or target as well as a collection of features. For example, if we want to use a learning algorithm to perform object recognition from photographs, we need to specify which object appears in each of the photos. We might do this with a numeric code, with 0 signifying a person, 1 signifying a car, 2 signifying a cat, and so forth. Often when working with a dataset containing a design matrix of feature observations , we also provide a vector of labels , with providing the label for example .
在监督学习中,示例包含一个标签或目标以及一系列特征。例如,如果我们想使用一种学习算法来识别照片中的物体,就需要指定每张照片中出现的物体。我们可以用数字代码来实现这一点,0 表示人,1 表示车,2 表示猫,以此类推。通常情况下,在处理包含特征观测数据设计矩阵 的数据集时,我们也会提供一个标签向量 ,其中 提供了 的标签。
Of course, sometimes the label may be more than just a single number. For example, if we want to train a speech recognition system to transcribe entire sentences, then the label for each example sentence is a sequence of words.
当然,有时标签可能不仅仅是一个数字。例如,如果我们想训练语音识别系统转录整个句子,那么每个例句的标签就是一串单词。
Just as there is no formal definition of supervised and unsupervised learning, there is no rigid taxonomy of datasets or experiences. The structures described here cover most cases, but it is always possible to design new ones for new applications.
正如监督学习和无监督学习没有正式的定义一样,数据集或经验也没有严格的分类标准。这里描述的结构涵盖了大多数情况,但也有可能为新的应用设计新的结构。

5.1.4 Example: Linear Regression
5.1.4 示例:线性回归

Our definition of a machine learning algorithm as an algorithm that is capable of improving a computer program's performance at some task via experience is somewhat abstract. To make this more concrete, we present an example of a simple machine learning algorithm: linear regression. We will return to this example repeatedly as we introduce more machine learning concepts that help to understand the algorithm's behavior.
我们将机器学习算法定义为一种能够通过经验提高计算机程序在某些任务中的性能的算法,这一定义有些抽象。为了使这一定义更加具体,我们将举例说明一种简单的机器学习算法:线性回归。在介绍更多有助于理解算法行为的机器学习概念时,我们将反复回到这个例子。
As the name implies, linear regression solves a regression problem. In other words, the goal is to build a system that can take a vector as input and predict the value of a scalar as its output. The output of linear regression is a linear function of the input. Let be the value that our model predicts should take on. We define the output to be
顾名思义,线性回归解决的是回归问题。换句话说,我们的目标是建立一个系统,将向量 作为输入,将标量 的值作为输出进行预测。线性回归的输出是输入的线性函数。让 成为我们的模型预测的 值。我们将输出定义为
where is a vector of parameters.
其中 是一个参数向量。
Parameters are values that control the behavior of the system. In this case, is the coefficient that we multiply by feature before summing up the contributions from all the features. We can think of as a set of weights that determine how
参数是控制系统行为的数值。在本例中, 是我们在将所有特征的贡献相加之前乘以特征 的系数。我们可以将 视为一组权重,它们决定了

each feature affects the prediction. If a feature receives a positive weight , then increasing the value of that feature increases the value of our prediction . If a feature receives a negative weight, then increasing the value of that feature decreases the value of our prediction. If a feature's weight is large in magnitude, then it has a large effect on the prediction. If a feature's weight is zero, it has no effect on the prediction.
每个特征对预测的影响。如果某个特征 获得正权重 ,那么增加该特征的值就会增加我们的预测值 。如果某个特征的权重为负,则增加该特征的值会降低我们的预测值。如果某个特征的权重很大,那么它对预测的影响就很大。如果某个特征的权重为零,则对预测没有影响。
We thus have a definition of our task : to predict from by outputting . Next we need a definition of our performance measure, .
这样,我们就有了任务 的定义:通过输出 预测 。接下来,我们需要定义性能指标
Suppose that we have a design matrix of example inputs that we will not use for training, only for evaluating how well the model performs. We also have a vector of regression targets providing the correct value of for each of these examples. Because this dataset will only be used for evaluation, we call it the test set. We refer to the design matrix of inputs as and the vector of regression targets as .
假设我们有一个由 示例输入组成的设计矩阵,我们不会将其用于训练,只用于评估模型的性能如何。我们还有一个回归目标向量,为每个示例提供 的正确值。由于该数据集仅用于评估,因此我们称之为测试集。我们将输入设计矩阵称为 ,将回归目标向量称为
One way of measuring the performance of the model is to compute the mean squared error of the model on the test set. If gives the predictions of the model on the test set, then the mean squared error is given by
衡量模型性能的一种方法是计算模型在测试集上的均方误差。如果 给出了模型在测试集上的预测值,那么均方误差的计算公式为
Intuitively, one can see that this error measure decreases to 0 when . We can also see that
直观上,我们可以看到,当 时,这个误差量会减小到 0。我们还可以看到
so the error increases whenever the Euclidean distance between the predictions and the targets increases.
因此,只要预测值与目标之间的欧氏距离增大,误差就会增大。
To make a machine learning algorithm, we need to design an algorithm that will improve the weights in a way that reduces when the algorithm is allowed to gain experience by observing a training set . One intuitive way of doing this (which we justify later, in section 5.5.1) is just to minimize the mean squared error on the training set, .
要制作一个机器学习算法,我们需要设计一种算法,当允许算法通过观察训练集获得经验时,该算法将以减少 的方式改进权重 。一种直观的方法(我们将在稍后的第 5.5.1 节中加以论证)就是尽量减小训练集上的均方误差,
To minimize , we can simply solve for where its gradient is 0 :
要使 最小化,我们只需求解梯度为 0 的位置即可:
The system of equations whose solution is given by equation 5.12 is known as the normal equations. Evaluating equation 5.12 constitutes a simple learning algorithm. For an example of the linear regression learning algorithm in action, see figure 5.1.
方程 5.12 所给出的方程组称为正则方程。对方程 5.12 进行求值就是一种简单的学习算法。有关线性回归学习算法的示例,请参见图 5.1。
It is worth noting that the term linear regression is often used to refer to a slightly more sophisticated model with one additional parameter - an intercept term . In this model
值得注意的是,线性回归这一术语通常指的是一种略为复杂的模型,它多了一个参数--截距项 。在该模型中
so the mapping from parameters to predictions is still a linear function but the mapping from features to predictions is now an affine function. This extension to
因此,从参数到预测的映射仍然是线性函数,但从特征到预测的映射现在是仿射函数。这种扩展

Figure 5.1: A linear regression problem, with a training set consisting of ten data points, each containing one feature. Because there is only one feature, the weight vector contains only a single parameter to learn, . (Left)Observe that linear regression learns to set such that the line comes as close as possible to passing through all the training points. (Right)The plotted point indicates the value of found by the normal equations, which we can see minimizes the mean squared error on the training set.
图 5.1:线性回归问题,训练集由十个数据点组成,每个数据点包含一个特征。因为只有一个特征,所以权重向量只包含一个需要学习的参数,即 。(左)观察线性回归学习设置 ,使直线 尽可能接近通过所有训练点。(右图)图中的点表示通过正态方程找到的 值,我们可以看到该值使训练集上的均方误差最小。

affine functions means that the plot of the model's predictions still looks like a line, but it need not pass through the origin. Instead of adding the bias parameter , one can continue to use the model with only weights but augment with an extra entry that is always set to 1 . The weight corresponding to the extra 1 entry plays the role of the bias parameter. We frequently use the term "linear" when referring to affine functions throughout this book.
仿射函数意味着模型预测的曲线图看起来仍像一条直线,但不需要通过原点。与添加偏置参数 相反,我们可以继续使用只有权重的模型,但在 中增加一个始终设为 1 的额外条目。与额外的 1 项相对应的权重起偏置参数的作用。我们在本书中提到仿射函数时经常使用 "线性 "一词。
The intercept term is often called the bias parameter of the affine transformation. This terminology derives from the point of view that the output of the transformation is biased toward being in the absence of any input. This term is different from the idea of a statistical bias, in which a statistical estimation algorithm's expected estimate of a quantity is not equal to the true quantity.
截距项 通常被称为仿射变换的偏置参数。这一术语源于这样一种观点,即在没有任何输入的情况下,变换的输出偏向于 。这一术语与统计偏差的概念不同,统计偏差是指统计估计算法对某一数量的预期估计值不等于真实数量。
Linear regression is of course an extremely simple and limited learning algorithm, but it provides an example of how a learning algorithm can work. In subsequent sections we describe some of the basic principles underlying learning algorithm design and demonstrate how these principles can be used to build more complicated learning algorithms.
线性回归当然是一种极其简单和有限的学习算法,但它提供了一个学习算法如何工作的例子。在随后的章节中,我们将介绍学习算法设计的一些基本原则,并演示如何利用这些原则构建更复杂的学习算法。

5.2 Capacity, Overfitting and Underfitting
5.2 容量、过拟合和欠拟合

The central challenge in machine learning is that our algorithm must perform well on new, previously unseen inputs - not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization.
机器学习的核心挑战在于,我们的算法必须在新的、以前未见过的输入上表现出色,而不仅仅是那些我们的模型所训练过的输入。对以前未观察到的输入进行良好处理的能力称为泛化。
Typically, when training a machine learning model, we have access to a training set; we can compute some error measure on the training set, called the training error; and we reduce this training error. So far, what we have described is simply an optimization problem. What separates machine learning from optimization is that we want the generalization error, also called the test error, to be low as well. The generalization error is defined as the expected value of the error on a new input. Here the expectation is taken across different possible inputs, drawn from the distribution of inputs we expect the system to encounter in practice.
通常情况下,在训练机器学习模型时,我们可以获得一个训练集;我们可以计算训练集上的一些误差度量,称为训练误差;然后我们减小训练误差。到目前为止,我们所描述的只是一个优化问题。机器学习与优化的区别在于,我们希望泛化误差(也称为测试误差)也要低。泛化误差被定义为新输入时误差的期望值。这里的期望值取自不同的可能输入,这些输入来自于我们预期系统在实际应用中会遇到的输入分布。
We typically estimate the generalization error of a machine learning model by measuring its performance on a test set of examples that were collected separately from the training set.
我们通常通过测量机器学习模型在测试集上的表现来估算模型的泛化误差,测试集上的示例是与训练集分开收集的。
In our linear regression example, we trained the model by minimizing the training error,
在线性回归的例子中,我们通过最小化训练误差来训练模型、
but we actually care about the test error, .
但我们实际上关心的是测试误差,
How can we affect performance on the test set when we can observe only the training set? The field of statistical learning theory provides some answers. If the training and the test set are collected arbitrarily, there is indeed little we can do. If we are allowed to make some assumptions about how the training and test set are collected, then we can make some progress.
当我们只能观察训练集时,如何影响测试集上的表现?统计学习理论领域提供了一些答案。如果训练集和测试集是任意收集的,那么我们确实无能为力。如果允许我们对训练集和测试集的收集方式做出一些假设,那么我们就能取得一些进展。
The training and test data are generated by a probability distribution over datasets called the data-generating process. We typically make a set of assumptions known collectively as the i.i.d. assumptions. These assumptions are that the examples in each dataset are independent from each other, and that the training set and test set are identically distributed, drawn from the same probability distribution as each other. This assumption enables us to describe the data-generating process with a probability distribution over a single example. The same distribution is then used to generate every train example and every test example. We call that shared underlying distribution the data-generating distribution, denoted . This probabilistic framework and the i.i.d. assumptions enables us to mathematically study the relationship between training error and test error.
训练数据和测试数据由数据集上的概率分布生成,称为数据生成过程。我们通常会做出一系列假设,统称为 i.i.d. 假设。这些假设是:每个数据集中的示例是相互独立的,训练集和测试集的分布是相同的,都来自于相同的概率分布。通过这一假设,我们可以用单个示例的概率分布来描述数据生成过程。然后,相同的分布被用于生成每个训练示例和每个测试示例。我们称这种共享的基础分布为数据生成分布,记为 。这种概率框架和 i.i.d. 假设使我们能够从数学角度研究训练误差和测试误差之间的关系。
One immediate connection we can observe between training error and test error is that the expected training error of a randomly selected model is equal to the expected test error of that model. Suppose we have a probability distribution and we sample from it repeatedly to generate the training set and the test set. For some fixed value , the expected training set error is exactly the same as the expected test set error, because both expectations are formed using the same dataset sampling process. The only difference between the two conditions is the name we assign to the dataset we sample.
我们可以观察到训练误差和测试误差之间的一个直接联系,即随机选择的模型的预期训练误差等于该模型的预期测试误差。假设我们有一个概率分布 ,并从中反复抽样生成训练集和测试集。对于某个固定值 ,预期训练集误差与预期测试集误差完全相同,因为这两个预期都是通过相同的数据集采样过程形成的。两种情况的唯一区别在于我们为采样数据集指定的名称。
Of course, when we use a machine learning algorithm, we do not fix the parameters ahead of time, then sample both datasets. We sample the training set, then use it to choose the parameters to reduce training set error, then sample the test set. Under this process, the expected test error is greater than or equal to the expected value of training error. The factors determining how well a machine learning algorithm will perform are its ability to
当然,在使用机器学习算法时,我们不会提前固定参数,然后对两个数据集进行采样。我们先对训练集进行采样,然后利用它来选择参数以减少训练集误差,接着再对测试集进行采样。在此过程中,预期测试误差大于或等于训练误差的预期值。决定机器学习算法性能好坏的因素是其在以下方面的能力
  1. Make the training error small.
    减少训练误差。
  2. Make the gap between training and test error small.
    使训练误差和测试误差之间的差距很小。
These two factors correspond to the two central challenges in machine learning: underfitting and overfitting. Underfitting occurs when the model is not able to
这两个因素对应着机器学习中的两个核心挑战:欠拟合和过拟合。当模型不能

obtain a sufficiently low error value on the training set. Overfitting occurs when the gap between the training error and test error is too large.
在训练集上获得足够低的误差值。当训练误差与测试误差之间的差距过大时,就会出现过度拟合。
We can control whether a model is more likely to overfit or underfit by altering its capacity. Informally, a model's capacity is its ability to fit a wide variety of functions. Models with low capacity may struggle to fit the training set. Models with high capacity can overfit by memorizing properties of the training set that do not serve them well on the test set.
我们可以通过改变模型的能力来控制模型是更有可能拟合过度还是拟合不足。非正式地讲,模型的容量是指它拟合各种函数的能力。容量小的模型可能难以拟合训练集。容量大的模型可能会因为记忆了训练集的属性而过拟合,而这些属性在测试集上并不适用。
One way to control the capacity of a learning algorithm is by choosing its hypothesis space, the set of functions that the learning algorithm is allowed to select as being the solution. For example, the linear regression algorithm has the set of all linear functions of its input as its hypothesis space. We can generalize linear regression to include polynomials, rather than just linear functions, in its hypothesis space. Doing so increases the model's capacity.
控制学习算法能力的一种方法是选择其假设空间,即允许学习算法选择作为解的函数集。例如,线性回归算法将输入的所有线性函数集合作为其假设空间。我们可以对线性回归进行概括,将多项式而不仅仅是线性函数纳入其假设空间。这样做可以增加模型的容量。
A polynomial of degree 1 gives us the linear regression model with which we are already familiar, with the prediction
度数为 1 的多项式给出了我们已经熟悉的线性回归模型,其预测结果是
By introducing as another feature provided to the linear regression model, we can learn a model that is quadratic as a function of :
通过引入 作为线性回归模型的另一个特征,我们可以学习一个与 成二次函数的模型:
Though this model implements a quadratic function of its input, the output is still a linear function of the parameters, so we can still use the normal equations to train the model in closed form. We can continue to add more powers of as additional features, for example, to obtain a polynomial of degree 9 :
虽然这个模型实现了输入的二次函数,但输出仍然是参数的线性函数,因此我们仍然可以使用正态方程以封闭形式训练模型。我们可以继续添加更多 的幂次作为附加特征,例如,得到一个度数为 9 的多项式:
Machine learning algorithms will generally perform best when their capacity is appropriate for the true complexity of the task they need to perform and the amount of training data they are provided with. Models with insufficient capacity are unable to solve complex tasks. Models with high capacity can solve complex tasks, but when their capacity is higher than needed to solve the present task, they may overfit.
一般来说,当机器学习算法的容量与其需要执行的任务的真正复杂性和所提供的训练数据量相适应时,其性能将达到最佳。容量不足的模型无法解决复杂任务。容量大的模型可以解决复杂的任务,但当其容量高于解决当前任务所需的容量时,它们可能会过度拟合。
Figure 5.2 shows this principle in action. We compare a linear, quadratic and degree-9 predictor attempting to fit a problem where the true underlying
图 5.2 展示了这一原则的实际应用。我们比较了线性预测器、二次预测器和九度预测器,试图拟合一个问题,在这个问题中,真实的潜在
Figure 5.2: We fit three models to this example training set. The training data was generated synthetically, by randomly sampling values and choosing deterministically by evaluating a quadratic function. (Left)A linear function fit to the data suffers from underfitting-it cannot capture the curvature that is present in the data. (Center)A quadratic function fit to the data generalizes well to unseen points. It does not suffer from a significant amount of overfitting or underfitting. (Right)A polynomial of degree 9 fit to the data suffers from overfitting. Here we used the Moore-Penrose pseudoinverse to solve the underdetermined normal equations. The solution passes through all the training points exactly, but we have not been lucky enough for it to extract the correct structure. It now has a deep valley between two training points that does not appear in the true underlying function. It also increases sharply on the left side of the data, while the true function decreases in this area.
图 5.2:我们对这个示例训练集拟合了三个模型。训练数据是通过随机抽样 值并通过评估二次函数确定性地选择 合成的。(左图)与数据拟合的线性函数存在拟合不足的问题--无法捕捉数据中存在的曲率。(中)二次函数拟合数据可以很好地泛化到未见的点。它没有明显的过拟合或欠拟合现象。(右图)与数据拟合的度数为 9 的多项式存在过拟合问题。在这里,我们使用 Moore-Penrose 伪逆变换来求解欠定正态方程。解法准确地通过了所有训练点,但我们还没有幸运到能提取出正确的结构。现在,它在两个训练点之间出现了一个深谷,而在真正的基础函数中却没有出现。它还在数据的左侧急剧增加,而真正的函数却在这一区域减小。
function is quadratic. The linear function is unable to capture the curvature in the true underlying problem, so it underfits. The degree- 9 predictor is capable of representing the correct function, but it is also capable of representing infinitely many other functions that pass exactly through the training points, because we have more parameters than training examples. We have little chance of choosing a solution that generalizes well when so many wildly different solutions exist. In this example, the quadratic model is perfectly matched to the true structure of the task, so it generalizes well to new data.
函数是二次函数。线性函数无法捕捉到真实问题中的曲率,因此它的拟合效果较差。度数为 9 的预测器能够代表正确的函数,但它也能代表无数个完全通过训练点的其他函数,因为我们的参数多于训练实例。当存在如此多差异巨大的解决方案时,我们几乎不可能选择一个通用性好的解决方案。在本例中,二次模型与任务的真实结构完全匹配,因此它能很好地泛化新数据。
So far we have described only one way of changing a model's capacity: by changing the number of input features it has, and simultaneously adding new parameters associated with those features. There are in fact many ways to change a model's capacity. Capacity is not determined only by the choice of model. The model specifies which family of functions the learning algorithm can choose from when varying the parameters in order to reduce a training objective. This is called the representational capacity of the model. In many cases, finding the best
到目前为止,我们只描述了改变模型容量的一种方法:改变输入特征的数量,同时增加与这些特征相关的新参数。事实上,改变模型容量的方法有很多。容量不仅取决于模型的选择。模型指定了学习算法在改变参数以降低训练目标时可以选择的函数系列。这就是模型的表征能力。在很多情况下,找到最佳的

function within this family is a difficult optimization problem. In practice, the learning algorithm does not actually find the best function, but merely one that significantly reduces the training error. These additional limitations, such as the imperfection of the optimization algorithm, mean that the learning algorithm's effective capacity may be less than the representational capacity of the model family.
在这个系列中寻找最佳函数是一个困难的优化问题。在实践中,学习算法实际上并没有找到最佳函数,而只是找到了一个能显著减少训练误差的函数。这些额外的限制,如优化算法的不完善,意味着学习算法的有效能力可能小于模型族的表征能力。
Our modern ideas about improving the generalization of machine learning models are refinements of thought dating back to philosophers at least as early as Ptolemy. Many early scholars invoke a principle of parsimony that is now most widely known as Occam's razor (c. 1287-1347). This principle states that among competing hypotheses that explain known observations equally well, we should choose the "simplest" one. This idea was formalized and made more precise in the twentieth century by the founders of statistical learning theory (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995).
我们关于提高机器学习模型泛化能力的现代想法,是对至少早在托勒密时期哲学家思想的完善。许多早期的学者都引用了一个现在最广为人知的 "奥卡姆剃刀"(Occam's razor,约 1287-1347 年)原则。该原则指出,在同样能够解释已知观察结果的相互竞争的假设中,我们应该选择 "最简单 "的假设。二十世纪,统计学习理论的创始人将这一思想正式化,并使其更加精确(Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer et al.)
Statistical learning theory provides various means of quantifying model capacity. Among these, the most well known is the Vapnik-Chervonenkis dimension, or VC dimension. The VC dimension measures the capacity of a binary classifier. The VC dimension is defined as being the largest possible value of for which there exists a training set of different points that the classifier can label arbitrarily.
统计学习理论提供了多种量化模型容量的方法。其中,最著名的是 Vapnik-Chervonenkis 维度或 VC 维度。VC 维度衡量二元分类器的容量。VC 维度的定义是:存在一个由 不同 点组成的训练集,分类器可以任意标注该训练集的 的最大可能值。
Quantifying the capacity of the model enables statistical learning theory to make quantitative predictions. The most important results in statistical learning theory show that the discrepancy between training error and generalization error is bounded from above by a quantity that grows as the model capacity grows but shrinks as the number of training examples increases (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995). These bounds provide intellectual justification that machine learning algorithms can work, but they are rarely used in practice when working with deep learning algorithms. This is in part because the bounds are often quite loose and in part because it can be quite difficult to determine the capacity of deep learning algorithms. The problem of determining the capacity of a deep learning model is especially difficult because the effective capacity is limited by the capabilities of the optimization algorithm, and we have little theoretical understanding of the general nonconvex optimization problems involved in deep learning.
量化模型的容量可以让统计学习理论做出定量预测。统计学习理论中最重要的结果表明,训练误差和泛化误差之间的差异有一个自上而下的界限,这个界限随着模型容量的增长而增长,但随着训练实例数量的增加而缩小(Vapnik 和 Chervonenkis,1971 年;Vapnik,1982 年;Blumer 等人,1989 年;Vapnik,1995 年)。这些界限为机器学习算法的运行提供了理论依据,但在深度学习算法的实际应用中却很少使用。部分原因是这些界限通常比较宽松,另一部分原因是确定深度学习算法的容量非常困难。确定深度学习模型容量的问题尤其困难,因为有效容量受限于优化算法的能力,而我们对深度学习所涉及的一般非凸优化问题几乎没有理论上的了解。
We must remember that while simpler functions are more likely to generalize (to have a small gap between training and test error), we must still choose a sufficiently complex hypothesis to achieve low training error. Typically, training error decreases until it asymptotes to the minimum possible error value as model capacity increases (assuming the error measure has a minimum value). Typically,
我们必须记住,虽然较简单的函数更有可能泛化(训练误差与测试误差之间的差距较小),但我们仍必须选择足够复杂的假设,以实现较低的训练误差。通常情况下,随着模型容量的增加,训练误差会逐渐减小,直至渐近于可能的最小误差值(假设误差测量值为最小值)。通常情况下
Figure 5.3: Typical relationship between capacity and error. Training and test error behave differently. At the left end of the graph, training error and generalization error are both high. This is the underfitting regime . As we increase capacity, training error decreases, but the gap between training and generalization error increases. Eventually, the size of this gap outweighs the decrease in training error, and we enter the overfitting regime, where capacity is too large, above the optimal capacity.
图 5.3:容量与误差之间的典型关系。训练误差和测试误差表现不同。在图的左端,训练误差和泛化误差都很大。这就是欠拟合状态。随着容量的增加,训练误差会减小,但训练误差和泛化误差之间的差距会增大。最终,这一差距的大小超过了训练误差的减少,我们就进入了过拟合状态,即容量过大,超过了最佳容量。
generalization error has a U-shaped curve as a function of model capacity. This is illustrated in figure 5.3.
泛化误差与模型容量的函数关系呈 U 型曲线。如图 5.3 所示。
To reach the most extreme case of arbitrarily high capacity, we introduce the concept of nonparametric models. So far, we have seen only parametric models, such as linear regression. Parametric models learn a function described by a parameter vector whose size is finite and fixed before any data is observed. Nonparametric models have no such limitation.
为了达到任意高容量的最极端情况,我们引入了非参数模型的概念。到目前为止,我们只见过参数模型,如线性回归。参数模型学习一个由参数向量描述的函数,而参数向量的大小是有限的,并且在观测到任何数据之前就已固定。非参数模型则没有这种限制。
Sometimes, nonparametric models are just theoretical abstractions (such as an algorithm that searches over all possible probability distributions) that cannot be implemented in practice. However, we can also design practical nonparametric models by making their complexity a function of the training set size. One example of such an algorithm is nearest neighbor regression. Unlike linear regression, which has a fixed-length vector of weights, the nearest neighbor regression model simply stores the and from the training set. When asked to classify a test point , the model looks up the nearest entry in the training set and returns the associated regression target. In other words, where . The algorithm can also be generalized to distance metrics other than the norm, such as learned distance metrics (Goldberger et al., 2005). If the algorithm is allowed to break ties by averaging the values for all that are tied for nearest, then this algorithm is able to achieve the minimum possible training error (which
有时,非参数模型只是理论上的抽象概念(如搜索所有可能概率分布的算法),无法在实践中实现。不过,我们也可以设计实用的非参数模型,使其复杂度成为训练集大小的函数。近邻回归就是这种算法的一个例子。与具有固定长度权重向量的线性回归不同,近邻回归模型只需存储来自训练集的 。当要求对测试点 进行分类时,模型会查找训练集中最近的条目,并返回相关的回归目标。换句话说, ,其中