《机器学习中的高斯过程

Chapter 2 第二章

Regression 回归

Supervised learning can be divided into regression and classification problems. Whereas the outputs for classification are discrete class labels, regression is concerned with the prediction of continuous quantities. For example, in a financial application, one may attempt to predict the price of a commodity as a function of interest rates, currency exchange rates, availability and demand. In this chapter we describe Gaussian process methods for regression problems; classification problems are discussed in chapter 3 .
监督学习可分为回归问题和分类问题。分类问题的输出是离散的类别标签，而回归问题则关注连续量的预测。例如，在金融应用中，人们可能尝试预测商品价格，将其作为利率、汇率、供需关系的函数。本章我们将介绍用于回归问题的高斯过程方法；分类问题将在第 3 章讨论。

There are several ways to interpret Gaussian process (GP) regression models. One can think of a Gaussian process as defining a distribution over functions, and inference taking place directly in the space of functions, the function-space view. Although this view is appealing it may initially be difficult to grasp, so we start our exposition in section 2.1 with the equivalent weight-space view which may be more familiar and accessible to many, and continue in section 2.2 with the function-space view. Gaussian processes often have characteristics that can be changed by setting certain parameters and in section 2.3 we discuss how the properties change as these parameters are varied. The predictions from a GP model take the form of a full predictive distribution; in section 2.4 we discuss how to combine a loss function with the predictive distributions using decision theory to make point predictions in an optimal way. A practical comparative example involving the learning of the inverse dynamics of a robot arm is presented in section 2.5. We give some theoretical analysis of Gaussian process regression in section 2.6, and discuss how to incorporate explicit basis functions into the models in section 2.7. As much of the material in this chapter can be considered fairly standard, we postpone most references to the historical overview in section 2.8.
高斯过程（GP）回归模型有多种解释方式。一种观点认为高斯过程定义了函数空间上的分布，推断直接在函数空间中进行，即函数空间视角。尽管这一视角颇具吸引力，但起初可能较难理解，因此我们在 2.1 节首先从更易为多数人熟悉和接受的权重空间视角展开阐述，随后在 2.2 节转向函数空间视角。高斯过程通常具有可通过设定特定参数调整的特性，2.3 节将讨论这些参数变化时模型性质如何改变。GP 模型的预测结果以完整预测分布的形式呈现；2.4 节将探讨如何结合损失函数与预测分布，运用决策理论以最优方式作出点预测。第 2.5 节展示了一个涉及机器人手臂逆动力学学习的实际对比案例。我们在 2.6 节对高斯过程回归进行了理论分析，并在 2.7 节讨论了如何将显式基函数融入模型。由于本章大部分内容可视为相当标准的材料，我们将多数参考文献延后至 2.8 节的历史概述部分。

2.1 Weight-space View 2.1 权重空间视角

The simple linear regression model where the output is a linear combination of the inputs has been studied and used extensively. Its main virtues are simplic-
简单线性回归模型（其输出是输入的线性组合）已被广泛研究和应用，其主要优势在于简洁性——

two equivalent views 两种等效视角

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. (C) 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml ity of implementation and interpretability. Its main drawback is that it only allows a limited flexibility; if the relationship between input and output cannot reasonably be approximated by a linear function, the model will give poor predictions.
C. E. Rasmussen 与 C. K. I. Williams 合著，《机器学习中的高斯过程》，麻省理工学院出版社，2006 年，ISBN 026218253X。版权所有©2006 麻省理工学院。www.GaussianProcess.org/gpml 该模型易于实现且解释性强，其主要局限在于灵活性有限；若输入与输出之间的关系无法通过线性函数合理近似，模型将给出较差的预测结果。

In this section we first discuss the Bayesian treatment of the linear model. We then make a simple enhancement to this class of models by projecting the inputs into a high-dimensional feature space and applying the linear model there. We show that in some feature spaces one can apply the "kernel trick" to carry out computations implicitly in the high dimensional space; this last step leads to computational savings when the dimensionality of the feature space is large compared to the number of data points.
本节我们首先讨论线性模型的贝叶斯处理方法。随后，我们通过将输入映射到高维特征空间并在其中应用线性模型，对此类模型进行简单增强。我们展示在某些特征空间中，可以运用“核技巧”隐式地在高维空间中进行计算；当特征空间维度远大于数据点数量时，这一步骤能显著节省计算成本。

training set 训练集

We have a training set

D

n

observations,

D = {(x_{i}, y_{i}) ∣ i = 1, \dots, n}

, where

x

denotes an input vector (covariates) of dimension

D

and

y

denotes a scalar output or target (dependent variable); the column vector inputs for all

n

cases are aggregated in the

D \times n

design matrix

x^{1} X

,and the targets are collected in the vector

y

,so we can write

D = (X, y)

. In the regression setting the targets are real values. We are interested in making inferences about the relationship between inputs and targets, i.e. the conditional distribution of the targets given the inputs (but we are not interested in modelling the input distribution itself).
我们拥有一个包含

n

个观测值的训练集

D

，记为

D = {(x_{i}, y_{i}) ∣ i = 1, \dots, n}

，其中

x

表示维度为

D

的输入向量（协变量），

y

表示标量输出或目标值（因变量）；所有

n

个样本的列向量输入汇总于设计矩阵

x^{1} X

D \times n

中，目标值则收集在向量

y

内，因此可表示为

D = (X, y)

。在回归问题中，目标值为实数值。我们关注的是推断输入与目标值之间的关系，即给定输入时目标值的条件分布（但不对输入分布本身进行建模）。

design matrix 设计矩阵

2.1.1 The Standard Linear Model
2.1.1 标准线性模型

We will review the Bayesian analysis of the standard linear regression model with Gaussian noise
我们将回顾带有高斯噪声的标准线性回归模型的贝叶斯分析

\begin{matrix} (2.1) & f (x) = x^{⊤} w, y = f (x) + ε, \end{matrix}

where

x

is the input vector,

w

is a vector of weights (parameters) of the linear model,

f

is the function value and

y

is the observed target value. Often a bias weight or offset is included, but as this can be implemented by augmenting the input vector

x

with an additional element whose value is always one,we do not explicitly include it in our notation. We have assumed that the observed values

y

differ from the function values

f (x)

by additive noise,and we will further assume that this noise follows an independent, identically distributed Gaussian distribution with zero mean and variance

σ_{n}^{2}

其中

x

表示输入向量，

w

是线性模型的权重（参数）向量，

f

为函数值，

y

是观测目标值。通常模型中会包含一个偏置权重或偏移量，但由于这可以通过在输入向量

x

中增加一个值恒为 1 的元素来实现，我们在表示法中不再显式包含它。我们假设观测值

y

与函数值

f (x)

之间存在加性噪声的差异，并进一步假设该噪声服从独立同分布的高斯分布，其均值为零，方差为

σ_{n}^{2}

。

\begin{matrix} (2.2) & ε \sim N (0, σ_{n}^{2}) . \end{matrix}

bias, offset 偏置，偏移量

likelihood 似然

This noise assumption together with the model directly gives rise to the likelihood, the probability density of the observations given the parameters, which is
这一噪声假设与模型直接导出了似然，即在给定参数下观测值的概率密度，其表达式为

^{1}

In statistics texts the design matrix is usually taken to be the transpose of our definition, but our choice is deliberate and has the advantage that a data point is a standard (column) vector.
在统计学教材中，设计矩阵通常被定义为本书定义的转置形式，但我们的选择是经过深思熟虑的，其优势在于数据点可以表示为标准（列）向量。

2.1 Weight-space View 2.1 权重空间视角

factored over cases in the training set (because of the independence assumption) to give
在训练集中因独立性假设而按样本分解得到

\begin{matrix} (2.3) & p (y ∣ X, w) = \prod_{i = 1}^{n} p (y_{i} ∣ x_{i}, w) = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π} σ_{n}} \exp (- \frac{{(y_{i} - x_{i}^{⊤} w)}^{2}}{2 σ_{n}^{2}}) \end{matrix}

= \frac{1}{{(2 π σ_{n}^{2})}^{n / 2}} \exp (- \frac{1}{2 σ_{n}^{2}} {| y - X^{⊤} w |}^{2}) = N (X^{⊤} w, σ_{n}^{2} I),

where

| z |

denotes the Euclidean length of vector

z

. In the Bayesian formalism we need to specify a prior over the parameters, expressing our beliefs about the parameters before we look at the observations. We put a zero mean Gaussian prior with covariance matrix

\sum_{p}

on the weights
其中

| z |

表示向量

z

的欧几里得长度。在贝叶斯形式中，我们需要为参数指定一个先验分布，以表达我们在观察数据之前对参数的信念。我们对权重施加了一个均值为零、协方差矩阵为

\sum_{p}

的高斯先验。

\begin{matrix} (2.4) & w \sim N (0, \sum_{p}) . \end{matrix}

prior 先验

The rôle and properties of this prior will be discussed in section 2.2; for now we will continue the derivation with the prior as specified.
这一先验的作用及性质将在第 2.2 节讨论；现在我们将继续按照指定的先验进行推导。

Inference in the Bayesian linear model is based on the posterior distribution over the weights,computed by Bayes’ rule,(see eq. (A.3))

^{2}

贝叶斯线性模型中的推断基于权重的后验分布，通过贝叶斯规则计算得出（参见公式(A.3)）

^{2}

\begin{matrix} (2.5) & posterior = \frac{likelihood \times prior}{marginal likelihood}, p (w ∣ y, X) = \frac{p (y ∣ X, w) p (w)}{p (y ∣ X)}, \end{matrix}

posterior 后验

where the normalizing constant, also known as the marginal likelihood (see page 19), is independent of the weights and given by
其中归一化常数，亦称边缘似然（参见第 19 页），与权重无关，其表达式为

\begin{matrix} (2.6) & p (y ∣ X) = \int p (y ∣ X, w) p (w) d w . \end{matrix}

marginal likelihood 边缘似然

The posterior in eq. (2.5) combines the likelihood and the prior, and captures everything we know about the parameters. Writing only the terms from the likelihood and prior which depend on the weights, and "completing the square" we obtain
方程(2.5)中的后验结合了似然与先验，囊括了我们对参数的全部认知。仅提取依赖于权重的似然项与先验项，并通过"配方法"整理后可得

p (w ∣ X, y) \propto \exp (- \frac{1}{2 σ_{n}^{2}} {(y - X^{⊤} w)}^{⊤} (y - X^{⊤} w)) \exp (- \frac{1}{2} w^{⊤} \sum_{p}^{- 1} w)

\begin{matrix} (2.7) & \propto \exp (- \frac{1}{2} {(w - \overset{―}{w})}^{⊤} (\frac{1}{σ_{n}^{2}} X X^{⊤} + \sum_{p}^{- 1}) (w - \overset{―}{w})), \end{matrix}

where

\overset{―}{w} = σ_{n}^{- 2} {(σ_{n}^{- 2} X X^{⊤} + \sum_{p}^{- 1})}^{- 1} X y

,and we recognize the form of the posterior distribution as Gaussian with mean

\overset{―}{w}

and covariance matrix

A^{- 1}

其中

\overset{―}{w} = σ_{n}^{- 2} {(σ_{n}^{- 2} X X^{⊤} + \sum_{p}^{- 1})}^{- 1} X y

，我们识别出后验分布的形式为高斯分布，其均值为

\overset{―}{w}

，协方差矩阵为

A^{- 1}

\begin{matrix} (2.8) & p (w ∣ X, y) \sim N (\overset{―}{w} = \frac{1}{σ_{n}^{2}} A^{- 1} X y, A^{- 1}), \end{matrix}

where

A = σ_{n}^{- 2} X X^{⊤} + \sum_{p}^{- 1}

. Notice that for this model (and indeed for any Gaussian posterior) the mean of the posterior distribution

p (w ∣ y, X)

is also its mode, which is also called the maximum a posteriori (MAP) estimate of w. In a non-Bayesian setting the negative log prior is sometimes thought of as a penalty term, and the MAP point is known as the penalized maximum likelihood estimate of the weights, and this may cause some confusion between the two approaches. Note, however, that in the Bayesian setting the MAP estimate plays no special rôle.

^{3}

The penalized maximum likelihood procedure
其中

A = σ_{n}^{- 2} X X^{⊤} + \sum_{p}^{- 1}

。注意到对于此模型（实际上对于任何高斯后验分布），后验分布的均值

p (w ∣ y, X)

同时也是其众数，也称为权重 w 的最大后验（MAP）估计。在非贝叶斯框架下，负对数先验有时被视为惩罚项，而 MAP 点则被称为权重的惩罚极大似然估计，这可能引起两种方法之间的混淆。然而需注意，在贝叶斯框架中，MAP 估计并不具备特殊地位。

^{3}

惩罚极大似然估计过程

MAP estimate MAP 估计

^{2}

Often Bayes’ rule is stated as

p (a ∣ b) = p (b ∣ a) p (a) / p (b)

; here we use it in a form where we additionally condition everywhere on the inputs

X

(but neglect this extra conditioning for the prior which is independent of the inputs).

^{2}

贝叶斯定理通常表述为

p (a ∣ b) = p (b ∣ a) p (a) / p (b)

；此处我们采用一种形式，在其中额外对所有输入

X

进行了条件化处理（但对于独立于输入的先验则省略了这一额外条件）。

https://cdn.noedgeai.com/019686a5-3b83-7d6c-828a-cad6ac9d4b35_3.jpg?x=538&y=412&w=990&h=1020&r=0

Figure 2.1: Example of Bayesian linear model

f (x) = w_{1} + w_{2} x

with intercept

w_{1}

and slope parameter

w_{2}

. Panel (a) shows the contours of the prior distribution

p (w) \sim N (0, I)

,eq. (2.4). Panel (b) shows three training points marked by crosses. Panel (c) shows contours of the likelihood

p (y ∣ X, w)

eq. (2.3),assuming a noise level of

σ_{n} = 1

; note that the slope is much more "well determined" than the intercept. Panel (d) shows the posterior,

p (w ∣ X, y)

eq. (2.7); comparing the maximum of the posterior to the likelihood, we see that the intercept has been shrunk towards zero whereas the more 'well determined' slope is almost unchanged. All contour plots give the 1 and 2 standard deviation equi-probability contours. Superimposed on the data in panel (b) are the predictive mean plus/minus two standard deviations of the (noise-free) predictive distribution

p (f_{*} ∣ x_{*}, X, y)

,eq. (2.9).
图 2.1：贝叶斯线性模型

f (x) = w_{1} + w_{2} x

示例，包含截距

w_{1}

和斜率参数

w_{2}

。图(a)展示了先验分布

p (w) \sim N (0, I)

的等高线，参见方程(2.4)。图(b)标记了三个训练数据点（以叉号表示）。图(c)展示了似然函数

p (y ∣ X, w)

的等高线（假设噪声水平为

σ_{n} = 1

）；注意到斜率的"确定性"远高于截距。图(d)显示了后验分布

p (w ∣ X, y)

，参见方程(2.7)；对比后验最大值与似然函数可见，截距向零收缩，而"更确定"的斜率几乎未变。所有等高线图均给出 1 和 2 个标准差等概率线。图(b)数据上叠加了预测均值及（无噪声）预测分布

p (f_{*} ∣ x_{*}, X, y)

的±2 标准差范围，参见方程(2.9)。

^{3}

In this case,due to symmetries in the model and posterior,it happens that the mean of the predictive distribution is the same as the prediction at the mean of the posterior. However, this is not the case in general.
在此情况下，由于模型及后验分布中的对称性，预测分布的均值恰好与后验均值处的预测结果相同。然而，这并非普遍情况。

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. (c) 2006 Massachusetts Institute of Technology. www. GaussianProcess.org/gpml
C·E·拉斯穆森 & C·K·I·威廉姆斯，《机器学习中的高斯过程》，麻省理工学院出版社，2006 年，ISBN 026218253X。（©）2006 麻省理工学院。网址：www.GaussianProcess.org/gpml

2.1 Weight-space View 2.1 权重空间视角

is known in this case as ridge regression [Hoerl and Kennard, 1970] because of the effect of the quadratic penalty term

\frac{1}{2} w^{⊤} \sum_{p}^{- 1} w

from the log prior.
这种情况被称为岭回归[Hoerl and Kennard, 1970]，源于对数先验中二次惩罚项

\frac{1}{2} w^{⊤} \sum_{p}^{- 1} w

的作用。

ridge regression 岭回归

To make predictions for a test case we average over all possible parameter values, weighted by their posterior probability. This is in contrast to non-Bayesian schemes, where a single parameter is typically chosen by some criterion. Thus the predictive distribution for

f_{*} ≜ f (x_{*})

x_{*}

is given by averaging the output of all possible linear models w.r.t. the Gaussian posterior
为了对测试案例进行预测，我们根据其后验概率对所有可能的参数值进行加权平均。这与非贝叶斯方法形成对比，后者通常依据某种准则选择单一参数。因此，对于

x_{*}

处的

f_{*} ≜ f (x_{*})

的预测分布，是通过对所有可能的线性模型输出相对于高斯后验进行平均得到的

\begin{matrix} (2.9) & p (f_{*} ∣ x_{*}, X, y) = \int p (f_{*} ∣ x_{*}, w) p (w ∣ X, y) d w \end{matrix}

= N (\frac{1}{σ_{n}^{2}} x_{*}^{⊤} A^{- 1} X y, x_{*}^{⊤} A^{- 1} x_{*}) .

predictive distribution 预测分布

The predictive distribution is again Gaussian, with a mean given by the posterior mean of the weights from eq. (2.8) multiplied by the test input, as one would expect from symmetry considerations. The predictive variance is a quadratic form of the test input with the posterior covariance matrix, showing that the predictive uncertainties grow with the magnitude of the test input, as one would expect for a linear model.
预测分布再次呈现高斯特性，其均值由权重后验均值（见公式 2.8）与测试输入相乘得到，这一对称性考量下的结果符合预期。预测方差则是测试输入与后验协方差矩阵的二次型，表明预测不确定性随测试输入量级增大而增加，这正是线性模型的典型特征。

An example of Bayesian linear regression is given in Figure 2.1. Here we have chosen a 1-d input space so that the weight-space is two-dimensional and can be easily visualized. Contours of the Gaussian prior are shown in panel (a). The data are depicted as crosses in panel (b). This gives rise to the likelihood shown in panel (c) and the posterior distribution in panel (d). The predictive distribution and its error bars are also marked in panel (b).
图 2.1 展示了贝叶斯线性回归的一个实例。此处我们选择了一维输入空间，使得权重空间为二维并可直观呈现。图(a)展示了高斯先验的等高线，图(b)中以十字标记数据点，由此生成图(c)中的似然函数及图(d)中的后验分布。预测分布及其误差范围也在图(b)中标注。

2.1.2 Projections of Inputs into Feature Space
2.1.2 输入向特征空间的投影

In the previous section we reviewed the Bayesian linear model which suffers from limited expressiveness. A very simple idea to overcome this problem is to first project the inputs into some high dimensional space using a set of basis functions and then apply the linear model in this space instead of directly on the inputs themselves. For example,a scalar input

x

could be projected into the space of powers of

x : ϕ (x) = {(1, x, x^{2}, x^{3}, \dots)}^{⊤}

to implement polynomial regression. As long as the projections are fixed functions (i.e. independent of the parameters

w

) the model is still linear in the parameters,and therefore linear in the parameters analytically tractable.

^{4}

This idea is also used in classification,where a dataset which is not linearly separable in the original data space may become linearly separable in a high dimensional feature space, see section 3.3. Application of this idea begs the question of how to choose the basis functions? As we shall demonstrate (in chapter 5), the Gaussian process formalism allows us to answer this question. For now, we assume that the basis functions are given.
在前一节中，我们回顾了表达能力受限的贝叶斯线性模型。克服这一问题的一个非常简单的想法是，首先使用一组基函数将输入投影到某个高维空间，然后在此空间中而非直接在输入上应用线性模型。例如，标量输入

x

可以通过

x : ϕ (x) = {(1, x, x^{2}, x^{3}, \dots)}^{⊤}

的幂次投影来实现多项式回归。只要这些投影是固定函数（即与参数

w

无关），模型在参数上仍然是线性的，因此在参数上解析可解。

^{4}

这一思路也应用于分类问题中，在原数据空间中线性不可分的数据集，在高维特征空间中可能变得线性可分，参见 3.3 节。应用这一思路引出了如何选择基函数的问题。正如我们将展示的（在第 5 章中），高斯过程形式体系使我们能够回答这个问题。目前，我们假设基函数是给定的。

feature space 特征空间

polynomial regression 多项式回归

Specifically,we introduce the function

ϕ (x)

which maps a

D

-dimensional input vector

x

into an

N

dimensional feature space. Further let the matrix
具体而言，我们引入函数

ϕ (x)

，该函数将

D

维输入向量

x

映射到

N

维特征空间。进一步设矩阵

^{4}

Models with adaptive basis functions,such as e.g. multilayer perceptrons,may at first seem like a useful extension, but they are much harder to treat, except in the limit of an infinite number of hidden units, see section 4.2.3.
具有自适应基函数的模型，例如多层感知机，初看似乎是一种有益的扩展，但处理起来要困难得多，除非在隐藏单元数量趋于无穷的极限情况下，详见第 4.2.3 节。

Φ (X)

be the aggregation of columns

Φ (x)

for all cases in the training set. Now the model is
C. E. 拉斯穆森 & C. K. I. 威廉姆斯，《机器学习中的高斯过程》，麻省理工学院出版社，2006 年，ISBN 026218253X。（版权所有）2006 麻省理工学院。www.GaussianProcess.org/gpml

Φ (X)

是训练集中所有案例

Φ (x)

列的聚合。现在模型为

\begin{matrix} (2.10) & f (x) = ϕ {(x)}^{⊤} w, \end{matrix}

where the vector of parameters now has length

N

. The analysis for this model is analogous to the standard linear model,except that everywhere

Φ (X)

is substituted for

X

. Thus the predictive distribution becomes
其中参数向量的长度现为

N

。此模型的分析与标准线性模型类似，只是所有出现

Φ (X)

的地方都被替换为

X

。因此，预测分布变为

\begin{matrix} (2.11) & f_{*} ∣ x_{*}, X, y \sim N (\frac{1}{σ_{n}^{2}} ϕ {(x_{*})}^{⊤} A^{- 1} Φ y, ϕ {(x_{*})}^{⊤} A^{- 1} ϕ (x_{*})) \end{matrix}

explicit feature space formulation
显式特征空间表述

with

Φ = Φ (X)

and

A = σ_{n}^{- 2} Φ Φ^{⊤} + \sum_{p}^{- 1}

. To make predictions using this equation we need to invert the

A

matrix of size

N \times N

which may not be convenient if

N

,the dimension of the feature space,is large. However,we can rewrite the equation in the following way
其中

Φ = Φ (X)

和

A = σ_{n}^{- 2} Φ Φ^{⊤} + \sum_{p}^{- 1}

。使用此方程进行预测时，我们需要对大小为

N \times N

的

A

矩阵求逆，这在特征空间的维度

N

较大时可能不太方便。然而，我们可以通过以下方式重写该方程。

\begin{matrix} (2.12) & f_{*} ∣ x_{*}, X, y \sim N (ϕ_{*}^{⊤} \sum_{p} Φ {(K + σ_{n}^{2} I)}^{- 1} y, \end{matrix}

ϕ_{*}^{⊤} \sum_{p} ϕ_{*} - ϕ_{*}^{⊤} \sum_{p} Φ {(K + σ_{n}^{2} I)}^{- 1} Φ^{⊤} \sum_{p} ϕ_{*}),

alternative formulation 替代公式

where we have used the shorthand

ϕ (x_{*}) = ϕ_{*}

and defined

K = Φ^{⊤} \sum_{p} Φ

. To show this for the mean,first note that using the definitions of

A

and

K

we have

σ_{n}^{- 2} Φ (K + σ_{n}^{2} I) = σ_{n}^{- 2} Φ (Φ^{⊤} \sum_{p} Φ + σ_{n}^{2} I) = A \sum_{p} Φ

. Now multiplying through by

A^{- 1}

from left and

{(K + σ_{n}^{2} I)}^{- 1}

from the right gives

σ_{n}^{- 2} A^{- 1} Φ =

\sum_{p} Φ {(K + σ_{n}^{2} I)}^{- 1}

,showing the equivalence of the mean expressions in eq. (2.11) and eq. (2.12). For the variance we use the matrix inversion lemma, eq. (A.9), setting

Z^{- 1} = \sum_{p}, W^{- 1} = σ_{n}^{2} I

and

V = U = Φ

therein. In eq. (2.12) we need to invert matrices of size

n \times n

which is more convenient when

n < N

. Geometrically,note that

n

datapoints can span at most

n

dimensions in the feature space.
这里我们使用了简写

ϕ (x_{*}) = ϕ_{*}

并定义了

K = Φ^{⊤} \sum_{p} Φ

。为了证明均值部分，首先注意到利用

A

和

K

的定义，我们有

σ_{n}^{- 2} Φ (K + σ_{n}^{2} I) = σ_{n}^{- 2} Φ (Φ^{⊤} \sum_{p} Φ + σ_{n}^{2} I) = A \sum_{p} Φ

。现在从左乘以

A^{- 1}

，从右乘以

{(K + σ_{n}^{2} I)}^{- 1}

，得到

σ_{n}^{- 2} A^{- 1} Φ =

\sum_{p} Φ {(K + σ_{n}^{2} I)}^{- 1}

，这表明了方程(2.11)和方程(2.12)中均值表达式的等价性。对于方差部分，我们使用矩阵求逆引理（方程(A.9)），在其中设定

Z^{- 1} = \sum_{p}, W^{- 1} = σ_{n}^{2} I

和

V = U = Φ

。在方程(2.12)中，我们需要对大小为

n \times n

的矩阵求逆，这在

n < N

时更为方便。从几何角度看，注意

n

个数据点在特征空间中最多可以跨越

n

个维度。

computational load 计算负荷

Notice that in eq. (2.12) the feature space always enters in the form of

Φ^{⊤} \sum_{p} Φ, ϕ_{*}^{⊤} \sum_{p} Φ

,or

ϕ_{*}^{⊤} \sum_{p} ϕ_{*}

; thus the entries of these matrices are invariably of the form

ϕ {(x)}^{⊤} \sum_{p} ϕ (x^{'})

where

x

and

x^{'}

are in either the training or the test sets. Let us define

k (x, x^{'}) = ϕ {(x)}^{⊤} \sum_{p} ϕ (x^{'})

. For reasons that will become clear later we call

k (\cdot, \cdot)

a covariance function or kernel. Notice that

ϕ {(x)}^{⊤} \sum_{p} ϕ (x^{'})

is an inner product (with respect to

\sum_{p}

). As

\sum_{p}

is positive definite we can define

\sum_{p}^{1 / 2}

so that

{(\sum_{p}^{1 / 2})}^{2} = \sum_{p}

; for example if the SVD (singular value decomposition) of

\sum_{p} = U D U^{⊤}

,where

D

is diagonal,then one form for

\sum_{p}^{1 / 2}

U D^{1 / 2} U^{⊤}

. Then defining

ψ (x) = \sum_{p}^{1 / 2} ϕ (x)

we obtain a simple dot product representation

k (x, x^{'}) = ψ (x) \cdot ψ (x^{'}) .

请注意，在方程（2.12）中，特征空间总是以

Φ^{⊤} \sum_{p} Φ, ϕ_{*}^{⊤} \sum_{p} Φ

或

ϕ_{*}^{⊤} \sum_{p} ϕ_{*}

的形式出现；因此，这些矩阵的条目始终是

ϕ {(x)}^{⊤} \sum_{p} ϕ (x^{'})

的形式，其中

x

和

x^{'}

位于训练集或测试集中。我们定义

k (x, x^{'}) = ϕ {(x)}^{⊤} \sum_{p} ϕ (x^{'})

。出于稍后将明了的原因，我们称

k (\cdot, \cdot)

为协方差函数或核函数。注意到

ϕ {(x)}^{⊤} \sum_{p} ϕ (x^{'})

是一个内积（相对于

\sum_{p}

）。由于

\sum_{p}

是正定的，我们可以定义

\sum_{p}^{1 / 2}

，使得

{(\sum_{p}^{1 / 2})}^{2} = \sum_{p}

；例如，如果

\sum_{p} = U D U^{⊤}

的 SVD（奇异值分解）为

D

是对角矩阵，那么

\sum_{p}^{1 / 2}

的一种形式是

U D^{1 / 2} U^{⊤}

。然后定义

ψ (x) = \sum_{p}^{1 / 2} ϕ (x)

，我们得到一个简单的点积表示

k (x, x^{'}) = ψ (x) \cdot ψ (x^{'}) .

。

kernel 核函数

If an algorithm is defined solely in terms of inner products in input space then it can be lifted into feature space by replacing occurrences of those inner products by

k (x, x^{'})

; this is sometimes called the kernel trick. This technique is particularly valuable in situations where it is more convenient to compute the kernel than the feature vectors themselves. As we will see in the coming sections, this often leads to considering the kernel as the object of primary interest, and its corresponding feature space as having secondary practical importance.
如果一个算法仅根据输入空间中的内积来定义，那么可以通过将这些内积替换为

k (x, x^{'})

将其提升到特征空间；这有时被称为核技巧。在计算核比计算特征向量本身更为便利的情况下，这一技术尤为宝贵。正如我们将在后续章节中看到的，这常常导致将核视为主要关注对象，而其对应的特征空间则具有次要的实际重要性。

kernel trick 核技巧

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,
C. E. 拉斯穆森 & C. K. I. 威廉姆斯，《机器学习中的高斯过程》，麻省理工学院出版社，2006 年，

2.2 Function-space View 2.2 函数空间视角

An alternative and equivalent way of reaching identical results to the previous section is possible by considering inference directly in function space. We use a Gaussian process (GP) to describe a distribution over functions. Formally:
另一种与前一节得出相同结果的等效方法，是直接在函数空间中进行推理。我们使用高斯过程（GP）来描述函数的分布。正式定义如下：

Definition 2.1

A

Gaussian process is a collection of random variables,any finite number of which have a joint Gaussian distribution.
定义 2.1

A

高斯过程是一组随机变量的集合，其中任意有限数量的变量都具有联合高斯分布。

Gaussian process 高斯过程

A Gaussian process is completely specified by its mean function and covariance function. We define mean function

m (x)

and the covariance function

k (x, x^{'})

of a real process

f (x)

as
一个高斯过程完全由其均值函数和协方差函数确定。我们定义实过程

f (x)

的均值函数

m (x)

和协方差函数

k (x, x^{'})

为

\begin{matrix} (2.13) & m (x) = E [f (x)], \end{matrix}

k (x, x^{'}) = E [(f (x) - m (x)) (f (x^{'}) - m (x^{'}))],

covariance and mean function
协方差与均值函数

and will write the Gaussian process as
并将该高斯过程记为

\begin{matrix} (2.14) & f (x) \sim G P (m (x), k (x, x^{'})) . \end{matrix}

Usually, for notational simplicity we will take the mean function to be zero, although this need not be done, see section 2.7.
通常，为了符号上的简洁性，我们会将均值函数设为零，尽管这并不是必须的，具体可参见 2.7 节。

In our case the random variables represent the value of the function

f (x)

at location

x

. Often,Gaussian processes are defined over time,i.e. where the index set of the random variables is time. This is not (normally) the case in our use of GPs; here the index set

X

is the set of possible inputs,which could be more general,e.g.

R^{D}

. For notational convenience we use the (arbitrary) enumeration of the cases in the training set to identify the random variables such that

f_{i} ≜ f (x_{i})

is the random variable corresponding to the case

(x_{i}, y_{i})

as would be expected.
在我们的例子中，随机变量代表函数

f (x)

在位置

x

处的值。通常，高斯过程是随时间定义的，即随机变量的索引集是时间。但在我们使用高斯过程时，这种情况并不常见；此处的索引集

X

是可能的输入集合，可以更为广义，例如

R^{D}

。为了符号上的便利，我们使用训练集中案例的（任意）编号来标识随机变量，使得

f_{i} ≜ f (x_{i})

对应于案例

(x_{i}, y_{i})

的随机变量，这符合预期。

index set

\equiv

input domain
索引集

\equiv

输入域

A Gaussian process is defined as a collection of random variables. Thus, the definition automatically implies a consistency requirement, which is also sometimes known as the marginalization property. This property simply means that if the GP e.g. specifies

(y_{1}, y_{2}) \sim N (μ, \sum)

,then it must also specify

y_{1} \sim N (μ_{1}, \sum_{11})

where

\sum_{11}

is the relevant submatrix of

\sum

,see eq. (A.6). In other words, examination of a larger set of variables does not change the distribution of the smaller set. Notice that the consistency requirement is automatically fulfilled if the covariance function specifies entries of the covariance matrix.

^{5}

The definition does not exclude Gaussian processes with finite index sets (which would be simply Gaussian distributions), but these are not particularly interesting for our purposes.
高斯过程被定义为一组随机变量的集合。因此，该定义自动蕴含了一致性要求，有时也被称为边缘化性质。这一性质简单来说意味着，如果高斯过程指定了

(y_{1}, y_{2}) \sim N (μ, \sum)

，那么它也必须指定

y_{1} \sim N (μ_{1}, \sum_{11})

，其中

\sum_{11}

是

\sum

的相关子矩阵，参见方程(A.6)。换言之，对更大变量集的考察不会改变较小变量集的分布。注意，当协方差函数指定了协方差矩阵的各个元素时，一致性要求会自动满足。

^{5}

该定义并未排除具有有限索引集的高斯过程（那将简化为高斯分布），但这些对我们当前的研究目的而言并不特别有趣。

marginalization property 边缘化性质

finite index set 有限索引集

^{5}

Note,however,that if you instead specified e.g. a function for the entries of the inverse covariance matrix, then the marginalization property would no longer be fulfilled, and one could not think of this as a consistent collection of random variables - this would not qualify as a Gaussian process.

^{5}

但需注意，如果改为指定协方差矩阵逆矩阵元素的函数，则边缘化性质将不再满足，人们无法将其视为一致的随机变量集合——这将不符合高斯过程的定义。

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. (C) 2006 Massachusetts Institute of Technology. www. GaussianProcess.org/gpml
C. E. 拉斯穆森 & C. K. I. 威廉姆斯，《机器学习中的高斯过程》，麻省理工学院出版社，2006 年，ISBN 026218253X。版权所有 © 2006 麻省理工学院。www.GaussianProcess.org/gpml

A simple example of a Gaussian process can be obtained from our Bayesian linear regression model

f (x) = ϕ {(x)}^{⊤} w

with prior

w \sim N (0, \sum_{p})

. We have for the mean and covariance
高斯过程的一个简单示例可从我们的贝叶斯线性回归模型

f (x) = ϕ {(x)}^{⊤} w

结合先验

w \sim N (0, \sum_{p})

中获得。对于其均值与协方差函数，我们有

\begin{matrix} (2.15) & E [f (x)] = ϕ {(x)}^{⊤} E [w] = 0, \end{matrix}

E [f (x) f (x^{'})] = ϕ {(x)}^{⊤} E [w w^{⊤}] ϕ (x^{'}) = ϕ {(x)}^{⊤} \sum_{p} ϕ (x^{'}) .

Thus

f (x)

and

f (x^{'})

are jointly Gaussian with zero mean and covariance given by

ϕ {(x)}^{⊤} \sum_{p} ϕ (x^{'})

. Indeed,the function values

f (x_{1}), \dots, f (x_{n})

corresponding to any number of input points

n

are jointly Gaussian,although if

N < n

then this Gaussian is singular (as the joint covariance matrix will be of rank

N

).
因此，

f (x)

和

f (x^{'})

是联合高斯分布的，均值为零，协方差由

ϕ {(x)}^{⊤} \sum_{p} ϕ (x^{'})

给出。实际上，对应于任意数量输入点

n

的函数值

f (x_{1}), \dots, f (x_{n})

都是联合高斯的，尽管如果

N < n

，则该高斯分布是奇异的（因为联合协方差矩阵的秩将为

N

）。

In this chapter our running example of a covariance function will be the squared exponential

^{6} (SE)

covariance function; other covariance functions are discussed in chapter 4. The covariance function specifies the covariance between pairs of random variables
在本章中，我们将以平方指数

^{6} (SE)

协方差函数作为协方差函数的运行示例；其他协方差函数将在第 4 章讨论。协方差函数定义了随机变量对之间的协方差关系。

\begin{matrix} (2.16) & cov (f (x_{p}), f (x_{q})) = k (x_{p}, x_{q}) = \exp (- \frac{1}{2} {| x_{p} - x_{q} |}^{2}) . \end{matrix}

Note, that the covariance between the outputs is written as a function of the inputs. For this particular covariance function, we see that the covariance is almost unity between variables whose corresponding inputs are very close, and decreases as their distance in the input space increases.
需要注意的是，输出之间的协方差被表示为输入的函数。对于这一特定的协方差函数，我们观察到当对应输入非常接近时，变量间的协方差几乎为 1，并随着输入空间距离的增加而减小。

It can be shown (see section 4.3.1) that the squared exponential covariance function corresponds to a Bayesian linear regression model with an infinite number of basis functions. Indeed for every positive definite covariance function

k (\cdot, \cdot)

,there exists a (possibly infinite) expansion in terms of basis functions (see Mercer's theorem in section 4.3). We can also obtain the SE covariance function from the linear combination of an infinite number of Gaussian-shaped basis functions, see eq. (4.13) and eq. (4.30).
可以证明（参见第 4.3.1 节），平方指数协方差函数对应于具有无限多个基函数的贝叶斯线性回归模型。实际上，对于每一个正定协方差函数

k (\cdot, \cdot)

，都存在一个（可能是无限的）基函数展开（参见第 4.3 节中的 Mercer 定理）。我们还可以通过无限多个高斯形状基函数的线性组合得到 SE 协方差函数，具体参见方程(4.13)和方程(4.30)。

The specification of the covariance function implies a distribution over functions. To see this, we can draw samples from the distribution of functions evaluated at any number of points; in detail,we choose a number of input points,

^{7} X_{*}

and write out the corresponding covariance matrix using eq. (2.16) elementwise. Then we generate a random Gaussian vector with this covariance matrix
协方差函数的设定隐含了对函数的分布。为了理解这一点，我们可以在任意数量的点上从函数评估的分布中抽取样本；具体来说，我们选择若干输入点

^{7} X_{*}

，并利用方程(2.16)逐元素写出对应的协方差矩阵。然后，我们生成一个具有此协方差矩阵的随机高斯向量

\begin{matrix} (2.17) & f_{*} \sim N (0, K (X_{*}, X_{*})), \end{matrix}

and plot the generated values as a function of the inputs. Figure 2.2(a) shows three such samples. The generation of multivariate Gaussian samples is described in section A.2.
并将生成的值作为输入的函数绘制出来。图 2.2(a)展示了三个这样的样本。生成多元高斯样本的方法在附录 A.2 节中有所描述。

In the example in Figure 2.2 the input values were equidistant, but this need not be the case. Notice that "informally" the functions look smooth. In fact the squared exponential covariance function is infinitely differentiable, leading to the process being infinitely mean-square differentiable (see section 4.1). We also see that the functions seem to have a characteristic length-scale, which informally can be thought of as roughly the distance you have to move in input space before the function value can change significantly, see section 4.2.1. For eq. (2.16) the characteristic length-scale is around one unit. By replacing

| x_{p} - x_{q} |

| x_{p} - x_{q} | / ℓ

in eq. (2.16) for some positive constant

ℓ

we could change the characteristic length-scale of the process. Also, the overall variance of the
在图 2.2 的示例中，输入值是等距的，但这并非必须如此。注意这些函数“直观上”看起来是平滑的。实际上，平方指数协方差函数是无限可微的，这使得该过程具有无限均方可微性（参见第 4.1 节）。我们还观察到这些函数似乎具有一个特征长度尺度，可以非正式地理解为在输入空间中需要移动的大致距离，超过这个距离函数值才会发生显著变化，详见第 4.2.1 节。对于方程（2.16），特征长度尺度约为一个单位。通过将方程（2.16）中的

| x_{p} - x_{q} |

替换为某个正常数

ℓ

的

| x_{p} - x_{q} | / ℓ

，我们可以改变过程的特征长度尺度。此外，总体方差...

Bayesian linear model is a Gaussian process
贝叶斯线性模型是一种高斯过程

basis functions 基函数

smoothness 平滑性

characteristic 特征

length-scale 长度尺度

^{6}

Sometimes this covariance function is called the Radial Basis Function (RBF) or Gaussian; here we prefer squared exponential.

^{6}

有时这种协方差函数被称为径向基函数（RBF）或高斯函数；在此我们更倾向于使用平方指数这一名称。

^{7}

Technically,these input points play the rôle of test inputs and therefore carry a subscript asterisk; this will become clearer later when both training and test points are involved.

^{7}

严格来说，这些输入点扮演着测试输入的角色，因此带有下标星号；当后续涉及训练点和测试点时，这一点会变得更加清晰。

https://cdn.noedgeai.com/019686a5-3b83-7d6c-828a-cad6ac9d4b35_8.jpg?x=218&y=410&w=987&h=428&r=0

Figure 2.2: Panel (a) shows three functions drawn at random from a GP prior; the dots indicate values of

y

actually generated; the two other functions have (less correctly) been drawn as lines by joining a large number of evaluated points. Panel (b) shows three random functions drawn from the posterior, i.e. the prior conditioned on the five noise free observations indicated. In both plots the shaded area represents the pointwise mean plus and minus two times the standard deviation for each input value (corresponding to the

95 %

confidence region),for the prior and posterior respectively.
图 2.2：(a)面板展示了从高斯过程先验中随机抽取的三个函数，圆点表示实际生成的

y

值，另两条函数曲线（准确性较低）通过连接大量采样点绘制成连续线。(b)面板显示了基于五个无噪声观测值（图中标出）进行先验条件化后，从后验分布中抽取的三个随机函数。两图中阴影区域分别代表先验与后验情况下，每个输入值对应的点均值加减两倍标准差范围（即

95 %

置信区间）。

magnitude 量级

random function can be controlled by a positive pre-factor before the exp in eq. (2.16). We will discuss more about how such factors affect the predictions in section 2.3, and say more about how to set such scale parameters in chapter 5.
随机函数的波动幅度可通过方程(2.16)中指数项前的正比例因子调控。我们将在 2.3 节详细讨论这类因子如何影响预测结果，并在第五章进一步阐述如何设置此类尺度参数。

Prediction with Noise-free Observations
无噪声观测下的预测

We are usually not primarily interested in drawing random functions from the prior, but want to incorporate the knowledge that the training data provides about the function. Initially, we will consider the simple special case where the observations are noise free,that is we know

{(x_{i}, f_{i}) ∣ i = 1, \dots, n}

. The joint distribution of the training outputs,

f

,and the test outputs

f_{*}

according to the
我们通常主要关注的并非从先验中抽取随机函数，而是希望融入训练数据所提供的关于函数的知识。最初，我们将考虑一个简单的特例，即观测数据无噪声，也就是说我们确切知道

{(x_{i}, f_{i}) ∣ i = 1, \dots, n}

。根据先验，训练输出

f

与测试输出

f_{*}

的联合分布

prior is 为

\begin{matrix} (2.18) & [\begin{array}{l} f \\ f_{*} \end{array}] \sim N (0, [\begin{array}{ll} K (X, X) & K (X, X_{*}) \\ K (X_{*}, X) & K (X_{*}, X_{*}) \end{array}]) . \end{matrix}

If there are

n

training points and

n_{*}

test points then

K (X, X_{*})

denotes the

n \times n_{*}

matrix of the covariances evaluated at all pairs of training and test points,and similarly for the other entries

K (X, X), K (X_{*}, X_{*})

and

K (X_{*}, X)

. To get the posterior distribution over functions we need to restrict this joint prior distribution to contain only those functions which agree with the observed data points. Graphically in Figure 2.2 you may think of generating functions from the prior, and rejecting the ones that disagree with the observations, al-C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. (c) 2006 Massachusetts Institute of Technology. www. GaussianProcess.org/gpml though this strategy would not be computationally very efficient. Fortunately, in probabilistic terms this operation is extremely simple, corresponding to conditioning the joint Gaussian prior distribution on the observations (see section A. 2 for further details) to give
若有

n

个训练点和

n_{*}

个测试点，则

K (X, X_{*})

表示在所有训练点与测试点对之间计算的协方差

n \times n_{*}

矩阵，其他条目

K (X, X), K (X_{*}, X_{*})

和

K (X_{*}, X)

同理。要获得函数上的后验分布，我们需要将这一联合先验分布限制为仅包含与观测数据点相符的那些函数。在图 2.2 中，可以形象地理解为从先验生成函数，并剔除与观测不符的函数——虽然这种策略在计算效率上并不理想。幸运的是，在概率术语中这一操作极其简单，对应于将联合高斯先验分布条件化于观测数据（详见 A.2 节），从而得到

\begin{matrix} (2.19) & f_{*} ∣ X_{*}, X, f \sim N (K (X_{*}, X) K {(X, X)}^{- 1} f, \end{matrix}

K (X_{*}, X_{*}) - K (X_{*}, X) K {(X, X)}^{- 1} K (X, X_{*})) .

joint prior 联合先验

graphical rejection 图形化剔除

Function values

f_{*}

(corresponding to test inputs

X_{*}

) can be sampled from the joint posterior distribution by evaluating the mean and covariance matrix from eq. (2.19) and generating samples according to the method described in section A. 2 .
通过评估方程(2.19)中的均值与协方差矩阵，并按照附录 A.2 节所述方法生成样本，可以从联合后验分布中采样得到测试输入

X_{*}

对应的函数值

f_{*}

。

Figure 2.2(b) shows the results of these computations given the five data-points marked with + symbols. Notice that it is trivial to extend these computations to multidimensional inputs - one simply needs to change the evaluation of the covariance function in accordance with eq. (2.16), although the resulting functions may be harder to display graphically.
图 2.2(b)展示了基于五个标有+符号的数据点进行这些计算的结果。值得注意的是，将这些计算扩展到多维输入是轻而易举的——只需根据方程(2.16)调整协方差函数的求值方式即可，尽管生成的结果函数可能更难用图形展示。

Prediction using Noisy Observations
使用含噪声观测进行预测

It is typical for more realistic modelling situations that we do not have access to function values themselves,but only noisy versions thereof

y = f (x) + ε \cdot^{8}

Assuming additive independent identically distributed Gaussian noise

ε

with variance

σ_{n}^{2}

,the prior on the noisy observations becomes
在更现实的建模场景中，通常我们无法直接获取函数值本身，而只能获得带有噪声的版本

y = f (x) + ε \cdot^{8}

。假设加性独立同分布的高斯噪声

ε

具有方差

σ_{n}^{2}

，则含噪声观测的先验分布变为

\begin{matrix} (2.20) & cov (y_{p}, y_{q}) = k (x_{p}, x_{q}) + σ_{n}^{2} δ_{p q} or cov (y) = K (X, X) + σ_{n}^{2} I, \end{matrix}

where

δ_{p q}

is a Kronecker delta which is one iff

p = q

and zero otherwise. It follows from the independence

^{9}

assumption about the noise,that a diagonal matrix

^{10}

is added,in comparison to the noise free case,eq. (2.16). Introducing the noise term in eq. (2.18) we can write the joint distribution of the observed target values and the function values at the test locations under the prior as
其中

δ_{p q}

为克罗内克δ函数，当且仅当

p = q

时取值为 1，否则为 0。根据噪声的独立性

^{9}

假设，相较于无噪声情况（方程 2.16），此处需添加一个对角矩阵

^{10}

。在方程(2.18)中引入噪声项后，我们可以将观测目标值与测试位置处函数值的联合先验分布表示为

\begin{matrix} (2.21) & [\begin{matrix} y \\ f_{*} \end{matrix}] \sim N (0, [\begin{matrix} K (X, X) + σ_{n}^{2} I & K (X, X_{*}) \\ K (X_{*}, X) & K (X_{*}, X_{*}) \end{matrix}]) . \end{matrix}

Deriving the conditional distribution corresponding to eq. (2.19) we arrive at the key predictive equations for Gaussian process regression
推导对应于方程(2.19)的条件分布，我们得出了高斯过程回归的关键预测方程

\begin{matrix} (2.22) & f_{*} ∣ X, y, X_{*} \sim N ({\overset{―}{f}}_{*}, cov (f_{*})),where \end{matrix}

\begin{matrix} (2.23) & {\overset{―}{f}}_{*} ≜ E [f_{*} ∣ X, y, X_{*}] = K (X_{*}, X) {[K (X, X) + σ_{n}^{2} I]}^{- 1} y, \end{matrix}

\begin{matrix} (2.24) & cov (f_{*}) = K (X_{*}, X_{*}) - K (X_{*}, X) {[K (X, X) + σ_{n}^{2} I]}^{- 1} K (X, X_{*}) . \end{matrix}

Notice that we now have exact correspondence with the weight space view in eq. (2.12) when identifying

K (C, D) = Φ {(C)}^{⊤} \sum_{p} Φ (D)

,where

C, D

stand for either

X

X_{*}

. For any set of basis functions,we can compute the corresponding covariance function as

k (x_{p}, x_{q}) = ϕ {(x_{p})}^{⊤} \sum_{p} ϕ (x_{q})

; conversely,for every (positive definite) covariance function

k

,there exists a (possibly infinite) expansion in terms of basis functions, see section 4.3.
注意当识别

K (C, D) = Φ {(C)}^{⊤} \sum_{p} Φ (D)

时（其中

C, D

代表

X

或

X_{*}

），我们此刻与方程(2.12)中的权重空间视角完全对应。对于任意基函数集，我们可以计算对应的协方差函数为

k (x_{p}, x_{q}) = ϕ {(x_{p})}^{⊤} \sum_{p} ϕ (x_{q})

；反之，对于每个（正定）协方差函数

k

，都存在一个（可能无限的）基函数展开式，详见第 4.3 节。

noise-free predictive distribution
无噪声预测分布

predictive distribution 预测分布

^{8}

There are some situations where it is reasonable to assume that the observations are noise-free, for example for computer simulations, see e.g. Sacks et al. [1989].

^{8}

在某些情况下，假设观测数据无噪声是合理的，例如计算机模拟场景，可参见 Sacks 等人[1989]的研究。

^{9}

More complicated noise models with non-trivial covariance structure can also be handled, see section 9.2 .

^{9}

更复杂的噪声模型（具有非平凡协方差结构）同样适用，详见第 9.2 节。

^{10}

Notice that the Kronecker delta is on the index of the cases,not the value of the input; for the signal part of the covariance function the input value is the index set to the random variables describing the function, for the noise part it is the identity of the point.

^{10}

注意克罗内克δ函数作用于样本索引而非输入值；对于协方差函数的信号部分，输入值是随机变量描述函数的索引集，对于噪声部分则是数据点的标识。

https://cdn.noedgeai.com/019686a5-3b83-7d6c-828a-cad6ac9d4b35_10.jpg?x=236&y=417&w=920&h=336&r=0

Figure 2.3: Graphical model (chain graph) for a GP for regression. Squares represent observed variables and circles represent unknowns. The thick horizontal bar represents a set of fully connected nodes. Note that an observation

y_{i}

is conditionally independent of all other nodes given the corresponding latent variable,

f_{i}

. Because of the marginalization property of GPs addition of further inputs,

x

,latent variables,

f

, and unobserved targets,

y_{*}

,does not change the distribution of any other variables.
图 2.3：回归问题中高斯过程的图形模型（链式图）。方框表示观测变量，圆圈表示未知量。粗水平条代表一组全连接节点。注意观测值

y_{i}

在给定对应潜变量

f_{i}

时与其他所有节点条件独立。由于高斯过程的边缘化特性，新增输入

x

、潜变量

f

及未观测目标

y_{*}

不会改变其他变量的分布。

correspondence with weight-space view
与权重空间视角的对应关系

The expressions involving

K (X, X), K (X, X_{*})

and

K (X_{*}, X_{*})

etc. can look rather unwieldy, so we now introduce a compact form of the notation setting

K = K (X, X)

and

K_{*} = K (X, X_{*})

. In the case that there is only one test point

x_{*}

we write

k (x_{*}) = k_{*}

to denote the vector of covariances between the test point and the

n

training points. Using this compact notation and for a single test point

x_{*}

,equations 2.23 and 2.24 reduce to
涉及

K (X, X), K (X, X_{*})

和

K (X_{*}, X_{*})

等的表达式可能显得相当复杂，因此我们现在引入一种紧凑的符号表示法，设定

K = K (X, X)

和

K_{*} = K (X, X_{*})

。在仅有一个测试点

x_{*}

的情况下，我们用

k (x_{*}) = k_{*}

表示该测试点与

n

个训练点之间的协方差向量。采用这种紧凑符号表示，并且针对单个测试点

x_{*}

，方程 2.23 和 2.24 简化为

\begin{matrix} (2.25) & {\bar{f}}_{*} = k_{*}^{⊤} {(K + σ_{n}^{2} I)}^{- 1} y, \end{matrix}

\begin{matrix} (2.26) & V [f_{*}] = k (x_{*}, x_{*}) - k_{*}^{⊤} {(K + σ_{n}^{2} I)}^{- 1} k_{*} . \end{matrix}

compact notation 紧凑表示法

Let us examine the predictive distribution as given by equations 2.25 and 2.26. Note first that the mean prediction eq. (2.25) is a linear combination of observations

y

; this is sometimes referred to as a linear predictor. Another way to look at this equation is to see it as a linear combination of

n

kernel functions, each one centered on a training point, by writing
让我们来考察由方程 2.25 和 2.26 给出的预测分布。首先注意到，均值预测方程（2.25）是观测值

y

的线性组合，这有时被称为线性预测器。另一种理解该方程的方式是将其视为以每个训练点为中心的

n

核函数的线性组合，可表述为

\begin{matrix} (2.27) & \bar{f} (x_{*}) = \sum_{i = 1}^{n} α_{i} k (x_{i}, x_{*}) \end{matrix}

predictive distribution 预测分布

linear predictor 线性预测器

where

α = {(K + σ_{n}^{2} I)}^{- 1} y

. The fact that the mean prediction for

f (x_{*})

can be written as eq. (2.27) despite the fact that the GP can be represented in terms of a (possibly infinite) number of basis functions is one manifestation of the representer theorem; see section 6.2 for more on this point. We can understand this result intuitively because although the GP defines a joint Gaussian distribution over all of the

y

variables,one for each point in the index set

X

,for making predictions at

x_{*}

we only care about the

(n + 1)

-dimensional distribution defined by the

n

training points and the test point. As a Gaussian distribution is marginalized by just taking the relevant block of the joint covariance matrix (see section A.2) it is clear that conditioning this

(n + 1)

-dimensional distribution on the observations gives us the desired result. A graphical model representation of a GP is given in Figure 2.3.
其中

α = {(K + σ_{n}^{2} I)}^{- 1} y

。尽管高斯过程可以用（可能是无限的）基函数集表示，但对

f (x_{*})

的均值预测仍可写作等式（2.27），这是表示定理的一种体现；关于此点的更多讨论见第 6.2 节。我们可以直观理解这一结果——虽然高斯过程定义了索引集

X

中每个点对应的所有

y

变量的联合高斯分布，但在

x_{*}

处进行预测时，我们只需关注由

n

个训练点和测试点定义的

(n + 1)

维分布。由于高斯分布的边缘化只需提取联合协方差矩阵的相应分块（见附录 A.2 节），显然将此

(n + 1)

维分布基于观测值进行条件化即可得到所需结果。图 2.3 展示了高斯过程的图模型表示。

representer theorem 表示定理

https://cdn.noedgeai.com/019686a5-3b83-7d6c-828a-cad6ac9d4b35_11.jpg?x=538&y=412&w=988&h=421&r=0

Figure 2.4: Panel (a) is identical to Figure 2.2(b) showing three random functions drawn from the posterior. Panel (b) shows the posterior

c o

-variance between

f (x)

and

f (x^{'})

for the same data for three different values of

x^{'}

. Note,that the covariance at close points is high, falling to zero at the training points (where there is no variance, since it is a noise-free process), then becomes negative, etc. This happens because if the smooth function happens to be less than the mean on one side of the data point, it tends to exceed the mean on the other side, causing a reversal of the sign of the covariance at the data points. Note for contrast that the prior covariance is simply of Gaussian shape and never negative.
图 2.4：(a)面板与图 2.2(b)相同，展示了从后验分布中抽取的三个随机函数。(b)面板显示了相同数据下

f (x)

与

f (x^{'})

之间的后验

c o

方差，对应三个不同的

x^{'}

值。需注意，邻近点的协方差较高，在训练点处降至零（因为这是一个无噪声过程，此处无方差），随后变为负值等。这种现象的原因是：若平滑函数在数据点一侧低于均值，则倾向于在另一侧超过均值，导致数据点处协方差符号反转。对比之下，先验协方差始终呈高斯形状且永不为负。

Note also that the variance in eq. (2.24) does not depend on the observed targets, but only on the inputs; this is a property of the Gaussian distribution. The variance is the difference between two terms: the first term

K (X_{*}, X_{*})

is simply the prior covariance; from that is subtracted a (positive) term, representing the information the observations gives us about the function. We can
还需注意，方程(2.24)中的方差不依赖于观测目标值，仅取决于输入；这是高斯分布的特性。方差由两项之差构成：第一项

K (X_{*}, X_{*})

是先验协方差；从中减去一个（正）项，代表观测数据提供的函数信息。我们能够

noisy predictions 含噪声的预测

very simply compute the predictive distribution of test targets

y_{*}

by adding

σ_{n}^{2} I

to the variance in the expression for

cov (f_{*})

.
非常简单地计算测试目标

y_{*}

的预测分布，只需在

cov (f_{*})

的表达式中将方差加上

σ_{n}^{2} I

。

joint predictions 联合预测

The predictive distribution for the GP model gives more than just pointwise errorbars of the simplified eq. (2.26). Although not stated explicitly, eq. (2.24) holds unchanged when

X_{*}

denotes multiple test inputs; in this case the covariance of the test targets are computed (whose diagonal elements are the pointwise variances). In fact, eq. (2.23) is the mean function and eq. (2.24) the
GP 模型的预测分布不仅提供了简化方程(2.26)中的逐点误差范围。虽然未明确说明，当

X_{*}

表示多个测试输入时，方程(2.24)依然成立；此时会计算测试目标的协方差（其对角线元素即为逐点方差）。实际上，方程(2.23)是均值函数，而方程(2.24)则是

posterior process 后验过程

covariance function of the (Gaussian) posterior process; recall the definition of Gaussian process from page 13. The posterior covariance in illustrated in Figure 2.4(b).
（高斯）后验过程的协方差函数；回顾第 13 页对高斯过程的定义。后验协方差如图 2.4(b)所示。

It will be useful (particularly for chapter 5) to introduce the marginal likeli-
引入边缘似然（特别是为第 5 章做准备）将非常有用。

marginal likelihood hood (or evidence)

p (y ∣ X)

at this point. The marginal likelihood is the integral C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. (c) 2006 Massachusetts Institute of Technology. www. GaussianProcess.org/gpml
此处引入边缘似然（或称证据）

p (y ∣ X)

2.3 Varying the Hyperparameters
2.3 超参数调整

input:

X

(inputs),

y

(targets),

k

(covariance function),

σ_{n}^{2}

(noise level),
输入：

X

（输入值），

y

（目标值），

k

（协方差函数），

σ_{n}^{2}

（噪声水平）

x_{*}

(test input) （测试输入）

L := cholesky (K + σ_{n}^{2} I)

α := L^{⊤} ∖ (L ∖ y)

(2.25)

{\bar{f}}_{*} := k_{*}^{⊤} α

（2.25）

\begin{aligned} v := L k_{*} \\ 0 [f_{*}] := k (x_{*}, x_{*}) - v^{⊤} v \end{aligned}}

predictive variance eq. (2.26)
预测方差方程（2.26）

\log p (y ∣ X) := - \frac{1}{2} y^{⊤} α - \sum_{i} \log L_{i i} - \frac{n}{2} \log 2 π

eq. (2.30) 方程（2.30）

8: return:

{\bar{f}}_{*}

(mean),

V [f_{*}]

(variance),

\log p (y ∣ X)

(log marginal likelihood)
8: 返回：

{\bar{f}}_{*}

（均值），

V [f_{*}]

（方差），

\log p (y ∣ X)

（对数边际似然）

Algorithm 2.1: Predictions and log marginal likelihood for Gaussian process regression. The implementation addresses the matrix inversion required by eq. (2.25) and (2.26) using Cholesky factorization, see section A.4. For multiple test cases lines 4-6 are repeated. The log determinant required in eq. (2.30) is computed from the Cholesky factor (for large

n

it may not be possible to represent the determinant itself). The computational complexity is

n^{3} / 6

for the Cholesky decomposition in line 2,and

n^{2} / 2

for solving triangular systems in line 3 and (for each test case) in line 5 .
算法 2.1：高斯过程回归的预测与对数边际似然计算。该实现通过 Cholesky 分解处理方程（2.25）和（2.26）所需的矩阵求逆，详见附录 A.4。对于多个测试用例，需重复执行第 4-6 行。方程（2.30）所需的对数行列式从 Cholesky 因子中计算得出（对于大规模

n

，可能无法直接表示行列式本身）。计算复杂度方面，第 2 行的 Cholesky 分解为

n^{3} / 6

，第 3 行及（每个测试用例的）第 5 行解三角系统的复杂度为

n^{2} / 2

。

of the likelihood times the prior
似然与先验的乘积

\begin{matrix} (2.28) & p (y ∣ X) = \int p (y ∣ f, X) p (f ∣ X) d f . \end{matrix}

The term marginal likelihood refers to the marginalization over the function values

f

. Under the Gaussian process model the prior is Gaussian,

f ∣ X \sim

N (0, K)

,or
边际似然这一术语指对函数值

f

的边际化处理。在高斯过程模型中，先验服从高斯分布，

f ∣ X \sim

N (0, K)

，或者说

\begin{matrix} (2.29) & \log p (f ∣ X) = - \frac{1}{2} f^{⊤} K^{- 1} f - \frac{1}{2} \log | K | - \frac{n}{2} \log 2 π, \end{matrix}

and the likelihood is a factorized Gaussian

y ∣ f \sim N (f, σ_{n}^{2} I)

so we can make use of equations A. 7 and A. 8 to perform the integration yielding the log marginal likelihood
且似然函数为因子化的高斯分布

y ∣ f \sim N (f, σ_{n}^{2} I)

，因此我们可以利用公式 A.7 和 A.8 进行积分运算，从而得到对数边际似然。

\begin{matrix} (2.30) & \log p (y ∣ X) = - \frac{1}{2} y^{⊤} {(K + σ_{n}^{2} I)}^{- 1} y - \frac{1}{2} \log | K + σ_{n}^{2} I | - \frac{n}{2} \log 2 π . \end{matrix}

This result can also be obtained directly by observing that

y \sim N (0, K + σ_{n}^{2} I)

.
这一结果也可通过直接观察

y \sim N (0, K + σ_{n}^{2} I)

获得。

A practical implementation of Gaussian process regression (GPR) is shown in Algorithm 2.1. The algorithm uses Cholesky decomposition, instead of directly inverting the matrix, since it is faster and numerically more stable, see section A.4. The algorithm returns the predictive mean and variance for noise free test data-to compute the predictive distribution for noisy test data

y_{*}

, simply add the noise variance

σ_{n}^{2}

to the predictive variance of

f_{*}

.
高斯过程回归（GPR）的一个实际实现如算法 2.1 所示。该算法采用 Cholesky 分解而非直接矩阵求逆，因其速度更快且数值更稳定，详见章节 A.4。算法返回无噪声测试数据的预测均值与方差——若要计算含噪声测试数据的预测分布

y_{*}

，只需将噪声方差

σ_{n}^{2}

加到

f_{*}

的预测方差上。

2.3 Varying the Hyperparameters
2.3 超参数调整

Typically the covariance functions that we use will have some free parameters. For example, the squared-exponential covariance function in one dimension has the following form
通常我们所使用的协方差函数会包含一些自由参数。例如，一维情况下的平方指数协方差函数具有如下形式

\begin{matrix} (2.31) & k_{y} (x_{p}, x_{q}) = σ_{f}^{2} \exp (- \frac{1}{2 ℓ^{2}} {(x_{p} - x_{q})}^{2}) + σ_{n}^{2} δ_{p q} . \end{matrix}

The covariance is denoted

k_{y}

as it is for the noisy targets

y

rather than for the underlying function

f

. Observe that the length-scale

ℓ

,the signal variance

σ_{f}^{2}

and the noise variance

σ_{n}^{2}

can be varied. In general we call the free parameters hyperparameters.

^{11}

协方差记为

k_{y}

，因为它是针对含噪目标

y

而非底层函数

f

。注意长度尺度

ℓ

、信号方差

σ_{f}^{2}

和噪声方差

σ_{n}^{2}

均可调整。通常我们将这些自由参数称为超参数。

^{11}

https://cdn.noedgeai.com/019686a5-3b83-7d6c-828a-cad6ac9d4b35_13.jpg?x=538&y=416&w=992&h=871&r=0

Figure 2.5: (a) Data is generated from a GP with hyperparameters

(ℓ, σ_{f}, σ_{n}) =

(1,1,0.1),as shown by the + symbols. Using Gaussian process prediction with these hyperparameters we obtain a

95 %

confidence region for the underlying function

f

(shown in grey). Panels (b) and (c) again show the

95 %

confidence region,but this time for hyperparameter values(0.3,1.08,0.00005)and(3.0,1.16,0.89)respectively.
图 2.5：(a) 数据由超参数为

(ℓ, σ_{f}, σ_{n}) =

(1,1,0.1)的高斯过程生成，如“+”符号所示。使用这些超参数进行高斯过程预测，我们得到了底层函数

95 %

的置信区域（灰色显示）。(b)和(c)面板再次展示了

95 %

置信区域，但这次分别对应超参数值(0.3,1.08,0.00005)和(3.0,1.16,0.89)。

hyperparameters 超参数

In chapter 5 we will consider various methods for determining the hyperpa-rameters from training data. However, in this section our aim is more simply to explore the effects of varying the hyperparameters on GP prediction. Consider the data shown by + signs in Figure 2.5(a). This was generated from a GP with the SE kernel with

(ℓ, σ_{f}, σ_{n}) = (1, 1, 0.1)

. The figure also shows the 2 standard-deviation error bars for the predictions obtained using these values of the hyperparameters, as per eq. (2.24). Notice how the error bars get larger for input values that are distant from any training points. Indeed if the x-axis
在第五章中，我们将探讨多种从训练数据中确定超参数的方法。然而，本节的目标更为简单，旨在研究不同超参数对高斯过程预测的影响。观察图 2.5(a)中用“+”符号标示的数据，这些数据由使用超参数

(ℓ, σ_{f}, σ_{n}) = (1, 1, 0.1)

的 SE 核函数生成的高斯过程产生。图中还展示了根据方程(2.24)使用这些超参数值进行预测时得到的 2 倍标准差误差范围。注意，对于远离任何训练点的输入值，误差范围是如何增大的。实际上，若 x 轴

^{11}

We refer to the parameters of the covariance function as hyperparameters to emphasize that they are parameters of a non-parametric model; in accordance with the weight-space view, section 2.1, the parameters (weights) of the underlying parametric model have been integrated out.
我们将协方差函数的参数称为超参数，以强调它们是非参数模型的参数；根据权重空间视角（第 2.1 节），基础参数模型中的参数（权重）已被积分消去。

σ_{f}

away from the data.
C. E. Rasmussen 和 C. K. I. Williams 所著的《机器学习中的高斯过程》（麻省理工学院出版社，2006 年，ISBN 026218253X）。(C) 2006 麻省理工学院。www.GaussianProcess.org/gpml 若扩展观察范围，会看到误差线反映出过程

σ_{f}

在数据之外区域的先验标准差。

If we set the length-scale shorter so that

ℓ = 0.3

and kept the other parameters the same, then generating from this process we would expect to see plots like those in Figure 2.5(a) except that the x-axis should be rescaled by a factor of 0.3 ; equivalently if the same x-axis was kept as in Figure 2.5(a) then a sample function would look much more wiggly.
如果我们缩短长度尺度使

ℓ = 0.3

，并保持其他参数不变，那么从该过程生成样本时，预期会看到类似于图 2.5(a)的曲线，但 x 轴应按 0.3 的比例因子重新标度；等效地，若保持与图 2.5(a)相同的 x 轴范围，则样本函数会显得更加波动剧烈。

If we make predictions with a process with

ℓ = 0.3

on the data generated from the

ℓ = 1

process then we obtain the result in Figure 2.5(b). The remaining two parameters were set by optimizing the marginal likelihood, as explained in chapter 5 . In this case the noise parameter is reduced to

σ_{n} = 0.00005

as the greater flexibility of the "signal" means that the noise level can be reduced. This can be observed at the two datapoints near

x = 2.5

in the plots. In Figure 2.5(a)

(ℓ = 1)

these are essentially explained as a similar function value with differing noise. However,in Figure 2.5(b)

(ℓ = 0.3)

the noise level is very low, so these two points have to be explained by a sharp variation in the value of the underlying function

f

. Notice also that the short length-scale means that the error bars in Figure 2.5(b) grow rapidly away from the datapoints.
如果我们使用带有

ℓ = 0.3

的过程对由

ℓ = 1

过程生成的数据进行预测，则会得到图 2.5(b)所示结果。其余两个参数通过优化边缘似然来设定，如第 5 章所述。在此情况下，噪声参数降至

σ_{n} = 0.00005

，因为“信号”更大的灵活性意味着噪声水平可以降低。这可以从图中靠近

x = 2.5

的两个数据点观察到。在图 2.5(a)

(ℓ = 1)

中，这些点基本上被解释为具有不同噪声的相似函数值。然而，在图 2.5(b)

(ℓ = 0.3)

中，噪声水平极低，因此这两个点必须通过底层函数值的急剧变化来解释

f

。还需注意的是，短长度尺度意味着图 2.5(b)中的误差条在远离数据点时迅速增大。

In contrast,we can set the length-scale longer,for example to

ℓ = 3

,as shown in Figure 2.5(c). Again the remaining two parameters were set by optimizing the marginal likelihood. In this case the noise level has been increased to

σ_{n} = 0.89

and we see that the data is now explained by a slowly varying function with a lot of noise.
相比之下，我们可以将长度尺度设置得更长，例如设为

ℓ = 3

，如图 2.5(c)所示。同样，其余两个参数通过优化边缘似然来设定。在此情况下，噪声水平已增至

σ_{n} = 0.89

，我们看到数据现在被解释为一个缓慢变化的函数伴随大量噪声。

Of course we can take the position of a quickly-varying signal with low noise, or a slowly-varying signal with high noise to extremes; the former would give rise to a white-noise process model for the signal, while the latter would give rise to a constant signal with added white noise. Under both these models the datapoints produced should look like white noise. However, studying Figure 2.5(a) we see that white noise is not a convincing model of the data,as the sequence of

y

’s does not alternate sufficiently quickly but has correlations due to the variability of the underlying function. Of course this is relatively easy to see in one dimension, but methods such as the marginal likelihood discussed in chapter 5 generalize to higher dimensions and allow us to score the various models. In this case the marginal likelihood gives a clear preference for

(ℓ, σ_{f}, σ_{n}) = (1, 1, 0.1)

over the other two alternatives.
当然，我们可以将快速变化信号伴随低噪声，或缓慢变化信号伴随高噪声的情形推向极端；前者会导致信号的白噪声过程模型，而后者则会产生一个带有附加白噪声的恒定信号。在这两种模型下，生成的数据点都应呈现白噪声特征。然而，观察图 2.5(a)可以发现，白噪声并非数据的合理解释，因为

y

序列的交替速度不足，且由于底层函数的变异性存在相关性。虽然在一维情况下相对容易识别这一点，但诸如第 5 章讨论的边缘似然等方法可推广至更高维度，使我们能对不同模型进行评分。此案例中，边缘似然明确显示对

(ℓ, σ_{f}, σ_{n}) = (1, 1, 0.1)

的偏好高于其他两种替代方案。

2.4 Decision Theory for Regression
2.4 回归问题的决策理论

In the previous sections we have shown how to compute predictive distributions for the outputs

y_{*}

corresponding to the novel test input

x_{*}

. The predictive distribution is Gaussian with mean and variance given by eq. (2.25) and eq. (2.26). In practical applications, however, we are often forced to make a decision about how to act, i.e. we need a point-like prediction which is optimal in some sense. To this end we need a loss function,

L (y_{true}, y_{guess})

,which specifies the loss (or
在前几节中，我们展示了如何计算对应于新测试输入

x_{*}

的输出

y_{*}

的预测分布。该预测分布服从高斯分布，其均值与方差分别由公式(2.25)和(2.26)给出。然而在实际应用中，我们常需决定如何行动，即需要某种意义上的最优单点预测。为此，我们需要一个损失函数

L (y_{true}, y_{guess})

，用以量化（或

too short length-scale 长度尺度过短

too long length-scale model comparison
长度尺度过长模型比较

optimal predictions loss function
最优预测损失函数

y_{guess}

when the true value is

y_{true}

. For example, the loss function could equal the absolute deviation between the guess and the truth.
C. E. Rasmussen 和 C. K. I. Williams 所著《机器学习中的高斯过程》，麻省理工学院出版社，2006 年，ISBN 026218253X。©2006 麻省理工学院。www.GaussianProcess.org/gpml 惩罚）当真实值为

y_{true}

时猜测值为

y_{guess}

所产生的损失。例如，损失函数可以等于猜测值与真实值之间的绝对偏差。

Notice that we computed the predictive distribution without reference to the loss function. In non-Bayesian paradigms, the model is typically trained by minimizing the empirical risk (or loss). In contrast, in the Bayesian setting there is a clear separation between the likelihood function (used for training, in addition to the prior) and the loss function. The likelihood function describes how the noisy measurements are assumed to deviate from the underlying noise-free function. The loss function, on the other hand, captures the consequences of making a specific choice, given an actual true state. The likelihood and loss function need not have anything in common.

^{12}

注意到我们在计算预测分布时并未涉及损失函数。在非贝叶斯范式中，模型通常通过最小化经验风险（或损失）进行训练。相比之下，贝叶斯方法中似然函数（训练时与先验结合使用）与损失函数之间有明确区分。似然函数描述了噪声测量值如何被假定为偏离底层无噪声函数，而损失函数则捕捉了在真实状态下做出特定选择带来的后果。似然函数与损失函数无需有任何共同之处。

^{12}

non-Bayesian paradigm Bayesian paradigm
非贝叶斯范式贝叶斯范式

Our goal is to make the point prediction

y_{guess}

which incurs the smallest loss, but how can we achieve that when we don’t know

y_{true}

? Instead,we minimize the expected loss or risk, by averaging w.r.t. our model's opinion as to what the truth might be
我们的目标是做出导致最小损失的

y_{guess}

点预测，但在未知

y_{true}

的情况下如何实现？为此，我们通过根据模型对真实情况的概率评估取平均值，来最小化期望损失或风险。

\begin{matrix} (2.32) & {\tilde{R}}_{L} (y_{guess} ∣ x_{*}) = \int L (y_{*}, y_{guess}) p (y_{*} ∣ x_{*}, D) d y_{*} . \end{matrix}

expected loss, risk 期望损失，风险

Thus our best guess, in the sense that it minimizes the expected loss, is
因此，在最小化预期损失的意义上，我们的最佳猜测是

\begin{matrix} (2.33) & y_{optimal} ∣ x_{*} = \underset{y_{guess}}{argmin} {\tilde{R}}_{L} (y_{guess} ∣ x_{*}) . \end{matrix}

absolute error loss squared error loss
绝对误差损失平方误差损失

In general the value of

y_{guess}

that minimizes the risk for the loss function

∣ y_{guess} -

y_{*} ∣

is the median of

p (y_{*} ∣ x_{*}, D)

,while for the squared loss

{(y_{guess} - y_{*})}^{2}

it is the mean of this distribution. When the predictive distribution is Gaussian the mean and the median coincide, and indeed for any symmetric loss function and symmetric predictive distribution we always get

y_{guess}

as the mean of the predictive distribution. However, in many practical problems the loss functions can be asymmetric, e.g. in safety critical applications, and point predictions may be computed directly from eq. (2.32) and eq. (2.33). A comprehensive treatment of decision theory can be found in Berger [1985].
一般而言，对于损失函数

∣ y_{guess} -

y_{*} ∣

，使风险最小化的

y_{guess}

值是

p (y_{*} ∣ x_{*}, D)

的中位数；而对于平方损失

{(y_{guess} - y_{*})}^{2}

，则是该分布的均值。当预测分布呈高斯分布时，均值与中位数重合。实际上，对于任何对称的损失函数和对称的预测分布，我们总能得到预测分布的均值

y_{guess}

。然而，在许多实际问题中，损失函数可能是非对称的，例如在安全关键型应用中，点预测可直接通过公式（2.32）和（2.33）计算得出。决策理论的全面论述可参考 Berger[1985]的著作。

2.5 An Example Application
2.5 应用示例

In this section we use Gaussian process regression to learn the inverse dynamics of a seven degrees-of-freedom SARCOS anthropomorphic robot arm. The task is to map from a 21-dimensional input space ( 7 joint positions, 7 joint velocities, 7 joint accelerations) to the corresponding 7 joint torques. This task has previously been used to study regression algorithms by Vijayakumar and Schaal [2000], Vijayakumar et al. [2002] and Vijayakumar et al. [2005].

^{13}

Following
本节我们采用高斯过程回归方法，学习七自由度 SARCOS 仿人机械臂的逆动力学模型。任务是从 21 维输入空间（7 个关节位置、7 个关节速度、7 个关节加速度）映射到对应的 7 个关节扭矩。该任务此前已被 Vijayakumar 和 Schaal[2000]、Vijayakumar 等人[2002]及 Vijayakumar 等人[2005]用于回归算法研究。

^{13}

后续

robot arm 机械臂

^{12}

Beware of fallacious arguments like: a Gaussian likelihood implies a squared error loss function.

^{12}

警惕诸如“高斯似然意味着平方误差损失函数”这类谬误论点。

^{13}

We thank Sethu Vijayakumar for providing us with the data.

^{13}

我们感谢 Sethu Vijayakumar 提供的数据。

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. (c) 2006 Massachusetts Institute of Technology. www. GaussianProcess.org/gpml
C. E. Rasmussen 与 C. K. I. Williams 合著，《机器学习中的高斯过程》，麻省理工学院出版社，2006 年，ISBN 026218253X。版权所有©2006 麻省理工学院。网址：www.GaussianProcess.org/gpml

2.5 An Example Application
2.5 一个应用示例

this previous work we present results below on just one of the seven mappings, from the 21 input variables to the first of the seven torques.
在先前的工作中，我们仅展示了七种映射之一的结果，即从 21 个输入变量到七个扭矩中第一个的映射。

One might ask why it is necessary to learn this mapping; indeed there exist physics-based rigid-body-dynamics models which allow us to obtain the torques from the position, velocity and acceleration variables. However, the real robot arm is actuated hydraulically and is rather lightweight and compliant, so the assumptions of the rigid-body-dynamics model are violated (as we see below). It is worth noting that the rigid-body-dynamics model is nonlinear, involving trigonometric functions and squares of the input variables.
有人可能会问为何需要学习这种映射；实际上，存在基于物理的刚体动力学模型，可以从位置、速度和加速度变量中获取扭矩。然而，真实机械臂采用液压驱动，重量轻且柔韧性高，因此刚体动力学模型的假设条件被打破（如下文所示）。值得注意的是，刚体动力学模型是非线性的，涉及三角函数及输入变量的平方运算。

why learning? 为何学习？

An inverse dynamics model can be used in the following manner: a planning module decides on a trajectory that takes the robot from its start to goal states, and this specifies the desired positions, velocities and accelerations at each time. The inverse dynamics model is used to compute the torques needed to achieve this trajectory and errors are corrected using a feedback controller.
逆动力学模型可按如下方式应用：规划模块确定一条使机器人从起始状态到达目标状态的轨迹，该轨迹详细规定了每个时间点的预期位置、速度及加速度。逆动力学模型用于计算实现该轨迹所需的扭矩，并通过反馈控制器纠正误差。

The dataset consists of 48,933 input-output pairs, of which 44,484 were used as a training set and the remaining 4,449 were used as a test set. The inputs were linearly rescaled to have zero mean and unit variance on the training set. The outputs were centered so as to have zero mean on the training set.
数据集包含 48,933 组输入-输出对，其中 44,484 组作为训练集，其余 4,449 组作为测试集。输入数据经线性缩放，在训练集上呈现零均值与单位方差。输出数据以训练集零均值为基准进行中心化处理。

Given a prediction method, we can evaluate the quality of predictions in several ways. Perhaps the simplest is the squared error loss, where we compute the squared residual

{(y_{*} - \bar{f} (x_{*}))}^{2}

between the mean prediction and the target
给定预测方法后，我们可通过多种方式评估预测质量。最简易的或许是平方误差损失法，即计算平均预测值与目标值之间的平方残差

{(y_{*} - \bar{f} (x_{*}))}^{2}

。

at each test point. This can be summarized by the mean squared error (MSE), MSE by averaging over the test set. However, this quantity is sensitive to the overall scale of the target values, so it makes sense to normalize by the variance of the targets of the test cases to obtain the standardized mean squared error (SMSE).
在每个测试点上。这可以通过均方误差（MSE）来概括，即通过对测试集取平均得到的 MSE。然而，该数值对目标值的整体尺度较为敏感，因此有必要通过测试案例目标值的方差进行归一化，从而获得标准化的均方误差（SMSE）。

This causes the trivial method of guessing the mean of the training targets to
这导致了一种简单的方法，即猜测训练目标的平均值

SMSE 短信服务

have a SMSE of approximately 1.
具有大约 1 的短信服务。

Additionally if we produce a predictive distribution at each test input we can evaluate the negative log probability of the target under the model.

^{14}

As GPR produces a Gaussian predictive density, one obtains
此外，如果我们在每个测试输入处生成预测分布，可以评估目标在模型下的负对数概率。

^{14}

由于高斯过程回归（GPR）产生的是高斯预测密度，因此可以得到

\begin{matrix} (2.34) & - \log p (y_{*} ∣ D, x_{*}) = \frac{1}{2} \log (2 π σ_{*}^{2}) + \frac{{(y_{*} - \bar{f} (x_{*}))}^{2}}{2 σ_{*}^{2}}, \end{matrix}

where the predictive variance

σ_{*}^{2}

for GPR is computed as

σ_{*}^{2} = V (f_{*}) + σ_{n}^{2}

, where

V (f_{*})

is given by eq. (2.26); we must include the noise variance

σ_{n}^{2}

as we are predicting the noisy target

y_{*}

. This loss can be standardized by subtracting the loss that would be obtained under the trivial model which predicts using a Gaussian with the mean and variance of the training data. We denote this
其中，GPR 的预测方差

σ_{*}^{2}

计算为

σ_{*}^{2} = V (f_{*}) + σ_{n}^{2}

，其中

V (f_{*})

由方程(2.26)给出；我们必须包含噪声方差

σ_{n}^{2}

，因为我们预测的是带有噪声的目标

y_{*}

。这一损失可以通过减去使用训练数据均值和方差的高斯分布作为简单模型预测时获得的损失来进行标准化。我们将其称为

the standardized log loss (SLL). The mean SLL is denoted MSLL. Thus the MSLL MSLL will be approximately zero for simple methods and negative for better methods.
标准化对数损失（SLL）。平均 SLL 记为 MSLL。因此，对于简单方法，MSLL 将近似为零，而对于更优方法则为负值。

A number of models were tested on the data. A linear regression (LR) model provides a simple baseline for the SMSE. By estimating the noise level from the
在数据上测试了多个模型。线性回归（LR）模型为标准化均方误差（SMSE）提供了一个简单的基准。通过从数据中估计噪声水平...

^{14}

It makes sense to use the negative log probability so as to obtain a loss,not a utility.

^{14}

使用负对数概率以得到损失而非效用是合理的。

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. (C) 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml residuals on the training set one can also obtain a predictive variance and thus get a MSLL value for LR. The rigid-body-dynamics (RBD) model has a number of free parameters; these were estimated by Vijayakumar et al. [2005] using a least-squares fitting procedure. We also give results for the locally weighted projection regression (LWPR) method of Vijayakumar et al. [2005] which is an on-line method that cycles through the dataset multiple times. For the GP models it is computationally expensive to make use of all 44,484 training cases due to the

O (n^{3})

scaling of the basic algorithm. In chapter 8 we present several different approximate GP methods for large datasets. The result given in Table 2.1 was obtained with the subset of regressors (SR) approximation with a subset size of 4096. This result is taken from Table 8.1, which gives full results of the various approximation methods applied to the inverse dynamics problem. The squared exponential covariance function was used with a separate length-scale parameter for each of the 21 input dimensions, plus the signal and noise variance parameters

σ_{f}^{2}

and

σ_{n}^{2}

. These parameters were set by optimizing the marginal likelihood eq. (2.30) on a subset of the data (see also chapter 5).
C. E. Rasmussen & C. K. I. Williams, 《机器学习中的高斯过程》，麻省理工学院出版社，2006 年，ISBN 026218253X。(C) 2006 麻省理工学院。www.GaussianProcess.org/gpml 训练集上的残差还可以用于获取预测方差，从而得到 LR 的 MSLL 值。刚体动力学（RBD）模型具有多个自由参数；这些参数由 Vijayakumar 等人[2005]通过最小二乘拟合程序估计得出。我们还给出了 Vijayakumar 等人[2005]提出的局部加权投影回归（LWPR）方法的结果，这是一种在线方法，可多次循环遍历数据集。对于 GP 模型，由于基本算法的

O (n^{3})

缩放特性，利用全部 44,484 个训练案例在计算上非常昂贵。在第 8 章中，我们介绍了多种适用于大型数据集的近似 GP 方法。表 2.1 中的结果是使用回归子集（SR）近似方法，子集大小为 4096 得到的。该结果取自表 8.1，其中详细展示了应用于逆动力学问题的各种近似方法的完整结果。采用了平方指数协方差函数，并为 21 个输入维度中的每一个设置了单独的长度尺度参数，外加信号和噪声方差参数

σ_{f}^{2}

和

σ_{n}^{2}

。这些参数通过在一部分数据上优化边缘似然（见公式 2.30）来设定（另见第 5 章）。

Method 方法	SMSE	MSLL
LR	0.075	-1.29
RBD	0.104	-
LWPR	0.040	-
GPR	0.011	-2.25

Table 2.1: Test results on the inverse dynamics problem for a number of different methods. The "-" denotes a missing entry, caused by two methods not producing full predictive distributions, so MSLL could not be evaluated.
表 2.1：关于逆动力学问题的多种不同方法的测试结果。"-"表示缺失项，这是由于两种方法未能生成完整的预测分布，因此无法评估 MSLL。

The results for the various methods are presented in Table 2.1. Notice that the problem is quite non-linear, so the linear regression model does poorly in comparison to non-linear methods.

^{15}

The non-linear method LWPR improves over linear regression, but is outperformed by GPR.
表 2.1 展示了各种方法的结果。注意到该问题具有显著非线性特征，因此线性回归模型的表现远逊于非线性方法。非线性方法 LWPR 虽优于线性回归，但仍被高斯过程回归(GPR)超越。

2.6 Smoothing, Weight Functions and Equiva- lent Kernels
2.6 平滑处理、权重函数与等效核

Gaussian process regression aims to reconstruct the underlying signal

f

by removing the contaminating noise

ε

. To do this it computes a weighted average of the noisy observations

y

\bar{f} (x_{*}) = k {(x_{*})}^{⊤} {(K + σ_{n}^{2} I)}^{- 1} y

; as

\bar{f} (x_{*})

is a linear combination of the

y

values,Gaussian process regression is a linear smoother (see Hastie and Tibshirani

[1990, sec. 2.8]

for further details). In this section we study smoothing first in terms of a matrix analysis of the predictions at the training points, and then in terms of the equivalent kernel.
高斯过程回归旨在通过消除噪声污染

ε

来重建原始信号

f

。其实现方式是对含噪声观测值

y

进行加权平均计算

\bar{f} (x_{*}) = k {(x_{*})}^{⊤} {(K + σ_{n}^{2} I)}^{- 1} y

；由于

\bar{f} (x_{*})

是

y

值的线性组合，高斯过程回归属于线性平滑器（详见 Hastie 和 Tibshirani 的论述

[1990, sec. 2.8]

）。本节我们将首先从训练点预测的矩阵分析角度探讨平滑处理，继而研究等效核的相关特性。

linear smoother 线性平滑器

^{15}

It is perhaps surprising that RBD does worse than linear regression. However,Stefan Schaal (pers. comm., 2004) states that the RBD parameters were optimized on a very large dataset, of which the training data used here is subset, and if the RBD model were optimized w.r.t. this training set one might well expect it to outperform linear regression.

^{15}

令人惊讶的是，RBD 的表现比线性回归还要差。然而，Stefan Schaal（个人交流，2004 年）指出，RBD 的参数是在一个非常大的数据集上优化的，而这里使用的训练数据只是其子集。如果 RBD 模型针对这个训练集进行优化，人们很可能会预期它的表现会优于线性回归。

2.6 Smoothing, Weight Functions and Equivalent Kernels
2.6 平滑、权重函数与等效核

The predicted mean values

\overset{―}{f}

at the training points are given by
预测的平均值

\overset{―}{f}

在训练点处由以下公式给出

\begin{matrix} (2.35) & \overset{―}{f} = K {(K + σ_{n}^{2} I)}^{- 1} y . \end{matrix}

Let

K

have the eigendecomposition

K = \sum_{i = 1}^{n} λ_{i} u_{i} u_{i}^{⊤}

,where

λ_{i}

is the

i

th eigenvalue and

u_{i}

is the corresponding eigenvector. As

K

is real and symmetric positive semidefinite, its eigenvalues are real and non-negative, and its eigenvectors are mutually orthogonal. Let

y = \sum_{i = 1}^{n} γ_{i} u_{i}

for some coefficients

γ_{i} = u_{i}^{⊤} y

. Then
设

K

具有特征分解

K = \sum_{i = 1}^{n} λ_{i} u_{i} u_{i}^{⊤}

，其中

λ_{i}

是第

i

个特征值，

u_{i}

是对应的特征向量。由于

K

是实对称半正定矩阵，其特征值为实数且非负，特征向量相互正交。设

y = \sum_{i = 1}^{n} γ_{i} u_{i}

，其中

γ_{i} = u_{i}^{⊤} y

为某些系数。那么

\begin{matrix} (2.36) & \overset{―}{f} = \sum_{i = 1}^{n} \frac{γ_{i} λ_{i}}{λ_{i} + σ_{n}^{2}} u_{i} . \end{matrix}

eigendecomposition 特征分解

Notice that if

λ_{i} / (λ_{i} + σ_{n}^{2}) ≪ 1

then the component in

y

along

u_{i}

is effectively eliminated. For most covariance functions that are used in practice the eigenvalues are larger for more slowly varying eigenvectors (e.g. fewer zero-crossings) so that this means that high-frequency components in

y

are smoothed out. The effective number of parameters or degrees of freedom of the smoother is defined as

tr (K {(K + σ_{n}^{2} I)}^{- 1}) = \sum_{i = 1}^{n} λ_{i} / (λ_{i} + σ_{n}^{2})

,see Hastie and Tibshirani

[1990, sec. 3.5]

. Notice that this counts the number of eigenvectors which are not eliminated.
注意，如果

λ_{i} / (λ_{i} + σ_{n}^{2}) ≪ 1

，则

y

沿

u_{i}

的分量实际上被消除了。对于实践中使用的大多数协方差函数，特征值对于变化较慢的特征向量（例如，零交叉较少）更大，这意味着

y

中的高频分量被平滑掉了。平滑器的有效参数数量或自由度定义为

tr (K {(K + σ_{n}^{2} I)}^{- 1}) = \sum_{i = 1}^{n} λ_{i} / (λ_{i} + σ_{n}^{2})

，参见 Hastie 和 Tibshirani 的

[1990, sec. 3.5]

。注意，这计算了未被消除的特征向量数量。

degrees of freedom 自由度

We can define a vector of functions

h (x_{*}) = {(K + σ_{n}^{2} I)}^{- 1} k (x_{*})

. Thus we have

\bar{f} (x_{*}) = h {(x_{*})}^{⊤} y

,making it clear that the mean prediction at a point

x_{*}

is a linear combination of the target values

y

. For a fixed test point

x_{*}

h (x_{*})

gives the vector of weights applied to targets

y . h (x_{*})

is called the weight function [Silverman, 1984]. As Gaussian process regression is a linear smoother,
我们可以定义一个函数向量

h (x_{*}) = {(K + σ_{n}^{2} I)}^{- 1} k (x_{*})

。因此，我们有

\bar{f} (x_{*}) = h {(x_{*})}^{⊤} y

，清楚地表明在点

x_{*}

处的均值预测是目标值

y

的线性组合。对于一个固定的测试点

x_{*}

，

h (x_{*})

给出了应用于目标

y . h (x_{*})

的权重向量，称为权重函数[Silverman, 1984]。由于高斯过程回归是一个线性平滑器，

weight function 权重函数

the weight function does not depend on

y

. Note the difference between a linear model, where the prediction is a linear combination of the inputs, and a linear smoother, where the prediction is a linear combination of the training set targets.
权重函数不依赖于

y

。注意线性模型（预测是输入的线性组合）与线性平滑器（预测是训练集目标的线性组合）之间的区别。

Understanding the form of the weight function is made complicated by the matrix inversion of

K + σ_{n}^{2} I

and the fact that

K

depends on the specific locations of the

n

datapoints. Idealizing the situation one can consider the observations to be "smeared out" in x-space at some density of observations. In this case analytic tools can be brought to bear on the problem, as shown in section 7.1. By analogy to kernel smoothing, Silverman [1984] called the idealized weight function the equivalent kernel; see also Girosi et al. [1995, sec. 2.1].
理解权重函数的形式因涉及矩阵

K + σ_{n}^{2} I

的求逆及

K

依赖于

n

数据点具体位置而变得复杂。理想化情况下，可将观测视为在 x 空间中以某种观测密度“均匀散布”。此时可运用解析工具处理该问题，如第 7.1 节所示。类比于核平滑方法，Silverman[1984]将这种理想化权重函数称为等效核；另见 Girosi 等人[1995 年，第 2.1 节]。

equivalent kernel 等效核

A kernel smoother centres a kernel function

^{16} κ

x_{*}

and then computes
核平滑方法以

x_{*}

为中心放置核函数

^{16} κ

，然后进行计算

kernel smoother 核平滑方法

κ_{i} = κ (| x_{i} - x_{*} | / ℓ)

for each data point

(x_{i}, y_{i})

,where

ℓ

is a length-scale. The Gaussian is a commonly used kernel function. The prediction for

f (x_{*})

is computed as

\hat{f} (x_{*}) = \sum_{i = 1}^{n} w_{i} y_{i}

where

w_{i} = κ_{i} / \sum_{j = 1}^{n} κ_{j}

. This is also known as the Nadaraya-Watson estimator, see e.g. Scott [1992, sec. 8.1].
对于每个数据点

(x_{i}, y_{i})

，其中

ℓ

为长度尺度。高斯函数是一种常用的核函数。对

f (x_{*})

的预测计算为

\hat{f} (x_{*}) = \sum_{i = 1}^{n} w_{i} y_{i}

，其中

w_{i} = κ_{i} / \sum_{j = 1}^{n} κ_{j}

。这也被称为 Nadaraya-Watson 估计器，参见 Scott[1992, 第 8.1 节]。

The weight function and equivalent kernel for a Gaussian process are illustrated in Figure 2.6 for a one-dimensional input variable

x

. We have used the squared exponential covariance function and have set the length-scale

ℓ = 0.0632

(so that

ℓ^{2} = 0.004

). There are

n = 50

training points spaced randomly along the

x

-axis. Figures 2.6(a) and 2.6(b) show the weight function and equivalent kernel for

x_{*} = 0.5

and

x_{*} = 0.05

respectively,for

σ_{n}^{2} = 0.1

. Figure

2.6 (c)

is also for

x_{*} = 0.5

but uses

σ_{n}^{2} = 10

. In each case the dots correspond to the weight function

h (x_{*})

and the solid line is the equivalent kernel,whose construction is explained below. The dashed line shows a squared exponential kernel centered on the test point, scaled to have the same height as the maximum value in the equivalent kernel. Figure 2.6(d) shows the variation in the equivalent kernel as a function of

n

,the number of datapoints in the unit interval.
图 2.6 展示了一维输入变量

x

的高斯过程权重函数与等效核。我们采用了平方指数协方差函数，并设定长度尺度

ℓ = 0.0632

（因此

ℓ^{2} = 0.004

）。训练点数量为

n = 50

，沿

x

轴随机分布。图 2.6(a)和 2.6(b)分别显示了

x_{*} = 0.5

和

x_{*} = 0.05

情况下的权重函数与等效核，对应

σ_{n}^{2} = 0.1

。图

2.6 (c)

同样针对

x_{*} = 0.5

，但使用了

σ_{n}^{2} = 10

。各图中圆点表示权重函数

h (x_{*})

，实线为等效核（其构建原理下文说明），虚线表示以测试点为中心、高度缩放至与等效核最大值相同的平方指数核。图 2.6(d)展示了等效核随

n

（单位区间内数据点数量）的变化规律。

^{16}

Note that this kernel function does not need to be a valid covariance function.
注意，此核函数无需是有效的协方差函数。

https://cdn.noedgeai.com/019686a5-3b83-7d6c-828a-cad6ac9d4b35_19.jpg?x=531&y=416&w=995&h=845&r=0

Figure 2.6: Panels (a)-(c) show the weight function

h (x_{*})

(dots) corresponding to the

n = 50

training points,the equivalent kernel (solid) and the original squared exponential kernel (dashed). Panel (d) shows the equivalent kernels for two different data densities. See text for further details. The small cross at the test point is to scale in all four plots.
图 2.6：(a)-(c)面板展示了对应于

n = 50

训练点的权重函数

h (x_{*})

（点状）、等效核（实线）及原始平方指数核（虚线）。(d)面板展示了两种不同数据密度下的等效核。详见正文说明。所有四个图表中测试点处的小十字标记均按比例绘制。

Many interesting observations can be made from these plots. Observe that the equivalent kernel has (in general) a shape quite different to the original SE kernel. In Figure 2.6(a) the equivalent kernel is clearly oscillatory (with negative sidelobes) and has a higher spatial frequency than the original kernel. Figure 2.6(b) shows similar behaviour although due to edge effects the equivalent kernel is truncated relative to that in Figure 2.6(a). In Figure 2.6(c) we see that at higher noise levels the negative sidelobes are reduced and the width of the equivalent kernel is similar to the original kernel. Also note that the overall height of the equivalent kernel in (c) is reduced compared to that in (a) and C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. (c) 2006 Massachusetts Institute of Technology. www. GaussianProcess.org/gpml
从这些图中可以得出许多有趣的观察结果。注意到等效核的形状（通常）与原始 SE 核大不相同。在图 2.6(a)中，等效核明显呈现振荡性（带有负旁瓣），且空间频率高于原始核。图 2.6(b)展示了类似行为，但由于边缘效应，等效核相较于图 2.6(a)有所截断。在图 2.6(c)中可见，在较高噪声水平下，负旁瓣减弱，等效核的宽度与原始核相近。还需注意的是，(c)中等效核的整体高度较(a)和 C. E. Rasmussen 与 C. K. I. Williams 所著《机器学习中的高斯过程》（麻省理工学院出版社，2006 年，ISBN 026218253X）中有所降低。(c) 2006 年马萨诸塞理工学院版权所有。www.GaussianProcess.org/gpml

2.7 Incorporating Explicit Basis Functions
2.7 引入显式基函数

(b) - it averages over a wider area. The more oscillatory equivalent kernel for lower noise levels can be understood in terms of the eigenanalysis above; at higher noise levels only the large

λ

(slowly varying) components of

y

remain, while for smaller noise levels the more oscillatory components are also retained.
(b) - 它在一个更广的区域内进行平均。对于较低噪声水平下更振荡的等效核，可以通过上述特征分析来理解；在较高噪声水平下，只有

λ

（变化缓慢）的大分量保留下来，而对于较小的噪声水平，更多振荡分量也被保留。

In Figure 2.6(d) we have plotted the equivalent kernel for

n = 10

and

n =

250 datapoints in

[0, 1]

; notice how the width of the equivalent kernel decreases as

n

increases. We discuss this behaviour further in section 7.1.
在图 2.6(d)中，我们绘制了

n = 10

和

n =

在

[0, 1]

中 250 个数据点的等效核；注意到等效核的宽度随着

n

的增加而减小。我们将在第 7.1 节进一步讨论这一行为。

The plots of equivalent kernels in Figure 2.6 were made by using a dense grid of

n_{grid}

points on

[0, 1]

and then computing the smoother matrix

K (K +

{σ_{grid}^{2} I)}^{- 1}

. Each row of this matrix is the equivalent kernel at the appropriate location. However,in order to get the scaling right one has to set

σ_{grid}^{2} =

σ_{n}^{2} n_{grid} / n

; for

n_{grid} > n

this means that the effective variance at each of the

n_{grid}

points is larger,but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation if there were

n_{grid} / n

independent Gaussian observations with variance

σ_{grid}^{2}

at a single

x

- position; this would be equivalent to one Gaussian observation with variance

σ_{n}^{2}

. In effect the

n

observations have been smoothed out uniformly along the interval. The form of the equivalent kernel can be obtained analytically if we go to the continuum limit and look to smooth a noisy function. The relevant theory and some example equivalent kernels are given in section 7.1.
图 2.6 中等价核的绘制是通过在

[0, 1]

上使用密集的

n_{grid}

点网格，然后计算平滑矩阵

K (K +

{σ_{grid}^{2} I)}^{- 1}

实现的。该矩阵的每一行即为对应位置上的等价核。但为了正确设置比例，需令

σ_{grid}^{2} =

σ_{n}^{2} n_{grid} / n

；对于

n_{grid} > n

而言，这意味着每个

n_{grid}

点处的有效方差更大，但由于相应增加了更多点数，这种效应会相互抵消。可以通过设想以下情形来理解：若在单个

x

位置上有

n_{grid} / n

个独立的高斯观测值且方差为

σ_{grid}^{2}

，则等效于一个方差为

σ_{n}^{2}

的高斯观测值。实际上，这些

n

观测值已沿区间被均匀平滑。若进入连续极限并考虑对含噪函数进行平滑处理，则可解析获得等价核的形式。相关理论及部分示例等价核见第 7.1 节。

2.7 Incorporating Explicit Basis Functions
2.7 引入显式基函数

It is common but by no means necessary to consider GPs with a zero mean function. Note that this is not necessarily a drastic limitation, since the mean of the posterior process is not confined to be zero. Yet there are several reasons why one might wish to explicitly model a mean function, including interpretability of the model, convenience of expressing prior information and a number of analytical limits which we will need in subsequent chapters. The use of explicit basis functions is a way to specify a non-zero mean over functions, but as we will see in this section, one can also use them to achieve other interesting effects.
通常但不绝对必要地，我们会考虑具有零均值函数的高斯过程。需要注意的是，这并非一个严格的限制，因为后验过程的均值并不被强制设为零。然而，有多个理由支持显式建模均值函数，包括模型的可解释性、便于表达先验信息，以及后续章节中我们将需要的若干分析限制。使用显式基函数是规定函数非零均值的一种方式，但正如本节所示，它们还能用于实现其他有趣的效果。

Using a fixed (deterministic) mean function

m (x)

is trivial: Simply apply the usual zero mean GP to the difference between the observations and the fixed mean function. With
使用固定的（确定性）均值函数

m (x)

是简单的：只需将常规的零均值高斯过程应用于观测值与固定均值函数之间的差值即可。在

\begin{matrix} (2.37) & f (x) \sim G P (m (x), k (x, x^{'})), \end{matrix}

fixed mean function 固定均值函数

the predictive mean becomes
的情况下，预测均值变为

\begin{matrix} (2.38) & {\overset{―}{f}}_{*} = m (X_{*}) + K (X_{*}, X) K_{y}^{- 1} (y - m (X)), \end{matrix}

where

K_{y} = K + σ_{n}^{2} I

,and the predictive variance remains unchanged from eq. (2.24).
其中

K_{y} = K + σ_{n}^{2} I

，且预测方差与方程(2.24)保持一致未变。

However, in practice it can often be difficult to specify a fixed mean function. In many cases it may be more convenient to specify a few fixed basis functions, C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. (c) 2006 Massachusetts Institute of Technology. www. GaussianProcess.org/gpml whose coefficients,

β

,are to be inferred from the data. Consider
然而，实践中指定固定均值函数往往较为困难。多数情况下，采用少量固定基函数更为便利，如 C. E. Rasmussen 与 C. K. I. Williams 在《机器学习的高斯过程》（MIT 出版社，2006 年，ISBN 026218253X）所述。©2006 麻省理工学院。详见 www.GaussianProcess.org/gpml，其系数

β

需从数据中推断得出。考虑

\begin{matrix} (2.39) & g (x) = f (x) + h {(x)}^{⊤} β, where f (x) \sim G P (0, k (x, x^{'})), \end{matrix}

here

f (x)

is a zero mean GP,

h (x)

are a set of fixed basis functions,and

β

are additional parameters. This formulation expresses that the data is close to a global linear model with the residuals being modelled by a GP. This idea was explored explicitly as early as 1975 by Blight and Ott [1975], who used the GP to model the residuals from a polynomial regression,i.e.

h (x) = (1, x, x^{2}, \dots)

. When fitting the model,one could optimize over the parameters

β

jointly with the hyperparameters of the covariance function. Alternatively, if we take the prior on

β

to be Gaussian,

β \sim N (b, B)

,we can also integrate out these parameters. Following O'Hagan [1978] we obtain another GP
此处

f (x)

为零均值高斯过程，

h (x)

为一组固定基函数，

β

为附加参数。该模型表明数据接近全局线性模型，残差部分由高斯过程建模。此思路早在 1975 年由 Blight 和 Ott[1975]明确探讨，他们用高斯过程对多项式回归的残差建模，即

h (x) = (1, x, x^{2}, \dots)

。拟合模型时，可将参数

β

与协方差函数的超参数联合优化。或者，若对

β

采用高斯先验

β \sim N (b, B)

，亦可积分消去这些参数。遵循 O'Hagan[1978]的方法，我们得到另一个高斯过程

\begin{matrix} (2.40) & g (x) \sim G P (h {(x)}^{⊤} b, k (x, x^{'}) + h {(x)}^{⊤} B h (x^{'})), \end{matrix}

now with an added contribution in the covariance function caused by the uncertainty in the parameters of the mean. Predictions are made by plugging the mean and covariance functions of

g (x)

into eq. (2.39) and eq. (2.24). After rearranging, we obtain
现在，由于均值参数的不确定性，协方差函数中增加了一项贡献。通过将

g (x)

的均值和协方差函数代入方程(2.39)和(2.24)进行预测。经过重新排列后，我们得到

\begin{matrix} (2.41) & \overset{―}{g} (X_{*}) = H_{*}^{⊤} \overset{―}{β} + K_{*}^{⊤} K_{y}^{- 1} (y - H^{⊤} \overset{―}{β}) = \overset{―}{f} (X_{*}) + R^{⊤} \overset{―}{β}, \end{matrix}

cov (g_{*}) = cov (f_{*}) + R^{⊤} {(B^{- 1} + H K_{y}^{- 1} H^{⊤})}^{- 1} R,

where the

H

matrix collects the

h (x)

vectors for all training (and

H_{*}

all test) cases,

\overset{―}{β} = {(B^{- 1} + H K_{y}^{- 1} H^{⊤})}^{- 1} (H K_{y}^{- 1} y + B^{- 1} b)

,and

R = H_{*} - H K_{y}^{- 1} K_{*}

. Notice the nice interpretation of the mean expression,eq. (2.41) top line:

\overset{―}{β}

is the mean of the global linear model parameters, being a compromise between the data term and prior, and the predictive mean is simply the mean linear output plus what the GP model predicts from the residuals. The covariance is the sum of the usual covariance term and a new non-negative contribution.
其中，

H

矩阵收集了所有训练（和

H_{*}

所有测试）案例的

h (x)

向量，

\overset{―}{β} = {(B^{- 1} + H K_{y}^{- 1} H^{⊤})}^{- 1} (H K_{y}^{- 1} y + B^{- 1} b)

和

R = H_{*} - H K_{y}^{- 1} K_{*}

。注意均值表达式的美妙解释，即方程(2.41)顶行：

\overset{―}{β}

是全局线性模型参数的平均值，是数据项与先验之间的折衷，而预测均值仅是平均线性输出加上 GP 模型从残差中预测的部分。协方差是通常的协方差项与一个新的非负贡献之和。

Exploring the limit of the above expressions as the prior on the

β

parameter becomes vague,

B^{- 1} \to O

(where

O

is the matrix of zeros),we obtain a predictive distribution which is independent of

b

当关于参数

β

的先验变得模糊时，探索上述表达式的极限，

B^{- 1} \to O

（其中

O

为零矩阵），我们得到一个独立于

b

的预测分布。

\begin{matrix} (2.42) & \overset{―}{g} (X_{*}) = \overset{―}{f} (X_{*}) + R^{⊤} \overset{―}{β}, \end{matrix}

cov (g_{*}) = cov (f_{*}) + R^{⊤} {(H K_{y}^{- 1} H^{⊤})}^{- 1} R,

where the limiting

\overset{―}{β} = {(H K_{y}^{- 1} H^{⊤})}^{- 1} H K_{y}^{- 1} y

. Notice that predictions under the limit

B^{- 1} \to O

should not be implemented naïvely by plugging the modified covariance function from eq. (2.40) into the standard prediction equations, since the entries of the covariance function tend to infinity, thus making it unsuitable for numerical implementation. Instead eq. (2.42) must be used. Even if the non-limiting case is of interest, eq. (2.41) is numerically preferable to a direct implementation based on eq. (2.40), since the global linear part will often add some very large eigenvalues to the covariance matrix, affecting its condition number. C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. (c) 2006 Massachusetts Institute of Technology. www. GaussianProcess.org/gpml
其中极限情况下的

\overset{―}{β} = {(H K_{y}^{- 1} H^{⊤})}^{- 1} H K_{y}^{- 1} y

。需要注意的是，在极限

B^{- 1} \to O

下的预测不应简单地通过将修改后的协方差函数（来自方程 2.40）代入标准预测方程来实现，因为协方差函数的元素趋向于无穷大，这使得它不适合数值实现。相反，必须使用方程（2.42）。即使对非极限情况感兴趣，方程（2.41）在数值上也比基于方程（2.40）的直接实现更可取，因为全局线性部分通常会向协方差矩阵添加一些非常大的特征值，影响其条件数。C. E. Rasmussen & C. K. I. Williams，《机器学习中的高斯过程》，麻省理工学院出版社，2006 年，ISBN 026218253X。（c）2006 麻省理工学院。www.GaussianProcess.org/gpml

stochastic mean function 随机均值函数

polynomial regression 多项式回归

2.7.1 Marginal Likelihood
2.7.1 边缘似然

In this short section we briefly discuss the marginal likelihood for the model with a Gaussian prior

β \sim N (b, B)

on the explicit parameters from eq. (2.40), as this will be useful later, particularly in section 6.3.1. We can express the marginal likelihood from eq. (2.30) as
本节简要讨论具有高斯先验

β \sim N (b, B)

的模型在显式参数（见方程(2.40)）下的边缘似然，这对后续内容尤其是第 6.3.1 节非常有用。我们可以将方程(2.30)中的边缘似然表示为

\begin{matrix} (2.43) & \log p (y ∣ X, b, B) = - \frac{1}{2} {(H^{⊤} b - y)}^{⊤} {(K_{y} + H^{⊤} B H)}^{- 1} (H^{⊤} b - y) \end{matrix}

- \frac{1}{2} \log | K_{y} + H^{⊤} B H | - \frac{n}{2} \log 2 π,

where we have included the explicit mean. We are interested in exploring the limit where

B^{- 1} \to O

,i.e. when the prior is vague. In this limit the mean of the prior is irrelevant (as was the case in eq. (2.42)), so without loss of generality (for the limiting case) we assume for now that the mean is zero,

b = 0

,giving
此处我们明确包含了均值。我们关注的是当

B^{- 1} \to O

时的极限情况，即先验模糊的情形。在此极限下，先验均值变得无关紧要（如方程(2.42)所示），因此不失一般性（针对极限情况），我们暂时假设均值为零，即

b = 0

，从而得到

\begin{matrix} (2.44) & \log p (y ∣ X, b = 0, B) = - \frac{1}{2} y^{⊤} K_{y}^{- 1} y + \frac{1}{2} y^{⊤} C y \end{matrix}

- \frac{1}{2} \log | K_{y} | - \frac{1}{2} \log | B | - \frac{1}{2} \log | A | - \frac{n}{2} \log 2 π,

where

A = B^{- 1} + H K_{y}^{- 1} H^{⊤}

and

C = K_{y}^{- 1} H^{⊤} A^{- 1} H K_{y}^{- 1}

and we have used the matrix inversion lemma, eq. (A.9) and eq. (A.10).
其中

A = B^{- 1} + H K_{y}^{- 1} H^{⊤}

和

C = K_{y}^{- 1} H^{⊤} A^{- 1} H K_{y}^{- 1}

，并且我们使用了矩阵求逆引理，即方程(A.9)和方程(A.10)。

We now explore the behaviour of the log marginal likelihood in the limit of vague priors on

β

. In this limit the variances of the Gaussian in the directions spanned by columns of

H^{⊤}

will become infinite,and it is clear that this will require special treatment. The log marginal likelihood consists of three terms: a quadratic form in

y

\log

determinant term,and a term involving

\log 2 π

. Performing an eigendecomposition of the covariance matrix we see that the contributions to quadratic form term from the infinite-variance directions will be zero. However, the log determinant term will tend to minus infinity. The standard solution [Wahba, 1985, Ansley and Kohn, 1985] in this case is to project

y

onto the directions orthogonal to the span of

H^{⊤}

and compute the marginal likelihood in this subspace. Let the rank of

H^{⊤}

m

. Then as shown in Ansley and Kohn [1985] this means that we must discard the terms

- \frac{1}{2} \log | B | - \frac{m}{2} \log 2 π

from eq. (2.44) to give
我们现在探讨在

β

先验模糊的极限情况下对数边缘似然的行为。在此极限下，高斯分布在

H^{⊤}

列向量张成的方向上的方差将趋于无穷大，显然这需要特殊处理。对数边缘似然由三项组成：关于

y

的二次型、一个

\log

行列式项，以及涉及

\log 2 π

的项。通过对协方差矩阵进行特征分解，我们发现来自无限方差方向对二次型项的贡献将为零。然而，对数行列式项将趋向于负无穷。标准解决方案[Wahba, 1985, Ansley and Kohn, 1985]是将

y

投影到与

H^{⊤}

张成空间正交的方向上，并在该子空间中计算边缘似然。设

H^{⊤}

的秩为

m

。如 Ansley 和 Kohn[1985]所示，这意味着我们必须从方程(2.44)中舍弃

- \frac{1}{2} \log | B | - \frac{m}{2} \log 2 π

项。

\log p (y ∣ X) = - \frac{1}{2} y^{⊤} K_{y}^{- 1} y + \frac{1}{2} y^{⊤} C y - \frac{1}{2} \log | K_{y} | - \frac{1}{2} \log | A | - \frac{n - m}{2} \log 2 π,

(2.45)

where

A = H K_{y}^{- 1} H^{⊤}

and

C = K_{y}^{- 1} H^{⊤} A^{- 1} H K_{y}^{- 1}

.
其中

A = H K_{y}^{- 1} H^{⊤}

和

C = K_{y}^{- 1} H^{⊤} A^{- 1} H K_{y}^{- 1}

。

2.8 History and Related Work
2.8 历史与相关工作

Prediction with Gaussian processes is certainly not a very recent topic, especially for time series analysis; the basic theory goes back at least as far as the work of Wiener [1949] and Kolmogorov [1941] in the 1940's. Indeed Lauritzen [1981] discusses relevant work by the Danish astronomer T. N. Thiele dating from 1880.
高斯过程预测绝非新近课题，尤其在时间序列分析领域，其基础理论至少可追溯至 20 世纪 40 年代维纳[1949]和柯尔莫哥洛夫[1941]的研究。事实上，劳里岑[1981]曾探讨过丹麦天文学家 T.N.蒂勒早在 1880 年的相关成果。

time series 时间序列

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. (C) 2006 Massachusetts Institute of Technology. www. GaussianProcess.org/gpml
C.E.拉斯穆森与 C.K.I.威廉姆斯合著，《机器学习中的高斯过程》，麻省理工学院出版社，2006 年，ISBN 026218253X。版权所有©2006 麻省理工学院。网址：www.GaussianProcess.org/gpml

Gaussian process prediction is also well known in the geostatistics field (see, e.g. Matheron, 1973; Journel and Huijbregts, 1978) where it is known as krig-ing,

^{17}

and in meteorology [Thompson,1956,Daley,1991] although this literature naturally has focussed mostly on two- and three-dimensional input spaces. Whittle [1963, sec. 5.4] also suggests the use of such methods for spatial prediction. Ripley [1981] and Cressie [1993] provide useful overviews of Gaussian process prediction in spatial statistics.
高斯过程预测在地统计学领域同样广为人知（参见马特隆 1973 年、茹尔内尔与许伊布雷赫茨 1978 年的著作），其术语称为克里金法

^{17}

；气象学领域亦有应用[汤普森 1956 年、戴利 1991 年]，不过相关文献自然多聚焦于二维和三维输入空间。惠特尔[1963 年，第 5.4 节]也曾建议将此类方法用于空间预测。里普利[1981 年]和克雷斯[1993 年]则对空间统计学中的高斯过程预测提供了有价值的综述。

Gradually it was realized that Gaussian process prediction could be used in a general regression context. For example O'Hagan [1978] presents the general theory as given in our equations 2.23 and 2.24, and applies it to a number of one-dimensional regression problems. Sacks et al. [1989] describe GPR in the context of computer experiments (where the observations

y

are noise free) and discuss a number of interesting directions such as the optimization of parameters in the covariance function (see our chapter 5 ) and experimental design (i.e. the choice of

x

-points that provide most information on

f

). The authors describe a number of computer simulations that were modelled, including an example where the response variable was the clock asynchronization in a circuit and the inputs were six transistor widths. Santner et al. [2003] is a recent book on the use of GPs for the design and analysis of computer experiments.
人们逐渐认识到高斯过程预测可以应用于一般的回归问题中。例如，O'Hagan [1978] 提出了如我们方程 2.23 和 2.24 所示的通用理论，并将其应用于多个一维回归问题。Sacks 等人[1989]在计算机实验（观测数据

y

无噪声）的背景下描述了高斯过程回归（GPR），并探讨了若干有趣的方向，如协方差函数中参数的优化（参见本书第 5 章）和实验设计（即选择能提供关于

f

最多信息的

x

点）。作者列举了多个建模的计算机模拟案例，其中包括一个响应变量为电路时钟不同步、输入为六个晶体管宽度的例子。Santner 等人[2003]的新书详细介绍了如何利用高斯过程进行计算机实验的设计与分析。

Williams and Rasmussen [1996] described Gaussian process regression in a machine learning context, and described optimization of the parameters in the covariance function, see also Rasmussen [1996]. They were inspired to use Gaussian process by the connection to infinite neural networks as described in section 4.2.3 and in Neal [1996]. The "kernelization" of linear ridge regression described above is also known as kernel ridge regression see e.g. Saunders et al. [1998].
Williams 和 Rasmussen [1996] 在机器学习背景下描述了高斯过程回归，并探讨了协方差函数中参数的优化问题，另见 Rasmussen [1996]。他们受到第 4.2.3 节及 Neal [1996]中关于无限神经网络关联的启发而采用高斯过程。上述线性岭回归的“核化”处理也被称为核岭回归，可参考 Saunders 等人[1998]的研究。

Relationships between Gaussian process prediction and regularization theory, splines, support vector machines (SVMs) and relevance vector machines (RVMs) are discussed in chapter 6 .
第 6 章将讨论高斯过程预测与正则化理论、样条、支持向量机（SVMs）及相关向量机（RVMs）之间的关系。

2.9 Exercises 2.9 练习题

Replicate the generation of random functions from Figure 2.2. Use a regular (or random) grid of scalar inputs and the covariance function from eq. (2.16). Hints on how to generate random samples from multi-variate Gaussian distributions are given in section A.2. Invent some training data points, and make random draws from the resulting GP posterior using eq. (2.19).
复制图 2.2 中随机函数的生成过程。使用标量输入的规则（或随机）网格及方程(2.16)中的协方差函数。关于如何从多元高斯分布生成随机样本的提示见附录 A.2 节。设计一些训练数据点，并利用方程(2.19)从所得高斯过程后验中进行随机抽样。

In eq. (2.11) we saw that the predictive variance at $x_{*}$ under the feature space regression model was $var (f (x_{*})) = ϕ {(x_{*})}^{⊤} A^{- 1} ϕ (x_{*})$ . Show that $cov (f (x_{*}), f (x_{*}^{'})) = ϕ {(x_{*})}^{⊤} A^{- 1} ϕ (x_{*}^{'})$ . Check that this is compatible with the expression given in eq. (2.24). C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. (c) 2006 Massachusetts Institute of Technology. www. GaussianProcess.org/gpml
在方程（2.11）中，我们看到特征空间回归模型在 $x_{*}$ 处的预测方差为 $var (f (x_{*})) = ϕ {(x_{*})}^{⊤} A^{- 1} ϕ (x_{*})$ 。证明 $cov (f (x_{*}), f (x_{*}^{'})) = ϕ {(x_{*})}^{⊤} A^{- 1} ϕ (x_{*}^{'})$ ，并验证其与方程（2.24）给出的表达式一致。C. E. Rasmussen 和 C. K. I. Williams 著，《机器学习的高斯过程》，麻省理工学院出版社，2006 年，ISBN 026218253X。（版权）2006 麻省理工学院。www.GaussianProcess.org/gpml

geostatistics 地统计学

kriging 克里金法

computer experiments 计算机实验

machine learning 机器学习

^{17}

Matheron named the method after the South African mining engineer D. G. Krige.

^{17}

马特隆以南非采矿工程师 D.G.克里格的名字命名了该方法。

The Wiener process is defined for $x \geq 0$ and has $f (0) = 0$ . (See section B.2.1 for further details.) It has mean zero and a non-stationary covariance function $k (x, x^{'}) = min (x, x^{'})$ . If we condition on the Wiener process passing through $f (1) = 0$ we obtain a process known as the Brownian bridge (or tied-down Wiener process). Show that this process has covariance $k (x, x^{'}) = min (x, x^{'}) - x x^{'}$ for $0 \leq x, x^{'} \leq 1$ and mean 0 . Write a computer program to draw samples from this process at a finite grid of $x$ points in $[0, 1]$ .
维纳过程定义于 $x \geq 0$ ，并具有 $f (0) = 0$ 特性（详见 B.2.1 节）。其均值为零，且具有非平稳协方差函数 $k (x, x^{'}) = min (x, x^{'})$ 。若我们限定维纳过程必须经过 $f (1) = 0$ ，则得到一个称为布朗桥（或称固定维纳过程）的过程。证明该过程的协方差为 $k (x, x^{'}) = min (x, x^{'}) - x x^{'}$ （当 $0 \leq x, x^{'} \leq 1$ 时），均值为 0。编写计算机程序，在 $[0, 1]$ 内的 $x$ 个离散点上对该过程进行采样。

Let ${var}_{n} (f (x_{*}))$ be the predictive variance of a Gaussian process regression model at $x_{*}$ given a dataset of size $n$ . The corresponding predictive variance using a dataset of only the first $n - 1$ training points is denoted ${var}_{n - 1} (f (x_{*}))$ . Show that ${var}_{n} (f (x_{*})) \leq {var}_{n - 1} (f (x_{*}))$ ,i.e. that the predictive variance at $x_{*}$ cannot increase as more training data is obtained. One way to approach this problem is to use the partitioned matrix equations given in section A. 3 to decompose ${var}_{n} (f (x_{*})) = k (x_{*}, x_{*}) -$ $k_{*}^{⊤} {(K + σ_{n}^{2} I)}^{- 1} k_{*}$ . An alternative information theoretic argument is given in Williams and Vivarelli $[2000]$ . Note that while this conclusion is true for Gaussian process priors and Gaussian noise models it does not hold generally, see Barber and Saad [1996].
设 ${var}_{n} (f (x_{*}))$ 为给定大小为 $n$ 的数据集时，高斯过程回归模型在 $x_{*}$ 处的预测方差。仅使用前 $n - 1$ 个训练点数据集对应的预测方差记为 ${var}_{n - 1} (f (x_{*}))$ 。证明 ${var}_{n} (f (x_{*})) \leq {var}_{n - 1} (f (x_{*}))$ ，即在 $x_{*}$ 处的预测方差不会随着获取更多训练数据而增加。解决此问题的一种方法是使用 A.3 节中给出的分块矩阵方程来分解 ${var}_{n} (f (x_{*})) = k (x_{*}, x_{*}) -$ $k_{*}^{⊤} {(K + σ_{n}^{2} I)}^{- 1} k_{*}$ 。Williams 和 Vivarelli 在 $[2000]$ 中给出了另一种信息论论证。需注意该结论仅适用于高斯过程先验和高斯噪声模型，并不具有普遍性，参见 Barber 和 Saad[1996]。

Chapter 2 第二章

Regression 回归

2.1 Weight-space View 2.1 权重空间视角

2.1.1 The Standard Linear Model2.1.1 标准线性模型

2.1 Weight-space View 2.1 权重空间视角

2.1 Weight-space View 2.1 权重空间视角

2.1.2 Projections of Inputs into Feature Space2.1.2 输入向特征空间的投影

2.2 Function-space View 2.2 函数空间视角

2.2 Function-space View 2.2 函数空间视角

magnitude 量级

Prediction with Noise-free Observations无噪声观测下的预测

Prediction using Noisy Observations使用含噪声观测进行预测

noisy predictions 含噪声的预测

joint predictions 联合预测

posterior process 后验过程

2.3 Varying the Hyperparameters2.3 超参数调整

2.3 Varying the Hyperparameters2.3 超参数调整

2.4 Decision Theory for Regression2.4 回归问题的决策理论

2.5 An Example Application2.5 应用示例

2.5 An Example Application2.5 一个应用示例

SMSE 短信服务

2.6 Smoothing, Weight Functions and Equiva- lent Kernels2.6 平滑处理、权重函数与等效核

2.6 Smoothing, Weight Functions and Equivalent Kernels2.6 平滑、权重函数与等效核

weight function 权重函数

kernel smoother 核平滑方法

2.7 Incorporating Explicit Basis Functions2.7 引入显式基函数

2.7 Incorporating Explicit Basis Functions2.7 引入显式基函数

2.7.1 Marginal Likelihood2.7.1 边缘似然

2.8 History and Related Work2.8 历史与相关工作

2.9 Exercises 2.9 练习题

2.1.1 The Standard Linear Model
2.1.1 标准线性模型

2.1.2 Projections of Inputs into Feature Space
2.1.2 输入向特征空间的投影

Prediction with Noise-free Observations
无噪声观测下的预测

Prediction using Noisy Observations
使用含噪声观测进行预测

2.3 Varying the Hyperparameters
2.3 超参数调整

2.3 Varying the Hyperparameters
2.3 超参数调整

2.4 Decision Theory for Regression
2.4 回归问题的决策理论

2.5 An Example Application
2.5 应用示例

2.5 An Example Application
2.5 一个应用示例

2.6 Smoothing, Weight Functions and Equiva- lent Kernels
2.6 平滑处理、权重函数与等效核

2.6 Smoothing, Weight Functions and Equivalent Kernels
2.6 平滑、权重函数与等效核

2.7 Incorporating Explicit Basis Functions
2.7 引入显式基函数

2.7 Incorporating Explicit Basis Functions
2.7 引入显式基函数

2.7.1 Marginal Likelihood
2.7.1 边缘似然

2.8 History and Related Work
2.8 历史与相关工作