这是用户在 2025-5-9 20:30 为 file:///C:/Users/Zee/Downloads/1309.6835v1.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Gaussian Processes for Big Data
面向大数据的 Gaussian 过程


James Hensman*  詹姆斯·亨斯曼*

Dept. Computer Science  计算机科学系

The University of Sheffield
谢菲尔德大学

Sheffield, UK  英国谢菲尔德

Nicolò Fusi*  尼古洛·富西*

Dept. Computer Science  计算机科学系

The University of Sheffield
谢菲尔德大学

Sheffield, UK  英国谢菲尔德

Neil D. Lawrence*  尼尔·D·劳伦斯*

Dept. Computer Science  计算机科学系

The University of Sheffield
谢菲尔德大学

Sheffield, UK  英国谢菲尔德

Abstract  摘要


We introduce stochastic variational inference for Gaussian process models. This enables the application of Gaussian process (GP) models to data sets containing millions of data points. We show how GPs can be variationally decomposed to depend on a set of globally relevant inducing variables which factorize the model in the necessary manner to perform variational inference. Our approach is readily extended to models with non-Gaussian likelihoods and latent variable models based around Gaussian processes. We demonstrate the approach on a simple toy problem and two real world data sets.
我们提出了高斯过程模型的随机变分推断方法。这使得高斯过程(GP)模型能够应用于包含数百万数据点的数据集。我们展示了如何通过变分分解使 GP 依赖于一组全局相关的诱导变量,从而以必要的方式对模型进行因子分解以执行变分推断。我们的方法易于扩展到具有非高斯似然的模型以及基于高斯过程的隐变量模型。我们通过一个简单的玩具问题和两个真实世界数据集验证了该方法。

1 Introduction  1 引言


Gaussian processes [GPs, Rasmussen and Williams, 2006] are perhaps the dominant approach for inference on functions. They underpin a range of algorithms for regression, classification and unsupervised learning. Unfortunately, when applying a Gaussian process to a data set of size n exact inference has complexity O(n3) with storage demands of O(n2) . This hinders the application of these models for many domains. In particular, large spatiotemporal data sets, video, large social network data (e.g. from Facebook), population scale medical data sets, models that correlate across multiple outputs or tasks (for these models complexity is O(n3p3) and storage is O(n2p2) where p is the number of outputs or tasks). Collectively we can think of these applications as belonging to the domain of 'big data’.
高斯过程[GPs, Rasmussen and Williams, 2006]可能是函数推断的主导方法。它们支撑了一系列用于回归、分类和无监督学习的算法。遗憾的是,当将高斯过程应用于大小为 n 的数据集时,精确推断的复杂度为 O(n3) ,存储需求为 O(n2) 。这阻碍了这些模型在许多领域的应用,特别是大型时空数据集、视频、大型社交网络数据(例如来自 Facebook 的数据)、人口规模的医疗数据集,以及跨多个输出或任务关联的模型(对于这些模型,复杂度为 O(n3p3) ,存储需求为 O(n2p2) ,其中 p 是输出或任务的数量)。我们可以将这些应用统称为属于“大数据”领域。

Traditionally in Gaussian process a large data set is one that contains over a few thousand data points. Even to accommodate these data sets, various approximate techniques are required. One approach is to partition the data set into separate groups [e.g. Snelson and Ghahramani, 2007, Urtasun and Darrell, 2008]. An alternative is to build a low rank approximation to the covariance matrix based around 'inducing variables' [see e.g. Csató and Opper, 2002, Seeger et al., 2003, Quiñonero Candela and Rasmussen, 2005, Tit-sias, 2009]. These approaches lead to a computational complexity of O(nm2) and storage demands of O(nm) where m is a user selected parameter governing the number of inducing variables. However, even these reduced storage are prohibitive for big data, where n can be many millions or billions. For parametric models, stochastic gradient descent is often applied to resolve this storage issue, but in the GP domain, it hasn't been clear how this should be performed. In this paper we show how recent advances in variational inference [Hensman et al., 2012, Hoffman et al., 2012] can be combined with the idea of inducing variables to develop a practical algorithm for fitting GPs using stochastic variational inference (SVI).
传统上,在高斯过程中,大数据集指的是包含数千个数据点的集合。即使为了适应这些数据集,也需要各种近似技术。一种方法是将数据集分割成独立的组(例如 Snelson 和 Ghahramani,2007;Urtasun 和 Darrell,2008)。另一种方法是基于“诱导变量”构建协方差矩阵的低秩近似(参见 Csató和 Opper,2002;Seeger 等人,2003;Quiñonero Candela 和 Rasmussen,2005;Titsias,2009)。这些方法导致计算复杂度为 O(nm2) ,存储需求为 O(nm) ,其中 m 是用户选择的参数,用于控制诱导变量的数量。然而,即使这些降低后的存储需求对于大数据来说仍然过高,其中 n 可能达到数百万或数十亿。对于参数化模型,通常采用随机梯度下降来解决存储问题,但在高斯过程领域,如何实现这一点尚不明确。 本文展示了如何将变分推断领域的最新进展[Hensman 等,2012;Hoffman 等,2012]与诱导变量思想相结合,开发出一种基于随机变分推断(SVI)的高斯过程拟合实用算法。

2 Sparse GPs Revisited  2 稀疏高斯过程再探讨


We start with a succinct rederivation of the variational approach to inducing variables of Titsias [2009]. This allows us to introduce notation and derive expressions which allow for the formulation of a SVI algorithm.
我们首先简要重述 Titsias[2009]提出的诱导变量变分方法。这使我们能够引入符号体系并推导表达式,从而构建 SVI 算法框架。

Consider a data vector 1y ,where each entry yi is a noisy observation of the function f(xi) ,for all the points X={xi}i=1n . We consider the noise to be independent Gaussian with precision β . Introducing a Gaussian process prior over f() ,let the vector f contain values of the function at the points X . We shall also introduce a set of inducing variables: let the vector u contain values of the function f at the points Z={zi}i=1m which live in the same space as X . Using standard Gaussian process methodologies, we can write
考虑数据向量 1y ,其中每个元素 yi 都是函数 f(xi) 在点集 X={xi}i=1n 上的含噪观测值。假设噪声是独立高斯分布,精度为 β 。在 f() 上引入高斯过程先验,令向量 f 包含函数在点集 X 处的取值。我们还将引入一组诱导变量:令向量 u 包含函数 f 在与 X 同空间的点集 Z={zi}i=1m 处的取值。运用标准高斯过程方法,可建立如下表达式

p(yf)=N(yf,β1I),

p(fu)=N(fKnmKmm1u,K~),

p(u)=N(u0,Kmm),




*Also at Sheffield Institute for Translational Neuroscience, SITraN
同时任职于谢菲尔德转化神经科学研究所(SITraN)

1 Our derivation trivially extends to multiple independent output dimensions, but we omit them here for clarity.
1 我们的推导过程可轻松扩展至多个独立输出维度,但为了清晰起见,此处予以省略。





where Kmm is the covariance function evaluated between all the inducing points and Knm is the covariance function between all inducing points and training points and we have defined with K~=Knn KnmKmm1Kmn .
其中 Kmm 为所有诱导点间协方差函数的评估值, Knm 为所有诱导点与训练点间协方差函数的值,且我们已用 K~=Knn KnmKmm1Kmn 定义。

We first apply Jensen's inequality on the conditional probability p(yu) :
我们首先对条件概率 p(yu) 应用詹森不等式:

logp(yu)=logp(yf)p(fu)

(1)logp(yf)p(fu)L1.

where p(x) denotes an expectation under p(x) . For Gaussian noise taking the expectation inside the log is tractable, but it results in an expression containing Knn1 ,which has a computational complexity of O(n3) . Bringing the expectation outside the log gives a lower bound, L1 ,which can be computed with has complexity O(m3) . Further,when p(yf) factorises across the data,
其中 p(x) 表示在 p(x) 下的期望。对于高斯噪声,在 log 内取期望是可处理的,但它会导致一个包含 Knn1 的表达式,其计算复杂度为 O(n3) 。将期望移到 log 外得到一个下界 L1 ,其计算复杂度为 O(m3) 。此外,当 p(yf) 在数据上可分解时,

p(yf)=i=1np(yifi),

then this lower bound can be shown to be separable across y giving
那么可以证明这个下界在 y 上是可分离的

(2)exp(L1)=i=1nN(yiμi,β1)exp(12βk~i,i)

where μ=KnmKmm1u and k~i,i is the i th diagonal element of K~ . Note that the difference between our bound and the original log likelihood is given by the Kullback Leibler (KL) divergence between the posterior over the mapping function given the data and the inducing variables and the posterior of the mapping function given the inducing variables only,
其中 μ=KnmKmm1uk~i,iK~ 的第 i 个对角元素。需要注意的是,我们的界与原始对数似然之间的差异由给定数据和诱导变量的映射函数后验与仅给定诱导变量的映射函数后验之间的 Kullback Leibler(KL)散度给出,

KL(p(fu)p(fu,y)).

This KL divergence is minimized when there are m=n inducing variables and they are placed at the training data locations. This means that u=f , Kmm=Knm=Knn meaning that K~=0 . In this case we recover exp(L1)=p(yf) and the bound becomes equality because p(fu) is degenerate. However,since m=n and that there would be no computational or storage advantage from the representation. When m<n the bound can be maximised with respect to Z (which are variational parameters). This minimises the KL divergence and ensures that Z are distributed amongst the training data X such that all k~i,i are small. In practice this means that the expectations in (1) are only taken across a narrow domain (k~i,i is the marginal variance of p(fiu) ),keeping Jensen’s bound tight.
当存在 m=n 个诱导变量且它们位于训练数据位置时,该 KL 散度达到最小。这意味着 u=fKmm=Knm=KnnK~=0 。此时我们恢复 exp(L1)=p(yf) ,由于 p(fu) 是退化的,该界限变为等式。然而,由于 m=n ,该表示不会带来计算或存储上的优势。当 m<n 时,该界限可相对于 Z (作为变分参数)被最大化。这最小化了 KL 散度,并确保 Z 分布在训练数据 X 之间,使得所有 k~i,i 都很小。实际上,这意味着(1)中的期望仅在一个狭窄的域 (k~i,i (即 p(fiu) 的边际方差)上计算,从而保持 Jensen 界限的紧致性。

Before deriving the expressions for stochastic variational inference using L1 ,we recover the bound of Tit-sias [2009] by marginalising the inducing variables,
在利用 L1 进行随机变分推断的表达式推导之前,我们通过边缘化诱导变量恢复了 Titsias[2009]的界,

logp(yX)=logp(yu)p(u)du

(3)logexp{L1}p(u)duL2,

which with some linear algebraic manipulation leads to
经过一些线性代数操作后得到

L2=logN(y0,KnmKmm1Kmn+β1I)12βtr(K~),

matching the result of Titsias, with the implicit approximating distribution q(u) having precision
与 Titsias 的结果一致,其中隐含的近似分布 q(u) 具有精度

Λ=βKmm1KmnKnmKmm1+Kmm1

and mean  和均值

u^=βΛ1Kmm1Kmny.

3 SVI for GPs  高斯过程的 3 种随机变分推断方法


One of the novelties of the Titsias bound was that, rather than explicitly representing a variational distribution for q(u) ,these variables are ’collapsed’ [Hens-man et al., 2012]. However, for stochastic variational inference to work on Gaussian processes, it turns out we need to maintain an explicit representation of these inducing variables.
Titsias 界的新颖之处在于,它没有显式表示 q(u) 的变分分布,而是将这些变量“边缘化”处理[Hensman 等,2012]。然而,为了实现高斯过程的随机变分推断,我们发现需要保持这些诱导变量的显式表示。

Stochastic variational inference (SVI) allows variational inference for very large data sets, but it can only be applied to probabilistic models which have a set of global variables, and which factorise in the observations and latent variables as Figure 1(a). Gaussian Processes do not have global variables and exhibit no such factorisation (Figure 1(b)). By introducing inducing variables u ,we have an appropriate model for SVI (Figure 1(c)). Unfortunately, marginalising u re-introduces dependencies between the observations, and eliminates the global parameters. In the following, we derive a lower bound on L2 which includes an explicit variational distribution q(u) ,enabling SVI. We then derive the required natural gradients and discuss how latent variables might be used.
随机变分推断(SVI)支持对超大规模数据集进行变分推断,但仅适用于具有全局变量集、且观测值与隐变量如图 1(a)所示可分解的概率模型。高斯过程既无全局变量,也不存在此类分解特性(图 1(b))。通过引入诱导变量 u ,我们构建了适用于 SVI 的模型(图 1(c))。但若边缘化 u 会重新引入观测值间的依赖性,并消除全局参数。下文我们将推导包含显式变分分布 q(u)L2 下界,从而实现 SVI,随后推导所需的自然梯度并探讨隐变量的应用可能。

Because there are a fixed number of inducing variables (specified by the user at algorithm design time) we can perform stochastic variational inference, greatly increasing the size of data sets to which we can apply Gaussian processes.
由于诱导变量的数量是固定的(由用户在算法设计时指定),我们可以执行随机变分推断,从而大大扩展了高斯过程可应用的数据集规模。




https://cdn.noedgeai.com/0196b502-66e1-7204-b819-84a886bed035_2.jpg?x=316&y=200&w=1242&h=411&r=0

Figure 1: Graphical models showing (a) the reqired form for a probabilistic model for SVI (reproduced from [Hoffman et al.,2012]),with global variables g and latent variables z . (b) The graphical model corresponding to Gaussian process regression,where connectivity between the values of the function fi is denoted by a loop around the plate. (c) The graphical model corresponding to the sparse GP model,with inducing variables u working as global variables,and the term L1 acting as logp(yiu,xi) . Marginalisation of u leads to the variational DTC formulation, introducing dependencies between the observations.
图 1:图形模型展示(a) SVI 概率模型所需的形式(摘自[Hoffman et al.,2012]),包含全局变量 g 和潜在变量 z 。(b) 对应高斯过程回归的图形模型,其中函数值 fi 之间的连通性通过围绕平板的循环表示。(c) 对应稀疏 GP 模型的图形模型,其中诱导变量 u 作为全局变量,项 L1 充当 logp(yiu,xi) 的角色。对 u 进行边缘化处理得到变分 DTC 公式,引入了观测值之间的依赖关系。


3.1 Global Variables  3.1 全局变量


To apply stochastic variational inference to a Gaussian process model, we must have a set of global variables. The variables u will perform this role,and we introduce a variational distribution q(u) ,and use it to lower bound the quantity p(yX) .
要将随机变分推断应用于高斯过程模型,我们必须拥有一组全局变量。变量 u 将承担这一角色,我们引入一个变分分布 q(u) ,并用它来下界化量 p(yX)

logp(yX)L1+logp(u)logq(u)q(u)L3.

From the above we know that the optimal distribution is Gaussian,and we parametrise it as q(u)= N(um,S) . The bound L3 becomes
由上述可知最优分布为高斯分布,我们将其参数化为 q(u)= N(um,S) 。下界 L3 变为

L3=i=1n{logN(yikiKmm1m,β1)

12βk~i,i12tr(SΛi)}

(4)KL(q(u)p(u))

with ki being a vector of the ith  column of Kmn and Λi=βKmm1kikiKmm1 . The gradients of L3 with respect to the parameters of q(u) are
其中 kiKmn 的第 ith  列向量,且 Λi=βKmm1kikiKmm1L3 关于 q(u) 参数的梯度为

L3m=βKmm1KmnyΛm,

(5)L3S=12S112Λ.

Setting the derivatives to zero recovers the optimal solution found in the previous section,namely S=Λ1 , m=u^ . It follows that L2L3 ,with equality at this unique maximum.
将导数设为零可恢复前一节中找到的最优解,即 S=Λ1m=u^ 。由此可得 L2L3 ,且在此唯一最大值处取等号。

The key propery of L3 is that is can be written as a sum of n terms,each corresponding to one input-output pair {xi,yi} : we have induced the necessary factorisation to perform stochastic gradient methods on the distribution q(u) .
L3 的关键特性在于其可表示为 n 项之和,每项对应一个输入-输出对 {xi,yi} :我们已诱导出必要的因式分解,以便在分布 q(u) 上执行随机梯度方法。

3.2 Natural Gradients  3.2 自然梯度


Stochastic variational inference works by taking steps in the direction of the approximate natural gradient g~(θ) ,which is given by the usual gradient re-scaled by the inverse Fisher information: g~(θ)=G(θ)1Lθ . To work with the natural gradients of the distribution q(u) ,we first recall the canonical and expectation parameters
随机变分推断通过沿近似自然梯度 g~(θ) 方向迭代实现,该梯度由常规梯度经 Fisher 信息矩阵逆重新缩放得到: g~(θ)=G(θ)1Lθ 。为处理分布 q(u) 的自然梯度,我们首先回顾典型参数与期望参数。

θ1=S1m,θ2=12S1

and  

η1=m,η2=mm+S.

In the exponential family, properties of the Fisher information reveal the following simplification of the natural gradient [Hensman et al., 2012],
在指数族中,费希尔信息的性质揭示了自然梯度的以下简化[Hensman 等人,2012],

(6)g~(θ)=G(θ)1L3θ=L3η.

A step of length in the natural gradient direction, using θ(t+1)=θ(t)+dL3dη ,yields
在自然梯度方向上迈出长度为 的步长,使用 θ(t+1)=θ(t)+dL3dη ,得到

θ2(t+1)=12S(t+1)1

=12S(t)1+(12Λ+12S(t)1),

θ1(t+1)=S(t+1)1m(t+1)

=S(t)1m(t)+(βKmm1KmnyS(t)1m(t)),

and taking a step of unit length then recovers the same solution as above by either (3) or (5). This confirms the result discussed in Hensman et al. [2012], Hoffman et al. [2012], that taking this unit step is the same as performing a VB update. We can now obtain stochastic approximations to the natural gradient by considering the data either individually or in mini-batches.
然后采取单位步长,无论通过(3)还是(5)都能恢复上述相同的解。这验证了 Hensman 等人[2012]、Hoffman 等人[2012]讨论的结果,即采取这一单位步长等同于执行 VB 更新。现在,我们可以通过单独或小批量考虑数据来获得自然梯度的随机近似。


We note the convenient result that the natural gradient for θ2 is positive definite (note Λ=Kmm1+iΛi ). This means that taking a step in that direction always leads to a positive definite matrix, and our implementation need not parameterise S in any way so as to ensure positive-definiteness, cf. standard gradient approaches on covariance matrices.
我们注意到一个便利的结果,即对于 θ2 的自然梯度是正定的(参见 Λ=Kmm1+iΛi )。这意味着沿着该方向前进总能得到正定矩阵,因此我们的实现无需以任何方式参数化 S 来确保正定性,这与协方差矩阵上的标准梯度方法形成对比。

To optimise the kernel hyper-parameters and noise precision β ,we take derivatives of the bound L3 and perform standard stochastic gradient descent alongside the variational parameters. An illustration is presented in Figure 2.
为了优化核超参数和噪声精度 β ,我们对边界 L3 求导,并与变分参数一起执行标准的随机梯度下降。图 2 展示了这一过程的示意图。

3.3 Latent Variables  3.3 潜变量


The above derivations enable online learning for Gaussian process regression using SVI. Several GP based models involve inference of X ,such as the GP latent variable model [Lawrence, 2005, Titsias and Lawrence, 2010] and its extensions [e.g. Damianou et al., 2011, 2012].
上述推导使得使用随机变分推断(SVI)进行高斯过程回归的在线学习成为可能。多种基于高斯过程的模型涉及对 X 的推断,例如高斯过程潜变量模型[Lawrence, 2005, Titsias and Lawrence, 2010]及其扩展版本[如 Damianou 等, 2011, 2012]。

To perform stochastic variational inference with latent variables, we require a factorisation as illustrated by Figure 1(a): this factorisation is provided by (4). To get a model like the Bayesian GPLVM, we need a lower bound on logp(y) . In Titsias and Lawrence [2010] this was achieved through approximate marginalisation of L2 ,w.r.t. X ,which leads to an expression depending only on the parameters of q(X) . However this formulation scales poorly, and the variables of the optimisa-tion are closely connected due to the marginalisation of u . The above enables a lower bound to which SVI is immediately applicable:
为了对含隐变量的模型进行随机变分推断,我们需要如图 1(a)所示的分解结构:该分解由公式(4)提供。要构建类似贝叶斯高斯过程隐变量模型(GPLVM)的模型,需获得关于 logp(y) 的下界。在 Titsias 和 Lawrence[2010]的研究中,这是通过对 L2 进行近似边缘化实现的(相对于 X ),从而得到仅依赖于 q(X) 参数的表达式。然而该方法的计算复杂度较高,且由于对 u 的边缘化操作导致优化变量间存在强关联性。上述过程导出的下界可直接应用于随机变分推断(SVI):

logp(y)=logp(yX)p(X)dX

q(X){L3+logp(X)logq(X)}dX.

It is straightforward to introduce p output dimensions for the data Y ,and following Titsias and Lawrence [2010],we use a factorising normal distribution q(X)= i=1nq(xi) . The relevant expectations of L3 are tractable for various choices of covariance function.
可以直观地为数据 Y 引入 p 个输出维度,并沿用 Titsias 和 Lawrence[2010]的方法,采用因子分解的正态分布 q(X)= i=1nq(xi) 。对于不同类型的协方差函数, L3 的相关期望均具有可解析解。

To perform SVI in this model, we now alternate between selecting a minibatch of data, and optimisting the relevant variables of q(X) with q(u) fixed,and updating q(u) using the approximate natural gradient. We note that the form of (4) means that each of the latent variable distributions may be updated individually, enabling parallelisation across the minibatch.
在该模型中执行随机变分推断(SVI)时,我们现采取以下交替步骤:先选取一个小批量数据,固定 q(u) 并优化 q(X) 的相关变量,再利用近似自然梯度更新 q(u) 。值得注意的是,(4)式的形式意味着每个潜变量分布可独立更新,从而实现了跨小批量数据的并行化处理。

3.4 Non-Gaussian likelihoods
3.4 非高斯似然


Another advantage of the factorisation of (4) is that it enables a simple routine for inference with non-Gaussian likelihoods. The usual procedure for fitting GPs with non-Gaussian likelihoods is to approximate the likelihood using either a local variational lower bound [Gibbs and MacKay, 2000], or by expectation propagation [Kuss and Rasmussen, 2005]. These approximations to the likelihood are required because of the connections between the variables f .
(4)式因子化的另一优势在于,它为处理非高斯似然的推断提供了简洁流程。拟合具有非高斯似然的高斯过程时,常规方法是通过局部变分下界[Gibbs and MacKay, 2000]或期望传播[Kuss and Rasmussen, 2005]来近似似然函数。由于变量 f 之间的关联性,这些似然近似是必要的。

In L3 ,the bound factorises in such a way that some non-Gaussian likelihoods may be marginalised exactly, given the existing approximations. To see this, consider that we are presented not with the vector y ,but by a binary vector t with ti{0,1} ,and the likelihood p(ty)=i=1nσ(yi)ti(1σ(yi))(1ti) ,as in the case of classification. We can bound the marginal likelihood using p(tX)p(ty)exp{L3}dy which involves n independent one dimensional integrals due to the fac-torising nature of L3 . For the probit likelihood each of these integrals is tractable.
L3 中,边界因子以某种方式分解,使得在现有近似条件下,某些非高斯似然可以被精确边缘化。要理解这一点,考虑我们面对的不是向量 y ,而是一个二元向量 t ,其中 ti{0,1} ,以及似然函数 p(ty)=i=1nσ(yi)ti(1σ(yi))(1ti) ,如在分类问题中那样。我们可以利用 p(tX)p(ty)exp{L3}dy 来界定边缘似然,这涉及 n 个独立的一维积分,归因于 L3 的因子分解特性。对于概率单位(probit)似然,每个这样的积分都是可处理的。

This kind of approximation, where the likelihood is integrated exactly is amenable to SVI in the same manner as the regression case above through computation of the natural gradient.
这种近似方法,即对似然进行精确积分,与上述回归案例类似,通过计算自然梯度,同样适用于随机变分推断(SVI)。

4 Experiments  4 实验


4.1 Toy Data  4.1 玩具数据


To demonstrate our algorithm we begin with two simple toy datasets based on sinusoidal functions. In the first experiment we show how the approximation converges towards the true solution as mini-batches are included. Figure 2 shows the nature of our approximation: the variational approximation to the inducing function variables is shown.
为演示我们的算法,我们首先基于正弦函数构建了两个简单的玩具数据集。在第一个实验中,我们展示了随着小批量数据的加入,近似解如何收敛至真实解。图 2 展示了我们近似方法的本质:其中显示了对诱导函数变量的变分近似。

The second toy problem (Figure 3) illustrates the convergence of the algorithm on a two dimensional problem, again based on sinusoids. Here, we start with a random initialisation for q(u) ,and the model converges after 2000 iterations. We found empirically that holding the covariance parameters fixed for the first epoch results in more reliable convergence, as can be seen in Figure 4
第二个玩具问题(图 3)展示了算法在二维问题上的收敛情况,该问题同样基于正弦函数。此处,我们从 q(u) 的随机初始化开始,模型在 2000 次迭代后收敛。我们通过实验发现,在第一个周期固定协方差参数能使收敛更为可靠,如图 4 所示。

4.2 UK Apartment Price Data
4.2 英国公寓价格数据


Our first large scale Gaussian process models the changing cost of apartments in the UK. We downloaded the monthly price paid data for the period February to October 2012 from http://data.gov.uk/dataset/
我们首个大规模高斯过程模型用于模拟英国公寓价格的变化。我们从 http://data.gov.uk/dataset/下载了 2012 年 2 月至 10 月期间的月度成交价格数据。




https://cdn.noedgeai.com/0196b502-66e1-7204-b819-84a886bed035_4.jpg?x=215&y=371&w=1433&h=455&r=0

Figure 2: Stochastic variational inference on a trivial GP regression problem. Each pane shows the posterior of the GP after a batch of data, marked as solid points. Previoulsy seen (and discarded) data are marked as empty points,the distribution q(u) is represented by vertical errorbars.
图 2:在一个简单的 GP 回归问题上进行的随机变分推断。每个窗格显示了一批数据(标记为实心点)后 GP 的后验分布。之前见过(且已丢弃)的数据标记为空点,分布 q(u) 由垂直误差条表示。


https://cdn.noedgeai.com/0196b502-66e1-7204-b819-84a886bed035_4.jpg?x=221&y=1292&w=1429&h=597&r=0

Figure 3: A two dimensional toy demo, showing the initial condition and final condition of the model. Data are marked as colored points, and the model's prediction is shown as (similarly colored) contour lines. The positions of the inducing variables are marked as empty circles.
图 3:一个二维示例演示,展示了模型的初始状态与最终状态。数据点以彩色标记,模型预测结果以(同色系)等高线表示。诱导变量的位置用空心圆标出。



https://cdn.noedgeai.com/0196b502-66e1-7204-b819-84a886bed035_5.jpg?x=211&y=200&w=700&h=525&r=0

Figure 4: Convergence of the SVIGP algorithm on the two dimensional toy data
图 4:SVIGP 算法在二维示例数据上的收敛情况


land-registry-monthly-price-paid-data/, which covers England and Wales, and filtered for apartments. This resulted in a data set with 75,000 entries, which we cross referenced against a postcode database to get lattitude and longitude, on which we regressed the normalised logarithm of the apartment prices.
土地登记月度成交价格数据(覆盖英格兰和威尔士地区,筛选公寓类型),最终得到包含 75,000 条记录的数据集。我们通过邮政编码数据库交叉匹配获取经纬度坐标,并以此对标准化后的公寓价格对数进行回归分析。

Randomly selecting 10,000 data as a test set, we build a GP as described with a covariance function k(,) consisting of four parts: two squared exponential co-variances, initialised with different length scales were used to account for national and regional variations in property prices, a constant (or 'bias') term allowed for non-zero mean data, and a noise variance accounted for variation that could not be modelled using simply latitude and longitude.
随机选取 10,000 条数据作为测试集,构建如所述的 GP 模型,其协方差函数 k(,) 包含四个部分:两个初始化不同长度尺度的平方指数协方差(分别用于捕捉全国性和区域性房价差异)、允许非零均值数据的常数项(即"偏置"项),以及用于解释仅凭经纬度无法建模的变动的噪声方差项。

We selected 800 inducing input sites using a k -means algorithm, and optimised the parameters of the covariance function alongside the variational parameters. We performed some manual tuning of the learning rates: empirically we found that the step length should be much higher for the variational parameters of q(u) than for the values of the covariance function parameters. We used 0.01 and 1×105 . Also,we included a momentum term for the covariance function parameters (set to 0.9). We tried including momentum terms for the variational parameters, but we found this hindered performance. A large mini-batch size (1000) reduced the stochasticity of the gradient computations. We judged that the algorithm had converged after 750 iterations, as the stochastic estimate of the marginal lower bound on the marginal likelihood failed to increase further.
我们使用 k 均值算法选择了 800 个诱导输入点,并同时优化了协方差函数参数和变分参数。我们对学习率进行了一些手动调整:经验发现, q(u) 的变分参数步长应远高于协方差函数参数值。我们分别使用了 0.01 和 1×105 。此外,我们还为协方差函数参数添加了动量项(设为 0.9)。尝试为变分参数添加动量项时,发现这会降低性能。较大的小批量大小(1000)降低了梯度计算的随机性。当边缘似然的下界随机估计值不再增加时,我们判定算法在 750 次迭代后已收敛。

For comparison to our model, we constructed a series of GPs on subsets of the training data. Splitting the data into sets of500,800,1000and 1200,we fitted a GP with the same covariance function as our stochastic GP. Parameters of the covariance function were optimised using type-II maximum likelihood for each batch. Table 1 reports the mean squared error in our model's prediction of the held out prices, as well as the same for the random sub-set approach (along with two standard deviations of the inter-sub-set variability).
为了与我们的模型进行比较,我们在训练数据的子集上构建了一系列高斯过程(GPs)。将数据分割为 500、800、1000 和 1200 的集合后,我们拟合了一个与我们的随机高斯过程具有相同协方差函数的 GP。每个批次的协方差函数参数均通过类型 II 最大似然法进行优化。表 1 报告了我们的模型在预测保留价格时的均方误差,以及随机子集方法的相应结果(并附上子集间变异性的两个标准差)。



https://cdn.noedgeai.com/0196b502-66e1-7204-b819-84a886bed035_5.jpg?x=954&y=200&w=706&h=616&r=0

Figure 5: Variability of apartment price (logarithmically!) throughout England and Wales.
图 5:英格兰和威尔士公寓价格(对数化!)的变动情况。

Table 1: Mean squared errors in predicting the log-apartment prices across England and Wales by latti-tude and longitude
表 1:英格兰和威尔士地区基于纬度与经度预测公寓价格对数的均方误差

Mean square Error  均方误差
SVIGP0.426
Random sub-set (N=500)  随机子集 (N=500) 0.522+/0.018
Random sub-set (N=800)  随机子集 (N=800) 0.510 +/- 0.015  0.510 ± 0.015
Random sub-set (N=1000)  随机子集 (N=1000) 0.503 +/- 0.011  0.503 ± 0.011
Random sub-set (N=1200)  随机子集 (N=1200) 0.502 +/- 1.012  0.502 ± 1.012


4.3 Airline Delays  4.3 航班延误


The second large scale dataset we considered consists of flight arrival and departure times for every commercial flight in the USA from January 2008 to April 2008. This dataset contains extensive information about almost 2 million flights, including the delay (in minutes) in reaching the destination. The average delay of a flight in the first 4 months of 2008 was of 30 minutes. Of course, much better estimates can be given by exploiting the enourmous wealth of data available, but rich models are often overlooked in these cases due to the sheer size of the dataset. We randomly selected 800,000 datapoints 2 ,using a random subset of 700,000 samples to train the model and 100,000 to test it. We chose to include into our model 8 of the many variables available for this dataset: the age of the aircraft (number of years since deployment), distance that needs to be covered, airtime, departure time, arrival time, day of the week, day of the month and month.
我们考察的第二个大规模数据集包含 2008 年 1 月至 4 月美国所有商业航班的起飞与到达时间。该数据集涵盖近 200 万次航班的详细信息,包括抵达目的地的延误时间(分钟)。2008 年前 4 个月航班平均延误时间为 30 分钟。当然,利用海量数据可得出更精确的估计值,但由于数据集规模庞大,复杂模型常被忽视。我们随机选取 80 万数据点 2 ,其中 70 万样本用于训练模型,10 万用于测试。模型最终纳入 8 个变量:飞机机龄(投入使用年数)、飞行距离、空中飞行时间、起飞时间、到达时间、星期几、当月日期及月份。




https://cdn.noedgeai.com/0196b502-66e1-7204-b819-84a886bed035_6.jpg?x=212&y=199&w=713&h=623&r=0

Figure 6: Posterior variance of apartment prices.
图 6:公寓价格的后验方差


We built a Gaussian process with a squared exponential covariance function with a bias and noise term. In order to discard irrelevant input dimensions, we allowed a separate lengthscale for each input. For our experiments,we used m=1000 inducing inputs and a mini-batch size of 5000 . The learning rate for the variational parameters of q(u) was set to 0.01,while the learning rate for the covariance function parameters was set to 1×105 . We also used a momentum term of 0.9 for the covariance parameters.
我们构建了一个高斯过程,采用带有偏置和噪声项的平方指数协方差函数。为了剔除无关的输入维度,我们为每个输入设置了独立的长度尺度。实验中,我们使用了 m=1000 个诱导输入,并设定了 5000 的小批量大小。变分参数 q(u) 的学习率设为 0.01,而协方差函数参数的学习率设定为 1×105 。此外,协方差参数还应用了 0.9 的动量项。

For the purpose of comparison, we fitted several GPs with an identical covariance function on subsets of the data. We split the data into sets of 800, 1000 and 1200 samples and optimised the parameters using type-II maximum likelihood. We repeated this procedure 10 times.
出于比较目的,我们在数据子集上拟合了多个具有相同协方差函数的高斯过程。将数据划分为 800、1000 和 1200 样本的集合,并采用类型 II 最大似然法优化参数。此过程重复进行了 10 次。

The left pane of Figure 7 shows the root mean squared error (RMSE) obtained by fitting GPs on subsets of the data. The right pane of figure 7 shows the RMSE obtained by fitting 10 SVI GPs as a function of the iteration. The individual runs are shown in light gray, while the blue line shows the average RMSE across
图 7 左面板展示了在数据子集上拟合高斯过程得到的均方根误差(RMSE)。右面板则显示了 10 次随机变分推断(SVI)高斯过程拟合中,RMSE 随迭代次数的变化情况。浅灰色线条代表各次独立运行结果,蓝色线条表示平均 RMSE。



https://cdn.noedgeai.com/0196b502-66e1-7204-b819-84a886bed035_6.jpg?x=998&y=242&w=608&h=462&r=0

Figure 8: Root mean square errors for models with different numbers of inducing variables.
图 8:不同数量诱导变量模型的均方根误差对比。


https://cdn.noedgeai.com/0196b502-66e1-7204-b819-84a886bed035_6.jpg?x=991&y=898&w=601&h=477&r=0

Figure 9: Automatic relevance determination parameters for the features used for predicting flight delays.
图 9:用于预测航班延误的特征所采用的自动相关性判定参数。


runs.  运行。


One of the main advantages of the approach presented here is that the computational complexity is independent from the number of samples n . This allowed us to use a much larger number of inducing inputs than has traditionally been possible. Conventional sparse GPs have a computational complexity of O(nm2) ,so for large n the typical upper bound for m is between 50 and 100. The impact on the prediction performance is quite significant, as highlighted in Figure 8, where we fit several SVI GPs using different numbers of inducing inputs.
本文方法的主要优势之一在于其计算复杂度与样本数量 n 无关。这使得我们能够使用远多于传统方法的诱导输入数量。传统稀疏高斯过程的计算复杂度为 O(nm2) ,因此对于较大的 nm 的典型上限通常在 50 至 100 之间。如图 8 所示,采用不同数量诱导输入拟合多个 SVI 高斯过程时,这对预测性能的影响相当显著。

Looking at the inverse lengthscales in Figure 9, it's possible to get a better idea of the relevance of the different features available in this dataset. The most relevant variable turned out to be the time of departure of the flight, closely followed by the distance that needs to be covered. Distance and airtime should in theory be correlated, but they have very different relevances. This can be intuitively explained by considering that on longer flights it's easier to make up for delays at departure.
观察图 9 中的逆长度尺度参数,可以更清晰地理解该数据集中各特征的相关性。最具预测力的变量是航班起飞时间,紧随其后的是飞行距离。理论上距离与飞行时间应存在相关性,但二者的实际重要性差异显著。直观解释是:长距离飞行中更容易弥补起飞时的延误。




2 Subsampling wasn’t technically necessary,but we didn't want to overburden the memory of a shared compute node just before a submission deadline.
2 从技术上讲,子采样并非必要,但我们不想在提交截止日期前过度占用共享计算节点的内存。