这是用户在 2025-5-8 10:42 为 file:///Users/zoe/Downloads/nayebi19a.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

A Framework for Bayesian Optimization in Embedded Subspaces
嵌入式子空间中的贝叶斯优化框架


Alexander Munteanu 1 Amin Nayebi 2 Matthias Poloczek 32
亚历山大·蒙特亚努 1 阿明·纳耶比 2 马蒂亚斯·波洛切克 32

Abstract  摘要


We present a theoretically founded approach for high-dimensional Bayesian optimization based on low-dimensional subspace embeddings. We prove that the error in the Gaussian process model is bounded tightly when going from the original high-dimensional search domain to the low-dimensional embedding. This implies that the optimization process in the low-dimensional embedding proceeds essentially as if it were run directly on an unknown active subspace of low dimensionality. The argument applies to a large class of algorithms and GP models, including non-stationary kernels. Moreover, we provide an efficient implementation based on hashing and demonstrate empirically that this subspace embedding achieves considerably better results than the previously proposed methods for high-dimensional BO based on Gaussian matrix projections and structure-learning.
我们提出了一种基于低维子空间嵌入的高维贝叶斯优化理论方法。我们证明,当从原始高维搜索域转换到低维嵌入时,高斯过程模型中的误差被严格限制。这意味着低维嵌入中的优化过程基本上就像是在未知的低维活跃子空间上直接运行一样。这一论点适用于包括非平稳核在内的大量算法和 GP 模型。此外,我们提供了一种基于哈希的高效实现,并通过实证表明,这种子空间嵌入相较于先前提出的基于高斯矩阵投影和结构学习的高维 BO 方法,能取得显著更好的结果。

1. Introduction  1. 引言


Bayesian optimization (BO) has recently emerged as powerful technique for the global optimization of expensive-to-evaluate black-box functions (Brochu et al., 2010; Shahriari et al., 2016; Frazier, 2018). Here 'black-box' means that we may evaluate the objective at any point to observe its value, possibly with noise but without derivative information. The advantages of Bayesian optimization are sample-efficiency, convergence to a global optimum, and a low computational overhead. A critical limitation is the number of parameters that BO can optimize. The limit is usually seen around 15 parameters for the most common form of BO that uses Gaussian process (GP) regression as a surrogate model for the objective function. Thus, it is not surprising that expanding BO to higher-dimensional search spaces is widely acknowledged as one of the most important goals in the field. For example, Frazier (2018, p. 16) states that "developing Bayesian optimization methods that work well in high dimensions is of great practical and theoretical interest".
贝叶斯优化(BO)最近已成为一种强大的技术,用于全局优化评估成本高昂的黑盒函数(Brochu 等人,2010;Shahriari 等人,2016;Frazier,2018)。这里的“黑盒”意味着我们可以在任何点评估目标以观察其值,可能带有噪声但没有导数信息。贝叶斯优化的优势在于样本效率高、能收敛到全局最优解且计算开销低。一个关键限制是 BO 能优化的参数数量。对于最常见的使用高斯过程(GP)回归作为目标函数代理模型的 BO 形式,参数限制通常在 15 个左右。因此,将 BO 扩展到更高维搜索空间被广泛认为是该领域最重要的目标之一也就不足为奇了。例如,Frazier(2018,第 16 页)指出“开发在高维情况下表现良好的贝叶斯优化方法具有重要的实践和理论意义”。

This paper advances the field in the theory of high-dimensional Bayesian optimization and improves the practical performance. Specifically, the contributions are:
本文在高维贝叶斯优化的理论领域取得进展,并提升了实际应用性能。具体贡献包括:

  1. A theoretically founded framework for Bayesian optimization based on subspace embeddings. The core is a rigorous proof that any GP-based BO algorithm proceeds on the embedded space as it would if it was run directly on an unknown active subspace of low dimensionality.
    提出基于子空间嵌入的贝叶斯优化理论框架,其核心在于严格证明:任何基于高斯过程的贝叶斯优化算法在嵌入空间上的操作,等同于在未知低维活跃子空间上直接运行的效果。

  1. An extension of this result to large classes of parametrized GP models. Note that the argument is agnostic to the acquisition criterion used by the BO algorithm and thus can be easily combined with many state-of-the-art methods, including batch acquisition, multifidelity modeling, network architecture search, and many more.
    将该理论扩展至参数化高斯过程模型的大类。需注意该论证对贝叶斯优化算法使用的采集准则保持不可知性,因此可轻松结合多种前沿方法,包括批量采集、多保真度建模、网络架构搜索等应用场景。

  1. The Hashing-enhanced Subspace BO (HeSBO) method for high-dimensional problems has i) strong accuracy guarantees and ii) low computational overhead, while being iii) conceptually simple. In particular, it avoids complex corrections of projections that caused complications previously.
    面向高维问题的哈希增强子空间优化方法(HeSBO)具有:i) 强精度保证 ii) 低计算开销 iii) 概念简洁三大特性。特别地,该方法避免了以往因投影校正导致的复杂性难题。

  1. An experimental evaluation that demonstrates state-of-the-art performance when the proposed embedding is combined with a low-dimensional BO algorithm, e.g., Knowledge Gradient (KG) (Frazier et al., 2009) or BLOSSOM (BM) (McLeod et al., 2018).
    一项实验评估表明,当所提出的嵌入方法与低维贝叶斯优化算法(如知识梯度(KG)(Frazier 等人,2009)或 BLOSSOM(BM)(McLeod 等人,2018))结合使用时,能展现出最先进的性能。

Related Work. Bayesian optimization has recently received significant interest for the optimization of expensive-to-evaluate black-box functions. We refer to (Brochu et al., 2010; Shahriari et al., 2016; Frazier, 2018) for an overview. Aside of the general setting of optimizing an objective black-box function, several extensions have been studied, including constrained (Gardner et al., 2014; Hernández-Lobato et al., 2015; Picheny et al., 2016) and multifidelity optimization (Kennedy & O'Hagan, 2000; Huang et al., 2006; Swersky et al., 2013; Kandasamy et al., 2016; Poloczek et al., 2017), expensive functions with derivative information (Wu et al., 2017; Eriksson et al., 2018), and neural network architecture search (Klein et al., 2017; Kandasamy et al., 2018).
相关工作。贝叶斯优化近年来因在昂贵评估的黑盒函数优化中的应用而受到广泛关注。相关综述可参阅(Brochu 等人,2010;Shahriari 等人,2016;Frazier,2018)。除了一般性的黑盒目标函数优化外,还研究了几种扩展场景,包括带约束的优化(Gardner 等人,2014;Hernández-Lobato 等人,2015;Picheny 等人,2016)、多保真度优化(Kennedy & O'Hagan,2000;Huang 等人,2006;Swersky 等人,2013;Kandasamy 等人,2016;Poloczek 等人,2017)、含导数信息的昂贵函数优化(Wu 等人,2017;Eriksson 等人,2018)以及神经网络架构搜索(Klein 等人,2017;Kandasamy 等人,2018)。




Equal contribution 1 Dortmund Data Science Center,Faculties of Statistics and Computer Science, TU Dortmund, Dortmund,Germany 2 Department of Systems and Industrial Engineering,University of Arizona,Tucson,AZ,USA 3 Uber AI Labs, San Francisco, CA, USA. Correspondence to: Matthias Poloczek .
同等贡献 1 多特蒙德数据科学中心,统计学与计算机科学学院,多特蒙德工业大学,德国多特蒙德 2 系统与工业工程系,亚利桑那大学,美国图森 3 Uber 人工智能实验室,美国加利福尼亚州旧金山。通讯作者:Matthias Poloczek。

Proceedings of the 36th  International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).
《机器学习国际会议论文集》,美国加利福尼亚州长滩,PMLR 97,2019 年。版权归作者所有。





Kandasamy et al. (2015) extended Bayesian optimization to higher-dimensional search spaces composed of disjoint low-dimensional subspaces that can be optimized over separately if the latent additive structure is learned. The ADD algorithm of Gardner et al. (2017) and SKL of Wang et al. (2017) use sophisticated sampling procedures to learn the decomposition, and Rolland et al. (2018) and Mutny & Krause (2018) extended the idea to overlapping subspaces. Inferring the latent structure of the search space from data via extensive sampling introduces a considerable computational load in practice, which limits the applicability to objective functions with high evaluation cost (see Sect. 3). This bottleneck was recently addressed by Wang et al. (2018) whose ensemble BO method (EBO) uses an ensemble of additive GP models for scalability. We evaluated ADD, EBO and SKL from this line of research and found EBO indeed more scalable than SKL, but still considerably more expensive than the other approaches that we evaluated.
Kandasamy 等人(2015)将贝叶斯优化扩展至由可分离低维子空间构成的高维搜索空间,前提是学习到潜在的加性结构。Gardner 等人(2017)的 ADD 算法与 Wang 等人(2017)的 SKL 通过复杂采样程序学习分解方式,Rolland 等人(2018)及 Mutny & Krause(2018)进一步将该思想推广至重叠子空间。实践中,通过大量采样从数据推断搜索空间的潜在结构会带来巨大计算负担,限制了该方法在高评估成本目标函数上的适用性(见第 3 节)。Wang 等人(2018)近期提出的集成 BO 方法(EBO)采用加性高斯过程模型集合来解决这一瓶颈问题。我们评估了该研究方向的 ADD、EBO 和 SKL 方法,发现 EBO 确实比 SKL 更具可扩展性,但相比其他评估方法仍显昂贵。

Chen et al. (2012); Wang et al. (2016b) bypassed scala-bility issues by leveraging the observation that for many applications only a small number of tunable parameters have a significant effect on the objective function. This phenomenon is known as "active subspace" and "effective dimensionality". Wang et al. (2016b) used a projection matrix with standard Gaussian entries and showed that the active subspace is rank-preserved but strongly dilated. Thus, they expanded the low-dimensional search space to ensure that its projection to the high-dimensional space contains an optimal solution with reasonable probability. Their REMBO algorithm fits a GP model to the low-dimensional space and projects points back to the high-dimensional space using the inverse random projection. Points whose high-dimensional image is outside the search domain X are convex-projected to the boundary X ,which may lead to over-exploration of the boundary regions. Binois et al. (2015) suggested a new warping kernel kψ to address this issue by an improved preservation of distances among such points. Recently, Binois et al. (2019) proposed a novel choice of the low-dimensional search space to handle the distortion of Gaussian projections. Djolonga et al. (2013) and Eriks-son et al. (2018) propose learning the active subspace, in the latter paper from derivative information, and therefore avoid the random projection. We compare to the methods in (Wang et al., 2016b; Binois et al., 2015) in Sect. 3. We also compare to the BOCK algorithm of Oh et al. (2018) that is based on a cylindrical transformation. This provides scalability to high dimensional problems and an implicit prior that the optimizer is in the interior of the search space.
Chen 等人(2012); Wang 等人(2016b)通过利用观察到的现象绕过了可扩展性问题,即在许多应用中只有少量可调参数对目标函数有显著影响。这种现象被称为"活动子空间"和"有效维度"。Wang 等人(2016b)使用具有标准高斯项的投影矩阵,并表明活动子空间保持了秩但被强烈扩张。因此,他们扩展了低维搜索空间,以确保其向高维空间的投影以合理概率包含最优解。他们的 REMBO 算法将高斯过程模型拟合到低维空间,并使用逆随机投影将点投影回高维空间。高维图像位于搜索域 X 之外的点被凸投影到边界 X ,这可能导致边界区域的过度探索。Binois 等人(2015)提出了一种新的扭曲核 kψ ,通过改进这些点之间距离的保持来解决这个问题。最近,Binois 等人 (2019 年)提出了一种新颖的低维搜索空间选择方法,以处理高斯投影的失真问题。Djolonga 等人(2013 年)和 Eriksson 等人(2018 年)提出学习主动子空间,后者论文中利用导数信息,从而避免了随机投影。我们在第 3 节中与(Wang 等人,2016b;Binois 等人,2015 年)的方法进行了比较。我们还与 Oh 等人(2018 年)基于圆柱变换的 BOCK 算法进行了对比,该方法为高维问题提供了可扩展性,并隐含了优化器位于搜索空间内部的先验假设。

Outline. Sect. 2 introduces probabilistic subspace embed-dings and the HeSBO method, followed by the theoretical foundation. The experimental evaluation is given in Sect. 3 and the discussion is in Sect. 4. Sections denoted by letters are found in the Supplement.
概述。第 2 节介绍了概率子空间嵌入及 HeSBO 方法,随后是理论基础。实验评估在第 3 节中给出,讨论部分位于第 4 节。标有字母的章节可在补充材料中找到。

2. Probabilistic Subspace Embeddings of Gaussian processes
2. 高斯过程的概率子空间嵌入


This section introduces probabilistic subspace embeddings and their use in BO. Sect. 2.2 motivates the approach. The theoretical foundation is given in Sect. 2.3 and generalized to large classes of popular kernels in Sect. 2.4.
本节介绍概率子空间嵌入及其在贝叶斯优化中的应用。第 2.2 节阐述了该方法的动机,理论基础在第 2.3 节中给出,并在第 2.4 节中推广至多种流行核函数的广泛类别。

2.1.The HeSBO Algorithm  2.1.HeSBO 算法


Given an objective function f defined on the high-dimensional domain X=[1,+1]D ,we suppose that there exists an unknown de -dimensional active subspace Z . Informally,the objective f is invariant to coordinate changes outside Z . Note that we do not suppose that Z is axis-aligned with X . Our approach is to choose a d -dimensional subspace Y of X randomly such that the optimization process proceeds on Y essentially as it would on Z with good probability. This enables to run a BO method on Y in order to find an xargmaxxXf(x) . The proposed embedding can easily be combined with many GP-based BO methods: Algorithm 1 shows the generic BO algorithm (Shahriari et al., 2016; Frazier, 2018) with the embedding incorporated via the highlighted steps. We call this Hashing-enhanced Subspace Bayesian Optimization (HeSBO).
给定在高维域 X=[1,+1]D 上定义的目标函数 f ,我们假设存在一个未知的 de 维活跃子空间 Z 。非正式地说,目标函数 fZ 之外的坐标变化保持不变。注意,我们并不假设 ZX 轴对齐。我们的方法是随机选择一个 d 维子空间 Y 作为 X 的子空间,使得优化过程在 Y 上的进行基本上如同在 Z 上进行,且具有较高的概率。这使得能够在 Y 上运行贝叶斯优化方法以找到 xargmaxxXf(x) 。所提出的嵌入方法可以轻松地与多种基于高斯过程的贝叶斯优化方法结合:算法 1 展示了通用贝叶斯优化算法(Shahriari 等人,2016;Frazier,2018),其中通过高亮步骤嵌入了该方法。我们称之为哈希增强的子空间贝叶斯优化(HeSBO)。

The embedding is constructed in Algorithm 2 that initializes two hash functions which implicitly represent the embedding matrix S (see Sect. 2.2 for details): The hash function h chooses one single non-zero entry for each of the D dimensions,and the function σ determines the sign of the corresponding non-zero entry. Algorithm 3 maps the low-dimensional vector yY to the high-dimensional domain X . The algorithm iterates over the high-dimensional coordinates: for each entry it invokes the two above hash functions to determine the associated coordinate in the low-dimensional vector y and the sign that the corresponding value of y is multiplied with. Note that by construction the columns of the matrix are orthogonal. Moreover, for any fixed coordinate (dimension) di of the low-dimensional space we have that the number of coordinates in the high dimensional space that map to di is concentrated at D/d , i.e.,this number has mean D/d and a variance that drops (quickly) as D increases. The reason is that h picks the target of each dimension uniformly at random. The dilation is thus roughly D/d ,and the search domain Y=[1,1]d is mapped to SY={SyyY}X=[1,1]D with low distortion. We contrast this with REMBO's search space that was chosen as [d,+d]d to cope with dilations; see Sect. B. It is critical to note that the operations of copying the coordinates and multiplying them by {1,1} assert that no point is projected outside X . Thus, HeSBO avoids complex corrections in contrast to the REMBO variants. The analysis in Sect. C in the supplement shows that REMBO applies these corrections frequently.
嵌入构造在算法 2 中,该算法初始化了两个哈希函数,它们隐式地表示了嵌入矩阵 S (详见第 2.2 节):哈希函数 h 为每个 D 维度选择一个非零条目,而函数 σ 确定相应非零条目的符号。算法 3 将低维向量 yY 映射到高维域 X 。该算法遍历高维坐标:对于每个条目,它调用上述两个哈希函数来确定低维向量 y 中的关联坐标以及 y 对应值相乘的符号。注意,通过构造,矩阵的列是正交的。此外,对于低维空间中任何固定的坐标(维度) di ,映射到 di 的高维空间坐标数量集中在 D/d ,即该数量的均值为 D/d ,且方差随着 D 的增加而(迅速)下降。原因是 h 均匀随机地选择每个维度的目标。因此,扩张大致为 D/d ,搜索域 Y=[1,1]d 被映射到 SY={SyyY}X=[1,1]D 且具有低失真。 我们将其与 REMBO 的搜索空间进行对比,后者选择 [d,+d]d 以应对维度扩展问题,详见附录 B。关键需要注意的是,复制坐标并通过 {1,1} 进行乘法的操作确保了没有任何点会被投影到 X 之外。因此, HeSBO 避免了复杂的修正,这与 REMBO 的变体形成鲜明对比。补充材料中 C 节的分析表明,REMBO 频繁应用了这些修正措施。



Algorithm 1 The Generic BO Algorithm with the probabilistic subspace embedding
算法 1 采用概率子空间嵌入的通用贝叶斯优化算法



: Input: Objective f:XR ; acquisition criterion α
输入:目标 f:XR ;采集准则 α
(e.g.,EI); target dimension dN .
(例如,EI);目标维度 dN
2: Output: An optimizer xargmaxxXf(x) .
2:输出:优化器 xargmaxxXf(x)
3: Construct embedding S:YX with Algorithm 2.
使用算法 2 构建嵌入 S:YX
Sample initial points Y0Y using a space filling design
采用空间填充设计采样初始点 Y0Y
and let D0=(Y0,{observation forf(Sy)yY0}) .
并设 D0=(Y0,{observation forf(Sy)yY0})
Estimate the hyperparameters θ0 for the GP prior on Y
估计高斯过程先验中关于 Y 的超参数 θ0
given D0 . Then calculate the posterior conditioned
给定 D0 ,然后计算基于 D0 和@1#的条件后验
on θ0 and D0 .
基于 θ0D0 的条件后验
for n=1 to N do
对于 n=1N 执行循环
Compute yn+1argmaxyYα(y,Dn) .  计算 yn+1argmaxyYα(y,Dn)
Evaluate f(Syn+1) ,for the projection of yn+1 to X
评估 f(Syn+1) ,针对 yn+1X 的投影
via Algorithm 3. Let zn+1 be the (noisy) observation.
通过算法 3,设 zn+1 为(含噪声的)观测值。
Update the posterior distribution with the new obser-
用新的观测更新后验分布
vation Dn+1=Dn(yn+1,zn+1) .  观测 Dn+1=Dn(yn+1,zn+1)
end for  结束循环
Return xargmaxxXf(x) if observations are
若观测无噪声,则返回 xargmaxxXf(x)
noise-free, or a point with maximum expected value
否则返回后验分布下具有最大期望值的点。
under the posterior given DN otherwise.
基于给定 DN 的后验分布。




2.2. Motivation for the Embedding
2.2. 嵌入的动机


Next we provide the motivation and background for the proposed embedding, before we prove its theoretical properties in Sect. 2.3. Random projections have often been used to reduce the dependence on the dimension of an algorithm's running time, memory, or communication cost. The seminal result of Johnson & Lindenstrauss (1984), JL for short, states that any set of n points in a D -dimensional space can be embedded into dO(logn/ε2) dimensions such that all pairwise 2 distances are preserved up to a (1±ε) - factor. This can be achieved probabilistically by a linear map where each entry of the embedding GRd×D is an i.i.d. rescaled standard Gaussian. Much effort was put into making the construction computationally more efficient and sparse (Achlioptas, 2003; Ailon & Chazelle, 2009; Ailon & Liberty, 2009; Kane & Nelson, 2014). Sarlós (2006) was the first who generalized such embeddings to entire linear subspaces with target dimension roughly O(de/ε2) which is optimal, see (Nelson & Nguyên, 2014). Charikar et al. (2004) introduced an alternative projection, called count-sketch, based on hashing to identify heavy hitters in a large vector, often a data stream, based on a summary stored in a low-dimensional vector. The count-sketch can be formalized as a sparse matrix with only one non-zero entry per column that is drawn randomly in {1,1} . We give an example of such a map reducing from five to three dimensions:
接下来,在证明其理论性质之前(见第 2.3 节),我们将阐述所提出嵌入方法的动机和背景。随机投影常被用于降低算法运行时间、内存或通信成本对维度的依赖。Johnson & Lindenstrauss (1984)的开创性成果(简称 JL 引理)指出:任何 n 个点组成的集合在 D 维空间中,均可嵌入到 dO(logn/ε2) 维空间,且所有成对 2 距离能以 (1±ε) 因子近似保持。这种嵌入可通过线性映射概率实现,其中映射矩阵的每个元素均为独立同分布的重缩放标准高斯变量。后续研究(Achlioptas, 2003; Ailon & Chazelle, 2009; Ailon & Liberty, 2009; Kane & Nelson, 2014)致力于提升计算效率并实现稀疏化。Sarlós (2006)首次将此类嵌入推广到整个线性子空间,其目标维度约 O(de/ε2) (此为最优解,参见 Nelson & Nguyên, 2014)。Charikar 等人... (2004 年)提出了一种称为计数草图(count-sketch)的替代投影方法,该方法基于哈希技术,用于通过存储在低维向量中的摘要来识别大型向量(通常是数据流)中的频繁项。计数草图可形式化为一个稀疏矩阵,每列仅含一个随机选取的非零元素,具体实现参见 {1,1} 。以下是一个从五维降至三维的映射示例:

(001101000001001)(x1x2x3x4x5)=(x3x4x1x2+x5).


Algorithm 2 Construction of an inverse subspace embedding S implicitly given by (h,σ) ,see details in the text.
算法 2 由 (h,σ) 隐式给出的逆子空间嵌入构造方法 S ,详见正文说明。



Input: Input dimension D ,target dimension d .
输入:输入维度 D ,目标维度 d
Output: Implicit representation of a linear map
输出:线性映射的隐式表示
S:RdRD ,by two hash functions h and σ .
通过两个哈希函数 hσ
: Initialize a pairwise independent and uniform hash func-
初始化一个两两独立且均匀的哈希函数
tion h:[D][d] .  tion h:[D][d]
Initialize a 4-wise independent and uniform hash func-
初始化一个 4-wise 独立且均匀的哈希函数
tion σ:[D]{1,1} .  tion σ:[D]{1,1}
Return (h,σ) ,which implicitly defines S ,see detailed
返回 (h,σ) ,其隐式定义了 S ,详见
description in Sect. 2.2.
第 2.2 节中的描述



Algorithm 3 Computes SyX via the inverse subspace embedding S implicitly given by (h,σ)
算法 3 通过由 (h,σ) 隐式给出的逆子空间嵌入 S 计算 SyX



: Input: Low-dimensional vector yYRd .
输入:低维向量 yYRd
  • Output: High-dimensional vector x=SyXRD .
    输出:高维向量 x=SyXRD
for i=1 to D do
i=1D 循环执行
xi=σ(i)yh(i)
end for  结束循环
Return x .  返回 x




This is equivalent to uniformly hashing each entry xi to a random coordinate j=h(i)[d] and multiplying it by a random sign σi{1,1} for all i[D] . The low-dimensional summary y=Sx is given by j [d]:yj=i:h(i)=jσ(i)xi . Charikar et al. (2004) showed that x~i=σ(i)yh(i) is a good estimate for xi . We recover this idea for the projection from the low-dimensional to the high-dimensional domain, xSy ,in Algorithms 2 and 3. This type of embedding is not capable of preserving pairwise distances for arbitrary sets of points, as JL does. However, Clarkson & Woodruff (2013) showed that it yields a subspace embedding for an entire linear subspace with constant probability. Informally,they used the fact that a de - dimensional (active) subspace can have only O(de) coordinates that are large (have high leverage) and thus important. These are exactly the heavy hitters which can be estimated reasonably well. The other low values mostly cancel in the sum with random signs (up to a small error). Nelson & Nguyen (2013) improved and simplified their analysis and complemented with a tight lower bound, settling the target dimension of subspace embeddings via the count-sketch to Θ(de2/(ε2δ)) ,where δ is the failure probability. We define ε -subspace embeddings more formally.
这相当于将每个条目 xi 均匀哈希到一个随机坐标 j=h(i)[d] ,并为所有 i[D] 乘以一个随机符号 σi{1,1} 。低维摘要 y=Sxj [d]:yj=i:h(i)=jσ(i)xi 给出。Charikar 等人(2004 年)表明, x~i=σ(i)yh(i)xi 的良好估计。我们在算法 2 和 3 中,将这一思想应用于从低维到高维域的投影 xSy 。这种类型的嵌入无法像 JL 那样为任意点集保持成对距离。然而,Clarkson 和 Woodruff(2013 年)证明,它以恒定概率为整个线性子空间产生子空间嵌入。非正式地说,他们利用了一个事实: de 维(活跃)子空间最多只有 O(de) 个坐标是大的(具有高杠杆),因此很重要。这些正是可以合理估计的重击者。其他低值在随机符号的求和过程中大多相互抵消(误差很小)。Nelson 和 Nguyen(2013 年)改进并简化了他们的分析,并通过一个严格的下界补充,将通过计数草图进行子空间嵌入的目标维度确定为 Θ(de2/(ε2δ)) ,其中 δ 是失败概率。 我们更正式地定义 ε -子空间嵌入。


Definition 1 ( ε -subspace embedding,cf. Sarlós (2006); Woodruff (2014)). Given a matrix VRD×de with orthonormal columns,an integer dD and an approximation parameter ε(0,12) ,an ε -subspace embedding for V is a map S:RDRd such that xRd :
定义 1( ε -子空间嵌入,参见 Sarlós (2006);Woodruff (2014))。给定具有正交列的矩阵 VRD×de 、整数 dD 和近似参数 ε(0,12)ε -子空间嵌入对于 V 是一个映射 S:RDRd ,使得满足 xRd

(1)(1ε)Vx22≤∥SVx22(1+ε)Vx22,

or, equivalently (cf. Paul et al. (2014); Cohen et al. (2016))
或等价地(参见 Paul 等人(2014);Cohen 等人(2016))

(2)VSSVId2ε

Note that (1) is closer to the original JL property preserving Euclidean norms and thus distances, but generalized to the entire subspace spanned by V ,while (2) is often analytically more convenient and intuitively states that the isometry property VV=Id is approximately preserved under the random projection. Note that the definition directly implies that the singular values of an embedded subspace are preserved up to (1±ε) multiplicative distortion,which in particular means that its rank is preserved.
注意,(1)式更接近原始的 JL 性质,即保持欧几里得范数及距离,但推广至由 V 张成的整个子空间;而(2)式通常在分析上更为便利,直观表明随机投影下近似保持了等距性质 VV=Id 。该定义直接意味着嵌入子空间的奇异值在 (1±ε) 乘性失真范围内得以保持,这尤其表明其秩被保留。

2.3. Theoretical Foundation for the Embedding
2.3. 嵌入的理论基础


Assuming the existence of a low-dimensional active subspace, Wang et al. (2016b) showed that preserving the rank and controlling the dilation ensures that an optimal point is contained in a low-dimensional projection of this subspace. We improve on their analysis by leveraging the above definition. Our main theoretical contribution is to show that any ε -subspace embedding preserves the mean and variance functions of a Gaussian process with linear kernel, i.e., the standard inner product in Euclidean space. In Sect. 2.4 we further extend this result to other important classes of kernel functions like polynomial kernels, (squared) exponential and Matérn kernels. Note that the guarantee is independent of how the subspace embedding is achieved algorithmically.
假设存在一个低维活跃子空间,Wang 等人(2016b)研究表明,保持秩并控制扩张能确保最优解包含在该子空间的低维投影中。我们通过利用上述定义改进了他们的分析。我们的主要理论贡献是证明任何 ε 子空间嵌入都能保留高斯过程(具有线性核,即欧几里得空间中的标准内积)的均值与方差函数。在 2.4 节中,我们进一步将该结果扩展至其他重要核函数类别,如多项式核、(平方)指数核及 Matérn 核。需注意,该保证与子空间嵌入的算法实现方式无关。

Priors based on Gaussian process regression are popular in BO. Let the error distribution be normal with mean 0 and positive semidefinite kernel K(x,y) . Then the predictive distribution,given data matrix X=(x1,,xl) and the vector of their function evaluations f=(f(x1),,f(xl)) , is also normal with mean and variance functions
基于高斯过程回归的先验在贝叶斯优化中广泛应用。设误差分布服从均值为 0、半正定核为 K(x,y) 的正态分布,则给定数据矩阵 X=(x1,,xl) 及其函数评估值向量 f=(f(x1),,f(xl)) 时,预测分布同样服从正态分布,其均值与方差函数为...

(2)μ(x)=k(x)Kf,

(3)σ2(x)=K(x,x)k(x)Kk(x),

where xX and k(x) denotes the vector given by (K(x,x1)K(x,xl)) ,and the kernel matrix is defined by Ki,j=K(xi,xj),i,j[l] . By K we denote the Moore-Penrose pseudoinverse which coincides with the standard inverse but generalizes to non-invertible cases. For the sake of illustration, assume for the moment that we have the simple linear kernel K(x,y)=xy ,i.e.,the standard inner
其中 xXk(x) 表示由 (K(x,x1)K(x,xl)) 给出的向量,核矩阵由 Ki,j=K(xi,xj),i,j[l] 定义。 K 表示摩尔-彭罗斯伪逆,它与标准逆矩阵在可逆情况下一致,但推广至不可逆情形。为便于说明,假设我们暂时采用简单的线性核 K(x,y)=xy ,即标准内积。

product. Then Equations (2) and (3) simplify to
积。则方程(2)和(3)可简化为

(4)μ(x)=xX(XX)f,

(5)σ2(x)=xxxX(XX)Xx.

Now,to bound the effect of an ε -subspace embedding S under the assumption of low effective dimensionality, consider V ,an orthonormal basis for the active subspace,and V , an orthonormal basis for its nullspace. We can replace any point x=Vy+Vz by x=Vy ,its projection to the subspace,since f(x)=f(Vy+Vz)=f(Vy)=f(x) . We stress that V and V can be arbitrary bases for those subspaces. While Definition 1 and our results do not depend on the representative,it will be convenient to choose V derived from a singular value decomposition (SVD). Our main result states that the Gaussian process in this subspace is simulated very accurately conditioned on the (probabilistic) event that we have an ε -subspace embedding.
现在,为了在低有效维度的假设下约束 ε 子空间嵌入 S 的影响,考虑 V (活动子空间的正交基)和 V (其零空间的正交基)。我们可以用任意点 x=Vy+Vz 在子空间上的投影 x=Vy 来替代原值,因为 f(x)=f(Vy+Vz)=f(Vy)=f(x) 。需要强调的是, VV 可以是这些子空间的任意基。虽然定义 1 及我们的结果不依赖于具体基的选择,但选择源自奇异值分解(SVD)的 V 将更为便利。我们的主要结果表明,在获得 ε 子空间嵌入(概率性事件)的条件下,该子空间中的高斯过程能被高度精确地模拟。

Theorem 2. Consider a Gaussian process that acts directly in the unknown active subspace of dimension de with mean and variance functions μ(),σ2() . Let μ~(),σ~2() be their approximations using an ε -subspace embedding for the active subspace. Then we have for every xX
定理 2. 考虑一个高斯过程,其直接作用于维度为 de 的未知活跃子空间,具有均值与方差函数 μ(),σ2() 。设 μ~(),σ~2() 为使用 ε -子空间嵌入对活跃子空间进行近似的结果。则对于任意 xX ,我们有

  1. |μ(x)μ~(x)|5εxXf

  1. |σ2(x)σ~2(x)|12εx2 .

Proof. We can express x=Vy and by the SVD we have X=VU . Then for the mean function and its approximation we have by the triangle inequality
证明. 我们可以表示 x=Vy ,并通过 SVD 得到 X=VU 。于是对于均值函数及其近似,根据三角不等式可得

|μ(x)μ~(x)|

=|xX(XX)fxSSX(XSSX)f|

(6)|xX(XX)fxSSX(XX)f|

(7)+|xSSX(XX)fxSSX(XSSX)f|.

We bound both terms separately
我们分别对两项进行界定

(6)|yVVU(UVVU)f

yVSSVU(UVVU)f

≤∥yVSSVIU(UVVU)f

εyVUU(2)Uf=εyVUf

=εyXf

and for the second summand (7),
对于第二个被加数(7),

(7)|yVSSVU(UVVU)f

yVSSVU(UVSSVU)f

|yVSSVUUVVUf|

yVSSVUU(VSSV)Uf

yVSSV(VSSV)VVUf

(1+ε)2εyXf


The last inequality follows from the following two observations. First, the inverse singular values are approximated to within a factor (1±2ε) by Observation 1 in (Geppert et al., 2017). Thus (VSSV)VV2ε . Second,we have
最后一个不等式源于以下两个观察结果。首先,根据(Geppert 等人,2017)中的观察 1,逆奇异值被近似到因子 (1±2ε) 以内。因此 (VSSV)VV2ε 。其次,我们有

yVSSVyVSSVyVV+y

yVSSVI+y(1+ε)y

Summing up, the triangle inequality implies
综上所述,三角不等式意味着

|μ(x)μ~(x)|(6)+(7)5εyXf.

Similarly, we can show for the variance function that
类似地,对于方差函数我们可以证明

|σ2(x)σ~2(x)|

=|xxxX(XX)Xx|

xSSx+xSSX(XSSX)XSSx

|xxxSSx|

+|xX(XX)Xx

xSSX(XSSX)XSSx

(8)|xxxSSx|

(9)+|xX(XX)XxxSSX(XX)Xx|

+|xSSX(XX)Xx

(10)xSSX(XX)XSSx

+|xSSX(XX)XSSx

(11)xSSX(XSSX)XSSx

Again we bound items (8)-(11) separately. We have
我们再次分别对条目(8)-(11)进行约束。我们有

(8)|yVVyyVSSVy|

≤∥y2VSSVIεy2,

and  以及

(9)|yVVU(XX)Xx

yVSSVU(XX)Xx

≤∥yVSSVIU(XX)Xx

εyUUVVUUVVy

εyUUVVUU=εy2,y=εy2,

and  以及

(10)|xSSX(XX)UVVy

xSSX(XX)UVSSVy

xSSX(XX)UVSSVIy

εyyVSSVU(XX)U

εy(y(VSSVI)U(XX)U

+yVVU(XX)U)

εy(yVSSVIU(XX)U

+yU(XX)U)

ε(1+ε)y2UUVVUU=12εy2.

For the last summand, observe
对于最后一项求和,注意到

(11)|xSSXUVVUXSSx

xSSXU(VSSV)UXSSx

(VSSV)IxSSXU2

2εyVSSVUU2

2ε(y(VSSVI)UU

+yVVUU2

2ε(yVSSVIUU

+yUU2

2ε(1+ε)2y2UU2=18εy2

The second part of Theorem 2 follows by triangle inequality
定理 2 的第二部分由三角不等式得出

|σ2(x)σ~2(x)|(8)+(9)+(10)+(11)

12εy2.

2.4. Generalization to other Popular Kernels
2.4. 推广到其他流行核函数


Sarlós (2006) original argument to construct a subspace embedding relies on an additive approximation for inner products. The analysis is conducted on unit norm vectors and extends to the entire space by linearity. The assumption is thus not restrictive but simplifies the presentation. We show small error approximation guarantees for several popular classes of kernels. For the linear kernel K(x,y)=xy , this is an immediate consequence of Definition 1 via the parallelogram rule (Arriaga & Vempala, 2006):
Sarlós(2006)原始论证构建子空间嵌入时依赖于内积的加性近似。该分析针对单位范数向量展开,并通过线性扩展至整个空间。因此该假设并不具有限制性,但简化了表述过程。我们展示了几类流行核函数的小误差近似保证。对于线性核 K(x,y)=xy ,这一定义 1 通过平行四边形法则(Arriaga & Vempala, 2006)直接得出:

Corollary 3. If S is an ε -subspace embedding for V then for all x,y{vRDv=Vu,uRd} we have
推论 3. 若 SVε 子空间嵌入,则对所有 x,y{vRDv=Vu,uRd} 满足

xyεx∥∥y∥≤xSSyxy+εx∥∥y.

We begin with the polynomial kernel K(x,y)=(xy+c)p .
我们从多项式核 K(x,y)=(xy+c)p 开始。

Lemma 4. Let K(x,y)=(xy+c)p,pN be the polynomial kernel with c2 . Let S be an ε -subspace embedding for V . Then for all x,y{vRDv=Vu,uRd} with x∥=∥y∥=1 we have |K(x,y)K(Sx,Sy)| εpK(x,y) .
引理 4. 设 K(x,y)=(xy+c)p,pN 为带参数 c2 的多项式核, SVε 子空间嵌入。则对所有满足 x∥=∥y∥=1x,y{vRDv=Vu,uRd} ,我们有 |K(x,y)K(Sx,Sy)| εpK(x,y)

Proof. By Corollary 3 we have
证明。根据推论 3,我们有

K~(x,y)=(xSSy+c)p


(xy+εx∥∥y+c)p

=(xy+c)p(1+εx∥∥yxy+c)p

and  以及

K~(x,y)=(xSSy+c)p

(xyεx∥∥y+c)p

=(xy+c)p(1εx∥∥yxy+c)p.

Thus by Bernoulli's inequality
因此根据伯努利不等式

|K(x,y)K~(x,y)|=|(xy+c)p(xSSy+c)p|

|(xy+c)p(1(1±εx∥∥yxy+c)p)|

|(xy+c)p(11+εpx∥∥yxy+c)|

(xy+c)p1εpx∥∥y∥≤εpK(x,y).

Next we bound the approximation guarantee under ε - subspace embeddings for the (squared) exponential kernel K(x,y)=exp(xyplp) for p{1,2} . To this end,we begin with a claim proven in the Sect..
接下来,我们针对(平方)指数核 K(x,y)=exp(xyplp)p{1,2} 下的近似保证,在 ε 子空间嵌入条件下进行界定。为此,我们首先引用第..节中已证明的主张。

Claim 5. Let z[0,1] . Fix δ(0,12) . Then we have 0zz1+δδ and 0z1δzδ .
主张 5. 设 z[0,1] 。固定 δ(0,12) 。则有 0zz1+δδ0z1δzδ 成立。

Lemma 6. Let K(x,y)=exp(xyp/lp),p {1,2} be the (squared) exponential kernel. Let S be an ε - subspace embedding for V . Then for all x,y{vRD v=Vu,uRd} we have |K(x,y)K(Sx,Sy)|ε .
引理 6. 设 K(x,y)=exp(xyp/lp),p {1,2} 为(平方)指数核。令 SV 的一个 ε 子空间嵌入。那么对于所有 x,y{vRD v=Vu,uRd} ,我们有 |K(x,y)K(Sx,Sy)|ε

Proof. S is an ε -subspace embedding. Thus for any x,y in the embedded subspace there exists some δ(12,12) with |δ|ε such that
证明。 S 是一个 ε 子空间嵌入。因此,对于嵌入子空间中的任何 x,y ,存在某个 δ(12,12) 满足 |δ|ε ,使得

|K(x,y)K~(x,y)|

=|exp(xy2l2)exp(SxSy2l2)|

=|exp(xy2l2)exp((1+δ)xy2l2)|

=|K(x,y)K(x,y)1+δ||δ|ε,

where the last line holds by Claim 5 since K(x,y)= exp(xy2/l2)[0,1] .
因为根据声明 5,最后一行成立,由于 K(x,y)= exp(xy2/l2)[0,1]

Note that the approximation guarantee of Definition 1 implies that the (non-squared) Euclidean norms are approximated to within 1+ε1+ε and 1ε1ε respectively. The proof thus applies unchanged to the simple exponential kernel K(x,y)=exp(xy/l) .
注意定义 1 中的近似保证意味着(非平方的)欧几里得范数分别被近似到 1+ε1+ε1ε1ε 。因此,证明过程无需修改即可适用于简单的指数核 K(x,y)=exp(xy/l)

The Matérn kernel is defined as
Matérn 核定义为

K(x,y)=21νΓ(ν)(2νlxy)νBν(2νlxy),

where ν>0 is a parameter and Bν is a second kind Bessel function. For parameters of the form ν=p+12,pN the above characterization simplifies and can in particular be expressed as the product of a polynomial of degree p and an exponential function. To this end for x,y and length scale parameter l let ρ=2νxy/l . Then
其中 ν>0 是一个参数, Bν 是第二类贝塞尔函数。对于形如 ν=p+12,pN 的参数,上述特征可以简化,特别是可以表示为一个 p 次多项式与一个指数函数的乘积。为此,对于 x,y 和长度尺度参数 l ,设 ρ=2νxy/l 。于是

K(x,y)=K(ρ)=(1+poly(ρ))exp(ρ),

where poly(ρ)=i=1pciρi for coefficients ci>0,i[p] , depending on ν and p . The limiting cases are the simple exponential kernel for p=0,ν=12 and the squared exponential kernel for ν . The Matérn kernel converges quickly for larger values p3,ν72 and is nearly indistinguishable from the limiting case for finite and noisy data. We thus give two examples which are most interesting in machine learning (Rasmussen & Williams, 2006):
对于系数 ci>0,i[p] ,取决于 νp ,其中 poly(ρ)=i=1pciρi 。极限情况是当 p=0,ν=12 时为简单指数核, ν 时为平方指数核。对于较大的值 p3,ν72 ,Matérn 核迅速收敛,并且在有限且有噪声的数据下几乎无法与极限情况区分。因此我们给出两个在机器学习中最有趣的例子(Rasmussen & Williams, 2006):

forp=1:K3/2(ρ)=(1+ρ)exp(ρ),

forp=2:K5/2(ρ)=(1+ρ+ρ2/3)exp(ρ).

Lemma 7. Let K(x,y) be the Matérn kernel with parameter ν=p+12,pN . Let S be an ε -subspace embedding for V . Then for all x,y{vRDv=Vu,uRd} we have |K(x,y)K(Sx,Sy)|2ε .
引理 7. 设 K(x,y) 为参数 ν=p+12,pN 的 Matérn 核。设 SVε 子空间嵌入。则对所有 x,y{vRDv=Vu,uRd} ,我们有 |K(x,y)K(Sx,Sy)|2ε

Proof. K(x,y) is normalized since it is bounded by the limiting squared exponential kernel K(ρ)exp(Cρ2)1 for some absolute constant C>0 ,cf. (Rasmussen & Williams,2006). S is an ε -subspace embedding. Thus for any x,y in the embedded subspace there exists some δ(12,12) with |δ|ε such that xy∥=(1+ δ)SxSy . Thus we have
证明。 K(x,y) 是归一化的,因为它被某个绝对常数 C>0 的极限平方指数核 K(ρ)exp(Cρ2)1 所界定,参见(Rasmussen & Williams,2006)。 S 是一个 ε 子空间嵌入。因此对于嵌入子空间中的任意 x,y ,存在某个 δ(12,12) 满足 |δ|ε ,使得 xy∥=(1+ δ)SxSy 。因此我们有

|K(ρ)K~(ρ)|

=|1+poly(ρ)exp(ρ)1+poly((1+δ)ρ)exp((1+δ)ρ)|

(12)|1+poly(ρ)exp(ρ)(1+poly(ρ)exp(ρ))1+δ|

(13)+|(1+poly(ρ)exp(ρ))1+δ1+poly((1+δ)ρ)exp((1+δ)ρ)|.

Since K(ρ)[0,1] it follows again from Claim 5 that (12)=|K(ρ)K(ρ)1+δ||δ|ε . In the second term the denominators are equal,since (exp(ρ))1+δ=exp(ρ(1+ δ)) ,so we continue only with the enumerators of (13). The first term is always greater than the second for positive δ . Moreover note that poly ((1+δ)ρ)(1+δ) poly (ρ) since in each term we have (1+δ)i(1+δ) . For negative δ this is reversed but then also (1+δ)i(1+δ) is reversed. It can be shown similarly to Claim 5 that
由于 K(ρ)[0,1] ,根据主张 5 可再次得出 (12)=|K(ρ)K(ρ)1+δ||δ|ε 。在第二项中分母相等,因为 (exp(ρ))1+δ=exp(ρ(1+ δ)) ,故我们仅继续处理(13)式的分子部分。对于正数 δ ,第一项始终大于第二项。此外需注意,多项式 ((1+δ)ρ)(1+δ) 与多项式 (ρ) 的关系成立,因为每一项中都含有 (1+δ)i(1+δ) 。当 δ 为负数时情况反转,此时 (1+δ)i(1+δ) 也会相应反转。类似主张 5 的证明方法可表明


|(1+poly(ρ))1+δ(1+(1+δ)poly(ρ))|

|δ|exp((1+δ)ρ)εexp((1+δ)ρ).

So (13)ε and thus |K(ρ)K~(ρ)|2ε .
因此 (13)ε ,于是 |K(ρ)K~(ρ)|2ε

When generalizing Theorem 2 for the kernel functions above, we face two problems. First, since the implicit feature space may have infinite dimension it is unclear whether the singular value decomposition (SVD) exists. To see that it does, note that the rank of the kernel matrix might grow as large as the number of points. But it remains bounded and it was shown in (Bell, 2014) that any bounded rank linear operator has an SVD. Second, we have to argue that we have a subspace embedding restricted to the space spanned by the feature vectors. Recall that in the original proof of Sarlós (2006), it is sufficient to have an additive error guarantee for the inner product of unit vectors as shown above.
在将定理 2 推广至上述核函数时,我们面临两个问题。首先,由于隐式特征空间可能具有无限维度,奇异值分解(SVD)是否存在尚不明确。为证明其存在性,需注意核矩阵的秩可能随数据点数量增长,但其始终有界,且(Bell, 2014)已证明任何有界秩线性算子都存在 SVD。其次,我们必须论证在特征向量张成的子空间上存在子空间嵌入。回顾 Sarlós(2006)的原证明,如前述所示,仅需保证单位向量内积的加性误差界即可满足条件。

Corollary 8. The claims of Theorem 2 hold when using the normalized kernel functions with ε replaced by the additive error bounds given in Lemmas 4 - 7.
推论 8. 当使用归一化核函数并将 ε 替换为引理 4 至 7 给出的加性误差界时,定理 2 的结论成立。

3. Numerical Results  3. 数值结果


We demonstrate the performance of the proposed embedding technique on a variety of functions, combining it with KG of Cornell-MOE (Wu & Frazier, 2016; Wu et al., 2017), BM (McLeod et al., 2018), and EI that was also used for REMBO, referred to as HeSBO-KG, HeSBO-BM, and HeSBO-EI respectively. The evaluation compares HeSBO to the state-of-the-art for high-dimensional BO (see Sect. 1): REMBO- kY that fits a GP model to Y using the distances in Y ,and REMBO- kX that uses distances after projection to X . This addresses the problem that REMBO- kY tends to over-explore the boundary of X since those points may appear distant when selected in Y (Wang et al.,2016b,cf. p. 371). REMBO- kψ (Binois et al.,2015) uses a more sophisticated warping to handle points mapped to the boundary of X better. All REMBO variants use the EI criterion. ADD (Gardner et al., 2017), SKL (Wang et al., 2017), and EBO (Wang et al., 2018) represent methods that learn the structure of the search space. They were omitted for 1000- dimensional test functions due to high computational cost. BOCK (Oh et al., 2018) uses a cylindrical transformation of the space to achieve an excellent performance on problems of higher dimension. We implemented HeSBO-EI and all REMBO variants in Python 3 using GPy. The code for the embedding is available at github.com/aminnayebi/HesBO. For all other algorithms we used the authors' reference implementations (see the bibliography).
我们在多种函数上展示了所提出的嵌入技术的性能,并将其与 Cornell-MOE(Wu & Frazier, 2016; Wu et al., 2017)、BM(McLeod et al., 2018)以及同样用于 REMBO 的 EI 相结合,分别称为 HeSBO-KG、HeSBO-BM 和 HeSBO-EI。评估将 HeSBO 与高维贝叶斯优化(BO)的先进方法(见第 1 节)进行比较:REMBO- kY 通过在 Y 中使用距离将高斯过程(GP)模型拟合到 Y ,而 REMBO- kX 则在投影到 X 后使用距离。这解决了 REMBO- kY 倾向于过度探索 X 边界的问题,因为这些点在 Y 中被选中时可能显得较远(Wang et al.,2016b,cf. p. 371)。REMBO- kψ (Binois et al.,2015)采用更复杂的扭曲方法,以更好地处理映射到 X 边界的点。所有 REMBO 变体均使用 EI 准则。ADD(Gardner et al., 2017)、SKL(Wang et al., 2017)和 EBO(Wang et al., 2018)代表了学习搜索空间结构的方法。由于计算成本较高,这些方法在 1000 维测试函数中被省略。 BOCK(Oh 等人,2018 年)通过对空间进行柱面变换,在高维问题上实现了卓越性能。我们使用 Python 3 和 GPy 实现了 HeSBO-EI 及所有 REMBO 变体。嵌入代码发布于 github.com/aminnayebi/HesBO。其余算法均采用作者提供的参考实现(详见参考文献)。

Experimental Setup. We report the the function values of the recommended solutions as a function of (1) the number of steps performed by the algorithm, or (2) of the wall-clock time. For the latter it is important to note that all experiments were performed on dedicated machines with identical resources. Error bars are shown at the median ± two standard errors of the median, averaged over at least 100 replications. See Sect. D for more details. When algorithms show large discrepancies in the performance, we only plot the best for clarity. See Sect. E for additional plots that show all algorithms.
实验设置。我们以两种形式报告推荐解的函数值:(1)算法执行的步骤数,(2)实际耗时。需特别说明,所有实验均在资源相同的专用机器上运行。误差条显示中位数±两倍标准误,数据基于至少 100 次重复实验取平均。更多细节见 D 章节。当算法性能差异显著时,为清晰起见仅绘制最优结果,完整算法对比图见 E 章节。

Performance on Test Functions. We compared the algorithms on the following test functions: (1) Branin, (2) Hartmann-6, (3) Rosenbrock, and (4) Styblinski-Tang (Styb-Tang) with input dimension D{25,100,1000} . The first three have a low-dimensional active subspace: 2 dimensions for Branin and Rosenbrock, 6 for Hartmann-6. StybTang is defined on all D dimensions. The proposed subspace embedding in combination with KG and BM, two state-of-the-art methods for low-dimensional BO, outperforms the state-of-the-art on all benchmarks, including the high-dimensional StybTang, except for Hartmann-6. Looking closer, we found that EI, KG and BM-without the embedding- did not converge on the 6-d Hartmann function within this budget, hence we believe that this is not caused by the embedding but likely due to the algorithms' default settings. Fig. 1 shows the results for D=100 ; see Sect. E for D{25,1000} . ADD’s performance is comparable to HeSBO-EI but has considerably higher computational cost. We will revisit to this aspect below.
测试函数性能对比。我们在以下测试函数上比较了各算法表现:(1) Branin 函数,(2) Hartmann-6 函数,(3) Rosenbrock 函数,以及(4) Styblinski-Tang 函数(简称 Styb-Tang),输入维度为 D{25,100,1000} 。前三者具有低维活跃子空间:Branin 和 Rosenbrock 为 2 维,Hartmann-6 为 6 维。StybTang 函数定义在所有 D 维度上。结合 KG 和 BM 这两种最先进的低维贝叶斯优化方法,我们提出的子空间嵌入方案在所有基准测试(包括高维 StybTang)上都优于现有技术,仅 Hartmann-6 函数除外。深入分析发现,未采用嵌入的 EI、KG 和 BM 算法在给定计算资源下未能收敛于 6 维 Hartmann 函数,因此我们认为这并非嵌入导致,而更可能是算法默认参数所致。图 1 展示了 D=100 的结果; D{25,1000} 详见 E 章节。ADD 算法的性能与 HeSBO-EI 相当,但计算成本显著更高,下文将对此进行探讨。



https://cdn.noedgeai.com/0196adb8-a352-7eb3-ae42-e6170a4ad9e1_6.jpg?x=906&y=1307&w=677&h=541&r=0

Figure 1. Performances on Branin (ul) with target dimension d= 4,on Hartmann-6 (ur) with d=6 ,on Rosenbrock (bl) with d=4 , and on StybTang (br) with d=12 . The input dimension is D= 100. We see that the proposed embedding in combination with KG and BM achieves considerably better solutions at the same number of function evaluations than the state-of-the-art. An exception is Hartmann, where BOCK, SKL, and ADD perform best.
图 1 展示了在不同函数上的性能表现:左上为 Branin 函数(目标维度 d= 4),右上为 Hartmann-6 函数( d=6 ),左下为 Rosenbrock 函数( d=4 ),右下为 StybTang 函数( d=12 )。输入维度固定为 D= 100。可见,所提出的嵌入方法结合 KG 和 BM 策略,在相同函数评估次数下明显优于现有技术。唯一例外是 Hartmann 函数,其中 BOCK、SKL 和 ADD 方法表现最佳。



Robustness to Target Dimension d and Input Dimension D . We study the robustness of the embedding based algorithms with respect to the target dimension d chosen for the embedding and the input dimension D . Fig 2 shows the performances on Hartmann-6 with input dimensions D{25,1000} . We see that HeSBO-EI achieves
目标维度 d 与输入维度 D 的鲁棒性研究。我们分析了基于嵌入的算法对于所选嵌入目标维度 d 和输入维度 D 的适应性。图 2 展示了 Hartmann-6 函数在不同输入维度 D{25,1000} 下的表现。结果显示 HeSBO-EI 算法在...



https://cdn.noedgeai.com/0196adb8-a352-7eb3-ae42-e6170a4ad9e1_7.jpg?x=161&y=436&w=679&h=268&r=0

Figure 2. Robustness regarding the target dimension d on Hartmann-6 with input dimensions D=25 (l) and D=1000 (r). The performance of the new subspace embedding HeSBO-EI is robust for the different target dimensions and outperforms the best REMBO variant kX for all choices of d .
图 2 呈现了 Hartmann-6 函数在输入维度分别为 D=25 (左)和 D=1000 (右)时,针对目标维度 d 的鲁棒性测试。新型子空间嵌入方法 HeSBO-EI 在不同目标维度下均表现稳定,且在所有 d 参数选择中均优于最佳 REMBO 变体 kX


essentially the same performance for all target dimensions d across the different input dimensions D . Note that in particular HeSBO-EI's performance does not degrade when the target dimension d is chosen smaller than six,the dimensionality of Hartmann-6. We also tested the robustness on Hartmann-6 with input dimension D=100 and for Branin, and observed a similar robustness in both cases. He SBO-EI and the three REMBO variants use the same GP model and the same acquisition function EI ,thus we attribute the considerably better robustness to the hashing based subspace embedding proposed in Sect. 2.
在所有目标维度 d 上,不同输入维度 D 的表现基本一致。值得注意的是,特别是 HeSBO-EI 在目标维度 d 小于六(即 Hartmann-6 的维度)时性能并未下降。我们还在输入维度 D=100 的 Hartmann-6 和 Branin 上测试了鲁棒性,观察到两者均表现出相似的稳健性。由于 HeSBO-EI 与三种 REMBO 变体采用相同的高斯过程模型和采集函数 EI ,因此我们将显著更优的鲁棒性归因于第 2 节提出的基于哈希的子空间嵌入方法。

Analysis of the Scalability. We study the scalability of the algorithms in Fig. 3. Here we examine the performance as a function of the wall-clock time that essentially equals the computational overhead of the respective methods. We
可扩展性分析。我们在图 3 中研究了算法的可扩展性。此处我们考察了性能与墙钟时间(基本等同于各方法的计算开销)的函数关系。



https://cdn.noedgeai.com/0196adb8-a352-7eb3-ae42-e6170a4ad9e1_7.jpg?x=163&y=1505&w=679&h=273&r=0

Figure 3. Comparison of the running times on Branin with target dimensions d=4 (l) and on 100d-StybTang with d=12 (r). The input dimension is D=100 . We see that the proposed subspace embedding achieves considerably better solutions at the same computational cost compared to the other baselines, in particular REMBO. HeSBO-EI and HeSBO-BM perform best here.
图 3. Branin 在目标维度 d=4 (左)和 100d-StybTang 在 d=12 (右)上的运行时间对比。输入维度为 D=100 。可见所提出的子空间嵌入方法在相同计算成本下相比其他基线(尤其是 REMBO)获得了显著更优的解。此场景中 HeSBO-EI 和 HeSBO-BM 表现最佳。


observe that the embedding-based approaches HeSBO-EI and HeSBO-BM obtain considerably better solutions than the other algorithms at the same cost.
观察到基于嵌入的方法 HeSBO-EI 和 HeSBO-BM 在相同成本下获得的解决方案明显优于其他算法。

Neural Network Parameter Search. We evaluate the algorithms on a 100-dimensional neural network (NN) optimization task proposed by Oh et al. (2018). Here we are given a NN with one hidden layer of size ten. The goal is to choose the weights between the hidden layer and the outputs in order to minimize the loss on the MNIST data set (LeCun et al., 2017). The other weights and biases are optimized by Adam (Kingma & Ba, 2014). We refer to their paper for details. Fig. 4 summarizes the observed performances on this benchmark for target dimension d=12 . We note in passing that d=6 and d=24 gave similar results. He SBO-KG and BOCK obtain the best performance,
神经网络参数搜索。我们在 Oh 等人(2018)提出的 100 维神经网络(NN)优化任务上评估了各算法性能。该任务中,给定一个隐藏层大小为 10 的单层神经网络,目标是通过调整隐藏层与输出层之间的权重,以最小化 MNIST 数据集(LeCun 等,2017)上的损失值。其余权重和偏置由 Adam 优化器(Kingma & Ba,2014)自动优化,详情请参阅原文。图 4 汇总了目标维度 d=12 时各算法的表现。值得注意的是, d=6d=24 取得了相近的结果,其中 HeSBO-KG 和 BOCK 表现最佳。



https://cdn.noedgeai.com/0196adb8-a352-7eb3-ae42-e6170a4ad9e1_7.jpg?x=905&y=698&w=675&h=271&r=0

Figure 4. The NN benchmark with target dimension d=12 . We observe that HeSBO-KG and BOCK achieve the best performance. Note that HeSBO-EI outperforms EBO that operates on the high-dimensional domain.
图 4. 目标维度 d=12 的神经网络基准测试结果。可见 HeSBO-KG 和 BOCK 实现了最优性能,同时注意到 HeSBO-EI 的表现优于在高维空间运行的 EBO 算法。


followed by HeSBO-EI. Particularly remarkable for this benchmark is that the simple, embedding-based He SBO-EI finds better solutions than EBO, although the latter operates directly on the high-dimensional search space. This suggests that this high-dimensional problem possesses a low-dimensional approximation of sufficient accuracy.
紧随其后的是 HeSBO-EI。这一基准尤为引人注目的是,基于简单嵌入的 He SBO-EI 找到了比 EBO 更优的解决方案,尽管后者直接在高维搜索空间中运行。这表明该高维问题存在一个精度足够的低维近似解。

4. Conclusion  4. 结论


Enabling Bayesian optimization to high dimensional problems is seen of great practical and theoretical interest, e.g., see (Frazier, 2018). This paper advances both directions. We have proven for a large class of GP-based BO methods that the optimization process would proceed essentially as it would if invoked directly on the active subspace. The experimental evaluation demonstrated the practical value of the proposed embedding technique on a variety of benchmarks. In particular, He SBO performs better than the state-of-the-art for high-dimensional BO based on random embeddings and structure learning. It is important to note that the subspace embedding has a negligible computational overhead and is agnostic to the way data is acquired. It is thus straightforward to combine it with existing methods, e.g., for batch acquisition (Chevalier & Ginsbourger, 2013; Marmin et al., 2015; Shah & Ghahramani, 2015; Wu & Frazier, 2016; Wang et al., 2016a; Wu et al., 2017). We look forward to "exotic" applications, e.g., in multifidelity optimization.
使贝叶斯优化适用于高维问题具有重要的实践和理论意义,例如可参考(Frazier, 2018)。本文在这两个方向均取得了进展。我们针对一大类基于高斯过程的贝叶斯优化方法证明:优化过程的效果本质上等同于直接在活跃子空间上执行。实验评估表明,所提出的嵌入技术在多种基准测试中具有实用价值。特别地,He SBO 在基于随机嵌入和结构学习的高维贝叶斯优化任务中表现优于现有最优方法。值得注意的是,子空间嵌入带来的计算开销可忽略不计,且对数据获取方式保持不可知性,因此能直接与现有方法结合使用,例如批量采集方法(Chevalier & Ginsbourger, 2013; Marmin et al., 2015; Shah & Ghahramani, 2015; Wu & Frazier, 2016; Wang et al., 2016a; Wu et al., 2017)。我们期待该技术在"非常规"场景中的应用,例如多保真度优化领域。


Acknowledgements  致谢


The authors would like to thank Peter Frazier, Jian Wu, Pu Yang, Yunxiang Duke Zhang, Jake Gardner, and Mark McLeod for their help with their respective algorithms. Moreover, they are grateful to the anonymous reviewers for their valuable comments. Alexander Munteanu was partly supported by the German Science Foundation (DFG), Collaborative Research Center SFB 876, project C4 and by the Dortmund Data Science Center (DoDSc).
作者们要感谢 Peter Frazier、Jian Wu、Pu Yang、Yunxiang Duke Zhang、Jake Gardner 和 Mark McLeod 在他们各自算法上的帮助。此外,他们对匿名评审的宝贵意见表示感激。Alexander Munteanu 部分得到了德国科学基金会(DFG)合作研究中心 SFB 876 项目 C4 以及多特蒙德数据科学中心(DoDSc)的支持。

References  参考文献


Achlioptas, D. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671-687, 2003.
阿克利奥普塔斯, D. 数据库友好的随机投影:基于二进制硬币的约翰逊-林登斯特劳斯方法。《计算机与系统科学期刊》,66(4):671-687,2003 年。

Ailon, N. and Chazelle, B. The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39 (1):302-322, 2009.
艾隆, N. 与沙泽尔, B. 快速约翰逊-林登斯特劳斯变换及近似最近邻搜索。《SIAM 计算杂志》,39(1):302-322,2009 年。

Ailon, N. and Liberty, E. Fast dimension reduction using Rademacher series on dual BCH codes. Discrete & Computational Geometry, 42(4):615-630, 2009.
艾隆, N. 与利伯蒂, E. 基于对偶 BCH 码的 Rademacher 级数快速降维法。《离散与计算几何》,42(4):615-630,2009 年。

Arriaga, R. I. and Vempala, S. An algorithmic theory of learning: Robust concepts and random projection. Machine Learning, 63 (2):161-182, 2006.
阿里亚加, R. I. 与文帕拉, S. 学习的算法理论:鲁棒概念与随机投影。《机器学习》,63(2):161-182,2006 年。

Bell, J. The singular value decomposition of compact operators on Hilbert spaces. unpublished, 2014.
贝尔, J. 希尔伯特空间上紧算子的奇异值分解。未发表, 2014 年。

Binois, M., Ginsbourger, D., and Roustant, O. A warped kernel improving robustness in Bayesian optimization via random embeddings. In International Conference on Learning and Intelligent Optimization, pp. 281-286. Springer, 2015.
比诺瓦, M., 金斯伯格, D., 和鲁斯坦, O. 一种通过随机嵌入提升贝叶斯优化鲁棒性的扭曲核方法。收录于《学习与智能优化国际会议》, 第 281-286 页。斯普林格出版社, 2015 年。

Binois, M., Ginsbourger, D., and Roustant, O. On the choice of the low-dimensional domain for global optimization via random embeddings. Journal of Global Optimization, 2019. Accepted for Publication.
比诺瓦, M., 金斯伯格, D., 和鲁斯坦, O. 关于通过随机嵌入进行全局优化的低维域选择。《全局优化杂志》, 2019 年。已接受发表。

Brochu, E., Cora, V. M., and De Freitas, N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
布罗舒, E., 科拉, V. M., 和德弗雷塔斯, N. 昂贵成本函数贝叶斯优化教程及其在主动用户建模和分层强化学习中的应用。arXiv 预印本 arXiv:1012.2599, 2010 年。

Charikar, M., Chen, K. C., and Farach-Colton, M. Finding frequent items in data streams. Theoretical Computer Science, 312(1): 3-15, 2004.
Charikar, M., Chen, K. C., 与 Farach-Colton, M. 在数据流中寻找频繁项。《理论计算机科学》,312(1): 3-15, 2004 年。

Chen, B., Castro, R. M., and Krause, A. Joint optimization and variable selection of high-dimensional Gaussian processes. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1379-1386. Omni-press, 2012.
Chen, B., Castro, R. M., 与 Krause, A. 高维高斯过程的联合优化与变量选择。载于《第 29 届国际机器学习会议论文集》,第 1379-1386 页。Omni-press, 2012 年。

Chevalier, C. and Ginsbourger, D. Fast computation of the multi-points expected improvement with applications in batch selection. In Learning and Intelligent Optimization, pp. 59-69. Springer, 2013.
Chevalier, C. 与 Ginsbourger, D. 多点期望改进的快速计算及其在批量选择中的应用。载于《学习与智能优化》,第 59-69 页。Springer, 2013 年。

Clarkson, K. L. and Woodruff, D. P. Low rank approximation and regression in input sparsity time. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC), pp. 81-90, 2013.
Clarkson, K. L. 与 Woodruff, D. P. 输入稀疏时间内的低秩逼近与回归。载于《第 45 届 ACM 计算理论研讨会论文集》(STOC),第 81-90 页,2013 年。

Cohen, M. B., Nelson, J., and Woodruff, D. P. Optimal approximate matrix product in terms of stable rank. In Proceedings
Cohen, M. B., Nelson, J., 与 Woodruff, D. P. 基于稳定秩的最优近似矩阵乘积。《第 43 届国际自动机、语言与编程学术会议论文集》(ICALP),第 11:1-11:14 页,2016 年。

of the 43rd International Colloquium on Automata, Languages, and Programming (ICALP), pp. 11:1-11:14, 2016.
于《第 43 届国际自动机、语言与编程学术会议》(ICALP),第 11:1-11:14 页,2016 年。

Djolonga, J., Krause, A., and Cevher, V. High-dimensional gaussian process bandits. In Advances in Neural Information Processing Systems, pp. 1025-1033, 2013.
Djolonga, J., Krause, A., 与 Cevher, V. 高维高斯过程赌博机。《神经信息处理系统进展》,第 1025-1033 页,2013 年。

Eriksson, D., Dong, K., Lee, E., Bindel, D., and Wilson, A. G. Scaling Gaussian process regression with derivatives. In Advances in Neural Information Processing Systems, pp. 6868-6878, 2018.
Eriksson, D., Dong, K., Lee, E., Bindel, D., 与 Wilson, A. G. 带导数的高斯过程回归扩展。《神经信息处理系统进展》,第 6868-6878 页,2018 年。

Frazier, P. I. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
弗雷泽, P. I. 贝叶斯优化教程. arXiv 预印本 arXiv:1807.02811, 2018 年.

Frazier, P. I., Powell, W. B., and Dayanik, S. The Knowledge Gradient Policy for Correlated Normal Beliefs. INFORMS Journal on Computing, 21(4):599-613, 2009.
弗雷泽,P. I.,鲍威尔,W. B.,与达亚尼克,S. 相关正态信念的知识梯度策略。INFORMS 计算期刊,21(4):599-613,2009 年。

Frean, M. and Boyle, P. Using Gaussian processes to optimize expensive functions. In Australasian Joint Conference on Artificial Intelligence, pp. 258-267. Springer, 2008.
弗里安,M. 与博伊尔,P. 使用高斯过程优化昂贵函数。于澳大利亚人工智能联合会议,第 258-267 页。施普林格,2008 年。

Gardner, J., Guo, C., Weinberger, K., Garnett, R., and Grosse, R. Discovering and exploiting additive structure for Bayesian optimization. In Artificial Intelligence and Statistics, pp. 1311- 1319, 2017.
加德纳,J.,郭,C.,温伯格,K.,加内特,R.,与格罗斯,R. 发现并利用可加性结构进行贝叶斯优化。于人工智能与统计会议,第 1311-1319 页,2017 年。

Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q., and Cunningham, J. P. Bayesian optimization with inequality constraints. In Proceedings of the 31th International Conference on Machine Learning, pp. 937-945, 2014.
加德纳、库斯纳、徐泽恩、温伯格与坎宁安合著,《带不等式约束的贝叶斯优化》,发表于第 31 届国际机器学习会议论文集,937-945 页,2014 年。

Geppert, L. N., Ickstadt, K., Munteanu, A., Quedenfeld, J., and Sohler, C. Random projections for Bayesian regression. Statistics and Computing, 27(1):79-101, 2017.
格佩特、伊克斯塔特、蒙特亚努、奎登菲尔德与索勒合著,《贝叶斯回归的随机投影》,载于《统计与计算》期刊,27 卷 1 期,79-101 页,2017 年。

Hernández-Lobato, J. M., Gelbart, M. A., Hoffman, M. W., Adams, R. P., and Ghahramani, Z. Predictive entropy search for Bayesian optimization with unknown constraints. In ICML, 2015.
埃尔南德斯-洛巴托、吉尔巴特、霍夫曼、亚当斯与格拉赫马尼合著,《针对未知约束的贝叶斯优化预测熵搜索》,收录于国际机器学习大会(ICML),2015 年。

Huang, D., Allen, T., Notz, W., and Miller, R. Sequential kriging optimization using multiple-fidelity evaluations. Structural and Multidisciplinary Optimization, 32(5):369-382, 2006.
黄丹、艾伦、诺茨与米勒合著,《利用多保真度评估的序贯克里金优化》,发表于《结构与多学科优化》期刊,32 卷 5 期,369-382 页,2006 年。

Johnson, W. B. and Lindenstrauss, J. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26 (1):189-206, 1984.
Johnson, W. B. 与 Lindenstrauss, J. 将利普希茨映射扩展至希尔伯特空间。《当代数学》,26 卷 1 期:189-206 页,1984 年。

Kandasamy, K., Schneider, J., and Póczos, B. High dimensional Bayesian optimisation and bandits via additive models. In International Conference on Machine Learning, pp. 295-304, 2015.
Kandasamy, K.、Schneider, J. 与 Póczos, B. 通过加性模型实现高维贝叶斯优化与多臂老虎机。《国际机器学习会议》,295-304 页,2015 年。

Kandasamy, K., Dasarathy, G., Oliva, J. B., Schneider, J., and Poczos, B. Gaussian process bandit optimisation with multi-fidelity evaluations. In Advances in Neural Information Processing Systems, 2016. The code is available at https: //github.com/kirthevasank/mf-gp-ucb. Last Accessed on 05/01/2019.
Kandasamy, K.、Dasarathy, G.、Oliva, J. B.、Schneider, J. 与 Poczos, B. 多保真度评估的高斯过程老虎机优化。《神经信息处理系统进展》,2016 年。代码发布于 https://github.com/kirthevasank/mf-gp-ucb,最后访问于 2019 年 5 月 1 日。

Kandasamy, K., Neiswanger, W., Schneider, J., Poczos, B., and Xing, E. Neural architecture search with Bayesian optimisation and optimal transport. arXiv preprint arXiv:1802.07191, 2018.
Kandasamy, K.、Neiswanger, W.、Schneider, J.、Poczos, B. 与 Xing, E. 基于贝叶斯优化与最优传输的神经架构搜索。arXiv 预印本 arXiv:1802.07191,2018 年。

Kane, D. M. and Nelson, J. Sparser Johnson-Lindenstrauss transforms. J. ACM, 61(1):4:1-4:23, 2014.
凯恩, D. M. 与 纳尔逊, J. 稀疏约翰逊-林登斯特拉斯变换. 《ACM 杂志》, 61(1):4:1-4:23, 2014 年.


Kennedy, M. C. and O'Hagan, A. Predicting the output from a complex computer code when fast approximations are available. Biometrika, 87(1):1-13, 2000.
肯尼迪, M. C. 与 奥黑根, A. 当快速近似可用时预测复杂计算机代码的输出. 《生物计量学》, 87(1):1-13, 2000 年.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
金马, D. P. 与 巴, J. Adam: 一种随机优化方法. arXiv 预印本 arXiv:1412.6980, 2014 年.

Klein, A., Falkner, S., Bartels, S., Hennig, P., and Hutter, F. Fast Bayesian optimization of machine learning hyperparameters on large datasets. In Artificial Intelligence and Statistics, pp. 528-536, 2017.
克莱因, A., 福克纳, S., 巴特尔斯, S., 亨尼希, P., 与 赫特, F. 大型数据集上机器学习超参数的快速贝叶斯优化. 载于《人工智能与统计》, 第 528-536 页, 2017 年.

LeCun, Y., Cortes, C., and Burges, C. J. The MNIST database of handwritten digits, 2017. http://yann.lecun.com/ exdb/mnist/. Last Accessed on 05/13/2019.
LeCun, Y., Cortes, C., 与 Burges, C. J. 手写数字 MNIST 数据库,2017 年。http://yann.lecun.com/exdb/mnist/。最后访问于 2019 年 5 月 13 日。

Marmin, S., Chevalier, C., and Ginsbourger, D. Differentiating the multipoint expected improvement for optimal batch design. hal-01133220v2, 2015.
Marmin, S., Chevalier, C., 与 Ginsbourger, D. 多预期点改进的微分以实现最优批次设计。hal-01133220v2,2015 年。

McLeod, M., Roberts, S., and Osborne, M. A. Optimization, fast and slow: Optimally switching between local and Bayesian optimization. In International Conference on Machine Learning, pp. 3440-3449, 2018.
McLeod, M., Roberts, S., 与 Osborne, M. A. 优化,快与慢:在局部与贝叶斯优化间最优切换。载于《国际机器学习会议》,第 3440-3449 页,2018 年。

Mutny, M. and Krause, A. Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. In Advances in Neural Information Processing Systems, pp. 9005- 9016, 2018.
Mutny, M., 与 Krause, A. 高效高维贝叶斯优化:可加性与正交傅里叶特征。载于《神经信息处理系统进展》,第 9005-9016 页,2018 年。

Nelson, J. and Nguyen, H. L. Sparsity lower bounds for dimensionality reducing maps. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC), pp. 101-110, 2013.
纳尔逊, J. 与 阮, H. L. 降维映射的稀疏性下界。《第 45 届 ACM 计算理论年会论文集》(STOC),第 101-110 页,2013 年。

Nelson, J. and Nguyên, H. L. Lower bounds for oblivious subspace embeddings. In Proceedings of the 41st International Colloquium on Automata, Languages, and Programming (ICALP), pp. 883-894, 2014.
纳尔逊, J. 与 阮, H. L. oblivious 子空间嵌入的下界。《第 41 届国际自动机、语言与编程研讨会论文集》(ICALP),第 883-894 页,2014 年。

Oh, C., Gavves, E., and Welling, M. BOCK: Bayesian optimization with cylindrical kernels. In International Conference on Machine Learning, 2018. The implementation is available at https://github.com/ChangYong-Oh/ HyperSphere. Last Accessed on 05/10/2019.
吴, C., Gavves, E., 与 Welling, M. BOCK:基于柱面核的贝叶斯优化。《国际机器学习会议》,2018 年。实现代码发布于 https://github.com/ChangYong-Oh/HyperSphere。最后访问于 2019 年 5 月 10 日。

Paul, S., Boutsidis, C., Magdon-Ismail, M., and Drineas, P. Random projections for linear support vector machines. ACM Transactions on Knowledge Discovery from Data, 8(4):22:1-22:25, 2014.
保罗, S., Boutsidis, C., Magdon-Ismail, M., 与 Drineas, P. 线性支持向量机的随机投影。《ACM 数据挖掘知识发现汇刊》,第 8 卷第 4 期,第 22:1-22:25 页,2014 年。

Picheny, V., Gramacy, R. B., Wild, S., and Le Digabel, S. Bayesian optimization under mixed constraints with a slack-variable augmented Lagrangian. In Advances in Neural Information Processing Systems, pp. 1435-1443, 2016.
皮切尼,V.,格拉马西,R. B.,怀尔德,S.,和勒迪加贝尔,S.。基于松弛变量增广拉格朗日的混合约束下贝叶斯优化。《神经信息处理系统进展》,第 1435-1443 页,2016 年。

Poloczek, M., Wang, J., and Frazier, P. I. Multi-information source optimization. In Advances in Neural Information Processing Systems, pp. 4288-4298, 2017.
波洛切克,M.,王,J.,和弗雷泽,P. I.。多信息源优化。《神经信息处理系统进展》,第 4288-4298 页,2017 年。

Rasmussen, C. E. and Williams, C. K. I. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006.
拉斯穆森,C. E. 和威廉姆斯,C. K. I.。机器学习中的高斯过程。《自适应计算与机器学习》。麻省理工学院出版社,2006 年。

Rolland, P., Scarlett, J., Bogunovic, I., and Cevher, V. High-dimensional Bayesian optimization via additive models with overlapping groups. In International Conference on Artificial Intelligence and Statistics, pp. 298-307, 2018.
罗兰,P.,斯卡莱特,J.,博古诺维奇,I.,和切夫赫尔,V.。基于重叠组加性模型的高维贝叶斯优化。《人工智能与统计国际会议》,第 298-307 页,2018 年。

Sarlós, T. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 143-152, 2006.
萨洛斯,T. 通过随机投影改进大规模矩阵的近似算法。《第 47 届 IEEE 计算机科学基础研讨会论文集》(FOCS),第 143-152 页,2006 年。

Shah, A. and Ghahramani, Z. Parallel predictive entropy search for batch global optimization of expensive objective functions. In Advances in Neural Information Processing Systems, pp. 3330-3338, 2015.
沙阿,A. 和 加拉马尼,Z. 并行预测熵搜索用于昂贵目标函数的批量全局优化。《神经信息处理系统进展》,第 3330-3338 页,2015 年。

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148-175, 2016.
沙赫里亚里,B.,斯沃斯基,K.,王,Z.,亚当斯,R. P. 和 德弗雷塔斯,N. 将人类移出循环:贝叶斯优化综述。《IEEE 学报》,104(1):148-175,2016 年。

Swersky, K., Snoek, J., and Adams, R. P. Multi-task Bayesian optimization. In Advances in Neural Information Processing Systems, pp. 2004-2012, 2013.
斯沃斯基,K.,斯诺克,J. 和 亚当斯,R. P. 多任务贝叶斯优化。《神经信息处理系统进展》,第 2004-2012 页,2013 年。

Wang, J., Clark, S. C., Liu, E., and Frazier, P. I. Parallel Bayesian global optimization of expensive functions. arXiv preprint arXiv:1602.05149, 2016a.
王杰、克拉克·S·C、刘易、弗雷泽·P·I。昂贵函数的并行贝叶斯全局优化。arXiv 预印本 arXiv:1602.05149,2016a。

Wang, Z., Hutter, F., Zoghi, M., Matheson, D., and de Freitas, N. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55: 361-387, 2016b.
王喆、哈特·F、佐吉·M、马瑟森·D、德弗雷塔斯·N。通过随机嵌入实现十亿维度的贝叶斯优化。《人工智能研究杂志》,55 卷:361-387 页,2016b。

Wang, Z., Li, C., Jegelka, S., and Kohli, P. Batched high-dimensional Bayesian optimization via structural kernel learning. In International Conference on Machine Learning, pp. 3656-3664, 2017. The implementation is available at https://github.com/zi-w/ Structural-Kernel-Learning-for-HDBBO. Last Accessed on 05/10/2019.
王喆、李晨、杰格尔卡·S、科利·P。基于结构核学习的高维批量贝叶斯优化。载于《国际机器学习会议》,第 3656-3664 页,2017 年。实现代码发布于 https://github.com/zi-w/Structural-Kernel-Learning-for-HDBBO。最后访问于 2019 年 5 月 10 日。

Wang, Z., Gehring, C., Kohli, P., and Jegelka, S. Batched large-scale Bayesian optimization in high-dimensional spaces. In International Conference on Artificial Intelligence and Statistics, pp. 745-754, 2018. The implementation is available at https://github.com/zi-w/ Ensemble-Bayesian-Optimization. Last Accessed on 05/10/2019.
王喆、格林·C、科利·P、杰格尔卡·S。高维空间中的批量大规模贝叶斯优化。载于《人工智能与统计国际会议》,第 745-754 页,2018 年。实现代码发布于 https://github.com/zi-w/Ensemble-Bayesian-Optimization。最后访问于 2019 年 5 月 10 日。

Woodruff, D. P. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10 (1-2):1-157, 2014.
伍德拉夫, D. P. 草图法作为数值线性代数的工具。理论计算机科学基础与趋势,10(1-2):1-157,2014 年。

Wu, J. and Frazier, P. The parallel knowledge gradient method for batch Bayesian optimization. In Advances in Neural Information Processing Systems, pp. 3126-3134, 2016. Cornell-MOE is available at https://github.com/wujian16/ Cornell-MOE. Last Accessed on 05/10/2019.
吴建与弗雷泽, P. 并行知识梯度方法用于批量贝叶斯优化。载于《神经信息处理系统进展》,第 3126-3134 页,2016 年。Cornell-MOE 可在 https://github.com/wujian16/Cornell-MOE 获取。最后访问于 2019 年 5 月 10 日。

Wu, J., Poloczek, M., Wilson, A. G., and Frazier, P. I. Bayesian optimization with gradients. In Advances in Neural Information Processing Systems, 2017. Cornell-MOE is available at https://github.com/wujian16/ Cornell-MOE. Last Accessed on 05/10/2019.
吴建、波洛泽克, M.、威尔逊, A. G. 与弗雷泽, P. I. 带梯度的贝叶斯优化。载于《神经信息处理系统进展》,2017 年。Cornell-MOE 可在 https://github.com/wujian16/Cornell-MOE 获取。最后访问于 2019 年 5 月 10 日。