Taking the Human Out of the Loop: A Review of Bayesian Optimization
将人类移出循环：贝叶斯优化综述

Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams and Nando de Freitas
博巴克·沙赫里亚里、凯文·斯威斯基、王紫瑜、瑞安·P·亚当斯与南多·德弗雷塔斯

Abstract-Big data applications are typically associated with systems involving large numbers of users, massive complex software systems, and large-scale heterogeneous computing and storage architectures. The construction of such systems involves many distributed design choices. The end products (e.g., recommendation systems, medical analysis tools, real-time game engines, speech recognizers) thus involves many tunable configuration parameters. These parameters are often specified and hard-coded into the software by various developers or teams. If optimized jointly, these parameters can result in significant improvements. Bayesian optimization is a powerful tool for the joint optimization of design choices that is gaining great popularity in recent years. It promises greater automation so as to increase both product quality and human productivity. This review paper introduces Bayesian optimization, highlights some of its methodological aspects, and showcases a wide range of applications.
摘要-大数据应用通常与涉及大量用户、复杂软件系统及大规模异构计算和存储架构的系统相关联。构建此类系统需做出众多分布式设计决策。其最终产物（如推荐系统、医疗分析工具、实时游戏引擎、语音识别器）因而包含大量可调配置参数。这些参数常由不同开发者或团队指定并硬编码至软件中。若进行联合优化，这些参数可带来显著改进。贝叶斯优化作为一种强大的联合优化设计决策工具，近年来日益受到青睐。它有望实现更高程度的自动化，从而提升产品质量与人类生产效率。本综述论文介绍了贝叶斯优化，重点阐述其方法论特点，并展示广泛的应用场景。

I. INTRODUCTION 一、引言

Design problems are pervasive in scientific and industrial endeavours: scientists design experiments to gain insights into physical and social phenomena, engineers design machines to execute tasks more efficiently, pharmaceutical researchers design new drugs to fight disease, companies design websites to enhance user experience and increase advertising revenue, geologists design exploration strategies to harness natural resources, environmentalists design sensor networks to monitor ecological systems, and developers design software to drive computers and electronic devices. All these design problems are fraught with choices, choices that are often complex and high-dimensional, with interactions that make them difficult for individuals to reason about.
设计问题在科学和工业活动中无处不在：科学家设计实验以深入了解物理和社会现象，工程师设计机器以更高效地执行任务，药物研究人员设计新药以对抗疾病，公司设计网站以提升用户体验并增加广告收入，地质学家设计勘探策略以利用自然资源，环保人士设计传感器网络以监测生态系统，开发者设计软件以驱动计算机和电子设备。所有这些设计问题都充满了选择，这些选择往往复杂且高维度，其中的相互作用使得个人难以进行推理。

For example, many organizations routinely use the popular mixed integer programming solver IBM ILOG CPLEX

^{1}

for scheduling and planning. This solver has 76 free parameters, which the designers must tune manually - an overwhelming number to deal with by hand. This search space is too vast for anyone to effectively navigate.
例如，许多组织在日常调度和规划中广泛使用流行的混合整数规划求解器 IBM ILOG CPLEX

^{1}

。该求解器拥有 76 个自由参数，设计者必须手动调整——这一庞大数量令人难以手工应对。其搜索空间之广阔，已超出人工有效探索的范围。

More generally, consider teams in large companies that develop software libraries for other teams to use. These libraries have hundreds or thousands of free choices and parameters that interact in complex ways. In fact, the level of complexity is often so high that it becomes impossible to find domain experts capable of tuning these libraries to generate a new product.
更广泛地说，考虑大型公司中为其他团队开发软件库的团队。这些库包含数百乃至数千个以复杂方式相互作用的自由选项和参数。实际上，其复杂程度往往高到难以找到能够调优这些库以生成新产品的领域专家。

As a second example, consider massive online games involving the following three parties: content providers, users, and the analytics company that sits between them. The analytics company must develop procedures to automatically design game variants across millions of users; the objective is to enhance user experience and maximize the content provider's revenue.
第二个例子涉及大型在线游戏中的三方：内容提供商、用户以及位于二者之间的分析公司。分析公司必须开发自动化设计程序，为上百万用户生成游戏变体；其目标是提升用户体验并最大化内容提供商的收益。

The preceding examples highlight the importance of automating design choices. For a nurse scheduling application, we would like to have a tool that automatically chooses the 76 CPLEX parameters so as to improve healthcare delivery. When launching a mobile game, we would like to use the data gathered from millions of users in real-time to automatically adjust and improve the game. When a data scientist uses a machine learning library to forecast energy demand, we would like to automate the process of choosing the best forecasting technique and its associated parameters.
上述例子凸显了自动化设计选择的重要性。对于护士排班应用，我们希望有一个工具能自动选择 76 个 CPLEX 参数，从而优化医疗服务提供。在发布手机游戏时，我们希望利用从数百万用户实时收集的数据自动调整并改进游戏体验。当数据科学家使用机器学习库预测能源需求时，我们希望能自动化选择最佳预测技术及其相关参数的过程。

Any significant advances in automated design can result in immediate product improvements and innovation in a wide area of domains, including advertising, health-care informatics, banking, information mining, life sciences, control engineering, computing systems, manufacturing, e-commerce, and entertainment.
自动化设计领域的任何重大进展都能立即带来产品改进，并在广告、医疗保健信息学、银行、信息挖掘、生命科学、控制工程、计算系统、制造业、电子商务及娱乐等多个广泛领域中推动创新。

Bayesian optimization has emerged as a powerful solution for these varied design problems. In academia, it is impacting a wide range of areas, including interactive user-interfaces [26], robotics [101], [110], environmental monitoring [106], information extraction [158], combinatorial optimisation [79], [159], automatic machine learning [16], [143], [148], [151], [72], sensor networks [55], [146], adaptive Monte Carlo [105], experimental design [11] and reinforcement learning [27].
贝叶斯优化已成为解决这些多样化设计问题的有力方案。在学术界，它正影响着广泛领域，包括交互式用户界面[26]、机器人学[101][110]、环境监测[106]、信息提取[158]、组合优化[79][159]、自动化机器学习[16][143][148][151][72]、传感器网络[55][146]、自适应蒙特卡洛[105]、实验设计[11]以及强化学习[27]。

When software engineers develop programs, they are often faced with myriad choices. By making these choices explicit, Bayesian optimization can be used to construct optimal programs [74]: that is to say, programs that run faster or compute better solutions. Furthermore, since different components of software are typically integrated to build larger systems, this framework offers the opportunity to automate integrated products consisting of many parametrized software modules.
当软件工程师开发程序时，他们常常面临无数选择。通过将这些选择明确化，贝叶斯优化可用于构建最优程序[74]：即运行更快或能计算出更优解决方案的程序。此外，由于软件的不同组件通常被集成以构建更大系统，这一框架为自动化集成由多个参数化软件模块组成的产品提供了可能。

Mathematically, we are considering the problem of finding a global maximizer (or minimizer) of an unknown objective function

f

:
从数学角度，我们考虑的是寻找一个未知目标函数

f

的全局最大化器（或最小化器）的问题：

\begin{matrix} (1) & x^{⋆} = \underset{x \in X}{\arg max} f (x), \end{matrix}

where

X

is some design space of interest; in global optimization,

X

is often a compact subset of

R^{d}

but the Bayesian optimization framework can be applied to more unusual search spaces that involve categorical or conditional inputs, or even combinatorial search spaces with multiple categorical inputs. Furthermore, we will assume the black-box function

f

has no simple closed form,but can be evaluated at any arbitrary query point

x

in the domain. This evaluation produces noise-corrupted (stochastic) outputs

y \in R

such that

E [y ∣ f (x)] = f (x)

. In other words,we can only observe the function

f

through unbiased noisy point-wise observations

y

. Although this is the minimum requirement for Bayesian optimization, when gradients are available, they can be incorporated in the algorithm as well; see for example Sections 4.2.1 and 5.2.4 of [99]. In this setting, we consider a sequential search algorithm which,at iteration

n

,selects a location

x_{n + 1}

at which to query

f

and observe

y_{n + 1}

. After

N

queries,the algorithm makes a final recommendation

{\overset{―}{x}}_{N}

, which represents the algorithm's best estimate of the optimizer.
其中

X

代表某个感兴趣的设计空间；在全局优化中，

X

通常是

R^{d}

的紧致子集，但贝叶斯优化框架可应用于更特殊的搜索空间，如包含分类或条件输入的空间，甚至具有多个分类输入的组合搜索空间。此外，我们假设黑盒函数

f

无简单闭合形式，但可在定义域内任意查询点

x

处进行评估。该评估会生成带噪声（随机）的输出

y \in R

，满足

E [y ∣ f (x)] = f (x)

。换言之，我们只能通过无偏噪声点观测

y

来观察函数

f

。尽管这是贝叶斯优化的最低要求，但当梯度可用时，也可将其纳入算法中，例如参见文献[99]的第 4.2.1 节和 5.2.4 节。在此设定下，我们考虑一种顺序搜索算法：在第

n

次迭代时，选择查询位置

x_{n + 1}

以评估

f

并观测

y_{n + 1}

。经过

N

次查询后，算法给出最终推荐值

{\overset{―}{x}}_{N}

，代表其对优化器的最佳估计。

^{1}

http://www.ibm.com/software/commerce/optimization/cplex-optimizer/

In the context of big data applications for instance, the function

f

can be an object recognition system (e.g.,deep neural network) with tunable parameters

x

(e.g.,architectural choices, learning rates, etc) with a stochastic observable classification accuracy

y = f (x)

on a particular dataset such as ImageNet. Because the Bayesian optimization framework is very data efficient, it is particularly useful in situations like these where evaluations of

f

are costly,where one does not have access to derivatives with respect to

x

,and where

f

is non-convex and multimodal. In these situations, Bayesian optimization is able to take advantage of the full information provided by the history of the optimization to make this search efficient.
在大数据应用的背景下，函数

f

可以是一个具有可调参数

x

（如架构选择、学习率等）的对象识别系统（例如深度神经网络），在特定数据集（如 ImageNet）上具有随机可观测的分类准确率

y = f (x)

。由于贝叶斯优化框架的数据效率非常高，它特别适用于评估

f

成本高昂、无法获取关于

x

的导数以及

f

非凸且多模态的情况。在这些情况下，贝叶斯优化能够充分利用优化历史提供的完整信息，使搜索过程更加高效。

Fundamentally, Bayesian optimization is a sequential model-based approach to solving problem (1). In particular, we prescribe a prior belief over the possible objective functions and then sequentially refine this model as data are observed via Bayesian posterior updating. The Bayesian posterior represents our updated beliefs - given data - on the likely objective function we are optimizing. Equipped with this probabilistic model, we can sequentially induce acquisition functions

α_{n} : X \mapsto R

that leverage the uncertainty in the posterior to guide exploration. Intuitively, the acquisition function evaluates the utility of candidate points for the next evaluation of

f

; therefore

x_{n + 1}

is selected by maximizing

α_{n}

,where the index

n

indicates the implicit dependence on the currently available data. Here the "data" refers to previous locations where

f

has been evaluated, and the corresponding noisy outputs.
本质上，贝叶斯优化是一种基于序列模型的求解方法（1）。具体而言，我们首先对可能的目标函数设定先验信念，随后通过贝叶斯后验更新随着观测数据的积累逐步修正该模型。贝叶斯后验反映了在给定数据条件下，我们对所优化目标函数可能形态的最新认知。借助这一概率模型，我们可以序列化地推导出采集函数

α_{n} : X \mapsto R

，该函数利用后验分布中的不确定性来指导探索过程。直观上，采集函数评估候选点对目标函数进行下一次评估

f

的效用值，因此通过最大化

α_{n}

来选择

x_{n + 1}

，其中下标

n

表示对当前可用数据的隐式依赖关系。此处的"数据"指代先前已评估过的

f

位置点及其对应的含噪声输出结果。

In summary, the Bayesian optimization framework has two key ingredients. The first ingredient is a probabilistic surrogate model, which consists of a prior distribution that captures our beliefs about the behavior of the unknown objective function and an observation model that describes the data generation mechanism. The second ingredient is a loss function that describes how optimal a sequence of queries are; in practice, these loss functions often take the form of regret, either simple or cumulative. Ideally, the expected loss is then minimized to select an optimal sequence of queries. After observing the output of each query of the objective, the prior is updated to produce a more informative posterior distribution over the space of objective functions; see Figure 1 and Algorithm 1 for an illustration and pseudo-code of this framework. See Section 4 of [64] for another introduction.
总之，贝叶斯优化框架有两个关键组成部分。第一个组成部分是概率代理模型，它包括一个先验分布（用于捕捉我们对未知目标函数行为的信念）和一个描述数据生成机制的观测模型。第二个组成部分是损失函数，用于描述一系列查询的最优程度；在实践中，这些损失函数通常采用简单或累积遗憾的形式。理想情况下，通过最小化期望损失来选择最优的查询序列。在观察到每个目标查询的输出后，先验分布会被更新，从而在目标函数空间上生成信息量更大的后验分布；关于该框架的图示和伪代码，请参见图 1 和算法 1。另见文献[64]的第 4 节以获取另一种介绍。

One problem with this minimum expected risk framework is that the true sequential risk, up to the full evaluation budget, is typically computationally intractable. This has led to the introduction of many myopic heuristics known as acquisition functions, e.g., Thompson sampling, probability of improvement, expected improvement, upper-confidence-bounds, and entropy search. These acquisition functions trade off exploration and exploitation; their optima are located where the uncertainty in the surrogate model is large (exploration) and/or where the model prediction is high (exploitation). Bayesian optimization algorithms then select the next query point by maximizing such acquisition functions. Naturally, these acquisition functions are often even more multimodal and difficult to optimize, in terms of querying efficiency, than the original black-box function

f

. Therefore it is critical that the acquisition functions be cheap to evaluate or approximate: cheap in relation to the expense of evaluating the black-box

f

. Since acquisition functions have analytical forms that are easy to evaluate or at least approximate, it is usually much easier to optimize them than the original objective function.
这种最小期望风险框架的一个问题是，真实的序列风险（直到用尽全部评估预算）通常在计算上是难以处理的。这导致了许多近视启发式方法（称为采集函数）的引入，例如 Thompson 采样、改进概率、期望改进、上置信界和熵搜索。这些采集函数在探索和利用之间进行权衡；其最优解位于代理模型不确定性较大（探索）和/或模型预测值较高（利用）的区域。贝叶斯优化算法随后通过最大化这些采集函数来选择下一个查询点。自然，就查询效率而言，这些采集函数通常比原始黑盒函数

f

更加多峰且难以优化。因此，关键的是采集函数的评估或近似成本要低廉：相对于评估黑盒

f

的费用而言足够廉价。由于采集函数具有易于评估或至少可近似计算的解析形式，通常优化它们比优化原始目标函数要容易得多。

Algorithm 1 Bayesian optimization
算法 1 贝叶斯优化

for

n = 1, 2, \dots

do 对于

n = 1, 2, \dots

执行

select new

x_{n + 1}

by optimizing acquisition function

α

通过优化采集函数

α

来选择新的

x_{n + 1}

x_{n + 1} = \underset{x}{\arg max} α (x; D_{n})

query objective function to obtain

y_{n + 1}

查询目标函数以获取

y_{n + 1}

augment data

D_{n + 1} = {D_{n}, (x_{n + 1}, y_{n + 1})}

增强数据

D_{n + 1} = {D_{n}, (x_{n + 1}, y_{n + 1})}

update statistical model 更新统计模型

end for 结束循环

A. Paper overview A. 论文概览

In this paper, we introduce the ingredients of Bayesian optimization in depth. Our presentation is unique in that we aim to disentangle the multiple components that determine the success of Bayesian optimization implementations. In particular, we focus on statistical modelling as this leads to general algorithms to solve a broad range tasks. We also provide an extensive comparison among popular acquisition functions. We will see that the careful choice of statistical model is often far more important than the choice of acquisition function heuristic.
本文深入介绍了贝叶斯优化的核心要素。我们的阐述独树一帜，旨在剖析决定贝叶斯优化实现成功与否的多重组件。特别地，我们聚焦于统计建模，因其能衍生出解决广泛任务的通用算法。我们还对主流采集函数进行了全面比较。我们将看到，统计模型的精心选择往往比采集函数启发式策略的选择更为关键。

We begin in Sections II and III, with an introduction to parametric and non-parametric models, respectively, for binary- and real-valued objective functions. In Section IV, we will introduce many acquisition functions, compare them, and even combine them into portfolios. Several practical and implementation details, including available software packages, are discussed in Section V. A survey of theoretical results and a brief history of model-based optimization are provided in Sections VI and VII, respectively. Finally, we introduce more recent developments in Section VIII.
我们首先在第二、三节分别介绍针对二元和实值目标函数的参数化与非参数化模型。第四节将引入多种采集函数进行对比，甚至将其组合成投资组合。第五节讨论包括可用软件包在内的多项实践与实现细节。第六、七节分别综述理论成果和基于模型的优化简史。最后在第八节介绍最新研究进展。

B. Applications of Bayesian optimization
B. 贝叶斯优化的应用场景

Before embarking on a detailed introduction to Bayesian optimization, the following sections provide an overview of the many and varied successful applications of Bayesian optimization that should be of interest to data scientists.
在深入介绍贝叶斯优化之前，以下部分概述了众多且多样化的贝叶斯优化成功应用案例，这些内容对数据科学家颇具参考价值。

https://cdn.noedgeai.com/019684af-e35c-7a84-b274-13b47593f96c_2.jpg?x=136&y=150&w=752&h=757&r=0

Fig. 1. Illustration of the Bayesian optimization procedure over three iterations. The plots show the mean and confidence intervals estimated with a probabilistic model of the objective function. Although the objective function is shown, in practice it is unknown. The plots also show the acquisition functions in the lower shaded plots. The acquisition is high where the model predicts a high objective (exploitation) and where the prediction uncertainty is high (exploration). Note that the area on the far left remains unsampled, as while it has high uncertainty, it is correctly predicted to offer little improvement over the highest observation [27].
图 1. 贝叶斯优化过程经过三次迭代的示意图。图中展示了基于目标函数概率模型估计的均值与置信区间。虽然实际目标函数未知，但图示中仍予以呈现。下方阴影区域绘制了采集函数，其峰值出现在模型预测目标值较高处（利用）及预测不确定性较大处（探索）。值得注意的是，最左侧区域虽存在高不确定性，但由于模型正确预测其改进空间低于当前最高观测值[27]，该区域始终未被采样。

A / B

testing: Though the idea of

A / B

testing dates back to the early days of advertising in the form of so-called focus groups, the advent of the internet and smartphones has given web and app developers a new forum for implementing these tests at unprecedented scales. By redirecting small fractions of user traffic to experimental designs of an ad, app, game, or website, the developers can utilize noisy feedback to optimize any observable metric with respect to the product's configuration. In fact, depending on the particular phase of a product's life, new subscriptions may be more valuable than revenue or user retention, or vice versa; the click-through rate might be the relevant objective to optimize for an ad, whereas for a game it may be some measure of user engagement.
1)

A / B

测试：虽然

A / B

测试的概念可以追溯到广告业早期的焦点小组形式，但互联网和智能手机的出现为网页和应用开发者提供了一个新平台，使其能够以前所未有的规模实施这些测试。通过将少量用户流量重定向至广告、应用、游戏或网站的试验性设计，开发者可以利用嘈杂的反馈来优化产品配置相关的任何可观测指标。实际上，根据产品生命周期的特定阶段，新订阅可能比收入或用户留存更有价值，反之亦然；点击率可能是广告需要优化的相关目标，而对游戏而言，则可能是用户参与度的某种衡量标准。

The crucial problem is how to optimally query these subsets of users in order to find the best product with high probability within a predetermined query budget, or how to redirect traffic sequentially in order to optimize a cumulative metric while incurring the least opportunity cost [88], [135], [38].
关键问题在于如何最优地查询这些用户子集，以便在预定的查询预算内以高概率找到最佳产品，或者如何顺序重定向流量以优化累积指标，同时将机会成本降至最低[88], [135], [38]。

2) Recommender systems: In a similar setting, online content providers make product recommendations to their subscribers in order to optimize either revenue in the case of e-commerce sites, readership for news sites, or consumption for video and music streaming websites. In contrast to

A / B

testing, the content provider can make multiple suggestions to any given subscriber. The techniques reviewed in this work have been successfully used for the recommendation of news articles [97], [38], [153].
2) 推荐系统：在类似场景中，在线内容提供商向订阅用户推荐产品，以优化电子商务网站的收入、新闻网站的阅读量或视频音乐流媒体网站的消费量。与

A / B

测试不同，内容提供商可以向任何特定订阅用户提出多项建议。本研究中回顾的技术已成功应用于新闻文章推荐[97]、[38]、[153]。

3) Robotics and Reinforcement learning: Bayesian optimization has also been successfully applied to policy search. For example, by parameterizing a robot's gait it is possible to optimize it for velocity or smoothness as was done on the Sony AIBO ERS-7 in [101]. Similar policy parameterization and search techniques have been used to navigate a robot through landmarks, minimizing uncertainty about its own location and map estimate [110], [108]. See [27] for an example of applying Bayesian optimization to hierarchical reinforcement learning, where the technique is used to automatically tune the parameters of a neural network policy and to learn value functions at higher levels of the hierarchy. Bayesian optimization has also been applied to learn attention policies in image tracking with deep networks [44]
3) 机器人学与强化学习：贝叶斯优化同样成功应用于策略搜索领域。例如，通过参数化机器人的步态，可针对速度或流畅性进行优化，如索尼 AIBO ERS-7 机器人上的实践[101]。类似策略参数化与搜索技术还被用于引导机器人穿越地标，同时最小化其自身位置及地图估计的不确定性[110][108]。文献[27]展示了贝叶斯优化在分层强化学习中的应用案例，该技术用于自动调整神经网络策略参数，并学习层次结构中更高层的价值函数。此外，贝叶斯优化亦应用于深度学习图像跟踪中的注意力策略学习[44]。

4) Environmental monitoring and sensor networks: Sensor networks are used to monitor environmentally relevant quantities: temperature, concentration of pollutants in the atmosphere, soil, oceans, etc. Whether inside a building or at a planetary scale, these networks make noisy local measurements that are interpolated to produce a global model of the quantity of interest. In some cases, these sensors are expensive to activate but one can answer important questions like what is the hottest or coldest spot in a building by activating a relatively small number of sensors. Bayesian optimization was used for this task and the similar one of finding the location of greatest highway traffic congestion [146]. Also, see [55] for a meteorological application.
4) 环境监测与传感器网络：传感器网络用于监测与环境相关的各类参数，如温度、大气、土壤及海洋中的污染物浓度等。无论是在建筑物内部还是全球范围内，这些网络通过带有噪声的局部测量数据进行插值，构建出目标量的全局模型。某些情况下，激活这些传感器成本高昂，但通过激活相对少量的传感器，即可解答诸如建筑物内最热或最冷区域位置等重要问题。贝叶斯优化技术已应用于此类任务，以及类似的高速公路最拥堵路段定位问题[146]。另见气象学应用案例[55]。

When the sensor is mobile, there is a cost associated with making a measurement which relates to the distance travelled by a vehicle on which the sensor is mounted (e.g., a drone). This cost can be incorporated in the decision making process as in [106].
当传感器为移动式时，测量成本与搭载传感器的载具（如无人机）移动距离相关。如文献[106]所示，此类成本可纳入决策流程予以考量。

5) Preference learning and interactive interfaces: The computer graphics and animation fields are filled with applications that require the setting of tricky parameters. In many cases, the models are complex and the parameters unintuitive for non-experts. In [28], [26], the authors use Bayesian optimization to set the parameters of several animation systems by showing the user examples of different parametrized animations and asking for feedback. This interactive Bayesian optimization strategy is particulary effective as humans can be very good at comparing examples, but unable to produce an objective function whose optimum is the example of interest.
5) 偏好学习与交互界面：计算机图形学和动画领域充斥着需要设置复杂参数的应用。在许多情况下，模型本身很复杂，参数对非专业人士而言难以直观理解。在文献[28]和[26]中，作者通过向用户展示不同参数化动画示例并收集反馈，利用贝叶斯优化为多个动画系统设定参数。这种交互式贝叶斯优化策略特别有效，因为人类擅长比较示例，却难以构建一个以目标示例为最优解的客观函数。

6) Automatic machine learning and hyperparameter tuning: In this application, the goal is to automatically select the best model (e.g., random forests, support vector machines, neural networks, etc.) and its associated hyperparameters for solving a task on a given dataset. For big datasets or when considering many alternatives, cross-validation is very expensive and hence it is important to find the best technique within a fixed budget of cross-validation tests. The objective function here is the generalization performance of the models and hyperparameter settings; a noisy evaluation of the objective corresponds to training a single model on all but one cross-validation folds and returning, e.g., the empirical error on the held out fold.
6) 自动机器学习和超参数调优：在此应用中，目标是自动选择最佳模型（如随机森林、支持向量机、神经网络等）及其相关超参数，以解决给定数据集上的任务。对于大型数据集或考虑多种替代方案时，交叉验证成本极高，因此在固定次数的交叉验证测试预算内找到最佳技术至关重要。此处的目标函数是模型和超参数设置的泛化性能；对目标的噪声评估对应于在除一个交叉验证折之外的所有数据上训练单个模型，并返回例如在保留折上的经验误差。

The traditional alternatives to cross-validation include racing algorithms that use conservative concentration bounds to rule out underperforming models [107], [113]. Recently, the Bayesian optimization approach for the model selection and tuning task has received much attention in tuning deep belief networks [16], Markov chain Monte Carlo methods [105], [65], convolutional neural networks [143], [148], and automatically selecting among WEKA and scikit-learn offerings [151], [72].
传统交叉验证的替代方法包括使用保守集中界限来排除表现不佳模型的竞赛算法[107]、[113]。近年来，贝叶斯优化方法在模型选择与参数调优任务中受到广泛关注，已应用于深度信念网络[16]、马尔可夫链蒙特卡洛方法[105]、[65]、卷积神经网络[143]、[148]的调参，以及自动选择 WEKA 和 scikit-learn 工具包中的算法[151]、[72]。

7) Combinatorial optimization: Bayesian optimization has been used to solve difficult combinatorial optimization problems in several applications. One notable approach is called empirical hardness models (EHMs) that use a set of problem features to predict the performance of an algorithm on a specific problem instance [96]. Bayesian optimization with an EHM amounts to finding the best algorithm and configuration for a given problem. This concept has been applied to e.g., tuning mixed integer solvers [78], [159], and tuning approximate nearest neighbour algorithms [109]. Bayesian optimization has also been applied to fast object localization in images [163].
7) 组合优化：贝叶斯优化已被用于解决多个应用中的复杂组合优化问题。一种显著方法称为经验硬度模型（EHMs），其利用一组问题特征预测算法在特定问题实例上的性能[96]。基于 EHM 的贝叶斯优化相当于为给定问题寻找最佳算法及配置。该技术已应用于例如混合整数求解器调参[78][159]、近似最近邻算法调优[109]等场景。贝叶斯优化还被用于图像中的快速目标定位[163]。

8) Natural language processing and text: Bayesian optimization has been applied to improve text extraction in [158] and to tune text representations for more general text and language tasks in [162].
8) 自然语言处理与文本：贝叶斯优化已被应用于改进文本提取（见文献[158]），并调整文本表示以适用于更广泛的文本及语言任务（见文献[162]）。

II. BAYESIAN OPTIMIZATION WITH PARAMETRIC MODELS
二、基于参数化模型的贝叶斯优化

The central idea of Bayesian optimization is to build a model that can be updated and queried to drive optimization decisions. In this section, we cover several such models, but for the sake of clarity, we first consider a generic family of models parameterized by

w

. Let

D

denote the available data. We will generalize to the non-parametric situation in the proceeding section.
贝叶斯优化的核心思想是构建一个可更新和查询以驱动优化决策的模型。本节将介绍几种此类模型，但为了清晰起见，我们首先考虑由参数

w

定义的通用模型族。设

D

表示现有数据。我们将在后续章节中推广到非参数化场景。

Since

w

is an unobserved quantity,we treat it as a latent random variable with a prior distribution

p (w)

,which captures our a priori beliefs about probable values for

w

before any data is observed. Given data

D

and a likelihood model

p (D ∣ w)

,we can then infer a posterior distribution

p (w ∣ D)

using Bayes’ rule:
由于

w

是一个未观测到的量，我们将其视为具有先验分布

p (w)

的潜在随机变量，该分布捕获了在观测任何数据前我们对

w

可能值的先验信念。给定数据

D

和似然模型

p (D ∣ w)

后，可利用贝叶斯规则推断后验分布

p (w ∣ D)

：

\begin{matrix} (2) & p (w ∣ D) = \frac{p (D ∣ w) p (w)}{p (D)} . \end{matrix}

This posterior represents our updated beliefs about

w

after observing data

D

. The denominator

p (D)

is the marginal likelihood, or evidence, and is usually computationally intractable. Fortunately,it does not depend on

w

and is therefore simply a normalizing constant. A typical modelling choice is to use conjugacy to match the prior and likelihood so that the posterior (and often the normalizing constant) can be computed analytically.
该后验分布代表了我们在观测到数据

D

后对

w

的更新信念。分母

p (D)

是边际似然或证据，通常在计算上难以处理。幸运的是，它不依赖于

w

，因此仅是一个归一化常数。典型的建模选择是利用共轭性匹配先验与似然，从而能解析计算后验（通常包括归一化常数）。

A. Thompson sampling in the Beta-Bernoulli bandit model
A. Beta-Bernoulli 老虎机模型中的汤普森采样

We begin our discussion with a treatment of perhaps the simplest statistical model, the Beta-Bernoulli. Imagine that there are

K

drugs that have unknown effectiveness,where we define "effectiveness" as the probability of a successful cure. We wish to cure patients, but we must also identify which drugs are effective. Such a problem is often called a Bernoulli (or binomial) bandit problem by analogy to a group of slot machines, which each yield a prize with some unknown probability. In addition to clinical drug settings, this formalism is useful for

A / B

testing [135],advertising,and recommender systems [97], [38], among a wide variety of applications. The objective is to identify which arm of the bandit to pull, e.g., which drug to administer, which movie to recommend, or which advertisement to display. Initially, we consider the simple case where the arms are independent insofar as observing the success or failure of one provides no information about another.
我们首先讨论可能是最简单的统计模型——Beta-Bernoulli 模型。想象有

K

种药物，其疗效未知，这里我们将“疗效”定义为成功治愈的概率。我们希望治愈患者，但同时也要识别哪些药物是有效的。这类问题常被称为 Bernoulli（或二项）老虎机问题，类比于一组每次以某种未知概率产生奖励的老虎机。除了临床药物场景，这种形式主义还适用于

A / B

测试[135]、广告和推荐系统[97]、[38]等多种应用。目标在于确定拉动老虎机的哪个臂，例如，施用哪种药物、推荐哪部电影或展示哪条广告。最初，我们考虑简单情况，其中各臂相互独立，即观察一个臂的成功或失败不会提供关于另一个臂的任何信息。

Returning to the drug application, we can imagine the effectiveness of different drugs (arms on the bandit) as being determined by a function

f

that takes an index

a \in 1, \dots, K

and returns a Bernoulli parameter in the interval(0,1). With

y_{i} \in {0, 1}

,we denote the Bernoulli outcome of the treatment of patient

i

,and this has mean parameter

f (a_{i})

if the drug administered was

a_{i}

. Note that we are assuming stochastic feedback, in contrast to deterministic or adversarial feedback [9],[10]. With only

K

arms,we can fully describe the function

f

with a parameter

w \in {(0, 1)}^{K}

so that

f_{w} (a) := w_{a}

.
回到药物应用的例子，我们可以将不同药物（赌博机上的臂）的疗效想象为由一个函数

f

决定，该函数接收索引

a \in 1, \dots, K

并返回区间(0,1)内的伯努利参数。用

y_{i} \in {0, 1}

表示第

i

位患者治疗结果的伯努利输出，若所施用药物为

a_{i}

，则其均值参数为

f (a_{i})

。注意我们假设的是随机反馈，这与确定性或对抗性反馈[9][10]形成对比。当仅有

K

个臂时，我们可以用参数

w \in {(0, 1)}^{K}

完整描述函数

f

，使得

f_{w} (a) := w_{a}

成立。

Over time, we will see outcomes from different patients and different drugs. We can denote these data as a set of tuples

D_{n} = {(a_{i}, y_{i})}_{i = 1}^{n}

,where

a_{i}

indicates which of the

K

drugs was administered and

y_{i}

is 1 if the patient was cured and 0 otherwise. In a Bayesian setting, we will use these data to compute a posterior distribution over

w

. A natural choice for the prior distribution is a product of

K

beta distributions:
随着时间的推移，我们将观察到来自不同患者和不同药物的治疗结果。这些数据可以表示为一组元组

D_{n} = {(a_{i}, y_{i})}_{i = 1}^{n}

，其中

a_{i}

标识使用了

K

种药物中的哪一种，而

y_{i}

若患者痊愈则为 1，否则为 0。在贝叶斯框架下，我们将利用这些数据计算关于

w

的后验分布。先验分布的自然选择是

K

个贝塔分布的乘积：

\begin{matrix} (3) & p (w ∣ α, β) = \prod_{a = 1}^{K} Beta (w_{a} ∣ α, β), \end{matrix}

as this is the conjugate prior to the Bernoulli likelihood, and it leads to efficient posterior updating. We denote by

n_{a, 1}

the number of patients cured by drug

a

and by

n_{a, 0}

the number of patients who received

a

but were unfortunately not cured; that is
因为这是伯努利似然的共轭先验，且能实现高效的后验更新。我们用

n_{a, 1}

表示药物

a

治愈的患者数量，

n_{a, 0}

表示接受了

a

治疗但遗憾未愈的患者数量；即

\begin{matrix} (4) & n_{a, 0} = \sum_{i = 1}^{n} I (y_{i} = 0, a_{i} = a) \end{matrix}

\begin{matrix} (5) & n_{a, 1} = \sum_{i = 1}^{n} I (y_{i} = 1, a_{i} = a) . \end{matrix}

The convenient conjugate prior then leads to a posterior distribution which is also a product of betas:
这种便利的共轭先验随后导出的后验分布同样是一系列 Beta 分布的乘积：

\begin{matrix} (6) & p (w ∣ D) = \prod_{a = 1}^{K} Beta (w_{a} ∣ α + n_{a, 1}, β + n_{a, 0}) . \end{matrix}

Note that this makes it clear how the hyperparameters

α, β > 0

in the prior can be interpreted as pseudo-counts. Figure 2 provides a visualization of the posterior of a three-armed Beta-Bernoulli bandit model with a Beta(2,2)prior.
注意，这清晰地展示了先验中的超参数

α, β > 0

如何被解释为伪计数。图 2 展示了采用 Beta(2,2)先验的三臂 Beta-Bernoulli 老虎机模型后验分布的可视化结果。

In Section IV, we will introduce various strategies for selecting the next arm to pull within models like the Beta-Bernoulli, but for the sake of illustration, we introduce Thompson sampling [150], the earliest and perhaps the simplest nontrivial bandit strategy. This strategy is also commonly known as randomized probability matching [135] because it selects the arm based on the posterior probability of optimality, here given by a beta distribution. In simple models like the Beta-Bernoulli, it is possible to compute this distribution in closed form, but more often it must be estimated via, e.g., Monte Carlo.
在第四节中，我们将介绍在诸如 Beta-Bernoulli 等模型中选择下一次拉取臂的各种策略，但为了便于说明，我们引入汤普森采样[150]——这是最早且可能最简单的非平凡赌博机策略。该策略也常被称为随机概率匹配[135]，因为它基于最优性的后验概率（此处由 Beta 分布给出）来选择臂。在 Beta-Bernoulli 这类简单模型中，可以闭式计算出该分布，但更多情况下需通过蒙特卡洛等方法进行估计。

https://cdn.noedgeai.com/019684af-e35c-7a84-b274-13b47593f96c_4.jpg?x=160&y=154&w=696&h=630&r=0

Fig. 2. Example of the Beta-Bernoulli model for A/B testing. Three different buttons are being tested with various colours and text. Each option is given 2 successes (click-throughs) and 2 failures as a prior (top). As data are observed, each option updates its posterior over

w

. Option A is the current best with 5 successes and only 1 observed failure.
图 2. Beta-Bernoulli 模型在 A/B 测试中的应用示例。测试了三种不同颜色和文字的按钮选项。每个选项初始设定为 2 次成功（点击通过）和 2 次失败作为先验（顶部）。随着数据被观测，每个选项会更新其后验分布

w

。选项 A 当前表现最佳，有 5 次成功且仅观察到 1 次失败。

After observing

n

patients in our drug example,we can think of a bandit strategy as being a rule for choosing which drug to administer to patient

n + 1

,i.e.,choosing

a_{n + 1}

among the

K

options. In the case of Thompson sampling,this can be done by drawing a single sample

\tilde{w}

from the posterior and then maximizing the resulting surrogate

f_{\tilde{w}}

,i.e.,
在我们的药物示例中观察了

n

名患者后，可以将赌博策略视为一种规则，用于决定对第

n + 1

名患者施用哪种药物，即在

K

个选项中选择

a_{n + 1}

。在汤普森采样的情况下，这可以通过从后验分布中抽取单个样本

\tilde{w}

，然后最大化所得的替代

f_{\tilde{w}}

来实现，即

\begin{matrix} (7) & a_{n + 1} = \underset{a}{\arg max} f_{\tilde{w}} (a) where \tilde{w} \sim p (w ∣ D_{n}) . \end{matrix}

For the Beta-Bernoulli,this corresponds to simply drawing

\tilde{w}

from (6) and then choosing the action with the largest

{\tilde{w}}_{a}

. This procedure, shown in pseudo-code in Algorithm 2, is also commonly called posterior sampling [127]. It is popular for several reasons: 1) there are no free parameters other than the prior hyperparameters of the Bayesian model, 2) the strategy naturally trades off between exploration and exploitation based on its posterior beliefs on

w

; arms are explored only if they are likely (under the posterior) to be optimal, 3) the strategy is relatively easy to implement as long as Monte Carlo sampling mechanisms are available for the posterior model, and 4) the randomization in Thompson sampling makes it particularly appropriate for batch or delayed feedback settings where many selections

a_{n + 1}

are based on the identical posterior [135],[38].
对于 Beta-Bernoulli 模型，这相当于直接从(6)式中抽取

\tilde{w}

，然后选择具有最大

{\tilde{w}}_{a}

的动作。这一过程在算法 2 中以伪代码形式展示，通常也被称为后验采样[127]。它受欢迎的原因有几个：1)除了贝叶斯模型的先验超参数外没有其他自由参数，2)该策略基于其后验信念自然地平衡了探索与利用；只有当某臂在后验概率下可能最优时才会被探索，3)只要后验模型支持蒙特卡洛采样机制，该策略相对容易实现，4)汤普森采样中的随机化使其特别适合批量或延迟反馈场景，其中许多选择

a_{n + 1}

基于相同的后验分布[135],[38]。

B. Linear models B. 线性模型

In many applications, the designs available to the experimenter have components that can be varied independently. For example, in designing an advertisement, one has choices such as artwork, font style, and size; if there are five choices for each, the total number of possible configurations is 125 . In general, this number grows combinatorially in the number of components. This presents challenges for approaches such as the independent Beta-Bernoulli model discussed in the previous section: modelling the arms as independent will lead to strategies that must try every option at least once. This rapidly becomes infeasible in the large spaces of real-world problems. In this section, we discuss a parametric approach that captures dependence between the arms via a linear model. For simplicity, we first consider the case of real-valued outputs

y

and generalize this model to binary outputs in the succeeding section.
在许多应用中，实验者可用的设计方案包含可独立变化的组件。例如，在设计广告时，可以选择如艺术品、字体样式和大小等元素；若每项有五种选择，则可能的配置总数达 125 种。一般而言，这一数量会随组件数量呈组合式增长。这对诸如前一节讨论的独立 Beta-Bernoulli 模型等方法提出了挑战：将各选项视为独立处理将导致策略必须至少尝试每个选项一次。这在现实问题的大规模空间中很快变得不可行。本节中，我们讨论一种通过线性模型捕捉选项间依赖关系的参数化方法。为简化起见，我们首先考虑实值输出的情况

y

，并在后续章节中将此模型推广至二元输出。

Algorithm 2 Thompson sampling for Beta-Bernoulli bandit
算法 2：Beta-Bernoulli 老虎机的汤普森采样

Require:

α, β

: hyperparameters of the beta prior
需确保：

α, β

：beta 先验的超参数

Initialize

n_{a, 0} = n_{a, 1} = i = 0

for all

a

初始化

n_{a, 0} = n_{a, 1} = i = 0

，对所有

a

repeat 重复

for

a = 1, \dots, K

do 对于

a = 1, \dots, K

执行

{\tilde{w}}_{a} \sim Beta (α + n_{a, 1}, β + n_{a, 0})

end for 结束循环

a_{i} = \arg max_{a} {\tilde{w}}_{a}

Observe

y_{i}

by pulling arm

a_{i}

通过拉动臂

a_{i}

观察

y_{i}

y_{i} = 0

then 如果

y_{i} = 0

则

n_{a_{i}, 0} = n_{a_{i}, 0} + 1

else 否则

n_{a_{i}, 1} = n_{a_{i}, 1} + 1

end if 结束条件

i = i + 1

until stopping criterion reached
直至达到停止标准

As before, we begin by specifying a likelihood and a prior. In the linear model, it is natural to assume that each possible arm

a

has an associated feature vector

x_{a} \in R^{d}

. We can then express the expected payout (reward) of each arm as a function of this vector,i.e.,

f (a) = f (x_{a})

. Our objective is to learn this function

f : R^{d} \mapsto R

for the purpose of choosing the best arm,and in the linear model we require

f

to be of the form

f_{w} (a) = x_{a}^{T} w

,where the parameters

w

are now feature weights. This forms the basis of our likelihood model, in which the observations for arm

a

are drawn from a Gaussian distribution with mean

x_{a}^{T} w

and variance

σ^{2}

.
如前所述，我们首先指定似然函数和先验分布。在线性模型中，自然假设每个可能的手臂

a

都有一个对应的特征向量

x_{a} \in R^{d}

。然后，我们可以将每个手臂的预期收益（奖励）表达为该向量的函数，即

f (a) = f (x_{a})

。我们的目标是学习这个函数

f : R^{d} \mapsto R

以选择最佳手臂，而在线性模型中，我们要求

f

的形式为

f_{w} (a) = x_{a}^{T} w

，其中参数

w

现在是特征权重。这构成了我们似然模型的基础，在该模型中，手臂

a

的观测值是从均值为

x_{a}^{T} w

、方差为

σ^{2}

的高斯分布中抽取的。

We use

X

to denote the

n \times d

design matrix in which row

i

is the feature vector associated with the arm pulled in the

i

th iteration,

x_{a_{i}}

. We denote by

y

the

n

-vector of observations. In this case,there is also a natural conjugate prior for

w

and

σ^{2}

: the normal-inverse-gamma, with density given by
我们使用

X

表示

n \times d

设计矩阵，其中第

i

行是与第

i

次迭代中拉动的手臂相关联的特征向量

x_{a_{i}}

。我们用

y

表示观测值的

n

向量。在这种情况下，对于

w

和

σ^{2}

也存在一个自然的共轭先验：正态-逆伽马分布，其密度函数由下式给出

NIG (w, σ^{2} ∣ w_{0}, V_{0}, α_{0}, β_{0}) =

{| 2 π σ^{2} V_{0} |}^{- \frac{1}{2}} \exp {- \frac{1}{2 σ^{2}} {(w - w_{0})}^{T} V_{0}^{- 1} (w - w_{0})}

\begin{matrix} (8) & \times \frac{β_{0}^{α_{0}}}{Γ (α_{0}) {(σ^{2})}^{α_{0} + 1}} \exp {- \frac{β_{0}}{σ^{2}}} . \end{matrix}

There are four prior hyperparameters in this case,

w_{0}, V_{0}

α_{0}

,and

β_{0}

. As in the Beta-Bernoulli case,this conjugate prior enables the posterior distribution to be computed easily, leading to another normal-inverse-gamma distribution, now with parameters
在此情况下存在四个先验超参数，分别是

w_{0}, V_{0}

、

α_{0}

和

β_{0}

。与 Beta-Bernoulli 情形类似，该共轭先验使得后验分布易于计算，从而得到另一个正态-逆伽马分布，此时参数为

\begin{matrix} (9) & w_{n} = V_{n} (V_{0}^{- 1} w_{0} + X^{T} y) \end{matrix}

\begin{matrix} (10) & V_{n} = {(V_{0}^{- 1} + X^{T} X)}^{- 1} \end{matrix}

\begin{matrix} (11) & α_{n} = α_{0} + n / 2 \end{matrix}

\begin{matrix} (12) & β_{n} = β_{0} + \frac{1}{2} (w_{0}^{T} V_{0}^{- 1} w_{0} + y^{T} y - w_{n}^{T} V_{n}^{- 1} w_{n}) . \end{matrix}

Integrating out the weight parameter

w

leads to coupling between the arms and makes it possible for the model to generalize observations of reward from one arm to another.
对权重参数

w

进行积分处理会导致各臂间产生耦合，使得模型能够将观察到的某臂奖励信息泛化至其他臂。

In this linear model,Thompson sampling draws a

\tilde{w}

from the posterior

p (w ∣ D_{n})

and selects the arm with the highest expected reward under that parameter, i.e.,
在此线性模型中，汤普森采样从后验分布

p (w ∣ D_{n})

中抽取一个

\tilde{w}

，并选择在该参数下预期奖励最高的臂，即：

\begin{matrix} (13) & a_{n + 1} = \underset{a}{\arg max} x_{a}^{T} \tilde{w} where \tilde{w} \sim p (w ∣ D_{n}) . \end{matrix}

After arm

a_{n + 1}

is pulled and

y_{n + 1}

is observed,the posterior model can be readily updated using equations (9-12).
当拉动臂

a_{n + 1}

并观察到

y_{n + 1}

后，可利用方程(9-12)直接更新后验模型。

Various generalizations can be immediately seen. For example, by embedding the arms of a multi-armed bandit into a feature space denoted

X

,we can generalize to objective functions

f

defined on the entire domain

X

,thus unifying the multi-armed bandit problem with that of general global optimization:
可以立即看出多种推广形式。例如，通过将多臂老虎机的臂嵌入到特征空间

X

中，我们可以推广到定义在整个域

X

上的目标函数

f

，从而将多臂老虎机问题与一般全局优化问题统一起来：

\begin{matrix} (14) & maximize f (x) s.t. x \in X . \end{matrix}

In the multi-armed bandit, the optimization is over a discrete and finite set

{x_{a}}_{a = 1}^{K} \subset X

,while global optimization seeks to solve the problem on,e.g.,a compact set

X \subset R^{d}

.
在多臂老虎机问题中，优化是在离散且有限的集合

{x_{a}}_{a = 1}^{K} \subset X

上进行的，而全局优化则旨在解决如紧致集

X \subset R^{d}

上的此类问题。

As in other forms of regression, it is natural in increase the expressiveness of the model with non-linear basis functions. In particular,we can use

J

basis functions

ϕ_{j} : X \mapsto R

, for

j = 1, \dots, J

,and model the function

f

with a linear combination
与其他形式的回归分析类似，通过非线性基函数来增强模型的表达能力是很自然的。具体来说，我们可以使用

J

基函数

ϕ_{j} : X \mapsto R

，对于

j = 1, \dots, J

，并用线性组合来建模函数

f

。

\begin{matrix} (15) & f (x) = Φ {(x)}^{T} w, \end{matrix}

where

Φ (x)

is the column vector of concatenated features

{ϕ_{j} (x)}_{j = 1}^{J}

. Common classical examples of such

ϕ_{j}

include radial basis functions such as
其中

Φ (x)

是拼接特征

{ϕ_{j} (x)}_{j = 1}^{J}

的列向量。此类

ϕ_{j}

的经典常见示例包括径向基函数，例如

\begin{matrix} (16) & ϕ_{j} (x) = \exp {- \frac{1}{2} {(x - z_{j})}^{T} Λ (x - z_{j})}, \end{matrix}

where

Λ

and

{z_{j}}_{j = 1}^{J}

are model hyperparameters,and Fourier bases
其中

Λ

和

{z_{j}}_{j = 1}^{J}

为模型超参数，以及傅里叶基函数

\begin{matrix} (17) & ϕ_{j} (x) = \exp {- i x^{T} ω_{j}}, \end{matrix}

with hyperparameters

{ω_{j}}_{j = 1}^{J}

.
包含超参数

{ω_{j}}_{j = 1}^{J}

。

Recently, such basis functions have also been learned from data by training deep belief networks [71], deep neural networks [93], [144], or by factoring the empirical covariance matrix of historical data [146], [72]. For example, in [34] each sigmoidal layer of an

L

layer neural network is defined as

L_{ℓ} (x) := σ (W_{ℓ} x + B_{ℓ})

where

σ

is some sigmoidal non-linearity,and

W_{ℓ}

and

B_{ℓ}

are the layer parameters. Then the feature map

Φ : R^{d} \mapsto R^{J}

can be expressed as

Φ (x) = L_{L} \circ \dots \circ L_{1} (x)

,where the final layer

L_{L}

has

J

output units. In [144], the weights of the last layer of a deep neural network are integrated out to result in a tractable Bayesian model with flexible learned basis functions.
近期，此类基函数也通过训练深度信念网络[71]、深度神经网络[93][144]或分解历史数据的经验协方差矩阵[146][72]从数据中学习得到。例如文献[34]中，将

L

层神经网络的每个 S 型层定义为

L_{ℓ} (x) := σ (W_{ℓ} x + B_{ℓ})

，其中

σ

为某种 S 型非线性函数，

W_{ℓ}

和

B_{ℓ}

是层参数。此时特征映射

Φ : R^{d} \mapsto R^{J}

可表示为

Φ (x) = L_{L} \circ \dots \circ L_{1} (x)

，最终层

L_{L}

包含

J

个输出单元。文献[144]则将深度神经网络最后一层的权重积分处理，从而得到具有灵活学习基函数的可处理贝叶斯模型。

Regardless of the feature map

Φ

,when conditioned on these basis functions,the posterior over the weights

w

can be computed analytically using (9-12). Let

Φ (X)

denote the

n \times J

matrix where

{[Φ (X)]}_{i, j} = ϕ_{j} (x_{i})

; then the posterior is as in Bayesian linear regression,substituting

Φ (X)

for the design matrix

X

.
无论特征映射

Φ

如何，当以这些基函数为条件时，权重

w

的后验分布可通过(9-12)解析计算得出。设

Φ (X)

表示

n \times J

矩阵，其中

{[Φ (X)]}_{i, j} = ϕ_{j} (x_{i})

；则该后验分布与贝叶斯线性回归中的形式相同，只需将设计矩阵

X

替换为

Φ (X)

。

C. Generalized linear models
C. 广义线性模型

While simple linear models capture the dependence between bandit arms in a straightforward and expressive way, the model as described does not immediately apply to other types of observations, such as binary or count data. Generalized linear models (GLMs) [119] allow more flexibility in the response variable through the introduction of a link function. Here we examine the GLM for binary data such as might arise from drug trials or

AB

testing.
虽然简单线性模型以直观且富有表现力的方式捕捉了赌博机臂间的依赖关系，但所述模型并不直接适用于其他类型的观测数据，如二分类或计数数据。广义线性模型（GLMs）[119]通过引入链接函数，使响应变量具有更大的灵活性。此处我们研究适用于药物试验或

AB

测试等二分类数据的 GLM。

The generalized linear model introduces a link function

g

that maps from the observation space into the reals. Most often,we consider the mean function

g^{- 1}

,which defines the expected value of the response as a function of the underlying linear model:

E [y ∣ x] = g^{- 1} (x^{T} w) = f (x)

. In the case of binary data, a common choice is the logit link function, which leads to the familiar logistic regression model in which

g^{- 1} (z) = 1 / (1 + \exp {z})

. In probit regression,the logistic mean function is replaced with the CDF of a standard normal. In either case,the observations

y_{i}

are taken to be Bernoulli random variables with parameter

g^{- 1} (x_{i}^{T} w)

.
广义线性模型引入了链接函数

g

，它将观测空间映射到实数域。最常见的是考虑均值函数

g^{- 1}

，它定义了响应变量作为底层线性模型函数的期望值：

E [y ∣ x] = g^{- 1} (x^{T} w) = f (x)

。对于二分类数据，常用选择是 logit 链接函数，这导致了熟悉的逻辑回归模型，其中

g^{- 1} (z) = 1 / (1 + \exp {z})

。在 probit 回归中，逻辑均值函数被标准正态分布的累积分布函数替代。无论哪种情况，观测值

y_{i}

都被视为参数为

g^{- 1} (x_{i}^{T} w)

的伯努利随机变量。

Unfortunately, there is no conjugate prior for the parameters

w

when such a likelihood is used and so we must resort to approximate inference. Markov chain Monte Carlo (MCMC) methods [4] approximate the posterior with a sequence of samples that converge to the posterior; this is the approach taken in [135] on the probit model. In contrast, the Laplace approximation fits a Gaussian distribution to the posterior by matching the curvature of the posterior distribution at the mode. For example in [38], Bayesian logistic regression with a Laplace approximation was used to model click-throughs for the recommendation of news articles in a live experiment. In the generalized linear model, Thompson sampling draws a

\tilde{w}

from the posterior

p (w ∣ D_{n})

using MCMC or a Laplace approximation, and then selects the arm with the highest expected reward given the sampled parameter

\tilde{w}

, i.e.,

a_{n + 1} = \arg max_{a} g^{- 1} (x_{a}^{T} \tilde{w})

.
遗憾的是，当采用此类似然函数时，参数

w

并不存在共轭先验分布，因此我们不得不依赖于近似推断方法。马尔可夫链蒙特卡罗（MCMC）方法[4]通过生成一系列收敛于后验分布的样本来近似后验分布；文献[135]在概率模型中就采用了这一方法。相比之下，拉普拉斯近似通过匹配后验分布在众数处的曲率，用高斯分布来拟合后验分布。例如，文献[38]在实时新闻推荐点击率建模中使用了基于拉普拉斯近似的贝叶斯逻辑回归。在广义线性模型中，汤普森抽样通过 MCMC 或拉普拉斯近似从后验分布

p (w ∣ D_{n})

中抽取一个样本

\tilde{w}

，然后选择给定采样参数

\tilde{w}

时期望奖励最高的臂，即

a_{n + 1} = \arg max_{a} g^{- 1} (x_{a}^{T} \tilde{w})

。

D. Related literature D. 相关文献

There are various strategies beyond Thompson sampling for Bayesian optimization that will be discussed in succeeding sections of the paper. However, before we can reason about which selection strategy is optimal, we need to establish what the goal of the series of sequential experiments will be. Historically, these goals have been quantified using the principle of maximum expected utility. In this framework, a utility function

U

is prescribed over a set of experiments

X := {x_{i}}_{i = 1}^{n}

, their outcomes

y := {y_{i}}_{i = 1}^{n}

,and the model parameter

w

. The unknown model parameter and outcomes are marginalized out to produce the expected utility
在贝叶斯优化领域，除汤普森采样外，论文后续章节还将探讨多种策略。然而，在论证何种选择策略最优之前，我们需明确这一系列序贯实验的目标。历史上，这些目标常通过最大期望效用原则来量化。该框架下，效用函数

U

被定义在一组实验

X := {x_{i}}_{i = 1}^{n}

、其实验结果

y := {y_{i}}_{i = 1}^{n}

及模型参数

w

之上。通过边缘化未知模型参数与结果，最终计算出期望效用值。

\begin{matrix} (18) & α (X) := E_{w} E_{y ∣ X, w} [U (X, y, w)], \end{matrix}

which is then maximized to obtain the best set of experiments with respect to the given utility

U

and the current posterior. The expected utility

α

is related to acquisition functions in Bayesian optimization, reviewed in Section IV. Depending on the literature, researchers have focussed on different goals which we briefly discuss here.
随后通过最大化该期望效用，可得到针对给定效用函数

U

与当前后验分布的最优实验组合。期望效用

α

与贝叶斯优化中的采集函数密切相关，详见第四节综述。根据文献差异，研究者们关注的目标各有侧重，我们将在此简要讨论。

1) Active learning and experimental design: In this setting, we are usually concerned with learning about

w

,which can be framed in terms of improving an estimator of

w

given the data. One popular approach is to select points that are expected to minimize the differential entropy of the posterior distribution

p (w ∣ X, y)

,i.e.,maximize:
1) 主动学习与实验设计：在此情境下，我们通常关注如何更好地理解

w

，这可以转化为在给定数据条件下改进对

w

的估计问题。一种常见方法是选择预期能最小化后验分布

p (w ∣ X, y)

微分熵的数据点，即最大化：

α (X) = E_{w} E_{y ∣ X, w} [\int p (w^{'} ∣ X, y) \log p (w^{'} ∣ X, y) d w^{'}] .

In the Bayesian experimental design literature, this criterion is known as the

D

-optimality utility and was first introduced by Lindley [98]. Since this seminal work, many alternative utilities have been proposed in the experimental design literature. See [37] for a detailed survey.
在贝叶斯实验设计文献中，该准则被称为

D

最优性效用，最初由 Lindley[98]提出。自这一开创性工作以来，实验设计领域已提出许多替代性效用函数。详见[37]的详细综述。

In the context of

A / B

testing,following this strategy would result in exploring all possible combinations of artwork, font, and sizes, no matter how bad initial outcomes were. This is due to the fact that the

D

-optimality utility assigns equal value to any information provided about any advertisement configuration, no matter how effective.
在

A / B

测试场景中，遵循此策略将导致探索所有可能的艺术品、字体和尺寸组合，无论初始结果多不理想。这是因为

D

最优性效用对任何广告配置提供的信息赋予同等价值，无论其效果如何。

In contrast to optimal experimental design, Bayesian optimization explores uncertain arms

a \in {1, \dots, K}

,or areas of the search space

X

,only until they can confidently be ruled out as being suboptimal. Additional impressions of suboptimal ads would be a waste of our evaluation budget. In Section IV, we will introduce another differential entropy based utility that is better suited for the task of optimization and that partially bridges the gap between optimization and improvement of estimator quality.
与最优实验设计不同，贝叶斯优化仅探索那些尚未能明确排除为次优的不确定选项

a \in {1, \dots, K}

或搜索空间区域

X

，直到可以确信它们非最优为止。对次优广告的额外曝光将浪费我们的评估预算。在第四节中，我们将引入另一种基于微分熵的效用函数，它更适用于优化任务，并部分弥合了优化与估计器质量提升之间的鸿沟。

2) Multi-armed bandit: Until recently, the multi-armed bandit literature has focussed on maximizing the sum of rewards

y_{i}

,possibly discounted by a discount factor

γ \in (0, 1]

:
2) 多臂老虎机：直至最近，多臂老虎机文献主要关注于最大化奖励总和

y_{i}

，可能通过折扣因子

γ \in (0, 1]

进行折现：

\begin{matrix} (19) & α (X) = E_{w} E_{y ∣ X, w} [\sum_{i = 1}^{n} γ^{i - 1} y_{i}] . \end{matrix}

When

γ < 1

,a Bayes-optimal sequence

X

can be computed for the Bernoulli bandit via dynamic programming, due to Gittins [59]. However, this solution is intractable for general reward distributions, and so in practice sequential heuristics are used and analyzed in terms of a frequentist measure, namely cumulative regret [92], [135], [146], [38], [127].
当

γ < 1

时，对于伯努利老虎机，由于 Gittins[59]的研究，可通过动态规划计算出贝叶斯最优序列

X

。然而，该解决方案对于一般奖励分布而言计算上不可行，因此实践中采用序列启发式方法，并以频率派指标——累计遗憾[92]、[135]、[146]、[38]、[127]进行分析。

Cumulative regret is a frequentist measure defined as
累计遗憾是一个频率派指标，其定义为

\begin{matrix} (20) & R_{n} (w) = \sum_{i = 1}^{n} f_{w}^{⋆} - f_{w} (x_{a_{i}}), \end{matrix}

where

f_{w}^{⋆} := max_{a} f_{w} (x_{a})

denotes the best possible expected reward. Whereas the

D

-optimality utility leads to too much exploration, the cumulative regret encourages exploitation by including intermediate selections

a_{i}

in the final loss function

R_{n}

. For certain tasks,this is an appropriate loss function: for example, when sequentially selecting ads, each impression incurs an opportunity cost. Meanwhile, for other tasks such as model selection, we typically have a predetermined evaluation budget for optimization and only the performance of the final recommended model should be assessed by the loss function.
其中

f_{w}^{⋆} := max_{a} f_{w} (x_{a})

表示最佳预期奖励。尽管

D

最优性效用会导致过度探索，但累积遗憾通过将中间选择

a_{i}

纳入最终损失函数

R_{n}

来鼓励利用。对于某些任务，这是一个合适的损失函数：例如，在顺序选择广告时，每次展示都会产生机会成本。而对于其他任务（如模型选择），我们通常有预定的评估预算用于优化，且损失函数应仅评估最终推荐模型的性能。

Recently, there has been growing interest in the best arm identification problem, which is more suitable for the model selection task [104], [30], [7], [51], [50], [72]. When using Bayesian surrogate models, this is equivalent to performing Bayesian optimization on a finite, discrete domain. In this so-called pure exploration settings, in addition to a selection strategy,a recommendation strategy

ρ

is specified to recommend an arm (or ad or drug) at the end of the experimentation based on observed data. The experiment is then judged via the simple regret,which depends on the recommendation

\bar{a} = ρ (D)

:
近来，最佳臂识别问题引发了越来越多的关注，该问题更适用于模型选择任务[104]、[30]、[7]、[51]、[50]、[72]。当采用贝叶斯代理模型时，这等同于在有限离散域上执行贝叶斯优化。在此类纯探索场景中，除了选择策略外，还需指定推荐策略

ρ

，以便根据观测数据在实验结束时推荐一个臂（或广告/药物）。实验效果通过简单遗憾度来评判，该指标取决于推荐结果

\bar{a} = ρ (D)

：

\begin{matrix} (21) & r_{n} (w) = f_{w}^{⋆} - f_{w} (x_{\bar{a}}) . \end{matrix}

III. NON-PARAMETRIC MODELS
三、非参数模型

In this section, we show how it is possible to marginalize away the weights in Bayesian linear regression and apply the kernel trick to construct a Bayesian non-parametric regression model. As our starting point, we assume the observation variance

σ^{2}

is fixed and place a zero-mean Gaussian prior on the regression coefficients

p (w ∣ V_{0}) = N (0, V_{0})

. In this case, we notice that it possible to analytically integrate out the weights, and in doing so we preserve Gaussianity:
本节将展示如何边缘化贝叶斯线性回归中的权重，并应用核技巧构建贝叶斯非参数回归模型。我们首先假设观测方差

σ^{2}

固定，并对回归系数

p (w ∣ V_{0}) = N (0, V_{0})

施加零均值高斯先验。此时可解析地积分掉权重，同时保持高斯性：

p (y ∣ X, σ^{2}) = \int p (y ∣ X, w, σ^{2}) p (w ∣ 0, V_{0}) d w

= \int N (y ∣ X w, σ^{2} I) N (w ∣ 0, V_{0}) d w

\begin{matrix} (22) & = N (y ∣ 0, {XV}_{0} X^{T} + σ^{2} I) . \end{matrix}

As noted earlier, it can be useful to introduce basis functions

ϕ

and in the context of Bayesian linear regression we in effect replace the design matrix

X

with a feature mapping matrix

Φ = Φ (X)

. In Equation (22),this results in a slightly different Gaussian for weights in feature space:
如前所述，引入基函数

ϕ

在贝叶斯线性回归背景下非常有用，实际上我们是用特征映射矩阵

Φ = Φ (X)

替代了设计矩阵

X

。在方程(22)中，这导致特征空间中的权重呈现略微不同的高斯分布：

\begin{matrix} (23) & p (y ∣ X, σ^{2}) = N (y ∣ 0, {Φ V}_{0} Φ^{T} + σ^{2} I) \end{matrix}

Note that

{Φ V}_{0} Φ^{T} \in R^{n \times n}

is a symmetric positive semidefinite matrix made up of pairwise inner products between each of the data in their basis function representations. The celebrated kernel trick emerges from the observation that these inner products can be equivalently computed by evaluating the corresponding kernel function

k

for all pairs to form the matrix

K

注意

{Φ V}_{0} Φ^{T} \in R^{n \times n}

是一个对称半正定矩阵，由数据在其基函数表示下的两两内积构成。著名的核技巧源于这样一个观察：这些内积可以通过评估相应核函数

k

来等价计算，从而形成矩阵

K

。

\begin{matrix} (24) & K_{i, j} = k (x_{i}, x_{j}) = Φ (x_{i}) V_{0} Φ {(x_{j})}^{T} \end{matrix}

\begin{matrix} (25) & = {⟨ Φ (x_{i}), Φ (x_{j}) ⟩}_{V_{0}} . \end{matrix}

The kernel trick allows us to specify an intuitive similarity between pairs of points,rather than a feature map

Φ

,which in practice can be hard to define. In other words, we can either think of predictions as depending directly on features

Φ

,as in the linear regression problem,or on kernels

k

,as in the lifted variant, depending on which paradigm is more interpretable or computationally tractable. Indeed,the former requires a

J \times J

matrix inversion compared to the latter’s

n \times n

.
核技巧让我们能够直接定义点对之间的直观相似性，而非依赖于实践中可能难以明确构建的特征映射

Φ

。换言之，我们可以选择将预测视为直接依赖于特征

Φ

（如线性回归问题所示），或依赖于核函数

k

（如升维变体所示），具体取决于哪种范式更易解释或计算上更可行。实际上，前者需要进行

J \times J

矩阵求逆运算，而后者仅需

n \times n

。

Note also that this approach not only allows us to compute the marginal likelihood of data that have already been seen, but it enables us to make predictions of outputs

y_{⋆}

at new locations

X_{⋆}

. This can be done by observing that
还需注意的是，这种方法不仅能计算已观测数据的边缘似然，还能对新位置

X_{⋆}

的输出值

y_{⋆}

进行预测。其原理可通过观察得出：

\begin{matrix} (26) & p (y_{⋆} ∣ X_{⋆}, X, y, σ^{2}) = \frac{p (y_{⋆}, y ∣ X_{⋆}, X, σ^{2})}{p (y ∣ X, σ^{2})} . \end{matrix}

Both the numerator and the denominator are Gaussian with the form appearing in Equation (23), and so the predictions are jointly Gaussian and can be computed via some simple linear algebra. Critically,given a kernel

k

,it becomes unnecessary to explicitly define or compute the features

Φ

because both the predictions and the marginal likelihood only depend on

K

.
分子和分母均为式(23)所示的高斯形式，因此预测值联合服从高斯分布，可通过简单线性代数计算。关键在于，给定核函数

k

后，无需显式定义或计算特征

Φ

，因为预测值与边缘似然仅依赖于

K

。

A.The Gaussian process A.高斯过程

By kernelizing a marginalized version of Bayesian linear regression, what we have really done is construct an object called a Gaussian process. The Gaussian process

GP (μ_{0}, k)

is a non-parametric model that is fully characterized by its prior mean function

μ_{0} : X \mapsto R

and its positive-definite kernel,or covariance function,

k : X \times X \mapsto R

[126]. Consider any finite collection

^{2}

n

points

x_{1 : n}

,and define variables

f_{i} := f (x_{i})

and

y_{1 : n}

to represent the unknown function values and noisy observations, respectively. In Gaussian process regression, we assume that

f := f_{1 : n}

are jointly Gaussian and the observations

y := y_{1 : n}

are normally distributed given

f

,resulting in the following generative model:
通过对贝叶斯线性回归的边缘化版本进行核化处理，我们实际上构建了一个称为高斯过程的对象。高斯过程

GP (μ_{0}, k)

是一种非参数模型，完全由其先验均值函数

μ_{0} : X \mapsto R

和正定核（即协方差函数）

k : X \times X \mapsto R

[126]所表征。考虑任意有限集合

^{2}

中的

n

个点

x_{1 : n}

，并定义变量

f_{i} := f (x_{i})

和

y_{1 : n}

分别表示未知函数值和含噪声的观测值。在高斯过程回归中，我们假设

f := f_{1 : n}

是联合高斯的，且观测值

y := y_{1 : n}

在给定

f

的条件下服从正态分布，由此得到以下生成模型：

\begin{matrix} (27) & f ∣ X \sim N (m, K) \end{matrix}

\begin{matrix} (28) & y ∣ f, σ^{2} \sim N (f, σ^{2} I), \end{matrix}

where the elements of the mean vector and covariance matrix are defined as

m_{i} := μ_{0} (x_{i})

and

K_{i, j} := k (x_{i}, x_{j})

,respectively. Equation (27) represents the prior distribution

p (f)

induced by the GP.
其中均值向量和协方差矩阵的元素分别定义为

m_{i} := μ_{0} (x_{i})

和

K_{i, j} := k (x_{i}, x_{j})

。方程(27)表示由高斯过程诱导出的先验分布

p (f)

。

Let

D_{n} = {(x_{i}, y_{i})}_{i = 1}^{n}

denote the set of observations and

x

denote an arbitrary test point. As mentioned when kernelizing linear regression,the random variable

f (x)

conditioned on observations

D_{n}

is also normally distributed with the following posterior mean and variance functions
设

D_{n} = {(x_{i}, y_{i})}_{i = 1}^{n}

表示观测值集合，

x

表示任意测试点。如核化线性回归所述，以观测值

D_{n}

为条件的随机变量

f (x)

同样服从正态分布，其后验均值与方差函数如下：

\begin{matrix} (29) & μ_{n} (x) = μ_{0} (x) + k {(x)}^{T} {(K + σ^{2} I)}^{- 1} (y - m) \end{matrix}

\begin{matrix} (30) & σ_{n}^{2} (x) = k (x, x) - k {(x)}^{T} {(K + σ^{2} I)}^{- 1} k (x), \end{matrix}

where

k (x)

is a vector of covariance terms between

x

and

x_{1 : n}

.
其中

k (x)

是

x

与

x_{1 : n}

之间的协方差项向量。

The posterior mean and variance evaluated at any point

x

represent the model's prediction and uncertainty, respectively, in the objective function at the point

x

. These posterior functions are used to select the next query point

x_{n + 1}

as detailed in Section IV.
任意点

x

处的后验均值与方差分别代表模型在该点

x

处目标函数的预测值与不确定性。如第四节所述，这些后验函数将用于选择下一个查询点

x_{n + 1}

。

B. Common kernels B. 常用核函数

In Gaussian process regression,the covariance function

k

dictates the structure of the response functions we can fit. For instance, if we expect our response function to be periodic, we can prescribe a periodic kernel. In this review, we focus on stationary kernels, which are shift invariant.
在高斯过程回归中，协方差函数

k

决定了我们所能拟合的响应函数结构。例如，若预期响应函数具有周期性，可采用周期性核函数。本综述重点关注平移不变的平稳核函数。

https://cdn.noedgeai.com/019684af-e35c-7a84-b274-13b47593f96c_7.jpg?x=911&y=156&w=746&h=199&r=0

Fig. 3. Left: Visualization of various kernel profiles. The horizontal axis represents the distance

r > 0

. Middle: Samples from GP priors with the corresponding kernels. Right: Samples from GP posteriors given two data points (black circles). Note the sharper drop in the Matérn1 kernel leads to rough features in the associated samples, while samples from a GP with the Matérn3 and Matérn5 kernels are increasingly smooth.
图 3. 左：各种核函数轮廓的可视化。横轴表示距离

r > 0

。中：具有相应核函数的高斯过程先验样本。右：给定两个数据点（黑色圆圈）的高斯过程后验样本。注意 Matérn1 核函数的急剧下降导致相关样本中出现粗糙特征，而具有 Matérn3 和 Matérn5 核函数的高斯过程样本则越来越平滑。

Matérn kernels are a very flexible class of stationary kernels. These kernels are parameterized by a smoothness parameter

ν > 0

,so called because samples from a GP with such a kernel are differentiable

| ν - 1 |

times [126]. The exponential kernel is a special case of the Matérn kernel with

ν = \frac{1}{2}

, and the squared exponential kernel is the limiting kernel when

ν \to \infty

. The following are the most commonly used kernels, labelled by the smoothness parameter, omitting the factor of

\frac{1}{2}

.
Matérn 核函数是一类非常灵活的平稳核函数。这些核函数通过一个平滑度参数

ν > 0

进行参数化，之所以这样命名，是因为具有这种核函数的高斯过程样本是可微

| ν - 1 |

次的[126]。指数核函数是 Matérn 核函数在

ν = \frac{1}{2}

时的特例，而平方指数核函数是当

ν \to \infty

时的极限核函数。以下是最常用的核函数，按平滑度参数标记，省略了

\frac{1}{2}

因子。

\begin{matrix} (31) & k_{MATÉRN 1} (x, x^{'}) = θ_{0}^{2} \exp (- r) \end{matrix}

\begin{matrix} (32) & k_{MATERN 3} (x, x^{'}) = θ_{0}^{2} \exp (- \sqrt{3} r) (1 + \sqrt{3} r) \end{matrix}

\begin{matrix} (33) & k_{MATERN 5} (x, x^{'}) = θ_{0}^{2} \exp (- \sqrt{5} r) (1 + \sqrt{5} r + \frac{5}{3} r^{2}) \end{matrix}

\begin{matrix} (34) & k_{SQ - EXP} (x, x^{'}) = θ_{0}^{2} \exp (- \frac{1}{2} r^{2}), \end{matrix}

where

r^{2} = {(x - x^{'})}^{T} Λ (x - x^{'})

and

Λ

is a diagonal matrix of

d

squared length scales

θ_{i}^{2}

. This family of covariance functions are therefore parameterized by an amplitude and

d

length scale hyperparameters,jointly denoted

θ

. Covariance functions with learnable length scale parameters are also known as automatic relevance determination (ARD) kernels. Figure 3 provides a visualization of the kernel profiles and samples from the corresponding priors and posteriors.
其中

r^{2} = {(x - x^{'})}^{T} Λ (x - x^{'})

和

Λ

是对角矩阵，其元素为

d

平方长度尺度

θ_{i}^{2}

。因此，这类协方差函数通过一个幅度和

d

个长度尺度超参数进行参数化，统称为

θ

。具有可学习长度尺度参数的协方差函数也被称为自动相关性确定（ARD）核。图 3 展示了核函数的轮廓图以及从相应先验和后验中抽取的样本。

C. Prior mean functions C. 先验均值函数

While the kernel function controls the smoothness and amplitude of samples from the GP, the prior mean provides a possible offset. In practice, this function is set to a constant

μ_{0} (x) \equiv μ_{0}

and inferred from data using techniques covered in Section V-A. Unless otherwise specified, in what follows we assume a constant prior mean function for convenience. However, the prior mean function is a principled way of incorporating expert knowledge of the objective function, if it is available, and the following analysis can be readily applied to non-constant functions

μ_{0}

.
虽然核函数控制着高斯过程样本的平滑度和幅度，但先验均值提供了一个可能的偏移量。实践中，该函数通常设为常数

μ_{0} (x) \equiv μ_{0}

，并通过第 V-A 节介绍的技术从数据中推断得出。除非另有说明，为方便起见，下文均假设先验均值函数为常数。然而，若存在对目标函数的专家知识，先验均值函数可作为融入此类知识的合理方式，且后续分析可轻松推广至非恒定函数

μ_{0}

。

D. Marginal likelihood D. 边缘似然

Another attractive property of the Gaussian process model is that it provides an analytical expression for the marginal likelihood of the data, where marginal refers to the fact that the unknown latent function

f

is marginalized out. The expression for the log marginal likelihood is simply given by:
高斯过程模型的另一个吸引人的特性是，它为数据的边际似然提供了一个解析表达式，这里的“边际”指的是未知的潜在函数

f

被边缘化。对数边际似然的表达式简单地由以下给出：

\log p (y ∣ x_{1 : n}, θ) = - \frac{1}{2} {(y - m_{θ})}^{T} {(K^{θ} + σ^{2} I)}^{- 1} (y - m_{θ})

\begin{matrix} (35) & - \frac{1}{2} \log | K^{θ} + σ^{2} I | - \frac{n}{2} \log (2 π), \end{matrix}

^{2}

We use the notation

z_{i : j} = {z_{i}, \dots, z_{j}}

^{2}

我们使用符号

z_{i : j} = {z_{i}, \dots, z_{j}}

。

where in a slight abuse of notation we augment the vector

θ := (θ_{0 : d}, μ_{0}, σ^{2})

; and the dependence on

θ

is made explicit by adding a superscript to the covariance matrix

K^{θ}

. The marginal likelihood is very useful in learning the hyper-parameters, as we will see in Section V-A. The right hand side of (35) can be broken into three terms: the first term quantifies how well the model fits the data, which is simply a Mahalanobis distance between the model predictions and the data; the second term quantifies the model complexity - smoother covariance matrices will have smaller determinants and therefore lower complexity penalties; finally, the last term is simply a linear function of the number of data points

n

, indicating that the likelihood of data tends to decrease with larger datasets.
在不严格遵循数学符号规范的情况下，我们对向量

θ := (θ_{0 : d}, μ_{0}, σ^{2})

进行了扩展；同时，通过在协方差矩阵

K^{θ}

上添加一个上标，明确表示其对

θ

的依赖关系。边际似然在学习超参数时非常有用，我们将在第 V-A 节中看到这一点。(35)式的右侧可以分为三项：第一项衡量模型对数据的拟合程度，即模型预测与数据之间的马氏距离；第二项量化模型的复杂度——更平滑的协方差矩阵具有更小的行列式值，因此复杂度惩罚更低；最后一项是数据点数量

n

的线性函数，表明随着数据集增大，数据的似然性往往会降低。

Conveniently, as long as the kernel is differentiable with respect to its hyperparameters

θ

,the marginal likelihood can be differentiated and can therefore be optimized in an off-the-shelf way to obtain a type II maximum likelihood (MLII) or empirical Bayes estimate of the kernel parameters. When data is scarce this can overfit the available data. In Section V-A we will review various practical strategies for learning hyperpa-rameters which all use the marginal likelihood.
方便的是，只要核函数对其超参数

θ

可微，边缘似然也同样可微，因此可以通过现成的方法进行优化，以获得核参数的 II 型最大似然（MLII）或经验贝叶斯估计。当数据稀缺时，这可能会对现有数据过拟合。在第五节 A 部分，我们将回顾各种实际学习超参数的策略，这些策略均利用了边缘似然。

E. Computational costs and other regression models
E. 计算成本与其他回归模型

Although we have analytic expressions, exact inference in Gaussian process regression is

O (n^{3})

where

n

is the number of observations. This cost is due to the inversion of the covariance matrix. In practice, the Cholesky decomposition can be computed once and saved so that subsequent predictions are

O (n^{2})

. However,this Cholesky decomposition must be recomputed every time the kernel hyperparameters are changed, which usually happens at every iteration (see Section V-A). For large datasets, or large function evaluation budgets in the Bayesian optimization setting, the cubic cost of exact inference is prohibitive and there have been many attempts at reducing this computational burden via approximation techniques. In this section we review two sparsification techniques for Gaussian processes and the alternative random forest regression.
尽管我们拥有解析表达式，但在高斯过程回归中进行精确推断的计算复杂度为

O (n^{3})

，其中

n

代表观测数量。这一成本源于协方差矩阵的求逆运算。实际应用中，可预先计算并保存 Cholesky 分解结果，使得后续预测的计算复杂度降至

O (n^{2})

。然而，每当核超参数发生变化时（通常发生在每次迭代中，参见第 V-A 节），都必须重新计算该 Cholesky 分解。对于大规模数据集或贝叶斯优化场景中较大的函数评估预算，精确推断的三次方计算成本令人望而却步，因此已有诸多研究尝试通过近似技术来减轻这一计算负担。本节我们将回顾高斯过程的两种稀疏化技术以及替代方案——随机森林回归。

1) Sparse pseudo-input Gaussian processes (SPGP): One early approach to modelling large

n

with Gaussian processes considered using

m < n

inducing pseudo-inputs to reduce the rank of the covariance matrix to

m

,resulting in a significant reduction in computational cost [137], [140]. By forcing the interaction between the

n

data points

x_{1 : n}

and any test point

x

to go through this set of

m

inducing pseudo-inputs, these methods can compute an approximate posterior in

O (n m^{2} + m^{3})

time. Pseudo-input methods have since been unified in a single theory based on the following overarching approximation.
1) 稀疏伪输入高斯过程（SPGP）：早期针对大规模

n

建模的高斯过程方法之一，考虑使用

m < n

诱导伪输入来将协方差矩阵的秩降低至

m

，从而显著降低计算成本[137][140]。通过强制

n

数据点

x_{1 : n}

与任何测试点

x

之间的交互经由这组

m

诱导伪输入进行，这些方法能够在

O (n m^{2} + m^{3})

时间内计算出近似后验分布。此后，伪输入方法被统一于基于以下总体近似的单一理论框架中。

Let

f

and

f_{⋆}

denote two sets of latent function values,commonly representing the function at training and test locations, respectively. The simplifying assumption is that

f

and

f_{⋆}

are independent given a third set of variables

u

,such that
设

f

和

f_{⋆}

表示两组潜在函数值，通常分别代表训练位置和测试位置处的函数。其简化假设是：给定第三组变量

u

时，

f

与

f_{⋆}

条件独立，即满足

\begin{matrix} (36) & p (f_{⋆}, f) = \int p (f_{⋆}, f, u) d u \end{matrix}

\begin{matrix} (37) & \approx \int q (f_{⋆} ∣ u) q (f ∣ u) p (u) d u = q (f, f_{⋆}) \end{matrix}

where

u

is the vector of function values at the pseudo-inputs. All sparse pseudo-input GP approximations can be specified in terms of the form used for the training and test conditionals,

q (f ∣ u)

and

q (f_{⋆} ∣ u)

,respectively [124].
其中

u

表示伪输入处的函数值向量。所有稀疏伪输入高斯过程近似均可根据训练条件

q (f ∣ u)

与测试条件

q (f_{⋆} ∣ u)

的形式进行定义[124]。

In the seminal works on pseudo-input methods, the locations of the pseudo-inputs were selected to optimize the marginal likelihood of the SPGP [137], [140]. In contrast, a variational approach has since been proposed to marginalize the pseudo-inputs to maximize fidelity to the original exact GP [152] rather than the likelihood of the approximate GP.
在关于伪输入方法的开创性研究中，伪输入的位置选择是为了优化 SPGP 的边缘似然[137][140]。相比之下，后来提出了一种变分方法，通过对伪输入进行边缘化处理，以最大化对原始精确高斯过程的保真度[152]，而非近似高斯过程的似然性。

The computational savings in the pseudo-input approach to approximating the GP comes at the cost of poor variance estimates. As can be observed in Figure 4, the uncertainty (blue shaded area) exhibits unwanted pinching at pseudo-inputs, while it is overly conservative in between and away from pseudo-inputs. In this instance, the 10 inducing points, indicated with black crosses, were not optimized to emphasize the potential pathologies of the method. Since in Bayesian optimization we use the credible intervals to guide exploration, these artefacts can mislead our search.
伪输入方法在近似高斯过程中带来的计算节省是以方差估计不佳为代价的。如图 4 所示，不确定性（蓝色阴影区域）在伪输入处表现出不希望的收缩现象，而在伪输入之间及远离伪输入的区域则过于保守。此例中，10 个诱导点（以黑色十字标记）未经过优化以突出该方法潜在的病态特征。由于在贝叶斯优化中我们依赖可信区间来指导探索，这些人为假象可能会误导搜索过程。

2) Sparse spectrum Gaussian processes (SSGP): While inducing pseudo-inputs reduce computational complexity by using a fixed number of points in the search space, sparse spectrum Gaussian processes (SSGP) take a similar approach to the kernel's spectral space [94]. Bochner's theorem states that any stationary kernel

k (x, x^{'}) = k (x - x^{'})

has a positive and finite Fourier spectrum

s (ω)

,i.e.,
2) 稀疏谱高斯过程（SSGP）：虽然通过使用搜索空间中的固定点数，诱导伪输入降低了计算复杂度，但稀疏谱高斯过程（SSGP）对核的谱空间采取了类似的方法[94]。Bochner 定理指出，任何平稳核

k (x, x^{'}) = k (x - x^{'})

都具有正有限傅里叶谱

s (ω)

，即，

\begin{matrix} (38) & k (x) = \frac{1}{{(2 π)}^{d}} \int e^{- i ω^{T} x} s (ω) d ω . \end{matrix}

Since the spectrum is positive and bounded, it can be normalized such that

p (ω) := s (ω) / ν

is a valid probability density function. In this formulation, evaluating the stationary kernel is equivalent to computing the expectation of the Fourier basis with respect to its specific spectral density

p (ω)

as in the following,
由于谱是正且有界的，可以将其归一化，使得

p (ω) := s (ω) / ν

成为有效的概率密度函数。在此表述中，评估平稳核等价于计算傅里叶基关于其特定谱密度

p (ω)

的期望，如下所示，

\begin{matrix} (39) & k (x, x^{'}) = ν E_{ω} [e^{- i ω^{T} (x - x^{'})}] . \end{matrix}

As the name suggests, SSGP approximates this expectation via Monte Carlo estimation using

m

samples drawn from the spectral density so that
顾名思义，SSGP 通过蒙特卡洛估计来近似这一期望值，使用从谱密度中抽取的

m

个样本进行计算

\begin{matrix} (40) & k (x, x^{'}) \approx \frac{ν}{m} \sum_{i = 1}^{m} e^{- i ω^{(i)}^{T} x} e^{i ω^{(i)}^{T} x^{'}} \end{matrix}

where

ω^{(i)} \sim s (ω) / ν

. The resulting finite dimensional problem is equivalent to Bayesian linear regression with

m

basis functions and the computational cost is once again reduced to

O (n m^{2} + m^{3})

.
其中

ω^{(i)} \sim s (ω) / ν

。由此得到的有限维问题等价于采用

m

基函数的贝叶斯线性回归，计算成本再次降至

O (n m^{2} + m^{3})

。

As with the pseudo-inputs, the spectral points can also be tuned via marginal likelihood optimization. Although this violates the Monte Carlo assumption and introduces a risk of overfitting, it allows for a smaller number of basis functions with good predictive power [94]. Once again, in Figure 4 we have not tuned the 80 spectral points in this way. Whereas around observed data (red crosses) the uncertainty estimates are smoother than the pseudo-inputs method, away from observations both the prediction and uncertainty regions exhibit spurious oscillations. This is highly undesirable for Bayesian optimization where we expect our surrogate model to fall back on the prior away from observed data.
与伪输入类似，频谱点也可通过边缘似然优化进行调参。尽管这违背了蒙特卡洛假设并可能引发过拟合风险，但能以较少的基函数数量获得良好预测能力[94]。需说明的是，图 4 中我们未对 80 个频谱点进行此类调参。在观测数据附近（红色十字标记处），其不确定性估计比伪输入方法更为平滑；但在远离观测区域时，预测值与不确定带均出现虚假振荡。这对贝叶斯优化极为不利——我们期望代理模型在远离观测数据时能回归先验。

https://cdn.noedgeai.com/019684af-e35c-7a84-b274-13b47593f96c_9.jpg?x=133&y=149&w=1531&h=311&r=0

Fig. 4. Comparison of surrogate regression models. Four different surrogate model posteriors are shown in blue (shaded area delimits 95% credible intervals), given noisy evaluations (red crosses) of a synthetic function (dashed line). The 10 pseudo-inputs for the SPGP method are shown as black crosses. The SSGP model used a basis of 80 Fourier features.
图 4. 代理回归模型对比。图中展示了四种不同的代理模型后验分布（蓝色区域为 95%置信区间），基于合成函数（虚线）的噪声评估结果（红色叉号）。SPGP 方法的 10 个伪输入点以黑色叉号标示。SSGP 模型采用了包含 80 个傅里叶特征的基函数组。

3) Random forests: Finally, as an alternative to Gaussian processes, random forest regression has been proposed as an expressive and flexible surrogate model in the context of sequential model-based algorithm configuration (SMAC) [79]. Introduced in 2001 [24], random forests are a class of scalable and highly parallelizable regression models that have been very successful in practice [42]. More precisely, the random forest is an ensemble method where the weak learners are decision trees trained on random subsamples of the data [24]. Averaging the predictions of the individual trees produces an accurate response surface.
3) 随机森林：作为高斯过程的替代方案，随机森林回归在基于序列模型的算法配置（SMAC）[79]中被提出为一种表现力强且灵活的代理模型。该模型于 2001 年[24]提出，是一类可扩展性强、高度并行化的回归模型，在实践中表现优异[42]。具体而言，随机森林是一种集成方法，其弱学习器为基于数据随机子集训练的决策树[24]。通过对各树预测结果取平均，可生成精确的响应曲面。

Subsampling the data, and the inherent parallelism of the random forest regression model give SMAC the ability to readily scale to large evaluation budgets, beyond where the cubic cost of an exact GP would be infeasible. Similarly, at every decision node of every tree, a fixed-sized subset of the available dimensions is sampled to fit a decision rule; this subsampling also helps the random forest scale to high-dimensional search spaces. Perhaps most importantly, random forests inherit the flexibility of decision trees when dealing with various data types; they can easily handle categorical and conditional variables. For example, when considering a decision node, the algorithm can exclude certain search dimensions from consideration when the path leading up to said node includes a particular boolean feature that is turned off.
对数据进行子采样，以及随机森林回归模型固有的并行性，使得 SMAC 能够轻松扩展到较大的评估预算范围，超出精确高斯过程（GP）立方成本不可行的界限。同样，在每棵树的每个决策节点上，都会对可用维度的一个固定大小子集进行采样以拟合决策规则；这种子采样也有助于随机森林扩展到高维搜索空间。或许最重要的是，随机森林继承了决策树在处理各种数据类型时的灵活性；它们可以轻松处理分类变量和条件变量。例如，在考虑一个决策节点时，如果通往该节点的路径包含某个被关闭的布尔特征，算法可以排除某些搜索维度的考量。

The exploration strategy in SMAC still requires an uncertainty estimate for predictions at test points. While the random forest does not provide an estimate of the variance of its predictions, Hutter et al. proposed using the empirical variance in the predictions across trees in the ensemble [79]. Though these are not principled uncertainty estimates, this heuristic has been shown to work well in practice for the SMAC algorithm.
SMAC 中的探索策略仍需对测试点的预测进行不确定性估计。虽然随机森林不提供其预测的方差估计，但 Hutter 等人提出使用集成中各棵树预测间的经验方差[79]。尽管这些并非理论上的不确定性估计，但该启发式方法已被证明在 SMAC 算法的实际应用中效果良好。

Although random forests are good interpolators in the sense that they output good predictions in the neighbourhood of training data, they are very poor extrapolators. Indeed, far from the data, the predictions of all trees could be identical, resulting in a poor prediction; more importantly, using the variance estimate of SMAC results in extremely confident intervals. In Figure 4 for example, away from data the shaded area is very narrow around a very poor constant prediction. Even more troubling is the fact that in areas of missing data multiple conflicting predictions can cause the empirical variance to blow up sharply, as can be seen in Figure 4. While Gaussian processes are also poor extrapolators (when used with local kernels), they produce relatively uncertain predictions away from the data by reverting to the prior - a more desirable behavior when trading off exploration and exploitation.
尽管随机森林在训练数据附近能输出良好预测的意义上是优秀的插值器，但它们在预测远离数据点时表现极差。实际上，远离数据区域时，所有树的预测可能完全相同，导致预测效果不佳；更重要的是，使用 SMAC 的方差估计会产生过于自信的区间。例如图 4 所示，在数据缺失区域，阴影区围绕着一个极差的常数预测变得非常狭窄。更令人担忧的是，在数据缺失区域，多个相互冲突的预测可能导致经验方差急剧膨胀，如图 4 所示。而高斯过程（使用局部核时）同样是糟糕的外推器，但通过回归到先验分布，它们能在远离数据时产生相对不确定的预测——这在权衡探索与利用时是更可取的行为。

Finally, another drawback of random forests for Bayesian optimization is that the response surface is discontinuous and non-differentiable so gradient based optimization methods are not applicable. SMAC relies on a combination of local and random search when maximizing the acquisition function.
最后，随机森林在贝叶斯优化中的另一个缺点是响应曲面不连续且不可微，因此基于梯度的优化方法无法适用。SMAC 在最大化采集函数时依赖于局部搜索与随机搜索相结合的策略。

IV. ACQUISITION FUNCTIONS
四、采集函数

Thus far, we have described the statistical model used to represent our belief about the unknown function

f

at iteration

n

. We have not described the exact mechanism or policy for selecting the sequence of query points

x_{1 : n}

. One could select these arbitrarily but this would be wasteful. Instead, there is a rich literature on selection strategies that utilize the posterior model to guide the sequential search, i.e., the selection of the next query point

x_{n + 1}

given

D_{n}

.
至此，我们已经描述了用于表示我们对未知函数

f

在第

n

次迭代时的认知的统计模型。但尚未详细说明选择查询点序列

x_{1 : n}

的具体机制或策略。虽然可以随意选择这些点，但这将造成浪费。相反，现有大量文献探讨了利用后验模型指导顺序搜索（即根据

D_{n}

选择下一个查询点

x_{n + 1}

）的选择策略。

Consider the utility function

U : R^{d} \times R \times Θ \mapsto R

which maps an arbitrary query point

x

,its corresponding function value

v = f (x)

,and a setting of the model hyperparameters

θ

to a measure of quality of the experiment, e.g., how much information this query will provide as in [98]. Given some data accumulated thus far, we can marginalize the unseen outcome

y

and the unknown model hyperparameters

θ

to obtain the expected utility of a query point

x

:
考虑效用函数

U : R^{d} \times R \times Θ \mapsto R

，它将任意查询点

x

、对应的函数值

v = f (x)

以及模型超参数设置

θ

映射为实验质量的衡量标准，例如该查询能提供多少信息量，如文献[98]所述。给定目前已积累的数据，我们可以对未观测结果

y

和未知模型超参数

θ

进行边缘化处理，从而得到查询点

x

的期望效用：

\begin{matrix} (41) & α (x; D_{n}) = E_{θ} E_{v ∣ x, θ} [U (x, v, θ)] \end{matrix}

For simplicity,in this section we will mostly ignore the

θ

dependence and we will discuss its marginalization in Section V-A.
为简化起见，本节我们主要忽略

θ

的依赖性，其边缘化处理将在第 V-A 节讨论。

Whereas in experimental design and decision theory, the function

α

is called the expected utility,in Bayesian optimization it is often called the acquisition or infill function. These acquisition functions are carefully designed to trade off exploration of the search space and exploitation of current promising areas. We first present traditional improvement-based and optimistic acquisition functions, followed by more recent information-based approaches.
在实验设计与决策理论中，函数

α

被称为期望效用，而在贝叶斯优化中常被称为采集函数或填充函数。这些采集函数经过精心设计，以权衡对搜索空间的探索与当前潜力区域的开发。我们首先介绍基于改进的传统采集函数和乐观采集函数，随后介绍更近期的基于信息的方法。

A. Improvement-based policies
A. 基于改进的策略

Improvement-based acquisition functions favour points that are likely to improve upon an incumbent target

τ

. An early strategy in the literature, probability of improvement (PI) [91], measures the probability that a point

x

leads to an improvement upon

τ

. Since the posterior distribution of

v = f (x)

is Gaussian, we can analytically compute this probability as follows:
基于改进的采集函数倾向于选择可能超越当前目标

τ

的点。早期文献中的概率改进策略(PI)[91]通过计算点

x

超越

τ

的概率来衡量改进可能性。由于

v = f (x)

的后验分布服从高斯分布，该概率可按如下公式解析计算：

\begin{matrix} (42) & α_{PI} (x; D_{n}) := P [v > τ] = Φ (\frac{μ_{n} (x) - τ}{σ_{n} (x)}), \end{matrix}

where

Φ

is the standard normal cumulative distribution function. Recall that

α_{PI}

is then maximized to select the next query point. For this criterion, the utility function is simply an indicator of improvement

U (x, v, θ) = I [v > τ]

,where the utility function is expressed (and marginalized) with respect to the latent variable

v

. Therefore,all improvements are treated equal and PI simply accumulates the posterior probability mass above

τ

x

.
其中

Φ

为标准正态累积分布函数。需注意，随后通过最大化

α_{PI}

来选择下一个查询点。对于此准则，效用函数仅作为改进指标

U (x, v, θ) = I [v > τ]

，此处效用函数针对潜变量

v

进行表达（及边缘化）。因此，所有改进被视为等同，PI 仅累积在

x

处超过

τ

的后验概率质量。

Although probability of improvement can perform very well when the target is known, in general the heuristic used for an unknown target

τ

causes PI to exploit quite aggressively [81].
尽管当目标已知时改进概率可能表现极佳，但针对未知目标

τ

的启发式方法通常会导致 PI 表现出较强的 exploitation 倾向[81]。

One could instead measure the expected improvement (EI) [115] which incorporates the amount of improvement. This new criterion corresponds to a different utility that is called the improvement function,denoted by

I (x)

. Formally, the improvement function

I

is defined as follows
另一种方法是测量预期改进量(EI)[115]，该准则考虑了改进幅度。这一新准则对应着被称为改进函数的不同效用，记作

I (x)

。形式上，改进函数

I

定义如下

\begin{matrix} (43) & I (x, v, θ) := (v - τ) I (v > τ) . \end{matrix}

Note that

I > 0

only if there is an improvement. Once again, because the random variable

v

is normally distributed,the expectation can be computed analytically as follows
注意

I > 0

仅当存在改进时成立。再次强调，由于随机变量

v

服从正态分布，期望值可按如下方式解析计算

α_{EI} (x; D_{n}) := E [I (x, v, θ)]

= (μ_{n} (x) - τ) Φ (\frac{μ_{n} (x) - τ}{σ_{n} (x)})

\begin{matrix} (44) & + σ_{n} (x) ϕ (\frac{μ_{n} (x) - τ}{σ_{n} (x)}), \end{matrix}

when

σ_{n} > 0

and vanishes otherwise. Here,not to be confused with the previous section,

ϕ

is the standard normal probability density function. These improvement strategies have been empirically studied in the literature [82], [81], [27] and recently convergence rates have been proven for EI [32].
当

σ_{n} > 0

时存在，否则消失。此处需注意与前文区分，

ϕ

表示标准正态概率密度函数。这些改进策略已在文献[82][81][27]中经过实证研究，近期针对 EI 的收敛速率也得到证明[32]。

Finally, although the target objective value (i.e., the best reachable objective value) is often unknown,in practice

τ

is adaptively set to the best observed value

y^{+} = max_{i = 1 : n} y_{i}

. Whereas for PI this heuristic can lead to an overly greedy optimization [81], it works reasonably with EI in practice [143]. When the objective function being minimized is very noisy, using the lowest mean value as the target is reasonable [157].
最后，尽管目标函数值（即可达到的最佳目标值）通常未知，实践中

τ

会自适应地设为当前观测到的最佳值

y^{+} = max_{i = 1 : n} y_{i}

。虽然 PI 采用这种启发式方法可能导致过度贪婪的优化[81]，但 EI 在实践中表现尚可[143]。当待最小化的目标函数噪声较大时，采用最低均值作为目标是合理的[157]。

https://cdn.noedgeai.com/019684af-e35c-7a84-b274-13b47593f96c_10.jpg?x=917&y=160&w=729&h=401&r=0

Fig. 5. Visualization of the surrogate regression model and various acquisition functions. (Top) The true objective function is shown as a dashed line and the probabilistic regression model is shown as a blue line with a shaded region delimiting the

2 σ_{n}

credible intervals. Finally,the observations are shown as red crosses. (Bottom) Four acquisition functions are shown. In the case of PI, the optimal mode is much closer to the best observation as in the alternative methods, which explains its greedy behaviour. In contrast, the randomization in TS allows it to explore more aggressively.
图 5. 代理回归模型及多种采集函数的可视化。（顶部）真实目标函数以虚线显示，概率回归模型以蓝色实线表示，阴影区域标出“

2 σ_{n}

”置信区间，观测值则用红色叉号标记。（底部）展示了四种采集函数。其中 PI 函数的最优模式更接近最佳观测点，这解释了其贪婪特性；而 TS 的随机性使其能进行更激进的探索。

B. Optimistic policies B. 乐观策略

Dating back to the seminal work of Lai & Robbins [92] on the multi-armed bandit problem, the upper confidence bound criterion has been a popular way of negotiating exploration and exploitation, often with provable cumulative regret bounds. The guiding principle behind this class of strategies is to be optimistic in the face of uncertainty. Indeed, using the upper confidence for every query point

x

corresponds to effectively using a fixed probability best case scenario according to the model. Originally, the upper confidence was given by frequentist Chernoff-Hoeffding bounds [8].
追溯至 Lai & Robbins [92]关于多臂老虎机问题的开创性研究，上置信界准则已成为权衡探索与利用的流行方法，通常具备可证明的累积遗憾界限。这类策略背后的指导原则是在不确定性面前保持乐观。实际上，对每个查询点

x

使用上置信界相当于根据模型固定概率地采用最佳情况假设。最初的上置信界由频率学派的 Chernoff-Hoeffding 边界[8]给出。

More recently, the Gaussian process upper confidence bound (GP-UCB [146]) algorithm was proposed as a Bayesian optimistic algorithm with provable cumulative regret bounds. In the deterministic case, a branch-and-bound extension to GP-UCB was proven to have exponentially vanishing instantaneous regret [43]. The GP-UCB algorithm has since been generalized to other Bayesian models by considering upper quantiles [84] instead of Equation (45) defined below, which is more reminiscent of frequentist concentration bounds. In the GP case,since the posterior at any arbitrary point

x

is a Gaussian,any quantile of the distribution of

f (x)

is computed with its corresponding value of

β_{n}

as follows:
最近，高斯过程上置信界（GP-UCB [146]）算法作为一种具有可证明累积遗憾界的贝叶斯乐观算法被提出。在确定性情况下，GP-UCB 的分支定界扩展被证明具有指数级递减的瞬时遗憾[43]。此后，GP-UCB 算法通过考虑上分位数[84]而非下文定义的方程(45)，被推广至其他贝叶斯模型，这更类似于频率学派的集中界。在高斯过程情况下，由于任意点

x

处的后验为高斯分布，

f (x)

分布的任意分位数均可通过对应的

β_{n}

值计算如下：

\begin{matrix} (45) & α_{UCB} (x; D_{n}) := μ_{n} (x) + β_{n} σ_{n} (x) . \end{matrix}

There are theoretically motivated guidelines for setting and scheduling the hyperparameter

β_{n}

to achieve optimal regret [146] and,as with

τ

in the improvement policies,tuning this parameter within these guidelines can offer a performance boost.
理论上存在设定和调度超参数

β_{n}

以实现最优遗憾的指导原则[146]，与改进策略中的

τ

类似，在这些原则范围内调整该参数可带来性能提升。

Finally, there also exist variants of these algorithms for the contextual bandits [153] (see Section VIII-D) and parallel querying [45] (see Section V-E).
最后，这些算法还存在针对上下文赌博机[153]（参见第八节 D 部分）和并行查询[45]（参见第五节 E 部分）的变体。

C. Information-based policies
C. 基于信息的策略

In contrast to the acquisition functions introduced so far, information-based policies consider the posterior distribution over the unknown minimizer

x^{⋆}

,denoted

p_{⋆} (x ∣ D_{n})

. This distribution is implicitly induced by the posterior over objective functions

f

. There are two policies in this class,namely Thompson sampling and entropy search.
与目前介绍的获取函数不同，基于信息的策略会考虑未知极小值

x^{⋆}

的后验分布（记作

p_{⋆} (x ∣ D_{n})

）。该分布由目标函数

f

的后验隐式导出。此类策略包含两种方法：汤普森采样与熵搜索。

Though it was introduced in 1933 [150], Thompson sampling has attracted renewed interest in the multi-armed bandit community, producing empirical evaluations [135], [38] as well as theoretical results [85], [2], [127]. Thompson sampling (TS) is a randomized strategy which samples a reward function from the posterior and selects the arm with the highest simulated reward. Therefore the selection made by TS can be expressed as the randomized acquisition function

x_{n + 1} \sim p_{⋆} (x ∣ D_{n})

.
尽管汤普森采样（Thompson sampling）早在 1933 年就被提出[150]，但它在多臂老虎机领域重新引发了研究热潮，产生了大量实证评估[135][38]及理论成果[85][2][127]。该策略通过从后验分布中随机抽取奖励函数样本，并选择模拟奖励最高的臂，其选择行为可表述为随机化采集函数

x_{n + 1} \sim p_{⋆} (x ∣ D_{n})

。

However, in continuous search spaces, the analog of Thompson sampling is to draw a continuous function

f^{(n)}

from the posterior GP and optimize it to obtain

x_{n + 1}

. In order to be optimized,the sample

f^{(n)}

needs to be fixed so it can be queried at arbitrary points; unfortunately, it is not clear how to fix an exact sample from the GP. However, using recent spectral sampling techniques [20], [125], [94], we can draw an approximate sample from the posterior that can be evaluated at any arbitrary point

x

[69],which extends TS to continuous search spaces. As an acquisition function, TS can be formulated as
然而在连续搜索空间中，汤普森采样的类比方法是从高斯过程后验中抽取连续函数

f^{(n)}

并进行优化以获得

x_{n + 1}

。为使样本

f^{(n)}

可优化，需将其固定以便在任意点查询；但精确固定高斯过程样本存在技术困难。借助最新谱采样技术[20][125][94]，我们可从后验分布中抽取可任意点

x

评估的近似样本[69]，从而将 TS 扩展至连续空间。作为采集函数，TS 可表述为：

α_{TS} (x; D_{n}) := f^{(n)} (x)

\begin{matrix} (46) & where f^{(n)} \overset{s.s.}{\sim} GP (μ_{0}, k ∣ D_{n}) \end{matrix}

where

\overset{s . s .}{\sim}

indicates approximate simulation via spectral sampling. Empirical evaluations show good performance which, however, seems to deteriorate in high dimensional problems, likely due to aggressive exploration [139].
其中“

\overset{s . s .}{\sim}

”表示通过频谱采样进行的近似模拟。实证评估显示其性能良好，但在高维问题中似乎有所下降，这可能是由于过于激进的探索策略所致[139]。

Instead of sampling the distribution

p_{⋆} (x ∣ D_{n})

,entropy search (ES) techniques aim to reduce the uncertainty in the location

x^{⋆}

by selecting the point that is expected to cause the largest reduction in entropy of the distribution

p_{⋆} (x ∣ D_{n})

[156],[67],[69]. In terms of utility,entropy search methods use the information gain defined as follows
与直接采样分布“

p_{⋆} (x ∣ D_{n})

”不同，熵搜索（ES）技术旨在通过选择能最大程度减少分布“

p_{⋆} (x ∣ D_{n})

”熵值的点来降低位置“

x^{⋆}

”的不确定性[156],[67],[69]。在效用方面，熵搜索方法采用如下定义的信息增益：

\begin{matrix} (47) & U (x, y, θ) = H (x^{⋆} ∣ D_{n}) - H (x^{⋆} ∣ D_{n} \cup {(x, y)}), \end{matrix}

where the

θ

implicitly parameterizes the distribution of

y

.
其中

θ

隐式参数化了

y

的分布。

In other words, ES measures the expected information gain from querying an arbitrary point

x

and selects the point that offers the most information about the unknown

x_{⋆}

. The acquisition function for ES can be expressed formally as
换言之，ES 衡量了从查询任意点

x

中获得的预期信息增益，并选择能提供最多关于未知

x_{⋆}

信息的点。ES 的采集函数可正式表述为

α_{ES} (x; D_{n}) := H (x^{⋆} ∣ D_{n}) - E_{y ∣ D_{n}, x} H (x^{⋆} ∣ D_{n} \cup {(x, y)})

where

H (x^{⋆} ∣ D_{n})

denotes the differential entropy of the posterior distribution

p_{⋆} (x ∣ D_{n})

,and the expectation is over the distribution of the random variable

y \sim N (μ_{n} (x), σ_{n}^{2} (x) + σ^{2})

.
其中

H (x^{⋆} ∣ D_{n})

表示后验分布

p_{⋆} (x ∣ D_{n})

的微分熵，期望值则是针对随机变量

y \sim N (μ_{n} (x), σ_{n}^{2} (x) + σ^{2})

的分布计算。

Once again, this function is not tractable for continuous search spaces

X

so approximations must be made. Early work discretized the space

X

and computed the conditional entropy via Monte Carlo sampling [156]. More recent work uses a discretization of the

X

to obtain a smooth approximation to

p_{⋆}

and its expected information gain [67]. This method is unfortunately

O (M^{4})

where

M

is the number of discrete so-called representer points.
再次强调，该函数在连续搜索空间

X

中难以处理，因此必须进行近似。早期研究通过离散化空间

X

并借助蒙特卡洛采样计算条件熵[156]。近期工作则采用

X

的离散化方法，以获得对

p_{⋆}

及其预期信息增益的光滑近似[67]。遗憾的是，该方法在

O (M^{4})

情况下效率受限，其中

M

表示离散化代表点的数量。

Finally, predictive entropy search (PES) removes the need for a discretization and approximates the acquisition function in

O ({(n + d)}^{3})

time,which,for

d < n

is of the same order as EI [69]. This is achieved by using the symmetric property of mutual information to rewrite

α_{ES} (x)

as
最终，预测熵搜索（PES）消除了离散化需求，并在

O ({(n + d)}^{3})

时间内近似获取函数，对于

d < n

而言其复杂度与 EI 相当[69]。这是通过利用互信息的对称特性重写

α_{ES} (x)

表达式实现的。

α_{PES} (x; D_{n}) := H (y ∣ D_{n}, x) - E_{x^{⋆} ∣ D_{n}} [H (y ∣ D_{n}, x, x^{⋆})]

The expectation can be approximated via Monte Carlo with Thompson samples; and three simplifying assumptions are made to compute

H (y ∣ D_{n}, x, x^{⋆})

. Empirically,this algorithm has been shown to perform as well or better than the dis-cretized version without the unappealing quartic term [69], making it arguably the state of the art in entropy search approximation.
期望值可通过汤普森采样的蒙特卡洛方法近似计算；为求解

H (y ∣ D_{n}, x, x^{⋆})

，文中作出三项简化假设。实证研究表明，该算法性能至少不逊于离散化版本，且避免了不理想的四次项[69]，堪称熵搜索近似领域的当前最优方法。

D. Portfolios of acquisition functions
D. 采集函数组合策略

No single acquisition strategy provides better performance over all problem instances. In fact, it has been empirically observed that the preferred strategy can change at various stages of the sequential optimization process. To address this issue, [73] proposed the use of a portfolio containing multiple acquisition strategies. At each iteration, each strategy in the portfolio provides a candidate query point and meta-criterion is used to select the next query point among these candidates. The meta-criterion is analogous to an acquisition function at a higher level; whereas acquisition functions are optimized in the entire input space, a meta-criterion is only optimized within the set of candidates suggested by its base strategies.
没有任何单一采集策略能在所有问题实例上均表现最优。事实上，实证研究发现，在序列优化过程的不同阶段，优势策略可能发生变化。为解决此问题，[73]提出采用包含多种采集策略的组合方案。每次迭代时，组合中各策略会提供候选查询点，再通过元准则从这些候选中选定下一个查询点。元准则类似于更高层级的采集函数——采集函数在整个输入空间进行优化，而元准则仅在其基础策略推荐的候选点集内进行优化。

The earlier approach of Hoffman et al. is based on a modification of the well-known Hedge algorithm [9], designed for the full-information adversarial multi-armed bandit. This particular portfolio algorithm relies on using the past performance of each acquisition function to predict future performance, where performance is measured by the objective function. However, this performance metric does not account for valuable information that is gained through exploration.
Hoffman 等人早期的方法基于对著名 Hedge 算法[9]的改进，该算法专为全信息对抗性多臂老虎机问题设计。这一特定投资组合算法依赖于利用各采集函数的历史表现来预测未来表现，其中表现由目标函数衡量。然而，这一性能指标并未考虑通过探索获得的宝贵信息。

A more recent approach, the so-called entropy search portfolio (ESP), considers the use of an information-based metric instead [139]. In contrast to the GP-Hedge portfolio, ESP selects among different candidates by considering the gain of information towards the optimum. Removing the constant entropy at the current time, the ESP meta-criterion reduces to
一种较新的方法称为熵搜索组合（ESP），它考虑使用基于信息的度量标准[139]。与 GP-Hedge 组合不同，ESP 通过考量向最优解的信息增益来筛选不同候选点。在去除当前时刻的恒定熵后，ESP 元准则简化为

\begin{matrix} (48) & α_{ESP} (x; D_{n}) = - E_{y ∣ D_{n}, x} [H [x_{⋆} ∣ D_{n} \cup {(x, y)}]] \end{matrix}

\begin{matrix} (49) & x_{n} = \underset{x_{1 : K, n}}{\arg max} α_{ESP} (x; D_{n}), \end{matrix}

where

x_{1 : K, n}

represent the candidates provided by the

K

base acquisition functions. In other words the candidate selected by this criterion is the one that results in the greatest expected reduction in entropy about the minimizer

x_{⋆}

. If the meta-criterion

α_{ESP} (x ∣ D_{n})

were minimized over the entire space

X

, ESP reduces to the acquisition functions proposed by [156], [67], [69]. However, ESP restricts this minimization to the set of candidates made by each portfolio member.
其中

x_{1 : K, n}

代表由

K

基础采集函数提供的候选点。换言之，该准则选出的候选点是能最大程度预期减少关于最小化器

x_{⋆}

熵值的那个。若在整个空间

X

上最小化元准则

α_{ESP} (x ∣ D_{n})

，ESP 便简化为[156]、[67]、[69]提出的采集函数。然而，ESP 将这一最小化过程限制在每个组合成员生成的候选点集合内。

Fig. 6. Absolute error of the best observation for the Branin and Hartmann 3 synthetic functions. Plotting the mean and standard error (shaded area) over 25 repeated runs.
图 6. Branin 和 Hartmann 3 合成函数最佳观测值的绝对误差。展示了 25 次重复运行的平均值及标准误差（阴影区域）。

V. PRACTICAL CONSIDERATIONS
五、实际应用考量

In this section, we discuss some implementation details and more advanced topics. In particular, we first describe how the unknown hyperparameters

θ

are dealt with,we then provide a survey of techniques used to optimize the acquisition functions, followed by a discussion of non-stationarity and Bayesian optimization with parallelizable queries.
本节将探讨部分实现细节及更进阶的主题。具体而言，我们首先阐述如何处理未知超参数

θ

，随后综述优化采集函数的技术方法，最后讨论非平稳性及支持并行查询的贝叶斯优化。

A. Handling hyperparameters
A. 超参数处理方法

Thus far in the discussion we have mostly ignored the kernel hyperparameters and assumed they were given. In this section we describe two data-driven ways of handling hyperparameters, namely point estimation and approximate marginalization. Consider a generic function

α : X \times Θ \mapsto R

, where

θ \in Θ

represents the hyperparameters of our GP. In the context of Bayesian optimization, this function could be our objective function or any function derived from the Gaussian process, but for concreteness, it may help to think of it specifically as the acquisition function,hence the symbol

α

. We wish to marginalize out our uncertainty about

θ

with the following expression
在迄今为止的讨论中，我们大多忽略了核超参数并假设它们已给定。本节将介绍两种数据驱动的超参数处理方法：点估计和近似边缘化。考虑一个通用函数

α : X \times Θ \mapsto R

，其中

θ \in Θ

代表高斯过程的超参数。在贝叶斯优化的语境下，该函数可能是目标函数或任何源自高斯过程的函数，但为具体起见，不妨将其特别视为采集函数，故用符号

α

表示。我们希望通过以下表达式消除对

θ

不确定性的边缘化

\begin{matrix} (50) & α_{n} (x) := E_{θ ∣ D_{n}} [α (x; θ)] = \int α (x; θ) p (θ ∣ D_{n}) d θ . \end{matrix}

This integral is over our posterior belief over

θ

given observations

D_{n}

,which can be decomposed via Bayes’ rule as
该积分基于给定观测数据

D_{n}

后对

θ

的后验信念，可通过贝叶斯规则分解为

\begin{matrix} (51) & p (θ ∣ D_{n}) = \frac{p (y ∣ X, θ) p (θ)}{p (D_{n})} . \end{matrix}

The simplest approach to tackling (50) is to fit the hy-perparameter to observed data using a point estimate

{\hat{θ}}_{n}^{ML}

{\hat{θ}}_{n}^{MAP}

,corresponding to type II maximum likelihood or maximum a posteriori estimates, respectively. The posterior is then replaced by a delta measure at the corresponding

{\hat{θ}}_{n}

which yields
处理(50)式最简方法是利用点估计

{\hat{θ}}_{n}^{ML}

或

{\hat{θ}}_{n}^{MAP}

（分别对应 II 型极大似然估计和最大后验估计）将超参数拟合到观测数据。随后用对应

{\hat{θ}}_{n}

处的狄拉克测度替代后验分布，从而得到

\begin{matrix} (52) & {\hat{α}}_{n} (x) = α (x; {\hat{θ}}_{n}) . \end{matrix}

The estimators

{\hat{θ}}_{n}^{ML}

and

{\hat{θ}}_{n}^{MAP}

can be obtained by optimizing the marginal likelihood or the unnormalized posterior, respectively. For certain priors and likelihoods, these quantities as well as their gradients can be computed analytically. For example, the GP regression model yields the following marginal likelihood defined in (35), which we denote here by

L_{n}

. Therefore it is common to use multi-started quasi-Newton hill-climbers (e.g., the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method) on objectives such as the likelihood

L_{n}

or the unnormalized posterior.
估计量

{\hat{θ}}_{n}^{ML}

和

{\hat{θ}}_{n}^{MAP}

可通过分别优化边际似然或未归一化后验概率获得。对于某些先验分布和似然函数，这些量及其梯度可解析计算。例如，高斯过程回归模型给出了如(35)式定义的边际似然，此处记为

L_{n}

。因此，通常会在目标函数（如似然函数

L_{n}

或未归一化后验概率）上使用多起点拟牛顿爬山算法（例如有限内存的 Broyden-Fletcher-Goldfarb-Shanno（L-BFGS）方法）。

In Bayesian optimization, our uncertainty about the response surface plays a key role in guiding exploration and therefore it is important to incorporate our uncertainty about

θ

in the regression model. Naturally, these point estimates cannot capture this uncertainty. For this reason we consider marginalizing out the hyperparameters using either quadrature or Monte Carlo [120], [26], [143].
在贝叶斯优化中，我们对响应面不确定性的认知对指导探索至关重要，因此在回归模型中纳入关于

θ

的不确定性十分重要。自然，这些点估计无法捕捉这种不确定性。为此，我们考虑通过数值积分或蒙特卡洛方法[120]、[26]、[143]对超参数进行边际化处理。

The common component in Monte Carlo (MC) methods is that they approximate the integral in (50) using

M

samples

{θ_{n}^{(i)}}_{i = 1}^{M}

from the posterior distribution

p (θ ∣ D_{n})

:
蒙特卡洛（MC）方法的共同点在于，它们使用来自后验分布

p (θ ∣ D_{n})

的

M

个样本

{θ_{n}^{(i)}}_{i = 1}^{M}

来近似计算(50)式中的积分：

\begin{matrix} (53) & E_{θ ∣ D_{n}} [α (x; θ)] \approx \frac{1}{M} \sum_{i = 1}^{M} α (x; θ_{n}^{(i)}) . \end{matrix}

However, in practice it is impossible to sample directly from the posterior so Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) techniques are used to produce a sequence of samples that are marginally distributed according to

p (θ ∣ D_{n})

in the limit of infinitely long chains. Once the

M

hyperparameter samples are obtained,the acquisition function is evaluated and averaged over all samples; this marginal acquisition function incorporates the uncertainty in

θ

. In addition to MC methods,one could also use quadrature as shown in [120]. Here, samples (not necessarily drawn from the posterior) are combined using a weighted mixture:
然而，在实际操作中，直接从后验分布中采样是不可能的，因此采用马尔可夫链蒙特卡洛（MCMC）和序贯蒙特卡洛（SMC）技术来生成一系列样本，这些样本在无限长链的极限下边际分布符合

p (θ ∣ D_{n})

。一旦获得

M

超参数样本，便对所有样本进行评估并计算获取函数的平均值；这一边际获取函数整合了

θ

的不确定性。除蒙特卡洛方法外，亦可采用如文献[120]所示的数值积分法。此处，样本（未必来自后验分布）通过加权混合方式结合：

\begin{matrix} (54) & E_{θ ∣ D_{n}} [α (x; θ)] \approx \sum_{i = 1}^{M} ω_{i} α (x; θ_{n}^{(i)}) . \end{matrix}

We could do away with samples entirely and approximately integrate out the hyperparameters as shown in [53]. To make the integral tractable, the authors adopted a linear approximation to the likelihood which enables them to derive an approximate posterior. This method, however, has not been demonstrated in the setting of Bayesian optimization.
我们可以完全摒弃样本，如文献[53]所示近似积分掉超参数。为使积分可解，作者采用了对似然函数的线性近似，从而推导出近似后验分布。然而，该方法尚未在贝叶斯优化的背景下得到验证。

Estimating the hyperparameters of GP kernels with very few function evaluations is a challenging task, often with disastrous consequences as illustrated by a simple example in [15]. The typical estimation of the hyperparameters by maximizing the marginal likelihood [126], [82] can easily fall into traps, as shown in [32]. Several authors have proposed to integrate out the hyperparameters using quadrature or Monte Carlo methods [120], [26], [143]. These more advanced techniques can still fall in traps as illustrated with a simple simulation example in [157], where theoretical bounds are used to ensure that Bayesian optimization is robust with respect to the choice of hyperparameters.
在极少函数评估次数下估计高斯过程核的超参数是一项极具挑战性的任务，文献[15]通过简单示例展示了其可能导致的灾难性后果。传统通过最大化边缘似然来估计超参数的方法[126][82]极易陷入陷阱，如文献[32]所示。多位学者提出采用数值积分或蒙特卡洛方法对超参数进行积分处理[120][26][143]，但文献[157]的仿真实验表明，即使这些先进技术仍可能失效，该研究通过理论边界确保贝叶斯优化对超参数选择具有鲁棒性。

B. Optimizing acquisition functions
B. 优化采集函数

A central step of the Bayesian optimization framework is the maximization of the acquisition function. Naturally, an acquisition function is only useful if it is cheap to evaluate relative to the objective function

f

. Nevertheless,the acquisition function is often multimodal and maximizing it is not a trivial task. In practice, the community has resorted to using various techniques such as discretization [143] and adaptive grids [13], or similarly, the divided rectangles approach of [83], which was used in [28], [110], [105]. When gradients are available, or can be cheaply approximated, one can use a multi-started quasi-Newton hill-climbing approach [100], [143]. Alternatively, [16] and [159] use the CMA-ES method of [66], and [79] apply multi-start local search.
贝叶斯优化框架的一个核心步骤是采集函数的最大化。自然地，只有当采集函数的评估成本相对于目标函数

f

较低时，它才有实用价值。然而，采集函数往往是多峰的，最大化它并非易事。实践中，学界采用了多种技术，如离散化[143]和自适应网格[13]，或类似地，[83]提出的分区矩形法，该方法被用于[28]、[110]、[105]。当梯度可用或能低成本近似时，可采用多起点拟牛顿爬山法[100]、[143]。此外，[16]和[159]使用了[66]的 CMA-ES 方法，而[79]则应用了多起点局部搜索。

Unfortunately, these auxiliary optimization techniques can be problematic for several reasons. First, in practice it is difficult to assess whether the auxiliary optimizer has found the global maximizer of the acquisition function. This raises important concerns about the convergence of Bayesian optimization algorithms because theoretical guarantees are only valid with the assumption that the exact optimizer is found and selected; see for example [146], [154] and [32]. Second, between any two consecutive iterations of the Bayesian optimization algorithm, the acquisition function may not change dramatically. Therefore, rerunning the auxiliary optimizer can be unnecessarily wasteful.
遗憾的是，这些辅助优化技术因多种原因可能存在问题。首先，在实践中难以评估辅助优化器是否找到了采集函数的全局最大值。这对贝叶斯优化算法的收敛性提出了重要关切，因为理论保证仅在假设找到并选择了精确优化器的情况下才成立；例如参见[146]、[154]和[32]。其次，在贝叶斯优化算法的任意两次连续迭代之间，采集函数可能不会发生剧烈变化。因此，重新运行辅助优化器可能造成不必要的浪费。

Recent proposed optimistic optimization methods provide an alternative to Bayesian optimization [87], [31], [116]. These methods sequentially build space-partitioning trees by splitting leaves with high function values or upper confidence bounds; the objective function is then evaluated at the centre of the chosen leaves. Simultaneous optimistic optimization (SOO) can reach the global optimum without knowledge of the function's smoothness [116]. Since SOO is optimistic at multiple scales (i.e., it expands several leaves simultaneously, with at most one leaf per level) it has also been referred to as multi-scale optimistic optimization [158].
最近提出的乐观优化方法为贝叶斯优化[87]、[31]、[116]提供了一种替代方案。这些方法通过分割具有高函数值或上置信边界的叶节点，逐步构建空间划分树；随后在选定叶节点的中心评估目标函数。同步乐观优化（SOO）无需了解函数平滑度即可达到全局最优[116]。由于 SOO 在多个尺度上保持乐观（即同时扩展多个叶节点，每层最多一个叶节点），它也被称为多尺度乐观优化[158]。

Though these optimistic optimization methods do not require any auxiliary optimization, these methods are not as competitive as Bayesian optimization in practical domains where prior knowledge is available. The Bayesian multi-scale SOO (BamSOO) algorithm combines the tree partitioning idea of SOO with the surrogate model of Bayesian optimization [158], eliminating the need for auxiliary optimization. BamSOO also boasts some theoretical guarantees that do not depend on the exact optimization of an acquisition function. Intuitively, the method implements SOO to optimize the objective function directly, but avoids querying points that are deemed unlikely to be optimal by the surrogate model's confidence bounds.
尽管这些乐观优化方法不需要任何辅助优化，但在已有先验知识的实际领域中，这些方法的竞争力不如贝叶斯优化。贝叶斯多尺度 SOO（BamSOO）算法将 SOO 的树状分区思想与贝叶斯优化的代理模型相结合[158]，从而消除了对辅助优化的需求。BamSOO 还拥有一些不依赖于采集函数精确优化的理论保证。直观上，该方法通过 SOO 直接优化目标函数，但避免查询那些被代理模型置信边界认为不太可能是最优的点。

https://cdn.noedgeai.com/019684af-e35c-7a84-b274-13b47593f96c_13.jpg?x=916&y=162&w=729&h=543&r=0

Fig. 7. Conditioned on the unknown objective function (red) lying between the surrogate confidence bounds (green region) with high probability, we can discard regions of the space where the upper bound is lower than the best lower bound encountered thus far. Figure from [43].
图 7. 在假设未知目标函数（红色）以高概率位于代理置信区间（绿色区域）内的条件下，我们可以排除空间上界低于当前最佳下界的区域。图片引自[43]。

In other words, BaMSOO uses the surrogate model to reduce the number of function evaluations, increasing sample efficiency. This work is also reminiscent of the theoretical work in [43], which proposes to only search in regions where the upper bound on the objective is greater than the best lower bound encountered thus far. Figure 7 illustrates how regions are discarded. Guided by the probabilistic model, the most promising regions are explored first, which avoids covering the entire space. Figure 8 compares SOO and BaMSOO on a simple one-dimensional example. Incorporating the surrogate model leads to better more refined optimization for the same number of query points.
换言之，BaMSOO 利用代理模型减少函数评估次数，从而提升采样效率。该工作也与文献[43]中的理论研究相呼应，后者提出仅在目标函数上界高于当前遇到的最佳下界的区域进行搜索。图 7 展示了如何剔除无效区域。在概率模型的引导下，优先探索最具潜力的区域，避免了全空间覆盖。图 8 通过简单一维示例对比了 SOO 与 BaMSOO 的效果，可见引入代理模型后，在相同查询点数量下能实现更精细的优化。

C. Conditional Spaces C. 条件空间

It is often the case that some variables will only influence the function being optimized when other variables take on certain values. These are called conditional variables and are said to be active or inactive. For example, when the function involves selecting between different algorithms as well as optimizing their hyperparameters, then certain sets of hyperparameters belonging to a given algorithm will be inactive if that algorithm is not selected [79], [16].
常见情况是，某些变量仅当其他变量取特定值时才会影响被优化函数，这类变量称为条件变量，其状态分为激活或未激活。例如，当函数涉及选择不同算法并优化其超参数时，若某算法未被选中，则属于该算法的特定超参数集合将处于未激活状态[79][16]。

More formally,consider a variable

x_{1} \in X_{2}

and another variable

x_{2} \in X_{2} . x_{1}

is said to be a child of

x_{2}

if it is only active when

x_{2}

takes on certain values in

X_{2}

. This conditional structure can be extended with multiple variables to form more complicated tree or directed acyclic graph structures. This greatly extends the capabilities of the Bayesian optimization framework, allowing it to chain together individual algorithms to form sophisticated pipelines that can be jointly optimized [151], [19].
更正式地说，考虑一个变量

x_{1} \in X_{2}

，当另一个变量

x_{2} \in X_{2} . x_{1}

仅在

x_{2}

取

X_{2}

中某些特定值时处于激活状态，则称其为

x_{2}

的子变量。这种条件结构可通过引入多个变量扩展为更复杂的树状或有向无环图结构。这极大拓展了贝叶斯优化框架的能力，使其能够将独立算法串联形成可联合优化的复杂流程管道[151][19]。

https://cdn.noedgeai.com/019684af-e35c-7a84-b274-13b47593f96c_14.jpg?x=161&y=147&w=696&h=1067&r=0

Fig. 8. Comparison of SOO (top) and BamSOO (bottom) on

f (x) =

\frac{1}{2} \sin (15 x) \sin (27 x)

[0, 1]

. Blue dots represent nodes where the objective was evaluated. BaMSOO does not evaluate

f

at points that are suboptimal with high probability under the surrogate model (not shown). Figure from [158].
图 8. SOO（上）与 BamSOO（下）在

f (x) =

\frac{1}{2} \sin (15 x) \sin (27 x)

[0, 1]

中的对比。蓝点表示目标函数被评估的节点。BaMSOO 不会在代理模型（未显示）高概率次优的点

f

进行评估。图片引自[158]。

Models such as random forests or the tree Parzen estimator (TPE) are naturally tailored to handle conditional spaces. Random forests are constructed using ensembles of decision trees that can learn to ignore inactive variables and the TPE itself is a graph-structured generative model that follows the conditional structure of the search space.
诸如随机森林或树结构 Parzen 估计器（TPE）等模型天然适合处理条件空间。随机森林通过集成决策树构建，能够学会忽略非活跃变量；而 TPE 本身是一种遵循搜索空间条件结构的图结构生成模型。

Gaussian processes are not immediately suitable for conditional spaces because standard kernels are not defined over variable-length spaces. A simple approach is to define a separate GP for each group of jointly active hyperparameters [16], however this ignores dependencies between groups. Recent work has focused on defining a fixed-length embedding of conditional spaces where a standard kernel using Euclidean distance can be applied [147]. This is currently a very new area of research and more work needs to be done before GPs can work in conditional spaces as well as tree-based models.
高斯过程并不直接适用于条件空间，因为标准核函数未定义在可变长度空间上。一种简单方法是为每组共同活跃的超参数单独定义一个高斯过程[16]，但这忽略了组间依赖关系。近期研究致力于定义条件空间的固定长度嵌入，从而可应用基于欧氏距离的标准核函数[147]。这目前是一个崭新研究领域，在高斯过程能像基于树模型那样良好处理条件空间之前，仍需更多工作。

D. Non-stationarity D. 非平稳性

A major assumption made by GP regression using the kernels suggested in Section III-B is that the underlying process is stationary. Formally, this assumption means that the kernel

k (x, x^{'})

can be equivalently written as a function of

x - x^{'}

. Intuitively,a function whose length-scale does not change throughout the input space will be well modelled by a GP with a stationary kernel.
使用第三节 B 部分建议的核函数进行高斯过程回归时，一个重要假设是底层过程具有平稳性。严格来说，该假设意味着核函数

k (x, x^{'})

可等价表示为

x - x^{'}

的函数。直观而言，若某函数在输入空间中长度尺度恒定不变，则采用平稳核的高斯过程能对其良好建模。

In real world problems we often expect that the true underlying process will be non-stationary. In these cases, the GP prior is misspecified, which means that it will require more data in order to produce reasonable posterior estimates. For Bayesian optimization this is an issue, as the entire goal is to minimize the function in as few evaluations as possible. Here, we will discuss some of the ways in which Bayesian optimization can be modified to deal with non-stationarity.
在现实世界的问题中，我们常常预期真实的内在过程是非平稳的。这种情况下，高斯过程先验会被误设，这意味着需要更多数据才能生成合理的后验估计。对于贝叶斯优化而言，这是个问题，因为其核心目标是以尽可能少的评估次数最小化函数。在此，我们将讨论贝叶斯优化如何调整以应对非平稳性的一些方法。

a) Non-stationary kernels: One way to create a nonstationary process is to use a non-stationary kernel. One strategy is to convert a stationary kernel into a non-stationary one by transforming

x

using a parametric warping function

x^{(w)} = w (x)

and then applying a stationary kernel to

x^{(w)}

[129],[145]. If

w

is chosen appropriately,the data will follow a stationary process in the transformed space.
a) 非平稳核函数：创建非平稳过程的一种方式是使用非平稳核函数。一种策略是通过参数化扭曲函数

x^{(w)} = w (x)

转换

x

，然后对

x^{(w)}

应用平稳核函数，从而将平稳核转化为非平稳核[129],[145]。若

w

选择得当，数据在变换后的空间中将遵循平稳过程。

In Bayesian optimization, the inputs are traditionally projected onto the unit hypercube and this fact was exploited in [145], who chose the warping function to be the cumulative distribution function (CDF) of the beta distribution,
在贝叶斯优化中，传统上输入会被投影到单位超立方体上，文献[145]利用这一特性，选择贝塔分布的累积分布函数（CDF）作为扭曲函数。

\begin{matrix} (55) & w_{d} (x) = \frac{x_{d}^{α - 1} {(1 - x_{d})}^{β - 1}}{B (α, β)}, \end{matrix}

where

α

and

β

are the shape parameters,and the

B

is the beta function. In this case,

w_{d} (x)

is a warping function for the

d^{th}

dimension of

x

,and a separate warping is applied to each dimension.
其中

α

和

β

为形状参数，

B

为贝塔函数。此情况下，

w_{d} (x)

作为

x

的

d^{th}

维度扭曲函数，且每个维度均单独施加扭曲。

Examples of functions before and after applying beta warping are shown in Figure 9. Despite having only two parameters, the beta CDF is able to express a wide variety of transformations. These transformations contract portions of the input space, and expand others, which has the effect of decreasing and increasing the length scale in those portions, respectively. The beta warping approach has been shown to be highly effective on several benchmark problems as well as hy-perparameter optimization for machine learning models [145], [18].
图 9 展示了应用贝塔扭曲前后的函数示例。尽管仅含两个参数，贝塔 CDF 能表达多种变换形式。这些变换会压缩输入空间的部分区域并扩展其他区域，从而分别缩短或延长相应区域的长度尺度。研究表明，贝塔扭曲方法在多个基准问题及机器学习模型超参数优化中效果显著[145][18]。

While the beta CDF is not the only choice, it is appealing for a number of reasons. For hyperparameter optimization, it mimics the kind of transformations practitioners tend to apply when applying a grid search, such as searching over learning rates in the log-domain. It is compactly parameterized, so that learning the shape parameters is not too much more expensive than learning other kernel parameters. Finally, it is an invertible transformation so that once the maximum of the acquisition function is found, it can easily be mapped back into the original space. For

α = β = 1

,the transformation is the identity and the original, stationary GP is recovered.
虽然贝塔累积分布函数并非唯一选择，但其吸引力源于多重原因。在超参数优化中，它模拟了实践者进行网格搜索时惯用的转换方式，例如在对数域中搜索学习率。其参数化形式紧凑，使得学习形状参数的成本仅略高于学习其他核参数。此外，该变换可逆，一旦找到采集函数的极值，即可轻松映射回原始空间。当

α = β = 1

时，变换为恒等映射，原始的平稳高斯过程得以恢复。

Learning

α

and

β

via point estimates can be difficult when using gradient based optimization as the beta function and its derivatives with respect to

α

and

β

do not have simple closed-form solutions. An appealing alternative in this case is the Kumaraswamy distribution, whose CDF takes the form
当使用基于梯度的优化方法时，通过点估计学习

α

和

β

可能较为困难，因为贝塔函数及其对

α

和

β

的导数没有简单的闭式解。此时库马拉斯瓦米分布是一个颇具吸引力的替代方案，其累积分布函数的形式为

\begin{matrix} (56) & w_{d} (x) = 1 - {(1 - x_{d}^{α})}^{β} . \end{matrix}

https://cdn.noedgeai.com/019684af-e35c-7a84-b274-13b47593f96c_15.jpg?x=141&y=173&w=1512&h=424&r=0

Fig. 9. Left: Examples of beta CDF warpings under different settings of the shape parameters

α

and

β

. Right: Examples of functions after applying a beta CDF warping (originally from [145]). The regions where the CDF has a slope greater than 1 are expanded along the horizontal axis, while regions where the CDF has slope less than 1 are contracted.
图 9. 左图：不同形状参数

α

和

β

设置下的贝塔 CDF 扭曲示例。右图：应用贝塔 CDF 扭曲后的函数示例（原始数据来自[145]）。CDF 斜率大于 1 的区域沿横轴扩展，而斜率小于 1 的区域则被压缩。

There are many other examples of non-stationary covariance functions [70], [121], [132], [126], [118], [22], [3], [54] that have been proposed for GP regression along with closely related output warping techniques [141] that can also model certain kinds of non-stationary processes.
还有许多其他非平稳协方差函数的例子[70]、[121]、[132]、[126]、[118]、[22]、[3]、[54]被提出用于高斯过程回归，同时密切相关的输出扭曲技术[141]也能模拟某些类型的非平稳过程。

b) Partitioning: An alternative approach to modelling non-stationarity that has been useful in practice is to partition the space into distinct regions and then to model each region as a separate stationary process. In a random forest model [79], [27], this is achieved by finer partitioning in regions of the space where the function changes rapidly, and more granular partitioning in regions where the function changes slowly. Partitioning can also be an effective strategy for GPs. For example, [63] proposed the treed GP model, which partitions the data and then applies a separate GP to each region.
b) 分区法：实践中另一种有效的非平稳性建模方法是将空间划分为不同区域，然后将每个区域视为独立的平稳过程进行处理。在随机森林模型[79]、[27]中，通过函数变化快速区域进行细粒度划分，而在变化缓慢区域采用更粗粒度的划分来实现这一点。分区策略同样适用于高斯过程。例如，[63]提出的树状高斯过程模型，先对数据进行分区，再对每个区域应用独立的高斯过程。

c) Heteroscedasticity: Heteroscedasticity is a close analogue of non-stationarity, but refers to non-stationary behaviour in the noise process governing the observation model, instead of the true process that we wish to capture. Standard GP regression using an isotropic noise kernel assumes by default that the noise process is constant everywhere, and is therefore stationary by definition. In practice, it is possible to have non-stationarity in both the true process and the noise process. Heteroscedasticity has been widely addressed in the GP literature, see e.g., [95], [86], [103].
c) 异方差性：异方差性是非平稳性的近亲，但它指的是观测模型中的噪声过程表现出非平稳行为，而非我们试图捕捉的真实过程。标准的高斯过程回归默认使用各向同性噪声核，这意味着噪声过程在所有地方都是恒定的，因此从定义上就是平稳的。实际上，真实过程和噪声过程都可能存在非平稳性。异方差性在高斯过程文献中已被广泛讨论，例如参见[95]、[86]、[103]。

For Bayesian optimization in particular, one approach to handling heteroscedastic noise was proposed in [5] using a partitioning approach. The idea is to build a partition using classification and regression trees (CART) [25]; however, splitting was restricted to occur at data points rather than between them. This ensured that the variance estimates of the GP would remain smooth between partitions.
特别是在贝叶斯优化中，文献[5]提出了一种处理异方差噪声的分区方法。其思路是利用分类与回归树（CART）[25]构建分区；然而，分割被限制在数据点处而非数据点之间进行。这确保了高斯过程的方差估计在分区之间保持平滑。

Another form of non-stationarity that is closely related to heteroscedasticity is a non-stationary amplitude [1], [54]. This is where the magnitude of the output process changes as a function of the input. To our knowledge this has not been directly addressed in the Bayesian optimization literature. There have however been attempts to be robust to this effect by integrating out the amplitude parameter of the GP kernel. This was done numerically in [143] and analytically using conjugate priors in [138],resulting in a latent GP with

t

-distributed predictions and an input-dependent noise covariance.
另一种与非平稳性密切相关且类似于异方差性的形式是非平稳幅度[1][54]。这种情况下，输出过程的幅度会随输入变化而变化。据我们所知，贝叶斯优化文献尚未直接探讨此问题。不过已有研究尝试通过积分掉高斯过程核的幅度参数来增强对此效应的鲁棒性。[143]中采用数值方法实现了这一点，而[138]则利用共轭先验进行解析处理，最终得到一个预测呈

t

分布且具有输入依赖噪声协方差的隐高斯过程。

E. Parallelization E. 并行化

Bayesian optimization is conventionally posed as a sequential problem where each experiment is completed before a new one is proposed. In practice it may be possible and advantageous to run multiple function evaluations in parallel. Even if the number of experiments required to reach the minimum does not change, parallel approaches can yield a substantial reduction in terms of wall-clock time [80], [143].
贝叶斯优化传统上被视为一个顺序问题，即每个实验完成后才提出新的实验。然而在实际操作中，并行运行多个函数评估可能更为可行且具有优势。即使达到最小值所需的实验次数不变，并行方法也能显著缩短实际耗时[80][143]。

Ginsbourger et al. [57] proposed several approaches based on imputing the results of currently running experiments. The idea is that given the current observations

D_{n} = {(x_{n}, y_{n})}

and pending experiments

D_{p} = {x_{p}}

,one can impute a set of set of experimental outcomes

{\tilde{D}}_{p} = {(x_{p}, {\tilde{y}}_{p})}

and then perform a step of Bayesian optimization using the augmented dataset

D_{t} \cup {\tilde{D}}_{p}

.
Ginsbourger 等人[57]提出了几种基于对当前运行实验结果的估算方法。其核心思想是：给定当前观测值

D_{n} = {(x_{n}, y_{n})}

和待处理实验

D_{p} = {x_{p}}

，可以估算出一组实验输出结果

{\tilde{D}}_{p} = {(x_{p}, {\tilde{y}}_{p})}

，然后利用增强数据集

D_{t} \cup {\tilde{D}}_{p}

进行贝叶斯优化的一步操作。

One simple strategy is the constant liar,where a constant

L

is chosen such that

{\tilde{y}}_{p} = L, \forall p

. Another strategy is the Kriging believer,which uses the GP predictive mean

{\tilde{y}}_{p} = μ_{n} (x_{p})

. [143] used an approach where a set of

S

fantasies are sampled for each unfinished experiment from the full GP posterior predictive distribution. These are then combined to estimate the following parallel integrated acquisition function,
一种简单策略是“恒定谎言者”，即选择一个固定值

L

使得

{\tilde{y}}_{p} = L, \forall p

。另一种策略是克里金信徒，它使用高斯过程预测均值

{\tilde{y}}_{p} = μ_{n} (x_{p})

。[143]采用的方法是从完整高斯过程后验预测分布中为每个未完成实验采样一组

S

幻想值，随后将这些值组合起来估计以下并行集成采集函数。

\begin{matrix} (57) & α (x; D_{n}, D_{p}) = \int_{R^{J}} α (x; D_{n} \cup {\tilde{D}}_{p}) P ({\tilde{y}}_{1 : J}; D_{n}) d y_{p 1 : J}, \end{matrix}

\begin{matrix} (58) & \approx \frac{1}{S} \sum_{s = 1}^{S} α (x; D_{n} \cup {\tilde{D}}_{p}^{(s)}), \end{matrix}

\begin{matrix} (59) & {\tilde{D}}_{p}^{(s)} \sim P ({\tilde{y}}_{1 : J}; D_{n}), \end{matrix}

where

J

is the number of currently pending experiments. This approach has been shown to be very effective in practice when

α

is chosen to be EI. Similar approaches are proposed in [58], [40] and a similar parallel extension to GP-UCB is proposed [45].
其中

J

表示当前待处理实验的数量。实践表明，当选择

α

为期望改进(EI)时，该方法效果显著。类似方法在文献[58][40]中亦有提出，针对 GP-UCB 的并行扩展方案可参见[45]。

Although the imputation approaches deal with parallel experiments, the nature in which they propose candidates is still inherently sequential. A truly parallel approach would simultaneously propose a set of candidates. Jones [81] and Hutter et al. [80] proposed an approach based on GP-UCB where

α_{UCB}

is optimized using a range of

β_{n}

values,which produces a set of points that favour a range of exploration and exploitation.
虽然插补方法处理的是并行实验，但其提出候选方案的本质仍是顺序性的。真正的并行方法应能同时提出一组候选方案。Jones[81]和 Hutter 等人[80]提出了一种基于 GP-UCB 的方法，通过优化

α_{UCB}

并采用不同的

β_{n}

参数值，生成一组兼顾探索与开发的候选点。

F. Software implementations
F. 软件实现

As of this writing, there are several open source packages implementing various forms of Bayesian optimization. We highlight several popular libraries in Table I.
在撰写本文时，已有多个开源包实现了不同形式的贝叶斯优化。我们在表 I 中重点介绍了几种常用库。

VI. THEORY OF BAYESIAN OPTIMIZATION
六、贝叶斯优化理论

There exist a vast literature on the theoretical properties of bandit algorithms in general. Theoretical properties of Bayesian optimization, however, have only been established recently. In this section, we focus on the results concerning Gaussian process based Bayesian optimization and defer detailed discussions of bandit algorithms to other dedicated surveys [29], [117].
关于多臂老虎机算法的理论特性已有大量文献研究，而贝叶斯优化的理论性质直到近期才得以确立。本节我们重点讨论基于高斯过程的贝叶斯优化相关成果，关于老虎机算法的详细论述将留待其他专题综述[29][117]展开。

There exist several early consistency proofs for Gaussian process based Bayesian optimization algorithms, in the one-dimensional setting [102] and one for a simplification of the algorithm using simplicial partitioning in higher dimensions [164]. The consistency of the algorithm using multivariate Gaussian processes has been established in [155].
存在若干关于基于高斯过程的贝叶斯优化算法的早期一致性证明，包括一维场景下的研究[102]，以及针对采用高维单纯形划分简化算法的证明[164]。而使用多元高斯过程的算法一致性已在文献[155]中得以确立。

More recently, [146] provided the first finite sample bound for Gaussian process based Bayesian optimization. In this work, the authors showed that the GP-UCB algorithm suffers from sub-linear cumulative regret in the stochastic setting. The regret bounds, however, allow only fixed hyperparameters. In [32], Bull provided both upper and lower bounds of simple regret for the EGO algorithm [82] in the deterministic setting. In addition to regret bounds concerning fixed hyperparameters, the author also provided simple regret bounds while allowing varying hyperparameters.
近期，文献[146]首次给出了基于高斯过程的贝叶斯优化有限样本界。该研究证明了在随机设置下，GP-UCB 算法具有次线性累积遗憾特性，但其遗憾界仅适用于固定超参数。Bull 在文献[32]中针对确定性场景下的 EGO 算法[82]，同时给出了简单遗憾的上界和下界。除固定超参数的遗憾界外，作者还提供了允许超参数变化时的简单遗憾界。

Since the pioneering work of [146] and [32], there emerged a large body of results on this topic including, exponentially vanishing simple regret bounds in the deterministic setting [43]; bounds for contextual Gaussian process bandits [89]; Bayes regret bounds for Thompson sampling [85], [127]; bounds for high-dimensional problems with a underlying low-rank structure [46]; bounds for parallel Bayesian optimization [45]; and improved regret bounds using mutual information [41].
自[146]和[32]的开创性工作以来，该领域涌现了大量研究成果，包括确定性设置下指数级衰减的简单遗憾界[43]、上下文高斯过程赌博机的边界[89]、汤普森采样的贝叶斯遗憾界[85][127]、具有潜在低维结构的高维问题边界[46]、并行贝叶斯优化的边界[45]，以及利用互信息改进的遗憾界[41]。

Despite the recent surge in theoretical contributions, there is still a wide gap between theory and practice. Regret bounds or even consistency results, for example, have not been established for approaches that use a full Bayesian treatment of hyperparameters [143]. Such theoretical results could advance the field of Bayesian optimization and provide insight for practitioners.
尽管近期理论贡献激增，理论与实践之间仍存在巨大鸿沟。例如，对于采用超参数完全贝叶斯处理的方法[143]，尚未建立遗憾界甚至一致性结果。此类理论成果或将推动贝叶斯优化领域发展，并为实践者提供洞见。

VII. HISTORY OF BAYESIAN OPTIMIZATION AND RELATED APPROACHES
第七章贝叶斯优化及相关方法的历史沿革

Arguably the earliest work related to Bayesian optimization was that of William Thompson in 1933 where he considered the likelihood that one unknown Bernoulli probability is greater than another given observational data [150]. In his article, Thompson argues that when considering, for example, two alternative medical treatments one should not eliminate the worst one based on a single clinical trial. Instead, he proposes, one should estimate the probability that one treatment is better than the other and weigh future trials in favour of the seemingly better treatment while still trying the seemingly suboptimal one. Thompson rightly argues that by adopting a single treatment following a clinical trial, there is a fixed chance that all subsequent patients will be given suboptimal treatment. In contrast, by dynamically selecting a fraction of patients for each treatment, this sacrifice becomes vanishingly small.
可以说，最早与贝叶斯优化相关的工作是威廉·汤普森（William Thompson）在 1933 年完成的，他在研究中考虑了给定观测数据时一个未知伯努利概率大于另一个的概率[150]。汤普森在文章中提出，例如在考虑两种替代医疗方案时，不应仅凭一次临床试验就排除效果较差的那一种。相反，他建议应估计一种治疗方案优于另一种的概率，并在未来的试验中偏向看似更优的治疗方案，同时仍尝试看似次优的方案。汤普森正确地指出，如果在临床试验后仅采用单一治疗方案，那么所有后续患者接受次优治疗的概率是固定的。相比之下，通过动态地为每种治疗方案分配一部分患者，这种牺牲会变得微乎其微。

In modern terminology, Thompson was directly addressing the exploration-exploitation trade-off, referring to the tension between selecting the best known treatment for every future patient (the greedy strategy) and continuing the clinical trial for longer in order to more confidently assess the quality of both treatments. This is a recurring theme not only in the Bayesian optimization literature, but also the related fields of sequential experimental design, multi-armed bandits, and operations research.
在现代术语中，汤普森直接探讨了探索与利用之间的权衡，即指在为每位未来患者选择已知最佳治疗方案（贪婪策略）与延长临床试验以更准确地评估两种治疗质量之间的张力。这一主题不仅频繁出现在贝叶斯优化文献中，也贯穿于序列实验设计、多臂老虎机及运筹学等相关领域。

Although modern experimental design had been developed a decade earlier by Ronald Fisher's work on agricultural crops, Thompson introduced the idea of making design choices dynamically as new evidence becomes available; a general strategy known as sequential experimental design or, in the multi-armed bandit literature, adaptive or dynamic allocation rules [92], [59].
尽管现代实验设计早在十年前就由罗纳德·费希尔在农作物研究中奠定基础，但汤普森提出了随着新证据出现动态调整设计选择的理念；这一通用策略被称为序列实验设计，或在多臂老虎机文献中称作自适应或动态分配规则[92][59]。

The term Bayesian optimization was coined in the seventies [115], but a popular version of the method has been known as efficient global optimization in the experimental design literature since the nineties [134]. Since the approximation of the objective function is often obtained using Gaussian process priors, the technique is also referred to as Gaussian process bandits [146].
贝叶斯优化这一术语于七十年代被提出[115]，但该方法的一个流行版本自九十年代以来在实验设计文献中被称为高效全局优化[134]。由于目标函数的近似通常通过高斯过程先验获得，该技术也被称为高斯过程赌博机[146]。

In the nonparametric setting, Kushner [91] used Wiener processes for unconstrained one-dimensional optimization problems. Kushner's decision model was based on maximizing the probability of improvement. He also included a parameter that controlled the trade-off between 'more global' and 'more local' optimization, in the same spirit as the exploration-exploitation trade-off. Meanwhile, in the former Soviet Union, Močkus and colleagues developed a multidimensional Bayesian optimization method using linear combinations of Wiener fields [115], [114]. Both of these methods, probility of and expected improvement, were in studied in detail in [81].
在非参数设定下，Kushner[91]将维纳过程用于无约束一维优化问题。Kushner 的决策模型基于最大化改进概率，并引入了一个参数来控制'更全局'与'更局部'优化之间的权衡，其思想与探索-利用权衡一致。与此同时，在前苏联，Močkus 及其同事利用维纳场的线性组合开发了多维贝叶斯优化方法[115][114]。这两种方法——改进概率与期望改进——在[81]中得到了详细研究。

At the same time, a large, related body of work emerged under the name kriging, in honour of the South African student who developed this technique at the University of the Witwatersrand [90], though largely popularized by Matheron and colleagues (e.g., [111]). In kriging, the goal is interpolation of a random field via a linear predictor. The errors on this model are typically assumed to not be independent, and are modelled with a Gaussian process.
与此同时，以克里金法（kriging）为名的大量相关研究涌现，这一方法由南非威特沃特斯兰德大学的学生提出[90]，以表纪念，但主要由马特隆及其同事（如[111]）推广普及。在克里金法中，目标是通过线性预测器对随机场进行插值。该模型的误差通常被假设为非独立，并通过高斯过程建模。

Kriging has been applied to experimental design under the name DACE, after design and analysis of computer experiments, the title of a paper by Sacks et al. [128] (and more recently a book by Santner et al. [130]). In DACE, the regression model is a best linear unbiased predictor (BLUP), and the residual model is a noise-free Gaussian process. The goal is to find a design point or points that optimizes some criterion. Experimental design is usually non-adaptive: the entire experiment is designed before data is collected. However, sequential design is an important and active subfield (e.g., [160], [33].
克里金法在实验设计领域以 DACE（计算机实验设计与分析）之名得到应用，该名称源自 Sacks 等人[128]的论文标题（近期 Santner 等人的著作[130]亦有提及）。在 DACE 中，回归模型采用最佳线性无偏预测（BLUP），残差模型则为无噪声的高斯过程。其目标是找到一个或多个优化特定准则的设计点。实验设计通常是非自适应的：整个实验方案在数据收集前即已确定。然而，序贯设计是该领域一个重要且活跃的分支（如[160]、[33]）。

TABLE I 表一

A LIST OF SEVERAL POPULAR OPEN SOURCE SOFTWARE LIBRARIES FOR BAYESIAN OPTIMIZATION AS OF MAY, 2015.
截至 2015 年 5 月，列举若干流行的开源贝叶斯优化软件库。

Package 包	License 许可证	URL	Language 语言	Model
SMAC	Academic non-commercial license. 学术非商业许可证。	http://www.cs.ubc.ca/labs/beta/Projects/SMAC	Java	Random forest 随机森林
Hyperopt 超参数优化	BSD	https://github.com/hyperopt/hyperopt	Python	Tree Parzen estimator 树结构 Parzen 估计器
Spearmint 斯皮尔蒙特	Academic non-commercial license. 学术非商业许可	https://github.com/HIPS/Spearmint	Python	Gaussian process 高斯过程
Bayesopt 贝叶斯优化	GPL	http://rmcantin.bitbucket.org/html	C++	Gaussian process 高斯过程
PyBO PyBO（贝叶斯优化库）	BSD	https://github.com/mwhoffman/pybo	Python Python（编程语言）	Gaussian process 高斯过程
MOE	Apache 2.0 Apache 2.0 开源协议	https://github.com/Yelp/MOE	Python $/ C + +$	Gaussian process 高斯过程

The efficient global optimization (EGO) algorithm is the combination of DACE model with the sequential expected improvement acquisition criterion. It was published in a paper by Jones et al. [82] as a refinement of the SPACE algorithm (stochastic process analysis of computer experiments) [133]. Since EGO's publication, there has evolved a body of work devoted to extending the algorithm, particularly in adding constraints to the optimization problem [6], [131], [23], and in modelling noisy functions [14], [75], [76].
高效全局优化（EGO）算法是将 DACE 模型与序列期望改进获取准则相结合的产物。该算法由 Jones 等人在一篇论文中发表[82]，作为 SPACE 算法（计算机实验的随机过程分析）[133]的改进版。自 EGO 发表以来，已发展出一系列致力于扩展该算法的研究，特别是在优化问题中添加约束[6]、[131]、[23]，以及建模含噪声函数[14]、[75]、[76]方面。

In the bandits setting, Lai and Robbins [92] introduced upper confidence bounds (UCB) as approximate alternatives to Gittins indices in 1985. Auer studied these bounds using frequentist techniques, and in adversarial multi-armed bandit settings [9], [8].
在多臂老虎机设定中，Lai 和 Robbins 于 1985 年提出了上置信界（UCB）作为 Gittins 指数的近似替代方案。Auer 利用频率学派技术研究了这些界限，并应用于对抗性多臂老虎机场景[9]、[8]。

The literature on multi-armed bandits is vast. The book of Cesa-Bianchi [36] is a good reference on the topic of online learning with experts and bandits in adversarial settings. There are many results on exploration [30], [51], [50] and contextual bandits [97], [112], [2]. These contextual bandits, may also be seen as myopic approximations to Markov decision processes.
多臂老虎机相关文献浩如烟海。Cesa-Bianchi 所著[36]是该领域关于对抗环境下在线学习与专家及老虎机问题的权威参考。关于探索策略的研究[30][51][50]和情境老虎机[97][112][2]成果丰硕。这些情境老虎机亦可视为马尔可夫决策过程的近视近似。

VIII. EXTENSIONS AND OPEN QUESTIONS
第八章扩展与开放性问题

A. Constrained Bayesian optimization
A. 约束贝叶斯优化

In [56] a scenario was outlined in which a food company wished to design the best tasting cookie subject to the number of calories being below a certain level. This is an example of constrained optimization, where certain regions of the design space

X

are invalid. In machine learning,this can arise when certain hyperparameter configurations result in models that diverge during training, or that run out of computer memory. When the constraints are known a priori, they can be incorporated into the optimization of the acquisition function. The more challenging case arises when it is not known in advance which configurations will result in a constraint violation. Several approaches deal with this problem by altering the acquisition function itself.
文献[56]中描述了一个场景：某食品公司希望在卡路里不超过特定水平的前提下设计出最美味的饼干。这是约束优化的一个实例，其中设计空间

X

的某些区域是无效的。在机器学习中，当某些超参数配置导致训练过程中模型发散或耗尽计算机内存时，也会出现这种情况。若约束条件已知，可将其纳入采集函数的优化过程。更具挑战性的情况是，无法预先知晓哪些配置会导致约束违反。针对此问题，已有多种方法通过修改采集函数本身来解决。

Gramacy and Lee [62] proposed the integrated expected conditional improvement (IECI) acquisition function:
Gramacy 和 Lee [62]提出了集成期望条件改进（IECI）获取函数：

α_{IECI} (x) = \int_{x^{'}} (α_{EI} (x^{'}, D_{n}) - α_{EI} (x^{'}, D_{n} \cup x) ∣ x)) h (x^{'}) d x .

(60)

This gives the change in expected improvement from observing

x

under the density

h

. Choosing

h

to model the probability of satisfying the constraint encourages IECI to favor regions with a high probability of being valid.
这给出了在密度

h

下观察

x

时预期改进的变化。选择

h

来模拟满足约束的概率，促使 IECI 倾向于具有高有效性概率的区域。

Snoek [142] and Gelbart et al. [56] proposed the weighted expected improvement criterion (wEI) that multiplies EI by the probability of satisfying the constraints:
Snoek [142]和 Gelbart 等人[56]提出了加权预期改进准则(wEI)，该准则将 EI 乘以满足约束的概率：

\begin{matrix} (61) & α_{wEI} (x) = α_{EI} (x, D_{n}) h (x, D_{n}) . \end{matrix}

Where

h (x, D_{n})

is a Gaussian process with a Bernoulli observation model. This reduces EI in regions that are likely to violate constraints.
其中

h (x, D_{n})

是一个具有伯努利观测模型的高斯过程。这降低了在可能违反约束的区域中的 EI。

A variant of wEI was proposed in [52] to deal with the case where the function is constrained to be less than some value

λ

. They used

h (x, D_{t}) = P (f (x) < λ ∣ D_{t})

,the posterior probability of satisfying this constraint under the Gaussian process model of the function.
文献[52]提出了一种改进的加权期望改进（wEI）方法，用于处理函数值需小于某特定约束条件

λ

的情况。该方法采用

h (x, D_{t}) = P (f (x) < λ ∣ D_{t})

，即在高斯过程模型下满足该约束条件的后验概率。

Hernández-Lobato et al. [68] recently proposed a variation of the predictive entropy search acquisition function to deal with the decoupled case, where the function and constraints can be evaluated independently.
Hernández-Lobato 等人[68]近期提出了一种预测熵搜索采集函数的变体，以应对解耦场景——目标函数与约束条件可独立评估的情况。

In a different approach, Gramacy et al. [61] adapted the augmented Lagrangian approach to the Bayesian optimization setting, with unconstrained Bayesian optimization approximately solving the inner loop of the algorithm.
Gramacy 等学者[61]则采用不同思路，将增广拉格朗日法引入贝叶斯优化框架，通过无约束贝叶斯优化近似求解算法内循环问题。

B. Cost-sensitivity B. 成本敏感性

In some cases, each function evaluation may return both a value along with an associated cost. In other words, it may be more expensive to evaluate the function in some parts of the design space than others. If there is a limited budget, then the search should be biased toward low-cost areas. In [143], the goal was to train a machine learning model and the cost was the time it took to train the model. They used expected improvement per second,

EI (x, D_{n}) / c (x)

in order to bias the search toward good models with fast training times. Here,

c (x)

was the estimated cost of querying the objective at

x

and was modelled using a Gaussian process with response

\log (c (x))

.
在某些情况下，每次函数评估可能同时返回一个值及其相关成本。换言之，在设计空间的不同区域评估函数的代价可能更高。若预算有限，则搜索应偏向低成本区域。文献[143]中，目标为训练机器学习模型，其成本是模型训练所需时间。研究者采用每秒期望改进量（

EI (x, D_{n}) / c (x)

）来使搜索偏向于训练速度快且效果良好的模型。此处，

c (x)

表示在

x

处查询目标函数的预估成本，并通过高斯过程建模，响应变量为

\log (c (x))

。

C. High-dimensional problems
C. 高维问题

Despite many success stories, Bayesian optimization is restricted to problems of moderate dimension. To advance the state of the art, Bayesian optimization should be scaled to high-dimensional parameter spaces. This is a difficult problem: to ensure that a global optimum is found, we require good coverage of

X

,but as the dimensionality increases,the number of evaluations needed to cover

X

increases exponentially.
尽管成功案例众多，贝叶斯优化仍局限于中等维度问题。为推进技术前沿，需将其扩展至高维参数空间。这是个难题：要确保找到全局最优解，需良好覆盖

X

，但随着维度增加，覆盖

X

所需的评估次数呈指数级增长。

For linear bandits, Carpentier et al. [35] recently proposed a compressed sensing strategy to attack problems with a high degree of sparsity. Also recently, Chen et al. [39] made significant progress by introducing a two stage strategy for optimization and variable selection of high-dimensional GPs. In the first stage, sequential likelihood ratio tests with a couple of tuning parameters are used to select the relevant dimensions. This, however, requires the relevant dimensions to be axis-aligned with an ARD kernel. Chen et al. provide empirical results only for synthetic examples (of up to 400 dimensions), but they provide key theoretical guarantees.
对于线性赌博机问题，Carpentier 等人[35]最近提出了一种压缩感知策略来处理高度稀疏性问题。同样在近期，Chen 等人[39]通过引入一种两阶段策略在高维高斯过程的优化和变量选择方面取得了重要进展。在第一阶段，他们使用带有几个调谐参数的序贯似然比检验来选择相关维度。然而，这种方法要求相关维度必须与 ARD 核轴对齐。Chen 等人仅提供了合成示例（最多 400 维）的实证结果，但他们给出了关键的理论保证。

Hutter et al. [79] used Bayesian optimization with random forests based on frequentist uncertainty estimates. Their method does not have theoretical guarantees for continuous optimization, but it achieved state-of-the-art performance for tuning up to 76 parameters of algorithms for solving combinatorial problems. Note that in constructing the trees that make the forest, one samples and selects the most promising features (dimensions). That is, random forests naturally select the relevant dimensions of the problem, and so not surprisingly have worked well in practice.
Hutter 等人[79]使用了基于频率派不确定性估计的随机森林贝叶斯优化方法。他们的方法在连续优化方面缺乏理论保证，但在调整组合问题求解算法的多达 76 个参数时实现了最先进的性能。需要注意的是，在构建构成森林的决策树时，会对特征（维度）进行采样并选择最有前景的特征。也就是说，随机森林天然会选择问题的相关维度，因此在实践中表现良好并不令人意外。

Many researchers have noted that for certain classes of problems most dimensions do not change the objective function significantly; examples include hyperparameter optimization for neural networks and deep belief networks [17] and automatic configuration of state-of-the-art algorithms for solving

N P

-hard problems [77]. That is to say these problems have low effective dimensionality. To take advantage of this property, Bergstra and Bengio [17] proposed to simply use random search for optimization - the rationale being that points sampled uniformly at random in each dimension can densely cover each low-dimensional subspace. As such, random search can exploit low effective dimensionality without knowing which dimensions are important. In [159], the authors exploit the same property, while still capitalizing on the strengths of Bayesian optimization. By combining randomization with Bayesian optimization, they were able to derive a new approach that outperforms each of the individual components.
许多研究者注意到，对于某些类型的问题，大多数维度对目标函数的影响并不显著；例如神经网络和深度信念网络的超参数优化[17]，以及解决

N P

难题的最先进算法的自动配置[77]。也就是说，这些问题具有较低的有效维度。为了利用这一特性，Bergstra 和 Bengio[17]提出简单地使用随机搜索进行优化——其基本原理在于，在每个维度上均匀随机采样的点可以密集覆盖每个低维子空间。因此，随机搜索可以在不知道哪些维度重要的情况下利用低有效维度。在[159]中，作者利用了相同的特性，同时仍然发挥了贝叶斯优化的优势。通过将随机化与贝叶斯优化相结合，他们能够提出一种优于各自独立组件的新方法。

Figure 10 illustrates the approach in a nutshell. Assume we know that a given

D = 2

dimensional black-box function

f (x_{1}, x_{2})

only has

d = 1

important dimensions,but we do not know which of the two dimensions is the important one. We can then perform optimization in the embedded 1-dimensional subspace defined by

x_{1} = x_{2}

since this is guaranteed to include the optimum. This idea enables us to perform Bayesian optimization in a low-dimensional space to optimize a high-dimensional function with low intrinsic dimensionality. Importantly, it is not restricted to cases with axis-aligned intrinsic dimensions.
图 10 简要展示了该方法的核心思想。假设我们已知某个

D = 2

维黑盒函数

f (x_{1}, x_{2})

仅有

d = 1

个重要维度，但无法确定具体是哪两个维度。此时我们可以在由

x_{1} = x_{2}

定义的 1 维嵌入子空间中进行优化，因为该子空间必然包含最优解。这一思路使我们能够通过在低维空间进行贝叶斯优化来优化具有低内在维度的高维函数。值得注意的是，该方法并不局限于轴对齐内在维度的情况。

https://cdn.noedgeai.com/019684af-e35c-7a84-b274-13b47593f96c_18.jpg?x=909&y=173&w=753&h=297&r=0

Fig. 10. This function in

D = 2

dimesions only has

d = 1

effective dimension: the vertical axis indicated with the word important on the right hand side figure. Hence, the one-dimensional embedding includes the two-dimensional function's optimizer. It is more efficient to search for the optimum along the one-dimensional random embedding than in the original two-dimensional space.
图 10. 这个

D = 2

维函数仅有

d = 1

个有效维度：如右图所示，标有"重要"字样的垂直轴即为有效维度。因此，一维嵌入空间已包含该二维函数的优化器。相较于原始二维空间，沿一维随机嵌入空间搜索最优解效率更高。

To make the discussion more precise, a function

f : R^{D} \mapsto R

will have effective dimensionality

d_{e}

, with

d_{e} < D

,if there exists a linear effective subspace

T

of dimension

d_{e}

such that for all

x_{⊤} \in T \subset R^{D}

and

x_{⊥} \in T^{⊥} \subset R^{D}

,and

f (x) = f (x_{⊤} + x_{⊥}) = f (x_{⊤})

, where the so-called constant subspace

T^{⊥}

denotes the orthogonal complement of

T

. This definition simply states that the function does not change along the coordinates

x_{⊥}

, and hence the name for

T^{⊥}

.
为了使讨论更加精确，一个函数

f : R^{D} \mapsto R

将具有有效维度

d_{e}

，其中

d_{e} < D

，如果存在一个维度为

d_{e}

的线性有效子空间

T

，使得对于所有

x_{⊤} \in T \subset R^{D}

和

x_{⊥} \in T^{⊥} \subset R^{D}

，以及

f (x) = f (x_{⊤} + x_{⊥}) = f (x_{⊤})

，其中所谓的常数子空间

T^{⊥}

表示

T

的正交补。这个定义简单地说明函数沿着坐标

x_{⊥}

不发生变化，因此得名

T^{⊥}

。

Given this definition, Theorem 1 of [159] shows that problems of low effective dimensionality can be solved via random embedding. The theorem assumes we are given a function

f : R^{D} \mapsto R

with effective dimensionality

d_{e}

and a random matrix

A \in R^{D \times d}

with independent entries sampled according to

N (0, 1)

and

d \geq d_{e}

. It then shows that,with probability 1,for any

x \in R^{D}

,there exists a

z \in R^{d}

such that

f (x) = f (A z)

.
根据这个定义，[159]中的定理 1 表明，低有效维度的问题可以通过随机嵌入来解决。该定理假设给定一个具有有效维度

d_{e}

的函数

f : R^{D} \mapsto R

和一个随机矩阵

A \in R^{D \times d}

，其独立条目按照

N (0, 1)

和

d \geq d_{e}

采样。然后它表明，以概率 1，对于任何

x \in R^{D}

，存在一个

z \in R^{d}

使得

f (x) = f (A z)

。

Effectively,the theorem says that given any

x \in R^{D}

and a random matrix

A \in R^{D \times d}

,with probability 1,there is a point

z \in R^{d}

such that

f (x) = f (Az)

. This implies that for any optimizer

x^{⋆} \in R^{D}

,there is a point

z^{⋆} \in R^{d}

with

f (x^{⋆}) = f ({Az}^{⋆})

. Therefore,instead of optimizing in the high dimensional space, we can optimize the function

g (z) = f (A z)

in the lower dimensional space. This observation gives rise to an algorithm called Bayesian optimization with random embedding (REMBO), described in Algorithm 3. REMBO first draws a random embedding (given by

A

) and then performs Bayesian optimization in this embedded space.
该定理实质上表明，给定任意

x \in R^{D}

和一个随机矩阵

A \in R^{D \times d}

，以概率 1 存在一个点

z \in R^{d}

使得

f (x) = f (Az)

。这意味着对于任何优化器

x^{⋆} \in R^{D}

，都存在一个点

z^{⋆} \in R^{d}

满足

f (x^{⋆}) = f ({Az}^{⋆})

。因此，我们无需在高维空间中进行优化，而可以在低维空间中优化函数

g (z) = f (A z)

。这一观察催生了一种称为随机嵌入贝叶斯优化（REMBO）的算法，如算法 3 所述。REMBO 首先生成一个随机嵌入（由

A

给出），然后在此嵌入空间内执行贝叶斯优化。

Algorithm 3 REMBO 算法 3 REMBO

Generate a random matrix

A

生成随机矩阵

A

Choose the set

Z

选择集合

Z

for

n = 1, 2, \dots

do 对于

n = 1, 2, \dots

执行

select

z_{n + 1}

by optimizing the acquisition function

α

:
通过优化采集函数

α

来选择

z_{n + 1}

：

z_{n + 1} = \underset{z \in Z}{\arg max} α (z ∣ D_{n})

augment the data

D_{n + 1} = {D_{n}, (z_{n + 1}, f ({Az}_{n + 1})}

扩充数据

D_{n + 1} = {D_{n}, (z_{n + 1}, f ({Az}_{n + 1})}

update the kernel hyperparameters
更新核超参数

end for 结束循环

An important detail is how REMBO chooses the bounded region

Z

,inside which it performs Bayesian optimization. This is important because its effectiveness depends on the size of

Z

. Locating the optimum within

Z

is easier if

Z

is small,but if we set

Z

too small it may not actually contain the global optimizer. We refer the readers to the original paper for details.
一个关键细节在于 REMBO 如何选择有界区域

Z

，并在其内部执行贝叶斯优化。这一点很重要，因为其效果取决于

Z

的大小。在

Z

内定位最优解会更容易，如果

Z

较小；但若将

Z

设得过小，可能无法包含全局最优解。具体细节我们建议读者参阅原论文。

https://cdn.noedgeai.com/019684af-e35c-7a84-b274-13b47593f96c_19.jpg?x=141&y=158&w=741&h=171&r=0

Fig. 11. Left: Three correlated functions drawn from a multi-output GP. Middle: The GP posterior predictive distribution of function (3) when the functions are assumed to be independent. This is equivalent to ignoring observations from functions (1) and (2). Right: Posterior predictive distribution of function (3) when the correlations are taken into account. Here, the observations from functions (1) and (2) act as weak observations for function (3). This results in a much more accurate prediction.
图 11。左：从多输出高斯过程中抽取的三个相关函数。中：假设函数间独立时，函数(3)的高斯过程后验预测分布。这等同于忽略函数(1)和(2)的观测数据。右：考虑相关性时的函数(3)后验预测分布。此时，函数(1)和(2)的观测数据作为函数(3)的弱观测信号，显著提升了预测精度。

D. Multi-Task D. 多任务

When tuning the hyperparameters of a machine learning model on some data, it is unlikely that the hyperparameters will change very much if new data is added to the original data, especially if the new data represents a small fraction of the total amount. Likewise, if one were to train a model for object recognition, then good hyperparameter settings are likely to also be good on other object recognition datasets. Experts often exploit this property when applying their models to new datasets.
在针对某数据集调整机器学习模型的超参数时，若新增数据量占比较小，原始数据的超参数设置通常不会发生显著变化。同理，若某组超参数在物体识别任务中表现良好，它们很可能在其他物体识别数据集上同样有效。专家在将模型迁移至新数据集时，常利用这一特性。

There have been several attempts to exploit this property within the Bayesian optimization framework [89], [79], [12], [148], [161], [49]. The idea is that there are several correlated functions,

T = {1, 2, \dots, M}

,called tasks and that we are interested in optimizing some subset of these tasks. In essence, the data from one task can provide information about another task.
已有多次尝试在贝叶斯优化框架中利用这一特性[89]、[79]、[12]、[148]、[161]、[49]。其核心思想是存在多个被称为任务的关联函数

T = {1, 2, \dots, M}

，而我们关注的是优化这些任务的某个子集。本质上，一个任务的数据能为其他任务提供信息。

One way to share information between tasks in a Bayesian optimization routine is to modify the underlying Gaussian process model. There has been a great deal of work on extending Gaussian processes to the multi-task scenario. These extensions are also known as multi-output Gaussian processes. The key is to define a valid covariance over input and task pairs,

k ((x, m), (x^{'}, m^{'}))

. One method is to use the intrinsic model of coregionalization (ICM) [60], [136], [21] that utilizes the product kernel,
在贝叶斯优化过程中共享任务信息的一种方法是修改底层高斯过程模型。关于将高斯过程扩展到多任务场景已有大量研究，这类扩展也被称为多输出高斯过程。关键在于定义输入与任务对之间的有效协方差

k ((x, m), (x^{'}, m^{'}))

。其中一种方法是采用内在共区域化模型(ICM)[60]、[136]、[21]，该模型利用乘积核函数，

\begin{matrix} (62) & k ((x, m), (x^{'}, m^{'})) = k_{X} (x, x^{'}) k_{T} (m, m^{'}) . \end{matrix}

Where

m, m^{'} \in T . k_{T}

defines the covariance between tasks. There are many ways to parameterize the task covariance function [123]. Figure 11 illustrates how knowledge of correlations between tasks can be used with a multi-output GP to make more accurate predictions.
其中

m, m^{'} \in T . k_{T}

定义了任务间的协方差。任务协方差函数的参数化方法多种多样[123]。图 11 展示了如何利用任务间相关性知识，结合多输出高斯过程实现更精准的预测。

An alternative view of the ICM model is that it defines a latent process that is rotated and scaled to produce each of the individual tasks. The problem of defining a multi-output GP can then be viewed as learning a latent function, or a set of latent functions, that can be transformed to produce the output tasks. [12] proposed an approach that learns a latent ranking function at each iteration using pairs of observations from within each task. By learning a single ranking function that works across tasks, the tasks are effectively jointly embedded in a latent space that is invariant to potentially different output scales across tasks.
ICM 模型的另一种观点认为，它定义了一个潜在过程，通过旋转和缩放该过程来生成各个独立任务。因此，定义多输出高斯过程的问题可视为学习一个或多个潜在函数，这些函数经转换后能生成输出任务。[12]提出了一种方法，在每次迭代中利用来自各任务内部的观测对学习潜在排序函数。通过跨任务学习单一排序函数，这些任务被有效地联合嵌入到一个潜在空间中，该空间对任务间可能存在的不同输出尺度具有不变性。

Each task may come with additional side information, or context features. In this case, it is possible to define a joint model that uses this context. This was considered for algorithm configuration in [79] using a random forest model. When starting a new task, [49] uses task features to find similar tasks. The best inputs from the most similar tasks are then used as the initial design for the new task.
每个任务可能附带额外的辅助信息或上下文特征。这种情况下，可以定义一个利用此类上下文的联合模型。[79]在算法配置中采用随机森林模型探讨了这一点。当启动新任务时，[49]利用任务特征寻找相似任务，并将最相似任务中的最优输入作为新任务的初始设计方案。

E. Freeze-Thaw E. 冻结-解冻法

In some cases, the experiments selected by Bayesian optimization themselves require an inner loop of iterative optimization. For example, in the case of tuning machine learning hyperparameters, each experiment consists of training a model before evaluating it. It is often possible to evaluate the model during training in order to get an estimate of how it is performing. When tuning hyperparameters by hand, experts can use this information in order to estimate model performance at the end of training and can halt training early if this estimate looks unsatisfactory. This allows a far greater number of models to be trained in a given amount of time.
在某些情况下，贝叶斯优化选择的实验本身需要一个内部迭代优化循环。例如，在调整机器学习超参数的情况下，每个实验包括在评估模型之前对其进行训练。通常可以在训练过程中评估模型，以获取其性能的估计。手动调整超参数时，专家可以利用这些信息来估计训练结束时的模型性能，如果估计结果不理想，可以提前终止训练。这使得在给定时间内可以训练更多数量的模型。

An attempt to incorporate this into the Bayesian optimization framework is given in [149]. They identify that many loss functions in machine learning follow an exponential decay pattern during training, and construct a basis set of exponentially decaying functions of the form

f (t, λ) = e^{- λ t}

, where

λ

represents the rate of decay over time,represented by

t

,in order to forecast model performance. It is possible to construct a nonstationary kernel from this basis set:
文献[149]尝试将这一思路融入贝叶斯优化框架。他们指出，机器学习中许多损失函数在训练过程中呈现指数衰减模式，并构建了一组以

f (t, λ) = e^{- λ t}

形式表示的指数衰减基函数（其中

λ

代表随时间变化的衰减率，由

t

表示），用以预测模型性能。基于此基函数集可构造非平稳核函数：

\begin{matrix} (63) & k (t, t^{'}) = \frac{β^{α}}{{(t + t^{'} + β)}^{α}}, \end{matrix}

where

α

and

β

are hyperparameters that control the shape of the kernel. This kernel is used within a Gaussian process to jointly model(x,t)pairs. Given the ability to forecast curves, [149] then uses an entropy search-based acquisition function in order to determine whether to freeze a currently running experiment, thaw a previous experiment in order to resume training, or start a new experiment.
其中

α

和

β

是控制核函数形状的超参数。该核函数被用于高斯过程中联合建模(x,t)数据对。借助曲线预测能力，[149]随后采用基于熵搜索的采集函数来决定是冻结当前运行实验、解冻先前实验以继续训练，还是启动新实验。

Rather than constructing a kernel, [48], [47] built a basis set manually based on previously collected training curves. This basis set is then used with Bayesian linear regression in order to forecast training curves, and an early stopping rule is given based on the probability of improvement using the forecasted value.
不同于构建核函数的方法，[48]和[47]基于历史训练曲线手动构建基函数集。该基函数集结合贝叶斯线性回归进行训练曲线预测，并基于预测值的改进概率给出了早停规则。

An alternative view of this procedure is to consider Gaussian process models that incorporate partial feedback. This view is used in [122], where they construct a Gaussian process with non-stationary noise process that starts high when the experiment begins, and decays over time.
该过程的另一种视角是考虑融入部分反馈的高斯过程模型。这一观点在文献[122]中得到应用，其中构建了一个具有非平稳噪声过程的高斯过程，该噪声在实验开始时较高，并随时间逐渐衰减。

IX. Concluding Remarks 九、结论性评述

In this paper we have introduced Bayesian optimization from a modelling perspective. Beginning with the Beta-Bernoulli and linear models, and extending them to nonparametric models, we recover a wide range of approaches to Bayesian optimization that have been introduced in the literature. There has been a great deal of work that has focussed heavily on designing acquisition functions, however we have taken the perspective that the importance of this plays a secondary role to the choice of the underlying surrogate model.
本文从建模角度介绍了贝叶斯优化方法。从 Beta-伯努利模型和线性模型出发，延伸至非参数模型，我们梳理了文献中提出的各类贝叶斯优化方法。尽管大量研究聚焦于获取函数的设计，但我们认为相较于底层代理模型的选择，其重要性居于次要地位。

In addition to outlining different modelling choices, we have considered many of the design decisions that are used to build Bayesian optimization systems. We further highlighted relevant theory as well as practical considerations that are used when applying these techniques to real-world problems. We provided a history of Bayesian optimization and related fields and surveyed some of the many successful applications of these methods. We finally discussed extensions of the basic framework to new problem domains, which often require new kinds of surrogate models.
除了概述不同的建模选择外，我们还探讨了构建贝叶斯优化系统时涉及的诸多设计决策。我们进一步强调了相关理论以及在现实问题中应用这些技术时的实际考量。文章回顾了贝叶斯优化及其相关领域的发展历程，并综述了该方法在众多成功案例中的应用。最后，我们讨论了基础框架向新问题领域的扩展，这些领域通常需要新型的代理模型。

Although the underpinnings of Bayesian optimization are quite old, the field itself is undergoing a resurgence, aided by new problems, models, theory, and software implementations. In this paper, we have attempted to summarize the current state of Bayesian optimization methods; however, it is clear that the field itself has only scratched the surface and that there will surely be many new problems, discoveries, and insights in the future.
尽管贝叶斯优化的理论基础由来已久，但该领域本身正借助新问题、新模型、新理论及软件实现迎来复兴。本文试图总结当前贝叶斯优化方法的发展现状，但显然该领域仅触及冰山一角，未来必将涌现更多新问题、新发现与新见解。

REFERENCES 参考文献

[1] R. P. Adams and O. Stegle. Gaussian process product models for nonparametric nonstationarity. In International Conference on Machine Learning, pages 1-8, 2008.
[1] R. P. Adams 与 O. Stegle. 非参数非平稳性的高斯过程乘积模型. 见《国际机器学习会议》, 第 1-8 页, 2008 年。

[2] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, 2013.
[2] S. Agrawal 与 N. Goyal。基于线性收益的情境赌博机中的汤普森采样。发表于《国际机器学习会议》，2013 年。

[3] E. B. Anderes and M. L. Stein. Estimating deformations of isotropic Gaussian random fields on the plane. The Annals of Statistics, 36(2):719-741, 2008.
[3] E. B. Anderes 与 M. L. Stein。平面各向同性高斯随机场形变的估计。《统计学年刊》，36(2):719-741，2008 年。

[4] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50(1-2):5-43, 2003.
[4] C. Andrieu、N. de Freitas、A. Doucet 与 M. I. Jordan。机器学习中的 MCMC 方法导论。《机器学习》，50(1-2):5-43，2003 年。

[5] J.-A. M. Assael, Z. Wang, and N. de Freitas. Heteroscedastic treed Bayesian optimisation. arXiv preprint arXiv:1410.7172, 2014.
[5] J.-A. M. Assael、Z. Wang 与 N. de Freitas。异方差树贝叶斯优化。arXiv 预印本 arXiv:1410.7172，2014 年。

[6] C. Audet, J. Jr, Dennis, D. W. Moore, A. Booker, and P. D. Frank. Surrogate-model-based method for constrained optimization. In AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, 2000.
[6] C. 奥德特、J. 小丹尼斯、D. W. 摩尔、A. 布克与 P. D. 弗兰克。基于代理模型的约束优化方法。收录于《AIAA/美国空军/NASA/ISSMO 多学科分析与优化研讨会》，2000 年。

[7] J. Audibert, S. Bubeck, and R. Munos. Best arm identification in multi-armed bandits. In Conference on Learning Theory, 2010.
[7] J. 奥迪伯特、S. 布贝克与 R. 穆诺斯。多臂老虎机中的最优臂识别。收录于《学习理论会议》，2010 年。

[8] P. Auer. Using confidence bounds for exploitation-exploration tradeoffs. Journal of Machine Learning Research, 3:397-422, 2003.
[8] P. 奥尔。利用置信边界进行开发-探索权衡。《机器学习研究期刊》第 3 期，397-422 页，2003 年。

[9] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Symposium on Foundations of Computer Science, pages 322-331, 1995.
[9] P. 奥尔、N. 切萨-比安奇、Y. 弗伦德与 R. E. 沙皮尔。在作弊赌场中赌博：对抗性多臂老虎机问题。收录于《计算机科学基础研讨会》，322-331 页，1995 年。

[10] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The non-stochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48-77, 2002.
[10] P.奥尔、N.切萨-比安奇、Y.弗洛伊德与 R.E.沙皮尔。《非随机多臂老虎机问题》。SIAM 计算期刊，32(1):48-77，2002 年。

[11] J. Azimi, A. Jalali, and X. Fern. Hybrid batch Bayesian optimization. In International Conference on Machine Learning, 2012.
[11] J.阿齐米、A.贾拉利与 X.弗恩。《混合批次贝叶斯优化》。收录于国际机器学习大会，2012 年。

[12] R. Bardenet, M. Brendel, B. Kégl, and M. Sebag. Collaborative hyperparameter tuning. In International Conference on Machine Learning, 2013.
[12] R.巴登内、M.布伦德尔、B.凯格尔与 M.塞巴格。《协同超参数调优》。收录于国际机器学习大会，2013 年。

[13] R. Bardenet and B. Kégl. Surrogating the surrogate: accelerating Gaussian-process-based global optimization with a mixture cross-entropy algorithm. In International Conference on Machine Learning, pages 55-62, 2010.
[13] R.巴登内与 B.凯格尔。《替代替代者：利用混合交叉熵算法加速基于高斯过程的全局优化》。国际机器学习大会，55-62 页，2010 年。

[14] T. Bartz-Beielstein, C. Lasarczyk, and M. Preuss. Sequential parameter optimization. In Proc. CEC-05, 2005.
[14] T·巴茨-拜尔斯坦、C·拉萨尔奇克与 M·普勒斯。序列参数优化。收录于《CEC-05 会议论文集》，2005 年。

[15] R. Benassi, J. Bect, and E. Vazquez. Robust Gaussian process-based global optimization using a fully Bayesian expected improvement criterion. In C. Coello, editor, Learning and Intelligent Optimization, volume 6683 of Lecture Notes in Computer Science, pages 176-190.
[15] R·贝纳西、J·贝克特与 E·瓦兹奎兹。基于鲁棒高斯过程的全局优化：一种完全贝叶斯期望改进准则。载于 C·科埃略主编的《学习与智能优化》，《计算机科学讲义》第 6683 卷，第 176-190 页。

Springer Berlin / Heidelberg, 2011.
柏林/海德堡施普林格出版社，2011 年。

[16] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, pages 2546-2554, 2011.
[16] J·伯格斯特拉、R·巴尔德内、Y·本吉奥与 B·凯格尔。超参数优化算法。收录于《神经信息处理系统进展》，第 2546-2554 页，2011 年。

[17] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281-305, 2012.
[17] J·伯格斯特拉与 Y·本吉奥。超参数优化的随机搜索。《机器学习研究期刊》，13 卷：281-305 页，2012 年。

[18] J. Bergstra, B. Komer, C. Eliasmith, and D. Warde-Farley. Preliminary evaluation of hyperopt algorithms on HPOLib. International Conference on Machine Learning AutoML Workshop, 2014.
[18] J·伯格斯特拉、B·科默、C·埃利亚史密斯及 D·沃德-法利。HPOLib 上 Hyperopt 算法的初步评估。《国际机器学习会议 AutoML 研讨会》，2014 年。

[19] J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International Conference on Machine Learning, 2013.
[19] J·伯格斯特拉、D·亚明斯与 D·D·考克斯。将模型搜索科学化：面向视觉架构的数百维超参数优化。《国际机器学习会议》，2013 年。

[20] S. Bochner. Lectures on Fourier integrals. Princeton University Press, 1959.
[20] S·博赫纳。傅里叶积分讲义。普林斯顿大学出版社，1959 年。

[21] E. V. Bonilla, K. M. A. Chai, and C. K. I. Williams. Multi-task Gaussian process prediction. In Advances in Neural Information Processing Systems, 2008.
[21] E·V·邦利亚，K·M·A·柴，与 C·K·I·威廉姆斯。多任务高斯过程预测。载于《神经信息处理系统进展》，2008 年。

[22] L. Bornn, G. Shaddick, and J. V. Zidek. Modeling nonstationary processes through dimension expansion. Journal of the American Statistical Society, 107(497), 2012.
[22] L·伯恩，G·沙迪克，与 J·V·齐德克。通过维度扩展建模非平稳过程。《美国统计学会杂志》，107(497)，2012 年。

[23] P. Boyle. Gaussian Processes for Regression and Optimisation. PhD thesis, Victoria University of Wellington, Wellington, New Zealand, 2007.
[23] P·博伊尔。回归与优化的高斯过程。博士论文，新西兰惠灵顿维多利亚大学，2007 年。

[24] L. Breiman. Random forests. Machine learning, 45(1):5-32, 2001.
[24] L·布雷曼。随机森林。《机器学习》，45(1):5-32，2001 年。

[25] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth and Brooks, 1984.
[25] L·布雷曼、J·弗里德曼、R·奥尔申与 C·斯通。《分类与回归树》。沃兹沃思与布鲁克斯出版社，1984 年。

[26] E. Brochu, T. Brochu, and N. de Freitas. A Bayesian interactive optimization approach to procedural animation design. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages 103-112, 2010.
[26] E·布罗楚、T·布罗楚与 N·德弗雷塔斯。贝叶斯交互式优化方法在程序动画设计中的应用。收录于《2010 年 ACM SIGGRAPH/欧洲图形学研讨会计算机动画论文集》，第 103-112 页，2010 年。

[27] E. Brochu, V. M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Technical Report UBC TR-2009-23 and arXiv:1012.2599v1, Dept. of Computer Science, University of British Columbia, 2009.
[27] E·布罗楚、V·M·科拉与 N·德弗雷塔斯。高成本函数贝叶斯优化教程及其在主动用户建模与分层强化学习中的应用。技术报告 UBC TR-2009-23 及 arXiv:1012.2599v1，不列颠哥伦比亚大学计算机科学系，2009 年。

[28] E. Brochu, N. de Freitas, and A. Ghosh. Active preference learning with discrete choice data. In Advances in Neural Information Processing Systems, pages 409-416, 2007.
[28] E·布罗楚、N·德弗雷塔斯与 A·戈什。基于离散选择数据的主动偏好学习。收录于《神经信息处理系统进展》，第 409-416 页，2007 年。

[29] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1-122, 2012.
[29] S. Bubeck 与 N. Cesa-Bianchi。随机与非随机多臂老虎机问题的遗憾分析。《机器学习基础与趋势》，5(1):1-122，2012 年。

[30] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In International Conference on Algorithmic Learning Theory, 2009.
[30] S. Bubeck、R. Munos 和 G. Stoltz。多臂老虎机问题中的纯粹探索。载于《国际算法学习理论会议》，2009 年。

[31] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvari. X-armed bandits. Journal of Machine Learning Research, 12:1655-1695, 2011.
[31] S. Bubeck、R. Munos、G. Stoltz 和 C. Szepesvari。X-武装老虎机。《机器学习研究杂志》，12:1655-1695，2011 年。

[32] A. D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12:2879-2904, 2011.
[32] A. D. Bull。高效全局优化算法的收敛速率。《机器学习研究杂志》，12:2879-2904，2011 年。

[33] D. Busby. Hierarchical adaptive experimental design for Gaussian process emulators. Reliability Engineering and System Safety, 94(7):1183- 1193, July 2009.
[33] D. Busby. 高斯过程仿真器的分层自适应实验设计. 可靠性工程与系统安全, 94(7):1183-1193, 2009 年 7 月.

[34] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisen-roth. Manifold Gaussian processes for regression. arXiv preprint arXiv:1402.5876, 2014.
[34] R. Calandra, J. Peters, C. E. Rasmussen, 和 M. P. Deisenroth. 流形高斯过程回归. arXiv 预印本 arXiv:1402.5876, 2014 年.

[35] A. Carpentier and R. Munos. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In

A I

and Statistics,pages 190-198, 2012.
[35] A. Carpentier 和 R. Munos. 高维随机线性赌博机的压缩感知理论. 收录于

A I

与统计会议, 页码 190-198, 2012 年.

[36] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, 2006.
[36] N. Cesa-Bianchi 和 G. Lugosi. 预测、学习与博弈. 剑桥大学出版社, 纽约, 2006 年.

[37] K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statistical Science, pages 273-304, 1995.
[37] K. 查洛纳与 I. 韦尔迪内利。贝叶斯实验设计综述。《统计科学》，第 273-304 页，1995 年。

[38] O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, pages 2249- 2257, 2011.
[38] O. 沙佩尔与 L. 李。汤普森采样的实证评估。《神经信息处理系统进展》，第 2249-2257 页，2011 年。

[39] B. Chen, R. Castro, and A. Krause. Joint optimization and variable selection of high-dimensional Gaussian processes. In International Conference on Machine Learning, 2012.
[39] B. 陈、R. 卡斯特罗与 A. 克劳泽。高维高斯过程的联合优化与变量选择。《国际机器学习会议》，2012 年。

[40] S. Clark. Parallel Machine Learning Algorithms In Bioinformatics And Global Optimization. PhD thesis, Cornell University, 2012.
[40] S. 克拉克。生物信息学与全局优化中的并行机器学习算法。博士论文，康奈尔大学，2012 年。

[41] E. Contal, V. Perchet, and N. Vayatis. Gaussian process optimization with mutual information. In International Conference on Machine Learning, 2013.
[41] E. Contal, V. Perchet 和 N. Vayatis。基于互信息的高斯过程优化。收录于国际机器学习会议，2013 年。

[42] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A unified
[42] A. Criminisi, J. Shotton 和 E. Konukoglu。决策森林：一种统一的

framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and Trends in Computer Graphics and Vision, (7):81-227, 2011.
分类、回归、密度估计、流形学习及半监督学习的框架。《计算机图形学与视觉基础与趋势》，(7):81-227，2011 年。

[43] N. de Freitas, A. Smola, and M. Zoghi. Exponential regret bounds for Gaussian process bandits with deterministic observations. In International Conference on Machine Learning, 2012.
[43] N. de Freitas, A. Smola, 和 M. Zoghi。确定性观测下高斯过程赌博机的指数遗憾界。《国际机器学习会议》，2012 年。

[44] M. Denil, L. Bazzani, H. Larochelle, and N. de Freitas. Learning where to attend with deep architectures for image tracking. Neural Computation, 24(8):2151-2184, 2012.
[44] M. Denil, L. Bazzani, H. Larochelle, 和 N. de Freitas。通过深度架构学习图像跟踪中的注意力位置。《神经计算》，24(8):2151-2184，2012 年。

[45] T. Desautels, A. Krause, and J. Burdick. Parallelizing exploration-exploitation tradeoffs with Gaussian process bandit optimization. Journal of Machine Learning Research, 2014.
[45] T. Desautels, A. Krause, 和 J. Burdick。高斯过程赌博优化的并行探索-开发权衡。《机器学习研究期刊》，2014 年。

[46] J. Djolonga, A. Krause, and V. Cevher. High dimensional Gaussian process bandits. In Advances in Neural Information Processing Systems, pages 1025-1033, 2013.
[46] J·焦隆加、A·克劳斯与 V·切夫赫尔。高维高斯过程赌博机。《神经信息处理系统进展》，第 1025-1033 页，2013 年。

[47] T. Domhan, J. T. Springenberg, and F. Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), July 2015.
[47] T·多姆汉、J·T·斯普林伯格与 F·胡特。通过学习曲线外推加速深度神经网络超参数自动优化。《第 24 届国际人工智能联合会议论文集(IJCAI)》，2015 年 7 月。

[48] T. Domhan, T. Springenberg, and F. Hutter. Extrapolating learning curves of deep neural networks. In International Conference on Machine Learning AutoML Workshop, 2014.
[48] T·多姆汉、T·斯普林伯格与 F·胡特。深度神经网络学习曲线外推。《机器学习国际会议 AutoML 研讨会》，2014 年。

[49] M. Feurer, T. Springenberg, and F. Hutter. Initializing Bayesian hyperparameter optimization via meta-learning. In National Conference on Artificial Intelligence (AAAI), 2015.
[49] M·福伊雷尔、T·斯普林伯格与 F·胡特。通过元学习初始化贝叶斯超参数优化。《美国人工智能协会全国会议(AAAI)》，2015 年。

[50] V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems, pages 3212-3220, 2012.
[50] V. Gabillon, M. Ghavamzadeh, 与 A. Lazaric。最优臂识别：固定预算与固定置信度的统一方法。载于《神经信息处理系统进展》，第 3212-3220 页，2012 年。

[51] V. Gabillon, M. Ghavamzadeh, A. Lazaric, and S. Bubeck. Multi-bandit best arm identification. In Advances in Neural Information Processing Systems, pages 2222-2230, 2011.
[51] V. Gabillon, M. Ghavamzadeh, A. Lazaric, 与 S. Bubeck。多臂老虎机最优臂识别。载于《神经信息处理系统进展》，第 2222-2230 页，2011 年。

[52] J. R. Gardner, M. J. Kusner, Z. Xu, K. Q. Weinberger, and J. P. Cunningham. Bayesian optimization with inequality constraints. In International Conference on Machine Learning, pages 937-945, 2014.
[52] J. R. Gardner, M. J. Kusner, Z. Xu, K. Q. Weinberger, 与 J. P. Cunningham。带不等式约束的贝叶斯优化。载于《国际机器学习会议》，第 937-945 页，2014 年。

[53] R. Garnett, M. A. Osborne, and P. Hennig. Active learning of linear embeddings for Gaussian processes. In Uncertainty in Artificial Intelligence, 2014.
[53] R. Garnett, M. A. Osborne, 与 P. Hennig。高斯过程线性嵌入的主动学习。载于《人工智能不确定性会议》，2014 年。

[54] R. Garnett, M. A. Osborne, S. Reece, A. Rogers, and S. J. Roberts. Sequential Bayesian prediction in the presence of changepoints and faults. The Computer Journal, 53(9):1430-1446, 2010.
[54] R. Garnett, M. A. Osborne, S. Reece, A. Rogers, 和 S. J. Roberts。存在变点和故障情况下的序贯贝叶斯预测。《计算机期刊》，53(9):1430-1446，2010 年。

[55] R. Garnett, M. A. Osborne, and S. J. Roberts. Bayesian optimization for sensor set selection. In ACM/IEEE International Conference on Information Processing in Sensor Networks, pages 209-219. ACM, 2010.
[55] R. Garnett, M. A. Osborne, 和 S. J. Roberts。传感器组选择的贝叶斯优化。收录于《ACM/IEEE 传感器网络信息处理国际会议》，第 209-219 页。ACM，2010 年。

[56] M. A. Gelbart, J. Snoek, and R. P. Adams. Bayesian optimization with unknown constraints. In Uncertainty in Artificial Intelligence, 2014.
[56] M. A. Gelbart, J. Snoek, 和 R. P. Adams。未知约束下的贝叶斯优化。收录于《不确定性人工智能》，2014 年。

[57] D. Ginsbourger, R. Le Riche, and L. Carraro. Kriging is well-suited to parallelize optimization. In Computational Intelligence in Expensive Optimization Problems, pages 131-162. Springer, 2010.
[57] D. Ginsbourger, R. Le Riche, 和 L. Carraro。克里金法非常适合并行化优化。收录于《昂贵优化问题中的计算智能》，第 131-162 页。Springer，2010 年。

[58] D. Ginsbourger and R. L. Riche. Dealing with asynchronicity in parallel Gaussian process based global optimization. http://hal.archives-ouvertes.fr/hal-00507632, 2010.
[58] D. Ginsbourger 与 R. L. Riche。处理并行高斯过程全局优化中的异步性问题。http://hal.archives-ouvertes.fr/hal-00507632，2010 年。

[59] J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), pages 148- 177, 1979.
[59] J. C. Gittins。老虎机过程与动态分配指标。《皇家统计学会杂志》B 辑（方法论），148-177 页，1979 年。

[60] P. Goovaerts. Geostatistics for natural resources evaluation. Oxford University Press, 1997.
[60] P. Goovaerts。《自然资源评估中的地质统计学》。牛津大学出版社，1997 年。

[61] R. B. Gramacy, G. A. Gray, S. L. Digabel, H. K. Lee, P. Ranjan, G. Wells, and S. M. Wild. Modeling an augmented Lagrangian for improved blackbox constrained optimization. arXiv preprint arXiv:1403.4890, 2014.
[61] R. B. Gramacy、G. A. Gray、S. L. Digabel、H. K. Lee、P. Ranjan、G. Wells 与 S. M. Wild。改进黑盒约束优化的增广拉格朗日建模。arXiv 预印本 arXiv:1403.4890，2014 年。

[62] R. B. Gramacy and H. K. Lee. Optimization under unknown constraints. arXiv preprint arXiv:1004.4027, 2010.
[62] R·B·格拉梅西与 H·K·李。《未知约束下的优化》。arXiv 预印本 arXiv:1004.4027，2010 年。

[63] R. B. Gramacy, H. K. H. Lee, and W. G. Macready. Parameter space exploration with Gaussian process trees. In International Conference on Machine Learning, pages 45-52, 2004.
[63] R·B·格拉梅西、H·K·H·李与 W·G·麦克雷迪。《基于高斯过程树的参数空间探索》。载于《国际机器学习会议》，第 45-52 页，2004 年。

[64] S. Grunewalder, J. Audibert, M. Opper, and J. Shawe-Taylor. Regret bounds for Gaussian process bandit problems. In

A I

and Statistics, pages 273-280, 2010.
[64] S·格鲁内瓦尔德、J·奥迪贝尔特、M·奥珀与 J·肖-泰勒。《高斯过程老虎机问题的遗憾界》。载于《

A I

与统计》，第 273-280 页，2010 年。

[65] F. Hamze, Z. Wang, and N. de Freitas. Self-avoiding random dynamics on integer complex systems. ACM Transactions on Modeling and Computer Simulation (TOMACS), 23(1):9, 2013.
[65] F·哈姆泽、Z·王与 N·德弗雷塔斯。《整数复杂系统上的自回避随机动力学》。载于《ACM 建模与计算机模拟汇刊》(TOMACS)，第 23 卷第 1 期第 9 页，2013 年。

[66] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evol. Comput., 9(2):159-195, 2001.
[66] N. Hansen 与 A. Ostermeier。进化策略中完全去随机化的自我适应。《进化计算》，9(2):159-195，2001 年。

[67] P. Hennig and C. Schuler. Entropy search for information-efficient global optimization. The Journal of Machine Learning Research, 13(1):1809-1837, 2012.
[67] P. Hennig 与 C. Schuler。信息高效全局优化的熵搜索。《机器学习研究杂志》，13(1):1809-1837，2012 年。

[68] J. M. Hernández-Lobato, M. A. Gelbart, M. W. Hoffman, R. P. Adams, and Z. Ghahramani. Predictive entropy search for Bayesian optimization with unknown constraints. In International Conference on Machine Learning, 2015.
[68] J. M. Hernández-Lobato、M. A. Gelbart、M. W. Hoffman、R. P. Adams 与 Z. Ghahramani。未知约束下贝叶斯优化的预测熵搜索。载于《国际机器学习会议》，2015 年。

[69] J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems. 2014.
[68] J. M. Hernández-Lobato、M. W. Hoffman 与 Z. Ghahramani。黑箱函数高效全局优化的预测熵搜索。载于《神经信息处理系统进展》，2014 年。

[70] D. Higdon, J. Swall, and J. Kern. Non-stationary spatial modeling. Bayesian Statistics, 6, 1998.
[70] D. Higdon、J. Swall 和 J. Kern。非平稳空间建模。《贝叶斯统计学》，第 6 卷，1998 年。

[71] G. E. Hinton and R. Salakhutdinov. Using deep belief nets to learn covariance kernels for Gaussian processes. In Advances in Neural Information Processing Systems, pages 1249-1256, 2008.
[71] G. E. Hinton 和 R. Salakhutdinov。使用深度信念网络学习高斯过程的协方差核。《神经信息处理系统进展》，第 1249-1256 页，2008 年。

[72] M. Hoffman, B. Shahriari, and N. de Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In

A I

and Statistics,pages 365-374, 2014.
[72] M. Hoffman、B. Shahriari 和 N. de Freitas。基于模型的带约束相关性和预算的赌博机优化及其在自动机器学习中的应用。《

A I

与统计》，第 365-374 页，2014 年。

[73] M. W. Hoffman, E. Brochu, and N. de Freitas. Portfolio allocation for Bayesian optimization. In Uncertainty in Artificial Intelligence, pages 327-336, 2011.
[73] M. W. Hoffman、E. Brochu 和 N. de Freitas。贝叶斯优化的投资组合分配。《人工智能中的不确定性》，第 327-336 页，2011 年。

[74] H. H. Hoos. Programming by optimization. Communications of the

A C M, 55 (2) : 70 - 80, 2012

.
[74] H·H·胡斯. 通过优化进行编程. 通信于

A C M, 55 (2) : 70 - 80, 2012

。

[75] D. Huang, T. Allen, W. Notz, and N. Zeng. Global optimization of stochastic black-box systems via sequential Kriging meta-models. J. of Global Optimization, 34(3):441-466, 2006.
[75] 黄东、T·艾伦、W·诺茨与 N·曾. 基于序列克里金元模型的随机黑箱系统全局优化. 《全局优化杂志》, 34(3):441-466, 2006 年。

[76] F. Hutter. Automated Configuration of Algorithms for Solving Hard Computational Problems. PhD thesis, University of British Columbia, Vancouver, Canada, 2009.
[76] F·胡特. 自动配置解决复杂计算问题的算法. 博士论文, 不列颠哥伦比亚大学, 加拿大温哥华, 2009 年。

[77] F. Hutter, H. Hoos, and K. Leyton-Brown. Identifying key algorithm parameters and instance features using forward selection. In LION, 2013.
[77] F·胡特、H·胡斯与 K·莱顿-布朗. 利用前向选择识别关键算法参数与实例特征. 发表于 LION 会议, 2013 年。

[78] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Automated configuration of mixed integer programming solvers. In CPAIOR, pages 186-202, 2010.
[78] F. Hutter, H. H. Hoos 与 K. Leyton-Brown。混合整数规划求解器的自动配置。收录于 CPAIOR，第 186-202 页，2010 年。

[79] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In LION, pages 507- 523, 2011.
[79] F. Hutter, H. H. Hoos 与 K. Leyton-Brown。通用算法配置的序列化模型优化。收录于 LION，第 507-523 页，2011 年。

[80] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Parallel algorithm configuration. In

L I O N

,pages

55 - 70, 2012

.
[80] F. Hutter, H. H. Hoos 与 K. Leyton-Brown。并行算法配置。收录于

L I O N

，第

55 - 70, 2012

页。

[81] D. Jones. A taxonomy of global optimization methods based on response surfaces. J. of Global Optimization, 21(4):345-383, 2001.
[81] D. Jones。基于响应面的全局优化方法分类。《全局优化杂志》，21(4):345-383，2001 年。

[82] D. Jones, M. Schonlau, and W. Welch. Efficient global optimization of expensive black-box functions. J. of Global optimization, 13(4):455- 492, 1998.
[82] D. Jones、M. Schonlau 和 W. Welch。昂贵黑箱函数的高效全局优化。《全局优化杂志》，13(4):455-492，1998 年。

[83] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157-181, 1993.
[83] D. R. Jones、C. D. Perttunen 和 B. E. Stuckman。无需 Lipschitz 常数的 Lipschitz 优化。《优化理论与应用杂志》，79(1):157-181，1993 年。

[84] E. Kaufmann, O. Cappé, and A. Garivier. On bayesian upper confidence bounds for bandit problems. In International Conference on Artificial Intelligence and Statistics, pages 592-600, 2012.
[84] E. Kaufmann、O. Cappé 和 A. Garivier。关于多臂老虎机问题中的贝叶斯上置信界。《人工智能与统计国际会议》，592-600 页，2012 年。

[85] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory, volume 7568 of Lecture Notes in Computer Science, pages 199-213. Springer Berlin Heidelberg, 2012.
[85] E. Kaufmann、N. Korda 和 R. Munos。汤普森采样：一种渐近最优的有限时间分析。《算法学习理论》，第 7568 卷《计算机科学讲义》，199-213 页。Springer Berlin Heidelberg，2012 年。

[86] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard. Most likely het-eroscedastic Gaussian process regression. In International Conference on Machine Learning, pages 393-400, 2007.
[86] K. Kersting, C. Plagemann, P. Pfaff 和 W. Burgard。最可能异方差高斯过程回归。载于《国际机器学习会议》，第 393-400 页，2007 年。

[87] L. Kocsis and C. Szepesvári. Bandit based Monte-Carlo planning. In European Conference on Machine Learning, pages 282-293. 2006.
[87] L. Kocsis 和 C. Szepesvári。基于赌博机的蒙特卡洛规划。载于《欧洲机器学习会议》，第 282-293 页，2006 年。

[88] R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery, 18(1):140-181, 2009.
[88] R. Kohavi, R. Longbotham, D. Sommerfield 和 R. M. Henne。网络受控实验：调查与实践指南。《数据挖掘与知识发现》，第 18 卷第 1 期，第 140-181 页，2009 年。

[89] A. Krause and C. S. Ong. Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447-2455, 2011.
[89] A. Krause 和 C. S. Ong。上下文高斯过程赌博机优化。载于《神经信息处理系统进展》，第 2447-2455 页，2011 年。

[90] D. G. Krige. A statistical approach to some basic mine valuation problems on the witwatersrand. Journal of Chemical, Metallurgical, and Mining Society of South Africa, 1951.
[90] D·G·克里格。南非威特沃特斯兰德地区基础矿山价值评估问题的统计学方法。《南非化学、冶金与矿业学会期刊》，1951 年。

[91] H. J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Fluids Engineering, 86(1):97-106, 1964.
[91] H·J·库什纳。噪声环境下任意多峰曲线最大值点定位的新方法。《流体工程学报》，86(1):97-106，1964 年。

[92] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4-22, 1985.
[92] 赖泰立与 H·罗宾斯。渐近高效的自适应分配规则。《应用数学进展》，6(1):4-22，1985 年。

[93] M. Lázaro-Gredilla and A. R. Figueiras-Vidal. Marginalized neural network mixtures for large-scale regression. Neural Networks, IEEE
[93] M·拉扎罗-格雷迪亚与 A·R·菲格拉斯-维达尔。面向大规模回归的边缘化神经网络混合模型。《IEEE 神经网络汇刊》

Transactions on, 21(8):1345-1351, 2010.
《Transactions on》，第 21 卷第 8 期，页码 1345-1351，2010 年。

[94] M. Lázaro-Gredilla, J. Quiñonero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal. Sparse spectrum Gaussian process regression. The Journal of Machine Learning Research, 11:1865-1881, 2010.
[94] M·拉扎罗-格雷迪亚、J·奎农内罗-坎德拉、C·E·拉斯穆森与 A·R·菲格拉斯-维达尔合著。《稀疏谱高斯过程回归》，载《机器学习研究期刊》第 11 卷，页码 1865-1881，2010 年。

[95] Q. V. Le, A. J. Smola, and S. Canu. Heteroscedastic Gaussian process regression. In International Conference on Machine Learning, pages 489-496, 2005.
[95] 黎青龙、A·J·斯莫拉与 S·卡努合著。《异方差高斯过程回归》，收录于《国际机器学习会议论文集》，页码 489-496，2005 年。

[96] K. Leyton-Brown, E. Nudelman, and Y. Shoham. Learning the empirical hardness of optimization problems: The case of combinatorial auctions. In Principles and Practice of Constraint Programming, pages 556-572. Springer, 2002.
[96] K·莱顿-布朗、E·努德尔曼与 Y·肖汉合著。《优化问题的经验硬度学习：组合拍卖案例》，收录于《约束编程原理与实践》，页码 556-572，Springer 出版社，2002 年。

[97] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In World Wide Web, pages 661-670, 2010.
[97] 李力、楚文、约翰·兰福德和罗伯特·E·沙皮尔。一种基于上下文老虎机方法的个性化新闻文章推荐系统。载于《万维网》，第 661-670 页，2010 年。

[98] D. V. Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, pages 986-1005, 1956.
[98] D·V·林德利。关于实验提供信息量的度量。《数理统计年鉴》，第 986-1005 页，1956 年。

[99] D. Lizotte. Practical Bayesian Optimization. PhD thesis, University of Alberta, Canada, 2008.
[99] 丹尼尔·利佐特。实用贝叶斯优化。博士论文，加拿大阿尔伯塔大学，2008 年。

[100] D. Lizotte, R. Greiner, and D. Schuurmans. An experimental methodology for response surface optimization methods. J. of Global Optimization, pages 1-38, 2011.
[100] 丹尼尔·利佐特、拉塞尔·格雷纳和戴尔·舒尔曼斯。响应面优化方法的实验方法论。《全局优化杂志》，第 1-38 页，2011 年。

[101] D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Automatic gait optimization with Gaussian process regression. In Proc. of IJCAI, pages 944-949, 2007.
[101] D. Lizotte, T. Wang, M. Bowling 和 D. Schuurmans。基于高斯过程回归的自动步态优化。见《国际人工智能联合会议论文集》(IJCAI)，第 944-949 页，2007 年。

[102] M. Locatelli. Bayesian algorithms for one-dimensional global optimization. J. Global Optimization, 1997.
[102] M. Locatelli。一维全局优化的贝叶斯算法。《全局优化杂志》，1997 年。

[103] M. Lzaro-gredilla and M. K. Titsias. Variational heteroscedastic Gaussian process regression. In International Conference on Machine Learning, pages 841-848. ACM, 2011.
[103] M. Lzaro-gredilla 和 M. K. Titsias。异方差高斯过程回归的变分方法。见《国际机器学习会议》，第 841-848 页。ACM，2011 年。

[104] O. Madani, D. Lizotte, and R. Greiner. Active model selection. In Conference on Uncertainty in Artificial Intelligence (UAI), 2004.
[104] O. Madani, D. Lizotte 和 R. Greiner。主动模型选择。见《不确定性人工智能会议》(UAI)，2004 年。

[105] N. Mahendran, Z. Wang, F. Hamze, and N. de Freitas. Adaptive MCMC with Bayesian optimization. Journal of Machine Learning Research - Proceedings Track, 22:751-760, 2012.
[105] N·马亨德兰、Z·王、F·哈姆泽与 N·德弗雷塔斯。《基于贝叶斯优化的自适应 MCMC 方法》。机器学习研究期刊-会议论文集，22 卷：751-760 页，2012 年。

[106] R. Marchant and F. Ramos. Bayesian optimisation for intelligent environmental monitoring. In NIPS workshop on Bayesian Optimization and Decision Making, 2012.
[106] R·马钱特与 F·拉莫斯。《面向智能环境监测的贝叶斯优化方法》。收录于 NIPS 贝叶斯优化与决策制定研讨会，2012 年。

[107] O. Maron and A. W. Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. Robotics Institute, page 263, 1993.
[107] O·马龙与 A·W·摩尔。《霍夫丁赛选法：加速分类与函数逼近的模型选择搜索》。机器人研究所，263 页，1993 年。

[108] R. Martinez-Cantin, N. de Freitas, E. Brochu, J. Castellanos, and A. Doucet. A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Autonomous Robots, 27(2):93-103, 2009.
[108] R·马丁内斯-坎廷、N·德弗雷塔斯、E·布罗舒、J·卡斯特利亚诺斯与 A·杜塞。《基于贝叶斯探索-开发策略的视觉导引移动机器人最优在线感知与规划》。自主机器人，27 卷 2 期：93-103 页，2009 年。

[109] J. Martinez, J. J. Little, and N. de Freitas. Bayesian optimization with an empirical hardness model for approximate nearest neighbour search. In IEEE Winter Conference on Applications of Computer Vision, 2014.
[109] J. Martinez、J. J. Little 和 N. de Freitas。基于经验硬度模型的贝叶斯优化在近似最近邻搜索中的应用。IEEE 计算机视觉应用冬季会议，2014 年。

[110] R. Martinez-Cantin, N. de Freitas, A. Doucet, and J. A. Castellanos. Active policy learning for robot planning and exploration under uncertainty. Robotics Science and Systems, 2007.
[110] R. Martinez-Cantin、N. de Freitas、A. Doucet 和 J. A. Castellanos。不确定性下的机器人规划与探索主动策略学习。机器人科学与系统会议，2007 年。

[111] G. Matheron. The theory of regionalized variables and its applications. Cahier du Centre de Morphologie Mathematique, Ecoles des Mines, 1971.
[111] G. Matheron。区域化变量理论及其应用。数学形态学中心手册，矿业学院，1971 年。

[112] B. C. May, N. Korda, A. Lee, and D. S. Leslie. Optimistic Bayesian sampling in contextual bandit problems. Technical Report 11:01, Statistics Group, School of Mathematics, University of Bristol, 2011.
[112] B. C. May、N. Korda、A. Lee 和 D. S. Leslie。上下文赌博问题中的乐观贝叶斯采样。技术报告 11:01，布里斯托大学数学学院统计组，2011 年。

[113] V. Mnih, C. Szepesvári, and J.-Y. Audibert. Empirical Bernstein stopping. In International Conference on Machine Learning, pages 672-679. ACM, 2008.
[113] V. Mnih, C. Szepesvári 和 J.-Y. Audibert。经验伯恩斯坦停止准则。收录于《国际机器学习会议》，第 672-679 页。ACM，2008 年。

[114] J. Močkus. Application of Bayesian approach to numerical methods of global and stochastic optimization. J. of Global Optimization, 4(4):347-365, 1994.
[114] J. Močkus。贝叶斯方法在全局与随机优化数值计算中的应用。《全局优化杂志》，4(4):347-365，1994 年。

[115] J. Močkus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for seeking the extremum. In L. Dixon and G. Szego, editors, Toward Global Optimization, volume 2. Elsevier, 1978.
[115] J. Močkus, V. Tiesis 和 A. Zilinskas。贝叶斯方法在极值搜索中的应用。载于 L. Dixon 和 G. Szego 编辑的《走向全局优化》第 2 卷。Elsevier，1978 年。

[116] R. Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In Advances in Neural Information Processing Systems, pages 783-791, 2011.
[116] R. Munos。无需了解平滑性的确定性函数乐观优化。收录于《神经信息处理系统进展》，第 783-791 页，2011 年。

[117] R. Munos. From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning. Technical Report hal- 00747575, INRIA Lille, 2014.
[117] R. 穆诺斯。从多臂老虎机到蒙特卡洛树搜索：应用于优化与规划的乐观原则。技术报告 hal-00747575，法国国家信息与自动化研究所里尔分院，2014 年。

[118] R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
[118] R. M. 尼尔。神经网络中的贝叶斯学习。多伦多大学博士论文，1995 年。

[119] J. Nelder and R. Wedderburn. Generalized linear models. Journal of the Royal Statistical Society, Series A, 135(3):370-384, 1972.
[119] J. 内尔德与 R. 韦德伯恩。广义线性模型。《皇家统计学会杂志》A 辑，135(3):370-384，1972 年。

[120] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for
[120] M. A. 奥斯本、R. 加内特与 S. J. 罗伯茨。高斯过程在

global optimisation. In LION, 2009.
全局优化。收录于 LION，2009 年。

[121] C. Paciorek and M. Schervish. Nonstationary covariance functions for gaussian process regression. In Advances in Neural Information Processing Systems, volume 16, pages 273-280, 2004.
[121] C. 帕乔雷克与 M. 舍维什。高斯过程回归中的非平稳协方差函数。刊于《神经信息处理系统进展》，第 16 卷，第 273-280 页，2004 年。

[122] V. Picheny and D. Ginsbourger. A nonstationary space-time Gaussian process model for partially converged simulations. SIAM/ASA Journal on Uncertainty Quantification, 1(1):57-78, 2013.
[122] V. 皮什尼与 D. 金斯伯格。针对部分收敛模拟的非平稳时空高斯过程模型。《SIAM/ASA 不确定性量化期刊》，第 1 卷第 1 期，第 57-78 页，2013 年。

[123] J. C. Pinheiro and D. M. Bates. Unconstrained parametrizations for variance-covariance matrices. Statistics and Computing, 6(3):289-296, 1996.
[123] J. C. 皮涅罗与 D. M. 贝茨。方差-协方差矩阵的无约束参数化。《统计与计算》，第 6 卷第 3 期，第 289-296 页，1996 年。

[124] J. Quiñonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. The Journal of Machine Learning Research, 6:1939-1959, 2005.
[124] J. 基尼奥内罗-坎德拉与 C. E. 拉斯穆森。稀疏近似高斯过程回归的统一视角。《机器学习研究杂志》，6:1939-1959，2005 年。

[125] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177-1184, 2007.
[125] A. 拉希米与 B. 雷克特。大规模核机器的随机特征。《神经信息处理系统进展》，页码 1177-1184，2007 年。

[126] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
[126] C. E. 拉斯穆森与 C. K. I. 威廉姆斯。《机器学习中的高斯过程》。麻省理工学院出版社，2006 年。

[127] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221-1243, 2014.
[127] D. 鲁索与 B. 范罗伊。通过后验采样学习优化。《运筹学数学》，39(4):1221-1243，2014 年。

[128] J. Sacks, W. J. Welch, T. J. Welch, and H. P. Wynn. Design and analysis of computer experiments. Statistical Science, 4(4):409-423, 1989.
[128] J·萨克斯、W·J·韦尔奇、T·J·韦尔奇与 H·P·温。计算机实验的设计与分析。《统计科学》，4(4):409-423，1989 年。

[129] P. D. Sampson and P. Guttorp. Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association, 87(417):108-119, 1992.
[129] P·D·桑普森与 P·古托普。非平稳空间协方差结构的非参数估计。《美国统计协会杂志》，87(417):108-119，1992 年。

[130] T. J. Santner, B. Williams, and W. Notz. The Design and Analysis of Computer Experiments. Springer, 2003.
[130] T·J·桑特纳、B·威廉姆斯与 W·诺茨。《计算机实验的设计与分析》。斯普林格出版社，2003 年。

[131] M. J. Sasena. Flexibility and Efficiency Enhancement for Constrained Global Design Optimization with Kriging Approximations. PhD thesis, University of Michigan, 2002.
[131] M·J·萨塞纳。基于克里金近似的约束全局设计优化灵活性及效率提升研究。密歇根大学博士学位论文，2002 年。

[132] A. M. Schmidt and A. O'Hagan. Bayesian inference for nonstationary spatial covariance structures via spatial deformations. Journal of the Royal Statistical Society, Series B, 65:743-758, 2003.
[132] A. M. Schmidt 与 A. O'Hagan。通过空间变形实现非平稳空间协方差结构的贝叶斯推断。《皇家统计学会杂志》B 辑，65 卷：743-758 页，2003 年。

[133] M. Schonlau. Computer Experiments and Global Optimization. PhD thesis, University of Waterloo, Waterloo, Ontario, Canada, 1997.
[133] M. Schonlau。计算机实验与全局优化。博士论文，加拿大安大略省滑铁卢大学，1997 年。

[134] M. Schonlau, W. J. Welch, and D. R. Jones. Global versus local search in constrained optimization of computer models. Lecture Notes-Monograph Series, 34:11-25, 1998.
[134] M. Schonlau、W. J. Welch 和 D. R. Jones。计算机模型约束优化中的全局与局部搜索。《讲座笔记-专著系列》，34 卷：11-25 页，1998 年。

[135] S. L. Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6):639-658, 2010.
[135] S. L. Scott。现代贝叶斯视角下的多臂老虎机问题。《商业与工业应用随机模型》，26(6)：639-658 页，2010 年。

[136] M. Seeger, Y.-W. Teh, and M. I. Jordan. Semiparametric latent factor models. In

A I

and Statistics,2005.
[136] M. Seeger、Y.-W. Teh 和 M. I. Jordan。半参数潜在因子模型。发表于

A I

与统计，2005 年。

[137] M. Seeger, C. Williams, and N. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Artificial Intelligence and Statistics 9, number EPFL-CONF-161318, 2003.
[137] M. Seeger、C. Williams 和 N. Lawrence。快速前向选择加速稀疏高斯过程回归。发表于《人工智能与统计 9》，编号 EPFL-CONF-161318，2003 年。

[138] A. Shah, A. G. Wilson, and Z. Ghahramani. Student-t processes as alternatives to Gaussian processes. In

A I

and Statistics,pages 877- 885, 2014.
[138] A. Shah、A. G. Wilson 和 Z. Ghahramani。作为高斯过程替代的学生 t 过程。发表于

A I

与统计，第 877-885 页，2014 年。

[139] B. Shahriari, Z. Wang, M. W. Hoffman, A. Bouchard-Côté, and N. de Freitas. An entropy search portfolio. In NIPS workshop on Bayesian Optimization, 2014.
[139] B. Shahriari、Z. Wang、M. W. Hoffman、A. Bouchard-Côté 和 N. de Freitas。一种熵搜索组合。发表于 NIPS 贝叶斯优化研讨会，2014 年。

[140] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, pages 1257-1264, 2005.
[140] E. Snelson 与 Z. Ghahramani。使用伪输入的稀疏高斯过程。载于《神经信息处理系统进展》，第 1257-1264 页，2005 年。

[141] E. Snelson, C. E. Rasmussen, and Z. Ghahramani. Warped Gaussian processes. In Advances in Neural Information Processing Systems, 2003.
[141] E. Snelson、C. E. Rasmussen 和 Z. Ghahramani。扭曲高斯过程。载于《神经信息处理系统进展》，2003 年。

[142] J. Snoek. Bayesian Optimization and Semiparametric Models with Applications to Assistive Technology. PhD thesis, University of Toronto, Toronto, Canada, 2013.
[142] J. Snoek。贝叶斯优化与半参数模型在辅助技术中的应用。博士论文，加拿大多伦多大学，2013 年。

[143] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951-2959, 2012.
[143] J. Snoek、H. Larochelle 和 R. P. Adams。机器学习算法的实用贝叶斯优化。载于《神经信息处理系统进展》，第 2951-2959 页，2012 年。

[144] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, Prabhat, and R. Adams. Scalable Bayesian optimization using deep neural networks. In International Conference on Machine Learning, 2015.
[144] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, Prabhat, 和 R. Adams。使用深度神经网络的可扩展贝叶斯优化。载于国际机器学习会议，2015 年。

[145] J. Snoek, K. Swersky, R. S. Zemel, and R. P. Adams. Input warping for Bayesian optimization of non-stationary functions. In International Conference on Machine Learning, 2014.
[145] J. Snoek, K. Swersky, R. S. Zemel, 和 R. P. Adams。非平稳函数贝叶斯优化的输入扭曲方法。载于国际机器学习会议，2014 年。

[146] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, pages 1015-1022, 2010.
[146] N. Srinivas, A. Krause, S. M. Kakade, 和 M. Seeger。多臂老虎机设置下的高斯过程优化：无遗憾与实验设计。载于国际机器学习会议，第 1015-1022 页，2010 年。

[147] K. Swersky, D. Duvenaud, J. Snoek, F. Hutter, and M. A. Osborne. Raiders of the lost architecture: Kernels for Bayesian optimization in conditional parameter spaces. arXiv preprint arXiv:1409.4011, 2014.
[147] K. Swersky, D. Duvenaud, J. Snoek, F. Hutter, 和 M. A. Osborne。失落架构的探索者：条件参数空间中贝叶斯优化的核方法。arXiv 预印本 arXiv:1409.4011，2014 年。

[148] K. Swersky, J. Snoek, and R. P. Adams. Multi-task Bayesian optimization. In Advances in Neural Information Processing Systems, pages 2004-2012, 2013.
[148] K. Swersky、J. Snoek 和 R. P. Adams。多任务贝叶斯优化。《神经信息处理系统进展》，第 2004-2012 页，2013 年。

[149] K. Swersky, J. Snoek, and R. P. Adams. Freeze-thaw Bayesian optimization. arXiv preprint arXiv:1406.3896, 2014.
[149] K. Swersky、J. Snoek 和 R. P. Adams。冻结-解冻贝叶斯优化。arXiv 预印本 arXiv:1406.3896，2014 年。

[150] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285-294, 1933.
[150] W. R. Thompson。基于两样本证据的未知概率超越另一概率的可能性。《生物计量学》，25(3/4):285-294，1933 年。

[151] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Knowledge Discovery and Data Mining, pages 847-855, 2013.
[151] C. Thornton、F. Hutter、H. H. Hoos 和 K. Leyton-Brown。Auto-WEKA：分类算法的组合选择与超参数优化。《知识发现与数据挖掘》，第 847-855 页，2013 年。

[152] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567-574, 2009.
[152] M. K. Titsias. 稀疏高斯过程中诱导变量的变分学习. 发表于《人工智能与统计国际会议》, 第 567-574 页, 2009 年.

[153] H. P. Vanchinathan, I. Nikolic, F. De Bona, and A. Krause. Explore-exploit in top-N recommender systems via Gaussian processes. In Proceedings of the 8th ACM Conference on Recommender systems, pages 225-232. ACM, 2014.
[153] H. P. Vanchinathan, I. Nikolic, F. De Bona, 及 A. Krause. 基于高斯过程的 Top-N 推荐系统探索-开发策略. 发表于《第 8 届 ACM 推荐系统会议论文集》, 第 225-232 页. ACM 出版社, 2014 年.

[154] E. Vazquez and J. Bect. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions.

J

. of Statistical Planning and Inference, 140:3088-3095, 2010.
[154] E. Vazquez 与 J. Bect. 固定均值与协方差函数下期望改进算法的收敛性分析. 《统计规划与推断杂志》, 140 卷:3088-3095 页, 2010 年.

[155] E. Vazquez and J. Bect. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and Inference, 140(11):3088-3095, 2010.
[155] E. Vazquez 与 J. Bect. 固定均值与协方差函数下期望改进算法的收敛性分析. 《统计规划与推断杂志》, 140 卷 11 期:3088-3095 页, 2010 年.

[156] J. Villemonteix, E. Vazquez, and E. Walter. An informational approach to the global optimization of expensive-to-evaluate functions. J. of Global Optimization, 44(4):509-534, 2009.
[156] J·维尔蒙特、E·瓦兹奎兹与 E·沃尔特。一种针对高成本评估函数的全局优化信息方法。《全局优化杂志》，44(4):509-534，2009 年。

[157] Z. Wang and N. de Freitas. Theoretical analysis of Bayesian optimisa-tion with unknown Gaussian process hyper-parameters. arXiv preprint arXiv:1406.7758, 2014.
[157] Z·王与 N·德弗雷塔斯。高斯过程超参数未知时贝叶斯优化的理论分析。arXiv 预印本 arXiv:1406.7758，2014 年。

[158] Z. Wang, B. Shakibi, L. Jin, and N. de Freitas. Bayesian multi-scale optimistic optimization. In

A I

and Statistics,pages 1005-1014,2014.
[158] Z·王、B·沙基布、L·金及 N·德弗雷塔斯。贝叶斯多尺度乐观优化。收录于

A I

与统计会议，页码 1005-1014，2014 年。

[159] Z. Wang, M. Zoghi, D. Matheson, F. Hutter, and N. de Freitas. Bayesian optimization in high dimensions via random embeddings. In International Joint Conference on Artificial Intelligence, pages 1778- 1784, 2013.
[159] Z·王、M·佐吉、D·马西森、F·哈特及 N·德弗雷塔斯。通过随机嵌入实现高维贝叶斯优化。《国际人工智能联合会议论文集》，页码 1778-1784，2013 年。

[160] B. J. Williams, T. J. Santner, and W. I. Notz. Sequential design of computer experiments to minimize integrated response functions. Statistica Sinica, 10:1133-1152, 2000.
[160] B. J. 威廉姆斯, T. J. 桑特纳, 与 W. I. 诺茨。计算机实验的序贯设计以最小化集成响应函数。《统计学报》，10:1133-1152，2000 年。

[161] D. Yogatama and G. Mann. Efficient transfer learning method for automatic hyperparameter tuning. In

A I

and Statistics,pages 1077- 1085, 2014.
[161] D. 瑜伽塔玛与 G. 曼恩。自动超参数调优的高效迁移学习方法。收录于《

A I

与统计》，页码 1077-1085，2014 年。

[162] D. Yogatama and N. A. Smith. Bayesian optimization of text representations. arXiv preprint arXiv:1503.00693, 2015.
[162] D. 瑜伽塔玛与 N. A. 史密斯。文本表示的贝叶斯优化。arXiv 预印本 arXiv:1503.00693，2015 年。

[163] Y. Zhang, K. Sohn, R. Villegas, G. Pan, and H. Lee. Improving object detection with deep convolutional networks via Bayesian optimization and structured prediction. In IEEE Computer Vision and Pattern Recognition Conference, 2015.
[163] Y. 张, K. 索恩, R. 维列加斯, G. 潘, 与 H. 李。通过贝叶斯优化与结构化预测改进深度卷积网络的目标检测。发表于《IEEE 计算机视觉与模式识别会议》，2015 年。

[164] A. Zilinskas and J. Zilinskas. Global optimization based on a statistical model and simplical partitioning. Computers and Mathematics with Applications, 44:957-967, 2002.
[164] A. Zilinskas 和 J. Zilinskas。基于统计模型和单纯形划分的全局优化方法。《Computers and Mathematics with Applications》，44 卷：957-967 页，2002 年。

Taking the Human Out of the Loop: A Review of Bayesian Optimization将人类移出循环：贝叶斯优化综述

I. INTRODUCTION 一、引言

A. Paper overview A. 论文概览

B. Applications of Bayesian optimizationB. 贝叶斯优化的应用场景

II. BAYESIAN OPTIMIZATION WITH PARAMETRIC MODELS二、基于参数化模型的贝叶斯优化

A. Thompson sampling in the Beta-Bernoulli bandit modelA. Beta-Bernoulli 老虎机模型中的汤普森采样

B. Linear models B. 线性模型

C. Generalized linear modelsC. 广义线性模型

D. Related literature D. 相关文献

III. NON-PARAMETRIC MODELS三、非参数模型

A.The Gaussian process A.高斯过程

B. Common kernels B. 常用核函数

C. Prior mean functions C. 先验均值函数

D. Marginal likelihood D. 边缘似然

E. Computational costs and other regression modelsE. 计算成本与其他回归模型

IV. ACQUISITION FUNCTIONS四、采集函数

A. Improvement-based policiesA. 基于改进的策略

B. Optimistic policies B. 乐观策略

C. Information-based policiesC. 基于信息的策略

D. Portfolios of acquisition functionsD. 采集函数组合策略

V. PRACTICAL CONSIDERATIONS五、实际应用考量

A. Handling hyperparametersA. 超参数处理方法

B. Optimizing acquisition functionsB. 优化采集函数

C. Conditional Spaces C. 条件空间

D. Non-stationarity D. 非平稳性

E. Parallelization E. 并行化

F. Software implementationsF. 软件实现

VI. THEORY OF BAYESIAN OPTIMIZATION六、贝叶斯优化理论

VII. HISTORY OF BAYESIAN OPTIMIZATION AND RELATED APPROACHES第七章 贝叶斯优化及相关方法的历史沿革

VIII. EXTENSIONS AND OPEN QUESTIONS第八章 扩展与开放性问题

A. Constrained Bayesian optimizationA. 约束贝叶斯优化

B. Cost-sensitivity B. 成本敏感性

C. High-dimensional problemsC. 高维问题

D. Multi-Task D. 多任务

E. Freeze-Thaw E. 冻结-解冻法

IX. Concluding Remarks 九、结论性评述

REFERENCES 参考文献

Taking the Human Out of the Loop: A Review of Bayesian Optimization
将人类移出循环：贝叶斯优化综述

B. Applications of Bayesian optimization
B. 贝叶斯优化的应用场景

II. BAYESIAN OPTIMIZATION WITH PARAMETRIC MODELS
二、基于参数化模型的贝叶斯优化

A. Thompson sampling in the Beta-Bernoulli bandit model
A. Beta-Bernoulli 老虎机模型中的汤普森采样

C. Generalized linear models
C. 广义线性模型

III. NON-PARAMETRIC MODELS
三、非参数模型

E. Computational costs and other regression models
E. 计算成本与其他回归模型

IV. ACQUISITION FUNCTIONS
四、采集函数

A. Improvement-based policies
A. 基于改进的策略

C. Information-based policies
C. 基于信息的策略

D. Portfolios of acquisition functions
D. 采集函数组合策略

V. PRACTICAL CONSIDERATIONS
五、实际应用考量

A. Handling hyperparameters
A. 超参数处理方法

B. Optimizing acquisition functions
B. 优化采集函数

F. Software implementations
F. 软件实现

VI. THEORY OF BAYESIAN OPTIMIZATION
六、贝叶斯优化理论

VII. HISTORY OF BAYESIAN OPTIMIZATION AND RELATED APPROACHES
第七章贝叶斯优化及相关方法的历史沿革

VIII. EXTENSIONS AND OPEN QUESTIONS
第八章扩展与开放性问题

A. Constrained Bayesian optimization
A. 约束贝叶斯优化

C. High-dimensional problems
C. 高维问题