High Dimensional Bayesian Optimization Using Dropout
高维贝叶斯优化中的 Dropout 应用

Cheng Li, Sunil Gupta, Santu Rana, Vu Nguyen, Svetha Venkatesh, Alistair Shilton
李成、Sunil Gupta、Santu Rana、Vu Nguyen、Svetha Venkatesh、Alistair Shilton

Centre for Pattern Recognition and Data Analytics (PRaDA), Deakin University, Australia cheng.1@deakin.edu.au
澳大利亚迪肯大学模式识别与数据分析中心(PRaDA)，邮箱：cheng.1@deakin.edu.au

Abstract 摘要

Scaling Bayesian optimization to high dimensions is challenging task as the global optimization of high-dimensional acquisition function can be expensive and often infeasible. Existing methods depend either on limited "active" variables or the additive form of the objective function. We propose a new method for high-dimensional Bayesian optimization, that uses a dropout strategy to optimize only a subset of variables at each iteration. We derive theoretical bounds for the regret and show how it can inform the derivation of our algorithm. We demonstrate the efficacy of our algorithms for optimization on two benchmark functions and two real-world applications - training cascade classifiers and optimizing alloy composition.
将贝叶斯优化扩展至高维环境面临重大挑战，因为高维采集函数的全局优化计算成本高昂且常不可行。现有方法要么依赖有限的“活跃”变量，要么基于目标函数的加性形式。我们提出了一种新的高维贝叶斯优化方法，采用 Dropout 策略在每次迭代中仅优化部分变量。我们推导了遗憾的理论界限，并展示了其如何指导算法设计。通过两个基准函数优化及两个实际应用（级联分类器训练与合金成分优化），验证了所提算法的有效性。

1 Introduction 1 引言

From mixture products (e.g. shampoos, alloys) to processes for mixture (e.g. heat treatments for alloys), the need to find the optimal values of control variables to achieve a target product lies at the heart of most industrial processes. The complexity arises because we do not know the mathematical relationship between the control variables and the target - it is a Black-Box Function. This exploration done through experimental optimization is a laborious process and limited by resource restrictions and cost.
从混合物产品（如洗发水、合金）到混合工艺（如合金的热处理），寻找控制变量的最优值以实现目标产品是大多数工业流程的核心。复杂性源于我们不清楚控制变量与目标之间的数学关系——这是一个黑箱函数。通过实验优化进行的这种探索是一个费力的过程，并受到资源限制和成本的制约。

The process of experimental optimization quickly hits limit as soon as the number of control variables are increased. For example, since the Bronze Age, fewer than 12 elements have been combined to make alloys. But the periodic table contains 97 naturally occurring elements. Only a tiny fraction of the target space has been explored because of the underlying complexity. By increasing the number of elements to just 15 in order to find a high-strength alloy with 3 mixing levels per element, the search space escalates to more than 14 million choices [Xue et al., 2016]. Another illustrative problem is that wing configuration design of a high speed civil transport (HSCT) aircraft may include upto 26 variables to reach a targeted wing configuration [Koch et al., 1999].
实验优化过程在控制变量数量增加时很快达到极限。例如，自青铜时代以来，用于制造合金的元素组合不超过 12 种。然而，元素周期表中天然存在的元素多达 97 种。由于潜在的复杂性，仅探索了目标空间的极小部分。若将元素数量增至 15 种以寻找每种元素 3 种混合比例的高强度合金，搜索空间会激增至超过 1400 万种选择[Xue 等人，2016]。另一个例证是，高速民用运输机(HSCT)的机翼构型设计可能涉及多达 26 个变量才能达到目标机翼配置[Koch 等人，1999]。

Bayesian optimization (BO) [Snoek et al., 2012; Nguyen et al., 2016] is a powerful technique to optimize expensive, black-box functions. A classical BO uses Gaussian process (GP) [Rasmussen and Williams, 2005] to model the mean and variance of the target function. As the function is expensive to interrogate, a surrogate function (or acquisition function) is constructed from the GP to trade-off exploitation (where mean is high) and exploration (where uncertainty is high). The next sample is determined by maximizing the acquisition function. Scaling BO methods to handle functions in high dimension presents two main challenges. Firstly, the number of observations required by the GP grows exponentially as input dimensions increase. This implies more experimental evaluations are required, often expensive and infeasible in real applications. Secondly, global optimization for high-dimensional acquisition functions is intrinsically a hard problem and can be prohibitively expensive to be feasible [Kan-dasamy et al., 2015; Rana et al., 2017].
贝叶斯优化（BO）[Snoek 等人，2012；Nguyen 等人，2016]是一种优化昂贵黑盒函数的强大技术。经典 BO 使用高斯过程（GP）[Rasmussen 和 Williams，2005]来建模目标函数的均值和方差。由于函数评估成本高昂，通常基于 GP 构建替代函数（或称采集函数），以权衡开发（均值较高区域）与探索（不确定性较高区域）。下一个采样点通过最大化采集函数来确定。将 BO 方法扩展到高维函数时面临两大挑战：首先，GP 所需的观测数量随输入维度增加呈指数级增长，这意味着需要更多实验评估，而这在实际应用中往往成本高昂且难以实现；其次，高维采集函数的全局优化本身是一个难题，其计算成本可能高到无法承受[Kan-dasamy 等人，2015；Rana 等人，2017]。

Solutions have been proposed to tackle high-dimensional Bayesian optimization. Wang et al. [2013] projected the high-dimensional space into a low-dimensional subspace and then optimized the acquisition function in a low-dimensional subspace (REMBO). The assumption that only some dimensions

(d_{e} ≪ D)

are effective is often restrictive. Qian et al. [2016] studied the case when all dimensions are effective, but many of them only have a small bounded effect by using sequential random embedding to reduce the embedding gap. These methods may not work if all dimensions in the high-dimensional function are similarly effective. The additive decomposition assumption is another solution for high-dimensional function analysis. Kandasamy et al. [2015] proposed the Add-GP-UCB model in which the objective function is assumed to be the sum of a set of low-dimensional functions with disjoint dimensions and then BO can be performed in the low-dimensional space. Add-GP-UCB allows the objective function to vary along the entire feature domain. Li et al. [2016] generalized the Add-GP-UCB by eliminating an axis-aligned representation. However, in practice it is difficult to know the decomposition of functions in advance, especially for non-separable functions. The most related one to our work is DSA [Ulmasov et al., 2016], which reduces the number of variables at each iteration by PCA. There are two problems when using DSA: (1) using PCA for selecting variables is effective only when there are large number of data points. This is especially not true for Bayesian optimization where in the beginning we do not have many points. Eigen vector estimation using small number of data points are often inaccurate and can be misleading. (2) DSA may get stuck in a local optimum since it only clamps the other coordinates to their current best values.
已有多种解决方案被提出以应对高维贝叶斯优化问题。Wang 等人[2013]将高维空间投影至低维子空间，随后在低维子空间中优化采集函数（REMBO）。但"仅部分维度

(d_{e} ≪ D)

有效"的假设往往具有局限性。Qian 等人[2016]研究了所有维度均有效但多数仅具微弱边界效应的情况，通过序列随机嵌入降低嵌入间隙。若高维函数中所有维度均具有相似效力，这些方法可能失效。加性分解假设是分析高维函数的另一解决方案。Kandasamy 等人[2015]提出 Add-GP-UCB 模型，该模型假设目标函数为若干互斥维度的低维函数之和，从而可在低维空间执行贝叶斯优化。Add-GP-UCB 允许目标函数在整个特征域上变化。Li 等人[2016]通过消除轴对齐表示对 Add-GP-UCB 进行了推广。然而，在实践中，尤其是对于不可分离的函数，提前了解函数的分解情况十分困难。与我们的工作最为相关的是 DSA [Ulmasov et al., 2016]，该方法通过 PCA 在每次迭代中减少变量数量。使用 DSA 时存在两个问题：（1）仅当数据点数量庞大时，利用 PCA 选择变量才有效。这在贝叶斯优化初期尤为不适用，因为初始阶段数据点较少。基于少量数据点估计的特征向量往往不准确且可能产生误导。（2）DSA 可能陷入局部最优解，因为它仅将其他坐标固定为其当前最佳值。

This paper proposes an alternative approach that does not rely on the assumptions that the objective function depends on limited "active" features - projections can be made into lower dimensional sub-spaces (fixed [Djolonga et al., 2013; Wang et al., 2013] or updated [Ulmasov et al., 2016]) or objective function decomposed in additive forms [Kandasamy et al., 2015; Li et al., 2016]. Motivated by the dropout algorithm in neural networks [Srivastava et al., 2014], we explore dimension dropout in high-dimensional Bayesian optimization. We choose

d

out of

D

dimensions

(d < D)

randomly at each iteration and only optimize variables from the chosen dimensions via Bayesian optimization. To "fill-in" the variables from the left-out dimensions, we consider alternate strategies - random values, values of these variables from the best found function value thus far, and a mixture of these two methods. We formulate our dropout algorithms and apply them on benchmark functions and two real-world applications of training cascade classifiers [Viola and Jones, 2001] and an aluminium alloy design. We compare them with baselines ( random search, standard BO, REMBO [Wang et al., 2013] and Add-GP-UCB [Kandasamy et al., 2015] ). The experimental results demonstrate the effectiveness of our algorithms. We derive regret bound theoretically. As expected the cost of Dropout algorithm is a remaining "regret gap", and we provide insights to how this gap can be reduced through the strategies we have formulated to fill-in the dropped-out variables. Our main contributions are:
本文提出了一种不依赖于目标函数仅受限于少数"活跃"特征假设的替代方法——可将投影降至低维子空间（固定维度如[Djolonga et al., 2013; Wang et al., 2013]或动态更新如[Ulmasov et al., 2016]），或将目标函数分解为加性形式[Kandasamy et al., 2015; Li et al., 2016]。受神经网络中 dropout 算法[Srivastava et al., 2014]启发，我们探索高维贝叶斯优化中的维度丢弃策略：每次迭代随机从

D

个维度中选取

d

个维度，仅通过贝叶斯优化调整选定维度的变量。针对未选中维度的变量填补，我们研究了三种策略——随机赋值、沿用当前最优解对应变量值，以及两种策略的混合方法。我们构建了 dropout 算法框架，并在基准测试函数、级联分类器训练[Viola and Jones, 2001]和铝合金设计两个实际应用中进行了验证。我们将其与基线方法（随机搜索、标准贝叶斯优化、REMBO [Wang et al., 2013] 和 Add-GP-UCB [Kandasamy et al., 2015]）进行对比。实验结果证明了我们算法的有效性。我们从理论上推导了遗憾界。正如预期，Dropout 算法的代价是存在一个残余的“遗憾间隙”，我们通过提出的填补丢弃变量的策略，深入分析了如何缩小这一间隙。我们的主要贡献包括：

Formulation of a novel variable dropout method for high-dimensional Bayesian optimization;
提出了一种新颖的高维贝叶斯优化变量丢弃方法；

Theoretical analysis of the regret bound for our dropout algorithm, and the use of the regret bound to guide how to fill-in the dropped out variables;
对 Dropout 算法的遗憾界进行了理论分析，并利用该遗憾界指导如何填补被丢弃的变量；

Demonstration and comparison of our algorithms with baselines on two synthetic functions and two real applications: training cascade classifiers and designing an aluminium alloy through improved (phase) utility.
在两个合成函数和两个实际应用（训练级联分类器和通过改进（相）效用来设计铝合金）上展示并对比了我们的算法与基线方法。

2 Formulation 2 公式化

Preliminaries Bayesian optimization is used to maximize or minimize a function

f

in the input domain

X \subseteq R^{D}

. It includes two critical components: prior and acquisition functions. Gaussian process (GP) is a popular choice for the prior due to its tractability for posterior and predictive distributions and it is specified by its mean

m (.) a n d c o v a r i a n c e k e r n e l

function

k (.,)

. Give a set of observations

x_{1 : t}

and the corresponding values

f (x_{1 : t})

,the probability of any finite set of

f

is Gaussian
预备知识贝叶斯优化用于在输入域

X \subseteq R^{D}

中最大化或最小化函数

f

。它包含两个关键组件：先验和采集函数。高斯过程（GP）因其后验和预测分布的可处理性而成为先验的流行选择，并通过其均值

m (.) a n d c o v a r i a n c e k e r n e l

函数

k (.,)

来指定。给定一组观测值

x_{1 : t}

及其对应值

f (x_{1 : t})

，任何有限集

f

的概率服从高斯分布。

\begin{matrix} (1) & f (x) \sim N (m (x), K (x, x^{'})) \end{matrix}

where

K {(x, x^{'})}_{i, j} = k (x_{i}, x_{j}^{'})

is the covariance matrix. Two popular choice of

k

are the squared exponential (SE) kernel and the Matérn kernel. The predictive distribution of a new point

x_{t + 1}

is given as
其中

K {(x, x^{'})}_{i, j} = k (x_{i}, x_{j}^{'})

是协方差矩阵。

k

的两种流行选择是平方指数（SE）核和 Matérn 核。新点

x_{t + 1}

的预测分布由下式给出：

f_{t + 1} ∣ f_{1 : t} \sim N (μ_{t + 1} (x_{t + 1} ∣ x_{1 : t}, f_{1 : t}), σ_{t + 1}^{2} (x_{t + 1} ∣ x_{1 : t}, f_{1 : t}))

(2)where

f_{1 : t} = f (x_{1 : t}), μ_{t + 1} (.) = k^{T} K^{- 1} f_{1 : t}

σ_{t + 1}^{2} (.) = k (x_{t + 1}, x_{t + 1}) - k^{T} K^{- 1} k

and

k =

[k (x_{t + 1}, x_{1}), \dots, k (x_{t + 1}, x_{t})]

.
（2）其中

f_{1 : t} = f (x_{1 : t}), μ_{t + 1} (.) = k^{T} K^{- 1} f_{1 : t}

、

σ_{t + 1}^{2} (.) = k (x_{t + 1}, x_{t + 1}) - k^{T} K^{- 1} k

和

k =

[k (x_{t + 1}, x_{1}), \dots, k (x_{t + 1}, x_{t})]

。

Acquisition function is a proxy function derived from the predictive mean and variance and it determines the next sample point. We denote acquisition function

a (x ∣ {x_{1 : t}, f_{1 : t}})

and the next sample point

x_{t + 1} = {argmax}_{x \in X} a (x ∣

{x_{1 : t}, f_{1 : t}})

. Some examples of acquisition functions include Expected Improvement (EI) and GP-UCB [Srinivas et al., 2010]. The EI-based acquisition function is to compute the expected improvement with respect to the current maximum

f (x^{+})

,or

E I (x) = E (max {0, f_{t + 1} (x) - f (x^{+})}

x_{1 : t}, f_{1 : t}

). The closed form has been derived in [Mockus et al., 1978; Jones et al., 1993]. The GP-UCB [Srinivas et al., 2010] is defined as

UCB (x) = μ (x) + \sqrt{β} σ (x)

,where

β

is a positive tradeoff. The first term contributes to the exploitation and the second term contributes to the exploration. DIRECT [Jones et al., 1993] is often used to find the global maximum in acquisition function.
采集函数是从预测均值和方差中派生出的代理函数，它决定了下一个采样点。我们将采集函数记为

a (x ∣ {x_{1 : t}, f_{1 : t}})

，下一个采样点记为

x_{t + 1} = {argmax}_{x \in X} a (x ∣

{x_{1 : t}, f_{1 : t}})

。采集函数的例子包括期望改进（EI）和 GP-UCB [Srinivas 等人，2010]。基于 EI 的采集函数旨在计算相对于当前最大值

f (x^{+})

或

E I (x) = E (max {0, f_{t + 1} (x) - f (x^{+})}

x_{1 : t}, f_{1 : t}

的期望改进量。其闭式解已在[Mockus 等人，1978; Jones 等人，1993]中给出。GP-UCB [Srinivas 等人，2010]定义为

UCB (x) = μ (x) + \sqrt{β} σ (x)

，其中

β

为正权衡参数。第一项促进利用，第二项促进探索。DIRECT [Jones 等人，1993]常用于寻找采集函数中的全局最大值。

We seek to maximize a function

f

in the restricted domain

X = {[0, 1]}^{D}

(this can always be achieved by scaling). We assume the maximal function value can be achieved at a query point

x^{*}

,i.e.

x^{*} = {argmax}_{x \in X} f (x)

. At iteration

t

and the corresponding query point

x_{t} \in X

,the instantaneous regret

r_{t}

is defined

r_{t} = f (x^{*}) - f (x_{t})

and the cumulative regret

R_{T}

is defined

R_{T} = \sum_{t = 1}^{T} r_{t}

. A desirable property of an algorithm is to have no-regret:

lim_{T \to \infty} \frac{1}{T} R_{T} = 0

.
我们寻求在限定域

X = {[0, 1]}^{D}

内最大化函数

f

（总可通过缩放实现）。假设最大函数值可在查询点

x^{*}

处达到，即

x^{*} = {argmax}_{x \in X} f (x)

。在第

t

次迭代及对应查询点

x_{t} \in X

时，瞬时遗憾

r_{t}

定义为

r_{t} = f (x^{*}) - f (x_{t})

，累积遗憾

R_{T}

定义为

R_{T} = \sum_{t = 1}^{T} r_{t}

。算法的理想特性是无遗憾性：

lim_{T \to \infty} \frac{1}{T} R_{T} = 0

。

Dropout Algorithms We refer to

I_{d}

as the indices of

d

out of

D

dimensions and

I_{D - d}

as the indices of the left-out

D - d

dimensions so that

I_{d} \cup I_{D - d} = {1, \dots, D}

and

I_{D - d} \cap I_{d} =

. The corresponding variables from

I_{d}

and

I_{D - d}

dimensions are respectively denoted as

x^{I_{d}}

and

x^{I_{D - d}}

. For the convenience we later use

x^{d} = x^{I_{d}}, x^{D - d} = x^{I_{D - d}}

and hence

x = [x^{d}, x^{D - d}]

.
我们称

I_{d}

为从

d

个维度中选出的

D

维度的索引，

I_{D - d}

为被排除的

D - d

维度的索引，因此

I_{d} \cup I_{D - d} = {1, \dots, D}

且

I_{D - d} \cap I_{d} =

。对应的变量分别来自

I_{d}

和

I_{D - d}

维度，记为

x^{I_{d}}

和

x^{I_{D - d}}

。为方便起见，后续我们使用

x^{d} = x^{I_{d}}, x^{D - d} = x^{I_{D - d}}

，因此

x = [x^{d}, x^{D - d}]

。

Motivated by the dropout algorithm in neural networks [Srivastava et al., 2014], we explore dimension dropout for high-dimensional Bayesian optimization. We randomly choose

d

out of

D

dimensions

(d < D)

at each iteration and only optimize the

d

-dimensional variables through Bayesian optimization. Specifically we assume in the

d

-dimensional space the observations

y = f ([x^{d}, x^{D - d}]) + ε

with

ε \sim

N (0, σ^{2})

for all

x^{D - d}

. Gaussian process then is used to model the function values

f ([x^{d}, x^{D - d}]), \forall x^{D - d}

. The predictive mean

μ (x^{d})

and variance

σ (x^{d})

can be computed. As with GP-UCB [Srinivas et al., 2010], we construct the acquisition function in the

d

-dimensional space
受神经网络中 Dropout 算法[Srivastava 等人，2014]的启发，我们探索了高维贝叶斯优化中的维度丢弃策略。每次迭代时，我们随机从

D

个维度中选取

d

个维度

(d < D)

，仅通过贝叶斯优化对

d

维变量进行优化。具体而言，在

d

维空间中，假设观测值

y = f ([x^{d}, x^{D - d}]) + ε

满足对所有

x^{D - d}

有

ε \sim

N (0, σ^{2})

。随后使用高斯过程对函数值

f ([x^{d}, x^{D - d}]), \forall x^{D - d}

建模，可计算预测均值

μ (x^{d})

和方差

σ (x^{d})

。与 GP-UCB[Srinivas 等人，2010]类似，我们在

d

维空间中构建采集函数。

\begin{matrix} (3) & a (x^{d}) = μ_{t - 1} (x^{d}) + \sqrt{β_{t}^{d}} σ_{t - 1} (x^{d}) \end{matrix}

where

β_{t}^{d}

is a factor controlling tradeoff between the exploitation

μ_{t - 1} (x^{d})

and exploration

σ_{t - 1} (x^{d})

. At each iteration, we determine a new

x^{d}

by maximizing the Eq .(3).
其中

β_{t}^{d}

是控制利用

μ_{t - 1} (x^{d})

与探索

σ_{t - 1} (x^{d})

之间权衡的因子。每次迭代时，我们通过最大化式(3)来确定新的

x^{d}

。

Given

x_{t}^{d}

,we still need to fill-in the variables

x_{t}^{D - d}

from the left-out

D - d

dimensions to evaluate the function. We devise three "fill-in" strategies for

x_{t}^{D - d}

:
给定

x_{t}^{d}

，我们仍需从遗漏的

D - d

维度中填补变量

x_{t}^{D - d}

以评估函数。为此我们设计了三种针对

x_{t}^{D - d}

的“填补”策略：

Dropout-Random: use a random value in the domain:
随机丢弃法：在定义域内使用随机值：

\begin{matrix} (4) & x_{t}^{D - d} \sim u (x^{D - d}) \end{matrix}

where

u (\cdot)

is a uniform distribution.
其中

u (\cdot)

为均匀分布。

Dropout-Copy: copy the value of the variables from the best function value so far:
复制丢弃法：从当前最佳函数值中复制变量值：

x_{t}^{+} = \underset{t^{'} \leq t}{argmax} f (x_{t^{'}})

\begin{matrix} (5) & x_{t}^{D - d} = {(x_{t}^{+})}^{D - d} \end{matrix}

where

x_{t}^{+}

is the variables of the best found function value till

t

iterations.
其中

x_{t}^{+}

表示截至第

t

次迭代时，所找到的最佳函数值对应的变量。

Dropout-Mix: use a mixture of the above two methods. We use a random value with probability $p$ or copy the value from the variables of the best found function value so far with the probability $1 - p$ .
Dropout-Mix：混合使用上述两种方法。我们以概率 $p$ 随机取值，或以概率 $1 - p$ 从当前最佳函数值对应的变量中复制值。

Dropout-Random does not work effectively when a large number of dimensions are influential. However, it can still improve the optimization since we optimize

d

variables each iteration. Copying the value of the variables from the best function value so far seems to be an efficient strategy to consistently improve the previous best regret. However, this method may get stuck in a local optimum. This problem is solved by the third strategy, which helps the copy method to escape the local optimum with a probability

p

.
当大量维度具有影响力时，Dropout-Random 方法效果不佳。然而，由于每次迭代优化

d

个变量，它仍能提升优化效果。复制当前最佳函数值对应的变量值似乎是一种能持续改善先前最佳遗憾的有效策略。但此方法可能陷入局部最优，该问题通过第三种策略得以解决，即以概率

p

帮助复制方法逃离局部最优。

Our approach performs Bayesian optimization in the

d

- dimensional space and thus DIRECT requires

O (ζ^{- d})

calls to the acquisition function to achieve

ζ

accuracy [Jones et al., 1998]. It is significantly better than full-dimensional BO where DIRECT requires

O (ζ^{- D})

objective function calls. Both our approach and the full-dimensional BO need the time complexity

O (n^{3})

to compute the inverse of the covariance matrix,where

n

is the number of observations. We summarize our algorithms in the following.
我们的方法在

d

维空间中进行贝叶斯优化，因此 DIRECT 需要调用

O (ζ^{- d})

次采集函数以达到

ζ

的精度[Jones 等人，1998]。这明显优于全维度贝叶斯优化，后者 DIRECT 需要

O (ζ^{- D})

次目标函数调用。我们的方法与全维度贝叶斯优化均需

O (n^{3})

的时间复杂度来计算协方差矩阵的逆，其中

n

为观测次数。我们在下文中对算法进行总结。

Algorithm 1 Dropout Algorithm for High-dimensional Bayesian Optimization
算法 1 高维贝叶斯优化的 Dropout 算法

Input:

D_{1} = {x_{0}, y_{0}}

输入：

D_{1} = {x_{0}, y_{0}}

for

t = 1, 2, \dots

do 对于

t = 1, 2, \dots

执行

randomly select

d

dimensions
随机选择

d

个维度

x_{t}^{d} \leftarrow {argmax}_{x_{t}^{d} \in X^{d}} a (x^{d} ∣ D_{t}) (Eq . (3))

x_{t}^{D - d} \leftarrow

one of three "fill-in" strategies (Sec 2.)
三种“填空”策略之一（见第 2 节）

x_{t} \leftarrow x_{t}^{d} \cup x_{t}^{D - d}

y_{t} \leftarrow

Query

y_{t}

x_{t}

在

x_{t}

处查询

y_{t}

D_{t + 1} = D_{t} \cup {x_{t}, y_{t}}

end for 结束循环

3 Theoretical Analysis 3 理论分析

Our main contribution is to derive a regret bound for our algorithm and discuss heuristic strategies. We denote

f (x^{d})

as the worst function value given

x^{d}

,i.e.

f (x^{d}) = f ([x^{d}, x_{w}^{D - d}])

, where

x_{w}^{D - d} = {argmin}_{x^{D - d}} f ([x^{d}, x^{D - d}])

.
我们的主要贡献是推导出算法的遗憾界并讨论启发式策略。我们将

f (x^{d})

定义为给定

x^{d}

时的最差函数值，即

f (x^{d}) = f ([x^{d}, x_{w}^{D - d}])

，其中

x_{w}^{D - d} = {argmin}_{x^{D - d}} f ([x^{d}, x^{D - d}])

。

Assumption 1. Let

f

sample from GP with the kernel

k (x, x^{'})

,which is L-Lipschitz for all

x

. Then the partial derivatives of

f

satisfy the following high probability bound for some constants

a, b > 0

,
假设 1. 令

f

从具有核函数

k (x, x^{'})

的高斯过程中采样，该核函数对所有

x

满足 L-利普希茨条件。那么，对于某些常数

a, b > 0

，

f

的偏导数满足以下高概率边界，

\begin{matrix} (6) & P (\forall j, \frac{\partial f}{\partial x_{j}} < L) \geq 1 - a e^{- {(L / b)}^{2}}, \forall t \geq 1 \end{matrix}

The assumption implies the following equation holds with the probability greater than

1 - δ / 2

for all

x

,
该假设意味着以下等式以大于

1 - δ / 2

的概率对所有

x

成立，

| f (x) - f (x^{d}) | = | f ([x^{d}, x^{D - d}]) - f ([x^{d}, x_{w}^{D - d}]) |

\begin{matrix} (7) & \leq L {‖ \begin{matrix} x^{D - d} - x_{w}^{D - d} \end{matrix} ‖}_{1} \leq L | D - d | \end{matrix}

where

L = b \sqrt{\log (2 (D - d) a / δ)}

. 其中

L = b \sqrt{\log (2 (D - d) a / δ)}

。

Lemma 2. Pick

δ \in (0, 1)

and set

β_{t}^{d} = 2 \log (4 π_{t} / δ) +

2 d \log (d t^{2} b r \sqrt{\log (4 d a / δ)})

,where

\sum_{t \geq 1} π_{t}^{- 1} = 1, π_{t} > 0

. Then in the

d

-dimensional space,with the probability

\geq 1 -

δ / 2

引理 2. 选取

δ \in (0, 1)

并设定

β_{t}^{d} = 2 \log (4 π_{t} / δ) +

2 d \log (d t^{2} b r \sqrt{\log (4 d a / δ)})

，其中

\sum_{t \geq 1} π_{t}^{- 1} = 1, π_{t} > 0

。那么在

d

维空间中，以概率

\geq 1 -

δ / 2

\begin{matrix} (8) & | f (x^{d}) - μ_{t - 1} (x^{d}) | \leq \sqrt{β_{t}^{d}} σ_{t - 1} (x^{d}), \forall t \geq 1 \end{matrix}

The Lemma 2 is derived from the Bayesian optimization in the

d

-dimensional space. The proof is identical to Theorem 2 in [Srinivas et al., 2010].
引理 2 源自于

d

维空间中的贝叶斯优化。其证明与[Srinivas 等人，2010]中的定理 2 相同。

Lemma 3. Let

β_{t}^{d}

be defined as in Lemma 2 and set

σ_{t - 1}^{'} (x^{d}) = σ_{t - 1} (x^{d}) + \frac{L (D - d)}{\sqrt{β_{t}^{d}}}

. Then
引理 3. 设

β_{t}^{d}

如引理 2 所定义，并令

σ_{t - 1}^{'} (x^{d}) = σ_{t - 1} (x^{d}) + \frac{L (D - d)}{\sqrt{β_{t}^{d}}}

。则

\begin{matrix} (9) & | f (x) - μ_{t - 1} (x^{d}) | \leq \sqrt{β_{t}^{d}} σ_{t - 1}^{'} (x^{d}) \end{matrix}

holds with the probability

\geq 1 - δ

.
以概率

\geq 1 - δ

成立。

Proof. The following is true for all

t \geq 1

and for all

x \in X

with probability

> 1 - δ

,
证明。以下对所有

t \geq 1

及所有

x \in X

以概率

> 1 - δ

成立，

| f (x) - μ_{t - 1} (x^{d}) |

\leq | f (x) - f (x^{d}) | + | f (x^{d}) - μ_{t - 1} (x^{d}) |

\begin{matrix} (10) & \leq L {‖ \begin{matrix} x - [x^{d}, x_{w}^{D - d}] \end{matrix} ‖}_{1} + \sqrt{β_{t}^{d}} σ_{t - 1} (x^{d}) \end{matrix}

\leq L (D - d) + \sqrt{β_{t}^{d}} σ_{t - 1} (x^{d})

\leq \sqrt{β_{t}^{d}} σ_{t - 1}^{'} (x^{d})

where

σ_{t - 1}^{'} (x^{d}) = σ_{t - 1} (x^{d}) + \frac{L (D - d)}{\sqrt{β_{t}^{d}}}

. The Line 3 to Line 4 In Eq.(10) exploits Eq.(7). The variance difference

\frac{L (D - d)}{\sqrt{β_{t}^{d}}}

will reduce with iteration

t

since

β_{t}^{d}

is increasing.
其中

σ_{t - 1}^{'} (x^{d}) = σ_{t - 1} (x^{d}) + \frac{L (D - d)}{\sqrt{β_{t}^{d}}}

。式(10)中从第 3 行到第 4 行的推导利用了式(7)。方差差异

\frac{L (D - d)}{\sqrt{β_{t}^{d}}}

将随着迭代次数

t

的增加而减小，因为

β_{t}^{d}

是递增的。

Lemma 4. Pick

δ \in (0, 1)

and Let

β_{t}^{d}

be defined as in Lemma 2. Then the regret

r_{t}

is bounded by

2 \sqrt{β_{t}^{d}} σ_{t - 1}^{'} (x^{d}) + \frac{1}{t^{2}}

.
引理 4。选取

δ \in (0, 1)

，并令

β_{t}^{d}

如引理 2 所定义。则遗憾值

r_{t}

的界限由

2 \sqrt{β_{t}^{d}} σ_{t - 1}^{'} (x^{d}) + \frac{1}{t^{2}}

给出。

Proof. By definition of

x_{t}^{d} : μ_{t - 1} (x_{t}^{d}) + \sqrt{β_{t}^{d}} σ_{t - 1}^{'} (x_{t}^{d}) \geq

μ_{t - 1} ({[x^{*}]}_{t}^{d}) + \sqrt{β_{t}^{d}} σ_{t - 1}^{'} ({[x^{*}]}_{t}^{d})

. According to Lemma 5.7 in [Srinivas et al.,2010],

μ_{t - 1} ({[x^{*}]}_{t}^{d}) + \sqrt{β_{t}^{d}} σ_{t - 1}^{'} ({[x^{*}]}_{t}^{d}) + \frac{1}{t^{2}} \geq

f (x^{*})

,where

{[x^{*}]}_{t}

denotes the closest point in discretizations

D_{t} \subset X

x^{*}

. Then
证明。根据

x_{t}^{d} : μ_{t - 1} (x_{t}^{d}) + \sqrt{β_{t}^{d}} σ_{t - 1}^{'} (x_{t}^{d}) \geq

的定义

μ_{t - 1} ({[x^{*}]}_{t}^{d}) + \sqrt{β_{t}^{d}} σ_{t - 1}^{'} ({[x^{*}]}_{t}^{d})

。依据[Srinivas 等，2010]中的引理 5.7，

μ_{t - 1} ({[x^{*}]}_{t}^{d}) + \sqrt{β_{t}^{d}} σ_{t - 1}^{'} ({[x^{*}]}_{t}^{d}) + \frac{1}{t^{2}} \geq

f (x^{*})

，其中

{[x^{*}]}_{t}

表示离散化集合

D_{t} \subset X

中与

x^{*}

最接近的点。随后

r_{t} = f (x^{*}) - f (x_{t})

\leq \sqrt{β_{t}^{d}} σ_{t - 1}^{'} (x_{t}^{d}) + μ_{t - 1} (x_{t}^{d}) + \frac{1}{t^{2}} - f ([x_{t}^{d}, x_{t}^{D - d}])

\leq 2 \sqrt{β_{t}^{d}} σ_{t - 1}^{'} (x^{d}) + \frac{1}{t^{2}}

Lemma 5. Pick

δ \in (0, 1)

and let

β_{t}^{d}

be defined as Lemma 2. Then the cumulative regret holds with the probability

\geq 1 - δ

and

C_{1} = 8 / \log (1 + σ^{- 2})

,
引理 5。选取

δ \in (0, 1)

，并令

β_{t}^{d}

如引理 2 所定义。则累积遗憾以概率

\geq 1 - δ

和

C_{1} = 8 / \log (1 + σ^{- 2})

成立，

\begin{matrix} (11) & R_{T} \leq \sqrt{C_{1} β_{T}^{d} γ_{T} T} + 2 T L (D - d) + 2 \end{matrix}

https://cdn.noedgeai.com/0196ab4f-9eb7-7c61-b614-466a7725f8dc_3.jpg?x=159&y=181&w=733&h=345&r=0

Figure 1: The effect of the number of dimensions

d

in Dropout-Copy. The y-axis in Gaussian mixture function presents the real function value (Higher value is better). The y-axis in Schwefel's 1.2 function presents the logarithm of function value (Lower value is better).
图 1：维度数量

d

对 Dropout-Copy 方法的影响。高斯混合函数中 y 轴表示实际函数值（数值越高越好）。Schwefel's 1.2 函数中 y 轴表示函数值的对数（数值越低越好）。

Proof. we prove Lemma 5 in the following
证明。我们将在下文中证明引理 5。

R_{T} = \sum_{t \leq T} r_{t} \leq \sum_{t \leq T} (2 \sqrt{β_{t}^{d}} σ_{t - 1}^{'} (x^{d}) + \frac{1}{t^{2}})

\begin{matrix} (12) & = \sum_{t \leq T} (2 \sqrt{β_{t}^{d}} σ_{t - 1} (x^{d})) + 2 T L (D - d) + \frac{π^{2}}{6} \end{matrix}

\leq \sqrt{C_{1} β_{T}^{d} γ_{T} T} + 2 T L (D - d) + 2

where

\sum_{t \leq T} (2 \sqrt{β_{t}^{d}} σ_{t - 1} (x^{d}))

in the second line of Eq. (12) can be bounded via the Theorem 1 in [Srinivas et al., 2010].

γ_{T}

can be bounded for different kernels. For the SE kernel,

γ_{T} = O ({(\log T)}^{d + 1})

.
其中，式(12)第二行中的

\sum_{t \leq T} (2 \sqrt{β_{t}^{d}} σ_{t - 1} (x^{d}))

可通过[Srinivas et al., 2010]中的定理 1 界定。对于不同核函数，

γ_{T}

均可被界定。对于平方指数核，

γ_{T} = O ({(\log T)}^{d + 1})

。

Discussion 讨论

Lemma 5 indicates

lim_{T \to \infty} \frac{1}{T} R_{T} \leq 2 L (D - d)

- that is,a regret gap remains in the limit. This is introduced through the bound on the worst-case of

{‖ \begin{matrix} x - [x^{d}, x^{D - d}] \end{matrix} ‖}_{1}, \forall x^{D - d}

(Eq.(7)). In reality a judicious choice of the

D - d

dimensions will improve this bound by reducing this difference. We have thus formulated three options for filling in the variables of the dropped out dimensions: random, best value and mixture of the two. Intuitively if the current optimum is far away from the global optimum, random values are an appropriate guess for the "fill-in" as there is no other information. If the current optimum is close to the global optimum, copying the value of the dropped-out variables from the best found function value is likely to improve the best regret obtained by the previous iterations. This behaves like block coordinate descent [Nesterov, 2012] that optimizes a chosen block of coordinates while keeping others fixed. The difference is that block coordinate descent assumes that the previous iteration has reached the best value, while Dropout-Copy starts from the best of previous iterations. When the current optimum value is close to a local optimum, Dropout-Copy may get stuck. To escape this local optimum, we use Dropout-Mix that introduces a random fill-in with a pre-specified probability into Dropout-Copy.
引理 5 表明

lim_{T \to \infty} \frac{1}{T} R_{T} \leq 2 L (D - d)

——即在极限情况下仍存在遗憾间隙。这是通过对

{‖ \begin{matrix} x - [x^{d}, x^{D - d}] \end{matrix} ‖}_{1}, \forall x^{D - d}

最坏情况的约束（式(7)）引入的。实际上，明智地选择

D - d

维度将通过减小这一差异来改进该界限。为此我们提出了三种填补缺失维度变量的方案：随机填充、最优值填充以及两者混合填充。直观而言，若当前最优解远离全局最优解，随机值作为"填充"是合理猜测，因无其他信息可用；若当前最优解接近全局最优解，则从已找到的最佳函数值中复制缺失变量值，有望提升前序迭代获得的最佳遗憾值。该机制类似块坐标下降法[Nesterov, 2012]——在固定其他坐标块的同时优化选定坐标块。区别在于块坐标下降法假设前次迭代已达最优值，而 Dropout-Copy 从先前迭代的最佳值起步。当当前最优值接近局部最优时，Dropout-Copy 可能陷入停滞。为跳出这一局部最优解，我们采用 Dropout-Mix 方法，它以预设概率在 Dropout-Copy 中引入随机填充。

https://cdn.noedgeai.com/0196ab4f-9eb7-7c61-b614-466a7725f8dc_3.jpg?x=924&y=179&w=740&h=708&r=0

Figure 2: The effect of the probability

p

in Dropout-Mix.
图 2：概率

p

对 Dropout-Mix 的影响。

4 Experiments 4 实验

We evaluate our methods on benchmark functions and two real applications. We compare our methods with four baselines: random search, which is a simple random sampling method, the standard BO, REMBO [Wang et al., 2013] which projects high dimensions to lower dimensions, and ADD-GP-UCB [Kandasamy et al., 2015] which optimizes disjoint groups of variables and combines them. For standard BO we allocate a budget of 30 seconds (which is larger than the time returned by our algorithms) to optimize the acquisition function at each iteration. The number of initial observations are set at

d + 1

. We use the SE kernel with the lengthscale 0.1 and DIRECT [Jones et al., 1993] to optimize acquisition functions. We run each algorithm 20 times with different initializations and report the average value with standard error.
我们在基准函数和两个实际应用场景中评估了所提方法，并与四种基线方法进行对比：随机搜索（一种简单随机采样方法）、标准贝叶斯优化（BO）、将高维空间投影至低维的 REMBO[Wang 等人，2013]，以及优化变量不相交组并组合结果的 ADD-GP-UCB[Kandasamy 等人，2015]。标准 BO 每次迭代分配 30 秒预算（超过本算法耗时）用于优化采集函数。初始观测点数量设为

d + 1

。采用长度尺度为 0.1 的 SE 核函数，并使用 DIRECT[Jones 等人，1993]优化采集函数。每种算法均以不同初始化方式运行 20 次，报告平均值及标准误差。

4.1 Optimization of Benchmark Functions
4.1 基准函数的优化

To demonstrate that Dropout-Mix can deal with local convergence, we choose a bimodal Gaussian mixture function as our test function. The Gaussian mixture function is defined as

y = N P (x; μ_{1}, \sum_{1}) + \frac{1}{2} N P (x; μ_{2}, \sum_{2})

,where

N P

is the Gaussian probability function,

μ_{1} = [2, 2, \dots, 2]

μ_{2} = [3, 3, \dots, 3]

and

\sum_{1} = \sum_{2} = diag ([1, 1, \dots, 1])

. The domain of definition

X = {[1, 4]}^{D}

and its global maximum is located at

x^{*} = [2, 2, \dots 2]

. This function has a local maximum and no interacting variables. To demonstrate that our algorithms can effectively work for functions with interacting variables, we use the unimodal Schwefel's 1.2 function

f (x) = \sum_{j = 1}^{D} {(\sum_{i = 1}^{j} x_{i})}^{2}

as our second test function. It is defined in the domain

X = {[- 1, 1]}^{D}

and has the global minimum at

x^{*} = [0, 0, \dots 0]

. We compare the algorithms in terms of the best function values reached upto any iteration.
为证明 Dropout-Mix 能有效应对局部收敛问题，我们选用双峰高斯混合函数作为测试函数。该高斯混合函数定义为

y = N P (x; μ_{1}, \sum_{1}) + \frac{1}{2} N P (x; μ_{2}, \sum_{2})

，其中

N P

为高斯概率函数，

μ_{1} = [2, 2, \dots, 2]

、

μ_{2} = [3, 3, \dots, 3]

和

\sum_{1} = \sum_{2} = diag ([1, 1, \dots, 1])

。定义域为

X = {[1, 4]}^{D}

，其全局最大值位于

x^{*} = [2, 2, \dots 2]

。此函数存在一个局部最大值且无交互变量。为展示算法在含交互变量函数中的有效性，我们采用单峰 Schwefel's 1.2 函数

f (x) = \sum_{j = 1}^{D} {(\sum_{i = 1}^{j} x_{i})}^{2}

作为第二测试函数，其定义域为

X = {[- 1, 1]}^{D}

，全局最小值位于

x^{*} = [0, 0, \dots 0]

。我们通过比较各算法在迭代过程中达到的最佳函数值来评估性能。

https://cdn.noedgeai.com/0196ab4f-9eb7-7c61-b614-466a7725f8dc_4.jpg?x=151&y=152&w=1480&h=313&r=0

Figure 3: The optimization for the Gaussian mixture function. Higher value is better. Four different dimensions are tested from left to right (a)

D = 5

(b)

D = 10

(c)

D = 20

(d)

D = 30

. The BO for

D = 5

and

D = 10

is terminated once it converges. The graphs are best seen in color.
图 3：高斯混合函数的优化过程（数值越高越好）。从左至右测试了四种不同维度：(a)

D = 5

(b)

D = 10

(c)

D = 20

(d)

D = 30

。当

D = 5

和

D = 10

的贝叶斯优化收敛时即终止。图示效果建议彩色查看。

https://cdn.noedgeai.com/0196ab4f-9eb7-7c61-b614-466a7725f8dc_4.jpg?x=155&y=603&w=1476&h=314&r=0

Figure 4: The optimization for Schwefel's 1.2 function. Lower value is better. Four different dimensions are tested from left to right (a)

D = 5

(b)

D = 10

(c)

D = 20

(d)

D = 30

. The graphs are best seen in color.
图 4：Schwefel 的 1.2 函数优化过程。数值越低越好。从左至右测试了四种不同维度：(a)

D = 5

(b)

D = 10

(c)

D = 20

(d)

D = 30

。图表建议以彩色查看效果最佳。

On the number of chosen dimensions

d

We investigate how the number of chosen dimensions

d

affects the dropout algorithms. We experiment

d = 1, 2, 5, 10

for

D = 20

in Dropout-Copy. For one experiment, we keep the same number of dimensions for all iterations. The results for two test functions are shown in Figure 1. Since the optimized Gaussian mixture function has non-interacting variables, then variables can be optimized independently so that Dropout-Copy with

d = 1

can still improve the maximal value reached. Dropout-Copy with

d = 5

performs best. However,for the Schwefel's 1.2 function independently optimizing variables is not an efficient way. Figure 1 (b) shows that there is a faster convergence rate for a larger

d

. It can be explained by noting that a large

d

has relatively higher probability of optimizing interacting variables within stipulated iterations. These graphs show that it is reasonable to compromise: if

d

is large then global optimization in the

d

-dimensional space might be costly,and if

d

is small then the convergence rate is slow for functions with interacting variables.
关于所选维度数量

d

的研究，我们探讨了所选维度数量

d

对 dropout 算法的影响。在 Dropout-Copy 中，我们针对

D = 20

进行了实验。在一次实验中，我们保持所有迭代的维度数量不变。两个测试函数的结果如图 1 所示。由于优化的高斯混合函数具有非交互变量，因此变量可以独立优化，使得使用

d = 1

的 Dropout-Copy 仍能提高达到的最大值。使用

d = 5

的 Dropout-Copy 表现最佳。然而，对于 Schwefel 的 1.2 函数，独立优化变量并不是一种高效的方法。图 1(b)显示，较大的

d

具有更快的收敛速度。这可以通过注意到较大的

d

在规定的迭代次数内优化交互变量的概率相对较高来解释。这些图表表明，进行折中是合理的：如果

d

较大，则在

d

维空间中进行全局优化可能成本高昂；而如果

d

较小，则对于具有交互变量的函数，收敛速度较慢。

On the probability

p

To study the influence of the probability

p

in Dropout-Mix,we set

p = 0, 0.1, 0.5, 0.8, 1

. Dropout-Copy and Dropout-Random are the special cases of Dropout-Mix

(p = 0 and p = 1)

. In this experiment,for the low-dimensional Gaussian mixture functions

(D = 2

and

D = 5

),we set

μ_{1} = [2, 2, \dots, 2], μ_{2} = [5, 5, \dots, 5]

X = {[0, 7]}^{D}

so that two modes are far. We use

d = 1

for

D = 2

and

d = 2

for

D = 5

. For both high-dimensional Gaussian mixture functions and Schwefel's function, we keep the same setting as before and use

d = 5

for

D = 20

. The results are shown in Figure 2 (a), (b) and (c).
关于概率

p

的研究，为了探讨 Dropout-Mix 中概率

p

的影响，我们设定

p = 0, 0.1, 0.5, 0.8, 1

。Dropout-Copy 和 Dropout-Random 是 Dropout-Mix 的特殊情况

(p = 0 and p = 1)

。在本实验中，针对低维高斯混合函数

(D = 2

和

D = 5

，我们设定

μ_{1} = [2, 2, \dots, 2], μ_{2} = [5, 5, \dots, 5]

、

X = {[0, 7]}^{D}

以使两个模式相距较远。对于

D = 2

使用

d = 1

，对于

D = 5

使用

d = 2

。对于高维高斯混合函数和 Schwefel 函数，我们保持之前的相同设置，并对

D = 20

使用

d = 5

。结果如图 2(a)、(b)和(c)所示。

We see that Dropout-Mix and Dropout-Random work in low dimensions. Dropout-Copy does not work well and because it may get stuck in a local optimum for low-dimensional functions. This happens with a lower probability in high dimensions and thus the average performance of Drop-Copy is slightly better than Dropout-Mix in the Figure 2 (c). Dropout-Copy always performs best for the unimodal function, seen in Figure 2 (d). Therefore, we recommend using Drop-Copy and Drop-Mix with a small

p

(e.g. 0.1,0.2) in high-dimensions.
我们发现 Dropout-Mix 和 Dropout-Random 在低维度下表现良好。Dropout-Copy 效果不佳，因为它可能在低维函数中陷入局部最优解。而在高维度下这种情况发生的概率较低，因此图 2(c)中 Drop-Copy 的平均性能略优于 Dropout-Mix。对于单峰函数，Dropout-Copy 始终表现最佳，如图 2(d)所示。因此，我们建议在高维度下使用 Drop-Copy 和 Drop-Mix，并设置较小的

p

（例如 0.1、0.2）。

Comparison with existing approaches Based on the experiments above,we test our algorithms with

d = 2

for

D = 5

and

d = 5

for

D = 10, 20, 30

. Dropout-mix is applied with

p = 0.1

. For the Gaussian mixture function, we set

μ_{1} = [2, 2, \dots, 2]

and

μ_{2} = [3, 3, \dots, 3]

for all

D

. Since we do not know the structure of the functions, we use REMBO with a

d

-dimensional projection and ADD-GP-UCB with

d

dimensions at each group,where the value

d

is the same with dropout algorithms. We run 500 function evaluations for these two functions. The results are shown in Figure 3 and 4 respectively. In low dimensions

(D = 5 and D = 10)

, standard BO performs best. In high dimensions

(D = 20

and

D = 30

) ,our Dropout-Mix and Dropout-Copy significantly outperform other baselines. Not surprisingly, REMBO and ADD-GP-UCB do not perform well since the intrinsic structure of functions does not fit their prior assumptions.
与现有方法的对比基于上述实验，我们使用

d = 2

对

D = 5

和

d = 5

对

D = 10, 20, 30

测试了我们的算法。混合丢弃技术以

p = 0.1

方式应用。对于高斯混合函数，我们为所有

D

设置

μ_{1} = [2, 2, \dots, 2]

和

μ_{2} = [3, 3, \dots, 3]

。由于不了解函数结构，我们采用带

d

维投影的 REMBO 方法，并在每组使用

d

维度的 ADD-GP-UCB 算法，其中

d

值与丢弃算法保持一致。我们对这两个函数进行了 500 次评估运算，结果分别展示在图 3 和图 4 中。在低维度

(D = 5 and D = 10)

情况下，标准贝叶斯优化表现最佳；而在高维度

(D = 20

和

D = 30

场景下，我们的混合丢弃与复制丢弃算法显著优于其他基线方法。不出所料，由于函数内在结构不符合其先验假设，REMBO 和 ADD-GP-UCB 表现欠佳。

4.2 Training Cascade Classifier
4.2 级联分类器训练

We evaluate the dropout algorithm by training a cascade classifier [Viola and Jones, 2001] on three real datasets from UCI repository: IJCNN1, German and Ionosphere dataset. A cascade classifier consists of a series of weak classifiers. Each weak classifier is a simple decision stump. The weight for instances is updated based on the error rate from the previous weak classifier. Therefore, the threshold at the decision stump is very important. Generally, independently computing the thresholds is not an optimal strategy. We seek to find optimal thresholds by maximizing the training accuracy. Features in all datesets are scaled between

[0, 1]

. The number of stages is set equal to the number of features in the dataset. We use

d = 5

and

p = 0.1

for all datasets. We ensure that the dimension at each group in Add-GP-UCB is lower than 10 so that DIRECT can work. The experimental results are shown in Figure 5. Dropout-Copy and Dropout-Mix perform similarly but significantly better than other methods.
我们通过在 UCI 知识库中的三个真实数据集（IJCNN1、German 和 Ionosphere）上训练级联分类器[Viola and Jones, 2001]来评估 dropout 算法。级联分类器由一系列弱分类器组成，每个弱分类器是一个简单的决策桩。实例的权重根据前一个弱分类器的错误率进行更新，因此决策桩的阈值非常重要。通常，独立计算阈值并非最优策略，我们通过最大化训练准确率来寻找最优阈值。所有数据集的特征均缩放至

[0, 1]

范围内。级联阶段数设置为数据集中的特征数量，对所有数据集使用

d = 5

和

p = 0.1

参数。我们确保 Add-GP-UCB 中每组的维度低于 10，以便 DIRECT 算法有效运行。实验结果如图 5 所示，Dropout-Copy 和 Dropout-Mix 表现相近且显著优于其他方法。

https://cdn.noedgeai.com/0196ab4f-9eb7-7c61-b614-466a7725f8dc_5.jpg?x=159&y=182&w=1497&h=448&r=0

Figure 5: Maximum classification accuracy for training data as a function of Bayesian optimization iteration. The number of stages in a cascade classifier is equal to the number of features in three datasets (a) IJCNN1

D = 22

,(b) German

D = 24

,(c) Ionosphere

D = 33

.
图 5：训练数据的最大分类准确率随贝叶斯优化迭代次数的变化关系。级联分类器中的阶段数等同于三个数据集中的特征数量：(a) IJCNN1

D = 22

，(b) German

D = 24

，(c) Ionosphere

D = 33

。

4.3 Alloy Design 4.3 合金设计

AA-2050 is a low density high corrosion resistant alloy and is used for aerospace applications. The current alloy has been designed decades ago and is considered by our metallurgist collaborator as a prime candidate for further improvement. To measure utility of an alloy composition we use the software based thermodynamic simulator (THEMOCALC). The alloy consists of 9 elements (Al, Cu, Mg, Zn, Cr, Mn, Ti,

Zr

,and

Sc

). The utility is defined by a weighted combination of four phases that are produced. The phases relate to the internal structures and influence alloy properties. In all we have a 13-dimensional optimization problem ( 9 elements and 4 operational parameters). We seek the composition that maximizes this utility. The result is given in Figure 6. We started from a utility of about 4.5 . After 500 optimization iterations, our Dropout-Mix can achieve the utility of 5.1 while the standard BO keeps around 4.8 after 100 iterations. The results clearly show the effectiveness of our methods for the real world application of alloy design.
AA-2050 是一种低密度高抗腐蚀合金，广泛应用于航空航天领域。该合金配方设计于数十年前，被我们的冶金学合作者视为进一步优化的首要候选材料。衡量合金成分效用时，我们采用基于软件的热力学模拟器（THEMOCALC）。该合金由 9 种元素组成（铝、铜、镁、锌、铬、锰、钛、

Zr

及

Sc

），其效用由四种生成相的加权组合定义，这些相态与内部结构相关并影响合金性能。整体构成一个 13 维优化问题（9 种元素+4 个操作参数），我们寻求使该效用最大化的成分组合。图 6 展示了优化结果：初始效用值约为 4.5，经过 500 次优化迭代后，我们的 Dropout-Mix 方法可实现 5.1 的效用值，而标准贝叶斯优化在 100 次迭代后仅达到 4.8 左右。这些结果清晰证明了我们的方法在合金设计实际应用中的有效性。

5 Conclusion and Future Work
5 结论与未来工作

We propose a new method for high-dimensional Bayesian optimization by using a drop-out strategy. We develop three strategies to fill-in the variables from the dropped-out dimensions, including random values, the value from the best found sample so far and the mixture of these methods. The regret bounds for our methods has been derived and discussed. Our experimental results on synthetic and real applications show that our methods works effectively for the high-dimensional optimization. It might be promising if we only apply local optimization to the acquisition function built from the dropped-out dimensions. We intend to consider more efficient ways to choose dimensions in future.
我们提出了一种通过使用丢弃策略进行高维贝叶斯优化的新方法。我们开发了三种策略来填充被丢弃维度的变量，包括随机值、迄今为止找到的最佳样本值以及这些方法的混合。我们推导并讨论了这些方法的遗憾界。在合成和实际应用中的实验结果表明，我们的方法在高维优化中表现有效。如果仅对基于丢弃维度构建的采集函数应用局部优化，可能会很有前景。我们计划在未来考虑更高效的维度选择方法。

https://cdn.noedgeai.com/0196ab4f-9eb7-7c61-b614-466a7725f8dc_5.jpg?x=963&y=762&w=603&h=521&r=0

Figure 6: The utility of alloy vs the iterations of Bayesian optimization. The number of optimal parameters is 13 . We use

d = 5

in this experiment.
图 6：合金效用与贝叶斯优化迭代次数的关系。最优参数数量为 13。本实验中我们使用了

d = 5

。

Acknowledgments 致谢

This work is partially supported by the Telstra-Deakin Centre of Excellence in Big Data and Machine Learning. We thank anonymous reviewers for their valuable comments.
本工作部分由 Telstra-Deakin 大数据与机器学习卓越中心支持。我们感谢匿名评审的宝贵意见。

References 参考文献

[Djolonga et al., 2013] Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional gaussian process bandits. In Advances in Neural Information Processing Systems, pages 1025-1033, 2013.
[Djolonga 等，2013] Josip Djolonga, Andreas Krause 与 Volkan Cevher。高维高斯过程赌博机。收录于《神经信息处理系统进展》，页码 1025-1033，2013 年。

[Jones et al., 1993] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157-181, 1993.
[Jones 等，1993] D. R. Jones, C. D. Perttunen 与 B. E. Stuckman。无需 Lipschitz 常数的 Lipschitz 优化。《优化理论与应用期刊》，79(1):157-181，1993 年。

[Jones et al., 1998] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455-492, 1998.
[Jones 等，1998] Donald R. Jones, Matthias Schonlau 与 William J. Welch。昂贵黑箱函数的高效全局优化。《全局优化期刊》，13(4):455-492，1998 年。

[Kandasamy et al., 2015] Kirthevasan Kandasamy, Jeff G. Schneider, and Barnabs Pczos. High Dimensional Bayesian Optimisation and Bandits via Additive Models. In

I C M L

,volume 37,pages 295-304,2015.
[Kandasamy 等，2015] Kirthevasan Kandasamy, Jeff G. Schneider 与 Barnabs Pczos。基于加性模型的高维贝叶斯优化与赌博机。收录于

I C M L

，第 37 卷，页码 295-304，2015 年。

[Koch et al., 1999] T. W. Koch, P. N.and Simpson, J. K. Allen, and F. Mistree. Statistical approximations for multidisciplinary design optimization: The problem of size. Journal of Aircraft, 36(1):275-286, 1999.
[科赫等，1999 年] T·W·科赫、P·N·辛普森、J·K·艾伦与 F·米斯特里。多学科设计优化的统计近似方法：规模问题。《飞行器杂志》，36(1):275-286，1999 年。

[Li et al., 2016] C. Li, K. Kandasamy, B. Poczos, and J. Schneider. High dimensional bayesian optimization via restricted projection pursuit models. In AISTATS, pages 1-9, 2016.
[李等，2016 年] C·李、K·坎达萨米、B·波克佐斯与 J·施耐德。基于受限投影追踪模型的高维贝叶斯优化。《人工智能与统计会议录》，第 1-9 页，2016 年。

[Mockus et al., 1978] J. Mockus, V. Tiesis, and A. Zilin-skas. The application of bayesian methods for seeking the extremum. Towards Global Optimisation, (2):117-129, 1978.
[莫库斯等，1978 年] J·莫库斯、V·蒂西斯与 A·齐林斯卡斯。贝叶斯方法在极值搜索中的应用。《走向全局优化》，(2):117-129，1978 年。

[Nesterov, 2012] Yurii Nesterov. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2), 2012.
[内斯特罗夫，2012 年] 尤里·内斯特罗夫。超大规模优化问题中坐标下降法的效率。《SIAM 优化期刊》，22(2)，2012 年。

[Nguyen et al., 2016] V. Nguyen, S. Rana, S. K. Gupta, C. Li, and S. Venkatesh. Budgeted batch bayesian optimization. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 1107-1112, Dec 2016.
[Nguyen 等，2016] V. Nguyen、S. Rana、S. K. Gupta、C. Li 与 S. Venkatesh。预算批量贝叶斯优化。载于《2016 年 IEEE 第 16 届国际数据挖掘会议（ICDM）》，第 1107-1112 页，2016 年 12 月。

[Qian et al., 2016] Hong Qian, Yi-Qi Hu, and Yang Yu. Derivative-free optimization of high-dimensional non-convex functions by sequential random embeddings. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016.
[Qian 等，2016] 钱弘、胡逸琪与余洋。基于序列随机嵌入的高维非凸函数无导数优化。载于《第 25 届国际人工智能联合会议论文集》，2016 年。

[Rana et al., 2017] Santu Rana, Cheng Li, Sunil Gupta, Vu Nguyen, and Svetha Venkatesh. High dimensional bayesian optimization with elastic gaussian process. In International Conference on Machine learning, Sydney, 2017.
[Rana 等，2017] Santu Rana、李成、Sunil Gupta、Vu Nguyen 及 Svetha Venkatesh。弹性高斯过程的高维贝叶斯优化。载于《机器学习国际会议》，悉尼，2017 年。

[Rasmussen and Williams, 2005] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005.
[Rasmussen 与 Williams，2005] Carl Edward Rasmussen 与 Christopher K. I. Williams。机器学习中的高斯过程。麻省理工学院出版社，2005 年。

[Snoek et al., 2012] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In NIPS, pages 2951-2959, 2012.
[Snoek 等，2012] Jasper Snoek、Hugo Larochelle 和 Ryan P Adams。机器学习算法的实用贝叶斯优化。发表于 NIPS，第 2951-2959 页，2012 年。

[Srinivas et al., 2010] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In ICML, 2010.
[Srinivas 等，2010] Niranjan Srinivas、Andreas Krause、Sham Kakade 和 Matthias Seeger。强盗设置下的高斯过程优化：无遗憾与实验设计。发表于 ICML，2010 年。

[Srivastava et al., 2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929-1958, January 2014.
[Srivastava 等，2014] Nitish Srivastava、Geoffrey Hinton、Alex Krizhevsky、Ilya Sutskever 和 Ruslan Salakhutdi-nov。Dropout：防止神经网络过拟合的简单方法。《机器学习研究杂志》，15(1):1929-1958，2014 年 1 月。

[Ulmasov et al., 2016] Doniyor Ulmasov, Caroline Baroukh, Benoit Chachuat, Marc P. Deisenroth, and Ruth Mis-ener. Proceedings of the european symposium on computer aided process engineering. In Bayesian Optimization with Dimension Scheduling: Application to Biological Systems, 2016.
[Ulmasov 等，2016] Doniyor Ulmasov、Caroline Baroukh、Benoit Chachuat、Marc P. Deisenroth 和 Ruth Mis-ener。欧洲计算机辅助过程工程研讨会论文集。《基于维度调度的贝叶斯优化：在生物系统中的应用》，2016 年。

[Viola and Jones, 2001] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, pages 511-518, 2001.
[Viola and Jones, 2001] 保罗·维奥拉（Paul Viola）与迈克尔·琼斯（Michael Jones）。利用简单特征的增强级联实现快速目标检测。发表于《计算机视觉与模式识别》，第 511-518 页，2001 年。

[Wang et al., 2013] Ziyu Wang, Masrour Zoghi, Frank Hut-ter, David Matheson, and Nando De Freitas. Bayesian optimization in high dimensions via random embeddings. In IJCAI, pages 1778-1784, 2013.
[Wang et al., 2013] 王子玉（Ziyu Wang）、Masrour Zoghi、Frank Hutter、David Matheson 及南多·德弗雷塔斯（Nando De Freitas）。通过随机嵌入实现高维贝叶斯优化。收录于 IJCAI 会议论文集，第 1778-1784 页，2013 年。

[Xue et al., 2016] Dezhen Xue, Prasanna V. Balachandran, John Hogden, James Theiler, Deqing Xue, and Turab Lookman. Accelerated search for materials with targeted properties by adaptive design. Nature communications, 7, 2016.
[Xue et al., 2016] 薛德震（Dezhen Xue）、Prasanna V. Balachandran、John Hogden、James Theiler、薛德清（Deqing Xue）及 Turab Lookman。通过自适应设计加速目标属性材料的搜索。《自然-通讯》第 7 卷，2016 年。

High Dimensional Bayesian Optimization Using Dropout高维贝叶斯优化中的 Dropout 应用

Abstract 摘要

1 Introduction 1 引言

2 Formulation 2 公式化

3 Theoretical Analysis 3 理论分析

Discussion 讨论

4 Experiments 4 实验

4.1 Optimization of Benchmark Functions4.1 基准函数的优化

4.2 Training Cascade Classifier4.2 级联分类器训练

4.3 Alloy Design 4.3 合金设计

5 Conclusion and Future Work5 结论与未来工作

Acknowledgments 致谢

References 参考文献

High Dimensional Bayesian Optimization Using Dropout
高维贝叶斯优化中的 Dropout 应用

4.1 Optimization of Benchmark Functions
4.1 基准函数的优化

4.2 Training Cascade Classifier
4.2 级联分类器训练

5 Conclusion and Future Work
5 结论与未来工作