这是用户在 2025-7-6 11:21 为 file:///Users/zoe/Downloads/Yang-%E7%AD%89---2023---Efficient-Robust-Bayesian-Optimization-for-Arbit... 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Efficient Robust Bayesian Optimization for Arbitrary Uncertain Inputs
高效鲁棒贝叶斯优化算法应对任意不确定输入

Lin Yang  林阳

Huawei Noah's Ark Lab
华为诺亚方舟实验室

China  中国

yanglin33@huawei.com

Junlong Lyu  吕俊龙

Huawei Noah's Ark Lab
华为诺亚方舟实验室

Hong Kong SAR, China
中国香港特别行政区

lyujunlong@huawei.com

Wenlong Lyu  吕文龙

Huawei Noah's Ark Lab
华为诺亚方舟实验室

China  中国

lvwenlong2@huawei.com

Zhitang Chen  陈志堂

Huawei Noah's Ark Lab
华为诺亚方舟实验室

Hong Kong SAR, China
中国香港特别行政区

chenzhitang2@huawei.com

Abstract  摘要

Bayesian Optimization (BO) is a sample-efficient optimization algorithm widely employed across various applications. In some challenging BO tasks, input uncertainty arises due to the inevitable randomness in the optimization process, such as machining errors, execution noise, or contextual variability. This uncertainty deviates the input from the intended value before evaluation, resulting in significant performance fluctuations in final result. In this paper, we introduce a novel robust Bayesian Optimization algorithm, AIRBO, which can effectively identify a robust optimum that performs consistently well under arbitrary input uncertainty. Our method directly models the uncertain inputs of arbitrary distributions by empowering the Gaussian Process with the Maximum Mean Discrepancy (MMD) and further accelerates the posterior inference via Nyström approximation. Rigorous theoretical regret bound is established under MMD estimation error and extensive experiments on synthetic functions and real problems demonstrate that our approach can handle various input uncertainties and achieve a state-of-the-art performance.
贝叶斯优化(BO)是一种样本高效的优化算法,被广泛应用于各类场景。在某些具有挑战性的 BO 任务中,由于加工误差、执行噪声或环境变异等优化过程中不可避免的随机性,会导致输入不确定性。这种不确定性使得实际评估值偏离预期输入,最终造成结果性能的显著波动。本文提出了一种新型鲁棒贝叶斯优化算法 AIRBO,能够有效识别在任意输入不确定性下均表现稳定的鲁棒最优点。我们通过将高斯过程与最大均值差异(MMD)相结合,直接对任意分布的不确定输入进行建模,并利用 Nyström 近似进一步加速后验推理。在 MMD 估计误差下建立了严格的理论遗憾界,同时通过合成函数和实际问题的广泛实验表明,本方法能处理多种输入不确定性并实现当前最优性能。

1 Introduction  1 引言

Bayesian Optimization (BO) is a powerful sequential decision-making algorithm for high-cost black-box optimization. Owing to its remarkable sample efficiency and capacity to balance exploration and exploitation, BO has been successfully applied in diverse domains, including neural architecture search [32], hyper-parameter tuning [4, 29, 12], and robotic control [18, 5], among others. Nevertheless, in some real-world problems, the stochastic nature of the optimization process, such as machining error during manufacturing, execution noise of control, or variability in contextual factor,inevitably introduces input randomness,rendering the design parameter x to deviate to x before evaluation. This deviation produces a fluctuation of function value y and eventually leads to a performance instability of the outcome. In general, the input randomness is determined by the application scenario and can be of arbitrary distribution, even quite complex ones. Moreover, in some cases,we cannot observe the exact deviated input x but a rough estimation for the input uncertainty. This is quite common for robotics and process controls. For example, consider a robot control task shown in Figure 1a,a drone is sent to a target location x to perform a measurement task. However, due to the execution noise caused by the fuzzy control or a sudden wind, the drone ends up at location xP(x) and gets a noisy measurement y=f(x)+ζ . Instead of observing the exact value of
贝叶斯优化(BO)是一种针对高成本黑箱优化问题的强大序列决策算法。凭借其卓越的样本效率与探索-开发平衡能力,BO 已成功应用于神经网络架构搜索[32]、超参数调优[4,29,12]、机器人控制[18,5]等多个领域。然而在某些实际问题中,制造过程中的加工误差、控制执行噪声或环境因素变异等随机特性,会不可避免地引入输入随机性,导致设计参数 x 在评估前偏离至 x 。这种偏离会引起函数值 y 的波动,最终导致输出结果的性能不稳定。通常而言,输入随机性由应用场景决定,可能服从任意分布(甚至是相当复杂的分布)。此外在某些情况下,我们无法观测到确切的偏离输入 x ,而只能获得输入不确定性的粗略估计——这在机器人与过程控制领域尤为常见。 例如,考虑图 1a 所示的机器人控制任务:派遣无人机前往目标位置 x 执行测量任务。然而由于模糊控制产生的执行噪声或突发的阵风干扰,无人机最终抵达位置 xP(x) 并获取到含噪声的测量值 y=f(x)+ζ 。此时系统无法观测到确切的

Figure 1: Robust Bayesian optimization problem.
图 1:鲁棒贝叶斯优化问题

x ,we only have a coarse estimation of the input uncertainty P(x) . The goal is to identify a robust location that gives the maximal expected measurement under the process randomness.
x ,我们仅对输入不确定性有一个粗略的估计 P(x) 。目标是在过程随机性下找到一个能提供最大预期测量值的稳健位置。

To find a robust optimum, it is crucial to account for input uncertainty during the optimization process. Existing works [24, 3, 7, 10] along this direction assume that the exact input value, i.e., x in Figure 1b,is observable and construct a surrogate model using these exact inputs. Different techniques are then employed to identify the robust optimum: Nogueira et al. utilize the unscented transform to propagate input uncertainty to the acquisition function [24], while Beland and Nair integrate over the exact GP model to obtain the posterior with input uncertainty [3]. Meanwhile, [7] designs a robust confidence-bounded acquisition and applies min-max optimization to identify the robust optimum. Similarly, [10] constructs an adversarial surrogate with samples from the exact surrogate. These methods work quite well but are constrained by their dependence on observable input values, which may not always be practical.
为寻找鲁棒最优解,在优化过程中考虑输入不确定性至关重要。现有相关研究[24,3,7,10]均假设可以观测到精确输入值(即图 1b 中的 x ),并基于这些精确输入构建代理模型。随后采用不同技术识别鲁棒最优解:Nogueira 等人利用无味变换将输入不确定性传递至采集函数[24],Beland 和 Nair 则通过对精确 GP 模型积分来获得包含输入不确定性的后验分布[3]。同时,[7]设计了鲁棒置信边界采集函数,并应用极小极大优化来识别鲁棒最优解。类似地, [10] 通过精确代理模型的采样构建对抗性代理模型。这些方法虽效果良好,但受限于对可观测输入值的依赖,在实际应用中可能存在局限。

An alternative approach involves directly modeling the uncertain inputs. A pioneering work by Moreno et al. [20] assumes Gaussian input distribution and employs a symmetric Kullback-Leibler divergence (SKL) to measure the distance of input variables. Dallaire et al. [13] implement a Gaussian process model with an expected kernel and derive a closed-form solution by restricting the kernel to linear, quadratic, or squared exponential kernels and assuming Gaussian inputs. Nonetheless, the applicability of these methods is limited due to their restrictive Gaussian input distribution assumption and kernel choice. To surmount these limitations, Oliveira et al. propose a robust Gaussian process model that incorporates input distribution by computing an integral kernel. Although this kernel can be applied to various distributions and offers a rigorous regret bound, its posterior inference requires a large sampling and can be time-consuming.
另一种方法直接对不确定输入进行建模。Moreno 等人[20]的开创性工作假设输入服从高斯分布,并采用对称 Kullback-Leibler 散度(SKL)来衡量输入变量的距离。Dallaire 等人[13]实现了带有期望核的高斯过程模型,通过将核函数限制为线性、二次或平方指数核,并假设高斯输入,推导出闭式解。然而,这些方法由于受限于高斯输入分布假设和核函数选择,其适用性较为有限。为克服这些限制,Oliveira 等人提出了一种鲁棒高斯过程模型,通过计算积分核来纳入输入分布。虽然该核函数可适用于多种分布并提供严格的遗憾界,但其后验推断需要大量采样且耗时较长。

In this work, we propose an Arbitrary Input uncertainty Robust Bayesian Optimization algorithm (AIRBO). This algorithm can directly model the uncertain input of arbitrary distribution and propagate the input uncertainty into the surrogate posterior, which can then be used to guide the search for robust optimum. To achieve this, we employ Gaussian Process (GP) as the surrogate and empower its kernel design with the Maximum Mean Discrepancy (MMD), which allows us to comprehensively compare the uncertain inputs in Reproducing Kernel Hilbert Space (RKHS) and accurately quantify the target function under various input uncertainties(Sec. 3.1). Moreover, to stabilize the MMD estimation and accelerate the posterior inference, we utilize Nyström approximation to reduce the space complexity of MMD estimation from O(m2) to O(mh) ,where hm (Sec. 3.2). This can substantially improve the parallelization of posterior inference and a rigorous theoretical regret bound is also established under the approximation error (Sec. 4). Comprehensive evaluations on synthetic functions and real problems in Sec 5 demonstrate that our algorithm can efficiently identify robust optimum under complex input uncertainty and achieve state-of-the-art performance.
本文提出了一种任意输入不确定性鲁棒贝叶斯优化算法(AIRBO)。该算法可直接对任意分布的不确定输入进行建模,并将输入不确定性传递至代理后验分布,从而指导鲁棒最优解的搜索。为实现这一目标,我们采用高斯过程(GP)作为代理模型,并通过最大均值差异(MMD)增强其核函数设计,使我们能够在再生核希尔伯特空间(RKHS)中全面比较不确定输入,并精准量化不同输入不确定性下的目标函数(第 3.1 节)。此外,为稳定 MMD 估计并加速后验推断,我们采用 Nyström 近似将 MMD 估计的空间复杂度从 O(m2) 降至 O(mh) (其中 hm ,见第 3.2 节)。该方法可显著提升后验推断的并行化效率,并在近似误差范围内建立了严格的理论遗憾界(第 4 节)。 第 5 节中对合成函数和实际问题的综合评估表明,我们的算法能在复杂输入不确定性条件下高效识别鲁棒最优解,并实现最先进的性能表现。

2 Problem Formulation  2 问题表述

In this section, we first formulize the robust optimization problem under input uncertainty then briefly review the intuition behind Bayesian Optimization and Gaussian Processes.
本节首先形式化表述输入不确定性下的鲁棒优化问题,随后简要回顾贝叶斯优化与高斯过程背后的原理。

2.1 Optimization with Input Uncertainty
2.1 含输入不确定性的优化

As illustrated in Figure 1b,we consider an optimization of expensive black-box function: f(x) , where x is the design parameter to be tuned. At each iteration n ,we select a new query point xn according to the optimization heuristics. However, due to the stochastic nature of the process, such as machining error or execution noise,the query point is perturbed to xn before the function evaluation. Moreover,we cannot observe the exact value of xn and only have a vague probability estimation of its value: Pxn . After the function evaluation,we get a noisy measurement y=f(xn)+ζn ,where ζn is homogeneous measurement noise sampled from N(0,σ2) . The goal is to find an optimal design parameter x that maximizes the expected function value under input uncertainty:
如图 1b 所示,我们考虑对昂贵黑箱函数 f(x) 进行优化,其中 x 是待调整的设计参数。在每次迭代 n 时,我们根据优化启发式方法选择新的查询点 xn 。然而由于加工误差或执行噪声等随机因素影响,该查询点在函数评估前会被扰动为 xn 。此外,我们无法观测 xn 的确切值,仅能获得其取值的模糊概率估计: Pxn 。函数评估后,我们将得到含噪声的测量值 y=f(xn)+ζn ,其中 ζn 是从 N(0,σ2) 采样的同质测量噪声。我们的目标是在输入不确定性条件下,找到能最大化期望函数值的最优设计参数 x

(1)x=argmaxxxPxf(x)dx=argmaxxEPx[f]

Depending on the specific problem and randomness source,the input distribution Px can be arbitrary in general and even become quite complex sometimes. Here we do not place any additional assumption on them, except assuming we can sample from these input distributions, which can be easily done by approximating it with Bayesian methods and learning a parametric probabilistic model [16]. Additionally,we assume the exact values of x are inaccessible,which is quite common in some real-world applications, particularly in robotics and process control [25].
根据具体问题和随机性来源的不同,输入分布 Px 通常可以是任意的,有时甚至相当复杂。此处我们不对其施加任何额外假设,仅假定能够从这些输入分布中进行采样——这可以通过贝叶斯方法近似估计或学习参数化概率模型轻松实现[16]。此外,我们假设无法获取 x 的精确值,这在实际应用中(尤其是机器人学和过程控制领域[25])十分常见。

2.2 Bayesian Optimization
2.2 贝叶斯优化

In this paper, we focus on finding the robust optimum with BO. Each iteration of BO involves two key steps: I) fitting a surrogate model and II) maximizing an acquisition function.
本文重点研究如何通过贝叶斯优化(BO)寻找鲁棒最优解。BO 的每次迭代包含两个关键步骤:I)拟合代理模型 II)最大化采集函数。

Gaussian Process Surrogate: To build a sample-efficient surrogate, we choose Gaussian Process (GP) as the surrogate model in this paper. Following [34], GP can be interpreted from a weight-space view: given a set of n observations, Dn={(xi,yi)i=1,,n} . Denote all the inputs as XRD×n and all the output vector as yRn×1 . We first consider a linear surrogate:
高斯过程代理模型:为构建样本高效的代理模型,本文选择高斯过程(GP)作为基础模型。根据文献[34],GP 可以从权重空间视角进行解释:给定一组 n 观测值, Dn={(xi,yi)i=1,,n} 。将所有输入记为 XRD×n ,输出向量记为 yRn×1 。我们首先考虑线性代理模型:

(2)f(x)=xTw,y=f(x)+ζ,ζN(0,σ2),

where w is the model parameters and ζ is the observation noise. This model’s capacity is limited due to its linear form. To obtain a more powerful surrogate,we can extend it by projecting the input x into a feature space ϕ(x) . By taking a Bayesian treatment and placing a zero mean Gaussian prior on the weight vector: wN(0,p) ,its predictive distribution can be derived as follows (see Section2.1 of [34] for detailed derivation):
其中 w 为模型参数, ζ 为观测噪声。由于线性形式的限制,该模型表达能力有限。为获得更强大的代理模型,可通过将输入 x 投影到特征空间 ϕ(x) 进行扩展。采用贝叶斯处理方法并对权重向量施加零均值高斯先验: wN(0,p) ,其预测分布可推导如下(详细推导参见[34]第 2.1 节):

(3)fx,X,yN(ϕT(x)pϕ(X)(A+σn2I)1y,

ϕT(x)pϕ(x)ϕT(x)pϕ(X)(A+σn2I)1ϕT(X)pϕ(x)),

where A=ϕT(X)pϕ(X) and I is a identity matrix. Note the predictive distribution is also a Gaussian and the feature mappings are always in the form of inner product with respect to p This implies we are comparing inputs in a feature space and enables us to apply kernel trick. Therefore,instead of exactly defining a feature mapping ϕ() ,we can define a kernel: k(x,x)= ϕ(x)Tpϕ(x)=ψ(x)ψ(x) . Substituting it into Eq. 3 gives the vanilla GP posterior:
其中 A=ϕT(X)pϕ(X)I 为单位矩阵。需要注意的是预测分布同样服从高斯分布,且特征映射始终以 p 为基准采用内积形式。这意味着我们是在特征空间中进行输入比较,从而能够应用核技巧。因此,无需精确定义特征映射 ϕ() ,我们可以直接定义核函数: k(x,x)= ϕ(x)Tpϕ(x)=ψ(x)ψ(x) 。将其代入公式 3 即可得到标准高斯过程后验:

(4)fX,y,XN(μ,),whereμ=K(X,X)[K(X,X)+σn2I]1y,

=K(X,X)K(X,X)[K(X,X)+σn2I]1K(X,X).

From this interpretation of GP,we note that its core idea is to project the input x to a (possibly infinite) feature space ψ(x) and compare them in the Reproducing Kernel Hilbert Space (RKHS) defined by kernel.
从高斯过程的这种解释中,我们注意到其核心思想是将输入 x 投影到(可能是无限维的)特征空间 ψ(x) ,并在由核函数定义的再生核希尔伯特空间(RKHS)中进行比较。

Acquisition Function Optimization: Given the posterior of GP surrogate model, the next step is to decide a query point xn . The exploitation and exploration balance is achieved by designing an acquisition function α(xDn) . Through numerous acquisition functions exist [28],we follow [25,7] and adopt the Upper Confidence Bound (UCB) acquisition:
采集函数优化:给定 GP 代理模型的后验分布,下一步是确定查询点 xn 。通过设计采集函数 α(xDn) 来实现开发与探索的平衡。尽管存在多种采集函数[28],我们遵循[25,7]采用上置信界(UCB)采集函数:

(5)α(xDn)=μ(x)+βσ(x),

where β is a hyper-parameter to control the level of exploration.
其中 β 是控制探索程度的超参数。

3 Proposed Method  3 研究方法

To cope with randomness during the optimization process, we aim to build a robust surrogate that can directly accept the uncertain inputs of arbitrary distributions and propagate the input uncertainty into the posterior. Inspired by the weight-space interpretation of GP, we empower GP kernel with MMD to compare the uncertain inputs in RKHS. In this way, the input randomness is considered during the covariance computation and naturally reflected in the resulting posterior, which then can be used to guide the search for a robust optimum(Sec. 3.1). To further accelerate the posterior inference, we employ Nyström approximation to stabilize the MMD estimation and reduce its space complexity (Sec. 3.2).
为应对优化过程中的随机性,我们旨在构建一个能直接接受任意分布不确定输入,并将输入不确定性传递至后验分布的鲁棒代理模型。受高斯过程权重空间解释的启发,我们通过最大平均差异(MMD)增强高斯过程核函数,在再生核希尔伯特空间中进行不确定输入的比较。这种方法在协方差计算时考虑了输入随机性,并自然地反映在最终后验分布中,从而可用来指导鲁棒最优解的搜索(第 3.1 节)。为进一步加速后验推断,我们采用 Nyström 近似法来稳定 MMD 估计并降低其空间复杂度(第 3.2 节)。

3.1 Modeling the Uncertain Inputs
3.1 不确定输入建模

Assume PxPXP are a set of distribution densities over Rd ,representing the distributions of the uncertain inputs. We are interested in building a GP surrogate over the probability space P ,which requires to measure the difference between the uncertain inputs.
假设 PxPXP 是定义在 Rd 上的一组概率密度函数,表示不确定输入的分布。我们的目标是在概率空间 P 上构建高斯过程代理模型,这需要度量不确定输入之间的差异。

To do so, we turn to the Integral Probabilistic Metric (IPM) [23]. The basic idea behind IPM is to define a distance measure between two distributions P and Q¯ as the supremum over a class of functions G of the absolute expectation difference:
为此,我们采用积分概率度量(IPM)[23]。IPM 的核心思想是将两个分布 PQ¯ 之间的距离定义为函数类 G 上期望差绝对值的上确界:

(6)d(P,Q)=supgG|EuPg(u)EvQg(v)|,

where G is a class of functions that satisfies certain conditions. Different choices of G lead to various IPMs. For example, if we restrict the function class to be uniformly bounded in RKHS we can get the MMD [15],while a Lipschitz-continuous G realizes the Wasserstein distance [14].
其中 G 是满足特定条件的函数类。不同的 G 选择会导出不同的 IPM。例如,若将函数类限制在 RKHS 中一致有界,则可得到 MMD[15];而选择 Lipschitz 连续函数 G 则实现 Wasserstein 距离[14]。

In this work, we choose MMD as the distance measurement for the uncertain inputs because of its intrinsic connection with distance measurement in RKHS. Given a characteristic kernel k : Rd×RdR and associate RKHS Hk ,define the mean map ψ:PHk such that ψ(P),g= EP[g],gHk . The MMD between P,QP is defined as:
本研究选择 MMD 作为不确定输入的距离度量,因其与 RKHS 中的距离度量存在本质关联。给定特征核 kRd×RdR 及其关联的 RKHS Hk ,定义均值映射 ψ:PHk 使得 ψ(P),g= EP[g],gHk 。分布 P,QP 之间的 MMD 定义为:

(7)MMD(P,Q)=supgk1[EuPg(u)EvQg(v)]=ψPψQ,

Without any additional assumption on the input distributions,except we can get m samples {ui}i=1m,{vi}i=1m from P,Q respectively,MMD can be empirically estimated as follows [21]:
在不对输入分布做任何额外假设的情况下,只要我们能分别从 P,Q 中获取 m 样本 {ui}i=1m,{vi}i=1m ,即可通过以下方式经验性估计 MMD[21]:

(8)MMD2(P,Q)1m(m1)1i,jm,ij(k(ui,uj)+k(vi,vj))2m21i,jmk(ui,vj),

To integrate MMD into the GP surrogate,we design an MMD-based kernel over P as follows:
为了将 MMD 整合到高斯过程代理模型中,我们设计了基于 MMD 的 P 核函数如下:

(9)k^(P,Q)=exp(αMMD2(P,Q)),

with a learnable scaling parameter α . This is a valid kernel,and universal w.r.t. C(P) under mild conditions (see Theorem 2.2, [11]). Also, it is worth to mention that, to compute the GP posterior, we only need to sample m points from the input distributions,but do not require their corresponding function values.
其中 α 为可学习的缩放参数。这是一个有效的核函数,在温和条件下对 C(P) 具有普适性(参见定理 2.2,[11])。值得注意的是,计算 GP 后验时仅需从输入分布中采样 m 个点,而无需其对应的函数值。

With the MMD kernel,our surrogate model places a prior GP(0,k^(Px,Px)) and obtain a dataset Dn={(x^i,yi)x^iPxi,i=1,2,,n)} . The posterior is Gaussian with mean and variance:
通过 MMD 核函数,我们的代理模型设置先验 GP(0,k^(Px,Px)) 并获取数据集 Dn={(x^i,yi)x^iPxi,i=1,2,,n)} 。其后验分布为高斯分布,其均值与方差为:

(10)μ^n(P)=k^n(P)T(K^n+σ2I)1yn

(11)σ^n2(P)=k^(P,P)k^n(P)T(K^n+σ2I)1k^n(P),

where yn:=[y1,,yn]T,k^n(P):=[k^(P,Px1),,k^(P,Pxn)]T and [K^n]ij=k^(Pxi,Pxj) .
其中 yn:=[y1,,yn]T,k^n(P):=[k^(P,Px1),,k^(P,Pxn)]T[K^n]ij=k^(Pxi,Pxj)

3.2 Boosting posterior inference with Nyström Approximation
3.2 利用 Nyström 近似加速后验推断

To derive the posterior distribution of our robust GP surrogate, it requires estimating the MMD between each pair of inputs. Gretton et al. prove the empirical estimator in Eq. 8 approximates MMD in a bounded and asymptotic way [15]. However,the sampling size m used for estimation greatly affects the approximation error and insufficient sampling leads to a high estimation variance(ref Figure 3a). Such an MMD estimation variance causes numerical instability of the covariance matrix and propagates into the posterior distribution and acquisition function, rendering the search for optimal query point a challenging task. Figure 3b gives an example of MMD-GP posterior with insufficient samples, which produces a noisy acquisition function and impedes the search of optima. Increasing the sampling size can help alleviate this issue. However, the computation and space complexities of the empirical MMD estimator scale quadratically with the sampling size m . This leaves us with a dilemma that insufficient sampling results in a highly-varied posterior while a larger sample size can occupy significant GPU memory and reduce the ability for parallel computation.
为推导鲁棒高斯过程代理模型的后验分布,需计算每对输入间的最大均值差异(MMD)。Gretton 等人证明公式 8 中的经验估计器能以有界渐近方式近似 MMD[15]。但用于估计的采样规模 m 会显著影响近似误差,采样不足将导致较高的估计方差(参见图 3a)。这种 MMD 估计方差会造成协方差矩阵数值不稳定,并传递至后验分布和采集函数,使得最优查询点的搜索变得困难。图 3b 展示了采样不足时 MMD-GP 后验的实例,其产生的噪声采集函数会阻碍最优值搜索。增大采样规模可缓解此问题,但经验 MMD 估计器的计算与空间复杂度会随采样规模 m 呈平方级增长。这使我们陷入两难困境:采样不足会导致后验分布波动剧烈,而增大样本量又会占用大量 GPU 内存并削弱并行计算能力。

To reduce the space and computation complexity while retaining a stable MMD estimation, we resort to the Nyström approximation [33]. This method alleviates the computational cost of kernel matrix by randomly selecting h subsamples from the m samples (hm) and computes an approximated matrix via K~=KmhKh+KmhT . Combining this with the MMD definition gives its Nyström estimator:
为在保持稳定 MMD 估计的同时降低空间和计算复杂度,我们采用 Nyström 近似法[33]。该方法通过从 m 样本 (hm) 中随机选取 h 个子样本,并经由 K~=KmhKh+KmhT 计算近似矩阵,从而缓解核矩阵的计算开销。将其与 MMD 定义结合可得到 Nyström 估计量:

MMD2(P,Q)=Eu,uPP[k(u,u)]+Ev,vQQ[k(v,v)]2Eu,vPQ[k(u,v)]

1m21mTU1m+1m21mTV1m2m21mTW1m

1m21mTUmhUh+UmhT1n+1m21mTVmhVh+VmhT1m2m21mTWmhWh+WmhT1m

(12)

where U=K(u,u),V=K(v,v),W=K(u,v) are the kernel matrices, 1m represents a m-by-1 vector of ones, m defines the sampling size and h controls the sub-sampling size. Note that this Nyström estimator reduces the space complexity of posterior inference from O(MNm2) to O(MNmh) ,where M and N are the numbers of training and testing samples, m is the sampling size for MMD estimation while hm is the sub-sampling size. This can significantly boost the posterior inference of robust GP by allowing more inference to run in parallel on GPU.
其中 U=K(u,u),V=K(v,v),W=K(u,v) 表示核矩阵, 1m 代表一个 m×1 的全 1 向量, m 定义采样大小, h 控制子采样规模。需要注意的是,这种 Nyström 估计器将后验推断的空间复杂度从 O(MNm2) 降低至 O(MNmh) ,其中 MN 分别表示训练样本与测试样本数量, m 是 MMD 估计的采样规模,而 hm 为子采样规模。该方法通过支持更多并行 GPU 推理计算,能显著提升鲁棒高斯过程的后验推断效率。

4 Theoretical Analysis  第四章 理论分析

Assume xXRd ,and PxPXP are a set of distribution densities over Rd ,representing the distribution of the noisy input. Given a characteristic kernel k:Rd×RdR and associate RKHS Hk ,we define the mean map ψ:PHk such that ψ(P),g=EP[g],gHk .
假设 xXRdPxPXP 是定义在 Rd 上的一组分布密度,表示含噪输入的分布。给定特征核 k:Rd×RdR 及其对应的再生核希尔伯特空间 Hk ,我们定义均值映射 ψ:PHk 使得 ψ(P),g=EP[g],gHk 成立。

We consider a more general case. Choosing any suitable functional L such that k^(P,P):= L(ψP,ψP) is a positive-definite kernel over P ,for example the linear kernel ψP,ψPk and radial kernels exp(αψPψPk2) using the MMD distance as a metric. Such a kernel k^ is associated with a RKHS Hk^ containing functions over the space of probability measures P .
我们考虑更一般化的情况。选择任意合适的泛函 L 使得 k^(P,P):= L(ψP,ψP) 成为定义在 P 上的正定核,例如采用 MMD 距离作为度量的线性核 ψP,ψPk 和径向基核 exp(αψPψPk2) 。此类核函数 k^ 关联着一个再生核希尔伯特空间 Hk^ ,该空间包含定义在概率测度空间 P 上的函数。

One important theoretical guarantee to conduct GP model is that our object function can be approximated by functions in Hk^ ,which relies on the universality of k^ . Let C(P) be the class of continuous functions over P endowed with the topology of weak convergence and the associated Borel σ -algebra, and we define f^C(P) such that
构建 GP 模型的重要理论保证在于:目标函数可由 Hk^ 空间中的函数逼近,这依赖于 k^ 的普适性。设 C(P)P 上赋予弱收敛拓扑及相应 Borel σ -代数的连续函数类,我们定义 f^C(P) 满足

f^(P):=EP[f],PP,

which is just our object function,For k^ be radial kernels,it has been shown that k^ is universal w.r.t C(P) given that X is compact and the mean map ψ is injective [11,22]. For k^ be linear kernel which is not universal,it has been shown in Lemma 1,[26] that f^Hk^ if and only if fH and further f^k^=∥fk . Thus,in the remain of this chapter,we may simply assume f^Hk^ .
这正是我们的目标函数。对于径向核 k^ ,已有研究表明,当 X 是紧集且均值映射 ψ 是单射时, k^ 相对于 C(P) 具有普适性[11,22]。而对于非普适性的线性核 k^ ,文献[26]中的引理 1 证明 f^Hk^ 当且仅当 fH 且进一步满足 f^k^=∥fk 。因此,在本章后续内容中,我们可以直接假设 f^Hk^ 成立。

Suppose we have an approximation kernel function k~(P,Q) near to the exact kernel function k^(P,Q) . The mean μ^n(p) and variance σ^n2(p) are approximated by
假设我们有一个近似核函数 k~(P,Q) 接近于精确核函数 k^(P,Q) 。其均值 μ^n(p) 和方差 σ^n2(p) 可通过以下方式近似得到。

(13)μ~n(P)=k~n(P)T(K~n+σ2I)1yn

(14)σ~n2(P)=k~(P,P)k~n(P)T(K~n+σ2I)1k~n(P),

where yn:=[y1,,yn]T,k~n(P):=[k~(P,P1),,k~(P,Pn)]T and [K~n]ij=k~(Pi,Pj) . The maximum information gain corresponding to the kernel k^ is denoted as
其中 yn:=[y1,,yn]T,k~n(P):=[k~(P,P1),,k~(P,Pn)]T[K~n]ij=k~(Pi,Pj) 。与核函数 k^ 对应的最大信息增益定义为

γ^n:=supRPX;|R|=nI^(yn;f^nR)=12lndet(I+σ2K^n),

Denote e(P,Q)=k^(P,Q)k~(P,Q) as the error function when estimating the kernel k^ . We suppose e(P,Q) has an upper bound with high probability:
e(P,Q)=k^(P,Q)k~(P,Q) 为核函数 k^ 估计过程中的误差函数。我们假设 e(P,Q) 以高概率存在上界:

Assumption 1. For any ε>0,P,QPX ,we may choose an estimated k~(P,Q) such that the error function e(P,Q) can be upper-bounded by eε with probability at least 1ε ,that is, P(|e(P,Q)|eε)>1ε
假设 1. 对于任意 ε>0,P,QPX ,我们总能选择一个估计量 k~(P,Q) ,使得误差函数 e(P,Q) 能以至少 1ε 的概率被 eε 上界控制,即满足 P(|e(P,Q)|eε)>1ε

Remark. Note that this assumption is standard in our case: we may assume maxxXϕkΦ , where ϕ is the feature map corresponding to the k . Then when using empirical estimator,the error between MMDempirical  and MMD is controlled by 4Φ2log(6/ε)m1 with probability at least 1ε according to Lemma E.1, [8]. When using the Nyström estimator, the error has a similar form as the empirical one,and under mild conditions,when h=O(mlog(m)) ,we get the error of the order O(m1/2log(1/ε)) with probability at least 1ε . One can check more details in Lemma 1,
注记。请注意这一假设在本研究中是标准设定:我们可以设 maxxXϕkΦ ,其中 ϕ 表示对应于 k 的特征映射。当采用经验估计量时,根据文献[8]中的引理 E.1, MMDempirical  与 MMD 之间的误差以至少 1ε 的概率被 4Φ2log(6/ε)m1 控制。使用 Nyström 估计量时,误差形式与经验估计量相似,在温和条件下当 h=O(mlog(m)) 时,我们以至少 1ε 的概率获得 O(m1/2log(1/ε)) 量级的误差。更多细节可参阅引理 1,

Now we restrict our Gaussian process in the subspace PX={Px,xX}P . We assume the observation yi=f(xi)+ζi with the noise ζi . The input-induced noise is defined as Δfpxi:= f(xi)EPxi[f]=f(xi)f^(Pxi) . Then the total noise is yiEPxi[f]=ζi+Δfpxi . We can state our main result, which gives a cumulative regret bound under inexact kernel calculations,
现在我们将高斯过程限制在子空间 PX={Px,xX}P 中。假设观测值 yi=f(xi)+ζi 带有噪声 ζi 。输入诱导噪声定义为 Δfpxi:= f(xi)EPxi[f]=f(xi)f^(Pxi) 。则总噪声为 yiEPxi[f]=ζi+Δfpxi 。我们可以给出主要结论,即在核函数计算不精确情况下的累积遗憾界,

Theorem 1. Let δ>0,fHk ,and the corresponding f^k^b,maxxX|f(x)|=M . Suppose the observation noise ζi=yif(xi) is σζ -sub-Gaussian,and thus with high probability |ζi|<A for some A>0 . Assume that both k and Px satisfy the conditions for ΔfPx to be σE -sub-Gaussian, for a given σE>0 . Then,under Assumption 1 with ε>0 and corresponding eε ,setting σ2=1+2n , running Gaussian Process with acquisition function
定理 1。设 δ>0,fHk ,其对应 f^k^b,maxxX|f(x)|=M 。假设观测噪声 ζi=yif(xi)σζ -次高斯的,因此以高概率 |ζi|<A 对某个 A>0 成立。假定 kPx 均满足使 ΔfPx 成为给定 σE>0σE -次高斯条件。则在 ε>0 的假设 1 及对应 eε 下,设定 σ2=1+2n ,运行采用获取函数的高斯过程

(15)α~(xDn)=μ~n(Px)+βnσ~n(Px)

whereβn=(b+σE2+σζ22(γ^n+1lnδ)),

we have that the uncertain-inputs cumulative regret satisfies:
不确定输入的累计遗憾度满足:

(16)R~nO(nγ^n(γ^nlnδ)+n2(γ^nlnδ)eε+n3eε)

with probability at least 1δnε . Here R~n=t=1nr~t ,and r~t=maxxXEPx[f]EPxt[f]
置信度不低于 1δnε 。此处 R~n=t=1nr~t ,且 r~t=maxxXEPx[f]EPxt[f]

The proof of our main theorem 1 can be found in appendix B. 3
主定理 1 的证明详见附录 B.3

The assumption that ζi is σζ -sub-Gaussian is standard in GP fields. The assumption that ΔfPx is σE -sub-Gaussian can be met when Px is uniformly bounded or Gaussian,as stated in Proposition 3, [26]. Readers may check the definition of sub-Gaussian in appendix, Definition 1,
假设 ζi 服从 σζ -次高斯分布是 GP 领域的标准假设。根据命题 3 和文献[26]所述,当 Px 具有一致有界性或服从高斯分布时, ΔfPx 服从 σE -次高斯分布的假设即可成立。读者可查阅附录定义 1 中关于次高斯的定义。

To achieve an regret of order R~nO(nγ^n) ,the same order as the exact Improved GP regret (23), and ensure this with high probability,we need to take ε=O(δ/n),eε=O(n52γ^n(γ^n2n12)) , and this requires a sample size m of order O(n5γ^n2(γ^n4n)log(n)) for MCMC approximation,or with a same sample size m and a subsample size h of order O(n52+νγ^n1ν(γ^n2n12)) for Nyström approximation with some ν>0 . Note that (16) only offers an upper bound for cumulative regret,in real applications the calculated regret may be much smaller than this bound, as the approximation error eϵ can be fairly small even with a few samples when the input noise is relatively weak.
为实现与精确改进 GP 遗憾(23)同阶的 R~nO(nγ^n) 阶遗憾,并以高概率确保这一结果,我们需要选取 ε=O(δ/n),eε=O(n52γ^n(γ^n2n12)) 。这就要求马尔可夫链蒙特卡洛(MCMC)近似所需的样本量 m 达到 O(n5γ^n2(γ^n4n)log(n)) 阶,或在使用 Nyström 近似时保持相同样本量 m 的同时,使子样本量 h 达到 O(n52+νγ^n1ν(γ^n2n12)) 阶(其中 ν>0 为某常数)。需注意,(16)式仅给出了累积遗憾的上界,实际应用中计算得到的遗憾可能远小于该上界——因为当输入噪声较弱时,即使采用少量样本,近似误差 eϵ 也可能相当小。

To analysis the exact order of γ^n could be difficult,as it is influenced by the specific choice of embedding kernel k and input uncertainty distributions Pxi,xiX . Nevertheless,we can deduce the following result for a wide range of cases, showing that cumulative regret is sub-linear under mild conditions. One can check the proof in appendix B. 4
要精确分析 γ^n 的阶数可能较为困难,因为这受到嵌入核函数 k 和输入不确定性分布 Pxi,xiX 具体选择的共同影响。不过对于大多数情况,我们可以推导出以下结论:在温和条件下,累积遗憾呈现次线性增长。具体证明可参见附录 B.4。

Theorem 2 (Bounding the Maximum information gain). Suppose k is r-th differentiable with bounded derivatives and translation invariant,i.e., k(x,y)=k(xy,0) . Suppose the input uncertainty is i.i.d.,that is,the noised input density satisfies Pxi(x)=P0(xxi),xiX . Then if the space X is compact in Rd ,the maximum information gain γ^n satisfies
定理 2(最大信息增益上界)。假设 k 具有 r 阶有界导数且满足平移不变性,即 k(x,y)=k(xy,0) 。若输入不确定性服从独立同分布,即含噪输入密度满足 Pxi(x)=P0(xxi),xiX ,则当空间 XRd 上紧致时,最大信息增益 γ^n 满足:

γ^n=O(nd(d+1)r+d(d+1)log(n)).

Thus,when r>d(d+1) ,the accumulate regret is sub-linear respect to n ,with sufficiently small eε .
因此当 r>d(d+1) 时,只要 eε 足够小,累积遗憾相对于 n 将保持次线性增长。

Figure 2: Modeling results under different types of input uncertainties.
图 2:不同类型输入不确定性下的建模结果。

5 Evaluation  5 评估

In this section, we first experimentally demonstrate AIRBO's ability to model uncertain inputs of arbitrary distributions, then validate the Nyström-based inference acceleration for GP posterior, followed by experiments on robust optimization of synthetic functions and real-world benchmark.
本节首先通过实验验证 AIRBO 对任意分布不确定输入的建模能力,随后检验基于 Nyström 方法的高斯过程后验推断加速效果,最后在合成函数和现实基准测试上进行鲁棒优化实验

5.1 Robust Surrogate  5.1 鲁棒代理模型

Modeling arbitrary uncertain inputs: We demonstrate MMDGP's capabilities by employing an RKHS function as the black-box function and randomly selecting 10 samples from its input domain. Various types of input randomness are introduced into the observation and produce training datasets of D={(xi,f(xi+δi))δiPxi}i=110 with different Px configurations. Figure 2a and 2b compare the modeling results of a conventional GP and MMDGP under a Gaussian input uncertainty Px=N(0,0.012) . We observe that the GP model appears to overfit the observed samples without recognizing the input uncertainty, whereas MMDGP properly incorporates the input randomness into its posterior.
对任意不确定输入的建模:我们通过采用 RKHS 函数作为黑盒函数,并从其输入域中随机选取 10 个样本,展示了 MMDGP 的建模能力。在观测中引入多种类型的输入随机性,生成具有不同 Px 配置的 D={(xi,f(xi+δi))δiPxi}i=110 训练数据集。图 2a 和 2b 比较了传统高斯过程与 MMDGP 在高斯输入不确定性 Px=N(0,0.012) 下的建模结果。观察到传统 GP 模型似乎过度拟合观测样本而未能识别输入不确定性,而 MMDGP 则正确地将输入随机性纳入其后验分布。

To further examine our model's ability under complex input uncertainty, we design the input distribution to follow a beta distribution with input-dependent variance: Px= beta (α=0.5,β= 0.5,σ=0.9(sin4πx+1) ). The MMDGP posterior is shown in Figure 2c. As the input variance σ changes along x ,inputs from the left and right around a given location xi yield different MMD distances,resulting in an asymmetric posterior (e.g.,around x=0.05 and x=0.42 ). This suggests that MMDGP can precisely model the multimodality and asymmetry of the input uncertainty.
为深入检验模型在复杂输入不确定性下的能力,我们设计输入分布遵循具有输入相关方差的贝塔分布: Px= beta (α=0.5,β= 0.5,σ=0.9(sin4πx+1) )。图 2c 展示了 MMDGP 的后验分布。当输入方差 σ 沿 x 变化时,给定位置 xi 左右两侧的输入会产生不同的 MMD 距离,从而形成非对称后验(例如在 x=0.05x=0.42 附近)。这表明 MMDGP 能精确建模输入不确定性的多模态和非对称特性。

Moreover,we evaluated MMDGP using a step-changing Chi-squared distribution Px=χ2(g(x),σ= 0.01) ,where g(x)=0.5 if x[0,0.6] ,and g(x)=7.0 otherwise. This abrupt change in g(x) significantly alters the input distribution from a sharply peaked distribution to a flat one with a long tail. Figure 2d illustrates that our model can accurately capture this distribution shape variation, as evidenced by the sudden posterior change around x=0.6 . This demonstrates our model can thoroughly quantify the characteristics of complex input uncertainties.
此外,我们采用阶跃变化的卡方分布 Px=χ2(g(x),σ= 0.01) 对 MMDGP 进行了评估,其中当 x[0,0.6]g(x)=0.5 ,否则 g(x)=7.0 。这种 g(x) 的突变使输入分布从尖锐峰值分布显著转变为具有长尾的平坦分布。图 2d 显示我们的模型能准确捕捉这种分布形态变化, x=0.6 附近后验概率的突变即为明证。这表明我们的模型能全面量化复杂输入不确定性的特征。

Comparing with the other surrogate models: We also compare our model with the other surrogate models under the step-changing Chi-squared input distribution. The results are reported in Figure 7 and they demonstrate our model outperforms obviously under such a complex input uncertainty (see Appendix D. 1 for more details)
与其他代理模型对比:我们还在阶跃变化卡方输入分布下将本模型与其他代理模型进行对比。结果如图 7 所示,表明在此类复杂输入不确定性条件下本模型具有明显优势(详见附录 D.1)。

5.2 Accelerating the Posterior Inference
5.2 加速后验推断

Estimation variance of MMD: We first examine the variance of MMD estimation by employing two beta distributions P=beta(α=0.4,β=0.2,σ=0.1) and Q=beta(α=0.4,β=0.2,σ= 0.1) +c ,where c is an offset value. Figure 3a shows the empirical MMDs computed via Eq. 8 with varying sampling sizes as Q moves away from P . We find that a sampling size of 20 is inadequate, leading to high estimation variance, and increasing the sampling size to 100 stabilizes the estimation
MMD 估计方差分析:我们首先通过采用两个贝塔分布 P=beta(α=0.4,β=0.2,σ=0.1)Q=beta(α=0.4,β=0.2,σ= (0.1) +c 来检验 MMD 估计的方差,其中 c 为偏移量。图 3a 展示了根据公式 8 计算的经验 MMD 值,随着 Q 逐渐远离 P 时不同采样规模下的变化情况。研究发现,20 的采样规模会导致较高的估计方差,而将采样规模提升至 100 则能稳定估计结果

We further utilize this beta distribution P as the input distribution and derive the MMDGP posterior via empirical estimator in Figure 3b. Note that the MMD instability caused by inadequate sampling subsequently engenders a fluctuating posterior and culminates in a noisy acquisition function, which prevents the acquisition optimizer (e.g., L-BFGS-B in this experiment) from identifying the optima Although Figure 3c shows that this issue can be mitigated by using more samples during empirical
我们进一步利用这个β分布 P 作为输入分布,并通过图 3b 中的经验估计器推导出 MMDGP 后验分布。需要注意的是,采样不足导致的 MMD 不稳定性会引发后验分布波动,最终形成噪声较大的采集函数,这使得采集优化器(本实验中采用 L-BFGS-B)难以定位最优解。尽管图 3c 表明,通过增加经验采样数量可以缓解该问题

Figure 3: The posterior derived from the empirical and Nyström MMD approximators with varying sampling sizes.
图 3:采用不同采样规模时,经验 MMD 近似器与尼斯特罗姆 MMD 近似器得到的后验对比。

Table 1: Performance of Posterior inference for 512 samples.
表 1:512 个样本的后验推断性能

MethodSampling Size  采样量Sub-sampling Size  子采样量Inference Time (seconds)  推断时间(秒)Batch Size (samples)  批量大小(样本数)
Empirical  经验值20-1.143±0.083512
Empirical  经验值100-8.117±0.040128
Empirical  经验值1000-840.715±2.1821
Nystrom100100.780±0.001512
Nystrom  尼斯特罗姆法100010021.473±0.984128

MMD estimation, it is crucial to note that a larger sampling size significantly increases GPU memory usage because of its quadratic space complexity of O(MNm2)(M and N are the sample number of training and testing, m is the sampling size for MMD estimation). This limitation severely hinders parallel inference for multiple samples and slows the overall speed of posterior computation.
在 MMD 估计中必须注意,较大的采样规模会显著增加 GPU 内存占用,因其空间复杂度与训练样本数 O(MNm2)(M 、测试样本数 N 的平方成正比( m 为 MMD 估计的采样规模)。这一限制严重阻碍了多样本的并行推断,拖慢了后验计算的整体速度。

Table 1 summarizes the inference time of MMDGP posteriors at 512 samples with different sampling sizes. We find that, for beta distribution defined in this experiment, the Nyström MMD estimator with a sampling size of 100 and sub-sampling size of 10 already delivers a comparable result to the empirical estimator with 100 samples (as seen in the acquisition plot of Figure 3d). Also, the inference time is reduced from 8.117 to 0.78 seconds by enabling parallel computation for more samples. For the cases that require much more samples for MMD estimation (e.g., the input distribution is quite complex or high-dimensional), this Nyström-based acceleration can have a more pronounced impact.
表 1 总结了 512 个样本在不同采样规模下 MMDGP 后验的推断时间。我们发现,对于本实验定义的 beta 分布,采用 100 采样规模和 10 子采样规模的尼斯特罗姆 MMD 估计器,已能达到与 100 样本经验估计器相当的效果(见图 3d 的采集曲线)。同时通过启用更多样本的并行计算,推断时间从 8.117 秒缩短至 0.78 秒。对于需要更多 MMD 估计样本的情况(如输入分布非常复杂或高维时),这种基于尼斯特罗姆法的加速将产生更显著的效果。

Effect of Nyström estimator on optimization: To investigate the effect of Nyström estimator on optimization, we also perform an ablation study in Appendix D.2, the results in Figure 8 suggest that Nyström estimator slightly degrades the optimization performance but greatly improves the inference efficiency.
Nyström 估计器对优化的影响:为研究 Nyström 估计器对优化的影响,我们在附录 D.2 进行了消融实验。图 8 结果显示,Nyström 估计器虽轻微降低优化性能,但显著提升了推理效率。

5.3 Robust Optimization  5.3 鲁棒优化

Experiment setup: To experimentally validate AIRBO’s performance,we implement our algorithm 1 based on BoTorch [2] and employ a linear combination of multiple rational quadratic kernels [6] to compute the MMD as Eq. 9. We compare our algorithm with several baselines: 1) uGP-UCB [26] is a closely related work that employs an integral kernel to model the various input distributions. It has a quadratic inference complexity of O(MNm2) ,where M and N are the sample numbers of the training and testing set,and m indicates the sampling size of the integral kernel. 3)GP-UCB is the standard GP with UCB acquisition, which represents a broad range of existing methods that
实验设置:为验证 AIRBO 性能,我们基于 BoTorch[2]实现算法 1 ,采用多个有理二次核[6]的线性组合计算 MMD(如公式 9 所示)。对比基线包括:1) uGP-UCB[26]采用积分核建模不同输入分布,其推理复杂度为二次方 O(MNm2) (其中 MN 分别为训练集与测试集样本数, m 表示积分核采样规模);3) GP-UCB 是采用 UCB 采集函数的标准高斯过程,代表现有广泛方法


1 The code will be available on https://github.com/huawei-noah/HEBO and more implementation details can be found in Appendix C. 1
1 代码将在 https://github.com/huawei-noah/HEBO 上公开,更多实现细节详见附录 C.1


Figure 4: Robust optimization results on synthetic functions.
图 4:合成函数上的鲁棒优化结果

focus on non-robust optimization. 3) SKL-UCB employs symmetric Kullback-Leibler divergence to measure the distance between the uncertain inputs [20]. Its closed-form solution only exists if the input distributions are the Gaussians. 4) ERBF-UCB is the robust GP with the expected Radial Basis Function kernel proposed in [13]. It computes the expected kernel under input distribution using the Gaussian integrals. Assuming the input distributions are sub-Gaussians, this method can efficiently find the robust optimum. Since all the methods use UCB acquisition, we simply distinguish them by their surrogate names in the following tests.
专注于非鲁棒优化。3) SKL-UCB 采用对称 Kullback-Leibler 散度衡量不确定输入间的距离[20],其闭式解仅当输入分布为高斯分布时存在。4) ERBF-UCB 是文献[13]提出的采用期望径向基函数核的鲁棒高斯过程,通过高斯积分计算输入分布下的期望核函数。该方法在假设输入分布为次高斯分布时,能高效找到鲁棒最优解。由于所有方法均采用 UCB 采集函数,后续测试中我们直接以其代理模型名称进行区分。

At the end of the optimization,each algorithm needs to decide a final outcome xnr ,perceived to be the robust optimum under input uncertainty at step n . For a fair comparison,we employ the same outcome policy across all the algorithms: xnr=argmaxxDnμ^(x) ,where μ^(x) is the posterior mean of robust surrogate at x and Dn={(xi,f(xi+δi))δiPxi} are the observations so far. The optimization performance is measured in terms of robust regret as follows:
在优化过程结束时,每个算法都需要确定一个最终输出结果 xnr ,该结果被视为步骤 n 时输入不确定性下的鲁棒最优解。为确保公平比较,我们在所有算法中采用相同的输出策略: xnr=argmaxxDnμ^(x) ,其中 μ^(x) 是鲁棒代理模型在 x 处的后验均值, Dn={(xi,f(xi+δi))δiPxi} 表示迄今为止的观测数据。优化性能通过以下鲁棒遗憾度进行衡量:

(17)r(xnr)=EδPx[f(x+δ)]EδPxnr[f(xnr+δ)],

where x is the global robust optimum and xnr represents the outcome point at step n . For each algorithm, we repeat the optimization process 12 times and compare the average robust regret.
其中 x 表示全局鲁棒最优解, xnr 代表步骤 n 时的输出点。针对每个算法,我们重复优化过程 12 次并比较平均鲁棒遗憾度。

1D RKHS function: We begin the optimization evaluation with an RKHS function that is widely used in previous BO works [1,24,10]. Figure 4a shows its exact global optimum resides at x=0.892 while the robust optimum is around x=0.08 when the inputs follow a Gaussian distribution N(0,0.012) . According to Figure 4b,all the robust BO methods work well with Gaussian uncertain inputs and efficiently identify the robust optimum, but the GP-UCB stagnates at a local optimum due to its neglect of input uncertainty. Also, we notice the regret of our method decrease slightly slower than uGP works in this low-dimensional and Gaussian-input case, but later cases with higher dimension and more complex distribution show our method is more stable and efficient.
一维 RKHS 函数:我们首先采用先前贝叶斯优化研究中广泛使用的 RKHS 函数进行优化评估[1,24,10]。图 4a 显示当输入服从高斯分布 N(0,0.012) 时,其精确全局最优解位于 x=0.892 ,而鲁棒最优解约在 x=0.08 位置。根据图 4b 所示,所有鲁棒贝叶斯优化方法在高斯不确定输入下表现良好,均能有效识别鲁棒最优解,但 GP-UCB 因忽略输入不确定性而陷入局部最优。值得注意的是,在这个低维高斯输入案例中,本方法的遗憾值下降速度略慢于 uGP 方法,但在后续更高维度及更复杂分布的场景中,本方法展现出更优的稳定性和效率。

1D double-peak function: To test with more complex input uncertainty, we design a blackbox function with double peaks and set the input distribution to be a multi-modal distribution Px= beta (α=0.4,β=0.2,σ=0.1) . Figure 4c shows the blackbox function (black solid line) and the corresponding function expectations estimated numerically via sampling from the input distribution (i.e.,the colored lines). Note the true robust optimum is around x=0.251 under the beta distribution, but an erroneous location at x=0.352 may be determined if the input uncertainty is incorrectly presumed to be Gaussian. This explains the results in Figure 4d, the performance of SKL-UCB and ERBF-UCB are sub-optimal due to their misidentification of inputs as Gaussian variables, while our method accurately quantifies the input uncertainty and outperforms the others.
一维双峰函数:为测试更复杂的输入不确定性,我们设计了一个具有双峰的黑箱函数,并将输入分布设定为多模态的 beta 分布。图 4c 展示了该黑箱函数(黑色实线)以及通过输入分布采样数值估算得到的对应函数期望值(即彩色线条)。需注意在 beta 分布下真实的稳健最优解位于 x=0.251 附近,但若错误假设输入不确定性为高斯分布,则可能误判最优解位于 x=0.352 。这解释了图 4d 所示结果——由于 SKL-UCB 和 ERBF-UCB 错误地将输入识别为高斯变量,其表现欠佳;而我们的方法能准确量化输入不确定性,性能优于其他方法。

10D bumped-bowl function: we also extend our evaluation to a 10D bumped-bowl function [27] under a concatenated circular distribution. Figure 9 demonstrates AIRBO scales efficiently to high dimension and outperforms the others under complex input uncertainty(see Appendix D.3).
十维凸底隆起函数:我们还将评估扩展至拼接环形分布下的十维凸底隆起函数[27]。图 9 表明 AIRBO 能高效扩展至高维空间,在复杂输入不确定性条件下表现优于其他方法(详见附录 D.3)。

Robust robot pushing: To evaluate AIRBO in a real-world problem, we employ a robust robot pushing benchmark from [31], in which a ball is placed at the origin point of a 2D space and a robot learns to push it to a predefined target location (gx,gy) . This benchmark takes a 3-dimensional input (rx,ry,rt) ,where rx,ry[5,+5] are the 2D coordinates of the initial robot location and rt[0,30] controls the push duration. We set four targets in separate quadrants,i.e., g1= (3,3),g2=(3,3),g3=(4.3,4.3) ,and a "twin" target at g3=(5.1,3.0) ,and describe the input uncertainty via a two-component Gaussian Mixture Model (defined in Appendix D.4). Following [7 10], this blackbox benchmark outputs the minimum distance to these 4 targets under squared and linear distances: loss=min(d2(g1,l),d(g2,l),d(g3,l),d(g3,l)) ,where d(gi,l) is the Euclidean distance between the ball’s ending location l and the i -th target. This produces a loss landscape as shown in Figure 5a. Note that g2 is a more robust target than g1 because of its linear-form distance while pushing the ball to quadrant I is the best choice as the targets, g3 and g3 ,match the dual-mode pattern of the input uncertainty. According to Figure 5b, our method obviously outperforms the others because it efficiently quantifies the multimodal input uncertainty. This can be further evidenced by the push configurations found by different algorithms in Figure 6, in which each dot represents the robot's initial location and its color represents the push duration. We find that AIRBO successfully recognizes the targets in quadrant I as an optimal choice and frequently pushes from quadrant III to quadrant I. Moreover, the pushes started close to the origin can easily go far away under input variation, so our method learns to push the ball from a corner with a long push duration, which is more robust in this case.
鲁棒机器人推球任务:为评估 AIRBO 在现实问题中的表现,我们采用[31]提出的鲁棒机器人推球基准测试。该测试将小球置于二维空间原点,机器人学习将其推至预设目标位置 (gx,gy) 。基准测试采用三维输入 (rx,ry,rt) ,其中 rx,ry[5,+5] 表示机器人初始位置的 2D 坐标, rt[0,30] 控制推动持续时间。我们在四个象限分别设置目标点 g1= (3,3),g2=(3,3),g3=(4.3,4.3) ,并在 g3=(5.1,3.0) 位置设置"孪生"目标点,通过双组分高斯混合模型(定义见附录 D.4)描述输入不确定性。参照[7 10],该黑箱基准测试输出小球在平方距离和线性距离下与这 4 个目标点的最小距离: loss=min(d2(g1,l),d(g2,l),d(g3,l),d(g3,l)) ,其中 d(gi,l) 表示小球终点位置 l 与第 i 个目标点的欧氏距离。由此产生的损失景观如图 5a 所示。值得注意的是,由于采用线性距离形式, g2 是比 g1 更鲁棒的目标;而将球推至第一象限是最佳选择,因为目标点 g3g3 与输入不确定性的双峰模式相匹配。 根据图 5b 所示,我们的方法明显优于其他方法,因为它能有效量化多模态输入的不确定性。图 6 中不同算法发现的推动配置进一步佐证了这一点——图中每个圆点代表机器人的初始位置,颜色深浅代表推动持续时间。我们发现 AIRBO 成功识别出第一象限目标作为最优选择,并频繁从第三象限向第一象限实施推动。此外,靠近原点发起的推动在输入变异下容易偏离目标,因此我们的方法学会从角落位置实施长时间推动,这种策略在本案例中展现出更强的鲁棒性。

Figure 6: The robot's initial locations and push times found by different algorithms
图 6:不同算法发现的机器人初始位置与推动时长

6 Discussion and Conclusion
6 讨论与结论

In this work, we generalize robust Bayesian Optimization to an uncertain input setting. The weight-space interpretation of GP inspires us to empower the GP kernel with MMD and build a robust surrogate for uncertain inputs of arbitrary distributions. We also employ the Nyström approximation to boost the posterior inference and provide theoretical regret bound under approximation error. The experiments on synthetic blackbox function and benchmarks demonstrate our method can handle various input uncertainty and achieve state-of-the-art optimization performance.
本研究将鲁棒贝叶斯优化推广至不确定输入场景。高斯过程的权重空间解释启发我们通过最大平均差异(MMD)增强 GP 核函数,构建适用于任意分布不确定输入的鲁棒代理模型。我们采用 Nyström 近似加速后验推理,并在近似误差下提供理论遗憾界。针对合成黑箱函数与基准测试的实验表明,本方法能处理多种输入不确定性,并实现最先进的优化性能。

There are several interesting directions that worth to explore: though we come to current MMD-based kernel from the weight-space interpretation of GP and the RKHS realization of MMD, our kernel design exhibits a deep connection with existing works on kernel over probability measures [22, 11] Along this direction, as our theoretic regret analysis in Section 4 does not assume any particular form of kernel and the Nyström acceleration can also be extended to the other kernel computation, it is possible that AIRBO can be further generalized to a more rich family of kernels. Moreover, the MMD used in our kernel is by no means limited to its RKHS realization. In fact,any function class F that comes with uniform convergence guarantees and is sufficiently rich can be used, which renders different realizations of MMD. With proper choice of function class F ,MMD can be expressed as the Kolmogorov metric or other Earth-mover distances [15]. It is also interesting to extend AIRBO with the other IPMs.
有几个值得探索的有趣方向:虽然我们当前基于 MMD 的核函数源自 GP 的权重空间解释和 MMD 的 RKHS 实现,但我们的核设计与现有关于概率测度核函数的研究[22,11]存在深刻联系。沿着这个方向,由于第 4 节的理论遗憾分析并未假设任何特定核形式,且 Nyström 加速也可推广至其他核计算,AIRBO 有可能进一步推广到更丰富的核函数族。此外,我们核中使用的 MMD 绝不限于其 RKHS 实现——实际上任何具有一致收敛保证且足够丰富的函数类 F 都可使用,这将产生不同的 MMD 实现。通过恰当选择函数类 F ,MMD 可表达为 Kolmogorov 度量或其他推土机距离[15]。将 AIRBO 扩展至其他积分概率度量也颇具研究价值。

7 Acknowledgements  7 致谢

We sincerely thank Yanbin Zhu and Ke Ma for their help on formulating the problem. Also, a heartfelt appreciation goes to Lu Kang for her constant encouragement and support throughout this work. References
我们衷心感谢严斌朱和马柯在问题建模方面给予的帮助。同时,特别感谢康璐在整个研究过程中持续的鼓励与支持。参考文献

[1] John-Alexander M Assael et al. "Heteroscedastic treed bayesian optimisation". In: arXiv preprint arXiv:1410.7172 (2014).
[1] John-Alexander M Assael 等. "异方差树贝叶斯优化". 见于: arXiv 预印本 arXiv:1410.7172 (2014).

[2] Maximilian Balandat et al. "BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization". In: Advances in Neural Information Processing Systems 33. 2020. URL: http: //arxiv.org/abs/1910.06403
[2] Maximilian Balandat 等. "BoTorch: 高效蒙特卡洛贝叶斯优化框架". 见于: 神经信息处理系统进展 33. 2020. 网址: http://arxiv.org/abs/1910.06403

[3] Justin J Beland and Prasanth B Nair. "Bayesian Optimization Under Uncertainty". In: Advances in neural information processing systems (2017), p. 5.
[3] Justin J Beland 与 Prasanth B Nair. "不确定性条件下的贝叶斯优化". 见于: 神经信息处理系统进展 (2017), 第 5 页.

[4] James Bergstra et al. "Algorithms for hyper-parameter optimization". In: Advances in neural information processing systems 24 (2011).
[4] James Bergstra 等.《超参数优化算法》. 见: 神经信息处理系统进展 24 (2011).

[5] Felix Berkenkamp, Andreas Krause, and Angela P Schoellig. "Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics". In: Machine Learning (2021), pp. 1-35.
[5] Felix Berkenkamp, Andreas Krause 与 Angela P Schoellig.《带安全约束的贝叶斯优化:机器人学中安全自动的参数调谐》. 见: 机器学习 (2021), 第 1-35 页.

[6] Mikołaj Bińkowski et al. Demystifying MMD GANs. 2021. DOI: 10.48550/arXiv. 1801 01401. arXiv: 1801.01401
[6] Mikołaj Bińkowski 等.《揭秘 MMD 生成对抗网络》. 2021. DOI: 10.48550/arXiv.1801.01401. arXiv: 1801.01401

[7] Ilija Bogunovic et al. "Adversarially Robust Optimization with Gaussian Processes". In: Advances in Neural Information Processing Systems. Vol. 31. Curran Associates, Inc., 2018.
[7] Ilija Bogunovic 等.《基于高斯过程的对抗鲁棒优化》. 见: 神经信息处理系统进展. 第 31 卷. Curran Associates 公司, 2018.

[8] Antoine Chatalic et al. "Nyström Kernel Mean Embeddings". In: International Conference on Machine Learning. 2022.
[8] Antoine Chatalic 等. "Nyström 核均值嵌入". 见: 国际机器学习大会. 2022.

[9] Sayak Ray Chowdhury and Aditya Gopalan. "On Kernelized Multi-armed Bandits". In: International Conference on Machine Learning. 2017.
[9] Sayak Ray Chowdhury 与 Aditya Gopalan. "核化多臂老虎机研究". 见: 国际机器学习大会. 2017.

[10] Ryan B. Christianson and Robert B. Gramacy. Robust Expected Improvement for Bayesian Optimization. 2023. arXiv: 2302.08612
[10] Ryan B. Christianson 与 Robert B. Gramacy. 贝叶斯优化的鲁棒期望改进方法. 2023. arXiv 预印本: 2302.08612

[11] Andreas Christmann and Ingo Steinwart. "Universal Kernels on Non-Standard Input Spaces". In: NIPS. 2010. URL: https://api.semanticscholar.org/CorpusID:16968759
[11] Andreas Christmann 和 Ingo Steinwart. "非标准输入空间的通用核函数". 见: 神经信息处理系统会议. 2010. 网址: https://api.semanticscholar.org/CorpusID:16968759

[12] Alexander I Cowen-Rivers et al. "HEBO: pushing the limits of sample-efficient hyper-parameter optimisation". In: Journal of Artificial Intelligence Research 74 (2022), pp. 1269- 1349.
[12] 亚历山大·I·考恩-里弗斯等。"HEBO:突破高效超参数优化的样本效率极限"。见:《人工智能研究杂志》74 卷(2022 年),第 1269-1349 页。

[13] Patrick Dallaire, Camille Besse, and Brahim Chaib-draa. "An Approximate Inference with Gaussian Process to Latent Functions from Uncertain Data". In: Neurocomputing. Adaptive Incremental Learning in Neural Networks 74.11 (2011), pp. 1945-1955. DOI: 10.1016/j neucom.2010.09.024
[13] 帕特里克·达莱尔、卡米耶·贝斯与布拉希姆·柴卜-德拉。"基于高斯过程的隐函数近似推理方法处理不确定数据"。见:《神经计算:神经网络的自适应增量学习》74 卷 11 期(2011 年),第 1945-1955 页。DOI:10.1016/j.neucom.2010.09.024

[14] Arthur Gretton, Dougal Sutherland, and Wittawat Jitkrittum. "Interpretable comparison of distributions and models". In: Advances in Neural Information Processing Systems [Tutorial] (2019).
[14] 亚瑟·格雷顿、杜格尔·萨瑟兰与威塔瓦·吉克里通。"分布与模型的可解释性比较"。见:《神经信息处理系统进展》[教程](2019 年)。

[15] Arthur Gretton et al. "A kernel two-sample test". In: The Journal of Machine Learning Research 13.1 (2012), pp. 723-773.
[15] 亚瑟·格雷顿等。"核双样本检验"。见:《机器学习研究杂志》13 卷 1 期(2012 年),第 723-773 页。

[16] Trevor Hastie et al. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. Springer, 2009.
[16] 特雷弗·哈斯蒂等。《统计学习基础:数据挖掘、推断与预测》。第 2 卷。施普林格出版社,2009 年。

[17] Thomas Kühn. "Eigenvalues of integral operators with smooth positive definite kernels". In: Archiv der Mathematik 49 (1987), pp. 525-534. URL: https://api.semanticscholar org/CorpusID:121372638
[17] 托马斯·库恩。"具有光滑正定核的积分算子特征值"。发表于:《数学档案》49 卷(1987 年),第 525-534 页。URL:https://api.semanticscholar org/CorpusID:121372638

[18] Ruben Martinez-Cantin et al. "Active policy learning for robot planning and exploration under uncertainty." In: Robotics: Science and systems. Vol. 3. 2007, pp. 321-328.
[18] 鲁本·马丁内斯-坎廷等。"不确定性下机器人规划与探索的主动策略学习"。发表于:《机器人学:科学与系统》第 3 卷。2007 年,第 321-328 页。

[19] Seyedali Mirjalili and Andrew Lewis. "Obstacles and difficulties for robust benchmark problems: A novel penalty-based robust optimisation method". In: Information Sciences 328 (2016), pp. 485-509.
[19] 赛义德阿里·米尔贾利利与安德鲁·刘易斯。"鲁棒基准问题的障碍与难点:一种新型基于惩罚的鲁棒优化方法"。发表于:《信息科学》328 卷(2016 年),第 485-509 页。

[20] Pedro Moreno, Purdy Ho, and Nuno Vasconcelos. "A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications". In: Advances in Neural Information Processing Systems. Vol. 16. MIT Press, 2003.
[20] 佩德罗·莫雷诺、珀迪·何与努诺·瓦斯科塞洛斯。《基于 Kullback-Leibler 散度的 SVM 分类核函数及其在多媒体应用中的应用》。发表于:《神经信息处理系统进展》第 16 卷。麻省理工学院出版社,2003 年。

[21] Krikamol Muandet et al. "Kernel Mean Embedding of Distributions: A Review and Beyond". In: Foundations and Trends® in Machine Learning 10.1-2 (2017), pp. 1-141. DOI: 10.1561/ 2200000060. arXiv: 1605.09522
[21] 克里卡莫尔·穆安迪特等。《分布核均值嵌入:综述与展望》。发表于:《机器学习基础与趋势》10.1-2(2017 年),第 1-141 页。DOI:10.1561/2200000060。预印本:arXiv:1605.09522。

[22] Krikamol Muandet et al. "Learning from Distributions via Support Measure Machines". In: NIPS. 2012.
[22] 克里卡莫尔·穆安迪特等。《通过支持测度机从分布中学习》。发表于:《神经信息处理系统大会》。2012 年。

[23] Alfred Müller. "Integral probability metrics and their generating classes of functions". In: Advances in applied probability 29.2 (1997), pp. 429-443.
[23] 阿尔弗雷德·穆勒。《积分概率度量及其生成函数类》。发表于:《应用概率进展》29.2(1997 年),第 429-443 页。

[24] Jose Nogueira et al. "Unscented Bayesian Optimization for Safe Robot Grasping". In: 2016 IEEE/RSJ Ineternational Conference on Intelligent Robots and Systems (IROS). Daejeon, South Korea: IEEE, 2016, pp. 1967-1972. ISBN: 978-1-5090-3762-9. DOI: 10.1109/IROS. 2016. 7759310
[24] Jose Nogueira 等.《基于无迹变换贝叶斯优化的机器人安全抓取》.见:2016 年 IEEE/RSJ 智能机器人与系统国际会议(IROS).韩国大田:IEEE,2016 年,第 1967-1972 页.ISBN:978-1-5090-3762-9.DOI:10.1109/IROS.2016.7759310

[25] Rafael Oliveira, Lionel Ott, and Fabio Ramos. "Bayesian Optimisation under Uncertain Inputs". In: Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics. PMLR, 2019, pp. 1177-1184.
[25] Rafael Oliveira,Lionel Ott 和 Fabio Ramos.《不确定输入下的贝叶斯优化》.见:第二十二届人工智能与统计国际会议论文集.PMLR,2019 年,第 1177-1184 页.

[26] Rafael Oliveira, Lionel Ott, and Fabio Ramos. Bayesian optimisation under uncertain inputs. 2019. arXiv: 1902.07908.
[26] Rafael Oliveira,Lionel Ott 和 Fabio Ramos.不确定输入下的贝叶斯优化.2019 年.arXiv:1902.07908.

[27] Nicholas D. Sanders et al. Bayesian Search for Robust Optima. 2021. arXiv: 1904.11416
[27] Nicholas D. Sanders 等.鲁棒最优解的贝叶斯搜索.2021 年.arXiv:1904.11416

[28] Bobak Shahriari et al. "Taking the Human Out of the Loop: A Review of Bayesian Optimization". In: Proceedings of the IEEE 104.1 (Jan. 2016), pp. 148-175. DOI: 10.1109/JPROC 2015.2494218
[28] Bobak Shahriari 等."将人类移出循环:贝叶斯优化方法综述".见:《IEEE 会刊》104 卷 1 期(2016 年 1 月),第 148-175 页.DOI:10.1109/JPROC 2015.2494218

[29] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. "Practical bayesian optimization of machine learning algorithms". In: Advances in neural information processing systems 25 (2012).
[29] Jasper Snoek, Hugo Larochelle 与 Ryan P Adams."机器学习算法的实用贝叶斯优化".见:《神经信息处理系统进展》第 25 卷(2012 年).

[30] Niranjan Srinivas et al. "Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design". In: International Conference on Machine Learning. 2009. URL: https://api.semanticscholar.org/CorpusID:59031327
[30] Niranjan Srinivas 等."多臂老虎机设置下的高斯过程优化:无遗憾与实验设计".见:国际机器学习大会.2009 年.URL:https://api.semanticscholar.org/CorpusID:59031327

[31] Zi Wang and Stefanie Jegelka. "Max-value entropy search for efficient Bayesian optimization". In: International Conference on Machine Learning. PMLR. 2017, pp. 3627-3635.
[31] Zi Wang 与 Stefanie Jegelka."基于最大价值熵搜索的高效贝叶斯优化".见:国际机器学习大会.PMLR 出版社.2017 年,第 3627-3635 页.

[32] Colin White, Willie Neiswanger, and Yash Savani. "Bananas: Bayesian optimization with neural architectures for neural architecture search". In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. 12. 2021, pp. 10293-10301.
[32] 科林·怀特、威利·内斯旺格与雅什·萨瓦尼。《香蕉算法:基于神经架构贝叶斯优化的神经网络架构搜索》。见:AAAI 人工智能会议论文集。第 35 卷。第 12 期。2021 年,第 10293-10301 页。

[33] Christopher Williams and Matthias Seeger. "Using the Nyström Method to Speed Up Kernel Machines". In: Advances in Neural Information Processing Systems. Vol. 13. MIT Press, 2000.
[33] 克里斯托弗·威廉姆斯与马蒂亚斯·西格。《利用尼斯特伦方法加速核机器》。见:神经信息处理系统进展。第 13 卷。麻省理工学院出版社,2000 年。

[34] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning. Vol. 2. 3. MIT press Cambridge, MA, 2006.
[34] 克里斯托弗·KI·威廉姆斯与卡尔·爱德华·拉斯穆森。《机器学习中的高斯过程》。第 2 卷。第 3 期。麻省理工学院出版社,剑桥,2006 年。

A Nyström Estimator Error Bound
尼斯特伦估计量误差界

Nyström estimator can easily approximate the kernel mean embedding ψp1,ψp2 as well as the MMD distance between two distribution density p1 and p2 . We need first assume the boundedness of the feature map to the kernel k :
Nyström 估计器可以轻松近似核均值嵌入 ψp1,ψp2 以及两个分布密度 p1p2 之间的 MMD 距离。我们首先需要假设核函数 k 的特征映射有界性:

Assumption 2. There exists a positive constant K such that supxXϕ(x)∥≤K
假设 2. 存在正常数 K 使得 supxXϕ(x)∥≤K

The true MMD distance between p1 and p2 is denoted as MMD(p1,p2) . The estimated MMD distance when using a Nyström sample size mi ,sub-sample size hi for pi respectively,is denoted as MMD(pi,mi,hi) . Then the error
真实 MMD 距离在 p1p2 之间记为 MMD(p1,p2) 。当分别使用 Nyström 样本量 mi 、子样本量 hi 来估计 pi 时,其 MMD 距离估计值记为 MMD(pi,mi,hi) 。则误差

Err(pi,mi,hi):=|MMD(p1,p2)MMD(pi,mi,hi)|

and now we have the lemma from Theorem 5.1 in [8]
根据文献[8]中定理 5.1 可得如下引理

Lemma 1. Let Assumption 2 hold. Furthermore,assume that for i1,2 ,the data points X1i,,Xmii are drawn i.i.d. from the distribution ρi and that himi sub-samples X~1i,,X~hii are drawn uniformly with replacement from the dataset {X1i,,Xmii} . Then,for any δ(0,1) ,it holds with probability at least 12δ
引理 1. 设假设 2 成立。此外,假设对于 i1,2 ,数据点 X1i,,Xmii 是从分布 ρi 中独立同分布抽取的,且 himi 个子样本 X~1i,,X~hii 是从数据集 {X1i,,Xmii} 中有放回地均匀抽取的。那么对于任意 δ(0,1) ,以下结论以至少 12δ 的概率成立:

Err(pi,mi,hi)i=1,2(c1mi+c2hi+log(hi/δ)hiNpi(12K2log(hi/δ)hi)),

provided that,for i{1,2} ,  当且仅当对于 i{1,2}

himax(67,12K2CiL(H)1)log(hi/δ)

where c1=2K2log(6/δ),c2=43Klog(12/δ) and c4=6Klog(12/δ) . The notation Npi denotes the effective dimension associated to the distribution pk .
其中 c1=2K2log(6/δ),c2=43Klog(12/δ)c4=6Klog(12/δ) 。符号 Npi 表示与分布 pk 相关的有效维度。

Specifically,when the effective dimension N satisfies,for some c0 ,
具体而言,当有效维度 N 满足以下条件时(对于某个 c0 而言):

- eitherNρi(σ2)cσ2γfor someγ(0,1),

- orNρi(σ2)log(1+c/σ2)/β,for someβ>0.

Then,choosing the subsample size m to be
那么,将子样本大小 m 设定为

we get Err(ρi,mi,hi)=O(1/mi)  我们得到 Err(ρi,mi,hi)=O(1/mi)

B Proofs of Section 4
B 第 4 节证明

B.1 Exact Kernel Uncertainty GP Formulating
B.1 精确核不确定性高斯过程建模

Following the same notation in Section 4,now we can construct a Gaussian process GP(0,k^) modeling over P . This GP model can then be applied to learn f^ from a given set of observations Dn={(Pi,yi)}i=1n . Under zero mean condition,the value of f^(P) for a given PP follows a Gaussian posterior distribution with
沿用第 4 节中的相同符号约定,我们现在可以构建一个在 P 上建模的高斯过程 GP(0,k^) 。该 GP 模型随后可用于从给定观测集 Dn={(Pi,yi)}i=1n 中学习 f^ 。在零均值条件下,对于给定 PPf^(P) 值服从具有高斯后验分布

(18)μ^n(P)=k^n(P)T(K^n+σ2I)1yn

(19)σ^n2(P)=k^(P,P)k^n(P)T(K^n+σ2I)1k^n(P),

where yn:=[y1,,yn]T,k^n(P):=[k^(P,P1),,k^(P,Pn)]T and [K^n]ij=k^(Pi,Pj) . Now we restrict our Gaussian process in the subspace PX={Px,xX}P . We assume the observation yi=f(xi)+ζi with the noise ζi . The input-induced noise is defined as Δfpxi:= f(xi)EPxi[f]=f(xi)f^(Pxi) . Then the total noise is yiEPxi[f]=ζi+Δfpxi . To formulate the regret bounds, we introduce the information gain and estimated information gain given any {Pt}t=1nP :
其中 yn:=[y1,,yn]T,k^n(P):=[k^(P,P1),,k^(P,Pn)]T[K^n]ij=k^(Pi,Pj) 。现在我们将高斯过程限制在子空间 PX={Px,xX}P 中。假设观测值为 yi=f(xi)+ζi ,噪声为 ζi 。输入引起的噪声定义为 Δfpxi:= f(xi)EPxi[f]=f(xi)f^(Pxi) 。则总噪声为 yiEPxi[f]=ζi+Δfpxi 。为表述遗憾界,我们引入给定任意 {Pt}t=1nP 时的信息增益与估计信息增益:

(20)I^(yn;f^n{Pt}t=1n):=12lndet(I+σ2K^n),

(21)I~(yn;f^n{Pt}t=1n):=12lndet(I+σ2K~n),

and the maximum information gain is defined as γ^n:=supRPX;|R|=nI^(yn;f^nR) . Here f^n:= [f^(p1),,f^(pn)]T .
最大信息增益定义为 γ^n:=supRPX;|R|=nI^(yn;f^nR) 。此处 f^n:= [f^(p1),,f^(pn)]T

We define the sub-Gaussian condition as follows:
我们定义次高斯条件如下:

Definition 1. For a given σξ>0 ,a real-valued random variable ξ is said to be σξ -sub-Gaussian if:
定义 1. 对于给定 σξ>0 ,若实值随机变量 ξ 满足以下条件,则称其为 σξ -次高斯:

λR,E[eλξ]eλ2σξ2/2

Now we can state the lemma for bounding the uncertain-inputs regret of exact kernel evaluations, which is originally stated in Theorem 5 in [26].
现在我们可以陈述关于精确核评估的不确定输入后悔值界限的引理,该引理最初见于文献[26]的定理 5。

Lemma 2. Let δ(0,1),fHk ,and the corresponding f^k^b . Suppose the observation noise ζi=yif(xi) is conditionally σζ -sub-Gaussian. Assume that both k and Px satisfy the conditions for ΔfPx to be σE -sub-Gaussian,for a given σE>0 . Then,we have the following results:
引理 2. 设 δ(0,1),fHk ,且对应的 f^k^b 。假设观测噪声 ζi=yif(xi) 是条件 σζ -次高斯的。对于给定的 σE>0 ,若 kPx 都满足使 ΔfPx 成为 σE -次高斯的条件,则我们有以下结果:

(22)|μ^n(Px)f^(Px)|(b+σE2+σζ22(I^(yn;f^n{Pt}t=1n)+1+ln(1/δ))σ^n(Px)

β^n=b+σE2+σζ22(I^(yn;f^n{Pt}t=1n)+1+ln(1/δ)),

and set σ2=1+2/n ,the uncertain-inputs cumulative regret satisfies:
并设 σ2=1+2/n ,不确定输入的累积遗憾满足:

(23)R^nO(nγ^n(b+γ^n+ln(1/δ)))

with probability at least 1δ .
概率至少为 1δ

Note that although the original theorem restricted to the case when k^(p,q)=ψP,ψQk ,the results can be easily generated to other kernels over P ,as long as its universal w.r.t C(P) given that X is compact and the mean map ψ is injective [11,22].
需注意,虽然原定理仅限于 k^(p,q)=ψP,ψQk 的情况,但只要 X 是紧致的且均值映射 ψ 是单射[11,22],这些结果可以轻松推广到 P 上的其他核函数,前提是该核关于 C(P) 是通用的。

B.2 Error Estimates for Inexact Kernel Approximation
B.2 非精确核近似的误差估计

Now let us derivative the inference under the inexact kernel estimations.
现在让我们推导在非精确核估计下的推理过程。

Theorem 3. Under the Assumption 1 for ε>0 ,let μ~n,σ~n,I~(yn;f^n{Pt}t=1n) as defined in [13],[14],[21] respectively,and μ^n,σ^n,I^(yn;f^n{Pt}t=1n) as defined in [18],[19],[20]. Assume maxxXf(x)=M ,and assume the observation error ζi=yif(xi) satisfies |ζi|<A for all i . Then we have the following error bound holds with probability at least 1nε :
定理 3. 在假设 1 对 ε>0 成立的条件下,令 μ~n,σ~n,I~(yn;f^n{Pt}t=1n) 分别如[13],[14],[21]中所定义, μ^n,σ^n,I^(yn;f^n{Pt}t=1n) 如[18],[19],[20]中所定义。假设 maxxXf(x)=M ,并假设观测误差 ζi=yif(xi) 对所有 i 满足 |ζi|<A 。那么以至少 1nε 的概率有以下误差界成立:

(24)|μ^n(P)μ~n(P)|<(nσ2+n2σ4)(M+A)eε+O(eε2)

(25)|σ^n2(P)σ~n2(P)|<(1+nσ2)2eε+O(eε2)

(26)|I~(yn;f^n{Pt}t=1n)I^(yn;f^n{Pt}t=1n)|<n3/22σ2eε+O(eε2)

Proof. Denote e(P,Q)=k~(P,Q)k^(P,Q),en(P)=[e(P,P1),,e(PPn)]T ,and  证明. 记 e(P,Q)=k~(P,Q)k^(P,Q),en(P)=[e(P,P1),,e(PPn)]T ,且

[En]i,j=e(Pi,Pj) . Now according to the matrix inverse perturbation expansion,
[En]i,j=e(Pi,Pj) 。根据矩阵逆扰动展开定理,

(X+δX)1=X1X1δXX1+O(δX2),

we have  我们有

(K^n+σ2I+En)1=(K^n+σ2I)1(K^n+σ2I)1En(K^n+σ2I)1+O(En2),

thus  因此

μ~n(P)=(k^n(P)+en(P))T(K^n+σ2I+En)1yn

=μ^n(P)+en(P)T(K^n+σ2I)1ynk^n(P)T(K^n+σ2I)1En(K^n+σ2I)1yn

+O(En2)+O(en(P)En)

σ~n2(P)=σ^n2(P)+e(P,P)(k^n(P)+en(P))T(K^n+σ2I+En)1(k^n(P)+en(P))

=σ^n2(P)+e(P,P)2en(P)T(K^n+σ2I)1k^n(P)

+k^n(P)T(K^n+σ2I)1En(K^n+σ2I)1k^n(P)

+O(En2)+O(enEn)+O(en2En)

Notic that the following holds with a probability at least 1nε ,according to the Assumption 1,
请注意,根据假设 1,以下结论至少以 1nε 的概率成立:

|en(P)T(K^n+σ2I)1yn|en(P)2(K^n+σ2I)12yn2nσ2(M+A)eε,

|k^n(P)T(K^n+σ2I)1En(K^n+σ2I)1yn|k^n(P)2(K^n+σ2I)122En2yn2

nσ4neεn(M+A)=n2σ4(M+A),

here we use the fact that K^n semi-definite (which means (K^n+σ2I)12σ2 ), k^(P,P)1 , |yi|M+A . Combining these results,we have that
这里我们利用了 K^n 半正定(即 (K^n+σ2I)12σ2 )、 k^(P,P)1|yi|M+A 的性质。综合这些结果可得

|μ~n(P)μ^n(P)|<(nσ2+n2σ4)(M+A)eε+O(eε2),

holds with a probability at least 1nε .
以至少 1nε 的概率成立。

Similarly,we can conduct the same estimation to en(P)T(K^n+σ2I)1k^n(P) and k^n(P)T(K^n+ σ2I)1En(K^n+σ2I)1k^n(P) ,and get |σ~n2(P)σ^n2(P)|<(1+nσ2)2eε+O(eε2) holds with a probability at least 1nε .
同理,我们可以对 en(P)T(K^n+σ2I)1k^n(P)k^n(P)T(K^n+ σ2I)1En(K^n+σ2I)1k^n(P) 进行相同的估计,并以至少 1nε 的概率得到 |σ~n2(P)σ^n2(P)|<(1+nσ2)2eε+O(eε2) 成立。

It remains to estimate the error for estimating the information gain. Notice that, with a probability at least 1nε ,
仍需估计信息增益的误差。注意,以至少 1nε 的概率,

|I~(yn;f^n{pt}t=1n)I^(yn;f^n{pt}t=1n)|=|12logdet(I+σ2K~n)det(I+σ2K^n)|

=|12logdet(I(σ2I+K^n)1En)|

=|12Tr(log(I(σ2I+K^n)1En))|

=|12Tr((σ2I+K^n)1En)+O(En2)|

n3/22σ2eε+O(En2),

here the second equation uses the fact that det(AB1)=det(A)det(B)1 ,and the third and fourth equations use logdet(I+A)=Trlog(I+A)=Tr(AA22+) . The last inequality follows from the fact
此处第二个等式利用了 det(AB1)=det(A)det(B)1 的性质,第三和第四个等式运用了 logdet(I+A)=Trlog(I+A)=Tr(AA22+) 。最后一个不等式源于

Tr(σ2I+K^n)1En)(σ2I+K^n)1FEnFn3/2σ2eε

and K^n is semi-definite. With the uncertainty bound given by Lemma 3, let us prove that under inexact kernel estimations, the posterior mean is concentrated around the unknown reward function f^
K^n 是半正定的。根据引理 3 给出的不确定性边界,让我们证明在核估计不精确的情况下,后验均值集中在未知奖励函数 f^ 周围

Theorem 4. Under the former setting as in Theorem 3 with probability at least 1δnε ,let σν=σζ2+σE2 ,taking σ=1+2n ,the following holds for all xX :
定理 4. 在定理 3 相同设定下,以至少 1δnε 的概率成立,令 σν=σζ2+σE2 ,取 σ=1+2n ,则对所有 xX 均有:

(27)|μ~n(Px)f^(Px)|βnσ~n(Px)+βn(1+n)eε1/2+(n+n2)(M+A)eε,

whereβn=(b+σν2(γ^nln(δ)+1))

Proof. According to Lemma 2, equation (22), we have
证明. 根据引理 2 的式(22)可得

|μ^n(Px)f^(Px)|β^nσ^n(Px)

with  其中

β^n=b+σν2(I^(yn;f^n{Pt}t=1n)+1+ln(1/δ))βn.

Notice that  注意到

(28)|μ~n(Px)f^(Px)||μ~n(Px)μ^n(Px)|+|μ^n(Px)f^(Px)|,

We also have [25], which means
我们还拥有[25],这意味着

(29)σ^n(Px)=σ^n(Px)2σ~n(Px)2+(1+n)2eεσ~n(Px)+(1+n)eε1/2,

combining (24), (28) and (29), we finally get the result in (27).
结合(24)、(28)和(29),我们最终得到(27)中的结果。

B.3 Proofs for Theorem 1
B.3 定理 1 的证明

Now we can prove our main theorem 1 .
现在我们可以证明我们的主要定理 1。

Proof of Theorem 1 Let x maximize f^(Px) over X . Observing that at each round n1 ,by the choice of xn to maximize the acquisition function α~(xDn1)=μ~n1(Px)+βn1σ~n1(Px) ,we have
定理 1 的证明 设 xX 上最大化 f^(Px) 。观察到在每轮 n1 中,通过选择 xn 来最大化采集函数 α~(xDn1)=μ~n1(Px)+βn1σ~n1(Px) ,我们得到

r~n=f^(Px)f^(Pxn)

μ~n1(Px)+βn1σ~n1(Px)μ~n1(Pxn)+βn1σ~n1(Pxn)+2Err(n1,eε)

2βn1σ~n1(Pxn)+2Err(n1,eε).

Here we denote Err(n,eε):=(βn(1+n)+σ~n(Px)σνn3/4)eε1/2+(n+n2)(M+A)eε . The second inequality follows from [27],
此处记 Err(n,eε):=(βn(1+n)+σ~n(Px)σνn3/4)eε1/2+(n+n2)(M+A)eε 。第二个不等式源自[27],

f^(Px)μ~n1(Px)βn1σ~n1(Px)+Err(n1,eε)

μ~n1(Pxn)f^(Pxn)βn1σ~n1(Pxn)+Err(n1,eε),

and the third inequality follows from the choice of xn :
第三个不等式来自 xn 的选择:

μ~n1(Px)+βn1σ~n1(Px)μ~n1(Pxn)+βn1σ~n1(Pxn).

Thus we have  因此我们得出

R~n=t=1nr~t2βnt=1nσ~t1(Pxt)+t=1TErr(t1,eε).

From Lemma 4 in [9], we have that
根据文献[9]中的引理 4,我们可得

t=1nσ~t1(Pxt)4(n+2)lndet(I+σ2K~n)4(n+2)(γ^n+n322eε),

and thus  因此

(30)2βnt=1nσ~t1(Pxt)=O(nγ^n+nγ^n(γ^nlnδ)+n52eε+n52(γ^nlnδ)eε).

On the other hand, notice that
另一方面,注意到

(31)t=1nErr(t1,eε)=O(n2(γ^nlnδ)eε+(n2+n3)eϵ),

we find that the eε term in (30) can be controlled by in (31),thus we immediately get the result.
我们发现式(30)中的 eε 项可由式(31)控制,从而立即得出该结论。

B.4 Proofs for Theorem 2
B.4 定理 2 的证明

Proof. Define the square of the MMD distance between Px1,Px2 as dM(x1,x2) ,we have
证明。定义 Px1,Px2 之间的 MMD 距离平方为 dM(x1,x2) ,可得

dM(x1,x2)

=Rdk(x,x)Px1(x)Px1(x)dxdx+Rdk(x,x)Px2(x)Px2(x)dxdx

2Rdk(x,x)Px1(x)Px2(x)dxdx

=Rd(k(xx1,xx1)+k(xx2,xx2)2k(xx1,xx2))P0(x)P0(x)dxdx.

It is not hard to verify that dM is shift invariant: dM(x1,x2)=dM(x1x2,0) ,and dM has r -th bounded derivatives,thus k^(x1,x2):=k^(Px1,Px2)=exp(αdM(x1,x2)) is shift invariant with r -th bounded derivatives. Then take μ(x) as the Lebesgue measure over X ,according to Theorem 4, [17],the integral operator Tk,μ:Tk,μf(x)=XK(x,y)f(y)dμ(y) is a symmetric compact operator in L2(X,μ) ,and the spectrum of Tk,μ satisfies
不难验证 dM 具有平移不变性: dM(x1,x2)=dM(x1x2,0) ,且 dM 具有 r 阶有界导数,因此 k^(x1,x2):=k^(Px1,Px2)=exp(αdM(x1,x2)) 是具有 r 阶有界导数的平移不变函数。取 μ(x)X 上的勒贝格测度,根据文献[17]中的定理 4,积分算子 Tk,μ:Tk,μf(x)=XK(x,y)f(y)dμ(y)L2(X,μ) 空间中的对称紧算子,且 Tk,μ 的谱满足

λn(Tk,μ)=O(n1r/d).

Then according to Theorem 5 in [30],we have γ^n=O(nd(d+1)r+d(d+1)log(n)) ,which finish the proof.
根据文献[30]中的定理 5,我们得到 γ^n=O(nd(d+1)r+d(d+1)log(n)) ,至此完成证明。

C Evaluation Details  C 评估细节

C.1 Implementation  C.1 实现方案

In our implementation of AIRBO,we design the kernel k used for MMD estimation to be a linear combination of multiple Rational Quadratic kernels as its long tail behavior circumvents the fast decay issue of kernel [6]:
在 AIRBO 的实现中,我们将用于 MMD 估计的核函数 k 设计为多个有理二次核的线性组合,因其长尾特性规避了核函数[6]的快速衰减问题:

(32)k(x,x)=ai{0.2,0.5,1,2,5}(1+(xx)22aili2)ai,

where li is a learnable lengthscale and ai determines the relative weighting of large-scale and small-scale variations.
其中 li 为可学习的长度尺度参数, ai 用于确定大尺度与小尺度变化的相对权重。

Depending on the form of input distributions, the sampling and sub-sampling sizes for Nyström MMD estimator are empirically selected via experiments. Moreover, as the input uncertainty is already modeled in the surrogate,we employ a classic UCB-based acquisition as Eq. 5 with β=2.0 and maximize it via an L-BFGS-B optimizer.
根据输入分布的形式,通过实验经验性地选择 Nyström MMD 估计器的采样和子采样规模。此外,由于代理模型中已对输入不确定性进行建模,我们采用如公式 5 所示的经典 UCB 采集函数(含 β=2.0 参数),并通过 L-BFGS-B 优化器实现最大化。

Figure 7: Modeling performance with a step-changing Chi-squared distribution.
图 7:阶跃变化卡方分布下的建模性能表现

D More Experiments  D 补充实验

D.1 Comparing with the Other Models
D.1 与其他模型的对比

To compare the modeling performances with the other models, we design the input uncertainty to follow a step-changing Chi-squared distribution: Px=χ2(g(x),σ=0.01) ,where g(x)=0.5 if x[0.0,0.6) and g(x)=7.0 when x[0.6,1.0] . Due to this sudden parameter change,the uncertainty at point x=0.6 is expected to be asymmetric: 1) on its left-hand side,as the Chi-squared distribution becomes quite lean and sharp with a small value of g(x)=0.5 ,the distance from x=0.6 to its LHS points, xlhs[0.0,0.6) ,are relatively large,thus their covariances are small,resulting a fast-growing uncertainty. 2)Meanwhile,when x[0.6,1.0] ,the g(x) suddenly increases to 7.0, rendering the input distribution a quite flat one with a long tail. Therefore, the distances between x=0.6 and its RHS points become relatively small,which leads to large covariances and small uncertainties for points in [0.6,1.0] . As a result,we expect to observe an asymmetric posterior uncertainty at x=0.6 .
为评估不同模型的建模性能,我们设计了一种阶梯式变化的卡方分布输入不确定性: Px=χ2(g(x),σ=0.01) ,其中当 x[0.0,0.6)g(x)=0.5 ,而 x[0.6,1.0]g(x)=7.0 。由于这种参数突变,点 x=0.6 处的不确定性预计呈现非对称特征:1) 在左侧区间,随着卡方分布因 g(x)=0.5 值较小变得极为陡峭,从 x=0.6 到其左侧点 xlhs[0.0,0.6) 的距离相对较大,导致协方差较小,不确定性快速攀升;2) 当 x[0.6,1.0] 时, g(x) 骤增至 7.0,使输入分布呈现长尾扁平形态,因此 x=0.6 与其右侧点之间的距离相对较小,使得 [0.6,1.0] 区间内的点具有较大协方差和较小不确定性。最终我们预期在 x=0.6 处观察到非对称的后验不确定性。

Several surrogate models are employed in this comparison, including:
本对比研究采用了多种代理模型,包括:

Figure 8: Ablation test for the Nyström approximation.
图 8:Nyström 近似法的消融测试结果。

D.2 Ablation Test for Nyström Approximation
D.2 尼斯特罗姆近似消融实验

In this experiment, we aim to examine the effect of Nyström approximation for optimization. To this end, we choose to optimize an RKHS function (Figure 4a) under a beta input distribution: Px= beta (α=0.4,β=0.2,σ=0.1) . Several amortized candidates include:
本实验旨在检验 Nyström 近似对优化效果的影响。为此,我们选择在贝塔输入分布 Px= beta (α=0.4,β=0.2,σ=0.1) 下优化 RKHS 函数(图 4a)。采用的几种摊销候选方案包括:

According to Figure 8, we find that 1) with sufficient computation power, the MMDGP-raw-L can obtain the best performance by using a large sample size. 2)However, with limited complexity, the performance MMDGP -raw-S degrades obviously while the MMDGP -nystrom performs much better. This suggests that the Nyström approximation can significantly improve the efficiency with a mild cost of performance degradation. 3) All the MMDGP-based methods are better than the vanilla GPUCB .
根据图 8 所示,我们发现:1)在计算资源充足的情况下,MMDGP-raw-L 通过使用大样本量能获得最佳性能;2)然而在复杂度受限时, MMDGP -raw-S 的性能明显下降,而 MMDGP -nystrom 表现更为优异,这表明 Nyström 近似能以轻微的性能下降为代价显著提升效率;3)所有基于 MMDGP 的方法都优于普通 GPUCB

D.3 Optimization on 10D Bumped-Bowl Problem
D.3 10 维凸包问题的优化

To further evaluate AIRBO's optimization performance on the high-dimensional problem, we employ a 10-dimensional bumped bowl function from [27, 19]:
为了进一步评估 AIRBO 在高维问题上的优化性能,我们采用了文献[27,19]中的 10 维凸包函数:

(33)f(x)=g(x1:2)h(x3:),where{g(x)=2log(0.8x2+e10x2)+2.54h(x)=id5xi2+1


2 For a fair comparison,all the methods in this test use a UCB acquisition.
2 为确保公平比较,本次测试中所有方法均采用 UCB 采集函数。


Figure 9: Optimization regret on 10D bumped-bowl problem.
图 9:10 维凸碗问题上的优化遗憾值。

Figure 10: The input GMM distribution.
图 10:输入的高斯混合模型分布。

Figure 11: Simulation results of the push configurations found by different algorithms.
图 11:不同算法发现的推送配置仿真结果。

Here, xi is the i-th dimension of x,x1:2 represents the first 2 dimensions for the variable,and x3 : indicates the rest dimensions. The input uncertainty is designed to follow a concatenated distribution of a 2D circular distribution (r=0.5) and a multivariate normal distribution with a zero mean and diagonal covariance of 0.01 .
此处, xi 表示变量 x,x1:2 的第 i 个维度,其中前两个维度由 x,x1:2 表示,而 x3 则代表其余维度。输入不确定性被设计为遵循一个由二维圆形分布 (r=0.5) 与均值为零、对角线协方差为 0.01 的多元正态分布拼接而成的联合分布。

Figure 9 shows the mean and std values of the optimization regrets. We note that 1) when it comes to a high-dimensional problem and complex input distribution, the misassumption of Gaussian input uncertainty renders the skl and ERBF fail to locate the robust optimum and get stuck at local optimums. 2)Our method outperforms the others and can find the robust optimum efficiently and stably, while the uGP with a similar inference cost suffers the instability caused by insufficient sampling and stumbles over iterations, which can be evidenced by its high std values of optimization regret.
图 9 展示了优化遗憾值的均值与标准差。我们注意到:1)当面临高维问题及复杂输入分布时,对高斯输入不确定性的错误假设会导致 sklERBF 无法定位鲁棒最优解,并陷入局部最优;2)我们的方法表现优于其他方法,能高效稳定地找到鲁棒最优解,而具有相似推理成本的 uGP 由于采样不足导致的不稳定性,在迭代过程中表现波动,这从其较高的优化遗憾标准差值中可以得到验证。

D.4 Robust Robot Pushing
D.4 鲁棒机器人推挤实验

This benchmark is based on a Box2D simulator from [31], where our objective is to identify a robust push configuration, enabling a robot to push a ball to predetermined targets under input randomness. In our experiment,we simplify the task by setting the push angle to ra=arctanrurv ,ensuring the robot is always facing the ball. Also, we intentionally define the input distribution as a two-component Gaussian Mixture Model as follows:
该基准测试基于[31]中的 Box2D 模拟器,我们的目标是找到一个稳健的推动配置,使机器人能在输入随机性的条件下将球推向预定目标。实验中我们将推动角度简化为 ra=arctanrurv ,确保机器人始终正对球体。同时,我们特意将输入分布定义为如下双组分高斯混合模型:

(34)(rx,ry,rt)GMM(μ=[000110],,=[0.120.321e60.320.121e61e61e61.02],w=[0.50.5],

where the covariance matrix is shared among components and w is the weights of mixture components. Meanwhile, as the SKL-UCB and ERBF-UCB surrogates can only accept Gaussian input distributions, we choose to approximate the true input distribution with a Gaussian. As shown in Figure 10, the approximation error is obvious, which explains the performance gap among these algorithms in Figure 5b.
其中协方差矩阵 在各组分间共享, w 表示混合组分的权重。由于 SKL-UCB 和 ERBF-UCB 代理模型仅能接受高斯输入分布,我们选择用高斯分布来近似真实输入分布。如图 10 所示,这种近似存在明显误差,这也解释了图 5b 中这些算法间的性能差距。

Apart from the statistics of the found pre-images in Figure 6, we also simulate the robot pushes according to the found configurations and visualize the results in Figure 11. In this figure, each black hollow square represents an instance of the robot's initial location, the grey arrow indicates the push direction and duration, and the blue circle marks the ball's ending position after the push. We can find that, as the GP-UCB ignores the input uncertainty, it randomly pushes to these targets and the ball ending positions fluctuate. Also, due to the incorrect assumption of the input distribution, the SKL-UCB and ERBF-UCB fail to control the ball's ending position under input randomness. On the contrary, AIRBO successfully recognizes the twin targets in quadrant I as an optimal choice and frequently pushes to this area. Moreover, all the ball's ending positions are well controlled and centralized around the targets under input randomness.
除图 6 中统计的预映射发现结果外,我们还根据寻得的配置参数模拟了机器人推球动作,并将可视化结果呈现在图 11 中。该图中每个黑色空心方块代表机器人初始位置的一个实例,灰色箭头表示推动方向与持续时间,蓝色圆圈则标记推动后小球的终止位置。可以发现:由于 GP-UCB 忽略了输入不确定性,其推送目标具有随机性,导致小球终止位置呈现波动状态;同时由于对输入分布的错误假设,SKL-UCB 和 ERBF-UCB 在输入随机性下均无法有效控制小球终止位置。与之形成鲜明对比的是,AIRBO 成功识别出第一象限的双目标区域作为最优选择,并频繁向该区域推送。更重要的是,在输入随机性条件下,所有小球终止位置均得到精准控制,紧密聚集在目标区域周围。