Kernel Learning for Sample Constrained Black-Box Optimization
样本受限黑盒优化的核学习

Rajalaxmi Rajagopalan, Yu-Lin Wei, Romit Roy Choudhury
拉贾拉克希米·拉贾戈帕兰、魏玉林、罗米特·罗伊·乔杜里

Department of Electrical & Computer Engineering
电气与计算机工程系

University of Illinois Urbana-Champaign
伊利诺伊大学厄巴纳-香槟分校

{rr30,yulinlw2,croy}@illinois.edu

Abstract 摘要

Black box optimization (BBO) focuses on optimizing unknown functions in high-dimensional spaces. In many applications, sampling the unknown function is expensive, imposing a tight sample budget. Ongoing work is making progress on reducing the sample budget by learning the shape/structure of the function, known as kernel learning. We propose a new method to learn the kernel of a Gaussian Process. Our idea is to create a continuous kernel space in the latent space of a variational autoencoder, and run an auxiliary optimization to identify the best kernel. Results show that the proposed method, Kernel Optimized Blackbox Optimization (KOBO), outperforms state of the art by estimating the optimal at considerably lower sample budgets. Results hold not only across synthetic benchmark functions but also in real applications. We show that a hearing aid may be personalized with fewer audio queries to the user, or a generative model could converge to desirable images from limited user ratings.
黑盒优化（BBO）专注于在高维空间中优化未知函数。在许多应用中，对未知函数进行采样成本高昂，导致样本预算紧张。当前的研究正通过学习函数的形状/结构（即核学习）来降低样本预算。我们提出了一种学习高斯过程核的新方法，其核心思想是在变分自编码器的潜在空间中构建连续核空间，并通过辅助优化确定最佳核。实验结果表明，所提出的核优化黑盒优化方法（KOBO）能以显著更低的样本预算估计最优解，性能优于现有技术。这一优势不仅体现在合成基准函数上，也适用于实际应用场景：例如助听器可通过更少的用户音频查询实现个性化配置，或生成模型能基于有限的用户评分收敛到理想图像。

1 Introduction 1 引言

Many problems involve the optimization of an unknown objective function. Examples include personalizing content

x

to maximize a user’s satisfaction

f (x)

,or training deep learning models with hyperparameters

x

to maximize their performance

f (x)

. Function

f (x)

is unknown in these cases because it is embedded inside the human brain (for the case of personalization) or too complex to derive (for hyper-parameter tuning). However,for any chosen sample

x_{i}

,the value of

f (x_{i})

can be evaluated. For hearing-aid personalization, say, evaluating the function would entail playing audio with some hearing-compensation filter

x_{i}

and obtaining the audio clarity score

f (x_{i})

from the user.
许多问题都涉及对未知目标函数的优化。例如：通过个性化内容

x

来最大化用户满意度

f (x)

，或通过超参数

x

训练深度学习模型以最大化其性能

f (x)

。在这些情况下，函数

f (x)

是未知的，因为它要么内置于人脑中（个性化场景），要么过于复杂而无法推导（超参数调优场景）。然而对于任何选定样本

x_{i}

，都可以评估

f (x_{i})

的值。以助听器个性化为例，评估该函数需要播放带有听力补偿滤波器

x_{i}

的音频，并从用户处获取音频清晰度评分

f (x_{i})

。

Bayesian methods like Gaussian Process Regression (GPR) are de facto approaches to black-box optimization (BBO). Using a set of function evaluations, conventional GPR (Frazier 2018) learns a probabilistic surrogate model

\hat{f} (x)

for

f (x)

. The optimum is estimated on this surrogate as

{\hat{x}}^{*} =

argmin - \hat{f} (x)

. In most BBO problems,

f (x)

is expensive to evaluate,hence a strict sample or query budget

B

is of interest. Techniques that lower this budget have garnered recent attention. One idea is to exploit domain knowledge about
高斯过程回归（GPR）等贝叶斯方法是黑箱优化（BBO）的实际标准方法。传统 GPR（Frazier 2018）通过一组函数评估值，学习得到目标函数

f (x)

的概率代理模型

\hat{f} (x)

，并在此代理模型上将最优值估计为

{\hat{x}}^{*} =

argmin - \hat{f} (x)

。大多数 BBO 问题中，

f (x)

的评估成本高昂，因此严格的样本或查询预算

B

至关重要。降低该预算的技术近期备受关注，其中一种思路是利用领域知识——

https://cdn.noedgeai.com/0196a06a-5441-79f9-8a2e-981e39e6ac4a_0.jpg?x=935&y=627&w=707&h=253&r=0

Figure 1: KerGPR in VAE latent space gives

K^{*}

f

GPR
图 1：VAE 潜在空间中的 KerGPR 为

f

GPR 提供

K^{*}

the rough shape of

f (x)

,i.e.,select a GPR kernel that models this shape. With humans,for example,

f (x)

may have a staircase structure as they may not perceive differences in certain neighborhoods of

x

,but their ratings may change just outside that neighborhood. If GPR's surrogate model

\hat{f} (x)

captures this staircase structure in its kernel,sample efficiency can improve. However, in the absence of domain knowledge,can the optimal

G P R

kernel

K^{*}

be learnt,using the same sample queries used to estimate

x^{*}

f (x)

的粗略形状，即选择一个能够模拟该形状的高斯过程回归（GPR）核函数。以人类为例，

f (x)

可能呈现阶梯状结构，因为他们可能无法感知

x

某些邻近区域的差异，但他们的评分可能会在该邻近区域之外发生变化。如果 GPR 的代理模型

\hat{f} (x)

在其核函数中捕捉到这种阶梯状结构，采样效率可以得到提高。然而，在缺乏领域知识的情况下，能否利用用于估计

x^{*}

的相同样本查询来学习最优的

G P R

核函数

K^{*}

呢？

A growing body of research (Grosse et al. 2012)(Kim and Teh 2016)(Teng et al. 2020) is concentrating on kernel learning. One effective approach is Automatic Statistician (AS) (Duvenaud et al. 2013) where authors compose complex kernels by combining simple ones through a context-free grammar and design a search method over the countably infinite complex kernels (more in Section 5). Subsequent improvements over AS have used Hellinger distance as a measure of kernel similarity (Malkomes, Schaff, and Garnett 2016). This similarity metric guides an optimization-based search over the space of composite kernels. To reduce search complexity, (Gardner et al. 2017) exploits additive structures in the search space and employs MCMC methods to discover kernels. All these techniques operate on discrete spaces, delivering a categorical composition of simple kernels.
越来越多的研究（Grosse 等人，2012 年；Kim 和 Teh，2016 年；Teng 等人，2020 年）正聚焦于核学习。其中一种有效方法是“自动统计学家”（Automatic Statistician，简称 AS）（Duvenaud 等人，2013 年），该研究通过上下文无关文法将简单核组合成复杂核，并设计了一种在可数无限复杂核空间上的搜索方法（详见第 5 节）。后续对 AS 的改进采用了 Hellinger 距离作为核相似性度量（Malkomes、Schaff 和 Garnett，2016 年），该相似性指标引导了基于优化的复合核空间搜索。为降低搜索复杂度，（Gardner 等人，2017 年）利用搜索空间中的可加结构，采用 MCMC 方法来发现核函数。所有这些技术均在离散空间操作，最终输出简单核的类别组合结果。

A continuous space of kernels naturally lends itself to optimization methods. Our contribution is in designing such a continuous kernel space and running an auxiliary optimization on it to discover

K^{*}

. To this end,we first synthesize many discrete kernels by adding or multiplying a set of simple "basis" kernels, and then use a variational autoencoder (VAE) to learn a low-dimensional continuous manifold for the discrete kernels. This manifold lives in the latent space of the VAE, as shown by the orange region in Figure 1. Conventional optimization on this latent kernel space is difficult since we lack an objective function, but given a kernel, we can evaluate its effectiveness using model evidence (i.e., how well a given kernel agrees with available samples from

f (x))

. Thus,optimizing over the kernel space can also be designed as a blackbox optimization problem, hence we run a kernel space GPR (KerGPR) to output an optimal

K^{*}

. The main GPR module,Function

G P R

,uses

K^{*}

to model the surrogate function

\hat{f} (x)

,and queries the surrogate at more points. The new model evidence is then passed back to KerGPR (see Fig. 1) to update the optimal kernel. The iteration terminates when the query budget is expended. Function

G P R

then outputs the minimum of the surrogate

\hat{f} (x)

. Results show that KOBO consistently outperforms SOTA methods (namely MCMC (Gardner et al. 2017), BOMS (Malkomes, Schaff, and Garnett 2016), and CKS (Duve-naud et al. 2013)) in terms of the number of queries needed to reach the optimal. The gain from kernel learning is also significant compared to the best static kernels. Experiments are reported across synthetic benchmark functions and from real-world audio experiments with

U = 6

users. Volunteers were asked to rate the clarity of audio clips,and using

B \leq

25 ratings, KOBO prescribed a personalized filter that maximized that user's personal satisfaction. The performance gain is robust across extensive experiments, suggesting that KOBO could be deployed in real-world black-box applications where sample-budget is of prime concern.
核的连续空间自然适用于优化方法。我们的贡献在于设计这样一个连续的核空间，并在其上运行辅助优化以发现

K^{*}

。为此，我们首先通过添加或乘上一组简单的“基”核来合成许多离散核，然后使用变分自编码器（VAE）为离散核学习一个低维连续流形。如图 1 中橙色区域所示，该流形存在于 VAE 的潜在空间中。由于缺乏目标函数，传统的潜在核空间优化较为困难，但给定一个核时，我们可以通过模型证据（即给定核与

f (x))

的可用样本的吻合程度）评估其有效性。因此，核空间优化也可以设计为一个黑盒优化问题，为此我们运行核空间高斯过程回归（KerGPR）以输出最优的

K^{*}

。主 GPR 模块函数

G P R

使用

K^{*}

建模代理函数

\hat{f} (x)

，并在更多点上查询该代理。新的模型证据随后反馈给 KerGPR（见图 1）以更新最优核。当查询预算耗尽时，迭代终止。函数

G P R

随后输出代理模型

\hat{f} (x)

的最小值。结果表明，在达到最优解所需的查询次数方面，KOBO 始终优于最先进的算法（即 MCMC（Gardner 等人 2017）、BOMS（Malkomes、Schaff 和 Garnett 2016）和 CKS（Duvenaud 等人 2013））。与最佳静态核函数相比，核学习带来的增益也十分显著。实验数据涵盖合成基准函数和真实音频实验（涉及

U = 6

名用户）。志愿者需对音频片段清晰度进行评分，基于

B \leq

25 份评分数据，KOBO 能生成最大化用户个人满意度的个性化滤波器。该性能增益在大量实验中表现稳健，表明 KOBO 可部署在样本预算至关重要的现实黑箱应用中。

2 Problem Formulation and Background Problem Formulation
2 问题表述与背景问题表述

Consider an unknown real-valued function

f : H \to R

where

H \subseteq R^{N}, N \geq 500

. Let

x^{*}

be the minimizer of

f (x)

. We aim to estimate

x^{*}

using a budget of

B

queries. Thus, the optimization problem is,
考虑一个未知实值函数

f : H \to R

，其中

H \subseteq R^{N}, N \geq 500

。设

x^{*}

为

f (x)

的最小化解。我们的目标是在

B

次查询的预算限制下估计

x^{*}

。因此，该优化问题可表述为：

\begin{matrix} (1) & \underset{\hat{x} \in H}{argmin} {‖ \begin{matrix} f (\hat{x}) - f (x^{*}) \end{matrix} ‖}_{2} s.t. Q \leq B \end{matrix}

where

Q

is the number of times the objective function is evaluated/queried,and the sample budget is

B ≪ N

. Function

f

may be non-convex,may not have a closed-form expression, and its gradient is unavailable (hence, a black-box optimization problem). Bayesian approaches like GPR suit such problems but require choosing a kernel to model the function structure; a poor choice incurs more queries for optimization. Since queries can be expensive (e.g., when users need to answer many queries, or a NeuralNet needs re-training for each hyper-parameter configuration), lowering

Q

is of growing interest. Kernel learning aims to address this problem.
其中

Q

表示目标函数被评估/查询的次数，样本预算为

B ≪ N

。函数

f

可能是非凸的，可能没有闭式表达式，且其梯度不可得（因此属于黑盒优化问题）。高斯过程回归（GPR）等贝叶斯方法适用于此类问题，但需要选择核函数来建模函数结构；若核函数选择不当会导致优化过程需要更多查询。由于查询可能代价高昂（例如用户需要回答大量查询，或神经网络需要对每个超参数配置重新训练），降低

Q

的需求日益增长。核学习正是为了解决这一问题。

Background: Gaussian Process Regression (GPR)
背景：高斯过程回归（GPR）

Bayesian optimization (BO) (Frazier 2018) broadly consists of two modules: (1) Gaussian Process Regression that learns a Gaussian posterior distribution of the likely values of

f (x)

at any point of interest

x

. (2) Acquisition function,a sampling strategy that prescribes the point at which

f

should be evaluated (or sampled) next. We briefly discuss GPR to motivate the kernel learning problem.
贝叶斯优化（BO）（Frazier 2018）主要由两个模块组成：(1) 高斯过程回归，用于学习目标函数

f (x)

在任意关注点

x

处可能取值的高斯后验分布；(2) 采集函数，作为一种采样策略，指示下一步应评估（或采样）

f

的位置。我们将简要讨论 GPR 以引出核学习问题的动机。

https://cdn.noedgeai.com/0196a06a-5441-79f9-8a2e-981e39e6ac4a_1.jpg?x=978&y=147&w=619&h=364&r=0

Figure 2: Row 1 shows simple (or base) kernels (Duvenaud et al. 2013). Row 2 shows corresponding surrogates drawn from a GP with the above kernel.
图 2：第一行展示基础核函数(Duvenaud 等, 2013)，第二行显示从具有上述核函数的高斯过程(GP)中抽取的对应代理模型。

GPR Prior & Posterior: GPR generates a probabilistic surrogate model by defining a Gaussian distribution

(N (μ, K))

over infinite candidate functions. At initialization, i.e., when no function-sample are available, the prior distribution over the candidate functions is defined by

μ = 0

and a covariance matrix

K

. This matrix is computed using a kernel function

k

as,

K_{i j} = k (x_{i}, x_{j})

. The kernel essentially favors candidate functions that are similar to the kernel's own shape/structure; these candidates are assigned a higher likelihood. An expert with domain knowledge about the structure of

f (x)

can choose the kernel judiciously, resulting in better surrogate functions

\hat{f} (x)

. Better

\hat{f} (x)

will ultimately reduce the number of queries needed to optimize the objective

f (x)

.
高斯过程回归（GPR）先验与后验：GPR 通过定义一个覆盖无限候选函数的高斯分布

(N (μ, K))

来生成概率代理模型。初始化时（即尚无任何函数样本可用时），候选函数的先验分布由

μ = 0

和协方差矩阵

K

定义。该矩阵通过核函数

k

计算得出，表达式为

K_{i j} = k (x_{i}, x_{j})

。核函数本质上倾向于选择与自身形状/结构相似的候选函数，并为这些候选分配更高的似然值。若专家对

f (x)

的结构具备领域知识，可审慎选择核函数，从而获得更优的代理函数

\hat{f} (x)

。更好的

\hat{f} (x)

最终将减少优化目标

f (x)

所需的查询次数。

Once the function

f

has been observed for a set of samples

X = {x_{1}, x_{2}, \dots, x_{K}}

,i.e.,we know

F =

{f (x_{1}), f (x_{2}), \dots, f (x_{K})}

,the prior is updated to form the posterior distribution (Eqn. 2) over the candidate functions. The posterior mean

μ

is the most likely surrogate of the function

f

.
当在样本集

X = {x_{1}, x_{2}, \dots, x_{K}}

上观测到函数

f

（即已知

F =

{f (x_{1}), f (x_{2}), \dots, f (x_{K})}

）后，先验分布将更新形成候选函数的后验分布（公式 2）。后验均值

μ

是函数

f

最可能的代理表示。

\begin{matrix} (2) & P (F ∣ X) \sim N (F ∣ μ, K) \end{matrix}

where,

μ = {μ (x_{1}), μ (x_{2}), \dots, μ (x_{K})}, K_{i j} = k (x_{i}, x_{j})

,and

k

represents a kernel function.
其中

μ = {μ (x_{1}), μ (x_{2}), \dots, μ (x_{K})}, K_{i j} = k (x_{i}, x_{j})

和

k

表示核函数。

Predictions: To make predictions

\hat{F} = f (\hat{X})

at new points

\hat{X}

,GPR uses the current posterior (Eqn. 2) to define the conditional distribution of

\hat{F}

as:

P (\hat{F} ∣ F, X, \hat{X}) \sim

N ({\hat{K}}^{T} K^{- 1} F, \hat{\hat{K}} - {\hat{K}}^{T} K^{- 1} \hat{K})

The details and proof of all the above are clearly explained in (Wang 2020)).
预测：要在新点

\hat{X}

处进行预测

\hat{F} = f (\hat{X})

，高斯过程回归使用当前后验分布（方程 2）来定义

\hat{F}

的条件分布：

P (\hat{F} ∣ F, X, \hat{X}) \sim

N ({\hat{K}}^{T} K^{- 1} F, \hat{\hat{K}} - {\hat{K}}^{T} K^{- 1} \hat{K})

。上述所有内容的详细说明和证明均清晰阐述于（Wang 2020）中。

Kernel Selection 核选择

Kernels model the possible shape of the unknown function based on a set of observations(X,F)of the unknown function. Figure 2 illustrates example kernels on the top row; the bottom row shows candidate functions that GPR can derive using the corresponding kernel. In general, a class of kernels

K

produces a family of (GPR) surrogates that fit the function observations(X,F). Of course,each surrogate is associated to a likelihood which can improve with additional observations.
核函数基于对未知函数的一组观测(X,F)来建模其可能的形状。图 2 顶部展示了示例核函数；底部行显示高斯过程回归(GPR)利用相应核函数可推导出的候选函数。通常，一类核函数

K

会生成一组拟合函数观测(X,F)的(GPR)代理模型。当然，每个代理模型都关联一个似然值，该值可随新增观测而提升。

The goal of kernel selection is to select one kernel

K^{*} \in K

that best explains the function observations(X,F). Let’s denote

L : K \to R

to be a model evidence that measures how well a kernel

K

fits the observations. We assume that evaluating

L (K)

for all kernels in

K

is too expensive. The kernel selection problem is then,
核选择的目标是选出最能解释函数观测(X,F)的核函数

K^{*} \in K

。设

L : K \to R

为衡量核函数

K

与观测数据拟合程度的模型证据。我们假定对

K

中所有核函数评估

L (K)

的代价过高。因此核选择问题可表述为：

\begin{matrix} (3) & K^{*} = \underset{K \in K}{argmax} L (K) \end{matrix}

This problem is difficult to optimize with Bayesian approaches when the kernel space

K

is discrete (Parker and Rardin 2014). This is because the function

L (K)

is only defined at the feasible points and cannot be queried arbitrarily. In contrast, it is possible to deduce a continuous function's behavior in a neighborhood of a point; in the discrete case, the behavior of the objective may change significantly as we move from one feasible point to another. This motivates transforming the problem in Eqn. 3 into a problem in continuous space on which optimization can be applied.
当核空间

K

为离散时（Parker 和 Rardin 2014），该问题难以用贝叶斯方法优化。这是因为函数

L (K)

仅在可行点处定义，无法任意查询。相比之下，我们可以推导连续函数在某点邻域内的行为；而在离散情况下，目标函数的行为可能随着从一个可行点移动到另一个而发生显著变化。这促使我们将方程 3 中的问题转化为连续空间上的优化问题。

In this paper,the "model evidence"

L

is chosen to be GPR posterior in Eqn. 2 as it generates the surrogate that best describes the observations informed by the chosen kernel.
本文选择方程 2 中的高斯过程回归（GPR）后验作为"模型证据"

L

，因为它生成的代理模型能最佳地反映由选定核函数所启发的观测数据。

\begin{matrix} (4) & L (K) = P (F ∣ X, K) \end{matrix}

3 Kernel Learning in KOBO
3 KOBO 中的核学习

Intuition and Overview: Our prime objective is to create a continuous space

Z

corresponding to the discrete space

K

,thus simplifying the optimization in Eqn. 3. We propose to achieve this using a Variational Autoencoder (VAE) which can take discrete inputs

(K \in K)

and learn (encoder) a continuous latent space

Z

from which the inputs can be faithfully reconstructed (decoder). If a large number of discrete kernels are created and represented sufficiently using a scheme that offers some notion of mutual similarity between kernels (i.e.,representations defining a kernel space

K

),then we expect the VAE to give us a continuous representation of such kernels

Z

. This approach satisfies our objective since VAEs are expected to ensure continuity and completeness of their latent space, i.e., (i) two close points in the latent space cannot decode to completely different results, and (ii) a point sampled from the latent space must decode to a valid result. When the VAE is trained, we have successfully transformed the discrete optimization in

K

(Eqn. 4) to an easier continuous one in

Z

.
直觉与概述：我们的主要目标是创建一个与离散空间

K

对应的连续空间

Z

，从而简化方程 3 中的优化过程。为此，我们提出采用变分自编码器（VAE）来实现这一目标——该模型能够接收离散输入

(K \in K)

，并通过编码器学习出一个连续的潜在空间

Z

，进而通过解码器忠实地重构输入数据。若我们创建大量离散核并通过某种能体现核间相似性的表示方案（即定义核空间

K

的表示方法）充分表征它们，则 VAE 有望为我们提供此类核的连续表示

Z

。该方法满足我们的目标，因为 VAE 能确保潜在空间的连续性和完备性，即：(i) 潜在空间中相邻的两个点不会解码出完全不同的结果；(ii) 从潜在空间采样的点必须能解码出有效结果。当 VAE 训练完成后，我们便成功将

K

（方程 4）中的离散优化问题转化为

Z

中更易处理的连续优化问题。

Building on this intuition, KOBO's kernel learner is composed of 3 modules as shown in Figure 3:
基于这一思路，KOBO 的内核学习器由如图 3 所示的 3 个模块组成：

(1) Kernel Combiner creates composite kernels

K \in K

that form the discrete kernel space

K

.
(1) 内核组合器创建复合内核

K \in K

，形成离散内核空间

K

。

(2) Kernel Space Variational Autoencoder (KerVAE) trained on kernels generated by Kernel Combiner transforms discrete kernel space

K

to continuous space

Z

.
(2) 基于核组合器生成的内核训练的核空间变分自编码器（KerVAE）将离散核空间

K

转换为连续空间

Z

。

(3) Kernel Space GPR (KerGPR): Optimizes model evidence (Eqn. 4) on

Z

; this gives

z^{*}

which decodes to the optimal kernel

K^{*}

.
(3) 内核空间 GPR (KerGPR)：在

Z

上优化模型证据（公式 4）；由此得到

z^{*}

，解码后获得最优内核

K^{*}

。

Figure 3 connects all the modules to give a complete overview of KOBO. The main objective function (Eqn. 1) is optimized with a Function GPR (

f

GPR).

f

GPR uses a simple Square-Exponential (SE) kernel to obtain a batch of observations

D_{n}

. The model evidence is then passed to the kernel learning pipeline. The Kernel Combiner takes simple kernels and observations

D_{n}

as inputs,and outputs a discrete space of composite kernels

K_{C} \in K

. This is guided by a context-free-grammar described later. The KerVAE is trained on this discrete space and generates the corresponding continuous latent space

Z

. KerGPR running on

Z

optimizes the model evidence (from

f

GPR) to find the optimal kernel

K^{*}

which is prescribed to

f

GPR.

f

GPR uses this

K^{*}

to obtain a new batch of observations

D_{n + 1}

and update model evidence, which is again passed to the kernel learning pipeline. The cycle iterates until

f

GPR has expended the sample budget

B

. At this point,

f

GPR outputs the minimum of its surrogate model. The following discussions expand on Kernel Combiner, KerVAE, and KerGPR.
图 3 将所有模块连接起来，完整展示了 KOBO 的架构。主目标函数（公式 1）通过函数 GPR（

f

GPR）进行优化。

f

GPR 采用简单的平方指数（SE）核来获取一批观测数据

D_{n}

。随后，模型证据被传递至核学习管道。核组合器以简单核和观测数据

D_{n}

作为输入，输出一个由复合核

K_{C} \in K

构成的离散空间，该过程受后文所述的无上下文语法指导。KerVAE 在此离散空间上训练，生成对应的连续潜在空间

Z

。运行于

Z

的 KerGPR 通过优化模型证据（来自

f

GPR）来寻找最优核

K^{*}

，并将其指定给

f

GPR 使用。

f

GPR 利用此

K^{*}

获取新一批观测数据

D_{n + 1}

并更新模型证据，该证据再次传入核学习管道循环迭代，直至

f

GPR 耗尽样本预算

B

。此时，

f

GPR 输出其代理模型的最小值。下文将详细讨论核组合器、KerVAE 和 KerGPR。

https://cdn.noedgeai.com/0196a06a-5441-79f9-8a2e-981e39e6ac4a_2.jpg?x=937&y=150&w=706&h=366&r=0

Figure 3: System flow:

KOBO

iterates across

f G P R

on top and KerGPR below that runs in the KerVAE latent space. The blue arrow denotes the model evidence input to KerGPR, and the red arrow denotes the optimal kernel

K^{*}

supplied by KerGPR to

f

GPR.
图 3：系统流程：

KOBO

在上层迭代遍历

f G P R

，其下方是在 KerVAE 潜在空间中运行的 KerGPR。蓝色箭头表示输入到 KerGPR 的模型证据，红色箭头表示由 KerGPR 提供给

f

GPR 的最优内核

K^{*}

。

Kernel Combiner 内核组合器

Complex kernels

k_{C}

can be expressed as operations on the context-free grammar of base kernels

B

(Hopcroft,Mot-wani, and Ullman 2001; Duvenaud et al. 2013). Given

B = {A, B, C, D, E}

,and a set of operators

O =

{

add,multiply,end,...

}

,the Kernel Combiner generates a composite kernel

k_{C}

by drawing kernels from

B

and operators from

O

with probabilities

p_{B}, p_{O}

. An example

k_{C} =

A * C + B * D

. In general,to form a kernel space

K

,the Kernel Combiner develops a unique representation for each

k_{C}

.
复杂内核

k_{C}

可表示为对基础内核

B

的上下文无关文法进行操作（Hopcroft、Motwani 和 Ullman 2001；Duvenaud 等 2013）。给定

B = {A, B, C, D, E}

及运算符集合

O =

（

{

如加、乘、结束等

}

），内核组合器通过以概率

p_{B}, p_{O}

从

B

抽取内核、从

O

抽取运算符，生成复合内核

k_{C}

。示例见

k_{C} =

A * C + B * D

。通常，为构建内核空间

K

，内核组合器会为每个

k_{C}

开发唯一表示形式。

Grammar-based Representation: Given a composite kernel

k_{C}

,its grammar-based representation is a vector

r_{c}

,designed as follows. Let

A, B, C, D, E

be five base kernels in

B

. These are simple kernels like Square-exponential,Periodic,Rational Quadratic,etc. Any composite kernel

k_{C}

created from the base kernels is expressed in the form of Eqn. 5. The code

r_{c}

is then the vector of indices,i.e.,

r_{c} =

[a_{1}, b_{1}, c_{1}, d_{1}, e_{1}, a_{2}, b_{2}, c_{2}, d_{2}, e_{2}, a_{3}, b_{3}, c_{3}, d_{3}, e_{3}] .

基于文法的表示：给定复合内核

k_{C}

，其基于文法的表示是一个向量

r_{c}

，设计如下。设

A, B, C, D, E

为

B

中的五个基础内核，例如平方指数核、周期核、有理二次核等。任何由基础内核创建的复合内核

k_{C}

均以公式 5 的形式表达。代码

r_{c}

即为索引向量，即

r_{c} =

[a_{1}, b_{1}, c_{1}, d_{1}, e_{1}, a_{2}, b_{2}, c_{2}, d_{2}, e_{2}, a_{3}, b_{3}, c_{3}, d_{3}, e_{3}] .

。

k_{C} = A^{a_{1}} * B^{b_{1}} * C^{c_{1}} * D^{d_{1}} * E^{e_{1}}

\begin{matrix} (5) & + A^{a_{2}} * B^{b_{2}} * C^{c_{2}} * D^{d_{2}} * E^{e_{2}} \end{matrix}

+ A^{a_{3}} * B^{b_{3}} * C^{c_{3}} * D^{d_{3}} * E^{e_{3}} . .

If a composite kernel is,say,

k_{C}^{'} = A^{2} * E + C * D

, then its Grammar-based representation would be

r_{c}^{'} =

[2, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]

. Note: elements of the code vectors can also be fractions.
若一个复合核函数为

k_{C}^{'} = A^{2} * E + C * D

，则其基于语法的表示形式应为

r_{c}^{'} =

[2, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]

。注意：代码向量的元素也可以是分数形式。

This encoding scheme has two advantages: (1) Each code

r_{c}

preserves its composition,i.e.,given the code vector

r_{c}

, the base kernels and the operators used to construct

k_{C}

can be interpreted. (2) The code space is continuous, hence, a code

r_{c}^{''} = [2, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]

- which is only a flip of the second element in

r_{c}^{'} -

results in

k_{C}^{''} =

A^{2} * B * E + C * D

. In general,a small change in the code produces a small modification to the kernel composition (which makes the VAE’s task:

K \to Z

easier). Past work (Garrido-Merchán and Hernández-Lobato 2020)(Lu et al. 2018) have used one-hot encoding to represent kernel matrices, however, such one-hot schemes suffer from the lack of continuity (as shown in Figure 5(a) and (b) in the Appendix).
该编码方案具有两大优势：(1) 每个代码

r_{c}

保留其组合性，即给定代码向量

r_{c}

时，可解析出构建

k_{C}

所用的基础核函数及运算符；(2) 代码空间具有连续性，因此当代码

r_{c}^{''} = [2, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]

仅翻转

r_{c}^{'} -

中的第二个元素时，会得到

k_{C}^{''} =

A^{2} * B * E + C * D

。一般而言，代码的微小变化仅会引起核函数组合的微小调整（这使得变分自编码器的任务

K \to Z

更为容易）。先前研究（Garrido-Merchán 与 Hernández-Lobato 2020）（Lu 等人 2018）曾使用独热编码表示核矩阵，但此类独热方案存在缺乏连续性的缺陷（如附录图 5(a)(b)所示）。

However,the context-free grammar

r_{c}

does not encode any information from the objective function to be modeled by the complex kernels. We add this to the representation next.
然而，上下文无关文法

r_{c}

并未编码任何来自待建模目标函数的信息（这些目标函数将由复杂核函数处理）。我们将在后续表征中加入这一信息。

Data-based Representation: Given the available function observations(X,F),for each

k_{C}

,we compute the "distances" between its GPR covariance matrix

M_{C}

and the covariance matrix of each base kernel,

M_{b \in B}

(the covariance matrix is computed as

K_{i j} = k (x_{i}, x_{j}), k (\cdot)

is the kernel function and

x_{i}, x_{j}

are any two observations). We use the Forbenius norm to compute the matrix distances. This representation is denoted as

r_{d} \in R^{| B |}

.
基于数据的表征：给定可用的函数观测数据(X,F)，对于每个

k_{C}

，我们计算其高斯过程回归协方差矩阵

M_{C}

与每个基核协方差矩阵

M_{b \in B}

之间的"距离"（协方差矩阵的计算方式为

K_{i j} = k (x_{i}, x_{j}), k (\cdot)

表示核函数，

x_{i}, x_{j}

为任意两个观测点）。我们采用弗罗贝尼乌斯范数计算矩阵距离。该表征记为

r_{d} \in R^{| B |}

。

\begin{matrix} (6) & r_{d} = {‖ \begin{matrix} M_{C} - M_{b \in B} \end{matrix} ‖}_{F} \end{matrix}

The function information is now encoded in

r_{d}

as the covariance matrix is computed using the kernel and the function observations/samples. Kernels that model the function's data similarly need to be in the same neighborhood so that the model evidence is sufficiently smooth. Figure 5(c) and (d) (in the Appendix) shows the advantage of encoding the objective function's data in the kernel code.
函数信息现编码于

r_{d}

中，因协方差矩阵是通过核函数与函数观测/样本计算得出。对函数数据建模方式相似的核函数需位于同一邻域，以保证模型证据足够平滑。附录中的图 5(c)和(d)展示了在核编码中嵌入目标函数数据的优势。

Thus,the kernel space

K

consisting of complex kernels is a subset of the space of all positive semi-definite (PSD) matrices

S, K \subseteq S

. By restricting our search to

K

- not

S

- the kernels generated through context-free grammar compositions, we can scale the GPR covariance matrix as new function observations arrive (simple kernel functions have closed-form expressions that are expanded to complex kernels for any number of observations). Therefore, our kernel learning problem in Eqn. 3 becomes a kernel selection problem. The final representation of a composite kernel

k_{C}

r = [r_{c}, r_{d}]

. This is used to train the KerVAE to generate the continuous kernel space

Z

.
因此，由复杂核函数构成的内核空间

K

是所有半正定（PSD）矩阵空间

S, K \subseteq S

的子集。通过将搜索范围限定在

K

（而非

S

）——即通过上下文无关文法组合生成的核函数，我们可以在新函数观测数据到达时扩展高斯过程回归（GPR）协方差矩阵的规模（简单核函数具有闭式表达式，可扩展为适用于任意数量观测的复杂核函数）。因此，公式 3 中的核学习问题转化为核选择问题。复合核

k_{C}

的最终表达式为

r = [r_{c}, r_{d}]

，该表达式用于训练核变分自编码器（KerVAE）以生成连续核空间

Z

。

Kernel Space Variational Autoencoder (KerVAE) and GPR (KerGPR)
核空间变分自编码器（KerVAE）与高斯过程回归（KerGPR）

The KerVAE learns a continuous latent space

Z

of the discrete kernel space

K

and has two main components: (1) a probabilistic encoder that models

q_{ϕ} (z ∣ x) \sim p_{θ} (x ∣ z) p (z)

parameterized by

ϕ

where

p (z)

is the prior over the latent space,and (2) a decoder that models the likelihood

p_{θ} (x ∣ z)

parameterized by

θ

. The parameters of

q_{ϕ} (z ∣ x), p_{θ} (x ∣ z)

are optimized by joint maximization of the ELBO loss (Kingma and Welling 2013),
KerVAE 通过以下两个主要组件学习离散核空间

K

的连续潜在空间

Z

：(1) 概率编码器建模由

ϕ

参数化的

q_{ϕ} (z ∣ x) \sim p_{θ} (x ∣ z) p (z)

（其中

p (z)

表示潜在空间的先验分布），(2) 解码器建模由

θ

参数化的似然函数

p_{θ} (x ∣ z)

。通过联合最大化 ELBO 损失函数（Kingma 与 Welling，2013）优化

q_{ϕ} (z ∣ x), p_{θ} (x ∣ z)

的参数。

\begin{matrix} (7) & L (ϕ, θ, x) = E_{q_{ϕ} (z ∣ x)} [\log p_{θ} (x, z) - \log q_{ϕ} (z ∣ x)] \end{matrix}

Q	$f_{1} (x) \sim A * A * B + C$	$f_{2} (x) \sim C + D$	$f_{3} (x) \sim D * B + D$
5	$A * B$	$A$	$A$
10	$A * B + C$	$A + D$	$A * B * D + D$
15	$A * A * B + D$	$A * C + D$	$A * B + D$
20	$A * B + C * D$	$A * C + D$	$B * D + D$
25	$A * A * B + C * D$	$A * C + D$	$B * D + D$

Table 1: KOBO learning the ground truth kernel.

{A, B, C, D, E} = {SE, PER, RQ, MAT, LIN}

表 1：KOBO 学习真实核函数的过程

{A, B, C, D, E} = {SE, PER, RQ, MAT, LIN}

The KerVAE is re-trained after accumulating

v

function observations as the data representation

r_{d}

is re-computed for every new set of observations

D_{v} = (X_{v}, F_{v})

. Once Ker-VAE is trained, KerGPR is used to determine the optimal kernel

z^{*} \in Z

. The KerVAE decoder decodes

z^{*}

r_{c}^{*}

and

r_{c}^{*}

is easily mapped to

K^{*} \in K

due to the grammar’s interpretability.
KerVAE 在积累

v

个函数观测值后重新训练，因为数据表示

r_{d}

会针对每组新观测值

D_{v} = (X_{v}, F_{v})

重新计算。一旦 Ker-VAE 训练完成，便使用 KerGPR 确定最优核函数

z^{*} \in Z

。KerVAE 解码器将

z^{*}

解码为

r_{c}^{*}

，且由于语法的可解释性，

r_{c}^{*}

能轻松映射到

K^{*} \in K

。

KerGPR: Optimizing on KerVAE’s latent kernel space

Z

is also a black-box problem (similar to Eqn. 1) because the objective can only be evaluated for a given kernel

k

(i.e., by first decoding a given

z

k

,and computing

k

’s model evidence

L (K)

in Eqn. 4). We use GPR to find

z^{*}

in the latent space and decode to the optimal kernel

K^{*}

. Thus,our optimization objective is:
KerGPR：在 KerVAE 的潜在核空间

Z

上进行优化同样属于黑箱问题（类似于公式 1），因为目标函数仅能针对给定核

k

进行评估（即需先将给定

z

解码为

k

，并计算

k

在公式 4 中的模型证据

L (K)

）。我们使用 GPR 在潜在空间中寻找

z^{*}

，并解码得到最优核

K^{*}

。因此，我们的优化目标为：

\begin{matrix} (8) & K^{*} = Dec (\underset{z \in Z}{argmax} P (F ∣ X, Dec (z))) \end{matrix}

where, Dec(.) is the KerVAE decoder that maps a point from

Z

to the

K

space. We use the simple SE kernel in KerGPR as it does not benefit from recursive kernel learning (explained in Technical Appendix). The optimal kernel

K^{*}

is then used by Function GPR (

f

GPR) - the GPR posterior in Eqn. 2 - to generate surrogates that closely model the unknown function structure.
其中 Dec(.)为 KerVAE 解码器，用于将

Z

中的点映射到

K

空间。由于递归核学习并无助益（技术附录中有说明），我们在 KerGPR 中采用简单的 SE 核函数。最优核

K^{*}

随后被函数 GPR（

f

GPR）——即公式 2 中的 GPR 后验——用于生成能精准建模未知函数结构的代理模型。

4 Evaluation and Results 4 评估与结果

KOBO Versus Simple Kernels
KOBO 对比简单核函数

Metric: We use the metric of Regret, defined as the difference between the predicted and true minimum,

(f ({\hat{x}}^{*}) -

f (x^{*}))

. We compare KOBO’s regret against 5 popular base kernels

B = {SE, PER, RQ, MAT, LIN}

which respectively denote Square-Exponential (SE), Periodic (PER), Rational Quadratic (RQ), Matérn (MAT), and Linear (LIN) kernels. All KOBO experiments are initialized with the SE kernel.
评估指标：我们采用遗憾值（Regret）作为指标，其定义为预测最小值与真实最小值之间的差异

(f ({\hat{x}}^{*}) -

f (x^{*}))

。我们将 KOBO 的遗憾值与 5 种常用基础核函数

B = {SE, PER, RQ, MAT, LIN}

进行对比，它们分别代表平方指数（SE）、周期（PER）、有理二次（RQ）、马特恩（MAT）和线性（LIN）核函数。所有 KOBO 实验均以 SE 核函数初始化。

Synthetic Baseline Functions: We report results from 3 types of synthetic functions

f (x)

that are popular benchmarks (Kim 2020) for black-box optimization: Staircase functions in

N = 2000

dimensions; they exhibit non-smooth structures (Al-Roomi 2015).

◼

Smooth benchmark functions such as BRANIN commonly used in Bayesian optimization research (Sonja Surjanovic 2013). Periodic functions such as MICHALEWICZ that exhibit repetitions in their shape (Sonja Surjanovic 2013). The top row of Figure 4 visualizes these functions (more details on evaluation parameters in the Technical Appendix). All reported results are an average of 10 runs.
合成基准函数：我们报告了 3 类合成函数

f (x)

的实验结果，这些函数是黑盒优化领域的常用基准测试（Kim 2020）：《1#维度中的阶梯函数，其表现出非平滑结构（Al-Roomi 2015）》；《2#平滑基准函数（如 BRANIN），常用于贝叶斯优化研究（Sonja Surjanovic 2013）》；以及呈现周期性形态的函数（如 MICHALEWICZ）（Sonja Surjanovic 2013）。图 4 顶部展示了这些函数的可视化（更多评估参数细节见技术附录）。所有报告结果为 10 次运行的平均值。

Results: Figures 4(bottom row) plots Regret for the Staircase, Smooth (BRANIN), and the MICHALEWICZ functions, respectively. KOBO minimizes Regret at significantly fewer function evaluations (or samples), especially for Staircase and MICHALEWICZ. For a smooth function like BRANIN, KOBO's gain is understandably less since the

S E

and

P E R

kernels naturally fit the smooth shape. When real world functions exhibit complex (non-smooth) structures and when function evaluations are expensive, KOBO's advantage is desirable.
结果：图 4（底行）分别展示了阶梯函数、平滑函数（BRANIN）和 MICHALEWICZ 函数的遗憾值（Regret）。KOBO 在显著更少的函数评估（或采样）次数下实现了遗憾值最小化，尤其对于阶梯函数和 MICHALEWICZ 函数表现突出。而对于 BRANIN 这类平滑函数，KOBO 的优势相对较小，因为

S E

和

P E R

核函数天然契合平滑形态。当现实世界的函数呈现复杂（非平滑）结构且函数评估成本高昂时，KOBO 的优势尤为可贵。

https://cdn.noedgeai.com/0196a06a-5441-79f9-8a2e-981e39e6ac4a_4.jpg?x=173&y=152&w=1456&h=381&r=0

Figure 4: Comparison of KOBO and conventional BO using SE, PER, RQ, and Matérn kernels (Bottom Row) for (a) Staircase, (b) Smooth Branin, and, (c) Periodic Michalewicz functions (Top Row).
图 4：KOBO 与传统贝叶斯优化（使用 SE、PER、RQ 及 Matérn 核函数）的对比（底行），分别针对(a)阶梯函数、(b)平滑 Branin 函数及(c)周期性 Michalewicz 函数（顶行）。

KOBO Versus SOTA Baselines
KOBO 对比业界先进基线方法

Another Metric: Since

KOBO

learns the kernel in the latent space, we will use Model Evidence in addition to Regret. Model Evidence is the normalized probability of generating the observed data

D

given a kernel model

K

,i.e.,

\log (P (f ∣ X, K)) / | D |

(Malkomes,Schaff, and Garnett 2016). Computing the exact model evidence is generally intractable in GPs (Rasmussen, Williams et al. 2006)(MacKay et al. 1998). We use the Bayesian Information Criterion (BIC) to approximate the model evidence as

\log (P (f ∣ X, K)) = - \frac{1}{2} f^{T} K^{- 1} f - \frac{1}{2} \log ({(2 π)}^{N} | K |)

,where

N

is the dimensions of the input space

H \subseteq R^{N}

.
另一项指标：由于

KOBO

在潜在空间中学习核函数，我们将除遗憾值外还采用模型证据。模型证据是在给定核模型

K

的情况下生成观测数据

D

的归一化概率，即

\log (P (f ∣ X, K)) / | D |

（Malkomes, Schaff 和 Garnett 2016）。在高斯过程中精确计算模型证据通常不可行（Rasmussen, Williams 等 2006）（MacKay 等 1998）。我们使用贝叶斯信息准则（BIC）将模型证据近似为

\log (P (f ∣ X, K)) = - \frac{1}{2} f^{T} K^{- 1} f - \frac{1}{2} \log ({(2 π)}^{N} | K |)

，其中

N

表示输入空间

H \subseteq R^{N}

的维度。

We will plot Regret against the number of "Function Evaluations" (on the X axis), but for Model Evidence, we will plot it against the number of "Latent Evaluations". Recall that Model Evidence is the metric used in the latent space of KerVAE to find the "best" kernel

K^{*}

. Hence "Latent Evaluations" denotes the number of latent space samples

z

and the corresponding kernels

Dec (z) = K

sampled by KerGPR to find

K^{*}

. This reflects the computation overhead of KOBO.
我们将以"函数评估次数"（X 轴）为横坐标绘制遗憾值曲线，而对于模型证据，则以"潜在评估次数"为横坐标。需注意模型证据是 KerVAE 在潜在空间中用于寻找"最佳"核函数

K^{*}

的指标。因此"潜在评估次数"表示 KerGPR 为寻找

K^{*}

所采样的潜在空间样本

z

数量及对应核函数

Dec (z) = K

数量，这反映了 KOBO 的计算开销。

SOTA Baselines (details in Technical Appendix):
最先进基线方法（技术附录详述）：

(1) MCMC: The MCMC kernel search (Gardner et al. 2017; Abdessalem et al. 2017) applies the Metropolis-Hastings algorithm (Gardner et al. 2017) on the space of composite kernels

k_{C}

,using model evidence as the function. The proposal distribution is defined as: given a kernel

k

,it is either added or multiplied to a kernel from

B

(chosen with

p

).
(1) MCMC 方法：MCMC 核搜索（Gardner 等人 2017；Abdessalem 等人 2017）在复合核空间

k_{C}

上应用 Metropolis-Hastings 算法（Gardner 等人 2017），以模型证据作为目标函数。其提议分布定义为：给定一个核

k

，将其与从

B

中选取的核（通过

p

选择）进行加法或乘法运算。

(2) CKS: The Automatic Statistician/Compositional Kernel Search (CKS) (Duvenaud et al. 2013) method takes advantage of the fact that complex kernels are generated as context-free grammar compositions of positive semi-definite matrices (closed under addition and multiplication); the kernel selection is then a tree search guided by model evidence. CKS searches over the discrete kernel space

K

using a greedy strategy that, at each iteration, chooses the kernel with the highest model evidence. This kernel is then expanded by composition to a set of new kernels. The search process repeats on this expanded list.
(2) CKS 方法：自动统计学家/组合核搜索（CKS）（Duvenaud 等人 2013）利用复杂核作为正定半矩阵（在加法和乘法下封闭）的上下文无关文法组合生成这一特性，将核选择转化为由模型证据引导的树搜索。CKS 在离散核空间

K

上采用贪婪策略进行搜索，每次迭代选择具有最高模型证据的核，然后通过组合运算将该核扩展为一组新核，并在扩展后的列表上重复搜索过程。

(3) BOMS: The Bayesian Optimization for Model Search (BOMS) (Malkomes, Schaff, and Garnett 2016), unlike CKS' greedy strategy, is a meta-learning technique, which, conditioned on observations

D

,establishes similarities among the kernels in

K

,i.e.,BOMS constructs a kernel between the kernels ("kernel kernel"). Like KOBO, BOMS performs BO in

K

by defining a Gaussian distribution:

P (g) =

N (g; μ_{g}, K_{g})

,where

g

is the model evidence,

μ_{g}

is the mean,and

K_{q}

is the covariance (defined by "kernel kernel" function).

K_{g}

is constructed by defining a heuristic similarity measure between two kernels: Hellinger distance.
（3）BOMS：贝叶斯优化模型搜索（Bayesian Optimization for Model Search，BOMS）（Malkomes、Schaff 与 Garnett 2016）与 CKS 的贪婪策略不同，它是一种元学习技术。该方法基于观测数据

D

建立

K

中核函数间的相似性关系，即 BOMS 构建了核函数之间的核（"核的核"）。与 KOBO 类似，BOMS 通过在

K

中定义高斯分布进行贝叶斯优化：

P (g) =

N (g; μ_{g}, K_{g})

，其中

g

为模型证据，

μ_{g}

为均值，

K_{q}

为由"核的核"函数定义的协方差矩阵。

K_{g}

通过定义启发式核函数相似性度量（Hellinger 距离）构建。

Results: Figure 5(Row 1) shows that KOBO lowers Regret faster than all SOTA baselines for the three benchmark functions. For Staircase, KOBO attains the global minimum in about 17 function evaluations in contrast to MCMC, which incurs 28, BOMS 32, and CKS 43. For Michalewiez, KOBO attains the minimum in about 10 fewer samples than MCMC. While BOMS and CKS do not attain the minimum but get close to it. However, KOBO's performance gain is not as pronounced for Branin due to its smooth structure as evidenced in Figure 4.
结果：图 5（第一行）显示，针对三个基准测试函数，KOBO 比所有 SOTA 基线方法更快降低遗憾值。在阶梯函数优化中，KOBO 约用 17 次函数评估即达到全局最小值，而 MCMC 需要 28 次、BOMS 需 32 次、CKS 需 43 次。对于 Michalewicz 函数，KOBO 比 MCMC 少用约 10 次采样即达到最小值，BOMS 和 CKS 虽未达到最小值但接近最优。然而由于 Branin 函数的平滑特性（如图 4 所示），KOBO 在该函数上的性能优势并不显著。

Figure 5(Row 2) compares Model Evidence for the same benchmarks. For Staircase, KOBO's KerGPR achieves significantly higher Model Evidence in 20 iterations compared to MCMC,i.e.,KOBO’s optimal kernel

K^{*}

better explains the observed data. For Branin, KOBO can match the Model Evidence of baselines, and performs modestly better for Michalewiez. The results illustrate that KOBO's performance is superior because a continuous search space learned by KerVAE simplifies the KerGPR optimization to determine

K^{*}

,implying that KOBO presents an efficient search of the discrete kernel space

K

in contrast to sub-optimal search techniques like greedy search (CKS), or heuristic metrics for kernel optimization (BOMS).
图 5（第二行）比较了相同基准测试下的模型证据。对于 Staircase 案例，KOBO 的 KerGPR 在 20 次迭代中取得了显著高于 MCMC 的模型证据值，这表明 KOBO 的最优核

K^{*}

能更好地解释观测数据。在 Branin 函数上，KOBO 能与基线模型的证据值持平；而在 Michalewicz 函数上表现略优。这些结果表明 KOBO 的优越性能源于 KerVAE 学习的连续搜索空间简化了 KerGPR 优化过程以确定

K^{*}

，意味着 KOBO 实现了离散核空间

K

的高效搜索，与次优技术（如贪婪搜索 CKS）或启发式核优化指标（BOMS）形成鲜明对比。

Is $K^{}$ indeed learning the structure of $f (x)$ ?
$K^{}$ 是否确实在学习 $f (x)$ 的结构？

Synthetic Functions: If we knew the objective function

f (x)

,we could verify if

K^{*}

has learnt its structure. To test this, we sample a function from a GP with a known kernel

K^{+}

and pretend that to be

f (x)

; we check if KOBO’s

K^{*}

converges to

K^{+}

. Table 1 reports results from

3 N

- dimensional synthetic functions,shown in the top row

(N =

2000). These synthetic objective functions were sampled from a GP that uses different but known kernels. The subsequent rows show KerVAE’s learnt kernel after

Q

observations/queries. With more

Q

,KerGPR closely matches the ground truth kernel.
合成函数分析：若已知目标函数

f (x)

，便可验证

K^{*}

是否习得其结构。为此，我们从具有已知核函数

K^{+}

的高斯过程中采样函数并视作

f (x)

，检验 KOBO 的

K^{*}

是否收敛于

K^{+}

。表 1 展示了

3 N

维合成函数的实验结果（顶部行

(N =

2000），这些合成目标函数采样自采用不同但已知核函数的高斯过程。后续行显示 KerVAE 在

Q

次观测/查询后习得的核函数。随着

Q

增加，KerGPR 与真实核函数的匹配度显著提升。

https://cdn.noedgeai.com/0196a06a-5441-79f9-8a2e-981e39e6ac4a_5.jpg?x=198&y=149&w=1402&h=524&r=0

Figure 5: Comparison of KOBO, MCMC, CKS, and BOMS for Staircase (Col 1), Branin (Col 2), and Michalewiez (Col 3) functions: (Row 1) Regret (Row 2) Model Evidence.
图 5：KOBO、MCMC、CKS 与 BOMS 在阶梯函数（第 1 列）、Branin 函数（第 2 列）和 Michalewicz 函数（第 3 列）上的对比：（第 1 行）遗憾值（第 2 行）模型证据。

Learning Real-world

{CO}_{2}

Emission Data: Figure 6’s blue curve plots real

{CO}_{2}

emissions data over time (Thoning, Tans,and Komhyr 1989). We test if KOBO’s

K^{*}

can learn the structure of this blue curve from partial data. Figure 6(a,b,c) show results when KOBO has observed the first 20%,

40 %

,and

60 %

of the data,respectively. With the first

20 %

K^{*} = SE * PER + RQ

,hence

KOBO

’s red curve captures the periodic structure of the early data. When the first

40 %

of the data is observed, KOBO captures the downward linear trend of the

{CO}_{2}

data resulting in

K^{*} = SE * PER + PER + LIN

. With

60 %

of the data,

K^{*} = SE * PER * RQ + PER * LIN +

LIN models the interplay between the function's periodic structure and linear trends. A conventional Periodic kernel (PER), shown in black in Figure 6(c), is only able to capture the periodic structure, not the upward linear trend, even with 60% of the data.
学习真实世界

{CO}_{2}

排放数据：图 6 中的蓝色曲线绘制了随时间变化的真实

{CO}_{2}

排放数据（Thoning、Tans 和 Komhyr 1989）。我们测试 KOBO 的

K^{*}

能否从部分数据中学习这条蓝色曲线的结构。图 6(a,b,c)分别展示了当 KOBO 观测到前 20%、

40 %

和@4%数据时的结果。仅用前

20 %

数据时，

K^{*} = SE * PER + RQ

因此

KOBO

的红色曲线捕捉到了早期数据的周期性结构。当观测到前

40 %

数据时，KOBO 捕捉到了

{CO}_{2}

数据的下行线性趋势，从而得到

K^{*} = SE * PER + PER + LIN

。在拥有

60 %

数据的情况下，

K^{*} = SE * PER * RQ + PER * LIN +

线性模型（LIN）刻画了函数周期性结构与线性趋势间的交互作用。传统的周期核（PER）如图 6(c)黑色曲线所示，即便使用 60%数据，也只能捕捉周期性结构而无法反映上升的线性趋势。

User Experiments: (1) Audio Personalization, and (2) Image Recommendation
用户实验：(1)音频个性化定制，(2)图像推荐

(1) Audio: We apply KOBO to audio personalization for real volunteers. We deliberately corrupt audio played to the user with the aim of helping the user pick a filter

h^{*}

that cancels the effect of the corruption - equalization - and recovers the original audio; hence maximizing the user's audio satisfaction. Therefore, a GPR employed in the space of all audio filters

H

,maximizes the user’s satisfaction

f (h)

h^{*}

. At each iteration,the corrupted audio is filtered with a new

h^{'}

(recommended by GPR) and played to the user. The user's rating (0 to 10) of the perceived audio quality serves as the func-
（1）音频：我们将 KOBO 技术应用于真实志愿者的音频个性化处理。通过刻意对用户播放的音频施加干扰，旨在帮助用户选择能抵消干扰效果的滤波器

h^{*}

（均衡处理），从而恢复原始音频并最大化用户的听觉满意度。因此，在全体音频滤波器空间

H

中采用高斯过程回归（GPR），可在

h^{*}

时刻最大化用户满意度

f (h)

。每次迭代时，系统会用 GPR 推荐的新滤波器

h^{'}

对干扰音频进行滤波后播放给用户。用户对感知音频质量给出的评分（0-10 分）将作为函...

Hearing Loss 听力损失
Q	U1			U2			U3
Q	SE	KOBO	PER	SE	KOBO	PER	SE	KOBO	PER
5	6	6	6	8	8	8	6	6	6
10	6	8	6	8	8	8	6	7	7
15	6	10	7	8	10	8	7	9	7
20	10	10	10	9	10	9	7	10	10
25	10	10	10	10	10	10	9	10	10
Random Audio Corruption 随机音频干扰
Q	U1			U2			U3
Q	SE	KOBO	PER	SE	KOBO	PER	SE	KOBO	PER
5	1	1	1	3	3	3	1	1	1
10	2	3	1	4	4	4	2	2	3
15	2	4	5	4	10	4	2	2	3
20	4	10	5	4	10	9	5	8	4
25	8	10	9	8	10	9	10	8	7

Table 2: Audio personalization results (the

U = 3

volunteers (rest in Appendix) did not know which kernel was in use).
表 2：音频个性化实验结果（参与测试的

U = 3

名志愿者（其余见附录）并不知晓所采用的具体核函数）。

tion observations

f (h^{'})

. User feedback is finite and the frequency selective nature of human hearing (Antoine Lorenzi 2003) makes optimizing

f (h), h \in R^{4000}

,well suited for kernel learning methods like KOBO.
观测数据

f (h^{'})

显示用户反馈具有有限性，而人类听觉的频率选择性特征（Antoine Lorenzi 2003）使得优化

f (h), h \in R^{4000}

特别适合采用 KOBO 等核学习方法。

We invited 6 volunteers to two experiment sessions. In the first, the audio was corrupted with a "hearing loss" audiogram (CDC 2011); in the second, a "random" corruption filter was used (more details in Technical Appendix). By querying the user

Q

times,each time with a new filter

h^{'}

, KOBO expects to learn

K^{*}

,and in turn,maximize the user’s satisfaction. We report Regret against increasing

Q

,and compare to conventional GPR optimizers with simple base kernels

{SE, PER}

. The audio demos at various stages of the optimization is made public: https://keroptbo.github.io/.
我们邀请 6 名志愿者参与两轮实验：首轮采用"听力损失"听力图（CDC 2011）对音频进行干扰；次轮使用"随机"干扰滤波器（详见技术附录）。通过

Q

次用户调研（每次采用新滤波器

h^{'}

），KOBO 预期学习

K^{*}

从而最大化用户满意度。我们绘制了随

Q

增长的遗憾值曲线，并与采用简单基核

{SE, PER}

的传统高斯过程回归优化器对比。各优化阶段的音频样本已公开：https://keroptbo.github.io/。

(2) Images: We also apply KOBO to image recommendation. The motivation is that users are often unable to articulate precisely what images they want, however, given an image they can express their satisfaction. We formulate this as a black-box optimization problem in the image space by posing the question: if a user rates few pictures generated by an AI model, can the model find the "best" picture to maximize user satisfaction? To realize this, we use a pre-trained VAE-based image generator (Esser, Rombach, and Ommer 2020) (ImGen). The user issues a crude prompt of what they want,allowing ImGen to display an initial image,

x_{start}

. The user scores this image as

f (x_{start})

and KOBO is triggered. Table 3 displays

x_{start}

and the images recommended after

Q = 5, 15, 25

queries. For a subjective measure of optimality, we asked users to describe their ideal image upfront. KOBO’s recommendation at

Q \geq 15

seems to match the prompts quite well (more results in Appendix).
（2）图像推荐：我们还将 KOBO 应用于图像推荐。其动机在于用户往往难以精确描述他们想要的图像，但给定一张图片时，他们可以表达满意度。我们将此问题建模为图像空间中的黑箱优化问题：如果用户对 AI 模型生成的少量图片进行评分，模型能否找到使用户满意度最大化的"最佳"图片？为此，我们采用基于预训练 VAE 的图像生成器（Esser, Rombach 和 Ommer 2020）（ImGen）。用户输入粗略需求提示后，ImGen 会显示初始图像

x_{start}

。用户对该图像评分

f (x_{start})

后触发 KOBO。表 3 展示了

x_{start}

及经过

Q = 5, 15, 25

次查询后推荐的图像。为主观评估最优性，我们要求用户预先描述理想图像。KOBO 在

Q \geq 15

处的推荐与用户提示高度吻合（更多结果见附录）。

https://cdn.noedgeai.com/0196a06a-5441-79f9-8a2e-981e39e6ac4a_6.jpg?x=152&y=151&w=1501&h=343&r=0

Figure 6: The blue curve is real-world

{CO}_{2}

emissions data from (Thoning,Tans,and Komhyr 1989). The red curve is KOBO’s prediction of the blue curve after observing (a)

20 %

(b)

40 %

(c)

60 %

of the blue data.
图 6：蓝色曲线为(Thoning,Tans 和 Komhyr 1989)提供的真实

{CO}_{2}

排放数据。红色曲线是 KOBO 在观测到蓝色数据的(a)

20 %

(b)

40 %

(c)

60 %

后对蓝色曲线的预测。

https://cdn.noedgeai.com/0196a06a-5441-79f9-8a2e-981e39e6ac4a_6.jpg?x=151&y=634&w=720&h=472&r=0

Table 3: Prompt-based image generation user results
表 3：基于提示的图像生成用户结果

5 Related Work 5 相关工作

A body of works in BO has explored kernel learning. Closest to KOBO are BOMS (Malkomes, Schaff, and Garnett 2016), Automatic Statistician (AS) and their variants (Duvenaud et al. 2013; Grosse et al. 2012; Kim and Teh 2016), and MCMC (Gardner et al. 2017; Abdessalem et al. 2017) discussed (and used as baselines) earlier. In other work (Teng et al. 2020), authors treat the kernel as a random variable and learn its belief from the data. (Zhen et al. 2020) introduces kernels with random Fourier features for meta-learning tasks. The kernel features are learned as latent variables of a model to generate adaptive kernels. In contrast, KOBO uses variational inference as an auxiliary module to only learn a continuous latent kernel space; the KerGPR optimization primarily drives the kernel learning.
贝叶斯优化领域已有大量研究探索核学习方法。与 KOBO 最接近的是 BOMS（Malkomes、Schaff 和 Garnett 2016）、自动统计学家（AS）及其变体（Duvenaud 等 2013；Grosse 等 2012；Kim 和 Teh 2016），以及前文讨论（并用作基线）的 MCMC 方法（Gardner 等 2017；Abdessalem 等 2017）。其他研究中（Teng 等 2020），作者将核视为随机变量并从数据中学习其置信度。（Zhen 等 2020）提出了具有随机傅里叶特征的核函数用于元学习任务，核特征作为生成自适应核的模型潜变量进行学习。相比之下，KOBO 使用变分推断作为辅助模块仅学习连续潜核空间，核学习主要由 KerGPR 优化驱动。

Authors in (Kandasamy, Schneider, and Póczos 2015; Gardner et al. 2017; Mutny and Krause 2018; Wang et al. 2018) have improved BO performance in high-dimensional spaces by modeling the function via additive kernels. The objective is decomposed into a sum of functions in low-dimensional space. KOBO's comprehensive space of kernels from additive and multiplicative compositions is capable of modeling more complex function structures. Finally, (Kusner, Paige, and Hernández-Lobato 2017), (Gómez-Bombarelli et al. 2018), and (Gonzalez et al. 2015) perform optimization in a continuous latent space learned by VAEs to circumvent categorical data. Authors of (Garrido-Merchán and Hernández-Lobato 2020) use one-hot encoding approximations for

BO

of categorical variables.

KOBO

borrows from these ideas but applies them to kernel learning.
在(Kandasamy、Schneider 和 Póczos 2015；Gardner 等人 2017；Mutny 和 Krause 2018；Wang 等人 2018)的研究中，作者通过加性核函数建模改进了高维空间中的贝叶斯优化性能。目标函数被分解为低维空间中的函数之和。KOBO 通过加性和乘性组合构建的全面核空间能够建模更复杂的函数结构。最后，(Kusner、Paige 和 Hernández-Lobato 2017)、(Gómez-Bombarelli 等人 2018)以及(Gonzalez 等人 2015)通过在 VAE 学习的连续潜在空间中进行优化来规避分类数据问题。(Garrido-Merchán 和 Hernández-Lobato 2020)的作者采用独热编码近似处理分类变量的

BO

。

KOBO

借鉴了这些思路，但将其应用于核学习领域。

6 Limitations and Conclusion
六、局限性与结论

I Trading Computation for Sample Efficiency: KOBO incurs rounds of computation in estimating the model evidence

L

. However,this does not affect sample efficiency,since KerVAE training is sample-independent. Thus, KOBO's advantage is in reducing the sample evaluations of

f (x)

(e.g., user burden) and not in total CPU cycles.
计算效率与样本效率的权衡：KOBO 在估计模型证据

L

时需要多轮计算。但这并不影响样本效率，因为 KerVAE 训练过程与样本无关。因此，KOBO 的优势在于减少

f (x)

的样本评估次数（如减轻用户负担），而非降低总 CPU 周期。

I Overfitting to Simple Functions: As iterations progress, KerGPR might learn a kernel more complex than the actual target function

f

(see

f_{1} (x)

in Table 1). Choosing a complex kernel expands the surrogate function space, and may need more samples to converge. To avoid kernel overfitting, we can regularize the kernel complexity, i.e., the length and norm of the kernel grammar codes.
对简单函数的过拟合问题：随着迭代进行，KerGPR 可能学习到比实际目标函数更复杂的核函数

f

（见表 1 中的

f_{1} (x)

）。选择复杂核会扩大代理函数空间，可能需要更多样本才能收敛。为避免核过拟合，我们可以对核复杂度（即核语法代码的长度和范数）进行正则化。

Latent Space Interpretability: Current latent space learned by KerVAE is abstract and lacks interpretability. An interpretable latent space should offer improvements to KerGPR, facilitating the use of simpler optimizers compared to the expensive Bayesian Optimization in the latent space.
潜在空间可解释性：KerVAE 当前学习的潜在空间较为抽象且缺乏可解释性。一个可解释的潜在空间应能改进 KerGPR，从而在潜在空间中使用比昂贵的贝叶斯优化更简单的优化器。

To conclude, we propose KOBO, a kernel learning method for GPR. We design a continuous latent space of kernels (using a VAE), and optimize the space via an auxiliary GPR to output an optimal kernel

K^{*}

. This optimal kernel better models the structure of the objective function, which ensures sample efficiency. We show sample applications and believe the ideas could be extended to others.
总结而言，我们提出了 KOBO——一种面向高斯过程回归的核学习方法。我们设计了一个连续的核函数潜在空间（使用变分自编码器），并通过辅助高斯过程回归优化该空间以输出最优核

K^{*}

。该最优核能更好地建模目标函数结构，从而保证采样效率。我们展示了样本应用案例，并认为该思路可扩展至其他领域。

Acknowledgements 致谢

We thank Foxconn and NSF (grant 2008338, 1909568, 2148583, and MRI-2018966) for funding this research. We are also grateful to the reviewers for their insightful feedback.
我们感谢富士康和美国国家科学基金会（资助编号 2008338、1909568、2148583 及 MRI-2018966）对本研究的资金支持。同时衷心感谢审稿人富有洞见的反馈意见。

References 参考文献

Abdessalem, A. B.; Dervilis, N.; Wagg, D. J.; and Worden, K. 2017. Automatic kernel selection for gaussian processes regression with approximate bayesian computation and sequential monte carlo. Frontiers in Built Environment, 3: 52.
阿布德萨拉姆，A. B.；德维利斯，N.；瓦格，D. J.；沃登，K. 2017。基于近似贝叶斯计算与序贯蒙特卡洛的高斯过程回归自动核选择。《建筑环境前沿》，3 卷：52 页。

Al-Roomi, A. R. 2015. Unconstrained Single-Objective Benchmark Functions Repository.
阿尔-鲁米，A. R. 2015。无约束单目标基准测试函数库。

Antoine Lorenzi, B. C. 2003. Human Frequency Discrimination.
安托万·洛伦兹，B. C. 2003。人类频率辨别能力研究。

CDC. 2011. Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data, Hyattsville, MD.
美国疾病控制与预防中心(CDC). 2011. 美国卫生统计中心(NCHS). 国家健康与营养检测调查数据. 马里兰州海厄茨维尔.

Duvenaud, D.; Lloyd, J.; Grosse, R.; Tenenbaum, J.; and Zoubin, G. 2013. Structure discovery in nonparametric regression through compositional kernel search. In International Conference on Machine Learning, 1166-1174. PMLR.
杜文诺德(Duvenaud, D.); 劳埃德(Lloyd, J.); 格罗斯(Grosse, R.); 特南鲍姆(Tenenbaum, J.); 祖宾(Zoubin, G.). 2013. 通过复合核搜索实现非参数回归中的结构发现. 国际机器学习大会, 1166-1174. PMLR.

Esser, P.; Rombach, R.; and Ommer, B. 2020. Taming Transformers for High-Resolution Image Synthesis. arXiv:2012.09841.
埃塞尔(Esser, P.); 龙巴赫(Rombach, R.); 奥默(Ommer, B.). 2020. 驯服 Transformer 实现高分辨率图像合成. arXiv:2012.09841.

Frazier, P. I. 2018. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811.
弗雷泽(Frazier, P. I.). 2018. 贝叶斯优化教程. arXiv 预印本: arXiv:1807.02811.

Gardner, J.; Guo, C.; Weinberger, K.; Garnett, R.; and Grosse, R. 2017. Discovering and exploiting additive structure for Bayesian optimization. In Artificial Intelligence and Statistics, 1311-1319. PMLR.
加德纳，J.；郭，C.；温伯格，K.；加内特，R.；与格罗斯，R. 2017。发现并利用加性结构进行贝叶斯优化。收录于《人工智能与统计》，1311-1319。PMLR 出版社。

Garrido-Merchán, E. C.; and Hernández-Lobato, D. 2020. Dealing with categorical and integer-valued variables in bayesian optimization with gaussian processes. Neurocom-puting, 380: 20-35.
加里多-梅尔昌，E. C.；与埃尔南德斯-洛巴托，D. 2020。高斯过程贝叶斯优化中分类变量与整型变量的处理方法。《神经计算》，380 卷：20-35 页。

Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; She-berla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; and Aspuru-Guzik, A. 2018. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2): 268-276.
戈麦斯-邦巴雷利，R.；魏，J. N.；杜文诺，D.；埃尔南德斯-洛巴托，J. M.；桑切斯-伦格林，B.；谢伯拉，D.；阿吉莱拉-伊帕拉吉雷，J.；赫泽尔，T. D.；亚当斯，R. P.；与阿斯普鲁-古兹克，A. 2018。基于数据驱动分子连续表征的自动化化学设计。《美国化学会中心科学》，4(2)：268-276。

Gonzalez, J.; Longworth, J.; James, D. C.; and Lawrence, N. D. 2015. Bayesian optimization for synthetic gene design. arXiv preprint arXiv:1505.01627.
冈萨雷斯，J.；朗沃斯，J.；詹姆斯，D. C.；与劳伦斯，N. D. 2015。合成基因设计的贝叶斯优化方法。arXiv 预印本，arXiv:1505.01627。

Grosse, R.; Salakhutdinov, R. R.; Freeman, W. T.; and Tenenbaum, J. B. 2012. Exploiting compositionality to explore a large space of model structures. arXiv preprint arXiv:1210.4856.
格罗斯, R.; 萨拉赫丁诺夫, R. R.; 弗里曼, W. T.; 与特南鲍姆, J. B. 2012. 利用组合性探索大规模模型结构空间. arXiv 预印本 arXiv:1210.4856.

Hopcroft, J. E.; Motwani, R.; and Ullman, J. D. 2001. Introduction to automata theory, languages, and computation. Acm Sigact News, 32(1): 60-65.
霍普克罗夫特, J. E.; 莫特瓦尼, R.; 与乌尔曼, J. D. 2001. 自动机理论、语言与计算导论. ACM Sigact 新闻, 32(1): 60-65.

Kandasamy, K.; Schneider, J.; and Póczos, B. 2015. High dimensional Bayesian optimisation and bandits via additive
坎达萨米, K.; 施耐德, J.; 与波乔斯, B. 2015. 通过加性模型实现高维贝叶斯优化与多臂老虎机

models. In International conference on machine learning, 295-304. PMLR.
研究. 发表于国际机器学习大会, 295-304. PMLR.

Kim, H.; and Teh, Y. W. 2016. Scalable structure discovery in regression using gaussian processes. In Workshop on Automatic Machine Learning, 31-40. PMLR.
Kim, H.; Teh, Y. W. 2016. 基于高斯过程的可扩展回归结构发现. 收录于《自动机器学习研讨会》, 31-40 页. PMLR 出版.

Kim, J. 2020. Benchmark Functions for Bayesian Optimization.
Kim, J. 2020. 贝叶斯优化的基准函数集.

Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Kingma, D. P.; Welling, M. 2013. 自动编码变分贝叶斯. arXiv 预印本 arXiv:1312.6114.

Kusner, M. J.; Paige, B.; and Hernández-Lobato, J. M. 2017. Grammar variational autoencoder. In International conference on machine learning, 1945-1954. PMLR.
Kusner, M. J.; Paige, B.; Hernández-Lobato, J. M. 2017. 语法变分自编码器. 收录于《国际机器学习会议》, 1945-1954 页. PMLR 出版.

Lu, X.; Gonzalez, J.; Dai, Z.; and Lawrence, N. D. 2018. Structured variationally auto-encoded optimization. In International conference on machine learning, 3267-3275. PMLR.
卢, X.; 冈萨雷斯, J.; 戴, Z.; 与劳伦斯, N. D. 2018. 结构化变分自编码优化. 国际机器学习会议, 3267-3275. PMLR.

MacKay, D. J.; et al. 1998. Introduction to Gaussian processes. NATO ASI series

F

computer and systems sciences, 168: 133-166.
麦凯, D. J.; 等. 1998. 高斯过程导论. NATO ASI 系列计算机与系统科学, 168: 133-166.

Malkomes, G.; Schaff, C.; and Garnett, R. 2016. Bayesian optimization for automated model selection. Advances in neural information processing systems, 29.
马尔科姆斯, G.; 沙夫, C.; 与加内特, R. 2016. 自动化模型选择的贝叶斯优化. 神经信息处理系统进展, 29.

Mutny, M.; and Krause, A. 2018. Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. Advances in Neural Information Processing Systems, 31.
穆特尼, M.; 与克劳斯, A. 2018. 基于可加性与正交傅里叶特征的高效高维贝叶斯优化. 神经信息处理系统进展, 31.

Parker, R. G.; and Rardin, R. L. 2014. Discrete optimization. Elsevier.
帕克（Parker, R. G.）与拉尔丁（Rardin, R. L.）2014 年合著。《离散优化》。爱思唯尔出版社。

Rasmussen, C. E.; Williams, C. K.; et al. 2006. Gaussian processes for machine learning, volume 1. Springer.
拉斯穆森（Rasmussen, C. E.）、威廉姆斯（Williams, C. K.）等 2006 年合著。《机器学习中的高斯过程》第 1 卷。斯普林格出版社。

Sonja Surjanovic, D. B. 2013. Virtual Library of Optimization Functions.
索尼娅·苏尔亚尼奇（Sonja Surjanovic）与 D. B. 2013 年合著。《优化函数虚拟图书馆》。

Teng, T.; Chen, J.; Zhang, Y.; and Low, B. K. H. 2020. Scalable variational Bayesian kernel selection for sparse Gaussian process regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 5997-6004.
滕腾（Teng, T.）、陈杰（Chen, J.）、张洋（Zhang, Y.）与刘伯凯（Low, B. K. H.）2020 年合著。《稀疏高斯过程回归的可扩展变分贝叶斯核选择》。收录于《AAAI 人工智能会议论文集》第 34 卷，5997-6004 页。

Thoning, K. W.; Tans, P. P.; and Komhyr, W. D. 1989. Atmospheric carbon dioxide at Mauna Loa Observatory: 2. Analysis of the NOAA GMCC data, 1974-1985. Journal of Geophysical Research: Atmospheres, 94(D6): 8549-8565.
托宁，K. W.；坦斯，P. P.；与康默，W. D. 1989。莫纳罗亚天文台大气二氧化碳数据：2. NOAA GMCC 1974-1985 年数据分析。《地球物理研究杂志：大气层》，94(D6): 8549-8565。

Wang, J. 2020. An intuitive tutorial to Gaussian processes regression. arXiv preprint arXiv:2009.10862.
王，J. 2020。高斯过程回归的直观教程。arXiv 预印本 arXiv:2009.10862。

Wang, Z.; Gehring, C.; Kohli, P.; and Jegelka, S. 2018. Batched large-scale Bayesian optimization in high-dimensional spaces. In International Conference on Artificial Intelligence and Statistics, 745-754. PMLR.
王，Z.；格林，C.；科利，P.；与耶格尔卡，S. 2018。高维空间中批量大规模贝叶斯优化。《人工智能与统计国际会议论文集》，745-754。PMLR。

Zhen, X.; Sun, H.; Du, Y.; Xu, J.; Yin, Y.; Shao, L.; and Snoek, C. 2020. Learning to learn kernels with variational random features. In International Conference on Machine Learning, 11409-11419. PMLR.
甄，X.；孙，H.；杜，Y.；徐，J.；尹，Y.；邵，L.；与斯努克，C. 2020。学习具有变分随机特征的内核。《机器学习国际会议论文集》，11409-11419。PMLR。

Kernel Learning for Sample Constrained Black-Box Optimization样本受限黑盒优化的核学习

Abstract 摘要

1 Introduction 1 引言

2 Problem Formulation and Background Problem Formulation2 问题表述与背景 问题表述

Background: Gaussian Process Regression (GPR)背景：高斯过程回归（GPR）

Kernel Selection 核选择

3 Kernel Learning in KOBO3 KOBO 中的核学习

Kernel Combiner 内核组合器

Kernel Space Variational Autoencoder (KerVAE) and GPR (KerGPR)核空间变分自编码器（KerVAE）与高斯过程回归（KerGPR）

4 Evaluation and Results 4 评估与结果

KOBO Versus Simple KernelsKOBO 对比简单核函数

KOBO Versus SOTA BaselinesKOBO 对比业界先进基线方法

SOTA Baselines (details in Technical Appendix):最先进基线方法（技术附录详述）：

Is K∗ indeed learning the structure of f(x) ? K∗ 是否确实在学习 f(x) 的结构？

User Experiments: (1) Audio Personalization, and (2) Image Recommendation用户实验：(1)音频个性化定制，(2)图像推荐

5 Related Work 5 相关工作

6 Limitations and Conclusion六、局限性与结论

Acknowledgements 致谢

References 参考文献

Kernel Learning for Sample Constrained Black-Box Optimization
样本受限黑盒优化的核学习

2 Problem Formulation and Background Problem Formulation
2 问题表述与背景问题表述

Background: Gaussian Process Regression (GPR)
背景：高斯过程回归（GPR）

3 Kernel Learning in KOBO
3 KOBO 中的核学习

Kernel Space Variational Autoencoder (KerVAE) and GPR (KerGPR)
核空间变分自编码器（KerVAE）与高斯过程回归（KerGPR）

KOBO Versus Simple Kernels
KOBO 对比简单核函数

KOBO Versus SOTA Baselines
KOBO 对比业界先进基线方法

SOTA Baselines (details in Technical Appendix):
最先进基线方法（技术附录详述）：

Is $K^{}$ indeed learning the structure of $f (x)$ ?
$K^{}$ 是否确实在学习 $f (x)$ 的结构？

User Experiments: (1) Audio Personalization, and (2) Image Recommendation
用户实验：(1)音频个性化定制，(2)图像推荐

6 Limitations and Conclusion
六、局限性与结论