$β$ -VAE: LEARNING BASIC VISUAL CONCEPTS WITH A Constrained Variational Framework
$β$ -VAE：通过约束变分框架学习基础视觉概念

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, Alexander Lerchner Google DeepMind
伊琳娜·希金斯，洛伊克·马蒂，阿尔卡·帕尔，克里斯托弗·伯吉斯，泽维尔·格洛罗特，马修·波特温尼克，沙基尔·穆罕默德，亚历山大·莱克纳谷歌 DeepMind

{irinah, lmatthey, arkap, cpburgess, glorotx,
{伊琳娜, 洛伊克, 阿尔卡, 克里斯托弗, 泽维尔,

botvinick, shakir, lerchner}@google.com
马修, 沙基尔, 亚历山大}@谷歌.com

Abstract 摘要

Learning an interpretable factorised representation of the independent data generative factors of the world without supervision is an important precursor for the development of artificial intelligence that is able to learn and reason in the same way that humans do. We introduce

β

-VAE,a new state-of-the-art framework for automated discovery of interpretable factorised latent representations from raw image data in a completely unsupervised manner. Our approach is a modification of the variational autoencoder (VAE) framework. We introduce an adjustable hy-perparameter

β

that balances latent channel capacity and independence constraints with reconstruction accuracy. We demonstrate that

β

-VAE with appropriately tuned

β > 1

qualitatively outperforms VAE

(β = 1)

,as well as state of the art unsupervised (InfoGAN) and semi-supervised (DC-IGN) approaches to disentangled factor learning on a variety of datasets (celebA, faces and chairs). Furthermore, we devise a protocol to quantitatively compare the degree of disentanglement learnt by different models, and show that our approach also significantly outperforms all baselines quantitatively. Unlike InfoGAN,

β

-VAE is stable to train,makes few assumptions about the data and relies on tuning a single hyperparameter

β

,which can be directly optimised through a hyperparameter search using weakly labelled data or through heuristic visual inspection for purely unsupervised data.
在无监督的情况下学习世界独立数据生成因素的可解释分解表示，是开发能够像人类一样学习和推理的人工智能的重要前提。我们引入了

β

-VAE，这是一个全新的最先进框架，用于从原始图像数据中以完全无监督的方式自动发现可解释的分解潜在表示。我们的方法是对变分自编码器（VAE）框架的改进。我们引入了一个可调整的超参数

β

，用于平衡潜在通道容量、独立性约束与重建精度之间的关系。我们证明，在适当调整

β > 1

的情况下，

β

-VAE 在定性上优于 VAE

(β = 1)

，以及在多种数据集（celebA、faces 和 chairs）上最先进的无监督（InfoGAN）和半监督（DC-IGN）解耦因子学习方法。此外，我们设计了一种协议来定量比较不同模型学习的解耦程度，并表明我们的方法在定量上也显著优于所有基线。与 InfoGAN 不同，

β

-VAE 训练稳定，对数据假设较少，且仅需调整单个超参数

β

，该参数可通过弱标记数据的超参数搜索或纯无监督数据的启发式视觉检查直接优化。

1 INTRODUCTION 1 引言

The difficulty of learning a task for a given machine learning approach can vary significantly depending on the choice of the data representation. Having a representation that is well suited to the particular task and data domain can significantly improve the learning success and robustness of the chosen model (Bengio et al. 2013). It has been suggested that learning a disentangled representation of the generative factors in the data can be useful for a large variety of tasks and domains (Bengio et al. 2013; Ridgeway, 2016). A disentangled representation can be defined as one where single latent units are sensitive to changes in single generative factors, while being relatively invariant to changes in other factors (Bengio et al. 2013). For example, a model trained on a dataset of 3D objects might learn independent latent units sensitive to single independent data generative factors, such as object identity, position, scale, lighting or colour, thus acting as an inverse graphics model (Kulkarni et al. 2015). In a disentangled representation, knowledge about one factor can generalise to novel configurations of other factors. According to Lake et al. (2016), disentangled representations could boost the performance of state-of-the-art AI approaches in situations where they still struggle but where humans excel. Such scenarios include those which require knowledge transfer, where faster learning is achieved by reusing learnt representations for numerous tasks; zero-shot inference, where reasoning about new data is enabled by recombining previously learnt factors; or novelty detection.
给定机器学习方法学习某项任务的难度会因数据表示的选择而有显著差异。拥有一个与特定任务和数据领域高度契合的表示方式，能大幅提升所选模型的学习成功率和鲁棒性（Bengio 等人，2013 年）。研究表明，学习数据中生成因子的解耦表示（disentangled representation）对多种任务和领域都具有实用价值（Bengio 等人，2013 年；Ridgeway，2016 年）。解耦表示可定义为：单个潜在单元仅对单一生成因子的变化敏感，而对其他因子的变化保持相对不变性（Bengio 等人，2013 年）。例如，在 3D 物体数据集上训练的模型可能学会对独立数据生成因子（如物体身份、位置、比例、光照或颜色）敏感的独立潜在单元，从而充当逆向图形模型（Kulkarni 等人，2015 年）。在解耦表示中，关于某个因子的知识可以推广到其他因子的新配置中。根据 Lake 等人的观点... (2016 年)，解缠表示可以提升最先进 AI 方法在人类擅长而 AI 仍表现不佳的情境中的性能。这类场景包括需要知识迁移的情况（通过将已学表示重用于多项任务来实现更快学习）、零样本推理（通过重组先前学习的因子来对新数据进行推理）或新颖性检测。

Unsupervised learning of a disentangled posterior distribution over the underlying generative factors of sensory data is a major challenge in AI research (Bengio et al., 2013; Lake et al., 2016). Most previous attempts required a priori knowledge of the number and/or nature of the data generative factors (Hinton et al., 2011; Rippel & Adams, 2013; Reed et al., 2014; Zhu et al., 2014; Yang et al. 2015; Goroshin et al., 2015; Kulkarni et al. 2015; Cheung et al. 2015; Whitney et al., 2016; Karaletsos et al. 2016). This is not always feasible in the real world, where the newly initialised learner may be exposed to complex data where no a priori knowledge of the generative factors exists, and little to no supervision for discovering the factors is available. Until recently purely unsupervised approaches to disentangled factor learning have not scaled well (Schmidhuber, 1992; Desjardins et al., 2012; Tang et al., 2013; Cohen & Welling, 2014; 2015).
无监督学习如何从感官数据的潜在生成因子中解耦出后验分布，是人工智能研究中的一个重大挑战（Bengio 等人，2013；Lake 等人，2016）。以往大多数尝试都需要预先了解数据生成因子的数量和/或性质（Hinton 等人，2011；Rippel & Adams，2013；Reed 等人，2014；Zhu 等人，2014；Yang 等人，2015；Goroshin 等人，2015；Kulkarni 等人，2015；Cheung 等人，2015；Whitney 等人，2016；Karaletsos 等人，2016）。在现实世界中，这往往不可行，因为新初始化的学习者可能面临复杂数据，其中既不存在关于生成因子的先验知识，也几乎没有可用于发现这些因子的监督信息。直到最近，纯粹无监督的解耦因子学习方法在扩展性上表现不佳（Schmidhuber，1992；Desjardins 等人，2012；Tang 等人，2013；Cohen & Welling，2014；2015）。

https://cdn.noedgeai.com/0196aa96-f22f-748c-a3c4-b117f7921f5a_1.jpg?x=309&y=222&w=1168&h=857&r=0

Figure 1: Manipulating latent variables on celebA: Qualitative results comparing disentangling performance of

β

-VAE

(β = 250)

,VAE

(Kingma &Welling ∣ 2014) (β = 1)

and InfoGAN

(

Chen et al. 2016). In all figures of latent code traversal each block corresponds to the traversal of a single latent variable while keeping others fixed to either their inferred (

β

-VAE,VAE and DC-IGN where applicable) or sampled (InfoGAN) values. Each row represents a different seed image used to infer the latent values in the VAE-based models, or a random sample of the noise variables in InfoGAN.

β

-VAE and VAE traversal is over the

[- 3, 3]

range. InfoGAN traversal is over ten dimensional categorical latent variables. Only

β

-VAE and InfoGAN learnt to disentangle factors like azimuth (a), emotion (b) and hair style (c), whereas VAE learnt an entangled representation (e.g. azimuth is entangled with emotion, presence of glasses and gender). InfoGAN images adapted from Chen et al. (2016). Reprinted with permission.
图 1：CelebA 上的潜在变量操作：比较

β

-VAE

(β = 250)

、VAE

(Kingma &Welling ∣ 2014) (β = 1)

和 InfoGAN

(

（Chen 等人，2016）解缠性能的定性结果。在所有潜在代码遍历图中，每个块对应单个潜在变量的遍历，同时保持其他变量固定为其推断值（适用于

β

-VAE、VAE 和 DC-IGN）或采样值（InfoGAN）。每行代表在基于 VAE 的模型中用于推断潜在值的不同种子图像，或 InfoGAN 中噪声变量的随机样本。

β

-VAE 和 VAE 的遍历范围是

[- 3, 3]

。InfoGAN 遍历的是十维分类潜在变量。只有

β

-VAE 和 InfoGAN 学会了分离方位角（a）、情绪（b）和发型（c）等因素，而 VAE 学习了一个纠缠的表示（例如，方位角与情绪、眼镜的存在和性别纠缠在一起）。InfoGAN 图像改编自 Chen 等人（2016）。经许可转载。

Recently a scalable unsupervised approach for disentangled factor learning has been developed, called InfoGAN (Chen et al. 2016). InfoGAN extends the generative adversarial network (GAN) (Goodfellow et al. 2014) framework to additionally maximise the mutual information between a subset of the generating noise variables and the output of a recognition network. It has been reported to be capable of discovering at least a subset of data generative factors and of learning a disentangled representation of these factors. The reliance of InfoGAN on the GAN framework, however, comes at the cost of training instability and reduced sample diversity. Furthermore, InfoGAN requires some a priori knowledge of the data, since its performance is sensitive to the choice of the prior distribution and the number of the regularised noise variables. InfoGAN also lacks a principled inference network (although the recognition network can be used as one). The ability to infer the posterior latent distribution from sensory input is important when using the unsupervised model in transfer learning or zero-shot inference scenarios. Hence, while InfoGAN is an important step in the right direction, we believe that further improvements are necessary to achieve a principled way of using unsupervised learning for developing more human-like learning and reasoning in algorithms as described by Lake et al. (2016).
最近，一种名为 InfoGAN（Chen 等人，2016）的可扩展无监督解耦因子学习方法被提出。InfoGAN 扩展了生成对抗网络（GAN）（Goodfellow 等人，2014）框架，额外最大化生成噪声变量子集与识别网络输出之间的互信息。据报道，该方法能够发现至少一部分数据生成因子，并学习这些因子的解耦表示。然而，InfoGAN 对 GAN 框架的依赖带来了训练不稳定性和样本多样性降低的代价。此外，InfoGAN 需要一些数据的先验知识，因为其性能对先验分布的选择和正则化噪声变量的数量敏感。InfoGAN 还缺乏一个原则性的推理网络（尽管识别网络可以用作一个）。在迁移学习或零样本推理场景中使用无监督模型时，从感知输入推断后验潜在分布的能力至关重要。因此，虽然 InfoGAN 是朝着正确方向迈出的重要一步，但我们认为要实现 Lake 等人（2016）所描述的利用无监督学习在算法中开发更类似人类的学习和推理的原则性方法，还需要进一步的改进。

Finally, there is currently no general method for quantifying the degree of learnt disentanglement. Therefore there is no way to quantitatively compare the degree of disentanglement achieved by different models or when optimising the hyperparameters of a single model. In this paper we attempt to address these issues. We propose

β

-VAE,a deep unsupervised generative approach for disentangled factor learning that can automatically discover the independent latent factors of variation in unsupervised data. Our approach is based on the variational autoencoder (VAE) framework (Kingma & Welling, 2014, Rezende et al. 2014), which brings scalability and training stability. While the original VAE work has been shown to achieve limited disentangling performance on simple datasets, such as FreyFaces or MNIST (Kingma & Welling, 2014), disentangling performance does not scale to more complex datasets (e.g. Aubry et al. 2014; Paysan et al. 2009; Liu et al. 2015), prompting the development of more elaborate semi-supervised VAE-based approaches for learning disentangled factors (e.g. Kulkarni et al., 2015; Karaletsos et al., 2016).
最后，目前尚无通用的方法来量化所学解缠结的程度。因此，无法定量比较不同模型或在优化单个模型超参数时实现的解缠结程度。本文试图解决这些问题。我们提出

β

-VAE，一种用于解缠结因子学习的深度无监督生成方法，能够自动发现无监督数据中独立的潜在变异因子。我们的方法基于变分自编码器（VAE）框架（Kingma & Welling, 2014; Rezende et al. 2014），该框架带来了可扩展性和训练稳定性。虽然原始 VAE 工作已被证明在简单数据集（如 FreyFaces 或 MNIST）上实现有限的解缠结性能（Kingma & Welling, 2014），但解缠结性能无法扩展到更复杂的数据集（例如 Aubry et al. 2014; Paysan et al. 2009; Liu et al. 2015），这促使开发更精细的基于 VAE 的半监督方法来学习解缠结因子（例如 Kulkarni et al., 2015; Karaletsos et al., 2016）。

https://cdn.noedgeai.com/0196aa96-f22f-748c-a3c4-b117f7921f5a_2.jpg?x=308&y=218&w=1176&h=564&r=0

Figure 2: Manipulating latent variables on 3D chairs: Qualitative results comparing disentangling performance of

β

-VAE

(β = 5)

,VAE (Kingma &Welling 2014)

(β = 1)

,InfoGAN (Chen et al. 2016) and DC-IGN (Kulkarni et al. 2015). InfoGAN traversal is over the [-1, 1] range. VAE always learns an entangled representation (e.g. chair width is entangled with azimuth and leg style (b)). All models apart from VAE learnt to disentangle the labelled data generative factor, azimuth (a). InfoGAN and

β

-VAE were also able to discover unlabelled factors in the dataset,such as chair width (b). Only

β

-VAE,however,learnt about the unlabelled factor of chair leg style (c). InfoGAN and DC-IGN images adapted from Chen et al. (2016) and Kulkarni et al. (2015), respectively. Reprinted with permission.
图 2：三维椅子潜在变量操控：定性比较

β

-VAE

(β = 5)

、VAE（Kingma & Welling 2014）

(β = 1)

、InfoGAN（Chen 等，2016）和 DC-IGN（Kulkarni 等，2015）的解缠性能。InfoGAN 遍历范围为[-1,1]。VAE 始终学习到纠缠表征（例如椅子宽度与方位角和腿部样式（b）相纠缠）。除 VAE 外，所有模型都学会了解缠标注数据生成因子——方位角（a）。InfoGAN 和

β

-VAE 还发现了数据集中未标注的因子，如椅子宽度（b）。但只有

β

-VAE 学习到了未标注的椅子腿部样式因子（c）。InfoGAN 和 DC-IGN 图片分别改编自 Chen 等（2016）和 Kulkarni 等（2015）的论文，经许可转载。

We propose augmenting the original VAE framework with a single hyperparameter

β

that modulates the learning constraints applied to the model. These constraints impose a limit on the capacity of the latent information channel and control the emphasis on learning statistically independent latent factors.

β

-VAE with

β = 1

corresponds to the original VAE framework (Kingma &Welling,2014), Rezende et al. 2014). With

β > 1

the model is pushed to learn a more efficient latent representation of the data, which is disentangled if the data contains at least some underlying factors of variation that are independent. We show that this simple modification allows

β

-VAE to significantly improve the degree of disentanglement in learnt latent representations compared to the unmodified VAE framework (Kingma &Welling,2014; Rezende et al.,2014). Furthermore,we show that

β

-VAE achieves state of the art disentangling performance against both the best unsupervised (InfoGAN: Chen et al. 2016) and semi-supervised (DC-IGN: Kulkarni et al. 2015) approaches for disentangled factor learning on a number of benchmark datasets, such as CelebA (Liu et al. 2015), chairs (Aubry et al. 2014) and faces (Paysan et al. 2009) using qualitative evaluation. Finally, to help quantify the differences,we develop a new measure of disentanglement and show that

β

-VAE significantly outperforms all our baselines on this measure (ICA, PCA, VAE Kingma & Ba (2014), DC-IGN Kulkarni et al. (2015), and InfoGAN Chen et al. (2016)).
我们提出在原始 VAE 框架中增加一个超参数

β

，用于调节施加于模型的学习约束。这些约束限制了潜在信息通道的容量，并控制了对学习统计独立潜在因子的重视程度。当

β

-VAE 的

β = 1

参数对应原始 VAE 框架时（Kingma & Welling, 2014; Rezende 等, 2014）。而当

β > 1

参数调整时，模型被推动学习数据更高效的潜在表示，如果数据包含至少一些独立的变化基础因子，这种表示将是解耦的。我们证明，这一简单修改使得

β

-VAE 相比未修改的 VAE 框架（Kingma & Welling, 2014; Rezende 等, 2014）显著提高了学习潜在表示的解耦程度。此外，我们还展示

β

-VAE 在多个基准数据集（如 CelebA（Liu 等, 2015）、椅子（Aubry 等））上，针对解耦因子学习任务，与最佳无监督方法（InfoGAN: Chen 等, 2016）和半监督方法（DC-IGN: Kulkarni 等, 2015）相比，达到了最先进的解耦性能。 2014 年）和面部（Paysan 等人，2009 年）进行了定性评估。最后，为了量化差异，我们开发了一种新的解缠结度量方法，并表明

β

-VAE 在该度量上显著优于所有基线模型（包括 ICA、PCA、VAE（Kingma & Ba，2014）、DC-IGN（Kulkarni 等人，2015）和 InfoGAN（Chen 等人，2016））。

Our main contributions are the following: 1) we propose

β

-VAE,a new unsupervised approach for learning disentangled representations of independent visual data generative factors; 2) we devise a protocol to quantitatively compare the degree of disentanglement learnt by different models; 3) we demonstrate both qualitatively and quantitatively that our

β

-VAE approach achieves state-of-the-art disentanglement performance compared to various baselines on a variety of complex datasets.
我们的主要贡献如下：1）提出了

β

-VAE，一种新的无监督方法，用于学习独立视觉数据生成因子的解缠结表示；2）设计了一套协议，定量比较不同模型学习到的解缠结程度；3）通过定性和定量分析证明，与多种复杂数据集上的各类基线相比，我们的

β

-VAE 方法在解缠结性能上达到了最先进水平。

https://cdn.noedgeai.com/0196aa96-f22f-748c-a3c4-b117f7921f5a_3.jpg?x=305&y=218&w=1183&h=915&r=0

Figure 3: Manipulating latent variables on 3D faces: Qualitative results comparing disentangling performance of

β

-VAE

(β = 20)

,VAE (Kingma &Welling 2014)

(β = 1)

,InfoGAN (Chen et al. 2016) and DC-IGN (Kulkarni et al. 2015). InfoGAN traversal is over the [-1, 1] range. All models learnt to disentangle lighting (b) and elevation (c). DC-IGN and VAE struggled to continuously interpolate between different azimuth angles (a),unlike

β

-VAE,which additionally learnt to encode a wider range of azimuth angles than other models. InfoGAN and DC-IGN images adapted from Chen et al. (2016) and Kulkarni et al. (2015), respectively. Reprinted with permission.
图 3：三维人脸潜在变量操控：对比

β

-VAE

(β = 20)

、VAE（Kingma & Welling 2014）

(β = 1)

、InfoGAN（Chen 等，2016）与 DC-IGN（Kulkarni 等，2015）在解耦性能上的定性结果。InfoGAN 遍历范围为[-1,1]。所有模型均成功解耦光照(b)和仰角(c)。DC-IGN 与 VAE 在方位角(a)连续插值上表现欠佳，而

β

-VAE 不仅实现连续插值，还学习到比其他模型更广的方位角编码范围。InfoGAN 与 DC-IGN 图像分别引自 Chen 等(2016)和 Kulkarni 等(2015)，经许可转载。

https://cdn.noedgeai.com/0196aa96-f22f-748c-a3c4-b117f7921f5a_3.jpg?x=312&y=1428&w=1173&h=316&r=0

Figure 4: Latent factors learnt by

β

-VAE on celebA: traversal of individual latents demonstrates that

β

-VAE discovered in an unsupervised manner factors that encode skin colour,transition from an elderly male to younger female, and image saturation.
图 4：

β

-VAE 在 celebA 数据集上学习的潜在因子：单个潜在变量的遍历表明，

β

-VAE 以无监督方式发现了编码肤色、从老年男性到年轻女性的过渡以及图像饱和度的因子。

2 β-VAE FRAMEWORK DERIVATION
2 β-VAE 框架推导

Let

D = {X, V, W}

be the set that consists of images

x \in R^{N}

and two sets of ground truth data generative factors: conditionally independent factors

v \in R^{K}

,where

\log p (v ∣ x) = \sum_{k} \log p (v_{k} ∣ x)

; and conditionally dependent factors

w \in R^{H}

. We assume that the images

x

are generated by the true world simulator using the corresponding ground truth data generative factors:

p (x ∣ v, w) =

Sim (v, w)

. We want to develop an unsupervised deep generative model that,using samples from

X

only,can learn the joint distribution of the data

x

and a set of generative latent factors

z (z \in R^{M}

,where

M \geq K

) such that

z

can generate the observed data

x

; that is,

p (x ∣ z) \approx p (x ∣ v, w) = Sim (v, w)

. Thus a suitable objective is to maximise the marginal (log-)likelihood of the observed data

x

in expectation over the whole distribution of latent factors

z

:
设

D = {X, V, W}

为由图像

x \in R^{N}

和两组真实数据生成因子组成的集合：条件独立因子

v \in R^{K}

，其中

\log p (v ∣ x) = \sum_{k} \log p (v_{k} ∣ x)

；以及条件依赖因子

w \in R^{H}

。我们假设图像

x

是由真实世界模拟器使用相应的真实数据生成因子生成的：

p (x ∣ v, w) =

Sim (v, w)

。我们希望开发一个无监督深度生成模型，仅使用来自

X

的样本，能够学习数据

x

与一组生成潜在因子

z (z \in R^{M}

的联合分布，其中

M \geq K

)，使得

z

能够生成观测数据

x

；即

p (x ∣ z) \approx p (x ∣ v, w) = Sim (v, w)

。因此，一个合适的目标是在潜在因子

z

的整个分布上最大化观测数据

x

的边际（对数）似然期望：

\begin{matrix} (1) & max_{θ} E_{p_{θ} (z)} [p_{θ} (x ∣ z)] \end{matrix}

For a given observation

x

,we describe the inferred posterior configurations of the latent factors

z

by a probability distribution

q_{ϕ} (z ∣ x)

. Our aim is to ensure that the inferred latent factors

q_{ϕ} (z ∣ x)

capture the generative factors

v

in a disentangled manner. The conditionally dependent data generative factors

w

can remain entangled in a separate subset of

z

that is not used for representing

v

. In order to encourage this disentangling property in the inferred

q_{ϕ} (z ∣ x)

,we introduce a constraint over it by trying to match it to a prior

p (z)

that can both control the capacity of the latent information bottleneck, and embodies the desiderata of statistical independence mentioned above. This can be achieved if we set the prior to be an isotropic unit Gaussian

(p (z) = N (0, I))

,hence arriving at the constrained optimisation problem in Eq. 2,where

ϵ

specifies the strength of the applied constraint.
对于给定的观测数据

x

，我们通过概率分布

q_{ϕ} (z ∣ x)

来描述潜在因子

z

的推断后验配置。我们的目标是确保推断出的潜在因子

q_{ϕ} (z ∣ x)

以解耦的方式捕捉生成因子

v

。条件依赖的数据生成因子

w

可以在

z

的独立子集中保持纠缠状态，这部分不用于表示

v

。为了在推断的

q_{ϕ} (z ∣ x)

中促进这种解耦特性，我们通过尝试将其与先验

p (z)

匹配来引入约束，该先验既能控制潜在信息瓶颈的容量，又体现了上述统计独立性的理想特性。若将先验设为各向同性的单位高斯分布

(p (z) = N (0, I))

，则可实现这一目标，从而得到公式 2 中的约束优化问题，其中

ϵ

指定了所施加约束的强度。

\begin{matrix} (2) & max_{ϕ, θ} E_{x \sim D} [E_{q_{ϕ} (z ∣ x)} [\log p_{θ} (x ∣ z)]] subject to D_{K L} (q_{ϕ} (z ∣ x) ∥ p (z)) < ϵ \end{matrix}

Re-writing Eq. 2 as a Lagrangian under the KKT conditions (Kuhn & Tucker, 1951; Karush, 1939), we obtain:
根据 KKT 条件（Kuhn & Tucker, 1951; Karush, 1939）将公式 2 重写为拉格朗日形式，我们得到：

\begin{matrix} (3) & F (θ, ϕ, β; x, z) = E_{q_{ϕ} (z ∣ x)} [\log p_{θ} (x ∣ z)] - β (D_{K L} (q_{ϕ} (z ∣ x) ∥ p (z)) - ϵ) \end{matrix}

where the KKT multiplier

β

is the regularisation coefficient that constrains the capacity of the latent information channel

z

and puts implicit independence pressure on the learnt posterior due to the isotropic nature of the Gaussian prior

p (z)

. Since

β, ϵ \geq 0

according to the complementary slackness KKT condition,Eq. 3 can be re-written to arrive at the

β

-VAE formulation - as the familiar variational free energy objective function as described by Jordan et al. (1999),but with the addition of the

β

coefficient:
其中 KKT 乘子

β

是正则化系数，它约束了潜在信息通道

z

的容量，并由于高斯先验

p (z)

的各向同性特性对学习到的后验施加了隐式独立性压力。根据互补松弛 KKT 条件，由于

β, ϵ \geq 0

，方程 3 可以重新表述为

β

-VAE 公式——即 Jordan 等人（1999）所描述的熟悉的变分自由能目标函数，但增加了

β

系数：

\begin{matrix} (4) & F (θ, ϕ, β; x, z) \geq L (θ, ϕ; x, z, β) = E_{q_{ϕ} (z ∣ x)} [\log p_{θ} (x ∣ z)] - β D_{K L} (q_{ϕ} (z ∣ x) ∥ p (z)) \end{matrix}

Varying

β

changes the degree of applied learning pressure during training,thus encouraging different learnt representations.

β

-VAE where

β = 1

corresponds to the original VAE formulation of (Kingma & Welling 2014). We postulate that in order to learn disentangled representations of the conditionally independent data generative factors

v

,it is important to set

β > 1

,thus putting a stronger constraint on the latent bottleneck than in the original VAE formulation of Kingma & Welling (2014). These constraints limit the capacity of

z

,which,combined with the pressure to maximise the log likelihood of the training data

x

under the model,should encourage the model to learn the most efficient representation of the data. Since the data

x

is generated using at least some conditionally independent ground truth factors

v

,and the

D_{K L}

term of the

β

-VAE objective function encourages conditional independence in

q_{ϕ} (z ∣ x)

,we hypothesise that higher values of

β

should encourage learning a disentangled representation of

v

. The extra pressures coming from high

β

values,however,may create a trade-off between reconstruction fidelity and the quality of disentanglement within the learnt latent representations. Disentangled representations emerge when the right balance is found between information preservation (reconstruction cost as regularisation) and latent channel capacity restriction

(β > 1)

. The latter can lead to poorer reconstructions due to the loss of high frequency details when passing through a constrained latent bottleneck. Hence, the log likelihood of the data under the learnt model is a poor metric for evaluating disentangling in

β

-VAEs. Instead we propose a quantitative metric that directly measures the degree of learnt disentanglement in the latent representation.
调整

β

可以改变训练过程中施加的学习压力程度，从而促使学习到不同的表征。

β

-VAE 中的

β = 1

对应于(Kingma & Welling 2014)提出的原始 VAE 公式。我们假设，为了学习条件独立数据生成因素

v

的解耦表征，关键在于设置

β > 1

，这比 Kingma & Welling (2014)的原始 VAE 公式对潜在瓶颈施加了更强的约束。这些约束限制了

z

的容量，结合在模型下最大化训练数据

x

对数似然的压力，应能促使模型学习数据的最有效表征。由于数据

x

的生成至少涉及某些条件独立的真实因素

v

，且

β

-VAE 目标函数的

D_{K L}

项鼓励

q_{ϕ} (z ∣ x)

中的条件独立性，我们推测较高的

β

值应有助于学习

v

的解耦表征。然而，来自高

β

值的额外压力可能会在重建保真度与所学潜在表征的解耦质量之间形成权衡。当在信息保留（重建成本作为正则化）与潜在通道容量限制

(β > 1)

之间找到适当平衡时，解耦表征便会出现。后者由于通过受限潜在瓶颈时高频细节的丢失，可能导致重建效果变差。因此，在

β

-VAEs 中，数据在学习模型下的对数似然是评估解耦效果的较差指标。相反，我们提出了一种直接衡量潜在表征中学习解耦程度的定量指标。

Since our proposed hyperparameter

β

directly affects the degree of learnt disentanglement,we would like to estimate the optimal

β

for learning a disentangled latent representation directly. However,it is not possible to do so. This is because the optimal

β

will depend on the value of

ϵ

in Eq 2 Different datasets and different model architectures will require different optimal values of

ϵ

. However,when optimising

β

in Eq. 4,we are indirectly also optimising

ϵ

for the best disentanglement (see Sec A. 7 for details),and while we can not learn the optimal value of

β

directly,we can instead estimate it using either our proposed disentanglement metric (see Sec. 3) or through visual inspection heuristics.
由于我们提出的超参数

β

直接影响学习解缠结的程度，我们希望直接估计出学习解缠结潜在表征的最优

β

。然而，这是不可能实现的，因为最优

β

将取决于公式 2 中

ϵ

的取值。不同的数据集和不同的模型架构需要不同的最优

ϵ

值。不过，在优化公式 4 中的

β

时，我们实际上也在间接优化

ϵ

以获得最佳解缠结效果（详见附录 A.7），虽然无法直接学习到

β

的最优值，但可以通过我们提出的解缠结度量指标（见第 3 节）或视觉启发式方法进行估计。

https://cdn.noedgeai.com/0196aa96-f22f-748c-a3c4-b117f7921f5a_5.jpg?x=303&y=200&w=606&h=583&r=0

Figure 5: Schematic of the proposed disentanglement metric: over a batch of

L

samples, each pair of images has a fixed value for one target generative factor

y

(here

y =

scale) and differs on all others. A linear classifier is then trained to identify the target factor using the average pairwise difference

z_{diff}^{b}

in the latent space over

L

samples.
图 5：所提出的解缠结度量示意图：在一批

L

样本中，每对图像在目标生成因子

y

（此处为

y =

尺度）上保持固定值，而在其他所有因子上存在差异。随后训练一个线性分类器，利用潜在空间中

L

样本的平均成对差异

z_{diff}^{b}

来识别目标因子。

3 DISENTANGLEMENT METRIC 3 解缠结度量

It is important to be able to quantify the level of disentanglement achieved by different models. Designing a metric for this, however, is not straightforward. We begin by defining the properties that we expect a disentangled representation to have. Then we describe our proposed solution for quantifying the presence of such properties in a learnt representation.
量化不同模型实现的解缠水平非常重要。然而，设计一个衡量标准并不简单。我们首先定义了解缠表示应具备的特性，然后描述了提出的量化方法，用于评估学习表示中这些特性的存在程度。

As stated above, we assume that the data is generated by a ground truth simulation process which uses a number of data generative factors, some of which are conditionally independent, and we also assume that they are interpretable. For example, the simulator might sample independent factors corresponding to object shape, colour and size to generate an image of a small green apple. Because of the independence property, the simulator can also generate small red apples or big green apples. A representation of the data that is disentangled with respect to these generative factors, i.e. which encodes them in separate latents, would enable robust classification even using very simple linear classifiers (hence providing interpretability). For example, a classifier that learns a decision boundary that relies on object shape would perform as well when other data generative factors, such as size or colour, are varied.
如上所述，我们假设数据是由一个使用多个数据生成因子的真实模拟过程生成的，其中一些因子是条件独立的，并且我们还假设这些因子是可解释的。例如，模拟器可能会采样与物体形状、颜色和大小相对应的独立因子，以生成一个小绿苹果的图像。由于独立性特性，模拟器还可以生成小红苹果或大绿苹果。对于这些生成因子解耦的数据表示（即在不同的潜在变量中编码这些因子），即使使用非常简单的线性分类器也能实现稳健的分类（从而提供可解释性）。例如，一个学习依赖于物体形状决策边界的分类器，在其他数据生成因子（如大小或颜色）变化时，其性能同样良好。

Note that a representation consisting of independent latents is not necessarily disentangled, according to our desiderata. Independence can readily be achieved by a variety of approaches (such as PCA or ICA) that learn to project the data onto independent bases. Representations learnt by such approaches do not in general align with the data generative factors and hence may lack interpretability. For this reason, a simple cross-correlation calculation between the inferred latents would not suffice as a disentanglement metric.
请注意，根据我们的需求，由独立潜在变量组成的表示并不一定是解耦的。独立性可以通过多种方法（如 PCA 或 ICA）轻松实现，这些方法学习将数据投影到独立的基上。此类方法学习到的表示通常与数据生成因素不对齐，因此可能缺乏可解释性。因此，简单的推断潜在变量之间的互相关计算不足以作为解耦度量。

Our proposed disentangling metric, therefore, measures both the independence and interpretability (due to the use of a simple classifier) of the inferred latents. To apply our metric, we run inference on a number of images that are generated by fixing the value of one data generative factor while randomly sampling all others. If the independence and interpretability properties hold for the inferred representations, there will be less variance in the inferred latents that correspond to the fixed generative factor. We use a low capacity linear classifier to identify this factor and report the accuracy value as the final disentanglement metric score. Smaller variance in the latents corresponding to the target factor will make the job of this classifier easier, resulting in a higher score under the metric. See Fig. 5 for a representation of the full process.
因此，我们提出的解缠结度量既衡量了推断潜在变量的独立性，又衡量了其可解释性（由于使用了简单分类器）。为了应用我们的度量标准，我们对一系列图像进行推断，这些图像是通过固定一个数据生成因子的值同时随机采样所有其他因子生成的。如果推断的表征满足独立性和可解释性属性，那么对应于固定生成因子的潜在变量将具有较小的方差。我们使用一个低容量的线性分类器来识别这个因子，并将准确率值报告为最终的解缠结度量分数。目标因子对应的潜在变量方差越小，分类器的任务就越容易完成，从而在该度量下获得更高的分数。完整过程的示意图见图 5。

More formally,we start from a dataset

D = {X, V, W}

as described in Sec. 2,assumed to contain a balanced distribution of ground truth factors(v,w),where images data points are obtained using a ground truth simulator process

x \sim Sim (v, w)

. We also assume we are given labels identifying a subset of the independent data generative factors

v \in V

for at least some instances. We then construct a batch of

B

vectors

z_{diff}^{b}

,to be fed as inputs to a linear classifier as follows:
更正式地说，我们从第 2 节描述的数据集

D = {X, V, W}

出发，假设其中包含平衡分布的真实因子(v,w)，其中图像数据点通过真实模拟器过程

x \sim Sim (v, w)

获得。我们还假设至少部分实例提供了标识独立数据生成因子子集

v \in V

的标签。随后，我们构建一个由

B

向量

z_{diff}^{b}

组成的批次，按如下方式输入线性分类器：

Choose a factor $y \sim Unif [1 \dots K]$ (e.g. $y =$ scale in Fig. 5).
选择一个因子 $y \sim Unif [1 \dots K]$ （例如图 5 中的 $y =$ 比例）。

For a batch of $L$ samples:
对于一个包含 $L$ 样本的批次：

(a) Sample two sets of latent representations,

v_{1, l}

and

v_{2, l}

,enforcing

{[v_{1, l}]}_{k} =

{[v_{2, l}]}_{k}

k = y

(so that the value of factor

k = y

is kept fixed).
(a) 抽样两组潜在表示

v_{1, l}

和

v_{2, l}

，强制满足

{[v_{1, l}]}_{k} =

{[v_{2, l}]}_{k}

若

k = y

（从而保持因子

k = y

的值固定）。

(b) Simulate image

x_{1, l} \sim Sim (v_{1, l})

,then infer

z_{1, l} = μ (x_{1, l})

,using the encoder

q (z ∣ x) \sim N (μ (x), σ (x)) .

(b) 模拟图像

x_{1, l} \sim Sim (v_{1, l})

，随后通过编码器

q (z ∣ x) \sim N (μ (x), σ (x)) .

推断

z_{1, l} = μ (x_{1, l})

Repeat the process for

v_{2, l}

.
对

v_{2, l}

重复此过程。

z_{diff}^{l} = | z_{1, l} - z_{2, l} |

,the absolute linear difference between the inferred latent representations.
(c) 计算差异

z_{diff}^{l} = | z_{1, l} - z_{2, l} |

，即推断出的潜在表示之间的绝对线性差异。

Use the average $z_{diff}^{b} = \frac{1}{L} \sum_{l = 1}^{L} z_{diff}^{l}$ to predict $p (y ∣ z_{diff}^{b})$ (again, $y =$ scale in Fig. 5) and report the accuracy of this predictor as disentangement metric score.
使用平均值 $z_{diff}^{b} = \frac{1}{L} \sum_{l = 1}^{L} z_{diff}^{l}$ 预测 $p (y ∣ z_{diff}^{b})$ （同样参照图 5 中的 $y =$ 尺度），并将该预测器的准确率报告为解缠指标得分。

The classifier’s goal is to predict the index

y

of the generative factor that was kept fixed for a given

z_{diff}^{b}

. The accuracy of this classifier over multiple batches is used as our disentanglement metric score. We choose a linear classifier with low VC-dimension in order to ensure it has no capacity to perform nonlinear disentangling by itself. We take differences of two inferred latent vectors to reduce the variance in the inputs to the classifier,and to reduce the conditional dependence on the inputs

x

. This ensures that on average

{[z_{diff}^{b}]}_{y} < {[z_{diff}^{b}]}_{{∖ y}}

. See Equations 5 in Appendix A. 4 for more details of the process.
分类器的目标是预测在给定

z_{diff}^{b}

中保持固定的生成因子索引

y

。通过多个批次上该分类器的准确率，我们将其作为解纠缠度量得分。我们选用具有低 VC 维度的线性分类器，以确保其本身不具备执行非线性解纠缠的能力。我们采用两个推断出的潜在向量之差，以降低分类器输入方差，并减少对输入

x

的条件依赖性。这确保了平均而言

{[z_{diff}^{b}]}_{y} < {[z_{diff}^{b}]}_{{∖ y}}

。具体流程详见附录 A.4 中的公式 5。

4 EXPERIMENTS 4 实验部分

In this section we first qualitatively demonstrate that our proposed

β

-VAE framework consistently discovers more latent factors and disentangles them in a cleaner fashion that either unmodified VAE (Kingma & Welling, 2014) or state of the art unsupervised (InfoGAN: Chen et al., 2016) and semi-supervised (DC-IGN: Kulkarni et al. 2015) solutions for disentangled factor learning on a variety of benchmarks. We then quantify and characterise the differences in disentangled factor learning between our

β

-VAE framework and a variety of benchmarks using our proposed new disentangling metric.
在本节中，我们首先定性地展示，相较于未经修改的 VAE（Kingma & Welling, 2014）或当前最先进的无监督（InfoGAN: Chen 等，2016）及半监督（DC-IGN: Kulkarni 等，2015）解耦因子学习方案，我们提出的

β

-VAE 框架在多种基准测试中能更一致地发现更多潜在因子并以更清晰的方式实现解耦。随后，我们通过提出的新解耦度量标准，量化并刻画了

β

-VAE 框架与多种基准在解耦因子学习上的差异。

4.1 QUALITATIVE BENCHMARKS
4.1 定性基准测试

We trained

β

-VAE (see Tbl. 1 for architecture details) on a variety of datasets commonly used to evaluate disentangling performance of models: celebA (Liu et al., 2015), chairs (Aubry et al., 2014) and faces (Paysan et al. 2009). Figures 1:3 provide a qualitative comparison of the disentangling performance of

β

-VAE,VAE

(β = 1)

(Kingma &Welling 2014),InfoGAN (Chen et al. 2016) and DC-IGN (Kulkarni et al., 2015) as appropriate.
我们在常用于评估模型解耦性能的多个数据集上训练了

β

-VAE（架构详情见表 1）：celebA（Liu 等，2015）、chairs（Aubry 等，2014）和 faces（Paysan 等，2009）。图 1 至图 3 对

β

-VAE、VAE

(β = 1)

（Kingma & Welling，2014）、InfoGAN（Chen 等，2016）及 DC-IGN（Kulkarni 等，2015）的解耦性能进行了定性对比（视情况选用）。

It can be seen that across all datasets

β

-VAE is able to automatically discover and learn to disentangle all of the factors learnt by the semi-supervised DC-IGN (Kulkarni et al. 2015): azimuth (Fig. 3a, Fig. 2a), lighting and elevation (Fig. 3b,c)). Often it acts as a more convincing inverse graphics network than DC-IGN (e.g. Fig. 3a) or InfoGAN (e.g. Fig. 2a, Fig. 1a-c or Fig. 3a). Furthermore, unlike DC-IGN,

β

-VAE requires no supervision and hence can learn about extra unlabelled data generative factors that DC-IGN can not learn by design, such as chair width or leg style (Fig. 2b,c). The unsupervised InfoGAN (Chen et al. 2016) approach shares this quality with

β

-VAE,and the two frameworks tend to discover overlapping, but not necessarily identical sets of data generative factors. For example,both

β

-VAE and InfoGAN (but not DC-IGN) learn about the width of chairs (Fig. 2b). Only

β

-VAE,however,learns about the chair leg style (Fig. 2c). It is interesting to note how

β

-VAE is able to generate an armchair with a round office chair base, even though such armchairs do not exist in the dataset (or,perhaps,reality). Furthermore,only

β

-VAE is able to discover all three factors of variation (chair azimuth, width and leg style) within a single model, while InfoGAN learns to allocate its continuous latent variable to either azimuth or width. InfoGAN sometimes discovers factors that

β

-VAE does not precisely disentangle,such as the presence of sunglasses in celebA.

β

-VAE does, however, discover numerous extra factors such as skin colour, image saturation, and age/gender that are not reported in the InfoGAN paper (Chen et al. 2016) (Fig. 4). Furthermore,

β

-VAE latents tend to learn a smooth continuous transformation over a wider range of factor values than InfoGAN (e.g. rotation over a wider range of angles as shown in Figs. 1-3a). Overall

β

-VAE tends to consistently and robustly discover more latent factors and learn cleaner disentangled representations of them than either InfoGAN or DC-IGN. This holds even on such challenging datasets as celebA. Furthermore,unlike InfoGAN and DC-IGN,

β

-VAE requires no design decisions or assumptions about the data, and is very stable to train.
可以看出，在所有数据集中，

β

-VAE 能够自动发现并学会解耦半监督 DC-IGN（Kulkarni 等人，2015）所学习的所有因子：方位角（图 3a，图 2a）、光照和仰角（图 3b,c）。它通常表现得比 DC-IGN（例如图 3a）或 InfoGAN（例如图 2a、图 1a-c 或图 3a）更像一个令人信服的逆向图形网络。此外，与 DC-IGN 不同，

β

-VAE 无需监督，因此能够学习 DC-IGN 因设计限制而无法学习的额外未标记数据生成因子，如椅子宽度或腿部样式（图 2b,c）。无监督的 InfoGAN（Chen 等人，2016）方法与

β

-VAE 共享这一特性，两种框架倾向于发现重叠但不一定完全相同的数据生成因子集。例如，

β

-VAE 和 InfoGAN（而非 DC-IGN）都学习了椅子的宽度（图 2b）。然而，只有

β

-VAE 学习了椅子的腿部样式（图 2c）。值得注意的是，

β

-VAE 能够生成带有圆形办公椅底座的扶手椅，尽管数据集中（或现实中）并不存在此类扶手椅。此外，只有

β

-VAE 能够在单一模型中发现所有三个变化因素（椅子方位、宽度和腿部风格），而 InfoGAN 则学会将其连续潜在变量分配给方位或宽度。InfoGAN 有时会发现一些

β

-VAE 未能精确解开的因素，例如 celebA 中太阳镜的存在。然而，

β

-VAE 确实发现了许多额外因素，如肤色、图像饱和度和年龄/性别，这些在 InfoGAN 论文（Chen 等人，2016 年）（图 4）中并未提及。此外，

β

-VAE 的潜在变量倾向于在更广泛的因子值范围内学习平滑的连续变换，而 InfoGAN 则不然（例如，在更广的角度范围内进行旋转，如图 1-3a 所示）。总体而言，

β

-VAE 往往比 InfoGAN 或 DC-IGN 更一致且稳健地发现更多潜在因素，并学习到更清晰的解耦表示。即使在 celebA 这样具有挑战性的数据集上也是如此。此外，与 InfoGAN 和 DC-IGN 不同，

β

-VAE 不需要对数据做出任何设计决策或假设，并且训练过程非常稳定。

When compared to the unmodified VAE baseline

(β = 1) β

-VAE consistently learns significantly more disentangled latent representations. For example, when learning about chairs, VAE entangles chair width with leg style (Fig. 2b). When learning about celebA, VAE entangles azimuth with emotion and gender (Fig. 1a); emotion with hair style, skin colour and identity (Fig. 1b); while the VAE fringe latent also codes for baldness and head size (Fig. 1c). Although VAE performs relatively well on the faces dataset, it still struggles to learn a clean representation of azimuth (Fig. 3a). This, however, suggests that a continuum of disentanglement quality exists, and it can be traversed by varying

β

within the

β

-VAE framework. While increasing

β

often leads to better disentanglement, it may come at the cost of blurrier reconstructions and losing representations for some factors, particularly those that correspond to only minor changes in pixel space.
与未经修改的 VAE 基线相比，β-VAE 始终能学习到显著更解耦的潜在表征。例如，在学习椅子时，VAE 会将椅子宽度与腿部样式纠缠在一起（图 2b）；在学习 CelebA 人脸数据集时，VAE 会将方位角与情绪、性别纠缠（图 1a），将情绪与发型、肤色和身份纠缠（图 1b），而 VAE 的边缘潜在变量还会编码秃头程度和头部大小（图 1c）。尽管 VAE 在人脸数据集上表现相对较好，但在学习方位角的清晰表征时仍存在困难（图 3a）。这表明解耦质量存在连续谱，通过调整β-VAE 框架中的β值可以遍历这一谱系。虽然增加β通常能提升解耦效果，但可能以重建图像模糊为代价，并丢失某些因素的表征——尤其是那些仅对应像素空间微小变化的因素。

4.2 QUANTITATIVE BENCHMARKS
4.2 定量基准测试

In order to quantitatively compare the disentangling performance of

β

-VAE against various baselines, we created a synthetic dataset of 737,280 binary 2D shapes (heart, oval and square) generated from the Cartesian product of the shape and four independent generative factors

v_{k}

defined in vector graphics: position X (32 values),position Y (32 values),scale (6 values) and rotation (40 values over the

2 π

range). To ensure smooth affine object transforms,each two subsequent values for each factor

v_{k}

were chosen to ensure minimal differences in pixel space given 64x64 pixel image resolution. This dataset was chosen because it contains no confounding factors apart from its five independent data generative factors (identity,position

X

,position

Y

,scale and rotation). This gives us knowledge of the ground truth for comparing the disentangling performance of different models in an objective manner.
为了定量比较

β

-VAE 与各种基线方法的解耦性能，我们创建了一个包含 737,280 个二进制二维形状（心形、椭圆形和方形）的合成数据集，这些形状由形状与向量图形中定义的四个独立生成因子

v_{k}

的笛卡尔积生成：位置 X（32 个值）、位置 Y（32 个值）、缩放（6 个值）和旋转（在

2 π

范围内的 40 个值）。为确保平滑的仿射对象变换，每个因子

v_{k}

的连续两个值被选择为在 64x64 像素图像分辨率下像素空间差异最小。选择该数据集是因为除了五个独立的数据生成因子（形状、位置

X

、位置

Y

、缩放和旋转）外，不包含其他混淆因素。这为我们提供了客观比较不同模型解耦性能的真实基准。

We used our proposed disentanglement metric (see Sec. 3) to quantitatively compare the ability of

β

-VAE to automatically discover and learn a disentangled representation of the data generative factors of the synthetic dataset of

2 D

shapes described above with that of a number of benchmarks (see Tbl. 1 in Appendix for model architecture details). The table in Fig. 6(left) reports the classification accuracy of the disentanglement metric for 5,000 test samples. It can be seen that

β

-VAE

(β = 4)

significantly outperforms all baselines, such as an untrained VAE and the original VAE formulation of Kingma &Welling (2014)

(β = 1)

with the same architecture as

β

-VAE,the top ten PCA or ICA components of the data (see Sec. A.3 for details),or when using the raw pixels directly.

β

-VAE also does better than InfoGAN. Remarkably,

β

-VAE performs on the same level as DC-IGN despite the latter being semi-supervised and the former wholly unsupervised. Furthermore,

β

-VAE achieved similar classification accuracy as the ground truth vectors used for data generation, thus suggesting that it was able to learn a very good disentangled representation of the data generative factors.
我们采用提出的解缠结度量指标（见第 3 节）定量比较了

β

-VAE 与多个基准模型（模型架构细节见附录表 1）在自动发现并学习上述

2 D

形状合成数据集生成因子的解缠结表示能力上的差异。图 6（左）表格展示了 5,000 个测试样本在解缠结度量指标上的分类准确率。可见

β

-VAE

(β = 4)

显著优于所有基线方法，例如未经训练的 VAE、与

β

-VAE 架构相同的 Kingma & Welling (2014)

(β = 1)

原始 VAE 公式、数据的前十主成分（PCA）或独立成分（ICA）（详见 A.3 节），或直接使用原始像素的情况。

β

-VAE 表现也优于 InfoGAN。值得注意的是，

β

-VAE 与半监督的 DC-IGN 性能相当，而前者完全无监督。此外，

β

-VAE 达到了与数据生成所用真实向量相近的分类准确率，表明其成功学习到数据生成因子的高度解缠结表示。

We also examined qualitatively the representations learnt by

β

-VAE,VAE,InfoGAN and DC-IGN on the synthetic dataset of 2D shapes. Fig. 7A demonstrates that after training,

β

-VAE with

β = 4

learnt a good (while not perfect) disentangled representation of the data generative factors, and its decoder learnt to act as a rendering engine. Its performance was comparative to that of DC-IGN (Fig. 7C), with the difference that DC-IGN required a priori knowledge about the quantity of the data generative factors,while

β

-VAE was able to discover them in an unsupervised manner. The most informative latent units

z_{m}

β

-VAE have the highest KL divergence from the unit Gaussian prior

(p (z) = N (0, I))

,while the uninformative latents have KL divergence close to zero. Fig. 7A demonstrates the selectivity of each latent

z_{m}

to the independent data generating factors:

z_{m}^{μ} = f (v_{k}) \forall v_{k} \in {v_{position X}, v_{position Y}, v_{scale}, v_{rotation}}

(top three rows),where

z_{m}^{μ}

is the learnt Gaussian mean of latent unit

z_{m}

. The effect of traversing each latent

z_{m}

on the resulting reconstructions is shown in the bottom five rows of Fig. 7A. The latents

z_{6}

and

z_{2}

learnt to encode

X

and

Y

coordinates of the objects respectively; unit

z_{1}

learnt to encode scale; and units

z_{5}

and

z_{7}

learnt to encode rotation. The frequency of oscillations in each rotational latent corresponds to the rotational symmetry of the corresponding object (

2 π

for heart,

π

for oval and

π / 2

for square). Furthermore, the two rotational latents seem to encode cos and sin rotational coordinates, while the positional latents align with the Cartesian axes. While such alignment with intuitive factors for humans is not guaranteed, empirically we found it to be very common. Fig. 7B demonstrates that the unmodified
我们还定性研究了

β

-VAE、VAE、InfoGAN 和 DC-IGN 在 2D 形状合成数据集上学到的表征。图 7A 显示，训练后，带有

β = 4

的

β

-VAE 学习到了数据生成因子的一种良好（虽不完美）解耦表征，其解码器学会了充当渲染引擎的角色。其表现与 DC-IGN 相当（图 7C），不同之处在于 DC-IGN 需要关于数据生成因子数量的先验知识，而

β

-VAE 能够以无监督的方式发现它们。

β

-VAE 最具信息量的潜在单元

z_{m}

与单位高斯先验

(p (z) = N (0, I))

的 KL 散度最高，而无信息量的潜在单元的 KL 散度接近于零。图 7A 展示了每个潜在

z_{m}

对独立数据生成因子的选择性：

z_{m}^{μ} = f (v_{k}) \forall v_{k} \in {v_{position X}, v_{position Y}, v_{scale}, v_{rotation}}

（前三行），其中

z_{m}^{μ}

是潜在单元

z_{m}

学习到的高斯均值。遍历每个潜在

z_{m}

对重建结果的影响如图 7A 底部五行所示。潜在变量

z_{6}

和

z_{2}

分别学会了编码物体的

X

和

Y

坐标；单元

z_{1}

学会了编码尺度；而单元

z_{5}

和

z_{7}

则学会了编码旋转。每个旋转潜在变量中的振荡频率对应于相应物体的旋转对称性（心形为

2 π

，椭圆形为

π

，方形为

π / 2

）。此外，这两个旋转潜在变量似乎编码了余弦和正弦旋转坐标，而位置潜在变量则与笛卡尔轴对齐。虽然这种与人类直观因素的对齐并不能保证，但经验上我们发现这种情况非常普遍。图 7B 展示了未经修改的

https://cdn.noedgeai.com/0196aa96-f22f-748c-a3c4-b117f7921f5a_8.jpg?x=309&y=234&w=1172&h=432&r=0

Figure 6: Disentanglement metric classification accuracy for 2D shapes dataset. Left: Accuracy for different models and training regimes Right: Positive correlation is present between the size of

z

and the optimal normalised values of

β

for disentangled factor learning for a fixed

β

-VAE architecture.

β

values are normalised by latent

z

size

m

and input

x

size

n

. Note that

β

values are not uniformly sampled. Orange approximately corresponds to unnormalised

β = 1

. Good reconstructions are associated with entangled representations (lower disentanglement scores). Disentangled representations (high disentanglement scores) often result in blurry reconstructions.
图 6：二维形状数据集的解缠结度量分类准确率。左图：不同模型和训练方案的准确率。右图：在固定

β

-VAE 架构下，

z

的大小与解缠结因子学习的最优归一化

β

值之间存在正相关关系。

β

值通过潜在

z

尺寸

m

和输入

x

尺寸

n

进行归一化。注意

β

值并非均匀采样。橙色大致对应未归一化的

β = 1

。良好的重构与纠缠表示（较低的解缠结分数）相关。解缠结表示（高解缠结分数）通常会导致模糊的重构。

VAE baseline

(β = 1)

is not able to disentangle generative factors in the data as well as

β

-VAE with appropriate learning pressures. Instead each latent

z

(apart from

z_{9}

,which learnt rotation) encodes at least two data generative factors. InfoGAN also achieved a degree of disentangling (see Fig. 7D), particularly for positional factors. However, despite our best efforts to train InfoGAN, we were not able to achieve the same degree of disentangling in other factors, such as rotation, scale and shape. We also found its ability to generate the different shapes in the dataset to be inaccurate and unstable during training, possibly due to reported limitations of the GAN framework, which can struggle to learn the full data distribution and instead will often learn a small subset of its modes (Salimans et al. 2016; Zhao et al., 2016).
VAE 基线

(β = 1)

无法像施加适当学习压力的

β

-VAE 那样有效解耦数据中的生成因子，其每个潜在变量

z

（除学习旋转的

z_{9}

外）至少编码了两个数据生成因子。InfoGAN 也实现了一定程度的解耦（见图 7D），尤其在位置因子方面表现突出。然而，尽管我们全力优化 InfoGAN 的训练，仍无法在其他因子（如旋转、尺度和形状）上达到同等解耦水平。同时发现其在训练过程中生成数据集中不同形状的能力存在不准确和不稳定的问题，这可能是由于 GAN 框架的固有缺陷——难以学习完整数据分布，而倾向于捕获少量模式子集（Salimans et al. 2016; Zhao et al., 2016）。

Understanding the effects of

β

We hypothesised that constrained optimisation is important for enabling deep unsupervised models to learn disentangled representations of the independent data generative factors (Sec. 2). In the

β

-VAE framework this corresponds to tuning the

β

coefficient. One way to view

β

is as a mixing coefficient (see Sec. A. 6 for a derivation) for balancing the magnitudes of gradients from the reconstruction and the prior-matching components of the VAE lower bound formulation in Eq. 4 during training. In this context it makes sense to normalise

β

by latent

z

size

m

and input

x

size

n

in order to compare its different values across different latent layer sizes and different datasets

(β_{norm} = \frac{β M}{N})

. We found that larger latent

z

layer sizes

m

require higher constraint pressures (higher

β

values),see Fig. 6 (Right). Furthermore,the relationship of

β

for a given

m

is characterised by an inverted U curve. When

β

is too low or too high the model learns an entangled latent representation due to either too much or too little capacity in the latent

z

bottleneck. We find that in general

β > 1

is necessary to achieve good disentanglement. However if

β

is too high and the resulting capacity of the latent channel is lower than the number of data generative factors, then the learnt representation necessarily has to be entangled (as a low-rank projection of the true data generative factors will compress them in a non-factorial way to still capture the full data distribution well). We also note that VAE reconstruction quality is a poor indicator of learnt disentanglement. Good disentangled representations often lead to blurry reconstructions due to the restricted capacity of the latent information channel

z

,while entangled representations often result in the sharpest reconstructions. We therefore suggest that one should not necessarily strive for perfect reconstructions when using

β

-VAEs as unsupervised feature learners - though it is often possible to find the right

β

-VAE architecture and the right value of

β

to have both well disentangled latent representations and good reconstructions.
理解

β

的影响。我们假设，约束优化对于使深度无监督模型能够学习独立数据生成因子的解耦表示至关重要（第 2 节）。在

β

-VAE 框架中，这对应于调整

β

系数。一种看待

β

的方式是将其视为混合系数（推导见附录 A.6），用于在训练过程中平衡来自重构和 VAE 下界公式中先验匹配分量的梯度大小。在这种情况下，为了比较不同潜在层大小和不同数据集

(β_{norm} = \frac{β M}{N})

之间的不同

β

值，通过潜在

z

大小

m

和输入

x

大小

n

对

β

进行归一化是有意义的。我们发现，较大的潜在

z

层大小

m

需要更高的约束压力（更高的

β

值），见图 6（右）。此外，给定

m

的

β

关系表现为倒 U 型曲线。当

β

过低或过高时，由于潜在

z

瓶颈中的容量过多或过少，模型会学习到一个纠缠的潜在表示。我们发现，通常需要

β > 1

才能实现良好的解耦。然而，如果

β

过高且潜在通道的容量低于数据生成因素的数量，那么学习到的表示必然会是纠缠的（因为真实数据生成因素的低秩投影会以非因子化的方式压缩它们，以仍然很好地捕捉完整的数据分布）。我们还注意到，VAE 的重建质量并不能很好地指示学习到的解缠结程度。由于潜在信息通道

z

的限制容量，良好的解缠结表示通常会导致模糊的重建，而纠缠的表示通常会产生最清晰的重建。因此，我们建议在使用

β

-VAEs 作为无监督特征学习器时，不一定追求完美的重建——尽管通常可以找到合适的

β

-VAE 架构和

β

的正确值，以同时获得良好解缠结的潜在表示和良好的重建效果。

We proposed a principled way of choosing

β

for datasets with at least weak label information. If label information exists for at least a small subset of the independent data generative factors of variation, one can apply the disentanglement metric described in Sec. 3 to approximate the level of learnt disentanglement for various

β

choices during a hyperparameter sweep. When such labelled information is not available,the optimal value of

β

can be found through visual inspection of what
我们提出了一种原则性方法，用于为至少具有弱标签信息的数据集选择

β

。如果至少对于独立数据生成变化因素的一个小子集存在标签信息，可以在超参数扫描过程中应用第 3 节所述的解缠结度量来近似不同

β

选择下学习的解缠结水平。当此类标注信息不可用时，

β

的最佳值可以通过视觉检查来确定

https://cdn.noedgeai.com/0196aa96-f22f-748c-a3c4-b117f7921f5a_9.jpg?x=305&y=198&w=698&h=1035&r=0

Figure 7: A: Representations learnt by a

β

-VAE

(β = 4)

. Each column represents a latent

z_{i}

,ordered according to the learnt Gaussian variance (last row). Row 1 (position) shows the mean activation (red represents high values) of each latent

z_{i}

as a function of all

32 \times 32

locations averaged across objects, rotations and scales. Row 2 and 3 show the mean activation of each unit

z_{i}

as a function of scale (respectively rotation), averaged across rotations and positions (respectively scales and positions). Square is red, oval is green and heart is blue. Rows 4-8 (second group) show reconstructions resulting from the traversal of each latent

z_{i}

over three standard deviations around the unit Gaussian prior mean while keeping the remaining

9 / 10

latent units fixed to the values obtained by running inference on an image from the dataset. B: Similar analysis for VAE

(β = 1)

. C: Similar analysis for DC-IGN, clamping a single latent each for scale, positions, orientation and 5 for shape. D: Similar analysis for InfoGAN, using 5 continuous latents regularized using the mutual information cost, and 5 additional unconstrained noise latents (not shown). effect the traversal of each single latent unit

z_{m}

has on the generated images

(x ∣ z)

in pixel space (as shown in Fig. 7 rows 4-8). For the 2D shapes dataset, we have found that the optimal values of

β

as determined by visual inspection match closely the optimal values as determined by the disentanglement metric.
图 7：A：由

β

-VAE

(β = 4)

学习到的表示。每列代表一个潜在

z_{i}

，按学习到的高斯方差排序（最后一行）。第 1 行（位置）显示每个潜在

z_{i}

在所有

32 \times 32

位置上的平均激活（红色表示高值），数据已跨物体、旋转和尺度取平均。第 2 行和第 3 行分别显示每个单元

z_{i}

在尺度（对应旋转）上的平均激活，数据已跨旋转和位置（对应尺度和位置）取平均。方形为红色，椭圆形为绿色，心形为蓝色。第 4-8 行（第二组）展示通过遍历每个潜在

z_{i}

在单位高斯先验均值周围三个标准差范围内的重建结果，同时保持其余

9 / 10

潜在单元固定在通过对数据集图像运行推理获得的值上。B：对 VAE

(β = 1)

的类似分析。C：对 DC-IGN 的类似分析，每次固定一个潜在变量分别用于尺度、位置、方向和 5 个用于形状。D：对 InfoGAN 的类似分析，使用 5 个通过互信息成本正则化的连续潜在变量，以及 5 个额外的无约束噪声潜在变量（未显示）。每个单一潜在单元

z_{m}

对生成图像

(x ∣ z)

在像素空间中的遍历效果（如图 7 第 4-8 行所示）。对于 2D 形状数据集，我们发现通过视觉检查确定的

β

最优值与解缠结度量确定的最优值高度吻合。

5 CONCLUSION 5 结论

In this paper we have reformulated the standard VAE framework (Kingma & Welling, 2014; Rezende et al. 2014) as a constrained optimisation problem with strong latent capacity constraint and independence prior pressures. By augmenting the lower bound formulation with the

β

coefficient that regulates the strength of such pressures and, as a consequence, the qualitative nature of the representations learnt by the model, we have achieved state of the art results for learning disentangled representations of data generative factors. We have shown that our proposed

β

-VAE framework significantly outperforms both qualitatively and quantitatively the original VAE (Kingma & Welling 2014), as well as state-of-the-art unsupervised (InfoGAN: Chen et al. 2016) and semi-supervised (DC-IGN: Kulkarni et al., 2015) approaches to disentangled factor learning. Furthermore, we have shown that

β

-VAE consistently and robustly discovers more factors of variation in the data,and it learns a representation that covers a wider range of factor values and is disentangled more cleanly than other benchmarks, all in a completely unsupervised manner. Unlike InfoGAN and DC-IGN, our approach does not depend on any a priori knowledge about the number or the nature of data generative factors. Our preliminary investigations suggest that the performance of the

β

-VAE framework may depend on the sampling density of the data generative factors within a training dataset (see Appendix A. 8 for more details). It appears that having more densely sampled data generative factors results in better disentangling performance of

β

-VAE,however we leave a more principled investigation of this effect to future work.

β

-VAE is robust with respect to different architectures,optimisation parameters and datasets,hence requiring few design decisions. Our approach relies on the optimisation of a single hyperparameter

β

,which can be found directly through a hyperparameter search if weakly labelled data is available to calculate our new proposed disentangling metric. Alternatively the optimal

β

can be estimated heuristically in purely unsupervised scenarios. Learning an interpretable factorised representation of the independent data generative factors in a completely unsupervised manner is an important precursor for the development of artificial intelligence that understands the world in the same way that humans do (Lake et al. 2016). We believe that using our approach as an unsupervised pretraining stage for supervised or reinforcement learning will produce significant improvements for scenarios such as transfer or fast learning.
在本文中，我们将标准 VAE 框架（Kingma & Welling, 2014; Rezende et al. 2014）重新表述为一个带有强潜在容量约束和独立性先验压力的约束优化问题。通过在用

β

系数调节这些压力强度的下限公式中进行增强，从而调节模型学习到的表示的质量特性，我们在学习数据生成因子的解耦表示方面取得了最先进的结果。我们已经证明，我们提出的

β

-VAE 框架在质量和数量上都显著优于原始 VAE（Kingma & Welling 2014），以及最先进的无监督（InfoGAN: Chen et al. 2016）和半监督（DC-IGN: Kulkarni et al., 2015）解耦因子学习方法。此外，我们已经证明，

β

-VAE 始终如一且稳健地发现了数据中更多的变异因子，并且它学习的表示覆盖了更广泛的因子值范围，并且比其他基准更干净地解耦，所有这些都以完全无监督的方式进行。与 InfoGAN 和 DC-IGN 不同，我们的方法不依赖于对数据生成因素数量或性质的任何先验知识。我们的初步研究表明，

β

-VAE 框架的性能可能取决于训练数据集中数据生成因素的采样密度（更多细节见附录 A.8）。似乎数据生成因素采样越密集，

β

-VAE 的解缠性能越好，但我们将这一效应的更原则性研究留待未来工作。

β

-VAE 对不同架构、优化参数和数据集具有鲁棒性，因此需要较少的设计决策。我们的方法依赖于单个超参数

β

的优化，如果有弱标记数据可用于计算我们新提出的解缠指标，则可以直接通过超参数搜索找到该参数。或者，在纯无监督场景中，可以启发式地估计最优

β

。以完全无监督的方式学习数据生成独立因素的可解释性分解表示，是开发能像人类一样理解世界的人工智能的重要前提（Lake 等人，2016）。我们相信，将我们的方法作为监督学习或强化学习的无监督预训练阶段，将在迁移学习或快速学习等场景中带来显著改进。

6 ACKNOWLEDGEMENTS 6 致谢

We would like to thank Charles Blundell, Danilo Rezende, Tejas Kulkarni and David Pfau for helpful comments that improved the manuscript.
我们要感谢 Charles Blundell、Danilo Rezende、Tejas Kulkarni 和 David Pfau 提出的宝贵意见，这些意见对完善本文稿起到了重要作用。

REFERENCES 参考文献

M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, 2014.
M. Aubry、D. Maturana、A. Efros、B. Russell 和 J. Sivic。《看见 3D 椅子：基于范例的 2D-3D 对齐——利用大型 CAD 模型数据集》。发表于 CVPR，2014 年。

Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. In IEEE Transactions on Pattern Analysis & Machine Intelligence, 2013.
Y. Bengio、A. Courville 与 P. Vincent。表征学习：综述与新视角。载于《IEEE 模式分析与机器智能汇刊》，2013 年。

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. arXiv, 2016.
陈曦、Yan Duan、Rein Houthooft、John Schulman、Ilya Sutskever 及 Pieter Abbeel。Infogan：通过信息最大化生成对抗网络实现可解释表征学习。arXiv 预印本，2016 年。

Brian Cheung, Jesse A. Levezey, Arjun K. Bansal, and Bruno A. Olshausen. Discovering hidden factors of variation in deep networks. In Proceedings of the International Conference on Learning Representations, Workshop Track, 2015.
Brian Cheung、Jesse A. Levezey、Arjun K. Bansal 与 Bruno A. Olshausen。深度网络中隐藏变异因素的发现。载于《国际学习表征会议论文集》研讨会部分，2015 年。

T. Cohen and M. Welling. Transformation properties of learned visual representations. In ICLR, 2015.
T. Cohen 与 M. Welling。学习视觉表征的变换特性。载于 ICLR 会议，2015 年。

Taco Cohen and Max Welling. Learning the irreducible representations of commutative lie groups. arXiv, 2014.
塔科·科恩与马克斯·韦林。学习交换李群的不可约表示。arXiv，2014 年。

G. Desjardins, A. Courville, and Y. Bengio. Disentangling factors of variation via generative entangling. arXiv, 2012.
G·德雅尔丹、A·库维尔与 Y·本吉奥。通过生成性纠缠解耦变异因素。arXiv，2012 年。

Carl Doersch. Tutorial on variational autoencoders. arxiv, 2016.
卡尔·多尔施。变分自编码器教程。arxiv，2016。

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011.
约翰·杜奇、埃拉德·哈赞与约拉姆·辛格。在线学习与随机优化的自适应次梯度方法。《机器学习研究杂志》，2011 年。

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. NIPS, pp. 2672-2680, 2014.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, 与 Y. Bengio。生成对抗网络。神经信息处理系统大会(NIPS), 第 2672-2680 页, 2014 年。

Ross Goroshin, Michael Mathieu, and Yann LeCun. Learning to linearize under uncertainty. NIPS, 2015.
Ross Goroshin, Michael Mathieu, 与 Yann LeCun。在不确定性下学习线性化。神经信息处理系统大会(NIPS), 2015 年。

G. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. International Conference on Artificial Neural Networks, 2011.
G. Hinton, A. Krizhevsky, 与 S. D. Wang。变换自编码器。国际人工神经网络会议, 2011 年。

Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183-233, 1999.
Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, 与 Lawrence K Saul。图模型变分方法导论。机器学习, 37(2):183-233, 1999 年。

Theofanis Karaletsos, Serge Belongie, and Gunnar Rätsch. Bayesian representation learning with oracle constraints. ICLR, 2016.
西奥法尼斯·卡拉莱索斯、塞尔日·贝洛涅与贡纳尔·雷奇。基于 Oracle 约束的贝叶斯表示学习。ICLR，2016 年。

W. Karush. Minima of Functions of Several Variables with Inequalities as Side Constraints. Master's thesis, Univ. of Chicago, Chicago, Illinois, 1939.
W·卡鲁什。带不等式约束的多变量函数极小值问题。硕士论文，芝加哥大学，伊利诺伊州芝加哥，1939 年。

D. P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv, 2014.
D·P·金马与吉米·巴。Adam：一种随机优化方法。arXiv，2014 年。

D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.
D·P·金马与 M·韦林。变分自编码贝叶斯方法。ICLR，2014 年。

H. W. Kuhn and A. W. Tucker. Nonlinear programming. In Proceedings of 2nd Berkeley Symposium, pp. 481-492, 1951.
H·W·库恩与 A·W·塔克。非线性规划。第二届伯克利研讨会论文集，第 481-492 页，1951 年。

Tejas Kulkarni, William Whitney, Pushmeet Kohli, and Joshua Tenenbaum. Deep convolutional inverse graphics network. NIPS, 2015.
特贾斯·库尔卡尼、威廉·惠特尼、普什米特·科利和约书亚·特南鲍姆。深度卷积逆向图形网络。NIPS，2015 年。

Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. arXiv, 2016.
布伦登·M·莱克、托默·D·厄尔曼、约书亚·B·特南鲍姆与塞缪尔·J·格什曼。构建能像人类一样学习与思考的机器。arXiv 预印本，2016 年。

Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. ICCV, 2015.
刘子、罗平、王旭与汤晓。野外环境下的深度学习人脸属性。国际计算机视觉大会(ICCV)，2015 年。

P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A 3d face model for pose and illumination invariant face recognition. AVSS, 2009.
P. Paysan, R. Knothe, B. Amberg, S. Romdhani, 与 T. Vetter。一种用于姿态与光照不变人脸识别的 3D 人脸模型。AVSS，2009 年。

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, and David Cournapeau. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 2011.
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, 与 David Cournapeau。Scikit-learn：Python 中的机器学习。Journal of Machine Learning Research，2011 年。

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv, 2015.
Alec Radford, Luke Metz, 与 Soumith Chintala。使用深度卷积生成对抗网络的无监督表示学习。arXiv，2015 年。

Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of variation with manifold interaction. ICML, 2014.
Scott Reed, Kihyuk Sohn, Yuting Zhang, 与 Honglak Lee。通过流形交互学习解耦变化因素。ICML，2014 年。

Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv, 2014.
达尼洛·J·雷曾德、沙基尔·穆罕默德与达安·维尔斯特拉。深度生成模型中的随机反向传播与近似推断。arXiv，2014 年。

Karl Ridgeway. A survey of inductive biases for factorial Representation-Learning. arXiv, 2016. URL http://arxiv.org/abs/1612.05299
卡尔·里奇韦。因子表示学习归纳偏置综述。arXiv，2016 年。URL http://arxiv.org/abs/1612.05299

Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep density models. arXiv, 2013.
奥伦·里佩尔和瑞安·普雷斯科特·亚当斯。基于深度密度模型的高维概率估计。arXiv，2013 年。

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. arXiv, 2016. URL http://arxiv.org/abs/1606.03498
蒂姆·萨利曼斯、伊恩·古德费洛、沃伊切赫·扎伦巴、维基·张、亚历克·拉德福德与陈曦。训练生成对抗网络的改进技术。arXiv，2016 年。URL http://arxiv.org/abs/1606.03498

Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-869, 1992.
于尔根·施密德胡伯。通过可预测性最小化学习因子编码。《神经计算》，4(6):863-869，1992 年。

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-Time single image and video Super-Resolution using an efficient Sub-Pixel convolutional neural network. arXiv, 2016.
史文哲、何塞·卡巴列罗、费伦茨·胡萨尔、约翰内斯·托茨、安德鲁·P·艾特肯、罗布·毕晓普、丹尼尔·吕克特与王泽涵。基于高效亚像素卷积神经网络的实时单图像与视频超分辨率。arXiv，2016 年。

Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey Hinton. Tensor analyzers. In Proceedings of the 30th International Conference on Machine Learning, 2013, Atlanta, USA, 2013.
唐义川、鲁斯兰·萨拉赫丁诺夫与杰弗里·辛顿。张量分析器。《第 30 届国际机器学习会议论文集》，2013 年，美国亚特兰大，2013 年。

William F. Whitney, Michael Chang, Tejas Kulkarni, and Joshua B. Tenenbaum. Understanding visual concepts with continuation learning. arXiv, 2016. URL http://arxiv.org/pdf/ 1602.06822.pdf
威廉·F·惠特尼、迈克尔·常、特贾斯·库尔卡尼与约书亚·B·特南鲍姆。通过延续学习理解视觉概念。arXiv，2016 年。URL http://arxiv.org/pdf/1602.06822.pdf

Jimei Yang, Scott Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. NIPS, 2015.
杨继梅、Scott Reed、杨明玄与 Honglak Lee。基于弱监督循环变换的三维视角合成解耦方法。神经信息处理系统大会(NIPS)，2015 年。

Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv, 2016. URL http://arxiv.org/abs/1609.03126.
赵军波、Michael Mathieu 与 Yann LeCun。基于能量的生成对抗网络。arXiv 预印本，2016 年。URL http://arxiv.org/abs/1609.03126。

Z. Zhu, P. Luo, X. Wang, and X. Tang. Multi-view perceptron: a deep model for learning face identity and view representations. In Advances in Neural Information Processing Systems 27. 2014.
Z. Zhu, P. Luo, X. Wang, 和 X. Tang。多视角感知机：一种学习人脸身份与视角表示的深度模型。发表于《神经信息处理系统进展》第 27 卷。2014 年。

A APPENDIX 附录

A.1 MODEL ARCHITECTURE DETAILS
A.1 模型架构细节

A summary of all model architectures used in this paper can be seen in Tbl 1
本文所用全部模型架构的概述可参见表 1

A.2 INFOGAN TRAINING A.2 INFOGAN 训练

To train the InfoGAN network described in Tbl. 1 on the 2D shapes dataset (Fig. 7), we followed the training paradigm described in Chen et al. (2016) with the following modifications. For the mutual information regularised latent code,we used 5 continuous variables

c_{i}

sampled uniformly from(-1,1). We used 5 noise variables

z_{i}

,as we found that using a reduced number of noise variables improved the quality of generated samples for this dataset. To help stabilise training, we used the instance noise trick described in Shi et al. (2016), adding Gaussian noise to the discriminator inputs (0.2 standard deviation on images scaled to

[- 1, 1]

). We followed Radford et al. (2015) for the architecture of the convolutional layers, and used batch normalisation in all layers except the last in the generator and the first in the discriminator.
为在 2D 形状数据集（图 7）上训练表 1 所述的 InfoGAN 网络，我们遵循了 Chen 等人（2016 年）提出的训练范式，并做了如下调整：对于互信息正则化的潜在编码，我们使用了 5 个均匀采样于(-1,1)区间的连续变量

c_{i}

；采用 5 个噪声变量

z_{i}

，因我们发现减少噪声变量数量可提升该数据集生成样本的质量；为稳定训练过程，我们应用了 Shi 等人（2016 年）提出的实例噪声技巧，向判别器输入添加标准差为 0.2 的高斯噪声（图像缩放至

[- 1, 1]

）；卷积层架构参考 Radford 等人（2015 年）的设计，并在生成器（除最后一层）和判别器（除第一层）中全面使用批量归一化。

Dataset 2D shapesAdamGeneratorFC 256, 256, Deconv 128x4x4, 64x4x4 (stride 2). Tanh. 数据集 2D 形状 Adam 生成器 FC 256, 256, 反卷积 128x4x4, 64x4x4 (步长 2)。Tanh 激活函数。	Optimiser 优化器		Architecture 架构
2D shapes (VAE) 2D 形状 (变分自编码器)	Adagrad 1e-2 自适应梯度算法 1e-2	Input Encoder Latents Decoder 输入编码器潜在空间解码器	4096 (flattened 64x64x1). FC 1200, 1200. ReLU activation. 10 FC 1200, 1200, 1200, 4096. Tanh activation. Bernoulli. 4096（64x64x1 展平）。全连接层 1200、1200。ReLU 激活函数。10 个全连接层 1200、1200、1200、4096。Tanh 激活函数。伯努利分布。
2D shapes 二维形状	rmsprop (as in Kulkarni et al. 2015 均方根传播优化器（同 Kulkarni 等人 2015 年论文）	Input 输入	64x64x1. Conv 96x3x3, 48x3x3, 48x3x3 (padding 1). ReLU activation and Max pooling 2x2. 10 Unpooling, Conv 48x3x3, 96x3x3, 1x3x3. ReLU activation, Sigmoid. 64x64x1 输入。卷积层 96x3x3、48x3x3、48x3x3（填充 1）。ReLU 激活函数和 2x2 最大池化。10 次反池化，卷积层 48x3x3、96x3x3、1x3x3。ReLU 激活函数，Sigmoid 输出。
(DC-IGN) （DC-IGN）		Encoder Latents 编码器潜在空间
		Decoder 解码器
(InfoGAN) （信息 GAN）	1e-3 (gen) 2e-4 (dis) 1e-3（生成器） 2e-4（判别器）	Discriminator 判别器	Conv and FC reverse of generator. Leaky ReLU activation. FC 1. Sigmoid activation. 卷积与全连接层结构为生成器的反向。使用 Leaky ReLU 激活函数。全连接层输出 1 维，Sigmoid 激活。
(InfoGAN) （信息 GAN）	1e-3 (gen) 2e-4 (dis) 1e-3（生成器） 2e-4（判别器）	Recognition Latents 识别潜在变量	Conv and FC shared with discriminator. FC 128, 5. Gaussian 10: $z_{1 \dots 5} \sim Unif (- 1, 1), c_{1 \dots 5} \sim Unif (- 1, 1)$ 与判别器共享卷积及全连接层。全连接层输出 128 维和 5 维。高斯分布 10： $z_{1 \dots 5} \sim Unif (- 1, 1), c_{1 \dots 5} \sim Unif (- 1, 1)$
Chairs (VAE) 椅子（变分自编码器）	Adam 1e-4 亚当优化器学习率 1e-4	Input Encoder Latents Decoder 输入编码器潜在空间解码器	64x64x1. Conv 32x4x4 (stride 2), 32x4x4 (stride 2), 64x4x4 (stride 2), 64x4x4 (stride 2), FC 256. ReLU activation. 32 Deconv reverse of encoder. ReLU activation. Bernoulli. 64x64x1。卷积层 32x4x4（步长 2），32x4x4（步长 2），64x4x4（步长 2），64x4x4（步长 2），全连接层 256。ReLU 激活函数。32 个反卷积层与编码器结构对称。ReLU 激活函数。伯努利分布。
CelebA 名人面部数据集(CelebA)	Adam 亚当	Input 输入	64x64x3. 64x64x3
(VAE) (变分自编码器)	1e-4 0.0001	Encoder 编码器	Conv 32x4x4 (stride 2), 32x4x4 (stride 2), 64x4x4 (stride 2), 64x4x4 (stride 2), FC 256. ReLU activation. 卷积层 32x4x4 (步长 2), 32x4x4 (步长 2), 64x4x4 (步长 2), 64x4x4 (步长 2), 全连接层 256。ReLU 激活函数。
		Latents 潜在变量	32
		Decoder 解码器	Deconv reverse of encoder. ReLU activation. Gaussian. 反卷积层与编码器对称结构。ReLU 激活函数。高斯分布。
3DFaces 3D 人脸	Adam 1e-4 Adam 优化器 1e-4	Input 输入	64x64x1. 64x64x1。
(VAE) (变分自编码器)		Encoder 编码器	Conv 32x4x4 (stride 2), 32x4x4 (stride 2), 64x4x4 (stride 2), 64x4x4 (stride 2), FC 256. ReLU activation. 卷积层 32x4x4 (步长 2), 32x4x4 (步长 2), 64x4x4 (步长 2), 64x4x4 (步长 2), 全连接层 256。ReLU 激活函数。
		Latents 潜在变量	32
		Decoder 解码器	Deconv reverse of encoder. ReLU activation. Bernoulli. 解码器为编码器的逆过程，采用 ReLU 激活函数，伯努利分布。

Table 1: Details of model architectures used in the paper. The models were trained using either adagrad (Duchi et al., 2011) or adam (Kingma & Ba, 2014) optimisers.
表 1：论文中使用的模型架构详情。模型训练采用了 adagrad（Duchi 等人，2011 年）或 adam（Kingma & Ba，2014 年）优化器。

A.3 ICA AND PCA BASELINES
A.3 ICA 与 PCA 基线方法

In order to calculate the ICA benchmark, we applied fastICA (Pedregosa et al., 2011) algorithm to the whitened pixel data. Due to memory limitations we had to apply the algorithm to pairwise combinations of the subsets of the dataset corresponding to the transforms of each of the three 2D object identities. We calculated the disentangling metric for all three ICA models trained on each of the three pairwise combinations of

2 D

objects,before presenting the average of these scores in Fig. 6
为计算 ICA 基准值，我们对白化后的像素数据应用了 fastICA 算法（Pedregosa 等人，2011 年）。由于内存限制，我们只能对数据集中对应于三个二维物体各自变换的子集进行两两组合处理。我们针对所有三个基于不同两两物体组合训练的 ICA 模型计算了分离度量指标，最终在图 6 中展示了这些得分的平均值。

We performed PCA on the raw and whitened pixel data. Both approaches resulted in similar disentangling metric scores. Fig. 6 reports the PCA results calculated using whitened pixel data for more direct comparison with the ICA score.
我们对原始及白化后的像素数据分别进行了 PCA 分析。两种方法得到的分离度量分数相近。图 6 报告了使用白化数据计算的 PCA 结果，以便与 ICA 分数进行更直接的对比。

A.4 DISENTANGLEMENT METRIC DETAILS
附录 A.4 解缠结度量细节

We used a linear classifier to learn the identity of the generative factor that produced

z_{diff}^{b}

(see Equations (5) for the process used to obtain samples of

z_{diff}^{b}

). We used a fully connected linear classifier to predict

p (y ∣ z_{diff}^{b})

,where

y

is one of four generative factors (position

X

,position

Y

,scale and rotation). We used softmax output nonlinearity and a negative log likelihood loss function. The classifier was trained using the Adagrad (Duchi et al. 2011) optimisation algorithm with learning rate of 1e-2 until convergence.
我们采用线性分类器来识别生成因子

z_{diff}^{b}

的身份（获取

z_{diff}^{b}

样本的过程参见方程(5)）。使用全连接线性分类器预测

p (y ∣ z_{diff}^{b})

，其中

y

为四个生成因子之一（位置

X

、位置

Y

、尺度及旋转）。分类器输出层采用 softmax 非线性激活函数，并以负对数似然作为损失函数。训练时使用 Adagrad 优化算法（Duchi 等人，2011），学习率设为 1e-2，直至模型收敛。

D = {V \in R^{K}, W \in R^{H}, X \in R^{N}}, y \sim Unif [1 \dots K]

Repeat for

b = 1 \dots B

: 对

b = 1 \dots B

重复以下操作：

v_{1, l} \sim p (v), w_{1, l} \sim p (w), w_{2, l} \sim p (w), {[v_{2, l}]}_{k} = {\begin{cases} {[v_{1, l}]}_{k}, & if k = y \\ \sim p (v_{k}), & otherwise \end{cases}

\begin{matrix} (5) & x_{1, l} \sim Sim (v_{1, l}, w_{1, l}), x_{2, l} \sim Sim (v_{2, l}, w_{2, l}), \end{matrix}

q (z ∣ x) \sim N (μ (x), σ (x)), z_{1, l} = μ (x_{1, l}), z_{2, l} = μ (x_{2, l})

z_{diff}^{l} = | z_{1, l} - z_{2, l} |, z_{diff}^{b} = \frac{1}{L} \sum_{l = 1}^{L} z_{diff}^{l}

All disentanglement metric score results reported in the paper were calculated in the following manner. Ten replicas of each model with the same hyperparameters were trained using different random seeds to obtain disentangled representations. Each of the ten trained model replicas was evaluated three times using the disentanglement metric score algorithm, each time using a different random seed to initialise the linear classifier. We then discarded the bottom

50 %

of the thirty resulting scores and reported the remaining results. This was done to control for the outlier results from the few experiments that diverged during training.
论文中报告的所有解缠度量分数结果均按以下方式计算。使用相同超参数训练每个模型的十个副本，采用不同的随机种子以获得解缠表示。每个训练好的模型副本使用解缠度量分数算法评估三次，每次初始化线性分类器时使用不同的随机种子。随后，我们剔除三十个结果分数中最低的

50 %

部分，并报告剩余结果。此举旨在控制训练过程中少数偏离实验产生的异常值。

The results reported in table in Fig. 6 (left) were calculated using the following data. Ground truth uses independent data generating factors

v

(our dataset did not contain any correlated data generating factors w). PCA and ICA decompositions keep the first ten components (PCA components explain

60.8 %

of variance).

β

-VAE

(β = 4)

,VAE

(β = 1)

and VAE untrained have the same fully connected architecture with ten latent units

z

. InfoGAN uses "inferred" values of the five continuous latents that were regularised with the mutual information objective during training.
图 6（左）表格中报告的结果基于以下数据计算得出。真实值采用独立的数据生成因子

v

（我们的数据集未包含任何相关数据生成因子 w）。PCA 和 ICA 分解保留了前十个主成分（PCA 成分解释了

60.8 %

的方差）。

β

-VAE

(β = 4)

、VAE

(β = 1)

及未训练的 VAE 均采用相同的全连接架构，包含十个潜在单元

z

。InfoGAN 使用了在训练期间通过互信息目标正则化的五个连续潜在变量的“推断”值。

A.5 CLASSIFYING THE GROUND TRUTH DATA GENERATIVE FACTORS VALUES
A.5 对真实数据生成因子值的分类

In order to further verify the validity of our proposed disentanglement metric we ran an extra quantitative test: we trained a linear classifier to predict the ground truth value of each of the five data generative factors used to generate the

2 D

shapes dataset. While this test does not measure disentangling directly (since it does not measure independence of the latent representation), a disentangled representation should make such a classification trivial. It can be seen in Table 2 that the representation learnt by

β

-VAE is on average the best representation for factor classification across all five factors. It is closely followed by DC-IGN. It is interesting to note that ICA does well only at encoding object identity, while PCA manages to learn a very good representation of object position.
为了进一步验证我们提出的解缠结度量标准的有效性，我们进行了额外的定量测试：训练一个线性分类器来预测用于生成

2 D

形状数据集的五个数据生成因子中每个因子的真实值。虽然该测试并未直接衡量解缠结（因为它没有测量潜在表示的独立性），但解缠结的表征应使此类分类变得轻而易举。从表 2 可以看出，

β

-VAE 学习到的表征平均而言是所有五个因子分类的最佳表征，紧随其后的是 DC-IGN。值得注意的是，ICA 仅在编码物体身份方面表现良好，而 PCA 则成功学习到了物体位置的极佳表征。

Model	Classification accuracy 分类准确率
Model	id 标识	scale 比例	rotation 旋转	position $X$ 位置 $X$	position Y 位置 Y	average 平均
PCA	43.38	36.08	5.96	60.66	60.15	41.25
ICA	59.6	34.4	7.61	25.96	25.12	30.54
DC-IGN	44.82	45.92	15.89	47.64	45.88	40.03
InfoGAN 信息生成对抗网络	44.47	40.91	6.39	27.51	23.73	28.60
VAE untrained 未训练的变分自编码器	39.44	25.33	6.09	16.69	14.39	20.39
VAE	41.55	24.07	8	16.5	18.72	21.77
β-VAE β-变分自编码器	50.08	43.03	20.36	52.25	49.5	43.04

Table 2: Linear classifier classification accuracy for predicting the ground truth values for each data generative factor from different latent representations. Each factor could take a variable number of possible values: 3 for id, 6 for scale, 40 for rotation and 32 for position X or Y. Best performing model results in each column are printed in bold.
表 2：线性分类器针对不同潜在表征预测各数据生成因子真实值的分类准确率。每个因子可能取值的数量不同：ID 为 3 种，缩放为 6 种，旋转为 40 种，X 或 Y 位置为 32 种。各列中表现最佳的模型结果以粗体显示。

A.6 INTERPRETING NORMALISED $β$
A.6 解读归一化 $β$

We start with the

β

-VAE constrained optimisation formulation that we have derived in Sec. 2,
我们从第 2 节推导出的

β

-VAE 约束优化公式开始，

\begin{matrix} (6) & L (θ, ϕ; x, z, β) = E_{q_{ϕ} (z ∣ x)} [\log p_{θ} (x ∣ z)] - β D_{K L} (q_{ϕ} (z ∣ x) ∥ p (z)) \end{matrix}

We make the assumption that every pixel

n

x \in R^{N}

is conditionally independent given

z

(Doersch 2016). The first term of Eq. 6 then becomes:
我们假设

x \in R^{N}

中的每个像素

n

在给定

z

的条件下是独立的（Doersch 2016）。于是，方程 6 的第一项变为：

\begin{matrix} (7) & E_{q_{ϕ} (z ∣ x)} [\log p_{θ} (x ∣ z)] = E_{q_{ϕ} (z ∣ x)} [\log \prod_{n} p_{θ} (x_{n} ∣ z)] = E_{q_{ϕ} (z ∣ x)} [\sum_{n} \log p_{θ} (x_{n} ∣ z)] \end{matrix}

Dividing both sides of Eq. 6 by

N

produces:
将方程 6 两边同时除以

N

得到：

\begin{matrix} (8) & L (θ, ϕ; x, z, β) \propto E_{q_{ϕ} (z ∣ x)} E_{n} [\log p_{θ} (x_{n} ∣ z)] - \frac{β}{N} D_{K L} (q_{ϕ} (z ∣ x) ∥ p (z)) \end{matrix}

We design

β

-VAE to learn conditionally independent factors of variation in the data. Hence we assume conditional independence of every latent

z_{m}

given

x

(where

m \in 1 \dots M

,and

M

is the dimensionality of

z

). Since our prior

p (z)

is an isotropic unit Gaussian,we can re-write the second term of Eq. 6 as:
我们设计

β

-VAE 来学习数据中条件独立的变异因素。因此，我们假设给定

x

时每个潜在变量

z_{m}

都是条件独立的（其中

m \in 1 \dots M

，且

M

是

z

的维度）。由于我们的先验

p (z)

是一个各向同性的单位高斯分布，我们可以将方程 6 的第二项重写为：

\begin{matrix} (9) & D_{K L} (q_{ϕ} (z ∣ x) ∥ p (z)) = \int_{z} q_{ϕ} (z ∣ x) \log \frac{q_{ϕ} (z ∣ x)}{p (z)} = \sum_{m} \int_{z_{m}} q_{ϕ} (z_{m} ∣ x) \log \frac{q_{ϕ} (z_{m} ∣ x)}{p (z_{m})} \end{matrix}

Multiplying the second term in Eq. 8 by a factor

\frac{M}{M}

produces:
将方程 8 中的第二项乘以因子

\frac{M}{M}

后得到：

\begin{matrix} (10) & L (θ, ϕ; x, z, β) \propto E_{q_{ϕ} (z ∣ x)} E_{n} [\log p_{θ} (x_{n} ∣ z)] - \frac{β M}{N} E_{m} \int_{z_{m}} [q_{ϕ} (z_{m} ∣ x) \log \frac{q_{ϕ} (z_{m} ∣ x)}{p (z_{m})}] \end{matrix}

= E_{q_{ϕ} (z ∣ x)} E_{n} [\log p_{θ} (x_{n} ∣ z)] - \frac{β M}{N} E_{m} [D_{K L} (q_{ϕ} (z_{m} ∣ x) ∥ p (z_{m}))]

Hence using 因此使用

β_{norm} = \frac{β M}{N}

in Eq. 10 is equivalent to optimising the original

β

-VAE formulation from Sec. 2,but with the additional independence assumptions that let us calculate data log likelihood and KL divergence terms in expectation over the individual pixels

x_{n}

and individual latents

z_{m}

.
在方程 10 中相当于优化第 2 节原始

β

-VAE 公式，但附加了独立性假设，使我们能够分别基于单个像素

x_{n}

和单个潜变量

z_{m}

计算数据对数似然和 KL 散度的期望值。

A.7 RELATIONSHIP BETWEEN $β$ AND $ϵ$
A.7 $β$ 与 $ϵ$ 之间的关系

For a given

ϵ

we can solve the constrained optimisation problem in Eq. 3 (find the optimal

(θ^{*}, ϕ^{*}, β^{*})

, such that

Δ F (θ^{*}, ϕ^{*}, β^{*}) = 0

). We can then re-write our optimal solution to the original optimisation problem in Eq. 2 as a function of

ϵ

:
对于给定的

ϵ

，我们可以求解等式 3 中的约束优化问题（找到最优的

(θ^{*}, ϕ^{*}, β^{*})

，使得

Δ F (θ^{*}, ϕ^{*}, β^{*}) = 0

）。随后，我们可以将等式 2 中原优化问题的最优解重写为

ϵ

的函数：

\begin{matrix} (11) & G (θ^{*} (ϵ), ϕ^{*} (ϵ)) = E_{q_{ϕ^{*} (ϵ)} (z ∣ x)} [\log p_{θ^{*} (ϵ)} (x ∣ z)] \end{matrix}

Now

β

can be interpreted as the rate of change of the optimal solution

(θ^{*}, ϕ^{*})

G

when varying the
现在，

β

可解释为当约束条件

G

变化时，最优解

(θ^{*}, ϕ^{*})

的变化率：

constraint

ϵ

: 约束条件

ϵ

\begin{matrix} (12) & \frac{δ G}{δ ϵ} = β^{*} (ϵ) \end{matrix}

A.8 DATA CONTINUITY 附录 A.8 数据连续性

We hypothesise that data continuity plays a role in guiding unsupervised models towards learning the correct data manifolds. To test this idea we measure how the degree of learnt disentangling changes with reduced continuity in the 2D shapes dataset. We trained a

β

-VAE with

β = 4

(Figure 7A) on subsamples of the original 2D shapes dataset, where we progressively decreased the generative factor sampling density. Reduction in data continuity negatively correlates with the average pixel wise (Hamming) distance between two consecutive transforms of each object (normalised by the average number of pixels occupied by each of the two adjacent transforms of an object to account for object scale). Figure 8 demonstrates that as the continuity in the data reduces, the degree of disentanglement in the learnt representations also drops. This effect holds after additional hyperparameter tuning and can not solely be explained by the decrease in dataset size, since the same VAE can learn disentangled representations from a data subset that preserves data continuity but is approximately 55% of the original size (results not shown).
我们假设数据连续性在引导无监督模型学习正确的数据流形方面起着重要作用。为了验证这一观点，我们在 2D 形状数据集中测量了学习解缠程度如何随着连续性降低而变化。我们使用

β

-VAE 搭配

β = 4

（图 7A）在原始 2D 形状数据集的子样本上进行训练，其中逐步降低了生成因子的采样密度。数据连续性的降低与每个物体两次连续变换之间的平均像素级（汉明）距离呈负相关（通过两个相邻变换物体占据的平均像素数进行归一化以考虑物体尺度）。图 8 显示，随着数据连续性的降低，学习表征中的解缠程度也随之下降。这一效应在经过额外超参数调优后依然成立，且不能仅用数据集规模减小来解释，因为相同的 VAE 可以从保留数据连续性但约为原始规模 55%的数据子集中学习到解缠表征（结果未展示）。

https://cdn.noedgeai.com/0196aa96-f22f-748c-a3c4-b117f7921f5a_15.jpg?x=536&y=475&w=715&h=613&r=0

Figure 8: Negative correlation between data transform continuity and the degree of disentangling achieved by

β

-VAE. Abscissa is the average normalized Hamming distance between each of the two consecutive transforms of each object. Ordinate is disentanglement metric score. Disentangling performance is robust to Bernoulli noise added to the data at test time, as shown by slowly degrading classification accuracy up to

10 %

noise level,considering that the 2D objects occupy on average between 2-7% of the image depending on scale. Fluctuations in classification accuracy for similar Hamming distances are due the different nature of subsampled generative factors (i.e. symmetries are present in rotation but are lacking in position).
图 8：数据变换连续性与

β

-VAE 解纠缠程度呈负相关。横轴表示每个物体两次连续变换间的平均归一化汉明距离，纵轴为解纠缠度量分数。测试时加入伯努利噪声后，解纠缠性能表现稳健（分类精度随噪声水平缓慢下降至

10 %

），考虑到二维物体根据比例平均占据图像的 2-7%。相似汉明距离下分类精度的波动源于子采样生成因子的不同性质（如旋转中存在对称性而位置中缺失）。

A. $9 β$ -VAE SAMPLES A. $9 β$ -VAE 样本

Samples from

β

-VAE that learnt disentangled

(β = 4)

and entangled

(β = 1)

representations can be seen in Figure 9.
图 9 展示了学习到解纠缠

(β = 4)

与纠缠

(β = 1)

表征的

β

-VAE 样本。

A.10 EXTRA $β$ -VAE TRAVERSAL PLOTS
A.10 额外 $β$ -VAE 遍历图

We present extra latent traversal plots from

β

-VAE that learnt disentangled representations of 3D chairs (Figures 10-11) and CelebA (Figures 12-14) datasets. Here we show traversals from all informative latents from a large number of seed images.
我们展示了来自

β

-VAE 的额外潜在遍历图，该模型学习到了 3D 椅子数据集（图 10-11）和 CelebA 数据集（图 12-14）的解缠表示。此处我们呈现了来自大量种子图像的所有信息性潜在变量的遍历结果。

β -VAE: LEARNING BASIC VISUAL CONCEPTS WITH A Constrained Variational Framework β -VAE：通过约束变分框架学习基础视觉概念

Abstract 摘要

1 INTRODUCTION 1 引言

2 β-VAE FRAMEWORK DERIVATION2 β-VAE 框架推导

3 DISENTANGLEMENT METRIC 3 解缠结度量

4 EXPERIMENTS 4 实验部分

4.1 QUALITATIVE BENCHMARKS4.1 定性基准测试

4.2 QUANTITATIVE BENCHMARKS4.2 定量基准测试

5 CONCLUSION 5 结论

6 ACKNOWLEDGEMENTS 6 致谢

REFERENCES 参考文献

A APPENDIX 附录

A.1 MODEL ARCHITECTURE DETAILSA.1 模型架构细节

A.2 INFOGAN TRAINING A.2 INFOGAN 训练

A.3 ICA AND PCA BASELINESA.3 ICA 与 PCA 基线方法

A.4 DISENTANGLEMENT METRIC DETAILS附录 A.4 解缠结度量细节

A.5 CLASSIFYING THE GROUND TRUTH DATA GENERATIVE FACTORS VALUESA.5 对真实数据生成因子值的分类

A.6 INTERPRETING NORMALISED βA.6 解读归一化 β

A.7 RELATIONSHIP BETWEEN β AND ϵA.7 β 与 ϵ 之间的关系

A.8 DATA CONTINUITY 附录 A.8 数据连续性

A. 9β -VAE SAMPLES A. 9β -VAE 样本

A.10 EXTRA β -VAE TRAVERSAL PLOTSA.10 额外 β -VAE 遍历图

$β$ -VAE: LEARNING BASIC VISUAL CONCEPTS WITH A Constrained Variational Framework
$β$ -VAE：通过约束变分框架学习基础视觉概念

2 β-VAE FRAMEWORK DERIVATION
2 β-VAE 框架推导

4.1 QUALITATIVE BENCHMARKS
4.1 定性基准测试

4.2 QUANTITATIVE BENCHMARKS
4.2 定量基准测试

A.1 MODEL ARCHITECTURE DETAILS
A.1 模型架构细节

A.3 ICA AND PCA BASELINES
A.3 ICA 与 PCA 基线方法

A.4 DISENTANGLEMENT METRIC DETAILS
附录 A.4 解缠结度量细节

A.5 CLASSIFYING THE GROUND TRUTH DATA GENERATIVE FACTORS VALUES
A.5 对真实数据生成因子值的分类

A.6 INTERPRETING NORMALISED $β$
A.6 解读归一化 $β$

A.7 RELATIONSHIP BETWEEN $β$ AND $ϵ$
A.7 $β$ 与 $ϵ$ 之间的关系

A. $9 β$ -VAE SAMPLES A. $9 β$ -VAE 样本

A.10 EXTRA $β$ -VAE TRAVERSAL PLOTS
A.10 额外 $β$ -VAE 遍历图