Faster Orthogonal Parameterization with Householder Matrices
基于Householder矩阵的快速正交参数化方法
Alexander Mathiasen Frederik Hvilshoj Jakob Rødsgaard Jørgensen Anshul Nasery Davide Mottin
亚历山大·马蒂亚森 弗雷德里克·维尔肖伊 雅各布·罗斯加德·约根森 安舒尔·纳塞里 达维德·莫廷
Abstract
摘要
Orthogonal matrices have been used in several Normalizing Flows (Tomczak & Welling, 2016; van den Berg et al., 2018).Orthogonal matrices are attractive since they are easy to invert and have Jacobian determinant one. Their main downside is the additional computational resources required to perform gradient descent. We identify a computational bottleneck for previous work on Householder matrices, and introduce a novel algorithm, FastH, which circumvents the bottleneck and is up to faster than a previous method.
正交矩阵已在多种标准化流模型中得到应用(Tomczak & Welling, 2016; van den Berg等, 2018)。正交矩阵因其易于求逆且雅可比行列式恒为1的特性而备受青睐,但其主要缺陷在于梯度下降所需的额外计算资源。我们发现前人关于Householder矩阵的研究存在计算瓶颈,并提出创新算法FastH,该算法规避了瓶颈,速度较前代方法提升高达倍。
1. Introduction
1. 引言
Normalizing Flows (NF) are a type of generative model with several attractive properties like exact likelihood, fast sampling and substantial memory savings (Kingma & Dhariwal, 2018). They are capable of representing complicated distributions by transforming a simple base distribution through an invertible neural network to attain a model distribution . The exact likelihood can then be computed through a change of variable formula due to the invertibility of .
This has motivated much research into invertible neural networks. However, the change of variables formula also require the computation of the Jacobian determinant of . Previous research thus attempt to design invertible neural networks which also allow efficient computation of their Jacobian determinant. This makes orthogonal matrices very attractive,since they are easy to invert and have unit Jacobian determinant .
Perharps unsurprisingly, much previous work adopt orthogonal matrices in their construction of Normalizing Flows (Tomczak & Welling, 2016; van den Berg et al., 2018; Hoogeboom et al., 2019; Golinski et al., 2019). However, the use of orthogonal matrices introduce one complication.
不出所料,已有大量研究将正交矩阵应用于标准化流构建(Tomczak & Welling, 2016; van den Berg等, 2018; Hoogeboom等, 2019; Golinski等, 2019)。但正交矩阵的使用也带来了新的挑战。
Figure 1. Comparison of previous approaches, see Section 4.
图1. 现有方法对比,详见第4节。
One needs to perform gradient descent wrt. weight matrices that are constrained to be orthogonal. Previous work solve this issue using different methods which can roughly be divided into three groups: matrix exponential, Cayley transform and Householder matrices.
The Householder method has the best time complexity for multiplying where is an orthogonal matrix and is a mini-batch with examples. In particular,the Householder methods take time compared to time for both alternative methods. For the Householder method is thus times faster. Contrary to the superior time complexity, we find that methods based on the Householder method are often slower on GPUs in practice, see Figure 1.
We identify the algorithmic issue which causes the discrepancy between theory and practice. It turns out that the previous Householder methods entails the computation of sequential inner products,which is ill-fit for parallel hardware like a GPU because the GPU cannot utilize all its cores. For example, if a GPU has 4000 cores and computes sequential inner products on 100-dimensional vectors, it can only utilize 100 cores simultaneously, leaving the remaining 3900 cores to run idle.
We introduce a novel algorithm, FastH, which increases core utilization, leaving less cores to run idle. FastH retains the desirable time complexity,while reducing the number of sequential operations. On a mini-batch of size , FastH performs sequential matrix-matrix operations instead of sequential vector-vector operations. In practice, FastH is faster than all previous methods, see Figure 1. Code: https://github.com/alexandermath/fasth.
Aarhus University Indian Institute of Technology,Bombay. Correspondence to: Alexander Mathiasen .
奥胡斯大学印度理工学院孟买分校。通讯作者:亚历山大·马蒂亚森。
Second workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models (ICML 2020)
第二届可逆神经网络、标准化流与显式似然模型研讨会(ICML 2020)
2. Background
2. 研究背景
In this section, we sketch the method based on Householder matrices, and explain its computational bottleneck. A matrix is orthogonal if . All orthogonal matrices can be decomposed into a product of Householder matrices (Uhlig,2001):
Householder matrices satisfy several useful properties. In particular,the matrix remains orthogonal under gradient descent updates (Mhammedi et al., 2017). Furthermore, all products of Householder matrices are orthogonal,and any orthogonal matrix can be decomposed as a product of Householder matrices (Uh-lig, 2001). Householder matrices thus allow us to perform gradient descent over all orthogonal matrices.
Multiplication. Normal matrix multiplication for matrices takes time. This is also true when is a product of Householder matrices. The product can be computed by Householder multiplications. If done sequentially, as indicated by the parenthesis, each Householder multiplication can be computed in time (Zhang et al., 2018). All multiplications can then be done in time,no more than computing normally.
However,the time complexity requires us to multiply each Householder matrix sequentially. Each Householder multiplication entails computing an inner product, see Equation (1),which means the multiplication requires the computation of inner products sequentially. Such sequential computation of inner products is slow on parallel hardware like GPUs.
Our main contribution is a novel parallel algorithm, FastH, which resolves the issue with sequential inner products without increasing the time complexity. FastH takes time but performs sequential matrix-matrix operations instead of sequential vector-vector operations (inner products). In practice, is up to faster than the normal sequential Householder method, see Figure 1 and Section 4.1.
Mathematical Setting. The number of sequential matrix-matrix and vector-vector operations is simply counted. We count only once when other sequential operations can be done in parallel. For example,processing sequentially while,in parallel,processing sequentially,we count sequential vector-vector operations.
Our goal is to create an algorithm with few sequential operations that solves the following problem: Given an input with and Householder matrices ,compute the output . For simplicity,we assume divides .
Since each is a matrix,it would take time to read the input . Therefore,we represent each Householder matrix by its associated Householder vector as in Equation (1).
A simplified version of FastH proceeds as follows: divide the Householder product into smaller products so each is a product of Householder matrices:
FastH算法的简化版本流程如下:将豪斯霍尔德积分解为子积,使得每个都是个豪斯霍尔德矩阵的乘积:
All products can be computed in parallel. The output can then be computed by instead of ,which reduces the number of sequential matrix multiplications from to .
所有个子积均可并行计算。输出可通过而非计算,从而将串行矩阵乘法次数从降至。
This algorithm computes the correct ,however,the time complexity increases due to two issues. First, multiplying each product with takes time,a total of time for all products. Second,we need to compute all products in time,so each product must be computed in time. If we only use the Householder structure, it takes time to compute each ,which is not fast enough.
Both issues can be resolved,yielding an algorithm. The key ingredient is a linear algebra result that dates back to 1987. The result is restated in Lemma 1.
这两个问题均可解决,从而得到算法。关键要素可追溯至1987年的线性代数成果,该成果在引理1中重述。
Lemma 1. (Bischof & Van Loan, 1987) For any m Householder matrices there exists st.
引理1. (Bischof & Van Loan, 1987) 对于任意m个豪斯霍尔德矩阵,存在使得
Both and can be computed by sequential Householder multiplications in time.
与均可通过次串行豪斯霍尔德乘法在时间内计算得出。
Proof. See (Bischof & Van Loan, 1987) Method 2.
证明. 参见(Bischof & Van Loan, 1987)方法2。
For completeness, we provide pseudo-code in Algorithm 1. Theorem 1 states properties of Algorithm 1 and its proof clarify how Lemma 1 solves both issues outlined above.
Proof. Correctness. Each iteration of Step 2 computes
证明。正确性。步骤2的每次迭代计算
Therefore,at termination, . In Step 1, we used Lemma 1 to compute the ’s such that as wanted.
因此在终止时, 。步骤1中,我们运用引理1计算满足 的 。
Time complexity. Consider the for loop in Step 1. By Lemma 1,each iteration takes time. Therefore,the total time of the iterations is .
时间复杂度。考虑步骤1的循环。根据引理1,每次迭代耗时 ,故 次迭代总时间为 。
Consider iteration of the loop in Step 2. The time of the iteration is asymptotically dominated by both matrix multiplications. Since and all are matrices, it takes time to compute both matrix multiplications. There are iterations so the total time becomes .
Number of Sequential Operations. Each iteration in Step 2 performs two sequential matrix multiplications. There are sequential iterations which gives a total of sequential matrix multiplications.
顺序操作数。步骤2每次迭代需执行两次顺序矩阵乘法。共 次顺序迭代,总计 次顺序矩阵乘法。
Each iteration in Step 1 performs sequential Householder multiplications to construct ,see Lemma 1. Since each iteration is run in parallel, the algorithm performs no more than sequential matrix multiplications.
Remark. Supplementary Material 8.1 extends this sections techniques to compute gradient, see Algorithm 2. For simplicity,this section had Algorithm 1 compute only , however, in Algorithm 2 it will be convenient to assume are precomputed. Each can be saved during Step 2 of Algorithm 1 without increasing asymptotic memory consumption.
Trade-off. Both Algorithm 1 and Algorithm 2 can be extended to take a parameter that controls a trade-off between total time complexity and the amount of parallelism. This can be achieved by changing the number of Householder matrices in each product from the mini-batch size to an integer . The new algorithm takes time, space and has sequential matrix multiplications. This extension has the practical benefit that one can try different values of and choose the one that yields superior performance on a particular hardware setup. The number of sequential matrix multiplications is minimized when . For a constant ,we can find the best by trying all values. The search needs to be done only once and takes time. In practice, this time is negligible, e.g., on the hardware we describe in Section 4 it took less than for .
To simulate a realistic machine learning environment, we performed all experiments on a standard machine learning server using a single NVIDIA RTX 2080 Ti.
This subsection investigates the time it takes for FastH to perform a gradient descent update wrt. an orthogonal matrix. The time of FastH is compared against four alternative algorithms. The first two alternatives are methods based on the matrix exponential and the Cayley map respectively (Golinski et al., 2019). The next two alternatives are the sequential and parallel algorithms from (Zhang et al., 2018), which both rely on Householder matrices like FastH. Both articles open-sourced their implementations which we use in our experiments.
The performance of the sequential algorithm is particularly interesting, because it is the same algorithm most previous work on Normalizing Flows adopt (Tomczak & Welling, 2016; van den Berg et al., 2018; Hoogeboom et al., 2019). The only difference is that (Zhang et al., 2018) implemented the algorithm in CUDA instead of PyTorch (Paszke et al., 2019) or TensorFlow (Abadi et al., 2015).
串行算法的性能特别值得关注,因其与归一化流领域多数前人工作采用的算法相同(Tomczak & Welling,2016;van den Berg等人,2018;Hoogeboom等人,2019)。唯一区别在于(Zhang等人,2018)使用CUDA而非PyTorch(Paszke等人,2019)或TensorFlow(Abadi等人,2015)实现该算法。
We measure the time of a gradient descent step with a weight matrix and a mini-batch ,where and . We ran each algorithm 100 times,and we report mean time with error bars where is the standard deviation of running time over the 100 repetitions. Figure 2 depicts the running time on the y-axis, as the size of the matrices increases on the x-axis. For , FastH is faster than all previous approaches. At FastH is faster than all previous approaches, except the parallel algorithm. FastH is even faster at than the Sequential algorithm at . To put the matrix sizes into perspective, we mention that previous work employ, e.g., in (Kingma &Dhariwal,2018) or in (Zhang et al., 2018).
The Householder Decomposition. The Householder decomposition of orthogonal matrices has been used in much previous works, for example, (Tomczak & Welling, 2016; Mhammedi et al., 2017; Zhang et al., 2018; van den Berg et al., 2018; Hoogeboom et al., 2019). Previous work typically use a type of sequential algorithm that performs sequential inner products. To circumvent the resulting long computation time on GPUs, previous work often suggest limiting the number of Householder matrices, which limits the expressiveness of the orthogonal matrix, introducing a trade-off between computation time and expressiveness.
豪斯霍尔德分解。正交矩阵的豪斯霍尔德分解已被多项先前工作采用,例如(Tomczak & Welling, 2016; Mhammedi等, 2017; Zhang等, 2018; van den Berg等, 2018; Hoogeboom等, 2019)。先前研究通常使用执行次串行内积的算法类型。为避免GPU上过长的计算时间,这些工作常建议限制豪斯霍尔德矩阵数量,这会制约正交矩阵的表达能力,形成计算时间与表达能力的权衡。
FastH takes the same asymptotic time as the sequential algorithm, however, it performs less sequential matrix operations,making it up to faster in practice. Since FastH computes the same output as the previous sequential algorithms, it can be used in previous work without degrading the performance of their model. In particular, FastH can be used to either (1) increase expressiveness at no additional computational cost or (2) speed up previous applications at the same level of expressiveness.
Different Orthogonal Parameterizations. Previous work explore different approaches to orthogonal parameterizations, including methods based on Householder matrices (Mhammedi et al., 2017), the matrix exponential (Lezcano-Casado & Martínez-Rubio, 2019) and the Cayley map (Golinski et al., 2019). (Golinski et al., 2019) raised a theoretical concern about the use of Householder matrices. The methods based on the matrix exponential and the Cayley map have desirable provable guarantees, which currently, it is not known whether the Householder decomposition possess. This might make it desirable to use methods based on the matrix exponential or the Cayley map instead of using methods based on Householder matrices. However, the methods based on matrix exponential and Cayley map use time to construct the orthogonal matrix, which the Householder method circumvents. This allows Householder methods to perform faster multiplications. In particular,for a mini-batch , methods based on Householder matrices can compute the product in time, times faster.
Determinants and Matrix Decompositions. (Kingma & Dhariwal, 2018) propose to speed up determinant computations by using the PLU decomposition where is a permutation matrix and are lower and upper triangular. This allows the determinant computation in time instead of . (Hoogeboom et al.,2019) point out that a fixed permutation matrix limits flexibility. To fix this issue,they suggest using the decomposition where is a rectangular matrix and is orthogonal. They suggest making orthogonal by using the Householder decomposition which FastH can speed up.
To make FastH widely accessible, we wrote a PyTorch implementation 'nn.Orthogonal' which can be used like 'nn.Linear'. Besides implementing the default 'forward' function, it also contains functions for inverse and log Jacobian determinant . While implementing FastH,we found that Python did not provide an adequate level of parallelization. We therefore implemented FastH in CUDA to fully utilize the parallel capabilities of GPUs. Our code can be found at https://github.com/alexandermath/fasth.
We identified an algorithmic issue with the previous use of Householder matrices in Neural Networks. FastH mitigates the issue,and is up to faster in practice,without introducing any loss of quality. In other words, FastH computes the same thing as the previous algorithms, just faster. FastH can thus be used to speed up Neural Networks like (Tomczak & Welling, 2016; van den Berg et al., 2018; Hoogeboom et al., 2019) without any downsides. References
我们发现了神经网络中豪斯霍尔德矩阵应用的算法问题。FastH在保证质量前提下解决了该问题,实际加速最高达倍。换言之,FastH以更快速度完成与原有算法相同的计算。因此可无副作用地加速(Tomczak & Welling, 2016; van den Berg等, 2018; Hoogeboom等, 2019)等神经网络模型。参考文献
It just needs to return .
它只需返回。
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Is-ard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Va-sudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Tensor-Flow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. URL https://www.tensorflow.org/.Software available from tensorflow.org.
Golinski, A., Lezcano-Casado, M., and Rainforth, T. Improving Normalizing Flows via Better Orthogonal Parameterizations. In ICML Workshop on Invertible Neural Networks and Normalizing Flows, 2019.
Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. The Reversible Residual Network: Backpropagation Without Storing Activations. In NIPS, 2017.
戈麦斯(Gomez, A. N.)、任(Ren, M.)、乌尔塔松(Urtasun, R.)和格罗斯(Grosse, R. B.)。可逆残差网络:无需存储激活值的反向传播。《NIPS》,2017年。
Hoogeboom, E., van den Berg, R., and Welling, M. Emerging Convolutions for Generative Normalizing Flows. In ICML, 2019.
胡格布姆(Hoogeboom, E.)、范登伯格(van den Berg, R.)和韦林(Welling, M.)。生成标准化流的新兴卷积。《ICML》,2019年。
Kingma, D. P. and Dhariwal, P. Glow: Generative Flow with Invertible 1x1 Convolutions. In NeurIPS. 2018.
金马(Kingma, D. P.)和达里瓦尔(Dhariwal, P.)。Glow:使用可逆1x1卷积的生成流。《NeurIPS》,2018年。
Lezcano-Casado, M. and Martínez-Rubio, D. Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group. In .
Mhammedi, Z., Hellicar, A., Rahman, A., and Bailey, J. Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections. In ICML, 2017.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS. 2019.
Tomczak, J. M. and Welling, M. Improving Variational Auto-Encoders using Householder Flow. arXiv preprint, 2016.
托姆扎克(Tomczak, J. M.)和韦林(Welling, M.)。使用豪斯霍尔德流改进变分自编码器。arXiv预印本,2016年。
Uhlig, F. Constructive Ways for Generating (Generalized) Real Orthogonal Matrices as Products of (Generalized) Symmetries. Linear Algebra and its Applications, 2001.
This subsection extends the techniques from Section 3.1 to handle gradient computations. Our goal is to create an algorithm with few sequential operations that solves the following problem: Given , and for some loss function ,compute and ,where is a Householder vector st. each Householder matrix is .
Since each is a matrix,it would take time to read the input . Therefore,we represent each by its WY decomposition .
由于每个都是矩阵,读取输入需要时间。因此我们采用WY分解来表示每个。
On a high-level FastH has two steps.
FastH算法在高层级上包含两个步骤。
Step 1. Sequentially compute by
步骤1. 通过以下公式顺序计算:
This also gives the gradient wrt. since .
由于,该步骤同时给出了关于的梯度。
Step 2. Use from Step 1 to compute the gradient for all . This problem can be split into subproblems, which can be solved in parallel, one subproblem for each .
步骤2. 使用步骤1中的计算所有的梯度。该问题可分解为个子问题并行求解,每个对应一个子问题。
Details. For completeness, we state pseudo-code in Algorithm 2, which we now explain with the help of Figure 3.
细节说明。为完整起见,我们在算法2中给出伪代码,现结合图3进行解释。
Figure 3a depicts a computational graph of Step 1 in Algorithm 2. In the figure,consider and ,which both have directed edges to a multiplication node (denoted by ). The output of this multiplication is by Equation (3). This can be repeated to obtain .
Step 2 computes the gradient of all Householder vectors . This computation is split into distinct subproblems that can be solved in parallel. Each subproblem concerns and the product ,see line 10-12 in Algorithm 2.
To ease notation,we index the Householder matrices of by . Furthermore,we let and . The notation implies that . The goal of each subproblem is to compute gradients wrt. the Householder vectors of . To compute the gradient of ,we need and ,which can be computed by:
Figure 3b depicts how and can be computed given and . Given and ,we can compute as done in (Zhang et al.,2018; Mhammedi et al., 2017). For completeness, we restate the needed equation in our notation,see Equation (6). Let be the ’th column of and let be the ’th column of . The sum of the gradient over a mini-batch of size is then:
Theorem 2. Algorithm 2 computes and in time with sequential matrix multiplications.
定理2. 算法2在时间内通过次顺序矩阵乘法计算出和。
Proof. Correctness. FastH computes gradients by the same equations as (Zhang et al., 2018), so in most cases, we show correctness by clarifying how FastH computes the same thing, albeit faster.
证明。正确性。FastH采用与(Zhang et al., 2018)相同的梯度计算公式,因此在多数情况下,我们通过阐明FastH如何以更快速度计算出相同结果来证明其正确性。
Figure 3. Computational graph of Step 1 and the ’th subproblem in Step 2 from Algorithm 2.
图3. 算法2中步骤1的计算图及步骤2第个子问题。
Consider computed in Step 1:
考虑步骤1中计算的:
This is the same as that computed in (Zhang et al., 2018).
这与(Zhang et al., 2018)中的计算结果相同。
Consider Step 2. Both and are computed as done in (Zhang et al.,2018). is computed using Equation (4) similar to backpropagation without storing activations, (Gomez et al., 2017), but using the fact that
分析步骤2。和均按(Zhang et al.,2018)方式计算。通过类似不存储激活值的反向传播方程(4)计算(参见Gomez et al., 2017),但利用了的特性
Time Complexity. In Step 1 the for loop performs matrix multiplications. Due to the WY decomposition which can be multiplied on in time since . The computation is repeated times,and take a total of time.
Step 2 line 14 performs two Householder matrix multiplications which take time,see Equations (4) and (5). In line 15 the gradient is computed by Equation (6), this sum also takes time to compute. Both computations on line 14 and 15 are repeated times,see line 10 and line 13. Therefore,the total time is .
Number of Sequential Operations. Step 1 performs sequential matrix operations. Lines 13-16 of Step 2 perform sequential matrix multiplications. Since each iteration of line 10-17 is run in parallel, the algorithm performs no more than sequential matrix multiplications.
The parallel algorithm from (Zhang et al., 2018) halted for larger values of . The failing code was not part of the main computation. This allowed us to remove the failing code and still get a good estimate of the running time of the parallel algorithm. We emphasize that removing the corresponding code makes the parallel algorithm faster. The experiments thus demonstrated that FastH is faster than a lower bound on the running time of the parallel algorithm.
(Zhang et al., 2018)的并行算法在值较大时出现停滞。故障代码并非主计算部分,这使得我们移除故障代码后仍能准确估算并行算法的运行时间。需要强调的是,移除相关代码会使并行算法更快。实验表明FastH比并行算法运行时间的下界更快。