这是用户在 2025-6-8 1:08 为 file:///Users/zoe/Downloads/4082_faster_orthogonal_parameteriza.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Faster Orthogonal Parameterization with Householder Matrices
基于Householder矩阵的快速正交参数化方法


Alexander Mathiasen 1 Frederik Hvilshoj 1 Jakob Rødsgaard Jørgensen 1 Anshul Nasery 12 Davide Mottin 1
亚历山大·马蒂亚森 1 弗雷德里克·维尔肖伊 1 雅各布·罗斯加德·约根森 1 安舒尔·纳塞里 12 达维德·莫廷 1

Abstract
摘要


Orthogonal matrices have been used in several Normalizing Flows (Tomczak & Welling, 2016; van den Berg et al., 2018).Orthogonal matrices are attractive since they are easy to invert and have Jacobian determinant one. Their main downside is the additional computational resources required to perform gradient descent. We identify a computational bottleneck for previous work on Householder matrices, and introduce a novel algorithm, FastH, which circumvents the bottleneck and is up to 29× faster than a previous method.
正交矩阵已在多种标准化流模型中得到应用(Tomczak & Welling, 2016; van den Berg等, 2018)。正交矩阵因其易于求逆且雅可比行列式恒为1的特性而备受青睐,但其主要缺陷在于梯度下降所需的额外计算资源。我们发现前人关于Householder矩阵的研究存在计算瓶颈,并提出创新算法FastH,该算法规避了瓶颈,速度较前代方法提升高达29×倍。

1. Introduction
1. 引言


Normalizing Flows (NF) are a type of generative model with several attractive properties like exact likelihood, fast sampling and substantial memory savings (Kingma & Dhariwal, 2018). They are capable of representing complicated distributions by transforming a simple base distribution zPz through an invertible neural network f to attain a model distribution f(z)Pmodel  . The exact likelihood pmodel (x) can then be computed through a change of variable formula due to the invertibility of f .
标准化流(NF)是一类具备精确似然计算、快速采样和显著内存节省等优势的生成模型(Kingma & Dhariwal, 2018)。其通过可逆神经网络f将简单基分布zPz转化为模型分布f(z)Pmodel ,从而表征复杂分布。得益于f的可逆性,可通过变量替换公式计算精确似然pmodel (x)

This has motivated much research into invertible neural networks. However, the change of variables formula also require the computation of the Jacobian determinant of f . Previous research thus attempt to design invertible neural networks which also allow efficient computation of their Jacobian determinant. This makes orthogonal matrices very attractive,since they are easy to invert U1=UT and have unit Jacobian determinant |det(Ux/x)|=1 .
这推动了对可逆神经网络的广泛研究。但变量替换公式还需计算f的雅可比行列式。因此前人致力于设计既能保持可逆性、又能高效计算雅可比行列式的神经网络结构。正交矩阵因其易逆性U1=UT和单位雅可比行列式|det(Ux/x)|=1成为理想选择。

Perharps unsurprisingly, much previous work adopt orthogonal matrices in their construction of Normalizing Flows (Tomczak & Welling, 2016; van den Berg et al., 2018; Hoogeboom et al., 2019; Golinski et al., 2019). However, the use of orthogonal matrices introduce one complication.
不出所料,已有大量研究将正交矩阵应用于标准化流构建(Tomczak & Welling, 2016; van den Berg等, 2018; Hoogeboom等, 2019; Golinski等, 2019)。但正交矩阵的使用也带来了新的挑战。



https://cdn.noedgeai.com/01974b4b-ab67-77ad-95b2-777382b7d74d_0.jpg?x=902&y=508&w=687&h=346&r=0

Figure 1. Comparison of previous approaches, see Section 4.
图1. 现有方法对比,详见第4节。


One needs to perform gradient descent wrt. weight matrices that are constrained to be orthogonal. Previous work solve this issue using different methods which can roughly be divided into three groups: matrix exponential, Cayley transform and Householder matrices.
需对约束为正交矩阵的权重矩阵执行梯度下降。前人通过不同方法解决该问题,主要分为三类:矩阵指数法、凯莱变换法和Householder矩阵法。

The Householder method has the best time complexity for multiplying UX where URd×d is an orthogonal matrix and XRd×m is a mini-batch with m examples. In particular,the Householder methods take O(d2m) time compared to O(d3+d2m) time for both alternative methods. For 1<m<d the Householder method is thus O(d/m) times faster. Contrary to the superior time complexity, we find that methods based on the Householder method are often slower on GPUs in practice, see Figure 1.
Householder方法在计算正交矩阵URd×d与含m个样本的迷你批次XRd×m的乘积时具有最优时间复杂度。具体而言,Householder方法耗时O(d2m),而另两种方法均需O(d3+d2m)时间。当1<m<d时,Householder方法提速达O(d/m)倍。但与优越的理论复杂度相反,我们发现基于Householder的方法在GPU上实际运行往往更慢,见图1。

We identify the algorithmic issue which causes the discrepancy between theory and practice. It turns out that the previous Householder methods entails the computation of O(d) sequential inner products,which is ill-fit for parallel hardware like a GPU because the GPU cannot utilize all its cores. For example, if a GPU has 4000 cores and computes sequential inner products on 100-dimensional vectors, it can only utilize 100 cores simultaneously, leaving the remaining 3900 cores to run idle.
我们找到了导致理论与实践差异的算法症结。原来传统Householder方法需要计算O(d)次顺序内积,这与GPU等并行硬件架构不匹配——当GPU处理100维向量的顺序内积时,4000个核心中仅能同时调用100个,其余3900个核心处于闲置状态。

We introduce a novel algorithm, FastH, which increases core utilization, leaving less cores to run idle. FastH retains the desirable O(d2m) time complexity,while reducing the number of sequential operations. On a mini-batch of size m>1 , FastH performs O(d/m+m) sequential matrix-matrix operations instead of O(d) sequential vector-vector operations. In practice, FastH is faster than all previous methods, see Figure 1. Code: https://github.com/alexandermath/fasth.
我们提出了一种新型算法FastH,该算法能提升核心利用率,减少闲置核心数量。FastH在保持理想O(d2m)时间复杂度的同时,降低了顺序操作次数。对于规模为m>1的微批次,FastH执行O(d/m+m)次顺序矩阵-矩阵运算而非O(d)次顺序向量-向量运算。实际应用中,FastH速度超越所有现有方法(见图1)。代码详见:https://github.com/alexandermath/fasth。


1 Aarhus University 2 Indian Institute of Technology,Bombay. Correspondence to: Alexander Mathiasen .
1奥胡斯大学2印度理工学院孟买分校。通讯作者:亚历山大·马蒂亚森

Second workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models (ICML 2020)
第二届可逆神经网络、标准化流与显式似然模型研讨会(ICML 2020)


2. Background
2. 研究背景


In this section, we sketch the method based on Householder matrices, and explain its computational bottleneck. A matrix URd×d is orthogonal if UT=U1 . All orthogonal d×d matrices can be decomposed into a product of d Householder matrices H1,,Hd (Uhlig,2001):
本节概述基于Householder矩阵的方法,并分析其计算瓶颈。当UT=U1时,矩阵URd×d为正交矩阵。所有d×d维正交矩阵均可分解为d个Householder矩阵H1,,Hd的乘积(Uhlig,2001):

(1)U=i=1dHiHi=I2viviTvi22viRd.

Householder matrices satisfy several useful properties. In particular,the matrix U remains orthogonal under gradient descent updates vi=viηvi (Mhammedi et al., 2017). Furthermore, all products of Householder matrices are orthogonal,and any d×d orthogonal matrix can be decomposed as a product of d Householder matrices (Uh-lig, 2001). Householder matrices thus allow us to perform gradient descent over all orthogonal matrices.
Householder矩阵具有若干重要特性:在梯度下降更新vi=viηvi过程中,矩阵U始终保持正交性(Mhammedi等,2017)。此外,所有Householder矩阵的乘积均为正交矩阵,且任意d×d维正交矩阵都可分解为d个Householder矩阵的乘积(Uhlig,2001)。因此通过Householder矩阵可实现对所有正交矩阵的梯度下降优化。

Multiplication. Normal matrix multiplication UX for matrices URd×d,XRd×m takes O(d2m) time. This is also true when U is a product of d Householder matrices. The product UX=H1(Hd1(HdX)) can be computed by d Householder multiplications. If done sequentially, as indicated by the parenthesis, each Householder multiplication can be computed in O(dm) time (Zhang et al., 2018). All d multiplications can then be done in O(d2m) time,no more than computing UX normally.
乘法运算。常规矩阵乘法UX对于URd×d,XRd×m维矩阵需O(d2m)时间。此结论同样适用于Ud个Householder矩阵乘积的情形。乘积UX=H1(Hd1(HdX))可通过d次Householder乘法计算。若按括号所示顺序执行,每次Householder乘法耗时O(dm)(Zhang等,2018),全部d次乘法总耗时仍为O(d2m),与常规计算UX相当。

However,the O(d2m) time complexity requires us to multiply each Householder matrix sequentially. Each Householder multiplication entails computing an inner product, see Equation (1),which means the multiplication UX requires the computation of O(d) inner products sequentially. Such sequential computation of inner products is slow on parallel hardware like GPUs.
O(d2m)时间复杂度要求顺序执行每个Householder矩阵乘法。每次Householder乘法需计算内积(见公式(1)),这意味着UX乘法需顺序执行O(d)次内积运算。此类顺序内积计算在GPU等并行硬件上效率较低。

Our main contribution is a novel parallel algorithm, FastH, which resolves the issue with sequential inner products without increasing the time complexity. FastH takes O(d2m) time but performs O(d/m+m) sequential matrix-matrix operations instead of O(d) sequential vector-vector operations (inner products). In practice, FastH is up to 29× faster than the normal sequential Householder method, see Figure 1 and Section 4.1.
我们的核心贡献是提出并行算法FastH,在不增加时间复杂度的前提下解决顺序内积问题。FastH耗时O(d2m),但执行O(d/m+m)次顺序矩阵-矩阵运算而非O(d)次顺序向量-向量运算(内积)。实际应用中,FastH比常规顺序Householder方法提速最高达29×倍(见图1及4.1节)。

Mathematical Setting. The number of sequential matrix-matrix and vector-vector operations is simply counted. We count only once when other sequential operations can be done in parallel. For example,processing v1,,vd/2 sequentially while,in parallel,processing vd/2+1,,vd sequentially,we count d/2 sequential vector-vector operations.
数学设定。顺序矩阵-矩阵与向量-向量运算次数按简单计数原则统计。当其他顺序运算可并行执行时仅计一次。例如顺序处理v1,,vd/2时并行顺序处理vd/2+1,,vd,则记为d/2次顺序向量-向量运算。

3.A Parallel Algorithm
3. 并行算法


3.1. Forward Pass
3.1. 前向传播


Our goal is to create an O(d2m) algorithm with few sequential operations that solves the following problem: Given an input XRd×m with d>m>1 and Householder matrices H1,,Hd ,compute the output A=H1HdX . For simplicity,we assume m divides d .
我们的目标是设计一种O(d2m)算法,通过少量串行运算解决以下问题:给定包含XRd×m和豪斯霍尔德矩阵(Householder matrices)H1,,Hd的输入d>m>1,计算输出A=H1HdX。为简化起见,假设m可整除d

Since each Hi is a d×d matrix,it would take O(d3) time to read the input H1,,Hd . Therefore,we represent each Householder matrix Hi by its associated Householder vector vi as in Equation (1).
由于每个Hi都是d×d矩阵,读取输入H1,,Hd需要O(d3)时间。因此我们如公式(1)所示,用关联的豪斯霍尔德向量(Householder vector)vi表示每个豪斯霍尔德矩阵Hi

A simplified version of FastH proceeds as follows: divide the Householder product H1Hd into smaller products P1Pd/m so each Pi is a product of m Householder matrices:
FastH算法的简化版本流程如下:将豪斯霍尔德积H1Hd分解为子积P1Pd/m,使得每个Pi都是m个豪斯霍尔德矩阵的乘积:

(2)Pi=H(i1)m+1Himi=1,,d/m.

All d/m products Pi can be computed in parallel. The output can then be computed by A=P1Pd/mX instead of A=H1HdX ,which reduces the number of sequential matrix multiplications from d to d/m .
所有d/m个子积Pi均可并行计算。输出可通过A=P1Pd/mX而非A=H1HdX计算,从而将串行矩阵乘法次数从d降至d/m

This algorithm computes the correct A ,however,the time complexity increases due to two issues. First, multiplying each product Pi with X takes O(d2m) time,a total of O(d3) time for all d/m products. Second,we need to compute all d/m products P1,,Pd/m in O(d2m) time,so each product Pi must be computed in O(d2m/(d/m))=O(dm2) time. If we only use the Householder structure, it takes O(d2m) time to compute each Pi ,which is not fast enough.
该算法能正确计算A,但时间复杂度因两个问题而增加:首先,每个子积PiX相乘需O(d2m)时间,全部d/m个子积共需O(d3)时间;其次,所有d/m个子积P1,,Pd/m需在O(d2m)时间内完成,故每个子积Pi须在O(d2m/(d/m))=O(dm2)时间内算出。若仅利用豪斯霍尔德结构,计算每个PiO(d2m)时间,无法满足速度要求。

Both issues can be resolved,yielding an O(d2m) algorithm. The key ingredient is a linear algebra result that dates back to 1987. The result is restated in Lemma 1.
这两个问题均可解决,从而得到O(d2m)算法。关键要素可追溯至1987年的线性代数成果,该成果在引理1中重述。

Lemma 1. (Bischof & Van Loan, 1987) For any m Householder matrices H1,,Hm there exists W,YRd×m st.
引理1. (Bischof & Van Loan, 1987) 对于任意m个豪斯霍尔德矩阵H1,,Hm,存在W,YRd×m使得

I2WYT=H1Hm.

Both W and Y can be computed by m sequential Householder multiplications in O(dm2) time.
WY均可通过m次串行豪斯霍尔德乘法在O(dm2)时间内计算得出。

Proof. See (Bischof & Van Loan, 1987) Method 2.
证明. 参见(Bischof & Van Loan, 1987)方法2。


For completeness, we provide pseudo-code in Algorithm 1. Theorem 1 states properties of Algorithm 1 and its proof clarify how Lemma 1 solves both issues outlined above.
为完备起见,我们在算法1中提供伪代码。定理1阐述了算法1的特性,其证明阐明引理1如何解决上述两个问题。

Theorem 1. Algorithm 1 computes
定理1. 算法1通过

H1HdX

in O(d2m) time with O(d/m+m) sequential matrix multiplications.
O(d/m+m)次串行矩阵乘法,在O(d2m)时间内完成计算。

Algorithm 1 FastH Forward
算法1 FastH前向计算


Input: XRd×m and H1,,HdRd×d .
输入:XRd×mH1,,HdRd×d

Output: A1=P1Pd/mX=H1HdX .
输出:A1=P1Pd/mX=H1HdX

// Step 1
// 步骤1

for i=d/m to 1 do in parallel
并行执行 i=d/m 到 1 的循环

Compute Yi,WiRd×m such that O(dm2)
计算满足 O(dm2)Yi,WiRd×m

Pi=I2WiYiT

by using Lemma 1.
运用引理1。

end for
结束循环

// Step 2
// 步骤2

Ad/m+1=X .

for i=d/m to 1 do sequentially
顺序执行 i=d/m 到 1 的循环

Ai=Ai+12Wi(YiTAi+1).

end for
结束循环

return A1 .
返回 A1



Proof. Correctness. Each iteration of Step 2 computes
证明。正确性。步骤2的每次迭代计算

Ai=Ai+12Wi(YiTAi+1)

=PiAi+1.

Therefore,at termination, A1=P1Pd/mX . In Step 1, we used Lemma 1 to compute the Pi ’s such that A= H1HdX as wanted.
因此在终止时,A1=P1Pd/mX 。步骤1中,我们运用引理1计算满足 A= H1HdXPi

Time complexity. Consider the for loop in Step 1. By Lemma 1,each iteration takes O(dm2) time. Therefore,the total time of the d/m iterations is O(dm2d/m)=O(d2m) .
时间复杂度。考虑步骤1的循环。根据引理1,每次迭代耗时 O(dm2) ,故 d/m 次迭代总时间为 O(dm2d/m)=O(d2m)

Consider iteration i of the loop in Step 2. The time of the iteration is asymptotically dominated by both matrix multiplications. Since Ai+1,Xi and Yi all are d×m matrices, it takes O(dm2) time to compute both matrix multiplications. There are d/m iterations so the total time becomes O(dm2d/m)=O(d2m) .
分析步骤2中第 i 次迭代。该次迭代时间主要由矩阵乘法决定。由于 Ai+1,XiYi 均为 d×m 阶矩阵,两次矩阵乘法共需 O(dm2) 时间。共进行 d/m 次迭代,总时间为 O(dm2d/m)=O(d2m)

Number of Sequential Operations. Each iteration in Step 2 performs two sequential matrix multiplications. There are d/m sequential iterations which gives a total of O(d/m) sequential matrix multiplications.
顺序操作数。步骤2每次迭代需执行两次顺序矩阵乘法。共 d/m 次顺序迭代,总计 O(d/m) 次顺序矩阵乘法。

Each iteration in Step 1 performs m sequential Householder multiplications to construct Pi ,see Lemma 1. Since each iteration is run in parallel, the algorithm performs no more than O(d/m+m) sequential matrix multiplications.
步骤1每次迭代需顺序执行 m 次豪斯霍尔德乘法以构造 Pi(参见引理1)。由于各迭代并行执行,算法最多进行 O(d/m+m) 次顺序矩阵乘法。

Remark. Supplementary Material 8.1 extends this sections techniques to compute gradient, see Algorithm 2. For simplicity,this section had Algorithm 1 compute only A1 , however, in Algorithm 2 it will be convenient to assume A1,,Ad/m are precomputed. Each Ai=PiPd/mX can be saved during Step 2 of Algorithm 1 without increasing asymptotic memory consumption.
备注。补充材料8.1节将本节技术推广至梯度计算(见算法2)。为简化起见,本节算法1仅计算 A1 ,但算法2中预设 A1,,Ad/m 已预先计算。各 Ai=PiPd/mX 可在算法1步骤2期间保存,且不增加渐近内存消耗。

3.2. Extensions
3.2. 扩展


Trade-off. Both Algorithm 1 and Algorithm 2 can be extended to take a parameter k that controls a trade-off between total time complexity and the amount of parallelism. This can be achieved by changing the number of Householder matrices in each product Pi from the mini-batch size m to an integer k . The new algorithm takes O(d2k+d2m) time, O(d2m/k) space and has O(d/k+k) sequential matrix multiplications. This extension has the practical benefit that one can try different values of k and choose the one that yields superior performance on a particular hardware setup. The number of sequential matrix multiplications O(d/k+k) is minimized when k=O(d) . For a constant c>1 ,we can find the best k{2,3,,cd} by trying all O(d) values. The search needs to be done only once and takes O(d(d2k+d2m))=O(d3+d2.5m) time. In practice, this time is negligible, e.g., on the hardware we describe in Section 4 it took less than 1s for d=784 .
权衡取舍。算法1和算法2均可扩展为接收参数k,该参数控制总时间复杂度与并行量之间的权衡。这可通过将每个乘积Pi中的Householder矩阵数量从迷你批次大小m调整为整数k来实现。新算法耗时O(d2k+d2m),占用O(d2m/k)空间,并包含O(d/k+k)次顺序矩阵乘法。此扩展的实用优势在于可尝试不同k值,选择在特定硬件配置上表现最佳者。当k=O(d)时,顺序矩阵乘法次数O(d/k+k)最小化。对于常数c>1,我们可通过测试所有O(d)个值找到最优k{2,3,,cd}。该搜索仅需执行一次,耗时O(d(d2k+d2m))=O(d3+d2.5m)。实践中该时间可忽略不计,例如在第4节描述的硬件上,当d=784时耗时不足1s

4. Experiments
4. 实验


To simulate a realistic machine learning environment, we performed all experiments on a standard machine learning server using a single NVIDIA RTX 2080 Ti.
为模拟真实机器学习环境,所有实验均在配备单块NVIDIA RTX 2080 Ti的标准机器学习服务器上进行。

4.1. Comparing Running Time
4.1. 运行时间对比


This subsection investigates the time it takes for FastH to perform a gradient descent update wrt. an orthogonal matrix. The time of FastH is compared against four alternative algorithms. The first two alternatives are methods based on the matrix exponential and the Cayley map respectively (Golinski et al., 2019). The next two alternatives are the sequential and parallel algorithms from (Zhang et al., 2018), which both rely on Householder matrices like FastH. Both articles open-sourced their implementations which we use in our experiments. 12
本小节研究FastH执行正交矩阵梯度下降更新的耗时。将FastH与四种替代算法对比:前两种分别基于矩阵指数和Cayley映射的方法(Golinski等人,2019);后两种来自(Zhang等人,2018)的串行与并行算法,二者与FastH同样依赖Householder矩阵。两篇论文均开源了实现代码,我们实验中使用这些代码。12

The performance of the sequential algorithm is particularly interesting, because it is the same algorithm most previous work on Normalizing Flows adopt (Tomczak & Welling, 2016; van den Berg et al., 2018; Hoogeboom et al., 2019). The only difference is that (Zhang et al., 2018) implemented the algorithm in CUDA instead of PyTorch (Paszke et al., 2019) or TensorFlow (Abadi et al., 2015).
串行算法的性能特别值得关注,因其与归一化流领域多数前人工作采用的算法相同(Tomczak & Welling,2016;van den Berg等人,2018;Hoogeboom等人,2019)。唯一区别在于(Zhang等人,2018)使用CUDA而非PyTorch(Paszke等人,2019)或TensorFlow(Abadi等人,2015)实现该算法。

We measure the time of a gradient descent step with a weight matrix WRd×d and a mini-batch XRd×m ,where m=32 and d=164,264,,4864 . We ran each algorithm 100 times,and we report mean time μ with error bars [μσ,μ+σ] where σ is the standard deviation of running time over the 100 repetitions. Figure 2 depicts the running time on the y-axis, as the size of the d×d matrices increases on the x-axis. For d>64 , FastH is faster than all previous approaches. At d=64 FastH is faster than all previous approaches, except the parallel algorithm. FastH is even faster at d=3000 than the Sequential algorithm at d=400 . To put the matrix sizes into perspective, we mention that previous work employ, e.g., d=192 in (Kingma &Dhariwal,2018) or in d=784 (Zhang et al., 2018).
我们测量权重矩阵WRd×d和迷你批次XRd×m(其中m=32d=164,264,,4864)的梯度下降步骤耗时。每个算法运行100次,报告平均时间μ及误差条[μσ,μ+σ]σ为100次运行时间的标准差)。图2纵轴显示运行时间,横轴表示d×d矩阵尺寸增长。当d>64时,FastH快于所有先前方法;在d=64时,FastH仅慢于并行算法;在d=3000时甚至比d=400尺寸下的串行算法更快。为说明矩阵尺寸背景,我们指出前人工作使用的尺寸如(Kingma & Dhariwal,2018)中的d=192或(Zhang等人,2018)中的d=784


1 https://github.com/zhangjiong724/spectral-RNN

2 https://github.com/Lezcano/expRNN




https://cdn.noedgeai.com/01974b4b-ab67-77ad-95b2-777382b7d74d_3.jpg?x=178&y=557&w=658&h=311&r=0

Figure 2. Running time of different algorithms for d×d matrices. FastH is fastest when d>64 . The sequential algorithm from (Zhang et al.,2018) crashed when d>448 .
图2. 不同算法处理d×d矩阵的运行时间。当d>64时FastH最快。(Zhang等,2018)提出的串行算法在d>448时崩溃。


5. Related Work
5. 相关工作


The Householder Decomposition. The Householder decomposition of orthogonal matrices has been used in much previous works, for example, (Tomczak & Welling, 2016; Mhammedi et al., 2017; Zhang et al., 2018; van den Berg et al., 2018; Hoogeboom et al., 2019). Previous work typically use a type of sequential algorithm that performs O(d) sequential inner products. To circumvent the resulting long computation time on GPUs, previous work often suggest limiting the number of Householder matrices, which limits the expressiveness of the orthogonal matrix, introducing a trade-off between computation time and expressiveness.
豪斯霍尔德分解。正交矩阵的豪斯霍尔德分解已被多项先前工作采用,例如(Tomczak & Welling, 2016; Mhammedi等, 2017; Zhang等, 2018; van den Berg等, 2018; Hoogeboom等, 2019)。先前研究通常使用执行O(d)次串行内积的算法类型。为避免GPU上过长的计算时间,这些工作常建议限制豪斯霍尔德矩阵数量,这会制约正交矩阵的表达能力,形成计算时间与表达能力的权衡。

FastH takes the same asymptotic time as the sequential algorithm, however, it performs less sequential matrix operations,making it up to 29× faster in practice. Since FastH computes the same output as the previous sequential algorithms, it can be used in previous work without degrading the performance of their model. In particular, FastH can be used to either (1) increase expressiveness at no additional computational cost or (2) speed up previous applications at the same level of expressiveness.
FastH的渐近时间与串行算法相同,但减少了串行矩阵运算,实际速度提升最高达29×。由于FastH输出与先前串行算法一致,可直接应用于既有研究而不会降低模型性能。具体而言,FastH可用于:(1)在不增加计算成本前提下提升表达能力;(2)在保持表达能力的同时加速现有应用。

Different Orthogonal Parameterizations. Previous work explore different approaches to orthogonal parameterizations, including methods based on Householder matrices (Mhammedi et al., 2017), the matrix exponential (Lezcano-Casado & Martínez-Rubio, 2019) and the Cayley map (Golinski et al., 2019). (Golinski et al., 2019) raised a theoretical concern about the use of Householder matrices. The methods based on the matrix exponential and the Cayley map have desirable provable guarantees, which currently, it is not known whether the Householder decomposition possess. This might make it desirable to use methods based on the matrix exponential or the Cayley map instead of using methods based on Householder matrices. However, the methods based on matrix exponential and Cayley map use O(d3) time to construct the orthogonal matrix, which the Householder method circumvents. This allows Householder methods to perform faster multiplications. In particular,for a mini-batch XRd×m , methods based on Householder matrices can compute the product in O(d2m) time, O(d/m) times faster.
不同正交参数化方法。先前研究探索了多种正交参数化方案,包括基于豪斯霍尔德矩阵(Mhammedi等, 2017)、矩阵指数(Lezcano-Casado & Martínez-Rubio, 2019)和凯莱映射(Golinski等, 2019)的方法。(Golinski等, 2019)对豪斯霍尔德矩阵的使用提出了理论质疑。基于矩阵指数和凯莱映射的方法具有可证明的理论保证,而目前尚不清楚豪斯霍尔德分解是否具备同等性质。这可能导致研究者更倾向采用矩阵指数或凯莱映射方法。然而,这两种方法构建正交矩阵需要O(d3)时间,而豪斯霍尔德方法规避了这个问题,使矩阵乘法运算更快。具体而言,对于小批量XRd×m,豪斯霍尔德矩阵方法可在O(d2m)时间内完成乘积运算,提速O(d/m)倍。

Determinants and Matrix Decompositions. (Kingma & Dhariwal, 2018) propose to speed up determinant computations by using the PLU decomposition W=PLU where P is a permutation matrix and L,U are lower and upper triangular. This allows the determinant computation in O(d) time instead of O(d3) . (Hoogeboom et al.,2019) point out that a fixed permutation matrix P limits flexibility. To fix this issue,they suggest using the QR decomposition where R is a rectangular matrix and Q is orthogonal. They suggest making Q orthogonal by using the Householder decomposition which FastH can speed up.
行列式与矩阵分解。(Kingma & Dhariwal, 2018)提出通过PLU分解W=PLU加速行列式计算,其中P是置换矩阵,L,U分别为下三角和上三角矩阵。这使行列式计算时间从O(d3)降至O(d)。(Hoogeboom等,2019)指出固定置换矩阵P会限制灵活性。为此他们建议采用QR分解,其中R是矩形矩阵,Q为正交矩阵。他们提出使用豪斯霍尔德分解实现Q的正交性——这正是FastH可以加速的部分。

6. Code
6. 代码实现


To make FastH widely accessible, we wrote a PyTorch implementation 'nn.Orthogonal' which can be used like 'nn.Linear'. Besides implementing the default 'forward' function, it also contains functions for inverse and log Jacobian determinant 3 . While implementing FastH,we found that Python did not provide an adequate level of parallelization. We therefore implemented FastH in CUDA to fully utilize the parallel capabilities of GPUs. Our code can be found at https://github.com/alexandermath/fasth.
为使FastH广泛应用,我们开发了类似'nn.Linear'的PyTorch模块'nn.Orthogonal'。除默认前向函数外,还包含逆运算和对数雅可比行列式3计算功能。实现过程中发现Python并行化不足,因此采用CUDA实现以充分发挥GPU并行能力。代码详见https://github.com/alexandermath/fasth。

7. Conclusion
7. 结论


We identified an algorithmic issue with the previous use of Householder matrices in Neural Networks. FastH mitigates the issue,and is up to 29× faster in practice,without introducing any loss of quality. In other words, FastH computes the same thing as the previous algorithms, just faster. FastH can thus be used to speed up Neural Networks like (Tomczak & Welling, 2016; van den Berg et al., 2018; Hoogeboom et al., 2019) without any downsides. References
我们发现了神经网络中豪斯霍尔德矩阵应用的算法问题。FastH在保证质量前提下解决了该问题,实际加速最高达29×倍。换言之,FastH以更快速度完成与原有算法相同的计算。因此可无副作用地加速(Tomczak & Welling, 2016; van den Berg等, 2018; Hoogeboom等, 2019)等神经网络模型。参考文献


3 It just needs to return 0=lg|det(Ux/x)| .
3 它只需返回0=lg|det(Ux/x)|


Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Is-ard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Va-sudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Tensor-Flow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. URL https://www.tensorflow.org/.Software available from tensorflow.org.
阿巴迪(Abadi, M.)、阿加瓦尔(Agarwal, A.)、巴勒姆(Barham, P.)、布雷夫多(Brevdo, E.)、陈(Chen, Z.)、西特罗(Citro, C.)、科拉多(Corrado, G. S.)、戴维斯(Davis, A.)、迪恩(Dean, J.)、德文(Devin, M.)、格玛瓦特(Ghemawat, S.)、古德费洛(Goodfellow, I.)、哈普(Harp, A.)、欧文(Irving, G.)、伊萨德(Isard, M.)、贾(Jia, Y.)、约泽福维茨(Jozefowicz, R.)、凯泽(Kaiser, L.)、库德勒(Kudlur, M.)、莱文伯格(Levenberg, J.)、马内(Mané, D.)、蒙加(Monga, R.)、摩尔(Moore, S.)、默里(Murray, D.)、奥拉(Olah, C.)、舒斯特(Schuster, M.)、什伦斯(Shlens, J.)、施泰纳(Steiner, B.)、苏茨克弗(Sutskever, I.)、塔尔瓦尔(Talwar, K.)、塔克(Tucker, P.)、万霍克(Vanhoucke, V.)、瓦苏德万(Vasudevan, V.)、维埃加斯(Viégas, F.)、维尼亚尔斯(Vinyals, O.)、沃登(Warden, P.)、沃滕伯格(Wattenberg, M.)、威克(Wicke, M.)、余(Yu, Y.)和郑(Zheng, X.)。TensorFlow:异构系统上的大规模机器学习,2015年。网址:https://www.tensorflow.org/。软件可从tensorflow.org获取。

Bischof, C. and Van Loan, C. The WY Representation for Products of Householder Matrices. SIAM Journal on Scientific and Statistical Computing, 1987.
比绍夫(Bischof, C.)和范隆(Van Loan, C.)。豪斯霍尔德矩阵乘积的WY表示。《SIAM科学与统计计算杂志》,1987年。

Golinski, A., Lezcano-Casado, M., and Rainforth, T. Improving Normalizing Flows via Better Orthogonal Parameterizations. In ICML Workshop on Invertible Neural Networks and Normalizing Flows, 2019.
戈林斯基(Golinski, A.)、莱斯卡诺-卡萨多(Lezcano-Casado, M.)和雷恩福斯(Rainforth, T.)。通过更好的正交参数化改进标准化流。《ICML可逆神经网络与标准化流研讨会》,2019年。

Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. The Reversible Residual Network: Backpropagation Without Storing Activations. In NIPS, 2017.
戈麦斯(Gomez, A. N.)、任(Ren, M.)、乌尔塔松(Urtasun, R.)和格罗斯(Grosse, R. B.)。可逆残差网络:无需存储激活值的反向传播。《NIPS》,2017年。

Hoogeboom, E., van den Berg, R., and Welling, M. Emerging Convolutions for Generative Normalizing Flows. In ICML, 2019.
胡格布姆(Hoogeboom, E.)、范登伯格(van den Berg, R.)和韦林(Welling, M.)。生成标准化流的新兴卷积。《ICML》,2019年。

Kingma, D. P. and Dhariwal, P. Glow: Generative Flow with Invertible 1x1 Convolutions. In NeurIPS. 2018.
金马(Kingma, D. P.)和达里瓦尔(Dhariwal, P.)。Glow:使用可逆1x1卷积的生成流。《NeurIPS》,2018年。

Lezcano-Casado, M. and Martínez-Rubio, D. Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group. In ICML,2019 .
莱斯卡诺-卡萨多(Lezcano-Casado, M.)和马丁内斯-鲁比奥(Martínez-Rubio, D.)。神经网络中的廉价正交约束:正交群与酉群的简单参数化。ICML,2019

Mhammedi, Z., Hellicar, A., Rahman, A., and Bailey, J. Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections. In ICML, 2017.
马梅迪(Mhammedi, Z.)、赫利卡尔(Hellicar, A.)、拉赫曼(Rahman, A.)和贝利(Bailey, J.)。使用豪斯霍尔德反射高效参数化循环神经网络的正交性。《ICML》,2017年。

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS. 2019.
帕什克(Paszke, A.)、格罗斯(Gross, S.)、马萨(Massa, F.)、勒雷尔(Lerer, A.)、布拉德伯里(Bradbury, J.)、查南(Chanan, G.)、基林(Killeen, T.)、林(Lin, Z.)、吉梅尔辛(Gimelshein, N.)、安蒂加(Antiga, L.)、德斯梅森(Desmaison, A.)、科普夫(Kopf, A.)、杨(Yang, E.)、德维托(DeVito, Z.)、雷森(Raison, M.)、特贾尼(Tejani, A.)、奇拉姆库尔西(Chilamkurthy, S.)、施泰纳(Steiner, B.)、方(Fang, L.)、白(Bai, J.)和钦塔拉(Chintala, S.)。PyTorch:一种命令式风格的高性能深度学习库。《NeurIPS》,2019年。

Tomczak, J. M. and Welling, M. Improving Variational Auto-Encoders using Householder Flow. arXiv preprint, 2016.
托姆扎克(Tomczak, J. M.)和韦林(Welling, M.)。使用豪斯霍尔德流改进变分自编码器。arXiv预印本,2016年。

Uhlig, F. Constructive Ways for Generating (Generalized) Real Orthogonal Matrices as Products of (Generalized) Symmetries. Linear Algebra and its Applications, 2001.
乌利希(Uhlig, F.)。生成(广义)实正交矩阵的构造性方法作为(广义)对称性的乘积。《线性代数及其应用》,2001年。

van den Berg, R., Hasenclever, L., Tomczak, J., and Welling, M. Sylvester Normalizing Flows for Variational Inference. In UAI,2018 .
范登伯格(van den Berg, R.)、哈森克莱弗(Hasenclever, L.)、托姆扎克(Tomczak, J.)和韦林(Welling, M.)。用于变分推断的西尔维斯特标准化流。UAI,2018

Zhang, J., Lei, Q., and Dhillon, I. Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization. In ICML,2018 .
张(Zhang, J.)、雷(Lei, Q.)和迪隆(Dhillon, I.)。通过高效SVD参数化稳定深度神经网络的梯度。ICML,2018

8. Supplementary Material
8. 补充材料


8.1. Backwards Propagation
8.1. 反向传播


This subsection extends the techniques from Section 3.1 to handle gradient computations. Our goal is to create an O(d2m) algorithm with few sequential operations that solves the following problem: Given A1,,Ad/m+1 , P1,,Pd/m and LA1 for some loss function L ,compute LX and Lv1,,Lvd ,where vj is a Householder vector st. each Householder matrix is Hj=I2vjvjT/vj22 .
本小节将第3.1节的技术延伸至梯度计算领域。我们的目标是设计一种O(d2m)算法,该算法通过少量顺序运算即可解决以下问题:给定损失函数LA1,,Ad/m+1P1,,Pd/mLA1,计算LXLv1,,Lvd,其中vj是户主向量(Householder vector),每个户主矩阵(Householder matrix)满足Hj=I2vjvjT/vj22

Since each Pi is a d×d matrix,it would take O(d3/m) time to read the input P1,,Pd/m . Therefore,we represent each Pi by its WY decomposition Pi=I2WYT .
由于每个Pi都是d×d矩阵,读取输入P1,,Pd/m需要O(d3/m)时间。因此我们采用WY分解Pi=I2WYT来表示每个Pi

On a high-level FastH has two steps.
FastH算法在高层级上包含两个步骤。

Step 1. Sequentially compute LA2,LA3,,LAd/m+1 by
步骤1. 通过以下公式顺序计算LA2,LA3,,LAd/m+1

(3)LAi+1=[AiAi+1]TLAi=PiTLAi

This also gives the gradient wrt. X since X=Ad/m+1 .
由于X=Ad/m+1,该步骤同时给出了关于X的梯度。

Step 2. Use LA1,,LAd/m from Step 1 to compute the gradient Lvj for all j . This problem can be split into d/m subproblems, which can be solved in parallel, one subproblem for each LAi .
步骤2. 使用步骤1中的LA1,,LAd/m计算所有j的梯度Lvj。该问题可分解为d/m个子问题并行求解,每个LAi对应一个子问题。

Details. For completeness, we state pseudo-code in Algorithm 2, which we now explain with the help of Figure 3.
细节说明。为完整起见,我们在算法2中给出伪代码,现结合图3进行解释。

Figure 3a depicts a computational graph of Step 1 in Algorithm 2. In the figure,consider LA1 and P1T ,which both have directed edges to a multiplication node (denoted by ). The output of this multiplication is LA2 by Equation (3). This can be repeated to obtain LA2,,LAd/m+1 .
图3a展示了算法2中步骤1的计算图。图中考虑LA1P1T,两者均有指向乘法节点(记为)的有向边。根据公式(3),该乘法输出为LA2。此过程可重复执行以获得LA2,,LAd/m+1

Step 2 computes the gradient of all Householder vectors Lvj . This computation is split into d/m distinct subproblems that can be solved in parallel. Each subproblem concerns LAi and the product Pi ,see line 10-12 in Algorithm 2.
步骤2计算所有Householder向量Lvj的梯度。该计算被拆分为d/m个可并行求解的独立子问题。每个子问题涉及LAi与乘积Pi,详见算法2第10-12行。

To ease notation,we index the Householder matrices of Pi by Pi=H^1H^m . Furthermore,we let A^m+1:=Ai+1 and A^j:=H^jA^j+1 . The notation implies that A^1= H^1H^mA^m+1=PiAi+1=Ai . The goal of each subproblem is to compute gradients wrt. the Householder vectors v^m,,v^1 of H^m,,H^1 . To compute the gradient of v^j ,we need A^j+1 and LA^j ,which can be computed by:
为简化表示,我们用Pi=H^1H^m索引Pi的Householder矩阵。另设A^m+1:=Ai+1A^j:=H^jA^j+1。该表示法意味着A^1=H^1H^mA^m+1=PiAi+1=Ai。每个子问题的目标是计算H^m,,H^1的Householder向量v^m,,v^1的梯度。计算v^j的梯度需要A^j+1LA^j,可通过以下方式求得:

(4)A^j+1=H^j1A^j=H^jTA^j

(5)LA^j+1=[A^jA^j+1]TLA^j=H^jTLA^j

Figure 3b depicts how A^2,,A^m+1 and LA^2,,LA^m+1 can be computed given A^1 and LA^1 . Given A^j+1 and LA^j ,we can compute Lv^j as done in (Zhang et al.,2018; Mhammedi et al., 2017). For completeness, we restate the needed equation in our notation,see Equation (6). Let a(l) be the l ’th column of A^j+1 and let g(l) be the l ’th column of LA^j . The sum of the gradient over a mini-batch of size m is then:
图3b展示了在给定A^1LA^1的情况下如何计算A^2,,A^m+1LA^2,,LA^m+1。根据A^j+1LA^j,我们可以按照(Zhang等人,2018;Mhammedi等人,2017)的方法计算Lv^j。为完整起见,我们用本文符号体系重述所需方程,参见公式(6)。设a(l)A^j+1的第l列,g(l)LA^j的第l列。对于规模为m的小批量数据,其梯度总和为:

(6)2v^j22l=1m(v^jTa(l))g(l)+(v^jTg(l))a(l)

2v^j22(v^jTa(l))(v^jTg(l))v^j


Algorithm 2 FastH Backward
算法2 FastH反向传播


: Input: A1,,Ad/m+1,P1,,Pd/m and LA1 .
输入:A1,,Ad/m+1,P1,,Pd/mLA1

  • Output: LX and Lvk for all k where Hk=I2vkvkTvk22 .
输出:当Hk=I2vkvkTvk22时所有k对应的LXLvk

// Step 1
// 步骤1

for i=1 to d/m do sequentially
按顺序执行:从i=1d/m

LAi+1=PiTLAi eq. (3).
按公式(3)计算LAi+1=PiTLAi

end for
结束循环

// Step 2
// 步骤2

for i=1 to d/m do in parallel
并行执行:从i=1d/m

Let LA^1=(LAi) .
LA^1=(LAi)

To ease notation,let Pi=H^1H^m .
为简化表示,设Pi=H^1H^m

for j=1 to m do
循环:从j=1m

Compute A^j+1,LA^j ,eqs. (4) and (5). O(dm)
根据公式(4)和(5)计算A^j+1,LA^jO(dm)

Compute Lv^j using A^j+1,LA^j ,eq. (6). O(dm)
使用A^j+1,LA^j和公式(6)计算Lv^jO(dm)

end for
结束循环

end for
结束循环

return LX=LAd/m+1 and Lvk for all k=1,,d .
返回LX=LAd/m+1Lvk对于所有k=1,,d



Theorem 2 states properties of Algorithm 2.
定理2阐述了算法2的性质。

Theorem 2. Algorithm 2 computes LX and Lv1,,Lvd in O(d2m) time with O(d/m+m) sequential matrix multiplications.
定理2. 算法2在O(d2m)时间内通过O(d/m+m)次顺序矩阵乘法计算出LXLv1,,Lvd

Proof. Correctness. FastH computes gradients by the same equations as (Zhang et al., 2018), so in most cases, we show correctness by clarifying how FastH computes the same thing, albeit faster.
证明。正确性。FastH采用与(Zhang et al., 2018)相同的梯度计算公式,因此在多数情况下,我们通过阐明FastH如何以更快速度计算出相同结果来证明其正确性。


https://cdn.noedgeai.com/01974b4b-ab67-77ad-95b2-777382b7d74d_6.jpg?x=193&y=185&w=1373&h=295&r=0

Figure 3. Computational graph of Step 1 and the i ’th subproblem in Step 2 from Algorithm 2.
图3. 算法2中步骤1的计算图及步骤2第i个子问题。


Consider LX computed in Step 1:
考虑步骤1中计算的LX

LX=LAd/m+1=Pd/mTP1TLA1

(eq. 2)=HdTH1TLA1.

This is the same as that computed in (Zhang et al., 2018).
这与(Zhang et al., 2018)中的计算结果相同。

Consider Step 2. Both Lv^j and LA^j are computed as done in (Zhang et al.,2018). A^j+1 is computed using Equation (4) similar to backpropagation without storing activations, (Gomez et al., 2017), but using the fact that H^jT=H^j1.
分析步骤2。Lv^jLA^j均按(Zhang et al.,2018)方式计算。A^j+1通过类似不存储激活值的反向传播方程(4)计算(参见Gomez et al., 2017),但利用了H^jT=H^j1.的特性

Time Complexity. In Step 1 the for loop performs d/m matrix multiplications. Due to the WY decomposition PiT=(I2WYT)T=I2YWT which can be multiplied on LAiRd×m in O(dm2) time since W,YRd×m . The computation is repeated d/m times,and take a total of O(d2m) time.
时间复杂度。步骤1的for循环执行d/m次矩阵乘法。由于WY分解PiT=(I2WYT)T=I2YWT可在O(dm2)时间内与LAiRd×m相乘(因W,YRd×m)。该计算重复d/m次,总耗时O(d2m)

Step 2 line 14 performs two Householder matrix multiplications which take O(dm) time,see Equations (4) and (5). In line 15 the gradient is computed by Equation (6), this sum also takes O(dm) time to compute. Both computations on line 14 and 15 are repeated d/mm times,see line 10 and line 13. Therefore,the total time is O(d2m) .
步骤2第14行执行两次Householder矩阵乘法,耗时O(dm)(见方程4和5)。第15行通过方程(6)计算梯度,该求和运算同样需要O(dm)时间。第14行和15行的计算均重复d/mm次(参见第10和13行),因此总时间为O(d2m)

Number of Sequential Operations. Step 1 performs O(d/m) sequential matrix operations. Lines 13-16 of Step 2 perform O(m) sequential matrix multiplications. Since each iteration of line 10-17 is run in parallel, the algorithm performs no more than O(d/m+m) sequential matrix multiplications.
顺序操作次数。步骤1执行O(d/m)次顺序矩阵运算。步骤2第13-16行执行O(m)次顺序矩阵乘法。由于第10-17行每次迭代并行执行,算法最多进行O(d/m+m)次顺序矩阵乘法。

8.2. Implementation Details
8.2. 实现细节


The parallel algorithm from (Zhang et al., 2018) halted for larger values of d . The failing code was not part of the main computation. This allowed us to remove the failing code and still get a good estimate of the running time of the parallel algorithm. We emphasize that removing the corresponding code makes the parallel algorithm faster. The experiments thus demonstrated that FastH is faster than a lower bound on the running time of the parallel algorithm.
(Zhang et al., 2018)的并行算法在d值较大时出现停滞。故障代码并非主计算部分,这使得我们移除故障代码后仍能准确估算并行算法的运行时间。需要强调的是,移除相关代码会使并行算法更快。实验表明FastH比并行算法运行时间的下界更快。