这是用户在 2024-5-14 14:21 为 https://hazyresearch.stanford.edu/blog/2020-12-05-hippo 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

HiPPO: Recurrent Memory with Optimal Polynomial Projections
HiPPO:具有最优多项式投影的循环记忆

Albert Gu*, Tri Dao*, Stefano Ermon, Atri Rudra, and Chris Ré
艾伯特·古*, 崔·道*, 斯特凡诺·埃尔蒙, 阿特里·鲁德拉, 克里斯·雷 请注意,原文本中的占位符 {{0}}, {{1}}, {{2}}, {{3}}, {{4}}, {{5}}, {{6}} 在翻译过程中保持不变,因为它们可能是代码或特定上下文中的变量。

Many areas of machine learning require processing sequential data in an online fashion. For example, a time series may be observed in real time where the future needs to be continuously predicted, or an agent in a partially-observed environment must learn how to encode its cumulative experience into a state in order to navigate and make decisions. The fundamental problem in modeling long-term and complex temporal dependencies is memory: storing and incorporating information from previous time steps.
机器学习的许多领域需要以在线方式处理序列数据。例如,时间序列可能在实时中被观察到,其中未来需要不断被预测,或者在部分观察的环境中的代理必须学会如何将其累积经验编码为状态,以便导航和做出决策。在模拟长期和复杂的时间依赖性中的基本问题是记忆:存储和整合来自先前时间步的信息。

However, popular machine learning models suffer from forgetting: they are either built with fixed-size context windows (e.g. attention), or heuristic mechanisms that empirically suffer from from a limited memory horizon (e.g., because of "vanishing gradients").
然而,流行的机器学习模型存在遗忘问题:它们要么是基于固定大小的上下文窗口构建的(例如注意力机制),要么是采用经验上受限于有限记忆范围的启发式机制(例如,由于“梯度消失”问题)。

This post describes our method for addressing the fundamental problem of incrementally maintaining a memory representation of sequences from first principles.
本文描述了我们从基本原理出发,解决逐步维护序列记忆表示这一根本问题的方法。

Many areas of machine learning require processing sequential data in an online fashion. For example, a time series may be observed in real time where the future needs to be continuously predicted, or an agent in a partially-observed environment must learn how to encode its cumulative experience into a state in order to navigate and make decisions. The fundamental problem in modeling long-term and complex temporal dependencies is memory: storing and incorporating information from previous time steps.
机器学习的许多领域需要以在线方式处理序列数据。例如,时间序列可能是在实时观察到的,需要不断预测未来,或者在部分观察的环境中的代理必须学会如何将其累积经验编码为状态,以便导航和做出决策。在模拟长期和复杂的时间依赖性时,记忆是基本问题:存储和整合来自先前时间步的信息。

However, popular machine learning models suffer from forgetting: they are either built with fixed-size context windows (e.g. attention), or heuristic mechanisms that empirically suffer from from a limited memory horizon (e.g., because of "vanishing gradients").
然而,流行的机器学习模型存在遗忘问题:它们要么是基于固定大小的上下文窗口构建的(例如注意力机制),要么是采用经验上受限于有限记忆范围的启发式机制(例如,由于“梯度消失”问题)。

This post describes our method for addressing the fundamental problem of incrementally maintaining a memory representation of sequences from first principles. We will:
本文描述了我们从基本原理出发,解决逐步维护序列记忆表示这一根本问题的方法。我们将:

  • find a technical formulation of this problem that we can analyze mathematically, and derive a closed-form solution with the HiPPO framework
    找到这个问题的技术性表述,我们可以用数学方法进行分析,并使用 HiPPO 框架推导出一个封闭形式的解。

  • see that our method can be easily integrated into end-to-end models such as RNNs, where our framework both generalizes previous models, including the popular LSTM and GRU, and improves on them, achieving state of the art on permuted MNIST, a popular benchmark for long-range memory.
    可以看到我们的方法可以很容易地集成到端到端的模型中,如 RNN,我们的框架不仅推广了之前的模型,包括流行的 LSTM 和 GRU,而且改进了它们,在排列的 MNIST 上达到了最先进的水平,这是一个流行的长期记忆基准。

  • show how insights from the framework reveal methods with distinct theoretical properties -- we highlight a particular model HiPPO-LegS, which is computationally efficient, provably alleviates vanishing gradients, and is the first known method to display "timescale robustness"!
    展示了框架的见解如何揭示具有不同理论特性的方法——我们特别强调了一个特定的模型 HiPPO-LegS,它在计算上是高效的,可证明地缓解了梯度消失问题,并且是第一个已知的方法,显示出“时间尺度鲁棒性”!

Our paper was accepted to Neurips 2020 as a Spotlight, and our code is publically available with PyTorch and Tensorflow implementations.
我们的论文被 Neurips 2020 作为 Spotlight 接受,我们的代码是公开可用的,有 PyTorch 和 Tensorflow 的实现。

Online Function Approximation: A Formalism for Incremental Memory Representations
在线函数逼近:一种用于增量记忆表示的正式方法 请注意,翻译可能不是完全字面上的,而是根据上下文和语境进行了适当的调整,以确保翻译的自然性和准确性。

Our first insight is to move from discrete-time to the continuous-time setting, which is often easier to analyze theoretically. We ask the following very natural question: given a continuous function (in one dimension) f(t)f(t), can we maintain a fixed-size representation c(t)RNc(t) \in \mathbb{R}^N at all times tt such that c(t)c(t) optimally captures the history of ff from times 00 to tt?
我们的第一个见解是将离散时间转换为连续时间设置,这在理论上通常更容易分析。我们提出了一个非常自然的问题:给定一个连续函数(在一维空间中) f(t)f(t) ,我们能否在所有时间 tt 保持一个固定大小的表示 c(t)RNc(t) \in \mathbb{R}^N ,以便 c(t)c(t) 最优地捕捉从时间 00ttff 的历史?

However, this problem is not fully well-defined yet -- we need to specify:
然而,这个问题尚未完全明确定义——我们需要明确:

  • Quality of approximation: What is the "optimal approximation" of the function's history? We need to specify a measure (or weight function) that tells us how much we care about every time in the past.
    近似质量:函数的“最佳近似”是什么?我们需要指定一个度量(或权重函数),告诉我们对过去的每个时间点有多关心。

  • Basis: How can we compress a continuous function into a fixed-length vector? We can project the function onto a subspace of dimension NN and store the NN coefficients of its expansion in any basis. For simplicity, we will assume that we are working with the polynomial basis throughout this post.
    基底:我们如何将一个连续函数压缩成一个固定长度的向量?我们可以将函数投影到一个维度为 NN 的子空间,并存储其在任何基底中的展开的 NN 个系数。为了简单起见,我们将假设在这篇文章中我们使用的是多项式基底。

Intuitively, we can think of the memory representation c(t)RNc(t) \in \mathbb{R}^N as being the coefficient vector of the optimal polynomial approximation to the history of f(t)f(t).
直观地说,我们可以将记忆表示 c(t)RNc(t) \in \mathbb{R}^N 视为对 f(t)f(t) 的历史进行最佳多项式近似的系数向量。

The HiPPO Framework (High-Order Polynomial Projection Operator)
HiPPO 框架(高阶多项式投影算子)

Notice that given a measure (and assuming the polynomial basis), the online function approximation problem is now fully specified!measure That is, given any input function f(t)f(t), the desired coefficient vectors c(t)c(t), which are our desired memory representation, are completely defined. The question remains -- how do we calculate them?
请注意,给定一个度量(并假设多项式基),在线函数逼近问题现在已完全指定! measure 也就是说,给定任何输入函数 f(t)f(t) ,所需的系数向量 c(t)c(t) ,即我们所需的内存表示,是完全定义的。问题仍然存在——我们如何计算它们?

The HiPPO framework formalizes this problem and provides machinery to compute the solution. Although the desired coefficients c(t)c(t) are rather abstractly defined as the implicit solution to an approximation problem, there is amazingly a closed-form solution that's easy to compute. We'll leave the technical details to the full paper, but we'll note that they leverage classic tools from approximation theory such as orthogonal polynomialsop. In the end, the solution takes on the form of a simple linear differential equation, which is called the HiPPO operator:
HiPPO 框架正式化了这个问题,并提供了计算解决方案的机制。尽管所需的系数 c(t)c(t) 被抽象地定义为近似问题的隐式解决方案,但令人惊讶的是,存在一个易于计算的封闭形式解决方案。我们将技术细节留给完整的论文,但我们会注意到,它们利用了诸如正交多项式 op 等近似理论的经典工具。最终,解决方案采用了一个简单的线性微分方程的形式,这被称为 HiPPO 算子。

c˙(t)=A(t)c(t)+B(t)f(t)\dot{c}(t) = A(t) c(t) + B(t) f(t)

In short, the HiPPO framework takes a family of measures, and gives an ODE with closed-form transition matrices A(t),B(t)A(t), B(t). These matrices depend on the measure, and following these dynamics finds the coefficients c(t)c(t) that optimally approximate the history of f(t)f(t) according to the measure.
简而言之,HiPPO 框架采用了一系列度量,并给出了具有闭合形式转移矩阵的 ODE。这些矩阵依赖于度量,遵循这些动态找到最优近似历史系数的系数,根据度量。

Figure 1: The HiPPO Framework. An input function f(t)f(t) (black line) is continually approximated by storing the coefficients of its optimal polynomial projections (colored lines) according to specified measures (colored boxes). These coefficients evolve through time (red, blue) according to a linear dynamical system.
图 1:HiPPO 框架。输入函数 f(t)f(t) (黑色线)通过存储其最优多项式投影的系数(彩色线)根据指定的度量(彩色框)不断近似。这些系数根据线性动力学系统随时间(红色,蓝色)演化。

Instantiations of HiPPO HiPPO 的实例化

Figure 2 shows some concrete examples of HiPPO. We show two of the simplest family of measures, based off uniform measures. The translated Legendre measure on the left uses a fixed-length sliding window; in other words, it cares about recent history. On the other hand, the scaled Legendre measure uniformly weights the entire history up to the current time. In both cases, the HiPPO framework produces closed-form formulas for the corresponding ODEs which are shown for completeness (the transition matrices are actually quite simple!).
图 2 展示了 HiPPO 的一些具体示例。我们展示了基于均匀测度的两个最简单的测度家族。左侧的转换勒让德测度使用固定长度的滑动窗口;换句话说,它关注最近的历史。另一方面,缩放勒让德测度均匀地对直到当前时间的整个历史进行加权。在这两种情况下,HiPPO 框架都为相应的 ODE(常微分方程)提供了封闭形式的公式,这些公式为了完整性而显示(转移矩阵实际上非常简单!)。

Figure 2. Examples of simple measures and their corresponding HiPPO operators. The Translated Legendre Measure uniformly weights the past θ\theta (a hyperparameter) units of time, while the Scaled Legendre Measure uniformly weights all history.
图 2. 简单测度及其相应的 HiPPO 操作符的示例。转换勒让德测度均匀地对过去 θ\theta (一个超参数)个单位时间进行加权,而缩放勒让德测度均匀地对所有历史进行加权。

From continuous-time to discrete-time
从连续时间到离散时间

There is one more detail called discretization. By using standard techniques for approximating the evolution of dynamical systems, the continuous-time HiPPO ODE can be converted to a discrete-time linear recurrence. Additionally, this step allows extensions of HiPPO to flexibly handle irregularly-sampled or missing data: simply evolve the system according to the given timestamps.
还有一个称为离散化的细节。通过使用近似动力系统演化的标准技术,连续时间的 HiPPO ODE 可以转换为离散时间的线性递归。此外,这一步骤允许 HiPPO 灵活处理不规则采样或缺失数据:只需根据给定的时间戳演化系统。

For the practitioner: To construct a memory representation ctc_t of an input sequence ftf_t, HiPPO is implemented as the simple linear recurrence ct+1=Atct+Btftc_{t+1} = A_t c_t + B_t f_t where the transition matrices At,BtA_t, B_t have closed-form formulas. That's it!
对于实践者:要构建一个输入序列 ftf_t 的记忆表示 ctc_t ,HiPPO 被实现为简单的线性递归 ct+1=Atct+Btftc_{t+1} = A_t c_t + B_t f_t ,其中转移矩阵 At,BtA_t, B_t 具有封闭形式的公式。就是这样!

Hippos in the wild: integration into ML models
野生河马:融入机器学习模型

At its core, HiPPO is a simple linear recurrence that can be integrated into end-to-end models in many ways. We focus on a recurrent neural network (RNN) due to their connection to dynamics systems involving a state evolving over time, just as in HiPPO. The HiPPO-RNN is the simplest way to perform this integration:
本质上,HiPPO 是一个简单的线性递归,可以通过多种方式整合到端到端模型中。我们专注于循环神经网络(RNN),因为它们与涉及随时间演变的动态系统有关,正如 HiPPO 所做的那样。HiPPO-RNN 是执行这种整合的最简单方法:

Figure 3. (Top) HiPPO on discrete sequences has the form of a simple linear recurrence. (Bottom) The HiPPO-RNN cell diagram.
图 3. (顶部) HiPPO 在离散序列上的形式是一个简单的线性递归。(底部) HiPPO-RNN 单元图。

  1. Start with a standard RNN recurrence ht=τ(ht1,xt)h_{t} = \tau(h_{t-1}, x_t) that evolves a hidden state hth_t by any nonlinear function τ\tau given the input xtx_t
    从一个标准的 RNN 递归开始 ht=τ(ht1,xt)h_{t} = \tau(h_{t-1}, x_t) ,它通过给定的输入 xtx_t 由任何非线性函数 τ\tau 演化隐藏状态 hth_t

  2. Project the state down to a lower dimension feature ftf_t
    将状态投影到一个较低维度的特征 ftf_t

  3. Use the HiPPO recurrence to create a representation ctc_t of the history of ftf_t, which is also fed back into τ\tau
    使用 HiPPO 递归来创建一个历史表示 ctc_tftf_t ,它也被反馈到 τ\tau 中。

Special cases of the HiPPO-RNN
HiPPO-RNN 的特殊情况

Those familiar with RNNs may notice this looks very similar to cell diagrams for other models such as LSTMs. In fact, several common models are closely related:
熟悉 RNN 的人可能会注意到,这与其他模型(如 LSTM)的单元图非常相似。实际上,几个常见的模型与 HiPPO 密切相关:

  • The most popular RNN models are the LSTM and GRU, which rely on a gating mechanism. In particular, the cell state of an LSTM performs the recurrence ct+1=αtct+βtftc_{t+1} = \alpha_t c_t + \beta_t f_t, where αt,β\alpha_t, \beta are known as the "forget" and "input" gates. Notice the similarity to the HiPPO recurrence ct+1=Atct+Btftc_{t+1} = A_t c_t + B_t f_t. In fact, these gated RNNs can be viewed as as a special case of HiPPO with low-order (N=1) approximations and input-dependent discretization! So HiPPO sheds light on these popular models and shows how the gating mechanism, which was originally introduced as a heuristic, could have been derived.
    最流行的 RNN 模型是 LSTM 和 GRU,它们依赖于门控机制。特别是,LSTM 的单元状态执行递归 ct+1=αtct+βtftc_{t+1} = \alpha_t c_t + \beta_t f_t ,其中 αt,β\alpha_t, \beta 被称为“遗忘”和“输入”门。注意 HiPPO 递归 ct+1=Atct+Btftc_{t+1} = A_t c_t + B_t f_t 的相似性。实际上,这些门控 RNN 可以被视为 HiPPO 的低阶(N=1)近似和输入依赖离散化的特例!因此,HiPPO 阐明了这些流行模型,并展示了最初作为启发式引入的门控机制是如何被推导出来的。

  • The HiPPO-LegT model, which is the instantiation of HiPPO for the translated Legendre measure, is exactly equivalent to a recent model called the Legendre Memory Unitlmu. Our proof is also much shorter, and just involves following the steps of the HiPPO framework!
    HiPPO-LegT 模型,即用于转换 Legendre 测度的 HiPPO 实例,与最近的一个模型——Legendre Memory Unit lmu 完全等价。我们的证明也更短,只是涉及遵循 HiPPO 框架的步骤! 请注意,原文本中的 lmu、{{1}} 和 {{2}} 是占位符,它们在实际应用中会被具体的值替换。在翻译中,我保留了这些占位符,以便在实际使用时可以正确地替换。

Elephants Hippos never forget
大象河马永远不会忘记

Let's take a look at how these models perform on benchmarks. First, we test if HiPPO solves the problem it was designed to -- online function approximation. Figure 4 shows that it can approximate a sequence of a million time steps with good fidelity. Keep in mind that this works while processing the function online with a limited budget of hidden units; it could have reconstructed the partial function at any point in time.
让我们来看看这些模型在基准测试上的表现。首先,我们测试 HiPPO 是否解决了它被设计来解决的问题——在线函数逼近。图 4 显示,它可以很好地逼近一百万个时间步长的序列。请记住,这是在处理具有有限隐藏单元预算的在线函数时实现的;它本可以在任何时间点重建该函数的局部。

Figure 4. (Left) a band-limited white noise function, sampled to a sequence of length 1000000. (Right) Reconstruction of the approximate function after processing the sequence with 256 hidden units. HiPPO closely matches the original function (MSE 0.02), while the LSTM produces random noise (MSE 0.25).
图 4.(左)一个带限白噪声函数,采样到一个长度为 1000000 的序列。(右)使用 256 个隐藏单元处理序列后,对近似函数的重构。HiPPO 紧密匹配原始函数(MSE 0.02),而 LSTM 产生随机噪声(MSE 0.25)。

Second, we test on the standard Permuted MNIST benchmark, where models must process the input image one pixel at a time and output a classification after consuming the entire sequence. This is a classic benchmark for testing long-term dependencies in sequence models, since they must remember inputs from almost 1000 time steps ago.
其次,我们在标准的排列 MNIST 基准上进行测试,模型必须逐像素处理输入图像,并在消耗整个序列后输出分类。这是一个经典的测试序列模型长期依赖性的基准,因为它们必须记住几乎 1000 个时间步之前的输入。

Multiple instantiations of our HiPPO framework, including the HiPPO-LegS and HiPPO-LegT models described above, set state-of-the-art over other recurrent models by a significant margin, achieving 98.3% test accuracy compared to the previous best of 97.15%. In fact, they even outperform non-recurrent sequence models that use global context such as dilated convolutions and transformers. Full results can be found in Tables 4+5.
我们的 HiPPO 框架的多个实例,包括上述的 HiPPO-LegS 和 HiPPO-LegT 模型,以显著的优势超越了其他循环模型,达到了 98.3%的测试准确率,而之前最好的成绩是 97.15%。实际上,它们甚至超过了使用全局上下文的非循环序列模型,如膨胀卷积和变压器。完整的结果可以在表 4+5 中找到。

Timescale Robustness of HiPPO-LegS
HiPPO-LegS 的时间尺度鲁棒性

Lastly, we'll explore some theoretical properties of our most interesting model, corresponding to the Scaled Legendre (LegS) measure. As motivation, the discerning reader may be wondering by this point: What's the difference between different instantiations of HiPPO? How does the measure influence the model? Here are some examples of how intuitive interpretation of the measure translates into theoretical properties of the downstream HiPPO model:
最后,我们将探讨我们最感兴趣的模型的一些理论特性,该模型对应于缩放的勒让德(LegS)测度。作为动机,细心的读者可能会在这个时候想知道:HiPPO 的不同实例之间有什么区别?测度是如何影响模型的?以下是一些关于测度的直观解释如何转化为下游 HiPPO 模型的理论特性的例子:

  • Gradient bounds: Since this measure says that we care about the entire past, information should propagate well through time. Indeed, we show that gradient norms of the model decay polynomially in time, instead of exponentially (i.e. the vanishing gradient problem for vanilla RNNs).
    梯度界限:由于该测度表明我们关心整个过去,信息应该能够很好地随时间传播。实际上,我们证明了模型的梯度范数随时间多项式衰减,而不是指数衰减(即普通 RNN 的梯度消失问题)。

  • Computational efficiency: The transition matrices AtA_t actually have special structure, and the recurrence can be computed in linear instead of quadratic time. We hypothesize that these efficiency properties are true in general (i.e., for all measures), and are related more broadly to efficiency of orthogonal polynomials and their associated computations (e.g. [link to SODA paper])
    计算效率:转换矩阵 AtA_t 实际上具有特殊的结构,递归可以在线性而不是二次时间内计算。我们假设这些效率特性在一般情况下是真实的(即,对于所有测度),并且与正交多项式及其相关计算的效率更广泛地相关(例如 [链接到 SODA 论文])。

  • Timescale robustness: Most interestingly, the scaled measure is agnostic to how fast the input function evolves; Figure 5 illustrates how HiPPO-LegS is intuitively dilation equivariant.
    时间尺度鲁棒性:最有趣的是,缩放度量对输入函数演化速度不敏感;图 5 直观地说明了 HiPPO-LegS 是如何直观地与膨胀等变的。

Figure 5. (Top) Since the scaled Legendre measure stretches through time, intuitively, dilating an input function should not change the projections. (Bottom) A commutative diagram illustrating how the HiPPO-LegS operator is equivariant to time dilation.
图 5.(顶部)由于缩放的勒让德度量随时间拉伸,直观上,膨胀输入函数不应改变投影。(底部)一个交换图,说明 HiPPO-LegS 算子对时间膨胀是等变的。

The table shows results on a trajectory classification dataset with distribution shift between the training and test sequences (i.e., arising from time series being sampled at different rates at deployment); HiPPO is the only method that can generalize to new timescales!
该表显示了在轨迹分类数据集上的结果,该数据集在训练和测试序列之间的分布发生了变化(即,由于在部署时以不同的速率采样时间序列而产生);HiPPO 是唯一能够推广到新时间尺度的方法!

Generalization LSTM GRU-D ODE-RNN NCDE LMU HiPPO-LegS
100Hz -> 200Hz 100Hz -> 200Hz 请注意,"100Hz -> 200Hz" 这个字段是一个频率转换的表示,没有提供上下文,因此直接翻译为“100Hz -> 200Hz”。如果需要更详细的翻译,请提供更多上下文信息。 25.4 23.1 41.8 44.7 6.0 88.8
200Hz -> 100Hz 200 赫兹 -> 100 赫兹 64.6 25.5 31.5 11.3 13.1 90.1

Conclusion 结论

  • The problem of maintaining memory representations of sequential data can be tackled by posing and solving continuous-time formalisms.
    通过提出和解决连续时间形式化问题,可以解决维护序列数据记忆表示的问题。

  • The HiPPO framework explains several previous sequence models as well as produces new models with cool properties.
    HiPPO 框架解释了几个先前的序列模型,并产生了具有酷炫特性的新模型。

  • This is just the tip of the iceberg - there are many technical extensions of HiPPO, rich connections to other sequence models, and potential applications waiting to be explored!
    这仅仅是冰山一角 - HiPPO 有许多技术扩展,与其他序列模型的丰富联系,以及等待探索的潜在应用!

Try it out 尝试一下

PyTorch and Tensorflow code for HiPPO are available on GitHub, where the HiPPO-RNNs can be used as a drop-in replacement for most RNN-based models. Closed-form formulas and implementations are given for the HiPPO instantiations mentioned here, and several more. For more details, see the full paper.
PyTorch 和 Tensorflow 的 HiPPO 代码可在 GitHub 上找到,HiPPO-RNNs 可以用作大多数基于 RNN 模型的即插即用替代品。这里提到了 HiPPO 实例化的闭合形式公式和实现,以及更多。更多详情,请参阅完整论文。

Footnotes 脚注


  1. A measure induces a Hilbert space structure on the space of functions, so that there is a unique optimal approximation - the projection onto the desired subspace.
    一个度量在函数空间上诱导出希尔伯特空间结构,从而存在唯一的最佳逼近——投影到所需的子空间。↩
  2. Examples of famous orthogonal polynomials include the Chebyshev polynomials and Legendre polynomials. The names of our methods, such as LegS (scaled Legendre), are based off the orthogonal polynomial family corresponding to its measure.
    著名的正交多项式例子包括切比雪夫多项式和勒让德多项式。我们的方法的名称,如 LegS(缩放勒让德),是基于其度量对应的正交多项式家族。↩
  3. The way we integrate the HiPPO recurrence into an RNN is slightly different, so the full RNN versions of the HiPPO-LegT and Legendre Memory Unit (LMU) are slightly different, but the core linear recurrence is the same.
    我们将 HiPPO 递归集成到 RNN 中的方式略有不同,因此 HiPPO-LegT 和勒让德记忆单元(LMU)的完整 RNN 版本略有不同,但核心线性递归是相同的。↩