这是用户在 2025-5-9 16:05 为 https://arxiv.org/html/2403.02600?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: kotex

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0  许可证:CC由4.0
arXiv:2403.02600v1 [cs.LG] 05 Mar 2024
Arxiv:2403.02600V1 [CS.LG] 05 2024年3月5日

TESTAM: A Time-Enhanced Spatio-Temporal Attention Model with Mixture of Experts
睾丸:与专家混合的时间增强时空注意模型

Hyunwook Lee & Sungahn Ko
Ulsan National Institute of Science and Technology
{gusdnr0916, sako}@unist.ac.kr
Corresponding author
Abstract  抽象的

Accurate traffic forecasting is challenging due to the complex interdependencies of large road networks and abrupt speed changes caused by unexpected events. Recent work has focused on spatial modeling with adaptive graph embedding or graph attention but has paid less attention to the temporal characteristics and effectiveness of in-situ modeling. In this paper, we propose the time-enhanced spatio-temporal attention model (TESTAM) to better capture recurring and non-recurring traffic patterns with mixture-of-experts model with three experts for temporal modeling, spatio-temporal modeling with a static graph, and spatio-temporal dependency modeling with a dynamic graph. By introducing different experts and properly routing them, TESTAM better captures traffic patterns under various circumstances, including cases of spatially isolated roads, highly interconnected roads, and recurring and non-recurring events. For proper routing, we reformulate a gating problem as a classification task with pseudo labels. Experimental results on three public traffic network datasets, METR-LA, PEMS-BAY, and EXPY-TKY, demonstrate that TESTAM outperforms 13 existing methods in terms of accuracy due to its better modeling of recurring and non-recurring traffic patterns. You can find the official code from https://github.com/HyunWookL/TESTAM
由于大型道路网络的复杂相互依存以及意外事件引起的突然变化,准确的流量预测是具有挑战性的。最近的工作集中在具有自适应图或图形注意力的空间建模上,但对原位建模的时间特征和有效性的关注较少。在本文中,我们提出了时间增强时空注意模型(testam),以更好地捕获具有特殊性混合模型的经常性和非经常性交通模式,该模型与三个用于时间建模的专家,用于时间建模,时空建模和静态图形,以及具有动态图的时空依赖性建模。通过介绍不同的专家并将其适当布线,Testam可以更好地捕获各种情况下的交通模式,包括空间隔离的道路,高度相互联系的道路以及反复发生和非重期事件。对于适当的路由,我们将门控问题重新制定为带有伪标签的分类任务。在三个公共交通网络数据集(Metr-LA,PEMS-Bay和Expy-Tky)上进行的实验结果表明,由于其重复进行和非经常性交通模式的更好建模,因此Testam在准确性方面优于13种现有方法。您可以从https://github.com/hyunwookl/testam找到官方代码

1 Introduction
1简介

Spatio-temporal modeling in non-Euclidean space has received considerable attention since it can be widely applied to many real-world problems, such as social networks and human pose estimation. Traffic forecasting is a representative real-world problem, which is particularly challenging due to the difficulty of identifying innate spatio-temporal dependencies between roads. Moreover, such dependencies are often influenced by numerous factors, such as weather, accidents, and holidays (Park et al., 2020; Lee et al., 2020; 2022).
非欧几里得空间中的时空建模受到了相当大的关注,因为它可以广泛应用于许多现实世界中的问题,例如社交网络和人类姿势估计。交通预测是一个代表性的现实问题,由于难以识别道路之间的先天时空依赖性,这尤其具有挑战性。此外,这种依赖性通常受到许多因素的影响,例如天气,事故和假期(Park等, 2020 ; Lee等, 2020 ; 2022

To overcome the challenges related to spatio-temporal modeling, many deep learning models have been proposed, including graph convolutional networks (GCNs), recurrent neural networks (RNNs), and Transformer. Li et al. (2018) have introduced DCRNN, which injects graph convolution into recurrent units, while Yu et al. (2018) have combined graph convolution and convolutional neural networks (CNNs) to model spatial and temporal features, outperforming traditional methods, such as ARIMA. Although effective, GCN-based methods require prior knowledge of the topological characteristics of spatial dependencies. In addition, as the pre-defined graph relies heavily on the Euclidean distance and empirical laws (Tobler’s first law of geography), ignoring dynamic changes in traffic (e.g., rush hour and accidents), it is hardly an optimal solution (Jiang et al., 2023). Graph-WaveNet, proposed by Wu et al. (2019), is the first model to address this limitation by using node embedding, building learnable adjacency matrix for spatial modeling. Motivated by the success of Graph-WaveNet and DCRNN, a line of research has focused on learnable graph structures, such as AGCRN (Bai et al., 2020) and MTGNN (Wu et al., 2020).
为了克服与时空建模相关的挑战,已经提出了许多深度学习模型,包括图卷积网络(GCN)、循环神经网络(RNN)和 Transformer。 Li 等人 ( 2018 ) 引入了 DCRNN,将图卷积注入循环单元,而 Yu 等人 ( 2018 ) 则将图卷积和卷积神经网络 (CNN) 结合起来对空间和时间特征进行建模,其表现优于 ARIMA 等传统方法。基于 GCN 的方法虽然有效,但需要事先了解空间依赖关系的拓扑特征。此外,由于预定义图严重依赖欧氏距离和经验定律(托布勒第一地理定律),忽略了交通的动态变化(例如高峰时段和事故),因此它很难成为最佳解决方案 (Jiang et al., 2023 。Wu 等人 ( 2019 ) 提出的 Graph-WaveNet 是第一个通过使用节点嵌入来解决此限制的模型,为空间建模构建可学习的邻接矩阵。受 Graph-WaveNet 和 DCRNN 成功的推动,一系列研究集中于可学习的图结构,例如 AGCRN (Bai et al., 2020 和 MTGNN (Wu et al., 2020

Although spatial modeling with learnable static graphs has drastically improved traffic forecasting, researchers have found that it can be further improved by learning networks dynamics among time, named time-varying graph structure. SLCNN (Zhang et al., 2020) and StemGNN (Cao et al., 2020) attempt to learn time-varying graph structures by projecting observational data. Zheng et al. (2020) have adopted multi-head attention for improved dynamic spatial modeling with no spatial restrictions, while Park et al. (2020) have developed ST-GRAT, a modified Transformer for traffic forecasting that utilizes graph attention networks (GAT). However, time-varying graph modeling is noise sensitive. Attention-based models can be relatively less noise sensitive, but a recent study reports that they often fail to generate an informative attention map by spreading attention weights over all roads (Jin et al., 2023). MegaCRN (Jiang et al., 2023) utilizes memory networks for graph learning, reducing sensitivity and injecting temporal information, simultaneously. Although effective, aforementioned methods focus on spatial modeling using specific spatial modeling methods, paying less attention to the use of multiple spatial modeling methods for in-situ forecasting.
尽管基于可学习静态图的空间建模已显著提升了交通预测水平,但研究人员发现,通过学习网络随时间变化的动态变化( 即时变图结构),可以进一步提升预测质量。SLCNN (Zhang 等人, 2020 年 和 StemGNN (Cao 等人, 2020 年 尝试通过投影观测数据来学习时变图结构。 Zheng 等人 ( 2020 ) 采用多头注意力机制,改进了不受空间限制的动态空间建模; Park 等人 ( 2020 ) 开发了 ST-GRAT,这是一种改进的 Transformer,利用图注意力网络 (GAT) 进行交通预测。然而,时变图建模对噪声敏感。基于注意力机制的模型对噪声的敏感度相对较低,但最近的一项研究报告称,它们往往无法通过将注意力权重分散到所有道路上来生成信息丰富的注意力图 (Jin et al., 2023 ) 。MegaCRN (Jiang et al., 2023 ) 利用记忆网络进行图学习,同时降低敏感度并注入时间信息。上述方法虽然有效,但主要关注使用特定的空间建模方法进行空间建模,而较少关注使用多种空间建模方法进行现场预测。

Different spatial modeling methods have certain advantages for different circumstances. For instance, learnable static graph modeling outperforms dynamic graphs in recurring traffic situations (Wu et al., 2020; Jiang et al., 2023). On the other hand, dynamic spatial modeling is advantageous for non-recurring traffic, such as incidents or abrupt speed changes (Park et al., 2020; Zheng et al., 2020). Park et al. (2020) have revealed that preserving the the road information itself improves forecasting performance, implying the need of temporal-only modeling. Jin et al. (2023) have shown that a static graph built on temporal similarity could lead to performance improvements when combined with a dynamic graph modeling method. Although many studies have discussed the importance of effective spatial modeling for traffic forecasting, few studies have focused on the dynamic use of spatial modeling methods in traffic forecasting (i.e., in-situ traffic forecasting).
不同的空间建模方法在不同情况下各有优势。例如,在重复发生的交通情况下,可学习的静态图建模优于动态图建模 (Wu et al., 2020 ; Jiang et al., 2023 ) 。另一方面,动态空间建模对于非重复发生的交通情况(例如事故或速度突变)更具优势 (Park et al., 2020 ; Zheng et al., 2020 )Park 等人( 2020 年 发现,保留道路信息本身可以提高预测性能,这意味着需要仅进行时间建模。 Jin 等人 ( 2023 ) 的研究证明,基于时间相似性构建的静态图与动态图建模方法相结合,可以提升性能。尽管许多研究探讨了有效的空间建模对交通预测的重要性,但很少有研究关注空间建模方法在交通预测(即现场交通预测)中的动态应用。

In this paper, we propose a time-enhanced spatio-temporal attention model (TESTAM), a novel Mixture-of-Experts (MoE) model that enables in-situ traffic forecasting. TESTAM consists of three experts, each of them has different spatial modeling: 1) without spatial modeling, 2) with learnable static graph, 3) with with dynamic graph modeling, and one gating network. Each expert consists of transformer-based blocks with their own spatial modeling methods. Gating networks take each expert’s last hidden state and input traffic conditions, generating candidate routes for in-situ traffic forecasting. To achieve effective training of gating network, we solve the routing problem as a classification problem with two loss functions that are designed to avoid the worst route and lead to the best route. The contributions of this work can be summarized as follows:
在本文中,我们提出了一个时间增强时空注意模型(Testam),这是一种新型的专家(MOE)模型,可实现原位交通预测。 Testam由三个专家组成,每个专家都有不同的空间建模:1)无空间建模,2)具有可学习的静态图,3)带有动态图建模和一个门控网络。每个专家都由其基于变压器的块组成,并由其自己的空间建模方法组成。门控网络采用每个专家的最后一个隐藏状态和输入流量条件,为原位流量预测生成候选路线。为了实现门控网络的有效培训,我们将路由问题作为分类问题解决了两个损失功能,旨在避免最糟糕的路线并通往最佳路线。这项工作的贡献可以总结如下:

  • We propose a novel Mixture-of-Experts model called TESTAM for traffic forecasting with diverse graph architectures for improving accuracy in different traffic conditions, including recurring and non-recurring situations.


    •我们提出了一种新型的专家专家模型,称为Testam,用于交通预测,并具有不同的图形体系结构,以提高不同的交通状况的准确性,包括经常性和非经常性情况。
  • We reformulate the gating problem as a classification problem to have the model better contextualize traffic situations and choose spatial modeling methods (i.e., experts) during training.


    •我们将门控问题重新制定为分类问题,以使模型更好地将交通情况化,并在培训期间选择空间建模方法(即专家)。
  • The experimental results over the state-of-the-art models using three real-world datasets indicate that TESTAM outperforms existing methods quantitatively and qualitatively.


    •使用三个现实世界数据集对最先进模型的实验结果表明,testam在定量和质量上都优于现有方法。

2 Related Work
2相关工作

2.1 Traffic Forecasting
2.1流量预测

Deep learning models achieve huge success by effectively capturing spatio-temporal features in traffic forecasting tasks. Previous studies have shown that RNN-based models outperform conventional temporal modeling approaches, such as ARIMA and support vector regression (Vlahogianni et al., 2014; Li & Shahabi, 2018). More recently, substantial research has demonstrated that attention-based models (Zheng et al., 2020; Park et al., 2020) and CNNs (Yu et al., 2018; Wu et al., 2019; 2020) perform better than RNN-based model in long-term prediction tasks. For spatial modeling, Zhang et al. (2016) have proposed a CNN-based spatial modeling method for Euclidean space. Another line of modeling methods using graph structures for managing complex road networks (e.g., GCNs) have also become popular. However, using GCNs requires building an adjacency matrix, and GCNs depend heavily on pre-defined graph structure.
深度学习模型通过有效捕捉交通预测任务中的时空特征而取得了巨大成功。先前的研究表明,基于 RNN 的模型优于传统的时间建模方法,例如 ARIMA 和支持向量回归 (Vlahogianni 等, 2014 ;Li 和 Shahabi, 2018 。最近,大量研究表明,基于注意力机制的模型 (Zheng 等, 2020 ;Park 等, 2020 和 CNN (Yu 等, 2018 ;Wu 等, 20192020 在长期预测任务中的表现优于基于 RNN 的模型。对于空间建模, Zhang 等( 2016 提出了一种基于 CNN 的欧几里得空间空间建模方法。另一种使用图结构管理复杂道路网络的建模方法(例如 GCN)也已流行起来。然而,使用 GCN 需要构建邻接矩阵,并且 GCN 严重依赖于预定义的图结构。

To overcome these difficulties, several approaches, such as graph attention models, have been proposed for dynamic edge importance weighting (Park et al., 2020). Graph-WaveNet (Wu et al., 2019) uses a learnable static adjacency matrix to capture hidden spatial dependencies in training. SLCNN (Zhang et al., 2020) and StemGNN (Cao et al., 2020) try to learn a time-varying graph by projecting current traffic conditions. MegaCRN (Jiang et al., 2023) uses memory-based graph learning to construct a noise-robust graph. Despite their effectiveness, forecasting models still suffer from inaccurate predictions due to abruptly changing speeds, instability, and changes in spatial dependency. To address these challenges, we design TESTAM to change its spatial modeling methods based on the traffic context using the Mixture-of-Experts technique.
为了克服这些困难,已经提出了几种用于动态边缘重要性加权的方法,例如图注意力模型 (Park et al。, 2020 。Graph-WaveNet (Wu et al。, 2019 使用可学习的静态邻接矩阵来捕获训练中的隐藏空间依赖性。SLCNN (Zhang et al。, 2020 和 StemGNN (Cao et al。, 2020 尝试通过投影当前交通状况来学习时变图。MegaCRN (Jiang et al。, 2023 使用基于记忆的图学习来构建抗噪图。尽管预测模型很有效,但由于速度突然变化、不稳定性以及空间依赖性的变化,预测模型仍然不准确。为了应对这些挑战,我们设计了 TESTAM,使用 Mixture-of-Experts 技术根据交通环境改变其空间建模方法。

2.2 Mixture of Experts
2.2专家的混合物

The Mixture-of-Experts (MoEs) is a machine learning technique devised by Shazeer et al. (2017) that has been actively researched as a powerful method for increasing model capacities without additional computational costs. MoEs have been used in various machine learning tasks, such as computer vision (Dryden & Hoefler, 2022) and natural language processing (Zhou et al., 2022; Fedus et al., 2022). Recently, MoEs have gone beyond being the purpose of increasing model capacities and are used to “specialize” each expert in subtasks at specific levels, such as the sample (Eigen et al., 2014; McGill & Perona, 2017; Rosenbaum et al., 2018), token (Shazeer et al., 2017; Fedus et al., 2022), and patch levels (Riquelme et al., 2021). These coarse-grained routing of the MoEs are frequently trained with multiple auxiliary losses, focusing on load balancing (Fedus et al., 2022; Dryden & Hoefler, 2022), but it often causes the experts to lose their opportunity to specialize. Furthermore, MoEs assign identical structures to every expert, eventually leading to limitations caused by the architecture, such as sharing the same inductive bias, which hardly changes. Dryden & Hoefler (2022) have proposed Spatial Mixture-of-Experts (SMoEs) that introduces fine-grained routing to solve the regression problem. SMOEs induce inductive bias via fine-grained, location-dependent routing for regression problems. They utilize one routing classification loss based on the final output losses, penalize gating networks with output error signals, and reduce the change caused by inaccurate routing for better routing and expert specialization. However, SMoEs only attempt to avoid incorrect routing and pay less attention to the best routing. TESTAM differs from existing MoEs in two main ways: it utilizes experts with different spatial modeling methods for better generalization, and it can be optimized with two loss functions–one for avoiding the worst route and another for choosing the best route for better specialization.
混合专家模型 (MoE) 是由 Shazeer 等人 ( 2017 ) 设计的一种机器学习技术,作为一种无需额外计算成本即可提升模型容量的有效方法,MoE 已被积极研究。MoE 已应用于各种机器学习任务,例如计算机视觉 (Dryden & Hoefler, 2022 ) 和自然语言处理 (Zhou 等人, 2022 ; Fedus 等人, 2022 ) 。最近,多级模型 (MoE) 已不再仅仅用于提升模型容量,还用于让每个专家在特定级别的子任务中“专精”,例如样本 (Eigen et al., 2014 ; McGill & Perona, 2017 ; Rosenbaum et al., 2018 ) 、token (Shazeer et al., 2017 ; Fedus et al., 2022 ) 和 patch 级别 (Riquelme et al., 2021 ) 。这些多级模型 (MoE) 的粗粒度路由通常使用多个辅助损失进行训练,注重负载平衡 (Fedus et al., 2022 ; Dryden & Hoefler, 2022 ) ,但这往往导致专家失去专精的机会。 此外,MoE 为每个专家分配相同的结构,最终导致由架构引起的限制,例如共享相同的归纳偏差,而这种偏差几乎不会改变。 Dryden 和 Hoefler ( 2022 ) 提出了空间混合专家 (SMoE),引入细粒度路由来解决回归问题。SMOE 通过细粒度、位置相关的路由来引入归纳偏差,以解决回归问题。它们根据最终输出损失利用一种路由分类损失,用输出错误信号惩罚门控网络,并减少由不准确路由引起的变化,以实现更好的路由和专家专业化。然而,SMoE 只会尝试避免错误的路由,而较少关注最佳路由。TESTAM 与现有的 MoE 主要有两点不同:它利用具有不同空间建模方法的专家来实现更好的泛化,并且可以通过两个损失函数进行优化——一个用于避免最差路线,另一个用于选择最佳路线以实现更好的专业化。

3 Methods
3种方法

3.1 Preliminaries
3.1预序

Problem Definition  问题定义

Let us define a road network as 𝒢=(𝒱,,𝒜)\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{A})caligraphic_G = ( caligraphic_V , caligraphic_E , caligraphic_A ), where 𝒱\mathcal{V}caligraphic_V is a set of all roads in road networks with |𝒱|=N|\mathcal{V}|=N| caligraphic_V | = italic_N, \mathcal{E}caligraphic_E is a set of edges representing the connectivity between roads, and 𝒜N×N\mathcal{A}\in\mathbb{R}^{N\times N}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT is a matrix representing the topology of 𝒢\mathcal{G}caligraphic_G. Given road networks, we formulate our problem as a special version of multivariate time series forecasting that predicts future TTitalic_T graph signals based on TT^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT historical input graph signals:
我们将道路网络定义为 𝒢=(𝒱,,𝒜)\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{A})caligraphic_G = ( caligraphic_V , caligraphic_E , caligraphic_A ) ,其中 𝒱\mathcal{V}caligraphic_V 是道路网络中所有道路的集合, |𝒱|=N|\mathcal{V}|=N| caligraphic_V | = italic_N\mathcal{E}caligraphic_E 是一组表示道路之间连通性的边, 𝒜N×Nsuperscript\mathcal{A}\in\mathbb{R}^{N\times N}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT 是代表 {​​5} 拓扑的矩阵。给定道路网络,我们将问题表述为多元时间序列预测的一个特殊版本,该版本基于 TsuperscriptT^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 历史输入图信号预测未来 TTitalic_T 图信号:

[X𝒢(tT+1),,X𝒢(t)]\displaystyle\big{[}X_{\mathcal{G}}^{(t-T^{\prime}+1)},\dots,X_{\mathcal{G}}^{% (t)}\big{]}[ italic_X start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) end_POSTSUPERSCRIPT , … , italic_X start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ] f()[X𝒢(t+1),,X𝒢(t+T)],\displaystyle\xrightarrow{f(\cdot)}\big{[}X_{\mathcal{G}}^{(t+1)},\dots,X_{% \mathcal{G}}^{(t+T)}\big{]},start_ARROW start_OVERACCENT italic_f ( ⋅ ) end_OVERACCENT → end_ARROW [ italic_X start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , … , italic_X start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + italic_T ) end_POSTSUPERSCRIPT ] ,

where X𝒢(i)N×CX_{\mathcal{G}}^{(i)}\in\mathbb{R}^{N\times C}italic_X start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT, CCitalic_C is the number of input features. We aim to train the mapping function f():T×N×CT×N×Cf(\cdot):\mathbb{R}^{T^{\prime}\times N\times C}\rightarrow\mathbb{R}^{T\times N% \times C}italic_f ( ⋅ ) : blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_N × italic_C end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_C end_POSTSUPERSCRIPT, which predicts the next TTitalic_T steps based on the given TT^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT observations. For the sake of simplicity, we omit 𝒢\mathcal{G}caligraphic_G from X𝒢X_{\mathcal{G}}italic_X start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT hereinafter.
其中 X𝒢(i)N×CsuperscriptsubscriptsuperscriptX_{\mathcal{G}}^{(i)}\in\mathbb{R}^{N\times C}italic_X start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPTCCitalic_C 是输入特征的数量。我们的目标是训练映射函数 f():T×N×CT×N×Csuperscriptsuperscriptsuperscriptf(\cdot):\mathbb{R}^{T^{\prime}\times N\times C}\rightarrow\mathbb{R}^{T\times N% \times C}italic_f ( ⋅ ) : blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_N × italic_C end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_C end_POSTSUPERSCRIPT ,该函数根据给定的 TsuperscriptT^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 观察结果预测接下来的 TTitalic_T 步。为简单起见,下文中我们将 X𝒢subscriptX_{\mathcal{G}}italic_X start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT 中的 𝒢\mathcal{G}caligraphic_G 省略。

Spatial Modeling Methods in Traffic Forecasting
流量预测中的空间建模方法

To effectively forecast the traffic signals, we first discuss spatial modeling, which is one of the necessities for traffic data modeling. In traffic forecasting, we can classify spatial modeling methods into four categories: 1) with identity matrix (i.e., multivariate time-series forecasting), 2) with a pre-defined adjacency matrix, 3) with a trainable adjacency matrix, and 4) with attention (i.e., dynamic spatial modeling without prior knowledge). Conventionally, a graph topology 𝒜\mathcal{A}caligraphic_A is constructed via an empirical law, including inverse distance (Li et al., 2018; Yu et al., 2018) and cosine similarity (Geng et al., 2019). However, these empirically built graph structures are not necessarily optimal, thus often resulting in poor spatial modeling quality. To address this challenge, a line of research (Wu et al., 2019; Bai et al., 2020; Jiang et al., 2023) is proposed to capture the hidden spatial information. Specifically, a trainable function g(,θ)g(\cdot,\theta)italic_g ( ⋅ , italic_θ ) is used to derive the optimal topological representation 𝒜~\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG as:
为了有效地预测交通信号,我们首先讨论空间建模,这是交通数据建模的必要条件之一。 在交通预测中,我们可以将空间建模方法分为四类:1)使用单位矩阵(即多元时间序列预测),2)使用预定义邻接矩阵,3)使用可训练邻接矩阵,以及 4)使用注意力机制(即无需先验知识的动态空间建模)。 传统上,图拓扑 𝒜\mathcal{A}caligraphic_A 是通过经验定律构建的,包括反距离 (Li et al., 2018 ; Yu et al., 2018 ) 和余弦相似度 (Geng et al., 2019 ) 。然而,这些经验构建的图结构不一定是最优的,因此常常导致空间建模质量不佳。为了应对这一挑战,提出了一系列研究 (Wu et al., 2019 ; Bai et al., 2020 ; Jiang et al., 2023 ) 来捕获隐藏的空间信息。具体而言,使用可训练函数 g(,θ)g(\cdot,\theta)italic_g ( ⋅ , italic_θ ) 来推导出最优拓扑表示 𝒜~\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG ,如下所示:

𝒜~=softmax(𝗋𝖾𝗅𝗎(g(X(t),θ),g(X(t),θ)),\tilde{\mathcal{A}}=softmax(\mathsf{relu}(g(X^{(t)},\theta),g(X^{(t)},\theta)^% {\top}),over~ start_ARG caligraphic_A end_ARG = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( sansserif_relu ( italic_g ( italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_θ ) , italic_g ( italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_θ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) , (1)

where g(X(t),θ)N×eg(X^{(t)},\theta)\in\mathbb{R}^{N\times e}italic_g ( italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_θ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_e end_POSTSUPERSCRIPT, and eeitalic_e is the embedding size. Spatial modeling based on Eq. 1 can be classified into two subcategories according to whether g(,θ)g(\cdot,\theta)italic_g ( ⋅ , italic_θ ) depends on X(t)X^{(t)}italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. Wu et al. (2019) define g(,θ)=EN×eg(\cdot,\theta)=E\in\mathbb{R}^{N\times e}italic_g ( ⋅ , italic_θ ) = italic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_e end_POSTSUPERSCRIPT, which is time-independent and less noise-sensitive but less in-situ modeling. Cao et al. (2020); Zhang et al. (2020) propose time-varying graph structure modeling with g(H(t),θ)=H(t)Wg(H^{(t)},\theta)=H^{(t)}Witalic_g ( italic_H start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_θ ) = italic_H start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_W, where Wd×eW\in\mathbb{R}^{d\times e}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_e end_POSTSUPERSCRIPT, projecting hidden states to another embedding space. Ideally, this method models dynamic changes in graph topology, but it is noise-sensitive.
其中 g(X(t),θ)N×esuperscriptsuperscriptg(X^{(t)},\theta)\in\mathbb{R}^{N\times e}italic_g ( italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_θ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_e end_POSTSUPERSCRIPTeeitalic_e 是嵌入大小。基于公式 1 的空间建模可以根据 g(,θ)g(\cdot,\theta)italic_g ( ⋅ , italic_θ ) 是否依赖于 X(t)superscriptX^{(t)}italic_X start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT 分为两个子类别。 Wu 等人( 2019 定义 g(,θ)=EN×esuperscriptg(\cdot,\theta)=E\in\mathbb{R}^{N\times e}italic_g ( ⋅ , italic_θ ) = italic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_e end_POSTSUPERSCRIPT ,它与时间无关、对噪声不太敏感,但现场建模能力较弱。 Cao 等人( 2020 年 );Zhang 等人( 2020 年 提出了一种时变图结构建模方法,其中 g(H(t),θ)=H(t)Wsuperscriptsuperscriptg(H^{(t)},\theta)=H^{(t)}Witalic_g ( italic_H start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_θ ) = italic_H start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_W (其中 Wd×esuperscriptW\in\mathbb{R}^{d\times e}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_e end_POSTSUPERSCRIPT )将隐藏状态投影到另一个嵌入空间。理想情况下,该方法可以模拟图拓扑的动态变化,但它对噪声敏感。

To reduce noise-sensitivity and obtain a time-varying graph structure, Zheng et al. (2020) adopt a spatial attention mechanism for traffic forecasting. Given input HiH_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of node iiitalic_i and its spatial neighbor 𝒩i\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, they compute spatial attention using multi-head attention as follows:
为了降低噪声敏感性并获得随时间变化的图结构, Zheng 等人( 2020 年采用了空间注意力机制进行交通预测。给定节点 iiitalic_i 及其空间邻居 𝒩isubscript\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 的输入 HisubscriptH_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,他们使用多头注意力计算空间注意力,如下所示:

Hi*=𝖢𝗈𝗇𝖼𝖺𝗍(oi(1),,oi(K))WO;\displaystyle H^{*}_{i}=\mathsf{Concat}(o_{i}^{(1)},\dots,o_{i}^{(K)})W^{O};italic_H start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = sansserif_Concat ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ) italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ; oi(k)=s𝒩iαi,sfv(k)(Hs)\displaystyle\qquad o^{(k)}_{i}=\sum_{s\in\mathcal{N}_{i}}\alpha_{i,s}\cdot f_% {v}^{(k)}(H_{s})italic_o start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) (2)
αi,j=exp(ei,j)s𝒩iexp(ei,s);\displaystyle\alpha_{i,j}=\frac{\exp(e_{i,j})}{\sum_{s\in\mathcal{N}_{i}}\exp(% e_{i,s})};italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_e start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT ) end_ARG ; ei,j=(fq(k)(Hi))(fk(k)(Hj))dk,\displaystyle\qquad e_{i,j}=\frac{\big{(}f_{q}^{(k)}(H_{i})\big{)}\big{(}f_{k}% ^{(k)}(H_{j})\big{)}^{\top}}{\sqrt{d_{k}}},italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG , (3)

where WOW^{O}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT is a projection layer, dkd_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a dimension of key vector, and fq(k)(),fk(k)(),f^{(k)}_{q}(\cdot),f^{(k)}_{k}(\cdot),italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( ⋅ ) , italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ ) , and fv(k)()f^{(k)}_{v}(\cdot)italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( ⋅ ) are query, key, and value projections of the kkitalic_k-th head, respectively. Although effective, these attention-based approaches still suffer from irregular spatial modeling, such as less accurate self-attention (i.e., from node iiitalic_i to iiitalic_i(Park et al., 2020) and uniformly distributed uninformative attention, regardless of its spatial relationships (Jin et al., 2023).
其中 WOsuperscriptW^{O}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT 是投影层, dksubscriptd_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT 是键向量的维度, fq(k)(),fk(k)(),subscriptsuperscriptsubscriptsuperscriptf^{(k)}_{q}(\cdot),f^{(k)}_{k}(\cdot),italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( ⋅ ) , italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ ) ,fv(k)()subscriptsuperscriptf^{(k)}_{v}(\cdot)italic_f start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( ⋅ ) 分别是第 kkitalic_k 个头的查询、键和值投影。这些基于注意力机制的方法虽然有效,但仍然存在空间建模不规则的问题,例如自注意力机制准确性较低(即从节点 iiitalic_iiiitalic_i(Park 等人, 2020 年 以及均匀分布的无信息量注意力机制,无论其空间关系如何 (Jin 等人, 2023 年

3.2 Model Architecture
3.2模型体系结构

Refer to caption
Figure 1: Overview of TESTAM. Left: The architecture of each expert. Middle: The workflow and routing mechanism of TESTAM. Solid lines indicate forward paths, and the dashed lines represent backward paths. Right: The three spatial modeling methods of TESTAM. The black lines indicate spatial connectivity, and red lines represent information flow corresponding to spatial connectivity. Identity, adaptive, and attention experts are responsible for temporal modeling, spatial modeling with learnable static graph, and with dynamic graph (i.e., attention), respectively.
图1:测试概述。:每个专家的架构。中间:睾丸的工作流和路由机理。实线表示前路路径,虚线表示向后路径。:testam的三种空间建模方法。黑线表示空间连接,红线表示与空间连接相对应的信息流。身份,自适应和注意专家分别负责时间建模,具有可学习的静态图以及动态图(即,注意)的空间建模。

Although transformers are well-established structures for time-series forecasting, it has a couple of problems when used for spatio-temporal modeling: they do not consider spatial modeling, consume considerable memory resources, and have bottleneck problems caused by the autoregressive decoding process. Park et al. (2020) have introduced an improved transformer model with graph attention (GAT), but the model still has auto-regressive properties. To eliminate the autoregressive characteristics while preserving the advantage of the encoder–decoder architecture, TESTAM transfers the attention domain through time-enhanced attention and temporal information embedding. As shown in Fig. 1 (left), in addition to temporal information embedding, each expert layer consists of four sublayers: temporal attention, spatial modeling, time-enhanced attention, and point-wise feed-forward neural networks. Each sublayer is connected to a bypass through skip connections. To improve generalization, we apply layer normalization after each sublayer. All experts have the same hidden size and number of layers and differ only in terms of spatial modeling methods.
尽管 Transformer 是用于时间序列预测的成熟结构,但它在用于时空建模时存在一些问题:它们没有考虑空间建模,消耗大量的内存资源,并且存在由自回归解码过程引起的瓶颈问题。Park 等人( 2020 年 提出了一种改进的具有图注意机制的 Transformer 模型 (GAT),但该模型仍然具有自回归特性。为了消除自回归特性,同时保留编码器-解码器架构的优势,TESTAM 通过时间增强注意和时间信息嵌入来迁移注意域。如图 1 (左)所示,除了时间信息嵌入之外,每个专家层还包含四个子层:时间注意、空间建模、时间增强注意和点状前馈神经网络。每个子层通过跳过连接连接到旁路。为了提高泛化能力,我们在每个子层之后应用层归一化。所有专家的隐藏层大小和层数相同,仅在空间建模方法方面有所不同。

Temporal Information Embedding
时间信息嵌入

Since temporal features (e.g., time of day) work as a global position with a specific periodicity, we omit position embedding in the original transformer architecture. Furthermore, instead of normalized temporal features, we utilize Time2Vec embedding (Kazemi et al., 2019) for periodicity and linearity modeling. Specifically, for the temporal feature τ\tau\in\mathbb{N}italic_τ ∈ blackboard_N, we represent τ\tauitalic_τ with hhitalic_h-dimensional embedding vector v(τ)v(\tau)italic_v ( italic_τ ) and the learnable parameters wi,ϕiw_{i},\phi_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each embedding dimension iiitalic_i as below:
由于时间特征(例如,一天中的时间)作为具有特定周期性的全局位置,我们省略了原始 Transformer 架构中的位置嵌入。 此外,我们利用 Time2Vec 嵌入 (Kazemi 等人, 2019 进行周期性和线性建模,而非归一化的时间特征。具体来说,对于时间特征 τ\tau\in\mathbb{N}italic_τ ∈ blackboard_N ,我们用 hhitalic_h 维嵌入向量 v(τ)v(\tau)italic_v ( italic_τ ) 表示 τ\tauitalic_τ ,并为每个嵌入维度 iiitalic_i 指定可学习的参数 wi,ϕisubscriptsubscriptw_{i},\phi_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,如下所示:

TIM(τ)[i]={wiv(τ)[i]+ϕi,if i=0(wiv(τ)[i]+ϕi)if 1ih1,\displaystyle{TIM}(\tau)[i]=\begin{cases}w_{i}v(\tau)[i]+\phi_{i},&\text{if }i% =0\\ \mathcal{F}(w_{i}v(\tau)[i]+\phi_{i})&\text{if }1\leq i\leq h-1,\end{cases}italic_T italic_I italic_M ( italic_τ ) [ italic_i ] = { start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v ( italic_τ ) [ italic_i ] + italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL start_CELL if italic_i = 0 end_CELL end_ROW start_ROW start_CELL caligraphic_F ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v ( italic_τ ) [ italic_i ] + italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL start_CELL if 1 ≤ italic_i ≤ italic_h - 1 , end_CELL end_ROW (4)

where \mathcal{F}caligraphic_F is a periodic activation function. Using Time2Vec embedding, we enable the model to utilize the temporal information of labels. Here, temporal information embedding of an input sequence is concatenated with other input features and then projected onto the hidden size hhitalic_h.
其中 \mathcal{F}caligraphic_F 是周期性激活函数。使用 Time2Vec 嵌入,我们使模型能够利用标签的时间信息。在这里,输入序列的时间信息嵌入与其他输入特征连接,然后投影到隐藏大小 hhitalic_h 上。

Temporal Attention  暂时的关注

As temporal attention in TESTAM is the same as that of transformers, we describe the benefits of temporal attention. Recent studies (Li et al., 2018; Bai et al., 2020) have shown that attention is an appealing solution for temporal modeling because, unlike recurrent unit-based or convolution-based temporal modeling, it can be used to directly attend to features across time steps with no restrictions. Temporal attention allows parallel computation and is beneficial for long-term sequence modeling. Moreover, it has less inductive bias in terms of locality and sequentiality. Although strong inductive bias can help the training, less inductive bias enables better generalization. Furthermore, for the traffic forecasting problem, causality among roads is an unavoidable factor (Jin et al., 2023) that cannot be easily modeled in the presence of strong inductive bias, such as sequentiality or locality.
由于 TESTAM 中的时间注意力机制与 Transformer 中的时间注意力机制相同,我们描述了时间注意力机制的优势。最近的研究 (Li et al., 2018 ; Bai et al., 2020 表明,注意力机制是时间建模的一个有吸引力的解决方案,因为与基于循环单元或基于卷积的时间建模不同,它可以用来直接关注跨时间步长的特征而不受任何限制。时间注意力机制允许并行计算,有利于长期序列建模。此外,它在局部性和顺序性方面的归纳偏差较小。虽然强归纳偏差有助于训练,但较小的归纳偏差可以实现更好的泛化。此外,对于交通预测问题,道路之间的因果关系是一个不可避免的因素 (Jin et al., 2023 ), 在存在强归纳偏差(如顺序性或局部性)的情况下,很难对其进行建模。

Spatial Modeling Layer  空间建模层

In this work, we leverage three spatial modeling layers for each expert, as shown in the middle of Fig. 1: spatial modeling with an identity matrix (i.e., no spatial modeling), spatial modeling with a learnable adjacency matrix (Eq. 1), and spatial modeling with attention (Eq. 2 and Eq. 3). We calculate spatial attention using Eqs. 2 and 3. Specifically, we compute attention with i𝒱,𝒩i=𝒱\forall_{i\in\mathcal{V}},\mathcal{N}_{i}=\mathcal{V}∀ start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_V, which means attention with no spatial restrictions. This setting enables similarity-based attention, resulting in better generalization.
在本研究中,我们为每位专家构建了三个空间建模层,如图 1 中间所示:使用单位矩阵的空间建模(即无空间建模)、使用可学习邻接矩阵的空间建模(公式 1 )以及使用注意力机制的空间建模(公式 2 和公式 3 )。我们使用公式 2 和公式 3 计算空间注意力机制。具体来说,我们使用 i𝒱,𝒩i=𝒱subscriptfor-allsubscript\forall_{i\in\mathcal{V}},\mathcal{N}_{i}=\mathcal{V}∀ start_POSTSUBSCRIPT italic_i ∈ caligraphic_V end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_V 计算注意力机制,这意味着注意力机制不受空间限制。此设置支持基于相似性的注意力机制,从而实现更好的泛化效果。

Inspired by the success of memory-augmented graph structure learning (Jiang et al., 2023; Lee et al., 2022), we propose a modified meta-graph learner that learns prototypes from both spatial graph modeling and gating networks. Our meta-graph learner consists of two individual neural networks with a meta-node bank 𝐌m×e\mathbf{M}\in\mathbb{R}^{m\times e}bold_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_e end_POSTSUPERSCRIPT, where mmitalic_m and eeitalic_e denote total memory items and a dimension of each memory, respectively, a hyper-network (Ha et al., 2017) for generating node embedding conditioned on 𝐌\mathbf{M}bold_M, and gating networks to calculate the similarities between experts’ hidden states and queried memory items. In this section, we mainly focus on the hyper-network. We construct a graph structure with a meta-node bank 𝐌\mathbf{M}bold_M and a projection WEe×dW_{E}\in\mathbb{R}^{e\times d}italic_W start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_e × italic_d end_POSTSUPERSCRIPT as follows:
受到记忆增强图结构学习成功的启发 (Jiang et al., 2023 ; Lee et al., 2022 ) ,我们提出了一种改进的元图学习器,它可以从空间图建模和门控网络中学习原型。我们的元图学习器由两个独立的神经网络组成,一个元节点库 𝐌m×esuperscript\mathbf{M}\in\mathbb{R}^{m\times e}bold_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_e end_POSTSUPERSCRIPT ,其中 mmitalic_meeitalic_e 分别表示总记忆项和每个记忆的维度,一个超网络 (Ha et al., 2017 ) 用于生成以 𝐌\mathbf{M}bold_M 为条件的节点嵌入,以及门控网络用于计算专家隐藏状态和查询记忆项之间的相似性。在本节中,我们主要关注超网络。我们构建一个具有元节点库 𝐌\mathbf{M}bold_M 和投影 WEe×dsubscriptsuperscriptW_{E}\in\mathbb{R}^{e\times d}italic_W start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_e × italic_d end_POSTSUPERSCRIPT 的图结构,如下所示:

E=𝐌WE;A~=softmax(𝗋𝖾𝗅𝗎(EE))E=\mathbf{M}W_{E};\tilde{A}=softmax(\mathsf{relu}(EE^{\top}))italic_E = bold_M italic_W start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ; over~ start_ARG italic_A end_ARG = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( sansserif_relu ( italic_E italic_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) )

By constructing a memory-augmented graph, the model achieves better context-aware spatial modeling than that achieved using other learnable static graphs (e.g., graph modeling with EN×dE\in\mathbb{R}^{N\times d}italic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT). Detailed explanations for end-to-end training and meta-node bank queries are provided in Sec. 3.3.
通过构建记忆增强图,该模型实现了比使用其他可学习静态图(例如,使用 EN×dsuperscriptE\in\mathbb{R}^{N\times d}italic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT 的图建模)更好的情境感知空间建模。端到端训练和元节点库查询的详细说明请参见第 3.3 节。

Time-Enhanced Attention  时间增强的关注

To eliminate the error propagation effects caused by auto-regressive characteristics, we propose a time-enhanced attention layer that helps the model transfer its domain from historical TT^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT time steps (i.e., source domain) to next TTitalic_T time steps (i.e., target domain). Let τlabel(t)=[τ(t+1),,τ(t+T)]\mathbf{\tau}^{(t)}_{label}=[\tau^{(t+1)},\dots,\tau^{(t+T)}]italic_τ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_a italic_b italic_e italic_l end_POSTSUBSCRIPT = [ italic_τ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , … , italic_τ start_POSTSUPERSCRIPT ( italic_t + italic_T ) end_POSTSUPERSCRIPT ] be a temporal feature vector of the label. We calculate the attention score from the source time step iiitalic_i to the target time step jjitalic_j as:
为了消除自回归特性引起的误差传播效应,我们提出了一个时间增强注意层,帮助模型将其域从历史 TsuperscriptT^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 时间步(即源域)转移到下一个 TTitalic_T 时间步(即目标域)。令 τlabel(t)=[τ(t+1),,τ(t+T)]subscriptsuperscriptsuperscript1superscript\mathbf{\tau}^{(t)}_{label}=[\tau^{(t+1)},\dots,\tau^{(t+T)}]italic_τ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_a italic_b italic_e italic_l end_POSTSUBSCRIPT = [ italic_τ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , … , italic_τ start_POSTSUPERSCRIPT ( italic_t + italic_T ) end_POSTSUPERSCRIPT ] 为标签的时间特征向量。我们计算从源时间步 iiitalic_i 到目标时间步 jjitalic_j 的注意力得分,如下所示:

αi,j=exp(ei,j)k=t+1Texp(ei,k),\displaystyle\alpha_{i,j}=\frac{\exp(e_{i,j})}{\sum_{k=t+1}^{T}\exp(e_{i,k})},italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_exp ( italic_e start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ) end_ARG ,
ei,j=(H(i)Wq(k))(TIM(τ(j))Wk(k))dk,\displaystyle e_{i,j}=\frac{(H^{(i)}W_{q}^{(k)})(\text{TIM}(\tau^{(j)})W_{k}^{% (k)})^{\top}}{\sqrt{d_{k}}},italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG ( italic_H start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) ( TIM ( italic_τ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG , (5)

where dk=d/Kd_{k}=d/Kitalic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_d / italic_K, KKitalic_K is the number of heads, and Wq(k),Wk(k)W_{q}^{(k)},W_{k}^{(k)}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT are linear transformation matrices. We can calculate the attention output using the same process as in Eq. 2, except that time-enhanced attention attends to the time steps of each node, whereas Eq. 2 attends to the important nodes at each time step.
其中 dk=d/Ksubscriptd_{k}=d/Kitalic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_d / italic_KKKitalic_K 是头的数量, Wq(k),Wk(k)superscriptsubscriptsuperscriptsubscriptW_{q}^{(k)},W_{k}^{(k)}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT 是线性变换矩阵。我们可以使用与公式 2 相同的过程来计算注意力输出,不同之处在于时间增强注意力关注的是每个节点的时间步长,而公式 2 关注的是每个时间步长上的重要节点。

3.3 Gating Networks
3.3门控网络

In this section, we describe the gating networks used for in-situ routing. Conventional MoE models have multiple experts with the same architecture and conduct coarse-grained routing, focusing on increasing model capacity without additional computational costs (Shazeer et al., 2017). However, coarse-grained routing provides experts with limited opportunities for specialization. Furthermore, in the case of the regression problem, existing MoEs hardly change their routing decisions after initialization because the gate is not guided by the gradients of regression tasks, as Dryden & Hoefler (2022) have revealed. Consequently, gating networks cause “mismatches,” resulting in uninformative and unchanging routing. Moreover, using the same architecture for all experts is less beneficial in terms of generalization since they also share the same inductive bias.
在本节中,我们描述用于原位路由的门控网络。传统的MOE模型具有相同体系结构的多个专家,并进行了粗粒的路由,重点是增加模型容量而没有额外的计算成本(Shazeer等, 2017 。但是,粗粒粒度的路由为专家提供了有限的专业机会。此外,在回归问题的情况下,现有的MOE在初始化后几乎不会改变其路由决策,因为门不受回归任务的梯度的指导,如Dryden&Hoefler( 2022所透露的那样。因此,门控网络会引起“不匹配”,导致不信息和不变的路由。此外,在所有专家中使用相同的体系结构在概括方面也不是有益的,因为它们也具有相同的归纳偏见。

To resolve this issue, we propose novel memory-based gating networks and two classification losses with regression error-based pseudo labels. Existing memory-based traffic forecasting approaches (Lee et al., 2022; Jiang et al., 2023) reconstruct the encoder’s hidden state with memory items, allowing memory to store typical features from seen samples for pattern matching. In contrast, we aim to learn the direct relationship between input signals and output representations. For node iiitalic_i at time step ttitalic_t, we define the memory-querying process as follows:
为了解决这个问题,我们提出了新的基于记忆的门控网络和两个基于回归误差的伪标签的分类损失。 现有的基于记忆的交通预测方法 (Lee 等人, 2022 ;Jiang 等人, 2023 利用记忆项重构编码器的隐藏状态,使记忆能够存储来自可见样本的典型特征,从而进行模式匹配。相比之下,我们的目标是学习输入信号与输出表示之间的直接关系。对于时间步 ttitalic_t 的节点 iiitalic_i ,我们将记忆查询过程定义如下:

Qi(t)=Xi(t)Wq+bq\displaystyle Q_{i}^{(t)}=X_{i}^{(t)}W_{q}+b_{q}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT
{aj=exp(Qi(t)M[j])j=1mexp(Qi(t)M[j])Oi(t)=j=1majM[j],\displaystyle\begin{cases}a_{j}=\frac{\exp(Q_{i}^{(t)}M[j]^{\top})}{\sum_{j=1}% ^{m}\exp(Q_{i}^{(t)}M[j]^{\top})}\\ O_{i}^{(t)}=\sum_{j=1}^{m}a_{j}M[j]\end{cases},{ start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_M [ italic_j ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_exp ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_M [ italic_j ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) end_ARG end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_M [ italic_j ] end_CELL start_CELL end_CELL end_ROW ,

where M[i]M[i]italic_M [ italic_i ] is the iiitalic_i-th memory item, and WqW_{q}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and bqb_{q}italic_b start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT are learnable parameters for input projection. Let zez_{e}italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT be an output representation of expert eeitalic_e. Given the queried memory Oi(t)eO_{i}^{(t)}\in\mathbb{R}^{e}italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, we calculate the routing probability pep_{e}italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT as shown below:
其中 M[i]delimited-[]M[i]italic_M [ italic_i ] 是第 iiitalic_i 个记忆项, WqsubscriptW_{q}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPTbqsubscriptb_{q}italic_b start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT 是输入投影的可学习参数。令 zesubscriptz_{e}italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT 为专家 eeitalic_e 的输出表示。给定查询的记忆 Oi(t)esuperscriptsubscriptsuperscriptO_{i}^{(t)}\in\mathbb{R}^{e}italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ,我们计算路由概率 pesubscriptp_{e}italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ,如下所示:

re=g(ze,Oi(t));pe=ree[e1,,eE]re,r_{e}=g(z_{e},O_{i}^{(t)});\quad p_{e}=\frac{r_{e}}{\sum_{e\in[e_{1},...,e_{E}% ]}r_{e}},italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_g ( italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ; italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = divide start_ARG italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_e ∈ [ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_ARG ,

where EEitalic_E is the number of experts. Since we use the similarity between output states and queried memory as the routing probability, solving the routing problem induces memory learning of a typical output representation and input-output relationship. We select the top-1 expert output as final output.
其中 EEitalic_E 是专家的数量。由于我们使用输出状态和查询记忆之间的相似性作为路由概率,因此解决路由问题会引发对典型输出表示和输入输出关系的记忆学习。我们选择 top-1 专家输出作为最终输出。

Routing Classification Losses
路由分类损失

To enable fine-grained routing that fits the regression problem, we adopt two classification losses: a classification loss to avoid the worst routing and another loss function to find the best routing. Inspired by SMoEs, we define the worst routing avoidance loss as the cross entropy loss with pseudo label lel_{e}italic_l start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT as shown below:
为了实现适合回归问题的细粒度路由,我们采用了两种分类损失:一种分类损失用于避免最差路由,另一种损失函数用于找到最佳路由。受 SMoE 的启发,我们将最差路由避免损失定义为具有伪标签 lesubscriptl_{e}italic_l start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT 的交叉熵损失,如下所示:

Lworst(𝐩)=1Eelelog(pe)\displaystyle L_{worst}(\mathbf{p})=-\frac{1}{E}\sum_{e}l_{e}log(p_{e})italic_L start_POSTSUBSCRIPT italic_w italic_o italic_r italic_s italic_t end_POSTSUBSCRIPT ( bold_p ) = - divide start_ARG 1 end_ARG start_ARG italic_E end_ARG ∑ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) (6)
le={1if L(y,y^) is smaller than q-th quantile and pe=argmax(𝐩)1/(E1)if L(y,y^) is greater than q-th quantile and peargmax(𝐩)0otherwise,\displaystyle l_{e}=\begin{cases}1&\text{if $L(y,\hat{y})$ is smaller than $q$% -th quantile and $p_{e}=argmax(\mathbf{p})$}\\ 1/(E-1)&\text{if $L(y,\hat{y})$ is greater than $q$-th quantile and $p_{e}\neq argmax% (\mathbf{p})$}\\ 0&otherwise\end{cases},italic_l start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_L ( italic_y , over^ start_ARG italic_y end_ARG ) is smaller than italic_q -th quantile and italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_a italic_r italic_g italic_m italic_a italic_x ( bold_p ) end_CELL end_ROW start_ROW start_CELL 1 / ( italic_E - 1 ) end_CELL start_CELL if italic_L ( italic_y , over^ start_ARG italic_y end_ARG ) is greater than italic_q -th quantile and italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ≠ italic_a italic_r italic_g italic_m italic_a italic_x ( bold_p ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL end_ROW ,

where y^\hat{y}over^ start_ARG italic_y end_ARG is the output of the selected expert, and qqitalic_q is an error quantile. If an expert is incorrectly selected, its label becomes zero and the unselected experts have the pseudo label 1/(E1)1/(E-1)1 / ( italic_E - 1 ), which means that there are equal chances of choosing unselected experts.
其中 y^\hat{y}over^ start_ARG italic_y end_ARG 是所选专家的输出, qqitalic_q 是错误分位数。如果错误地选择了专家,则其标签变为零,而未选择的专家则具有伪标签 1/(E1)111/(E-1)1 / ( italic_E - 1 ) ,这意味着选择未选择的专家的机会均等。

We also propose the best-route selection loss for more precise routing. However, as traffic data are noisy and contain many nonstationary characteristics, the best-route selection is not an easy task. Therefore, instead of choosing the best routing for every time step and every node, we calculate node-wise routing. Our best-route selection loss is similar to that in Eq. 6, except that it calculates node-wise pseudo labels and the routing probability, and the condition for pseudo labels is changed from “L(y,y^)L(y,\hat{y})italic_L ( italic_y , over^ start_ARG italic_y end_ARG ) is greater/smaller than qqitalic_q-th quantile” to “L(y,y^)L(y,\hat{y})italic_L ( italic_y , over^ start_ARG italic_y end_ARG ) is greater/smaller than (1q)(1-q)( 1 - italic_q )-th quantile.” Detailed explanations are provided in Appendix A.
我们还提出了最佳路线选择损失,以实现更精确的路线选择。 然而,由于交通数据嘈杂且包含许多非平稳特性,最佳路线选择并非易事。 因此,我们不是为每个时间步和每个节点选择最佳路由,而是计算节点路由。 我们的最佳路线选择损失与公式 6 中的类似,不同之处在于它计算了节点伪标签和路由概率,并且伪标签的条件从“ L(y,y^)L(y,\hat{y})italic_L ( italic_y , over^ start_ARG italic_y end_ARG ) 大于/小于 qqitalic_q 分位数”更改为“ L(y,y^)L(y,\hat{y})italic_L ( italic_y , over^ start_ARG italic_y end_ARG ) 大于/小于 (1q)1(1-q)( 1 - italic_q ) 分位数”。详细解释见附录 A。

Table 1: Experimental results on three real-world datasets with 13 baseline models and TESTAM. The values in bold indicate the best, and underlined values indicate the second-best performance.
表1:具有13个基线模型和睾丸的三个现实世界数据集的实验结果。粗体中的值表示最佳和下划线的值表示第二好的性能。
METR-LA 15 min  15分钟 30 min  30分钟 60 min  60分钟
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
HA (Li et al., 2018)
(Li等, 2018
4.16 7.80 13.00% 4.16 7.80 13.00% 4.16 7.80 13.00%
STGCN (Yu et al., 2018)
STGCN (Yu等, 2018
2.88 5.74 7.62% 3.47 7.24 9.57% 4.59 9.40 12.70%
DCRNN (Li et al., 2018)
dcrnn (Li等, 2018
2.77 5.38 7.30% 3.15 6.45 8.80% 3.60 7.59 10.50%
Graph-WaveNet (Wu et al., 2019)
Graph-Wavenet (Wu等, 2019
2.69 5.15 6.90% 3.07 6.22 8.37% 3.53 7.37 10.01%
STTN (Xu et al., 2020)
Sttn (Xu等, 2020
2.79 5.48 7.19% 3.16 6.50 8.53% 3.60 7.60 10.16%
GMAN (Zheng et al., 2020)
Gman (Zheng等, 2020
2.80 5.55 7.41% 3.12 6.49 8.73% 3.44 7.35 10.07%
MTGNN (Wu et al., 2020)
MTGNN (Wu等, 2020
2.69 5.18 6.86% 3.05 6.17 8.19% 3.49 7.23 9.87%
StemGNN (Cao et al., 2020)
Stemgnn (Cao等, 2020
2.56 5.06 6.46% 3.01 6.03 8.23% 3.43 7.23 9.85%
AGCRN (Bai et al., 2020)
AgCRN (Bai等, 2020
2.86 5.55 7.55% 3.25 6.57 8.99% 3.68 7.56 10.46%
CCRNN (Ye et al., 2021)
CCRNN (Ye等, 2021
2.85 5.54 7.50% 3.24 6.54 8.90% 3.73 7.65 10.59%
GTS (Shang et al., 2021)
GTS (Shang等, 2021
2.65 5.20 6.80% 3.05 6.22 8.28% 3.47 7.29 9.83%
PM-MemNet (Lee et al., 2022)
PM-Memnet (Lee等, 2022
2.65 5.29 7.01% 3.03 6.29 8.42% 3.46 7.29 9.97%
MegaCRN (Jiang et al., 2023)
MEGACRN (Jiang等, 2023
2.52 4.94 6.44% 2.93 6.06 7.96% 3.38 7.23 9.72%
TESTAM 2.54 4.93 6.42% 2.96 6.04 7.92% 3.36 7.09 9.67%
PEMS-BAY 15 min  15分钟 30 min  30分钟 60 min  60分钟
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
HA (Li et al., 2018)
(Li等, 2018
2.88 5.59 6.80% 2.88 5.59 6.80% 2.88 5.59 6.80%
STGCN (Yu et al., 2018)
STGCN (Yu等, 2018
1.36 2.96 2.90% 1.81 4.27 4.17% 2.49 5.69 5.79%
DCRNN (Li et al., 2018)
dcrnn (Li等, 2018
1.38 2.95 2.90% 1.74 3.97 3.90% 2.07 4.74 4.90%
Graph-WaveNet (Wu et al., 2019)
Graph-Wavenet (Wu等, 2019
1.30 2.74 2.73% 1.63 3.70 3.67% 1.95 4.52 4.63%
STTN (Xu et al., 2020)
Sttn (Xu等, 2020
1.36 2.87 2.89% 1.67 3.79 3.78% 1.95 4.50 4.58%
GMAN (Zheng et al., 2020)
Gman (Zheng等, 2020
1.35 2.90 2.87% 1.65 3.82 3.74% 1.92 4.49 4.52%
MTGNN (Wu et al., 2020)
MTGNN (Wu等, 2020
1.32 2.79 2.77% 1.65 3.74 3.69% 1.94 4.49 4.53%
StemGNN (Cao et al., 2020)
Stemgnn (Cao等, 2020
1.23 2.48 2.63% N/A from (Cao et al., 2020)
N/A来自(Cao等, 2020
N/A from (Cao et al., 2020)
N/A来自(Cao等, 2020
AGCRN (Bai et al., 2020)
AgCRN (Bai等, 2020
1.36 2.88 2.93% 1.69 3.87 3.86% 1.98 4.59 4.63%
CCRNN (Ye et al., 2021)
CCRNN (Ye等, 2021
1.38 2.90 2.90% 1.74 3.87 3.90% 2.07 4.65 4.87%
GTS (Shang et al., 2021)
GTS (Shang等, 2021
1.34 2.84 2.83% 1.67 3.83 3.79% 1.98 4.56 4.59%
PM-MemNet (Lee et al., 2022)
PM-Memnet (Lee等, 2022
1.34 2.82 2.81% 1.65 3.76 3.71% 1.95 4.49 4.54%
MegaCRN (Jiang et al., 2023)
MEGACRN (Jiang等, 2023
1.28 2.72 2.67% 1.60 3.68 3.57% 1.88 4.42 4.41%
TESTAM 1.29 2.77 2.61% 1.59 3.65 3.56% 1.85 4.33 4.31%
EXPY-TKY 10 min  10分钟 30 min  30分钟 60 min  60分钟
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
HA (Li et al., 2018)
(Li等, 2018
7.63 11.96 31.26% 7.63 11.96 31.25% 7.63 11.96 31.24%
STGCN (Yu et al., 2018)
STGCN (Yu等, 2018
6.09 9.60 24.84% 6.91 10.99 30.24% 8.41 12.70 32.90%
DCRNN (Li et al., 2018)
dcrnn (Li等, 2018
6.04 9.44 25.54% 6.85 10.87 31.02% 7.45 11.86 34.61%
Graph-WaveNet (Wu et al., 2019)
Graph-Wavenet (Wu等, 2019
5.91 9.30 25.22% 6.59 10.54 29.78% 6.89 11.07 31.71%
STTN (Xu et al., 2020)
Sttn (Xu等, 2020
5.90 9.27 25.67% 6.53 10.40 29.82% 6.99 11.23 32.52%
GMAN (Zheng et al., 2020)
Gman (Zheng等, 2020
6.09 9.49 26.52% 6.64 10.55 30.19% 7.05 11.28 32.91%
MTGNN (Wu et al., 2020)
MTGNN (Wu等, 2020
5.86 9.26 24.80% 6.49 10.44 29.23% 6.81 11.01 31.39%
StemGNN (Cao et al., 2020)
Stemgnn (Cao等, 2020
6.08 9.46 25.87% 6.85 10.80 31.25% 7.46 11.88 35.31%
AGCRN (Bai et al., 2020)
AgCRN (Bai等, 2020
5.99 9.38 25.71% 6.64 10.63 29.81% 6.99 11.29 32.13%
CCRNN (Ye et al., 2021)
CCRNN (Ye等, 2021
5.90 9.29 24.53% 6.68 10.77 29.93% 7.11 11.56 32.56%
GTS (Shang et al., 2021)
GTS (Shang等, 2021
- - - - - - - - -
PM-MemNet (Lee et al., 2022)
PM-Memnet (Lee等, 2022
5.94 9.25 25.10% 6.52 10.42 29.00% 6.87 11.14 31.22%
MegaCRN (Jiang et al., 2023)
MEGACRN (Jiang等, 2023
5.81 9.20 24.49% 6.44 10.33 28.92% 6.83 11.04 31.02%
TESTAM 5.84 9.23 25.36% 6.42 10.24 28.90% 6.75 11.01 31.01%

4 Experiments
4个实验

In this section, we describe experiments and compare the accuracy of TESTAM with that of existing models. We use three benchmark datasets for the experiments: METR-LA, PEMS-BAY, and EXPY-TKY. METR-LA and PEMS-BAY contain four-month speed data recorded by 207 sensors on Los Angeles highways and 325 sensors on Bay Area, respectively (Li et al., 2018). EXPY-TKY consists of three-month speed data collected from 1843 links in Tokyo, Japan. As EXPY-TKY covers a larger number of roads in a smaller area, its spatial dependencies with many abruptly changing speed patterns are more difficult to model than those in METR-LA or PEMS-BAY. METR-LA and PEMS-BAY datasets have 5-minute interval speeds and timestamps, whereas EXPY-TKY has 10-minute interval speeds and timestamps. Before training TESTAM, we have performed z-score normalization. In the cases of METR-LA and PEMS-BAY, we use 70% of the data for training, 10% for validation, and 20% for evaluation. For the EXPY-TKY, we utilize the first two months for training and validation and the last month for testing, as in the MegaCRN paper (Jiang et al., 2023).
在本节中,我们描述实验并将 TESTAM 的准确性与现有模型的准确性进行比较。我们使用三个基准数据集进行实验:METR-LA,PEMS-BAY 和 EXPY-TKY。METR-LA 和 PEMS-BAY 分别包含由洛杉矶高速公路上的 207 个传感器和湾区上的 325 个传感器记录的四个月的速度数据 (Li et al。, 2018 。EXPY-TKY 包含从日本东京的 1843 个链路收集的三个月的速度数据。由于 EXPY-TKY 覆盖了较小区域内的大量道路,因此与 METR-LA 或 PEMS-BAY 相比,其具有许多突变速度模式的空间依赖性更难建模。METR-LA 和 PEMS-BAY 数据集具有 5 分钟间隔速度和时间戳,而 EXPY-TKY 具有 10 分钟间隔速度和时间戳。在训练 TESTAM 之前,我们进行了 z 分数归一化。对于 METR-LA 和 PEMS-BAY,我们使用 70% 的数据进行训练,10% 的数据用于验证,20% 的数据用于评估。对于 EXPY-TKY,我们利用前两个月的数据进行训练和验证,最后一个月的数据用于测试,这与 MegaCRN 论文 (Jiang 等人, 2023 年 中的方法类似。

4.1 Experimental Settings
4.1实验设置

For all three datasets, we initialize the parameters and embedding using Xavier initialization. After performing a greedy search for hyperparameters, we set the hidden size d=e=32d=e=32italic_d = italic_e = 32, the memory size m=20m=20italic_m = 20, the number of layers l=3l=3italic_l = 3, the number of heads K=4K=4italic_K = 4, the hidden size for the feed-forward networks hff=128h_{ff}=128italic_h start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT = 128, and the error quantile q=0.7q=0.7italic_q = 0.7. We use the Adam optimizer with β1=0.9,β2=0.98,\beta_{1}=0.9,\beta_{2}=0.98,italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.98 , and ϵ=109\epsilon=10^{-9}italic_ϵ = 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT, as in Vaswani et al. (2017). We vary the learning rate during training using the cosine annealing warmup restart scheduler (Loshchilov & Hutter, 2017) according to the formula below:
对于所有三个数据集,我们使用 Xavier 初始化来初始化参数和嵌入。 在对超参数进行贪婪搜索后,我们设置了隐藏层大小 d=e=3232d=e=32italic_d = italic_e = 32 、内存大小 m=2020m=20italic_m = 20 、层数 l=33l=3italic_l = 3 、头部数量 K=44K=4italic_K = 4 、前馈网络的隐藏层大小 hff=128subscript128h_{ff}=128italic_h start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT = 128 以及误差分位数 q=0.70.7q=0.7italic_q = 0.7 。我们使用 Adam 优化器,其参数为 β1=0.9,β2=0.98,formulae-sequencesubscript10.9subscript20.98\beta_{1}=0.9,\beta_{2}=0.98,italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.98 ,ϵ=109superscript109\epsilon=10^{-9}italic_ϵ = 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT ,类似于 Vaswani 等人( 2017 年 的研究。我们使用余弦退火预热重启调度程序 (Loshchilov & Hutter, 2017 年 根据以下公式在训练期间调整学习率:

lrate={lrmin+(lrmaxlrmin)TcurTwarmFor the first Twarm stepslrmin+12(lrmaxlrmin)(1+cos(TcurTfreqπ))otherwise,lrate=\begin{cases}lr_{min}+(lr_{max}-lr_{min})\cdot\frac{T_{cur}}{T_{warm}}&% \text{For the first }T_{warm}\text{ steps}\\ lr_{min}+\frac{1}{2}(lr_{max}-lr_{min})\big{(}1+cos(\frac{T_{cur}}{T_{freq}}% \pi)\big{)}&\text{otherwise}\end{cases},italic_l italic_r italic_a italic_t italic_e = { start_ROW start_CELL italic_l italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT + ( italic_l italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_l italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) ⋅ divide start_ARG italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r end_POSTSUBSCRIPT end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_w italic_a italic_r italic_m end_POSTSUBSCRIPT end_ARG end_CELL start_CELL For the first italic_T start_POSTSUBSCRIPT italic_w italic_a italic_r italic_m end_POSTSUBSCRIPT steps end_CELL end_ROW start_ROW start_CELL italic_l italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_l italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_l italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) ( 1 + italic_c italic_o italic_s ( divide start_ARG italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r end_POSTSUBSCRIPT end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_f italic_r italic_e italic_q end_POSTSUBSCRIPT end_ARG italic_π ) ) end_CELL start_CELL otherwise end_CELL end_ROW , (7)

where TcurT_{cur}italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r end_POSTSUBSCRIPT is the number of steps since the last restart. We use Twarm=Tfreq=4000,lrmin=107T_{warm}=T_{freq}=4000,lr_{min}=10^{-7}italic_T start_POSTSUBSCRIPT italic_w italic_a italic_r italic_m end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT italic_f italic_r italic_e italic_q end_POSTSUBSCRIPT = 4000 , italic_l italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT for all datasets and set lrmax=3*103lr_{max}=3*10^{-3}italic_l italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 3 * 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for METR-LA and PEMS-BAY and lrmax=3*104lr_{max}=3*10^{-4}italic_l italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 3 * 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for EXPY-TKY. We follow the traditional 12-sequence (1 hour) input and 12-sequence output forecasting setting for METR-LA and PEMS-BAY and the 6-sequence (1 hour) input and 6-sequence output setting for EXPY-TKY, as in Jiang et al. (2023). We utilize mean absolute error (MAE) as a loss function and root mean squared error (RMSE) and mean absolute percentage error (MAPE) as evaluation metrics. All experiments are conducted using an RTX 3090 GPU.
其中 TcursubscriptT_{cur}italic_T start_POSTSUBSCRIPT italic_c italic_u italic_r end_POSTSUBSCRIPT 是自上次重启以来的步数。我们对所有数据集使用 Twarm=Tfreq=4000,lrmin=107formulae-sequencesubscriptsubscript4000subscriptsuperscript107T_{warm}=T_{freq}=4000,lr_{min}=10^{-7}italic_T start_POSTSUBSCRIPT italic_w italic_a italic_r italic_m end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT italic_f italic_r italic_e italic_q end_POSTSUBSCRIPT = 4000 , italic_l italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT ,对 METR-LA 和 PEMS-BAY 设置 lrmax=3*103subscript3superscript103lr_{max}=3*10^{-3}italic_l italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 3 * 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ,对 EXPY-TKY 设置 lrmax=3*104subscript3superscript104lr_{max}=3*10^{-4}italic_l italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 3 * 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 。我们遵循传统的 12 序列(1 小时)输入和 12 序列输出预测设置(针对 METR-LA 和 PEMS-BAY)以及 6 序列(1 小时)输入和 6 序列输出设置(针对 EXPY-TKY),如Jiang 等人( 2023 年所述。我们使用平均绝对误差 (MAE) 作为损失函数,使用均方根误差 (RMSE) 和平均绝对百分比误差 (MAPE) 作为评估指标。所有实验均使用 RTX 3090 GPU 进行。

We compare TESTAM with 13 baseline models: (1) historical average; (2) STGCN (Yu et al., 2018), a model with GCNs and CNNs; (3) DCRNN (Li et al., 2018), a model with graph convolutional recurrent units; (4) Graph-WaveNet (Wu et al., 2019) with a parameterized adjacency matrix; (5) STTN (Xu et al., 2020) and (6) GMAN (Zheng et al., 2020), state-of-the-art attention-based models; (7) MTGNN (Wu et al., 2020), (8) StemGNN (Cao et al., 2020), and (9) AGCRN (Bai et al., 2020), advanced models with an adaptive matrix; (10) CCRNN (Ye et al., 2021), a model with multiple adaptive matrices; (11) GTS (Shang et al., 2021), a model with a graph constructed with long-term historical data; and (12) PM-MemNet (Lee et al., 2022) and (13) MegaCRN (Jiang et al., 2023), state-of-the-art models with memory units.
我们将 TESTAM 与 13 个基线模型进行了比较:(1)历史平均值;(2)STGCN (Yu et al。, 2018 ,一个带有 GCN 和 CNN 的模型;(3)DCRNN (Li et al。, 2018 ,一个带有图卷积循环单元的模型;(4)带有参数化邻接矩阵的 Graph-WaveNet (Wu et al。, 2019 ;(5)STTN (Xu et al。, 2020 和(6)GMAN (Zheng et al。, 2020 ,最先进的基于注意力机制的模型; (7) MTGNN (Wu et al., 2020 ) 、(8) StemGNN (Cao et al., 2020 ) 和 (9) AGCRN (Bai et al., 2020 ) ,具有自适应矩阵的高级模型;(10) CCRNN (Ye et al., 2021 ) ,具有多个自适应矩阵的模型;(11) GTS (Shang et al., 2021 ) ,一种使用长期历史数据构建图的模型;以及(12) PM-MemNet (Lee et al., 2022 ) 和 (13) MegaCRN (Jiang et al., 2023 ) ,具有记忆单元的最新模型。

4.2 Experimental Results
4.2实验结果

The experimental results are shown in Table 1. TESTAM outperforms all other models, especially in long-term predictions, which are usually more difficult. Note that we use the results reported in the respecive papers after comparing them with reproduced results from official codes provided by the authors. The models with learnable static graphs (Graph-WaveNet, MTGNN, and CCRNN) and dynamic graphs (STTN and GMAN) show competitive performance, indicating that they have certain advantages. In terms of temporal modeling, RNN-based temporal models (DCRNN and AGCRN) show worse performance than the other methods in long-term forecasting due to error-accumulation of RNNs. Conversely, MegaCRN and PM-MemNet maintained their advantages even in long-term forecasting by injecting a memory-augmented representation vector into the decoder. GMAN and StemGNN have performed worse with EXPY-TKY, indicating a disadvantage of the attention methods, such as long-tail problems and uniformly distributed attention (Jin et al., 2023).
实验结果如表 1 所示。TESTAM 的表现优于所有其他模型,尤其是在通常更困难的长期预测中。需要注意的是,我们在使用各自论文中报告的结果之前,先将它们与作者提供的官方代码的复现结果进行了比较。具有可学习的静态图(Graph-WaveNet、MTGNN 和 CCRNN)和动态图(STTN 和 GMAN)的模型表现出竞争力,表明它们具有一定的优势。在时间建模方面,由于 RNN 的误差累积,基于 RNN 的时间模型(DCRNN 和 AGCRN)在长期预测中的表现不如其他方法。相反,MegaCRN 和 PM-MemNet 通过在解码器中注入内存增强的表示向量,即使在长期预测中也保持了它们的优势。 GMAN 和 StemGNN 在 EXPY-TKY 上的表现较差,这表明注意力方法存在缺点,例如长尾问题和均匀分布的注意力 (Jin et al., 2023

As EXPY-TKY has a 6–9 times larger number of roads than the other two datasets, experimental results with EXPY-TKY highlight the importance of spatial modeling. For example, attention-based spatial modeling methods show disadvantages and the results of modeling with time-varying networks (e.g., StemGNN) suggest that it could not properly capture spatial dependencies. In contrast, our model, TESTAM, shows its superiority to all other models, including those with learnable matrices. The results demonstrate that in-situ spatial modeling is crucial for traffic forecasting.
由于 EXPY-TKY 的道路数量是其他两个数据集的 6-9 倍,因此 EXPY-TKY 的实验结果凸显了空间建模的重要性。例如,基于注意力的空间建模方法存在缺点,而使用时变网络(例如 StemGNN)进行建模的结果表明它无法正确捕捉空间依赖性。相比之下,我们的模型 TESTAM 显示出优于所有其他模型(包括具有可学习矩阵的模型)。结果表明,现场空间建模对于交通预测至关重要。

4.3 Ablation Study
4.3消融研究

Table 2: Ablation study results across all prediction windows (i.e., average performance)
表2:在所有预测窗口中的消融研究结果(即平均性能)
Ablation  消融 METR-LA PEMS-BAY EXPY-TKY
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
w/o gating  无门控 3.00 6.12 8.29% 1.58 3.57 3.53% 6.74 10.97 29.48%
Ensemble  合奏 2.98 6.08 8.12% 1.56 3.53 3.50% 6.66 10.68 29.43%
worst-route avoidance only
仅避免最差路线
2.96 6.06 8.11% 1.55 3.52 3.48% 6.45 10.50 28.70%
Replaced  已替换 2.97 6.04 8.05% 1.56 3.54 3.47% 6.56 10.62 29.20%
w/o TIM  无 TIM 2.96 5.98 8.07% 1.54 3.45 3.46% 6.44 10.40 28.94%
w/o time-enhanced attention
无需时间增强注意力
2.99 6.03 8.15% 1.58 3.59 3.52% 6.64 10.75 29.85%
TESTAM 2.93 5.95 7.99% 1.53 3.47 3.41% 6.40 10.40 28.67%

The ablation study has two goals: to evaluate actual improvements achieved by each method, and to test two hypotheses: (1) in-situ modeling with diverse graph structures is advantageous for traffic forecasting and (2) having two loss functions for avoiding the worst route and leading to the best route is effective. To achieve these aims, we have designed a set of TESTAM variants, which are described below:
消融研究有两个目标:评估每种方法实现的实际改进,并检验两个假设:(1)具有多种图形结构的原位建模对于交通预测和(2)具有两个损失功能以避免最坏情况路线和通往最佳路线是有效的。为了实现这些目标,我们设计了一组testam变体,这些变体如下:

w/o gating  带门控

It uses only the output of the attention experts without ensembles or any other gating mechanism. Memory items are not trained because there are no gradient flows for the adaptive expert or gating networks. This setting results in an architecture similar to that of GMAN.
它仅使用没有集合或任何其他门控机制的注意专家的输出。由于没有适用于自适应专家或门控网络的梯度流,因此没有训练内存项。此设置导致与Gman类似的架构。

Ensemble  合奏

Instead of using MoEs, the final output is calculated with the weighted summation of the gating networks and each expert’s output. This setting allows the use of all spatial modeling methods but no in-situ modeling.
最终输出不是使用门控网络和每个专家的输出来计算的,而不是使用MOE。此设置允许使用所有空间建模方法,但没有原位建模。

worst-route avoidance only
最糟糕的避免

It excludes the loss for guiding best route selection. The exclusion of this relatively coarse-grained loss function is based on the fact that coarse-grained routing tends not to change its decisions after initialization (Dryden & Hoefler, 2022).
它不包括指导最佳路线选择的损失。这种相对粗粒的损失函数的排除是基于以下事实:粗粒式路由倾向于在初始化后不改变其决策(Dryden&Hoefler, 2022

Replaced  更换

It does not exclude any components. Instead, it replaces identity expert with a GCN-based adaptive expert, reducing spatial modeling diversity. The purpose of this setting is to test the hypothesis that in-situ modeling with diverse graph structures is helpful for traffic forecasting.
它不排除任何组件。取而代之的是,它用基于GCN的自适应专家取代了身份专家,从而降低了空间建模的多样性。这种设置的目的是检验以下假设:具有不同图形结构的原位建模有助于流量预测。

w/o TIM  tim

It replaces temporal information embedding (TIM) with simple embedding vectors without periodic activation functions.
它用无定期激活功能的简单嵌入向量替代了嵌入时间信息(TIM)。

w/o time-enhanced attention
带有时间增强的注意力

It replaces time-enhanced attention with basic temporal attention as we described in Sec. 3.2.
正如我们在SEC中所述,它取代了时间增强的注意力。 3.2

The experimental results shown in Table 2 connote that our hypotheses are supported and that TESTAM is a complete and indivisible set. The results of “w/o gating” and “ensemble” suggest that in-situ modeling greatly improves the traffic forecasting quality. The “w/o gating” results indicate that the performance improvement is not due to our model but due to in-situ modeling itself since this setting lead to performance comparable to that of GMAN (Zheng et al., 2020). “worst-route avoidance only” results indicate that our hypothesis that both of our routing classification losses are crucial for proper routing is valid. Finally, the results of “replaced,” which indicate significantly worse performance even than “worst route avoidance only,” confirm the hypothesis that diverse graph structures is helpful for in-situ modeling. Additional qualitative results with examples are provided in Appendix C.
2 中的实验结果表明我们的假设得到了支持,并且 TESTAM 是一个完整且不可分割的集合。“w/o gating”和“ensemble”的结果表明现场建模极大地提高了流量预测质量。“w/o gating”的结果表明性能的提升不是由于我们的模型,而是由于现场建模本身,因为这种设置导致的性能与 GMAN (Zheng et al., 2020 ) 相当。“仅避免最差路线”的结果表明我们的两种路线分类损失对于正确路线都至关重要的假设是正确的。最后,“replaced”的结果甚至比“仅避免最差路线”的结果性能差得多,证实了多样化的图结构有助于现场建模的假设。附录 C 中提供了其他定性结果和示例。

5 Conclusion
5结论

In this paper, we propose the time-enhanced spatio-temporal attention model (TESTAM), a novel Mixture-of-Experts model with attention that enables effective in-situ spatial modeling in both recurring and non-recurring situations. By transforming a routing problem into a classification task, TESTAM can contextualize various traffic conditions and choose the most appropriate spatial modeling method. TESTAM achieves superior performance to that of existing traffic forecasting models in three real-world datasets: METR-LA, PEMS-BAY, and EXPY-TKY. The results obtained using the EXPY-TKY dataset indicate that TESTAM is highly advantageous for large-scale graph structures, which are more applicable to real-world problems. We have also obtained qualitative results to visualize when and where TESTAM chooses specific graph structures. In future work, we plan to further improve and generalize TESTAM for the other spatio-temporal and multivariate time series forecasting tasks.
在本文中,我们提出了时间增强时空注意模型(testam),这是一种新型的专家混合物模型,具有关注,可在反复和非恢复情况​​下有效的原位空间建模。通过将路由问题转换为分类任务,Testam可以将各种流量条件上下文化,并选择最合适的空间建模方法。 TESTAM在三个现实世界数据集中的现有流量预测模型的性能卓越:METR-LA,PEMS-BAY和SUPED-TKY。使用Expy-TKY数据集获得的结果表明,Testam对于大型图形结构非常有利,这更适用于现实世界中的问题。我们还获得了定性结果,以可视化testam选择特定图形结构的时间和地点。在未来的工作中,我们计划进一步改善和推广其他时空和多元时间序列的预测任务。

Acknowledgments  致谢

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.RS-2023-00218913, No. 2021R1A2C1004542), by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grants (No. 2020-0-01336–Artificial Intelligence Graduate School Program, UNIST), and by the Green Venture R&D Program (No. S3236472), funded by the Ministry of SMEs and Startups (MSS, Korea)
这项工作得到了韩国政府(MSIT)资助的韩国国家研究基金会(NRF)赠款(No.RS-2023-00218913,No. 2021R1A2C1004542),信息与通信技术计划与评估研究所(IITP)(IITP)(IITP) )赠款(第2020-0-0-01336号 - 兵工智能研究生院计划,Unist),并由绿色风险投资R&D计划(No. S3236472),由中小企业和初创企业部(MSS,韩国)资助

References  参考

  • Bai et al. (2020)  Bai 等人(2020 年) Lei Bai, Lina Yao, Can Li, Xinazhi Wang, and Can Wang. Adaptive graph convolutional recurrent network for traffic forecasting. In Advances in Neural Information Processing Systems, volume 33, 2020.
    Lei Bai,Lina Yao,Can Li,Xinazhi Wang和Can Wang。用于流量预测的自适应图卷积网络。神经信息处理系统的进步中,第33卷,2020年。
  • Cao et al. (2020)  Cao 等人(2020 年) Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Congrui Huang, Yunhai Tong, Bixiong Xu, Jing Bai, Jie Tong, and Qi Zhang. Spectral temporal graph neural network for multivariate time-series forecasting. In Advances in Neural Information Processing Systems, volume 33, 2020.
    Defu Cao,Yujing Wang,Juanyong Duan,Ce Zhang,Xia Zhu,Cornui Huang,Yunhai Tong,Bixiong Xu,Jing Bai,Jie Bai,Jie Tong和Qi Zhang和Qi Zhang。用于多元时间序列预测的光谱时间图神经网络。神经信息处理系统的进步中,第33卷,2020年。
  • Dryden & Hoefler (2022)  Dryden&Hoefler(2022) Nikoli Dryden and Torsten Hoefler. Spatial mixture-of-experts. In Advances in Neural Information Processing Systems, volume 35, 2022.
    Nikoli Dryden和Torsten Hoefler。空间混合物。神经信息处理系统的进展中,第35卷,2022年。
  • Eigen et al. (2014)  Eigen 等人(2014) David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. In International Conference on Learning Representations, 2014.
    David Eigen,Marc'aurio Ranzato和Ilya Sutskever。学习在专家的深层混合中进行了体现。在2014年的国际学习表现会议上。
  • Fedus et al. (2022)  Fedus 等人(2022 年) William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23:120:1–120:39, 2022.
    威廉·费德斯(William Fedus),巴雷特(Barret Zoph)和诺阿姆·谢天(Noam Shazeer)。开关变压器:具有简单有效的稀疏性的缩放到数万亿个参数模型。机器学习研究杂志,23:120:1-120:39,2022。
  • Geng et al. (2019)  耿等人(2019) Xu Geng, Yaguang Li, Leye Wang, Lingyu Zhang, Qiang Yang, Jieping Ye, and Yan Liu. Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3656–3663, 2019.
    Xu Geng,Yaguang Li,Leye Wang,Lingyu Zhang,Qiang Yang,Jieping Ye和Yan Liu。用于乘车需求预测的时空多画卷积网络。 AAAI人工智能会议论文集,33(01):3656–3663,2019。
  • Ha et al. (2017)  Ha 等人(2017) David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. In International Conference on Learning Representations, 2017.
    David HA,Andrew M. Dai和Quoc V. Le。超级纸牌。在2017年国际学习表现会议上。
  • Jiang et al. (2023)  Jiang 等人(2023) Renhe Jiang, Zhaonan Wang, Jiawei Yong, Puneet Jeph, Quanjun Chen, Yasumasa Kobayashi, Xuan Song, Shintaro Fukushima, and Toyotaro Suzumura. Spatio-temporal meta-graph learning for traffic forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  8078–8086, 2023.
    Renhe Jiang、Zhaonan Wang、Jiawei Yong、Puneet Jeph、Quanjun Chen、Yasumasa Kobayashi、Xuan Song、Shintaro Fukushima 和 Toyotaro Suzumura。时空元图学习用于交通预测。AAAI 人工智能会议论文集第 37 卷,第 8078-8086 页,2023 年。
  • Jin et al. (2023)  Jin 等人(2023) Seungmin Jin, Hyunwook Lee, Cheonbok Park, Hyeshin Chu, Yunwon Tae, Jaegul Choo, and Sungahn Ko. A visual analytics system for improving attention-based traffic forecasting models. IEE Transaction on Visualization and Computer Graphics, 29(1):1102–1112, 2023.
    Seungmin Jin、Hyunwook Lee、Cheonbok Park、Hyeshin Chu、Yunwon Tae、Jaegul Choo 和 Sungahn Ko。用于改进基于注意力的交通预测模型的可视化分析系统。IEE Transaction on Visualization and Computer Graphics ,29(1):1102–1112,2023 年。
  • Kazemi et al. (2019)  Kazemi 等人(2019) Seyed Mehran Kazemi, Rishab Goel, Sepehr Eghbali, Janahan Ramanan, Jaspreet Sahota, Sanjay Thakur, Stella Wu, Cathal Smyth, Pascal Poupart, and Marcus A. Brubaker. Time2vec: Learning a vector representation of time. CoRR, abs/1907.05321, 2019. URL http://arxiv.org/abs/1907.05321.
    Seyed Mehran Kazemi,Rishab Goel,Sepehr Eghbali,Janahan Ramanan,Jaspreet Sahota,Sanjay Thakur,Stella Wu,Cathal Smyth,Pascal Poupart和Marcus A. Brubaker。 Time2VEC:学习时间的矢量表示。 Corr ,ABS/1907.05321,2019。URL http://arxiv.org/abs/1907.05321
  • Lee et al. (2020)  Lee 等人(2020 年) C. Lee, Y. Kim, S. Jin, D. Kim, R. Maciejewski, D. Ebert, and S. Ko. A visual analytics system for exploring, monitoring, and forecasting road traffic congestion. IEEE Transactions on Visualization and Computer Graphics, 26(11):3133–3146, 2020.
    C. Lee,Y。Kim,S。Jin,D。Kim,R。Maciejewski,D。Ebert和S. Ko。一种可视化分析系统,用于探索,监视和预测道路交通拥堵。 IEEE可视化和计算机图形交易,26(11):3133–3146,2020。
  • Lee et al. (2022)  Lee 等人(2022 年) Hyunwook Lee, Seungmin Jin, Hyeshin Chu, Hongkyu Lim, and Sungahn Ko. Learning to remember patterns: Pattern matching memory networks for traffic forecasting. In International Conference on Learning Representations, 2022.
    Hyunwook Lee,Seungmin Jin,Hyeshin Chu,Hongkyu Lim和Sungahn Ko。学习记住模式:用于流量预测的模式匹配的内存网络。在2022年国际学习表现会议上。
  • Li & Shahabi (2018)  Li&Shahabi(2018) Yaguang Li and Cyrus Shahabi. A brief overview of machine learning methods for short-term traffic forecasting and future directions. SIGSPATIAL Special, 10(1):3–9, 2018.
    Yaguang Li和Cyrus Shahabi。简要概述了用于短期流量预测和未来方向的机器学习方法。 Sigspatial Special ,10(1):3–9,2018年。
  • Li et al. (2018)  Li 等人(2018) Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In International Conference on Learning Representations, 2018.
    Yaguang Li,Rose Yu,Cyrus Shahabi和Yan Liu。扩散卷积复发性神经网络:数据驱动的流量预测。在2018年国际学习表现会议上。
  • Loshchilov & Hutter (2017)
    Loshchilov&Hutter(2017)
    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
    Ilya Loshchilov和Frank Hutter。 SGDR:随机梯度下降,温暖重新开始。在2017年国际学习表现会议上。
  • McGill & Perona (2017) Mason McGill and Pietro Perona. Deciding how to decide: Dynamic routing in artificial neural networks. In Proceedings of the International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  2363–2372, 2017.
  • Park et al. (2020) Cheonbok Park, Chunggi Lee, Hyojin Bahng, Yunwon Tae, Seungmin Jin, Kihwan Kim, Sungahn Ko, and Jaegul Choo. ST-GRAT: A novel spatio-temporal graph attention networks for accurately forecasting dynamically changing road speed. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pp.  1215–1224. ACM, 2020.
  • Riquelme et al. (2021) Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. In Advances in Neural Information Processing Systems, pp.  8583–8595, 2021.
  • Rosenbaum et al. (2018) Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of non-linear functions for multi-task learning. In International Conference on Learning Representations, 2018.
  • Ryan et al. (2019) G. Ryan, A. Mosca, R. Chang, and E. Wu. At a glance: Pixel approximate entropy as a measure of line chart complexity. IEEE Transactions on Visualization and Computer Graphics, 25(01):872–881, 2019. ISSN 1941-0506. doi: 10.1109/TVCG.2018.2865264 .
  • Shang et al. (2021) Chao Shang, Jie Chen, and Jinbo Bi. Discrete graph structure learning for forecasting multiple time series. In International Conference on Learning Representations, 2021.
  • Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
  • Vlahogianni et al. (2014) Eleni I. Vlahogianni, Matthew G. Karlaftis, and John C. Golias. Short-term traffic forecasting: Where we are and where we’re going. Transportation Research Part C: Emerging Technologies, 43:3–19, 2014. Special Issue on Short-term Traffic Flow Forecasting.
  • Wu et al. (2019) Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. Graph wavenet for deep spatial-temporal graph modeling. In Proceedings of the International Joint Conference on Artificial Intelligence, pp.  1907–1913, 2019.
  • Wu et al. (2020) Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, Xiaojun Chang, and Chengqi Zhang. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
  • Xu et al. (2020) Mingxing Xu, Wenrui Dai, Chunmiao Liu, Xing Gao, Weiyao Lin, Guo-Jun Qi, and Hongkai Xiong. Spatial-temporal transformer networks for traffic flow forecasting. arXiv preprint arXiv:2001.02908, 2020.
  • Ye et al. (2021) Junchen Ye, Leilei Sun, Bowen Du, Yanjie Fu, and Hui Xiong. Coupled layer-wise graph convolution for transportation demand prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 35(5):4617–4625, 2021.
  • Yu et al. (2018) Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the International Joint Conference on Artificial Intelligence, pp.  3634–3640, 2018.
  • Zhang et al. (2016) Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, and Xiuwen Yi. Dnn-based prediction model for spatio-temporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPACIAL ’16, 2016.
  • Zhang et al. (2020) Qi Zhang, Jianlong Chang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Spatio-temporal graph structure learning for traffic forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01):1177–1185, 2020.
  • Zheng et al. (2020) Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. GMAN: A graph multi-attention network for traffic prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  1234–1241, 2020.
  • Zhou et al. (2022) Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Zhifeng Chen, Quoc V. Le, and James Laudon. Mixture-of-experts with expert choice routing. In Advances in Neural Information Processing Systems, volume 35, 2022.

Appendix A Routing Classification Loss Function

In this section, we provide detailed information on the routing classification loss function. Both functions for worst-route avoidance and best-route selection are cross-entropy loss functions with different pseudo labels and routing levels. For the worst-route avoidance, we compute fine-grained routing for each point of each road, as Dryden & Hoefler (2022) do. However, utilizing worst-route avoidance is suboptimal because experts have less opportunities to be specialized for the best routing. Therefore, we adopt the best-route selection loss function for the routing problem. While designing the best-route selection loss function, we have two main concerns: 1) traffic data often shows severe fluctuation, which prevents a model to consistently choose best-fit experts, and 2) the best-route selection itself is a more complex task than worst-route avoidance, resulting in the model being hardly trained with fine-grained routing. To overcome those challenges, we have decided to construct node-wise best-route selection loss.

A.1 Worst-Route Avoidance Loss

For the worst-route avoidance loss function, we have built our pseudo label lel_{e}italic_l start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT as Eq. 6. In this section, we describe how those labels are chosen. Given prediction Y^N×T\hat{Y}\in\mathbb{R}^{N\times T}over^ start_ARG italic_Y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_T end_POSTSUPERSCRIPT and ground truth YN×TY\in\mathbb{R}^{N\times T}italic_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_T end_POSTSUPERSCRIPT, we have point-wise distance L(Y,Y^)N×TL(Y,\hat{Y})\in\mathbb{R}^{N\times T}italic_L ( italic_Y , over^ start_ARG italic_Y end_ARG ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_T end_POSTSUPERSCRIPT between prediction and ground truth. Given point-wise distances and error quantile qqitalic_q, we say that routing of road nnitalic_n at time ttitalic_t is incorrect (or the worst routing) if L(yn,t,y^n,t)L(y_{n,t},\hat{y}_{n,t})italic_L ( italic_y start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) is greater than qqitalic_q-th quantile. Therefore, if L(yn,t,y^n,t)L(y_{n,t},\hat{y}_{n,t})italic_L ( italic_y start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) is greater than qqitalic_q-th quantile, the pseudo label of the selected expert will be zero, and the labels for the other unselected experts will be 1/(E1)1/(E-1)1 / ( italic_E - 1 ), where EEitalic_E is total number of experts. However, if L(yn,t,y^n,t)L(y_{n,t},\hat{y}_{n,t})italic_L ( italic_y start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) is smaller than qqitalic_q-th quantile, which means it is correctly routed and the worst route is avoided, the selected experts will have a pseudo label of one and the other experts will have a pseudo label of zero. Formally, we can define pseudo label of expert eeitalic_e as follows:

le={1if L(y,y^) is smaller than q-th quantile and pe=argmax(𝐩)1/(E1)if L(y,y^) is greater than q-th quantile and peargmax(𝐩)0otherwisel_{e}=\begin{cases}1&\text{if $L(y,\hat{y})$ is smaller than $q$-th quantile % and $p_{e}=argmax(\mathbf{p})$}\\ 1/(E-1)&\text{if $L(y,\hat{y})$ is greater than $q$-th quantile and $p_{e}\neq argmax% (\mathbf{p})$}\\ 0&otherwise\end{cases}italic_l start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_L ( italic_y , over^ start_ARG italic_y end_ARG ) is smaller than italic_q -th quantile and italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_a italic_r italic_g italic_m italic_a italic_x ( bold_p ) end_CELL end_ROW start_ROW start_CELL 1 / ( italic_E - 1 ) end_CELL start_CELL if italic_L ( italic_y , over^ start_ARG italic_y end_ARG ) is greater than italic_q -th quantile and italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ≠ italic_a italic_r italic_g italic_m italic_a italic_x ( bold_p ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL end_ROW

A.2 Best-Route Selection Loss

For the best-route selection loss function, we define node-wise pseudo labels by converting each condition of pseudo labeling for worst-route avoidance. For worst-route avoidance, we assume that the routing is incorrect (i.e., the worst) if L(yn,ty^n,t)L(y_{n,t}\hat{y}_{n,t})italic_L ( italic_y start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) is greater than qqitalic_q-th quantile. In the best-route selection, we define that the routing is correct (i.e., best) if L(yny^n)L(y_{n}\hat{y}_{n})italic_L ( italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is smaller than 1q1-q1 - italic_q-th quantile; otherwise, its incorrectly routed, as shown below:

le={1if L(y,y^) is smaller than 1q-th quantile and pe=argmax(𝐩)1/(E1)if L(y,y^) is greater than 1q-th quantile and peargmax(𝐩)0otherwisel_{e}=\begin{cases}1&\text{if $L(y,\hat{y})$ is smaller than $1-q$-th quantile% and $p_{e}=argmax(\mathbf{p})$}\\ 1/(E-1)&\text{if $L(y,\hat{y})$ is greater than $1-q$-th quantile and $p_{e}% \neq argmax(\mathbf{p})$}\\ 0&otherwise\end{cases}italic_l start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_L ( italic_y , over^ start_ARG italic_y end_ARG ) is smaller than 1 - italic_q -th quantile and italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_a italic_r italic_g italic_m italic_a italic_x ( bold_p ) end_CELL end_ROW start_ROW start_CELL 1 / ( italic_E - 1 ) end_CELL start_CELL if italic_L ( italic_y , over^ start_ARG italic_y end_ARG ) is greater than 1 - italic_q -th quantile and italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ≠ italic_a italic_r italic_g italic_m italic_a italic_x ( bold_p ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL end_ROW
Table 3: Computation time of the models with the METR-LA dataset
Training time/epoch Inference time # of params
STGCN 14.8 secs 16.70 secs 320k
DCRNN 122.22 secs 13.44 secs 372k
Graph-WaveNet 48.07 secs 3.69 secs 309k
GMAN 312.1 secs 33.7 secs 901k
MegaCRN 84.7 secs 11.76 secs 339k
TESTAM 150 secs 7.96 secs 224k

Appendix B Computational Cost Analysis

For the computational cost analysis, we use five models as baselines: 1) STGCN (Yu et al., 2018), the lightest model that utilizes GCNs and CNNs to forecast 1-step future traffic condition; 2) DCRNN (Li et al., 2018), a well-known traffic forecasting model with graph-convolutional recurrent units; 3) Graph-WaveNet (Wu et al., 2019), a model that forecasts values by parallel computation with GCNs and CNNs; 4) GMAN (Zheng et al., 2020), a spatio-temporal attention model for traffic forecasting, and 5) MegaCRN (Jiang et al., 2023), one of state-of-the-art models using GCRNN and memory network concepts.

We have investigated other models for comparison but decided to exclude them after careful considerations. For example, we have excluded MTGNN and StemGNN since they are an improved version of Graph-WaveNet and have similar computational costs compared to Graph-WaveNet. Similarly, AGCRN, CCRNN, and GTS are excluded from baselines because they are variants of DCRNN, with few changes in computational costs. PM-MemNet and MegaCRN utilize sequence-to-sequence modeling with shared memory units; however, PM-MemNet experiences computational bottleneck with its stacked memory units, which requires LLitalic_L times larger computational costs than those of MegaCRN.

Even though TESTAM utilizes three individual experts for the prediction, we emphasize that it has a smaller number of parameters compared to the other models due to its small number of layers per each expert, which highly affects the computational costs. Furthermore, TESEAM only uses the encoder architecture of the transformer with a time-enhanced attention module that enables parallel computation, eliminating a computational bottleneck caused by the decoding process. As a result, in terms of computational costs, TESTAM is two times cheaper than the attention-based model (i.e., GMAN), illustrating a similar training time with DCRNN. Furthermore, in the inference phase, TESTAM shows the second fastest computation with the smallest number of the parameters.

Table 4: Case-specific experimental results on three real-world datasets. The numbers in bold mean the best performance, and those underlined mean the second-best performance. (I) means isolated roads. (H) means hard-to-predict roads, including intersections and the roads with high traffic fluctuations. (E) means the non-recurring circumstances, such as holidays or accidents.
METR-LA (I) 15 min 30 min 60 min
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
Graph-WaveNet (Wu et al., 2019) 3.58 5.93 8.43% 3.90 6.74 9.59% 4.29 7.50 10.98%
GMAN (Zheng et al., 2020) 3.81 6.99 9.15% 4.03 7.48 9.97% 4.32 8.13 11.13%
MegaCRN (Jiang et al., 2023) 3.54 5.88 8.46% 3.88 6.69 9.74% 4.35 7.67 11.39%
TESTAM 3.52 5.89 8.37% 3.80 6.59 9.43% 4.13 7.31 10.72%
METR-LA (H) 15 min 30 min 60 min
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
Graph-WaveNet (Wu et al., 2019) 4.07 6.74 11.75% 4.73 8.03 14.38% 5.48 9.34 17.20%
GMAN (Zheng et al., 2020) 4.37 7.79 12.83% 4.86 8.75 14.77% 5.37 9.64 16.77%
MegaCRN (Jiang et al., 2023) 4.02 6.68 11.61% 4.73 8.13 14.46% 5.55 9.72 17.67%
TESTAM 3.96 6.62 11.42% 4.51 7.75 13.57% 5.19 9.03 16.04%
METR-LA (E) 15 min 30 min 60 min
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
Graph-WaveNet (Wu et al., 2019) 4.22 7.14 13.08% 5.12 8.84 16.81% 6.17 10.62 21.19%
GMAN (Zheng et al., 2020) 4.45 7.67 14.49% 5.16 9.13 17.53% 5.95 10.58 21.18%
MegaCRN (Jiang et al., 2023) 4.03 6.91 12.37% 4.96 8.75 16.10% 6.01 10.69 20.58%
TESTAM 4.11 7.09 12.54% 4.92 8.71 15.71% 5.89 10.46 19.69%
PEMS-BAY (I) 15 min 30 min 60 min
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
Graph-WaveNet (Wu et al., 2019) 2.28 4.06 4.89% 2.87 5.29 6.65% 3.48 6.47 8.86%
GMAN (Zheng et al., 2020) 2.51 5.12 5.65% 3.06 6.20 7.34% 3.55 7.10 8.90%
MegaCRN (Jiang et al., 2023) 2.28 4.10 5.06% 2.92 5.58 7.27% 3.49 6.76 9.22%
TESTAM 2.26 4.03 4.67% 2.86 5.24 6.45% 3.36 6.45 8.55%
PEMS-BAY (H) 15 min 30 min 60 min
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
Graph-WaveNet (Wu et al., 2019) 2.46 4.42 5.44% 3.14 5.89 7.75% 3.81 7.23 10.50%
GMAN (Zheng et al., 2020) 2.72 5.50 6.28% 3.34 6.76 8.42% 3.88 7.76 10.37%
MegaCRN (Jiang et al., 2023) 2.47 4.44 5.62% 3.19 6.17 8.34% 3.82 7.48 10.76%
TESTAM 2.46 4.52 5.48% 3.10 5.75 7.62% 3.69 7.16 9.96%
PEMS-BAY (E) 15 min 30 min 60 min
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
Graph-WaveNet (Wu et al., 2019) 2.64 4.73 6.22% 3.39 6.21 8.91% 4.20 7.78 12.69%
GMAN (Zheng et al., 2020) 2.82 5.11 6.99% 3.55 6.68 9.72% 4.19 7.83 12.23%
MegaCRN (Jiang et al., 2023) 2.61 4.65 6.25% 3.51 6.70 10.01% 4.24 8.14 13.11%
TESTAM 2.59 4.58 5.98% 3.39 6.15 8.82% 4.03 7.57 11.45%
EXPY-TKY (I) 10 min 30 min 60 min
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
Graph-WaveNet (Wu et al., 2019) 8.28 12.56 28.65% 9.40 14.23 33.69% 10.24 15.35 36.99%
GMAN (Zheng et al., 2020) 8.14 12.45 28.81% 8.75 13.43 31.11% 9.26 14.20 32.62%
MegaCRN (Jiang et al., 2023) 8.06 12.30 27.94% 8.98 13.71 32.22% 9.64 14.63 35.07%
TESTAM 7.87 12.26 26.95% 8.60 13.44 29.69% 9.03 14.06 31.83%
EXPY-TKY (H) 10 min 30 min 60 min
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
Graph-WaveNet (Wu et al., 2019) 8.45 12.79 30.97% 9.63 14.50 36.07% 10.52 15.64 39.75%
GMAN (Zheng et al., 2020) 8.30 12.63 31.08% 8.90 13.57 33.58% 9.44 14.33 35.19%
MegaCRN (Jiang et al., 2023) 8.24 12.49 30.47% 9.21 13.92 34.60% 9.90 14.87 38.00%
TESTAM 8.06 12.48 29.08% 8.81 13.67 31.88% 9.26 14.31 34.23%
EXPY-TKY (E) 10 min 30 min 60 min
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
Graph-WaveNet (Wu et al., 2019) 11.18 16.23 65.97% 11.93 17.21 71.71% 12.32 17.69 74.75%
GMAN (Zheng et al., 2020) 11.02 16.06 64.93% 11.30 16.47 66.84% 11.49 16.74 67.10%
MegaCRN (Jiang et al., 2023) 11.04 16.32 62.13% 11.67 17.07 67.93% 11.94 17.37 71.43%
TESTAM 10.82 15.94 63.30% 11.28 16.01 65.51% 11.49 16.46 66.94%

Appendix C Detailed Experimental Results

Table 4 presents experimental results under various environment settings. In the table, we pose three scenarios: (I), (H), and (E). Each scenario represents difficult conditions in making accurate forecasting. (I) is a set of isolated roads, chosen by considering spatial locations and quantitative analysis results on the adjacency matrix. (H) is the hard-to-predict roads, including intersections and the roads with high traffic fluctuations. We have determined roads for (H) by visually exploring the roads (for intersections) and by selecting the roads with top 10% entropy-based time-series complexity. (E) contains roads and time with sudden events, including accidents, traffic controls, or holidays (e.g., Christmas). In the experiments, we compare the performances of three baselines with TESTAM, which are selected as a representative model for each spatial modeling method.

As shown in Table 4, TESTAM outperforms other models in non-recurring situations (i.e., (E)) and the roads that have unique spatial and topological features ((I) and (E)). Especially, TESTAM consistently proves its superiority for the roads with spatially unique features, outperforming existing models from 4% to 7% in general. Among all of three baselines, we observe that the attention-based modeling method has better encoded spatial information for most of the hard-to-predict scenarios. However, there are cases where the attention-based model fails, such as PEMS-BAY (I) or PEMS-BAY (H). From the perspective of temporal modeling, the attention-based modeling method indicates better long-term forecasting performance, while CNN- and RNN-based modeling methods are advantageous in short-term forecasting. TESTAM, in contrast, despite of its temporal modeling methods, it outperforms all the baselines in terms of both short-term and long-term forecasting cases. The results indicate that temporal information embedding and time-enhanced attention help the model to effectively transfer information from the input domain to the output domain.

C.1 Qualitative Evaluation

We perform a qualitative evaluation on TESTAM by visualizing the impact of our context-aware spatial modeling in four different types of cases: 1) hard-to-predict roads with recurring patterns; 2) isolated roads (I); 3) roads with unique traffic patterns; and 4) roads with non-recurring patterns for evaluating the event awareness of TESTAM. We use the EXPY-TKY dataset, which contains complex urban road networks with various traffic patterns.

Refer to caption
Figure 2: Visualization of a recurring pattern in a hard-to-predict road, Road 1349, from Dec 14th to Dec 17th. Road 1349 is a highway entrance located near Tokyo station. The locations of the roads are indicated in Fig. 7.
Recurring Patterns on Hard-to-Predict Roads

With the hard-to-predict roads, we observe that previous models often fail to effectively encode the spatial and temporal correlation of the roads, as Fig. 2 shows. For example, Road 1349 is a highway entrance located near the Tokyo station that has one of the largest traffic volumes in Tokyo and accordingly has severe fluctuation in data. Because of the fluctuations and complex spatio-temporal dependencies of the roads, prior models have shown their limitations in spatial modeling. In particular, Graph-WaveNet (the green line in Fig. 2), which relies on the learnable static graph, fails to timely catch both the speed drop and rise in the red box of Fig. 2. However, GMAN and MegaCRN (the red and violet lines) properly model the ends of rush hour, but they also fail to predict the start of rush hour. Furthermore, MegaCRN exhibits noise-sensitive behavior on Dec. 14th (top-left) and 17th (bottom-right) in Fig. 2.

Refer to caption
Figure 3: Qualitative forecasting result analysis for spatially isolated roads (I). The locations of the roads are indicated in Fig. 6.
Spatially Isolated Roads (I)

When forecasting spatially isolated roads, the model should focus on the road itself, instead of referring to the other roads, which is less informative for prediction. However, since existing models have less consideration for the importance of self-referencing, they fail to properly model rapid speed changes (e.g., noon in Fig. 3), or they become confused by information the other roads (MegaCRN at 15:00 on Dec 3rd in Fig. 3), resulting in poor forecasting. In contrast, TESTAM accurately forecasts the rapid speed changes that occurred at 3:00 and 12:00, as it enhances temporal modeling with identity experts, temporal information embedding, and a time-enhanced attention layer.

Refer to caption
Figure 4: Qualitative forecasting result analysis for Road 1111, a highway ramp located in Shibuya, with unique traffic patterns. The locations of the roads are indicated in Fig. 7.
Roads with Unique Traffic Patterns: The case of a Highway Ramp

In the EXPY-TKY, there are many roads showing unique patterns due to complex urban road networks and various traffic behaviors (e.g., commuting and traveling). Road 1111 is a highway ramp located in Shibuya and has unique patterns. One unique pattern of the road is that it tends to have lower speed than 30km/h for all day, except 3:00 (the red box in Fig. 4). GMAN and Graph-WaveNet cannot handle such unique patterns properly, failing to model the increasing speed at 3:00. On the other hand, MegaCRN predicts a high-speed situation by its pattern-awareness of memory units, but it still fails to timely forecast it. In contract, TESTAM timely forecasts the traffic changes, revealing its superiority for modeling the unique behaviors of roads, as shown in Fig. 4.

Refer to caption
Figure 5: Qualitative analysis results for Road 1196 (a metropolitan expressway) on Dec. 14th, where traffic control may occur because of heavy snow (red boxes)
Event-Aware Forecasting Case

We qualitatively evaluate and show the importance of context-aware spatial modeling in improving forecasting performance in various traffic conditions, such as sudden traffic control. Fig. 5 visualizes the recurring and non-recurring traffic conditions caused by traffic control for heavy snow. Because of unexpected traffic controls, there exist sudden speed drops from morning to noon. In such a non-recurring traffic condition, TESTAM shows better forecasting results due to its context-awareness. GMAN and MegaCRN partially capture the sudden changes but cannot make timely predictions.

Appendix D Detailed Selection Procedures and Locations of the Roads for Case Study

In this section, we describe how we selected roads and time for each scenario, (I), (H), and (E). In the cases of (I) and (H), we extracted the roads regardless of time.

Refer to caption
Figure 6: Location visualization for spatially isolated roads. The read circles are selected roads and the blue circles are unselected ones. (Left) the 262 roads before filtering, (middle) newly selected roads from visual investigation, and (right) finalized list of 162 roads. The pink circle is location of Road 1165 and 1166.
Spatially Isolated Roads (I)

We have selected spatially isolated roads with two procedures: 1) investigating network topology; and 2) filtering out them by visually investigating their locations. From the investigation of network topology, we found that there are total 262 roads without any connection. However, the random sampling process for building traffic dataset makes the network topology be sparse, which could not fully represent the connectivity in the real-world. Therefore, instead of directly utilize total 262 roads, we filter them out by visual investigation. After visual investigation, we have filtered out 150 roads from the list and added 50 roads to the list. Finally, we have total 162 roads for (I), as shown in Fig. 6.

Refer to caption
Figure 7: Location visualization for hard-to-predict roads (left) and specific location of Road 1111 (middle) and 1349 (right). The green arrows indicate main traffic flows and directions for each road.
Hard-to-Predict Roads (H)

For the hard-to-predict roads (H), we conduct two selection processes, visual exploration and entropy-based time-series complexity. Inspired by Ryan et al. (2019), we use entropy-based time-series complexity, which could measure noise level and unpredictability of changes in a series of points by estimating the probability of whether similar patterns will be repeated. From whole set of 1843 roads, we have extracted the 184 roads with the top 10% largest entropy, which are the most unpredictable roads. Furthermore, we additionally insert 34 hard-to-predict roads, such as intersections, ramp, and highway entrances by visually investigating roads. We finalized our list as shown in Fig. 7 left.

Roads and Time with Sudden Events (E)

The circumstances in (E) are especially important, since they are location- and time-specific events, and finding those samples requires tremendous efforts. In this paper, we have found the events with two strategies: 1) find specific time intervals (e.g., Christmas) of hard-to-predict roads (Fig. 7 left); and 2) find traffic controls and constructions that are relatively easy to find than accidents and have credible source, including official announcements from the Metropolitan Expressway Co., Ltd111https://www.shutoko.co.jp/. As a result, we have chosen the data at holiday intervals in hard-to-predict roads and the roads and intervals in construction or sudden traffic controls for (E).