TESTAM: A Time-Enhanced Spatio-Temporal Attention Model with Mixture of Experts
睾丸：与专家混合的时间增强时空注意模型

Hyunwook Lee & Sungahn Ko
Ulsan National Institute of Science and Technology
{gusdnr0916, sako}@unist.ac.kr Corresponding author

Abstract 抽象的

Accurate traffic forecasting is challenging due to the complex interdependencies of large road networks and abrupt speed changes caused by unexpected events. Recent work has focused on spatial modeling with adaptive graph embedding or graph attention but has paid less attention to the temporal characteristics and effectiveness of in-situ modeling. In this paper, we propose the time-enhanced spatio-temporal attention model (TESTAM) to better capture recurring and non-recurring traffic patterns with mixture-of-experts model with three experts for temporal modeling, spatio-temporal modeling with a static graph, and spatio-temporal dependency modeling with a dynamic graph. By introducing different experts and properly routing them, TESTAM better captures traffic patterns under various circumstances, including cases of spatially isolated roads, highly interconnected roads, and recurring and non-recurring events. For proper routing, we reformulate a gating problem as a classification task with pseudo labels. Experimental results on three public traffic network datasets, METR-LA, PEMS-BAY, and EXPY-TKY, demonstrate that TESTAM outperforms 13 existing methods in terms of accuracy due to its better modeling of recurring and non-recurring traffic patterns. You can find the official code from https://github.com/HyunWookL/TESTAM
由于大型道路网络的复杂相互依存以及意外事件引起的突然变化，准确的流量预测是具有挑战性的。最近的工作集中在具有自适应图或图形注意力的空间建模上，但对原位建模的时间特征和有效性的关注较少。在本文中，我们提出了时间增强时空注意模型（testam），以更好地捕获具有特殊性混合模型的经常性和非经常性交通模式，该模型与三个用于时间建模的专家，用于时间建模，时空建模和静态图形，以及具有动态图的时空依赖性建模。通过介绍不同的专家并将其适当布线，Testam可以更好地捕获各种情况下的交通模式，包括空间隔离的道路，高度相互联系的道路以及反复发生和非重期事件。对于适当的路由，我们将门控问题重新制定为带有伪标签的分类任务。在三个公共交通网络数据集（Metr-LA，PEMS-Bay和Expy-Tky）上进行的实验结果表明，由于其重复进行和非经常性交通模式的更好建模，因此Testam在准确性方面优于13种现有方法。您可以从https://github.com/hyunwookl/testam找到官方代码

1 Introduction
1简介

Spatio-temporal modeling in non-Euclidean space has received considerable attention since it can be widely applied to many real-world problems, such as social networks and human pose estimation. Traffic forecasting is a representative real-world problem, which is particularly challenging due to the difficulty of identifying innate spatio-temporal dependencies between roads. Moreover, such dependencies are often influenced by numerous factors, such as weather, accidents, and holidays (Park et al., 2020; Lee et al., 2020; 2022).
非欧几里得空间中的时空建模受到了相当大的关注，因为它可以广泛应用于许多现实世界中的问题，例如社交网络和人类姿势估计。交通预测是一个代表性的现实问题，由于难以识别道路之间的先天时空依赖性，这尤其具有挑战性。此外，这种依赖性通常受到许多因素的影响，例如天气，事故和假期（Park等， 2020 ； Lee等， 2020 ; 2022 ）。

To overcome the challenges related to spatio-temporal modeling, many deep learning models have been proposed, including graph convolutional networks (GCNs), recurrent neural networks (RNNs), and Transformer. Li et al. (2018) have introduced DCRNN, which injects graph convolution into recurrent units, while Yu et al. (2018) have combined graph convolution and convolutional neural networks (CNNs) to model spatial and temporal features, outperforming traditional methods, such as ARIMA. Although effective, GCN-based methods require prior knowledge of the topological characteristics of spatial dependencies. In addition, as the pre-defined graph relies heavily on the Euclidean distance and empirical laws (Tobler’s first law of geography), ignoring dynamic changes in traffic (e.g., rush hour and accidents), it is hardly an optimal solution (Jiang et al., 2023). Graph-WaveNet, proposed by Wu et al. (2019), is the first model to address this limitation by using node embedding, building learnable adjacency matrix for spatial modeling. Motivated by the success of Graph-WaveNet and DCRNN, a line of research has focused on learnable graph structures, such as AGCRN (Bai et al., 2020) and MTGNN (Wu et al., 2020).
为了克服与时空建模相关的挑战，已经提出了许多深度学习模型，包括图卷积网络（GCN）、循环神经网络（RNN）和 Transformer。 Li 等人 ( 2018 ) 引入了 DCRNN，将图卷积注入循环单元，而 Yu 等人 ( 2018 ) 则将图卷积和卷积神经网络 (CNN) 结合起来对空间和时间特征进行建模，其表现优于 ARIMA 等传统方法。基于 GCN 的方法虽然有效，但需要事先了解空间依赖关系的拓扑特征。此外，由于预定义图严重依赖欧氏距离和经验定律（托布勒第一地理定律），忽略了交通的动态变化（例如高峰时段和事故），因此它很难成为最佳解决方案（Jiang et al., 2023 ）。Wu 等人 ( 2019 ) 提出的 Graph-WaveNet 是第一个通过使用节点嵌入来解决此限制的模型，为空间建模构建可学习的邻接矩阵。受 Graph-WaveNet 和 DCRNN 成功的推动，一系列研究集中于可学习的图结构，例如 AGCRN （Bai et al.， 2020 ）和 MTGNN （Wu et al.， 2020 ）。

Although spatial modeling with learnable static graphs has drastically improved traffic forecasting, researchers have found that it can be further improved by learning networks dynamics among time, named time-varying graph structure. SLCNN (Zhang et al., 2020) and StemGNN (Cao et al., 2020) attempt to learn time-varying graph structures by projecting observational data. Zheng et al. (2020) have adopted multi-head attention for improved dynamic spatial modeling with no spatial restrictions, while Park et al. (2020) have developed ST-GRAT, a modified Transformer for traffic forecasting that utilizes graph attention networks (GAT). However, time-varying graph modeling is noise sensitive. Attention-based models can be relatively less noise sensitive, but a recent study reports that they often fail to generate an informative attention map by spreading attention weights over all roads (Jin et al., 2023). MegaCRN (Jiang et al., 2023) utilizes memory networks for graph learning, reducing sensitivity and injecting temporal information, simultaneously. Although effective, aforementioned methods focus on spatial modeling using specific spatial modeling methods, paying less attention to the use of multiple spatial modeling methods for in-situ forecasting.
尽管基于可学习静态图的空间建模已显著提升了交通预测水平，但研究人员发现，通过学习网络随时间变化的动态变化（即时变图结构），可以进一步提升预测质量。SLCNN （Zhang 等人， 2020 年）和 StemGNN （Cao 等人， 2020 年）尝试通过投影观测数据来学习时变图结构。 Zheng 等人 ( 2020 ) 采用多头注意力机制，改进了不受空间限制的动态空间建模； Park 等人 ( 2020 ) 开发了 ST-GRAT，这是一种改进的 Transformer，利用图注意力网络 (GAT) 进行交通预测。然而，时变图建模对噪声敏感。基于注意力机制的模型对噪声的敏感度相对较低，但最近的一项研究报告称，它们往往无法通过将注意力权重分散到所有道路上来生成信息丰富的注意力图 (Jin et al., 2023 ) 。MegaCRN (Jiang et al., 2023 ) 利用记忆网络进行图学习，同时降低敏感度并注入时间信息。上述方法虽然有效，但主要关注使用特定的空间建模方法进行空间建模，而较少关注使用多种空间建模方法进行现场预测。

Different spatial modeling methods have certain advantages for different circumstances. For instance, learnable static graph modeling outperforms dynamic graphs in recurring traffic situations (Wu et al., 2020; Jiang et al., 2023). On the other hand, dynamic spatial modeling is advantageous for non-recurring traffic, such as incidents or abrupt speed changes (Park et al., 2020; Zheng et al., 2020). Park et al. (2020) have revealed that preserving the the road information itself improves forecasting performance, implying the need of temporal-only modeling. Jin et al. (2023) have shown that a static graph built on temporal similarity could lead to performance improvements when combined with a dynamic graph modeling method. Although many studies have discussed the importance of effective spatial modeling for traffic forecasting, few studies have focused on the dynamic use of spatial modeling methods in traffic forecasting (i.e., in-situ traffic forecasting).
不同的空间建模方法在不同情况下各有优势。例如，在重复发生的交通情况下，可学习的静态图建模优于动态图建模 (Wu et al., 2020 ; Jiang et al., 2023 ) 。另一方面，动态空间建模对于非重复发生的交通情况（例如事故或速度突变）更具优势 (Park et al., 2020 ; Zheng et al., 2020 ) 。 Park 等人（ 2020 年）发现，保留道路信息本身可以提高预测性能，这意味着需要仅进行时间建模。 Jin 等人 ( 2023 ) 的研究证明，基于时间相似性构建的静态图与动态图建模方法相结合，可以提升性能。尽管许多研究探讨了有效的空间建模对交通预测的重要性，但很少有研究关注空间建模方法在交通预测（即现场交通预测）中的动态应用。

In this paper, we propose a time-enhanced spatio-temporal attention model (TESTAM), a novel Mixture-of-Experts (MoE) model that enables in-situ traffic forecasting. TESTAM consists of three experts, each of them has different spatial modeling: 1) without spatial modeling, 2) with learnable static graph, 3) with with dynamic graph modeling, and one gating network. Each expert consists of transformer-based blocks with their own spatial modeling methods. Gating networks take each expert’s last hidden state and input traffic conditions, generating candidate routes for in-situ traffic forecasting. To achieve effective training of gating network, we solve the routing problem as a classification problem with two loss functions that are designed to avoid the worst route and lead to the best route. The contributions of this work can be summarized as follows:
在本文中，我们提出了一个时间增强时空注意模型（Testam），这是一种新型的专家（MOE）模型，可实现原位交通预测。 Testam由三个专家组成，每个专家都有不同的空间建模：1）无空间建模，2）具有可学习的静态图，3）带有动态图建模和一个门控网络。每个专家都由其基于变压器的块组成，并由其自己的空间建模方法组成。门控网络采用每个专家的最后一个隐藏状态和输入流量条件，为原位流量预测生成候选路线。为了实现门控网络的有效培训，我们将路由问题作为分类问题解决了两个损失功能，旨在避免最糟糕的路线并通往最佳路线。这项工作的贡献可以总结如下：

•

We propose a novel Mixture-of-Experts model called TESTAM for traffic forecasting with diverse graph architectures for improving accuracy in different traffic conditions, including recurring and non-recurring situations.

•我们提出了一种新型的专家专家模型，称为Testam，用于交通预测，并具有不同的图形体系结构，以提高不同的交通状况的准确性，包括经常性和非经常性情况。
•

We reformulate the gating problem as a classification problem to have the model better contextualize traffic situations and choose spatial modeling methods (i.e., experts) during training.

•我们将门控问题重新制定为分类问题，以使模型更好地将交通情况化，并在培训期间选择空间建模方法（即专家）。
•

The experimental results over the state-of-the-art models using three real-world datasets indicate that TESTAM outperforms existing methods quantitatively and qualitatively.

•使用三个现实世界数据集对最先进模型的实验结果表明，testam在定量和质量上都优于现有方法。

2 Related Work
2相关工作

2.1 Traffic Forecasting
2.1流量预测

Deep learning models achieve huge success by effectively capturing spatio-temporal features in traffic forecasting tasks. Previous studies have shown that RNN-based models outperform conventional temporal modeling approaches, such as ARIMA and support vector regression (Vlahogianni et al., 2014; Li & Shahabi, 2018). More recently, substantial research has demonstrated that attention-based models (Zheng et al., 2020; Park et al., 2020) and CNNs (Yu et al., 2018; Wu et al., 2019; 2020) perform better than RNN-based model in long-term prediction tasks. For spatial modeling, Zhang et al. (2016) have proposed a CNN-based spatial modeling method for Euclidean space. Another line of modeling methods using graph structures for managing complex road networks (e.g., GCNs) have also become popular. However, using GCNs requires building an adjacency matrix, and GCNs depend heavily on pre-defined graph structure.
深度学习模型通过有效捕捉交通预测任务中的时空特征而取得了巨大成功。先前的研究表明，基于 RNN 的模型优于传统的时间建模方法，例如 ARIMA 和支持向量回归（Vlahogianni 等， 2014 ；Li 和 Shahabi， 2018 ）。最近，大量研究表明，基于注意力机制的模型（Zheng 等， 2020 ；Park 等， 2020 ）和 CNN （Yu 等， 2018 ；Wu 等， 2019 ； 2020 ）在长期预测任务中的表现优于基于 RNN 的模型。对于空间建模， Zhang 等（ 2016 ）提出了一种基于 CNN 的欧几里得空间空间建模方法。另一种使用图结构管理复杂道路网络的建模方法（例如 GCN）也已流行起来。然而，使用 GCN 需要构建邻接矩阵，并且 GCN 严重依赖于预定义的图结构。

To overcome these difficulties, several approaches, such as graph attention models, have been proposed for dynamic edge importance weighting (Park et al., 2020). Graph-WaveNet (Wu et al., 2019) uses a learnable static adjacency matrix to capture hidden spatial dependencies in training. SLCNN (Zhang et al., 2020) and StemGNN (Cao et al., 2020) try to learn a time-varying graph by projecting current traffic conditions. MegaCRN (Jiang et al., 2023) uses memory-based graph learning to construct a noise-robust graph. Despite their effectiveness, forecasting models still suffer from inaccurate predictions due to abruptly changing speeds, instability, and changes in spatial dependency. To address these challenges, we design TESTAM to change its spatial modeling methods based on the traffic context using the Mixture-of-Experts technique.
为了克服这些困难，已经提出了几种用于动态边缘重要性加权的方法，例如图注意力模型（Park et al。， 2020 ）。Graph-WaveNet （Wu et al。， 2019 ）使用可学习的静态邻接矩阵来捕获训练中的隐藏空间依赖性。SLCNN （Zhang et al。， 2020 ）和 StemGNN （Cao et al。， 2020 ）尝试通过投影当前交通状况来学习时变图。MegaCRN （Jiang et al。， 2023 ）使用基于记忆的图学习来构建抗噪图。尽管预测模型很有效，但由于速度突然变化、不稳定性以及空间依赖性的变化，预测模型仍然不准确。为了应对这些挑战，我们设计了 TESTAM，使用 Mixture-of-Experts 技术根据交通环境改变其空间建模方法。

2.2 Mixture of Experts
2.2专家的混合物

The Mixture-of-Experts (MoEs) is a machine learning technique devised by Shazeer et al. (2017) that has been actively researched as a powerful method for increasing model capacities without additional computational costs. MoEs have been used in various machine learning tasks, such as computer vision (Dryden & Hoefler, 2022) and natural language processing (Zhou et al., 2022; Fedus et al., 2022). Recently, MoEs have gone beyond being the purpose of increasing model capacities and are used to “specialize” each expert in subtasks at specific levels, such as the sample (Eigen et al., 2014; McGill & Perona, 2017; Rosenbaum et al., 2018), token (Shazeer et al., 2017; Fedus et al., 2022), and patch levels (Riquelme et al., 2021). These coarse-grained routing of the MoEs are frequently trained with multiple auxiliary losses, focusing on load balancing (Fedus et al., 2022; Dryden & Hoefler, 2022), but it often causes the experts to lose their opportunity to specialize. Furthermore, MoEs assign identical structures to every expert, eventually leading to limitations caused by the architecture, such as sharing the same inductive bias, which hardly changes. Dryden & Hoefler (2022) have proposed Spatial Mixture-of-Experts (SMoEs) that introduces fine-grained routing to solve the regression problem. SMOEs induce inductive bias via fine-grained, location-dependent routing for regression problems. They utilize one routing classification loss based on the final output losses, penalize gating networks with output error signals, and reduce the change caused by inaccurate routing for better routing and expert specialization. However, SMoEs only attempt to avoid incorrect routing and pay less attention to the best routing. TESTAM differs from existing MoEs in two main ways: it utilizes experts with different spatial modeling methods for better generalization, and it can be optimized with two loss functions–one for avoiding the worst route and another for choosing the best route for better specialization.
混合专家模型 (MoE) 是由 Shazeer 等人 ( 2017 ) 设计的一种机器学习技术，作为一种无需额外计算成本即可提升模型容量的有效方法，MoE 已被积极研究。MoE 已应用于各种机器学习任务，例如计算机视觉 (Dryden & Hoefler, 2022 ) 和自然语言处理 (Zhou 等人, 2022 ; Fedus 等人, 2022 ) 。最近，多级模型 (MoE) 已不再仅仅用于提升模型容量，还用于让每个专家在特定级别的子任务中“专精”，例如样本 (Eigen et al., 2014 ; McGill & Perona, 2017 ; Rosenbaum et al., 2018 ) 、token (Shazeer et al., 2017 ; Fedus et al., 2022 ) 和 patch 级别 (Riquelme et al., 2021 ) 。这些多级模型 (MoE) 的粗粒度路由通常使用多个辅助损失进行训练，注重负载平衡 (Fedus et al., 2022 ; Dryden & Hoefler, 2022 ) ，但这往往导致专家失去专精的机会。此外，MoE 为每个专家分配相同的结构，最终导致由架构引起的限制，例如共享相同的归纳偏差，而这种偏差几乎不会改变。 Dryden 和 Hoefler ( 2022 ) 提出了空间混合专家 (SMoE)，引入细粒度路由来解决回归问题。SMOE 通过细粒度、位置相关的路由来引入归纳偏差，以解决回归问题。它们根据最终输出损失利用一种路由分类损失，用输出错误信号惩罚门控网络，并减少由不准确路由引起的变化，以实现更好的路由和专家专业化。然而，SMoE 只会尝试避免错误的路由，而较少关注最佳路由。TESTAM 与现有的 MoE 主要有两点不同：它利用具有不同空间建模方法的专家来实现更好的泛化，并且可以通过两个损失函数进行优化——一个用于避免最差路线，另一个用于选择最佳路线以实现更好的专业化。

3 Methods
3种方法

3.1 Preliminaries
3.1预序

Problem Definition 问题定义

Let us define a road network as $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{A})$ , where $\mathcal{V}$ is a set of all roads in road networks with $|\mathcal{V}|=N$ , $\mathcal{E}$ is a set of edges representing the connectivity between roads, and $\mathcal{A}\in\mathbb{R}^{N\times N}$ is a matrix representing the topology of $\mathcal{G}$ . Given road networks, we formulate our problem as a special version of multivariate time series forecasting that predicts future $T$ graph signals based on $T^{\prime}$ historical input graph signals:
我们将道路网络定义为 $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{A})$ ，其中 $\mathcal{V}$ 是道路网络中所有道路的集合， $|\mathcal{V}|=N$ ， $\mathcal{E}$ 是一组表示道路之间连通性的边， $\mathcal{A}\in\mathbb{R}^{N\times N}$ 是代表 {5} 拓扑的矩阵。给定道路网络，我们将问题表述为多元时间序列预测的一个特殊版本，该版本基于 $T^{\prime}$ 历史输入图信号预测未来 $T$ 图信号：

\displaystyle\big{[}X_{\mathcal{G}}^{(t-T^{\prime}+1)},\dots,X_{\mathcal{G}}^{% (t)}\big{]}

\displaystyle\xrightarrow{f(\cdot)}\big{[}X_{\mathcal{G}}^{(t+1)},\dots,X_{% \mathcal{G}}^{(t+T)}\big{]},

where $X_{\mathcal{G}}^{(i)}\in\mathbb{R}^{N\times C}$ , $C$ is the number of input features. We aim to train the mapping function $f(\cdot):\mathbb{R}^{T^{\prime}\times N\times C}\rightarrow\mathbb{R}^{T\times N% \times C}$ , which predicts the next $T$ steps based on the given $T^{\prime}$ observations. For the sake of simplicity, we omit $\mathcal{G}$ from $X_{\mathcal{G}}$ hereinafter.
其中 $X_{\mathcal{G}}^{(i)}\in\mathbb{R}^{N\times C}$ 、 $C$ 是输入特征的数量。我们的目标是训练映射函数 $f(\cdot):\mathbb{R}^{T^{\prime}\times N\times C}\rightarrow\mathbb{R}^{T\times N% \times C}$ ，该函数根据给定的 $T^{\prime}$ 观察结果预测接下来的 $T$ 步。为简单起见，下文中我们将 $X_{\mathcal{G}}$ 中的 $\mathcal{G}$ 省略。

Spatial Modeling Methods in Traffic Forecasting
流量预测中的空间建模方法

To effectively forecast the traffic signals, we first discuss spatial modeling, which is one of the necessities for traffic data modeling. In traffic forecasting, we can classify spatial modeling methods into four categories: 1) with identity matrix (i.e., multivariate time-series forecasting), 2) with a pre-defined adjacency matrix, 3) with a trainable adjacency matrix, and 4) with attention (i.e., dynamic spatial modeling without prior knowledge). Conventionally, a graph topology $\mathcal{A}$ is constructed via an empirical law, including inverse distance (Li et al., 2018; Yu et al., 2018) and cosine similarity (Geng et al., 2019). However, these empirically built graph structures are not necessarily optimal, thus often resulting in poor spatial modeling quality. To address this challenge, a line of research (Wu et al., 2019; Bai et al., 2020; Jiang et al., 2023) is proposed to capture the hidden spatial information. Specifically, a trainable function $g(\cdot,\theta)$ is used to derive the optimal topological representation $\tilde{\mathcal{A}}$ as:
为了有效地预测交通信号，我们首先讨论空间建模，这是交通数据建模的必要条件之一。在交通预测中，我们可以将空间建模方法分为四类：1）使用单位矩阵（即多元时间序列预测），2）使用预定义邻接矩阵，3）使用可训练邻接矩阵，以及 4）使用注意力机制（即无需先验知识的动态空间建模）。传统上，图拓扑 $\mathcal{A}$ 是通过经验定律构建的，包括反距离 (Li et al., 2018 ; Yu et al., 2018 ) 和余弦相似度 (Geng et al., 2019 ) 。然而，这些经验构建的图结构不一定是最优的，因此常常导致空间建模质量不佳。为了应对这一挑战，提出了一系列研究 (Wu et al., 2019 ; Bai et al., 2020 ; Jiang et al., 2023 ) 来捕获隐藏的空间信息。具体而言，使用可训练函数 $g(\cdot,\theta)$ 来推导出最优拓扑表示 $\tilde{\mathcal{A}}$ ，如下所示：

\tilde{\mathcal{A}}=softmax(\mathsf{relu}(g(X^{(t)},\theta),g(X^{(t)},\theta)^% {\top}),

(1)

where $g(X^{(t)},\theta)\in\mathbb{R}^{N\times e}$ , and $e$ is the embedding size. Spatial modeling based on Eq. 1 can be classified into two subcategories according to whether $g(\cdot,\theta)$ depends on $X^{(t)}$ . Wu et al. (2019) define $g(\cdot,\theta)=E\in\mathbb{R}^{N\times e}$ , which is time-independent and less noise-sensitive but less in-situ modeling. Cao et al. (2020); Zhang et al. (2020) propose time-varying graph structure modeling with $g(H^{(t)},\theta)=H^{(t)}W$ , where $W\in\mathbb{R}^{d\times e}$ , projecting hidden states to another embedding space. Ideally, this method models dynamic changes in graph topology, but it is noise-sensitive.
其中 $g(X^{(t)},\theta)\in\mathbb{R}^{N\times e}$ ， $e$ 是嵌入大小。基于公式 1 的空间建模可以根据 $g(\cdot,\theta)$ 是否依赖于 $X^{(t)}$ 分为两个子类别。 Wu 等人（ 2019 ）定义 $g(\cdot,\theta)=E\in\mathbb{R}^{N\times e}$ ，它与时间无关、对噪声不太敏感，但现场建模能力较弱。 Cao 等人（ 2020 年）；Zhang 等人（ 2020 年）提出了一种时变图结构建模方法，其中 $g(H^{(t)},\theta)=H^{(t)}W$ （其中 $W\in\mathbb{R}^{d\times e}$ ）将隐藏状态投影到另一个嵌入空间。理想情况下，该方法可以模拟图拓扑的动态变化，但它对噪声敏感。

To reduce noise-sensitivity and obtain a time-varying graph structure, Zheng et al. (2020) adopt a spatial attention mechanism for traffic forecasting. Given input $H_{i}$ of node $i$ and its spatial neighbor $\mathcal{N}_{i}$ , they compute spatial attention using multi-head attention as follows:
为了降低噪声敏感性并获得随时间变化的图结构， Zheng 等人（ 2020 年）采用了空间注意力机制进行交通预测。给定节点 $i$ 及其空间邻居 $\mathcal{N}_{i}$ 的输入 $H_{i}$ ，他们使用多头注意力计算空间注意力，如下所示：

	$\displaystyle H^{*}_{i}=\mathsf{Concat}(o_{i}^{(1)},\dots,o_{i}^{(K)})W^{O};$	$\displaystyle\qquad o^{(k)}_{i}=\sum_{s\in\mathcal{N}_{i}}\alpha_{i,s}\cdot f_% {v}^{(k)}(H_{s})$		(2)
	$\displaystyle\alpha_{i,j}=\frac{\exp(e_{i,j})}{\sum_{s\in\mathcal{N}_{i}}\exp(% e_{i,s})};$	$\displaystyle\qquad e_{i,j}=\frac{\big{(}f_{q}^{(k)}(H_{i})\big{)}\big{(}f_{k}% ^{(k)}(H_{j})\big{)}^{\top}}{\sqrt{d_{k}}},$		(3)

where $W^{O}$ is a projection layer, $d_{k}$ is a dimension of key vector, and $f^{(k)}_{q}(\cdot),f^{(k)}_{k}(\cdot),$ and $f^{(k)}_{v}(\cdot)$ are query, key, and value projections of the $k$ -th head, respectively. Although effective, these attention-based approaches still suffer from irregular spatial modeling, such as less accurate self-attention (i.e., from node $i$ to $i$ ) (Park et al., 2020) and uniformly distributed uninformative attention, regardless of its spatial relationships (Jin et al., 2023).
其中 $W^{O}$ 是投影层， $d_{k}$ 是键向量的维度， $f^{(k)}_{q}(\cdot),f^{(k)}_{k}(\cdot),$ 和 $f^{(k)}_{v}(\cdot)$ 分别是第 $k$ 个头的查询、键和值投影。这些基于注意力机制的方法虽然有效，但仍然存在空间建模不规则的问题，例如自注意力机制准确性较低（即从节点 $i$ 到 $i$ ）（Park 等人， 2020 年）以及均匀分布的无信息量注意力机制，无论其空间关系如何（Jin 等人， 2023 年）。

3.2 Model Architecture
3.2模型体系结构

Refer to caption — Figure 1: Overview of TESTAM. Left: The architecture of each expert. Middle: The workflow and routing mechanism of TESTAM. Solid lines indicate forward paths, and the dashed lines represent backward paths. Right: The three spatial modeling methods of TESTAM. The black lines indicate spatial connectivity, and red lines represent information flow corresponding to spatial connectivity. Identity, adaptive, and attention experts are responsible for temporal modeling, spatial modeling with learnable static graph, and with dynamic graph (i.e., attention), respectively.
图1：测试概述。左：每个专家的架构。中间：睾丸的工作流和路由机理。实线表示前路路径，虚线表示向后路径。右：testam的三种空间建模方法。黑线表示空间连接，红线表示与空间连接相对应的信息流。身份，自适应和注意专家分别负责时间建模，具有可学习的静态图以及动态图（即，注意）的空间建模。

Although transformers are well-established structures for time-series forecasting, it has a couple of problems when used for spatio-temporal modeling: they do not consider spatial modeling, consume considerable memory resources, and have bottleneck problems caused by the autoregressive decoding process. Park et al. (2020) have introduced an improved transformer model with graph attention (GAT), but the model still has auto-regressive properties. To eliminate the autoregressive characteristics while preserving the advantage of the encoder–decoder architecture, TESTAM transfers the attention domain through time-enhanced attention and temporal information embedding. As shown in Fig. 1 (left), in addition to temporal information embedding, each expert layer consists of four sublayers: temporal attention, spatial modeling, time-enhanced attention, and point-wise feed-forward neural networks. Each sublayer is connected to a bypass through skip connections. To improve generalization, we apply layer normalization after each sublayer. All experts have the same hidden size and number of layers and differ only in terms of spatial modeling methods.
尽管 Transformer 是用于时间序列预测的成熟结构，但它在用于时空建模时存在一些问题：它们没有考虑空间建模，消耗大量的内存资源，并且存在由自回归解码过程引起的瓶颈问题。Park 等人（ 2020 年）提出了一种改进的具有图注意机制的 Transformer 模型 (GAT)，但该模型仍然具有自回归特性。为了消除自回归特性，同时保留编码器-解码器架构的优势，TESTAM 通过时间增强注意和时间信息嵌入来迁移注意域。如图 1 （左）所示，除了时间信息嵌入之外，每个专家层还包含四个子层：时间注意、空间建模、时间增强注意和点状前馈神经网络。每个子层通过跳过连接连接到旁路。为了提高泛化能力，我们在每个子层之后应用层归一化。所有专家的隐藏层大小和层数相同，仅在空间建模方法方面有所不同。

Temporal Information Embedding
时间信息嵌入

Since temporal features (e.g., time of day) work as a global position with a specific periodicity, we omit position embedding in the original transformer architecture. Furthermore, instead of normalized temporal features, we utilize Time2Vec embedding (Kazemi et al., 2019) for periodicity and linearity modeling. Specifically, for the temporal feature $\tau\in\mathbb{N}$ , we represent $\tau$ with $h$ -dimensional embedding vector $v(\tau)$ and the learnable parameters $w_{i},\phi_{i}$ for each embedding dimension $i$ as below:
由于时间特征（例如，一天中的时间）作为具有特定周期性的全局位置，我们省略了原始 Transformer 架构中的位置嵌入。此外，我们利用 Time2Vec 嵌入（Kazemi 等人， 2019 ）进行周期性和线性建模，而非归一化的时间特征。具体来说，对于时间特征 $\tau\in\mathbb{N}$ ，我们用 $h$ 维嵌入向量 $v(\tau)$ 表示 $\tau$ ，并为每个嵌入维度 $i$ 指定可学习的参数 $w_{i},\phi_{i}$ ，如下所示：

\displaystyle{TIM}(\tau)[i]=\begin{cases}w_{i}v(\tau)[i]+\phi_{i},&\text{if }i% =0\\ \mathcal{F}(w_{i}v(\tau)[i]+\phi_{i})&\text{if }1\leq i\leq h-1,\end{cases}

(4)

where $\mathcal{F}$ is a periodic activation function. Using Time2Vec embedding, we enable the model to utilize the temporal information of labels. Here, temporal information embedding of an input sequence is concatenated with other input features and then projected onto the hidden size $h$ .
其中 $\mathcal{F}$ 是周期性激活函数。使用 Time2Vec 嵌入，我们使模型能够利用标签的时间信息。在这里，输入序列的时间信息嵌入与其他输入特征连接，然后投影到隐藏大小 $h$ 上。

Temporal Attention 暂时的关注

As temporal attention in TESTAM is the same as that of transformers, we describe the benefits of temporal attention. Recent studies (Li et al., 2018; Bai et al., 2020) have shown that attention is an appealing solution for temporal modeling because, unlike recurrent unit-based or convolution-based temporal modeling, it can be used to directly attend to features across time steps with no restrictions. Temporal attention allows parallel computation and is beneficial for long-term sequence modeling. Moreover, it has less inductive bias in terms of locality and sequentiality. Although strong inductive bias can help the training, less inductive bias enables better generalization. Furthermore, for the traffic forecasting problem, causality among roads is an unavoidable factor (Jin et al., 2023) that cannot be easily modeled in the presence of strong inductive bias, such as sequentiality or locality.
由于 TESTAM 中的时间注意力机制与 Transformer 中的时间注意力机制相同，我们描述了时间注意力机制的优势。最近的研究（Li et al., 2018 ; Bai et al., 2020 ）表明，注意力机制是时间建模的一个有吸引力的解决方案，因为与基于循环单元或基于卷积的时间建模不同，它可以用来直接关注跨时间步长的特征而不受任何限制。时间注意力机制允许并行计算，有利于长期序列建模。此外，它在局部性和顺序性方面的归纳偏差较小。虽然强归纳偏差有助于训练，但较小的归纳偏差可以实现更好的泛化。此外，对于交通预测问题，道路之间的因果关系是一个不可避免的因素（Jin et al., 2023 ），在存在强归纳偏差（如顺序性或局部性）的情况下，很难对其进行建模。

Spatial Modeling Layer 空间建模层

In this work, we leverage three spatial modeling layers for each expert, as shown in the middle of Fig. 1: spatial modeling with an identity matrix (i.e., no spatial modeling), spatial modeling with a learnable adjacency matrix (Eq. 1), and spatial modeling with attention (Eq. 2 and Eq. 3). We calculate spatial attention using Eqs. 2 and 3. Specifically, we compute attention with $\forall_{i\in\mathcal{V}},\mathcal{N}_{i}=\mathcal{V}$ , which means attention with no spatial restrictions. This setting enables similarity-based attention, resulting in better generalization.
在本研究中，我们为每位专家构建了三个空间建模层，如图 1 中间所示：使用单位矩阵的空间建模（即无空间建模）、使用可学习邻接矩阵的空间建模（公式 1 ）以及使用注意力机制的空间建模（公式 2 和公式 3 ）。我们使用公式 2 和公式 3 计算空间注意力机制。具体来说，我们使用 $\forall_{i\in\mathcal{V}},\mathcal{N}_{i}=\mathcal{V}$ 计算注意力机制，这意味着注意力机制不受空间限制。此设置支持基于相似性的注意力机制，从而实现更好的泛化效果。

Inspired by the success of memory-augmented graph structure learning (Jiang et al., 2023; Lee et al., 2022), we propose a modified meta-graph learner that learns prototypes from both spatial graph modeling and gating networks. Our meta-graph learner consists of two individual neural networks with a meta-node bank $\mathbf{M}\in\mathbb{R}^{m\times e}$ , where $m$ and $e$ denote total memory items and a dimension of each memory, respectively, a hyper-network (Ha et al., 2017) for generating node embedding conditioned on $\mathbf{M}$ , and gating networks to calculate the similarities between experts’ hidden states and queried memory items. In this section, we mainly focus on the hyper-network. We construct a graph structure with a meta-node bank $\mathbf{M}$ and a projection $W_{E}\in\mathbb{R}^{e\times d}$ as follows:
受到记忆增强图结构学习成功的启发 (Jiang et al., 2023 ; Lee et al., 2022 ) ，我们提出了一种改进的元图学习器，它可以从空间图建模和门控网络中学习原型。我们的元图学习器由两个独立的神经网络组成，一个元节点库 $\mathbf{M}\in\mathbb{R}^{m\times e}$ ，其中 $m$ 和 $e$ 分别表示总记忆项和每个记忆的维度，一个超网络 (Ha et al., 2017 ) 用于生成以 $\mathbf{M}$ 为条件的节点嵌入，以及门控网络用于计算专家隐藏状态和查询记忆项之间的相似性。在本节中，我们主要关注超网络。我们构建一个具有元节点库 $\mathbf{M}$ 和投影 $W_{E}\in\mathbb{R}^{e\times d}$ 的图结构，如下所示：

E=\mathbf{M}W_{E};\tilde{A}=softmax(\mathsf{relu}(EE^{\top}))

By constructing a memory-augmented graph, the model achieves better context-aware spatial modeling than that achieved using other learnable static graphs (e.g., graph modeling with $E\in\mathbb{R}^{N\times d}$ ). Detailed explanations for end-to-end training and meta-node bank queries are provided in Sec. 3.3.
通过构建记忆增强图，该模型实现了比使用其他可学习静态图（例如，使用 $E\in\mathbb{R}^{N\times d}$ 的图建模）更好的情境感知空间建模。端到端训练和元节点库查询的详细说明请参见第 3.3 节。

Time-Enhanced Attention 时间增强的关注

To eliminate the error propagation effects caused by auto-regressive characteristics, we propose a time-enhanced attention layer that helps the model transfer its domain from historical $T^{\prime}$ time steps (i.e., source domain) to next $T$ time steps (i.e., target domain). Let $\mathbf{\tau}^{(t)}_{label}=[\tau^{(t+1)},\dots,\tau^{(t+T)}]$ be a temporal feature vector of the label. We calculate the attention score from the source time step $i$ to the target time step $j$ as:
为了消除自回归特性引起的误差传播效应，我们提出了一个时间增强注意层，帮助模型将其域从历史 $T^{\prime}$ 时间步（即源域）转移到下一个 $T$ 时间步（即目标域）。令 $\mathbf{\tau}^{(t)}_{label}=[\tau^{(t+1)},\dots,\tau^{(t+T)}]$ 为标签的时间特征向量。我们计算从源时间步 $i$ 到目标时间步 $j$ 的注意力得分，如下所示：

	$\displaystyle\alpha_{i,j}=\frac{\exp(e_{i,j})}{\sum_{k=t+1}^{T}\exp(e_{i,k})},$
	$\displaystyle e_{i,j}=\frac{(H^{(i)}W_{q}^{(k)})(\text{TIM}(\tau^{(j)})W_{k}^{% (k)})^{\top}}{\sqrt{d_{k}}},$		(5)

where $d_{k}=d/K$ , $K$ is the number of heads, and $W_{q}^{(k)},W_{k}^{(k)}$ are linear transformation matrices. We can calculate the attention output using the same process as in Eq. 2, except that time-enhanced attention attends to the time steps of each node, whereas Eq. 2 attends to the important nodes at each time step.
其中 $d_{k}=d/K$ 、 $K$ 是头的数量， $W_{q}^{(k)},W_{k}^{(k)}$ 是线性变换矩阵。我们可以使用与公式 2 相同的过程来计算注意力输出，不同之处在于时间增强注意力关注的是每个节点的时间步长，而公式 2 关注的是每个时间步长上的重要节点。

3.3 Gating Networks
3.3门控网络

In this section, we describe the gating networks used for in-situ routing. Conventional MoE models have multiple experts with the same architecture and conduct coarse-grained routing, focusing on increasing model capacity without additional computational costs (Shazeer et al., 2017). However, coarse-grained routing provides experts with limited opportunities for specialization. Furthermore, in the case of the regression problem, existing MoEs hardly change their routing decisions after initialization because the gate is not guided by the gradients of regression tasks, as Dryden & Hoefler (2022) have revealed. Consequently, gating networks cause “mismatches,” resulting in uninformative and unchanging routing. Moreover, using the same architecture for all experts is less beneficial in terms of generalization since they also share the same inductive bias.
在本节中，我们描述用于原位路由的门控网络。传统的MOE模型具有相同体系结构的多个专家，并进行了粗粒的路由，重点是增加模型容量而没有额外的计算成本（Shazeer等， 2017 ）。但是，粗粒粒度的路由为专家提供了有限的专业机会。此外，在回归问题的情况下，现有的MOE在初始化后几乎不会改变其路由决策，因为门不受回归任务的梯度的指导，如Dryden＆Hoefler（ 2022 ）所透露的那样。因此，门控网络会引起“不匹配”，导致不信息和不变的路由。此外，在所有专家中使用相同的体系结构在概括方面也不是有益的，因为它们也具有相同的归纳偏见。

To resolve this issue, we propose novel memory-based gating networks and two classification losses with regression error-based pseudo labels. Existing memory-based traffic forecasting approaches (Lee et al., 2022; Jiang et al., 2023) reconstruct the encoder’s hidden state with memory items, allowing memory to store typical features from seen samples for pattern matching. In contrast, we aim to learn the direct relationship between input signals and output representations. For node $i$ at time step $t$ , we define the memory-querying process as follows:
为了解决这个问题，我们提出了新的基于记忆的门控网络和两个基于回归误差的伪标签的分类损失。现有的基于记忆的交通预测方法（Lee 等人， 2022 ；Jiang 等人， 2023 ）利用记忆项重构编码器的隐藏状态，使记忆能够存储来自可见样本的典型特征，从而进行模式匹配。相比之下，我们的目标是学习输入信号与输出表示之间的直接关系。对于时间步 $t$ 的节点 $i$ ，我们将记忆查询过程定义如下：

	$\displaystyle Q_{i}^{(t)}=X_{i}^{(t)}W_{q}+b_{q}$
	$\displaystyle\begin{cases}a_{j}=\frac{\exp(Q_{i}^{(t)}M[j]^{\top})}{\sum_{j=1}% ^{m}\exp(Q_{i}^{(t)}M[j]^{\top})}\\ O_{i}^{(t)}=\sum_{j=1}^{m}a_{j}M[j]\end{cases},$

where $M[i]$ is the $i$ -th memory item, and $W_{q}$ and $b_{q}$ are learnable parameters for input projection. Let $z_{e}$ be an output representation of expert $e$ . Given the queried memory $O_{i}^{(t)}\in\mathbb{R}^{e}$ , we calculate the routing probability $p_{e}$ as shown below:
其中 $M[i]$ 是第 $i$ 个记忆项， $W_{q}$ 和 $b_{q}$ 是输入投影的可学习参数。令 $z_{e}$ 为专家 $e$ 的输出表示。给定查询的记忆 $O_{i}^{(t)}\in\mathbb{R}^{e}$ ，我们计算路由概率 $p_{e}$ ，如下所示：

r_{e}=g(z_{e},O_{i}^{(t)});\quad p_{e}=\frac{r_{e}}{\sum_{e\in[e_{1},...,e_{E}% ]}r_{e}},

where $E$ is the number of experts. Since we use the similarity between output states and queried memory as the routing probability, solving the routing problem induces memory learning of a typical output representation and input-output relationship. We select the top-1 expert output as final output.
其中 $E$ 是专家的数量。由于我们使用输出状态和查询记忆之间的相似性作为路由概率，因此解决路由问题会引发对典型输出表示和输入输出关系的记忆学习。我们选择 top-1 专家输出作为最终输出。

Routing Classification Losses
路由分类损失

To enable fine-grained routing that fits the regression problem, we adopt two classification losses: a classification loss to avoid the worst routing and another loss function to find the best routing. Inspired by SMoEs, we define the worst routing avoidance loss as the cross entropy loss with pseudo label $l_{e}$ as shown below:
为了实现适合回归问题的细粒度路由，我们采用了两种分类损失：一种分类损失用于避免最差路由，另一种损失函数用于找到最佳路由。受 SMoE 的启发，我们将最差路由避免损失定义为具有伪标签 $l_{e}$ 的交叉熵损失，如下所示：

	$\displaystyle L_{worst}(\mathbf{p})=-\frac{1}{E}\sum_{e}l_{e}log(p_{e})$		(6)
	$\displaystyle l_{e}=\begin{cases}1&\text{if $L(y,\hat{y})$ is smaller than $q$% -th quantile and $p_{e}=argmax(\mathbf{p})$}\\ 1/(E-1)&\text{if $L(y,\hat{y})$ is greater than $q$-th quantile and $p_{e}\neq argmax% (\mathbf{p})$}\\ 0&otherwise\end{cases},$

where $\hat{y}$ is the output of the selected expert, and $q$ is an error quantile. If an expert is incorrectly selected, its label becomes zero and the unselected experts have the pseudo label $1/(E-1)$ , which means that there are equal chances of choosing unselected experts.
其中 $\hat{y}$ 是所选专家的输出， $q$ 是错误分位数。如果错误地选择了专家，则其标签变为零，而未选择的专家则具有伪标签 $1/(E-1)$ ，这意味着选择未选择的专家的机会均等。

We also propose the best-route selection loss for more precise routing. However, as traffic data are noisy and contain many nonstationary characteristics, the best-route selection is not an easy task. Therefore, instead of choosing the best routing for every time step and every node, we calculate node-wise routing. Our best-route selection loss is similar to that in Eq. 6, except that it calculates node-wise pseudo labels and the routing probability, and the condition for pseudo labels is changed from “ $L(y,\hat{y})$ is greater/smaller than $q$ -th quantile” to “ $L(y,\hat{y})$ is greater/smaller than $(1-q)$ -th quantile.” Detailed explanations are provided in Appendix A.
我们还提出了最佳路线选择损失，以实现更精确的路线选择。然而，由于交通数据嘈杂且包含许多非平稳特性，最佳路线选择并非易事。因此，我们不是为每个时间步和每个节点选择最佳路由，而是计算节点路由。我们的最佳路线选择损失与公式 6 中的类似，不同之处在于它计算了节点伪标签和路由概率，并且伪标签的条件从“ $L(y,\hat{y})$ 大于/小于 $q$ 分位数”更改为“ $L(y,\hat{y})$ 大于/小于 $(1-q)$ 分位数”。详细解释见附录 A。

Table 1: Experimental results on three real-world datasets with 13 baseline models and TESTAM. The values in bold indicate the best, and underlined values indicate the second-best performance.
表1：具有13个基线模型和睾丸的三个现实世界数据集的实验结果。粗体中的值表示最佳和下划线的值表示第二好的性能。

METR-LA	15 min 15分钟			30 min 30分钟			60 min 60分钟
METR-LA	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
HA (Li et al., 2018) 哈（Li等， 2018 ）	4.16	7.80	13.00%	4.16	7.80	13.00%	4.16	7.80	13.00%
STGCN (Yu et al., 2018) STGCN （Yu等， 2018 ）	2.88	5.74	7.62%	3.47	7.24	9.57%	4.59	9.40	12.70%
DCRNN (Li et al., 2018) dcrnn （Li等， 2018 ）	2.77	5.38	7.30%	3.15	6.45	8.80%	3.60	7.59	10.50%
Graph-WaveNet (Wu et al., 2019) Graph-Wavenet （Wu等， 2019 ）	2.69	5.15	6.90%	3.07	6.22	8.37%	3.53	7.37	10.01%
STTN (Xu et al., 2020) Sttn （Xu等， 2020 ）	2.79	5.48	7.19%	3.16	6.50	8.53%	3.60	7.60	10.16%
GMAN (Zheng et al., 2020) Gman （Zheng等， 2020 ）	2.80	5.55	7.41%	3.12	6.49	8.73%	3.44	7.35	10.07%
MTGNN (Wu et al., 2020) MTGNN （Wu等， 2020 ）	2.69	5.18	6.86%	3.05	6.17	8.19%	3.49	7.23	9.87%
StemGNN (Cao et al., 2020) Stemgnn （Cao等， 2020 ）	2.56	5.06	6.46%	3.01	6.03	8.23%	3.43	7.23	9.85%
AGCRN (Bai et al., 2020) AgCRN （Bai等， 2020 ）	2.86	5.55	7.55%	3.25	6.57	8.99%	3.68	7.56	10.46%
CCRNN (Ye et al., 2021) CCRNN （Ye等， 2021 ）	2.85	5.54	7.50%	3.24	6.54	8.90%	3.73	7.65	10.59%
GTS (Shang et al., 2021) GTS （Shang等， 2021 ）	2.65	5.20	6.80%	3.05	6.22	8.28%	3.47	7.29	9.83%
PM-MemNet (Lee et al., 2022) PM-Memnet （Lee等， 2022 ）	2.65	5.29	7.01%	3.03	6.29	8.42%	3.46	7.29	9.97%
MegaCRN (Jiang et al., 2023) MEGACRN （Jiang等， 2023 ）	2.52	4.94	6.44%	2.93	6.06	7.96%	3.38	7.23	9.72%
TESTAM	2.54	4.93	6.42%	2.96	6.04	7.92%	3.36	7.09	9.67%
PEMS-BAY	15 min 15分钟			30 min 30分钟			60 min 60分钟
PEMS-BAY	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
HA (Li et al., 2018) 哈（Li等， 2018 ）	2.88	5.59	6.80%	2.88	5.59	6.80%	2.88	5.59	6.80%
STGCN (Yu et al., 2018) STGCN （Yu等， 2018 ）	1.36	2.96	2.90%	1.81	4.27	4.17%	2.49	5.69	5.79%
DCRNN (Li et al., 2018) dcrnn （Li等， 2018 ）	1.38	2.95	2.90%	1.74	3.97	3.90%	2.07	4.74	4.90%
Graph-WaveNet (Wu et al., 2019) Graph-Wavenet （Wu等， 2019 ）	1.30	2.74	2.73%	1.63	3.70	3.67%	1.95	4.52	4.63%
STTN (Xu et al., 2020) Sttn （Xu等， 2020 ）	1.36	2.87	2.89%	1.67	3.79	3.78%	1.95	4.50	4.58%
GMAN (Zheng et al., 2020) Gman （Zheng等， 2020 ）	1.35	2.90	2.87%	1.65	3.82	3.74%	1.92	4.49	4.52%
MTGNN (Wu et al., 2020) MTGNN （Wu等， 2020 ）	1.32	2.79	2.77%	1.65	3.74	3.69%	1.94	4.49	4.53%
StemGNN (Cao et al., 2020) Stemgnn （Cao等， 2020 ）	1.23	2.48	2.63%	N/A from (Cao et al., 2020) N/A来自（Cao等， 2020 ）			N/A from (Cao et al., 2020) N/A来自（Cao等， 2020 ）
AGCRN (Bai et al., 2020) AgCRN （Bai等， 2020 ）	1.36	2.88	2.93%	1.69	3.87	3.86%	1.98	4.59	4.63%
CCRNN (Ye et al., 2021) CCRNN （Ye等， 2021 ）	1.38	2.90	2.90%	1.74	3.87	3.90%	2.07	4.65	4.87%
GTS (Shang et al., 2021) GTS （Shang等， 2021 ）	1.34	2.84	2.83%	1.67	3.83	3.79%	1.98	4.56	4.59%
PM-MemNet (Lee et al., 2022) PM-Memnet （Lee等， 2022 ）	1.34	2.82	2.81%	1.65	3.76	3.71%	1.95	4.49	4.54%
MegaCRN (Jiang et al., 2023) MEGACRN （Jiang等， 2023 ）	1.28	2.72	2.67%	1.60	3.68	3.57%	1.88	4.42	4.41%
TESTAM	1.29	2.77	2.61%	1.59	3.65	3.56%	1.85	4.33	4.31%
EXPY-TKY	10 min 10分钟			30 min 30分钟			60 min 60分钟
EXPY-TKY	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
HA (Li et al., 2018) 哈（Li等， 2018 ）	7.63	11.96	31.26%	7.63	11.96	31.25%	7.63	11.96	31.24%
STGCN (Yu et al., 2018) STGCN （Yu等， 2018 ）	6.09	9.60	24.84%	6.91	10.99	30.24%	8.41	12.70	32.90%
DCRNN (Li et al., 2018) dcrnn （Li等， 2018 ）	6.04	9.44	25.54%	6.85	10.87	31.02%	7.45	11.86	34.61%
Graph-WaveNet (Wu et al., 2019) Graph-Wavenet （Wu等， 2019 ）	5.91	9.30	25.22%	6.59	10.54	29.78%	6.89	11.07	31.71%
STTN (Xu et al., 2020) Sttn （Xu等， 2020 ）	5.90	9.27	25.67%	6.53	10.40	29.82%	6.99	11.23	32.52%
GMAN (Zheng et al., 2020) Gman （Zheng等， 2020 ）	6.09	9.49	26.52%	6.64	10.55	30.19%	7.05	11.28	32.91%
MTGNN (Wu et al., 2020) MTGNN （Wu等， 2020 ）	5.86	9.26	24.80%	6.49	10.44	29.23%	6.81	11.01	31.39%
StemGNN (Cao et al., 2020) Stemgnn （Cao等， 2020 ）	6.08	9.46	25.87%	6.85	10.80	31.25%	7.46	11.88	35.31%
AGCRN (Bai et al., 2020) AgCRN （Bai等， 2020 ）	5.99	9.38	25.71%	6.64	10.63	29.81%	6.99	11.29	32.13%
CCRNN (Ye et al., 2021) CCRNN （Ye等， 2021 ）	5.90	9.29	24.53%	6.68	10.77	29.93%	7.11	11.56	32.56%
GTS (Shang et al., 2021) GTS （Shang等， 2021 ）	-	-	-	-	-	-	-	-	-
PM-MemNet (Lee et al., 2022) PM-Memnet （Lee等， 2022 ）	5.94	9.25	25.10%	6.52	10.42	29.00%	6.87	11.14	31.22%
MegaCRN (Jiang et al., 2023) MEGACRN （Jiang等， 2023 ）	5.81	9.20	24.49%	6.44	10.33	28.92%	6.83	11.04	31.02%
TESTAM	5.84	9.23	25.36%	6.42	10.24	28.90%	6.75	11.01	31.01%

4 Experiments
4个实验

In this section, we describe experiments and compare the accuracy of TESTAM with that of existing models. We use three benchmark datasets for the experiments: METR-LA, PEMS-BAY, and EXPY-TKY. METR-LA and PEMS-BAY contain four-month speed data recorded by 207 sensors on Los Angeles highways and 325 sensors on Bay Area, respectively (Li et al., 2018). EXPY-TKY consists of three-month speed data collected from 1843 links in Tokyo, Japan. As EXPY-TKY covers a larger number of roads in a smaller area, its spatial dependencies with many abruptly changing speed patterns are more difficult to model than those in METR-LA or PEMS-BAY. METR-LA and PEMS-BAY datasets have 5-minute interval speeds and timestamps, whereas EXPY-TKY has 10-minute interval speeds and timestamps. Before training TESTAM, we have performed z-score normalization. In the cases of METR-LA and PEMS-BAY, we use 70% of the data for training, 10% for validation, and 20% for evaluation. For the EXPY-TKY, we utilize the first two months for training and validation and the last month for testing, as in the MegaCRN paper (Jiang et al., 2023).
在本节中，我们描述实验并将 TESTAM 的准确性与现有模型的准确性进行比较。我们使用三个基准数据集进行实验：METR-LA，PEMS-BAY 和 EXPY-TKY。METR-LA 和 PEMS-BAY 分别包含由洛杉矶高速公路上的 207 个传感器和湾区上的 325 个传感器记录的四个月的速度数据（Li et al。， 2018 ）。EXPY-TKY 包含从日本东京的 1843 个链路收集的三个月的速度数据。由于 EXPY-TKY 覆盖了较小区域内的大量道路，因此与 METR-LA 或 PEMS-BAY 相比，其具有许多突变速度模式的空间依赖性更难建模。METR-LA 和 PEMS-BAY 数据集具有 5 分钟间隔速度和时间戳，而 EXPY-TKY 具有 10 分钟间隔速度和时间戳。在训练 TESTAM 之前，我们进行了 z 分数归一化。对于 METR-LA 和 PEMS-BAY，我们使用 70% 的数据进行训练，10% 的数据用于验证，20% 的数据用于评估。对于 EXPY-TKY，我们利用前两个月的数据进行训练和验证，最后一个月的数据用于测试，这与 MegaCRN 论文（Jiang 等人， 2023 年）中的方法类似。

4.1 Experimental Settings
4.1实验设置

For all three datasets, we initialize the parameters and embedding using Xavier initialization. After performing a greedy search for hyperparameters, we set the hidden size $d=e=32$ , the memory size $m=20$ , the number of layers $l=3$ , the number of heads $K=4$ , the hidden size for the feed-forward networks $h_{ff}=128$ , and the error quantile $q=0.7$ . We use the Adam optimizer with $\beta_{1}=0.9,\beta_{2}=0.98,$ and $\epsilon=10^{-9}$ , as in Vaswani et al. (2017). We vary the learning rate during training using the cosine annealing warmup restart scheduler (Loshchilov & Hutter, 2017) according to the formula below:
对于所有三个数据集，我们使用 Xavier 初始化来初始化参数和嵌入。在对超参数进行贪婪搜索后，我们设置了隐藏层大小 $d=e=32$ 、内存大小 $m=20$ 、层数 $l=3$ 、头部数量 $K=4$ 、前馈网络的隐藏层大小 $h_{ff}=128$ 以及误差分位数 $q=0.7$ 。我们使用 Adam 优化器，其参数为 $\beta_{1}=0.9,\beta_{2}=0.98,$ 和 $\epsilon=10^{-9}$ ，类似于 Vaswani 等人（ 2017 年）的研究。我们使用余弦退火预热重启调度程序（Loshchilov & Hutter， 2017 年）根据以下公式在训练期间调整学习率：

lrate=\begin{cases}lr_{min}+(lr_{max}-lr_{min})\cdot\frac{T_{cur}}{T_{warm}}&% \text{For the first }T_{warm}\text{ steps}\\ lr_{min}+\frac{1}{2}(lr_{max}-lr_{min})\big{(}1+cos(\frac{T_{cur}}{T_{freq}}% \pi)\big{)}&\text{otherwise}\end{cases},

(7)

where $T_{cur}$ is the number of steps since the last restart. We use $T_{warm}=T_{freq}=4000,lr_{min}=10^{-7}$ for all datasets and set $lr_{max}=3*10^{-3}$ for METR-LA and PEMS-BAY and $lr_{max}=3*10^{-4}$ for EXPY-TKY. We follow the traditional 12-sequence (1 hour) input and 12-sequence output forecasting setting for METR-LA and PEMS-BAY and the 6-sequence (1 hour) input and 6-sequence output setting for EXPY-TKY, as in Jiang et al. (2023). We utilize mean absolute error (MAE) as a loss function and root mean squared error (RMSE) and mean absolute percentage error (MAPE) as evaluation metrics. All experiments are conducted using an RTX 3090 GPU.
其中 $T_{cur}$ 是自上次重启以来的步数。我们对所有数据集使用 $T_{warm}=T_{freq}=4000,lr_{min}=10^{-7}$ ，对 METR-LA 和 PEMS-BAY 设置 $lr_{max}=3*10^{-3}$ ，对 EXPY-TKY 设置 $lr_{max}=3*10^{-4}$ 。我们遵循传统的 12 序列（1 小时）输入和 12 序列输出预测设置（针对 METR-LA 和 PEMS-BAY）以及 6 序列（1 小时）输入和 6 序列输出设置（针对 EXPY-TKY），如Jiang 等人（ 2023 年）所述。我们使用平均绝对误差 (MAE) 作为损失函数，使用均方根误差 (RMSE) 和平均绝对百分比误差 (MAPE) 作为评估指标。所有实验均使用 RTX 3090 GPU 进行。

We compare TESTAM with 13 baseline models: (1) historical average; (2) STGCN (Yu et al., 2018), a model with GCNs and CNNs; (3) DCRNN (Li et al., 2018), a model with graph convolutional recurrent units; (4) Graph-WaveNet (Wu et al., 2019) with a parameterized adjacency matrix; (5) STTN (Xu et al., 2020) and (6) GMAN (Zheng et al., 2020), state-of-the-art attention-based models; (7) MTGNN (Wu et al., 2020), (8) StemGNN (Cao et al., 2020), and (9) AGCRN (Bai et al., 2020), advanced models with an adaptive matrix; (10) CCRNN (Ye et al., 2021), a model with multiple adaptive matrices; (11) GTS (Shang et al., 2021), a model with a graph constructed with long-term historical data; and (12) PM-MemNet (Lee et al., 2022) and (13) MegaCRN (Jiang et al., 2023), state-of-the-art models with memory units.
我们将 TESTAM 与 13 个基线模型进行了比较：（1）历史平均值；（2）STGCN （Yu et al。， 2018 ），一个带有 GCN 和 CNN 的模型；（3）DCRNN （Li et al。， 2018 ），一个带有图卷积循环单元的模型；（4）带有参数化邻接矩阵的 Graph-WaveNet （Wu et al。， 2019 ）；（5）STTN （Xu et al。， 2020 ）和（6）GMAN （Zheng et al。， 2020 ），最先进的基于注意力机制的模型；（7） MTGNN (Wu et al., 2020 ) 、（8） StemGNN (Cao et al., 2020 ) 和 (9) AGCRN (Bai et al., 2020 ) ，具有自适应矩阵的高级模型；（10） CCRNN (Ye et al., 2021 ) ，具有多个自适应矩阵的模型；（11） GTS (Shang et al., 2021 ) ，一种使用长期历史数据构建图的模型；以及（12） PM-MemNet (Lee et al., 2022 ) 和 (13) MegaCRN (Jiang et al., 2023 ) ，具有记忆单元的最新模型。

4.2 Experimental Results
4.2实验结果

The experimental results are shown in Table 1. TESTAM outperforms all other models, especially in long-term predictions, which are usually more difficult. Note that we use the results reported in the respecive papers after comparing them with reproduced results from official codes provided by the authors. The models with learnable static graphs (Graph-WaveNet, MTGNN, and CCRNN) and dynamic graphs (STTN and GMAN) show competitive performance, indicating that they have certain advantages. In terms of temporal modeling, RNN-based temporal models (DCRNN and AGCRN) show worse performance than the other methods in long-term forecasting due to error-accumulation of RNNs. Conversely, MegaCRN and PM-MemNet maintained their advantages even in long-term forecasting by injecting a memory-augmented representation vector into the decoder. GMAN and StemGNN have performed worse with EXPY-TKY, indicating a disadvantage of the attention methods, such as long-tail problems and uniformly distributed attention (Jin et al., 2023).
实验结果如表 1 所示。TESTAM 的表现优于所有其他模型，尤其是在通常更困难的长期预测中。需要注意的是，我们在使用各自论文中报告的结果之前，先将它们与作者提供的官方代码的复现结果进行了比较。具有可学习的静态图（Graph-WaveNet、MTGNN 和 CCRNN）和动态图（STTN 和 GMAN）的模型表现出竞争力，表明它们具有一定的优势。在时间建模方面，由于 RNN 的误差累积，基于 RNN 的时间模型（DCRNN 和 AGCRN）在长期预测中的表现不如其他方法。相反，MegaCRN 和 PM-MemNet 通过在解码器中注入内存增强的表示向量，即使在长期预测中也保持了它们的优势。 GMAN 和 StemGNN 在 EXPY-TKY 上的表现较差，这表明注意力方法存在缺点，例如长尾问题和均匀分布的注意力（Jin et al.， 2023 ）。

As EXPY-TKY has a 6–9 times larger number of roads than the other two datasets, experimental results with EXPY-TKY highlight the importance of spatial modeling. For example, attention-based spatial modeling methods show disadvantages and the results of modeling with time-varying networks (e.g., StemGNN) suggest that it could not properly capture spatial dependencies. In contrast, our model, TESTAM, shows its superiority to all other models, including those with learnable matrices. The results demonstrate that in-situ spatial modeling is crucial for traffic forecasting.
由于 EXPY-TKY 的道路数量是其他两个数据集的 6-9 倍，因此 EXPY-TKY 的实验结果凸显了空间建模的重要性。例如，基于注意力的空间建模方法存在缺点，而使用时变网络（例如 StemGNN）进行建模的结果表明它无法正确捕捉空间依赖性。相比之下，我们的模型 TESTAM 显示出优于所有其他模型（包括具有可学习矩阵的模型）。结果表明，现场空间建模对于交通预测至关重要。

4.3 Ablation Study
4.3消融研究

Table 2: Ablation study results across all prediction windows (i.e., average performance)
表2：在所有预测窗口中的消融研究结果（即平均性能）

Ablation 消融	METR-LA			PEMS-BAY			EXPY-TKY
Ablation 消融	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
w/o gating 无门控	3.00	6.12	8.29%	1.58	3.57	3.53%	6.74	10.97	29.48%
Ensemble 合奏	2.98	6.08	8.12%	1.56	3.53	3.50%	6.66	10.68	29.43%
worst-route avoidance only 仅避免最差路线	2.96	6.06	8.11%	1.55	3.52	3.48%	6.45	10.50	28.70%
Replaced 已替换	2.97	6.04	8.05%	1.56	3.54	3.47%	6.56	10.62	29.20%
w/o TIM 无 TIM	2.96	5.98	8.07%	1.54	3.45	3.46%	6.44	10.40	28.94%
w/o time-enhanced attention 无需时间增强注意力	2.99	6.03	8.15%	1.58	3.59	3.52%	6.64	10.75	29.85%
TESTAM	2.93	5.95	7.99%	1.53	3.47	3.41%	6.40	10.40	28.67%

The ablation study has two goals: to evaluate actual improvements achieved by each method, and to test two hypotheses: (1) in-situ modeling with diverse graph structures is advantageous for traffic forecasting and (2) having two loss functions for avoiding the worst route and leading to the best route is effective. To achieve these aims, we have designed a set of TESTAM variants, which are described below:
消融研究有两个目标：评估每种方法实现的实际改进，并检验两个假设：（1）具有多种图形结构的原位建模对于交通预测和（2）具有两个损失功能以避免最坏情况路线和通往最佳路线是有效的。为了实现这些目标，我们设计了一组testam变体，这些变体如下：

w/o gating 带门控

It uses only the output of the attention experts without ensembles or any other gating mechanism. Memory items are not trained because there are no gradient flows for the adaptive expert or gating networks. This setting results in an architecture similar to that of GMAN.
它仅使用没有集合或任何其他门控机制的注意专家的输出。由于没有适用于自适应专家或门控网络的梯度流，因此没有训练内存项。此设置导致与Gman类似的架构。

Ensemble 合奏

Instead of using MoEs, the final output is calculated with the weighted summation of the gating networks and each expert’s output. This setting allows the use of all spatial modeling methods but no in-situ modeling.
最终输出不是使用门控网络和每个专家的输出来计算的，而不是使用MOE。此设置允许使用所有空间建模方法，但没有原位建模。

worst-route avoidance only
最糟糕的避免

It excludes the loss for guiding best route selection. The exclusion of this relatively coarse-grained loss function is based on the fact that coarse-grained routing tends not to change its decisions after initialization (Dryden & Hoefler, 2022).
它不包括指导最佳路线选择的损失。这种相对粗粒的损失函数的排除是基于以下事实：粗粒式路由倾向于在初始化后不改变其决策（Dryden＆Hoefler， 2022 ）。

Replaced 更换

It does not exclude any components. Instead, it replaces identity expert with a GCN-based adaptive expert, reducing spatial modeling diversity. The purpose of this setting is to test the hypothesis that in-situ modeling with diverse graph structures is helpful for traffic forecasting.
它不排除任何组件。取而代之的是，它用基于GCN的自适应专家取代了身份专家，从而降低了空间建模的多样性。这种设置的目的是检验以下假设：具有不同图形结构的原位建模有助于流量预测。

w/o TIM tim

It replaces temporal information embedding (TIM) with simple embedding vectors without periodic activation functions.
它用无定期激活功能的简单嵌入向量替代了嵌入时间信息（TIM）。

w/o time-enhanced attention
带有时间增强的注意力

It replaces time-enhanced attention with basic temporal attention as we described in Sec. 3.2.
正如我们在SEC中所述，它取代了时间增强的注意力。 3.2 。

The experimental results shown in Table 2 connote that our hypotheses are supported and that TESTAM is a complete and indivisible set. The results of “w/o gating” and “ensemble” suggest that in-situ modeling greatly improves the traffic forecasting quality. The “w/o gating” results indicate that the performance improvement is not due to our model but due to in-situ modeling itself since this setting lead to performance comparable to that of GMAN (Zheng et al., 2020). “worst-route avoidance only” results indicate that our hypothesis that both of our routing classification losses are crucial for proper routing is valid. Finally, the results of “replaced,” which indicate significantly worse performance even than “worst route avoidance only,” confirm the hypothesis that diverse graph structures is helpful for in-situ modeling. Additional qualitative results with examples are provided in Appendix C.
表 2 中的实验结果表明我们的假设得到了支持，并且 TESTAM 是一个完整且不可分割的集合。“w/o gating”和“ensemble”的结果表明现场建模极大地提高了流量预测质量。“w/o gating”的结果表明性能的提升不是由于我们的模型，而是由于现场建模本身，因为这种设置导致的性能与 GMAN (Zheng et al., 2020 ) 相当。“仅避免最差路线”的结果表明我们的两种路线分类损失对于正确路线都至关重要的假设是正确的。最后，“replaced”的结果甚至比“仅避免最差路线”的结果性能差得多，证实了多样化的图结构有助于现场建模的假设。附录 C 中提供了其他定性结果和示例。

5 Conclusion
5结论

In this paper, we propose the time-enhanced spatio-temporal attention model (TESTAM), a novel Mixture-of-Experts model with attention that enables effective in-situ spatial modeling in both recurring and non-recurring situations. By transforming a routing problem into a classification task, TESTAM can contextualize various traffic conditions and choose the most appropriate spatial modeling method. TESTAM achieves superior performance to that of existing traffic forecasting models in three real-world datasets: METR-LA, PEMS-BAY, and EXPY-TKY. The results obtained using the EXPY-TKY dataset indicate that TESTAM is highly advantageous for large-scale graph structures, which are more applicable to real-world problems. We have also obtained qualitative results to visualize when and where TESTAM chooses specific graph structures. In future work, we plan to further improve and generalize TESTAM for the other spatio-temporal and multivariate time series forecasting tasks.
在本文中，我们提出了时间增强时空注意模型（testam），这是一种新型的专家混合物模型，具有关注，可在反复和非恢复情况下有效的原位空间建模。通过将路由问题转换为分类任务，Testam可以将各种流量条件上下文化，并选择最合适的空间建模方法。 TESTAM在三个现实世界数据集中的现有流量预测模型的性能卓越：METR-LA，PEMS-BAY和SUPED-TKY。使用Expy-TKY数据集获得的结果表明，Testam对于大型图形结构非常有利，这更适用于现实世界中的问题。我们还获得了定性结果，以可视化testam选择特定图形结构的时间和地点。在未来的工作中，我们计划进一步改善和推广其他时空和多元时间序列的预测任务。

Acknowledgments 致谢

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.RS-2023-00218913, No. 2021R1A2C1004542), by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grants (No. 2020-0-01336–Artificial Intelligence Graduate School Program, UNIST), and by the Green Venture R&D Program (No. S3236472), funded by the Ministry of SMEs and Startups (MSS, Korea)
这项工作得到了韩国政府（MSIT）资助的韩国国家研究基金会（NRF）赠款（No.RS-2023-00218913，No. 2021R1A2C1004542），信息与通信技术计划与评估研究所（IITP）（IITP）（IITP））赠款（第2020-0-0-01336号 - 兵工智能研究生院计划，Unist），并由绿色风险投资R＆D计划（No. S3236472），由中小企业和初创企业部（MSS，韩国）资助

References 参考

Bai et al. (2020) Bai 等人（2020 年） Lei Bai, Lina Yao, Can Li, Xinazhi Wang, and Can Wang. Adaptive graph convolutional recurrent network for traffic forecasting. In Advances in Neural Information Processing Systems, volume 33, 2020.
Lei Bai，Lina Yao，Can Li，Xinazhi Wang和Can Wang。用于流量预测的自适应图卷积网络。在神经信息处理系统的进步中，第33卷，2020年。
Cao et al. (2020) Cao 等人（2020 年） Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Congrui Huang, Yunhai Tong, Bixiong Xu, Jing Bai, Jie Tong, and Qi Zhang. Spectral temporal graph neural network for multivariate time-series forecasting. In Advances in Neural Information Processing Systems, volume 33, 2020.
Defu Cao，Yujing Wang，Juanyong Duan，Ce Zhang，Xia Zhu，Cornui Huang，Yunhai Tong，Bixiong Xu，Jing Bai，Jie Bai，Jie Tong和Qi Zhang和Qi Zhang。用于多元时间序列预测的光谱时间图神经网络。在神经信息处理系统的进步中，第33卷，2020年。
Dryden & Hoefler (2022) Dryden＆Hoefler（2022） Nikoli Dryden and Torsten Hoefler. Spatial mixture-of-experts. In Advances in Neural Information Processing Systems, volume 35, 2022.
Nikoli Dryden和Torsten Hoefler。空间混合物。在神经信息处理系统的进展中，第35卷，2022年。
Eigen et al. (2014) Eigen 等人（2014） David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. In International Conference on Learning Representations, 2014.
David Eigen，Marc'aurio Ranzato和Ilya Sutskever。学习在专家的深层混合中进行了体现。在2014年的国际学习表现会议上。
Fedus et al. (2022) Fedus 等人（2022 年） William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23:120:1–120:39, 2022.
威廉·费德斯（William Fedus），巴雷特（Barret Zoph）和诺阿姆·谢天（Noam Shazeer）。开关变压器：具有简单有效的稀疏性的缩放到数万亿个参数模型。机器学习研究杂志，23：120：1-120：39，2022。
Geng et al. (2019) 耿等人（2019） Xu Geng, Yaguang Li, Leye Wang, Lingyu Zhang, Qiang Yang, Jieping Ye, and Yan Liu. Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3656–3663, 2019.
Xu Geng，Yaguang Li，Leye Wang，Lingyu Zhang，Qiang Yang，Jieping Ye和Yan Liu。用于乘车需求预测的时空多画卷积网络。 AAAI人工智能会议论文集，33（01）：3656–3663，2019。
Ha et al. (2017) Ha 等人（2017） David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. In International Conference on Learning Representations, 2017.
David HA，Andrew M. Dai和Quoc V. Le。超级纸牌。在2017年国际学习表现会议上。
Jiang et al. (2023) Jiang 等人（2023） Renhe Jiang, Zhaonan Wang, Jiawei Yong, Puneet Jeph, Quanjun Chen, Yasumasa Kobayashi, Xuan Song, Shintaro Fukushima, and Toyotaro Suzumura. Spatio-temporal meta-graph learning for traffic forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 8078–8086, 2023.
Renhe Jiang、Zhaonan Wang、Jiawei Yong、Puneet Jeph、Quanjun Chen、Yasumasa Kobayashi、Xuan Song、Shintaro Fukushima 和 Toyotaro Suzumura。时空元图学习用于交通预测。在AAAI 人工智能会议论文集第 37 卷，第 8078-8086 页，2023 年。
Jin et al. (2023) Jin 等人（2023） Seungmin Jin, Hyunwook Lee, Cheonbok Park, Hyeshin Chu, Yunwon Tae, Jaegul Choo, and Sungahn Ko. A visual analytics system for improving attention-based traffic forecasting models. IEE Transaction on Visualization and Computer Graphics, 29(1):1102–1112, 2023.
Seungmin Jin、Hyunwook Lee、Cheonbok Park、Hyeshin Chu、Yunwon Tae、Jaegul Choo 和 Sungahn Ko。用于改进基于注意力的交通预测模型的可视化分析系统。IEE Transaction on Visualization and Computer Graphics ，29(1):1102–1112，2023 年。
Kazemi et al. (2019) Kazemi 等人（2019） Seyed Mehran Kazemi, Rishab Goel, Sepehr Eghbali, Janahan Ramanan, Jaspreet Sahota, Sanjay Thakur, Stella Wu, Cathal Smyth, Pascal Poupart, and Marcus A. Brubaker. Time2vec: Learning a vector representation of time. CoRR, abs/1907.05321, 2019. URL http://arxiv.org/abs/1907.05321.
Seyed Mehran Kazemi，Rishab Goel，Sepehr Eghbali，Janahan Ramanan，Jaspreet Sahota，Sanjay Thakur，Stella Wu，Cathal Smyth，Pascal Poupart和Marcus A. Brubaker。 Time2VEC：学习时间的矢量表示。 Corr ，ABS/1907.05321，2019。URL http：//arxiv.org/abs/1907.05321 。
Lee et al. (2020) Lee 等人（2020 年） C. Lee, Y. Kim, S. Jin, D. Kim, R. Maciejewski, D. Ebert, and S. Ko. A visual analytics system for exploring, monitoring, and forecasting road traffic congestion. IEEE Transactions on Visualization and Computer Graphics, 26(11):3133–3146, 2020.
C. Lee，Y。Kim，S。Jin，D。Kim，R。Maciejewski，D。Ebert和S. Ko。一种可视化分析系统，用于探索，监视和预测道路交通拥堵。 IEEE可视化和计算机图形交易，26（11）：3133–3146，2020。
Lee et al. (2022) Lee 等人（2022 年） Hyunwook Lee, Seungmin Jin, Hyeshin Chu, Hongkyu Lim, and Sungahn Ko. Learning to remember patterns: Pattern matching memory networks for traffic forecasting. In International Conference on Learning Representations, 2022.
Hyunwook Lee，Seungmin Jin，Hyeshin Chu，Hongkyu Lim和Sungahn Ko。学习记住模式：用于流量预测的模式匹配的内存网络。在2022年国际学习表现会议上。
Li & Shahabi (2018) Li＆Shahabi（2018） Yaguang Li and Cyrus Shahabi. A brief overview of machine learning methods for short-term traffic forecasting and future directions. SIGSPATIAL Special, 10(1):3–9, 2018.
Yaguang Li和Cyrus Shahabi。简要概述了用于短期流量预测和未来方向的机器学习方法。 Sigspatial Special ，10（1）：3–9，2018年。
Li et al. (2018) Li 等人（2018） Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In International Conference on Learning Representations, 2018.
Yaguang Li，Rose Yu，Cyrus Shahabi和Yan Liu。扩散卷积复发性神经网络：数据驱动的流量预测。在2018年国际学习表现会议上。
Loshchilov & Hutter (2017)
Loshchilov＆Hutter（2017） Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
Ilya Loshchilov和Frank Hutter。 SGDR：随机梯度下降，温暖重新开始。在2017年国际学习表现会议上。
McGill & Perona (2017) Mason McGill and Pietro Perona. Deciding how to decide: Dynamic routing in artificial neural networks. In Proceedings of the International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2363–2372, 2017.
Park et al. (2020) Cheonbok Park, Chunggi Lee, Hyojin Bahng, Yunwon Tae, Seungmin Jin, Kihwan Kim, Sungahn Ko, and Jaegul Choo. ST-GRAT: A novel spatio-temporal graph attention networks for accurately forecasting dynamically changing road speed. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pp. 1215–1224. ACM, 2020.
Riquelme et al. (2021) Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. In Advances in Neural Information Processing Systems, pp. 8583–8595, 2021.
Rosenbaum et al. (2018) Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of non-linear functions for multi-task learning. In International Conference on Learning Representations, 2018.
Ryan et al. (2019) G. Ryan, A. Mosca, R. Chang, and E. Wu. At a glance: Pixel approximate entropy as a measure of line chart complexity. IEEE Transactions on Visualization and Computer Graphics, 25(01):872–881, 2019. ISSN 1941-0506. doi: 10.1109/TVCG.2018.2865264 .
Shang et al. (2021) Chao Shang, Jie Chen, and Jinbo Bi. Discrete graph structure learning for forecasting multiple time series. In International Conference on Learning Representations, 2021.
Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
Vlahogianni et al. (2014) Eleni I. Vlahogianni, Matthew G. Karlaftis, and John C. Golias. Short-term traffic forecasting: Where we are and where we’re going. Transportation Research Part C: Emerging Technologies, 43:3–19, 2014. Special Issue on Short-term Traffic Flow Forecasting.
Wu et al. (2019) Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. Graph wavenet for deep spatial-temporal graph modeling. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1907–1913, 2019.
Wu et al. (2020) Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, Xiaojun Chang, and Chengqi Zhang. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
Xu et al. (2020) Mingxing Xu, Wenrui Dai, Chunmiao Liu, Xing Gao, Weiyao Lin, Guo-Jun Qi, and Hongkai Xiong. Spatial-temporal transformer networks for traffic flow forecasting. arXiv preprint arXiv:2001.02908, 2020.
Ye et al. (2021) Junchen Ye, Leilei Sun, Bowen Du, Yanjie Fu, and Hui Xiong. Coupled layer-wise graph convolution for transportation demand prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 35(5):4617–4625, 2021.
Yu et al. (2018) Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 3634–3640, 2018.
Zhang et al. (2016) Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, and Xiuwen Yi. Dnn-based prediction model for spatio-temporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPACIAL ’16, 2016.
Zhang et al. (2020) Qi Zhang, Jianlong Chang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Spatio-temporal graph structure learning for traffic forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01):1177–1185, 2020.
Zheng et al. (2020) Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. GMAN: A graph multi-attention network for traffic prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1234–1241, 2020.
Zhou et al. (2022) Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Zhifeng Chen, Quoc V. Le, and James Laudon. Mixture-of-experts with expert choice routing. In Advances in Neural Information Processing Systems, volume 35, 2022.

Appendix A Routing Classification Loss Function

In this section, we provide detailed information on the routing classification loss function. Both functions for worst-route avoidance and best-route selection are cross-entropy loss functions with different pseudo labels and routing levels. For the worst-route avoidance, we compute fine-grained routing for each point of each road, as Dryden & Hoefler (2022) do. However, utilizing worst-route avoidance is suboptimal because experts have less opportunities to be specialized for the best routing. Therefore, we adopt the best-route selection loss function for the routing problem. While designing the best-route selection loss function, we have two main concerns: 1) traffic data often shows severe fluctuation, which prevents a model to consistently choose best-fit experts, and 2) the best-route selection itself is a more complex task than worst-route avoidance, resulting in the model being hardly trained with fine-grained routing. To overcome those challenges, we have decided to construct node-wise best-route selection loss.

A.1 Worst-Route Avoidance Loss

For the worst-route avoidance loss function, we have built our pseudo label $l_{e}$ as Eq. 6. In this section, we describe how those labels are chosen. Given prediction $\hat{Y}\in\mathbb{R}^{N\times T}$ and ground truth $Y\in\mathbb{R}^{N\times T}$ , we have point-wise distance $L(Y,\hat{Y})\in\mathbb{R}^{N\times T}$ between prediction and ground truth. Given point-wise distances and error quantile $q$ , we say that routing of road $n$ at time $t$ is incorrect (or the worst routing) if $L(y_{n,t},\hat{y}_{n,t})$ is greater than $q$ -th quantile. Therefore, if $L(y_{n,t},\hat{y}_{n,t})$ is greater than $q$ -th quantile, the pseudo label of the selected expert will be zero, and the labels for the other unselected experts will be $1/(E-1)$ , where $E$ is total number of experts. However, if $L(y_{n,t},\hat{y}_{n,t})$ is smaller than $q$ -th quantile, which means it is correctly routed and the worst route is avoided, the selected experts will have a pseudo label of one and the other experts will have a pseudo label of zero. Formally, we can define pseudo label of expert $e$ as follows:

l_{e}=\begin{cases}1&\text{if $L(y,\hat{y})$ is smaller than $q$-th quantile % and $p_{e}=argmax(\mathbf{p})$}\\ 1/(E-1)&\text{if $L(y,\hat{y})$ is greater than $q$-th quantile and $p_{e}\neq argmax% (\mathbf{p})$}\\ 0&otherwise\end{cases}

A.2 Best-Route Selection Loss

For the best-route selection loss function, we define node-wise pseudo labels by converting each condition of pseudo labeling for worst-route avoidance. For worst-route avoidance, we assume that the routing is incorrect (i.e., the worst) if $L(y_{n,t}\hat{y}_{n,t})$ is greater than $q$ -th quantile. In the best-route selection, we define that the routing is correct (i.e., best) if $L(y_{n}\hat{y}_{n})$ is smaller than $1-q$ -th quantile; otherwise, its incorrectly routed, as shown below:

l_{e}=\begin{cases}1&\text{if $L(y,\hat{y})$ is smaller than $1-q$-th quantile% and $p_{e}=argmax(\mathbf{p})$}\\ 1/(E-1)&\text{if $L(y,\hat{y})$ is greater than $1-q$-th quantile and $p_{e}% \neq argmax(\mathbf{p})$}\\ 0&otherwise\end{cases}

Table 3: Computation time of the models with the METR-LA dataset

	Training time/epoch	Inference time	# of params
STGCN	14.8 secs	16.70 secs	320k
DCRNN	122.22 secs	13.44 secs	372k
Graph-WaveNet	48.07 secs	3.69 secs	309k
GMAN	312.1 secs	33.7 secs	901k
MegaCRN	84.7 secs	11.76 secs	339k
TESTAM	150 secs	7.96 secs	224k

Appendix B Computational Cost Analysis

For the computational cost analysis, we use five models as baselines: 1) STGCN (Yu et al., 2018), the lightest model that utilizes GCNs and CNNs to forecast 1-step future traffic condition; 2) DCRNN (Li et al., 2018), a well-known traffic forecasting model with graph-convolutional recurrent units; 3) Graph-WaveNet (Wu et al., 2019), a model that forecasts values by parallel computation with GCNs and CNNs; 4) GMAN (Zheng et al., 2020), a spatio-temporal attention model for traffic forecasting, and 5) MegaCRN (Jiang et al., 2023), one of state-of-the-art models using GCRNN and memory network concepts.

We have investigated other models for comparison but decided to exclude them after careful considerations. For example, we have excluded MTGNN and StemGNN since they are an improved version of Graph-WaveNet and have similar computational costs compared to Graph-WaveNet. Similarly, AGCRN, CCRNN, and GTS are excluded from baselines because they are variants of DCRNN, with few changes in computational costs. PM-MemNet and MegaCRN utilize sequence-to-sequence modeling with shared memory units; however, PM-MemNet experiences computational bottleneck with its stacked memory units, which requires $L$ times larger computational costs than those of MegaCRN.

Even though TESTAM utilizes three individual experts for the prediction, we emphasize that it has a smaller number of parameters compared to the other models due to its small number of layers per each expert, which highly affects the computational costs. Furthermore, TESEAM only uses the encoder architecture of the transformer with a time-enhanced attention module that enables parallel computation, eliminating a computational bottleneck caused by the decoding process. As a result, in terms of computational costs, TESTAM is two times cheaper than the attention-based model (i.e., GMAN), illustrating a similar training time with DCRNN. Furthermore, in the inference phase, TESTAM shows the second fastest computation with the smallest number of the parameters.

Table 4: Case-specific experimental results on three real-world datasets. The numbers in bold mean the best performance, and those underlined mean the second-best performance. (I) means isolated roads. (H) means hard-to-predict roads, including intersections and the roads with high traffic fluctuations. (E) means the non-recurring circumstances, such as holidays or accidents.

METR-LA (I)	15 min			30 min			60 min
METR-LA (I)	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
Graph-WaveNet (Wu et al., 2019)	3.58	5.93	8.43%	3.90	6.74	9.59%	4.29	7.50	10.98%
GMAN (Zheng et al., 2020)	3.81	6.99	9.15%	4.03	7.48	9.97%	4.32	8.13	11.13%
MegaCRN (Jiang et al., 2023)	3.54	5.88	8.46%	3.88	6.69	9.74%	4.35	7.67	11.39%
TESTAM	3.52	5.89	8.37%	3.80	6.59	9.43%	4.13	7.31	10.72%
METR-LA (H)	15 min			30 min			60 min
METR-LA (H)	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
Graph-WaveNet (Wu et al., 2019)	4.07	6.74	11.75%	4.73	8.03	14.38%	5.48	9.34	17.20%
GMAN (Zheng et al., 2020)	4.37	7.79	12.83%	4.86	8.75	14.77%	5.37	9.64	16.77%
MegaCRN (Jiang et al., 2023)	4.02	6.68	11.61%	4.73	8.13	14.46%	5.55	9.72	17.67%
TESTAM	3.96	6.62	11.42%	4.51	7.75	13.57%	5.19	9.03	16.04%
METR-LA (E)	15 min			30 min			60 min
METR-LA (E)	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
Graph-WaveNet (Wu et al., 2019)	4.22	7.14	13.08%	5.12	8.84	16.81%	6.17	10.62	21.19%
GMAN (Zheng et al., 2020)	4.45	7.67	14.49%	5.16	9.13	17.53%	5.95	10.58	21.18%
MegaCRN (Jiang et al., 2023)	4.03	6.91	12.37%	4.96	8.75	16.10%	6.01	10.69	20.58%
TESTAM	4.11	7.09	12.54%	4.92	8.71	15.71%	5.89	10.46	19.69%
PEMS-BAY (I)	15 min			30 min			60 min
PEMS-BAY (I)	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
Graph-WaveNet (Wu et al., 2019)	2.28	4.06	4.89%	2.87	5.29	6.65%	3.48	6.47	8.86%
GMAN (Zheng et al., 2020)	2.51	5.12	5.65%	3.06	6.20	7.34%	3.55	7.10	8.90%
MegaCRN (Jiang et al., 2023)	2.28	4.10	5.06%	2.92	5.58	7.27%	3.49	6.76	9.22%
TESTAM	2.26	4.03	4.67%	2.86	5.24	6.45%	3.36	6.45	8.55%
PEMS-BAY (H)	15 min			30 min			60 min
PEMS-BAY (H)	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
Graph-WaveNet (Wu et al., 2019)	2.46	4.42	5.44%	3.14	5.89	7.75%	3.81	7.23	10.50%
GMAN (Zheng et al., 2020)	2.72	5.50	6.28%	3.34	6.76	8.42%	3.88	7.76	10.37%
MegaCRN (Jiang et al., 2023)	2.47	4.44	5.62%	3.19	6.17	8.34%	3.82	7.48	10.76%
TESTAM	2.46	4.52	5.48%	3.10	5.75	7.62%	3.69	7.16	9.96%
PEMS-BAY (E)	15 min			30 min			60 min
PEMS-BAY (E)	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
Graph-WaveNet (Wu et al., 2019)	2.64	4.73	6.22%	3.39	6.21	8.91%	4.20	7.78	12.69%
GMAN (Zheng et al., 2020)	2.82	5.11	6.99%	3.55	6.68	9.72%	4.19	7.83	12.23%
MegaCRN (Jiang et al., 2023)	2.61	4.65	6.25%	3.51	6.70	10.01%	4.24	8.14	13.11%
TESTAM	2.59	4.58	5.98%	3.39	6.15	8.82%	4.03	7.57	11.45%
EXPY-TKY (I)	10 min			30 min			60 min
EXPY-TKY (I)	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
Graph-WaveNet (Wu et al., 2019)	8.28	12.56	28.65%	9.40	14.23	33.69%	10.24	15.35	36.99%
GMAN (Zheng et al., 2020)	8.14	12.45	28.81%	8.75	13.43	31.11%	9.26	14.20	32.62%
MegaCRN (Jiang et al., 2023)	8.06	12.30	27.94%	8.98	13.71	32.22%	9.64	14.63	35.07%
TESTAM	7.87	12.26	26.95%	8.60	13.44	29.69%	9.03	14.06	31.83%
EXPY-TKY (H)	10 min			30 min			60 min
EXPY-TKY (H)	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
Graph-WaveNet (Wu et al., 2019)	8.45	12.79	30.97%	9.63	14.50	36.07%	10.52	15.64	39.75%
GMAN (Zheng et al., 2020)	8.30	12.63	31.08%	8.90	13.57	33.58%	9.44	14.33	35.19%
MegaCRN (Jiang et al., 2023)	8.24	12.49	30.47%	9.21	13.92	34.60%	9.90	14.87	38.00%
TESTAM	8.06	12.48	29.08%	8.81	13.67	31.88%	9.26	14.31	34.23%
EXPY-TKY (E)	10 min			30 min			60 min
EXPY-TKY (E)	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
Graph-WaveNet (Wu et al., 2019)	11.18	16.23	65.97%	11.93	17.21	71.71%	12.32	17.69	74.75%
GMAN (Zheng et al., 2020)	11.02	16.06	64.93%	11.30	16.47	66.84%	11.49	16.74	67.10%
MegaCRN (Jiang et al., 2023)	11.04	16.32	62.13%	11.67	17.07	67.93%	11.94	17.37	71.43%
TESTAM	10.82	15.94	63.30%	11.28	16.01	65.51%	11.49	16.46	66.94%

Appendix C Detailed Experimental Results

Table 4 presents experimental results under various environment settings. In the table, we pose three scenarios: (I), (H), and (E). Each scenario represents difficult conditions in making accurate forecasting. (I) is a set of isolated roads, chosen by considering spatial locations and quantitative analysis results on the adjacency matrix. (H) is the hard-to-predict roads, including intersections and the roads with high traffic fluctuations. We have determined roads for (H) by visually exploring the roads (for intersections) and by selecting the roads with top 10% entropy-based time-series complexity. (E) contains roads and time with sudden events, including accidents, traffic controls, or holidays (e.g., Christmas). In the experiments, we compare the performances of three baselines with TESTAM, which are selected as a representative model for each spatial modeling method.

As shown in Table 4, TESTAM outperforms other models in non-recurring situations (i.e., (E)) and the roads that have unique spatial and topological features ((I) and (E)). Especially, TESTAM consistently proves its superiority for the roads with spatially unique features, outperforming existing models from 4% to 7% in general. Among all of three baselines, we observe that the attention-based modeling method has better encoded spatial information for most of the hard-to-predict scenarios. However, there are cases where the attention-based model fails, such as PEMS-BAY (I) or PEMS-BAY (H). From the perspective of temporal modeling, the attention-based modeling method indicates better long-term forecasting performance, while CNN- and RNN-based modeling methods are advantageous in short-term forecasting. TESTAM, in contrast, despite of its temporal modeling methods, it outperforms all the baselines in terms of both short-term and long-term forecasting cases. The results indicate that temporal information embedding and time-enhanced attention help the model to effectively transfer information from the input domain to the output domain.

C.1 Qualitative Evaluation

We perform a qualitative evaluation on TESTAM by visualizing the impact of our context-aware spatial modeling in four different types of cases: 1) hard-to-predict roads with recurring patterns; 2) isolated roads (I); 3) roads with unique traffic patterns; and 4) roads with non-recurring patterns for evaluating the event awareness of TESTAM. We use the EXPY-TKY dataset, which contains complex urban road networks with various traffic patterns.

Recurring Patterns on Hard-to-Predict Roads

With the hard-to-predict roads, we observe that previous models often fail to effectively encode the spatial and temporal correlation of the roads, as Fig. 2 shows. For example, Road 1349 is a highway entrance located near the Tokyo station that has one of the largest traffic volumes in Tokyo and accordingly has severe fluctuation in data. Because of the fluctuations and complex spatio-temporal dependencies of the roads, prior models have shown their limitations in spatial modeling. In particular, Graph-WaveNet (the green line in Fig. 2), which relies on the learnable static graph, fails to timely catch both the speed drop and rise in the red box of Fig. 2. However, GMAN and MegaCRN (the red and violet lines) properly model the ends of rush hour, but they also fail to predict the start of rush hour. Furthermore, MegaCRN exhibits noise-sensitive behavior on Dec. 14th (top-left) and 17th (bottom-right) in Fig. 2.

Spatially Isolated Roads (I)

When forecasting spatially isolated roads, the model should focus on the road itself, instead of referring to the other roads, which is less informative for prediction. However, since existing models have less consideration for the importance of self-referencing, they fail to properly model rapid speed changes (e.g., noon in Fig. 3), or they become confused by information the other roads (MegaCRN at 15:00 on Dec 3rd in Fig. 3), resulting in poor forecasting. In contrast, TESTAM accurately forecasts the rapid speed changes that occurred at 3:00 and 12:00, as it enhances temporal modeling with identity experts, temporal information embedding, and a time-enhanced attention layer.

Roads with Unique Traffic Patterns: The case of a Highway Ramp

In the EXPY-TKY, there are many roads showing unique patterns due to complex urban road networks and various traffic behaviors (e.g., commuting and traveling). Road 1111 is a highway ramp located in Shibuya and has unique patterns. One unique pattern of the road is that it tends to have lower speed than 30km/h for all day, except 3:00 (the red box in Fig. 4). GMAN and Graph-WaveNet cannot handle such unique patterns properly, failing to model the increasing speed at 3:00. On the other hand, MegaCRN predicts a high-speed situation by its pattern-awareness of memory units, but it still fails to timely forecast it. In contract, TESTAM timely forecasts the traffic changes, revealing its superiority for modeling the unique behaviors of roads, as shown in Fig. 4.

Event-Aware Forecasting Case

We qualitatively evaluate and show the importance of context-aware spatial modeling in improving forecasting performance in various traffic conditions, such as sudden traffic control. Fig. 5 visualizes the recurring and non-recurring traffic conditions caused by traffic control for heavy snow. Because of unexpected traffic controls, there exist sudden speed drops from morning to noon. In such a non-recurring traffic condition, TESTAM shows better forecasting results due to its context-awareness. GMAN and MegaCRN partially capture the sudden changes but cannot make timely predictions.

Appendix D Detailed Selection Procedures and Locations of the Roads for Case Study

In this section, we describe how we selected roads and time for each scenario, (I), (H), and (E). In the cases of (I) and (H), we extracted the roads regardless of time.

Spatially Isolated Roads (I)

We have selected spatially isolated roads with two procedures: 1) investigating network topology; and 2) filtering out them by visually investigating their locations. From the investigation of network topology, we found that there are total 262 roads without any connection. However, the random sampling process for building traffic dataset makes the network topology be sparse, which could not fully represent the connectivity in the real-world. Therefore, instead of directly utilize total 262 roads, we filter them out by visual investigation. After visual investigation, we have filtered out 150 roads from the list and added 50 roads to the list. Finally, we have total 162 roads for (I), as shown in Fig. 6.

Hard-to-Predict Roads (H)

For the hard-to-predict roads (H), we conduct two selection processes, visual exploration and entropy-based time-series complexity. Inspired by Ryan et al. (2019), we use entropy-based time-series complexity, which could measure noise level and unpredictability of changes in a series of points by estimating the probability of whether similar patterns will be repeated. From whole set of 1843 roads, we have extracted the 184 roads with the top 10% largest entropy, which are the most unpredictable roads. Furthermore, we additionally insert 34 hard-to-predict roads, such as intersections, ramp, and highway entrances by visually investigating roads. We finalized our list as shown in Fig. 7 left.

Roads and Time with Sudden Events (E)

The circumstances in (E) are especially important, since they are location- and time-specific events, and finding those samples requires tremendous efforts. In this paper, we have found the events with two strategies: 1) find specific time intervals (e.g., Christmas) of hard-to-predict roads (Fig. 7 left); and 2) find traffic controls and constructions that are relatively easy to find than accidents and have credible source, including official announcements from the Metropolitan Expressway Co., Ltd¹¹1https://www.shutoko.co.jp/. As a result, we have chosen the data at holiday intervals in hard-to-predict roads and the roads and intervals in construction or sudden traffic controls for (E).

TESTAM: A Time-Enhanced Spatio-Temporal Attention Model with Mixture of Experts睾丸：与专家混合的时间增强时空注意模型

Abstract 抽象的

1 Introduction1简介

2 Related Work2相关工作

2.1 Traffic Forecasting2.1流量预测

2.2 Mixture of Experts2.2专家的混合物

3 Methods3种方法

3.1 Preliminaries3.1预序

Problem Definition 问题定义

Spatial Modeling Methods in Traffic Forecasting流量预测中的空间建模方法

3.2 Model Architecture3.2模型体系结构

Temporal Information Embedding时间信息嵌入

Temporal Attention 暂时的关注

Spatial Modeling Layer 空间建模层

Time-Enhanced Attention 时间增强的关注

3.3 Gating Networks3.3门控网络

Routing Classification Losses路由分类损失

4 Experiments4个实验

4.1 Experimental Settings4.1实验设置

4.2 Experimental Results4.2实验结果

4.3 Ablation Study4.3消融研究

w/o gating 带门控

Ensemble 合奏

worst-route avoidance only最糟糕的避免

Replaced 更换

w/o TIM tim

w/o time-enhanced attention带有时间增强的注意力

5 Conclusion5结论

Acknowledgments 致谢

References 参考

Appendix A Routing Classification Loss Function

A.1 Worst-Route Avoidance Loss

A.2 Best-Route Selection Loss

Appendix B Computational Cost Analysis

Appendix C Detailed Experimental Results

C.1 Qualitative Evaluation

Recurring Patterns on Hard-to-Predict Roads

Spatially Isolated Roads (I)

Roads with Unique Traffic Patterns: The case of a Highway Ramp

Event-Aware Forecasting Case

Appendix D Detailed Selection Procedures and Locations of the Roads for Case Study

Spatially Isolated Roads (I)

Hard-to-Predict Roads (H)

Roads and Time with Sudden Events (E)

TESTAM: A Time-Enhanced Spatio-Temporal Attention Model with Mixture of Experts
睾丸：与专家混合的时间增强时空注意模型

1 Introduction
1简介

2 Related Work
2相关工作

2.1 Traffic Forecasting
2.1流量预测

2.2 Mixture of Experts
2.2专家的混合物

3 Methods
3种方法

3.1 Preliminaries
3.1预序

Spatial Modeling Methods in Traffic Forecasting
流量预测中的空间建模方法

3.2 Model Architecture
3.2模型体系结构

Temporal Information Embedding
时间信息嵌入

3.3 Gating Networks
3.3门控网络

Routing Classification Losses
路由分类损失

4 Experiments
4个实验

4.1 Experimental Settings
4.1实验设置

4.2 Experimental Results
4.2实验结果

4.3 Ablation Study
4.3消融研究

worst-route avoidance only
最糟糕的避免

w/o time-enhanced attention
带有时间增强的注意力

5 Conclusion
5结论