这是用户在 2024-11-28 14:35 为 https://www.mdpi.com/2073-8994/15/4/951 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?












 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review
检阅

Time Series Analysis Based on Informer Algorithms: A Survey
基于Informer算法的时间序列分析综述

by 1,2,
通过 朱庆波
2,
1,2, 韩嘉林
1 and
2, 蔡凯
1,*
1和 赵存生
1
College of Naval Architecture and Ocean, Naval University of Engineering, Wuhan 430033, China
海军工程大学海军建筑与海洋学院,武汉430033
2
School of Mechanical Engineering, Hubei University of Technology, Wuhan 430068, China
湖北工业大学机械工程学院,湖北武汉430068
*
Author to whom correspondence should be addressed.
通信地址的作者。
Symmetry 2023, 15(4), 951; https://doi.org/10.3390/sym15040951
对称202315(4),951;https://doi.org/10.3390/sym15040951
Submission received: 20 March 2023 / Revised: 17 April 2023 / Accepted: 17 April 2023 / Published: 21 April 2023
提交日期:2023年3月20日 / 修订日期:2023年4月17日 / 接受日期:2023年4月17日 / 发布时间:2023年4月21日
(This article belongs to the Special Issue Machine Learning and Data Analysis)
(This文章属于机器学习和数据分析特刊)

Abstract 摘要

Long series time forecasting has become a popular research direction in recent years, due to the ability to predict weather changes, traffic conditions and so on. This paper provides a comprehensive discussion of long series time forecasting techniques and their applications, using the Informer algorithm model as a framework. Specifically, we examine sequential time prediction models published in the last two years, including the tightly coupled convolutional transformer (TCCT) algorithm, Autoformer algorithm, FEDformer algorithm, Pyraformer algorithm, and Triformer algorithm. Researchers have made significant improvements to the attention mechanism and Informer algorithm model architecture in these different neural network models, resulting in recent approaches such as wavelet enhancement structure, auto-correlation mechanism, and depth decomposition architecture. In addition to the above, attention algorithms and many models show potential and possibility in mechanical vibration prediction. In recent state-of-the-art studies, researchers have used the Informer algorithm model as an experimental control, and it can be seen that the algorithm model itself has research value. The informer algorithm model performs relatively well on various data sets and has become a more typical algorithm model for time series forecasting, and its model value is worthy of in-depth exploration and research. This paper discusses the structures and innovations of five representative models, including Informer, and reviews the performance of different neural network structures. The advantages and disadvantages of each model are discussed and compared, and finally, the future research direction of long series time forecasting is discussed.
长序列时间预测技术由于其对天气变化、交通状况等的预测能力,近年来已成为一个热门的研究方向.本文以Informer算法模型为框架,对长序列时间预测技术及其应用进行了全面的探讨.具体来说,我们研究了过去两年发布的顺序时间预测模型,包括紧耦合卷积Transformer(TCCT)算法、Autoformer算法、FEDformer算法、Pyraformer算法和Triformer算法。研究者们对这些不同神经网络模型中的注意机制和Informer算法模型结构进行了重大改进,产生了诸如小波增强结构、自相关机制和深度分解结构等新的方法。 除此之外,注意力算法和许多模型在机械振动预测中显示出了潜力和可能性。在最近的研究中,研究者们使用了Informer算法模型作为实验控制,可以看出该算法模型本身就具有研究价值。informer算法模型在各种数据集上表现相对较好,已成为时间序列预测中较为典型的算法模型,其模型价值值得深入探索和研究。本文讨论了包括Informer在内的五种具有代表性的模型的结构和创新之处,并对不同神经网络结构的性能进行了评述。讨论并比较了各模型的优缺点,最后探讨了长序列时间预测未来的研究方向。

1. Introduction 1.介绍

In recent years, long series time forecasting (LSTF) has become a widely used technique in various fields, including weather forecasting [1], stock market analysis [2], medical diagnosis [3], traffic prediction [4], defect detection [5], vibration analysis [6], action recognition [7], and anomaly detection [8]. The Transformer algorithm model was introduced by the Google team in 2017 [9] and has since replaced the Long Short-Term Memory (LSTM) algorithm model [10] as one of the most popular neural network prediction models. The Transformer algorithm model has been shown to outperform the LSTM algorithm model in terms of both accuracy and computational efficiency. However, researchers have found that the Transformer algorithm model still faces several challenges that limit the direct application to long series time forecasting problems, such as secondary spatio-temporal complexity, high memory usage, and inherent limitations of the encoder–decoder architecture.
近年来,长序列时间预测(LSTF)已经成为在各个领域中广泛使用的技术,包括天气预测[1]、股票市场分析[2]、医疗诊断[3]、交通预测[4]、缺陷检测[5]、振动分析[6]、动作识别[7]和异常检测[8]。Transformer算法模型由Google团队于2017年引入[9],并已取代长短期记忆(LSTM)算法模型[10]成为最受欢迎的神经网络预测模型之一。Transformer算法模型在准确性和计算效率方面都优于LSTM算法模型。 然而,研究人员发现,Transformer算法模型仍然面临着一些挑战,限制了直接应用于长序列时间预测问题,如二次时空复杂度,高内存使用,以及编码器-解码器架构的固有限制。
To address these challenges, researchers have developed an efficient algorithmic model called Informer for long series time forecasting, which is based on the Transformer model [11]. The Informer model employs the ProbSparse self-attention mechanism, which achieves O(L/log L) time complexity and enhances sequence dependency alignment, resulting in more reliable performance. Several new algorithmic models have been developed to improve the Transformer model and enhance the attention and encoder–decoder architecture of the Informer model. As the Transformer model still suffers from loose coupling, Shen et al. proposed the TCCT algorithmic model in 2021 [12]. They introduced the Cross Stage Partial Attention (CSPAttention) mechanism, which combines Cross Stage Partial Network (CSPNet) with the self-attention mechanism, resulting in a 30% reduction in computational cost and a 50% reduction in memory usage. Whereas this improvement solves the issue of loose coupling, some limitations of the Transformer model remain, such as ignoring the potential correlation between sequences and the encoder–decoder structure’s limited scalability during the optimization algorithm. To address these limitations, Su et al. proposed the Adaptive Graph Convolutional Network for a Transformer-based Long Sequence Time-Series Forecasting (AGCNT) algorithmic model in 2021 [13]. The AGCNT model captures the correlations between sequences in multivariate long series time forecasting problems without causing memory bottlenecks.
为了应对这些挑战,研究人员开发了一种高效的算法模型,称为Informer,用于长序列时间预测,该模型基于Transformer模型[11]。Informer模型采用ProbSparse自注意机制,可实现 O ( L / log   L ) 时间复杂度并增强序列依赖性对齐,从而获得更可靠的性能。已经开发了几种新的算法模型来改进Transformer模型并增强Informer模型的注意力和编码器-解码器架构。由于Transformer模型仍然存在松耦合问题,Shen等人在2021年提出了TCCT算法模型[12]。他们引入了跨阶段部分注意(CSPAttention)机制,将跨阶段部分网络(CSPNet)与自注意机制相结合,从而使计算成本降低30%,内存使用量减少50%。 尽管这种改进解决了松耦合的问题,但Transformer模型的一些限制仍然存在,例如在优化算法期间忽略序列之间的潜在相关性和编码器-解码器结构的有限可伸缩性。为了解决这些限制,Su等人在2021年提出了基于变换器的长序列时间序列预测(AGCNT)算法模型的自适应图卷积网络[13]。AGCNT模型捕捉序列之间的相关性,在多变量长序列的时间预测问题,而不会造成内存瓶颈。
The application of methods based on the Transformer algorithm model has led to significant improvements in long series time forecasting, but these models are still plagued by issues such as high computational cost and the inability to capture a global view of the time series. To overcome these challenges, several new algorithmic models have been proposed recently. For instance, Zhou et al. introduced the FEDformer algorithm model in 2022 [14], which leverages trend decomposition to capture a global view of the time series by incorporating seasonal trends. The Autoformer algorithm model proposed by Wu et al. [15] takes a different approach, with a deep decomposition architecture that extracts more predictable components from complex time series and enables targeted attention to reliable temporal dependencies that were previously undetected. Additionally, Liu et al. presented the Pyraformer algorithm model in 2022 [16], which employs a pyramidal self-attention mechanism to capture temporal features of different scales and reduces computation time and memory consumption while performing high-precision single-step and long time multi-step forecasting tasks. Finally, in 2022, Razvan-Gabriel Cirstea et al. introduced the Triformer algorithm model [17], which uses the patch attention algorithm to replace the original attention algorithm, proposes a triangular shrinkage module as a new pooling method, and employs a modeling approach for lightweight variables to enable the model to obtain features between different variables.
基于Transformer算法模型的方法的应用使得长序列时间预测有了显著的改进,但这些模型仍然受到诸如计算成本高和无法捕捉时间序列全局视图等问题的困扰。为了克服这些挑战,最近提出了几种新的算法模型。例如,Zhou等人在2022年引入了FEDformer算法模型[14],该模型利用趋势分解通过纳入季节趋势来捕获时间序列的全局视图。Wu等人提出的Autoformer算法模型[15]采用了不同的方法,具有深度分解架构,从复杂的时间序列中提取更多可预测的成分,并能够有针对性地关注以前未检测到的可靠的时间依赖性。此外,Liu等人 2022年提出了Pyraformer算法模型[16],该模型采用金字塔自注意机制来捕获不同尺度的时间特征,在执行高精度单步和长时间多步预测任务时减少计算时间和内存消耗。最后,在2022年,Razvan-Gabriel Cirstea等人引入了Triformer算法模型[17],该模型使用补丁注意力算法来代替原有的注意力算法,提出了三角形收缩模块作为新的池化方法,并采用了轻量级变量的建模方法,使模型能够获得不同变量之间的特征。
Various methods have been developed to predict mechanical vibration faults using models associated with the Informer algorithm [6]. Predicting vibration faults can allow for the early detection of equipment failures, the identification of the failure location, and the determination of the type of failure in the case of serious equipment failures. The time series prediction of equipment vibration is a valuable research direction in predicting failure and the remaining service life of electromechanical equipment. By applying fault prediction based on vibration data of electromechanical equipment to the manufacturing and operation and maintenance process of various products, the loss of electromechanical equipment can be avoided, resulting in cost savings and loss reduction. The time series prediction of motor-bearing vibration involves analyzing the historical data of its components to determine the possibility of future failure. Yang et al. [6] applied the Informer algorithm model to motor-bearing vibration prediction and proposed a random search-based time series prediction method for optimizing the Informer algorithm model. By optimizing the Informer and using a stochastic search to optimize the model parameters, the authors achieved better algorithmic performance in the time series prediction of motor bearing vibration.
已经开发了各种方法来使用与Informer算法相关的模型预测机械振动故障[6]。预测振动故障可以允许设备故障的早期检测、故障位置的识别以及在严重设备故障的情况下故障类型的确定。设备振动的时间序列预测是机电设备故障预测和剩余寿命预测的一个有价值的研究方向。通过将基于机电设备振动数据的故障预测应用到各种产品的制造和运行维护过程中,可以避免机电设备的损失,从而节省成本,减少损失。电机轴承振动的时间序列预测涉及分析其部件的历史数据,以确定未来故障的可能性。Yang等人 [6]将Informer算法模型应用于电机轴承振动预测,提出了一种基于随机搜索的时间序列预测方法,用于优化Informer算法模型。通过对Informer的优化和采用随机搜索的方法对模型参数进行优化,在电机轴承振动的时间序列预测中取得了较好的算法性能。
During the process of developing new models, it is evident that the Informer algorithm model is built upon the Transformer algorithm model with innovative improvements. Nonetheless, it holds significant research value and innovative significance, making it a typical algorithmic model with architecture and core algorithmic principles worth exploring and studying in depth. The rapid emergence of subsequent models is influenced to some extent by the Informer model.
在开发新模型的过程中,很明显,Informer算法模型是在Transformer算法模型的基础上进行创新性改进的。尽管如此,它仍然具有重要的研究价值和创新意义,是一个典型的算法模型,其体系结构和核心算法原理值得深入探讨和研究。后续模型的快速涌现在一定程度上受到Informer模型的影响。
In view of this, this paper presents an overview of related algorithmic models such as Informer. The contributions of this paper are summarized as follows:
鉴于此,本文对Informer等相关算法模型进行了综述。本文的主要贡献如下:
  • In this paper, the principle of the Informer algorithm model, related structure, and attention algorithm are restated in detail, and the advantages and shortcomings of the informer algorithm model are analyzed.
    本文详细阐述了Informer算法模型的原理、相关结构和注意力算法,分析了Informer算法模型的优缺点。
  • In this paper, we kindly discuss in detail the innovations and improvements in the model structure of several other advanced algorithmic models (including TCCT, Autoformer, FEDformer, Pyraformer, and Triformer)
    本文详细讨论了其他几种先进算法模型(包括TCCT、Autoformer、FEDformer、Pyraformer和Triformer)在模型结构上的创新和改进
  • We study an overview of the attentional algorithm structure and innovations for each model, and we also provide a critical analysis of the models and attention mechanisms that were studied and summarize the advantages and disadvantages of each model.
    我们研究了每种模型的注意算法结构和创新,并对所研究的模型和注意机制进行了批判性分析,总结了每种模型的优缺点。
  • In this paper, we compare and analyze each algorithm model with the informer algorithm model, showing the feasibility of the attention mechanism and related models such as the informer algorithm model and making predictions and outlooks on future research directions.
    本文将各种算法模型与informer算法模型进行了比较分析,展示了注意机制及informer算法等相关模型的可行性,并对未来的研究方向进行了预测和展望。

2. Background of Informer Algorithm Model and Architecture
2. Informer算法模型和架构的背景

This section introduces the inception and fundamental structure of the Informer algorithm model. Initially, the problem of long series time forecasting is defined, followed by a novel analysis and exploration of the Informer algorithm model. A review of the framework treatment is also provided.
本节介绍Informer算法模型的起源和基本结构。首先定义了长序列时间预测问题,然后对Informer算法模型进行了新的分析和探索。框架治疗的审查也提供了。

2.1. Basic Forecasting Problem Definition
2.1.基本预测问题定义

LSTF utilizes the long-term dependencies between spatial and temporal domains, contextual information, and inherent patterns in data to improve predictive performance. Recent research has shown that the Informer algorithm model and its variants have the potential to further enhance this performance. In this section, we begin by introducing the input and output representations, as well as the structural representation, in the LSTF problem.
LSTF利用空间和时间域之间的长期依赖关系、上下文信息以及数据中的固有模式来提高预测性能。最近的研究表明,Informer算法模型及其变体有可能进一步提高这一性能。在本节中,我们开始介绍LSTF问题中的输入和输出表示以及结构表示。
In the scrolling prediction setting with a fixed-size window, the input is represented by χt={x1t,,xLxt|xit dx}, while the output is the predicted corresponding sequence Yt={y1t,,yLyt|yitRdy} at time t. The LSTF problem is capable of handling longer output lengths than previous works, and its feature dimensionality is not limited to the univariate case (dy ≥ 1).
在具有固定尺寸窗口的滚动预测设置中,输入由 表示,而输出是在时间t预测的对应序列 。LSTF问题能够处理比以前的作品更长的输出长度,并且其特征维数不限于单变量情况( ≥ 1)。
The encoder–decoder operation is mainly carried out through a step-by-step process. Many popular models are designed to encode the input χt as a hidden state Ht and decode the output representation Yt from Ht={h1t,,hLht}. The inference involves a stepwise process called “dynamic decoding”, in which the decoder computes a new hidden state hk+1t from the previous state hkt and other necessary outputs at step k, and then predicts the sequence yk+1t at step (k + 1).

2.2. Informer Architecture

This section presents an overview of the refined Informer algorithm model architecture designed by Zhou et al. [11] The Informer algorithm model is an improvement on the transformer, and its structure is similar to the transformer in that it is a multi-layer structure made by stacking informer blocks. Informer modules are characterized by a ProbSparse multi-head self-focus mechanism, an encoder–decoder structure.
The following figure shows the basic architecture of the Informer algorithm model. The schematic diagram in the left part of the figure shows the encoder receiving a large number of long sequence inputs with the proposed ProbSparse self-attention instead of the canonical self-attention. The green trapezoid is a self-attention distillation operation that extracts the dominant attention and greatly reduces the network size. The layer-stacked copies increase the robustness. The decoder in the right part of the figure receives the long sequence input, fills the target element with zero, measures the weighted attentional composition of the feature map, and immediately predicts the output element in a generative manner.
The model successfully improves the predictive power of the LSTF problem and validates the potential value of the transformer family of models in capturing individual long-range dependencies between the output and the input of long time series. The ProbSparse self-attention mechanism is successfully proposed to effectively replace the canonical self-attention. The O(L/log L) time complexity and O(L/log L) memory footprint of dependency alignment is achieved. The research article presents the self-attention extraction operation to control the attention fraction in the stacked layers and significantly reduce the total space complexity to O(L/log L), which helps the model to receive long-range inputs. Meanwhile, researchers proposed a generative decoder to obtain long sequence outputs with only one forward step while avoiding the cumulative error expansion in the inference stage.
In the original transformer model, the canonical self-attention is defined based on tuple inputs, which are queries, keys, and values, and it performs the scaled dot product for A(Q,K,V)=Softmax(QK/d)V, where QLQ×d,KLK×d,VLV×d and d represent the input dimensions. To further discuss the self-attention mechanism, let qi,ki,vi represent the i-th row in Q, K, and V, respectively. Following the formulas in references [18,19], the attention of the i-th query is defined as a kernel smoother in the form of probability:
A(qi,K,V)=jk(qi,kj)lk(qi,ki)vj=Ep(kj|qi)[vj]
where p(kj|qi)=k(qi,kj)/lk(qi,kl) and k(qi,kj) choose the asymmetric index kernel exp(qikj/d). The output obtained by self-attention is mainly derived by combining the values after calculating the probability p(kj,qi). Additionally, it requires quadratic dot product computation and memory. Additionally, its need for quadratic dot product computation and O(LQLK) memory is the main drawback of self-attention in the transformer.
The self-attention mechanism is an improved attention model, which is simply summarized as a self-learning representation process. Moreover, the computational overhead of self-attention is quadratically correlated with sequence length, which leads to low computational efficiency, slow computational speed and high training costs of the Transformer model, and the excessive computational overhead also makes the application of the model difficult; the processing capability of long sequence data is thus limited in the Transformer model.
Therefore, studying variants of the self-attention mechanism to achieve an efficient Transformer has become an important research direction. The Informer algorithm model has been preceded by equally many proposals to reduce memory usage and computation and increase efficiency.
Sparse Transformer [20] combines row output and column input, where sparsity comes from separated spatial correlations. LogSparse Transformer [21] notes the circular pattern of self-attention and makes each cell focus on its previous cell in exponential steps. Longformer [22] extends the first two works to more complex sparse configurations.
However, they use the same strategy to deal with each multi-headed self-attention, a mode of thinking that makes it difficult to develop novel and efficient models. In order to reduce memory usage and improve computational efficiency, researchers first qualitatively evaluated the learned attention patterns of typical self-attention in the Informer algorithm model, i.e., a few dot product pairs contribute major attention, while others produce very little attention.
The following discussion focuses on the differences between the differentiated dot products. From the above equation, the attention of the i-th query to all keys is defined as the probability p(kj|qi), and the output is its combination with the value V. The dominant dot product pair encourages the attention probability distribution of the corresponding query to be far from a uniform distribution. If p(kj|qi) is close to a uniform distribution, which is p(kj|qi)=1/LK, then the result of the self-attention calculation becomes a numerically small sum of V values, which is redundant for the input. Therefore, the “similarity” between distributions p and q can be used to distinguish “important” queries. The “similarity” can be measured by the Kullback–Leibler scatter:
KL(q||p)=lnl=1Lkeqikl/d1Lkj=1LKqikj/dlnLk
removing the constants, the sparsity measure for the i-th query is defined as the following equation:
M(qi,K)=lnj=1LKeqikjd1LKj=1LKqiKjd
where the first term is the Log-Sum-Exp of all keys qi and the second term is their arithmetic mean. If the i-th query obtains a larger M(qi,K), it has a more diverse attention probability p and is likely to contain dominant dot product pairs in the head fields of the long-tailed self-attention distribution. Additionally, ProbSparse self-attention is achieved by allowing each key to focus on only u primary queries:
A(Q,K,V)=Softmax(Q¯Kd)V
where Q¯ is a sparse matrix of the same size as q, which contains only the Top-u queries under the sparse metric M(q,K). Under the control of a constant sampling factor c, u=clnLQ is set in such a way that ProbSparse self-attention only needs to compute O(lnLQ) dot products for each query-key lookup, and memory usage per layer is kept at O(LKlnLQ).
In the multi-head perspective, this attention generates different sparse query-key pairs for each head, thus avoiding severe information loss. Additionally, to address the quadratic complexity problem and the fact that LSE operations have potential numerical stability, researchers proposed an empirical approximation for efficiently obtaining query sparsity measures.
Lemma 1. 
For each qiRdin the set K as well as kjRd, the bound is the following equation:
lnLKM(qi,K)maxj{qikj/d}1LKj=1LK{qikj/d}+lnLK
This equation also holds when qiK, starting with Lemma 1, and the maximum mean measure is proposed as the following equation [11]:
M(qi,K)=maxj{qikjd}1LKj=1LKqikjd
randomly selected U=LK ln LQ dot product pairs are used to compute M(qi,K), meaning that the other pairs are filled with zeros, from which the sparse Top-u is selected as Q¯. The maximum operator in M(qi,K) is less sensitive to zero values and is numerically stable. In practice, the input lengths of queries and keys are usually equivalent in the self-attention computation, which means that LQ=LK=L. This makes the total time complexity and space complexity of ProbSparse self-attention O(LlnL).

2.3. Encoder

In Figure 1, it can be seen that Encoder receives a large number of long sequence inputs, while the model uses ProbSparse self-attention instead of self-attention in Transformer. The green trapezoid is the extraction operation of self-attention, which mainly extracts the dominant attention and drastically reduces the amount of computation and memory occupied. Additionally, the layer-stacked copies increase the robustness.
Figure 1. Architecture of the Informer algorithm.
The encoder is designed to extract robust remote dependencies for long sequence inputs. After the input representation, the t-th sequential input Xt has been shaped into a matrix XentLx×dmodel, and the following diagram shows the structure of a single stack of the encoder:
In the above figure, the horizontal stack represents one of the copies of a single encoder. The one presented in Figure 2 is the main stack that receives the entire input sequence. The second stack then acquires half of the slices of the input, and the subsequent stacks are repeated. The red layer is a dot product matrix, and the red layer is reduced in cascade by performing self-attention extraction on each layer. The output of the encoder is then performed by connecting the feature maps of all stacks.
Figure 2. Individual stacks in the Informer encoder.
As a natural consequence of the ProbSparse self-attention mechanism, the feature maps of the encoder have redundant combinations of values V. The distillation operation is used to assign weights to the values with dominant features and to produce a new self-attention feature map at the next level. This operation substantially prunes the temporal dimension of the input, and the n-head weight matrix of the attention block is seen in Figure 2. Inspired by the expansion convolution [23], the “distillation” process advances from the j-th layer to the (j + 1) layer, as follows:
Xj+1t=MaxPool(EIU(Conv1d([Xjt]AB)))
where []AB represents the attention block. It contains the multi-headed ProbSparse self-attention and the basic operations, where Conv1d() performs a one-dimensional convolutional filter in the time dimension using the ELU() activation function [24].
The Researchers add a maximum pooling layer spanning 2 and down-sample Xt into half of its slices after stacking one layer, which reduces the overall memory usage to O((2ϵ)LlogL), where ϵ is a small number. To enhance the robustness of the extraction operation, copies of the main stack is also constructed using halved inputs, and the number of self-focused extraction layers is gradually reduced by performing one layer at a time, aligning their output dimensions and joining the outputs of all stacks to obtain the final hidden representation of the encoder.

2.4. Decoder

A standard decoder structure is used in the Informer algorithm model, which consists of two identical multi-headed attention layers. However, the generative inference is used to mitigate the speed plunge in long series time forecasting. The decoder takes the long sequence input, fills the target elements with zeros, measures the weighted attention composition of the feature map, and immediately predicts the output elements in a generative manner.
In Figure 1, the decoder receives the long sequence input, fills the target elements with zero, measures the weighted attention composition of the feature map, and instantly predicts the output elements in a generative style. Regarding the implementation principle of the decoder, the following vectors are first provided to the decoder:
Xdet=Concat(Xtokent,X0t)(Ltoken+Ly)×dmodel
where XtokentRLtoken×dmodel is the start marker and X0tRLy×dmodel is a placeholder for the target sequence (setting the scalar to 0). By setting the masked dot product to negative infinity, a multi-headed masked attention mechanism is applied in the ProbSparse self-attention calculation. It prevents each position from paying attention to the upcoming position, thus avoiding the autoregression. The fully connected layer obtains the final output, whose size dy depends on whether univariate or multivariate prediction is performed.
Regarding the problem of long output length, the original transformer has no way to solve this problem; it is dynamically output with the same cascade output as the RNN-like model, which cannot handle long sequences. A start token in the dynamic decoding process of NLP is a great trick [25], especially for the pre-training model stage, and the concept is extended in the Informer algorithm model for the long series time forecasting problem by proposing a generative decoder for the long sequence output problem, that is, generating all the predicted data at once. The Q, K, and V values in the first masked attention layer are obtained by multiplying the embedding values from the decoder input by the weight matrix, while the Q values in the second attention layer are obtained by multiplying the output of the previous attention layer by the weight matrix, and the K and V values are obtained by multiplying the output of the encoder by the weight matrix.

2.5. Informer Algorithm Model Values

The ProbSparse self-attention mechanism in the Informer algorithm model can achieve O(L/log L) in terms of time complexity and memory usage, making a large adjustment relative to transformer in terms of time complexity and memory, while the self-attention mechanism highlights the cascade layer input by halving the dominant attention and efficiently handle overly long input sequences. The proposed generative decoder performs one forward operation on long sequences instead of prediction by stepping, which greatly improves the inference speed of long sequence forecasting.
First, the improved attention mechanism reduces the complexity and memory, but there is still room for improvement in reducing the complexity and memory consumption; second, although the efficiency is improved, there are still problems with the backward and forward time dependence, which leads to some errors between the prediction results and the actual results, and the accuracy still needs to be improved. Finally, the structure of the generative decoder is too simple, with few innovations, and similar to that of the transformer, so there is still room for structural adjustment.

3. Relevant Model Development

3.1. TCCT Algorithmic Model

Time series forecasting is crucial for a wide range of practical applications. models such as Informer are superior in handling such problems, especially long series time input (LSTI) and long series time forecasting (LSTF) problems. To improve the efficiency and enhance the localization of this class of models, some researchers have combined the models with CNNs [26,27,28] to varying degrees. However, their combinations are loosely coupled and do not take full advantage of CNNs, whereas three architectures of TCCT algorithmic models can be a good solution to the above problem. This section focuses on three TCCT algorithm model architectures. These architectures not only enhance the local performance of the Transformer, but also enhance the learning capability of the Transformer and reduce the computational cost and memory usage. It can also handle other time series forecasting models similar to Transformer algorithm models.
This section also elaborates the principle of the passthrough mechanism. Its main purpose is to obtain finer-grained information by connecting the feature maps of self-attention blocks at different scales. It is similar to CNN and features pyramids commonly used in image processing. As it extends the feature graph, it improves the prediction performance of the Transformer model.

3.1.1. Dilated Causal Convolutions

Stacking multiple self-attention blocks facilitates the extraction of deeper feature maps but introduces more time and space complexity. To further reduce memory usage, the Informer algorithm model uses a self-attention distilling operation. the Informer algorithm model uses a convolutional layer and a maximum pooling layer to trim the input length between every two self-attention blocks. A kernel of step size 1 and a convolutional layer of size 3 follow the previous self-attention block to make the features more aware of local contextual information. A kernel of step size 2 and a max-pooling layer of size 3 are then used to privilege locally dominant features and provide a smaller but more focused feature map for the latter self-attention block. However, canonical convolutional layers have two main drawbacks when applied to time series prediction.
First, it can only review the history of linear size as the depth of the network grows. In the Informer algorithm model, it mainly deals with long series time forecasting problem, but it is also not powerful enough in dealing with very long series. As the computational cost grows, the advantages of stacking self-attention blocks with canonical convolutional layers are gradually overshadowed by the disadvantages, which may lead to repetitive and meaningless computations due to the limited perceptual field. Second, the canonical convolution layer in the informer algorithm model does not consider the time perspective, which will inevitably lead to future information leakage in time series prediction and reduce the time perception field.
To address the above problem, the solution is inspired by TCN and replaces the canonical convolution with a dilated causal convolution to obtain the growth of the feeling field. More precisely, for the i-th convolutional layer after the i-th self-attention block, the dilated causal convolution operation C with kernel size k on element xnd of sequence XL×d is defined as:
C(xn)=[xnxnixn(k1)×i]Wd×d
where d is the output dimension, where the number i is also used as the expansion factor. The filter of the i-th expanded causal convolution layer skips (2i11) elements between two adjacent filter taps.
Due to the nature of causality, each element x at time t of the sequence is only convolved with the element at or before t, ensuring that there is no information leakage in the future. When i = 1, the dilated causal convolution degenerates to a normal causal convolution.
An illustration of a self-attention network stacked with three self-attention blocks and using the dilated causal convolutional layer of kernel 3 is provided in Figure 3. Researchers stacked such self-attention networks with three self-attention blocks that are connected to the dilated causal convolution layer and the max-pool layer. It can be seen that the application of the dilated causal convolutions layer can bring a wider feeling field and avoid future information leakage.
Figure 3. Visualization of a self-attention network.
Dilated causal convolution uses padding only at the front end of time to prevent the leakage of future information. Even with only two convolutional layers, the output perceptual field of the network in this figure is significantly larger than the one above. Therefore, stacking more self-attention blocks, the gap will be larger and, therefore, the two networks will perform better. In addition to this, the application of Dilated causal convolution will only incur a negligible amount of computational cost and memory usage.

3.1.2. TCCT Architecture and Passthrough Mechanism

Feature pyramids are commonly used to extract features in computer vision and CNNs. A similar concept can be applied to Transformer-based networks. According to the studies related to the Yolo series of target detection CNN networks [29,30], a passthrough mechanism is proposed to obtain feature maps from the earlier networks and merge them with the final feature maps to obtain finer-grained information.
In Transformer-based networks, a passthrough mechanism is used to merge feature maps of different scales. Assuming that the encoder stacks n self-focused blocks, each self-focused block will produce a feature map. The k-th (k = 1, 2…, n) feature map has length L/2k1 and dimensionality d. Assume that CSPAttention and dilation causal convolution have been applied to this encoder. In order to connect all feature maps of different scales, the k-th feature map is equivalently partitioned by length into 2n1 feature maps of length L/2n1. In this way, all feature maps can be connected by dimension.
However, the tandem graph has dimension (2n1)×d. Therefore, a transition layer needs to be employed to ensure that the entire network outputs feature maps are of the appropriate dimension. The above problem can be solved by stacking three self-attention blocks and using the TCCT algorithm model architecture, as shown in Figure 4.
Figure 4. Network of a stack of three CSPAttention blocks. Dilated causal convolution and a pass-through mechanism are used. Final output has the same dimensions as the input.
The passthrough mechanism works similarly to the full distillation operation in the Informer algorithm model. However, the Informer algorithm model with full distillation requires as many encoders as the number of self-attention blocks of the main encoder, whereas the Informer algorithm model with straight-through mechanism requires only one encoder.
This is despite the fact that the Informer algorithm model with full distillation has encoders with decreasing input length and it draws more attention from the model to the later timing data. For example, assuming that Informer stacks k encoders, the first half of the input sequence exists only in the main encoder, and conversely, the 1/2k1 of the later part of the input sequence exists in each individual encoder. The passthrough mechanism has no such deficiency. More importantly, the passthrough mechanism imposes almost no additional computational cost on the Informer algorithm model, whereas the Informer algorithm model with full distillation operation incurs considerable additional computational cost due to its multi-coder architecture.

3.1.3. Transformer with TCCT Architectures

CSPAttention, passthrough mechanism architecture, and dilated causal convolution [12] all work seamlessly with canonical Transformer, LogTrans, Informer, and other time series algorithm prediction models. A simple example of collaboration with the Informer algorithm model is shown in the Figure 5, and a detailed encoder example can be seen in the final figure in this section. Note that the Informer algorithm model in the figure below has only one encoder, which means that it does not use the full distillation operation but replaces it with a passthrough mechanism.
Figure 5. Architecture of Informer combined with the TCCT algorithmic model architecture.
In the above figure within the blue trapezoid are encoders stacked with three probabilistic sparse CSPAttention blocks which replace the probabilistic sparse self-attention blocks from before the figure, employing an inflated causal convolutional layer instead of the canonical convolutional layer, together with a max pooling layer in the green trapezoid being used to connect every two self-attention blocks. Without adding additional encoders, all three feature mappings of the output of the three self-attention blocks are fused and then transitioned to the final output of appropriate size. The right panel makes no significant changes compared to informer’s decoder, except that the masked probability sparse self-attention blocks are replaced by masked probability sparse CSPAttention blocks.
Based on the Informer architecture, other similar architectures can easily work with the TCCT algorithm model architecture. For example, to combine the TCCT algorithm model architecture with the LogTrans algorithm model, the masked probabilistic sparse self-attention block in the above figure will be replaced by the masked LogSparse self-attention block, and the other architectures remain unchanged.
In Figure 6, each CSPAttention block is combined with Probsparse self-attention, which is a typical architecture of Informer. Additionally, between each two CSPAttention blocks, a connection is made using a dilated causal convolutional layer and a maxpooling layer. The output feature map of the previous self-attention block is propagated through these two layers and reduced to half the length, reflecting the original Informer while expanding the perceptual field. The three feature maps of the output of the three self-attention blocks are also fused by a passthrough mechanism to obtain more fine-grained information. Finally, a transition layer is added to export the feature mappings of appropriate dimensions to the decoder.
Figure 6. A single Informer encoder stacks three self-attention blocks that work in concert with all TCCT model architectures.

3.1.4. TCCT Algorithm Model Value

The concept of tightly coupled convolutional Transformer and three TCCT algorithm model architectures proposed by shen et al. [12] improves the predictive power of models such as Informer for time series prediction. Especially, the CSPAttention is designed to reduce the computational cost and memory usage of the self-attention mechanism without degrading the prediction accuracy. In addition, the application of dilated causal convolution enables models such as Informer to obtain exponential receptive field growth. A passthrough mechanism is also employed to help individual models obtain more fine-grained information. Individual and extensive experiments on real data sets show that all three TCCT algorithm model architectures can improve model performance in time series forecasting in different ways.

3.2. Autoformer

This section introduces the structure and principles of the Autoformer algorithm model and disassembles and analyzes the deep decomposition architecture of Autoformer as well as the encoder–decoder architecture.
The related research based on the Transformer prediction model is to capture the inter-moment dependence through the self-attention mechanism, thus making some progress on the temporal sequence prediction. However, in long series time forecasting, the complex temporal patterns in long sequences make it difficult for the attention mechanism to discover reliable temporal dependencies, and the Transformer-based model has to use a sparse form of the attention mechanism to cope with the quadratic complexity, and the processing approach adopted in the above studies creates a bottleneck in information utilization. To break through the above problems, researchers propose an algorithmic model called Autoformer based on the basic Transformer architecture, which innovates the original sequence decomposition as a traditional method of preprocessing and proposes a deep decomposition architecture where the model is able to decompose more predictable components from complex temporal patterns.

3.2.1. Deep Decomposition Architecture

The encoder eliminates the long-term trend cycle component through the series decomposition block and focuses on seasonal pattern modeling. The decoder progressively accumulates the trend component extracted from the hidden variables. The past seasonal information of the encoder is utilized through encoder–decoder auto-correlation, and the architecture of the AutoFormer algorithm model is shown in Figure 7.
Figure 7. AutoFormer algorithm model architecture.
Time series decomposition is the decomposition of time series into several components. Each component represents a class of potential time patterns, such as seasonal, trend-cyclical terms. Due to the unpredictability of the future in forecasting problems, the past series are usually decomposed first and then forecasted separately. However, this causes the prediction results to be limited by the decomposition effect and ignores the interactions between the future components. The deep decomposition architecture embeds sequence decomposition as an internal unit of the Autoformer algorithm model in the encoder–decoder. In the prediction process, the model alternates between prediction result optimization and sequence decomposition, which means that the trend term and the period term are gradually separated from the hidden variables to achieve progressive decomposition.

3.2.2. Encoder

The input to the model uses the second half of the encoder input as input to provide the most recent information, represented by XenI2:I. In Figure 7, the series decomposition block is based on the idea of a sliding average, smoothing the period term and highlighting the trend term. The model input formula is the following formula:
Xt=AvgPool(Padding(X))Xs=XXt
where X is the hidden variable to be decomposed, Xt and Xs are the trend term and the period term, respectively, and the above equation is written as:
Xs,Xt=SeriesDecomp(X)
the above sequence decomposition unit is embedded between Autoformer layers. The input to the encoder part is the time step XenI×d of the past i-th. As a decomposition structure in Figure 7, the input contains both the season part Xdes(I2+O)×d and the trend recurrence part Xdet(I2+O)×d. Each initialization part consists of two parts: a decomposition from the second half of the encoder input Xen, of length I2, to provide the most recent information, and a placeholder of length O filled by a scalar. The format is as follows:
Xens,Xent=SeriesDecomp(XenI2:I)Xdes=Concat(Xens,X0)Xdet=Concat(Xent,XMean),
where Xens,XentI2×d denote the seasonal and trend cycle components of Xen, respectively, and X0,XMeanO×d denote the mean and zero placeholder of Xen, respectively.
As shown in Figure 7, the encoder focuses on seasonal component modeling. The output of the encoder contains past seasonal information and uses it as crossover information to help the decoder refine the prediction results. Assume that there are N encoder layers. The overall equation for the l-th encoder layer is summarized as Xenl=Encoder(Xenl1). Details are as follows: