DTAFORMER: Directional Time Attention Transformer For Long-Term Series Forecasting

Chang, Jiang; Yue, Luhui; Liu, Qingshan

doi:10.1007/978-981-97-8505-6_12

Jiang Chang¹⁵,
Luhui Yue¹⁵ &
Qingshan Liu¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15034))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

155 Accesses

Abstract

This paper introduces the Directional Time Attention Transformer (DTAformer) model for long-term time series forecasting, addressing the inherent limitations of traditional Transformer-based models in capturing the sequential order. By establishing a causal graph, we identify the confounding relationships, which lead to the erroneous capture of spurious sequential temporal direction information in time series models. The Directional Time Attention, a key component of the model, leverages the front-door adjustment to eliminate the confounder from the causal relationship, ensuring accurate modeling of temporal direction in time series. Additionally, we further analyze the impact of different patching methods and loss functions on prediction performance. The model’s performance is evaluated on nine benchmark datasets, with the results demonstrating its superiority over the State-of-the-Art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Zhang, W.: Informer: beyond efficient transformer for long sequence time-series forecasting. In: The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, vol. 35, no. 12, pp. 11106–11115. AAAI Press (2021)
Google Scholar
Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., Jin, R.: FEDformer: frequency enhanced decomposed transformer for long-term series forecasting. In: Proceedings of 39th International Conference on Machine Learning (ICML 2022), Baltimore, Maryland (2022)
Google Scholar
Nie, Y., Nguyen, N. H., Sinthong, P., Kalagnanam, J.: A time series is worth 64 words: long-term forecasting with transformers. In: International Conference on Learning Representations (2023)
Google Scholar
Ariyo, A. A., Adewumi, A. O., Ayo, C. K.: Stock price prediction using the ARIMA model. In: 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, pp. 106–112. IEEE (2014)
Google Scholar
Lai, G., Chang, W.-C., Yang, Y., Liu, H.: Modeling long-and short-term temporal patterns with deep neural networks. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 95–104 (2018)
Google Scholar
Pearl, J.: Causal inference in statistics: an overview. Biometrika 82(4), 669–688. Oxford University Press (2009)
Google Scholar
Pearl, J., Mackenzie, D.: The Book of Why: The New Science of Cause and Effect. Basic Books (2018)
Google Scholar
Pearl, J.: Causal diagrams for empirical research. Biometrika 82(4), 669–688. Oxford University Press (1995)
Google Scholar
Box, G.E.P., Jenkins, G.M.: Some recent advances in forecasting and control. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 17(2), 91–109. JSTOR (1968)
Google Scholar
Hatemi-J, A.: Multivariate tests for autocorrelation in the stable and unstable VAR models. Econ. Modell. 21(4), 661–683. Elsevier (2004)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780. MIT Press (1997)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Liu, S., Yu, H., Liao, C., Li, J., Lin, W., Liu, A. X., Dustdar, S.: Pyraformer: low-complexity pyramidal attention for long-range time series modeling and forecasting. In: International Conference on Learning Representations (2022)
Google Scholar
Wu, H., Xu, J., Wang, J., Long, M.: Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. (2021)
Google Scholar
Zeng, A., Chen, M., Zhang, L., Xu, Q.: Are transformers effective for time series forecasting? In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, pp. 11121–11128 (2023)
Google Scholar
Chu, Z.: Causal Triple Attention Time Series Forecasting (2021)
Google Scholar
Pearl, J.: Models, Reasoning and Inference. Cambridge University Press, Cambridge, UK, vol. 19, no. 2, p. 3 (2000)
Google Scholar
Rubin, D.B.: Causal inference using potential outcomes: Design, modeling, decisions. J. Am. Stat. Assoc. 100(469), 322–331. Taylor & Francis (2005)
Google Scholar
Pearl, J., Glymour, M., Jewell, N.P.: Causal Inference in Statistics: A primer. John Wiley & Sons (2016)
Google Scholar
Yang, X., Zhang, H., Qi, G., Cai, J.: Causal attention for vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9847–9857 (2021)
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Neural image caption generation with visual attention. In: Proceedings of International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958. JMLR. org (2014)
Google Scholar
Yue, Z., Zhang, H., Sun, Q., Hua, X.-S.: Interventional few-shot learning. Adv. Neural. Inf. Process. Syst. 33, 2734–2746 (2020)
Google Scholar
Hu, X., Tang, K., Miao, C., Hua, X.-S., Zhang, H.: Distilling causal effect of data in class-incremental learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3957–3966 (2021)
Google Scholar
Kim, T., Kim, J., Tae, Y., Park, C., Choi, J.-H., Choo, J.: Reversible instance normalization for accurate time-series forecasting against distribution shift. In: International Conference on Learning Representations (2021)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018). CoRR, abs/1810.04805
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020). CoRR, abs/2010.11929
Google Scholar
Liu, Y., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., Long, M.: iTransformer: Inverted Transformers Are Effective for Time Series Forecasting (2023). arXiv preprint arXiv:2310.06625
Chai, T., Draxler, R.R.: Root mean square error (RMSE) or mean absolute error (MAE)?—arguments against avoiding RMSE in the literature. Geosci. Model Dev. 7(3), 1247–1250. Copernicus Publications Göttingen, Germany (2014)
Google Scholar
Hyndman, R.J., Koehler, A.B.: Another look at measures of forecast accuracy. Int. J. Forecast. 22(4), 679–688. Elsevier (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, 210044, China
Jiang Chang & Luhui Yue
Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
Qingshan Liu

Authors

Jiang Chang
View author publications
You can also search for this author in PubMed Google Scholar
Luhui Yue
View author publications
You can also search for this author in PubMed Google Scholar
Qingshan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingshan Liu .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Appendix

In this Appendix, we provide related work, causality analysis and other experimental details.

1.1 Related Work

Loss Function: MSE The Mean Squared Error (MSE) has been a cornerstone in the domain of time series prediction and is commonly used as a loss function in various forecasting models. The MSE is defined as $\text {MSE} = \frac{1}{n}\sum _{i=1}^{n}(Y_i - \hat{Y}_i)^2$, where $Y_i$ represents the true value, $\hat{Y}_i$ denotes the predicted value, and $n$ is the number of observations.

The popularity of MSE in time series analysis can be attributed to its ability to emphasize larger errors due to its quadratic nature. This characteristic makes it particularly suitable for applications where large errors are more undesirable than smaller ones. Studies such as those by Hyndman and Koehler [30] have highlighted the effectiveness of MSE in capturing the variance of forecasting errors and providing a clear measure of prediction accuracy.

However, MSE is sensitive to outliers, as noted by Chai and Draxler [29]. In datasets with significant anomalies or noise, MSE might result in biased estimations, emphasizing the need for robust preprocessing steps.

1.2 Causality Analysis

In this formula, $ Y $ is no longer affected by the confounder $ U $, but directly affected by $ X $ and $ Z $. The original text is as follows: In the causal graph, it is not only the legitimate causal pathway extending from input variable $ X $ through mediator $ Z $ to outcome $ Y $ that warrants consideration. Concomitantly, the “backdoor" path, delineated as $ X \leftarrow U \rightarrow Z \rightarrow Y $, exerts an influence on $ Y $ through the cofounder $ U $. This introduces a spurious correlation between between $ X $ and $ Y $, thus confounding the relationship. Consequently, if one is to rely solely on the correlation $ P(Y \mid X) $ for model training, without addressing the confounding effects, the true causal effect from $ X $ to $ Y $ remains obscured, irrespective of the quantity and quality of training data [17, 18]. To disentangle the confounded relationship between $ X $ and $ Y $, it is imperative to obstruct the path $ X \leftarrow U \rightarrow Y $, thereby isolating the causal effect between $ X $ and $ Y $. However, in the context of time series analysis, the exact nature of the cofounder remains difficult to quantify. As an alternative, the front-door adjustment is employed, which does not demand specific information regarding the cofounder. Additionally, this approach offers a more intelligible means of understanding the mediator.

Therefore, instead of employing the likelihood $ P(Y \mid X) $, we utilize the causal intervention $ P(Y \mid \text {do}(X)) $ as proposed by Pearl for time series forecasting [8]. This approach aims to elucidate the genuine causal relationship between $ X $ and $ Y $. The front-door adjustment method is applied to compute $ P(Y \mid \text {do} (X)) $ through the front-door path $ X \rightarrow M \rightarrow Y $. This path is constructed from two partially causal effects: $ P(M \mid \text {do}(X)) $ and $ P(Y \mid \text {do}(Z)) $. Therefore, it follows that:

$$\begin{aligned} P(Y \mid \text {do}(X)) &= \sum _z P(Z = z \mid \text {do}(X)) P(Y \mid \text {do}(Z = z)). \end{aligned}$$

(12)

Similarly, to determine $ P(Z = z \mid \text {do}(X)) $, it is necessary to obstruct the backdoor path $ X \leftarrow U \rightarrow Y \leftarrow Z $ between $ X $ and $ Z $. Notably, this backdoor path includes a collider ($ U \rightarrow Y \leftarrow Z $). According to Pearl [8], the presence of a collider within a path implies that it obstructs the association between the influencing variables. Thus, this path is inherently blocked, leading to the conclusion that:

$$\begin{aligned} P(Z = z \mid \text {do}(X)) = P(Z = z \mid X). \end{aligned}$$

(13)

For $ P(Y \mid \text {do}(Z)) $, it is necessary to block the backdoor path $ Z \leftarrow X \leftarrow U \rightarrow Y $ between $ Z $ and $ Y $. Given the unknown specifics regarding the cofounder $ U $, this path must be blocked by intervening on $ X $. Hence,

$$\begin{aligned} P(Y \mid \text {do}(Z = z)) &= \sum _x P(Y \mid Z = z, X = x)P(X = x). \end{aligned}$$

(14)

Ultimately, this leads to the following formulation:

$$\begin{aligned} P(Y \mid \text {do}(X)) &= \sum _{z} P(Z = z \mid X) \sum _{x} P(X = x) P(Y \mid Z = z, X = x). \end{aligned}$$

(15)

1.3 Experimental Details

Data Descriptions The datasets employed in this study are summarized in Table 4, which delineate their inherent characteristics. These datasets have been widely utilized in the domain of time series analysis, providing a robust benchmark for evaluating the performance of various models.

Table 4. Statistics of popular datasets for benchmark.

Full size table

The datasets employed encompass a diverse array of domains, each contributing unique characteristics and challenges to time series analysis:

ETT (Electricity Transformer Temperature): Consisting of both hourly-level (ETTh) and quarter-hourly-level (ETTm) datasets, it captures key parameters such as oil and load features of electricity transformers from July 2016 to July 2018.
Traffic: This dataset includes hourly road occupancy rates from sensors on San Francisco freeways, recorded from 2015 to 2016.
Electricity: Details the hourly electricity consumption patterns of 321 clients from 2012 to 2014.
Exchange-Rate: Offers daily exchange rates for eight countries, spanning from 1990 to 2016.
Weather: Comprises 21 diverse indicators, including air temperature and humidity, recorded every 10 minutes in Germany during 2020.
ILI: Reflects weekly data on the proportion of patients with flu-like symptoms, reported by the U.S. Centers for Disease Control and Prevention from 2002 to 2021.

Table 5. Univariate long-term forecasting results with DTAformer. ETT datasets are used with prediction lengths $ T \in \{96, 192, 336, 720\} $. The best results are in bold and the second best are underlined.

Full size table

The comprehensive nature of these datasets, covering different intervals, features, and domains, provides a rigorous testing ground for time series analysis methodologies.

Univariate Long-term Forecasting Results Table 5 shows the univariate long-term time series forecasting results. Compared with other baseline methods, our DTAformer achieves most of the best results.

Attention Score Heatmap In this study, we employed both Self-attention and Directional Time Attention within the DTAformer framework to generate corresponding Attention Score Heatmaps, as depicted in Fig. 4. Compared to the Self-attention, the Directional Time Attention significantly reduces feature noise, concurrently enhancing the model’s ability to capture pronounced sequence direction features.

Analysis reveals that owing to the inherent mechanism of Directional Time Attention, all query $Q$ elements are constrained to capture relationships only with key $K$ elements at identical or preceding positions. Consequently, the attention scores de-emphasize the influence of potential future directional events within the input sequence. This contrasts with the Self-attention, which encompasses a broader range of interfering factors. As a result, Directional Time Attention more effectively discerns the authentic sequential and temporal relationships inherent in the input sequence.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chang, J., Yue, L., Liu, Q. (2025). DTAFORMER: Directional Time Attention Transformer For Long-Term Series Forecasting. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15034. Springer, Singapore. https://doi.org/10.1007/978-981-97-8505-6_12

Download citation

DOI: https://doi.org/10.1007/978-981-97-8505-6_12
Published: 07 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8504-9
Online ISBN: 978-981-97-8505-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DTAFORMER: Directional Time Attention Transformer For Long-Term Series Forecasting

Abstract

Access this chapter

Subscribe and save

Buy Now

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Related Work

1.2 Causality Analysis

1.3 Experimental Details

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us