Abstract
This paper introduces the Directional Time Attention Transformer (DTAformer) model for long-term time series forecasting, addressing the inherent limitations of traditional Transformer-based models in capturing the sequential order. By establishing a causal graph, we identify the confounding relationships, which lead to the erroneous capture of spurious sequential temporal direction information in time series models. The Directional Time Attention, a key component of the model, leverages the front-door adjustment to eliminate the confounder from the causal relationship, ensuring accurate modeling of temporal direction in time series. Additionally, we further analyze the impact of different patching methods and loss functions on prediction performance. The model’s performance is evaluated on nine benchmark datasets, with the results demonstrating its superiority over the State-of-the-Art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Zhang, W.: Informer: beyond efficient transformer for long sequence time-series forecasting. In: The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, vol. 35, no. 12, pp. 11106–11115. AAAI Press (2021)
Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., Jin, R.: FEDformer: frequency enhanced decomposed transformer for long-term series forecasting. In: Proceedings of 39th International Conference on Machine Learning (ICML 2022), Baltimore, Maryland (2022)
Nie, Y., Nguyen, N. H., Sinthong, P., Kalagnanam, J.: A time series is worth 64 words: long-term forecasting with transformers. In: International Conference on Learning Representations (2023)
Ariyo, A. A., Adewumi, A. O., Ayo, C. K.: Stock price prediction using the ARIMA model. In: 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, pp. 106–112. IEEE (2014)
Lai, G., Chang, W.-C., Yang, Y., Liu, H.: Modeling long-and short-term temporal patterns with deep neural networks. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 95–104 (2018)
Pearl, J.: Causal inference in statistics: an overview. Biometrika 82(4), 669–688. Oxford University Press (2009)
Pearl, J., Mackenzie, D.: The Book of Why: The New Science of Cause and Effect. Basic Books (2018)
Pearl, J.: Causal diagrams for empirical research. Biometrika 82(4), 669–688. Oxford University Press (1995)
Box, G.E.P., Jenkins, G.M.: Some recent advances in forecasting and control. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 17(2), 91–109. JSTOR (1968)
Hatemi-J, A.: Multivariate tests for autocorrelation in the stable and unstable VAR models. Econ. Modell. 21(4), 661–683. Elsevier (2004)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780. MIT Press (1997)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Liu, S., Yu, H., Liao, C., Li, J., Lin, W., Liu, A. X., Dustdar, S.: Pyraformer: low-complexity pyramidal attention for long-range time series modeling and forecasting. In: International Conference on Learning Representations (2022)
Wu, H., Xu, J., Wang, J., Long, M.: Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. (2021)
Zeng, A., Chen, M., Zhang, L., Xu, Q.: Are transformers effective for time series forecasting? In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, pp. 11121–11128 (2023)
Chu, Z.: Causal Triple Attention Time Series Forecasting (2021)
Pearl, J.: Models, Reasoning and Inference. Cambridge University Press, Cambridge, UK, vol. 19, no. 2, p. 3 (2000)
Rubin, D.B.: Causal inference using potential outcomes: Design, modeling, decisions. J. Am. Stat. Assoc. 100(469), 322–331. Taylor & Francis (2005)
Pearl, J., Glymour, M., Jewell, N.P.: Causal Inference in Statistics: A primer. John Wiley & Sons (2016)
Yang, X., Zhang, H., Qi, G., Cai, J.: Causal attention for vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9847–9857 (2021)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Neural image caption generation with visual attention. In: Proceedings of International Conference on Machine Learning, pp. 2048–2057 (2015)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958. JMLR. org (2014)
Yue, Z., Zhang, H., Sun, Q., Hua, X.-S.: Interventional few-shot learning. Adv. Neural. Inf. Process. Syst. 33, 2734–2746 (2020)
Hu, X., Tang, K., Miao, C., Hua, X.-S., Zhang, H.: Distilling causal effect of data in class-incremental learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3957–3966 (2021)
Kim, T., Kim, J., Tae, Y., Park, C., Choi, J.-H., Choo, J.: Reversible instance normalization for accurate time-series forecasting against distribution shift. In: International Conference on Learning Representations (2021)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018). CoRR, abs/1810.04805
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020). CoRR, abs/2010.11929
Liu, Y., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., Long, M.: iTransformer: Inverted Transformers Are Effective for Time Series Forecasting (2023). arXiv preprint arXiv:2310.06625
Chai, T., Draxler, R.R.: Root mean square error (RMSE) or mean absolute error (MAE)?—arguments against avoiding RMSE in the literature. Geosci. Model Dev. 7(3), 1247–1250. Copernicus Publications Göttingen, Germany (2014)
Hyndman, R.J., Koehler, A.B.: Another look at measures of forecast accuracy. Int. J. Forecast. 22(4), 679–688. Elsevier (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
In this Appendix, we provide related work, causality analysis and other experimental details.
1.1 Related Work
Loss Function: MSE The Mean Squared Error (MSE) has been a cornerstone in the domain of time series prediction and is commonly used as a loss function in various forecasting models. The MSE is defined as \(\text {MSE} = \frac{1}{n}\sum _{i=1}^{n}(Y_i - \hat{Y}_i)^2\), where \(Y_i\) represents the true value, \(\hat{Y}_i\) denotes the predicted value, and \(n\) is the number of observations.
The popularity of MSE in time series analysis can be attributed to its ability to emphasize larger errors due to its quadratic nature. This characteristic makes it particularly suitable for applications where large errors are more undesirable than smaller ones. Studies such as those by Hyndman and Koehler [30] have highlighted the effectiveness of MSE in capturing the variance of forecasting errors and providing a clear measure of prediction accuracy.
However, MSE is sensitive to outliers, as noted by Chai and Draxler [29]. In datasets with significant anomalies or noise, MSE might result in biased estimations, emphasizing the need for robust preprocessing steps.
1.2 Causality Analysis
In this formula, \( Y \) is no longer affected by the confounder \( U \), but directly affected by \( X \) and \( Z \). The original text is as follows: In the causal graph, it is not only the legitimate causal pathway extending from input variable \( X \) through mediator \( Z \) to outcome \( Y \) that warrants consideration. Concomitantly, the “backdoor" path, delineated as \( X \leftarrow U \rightarrow Z \rightarrow Y \), exerts an influence on \( Y \) through the cofounder \( U \). This introduces a spurious correlation between between \( X \) and \( Y \), thus confounding the relationship. Consequently, if one is to rely solely on the correlation \( P(Y \mid X) \) for model training, without addressing the confounding effects, the true causal effect from \( X \) to \( Y \) remains obscured, irrespective of the quantity and quality of training data [17, 18]. To disentangle the confounded relationship between \( X \) and \( Y \), it is imperative to obstruct the path \( X \leftarrow U \rightarrow Y \), thereby isolating the causal effect between \( X \) and \( Y \). However, in the context of time series analysis, the exact nature of the cofounder remains difficult to quantify. As an alternative, the front-door adjustment is employed, which does not demand specific information regarding the cofounder. Additionally, this approach offers a more intelligible means of understanding the mediator.
Therefore, instead of employing the likelihood \( P(Y \mid X) \), we utilize the causal intervention \( P(Y \mid \text {do}(X)) \) as proposed by Pearl for time series forecasting [8]. This approach aims to elucidate the genuine causal relationship between \( X \) and \( Y \). The front-door adjustment method is applied to compute \( P(Y \mid \text {do} (X)) \) through the front-door path \( X \rightarrow M \rightarrow Y \). This path is constructed from two partially causal effects: \( P(M \mid \text {do}(X)) \) and \( P(Y \mid \text {do}(Z)) \). Therefore, it follows that:
Similarly, to determine \( P(Z = z \mid \text {do}(X)) \), it is necessary to obstruct the backdoor path \( X \leftarrow U \rightarrow Y \leftarrow Z \) between \( X \) and \( Z \). Notably, this backdoor path includes a collider (\( U \rightarrow Y \leftarrow Z \)). According to Pearl [8], the presence of a collider within a path implies that it obstructs the association between the influencing variables. Thus, this path is inherently blocked, leading to the conclusion that:
For \( P(Y \mid \text {do}(Z)) \), it is necessary to block the backdoor path \( Z \leftarrow X \leftarrow U \rightarrow Y \) between \( Z \) and \( Y \). Given the unknown specifics regarding the cofounder \( U \), this path must be blocked by intervening on \( X \). Hence,
Ultimately, this leads to the following formulation:
1.3 Experimental Details
Data Descriptions The datasets employed in this study are summarized in Table 4, which delineate their inherent characteristics. These datasets have been widely utilized in the domain of time series analysis, providing a robust benchmark for evaluating the performance of various models.
The datasets employed encompass a diverse array of domains, each contributing unique characteristics and challenges to time series analysis:
-
ETT (Electricity Transformer Temperature): Consisting of both hourly-level (ETTh) and quarter-hourly-level (ETTm) datasets, it captures key parameters such as oil and load features of electricity transformers from July 2016 to July 2018.
-
Traffic: This dataset includes hourly road occupancy rates from sensors on San Francisco freeways, recorded from 2015 to 2016.
-
Electricity: Details the hourly electricity consumption patterns of 321 clients from 2012 to 2014.
-
Exchange-Rate: Offers daily exchange rates for eight countries, spanning from 1990 to 2016.
-
Weather: Comprises 21 diverse indicators, including air temperature and humidity, recorded every 10 minutes in Germany during 2020.
-
ILI: Reflects weekly data on the proportion of patients with flu-like symptoms, reported by the U.S. Centers for Disease Control and Prevention from 2002 to 2021.
The comprehensive nature of these datasets, covering different intervals, features, and domains, provides a rigorous testing ground for time series analysis methodologies.
Univariate Long-term Forecasting Results Table 5 shows the univariate long-term time series forecasting results. Compared with other baseline methods, our DTAformer achieves most of the best results.
Attention Score Heatmap In this study, we employed both Self-attention and Directional Time Attention within the DTAformer framework to generate corresponding Attention Score Heatmaps, as depicted in Fig. 4. Compared to the Self-attention, the Directional Time Attention significantly reduces feature noise, concurrently enhancing the model’s ability to capture pronounced sequence direction features.
Analysis reveals that owing to the inherent mechanism of Directional Time Attention, all query \(Q\) elements are constrained to capture relationships only with key \(K\) elements at identical or preceding positions. Consequently, the attention scores de-emphasize the influence of potential future directional events within the input sequence. This contrasts with the Self-attention, which encompasses a broader range of interfering factors. As a result, Directional Time Attention more effectively discerns the authentic sequential and temporal relationships inherent in the input sequence.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chang, J., Yue, L., Liu, Q. (2025). DTAFORMER: Directional Time Attention Transformer For Long-Term Series Forecasting. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15034. Springer, Singapore. https://doi.org/10.1007/978-981-97-8505-6_12
Download citation
DOI: https://doi.org/10.1007/978-981-97-8505-6_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8504-9
Online ISBN: 978-981-97-8505-6
eBook Packages: Computer ScienceComputer Science (R0)