Replacing self-attentions with convolutional layers in multivariate long sequence time-series forecasting

Wang, Yong; Peng, Jianjian; Wang, Xiaohu; Zhang, Zhicheng; Duan, Junting

doi:10.1007/s10489-023-05205-z

Replacing self-attentions with convolutional layers in multivariate long sequence time-series forecasting

Published: 15 December 2023

Volume 54, pages 522–543, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yong Wang ORCID: orcid.org/0000-0002-0422-5691¹^na1,
Jianjian Peng¹^na1,
Xiaohu Wang¹^na1,
Zhicheng Zhang¹^na1 &
…
Junting Duan¹^na1

310 Accesses
Explore all metrics

Abstract

Transformers have attracted increasing interest in time-series forecasting. However, there are two issues for Multi-Head Self-Attention (MHSA) layers in Multivariate Long Sequence Time-series Forecasting (MLSTF): the massive computation resource consumption and the lack of inductive bias for learning the seasonal and trend pattern of time-series sequences. To address these issues, a systematic method is proposed to replace part of the MHSA layers in Transformers with convolutional layers. Specifically, the self-attention patterns are categorized into four types, i.e., diagonal, vertical, block and heterogeneous patterns. The relationships are explored between convolutional layers and MHSA layers exhibiting different self-attention patterns. Based on which, the evaluation metrics are proposed to decide whether to replace MHSA layers with convolutional layers or not. The experimental results on two representative Transformer-based forecasting models show that our method can achieve competitive results with the original Transformer-based forecasting models and greatly reduce their number of parameters and flops. The performance of models on small data sets has also been greatly improved due to the introduction of convolutional operations. Further, this method is adapted to Transformer-based models for other time series tasks and achieves similar results.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CNformer: a convolutional transformer with decomposition for long-term multivariate time series forecasting

Article 31 March 2023

LCformer: Linear Convolutional Decomposed Transformer for Long-Term Series Forecasting

Long Time Series Deep Forecasting with Multiscale Feature Extraction and Seq2seq Attention Mechanism

Article 09 March 2022

Data Availability

The datasets generated or analyzed during this study are available in the links of Section 4

Notes

Available at https://github.com/zhouhaoyi/ETDataset.
Available at https://archive.ics.uci.edu/ml/ datasets/ElectricityLoadDiagrams20112014.
Available at https://www.bgc-jena.mpg.de/wetter/.
Available at https://www.ncei.noaa.gov/data/local-climatological-data/.
Available at https://github.com/laiguokun/multivariate-time-series-data.
Available at https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html.
Available at http://pems.dot.ca.gov
\(\star \): the replaced model
Available at https://github.com/ZZUFaceBookDL/GTN.
\(\star \): the replaced model

References

Chen K, Du X, Zhu B et al (2022) HTS-AT: a hierarchical token-semantic audio transformer for sound classification and detection. In: IEEE International conference on acoustics, speech and signal processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. IEEE, pp 646–650
Chimmula VKR, Zhang L (2020) Time series forecasting of covid-19 transmission in Canada using lstm networks. Chaos, Solitons Fractals 135:109864
Article Google Scholar
Cordonnier J, Loukas A, Jaggi M (2020) On the relationship between self-attention and convolutional layers. In: 8th International conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020
Dai Z, Liu H, Le QV et al (2021) Coatnet: Marrying convolution and attention for all data sizes. In: Advances in neural information processing systems 34: annual conference on neural information processing systems 2021, NeurIPS 2021, 6-14 December 2021, virtual, pp 3965–3977
Devlin J, Chang M, Lee K et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, pp 4171–4186
Di Piazza A, Di Piazza M, La Tona G et al (2021) An artificial neural network-based forecasting model of energy-related time series for electrical grid management. Math Comput Simul 184:294–305. ELECTRIMACS 2019 ENGINEERING - Modelling and computational simulation for analysis and optimisation in electrical power engineering
Graham B, El-Nouby A, Touvron H et al (2021) Levit: a vision transformer in convnet’s clothing for faster inference. In: 2021 IEEE/CVF International conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, pp 12239–12249
Guo S, Lin Y, Feng N et al (2019) Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In: Proceedings of the thirty-third aaai conference on artificial intelligence and thirty-first innovative applications of artificial intelligence conference and ninth AAAI symposium on educational advances in artificial intelligence. AAAI Press, AAAI’19/IAAI’19/EAAI’19
Haralick RM, Shanmugam K, Dinstein I (1973) Textural features for image classification. IEEE Trans Syst Man Cybern SMC-3(6):610–621
He H, Gao S, Jin T et al (2021) A seasonal-trend decomposition-based dendritic neuron model for financial time series prediction. Appl Soft Comput 108:107488
Article Google Scholar
Himeur Y, Ghanem K, Alsalemi A et al (2021) Artificial intelligence based anomaly detection of energy consumption in buildings: a review, current trends and new perspectives. Appl Energy 287:116601
Article Google Scholar
Kitaev N, Kaiser L, Levskaya A (2020) Reformer: the efficient transformer. In: 8th International conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020
Li S, Jin X, Xuan Y et al (2019) Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In: Advances in neural information processing systems 32: Annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp 5244–5254
Lim B, Arık SÖ, Loeff N et al (2021) Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int J Forecast 37(4):1748–1764
Article Google Scholar
Liu M, Ren S, Ma S et al (2021a) Gated transformer networks for multivariate time series classification. CoRR abs/2103.14438. arXiv:2103.14438
Liu S, Yu H, Liao C et al (2022) Pyraformer: low-complexity pyramidal attention for long-range time series modeling and forecasting. In: The tenth international conference on learning representations, ICLR 2022, Virtual Event, April 25-29, 2022
Liu Z, Lin Y, Cao Y et al (2021b) Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10–17, 2021. IEEE, pp 9992–10002
Pan X, Ge C, Lu R et al (2022) On the integration of self-attention and convolution. In: IEEE/CVF Conference on computer vision and pattern recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022. IEEE, pp 805–815
Peng Z, Huang W, Gu S et al (2021) Conformer: Local features coupling global representations for visual recognition. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10–17, 2021. IEEE, pp 357–366
Tuli S, Casale G, Jennings NR (2022) Tranad: Deep transformer networks for anomaly detection in multivariate time series data. Proc VLDB Endow 15(6):1201–1214
Article Google Scholar
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp 5998–6008
Vaswani A, Ramachandran P, Srinivas A et al (2021) Scaling local self-attention for parameter efficient visual backbones. In: IEEE Conference on computer vision and pattern recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, pp 12894–12904
Wang H, Zhang R, Cheng X et al (2022) Hierarchical traffic flow prediction based on spatial-temporal graph convolutional network. IEEE Trans Intell Transp Syst 23(9):16137–16147
Article Google Scholar
Wang X, Ma Y, Wang Y et al (2020) Traffic flow prediction via spatial temporal graph neural network. In: WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20–24, 2020. ACM / IW3C2, pp 1082–1092
Wu H, Xiao B, Codella N et al (2021a) Cvt: Introducing convolutions to vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10–17, 2021. IEEE, pp 22–31
Wu H, Xu J, Wang J et al (2021b) Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In: Advances in neural information processing systems 34: annual conference on neural information processing systems 2021, NeurIPS 2021, December 6–14, 2021, virtual, pp 22419–22430
Yuan L, Chen Y, Wang T et al (2021) Tokens-to-token vit: Training vision transformers from scratch on imagenet. In: 2021 IEEE/CVF International conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10–17, 2021. IEEE, pp 538–547
Zeng W, Jin S, Liu W et al (2022) Not all tokens are equal: human-centric visual analysis via token clustering transformer. In: IEEE/CVF Conference on computer vision and pattern recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022. IEEE, pp 11091–11101
Zerveas G, Jayaraman S, Patel D et al (2021) A transformer-based framework for multivariate time series representation learning. In: KDD ’21: The 27th ACM SIGKDD conference on knowledge discovery and data mining, virtual event, Singapore, August 14–18, 2021. ACM, pp 2114–2124
Zhang D, Lou S (2021) The application research of neural network and bp algorithm in stock price pattern classification and prediction. Futur Gener Comput Syst 115:872–879
Zhou H, Zhang S, Peng J et al (2021) Informer: beyond efficient transformer for long sequence time-series forecasting. In: Thirty-Fifth AAAI conference on artificial intelligence, AAAI 2021, thirty-third conference on innovative applications of artificial intelligence, IAAI 2021, the eleventh symposium on educational advances in artificial intelligence, EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, pp 11106–11115
Zhou T, Ma Z, Wen Q et al (2022) Fedformer: frequency enhanced decomposed transformer for long-term series forecasting. In: International conference on machine learning, ICML 2022, 17–23 July 2022, Baltimore, Maryland, USA, Proceedings of Machine Learning Research, vol 162. PMLR, pp 27268–27286

Download references

Author information

Yong Wang, Jianjian Peng, Xiaohu Wang, Zhicheng Zhang, and Junting Duan contributed equally to this work.

Authors and Affiliations

School of Computer Science and Engineering Center for CyberSecurity, University of Electronic Science and Technology of China, Xi Yuan Ave., Chengdu, 611731, Sichuan Province, China
Yong Wang, Jianjian Peng, Xiaohu Wang, Zhicheng Zhang & Junting Duan

Authors

Yong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianjian Peng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhicheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Junting Duan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection, experiments and analysis were performed by Yong Wang, Jianjian Peng, Xiaohu Wang, Zhicheng Zhang and Junting Duan. The first draft of the manuscript was written by Yong Wang, Jianjian Peng and Xiaohu Wang. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yong Wang.

Ethics declarations

Ethical and informed consent for data used

Not applicable.

Competing Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Details of the models

1.1 A.1 Autoformer

The details of Autoformer in the experiments are summarized in Table 10. Autoformer contains a 2-layer encoder and a 1-layer decoder. For the Auto-Correlation mechanism, let \(d = 64, h =8\) and add residual connections, a series decomposition layer (moving avg = 25), a position-wise feed-forward network layer (inner-layer dimension is 2048) and a dropout layer (\(p = 0.05\)) likewise. Note that the MHSA layers are replaced with depthwise separable convolutional layers in the first encoder layer and the decoder. For the depthwise separable convolutional layers, let the kernel size \(m = 5\).

Table 7 The Autoformer network components in details

Full size table

1.2 A.2 Informer

The schematic diagram of introducing convolution into Informer is shown in Fig. 11. The details of Informer in the experiments are summarized in Table 11. Informer contains a 2-layer encoder and a 1-layer decoder. For the ProbSparse self-attention mechanism, let \(d = 64, h = 8\) and add residual connections, a position-wise feed-forward network layer (inner-layer dimension is 2048) and a dropout layer (\(p = 0.05\)) likewise. Note that the MHSA layers are replaced with depthwise separable convolutional layers in the first encoder layer and the decoder. For the depthwise separable convolutional layers, let the kernel size \(m = 7\).

Table 8 The Informer network components in details

Full size table

Table 9 The TranAD network components in details

Full size table

1.3 A.3 TranAD

The schematic diagram of introducing convolution into TranAD is shown in Fig. 12. The details of TranAD in the experiments are summarized in Table 12. TranAD contains a 1-layer encoder, a 1-layer window encoder, and two 1-layer decoders. For the self-attention mechanism, let \(d = 2 \times feats, h = feats\) and add residual connections, a position-wise feed-forward network layer (inner-layer dimension is 16) and a dropout layer (\(p = 0.1\)) likewise, where feats denotes dimensions of data labels. Note that the MHSA layers are replaced with depthwise separable convolutional layers in the first window encoder layer and the encoder. For the depthwise separable convolutional layers, let the kernel size \(m = 5\). According to the training settings of TranAD, the model is trained with the AdamW optimizer to train the model with an initial learning rate of 0.01 (meta learning rate 0.02) and a step-scheduler with a step size of 0.5.

1.4 A.4 GTN

The schematic diagram of introducing convolution into GTN is shown in Fig. 13. The details of GTN in the experiments are summarized in Table 13. GTN contains an 8-layer encoder1 and a 8-layer encoder2. For the self-attention mechanism, let \(d = 64, h = 8\) and add residual connections, a position-wise feed-forward network layer (inner-layer dimension is 1024) and a dropout layer (\(p = 0.2\)) likewise. Note that the MHSA layers are replaced with depthwise separable convolutional layers in the encoder1 and encoder2. For the depthwise separable convolutional layers, let the kernel size \(m = 9\). According to the training settings of GTN, the model is trained with Adagrad with a learning rate of 0.0001.

Table 10 The GTN network components in details

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Y., Peng, J., Wang, X. et al. Replacing self-attentions with convolutional layers in multivariate long sequence time-series forecasting. Appl Intell 54, 522–543 (2024). https://doi.org/10.1007/s10489-023-05205-z

Download citation

Accepted: 28 November 2023
Published: 15 December 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s10489-023-05205-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions