Skip to main content

InTrans: Fast Incremental Transformer for Time Series Data Prediction

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2022)

Abstract

Predicting time-series data is useful in many applications, such as natural disaster prevention system, weather forecast, traffic control system, etc. Time-series forecasting has been extensively studied. Many existing forecasting models tend to perform well when predicting short sequence time-series. However, their performances greatly degrade when dealing with the long one. Recently, more dedicated research has been done for this direction, and Informer is currently the most efficient predicting model. The main drawback of Informer is the inability to incrementally learn. This paper proposes an incremental Transformer, called InTrans, to address the above bottleneck by reducing the training/predicting time of Informer. The time complexities of InTrans comparing to the Informer are: (1) O(S) vs O(L) for positional and temporal embedding, (2) \(O((S+k-1)*k)\) vs \(O(L*k)\) for value embedding, and (3) \(O((S+k-1)*d_{dim})\) vs \(O(L*d_{dim})\) for the computation of Query/Key/Value, where L is the length of the input; k is the kernel size; \(d_{dim}\) is the number of dimensions; and S is the length of the non-overlapping part of the input that is usually significantly smaller than L. Therefore, InTrans could greatly improve both training and predicting speed over the state-of-the-art model, Informer. Extensive experiments have shown that InTrans is about 26% faster than Informer for both short sequence and long sequence time-series prediction.

S. Bou—Due to name change, Savong Bou is now known as Takehiko Hashimoto.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/zhouhaoyi/ETDataset.

  2. 2.

    https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014.

  3. 3.

    https://www.ncei.noaa.gov/data/local-climatological-data/.

References

  1. Ariyo, A.A., Adewumi, A.O., Ayo, C.K.: Stock price prediction using the arima model. In: 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, pp. 106–112 (2014)

    Google Scholar 

  2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2016)

    Google Scholar 

  3. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv:2004.05150 (2020)

  4. Bou, S., Kitagawa, H., Amagasa, T.: L-Bix: incremental sliding-window aggregation over data streams using linear bidirectional aggregating indexes. Knowl. Inf. Syst. 62(8), 3107–3131 (2020)

    Article  Google Scholar 

  5. Bou, S., Kitagawa, H., Amagasa, T.: Cpix: real-time analytics over out-of-order data streams by incremental sliding-window aggregation. IEEE Trans. Knowl. Data Eng. (2021). https://doi.org/10.1109/TKDE.2021.3054898

  6. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: International Conference on Learning Representations (2020)

    Google Scholar 

  7. Lai, G., Chang, W.C., Yang, Y., Liu, H.: Modeling long- and short-term temporal patterns with deep neural networks (2018)

    Google Scholar 

  8. Park, H.J., Kim, Y., Kim, H.Y.: Stock market forecasting using a multi-task approach integrating long short-term memory and the random forest framework. Appl. Soft Comput. 114, 108106 (2022)

    Article  Google Scholar 

  9. Salinas, D., Flunkert, V., Gasthaus, J., Januschowski, T.: Deepar: probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 36(3), 1181–1191 (2020)

    Article  Google Scholar 

  10. Su, T., Pan, T., Chang, Y., Lin, S., Hao, M.: A hybrid fuzzy and k-nearest neighbor approach for debris flow disaster prevention. IEEE Access 10, 21787–21797 (2022). https://doi.org/10.1109/ACCESS.2022.3152906

    Article  Google Scholar 

  11. Taylor, S., Letham, B.: Forecasting at scale. Am. Stat. 72, 37–45 (2018)

    Article  MathSciNet  Google Scholar 

  12. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)

    Google Scholar 

  13. Zhou, H., et al.: Informer: beyond efficient transformer for long sequence time-series forecasting. In: The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, vol. 35, pp. 11106–11115. AAAI Press (2021)

    Google Scholar 

Download references

Acknowledgements

This work was supported by University of Tsukuba Basic Research Support Program Type A, Japan Society for the Promotion of Science (JSPS) KAKENHI under Grant Number JP19H04114 and JP22H03694, the New Energy and Industrial Technology Development Organization (NEDO) Grant Number JPNP20006, and Japan Agency for Medical Research and Development (AMED) Grant Number JP21zf0127005.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Savong Bou .

Editor information

Editors and Affiliations

Appendices

Appendix

A Proof of Theorem 1

The inputs to the Eq. 2 are positional and dimensional information. The dimensional information is the same for all records in the training data. Assuming that we have two consecutive input i and \(i+1\) of L records. The gap between the beginning of input i and \(i+1\) is S records. The positional information of input i is \(POS_i=[pos_{(i-1)S+1}, pos_{(i-1)S+2},..., pos_{(i-1)S+S}, pos_{(i-1)S+S+1}, ..., pos_{(i-1)S+L}]\), and the positional information of input \(i+1\) is \(POS_{i+1}=[pos_{(i-1)S+S+1}, pos_{(i-1)S+S+2}, ..., pos_{(i-1)S+L}, pos_{(i-1)S+L+1}, ..., pos_{(i-1)S+L+S}]\). We have

  • \(POS_i=POSn_i \oplus POSo_i\), and

  • \(POS_{i+1}=POSo_{i+1} \oplus POSn_{i+1}\), where

    • \(POSn_i=[pos_{(i-1)S+1}, pos_{(i-1)S+2},..., pos_{(i-1)S+S}]\), and

    • \(POSo_i=POSo_{i+1}=[pos_{(i-1)S+S+1}, ..., pos_{(i-1)S+L}]\)

    • \(POSn_{i+1}=[pos_{(i-1)S+L+1}, ..., pos_{(i-1)S+L+S}]\)

PosEncoding(POS) represents encoding the POS by Eq. 2. We have:

  • \(Pn_{i}=PosEncoding(POSn_i)\), and

  • \(Po_{i}=PosEncoding(POSo_i)\), so

  • \(P_{i}=Pn_{i} \oplus Po_{i}\).

Because \(POSo_i=POSo_{i+1}\), then we have:

  • \(Pn_{i+1}=PosEncoding(POSn_{i+1})\), therefore

  • \(P_{i+1}= Po_{i+1} \oplus Pn_{i+1}= Po_{i} \oplus Pn_{i+1} \).

Theorem 1 is proven.

B Proof of Theorem 2

The notations in Appendix A are also used in this Section. For relative position, the positional information of all input/output is the same, so \(POS_1= POS_2=...= POS_i=POS_{i+1}=[pos_{1}, pos_{2},..., pos_{L}]\). Therefore, we have \(P_1= P_2=...=P_i=P_{i+1}=PosEncoding(POS_1)\). Theorem 2 is proven.

C Proof of Theorem 3

The temporal embedding takes temporal information, such as weak, month, and holiday, of the records as a basis for embedding. The temporal information of the same record does not change wrt different training samples. Therefore, the temporal embedding of all records belongs to the overlapping part between input i and \(i+1\) is the same. The proof is similar to that of absolute positional embedding in Appendix A. Thus, \(To_{i+1}=To_{i}\), so Theorem 3 is proved.

D Proof of Theorem 4

The notations in Appendix A are also used in this Section. We need to prove that \(Vr_{i} = Vr_{i+1}\). When computing the embedding value of input i, the input i is convolved by a kernel (e.g., width k). Since the default stride is set to one, the resulting embedding values of records \(x_{(i-1)S+1}\) to \(x_{(i-1)S+L-(k-1)}\) are fully convolved without including the padding values. Similarly, the embedding values of records \(x_{(i-1)S+S+1}\) to \(x_{(i-1)S+L-(k-1)}\) are fully convolved. Therefore, the embedding values of records \(x_{(i-1)S+S+1}\) to \(x_{(i-1)S+L-(k-1)}\) are the same for both input i and \(i+1\). Therefore, \(Vr_{i} = Vr_{i+1}\), which proves Theorem 4.

E Proof of Theorem 5

Query, Key, and Value are computed as in Eq. 7. Such multiplication does not change the value distribution from the original input embedding. Therefore, Query, Key, and Value can be incrementally computed in similar manner to that of input embedding, which is proved in Theorems 31, and 4. Therefore, Theorem 5 is proven.

F Proof of Theorem 6

InTrans is implemented on top of Informer by adopting the incremental computation of the temporal/positional/value embeddings and Query/Key/Value of the training sample. To prove that InTrans has the same predicting accuracy as that of Informer, we have to prove that the temporal/positional/value embeddings and Query/Key/Value incrementally computed by InTrans is the same as the temporal/positional/value embeddings and Query/Key/Value computed by Informer. Theorems 12345 have proved that the embedding of the input and the Query/Key/Value incrementally computed by InTrans are the same as those non-incrementally computed by Informer. Theorem 3.7 is proven.

G Proof of Theorem 7

Equations 4 and 5 suggests that positional and temporal embedding can be incrementally computed, which is proved in Theorems 1 and 3. For each input \(i+1\), the positional embedding (\(Po_i\)) and the temporal embedding (\(To_i\)) corresponding to the overlapping part between inputs i and \(i+1\) computed when embedding the input i can be reused for input \(i+1\). Therefore, only the positional embedding (\(Pn_{i+1}\)) and the temporal embedding (\(Tn_{i+1}\)) corresponding to the non-overlapping part between input i and \(i+1\) need to be computed when embedding input \(i+1\). The size of the non-overlapping part between input i and \(i+1\) is represented by S, so the time complexity to compute the positional and temporal embedding of each input is O(S).

Similarly, value embedding is done by using Conv1d. Equation 6 and Theorem 4 suggest that the value embedding of the overlapping part, excluding the value embedding of the last \(k-1\) records, between input i and \(i+1\) after computing value embedding of input i can be reused for embedding the input \(i+1\). Therefore, the time complexity to compute the value embedding of each input is \(O((S+k-1)*k)\).

Similar to value embedding, Eq. 6 and Theorem 4 suggest that the time complexity to compute Query or Key or Value is \(O((S+k-1)*d_{dim})\). Theorem 7 is proved.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bou, S., Amagasa, T., Kitagawa, H. (2022). InTrans: Fast Incremental Transformer for Time Series Data Prediction. In: Strauss, C., Cuzzocrea, A., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2022. Lecture Notes in Computer Science, vol 13427. Springer, Cham. https://doi.org/10.1007/978-3-031-12426-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-12426-6_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-12425-9

  • Online ISBN: 978-3-031-12426-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics