Abstract
Predicting time-series data is useful in many applications, such as natural disaster prevention system, weather forecast, traffic control system, etc. Time-series forecasting has been extensively studied. Many existing forecasting models tend to perform well when predicting short sequence time-series. However, their performances greatly degrade when dealing with the long one. Recently, more dedicated research has been done for this direction, and Informer is currently the most efficient predicting model. The main drawback of Informer is the inability to incrementally learn. This paper proposes an incremental Transformer, called InTrans, to address the above bottleneck by reducing the training/predicting time of Informer. The time complexities of InTrans comparing to the Informer are: (1) O(S) vs O(L) for positional and temporal embedding, (2) \(O((S+k-1)*k)\) vs \(O(L*k)\) for value embedding, and (3) \(O((S+k-1)*d_{dim})\) vs \(O(L*d_{dim})\) for the computation of Query/Key/Value, where L is the length of the input; k is the kernel size; \(d_{dim}\) is the number of dimensions; and S is the length of the non-overlapping part of the input that is usually significantly smaller than L. Therefore, InTrans could greatly improve both training and predicting speed over the state-of-the-art model, Informer. Extensive experiments have shown that InTrans is about 26% faster than Informer for both short sequence and long sequence time-series prediction.
S. Bou—Due to name change, Savong Bou is now known as Takehiko Hashimoto.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ariyo, A.A., Adewumi, A.O., Ayo, C.K.: Stock price prediction using the arima model. In: 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, pp. 106–112 (2014)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2016)
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv:2004.05150 (2020)
Bou, S., Kitagawa, H., Amagasa, T.: L-Bix: incremental sliding-window aggregation over data streams using linear bidirectional aggregating indexes. Knowl. Inf. Syst. 62(8), 3107–3131 (2020)
Bou, S., Kitagawa, H., Amagasa, T.: Cpix: real-time analytics over out-of-order data streams by incremental sliding-window aggregation. IEEE Trans. Knowl. Data Eng. (2021). https://doi.org/10.1109/TKDE.2021.3054898
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: International Conference on Learning Representations (2020)
Lai, G., Chang, W.C., Yang, Y., Liu, H.: Modeling long- and short-term temporal patterns with deep neural networks (2018)
Park, H.J., Kim, Y., Kim, H.Y.: Stock market forecasting using a multi-task approach integrating long short-term memory and the random forest framework. Appl. Soft Comput. 114, 108106 (2022)
Salinas, D., Flunkert, V., Gasthaus, J., Januschowski, T.: Deepar: probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 36(3), 1181–1191 (2020)
Su, T., Pan, T., Chang, Y., Lin, S., Hao, M.: A hybrid fuzzy and k-nearest neighbor approach for debris flow disaster prevention. IEEE Access 10, 21787–21797 (2022). https://doi.org/10.1109/ACCESS.2022.3152906
Taylor, S., Letham, B.: Forecasting at scale. Am. Stat. 72, 37–45 (2018)
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)
Zhou, H., et al.: Informer: beyond efficient transformer for long sequence time-series forecasting. In: The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, vol. 35, pp. 11106–11115. AAAI Press (2021)
Acknowledgements
This work was supported by University of Tsukuba Basic Research Support Program Type A, Japan Society for the Promotion of Science (JSPS) KAKENHI under Grant Number JP19H04114 and JP22H03694, the New Energy and Industrial Technology Development Organization (NEDO) Grant Number JPNP20006, and Japan Agency for Medical Research and Development (AMED) Grant Number JP21zf0127005.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix
A Proof of Theorem 1
The inputs to the Eq. 2 are positional and dimensional information. The dimensional information is the same for all records in the training data. Assuming that we have two consecutive input i and \(i+1\) of L records. The gap between the beginning of input i and \(i+1\) is S records. The positional information of input i is \(POS_i=[pos_{(i-1)S+1}, pos_{(i-1)S+2},..., pos_{(i-1)S+S}, pos_{(i-1)S+S+1}, ..., pos_{(i-1)S+L}]\), and the positional information of input \(i+1\) is \(POS_{i+1}=[pos_{(i-1)S+S+1}, pos_{(i-1)S+S+2}, ..., pos_{(i-1)S+L}, pos_{(i-1)S+L+1}, ..., pos_{(i-1)S+L+S}]\). We have
-
\(POS_i=POSn_i \oplus POSo_i\), and
-
\(POS_{i+1}=POSo_{i+1} \oplus POSn_{i+1}\), where
-
\(POSn_i=[pos_{(i-1)S+1}, pos_{(i-1)S+2},..., pos_{(i-1)S+S}]\), and
-
\(POSo_i=POSo_{i+1}=[pos_{(i-1)S+S+1}, ..., pos_{(i-1)S+L}]\)
-
\(POSn_{i+1}=[pos_{(i-1)S+L+1}, ..., pos_{(i-1)S+L+S}]\)
-
PosEncoding(POS) represents encoding the POS by Eq. 2. We have:
-
\(Pn_{i}=PosEncoding(POSn_i)\), and
-
\(Po_{i}=PosEncoding(POSo_i)\), so
-
\(P_{i}=Pn_{i} \oplus Po_{i}\).
Because \(POSo_i=POSo_{i+1}\), then we have:
-
\(Pn_{i+1}=PosEncoding(POSn_{i+1})\), therefore
-
\(P_{i+1}= Po_{i+1} \oplus Pn_{i+1}= Po_{i} \oplus Pn_{i+1} \).
Theorem 1 is proven.
B Proof of Theorem 2
The notations in Appendix A are also used in this Section. For relative position, the positional information of all input/output is the same, so \(POS_1= POS_2=...= POS_i=POS_{i+1}=[pos_{1}, pos_{2},..., pos_{L}]\). Therefore, we have \(P_1= P_2=...=P_i=P_{i+1}=PosEncoding(POS_1)\). Theorem 2 is proven.
C Proof of Theorem 3
The temporal embedding takes temporal information, such as weak, month, and holiday, of the records as a basis for embedding. The temporal information of the same record does not change wrt different training samples. Therefore, the temporal embedding of all records belongs to the overlapping part between input i and \(i+1\) is the same. The proof is similar to that of absolute positional embedding in Appendix A. Thus, \(To_{i+1}=To_{i}\), so Theorem 3 is proved.
D Proof of Theorem 4
The notations in Appendix A are also used in this Section. We need to prove that \(Vr_{i} = Vr_{i+1}\). When computing the embedding value of input i, the input i is convolved by a kernel (e.g., width k). Since the default stride is set to one, the resulting embedding values of records \(x_{(i-1)S+1}\) to \(x_{(i-1)S+L-(k-1)}\) are fully convolved without including the padding values. Similarly, the embedding values of records \(x_{(i-1)S+S+1}\) to \(x_{(i-1)S+L-(k-1)}\) are fully convolved. Therefore, the embedding values of records \(x_{(i-1)S+S+1}\) to \(x_{(i-1)S+L-(k-1)}\) are the same for both input i and \(i+1\). Therefore, \(Vr_{i} = Vr_{i+1}\), which proves Theorem 4.
E Proof of Theorem 5
Query, Key, and Value are computed as in Eq. 7. Such multiplication does not change the value distribution from the original input embedding. Therefore, Query, Key, and Value can be incrementally computed in similar manner to that of input embedding, which is proved in Theorems 3, 1, and 4. Therefore, Theorem 5 is proven.
F Proof of Theorem 6
InTrans is implemented on top of Informer by adopting the incremental computation of the temporal/positional/value embeddings and Query/Key/Value of the training sample. To prove that InTrans has the same predicting accuracy as that of Informer, we have to prove that the temporal/positional/value embeddings and Query/Key/Value incrementally computed by InTrans is the same as the temporal/positional/value embeddings and Query/Key/Value computed by Informer. Theorems 1, 2, 3, 4, 5 have proved that the embedding of the input and the Query/Key/Value incrementally computed by InTrans are the same as those non-incrementally computed by Informer. Theorem 3.7 is proven.
G Proof of Theorem 7
Equations 4 and 5 suggests that positional and temporal embedding can be incrementally computed, which is proved in Theorems 1 and 3. For each input \(i+1\), the positional embedding (\(Po_i\)) and the temporal embedding (\(To_i\)) corresponding to the overlapping part between inputs i and \(i+1\) computed when embedding the input i can be reused for input \(i+1\). Therefore, only the positional embedding (\(Pn_{i+1}\)) and the temporal embedding (\(Tn_{i+1}\)) corresponding to the non-overlapping part between input i and \(i+1\) need to be computed when embedding input \(i+1\). The size of the non-overlapping part between input i and \(i+1\) is represented by S, so the time complexity to compute the positional and temporal embedding of each input is O(S).
Similarly, value embedding is done by using Conv1d. Equation 6 and Theorem 4 suggest that the value embedding of the overlapping part, excluding the value embedding of the last \(k-1\) records, between input i and \(i+1\) after computing value embedding of input i can be reused for embedding the input \(i+1\). Therefore, the time complexity to compute the value embedding of each input is \(O((S+k-1)*k)\).
Similar to value embedding, Eq. 6 and Theorem 4 suggest that the time complexity to compute Query or Key or Value is \(O((S+k-1)*d_{dim})\). Theorem 7 is proved.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bou, S., Amagasa, T., Kitagawa, H. (2022). InTrans: Fast Incremental Transformer for Time Series Data Prediction. In: Strauss, C., Cuzzocrea, A., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2022. Lecture Notes in Computer Science, vol 13427. Springer, Cham. https://doi.org/10.1007/978-3-031-12426-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-12426-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12425-9
Online ISBN: 978-3-031-12426-6
eBook Packages: Computer ScienceComputer Science (R0)