1 Introduction

Multivariate time series, which record the variations of multiple variables over time, are crucial for understanding and predicting the behaviors of complex systems (Du et al. 2020). Such data find extensive applications across various domains, including financial market analysis, climate monitoring, and health monitoring. For instance, in financial markets, analysts can predict market trends and devise strategies by analyzing time series of stock prices, trading volumes, and macroeconomic indicators (Moews et al. 2019). In the field of climate science, time series data such as temperature, precipitation, and wind speed assist scientists in predicting weather changes, thereby facilitating early warnings for natural disasters (Zamora-Martinez et al. 2014). In the health sector, continuous monitoring of physiological indicators like heart rate and blood pressure play a vital role in the early diagnosis and treatment of diseases (Fernandez-Fraga et al. 2018).

However, in practical applications, multivariate time series frequently exhibit missing values due to various reasons such as equipment failure, data transmission errors, or human oversight during data collection (Bashir and Wei 2018). Missing data not only reduces the amount of usable information but also can introduce biases that affect the accuracy and reliability of data analysis (Pan et al. 2022). For example, in financial analysis, missing transaction data can lead to misinterpretations of market dynamics; in climate monitoring, the absence of data at crucial time points can impair the prediction of extreme weather events; in health monitoring, missing physiological data can obscure the true health status of a patient. Therefore, effectively imputing these missing values is a critical challenge that enhances the accuracy of multivariate time series analysis and the quality of decision support.

Despite the immense potential of multivariate time series in analysis and application, the incompleteness of data frequently poses a significant challenge in practical applications. Traditional data imputation methods, such as linear interpolation, nearest neighbor imputation, or simple statistical methods (e.g., mean or median substitution), often fall short when dealing with time series data characterized by complex temporal dependencies and multivariate features. These methods tend to overlook the inherent dynamic properties of time series data, such as autocorrelation, as well as the interactions between different variables, which are indispensable attributes of multivariate time series.

Autocorrelation in time series means that the value of a data point is strongly connected to its historical values. For example, in financial time series, the price of a stock on any given day is typically closely related to its price on the previous day. Moreover, variables within a multivariate time series may interact complexly, such as the interplay among economic indicators or the correlation between different physiological parameters in health monitoring. If these characteristics are ignored during the imputation process, the imputed data may not accurately reflect the original dynamics of the time series, thereby affecting the accuracy of subsequent data analysis and decision-making processes.

Given the complexity of multivariate time series, the limitations of traditional imputation techniques are apparent, underscoring the need for more refined strategies to effectively address these challenges. In response to this, recurrent neural networks (RNNs) have been widely employed in the task of time series imputation. RNNs are particularly suited to this purpose due to their ability to model temporal dependencies, which is critical in understanding and reconstructing missing patterns in time series data.

However, most RNN-based imputation methods use a predictive approach to impute missing values. As shown in Fig. 1, these methods train a model to learn previous patterns before time step t to predict time step t and use the predicted value to impute the missing values. They did not take into account the influence of observed values at time step t and after time step t when estimating missing values at time step t. Inspired by the Kalman smoothing, which leverages both past observed data and future predicted data to provide a more accurate estimation of the current state, we attempt to extract the feature representation of a time series and use this representation to reconstruct the entire time series. This way, during the reconstruction of each time step’s value, the model incorporates the global data distribution information of the entire input time series.

Fig. 1
figure 1

RNN-based imputation methods with a predictive approach

Therefore, in this paper, we propose a Long Short-Term Memory Network based Recurrent Autoencoder with Imputation Units and Temporal Attention Imputation Model (RATAI) that imputes incomplete time series by introducing a two-stage imputation strategy of prediction and reconstruction. In the first stage, also called the prediction stage, the imputation units in encoder predict and impute missing values using temporal historical information and attribute correlation of input time series. At the end of first stage, the encoder extracts the feature representation of the imputed time series, which contains the data distribution characteristics of the entire imputed time series. In the second stage, also caled the reconstruction stage, the decoder uses this feature representation to reconstruct the input time series and imputes the missing values with the reconstruction values. In contrast to prediction-based imputation in the first stage, the reconstruction-based imputation during the second stage bears resemblance to the approach employed by the Kalman smoothing, where the system’s states during a certain period are estimated after acquiring all the information pertaining to the system within that period. Furthermore, the temporal attention mechanism introduced in RATAI allows the decoder to pay more attention to the high relative inputs when reconstructing each time step. Additionally, due to the presence of the imputation unit, RATAI can be trained directly using incomplete data without the need to set initial values for missing values, avoiding the impact of unreasonable initial values on model training.

The main contributions of RATAI are as follows:

  1. 1.

    RATAI addresses the key shortcomings of existing RNN-based imputation models which often neglect the contextual impact of data around the missing values, enhancing the accuracy of the imputation by integrating past and future data insights similar to Kalman smoothing techniques.

  2. 2.

    The RATAI model introduces a dual-stage imputation process with an encoder-decoder structure. The encoder efficiently utilizes temporal and attribute correlations to predict missing values, while the decoder reconstructs a complete and coherent time series, ensuring the preservation of original data dynamics.

  3. 3.

    Our experiments demonstrate that RATAI outperforms benchmark models in terms of accuracy, confirming its effectiveness in real-world multivariate time series imputation scenarios. Meanwhile, ablation studies confirm the effectiveness of each component within the model.

2 Related works

Existing methods for processing missing values are primarily divided into deletion-based methods and imputation-based methods (Liu et al. 2018). Deletion-based methods involve removing samples or attributes that contain missing values from the original dataset to obtain a complete dataset with a reduced size (Ribeiro and Freitas 2021). Although deletion-based methods have low time and space overhead, it is only suitable for incomplete datasets with a low missing rate and values in which are missing completely at random. When applied to datasets with high missing rates, deletion-based methods can cause a large amount of valuable information to be lost, which may affect the accuracy of analysis results. In contrast, the imputation-based methods estimate a reasonable value for each missing value based on observed data, resulting in a complete dataset with the same size as the original dataset (Zhou et al. 2021). Common imputation-based methods include statistical-based imputation methods and machine learning-based imputation methods.

Statistical-based imputation methods utilize statistical theories to model time series and estimate missing values. One of the simplest statistical-based imputation methods is mean imputation, which imputes missing values in some attribute based on the mean of the existing values in the corresponding attribute. Although mean imputation is simpler and faster and imputes values within a reasonable range, the imputation results are too concentrated, thereby reducing the dispersion of the attribute value distribution. Another representative statistical-based method is the Autoregressive (AR) model (Liu et al. 2018). For example, Miew Keen Choong et al. (2009) used AR models to deal with missing values in DNA microarray time series data and achieved good results when dealing with the situation where a particular time point of the data is missing entirely. Bashir and Wei (2018) proposed a vector autoregressive model-based imputation method to impute missing values in multidimensional time series.

Machine learning-based methods use modeling to extract valid information from incomplete data and make reasonable inferences about the missing values. These methods include K-Nearest Neighbor (KNN)(Lee and Styczynski 2018), Dynamic Time Warping (DTW) (Pan et al. 2023), Expectation Maximization (EM) (Zhou et al. 2021), Matrix Completion/Matrix Factorization (Dean and Varshney 2021), and Principal Component Analysis (PCA) (Josse et al. 2011). For instance, Sun et al. (2017) proposed a KNN-based imputation method to process traffic data, they introduced a gap-sensitive window mechanism in KNN, which made KNN perform better when imputing time series data. Oehmcke et al. (2016) employed DTW as distance metric instead of pointwise distance measurements in KNN and their method showed better results with consecutively missing values than state-of-the-art algorithms.

In machine learning-based imputation methods, a large portion of research is based on neural networks, including Multilayer Perceptron (MLP) (Silva-Ramírez et al. 2015; Jung et al. 2020; Pan et al. 2023), Autoencoder (AE) Zamanzadeh et al. (2021), Generative Adversarial Networks (GAN) (Zhang et al. 2021; Miao et al. 2021), Recurrent Neural Networks (RNN) (Ma et al. 2022; Ho et al. 2022; Weerakody et al. 2023), Self-attention-based networks (Suo et al. 2020; Yang et al. 2021; Ma et al. 2022; Du et al. 2023; Zhou et al. 2021; Zeng et al. 2023) and others. Among these, RNN have shown superiority in processing time series and many RNN-based imputation methods have been proposed in recent years. For example, Yoon et al. (2019) proposed a bi-directional RNN-based imputation method (M-RNN), in which the missing values need to be initially set to 0 and then estimated in two stages. The first stage uses the temporal correlation within each attribute in time series to estimate the missing values, and the second stage uses the correlation between the attributes to estimate the missing values. Che et al. (2018) proposed a Gate Recurrent Unit (GRU) based imputation model named GRU-D, which introduced a decay mechanism to generate input decay rates based on time intervals and dynamically combine the previous observations and the average value to estimate missing values. Ma et al. (2020) proposed the linear memory vector recurrent neural network (LIME-RNN) to handle missing values. They introduced a linear memory vector that integrates over previous hidden states of the RNN to impute missing values, and they tested different recurrent units, including RNN, LSTM, and GRU. Cao et al. (2018) proposed a bi-directional recurrent imputation model for time series called BRITS, which consisted of a bi-directional LSTM to impute multivariate time series. The imputed values in BRITS are treated as variables of RNN graph and can be updated during backpropagation. Suo et al. (2020) proposed a global and local time series imputation model with multi-directional attention learning named GLIMA, which used a two-layer GRU structure to learn both global and local dependencies of multivariate time series and introduced an attention mechanism to capture distant correlations across both time and feature. Cui et al. (2020) proposed a stacked bidirectional LSTM-based imputation method (SBU-LSTM) for imputing missing values in traffic data, and the results showed the advantages of the stacked bidirectional LSTM model in traffic data prediction.

3 Preliminaries

3.1 Long short-term memory network

Long Short-Term Memory Network (LSTM) is a type of Recurrent Neural Network (RNN) (Greff et al. 2017). The key idea behind LSTM is the introduction of memory cells which allow the network to selectively learn and forget information over time. Therefore, in our proposed model, we utilize LSTM units as the recurrent units. Figure 2 shows the structure of LSTM unit. There are three gates in a LSTM unit, which are the input gate, the forget gate, and the output gate. These gates control the flow of information into and out of the unit, while the memory cell \({C}_{t}\) and hidden state \({h}_{t}\) store long-term and short-term information, respectively.

Fig. 2
figure 2

The structure of the LSTM unit at time step t

The forget gate controls the amount of information that is retained in unit from the previous time step. The output of the forget gate at time step t is

$$\begin{aligned} {f}_{t}=\sigma \left( {W}_{f}{x}_{{t}}+{U}_{f}{h}_{{t}-{1}}+{b}_{f}\right) \end{aligned},$$
(1)

where \({h}_{{t}-{1}}\) is the previous hidden state, \({x_t}\) is the current input, \({W}_{f}\), \({U}_{f}\) and \({b}_{f}\) are learnable weights and bias, \(\sigma (\cdot )\) is the sigmoid activation function.

The input gate controls the amount of new information that is added to the memory cell. The output of the input gate at time step t is

$$\begin{aligned} {i}_{t}=\sigma \left( {W}_{i}x_t+{U}_{i}{h}_{{t}-{1}}+{b}_{i}\right) \end{aligned},$$
(2)

where \({W}_{i}\), \({U}_{i}\) and \({b}_{i}\) are learnable weights and bias.

The output gate controls the amount of information that is output from the unit and it is calculated by

$$\begin{aligned} {o}_{t}=\sigma \left( {W}_{o}x_t+{U}_{o}{h}_{{t}-{1}}+{b}_{o}\right) \end{aligned},$$
(3)

where \({W}_{o}\), \({U}_{o}\) and \({b}_{o}\) are learnable weights and bias.

The candidate memory cell \({\hat{{C}}}_{t}\) represents the new candidate values to be added to the memory cell and it is calculated by

$$\begin{aligned} {\hat{{C}}}_{t}=tanh\left( {W}_{c}x_t+{U}_{c}{h}_{{t}-{1}}+{b}_{c}\right) \end{aligned},$$
(4)

where \({W}_{c}\), \({U}_{c}\) and \({b}_{c}\) are learnable weights and bias, \(tanh(\cdot )\) is the tanh activation function.

The memory cell \({C}_{t}\) stores long-term information and is updated by

$$\begin{aligned} {C}_{t}={f}_{t}\circ {C}_{{t}-{1}}+{i}_{t}\circ {\hat{{C}}}_{t} \end{aligned},$$
(5)

where \(\circ\) denotes element-wise multiplication.

The hidden state \({h}_{t}\) is the output of LSTM unit, and it is calculated by

$$\begin{aligned} {h}_{t}={o}_{t}\circ tanh\left( {C}_{t}\right) \end{aligned}$$
(6)

These equations allow the LSTM unit to selectively learn and forget information over time, making it suitable for tasks involving sequential data.

3.2 Encoder-decoder framework

Autoencoder (AE) is a type of neural network used for self-supervised learning. Its purpose is to reconstruct input data, typically used for dimensionality reduction, feature extraction, and anomaly detection. AE is composed of an encoder and a decoder. The encoder maps the input data into a latent representation, while the decoder maps the latent representation back to the original input space (Rumelhart et al. 1986).

As shown in Fig. 3, the basic principle of AE is to minimize the difference between the input and the reconstructed output. The reconstruction error is typically measured by a loss function, such as mean squared error (MSE) or binary cross-entropy. The objective is to minimize this loss function using an optimization algorithm, such as stochastic gradient descent (SGD) or Adam.

Fig. 3
figure 3

The structure of AE with one hidden layer

The encoding phases of AE can be formulated as

$$\begin{aligned} {h}=\sigma \left( {W}_{{enc}}x+{b}_{{enc}}\right) \end{aligned},$$
(7)

where x is the input data, h is the latent representation, \(\sigma (\cdot )\) is the activation function, \({W}_{{enc}}\) and \({b}_{{enc}}\) are learnable weight and bias.

The decoding phases of AE can be formulated as

$$\begin{aligned} y=\sigma \left( {W}_{{dec}}{h}+{b}_{{dec}}\right) \end{aligned},$$
(8)

where y is the output data, \({W}_{{dec}}\) and \({b}_{{dec}}\) are learnable weight and bias.

The advantage of AE lies in its ability to capture the underlying structure and patterns in the input data. By learning a compact and meaningful representation of input data, AE can be used for feature extraction, where the latent representation can be applied to input to other supervised learning algorithms, such as classification or regression. Since the AE does not contain units that can store temporal information, it cannot learn the dynamical features of the multivariate time series, so it cannot be directly used for modeling the time series data.

Recurrent Autoencoder (RAE) is a variant of autoencoder that is designed to handle sequential data, such as time series or natural language text. The key difference between RAE and traditional AE is that RAE uses RNNs as the encoder and decoder, allowing it to capture the temporal dependencies in the input data. Figure 4 illustrates the structure of the LSTM-based Recurrent Autoencoder which is capable of extracting dynamic features from univariate or multivariate time series.

Fig. 4
figure 4

The structure of the LSTM-based RAE

Let \({X}=\left[ x_1,{\ x}_2,\ \ \ldots ,\ \ x_T\right]\), where \(x_t\in \mathbb {R}^D\), be a time series of length T. At time step t, the hidden state \({h}_{t}^{e}\) in encoder is updated by

$$\begin{aligned} {h}^{e}_{t}=LSTM\left( {h}^{e}_{t-1},{x}_{t}\right)\end{aligned},$$
(9)

where \({x_t}\) is the input at time step t, \({h}^{e}_{t}\) and \({h}^{e}_{{t}-{1}}\) is the hidden state in encoder at time step t and \(t-1\), respectively, \(\sigma (\cdot )\) is the activation function, \({W}_{h}\), \({W}_{x}\) and \({b}_{h}\) are learnable weights and bias. The \({h}^{e}_{T}\) at the last time step T is then used as the compressed representation of X.

The reconstructed time series \(\left[ y_1,{\ y}_2,\ \ \ldots ,\ \ y_T\right]\) can be decoded from the compressed representation \({h}^{e}_{T}\). The reconstructed value \(y_{t^\prime }\) at decoding time step \(t^\prime\) is generated by

$$\begin{aligned} y_{t^\prime }=\sigma \left( {W}_{o}{h}^{d}_{{t}^\prime }+{b}_{o}\right) \end{aligned}$$
(10)

while for \(t^\prime =1\), the hidden state is

$$\begin{aligned} {h}^{d}_{{t}^\prime }=LSTM\left( {h}^{e}_{T},{zero}\right) ,\ \ t^\prime =1 \end{aligned}$$
(11)

while for \(t^\prime \ge 2\), the hidden state is

$$\begin{aligned} {h}^{d}_{{t}^\prime }=LSTM\left( {h}^{d}_{{t}^\prime -{1}},{y}_{{t}^\prime-1}\right) ,\ \ t^\prime \ge 2 \end{aligned},$$
(12)

where \(y_{t^\prime }\) is the output at time step \(t^\prime\), \({h}^{d}_{{t}^\prime }\) is the hidden state in decoder at decoding time step \(t^\prime\), zero is a zero vector of the same dimension as \(y_{t^\prime }\), \(\sigma (\cdot )\) is the activation function, \({W}_{o}\) and \({b}_{o}\) are learnable weight and bias. The decoder’s role is to take in the compressed representation produced by encoder and use it to reconstruct the original input sequence.

By taking RNNs as the encoder and decoder, RAE can capture the temporal dependencies in the input data, which can be challenging for traditional feedforward neural networks.

4 Proposed methods

In this section, we propose a LSTM-based Recurrent Autoencoder with Imputation Units and Temporal Attention Imputation Model (RATAI) to impute missing values in multivariate time series. We define a multivariate time series of length T as \({X}=\left[ x_1,{\ x}_2,\ \ \ldots ,\ \ x_T\right]\), where \(x_t\in \mathbb {R}^D\). \(x_{td}\) denotes the value of the dth attribute at time step t. Since X may contain missing values, we introduce the corresponding missing indicator matrix \({M}=\left[ m_1,{\ m}_2,\ \ \ldots ,\ \ m_T\right]\), where \(m_t\in \mathbb {R}^D\). If \(m_{td}=1\), it means \(x_{td}\) is observed, on the contrary, if \(m_{td}=0\), it means \(x_{td}\) is missing.

The architecture of proposed model is illustrated in Fig. 5. RATAI consists of an encoder and a decoder. The encoder is composed of a series of LSTM units and imputation units. The imputation units take the time series, the correspond missing indicator matrix, and the temporal information from previous LSTM units as input, and use a predictive approach based on temporal and attribute correlation information to perform the first-stage imputation on missing values in the input time series. The last state of encoder is used as the feature representation of the imputed time series. The decoder utilizes this feature representation to reconstruct the input sequence and perform the second-stage imputation on missing values using the reconstructed values. Additionally, the temporal attention mechanism enables the decoder to focus on the high relative inputs when reconstructing the time series. The output of RATAI is the reconstruction of the imputed time series, and the two-stage imputation approach based on prediction and reconstruction allows RATAI to incorporate not only temporal information and attribute correlation but also global data distribution information of the input time series, overcoming the issue of insufficient information utilization caused by solely using predicted values for imputation.

Fig. 5
figure 5

The architecture of RATAI

4.1 LSTM-based encoder with imputation unit

Due to the use of incomplete data in the training of RATAI, the input data may contain missing values that cannot be directly handled by LSTM units. Simply replacing missing values in the input with zeros or other inappropriate initial values can significantly alter the data distribution of the time series, affecting the LSTM’s ability to effectively learn from the time series. Reasonable imputation of missing values can not only provide complete training data for LSTM but also alleviate the impact of missing values on the feature representation extraction. Therefore, we design an imputation unit that utilize the temporal and attribute correlation information of time series to impute the missing values in the first stage. The structure of imputation unit is illustrated in Fig. 6.

Fig. 6
figure 6

RNN-based imputation methods with a predictive approach

The imputation unit at time step t utilizes the output \({h}^{{e}}_{{t}-{1}}\) of the previous LSTM unit at time step \(t-1\), and the observed values \(x_t\) at time step t, which contains temporal history information and attributes correlation information, respectively, to generate estimated values \(z_t^\prime\) at time step t. The estimated values \(z_t^\prime\) and the observed values \(x_t\) are combined to form the \({\widetilde{x}}_t\), which is the final input of the current LSTM unit and then participates in the forward propagation process described by Equations (1) to (6). The output of the imputation unit represents the combined values \({\widetilde{x}}_t\) and it’s calculated by

$$\begin{aligned} {\widetilde{x}}_t=x_t\circ m_t+z_t^\prime \circ \left( 1-m_t\right) \end{aligned},$$
(13)

where \(x_t\) and \(m_t\) represent the input time series and missing matrix at time step t, respectively, \(z_t^\prime\) represents the estimated values at time step t.

The estimated values \(z_t^\prime\) is calculated by

$$\begin{aligned} z_t^\prime =a_t+{W}_{{z}^\prime }{h}^{{e}}_{{t}-{1}}+{b}_{{z}^\prime } \end{aligned},$$
(14)

where \({h}^{{e}}_{{t}-{1}}\) represents the output of LSTM unit in encoder at time step \(t-1\). \({W}_{{z}^\prime }\) and \({b}_{{z}^\prime }\) represent the weight matrix and the bias, respectively. \({W}_{{z}^\prime }{h}^{{e}}_{{t}-{1}}\) represents the predicted values calculated from temporal history information. \(a_t\) represents the estimated values calculated from correlate information among the attributes at time step t, and \(a_t\) is generated by

$$\begin{aligned} a_t=tanh\left( {W}_{a}x_t^\prime \right) \end{aligned},$$
(15)

where tanh is the activation function, \({W}_{a}\) represents the weight matrix with zero diagonal elements, as follows

$$\begin{aligned} {W}_{a}=\left[ \begin{array}{cccc} 0& w_{12} & \cdots & w_{1D}\\ w_{21}& 0& & w_{2D}\\ \qquad \vdots & & \ddots & \vdots \\ w_{D1}& w_{D2}& \cdots & 0\\ \end{array} \right] \end{aligned}$$
(16)

To eliminate the strong effect of self-correlation, we restrict the diagonal elements of weight matrix \({W}_{a}\) to be all zeros. Thus, the dth attribute \(a_{td}\) in \(a_t\) is exactly calculated based on \(a_{t1},\dots ,{a}_{td-1}\) and \(a_{td+1},\dots ,a_{tD}\) without \(a_{td}\) itself. \(x_t^\prime\) represents the imputed result after the first round of imputing missing values in \(x_t\) using only temporal information and is calculated by

$$\begin{aligned} x_t^\prime =x_t\circ m_t+z_t\circ \left( 1-m_t\right) \end{aligned},$$
(17)

where \(x_t\) and \(m_t\) represents the input time series and missing indicating vector at time step t, respectively, \(z_t\) represents the predicted values obtained solely from temporal information and is computed by

$$\begin{aligned} z_t={W}_{z}{h}^{{e}}_{{t}-{1}}+{b}_{z} \end{aligned},$$
(18)

where \({h}^{e}_{{t}-{1}}\) represents the output of LSTM unit in encoder at time step t, \({W}_{z}\) and \({b}_{z}\) represent the weight matrix and the bias, respectively. This is the first prediction for missing values, aiming at imputing the missing values in \(x_t\) to facilitate the subsequent extraction of correlations between attributes.

4.2 LSTM-based decoder with temporal attention

4.2.1 Feature representation extraction

Based on the previous section, the LSTM units in encoder produce a hidden state as output at each time step, resulting in a sequence of hidden states \([{h}^{{e}}_{1},{h}^{{e}}_{2},\ldots ,{h}^{{e}}_{T}]\) at the end of encoding. A common approach to compute the feature representation of imputed time series is to fuse the hidden states from each time step of the encoder, which is calculated by

$$\begin{aligned} repr=Linear(concat[{h}^{{e}}_{1},{h}^{{e}}_{2},\ldots ,{h}^{{e}}_{T}]) \end{aligned},$$
(19)

where repr is the feature representation of input time series and shares the same dimension as each \({h}^{{e}}_{i}\), Linear is a fully-connected layer, repr is used as a part of the input of the decoder for further computations. Considering that the LSTM’s memory cell and hidden state store long-term and short-term memories of the time series, respectively, we utilize the memory cell \({C}^{{e}}_{T}\) and hidden state \({h}^{{e}}_{T}\) from the LSTM unit in the encoder at the last time step as the feature representation of imputed time series. Then, we set these two states as the initial states for the LSTM unit in the decoder at the starting time step, which reduces computational complexity and accelerates the model’s convergence.

4.2.2 Decoding for missing value imputation

Regardless of whether fusing the hidden states of various time steps or using the last time step’s memory cell state and hidden state as the feature representation, the expressive power of the feature representation for the sequence gradually decreases as the length of the input time series increases. To address this issue, this paper introduces a temporal attention mechanism for the multivariate time series imputation model, in which a temporal attention layer serves as an intermediate transition state between the encoder and decoder. Figure 7 depicts how the attention mechanism operates during the decoding process at time step 3. When reconstructing the value at time step 3, the process involves initial computation of similarity scores between \({h}^{{d}}_{2}\) and all hidden states \([{h}^{{e}}_{1},{h}^{{e}}_{2},\ldots ,{h}^{{e}}_{T}]\) obtained from the encoder. Subsequently, these hidden states are weighted and summed using the similarity scores, yielding the background vector \({s}_{3}\) at time step 3. This background vector serves as a partial input for the decoder during the reconstruction procedure.

Fig. 7
figure 7

An example shows how the attention layer works

The memory cell state \({C}^{{e}}_{T}\) and hidden state \({h}^{{e}}_{T}\) of the last LSTM unit in encoder are used as the initial states of LSTM unit in decoder. In decoding phase, a zero vector is used as the initial input. The reconstructed values \(y_1\) at time step 1 is calculated based on the initial states, the initial input zero vector, and the context vector \({s}_{1}\). Subsequently, \(y_1\) is used as the input of the next time step, and this iteration continues until the reconstructed value is obtained for all time steps. The \(y_1\) is calculated as follows

$$\begin{aligned} y_1={W}_{y}{h}^{{d}}_{1}+{b}_{y} \end{aligned},$$
(20)

where \({W}_{y}\) and \({b}_{y}\) are weight matrix and bias, respectively, \({h}^{d}_{1}\) represents the hidden state of decoder at time step 1 and \({h}^{{d}}_{1}\) are calculated by following equation.

$$\begin{aligned} {h}^{{d}}_{1}=LSTM(h^{{e}}_T,[zero,s_1]) \end{aligned},$$
(21)

where \({h}^{{e}}_{T}\) is the last hidden state of LSTM unit in encoder, zero is the input zero vector, \({s}_{1}\) is the context vector in decoding time step 1, \([\cdot ]\) is the concatenate operation of vectors in the last dimension. While, for subsequent time steps

$$\begin{aligned} & y_{t^\prime }={W}_{y}{h}^{{d}}_{{t}^\prime }+{b}_{y},t^\prime \ge 2 \end{aligned},$$
(22)
$$\begin{aligned} & {h}^{{d}}_{{t}^\prime }= LSTM(h^{d}_{t'-1},[y_{t'-1},s_{t'}]),t'\ge 2 \end{aligned},$$
(23)

where \({h}^{{d}}_{{t}^\prime }\) and \({h}^{{d}}_{{t}^\prime -{1}}\) is the hidden state of LSTM unit in decoder at time step \(t^\prime\) and \(t^\prime -1\), respectively, \(y_{t^\prime -1}\) is the reconstructed value at decoding time step \(t^\prime -1\), and \({s}_{{t}^\prime }\) is the context vector at decoding time step \(t^\prime\). Let the hidden state of encoder at time step t be \({h}^{e}_{t}\), the background vector \({s}_{{t}^\prime }\) of the decoder at time step \(t^\prime\) is calculated by

$$\begin{aligned} {s}_{{t}^\prime }=\sum _{t=1}^{T}a_{t^\prime t}{h}^{{e}}_{t} \end{aligned},$$
(24)

where \(a_{t^\prime t}\) is the attention weight, represents the contribution of each time step in encoder to the reconstruction at decoding time step \(t^\prime\), and are calculated by

$$\begin{aligned} a_{t^\prime t}=\frac{\exp (e_{t^\prime t})}{\sum _{k=1}^{T}{\exp (e_{t^\prime k})}} \end{aligned},$$
(25)

where \(e_{t^\prime t}\) represents the similarity score between the hidden state of decoder at time step \(t^\prime\) and the hidden state of encoder at time step t. It is calculated by

$$\begin{aligned} e_{t^\prime t} = v^\top \tanh (W_{h^{d}} h^{d}_{t^\prime -1} + W_{h^e} h^e_t) \end{aligned},$$
(26)

where \({h}^d_{{t}^\prime -{1}}\) represents the hidden state of the decoder at decoding time step \(t^\prime -1\), \({h}^{e}_{t}\) represents the hidden state of the encoder at time step t, \({W}_{h^d}\),\({W}_{h^e}\), and v are learnable weight matrixes.

If the attention mechanism is not used, then the context vector \({s}_{{t}^\prime }\) is the same at different decoding time steps and is calculated by the following equation

$$\begin{aligned} {s}_{{t}^\prime }={W}_{s}[{h}^e_{1},{h}^e_{2},\ldots ,{h}^e_{T}] \end{aligned}$$
(27)

By introducing the attention mechanism, the decoder is able to pay attention to the input selectively when decoding, thereby mitigating the loss of information that can occur when generating background vectors for longer input sequences.

In the model training process, we use the Mean Square Error (MSE) to calculate the estimation loss, Since the data used for training may contain missing values and we cannot obtain the true values of these values, we use only the MSE between the model’s output values and the corresponding observed values. At time step t, the output error of the model is

$$\begin{aligned} OutputError=(y_{t^{\prime }}-x_t)\circ m_t \end{aligned}$$
(28)

In addition, according to Equation (13) and Equation (17), we know that there are also two estimates of the missing values during the training process, so the total loss of the model should not contain only OutputError. Finally, at decoding time step \(t^{\prime }\), the TotalLoss of the model is

$$\begin{aligned} TotalLoss=(y_{t^{\prime }}-x_t)\circ m_t+(z_t-x_t)\circ m_t+(z^{\prime }_t-x_t)\circ m_t \end{aligned},$$
(29)

where \((z_t-x_t)\circ m_t\) and \((z^{\prime }_{t}-x_t)\circ m_t\) represent the errors of imputation using temporal information and attribute correlation information in the first stage, respectively.

During the training process, we adopt many-to-many strategy. As is illustrated in the left part of Fig. 8, the outputs of RATAI at all time steps will participate in the optimizing procedure. In testing, we adopt many-to-one strategy, as is illustrated in the right part of Fig. 8, while only the output of the decoder at the last time step will be used for imputation.

Fig. 8
figure 8

The architecture of training model and estimation model

5 Experimental results and discussions

In this section, we conduct a series of experiments on six publicly available datasets, to verify the effectiveness of the proposed RATAI and exhibit the superiority of the proposed method by comparing imputation performance with other nine baseline models. We begin by introducing the datasets and describe the data-preprocessing methods, and then present the experimental evaluation of the proposed RATAI and other benchmark models on six multivariate time series sets with different missing rates.

5.1 Datasets

The experimental datasets are CHB-MIT, PTB, DART, MPI_Roof, GT_2011, and Energy. The length of time series and number of attributes of selected datasets are shown in Table 1.

Table 1 Experimental datasets

CHB-MIT, which is collected from the Children’s Hospital Boston, consists of electroencephalography (EEG) recordings with an intractable seizure of 24 pediatric patients. This database consists of 916 h of EEG records and 23 cases of EEG recordings of 22 patients whose ages ranged from 1.5 to 22 years.

PTB is a publicly available electrocardiogram (ECG) database collected from the German Heart Center. Each record includes 15 simultaneously measured signals: the conventional 12 leads together with the 3 Frank lead ECGs.

DART is a weather dataset containing 22 attributes with a sequence length of 4415 and a time interval of 30 min. The attributes are likewise rainfall, temperature, humidity, UV radiation, etc.

MPI-Roof records weather data between January 1 and January 30, 2018, and contains 21 attributes with a sequence length of 4463 and a sampling interval of 10 min. The attributes included are barometric pressure, temperature, humidity, wind speed, precipitation, and radiation, etc.

GT-2011 is a gas turbine carbon monoxide and nitrogen oxide emissions dataset, which contains 11 attributes with a length of 7411 and contains attributes such as ambient temperature, ambient pressure, turbine temperature, carbon monoxide concentration, nitrogen oxide concentration, etc.

Energy is a low-energy residential appliance energy use dataset, which contains 28 attributes with a sequence length of 4506 and a sampling frequency of 10 min each time, recording the energy consumption of appliances in a low-energy home between January 11 and February 11, 2016.

5.2 Experimental settings

In our experiments, we compare the imputation performance of proposed RATAI with several benchmark models for multivariate time series, including M-RNN, SBU-LSTM, LIME-LSTM, GRU-D, BRITS, SAITS, USGAN, Dlinear, and Informer.

To facilitate the evaluation of the imputation performance. For each dataset, we partition it into a train set, a validation set, and a test set. Specifically, the first 60% of the data is used as the train set for training, the next 10% is used as the validation set, and the remaining 30% is used as the test set for testing imputation accuracy. To construct incomplete datasets, we randomly remove values from the dataset with missing rates of 10%, 20%, 30%, 40%, and 50%, respectively. As shown in Fig. 9, we employ the sliding window technique on the pre-segmented dataset to acquire training data, with a window size of 7 and a stride of 1. The batch size is 64, and the number of parameters of different methods are kept roughly equal. Early stopping is utilized in training process to decide when to stop the training and the max epoch is 600.

Fig. 9
figure 9

Training samples are obtained using sliding window on the selected datasets

We adopt Root Mean Squared Error (RMSE) as the common evaluation metric to evaluate the imputation performance, which is calculated by

$$\begin{aligned} RMSE=\sqrt{\frac{\sum _{i\in \mathrm {\Omega }}\left( y_i-{\hat{y}}_i\right) ^2}{\left| \mathrm {\Omega }\right| }} \end{aligned},$$
(30)

where \(y_i\) and \({\hat{y}}_i\) denote the imputed value and ground truth of the ith missing value, respectively, \(\mathrm {\Omega }\) denotes the index set of missing values, and \(\left| \mathrm {\Omega }\right|\) denotes the size of the index set.

5.3 Results and analysis

Tables 2, 3, 4, 5, 67 show the experiment results of the above ten methods on six datasets, with the best results bolded and underlined and the second-best results bolded only. The experiment results demonstrate the superiority of RATAI over other methods across different missing rates on the six datasets.

Table 2 RMSE values of RATAI and benchmark models on CHB-MIT dataset
Table 3 RMSE values of RATAI and benchmark models on PTB dataset
Table 4 RMSE values of RATAI and benchmark models on DART dataset
Table 5 RMSE values of RATAI and benchmark models on MPI_Roof dataset
Table 6 RMSE values of RATAI and benchmark models on GT_2011 dataset
Table 7 RMSE values of RATAI and benchmark models on Energy dataset

On the CHB-MIT dataset, RATAI is the best model for imputation under all missing rates, with USGAN, SAITS, and LIME-LSTM as the second-best model under specific missing rates.

On the PTB dataset, RATAI is consistently the best model for imputation under all missing rates, with GRU-D and LIME-LSTM as the second-best model under specific missing rates.

On the DART dataset, RATAI is consistently the best model for imputation under all missing rates, with BRITS and SUB-LSTM as the second-best model under specific missing rates.

On the MPI_Roof dataset, BRITS is the best model under missing rates of 10% to 30%, with RATAI as the second-best model, and RATAI is the best model at missing rate of 40% and 50%, with M-RNN and BRITS as the second-best model, respectively.

On the GT_2011 dataset, RATAI is the best model and BRITS is the second-best model under missing rate of 10%. Under missing rates of 20% to 50%, RATAI is the second-best model and SAITS is the best model.

On the Energy dataset, RATAI is the best model for imputation under all missing rates, with USGAN, M-RNN and GRU-D as the second-best model at other missing rates.

The experimental results show that across all missing rates on the six selected real datasets, RATAI consistently achieves the best or second-best imputation results compared to the baseline methods, demonstrating the effectiveness of the proposed approach.

5.4 Ablation studies

In this section, we use two ablation experiments to verify the contribution of introducing the imputation unit and the temporal attention mechanism, respectively. In the ablation experiments, the unmodified model is called RATAI (full model), while the model without imputation units is called RATAI (w/o imputation unit) and the model without temporal attention layer is called RATAI (w/o temporal attention). The experimental results are illustrated in Table 8, with the best results bolded and underlined and the second-best results bolded only.

Table 8 Ablation results of RATAI

5.4.1 Effect of imputation unit

In this experiment, we exclude the imputation unit from RATAI. As LSTM cannot directly handle missing values, prior to the experiment, we perform pre-imputation on the experimental dataset using mean imputation method. As can be seen from Table 8, compared with RATAI (full model), the imputation RMSE of RATAI (w/o imputation unit) has significantly increased, indicating that the imputation unit plays an important role in RATAI.

Choosing PTB dataset as an example, we train RATAI (full model) and RATAI (w/o imputation unit) with a missing rate of 0.1. After training, we test both models using the test set. Figure 10 shows the imputation results of those two models on the 12th attribute of the PTB dataset.

Fig. 10
figure 10

Imputed values of RATAI (full model) and RATAI (w/o imputation unit)

From Fig. 10, it can be observed that the imputation results of RATAI (w/o imputation unit) have a larger gap with the true values compared to RATAI (full model). The results demonstrate that using mean imputation affects the distribution of the time series, thereby influencing the model’s feature extraction performance on the input sequence. In comparison, the introduction of imputation unit reduces the disruption caused by imputed values on the time series distribution, ultimately enhancing the model’s imputation capability.

5.4.2 Effect of temporal attention

In this experiment, we remove the temporal attention layer from RATAI, resulting in a fixed background vector during decoding at different time steps. From the Table 8, it can be observed that the introduction of the temporal attention layer improves the model’s imputation capability in the vast majority of cases.

Furthermore, we analyze the attention score assigned to different time steps in the temporal attention layer. Taking the PTB dataset as an example, we train RATAI with a missing rate of 10%, and after completing the training, we intercept a segment of data with a sequence length of 20 from the test set for testing and extract the attention score assigned by the decoder to different time steps when estimating the missing values at a specific time step. Figure 11 shows the heatmap of the attention score for the specific time step.

From the Fig. 11, it can be observed that the value of the element that around the diagonal is larger, and the value of the remaining elements is smaller. This indicates that while reconstructing the values at time step t, the decoder pays the most attention to the values of time series around time step t. Referring to the previous experimental results, we can further confirm that the introduction of the temporal attention is effective.

Fig. 11
figure 11

Heat map of the attention score

6 Conclusion

In this paper, we propose a Long Short-Term Memory Network based Recurrent Autoencoder with Temporal Attention Imputation Model (RATAI) for missing values in multivariate time series. In order to fully explore the historical information, attribute correlations, and dynamic features of the multivariate time series, a two-stage imputation model based on LSTM is designed. The imputation units in the encoder perform the first stage of imputation based on historical information and attribute correlations and then extract the feature representation of the imputed time series. In the second stage, the decoder utilizes the feature representation which encapsulates the dynamic features to reconstruct the input time series. Additionally, the introduced temporal attention layer enables the decoder to focus on high relative inputs when reconstructing a specific time step. Compared to prediction-based imputation methods, RATAI’s two-stage imputation approach can more comprehensively extract information from time series, achieving more precise imputation outcomes. Moreover, RATAI can be trained directly on incomplete time series data without requiring the setting of initial values for missing values, which avoids the effect of inappropriate initial values on model training and enhances the robustness of the proposed model. Comparative experiments and ablation experiments demonstrate the superiority of RATAI in imputation performance and the effectiveness of model design.