Abstract
Multivariate time series is ubiquitous in real-world applications, yet it often suffers from missing values that impede downstream analytical tasks. In this paper, we introduce the Long Short-Term Memory Network based Recurrent Autoencoder with Imputation Units and Temporal Attention Imputation Model (RATAI), tailored for multivariate time series. RATAI is designed to address certain limitations of traditional RNN-based imputation methods, which often focus on predictive modeling to estimate missing values, sometimes neglecting the contextual impact of observed data at and beyond the target time step. Drawing inspiration from Kalman smoothing, which effectively integrates past and future information to refine state estimations, RATAI aims to extract feature representations from time series data and use them to reconstruct a complete time series, thus overcoming the shortcomings of existing approaches. It employs a dual-stage imputation process: the encoder utilizes temporal information and attribute correlations to predict and impute missing values, and extract feature representation of imputed time series. Subsequently, the decoder reconstructs the series from the feature representation, and the reconstructed values are used as the final imputation values. Additionally, RATAI incorporates a temporal attention mechanism, allowing the decoder to focus on highly relevant inputs during reconstruction. This model can be trained directly using data that contains missing values, avoiding the misleading effects on model training that can arise from setting initial values for missing values. Our experiments demonstrate that RATAI outperforms benchmark models in multivariate time series imputation.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Multivariate time series, which record the variations of multiple variables over time, are crucial for understanding and predicting the behaviors of complex systems (Du et al. 2020). Such data find extensive applications across various domains, including financial market analysis, climate monitoring, and health monitoring. For instance, in financial markets, analysts can predict market trends and devise strategies by analyzing time series of stock prices, trading volumes, and macroeconomic indicators (Moews et al. 2019). In the field of climate science, time series data such as temperature, precipitation, and wind speed assist scientists in predicting weather changes, thereby facilitating early warnings for natural disasters (Zamora-Martinez et al. 2014). In the health sector, continuous monitoring of physiological indicators like heart rate and blood pressure play a vital role in the early diagnosis and treatment of diseases (Fernandez-Fraga et al. 2018).
However, in practical applications, multivariate time series frequently exhibit missing values due to various reasons such as equipment failure, data transmission errors, or human oversight during data collection (Bashir and Wei 2018). Missing data not only reduces the amount of usable information but also can introduce biases that affect the accuracy and reliability of data analysis (Pan et al. 2022). For example, in financial analysis, missing transaction data can lead to misinterpretations of market dynamics; in climate monitoring, the absence of data at crucial time points can impair the prediction of extreme weather events; in health monitoring, missing physiological data can obscure the true health status of a patient. Therefore, effectively imputing these missing values is a critical challenge that enhances the accuracy of multivariate time series analysis and the quality of decision support.
Despite the immense potential of multivariate time series in analysis and application, the incompleteness of data frequently poses a significant challenge in practical applications. Traditional data imputation methods, such as linear interpolation, nearest neighbor imputation, or simple statistical methods (e.g., mean or median substitution), often fall short when dealing with time series data characterized by complex temporal dependencies and multivariate features. These methods tend to overlook the inherent dynamic properties of time series data, such as autocorrelation, as well as the interactions between different variables, which are indispensable attributes of multivariate time series.
Autocorrelation in time series means that the value of a data point is strongly connected to its historical values. For example, in financial time series, the price of a stock on any given day is typically closely related to its price on the previous day. Moreover, variables within a multivariate time series may interact complexly, such as the interplay among economic indicators or the correlation between different physiological parameters in health monitoring. If these characteristics are ignored during the imputation process, the imputed data may not accurately reflect the original dynamics of the time series, thereby affecting the accuracy of subsequent data analysis and decision-making processes.
Given the complexity of multivariate time series, the limitations of traditional imputation techniques are apparent, underscoring the need for more refined strategies to effectively address these challenges. In response to this, recurrent neural networks (RNNs) have been widely employed in the task of time series imputation. RNNs are particularly suited to this purpose due to their ability to model temporal dependencies, which is critical in understanding and reconstructing missing patterns in time series data.
However, most RNN-based imputation methods use a predictive approach to impute missing values. As shown in Fig. 1, these methods train a model to learn previous patterns before time step t to predict time step t and use the predicted value to impute the missing values. They did not take into account the influence of observed values at time step t and after time step t when estimating missing values at time step t. Inspired by the Kalman smoothing, which leverages both past observed data and future predicted data to provide a more accurate estimation of the current state, we attempt to extract the feature representation of a time series and use this representation to reconstruct the entire time series. This way, during the reconstruction of each time step’s value, the model incorporates the global data distribution information of the entire input time series.
Therefore, in this paper, we propose a Long Short-Term Memory Network based Recurrent Autoencoder with Imputation Units and Temporal Attention Imputation Model (RATAI) that imputes incomplete time series by introducing a two-stage imputation strategy of prediction and reconstruction. In the first stage, also called the prediction stage, the imputation units in encoder predict and impute missing values using temporal historical information and attribute correlation of input time series. At the end of first stage, the encoder extracts the feature representation of the imputed time series, which contains the data distribution characteristics of the entire imputed time series. In the second stage, also caled the reconstruction stage, the decoder uses this feature representation to reconstruct the input time series and imputes the missing values with the reconstruction values. In contrast to prediction-based imputation in the first stage, the reconstruction-based imputation during the second stage bears resemblance to the approach employed by the Kalman smoothing, where the system’s states during a certain period are estimated after acquiring all the information pertaining to the system within that period. Furthermore, the temporal attention mechanism introduced in RATAI allows the decoder to pay more attention to the high relative inputs when reconstructing each time step. Additionally, due to the presence of the imputation unit, RATAI can be trained directly using incomplete data without the need to set initial values for missing values, avoiding the impact of unreasonable initial values on model training.
The main contributions of RATAI are as follows:
-
1.
RATAI addresses the key shortcomings of existing RNN-based imputation models which often neglect the contextual impact of data around the missing values, enhancing the accuracy of the imputation by integrating past and future data insights similar to Kalman smoothing techniques.
-
2.
The RATAI model introduces a dual-stage imputation process with an encoder-decoder structure. The encoder efficiently utilizes temporal and attribute correlations to predict missing values, while the decoder reconstructs a complete and coherent time series, ensuring the preservation of original data dynamics.
-
3.
Our experiments demonstrate that RATAI outperforms benchmark models in terms of accuracy, confirming its effectiveness in real-world multivariate time series imputation scenarios. Meanwhile, ablation studies confirm the effectiveness of each component within the model.
2 Related works
Existing methods for processing missing values are primarily divided into deletion-based methods and imputation-based methods (Liu et al. 2018). Deletion-based methods involve removing samples or attributes that contain missing values from the original dataset to obtain a complete dataset with a reduced size (Ribeiro and Freitas 2021). Although deletion-based methods have low time and space overhead, it is only suitable for incomplete datasets with a low missing rate and values in which are missing completely at random. When applied to datasets with high missing rates, deletion-based methods can cause a large amount of valuable information to be lost, which may affect the accuracy of analysis results. In contrast, the imputation-based methods estimate a reasonable value for each missing value based on observed data, resulting in a complete dataset with the same size as the original dataset (Zhou et al. 2021). Common imputation-based methods include statistical-based imputation methods and machine learning-based imputation methods.
Statistical-based imputation methods utilize statistical theories to model time series and estimate missing values. One of the simplest statistical-based imputation methods is mean imputation, which imputes missing values in some attribute based on the mean of the existing values in the corresponding attribute. Although mean imputation is simpler and faster and imputes values within a reasonable range, the imputation results are too concentrated, thereby reducing the dispersion of the attribute value distribution. Another representative statistical-based method is the Autoregressive (AR) model (Liu et al. 2018). For example, Miew Keen Choong et al. (2009) used AR models to deal with missing values in DNA microarray time series data and achieved good results when dealing with the situation where a particular time point of the data is missing entirely. Bashir and Wei (2018) proposed a vector autoregressive model-based imputation method to impute missing values in multidimensional time series.
Machine learning-based methods use modeling to extract valid information from incomplete data and make reasonable inferences about the missing values. These methods include K-Nearest Neighbor (KNN)(Lee and Styczynski 2018), Dynamic Time Warping (DTW) (Pan et al. 2023), Expectation Maximization (EM) (Zhou et al. 2021), Matrix Completion/Matrix Factorization (Dean and Varshney 2021), and Principal Component Analysis (PCA) (Josse et al. 2011). For instance, Sun et al. (2017) proposed a KNN-based imputation method to process traffic data, they introduced a gap-sensitive window mechanism in KNN, which made KNN perform better when imputing time series data. Oehmcke et al. (2016) employed DTW as distance metric instead of pointwise distance measurements in KNN and their method showed better results with consecutively missing values than state-of-the-art algorithms.
In machine learning-based imputation methods, a large portion of research is based on neural networks, including Multilayer Perceptron (MLP) (Silva-Ramírez et al. 2015; Jung et al. 2020; Pan et al. 2023), Autoencoder (AE) Zamanzadeh et al. (2021), Generative Adversarial Networks (GAN) (Zhang et al. 2021; Miao et al. 2021), Recurrent Neural Networks (RNN) (Ma et al. 2022; Ho et al. 2022; Weerakody et al. 2023), Self-attention-based networks (Suo et al. 2020; Yang et al. 2021; Ma et al. 2022; Du et al. 2023; Zhou et al. 2021; Zeng et al. 2023) and others. Among these, RNN have shown superiority in processing time series and many RNN-based imputation methods have been proposed in recent years. For example, Yoon et al. (2019) proposed a bi-directional RNN-based imputation method (M-RNN), in which the missing values need to be initially set to 0 and then estimated in two stages. The first stage uses the temporal correlation within each attribute in time series to estimate the missing values, and the second stage uses the correlation between the attributes to estimate the missing values. Che et al. (2018) proposed a Gate Recurrent Unit (GRU) based imputation model named GRU-D, which introduced a decay mechanism to generate input decay rates based on time intervals and dynamically combine the previous observations and the average value to estimate missing values. Ma et al. (2020) proposed the linear memory vector recurrent neural network (LIME-RNN) to handle missing values. They introduced a linear memory vector that integrates over previous hidden states of the RNN to impute missing values, and they tested different recurrent units, including RNN, LSTM, and GRU. Cao et al. (2018) proposed a bi-directional recurrent imputation model for time series called BRITS, which consisted of a bi-directional LSTM to impute multivariate time series. The imputed values in BRITS are treated as variables of RNN graph and can be updated during backpropagation. Suo et al. (2020) proposed a global and local time series imputation model with multi-directional attention learning named GLIMA, which used a two-layer GRU structure to learn both global and local dependencies of multivariate time series and introduced an attention mechanism to capture distant correlations across both time and feature. Cui et al. (2020) proposed a stacked bidirectional LSTM-based imputation method (SBU-LSTM) for imputing missing values in traffic data, and the results showed the advantages of the stacked bidirectional LSTM model in traffic data prediction.
3 Preliminaries
3.1 Long short-term memory network
Long Short-Term Memory Network (LSTM) is a type of Recurrent Neural Network (RNN) (Greff et al. 2017). The key idea behind LSTM is the introduction of memory cells which allow the network to selectively learn and forget information over time. Therefore, in our proposed model, we utilize LSTM units as the recurrent units. Figure 2 shows the structure of LSTM unit. There are three gates in a LSTM unit, which are the input gate, the forget gate, and the output gate. These gates control the flow of information into and out of the unit, while the memory cell \({C}_{t}\) and hidden state \({h}_{t}\) store long-term and short-term information, respectively.
The forget gate controls the amount of information that is retained in unit from the previous time step. The output of the forget gate at time step t is
where \({h}_{{t}-{1}}\) is the previous hidden state, \({x_t}\) is the current input, \({W}_{f}\), \({U}_{f}\) and \({b}_{f}\) are learnable weights and bias, \(\sigma (\cdot )\) is the sigmoid activation function.
The input gate controls the amount of new information that is added to the memory cell. The output of the input gate at time step t is
where \({W}_{i}\), \({U}_{i}\) and \({b}_{i}\) are learnable weights and bias.
The output gate controls the amount of information that is output from the unit and it is calculated by
where \({W}_{o}\), \({U}_{o}\) and \({b}_{o}\) are learnable weights and bias.
The candidate memory cell \({\hat{{C}}}_{t}\) represents the new candidate values to be added to the memory cell and it is calculated by
where \({W}_{c}\), \({U}_{c}\) and \({b}_{c}\) are learnable weights and bias, \(tanh(\cdot )\) is the tanh activation function.
The memory cell \({C}_{t}\) stores long-term information and is updated by
where \(\circ\) denotes element-wise multiplication.
The hidden state \({h}_{t}\) is the output of LSTM unit, and it is calculated by
These equations allow the LSTM unit to selectively learn and forget information over time, making it suitable for tasks involving sequential data.
3.2 Encoder-decoder framework
Autoencoder (AE) is a type of neural network used for self-supervised learning. Its purpose is to reconstruct input data, typically used for dimensionality reduction, feature extraction, and anomaly detection. AE is composed of an encoder and a decoder. The encoder maps the input data into a latent representation, while the decoder maps the latent representation back to the original input space (Rumelhart et al. 1986).
As shown in Fig. 3, the basic principle of AE is to minimize the difference between the input and the reconstructed output. The reconstruction error is typically measured by a loss function, such as mean squared error (MSE) or binary cross-entropy. The objective is to minimize this loss function using an optimization algorithm, such as stochastic gradient descent (SGD) or Adam.
The encoding phases of AE can be formulated as
where x is the input data, h is the latent representation, \(\sigma (\cdot )\) is the activation function, \({W}_{{enc}}\) and \({b}_{{enc}}\) are learnable weight and bias.
The decoding phases of AE can be formulated as
where y is the output data, \({W}_{{dec}}\) and \({b}_{{dec}}\) are learnable weight and bias.
The advantage of AE lies in its ability to capture the underlying structure and patterns in the input data. By learning a compact and meaningful representation of input data, AE can be used for feature extraction, where the latent representation can be applied to input to other supervised learning algorithms, such as classification or regression. Since the AE does not contain units that can store temporal information, it cannot learn the dynamical features of the multivariate time series, so it cannot be directly used for modeling the time series data.
Recurrent Autoencoder (RAE) is a variant of autoencoder that is designed to handle sequential data, such as time series or natural language text. The key difference between RAE and traditional AE is that RAE uses RNNs as the encoder and decoder, allowing it to capture the temporal dependencies in the input data. Figure 4 illustrates the structure of the LSTM-based Recurrent Autoencoder which is capable of extracting dynamic features from univariate or multivariate time series.
Let \({X}=\left[ x_1,{\ x}_2,\ \ \ldots ,\ \ x_T\right]\), where \(x_t\in \mathbb {R}^D\), be a time series of length T. At time step t, the hidden state \({h}_{t}^{e}\) in encoder is updated by
where \({x_t}\) is the input at time step t, \({h}^{e}_{t}\) and \({h}^{e}_{{t}-{1}}\) is the hidden state in encoder at time step t and \(t-1\), respectively, \(\sigma (\cdot )\) is the activation function, \({W}_{h}\), \({W}_{x}\) and \({b}_{h}\) are learnable weights and bias. The \({h}^{e}_{T}\) at the last time step T is then used as the compressed representation of X.
The reconstructed time series \(\left[ y_1,{\ y}_2,\ \ \ldots ,\ \ y_T\right]\) can be decoded from the compressed representation \({h}^{e}_{T}\). The reconstructed value \(y_{t^\prime }\) at decoding time step \(t^\prime\) is generated by
while for \(t^\prime =1\), the hidden state is
while for \(t^\prime \ge 2\), the hidden state is
where \(y_{t^\prime }\) is the output at time step \(t^\prime\), \({h}^{d}_{{t}^\prime }\) is the hidden state in decoder at decoding time step \(t^\prime\), zero is a zero vector of the same dimension as \(y_{t^\prime }\), \(\sigma (\cdot )\) is the activation function, \({W}_{o}\) and \({b}_{o}\) are learnable weight and bias. The decoder’s role is to take in the compressed representation produced by encoder and use it to reconstruct the original input sequence.
By taking RNNs as the encoder and decoder, RAE can capture the temporal dependencies in the input data, which can be challenging for traditional feedforward neural networks.
4 Proposed methods
In this section, we propose a LSTM-based Recurrent Autoencoder with Imputation Units and Temporal Attention Imputation Model (RATAI) to impute missing values in multivariate time series. We define a multivariate time series of length T as \({X}=\left[ x_1,{\ x}_2,\ \ \ldots ,\ \ x_T\right]\), where \(x_t\in \mathbb {R}^D\). \(x_{td}\) denotes the value of the dth attribute at time step t. Since X may contain missing values, we introduce the corresponding missing indicator matrix \({M}=\left[ m_1,{\ m}_2,\ \ \ldots ,\ \ m_T\right]\), where \(m_t\in \mathbb {R}^D\). If \(m_{td}=1\), it means \(x_{td}\) is observed, on the contrary, if \(m_{td}=0\), it means \(x_{td}\) is missing.
The architecture of proposed model is illustrated in Fig. 5. RATAI consists of an encoder and a decoder. The encoder is composed of a series of LSTM units and imputation units. The imputation units take the time series, the correspond missing indicator matrix, and the temporal information from previous LSTM units as input, and use a predictive approach based on temporal and attribute correlation information to perform the first-stage imputation on missing values in the input time series. The last state of encoder is used as the feature representation of the imputed time series. The decoder utilizes this feature representation to reconstruct the input sequence and perform the second-stage imputation on missing values using the reconstructed values. Additionally, the temporal attention mechanism enables the decoder to focus on the high relative inputs when reconstructing the time series. The output of RATAI is the reconstruction of the imputed time series, and the two-stage imputation approach based on prediction and reconstruction allows RATAI to incorporate not only temporal information and attribute correlation but also global data distribution information of the input time series, overcoming the issue of insufficient information utilization caused by solely using predicted values for imputation.
4.1 LSTM-based encoder with imputation unit
Due to the use of incomplete data in the training of RATAI, the input data may contain missing values that cannot be directly handled by LSTM units. Simply replacing missing values in the input with zeros or other inappropriate initial values can significantly alter the data distribution of the time series, affecting the LSTM’s ability to effectively learn from the time series. Reasonable imputation of missing values can not only provide complete training data for LSTM but also alleviate the impact of missing values on the feature representation extraction. Therefore, we design an imputation unit that utilize the temporal and attribute correlation information of time series to impute the missing values in the first stage. The structure of imputation unit is illustrated in Fig. 6.
The imputation unit at time step t utilizes the output \({h}^{{e}}_{{t}-{1}}\) of the previous LSTM unit at time step \(t-1\), and the observed values \(x_t\) at time step t, which contains temporal history information and attributes correlation information, respectively, to generate estimated values \(z_t^\prime\) at time step t. The estimated values \(z_t^\prime\) and the observed values \(x_t\) are combined to form the \({\widetilde{x}}_t\), which is the final input of the current LSTM unit and then participates in the forward propagation process described by Equations (1) to (6). The output of the imputation unit represents the combined values \({\widetilde{x}}_t\) and it’s calculated by
where \(x_t\) and \(m_t\) represent the input time series and missing matrix at time step t, respectively, \(z_t^\prime\) represents the estimated values at time step t.
The estimated values \(z_t^\prime\) is calculated by
where \({h}^{{e}}_{{t}-{1}}\) represents the output of LSTM unit in encoder at time step \(t-1\). \({W}_{{z}^\prime }\) and \({b}_{{z}^\prime }\) represent the weight matrix and the bias, respectively. \({W}_{{z}^\prime }{h}^{{e}}_{{t}-{1}}\) represents the predicted values calculated from temporal history information. \(a_t\) represents the estimated values calculated from correlate information among the attributes at time step t, and \(a_t\) is generated by
where tanh is the activation function, \({W}_{a}\) represents the weight matrix with zero diagonal elements, as follows
To eliminate the strong effect of self-correlation, we restrict the diagonal elements of weight matrix \({W}_{a}\) to be all zeros. Thus, the dth attribute \(a_{td}\) in \(a_t\) is exactly calculated based on \(a_{t1},\dots ,{a}_{td-1}\) and \(a_{td+1},\dots ,a_{tD}\) without \(a_{td}\) itself. \(x_t^\prime\) represents the imputed result after the first round of imputing missing values in \(x_t\) using only temporal information and is calculated by
where \(x_t\) and \(m_t\) represents the input time series and missing indicating vector at time step t, respectively, \(z_t\) represents the predicted values obtained solely from temporal information and is computed by
where \({h}^{e}_{{t}-{1}}\) represents the output of LSTM unit in encoder at time step t, \({W}_{z}\) and \({b}_{z}\) represent the weight matrix and the bias, respectively. This is the first prediction for missing values, aiming at imputing the missing values in \(x_t\) to facilitate the subsequent extraction of correlations between attributes.
4.2 LSTM-based decoder with temporal attention
4.2.1 Feature representation extraction
Based on the previous section, the LSTM units in encoder produce a hidden state as output at each time step, resulting in a sequence of hidden states \([{h}^{{e}}_{1},{h}^{{e}}_{2},\ldots ,{h}^{{e}}_{T}]\) at the end of encoding. A common approach to compute the feature representation of imputed time series is to fuse the hidden states from each time step of the encoder, which is calculated by
where repr is the feature representation of input time series and shares the same dimension as each \({h}^{{e}}_{i}\), Linear is a fully-connected layer, repr is used as a part of the input of the decoder for further computations. Considering that the LSTM’s memory cell and hidden state store long-term and short-term memories of the time series, respectively, we utilize the memory cell \({C}^{{e}}_{T}\) and hidden state \({h}^{{e}}_{T}\) from the LSTM unit in the encoder at the last time step as the feature representation of imputed time series. Then, we set these two states as the initial states for the LSTM unit in the decoder at the starting time step, which reduces computational complexity and accelerates the model’s convergence.
4.2.2 Decoding for missing value imputation
Regardless of whether fusing the hidden states of various time steps or using the last time step’s memory cell state and hidden state as the feature representation, the expressive power of the feature representation for the sequence gradually decreases as the length of the input time series increases. To address this issue, this paper introduces a temporal attention mechanism for the multivariate time series imputation model, in which a temporal attention layer serves as an intermediate transition state between the encoder and decoder. Figure 7 depicts how the attention mechanism operates during the decoding process at time step 3. When reconstructing the value at time step 3, the process involves initial computation of similarity scores between \({h}^{{d}}_{2}\) and all hidden states \([{h}^{{e}}_{1},{h}^{{e}}_{2},\ldots ,{h}^{{e}}_{T}]\) obtained from the encoder. Subsequently, these hidden states are weighted and summed using the similarity scores, yielding the background vector \({s}_{3}\) at time step 3. This background vector serves as a partial input for the decoder during the reconstruction procedure.
The memory cell state \({C}^{{e}}_{T}\) and hidden state \({h}^{{e}}_{T}\) of the last LSTM unit in encoder are used as the initial states of LSTM unit in decoder. In decoding phase, a zero vector is used as the initial input. The reconstructed values \(y_1\) at time step 1 is calculated based on the initial states, the initial input zero vector, and the context vector \({s}_{1}\). Subsequently, \(y_1\) is used as the input of the next time step, and this iteration continues until the reconstructed value is obtained for all time steps. The \(y_1\) is calculated as follows
where \({W}_{y}\) and \({b}_{y}\) are weight matrix and bias, respectively, \({h}^{d}_{1}\) represents the hidden state of decoder at time step 1 and \({h}^{{d}}_{1}\) are calculated by following equation.
where \({h}^{{e}}_{T}\) is the last hidden state of LSTM unit in encoder, zero is the input zero vector, \({s}_{1}\) is the context vector in decoding time step 1, \([\cdot ]\) is the concatenate operation of vectors in the last dimension. While, for subsequent time steps
where \({h}^{{d}}_{{t}^\prime }\) and \({h}^{{d}}_{{t}^\prime -{1}}\) is the hidden state of LSTM unit in decoder at time step \(t^\prime\) and \(t^\prime -1\), respectively, \(y_{t^\prime -1}\) is the reconstructed value at decoding time step \(t^\prime -1\), and \({s}_{{t}^\prime }\) is the context vector at decoding time step \(t^\prime\). Let the hidden state of encoder at time step t be \({h}^{e}_{t}\), the background vector \({s}_{{t}^\prime }\) of the decoder at time step \(t^\prime\) is calculated by
where \(a_{t^\prime t}\) is the attention weight, represents the contribution of each time step in encoder to the reconstruction at decoding time step \(t^\prime\), and are calculated by
where \(e_{t^\prime t}\) represents the similarity score between the hidden state of decoder at time step \(t^\prime\) and the hidden state of encoder at time step t. It is calculated by
where \({h}^d_{{t}^\prime -{1}}\) represents the hidden state of the decoder at decoding time step \(t^\prime -1\), \({h}^{e}_{t}\) represents the hidden state of the encoder at time step t, \({W}_{h^d}\),\({W}_{h^e}\), and v are learnable weight matrixes.
If the attention mechanism is not used, then the context vector \({s}_{{t}^\prime }\) is the same at different decoding time steps and is calculated by the following equation
By introducing the attention mechanism, the decoder is able to pay attention to the input selectively when decoding, thereby mitigating the loss of information that can occur when generating background vectors for longer input sequences.
In the model training process, we use the Mean Square Error (MSE) to calculate the estimation loss, Since the data used for training may contain missing values and we cannot obtain the true values of these values, we use only the MSE between the model’s output values and the corresponding observed values. At time step t, the output error of the model is
In addition, according to Equation (13) and Equation (17), we know that there are also two estimates of the missing values during the training process, so the total loss of the model should not contain only OutputError. Finally, at decoding time step \(t^{\prime }\), the TotalLoss of the model is
where \((z_t-x_t)\circ m_t\) and \((z^{\prime }_{t}-x_t)\circ m_t\) represent the errors of imputation using temporal information and attribute correlation information in the first stage, respectively.
During the training process, we adopt many-to-many strategy. As is illustrated in the left part of Fig. 8, the outputs of RATAI at all time steps will participate in the optimizing procedure. In testing, we adopt many-to-one strategy, as is illustrated in the right part of Fig. 8, while only the output of the decoder at the last time step will be used for imputation.
5 Experimental results and discussions
In this section, we conduct a series of experiments on six publicly available datasets, to verify the effectiveness of the proposed RATAI and exhibit the superiority of the proposed method by comparing imputation performance with other nine baseline models. We begin by introducing the datasets and describe the data-preprocessing methods, and then present the experimental evaluation of the proposed RATAI and other benchmark models on six multivariate time series sets with different missing rates.
5.1 Datasets
The experimental datasets are CHB-MIT, PTB, DART, MPI_Roof, GT_2011, and Energy. The length of time series and number of attributes of selected datasets are shown in Table 1.
CHB-MIT, which is collected from the Children’s Hospital Boston, consists of electroencephalography (EEG) recordings with an intractable seizure of 24 pediatric patients. This database consists of 916 h of EEG records and 23 cases of EEG recordings of 22 patients whose ages ranged from 1.5 to 22 years.
PTB is a publicly available electrocardiogram (ECG) database collected from the German Heart Center. Each record includes 15 simultaneously measured signals: the conventional 12 leads together with the 3 Frank lead ECGs.
DART is a weather dataset containing 22 attributes with a sequence length of 4415 and a time interval of 30 min. The attributes are likewise rainfall, temperature, humidity, UV radiation, etc.
MPI-Roof records weather data between January 1 and January 30, 2018, and contains 21 attributes with a sequence length of 4463 and a sampling interval of 10 min. The attributes included are barometric pressure, temperature, humidity, wind speed, precipitation, and radiation, etc.
GT-2011 is a gas turbine carbon monoxide and nitrogen oxide emissions dataset, which contains 11 attributes with a length of 7411 and contains attributes such as ambient temperature, ambient pressure, turbine temperature, carbon monoxide concentration, nitrogen oxide concentration, etc.
Energy is a low-energy residential appliance energy use dataset, which contains 28 attributes with a sequence length of 4506 and a sampling frequency of 10 min each time, recording the energy consumption of appliances in a low-energy home between January 11 and February 11, 2016.
5.2 Experimental settings
In our experiments, we compare the imputation performance of proposed RATAI with several benchmark models for multivariate time series, including M-RNN, SBU-LSTM, LIME-LSTM, GRU-D, BRITS, SAITS, USGAN, Dlinear, and Informer.
To facilitate the evaluation of the imputation performance. For each dataset, we partition it into a train set, a validation set, and a test set. Specifically, the first 60% of the data is used as the train set for training, the next 10% is used as the validation set, and the remaining 30% is used as the test set for testing imputation accuracy. To construct incomplete datasets, we randomly remove values from the dataset with missing rates of 10%, 20%, 30%, 40%, and 50%, respectively. As shown in Fig. 9, we employ the sliding window technique on the pre-segmented dataset to acquire training data, with a window size of 7 and a stride of 1. The batch size is 64, and the number of parameters of different methods are kept roughly equal. Early stopping is utilized in training process to decide when to stop the training and the max epoch is 600.
We adopt Root Mean Squared Error (RMSE) as the common evaluation metric to evaluate the imputation performance, which is calculated by
where \(y_i\) and \({\hat{y}}_i\) denote the imputed value and ground truth of the ith missing value, respectively, \(\mathrm {\Omega }\) denotes the index set of missing values, and \(\left| \mathrm {\Omega }\right|\) denotes the size of the index set.
5.3 Results and analysis
Tables 2, 3, 4, 5, 6, 7 show the experiment results of the above ten methods on six datasets, with the best results bolded and underlined and the second-best results bolded only. The experiment results demonstrate the superiority of RATAI over other methods across different missing rates on the six datasets.
On the CHB-MIT dataset, RATAI is the best model for imputation under all missing rates, with USGAN, SAITS, and LIME-LSTM as the second-best model under specific missing rates.
On the PTB dataset, RATAI is consistently the best model for imputation under all missing rates, with GRU-D and LIME-LSTM as the second-best model under specific missing rates.
On the DART dataset, RATAI is consistently the best model for imputation under all missing rates, with BRITS and SUB-LSTM as the second-best model under specific missing rates.
On the MPI_Roof dataset, BRITS is the best model under missing rates of 10% to 30%, with RATAI as the second-best model, and RATAI is the best model at missing rate of 40% and 50%, with M-RNN and BRITS as the second-best model, respectively.
On the GT_2011 dataset, RATAI is the best model and BRITS is the second-best model under missing rate of 10%. Under missing rates of 20% to 50%, RATAI is the second-best model and SAITS is the best model.
On the Energy dataset, RATAI is the best model for imputation under all missing rates, with USGAN, M-RNN and GRU-D as the second-best model at other missing rates.
The experimental results show that across all missing rates on the six selected real datasets, RATAI consistently achieves the best or second-best imputation results compared to the baseline methods, demonstrating the effectiveness of the proposed approach.
5.4 Ablation studies
In this section, we use two ablation experiments to verify the contribution of introducing the imputation unit and the temporal attention mechanism, respectively. In the ablation experiments, the unmodified model is called RATAI (full model), while the model without imputation units is called RATAI (w/o imputation unit) and the model without temporal attention layer is called RATAI (w/o temporal attention). The experimental results are illustrated in Table 8, with the best results bolded and underlined and the second-best results bolded only.
5.4.1 Effect of imputation unit
In this experiment, we exclude the imputation unit from RATAI. As LSTM cannot directly handle missing values, prior to the experiment, we perform pre-imputation on the experimental dataset using mean imputation method. As can be seen from Table 8, compared with RATAI (full model), the imputation RMSE of RATAI (w/o imputation unit) has significantly increased, indicating that the imputation unit plays an important role in RATAI.
Choosing PTB dataset as an example, we train RATAI (full model) and RATAI (w/o imputation unit) with a missing rate of 0.1. After training, we test both models using the test set. Figure 10 shows the imputation results of those two models on the 12th attribute of the PTB dataset.
From Fig. 10, it can be observed that the imputation results of RATAI (w/o imputation unit) have a larger gap with the true values compared to RATAI (full model). The results demonstrate that using mean imputation affects the distribution of the time series, thereby influencing the model’s feature extraction performance on the input sequence. In comparison, the introduction of imputation unit reduces the disruption caused by imputed values on the time series distribution, ultimately enhancing the model’s imputation capability.
5.4.2 Effect of temporal attention
In this experiment, we remove the temporal attention layer from RATAI, resulting in a fixed background vector during decoding at different time steps. From the Table 8, it can be observed that the introduction of the temporal attention layer improves the model’s imputation capability in the vast majority of cases.
Furthermore, we analyze the attention score assigned to different time steps in the temporal attention layer. Taking the PTB dataset as an example, we train RATAI with a missing rate of 10%, and after completing the training, we intercept a segment of data with a sequence length of 20 from the test set for testing and extract the attention score assigned by the decoder to different time steps when estimating the missing values at a specific time step. Figure 11 shows the heatmap of the attention score for the specific time step.
From the Fig. 11, it can be observed that the value of the element that around the diagonal is larger, and the value of the remaining elements is smaller. This indicates that while reconstructing the values at time step t, the decoder pays the most attention to the values of time series around time step t. Referring to the previous experimental results, we can further confirm that the introduction of the temporal attention is effective.
6 Conclusion
In this paper, we propose a Long Short-Term Memory Network based Recurrent Autoencoder with Temporal Attention Imputation Model (RATAI) for missing values in multivariate time series. In order to fully explore the historical information, attribute correlations, and dynamic features of the multivariate time series, a two-stage imputation model based on LSTM is designed. The imputation units in the encoder perform the first stage of imputation based on historical information and attribute correlations and then extract the feature representation of the imputed time series. In the second stage, the decoder utilizes the feature representation which encapsulates the dynamic features to reconstruct the input time series. Additionally, the introduced temporal attention layer enables the decoder to focus on high relative inputs when reconstructing a specific time step. Compared to prediction-based imputation methods, RATAI’s two-stage imputation approach can more comprehensively extract information from time series, achieving more precise imputation outcomes. Moreover, RATAI can be trained directly on incomplete time series data without requiring the setting of initial values for missing values, which avoids the effect of inappropriate initial values on model training and enhances the robustness of the proposed model. Comparative experiments and ablation experiments demonstrate the superiority of RATAI in imputation performance and the effectiveness of model design.
References
Bashir F, Wei HL (2018) Handling missing data in multivariate time series using a vector autoregressive model-imputation (VAR-IM) algorithm. Neurocomputing 276:23–30. https://doi.org/10.1016/j.neucom.2017.03.097
Cao W, Wang D, Li J, Zhou H, Li L, Li Y (2018) Brits: bidirectional recurrent imputation for time series. Adv Neural Inform Proc Syst. 31
Che Z, Purushotham S, Cho K, Sontag D, Liu Y (2018) Recurrent neural networks for multivariate time series with missing values. Sci Rep 8(1):6085. https://doi.org/10.1038/s41598-018-24271-9
Choong MK, Charbit M, Yan H (2009) Autoregressive-model-based missing value estimation for DNA microarray time series data. IEEE Trans Biomed Eng 13(1):131–137. https://doi.org/10.1109/TITB.2008.2007421
Cui Z, Ke R, Pu Z, Wang Y (2020) Stacked bidirectional and unidirectional LSTM recurrent neural network for forecasting network-wide traffic state with missing values. Transp Res Pt C-Emerg Technol 118:102674. https://doi.org/10.1016/j.trc.2020.102674
Dean RC, Varshney LR (2021) Optimal recovery of missing values for non-negative matrix factorization. IEEE Open J Signal Process 2:207–216. https://doi.org/10.1109/OJSP.2021.3069373
Du S, Li T, Yang Y, Horng SJ (2020) Multivariate time series forecasting via attention-based encoder-decoder framework. Neurocomputing 388:269–279. https://doi.org/10.1016/j.neucom.2019.12.118
Du W, Cote D, Liu Y (2023) Saits: self-attention-based imputation for time series. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2023.119619
Fernandez-Fraga S, Aceves-Fernandez M, Pedraza-Ortega J, Ramos-Arreguin J (2018) Screen task experiments for eeg signals based on ssvep brain computer interface. Int J Adv Res 6(2):1718–1732
Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J (2017) LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst 28(10):2222–2232. https://doi.org/10.1109/TNNLS.2016.2582924
Ho NH, Yang HJ, Kim J, Dao DP, Park HR, Pant S (2022) Predicting progression of Alzheimer’s disease using forward-to-backward bi-directional network with integrative imputation. Neural Netw 150:422–439. https://doi.org/10.1016/j.neunet.2022.03.016
Josse J, Pagés J, Husson F (2011) Multiple imputation in principal component analysis. Adv Data Anal Classif 5(3):231–246. https://doi.org/10.1007/s11634-011-0086-7
Jung S, Moon J, Park S, Rho S, Baik SW, Hwang E (2020) Bagging ensemble of multilayer perceptrons for missing electricity consumption data imputation. Sensors 20(6):1772. https://doi.org/10.3390/s20061772
Liu T, Wei H, Zhang K (2018) Wind power prediction with missing data using Gaussian process regression and multiple imputation. Appl Soft Comput 71:905–916
Lee JY, Styczynski MP (2018) NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics 14(12):153. https://doi.org/10.1007/s11306-018-1451-8
Ma Q, Li S, Shen L, Wang J, Wei J, Yu Z, Cottrell GW (2020) End-to-end incomplete time-series modeling from linear memory of latent variables. IEEE T Cybern 50(12):4908–4920. https://doi.org/10.1109/TCYB.2019.2906426
Ma Q, Li S, Cottrell GW (2022) Adversarial joint-learning recurrent neural network for incomplete time series classification. IEEE Trans Pattern Anal Mach Intell 44(4):1765–1776. https://doi.org/10.1109/TPAMI.2020.3027975
Miao X, Wu Y, Wang J, Gao Y, Mao X, Yin J (2021) Generative semi-supervised learning for multivariate time series imputation. In: Proceedings of the AAAI conference on artificial intelligence, pp 8983–8991
Moews B, Herrmann JM, Ibikunle G (2019) Lagged correlation-based deep learning for directional trend change prediction in financial time series. Expert Syst Appl 120:197–206. https://doi.org/10.1016/j.eswa.2018.11.027
Oehmcke S, Zielinski O, Kramer O (2016) kNN ensembles with penalized DTW for multivariate time series imputation. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp 2774–2781, https://doi.org/10.1109/IJCNN.2016.7727549
Pan H, Ye Z, He Q, Yan C, Yuan J, Lai X, Su J, Li R (2022) Discrete missing data imputation using multilayer perceptron and momentum gradient descent. Sensors 22(15):5645. https://doi.org/10.3390/s22155645
Pan Z, Wang Y, Wang K, Chen H, Yang C, Gui W (2023) Imputation of missing values in time series using an adaptive-learned median-filled deep autoencoder. IEEE T Cybern 53(2):695–706. https://doi.org/10.1109/TCYB.2022.3167995
Ribeiro C, Freitas AA (2021) A data-driven missing value imputation approach for longitudinal datasets. Artif Intell Rev 54(8):6277–6307. https://doi.org/10.1007/s10462-021-09963-5
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation, parallel distributed processing, explorations in the microstructure of cognition, ed. de rumelhart and j. mcclelland. vol. 1. 1986. Biometrika 71:599–607
Silva-Ramírez EL, Pino-Mejías R, López-Coello M (2015) Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns. Appl Soft Comput 29:65–74. https://doi.org/10.1016/j.asoc.2014.09.052
Suo Q, Zhong W, Xun G, Sun J, Chen C, Zhang A (2020) GLIMA: Global and Local Time Series Imputation with Multi-directional Attention Learning. In: 2020 IEEE International Conference on Big Data (Big Data), pp 798–807, https://doi.org/10.1109/BigData50022.2020.9378408
Sun B, Ma L, Cheng W, Wen W, Goswami P, Bai G (2017) An improved k-nearest neighbours method for traffic time series imputation. In: 2017 Chinese Automation Congress (CAC), pp 7346–7351
Weerakody PB, Wong KW, Wang G (2023) Cyclic gate recurrent neural networks for time series data with missing values. Neural Process Lett 55(2):1527–1554. https://doi.org/10.1007/s11063-022-10950-2
Yang B, Kang Y, Yuan Y, Huang X, Li H (2021) ST-LBAGAN: Spatio-temporal learnable bidirectional attention generative adversarial networks for missing traffic data imputation. Knowledge-Based Syst 215:106705. https://doi.org/10.1016/j.knosys.2020.106705
Yoon J, Zame WR, van der Schaar M (2019) Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Trans Biomed Eng 66(5):1477–1490. https://doi.org/10.1109/TBME.2018.2874712
Zamanzadeh DJ, Petousis P, Davis TA, Nicholas SB, Norris KC, Tuttle KR, Bui AAT, Sarrafzadeh M (2021) Autopopulus: A Novel Framework for Autoencoder Imputation on Large Clinical Datasets. In: 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp 2303–2309, https://doi.org/10.1109/EMBC46164.2021.9630135
Zamora-Martinez F, Romeu P, Botella-Rocamora P, Pardo J (2014) On-line learning of indoor temperature forecasting models towards energy efficiency. Energy Build 83(SI):162–172. https://doi.org/10.1016/j.enbuild.2014.04.034
Zeng A, Chen M, Zhang L, Xu Q (2023) Are transformers effective for time series forecasting? In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 11121–11128
Zhang Y, Zhou B, Cai X, Guo W, Ding X, Yuan X (2021) Missing value imputation in multivariate time series with end-to-end generative adversarial networks. Inf Sci 551:67–82. https://doi.org/10.1016/j.ins.2020.11.035
Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W (2021) Informer: Beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of the AAAI conference on artificial intelligence, pp 11106–11115
Zhou Y, Jiang J, Yang SH, He L, Ding Y (2021) MuSDRI: multi-seasonal decomposition based recurrent imputation for time series. IEEE Sens J 21(20):23213–23223
Acknowledgements
This work is supported by the National Natural Science Foundation of China (62076050, 62073056), the National Key R&D Program of China (2022YFF0610900) and the Fundamental Research Funds for the Central Universities (DUT22LAB129).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lai, X., Yao, Y., Mu, J. et al. Ratai: recurrent autoencoder with imputation units and temporal attention for multivariate time series imputation. Artif Intell Rev 58, 37 (2025). https://doi.org/10.1007/s10462-024-11039-z
Accepted:
Published:
DOI: https://doi.org/10.1007/s10462-024-11039-z