Elsevier

Knowledge-Based Systems

Volume 215, 5 March 2021, 106743
Knowledge-Based Systems

Nonlinear compensation algorithm for multidimensional temporal data: A missing value imputation for the power grid applications

https://doi.org/10.1016/j.knosys.2021.106743Get rights and content

Abstract

In smart grid, the missing values do influence the real-time grid monitoring and bring biases of conclusions from the grid data mining. From the analysis on the data from smart grid, every variable shows global variation and local variation. Based on these characters, a novel statistical and machine learning-based imputation method is proposed, taking advantage of the global trend capturing by one-dimension interpolation of the variable of interest and the local variation capturing by linear compensation of multidimensional variables. By using KCPA, the multidimensional nonlinear variables are mapped into a feature space, and obtained new variables linearly couple with the variable of interest. Then these new variables together with the multidimensional linear variables are used for that linear compensation. The comparative experiment indicates that the proposed method outperforms the commonly used methods by reducing the RMSE by 29.19% and MAE by 44.73% on average, and having the best R2 closest to 1. A test on public dataset shows that the proposed method still has a good performance. At last, the sensitivity analysis on missing rate shows that the imputation error of the proposed methods remains steady for all the variables with the increase of missing rates from 5% to 10%.

Introduction

The smart grid is becoming more intelligent by the employment of artificial intelligence technologies. Real-time analytics of the high volumes of streaming data from a variety of sources in the power grid is crucial to keep a safe, reliable and efficient system operation [1]. Many sensors have been installed for continuous and cooperative monitor on the power grid state, such as electricity meter and thermometer. These sensors generate massive time series data that help operators and controllers understand grid conditions better. However, due to many natural and human factors, such as power outage, malfunctioning devices, sensor failure and error in data transfer, some sensor observations are lost at unexpected moments. Thus, values missing is a common occurrence in many grid-related time series datasets of smart grid systems, which do influence the real-time grid monitoring and the conclusions drawn from the grid data mining [2].

Contingencies, such as equipment failures or natural disasters may cause missing values for a certain period [3]. These missing values are difficult to be recovered and becoming a headache to the smart grid [4]. Some instantaneous missing values of variables can be handled by deletion or imputation approaches [5]. The former is easy to implement, and just excludes missing values from the dataset for an available case analysis. If many variables have unknown values, this may lose important information, which is especially detrimental to extract the correlation between data and time dimension, and also results in the decrease in the sample size and few complete cases. In the subsequent grid data mining, an unpredictable bias and estimates with larger standard errors will occur [6].

Apart from discarding incomplete data, imputation is an alternative approach for handling missing values, replacing missing values by appropriate values with the available information [7]. Simple and frequently used imputation methods are based on statistical learning, including mean imputation, median imputation and machine learning, KNN imputation. These methods are suitable for the imputation of this kind of variables, whose missing values depend on itself, not other variables [8]. For example, in mean imputation missing values are replaced with mean of the existing observations of the relevant variable [9]. Although it sounds convenient to implement, simple imputation leads to inefficient analyses and commonly produces severely biased estimates of the association investigated [10]. There are more sophisticated imputation techniques for the variable of interest depending on other variables, more independent or dependent variables work together to achieve better prediction, interpolation and regression are two classical examples.

Interpolation is a method of statistical learning-based approaches and replaces the missing value with a weighted mean value of one-dimensional interpolations [11], which are obtained by interpolating the variable of interest and other relevant variables at the adjacent moment of the missing value [12]. The advantage of this method is the computational efficiency for both time and space. However, it highly depends on the relevancy between variables, and always ignores the nonlinear coupling effects for its linear strategy [13]. Thus, it misses important information and reduces the imputation performance.

Regression is another group of statistical learning-based approaches [14], and replaces the missing value with the prediction from linear regression model, local regression models, quadratic regression models or linear regression model solved by Bayesian theory, etc. [15], [16]. For example, linear regression makes the prediction by the linear relationship of historical data between the dependent and independent variables. More variables work together than linear interpolation, better performance is expected [17]. These regression approaches are easy to build and apply, but before the application, the form of function must be determined and the parameters of the function need to estimate by model training. If the operation condition varies, the selected model or its parameters cannot capture the data trend well, and the accuracy was proved unreliable under different operation conditions [18].

Generally, most of the statistical learning-based approaches are able to take advantage of the statistic feature of the grid data. Nevertheless, various reasons limit the accuracy of imputation. Differently, machine learning-based approaches try to take full advantages of the data and the form of function is usually determined by the data. It is suitable for the imputation of variables depended on itself or other variables.

[19] proposed a novel tensor-based algorithm, specifically, an iterative tensor decomposition approach, which utilized multidimensional inherent correlation of traffic data to detect and impute missing data. Based on the work of [20], [21] used a statistical learning method, biclustering, to divide the rows and columns of the rectangular data array into similar subsets at the same time in order to fill in the missing values. An extended tensor factorization model was proposed and a complete Bayesian framework via variable Bayes is used to automatically learn model parameters [22]. The important issues of Bayesian inference including prior construction, posterior computation, model comparison, and sensitivity analysis were discussed [23].

The performance of machine learning-based approaches will stabilize for an enough amounts of data. No matter how data increases, the models are generally stable or only changed slightly, and the same as the accuracy. Differently, deep learning approaches can get a better accuracy from the training on large amounts of data.

[24] used RNN network model to fill in the missing value of multivariate timing data. Training automatic encoders to model incomplete data reduced effectively the complexity of data modeling [25]. [26] proposed the Dynamic-LRNN neural network, which divide the original data into two categories for training and predicting the missing value. [27] applied high-speed linear neural-like structure using successive geometric transformation model. A new sequence-to-sequence imputation model (SSIM) was proposed to recover missing data in long short-term memory network, which could utilize both past and future information for a given time [28]. Inspired by image-to-image translation, [29] applied an end-to-end U-Net convolutional network to the missing data reconstruction.

In comparison with the statistical learning-based method, the machine and deep learning-based methods consider more multidimensional historical data, getting a higher prediction accuracy of missing values [30]. However, it is difficult to collect overall the absence cases for all dimensional data. In addition, it takes a long time for training a large number of complex interpolation models and for the last testing, so these methods are not suitable for a real-time data analysis [31].

In the power system, there are many kinds of field test data, having their own special characteristics, and no public training and testing datasets available. The deep learning-based approaches cannot handle the imputation of missing values due to the lack of enough samples [32].

In this paper, we aim at analyzing the data characteristics of grid system and observation missing mechanism, and customizing an autofit imputation approach for this domain. The rest of this article is arranged as follows. Section 2 analyzes the data and the data missing of grid system. Based on that, Section 3 describes an imputation method based on statistical and machine learning for these data in detail. Experimental results and analysis are presented in Section 4. Section 5 is the summary and future improvement direction.

Section snippets

Analysis on the data and the data missing of grid system

This paper uses the data of Gule-I-transmission-line in one province, China for experiments and analysis. The relevant information data of line loss makes our dataset, recording 16 temporal variables in 914 days, and can be divided into two parts. The first part is the power data of the transmission line, which comes from the energy measurement device at the gateway, including the active and reactive power of the transmission line. The second part is the meteorological data of the transmission

A statistical and machine learning fusion imputation method for the grid data

Motivated by the aforementioned global and local variations of temporal views, we propose a new imputation method fused by statistical and machine learning, taking advantage of one-dimensional interpolation and machine learning-based approach. This method can capture both the global variation of the temporal variable of interest itself and the local variation with high accuracy. The proposed algorithm is as follows.

After median filter denoises the salt and pepper noise in the whole dataset, for

Making datasets

In our experiment, two datasets are used. One is the relevant information dataset of line loss, which has recorded 16 variables in 914 days. Hereinto, one of variables is the line loss rate, and all the others are influencing factors. A common public dataset about the graduate admissions is also used, having 8 features and 500 instances. To make the MCAR datasets, we arbitrarily choose a variable in original data as the variable Y to be imputed, then randomly and discontinuously discard 7%

Conclusion

In this paper, a new imputation method for the temporal monitoring data from grid system has been proposed. From the analysis on these data, there is around 7% average missing rate with the missing pattern observed as MCRA, and every variable contains a global variation and a local variation. After filter impulse noise by median filtering, we use one-dimensional interpolation to capture the global variation of the temporal variable of interest itself and use machine learning-based approach to

CRediT authorship contribution statement

Tao Su: Conceptualization, Methodology, Software, Writing - original draft, Visualization. Ying Shi: Conceptualization, Validation, Formal analysis, Writing - review & editing, Supervision. Jicheng Yu: Data curation, Writing - review & editing, Funding acquisition. Changxi Yue: Data curation, Writing - review & editing, Project administration, Funding acquisition. Feng Zhou: Data curation, Writing - review & editing, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work is supported by State Grid Corporation of China Headquarter Science and Technology Project (JL71-18-002).

Tao Su received the B.E. degrees in Automation from the Wuhan University of Technology, Hubei, China, where he is currently studying for M.A. degrees in control science and engineering.

His research interests include big data systems, object detection and semantic segmentation.

References (43)

  • ChenX. et al.

    Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model

    Transp. Res. C

    (2019)
  • BianchiF.M. et al.

    Learning representations of multivariate time series with missing data

    Pattern Recognit.

    (2019)
  • LaiX. et al.

    Imputations of missing values using a tracking-removed autoencoder trained with incomplete data

    Neurocomputing

    (2019)
  • TurabiehH. et al.

    Dynamic L-RNN recovery of missing data in IoMT applications

    Future Gener. Comput. Syst.

    (2018)
  • TkachenkoR. et al.

    A non-iterative neural-like framework for missing data imputation

    Procedia Comput. Sci.

    (2019)
  • ChoudhuryS.J. et al.

    Imputation of missing data with neural networks for classification

    Knowl.-Based Syst.

    (2019)
  • MiglaniA. et al.

    Deep learning models for traffic flow prediction in autonomous vehicles: A review, solutions, and challenges

    Veh. Commun.

    (2019)
  • BennettD.A.

    How can I deal with missing data in my study?

    Aust. New Zealand J. Public Health

    (2001)
  • KurdijaA.S. et al.

    Efficient global correlation measures for a collaborative filtering dataset

    Knowl.-Based Syst.

    (2018)
  • H. Xiao, W. Xinying, F. Hao, Requirements analysis and application research of big data in power network dispatching...
  • YiH. et al.

    Real-time detection of false data injection in smart grid networks: An adaptive CUSUM method and analysis

    IEEE Syst. J.

    (2016)
  • Cited by (17)

    • Multiple imputation method of missing credit risk assessment data based on generative adversarial networks

      2022, Applied Soft Computing
      Citation Excerpt :

      Nikfalazar et al. [30] proposed a novel imputation method by integrating the merits of decision tress and fuzzy clustering into an iterative learning approach, and proved its prediction accuracy and reliability. Su et al. [31] proposed an imputation method based on statistics and machine learning, which combines global trend and local variation. The test results on public dataset show that the method outperforms the commonly used methods.

    • Fuzzy information decomposition incorporated and weighted Relief-F feature selection: When imbalanced data meet incompletion

      2022, Information Sciences
      Citation Excerpt :

      Thus, it is significantly important to make a precise imputation for the missing values during the data preprocessing. So far, a rich body of imputation methods have been reported in the literature, see [25–27] and references therein. Among these imputation methods, matrix factorization proposed in [28] is one of the most efficient methods that don’t need pre-filling and can handle high sparsity problems.

    • CVT measurement error correction by double regression-based particle swarm optimization compensation algorithm

      2021, Energy Reports
      Citation Excerpt :

      Especially with the increasing popularity of machine learning in recent years [9], the information of multidimensional influence parameter data has been more convenient to mine. Multiple linear regression and ridge regression algorithms can capture the linear relationship between independent variable data and regression data [10]. Qingsong Chen et al. (2019) developed an online monitoring system for CVT, using a double regression algorithm to determine the environmental impact [11].

    View all citing articles on Scopus

    Tao Su received the B.E. degrees in Automation from the Wuhan University of Technology, Hubei, China, where he is currently studying for M.A. degrees in control science and engineering.

    His research interests include big data systems, object detection and semantic segmentation.

    Ying Shi received the Ph.D. degree in marine engineering form Wuhan University of Technology, Hubei, China.

    She is a Professor of artificial intelligence with Wuhan University of Technology, Hubei, China. Her current research interests include big data systems, grid security, machine learning, deep learning.

    Jicheng Yu received his B.S. degree from Huazhong University of Science and Technology, Wuhan, China in 2010, and his M.S. and Ph.D. degrees from Arizona State University, Tempe, USA, in 2013 and 2017.

    Currently, he is a research Engineer at China Electric Power Research Institute, Wuhan, China. His research interests include sensors, smart meters, big data analytics in the power system.

    Changxi Yue received his B.S and M.S degrees in electrical engineering from Xi’an Jiaotong University, Xi’an, China, in 2004 and 2006.

    Currently, he leads the High-voltage and High-current Technique Group of the Department of Metrology at China Electric Power Research Institute, Wuhan, China. He is a senior research engineer expertized in high voltage and high current test and measurement.

    Feng Zhou received his B.S. and M.S. degrees in automation from Hefei University of Technology, Hefei, China, in 2002 and 2006, and Ph.D. degree in electrical and electronics engineering from Huazhong University of Science and Technology, Wuhan, China, in 2019.

    Currently, he is the Site Director of the Department of Metrology at China Electric Power Research Institute, Wuhan, China. Dr. Zhou is a professor-level senior engineer. His expertise is in high-voltage insulation, high-voltage metrology, and energy internet.

    This document is the results of the research project funded by State Grid Corporation of China Headquarter Science and Technology Project.

    View full text