An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment

Shi, Weiwei; Zhu, Yongxin; Huang, Tian; Sheng, Gehao; Lian, Yong; Wang, Guoxing; Chen, Yufeng

doi:10.1007/s11265-016-1119-4

An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment

Published: 02 March 2016

Volume 86, pages 221–236, (2017)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Weiwei Shi¹,
Yongxin Zhu ORCID: orcid.org/0000-0002-1813-1792¹,
Tian Huang¹,
Gehao Sheng¹,
Yong Lian¹,
Guoxing Wang¹ &
…
Yufeng Chen²

1414 Accesses
31 Citations
Explore all metrics

Abstract

Big data techniques have been applied to power grid for the prediction and evaluation of grid conditions. However, the raw data quality can rarely meet the requirement of precise data analytics since raw data set usually contains samples with missing data to which the common data mining models are sensitive. Besides, the raw training data from a single monitoring system, e.g. dissolved gas analysis (DGA), are rarely sufficient for training in the form of valid instances since raw data set usually contains samples with noisy data. Though classic methods like neural network can be used to fill the gaps of missing data and classify the fault type, their models often fail to fit the rules of power grid conditions. This paper presents an integrated data preprocessing framework (DPF) based on Apache Spark to improve the prediction accuracy for data sets with missing data points and classification accuracy with noise data as well as to meet the big data requirement, which mainly combines missing data prediction, data fusion, data cleansing and fault type classification. First, the prediction model is trained based on the linear regression (LinR). Afterwards, we propose an optimized linear method (OLR) to improve the prediction accuracy. Then, to better utilize the strong correlation among different data sources, new data features extracted by persons correlation coefficient (PCC) are fused into a training data set. Next, principal component analysis (PCA) is taken to reduce the side effect brought by the new feature as well as retaining significant information for classification. Finally, the classification model based on logistic regression (LogR) and support vector machine (SVM) is trained to classify the fault type of electric equipment. We test the DPF framework on missing data prediction and fault type classification of power transformers in power grid system. The experimental results show that the predictors based on the proposed framework achieve lower mean square error and the classifiers obtain higher accuracy than traditional ones. Besides, the training time required for training large-scale data shows a decreasing trend. Therefore, the data preprocessing framework DPF would be a good candidate to predict the missing data and classify the fault type in power grid system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real-time power theft monitoring and detection system with double connected data capture system

Article Open access 29 May 2023

Machine learning techniques applied to mechanical fault diagnosis and fault prognosis in the context of real industrial manufacturing use-cases: a systematic literature review

Article 04 March 2022

SOH estimation of lithium-ion batteries based on least squares support vector machine error compensation model

Article 09 September 2021

References

Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys, 41(3), 1–52.
Article Google Scholar
Niu, J., Gao, Y., Qiu, M., & Ming, Z. (2012). Selecting proper wireless network interfaces for user experience enhancement with guaranteed probability. Journal of Parallel and Distributed Computing, 72(12), 1565–1575. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731512002134.
Article Google Scholar
Li, Y., Dai, W., Ming, Z., & Qiu, M. (2015). Privacy protection for preventing data over-collection in smart city. IEEE Transactions on Computers, PP(99), 1–1.
Google Scholar
Lee, K., Kung, S.-Y., & Verma, N. (2012). Low-energy formulations of support vector machine kernel functions for biomedical sensor applications. Journal of Signal Processing Systems (JSPS), 69(3), 339–349. [Online]. Available. doi:10.1007/s11265-012-0672-8.
Article Google Scholar
Zliobaite, I., & Gabrys, B. (2014). Adaptive preprocessing for streaming data. IEEE Transactions on Knowledge and Data Engineering, 26(2), 309–321.
Article Google Scholar
Davis, J.J., & Clark, A.J. (2011). Data preprocessing for anomaly based network intrusion detection: A review. Computers & Security, 30(6–7), 353–375. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167404811000691.
Article Google Scholar
Khalighi, S., Pak, F., Tirdad, P., & Nunes, U. (2015). Iris recognition using robust localization and nonsubsampled contourlet based features. Journal of Signal Processing Systems (JSPS), 81(1), 111–128. [Online]. Available. doi:10.1007/s11265-014-0911-2.
Article Google Scholar
Qiu, M., Ming, Z., Li, J., Liu, J., Quan, G., & Zhu, Y. (2013). Informer homed routing fault tolerance mechanism for wireless sensor networks. Journal of Systems Architecture, 59(4–5), 260–270. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1383762113000040.
Article Google Scholar
Ma, H., King, I., & Lyu, M.R. (2007). Effective missing data prediction for collaborative filtering. In Inproceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 39–46). Amsterdam: ACM.
Google Scholar
Nogueira, R., Vieira, S., & Sousa, J. (2005). The prediction of bankruptcy using fuzzy classifiers. In 2005 ICSC Congress on Computational Intelligence Methods and Applications (p. 6).
Lei, K.S., & Wan, F. (2010). Pre-processing for missing data: A hybrid approach to air pollution prediction in macau. In 2010 IEEE International Conference on Automation and Logistics (ICAL), (Vol. 16–20 pp. 418–422).
Tian, F., Sun, J., & Shao, S. (2013). Wavelet threshold de-noising applications in avionics test data processing. In 2013 Third International Conference on Instrumentation, Measurement, Computer, Communication and Control (IMCCC), (Vol. 21–23, pp. 667– 671).
Wei, X., Xiao, B., Zhang, Q., & Liu, R. (2011). A rigid structure matching-based noise data processing approach for human motion capture. In 2011 Workshop on Digital Media and Digital Content Management (DMDCM) (Vol. 15–16 pp. 91–96).
da Silva, I., & Adeodato, P. (2011). Pca and gaussian noise in mlp neural network training improve generalization in problems with small and unbalanced data sets. In The 2011 International Joint Conference on Neural Networks (IJCNN) (pp. 2664–2669).
Yu, L., Wang, S., & Lai, K. (2006). An integrated data preparation scheme for neural network data analysis. IEEE Transactions on Knowledge and Data Engineering, 18(2), 217–230.
Article Google Scholar
Atasu, K. (2015). Feature-rich regular expression matching accelerator for text analytics. Journal of Signal Processing Systems (JSPS), 1–17. [Online]. Available. doi:10.1007/s11265-015-1052-y.
Karthikeyan, P., Amudhavel, J., Abraham, A., Sathian, D., Raghav, R.S., & Dhavachelvan, P. (2015). A comprehensive survey on variants and its extensions of big data in cloud environment. In Proceedings of the 2015 International Conference on Advanced Research in Computer Science Engineering and Technology (ICARCSET 2015) (pp. 1–5). Unnao: ACM.
Google Scholar
Morchen, F., & Ultsch, A. (2005). Optimizing time series discretization for knowledge discovery. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (pp. 660–665). Chicago: ACM.
Chapter Google Scholar
Shi, W., Zhu, Y., Zhang, J., Tao, X., Sheng, G., Lian, Y., Wang, G., & Chen, Y. (2015). Improving power grid monitoring data quality: An efficient machine learning framework for missing data prediction. In IEEE 17th International Conference on High Performance Computing and Communications, 2015 (pp. 417–422). IEEE Computer Society.
Zhang, J., Zhu, Y., Shi, W., Sheng, G., & Chen, Y. (2015). An improved machine learning scheme for data-driven fault diagnosis of power grid equipment. In The 2015 IEEE International Symposium on Smart Data (pp. 1737–1742). IEEE Computer Society.
Lu, Z., & Hui, Y. (2003). L 1 linear interpolator for missing values in time series. Annals of the Institute of Statistical Mathematics, 55(1), 197–216. [Online]. Available. doi:10.1007/BF02530494.
MathSciNet MATH Google Scholar
Hong, S.T., & Chang, J.W. (2011). A new data filtering scheme based on statistical data analysis for monitoring systems in wireless sensor networks. In Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications, (pp. 635–640). IEEE Computer Society.
Grunwald, P. (2007). Linear regression. In The Minimum Description Length Principle (pp. 335–368). MIT Press. [Online]. Available: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6282057.
Trevor, H., Robert, T., & Jerome, F. (2001). The elements of statistical learning: data mining, inference and prediction (Vol. 1, pp. 371–406). New York: Springer.
MATH Google Scholar
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
MATH Google Scholar
Abe, S. (2003). Analysis of multiclass support vector machines. Thyroid, 21(3), 3772.
Google Scholar
Lin, C.-Y., Tsai, C.-H., Lee, C.-P., & Lin, C.-J. (2014). Large-scale logistic regression and linear support vector machines using spark. In IEEE International Conference on Big Data (Big Data), 2014 (pp. 519–528): IEEE.
Solaimani, M., Iftekhar, M., Khan, L., Thuraisingham, B., & Ingram, J.B. (2014). Spark-based anomaly detection over multi-source vmware performance data in real-time. In IEEE Symposium on Computational Intelligence in Cyber Security (CICS), 2014 (pp. 1–8). IEEE.
Harnie, D., Vapirev, A.E., Wegner, J.K., Gedich, A., Steijaert, M., Wuyts, R., & De Meuter, W. (2015). Scaling machine learning for target prediction in drug discovery using apache spark. In Proceedings of the 15th IEEE/ACM International Symposium on Cluster Cloud and Grid Computing.
Shanahan, J.G., & Dai, L. (2015). Large scale distributed data science using apache spark. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2323–2324). Sydney: ACM.
Chapter Google Scholar
Stoica, I. (2014). Conquering big data with spark and bdas. SIGMETRICS Perform Evaluation Review, 42 (1), 193– 193.
Article Google Scholar
Jolliffe, I. (2014). Principal component analysis. In Wiley StatsRef: Statistics Reference Online (pp. –): Wiley. [Online]. Available. doi:10.1002/9781118445112.stat06472 .
Sun, G., Wang, Z., & Wang, M. (2008). A new multi-classification method based on binary tree support vector machine. In 3rd International Conference on Innovative Computing Information and Control, 2008. ICICIC ’08 (p. 77).
Dorffner, G. (1996). Neural networks for time series processing. Neural Network World, 6, 447–468.
Google Scholar

Download references

Acknowledgments

This paper is sponsored in part by the National High Technology and Research Development Program of China (863 Program, 2015AA050204), State Grid Science and Technology Project (520626140020, 14H100000552, SGCQDK00PJJS1400020), State Grid Corporation of China, the National Research Foundation Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) program, and the National Natural Science Foundation of China (No.61373032).

Author information

Authors and Affiliations

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
Weiwei Shi, Yongxin Zhu, Tian Huang, Gehao Sheng, Yong Lian & Guoxing Wang
Electric Power Research Institute of Shandong Power Supply Company of State Grid, Shandong, China
Yufeng Chen

Authors

Weiwei Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yongxin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Tian Huang
View author publications
You can also search for this author in PubMed Google Scholar
Gehao Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Yong Lian
View author publications
You can also search for this author in PubMed Google Scholar
Guoxing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongxin Zhu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shi, W., Zhu, Y., Huang, T. et al. An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment. J Sign Process Syst 86, 221–236 (2017). https://doi.org/10.1007/s11265-016-1119-4

Download citation

Received: 13 October 2015
Revised: 28 December 2015
Accepted: 17 February 2016
Published: 02 March 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s11265-016-1119-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment

Abstract

Access this article

Similar content being viewed by others

Real-time power theft monitoring and detection system with double connected data capture system

Machine learning techniques applied to mechanical fault diagnosis and fault prognosis in the context of real industrial manufacturing use-cases: a systematic literature review

SOH estimation of lithium-ion batteries based on least squares support vector machine error compensation model

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment

Abstract

Access this article

Similar content being viewed by others

Real-time power theft monitoring and detection system with double connected data capture system

Machine learning techniques applied to mechanical fault diagnosis and fault prognosis in the context of real industrial manufacturing use-cases: a systematic literature review

SOH estimation of lithium-ion batteries based on least squares support vector machine error compensation model

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation