Skip to main content
Log in

An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Big data techniques have been applied to power grid for the prediction and evaluation of grid conditions. However, the raw data quality can rarely meet the requirement of precise data analytics since raw data set usually contains samples with missing data to which the common data mining models are sensitive. Besides, the raw training data from a single monitoring system, e.g. dissolved gas analysis (DGA), are rarely sufficient for training in the form of valid instances since raw data set usually contains samples with noisy data. Though classic methods like neural network can be used to fill the gaps of missing data and classify the fault type, their models often fail to fit the rules of power grid conditions. This paper presents an integrated data preprocessing framework (DPF) based on Apache Spark to improve the prediction accuracy for data sets with missing data points and classification accuracy with noise data as well as to meet the big data requirement, which mainly combines missing data prediction, data fusion, data cleansing and fault type classification. First, the prediction model is trained based on the linear regression (LinR). Afterwards, we propose an optimized linear method (OLR) to improve the prediction accuracy. Then, to better utilize the strong correlation among different data sources, new data features extracted by persons correlation coefficient (PCC) are fused into a training data set. Next, principal component analysis (PCA) is taken to reduce the side effect brought by the new feature as well as retaining significant information for classification. Finally, the classification model based on logistic regression (LogR) and support vector machine (SVM) is trained to classify the fault type of electric equipment. We test the DPF framework on missing data prediction and fault type classification of power transformers in power grid system. The experimental results show that the predictors based on the proposed framework achieve lower mean square error and the classifiers obtain higher accuracy than traditional ones. Besides, the training time required for training large-scale data shows a decreasing trend. Therefore, the data preprocessing framework DPF would be a good candidate to predict the missing data and classify the fault type in power grid system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9

Similar content being viewed by others

References

  1. Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys, 41(3), 1–52.

    Article  Google Scholar 

  2. Niu, J., Gao, Y., Qiu, M., & Ming, Z. (2012). Selecting proper wireless network interfaces for user experience enhancement with guaranteed probability. Journal of Parallel and Distributed Computing, 72(12), 1565–1575. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731512002134.

    Article  Google Scholar 

  3. Li, Y., Dai, W., Ming, Z., & Qiu, M. (2015). Privacy protection for preventing data over-collection in smart city. IEEE Transactions on Computers, PP(99), 1–1.

    Google Scholar 

  4. Lee, K., Kung, S.-Y., & Verma, N. (2012). Low-energy formulations of support vector machine kernel functions for biomedical sensor applications. Journal of Signal Processing Systems (JSPS), 69(3), 339–349. [Online]. Available. doi:10.1007/s11265-012-0672-8.

    Article  Google Scholar 

  5. Zliobaite, I., & Gabrys, B. (2014). Adaptive preprocessing for streaming data. IEEE Transactions on Knowledge and Data Engineering, 26(2), 309–321.

    Article  Google Scholar 

  6. Davis, J.J., & Clark, A.J. (2011). Data preprocessing for anomaly based network intrusion detection: A review. Computers & Security, 30(6–7), 353–375. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167404811000691.

    Article  Google Scholar 

  7. Khalighi, S., Pak, F., Tirdad, P., & Nunes, U. (2015). Iris recognition using robust localization and nonsubsampled contourlet based features. Journal of Signal Processing Systems (JSPS), 81(1), 111–128. [Online]. Available. doi:10.1007/s11265-014-0911-2.

    Article  Google Scholar 

  8. Qiu, M., Ming, Z., Li, J., Liu, J., Quan, G., & Zhu, Y. (2013). Informer homed routing fault tolerance mechanism for wireless sensor networks. Journal of Systems Architecture, 59(4–5), 260–270. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1383762113000040.

    Article  Google Scholar 

  9. Ma, H., King, I., & Lyu, M.R. (2007). Effective missing data prediction for collaborative filtering. In Inproceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 39–46). Amsterdam: ACM.

    Google Scholar 

  10. Nogueira, R., Vieira, S., & Sousa, J. (2005). The prediction of bankruptcy using fuzzy classifiers. In 2005 ICSC Congress on Computational Intelligence Methods and Applications (p. 6).

  11. Lei, K.S., & Wan, F. (2010). Pre-processing for missing data: A hybrid approach to air pollution prediction in macau. In 2010 IEEE International Conference on Automation and Logistics (ICAL), (Vol. 16–20 pp. 418–422).

  12. Tian, F., Sun, J., & Shao, S. (2013). Wavelet threshold de-noising applications in avionics test data processing. In 2013 Third International Conference on Instrumentation, Measurement, Computer, Communication and Control (IMCCC), (Vol. 21–23, pp. 667– 671).

  13. Wei, X., Xiao, B., Zhang, Q., & Liu, R. (2011). A rigid structure matching-based noise data processing approach for human motion capture. In 2011 Workshop on Digital Media and Digital Content Management (DMDCM) (Vol. 15–16 pp. 91–96).

  14. da Silva, I., & Adeodato, P. (2011). Pca and gaussian noise in mlp neural network training improve generalization in problems with small and unbalanced data sets. In The 2011 International Joint Conference on Neural Networks (IJCNN) (pp. 2664–2669).

  15. Yu, L., Wang, S., & Lai, K. (2006). An integrated data preparation scheme for neural network data analysis. IEEE Transactions on Knowledge and Data Engineering, 18(2), 217–230.

    Article  Google Scholar 

  16. Atasu, K. (2015). Feature-rich regular expression matching accelerator for text analytics. Journal of Signal Processing Systems (JSPS), 1–17. [Online]. Available. doi:10.1007/s11265-015-1052-y.

  17. Karthikeyan, P., Amudhavel, J., Abraham, A., Sathian, D., Raghav, R.S., & Dhavachelvan, P. (2015). A comprehensive survey on variants and its extensions of big data in cloud environment. In Proceedings of the 2015 International Conference on Advanced Research in Computer Science Engineering and Technology (ICARCSET 2015) (pp. 1–5). Unnao: ACM.

    Google Scholar 

  18. Morchen, F., & Ultsch, A. (2005). Optimizing time series discretization for knowledge discovery. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (pp. 660–665). Chicago: ACM.

    Chapter  Google Scholar 

  19. Shi, W., Zhu, Y., Zhang, J., Tao, X., Sheng, G., Lian, Y., Wang, G., & Chen, Y. (2015). Improving power grid monitoring data quality: An efficient machine learning framework for missing data prediction. In IEEE 17th International Conference on High Performance Computing and Communications, 2015 (pp. 417–422). IEEE Computer Society.

  20. Zhang, J., Zhu, Y., Shi, W., Sheng, G., & Chen, Y. (2015). An improved machine learning scheme for data-driven fault diagnosis of power grid equipment. In The 2015 IEEE International Symposium on Smart Data (pp. 1737–1742). IEEE Computer Society.

  21. Lu, Z., & Hui, Y. (2003). L 1 linear interpolator for missing values in time series. Annals of the Institute of Statistical Mathematics, 55(1), 197–216. [Online]. Available. doi:10.1007/BF02530494.

    MathSciNet  MATH  Google Scholar 

  22. Hong, S.T., & Chang, J.W. (2011). A new data filtering scheme based on statistical data analysis for monitoring systems in wireless sensor networks. In Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications, (pp. 635–640). IEEE Computer Society.

  23. Grunwald, P. (2007). Linear regression. In The Minimum Description Length Principle (pp. 335–368). MIT Press. [Online]. Available: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6282057.

  24. Trevor, H., Robert, T., & Jerome, F. (2001). The elements of statistical learning: data mining, inference and prediction (Vol. 1, pp. 371–406). New York: Springer.

    MATH  Google Scholar 

  25. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

    MATH  Google Scholar 

  26. Abe, S. (2003). Analysis of multiclass support vector machines. Thyroid, 21(3), 3772.

    Google Scholar 

  27. Lin, C.-Y., Tsai, C.-H., Lee, C.-P., & Lin, C.-J. (2014). Large-scale logistic regression and linear support vector machines using spark. In IEEE International Conference on Big Data (Big Data), 2014 (pp. 519–528): IEEE.

  28. Solaimani, M., Iftekhar, M., Khan, L., Thuraisingham, B., & Ingram, J.B. (2014). Spark-based anomaly detection over multi-source vmware performance data in real-time. In IEEE Symposium on Computational Intelligence in Cyber Security (CICS), 2014 (pp. 1–8). IEEE.

  29. Harnie, D., Vapirev, A.E., Wegner, J.K., Gedich, A., Steijaert, M., Wuyts, R., & De Meuter, W. (2015). Scaling machine learning for target prediction in drug discovery using apache spark. In Proceedings of the 15th IEEE/ACM International Symposium on Cluster Cloud and Grid Computing.

  30. Shanahan, J.G., & Dai, L. (2015). Large scale distributed data science using apache spark. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2323–2324). Sydney: ACM.

    Chapter  Google Scholar 

  31. Stoica, I. (2014). Conquering big data with spark and bdas. SIGMETRICS Perform Evaluation Review, 42 (1), 193– 193.

    Article  Google Scholar 

  32. Jolliffe, I. (2014). Principal component analysis. In Wiley StatsRef: Statistics Reference Online (pp. –): Wiley. [Online]. Available. doi:10.1002/9781118445112.stat06472 .

  33. Sun, G., Wang, Z., & Wang, M. (2008). A new multi-classification method based on binary tree support vector machine. In 3rd International Conference on Innovative Computing Information and Control, 2008. ICICIC ’08 (p. 77).

  34. Dorffner, G. (1996). Neural networks for time series processing. Neural Network World, 6, 447–468.

    Google Scholar 

Download references

Acknowledgments

This paper is sponsored in part by the National High Technology and Research Development Program of China (863 Program, 2015AA050204), State Grid Science and Technology Project (520626140020, 14H100000552, SGCQDK00PJJS1400020), State Grid Corporation of China, the National Research Foundation Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) program, and the National Natural Science Foundation of China (No.61373032).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongxin Zhu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shi, W., Zhu, Y., Huang, T. et al. An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment. J Sign Process Syst 86, 221–236 (2017). https://doi.org/10.1007/s11265-016-1119-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-016-1119-4

Keywords

Navigation