Abstract
Most research on disk failure prediction is based on Self-Monitoring Analysis and Reporting Technology (SMART) data. And the data of the failed disk is extremely small, so the characteristics of the imbalanced data set have created obstacles to the classification problem, especially tabular data. For the study of the imbalanced dataset, this paper proposes to use Conditional Tabular GANs (CTGAN) to augment disk failure data. CTGAN generates some fake data that matches the distribution of real data features. Use the classic models (BP-ANN, SVM, decision tree, random forest) as a reference to verify the effectiveness of CTGAN. The experiment results show that CTGAN achieves great results on the augmentation of imbalanced datasets.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Agustianto, K., Destarianto, P.: Imbalance data handling using neighborhood cleaning rule (NCL) sampling method for precision student modeling. In: 2019 International Conference on Computer Science, Information Technology, and Electrical Engineering (ICOMITEE), pp. 86–89. IEEE (2019)
Borovcnik, M., Bentz, H.J., Kapadia, R.: A probabilistic perspective. In: Kapadia, R., Borovcnik, M. (eds) Chance Encounters: Probability in Education. Mathematics Education Library, vol 12. Springer, Dordrecht (1991). https://doi.org/10.1007/978-94-011-3532-0_2
Chawla, N.V., Bowyer, K.W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intel. Res. 16, 321–357 (2002)
Chen, B., Su, Y., Huang, S.: Classification of imbalanced data based on km-smote and random forest computer technology and development. Comput. Technol. Dev. 25(9), 17–21 (2015)
Fei, H., Yuan, Q., Zheng, Y.: Deep learning-based classification method for epileptic eeg imbalance. J. Instrum. (2021)
Feng, Y., Shi, Z.: CNN-based network intrusion detection under imbalanced data. J. North Cent. Univ. (Nat. Sci. Ed.) 42(4), 7 (2021)
Goodfellow, I., et al.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27 (2014)
Guo, Y.: Research on classification algorithm for stroke imbalance dataset. Ph.D. thesis, Taiyuan University of Technology (2021)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Jahmunah, V., Ng, E., San, T.R., Acharya, U.R.: Automated detection of coronary artery disease, myocardial infarction and congestive heart failure using gaborcnn model with ECG signals. Comput. Biol. Med. 134, 104457 (2021)
Klein, A.: What smart stats tell us about hard drives (2016). www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures
Li, C., Lu, G., Wang, H.: Boundary sample undersampling support vector machine based classification algorithm for telecommunication subscriber default. Telecommun. Sci. 33(9), 7 (2017)
Li, M., Dong, W.: Quality prediction of automotive parts for imbalanced data sets. China Mech. Eng. 33(1), 9 (2022)
Liang, J., Ye, G., Guo, J., Huang, Q., Zhang, S.: Reducing false-positives in lung nodules detection using balanced datasets. Front. Public Health, 517 (2021)
Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. 409, 17–26 (2017)
Liu, X., Wu, J., Zhou, Z.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(2), 539–550 (2008)
Phung, N.M., Mimura, M.: Detection of malicious javascript on an imbalanced dataset. Internet Things 13(1), 100357 (2021)
Rok, B., Lusa, L.: Smote for high-dimensional class-imbalanced data. BMC Bioinform. 14(1), 106–121 (2013). https://doi.org/10.1186/1471-2105-14-106
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional gan. Adv. Neural Inf. Process. Syst. 32 (2019)
Zhou, Y., Sun, H., Fang, Q., Xia, H.: A review of research on classification methods for imbalanced datasets. Comput. Appl. Res. 39(6), 1–7 (2022)
Zhu, B.: Research on hard disk failure prediction methods for large-scale storage systems. Master’s thesis, Nankai University (2014)
Acknowledgements
The research work was supported by the Shandong Provincial Natural Science Foundation of China (Grant No. ZR2019LZH003) in this paper. Peng Wu is the author to whom all correspondence should be addressed.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jia, J., Wu, P., Zhang, K., Zhong, J. (2022). Imbalanced Disk Failure Data Processing Method Based on CTGAN. In: Huang, DS., Jo, KH., Jing, J., Premaratne, P., Bevilacqua, V., Hussain, A. (eds) Intelligent Computing Theories and Application. ICIC 2022. Lecture Notes in Computer Science, vol 13394. Springer, Cham. https://doi.org/10.1007/978-3-031-13829-4_55
Download citation
DOI: https://doi.org/10.1007/978-3-031-13829-4_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13828-7
Online ISBN: 978-3-031-13829-4
eBook Packages: Computer ScienceComputer Science (R0)