Skip to main content

Imbalanced Disk Failure Data Processing Method Based on CTGAN

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13394))

Abstract

Most research on disk failure prediction is based on Self-Monitoring Analysis and Reporting Technology (SMART) data. And the data of the failed disk is extremely small, so the characteristics of the imbalanced data set have created obstacles to the classification problem, especially tabular data. For the study of the imbalanced dataset, this paper proposes to use Conditional Tabular GANs (CTGAN) to augment disk failure data. CTGAN generates some fake data that matches the distribution of real data features. Use the classic models (BP-ANN, SVM, decision tree, random forest) as a reference to verify the effectiveness of CTGAN. The experiment results show that CTGAN achieves great results on the augmentation of imbalanced datasets.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Agustianto, K., Destarianto, P.: Imbalance data handling using neighborhood cleaning rule (NCL) sampling method for precision student modeling. In: 2019 International Conference on Computer Science, Information Technology, and Electrical Engineering (ICOMITEE), pp. 86–89. IEEE (2019)

    Google Scholar 

  2. Borovcnik, M., Bentz, H.J., Kapadia, R.: A probabilistic perspective. In: Kapadia, R., Borovcnik, M. (eds) Chance Encounters: Probability in Education. Mathematics Education Library, vol 12. Springer, Dordrecht (1991). https://doi.org/10.1007/978-94-011-3532-0_2

  3. Chawla, N.V., Bowyer, K.W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intel. Res. 16, 321–357 (2002)

    Article  MATH  Google Scholar 

  4. Chen, B., Su, Y., Huang, S.: Classification of imbalanced data based on km-smote and random forest computer technology and development. Comput. Technol. Dev. 25(9), 17–21 (2015)

    Google Scholar 

  5. Fei, H., Yuan, Q., Zheng, Y.: Deep learning-based classification method for epileptic eeg imbalance. J. Instrum. (2021)

    Google Scholar 

  6. Feng, Y., Shi, Z.: CNN-based network intrusion detection under imbalanced data. J. North Cent. Univ. (Nat. Sci. Ed.) 42(4), 7 (2021)

    Google Scholar 

  7. Goodfellow, I., et al.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27 (2014)

    Google Scholar 

  8. Guo, Y.: Research on classification algorithm for stroke imbalance dataset. Ph.D. thesis, Taiyuan University of Technology (2021)

    Google Scholar 

  9. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91

    Chapter  Google Scholar 

  10. Jahmunah, V., Ng, E., San, T.R., Acharya, U.R.: Automated detection of coronary artery disease, myocardial infarction and congestive heart failure using gaborcnn model with ECG signals. Comput. Biol. Med. 134, 104457 (2021)

    Article  Google Scholar 

  11. Klein, A.: What smart stats tell us about hard drives (2016). www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures

  12. Li, C., Lu, G., Wang, H.: Boundary sample undersampling support vector machine based classification algorithm for telecommunication subscriber default. Telecommun. Sci. 33(9), 7 (2017)

    Google Scholar 

  13. Li, M., Dong, W.: Quality prediction of automotive parts for imbalanced data sets. China Mech. Eng. 33(1), 9 (2022)

    Google Scholar 

  14. Liang, J., Ye, G., Guo, J., Huang, Q., Zhang, S.: Reducing false-positives in lung nodules detection using balanced datasets. Front. Public Health, 517 (2021)

    Google Scholar 

  15. Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. 409, 17–26 (2017)

    Article  Google Scholar 

  16. Liu, X., Wu, J., Zhou, Z.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(2), 539–550 (2008)

    Google Scholar 

  17. Phung, N.M., Mimura, M.: Detection of malicious javascript on an imbalanced dataset. Internet Things 13(1), 100357 (2021)

    Article  Google Scholar 

  18. Rok, B., Lusa, L.: Smote for high-dimensional class-imbalanced data. BMC Bioinform. 14(1), 106–121 (2013). https://doi.org/10.1186/1471-2105-14-106

    Article  Google Scholar 

  19. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional gan. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  20. Zhou, Y., Sun, H., Fang, Q., Xia, H.: A review of research on classification methods for imbalanced datasets. Comput. Appl. Res. 39(6), 1–7 (2022)

    Google Scholar 

  21. Zhu, B.: Research on hard disk failure prediction methods for large-scale storage systems. Master’s thesis, Nankai University (2014)

    Google Scholar 

Download references

Acknowledgements

The research work was supported by the Shandong Provincial Natural Science Foundation of China (Grant No. ZR2019LZH003) in this paper. Peng Wu is the author to whom all correspondence should be addressed.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jia, J., Wu, P., Zhang, K., Zhong, J. (2022). Imbalanced Disk Failure Data Processing Method Based on CTGAN. In: Huang, DS., Jo, KH., Jing, J., Premaratne, P., Bevilacqua, V., Hussain, A. (eds) Intelligent Computing Theories and Application. ICIC 2022. Lecture Notes in Computer Science, vol 13394. Springer, Cham. https://doi.org/10.1007/978-3-031-13829-4_55

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-13829-4_55

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-13828-7

  • Online ISBN: 978-3-031-13829-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics