Skip to main content

Informative Software Defect Data Generation and Prediction: INF-SMOTE

  • Conference paper
  • First Online:
Advances in Computing and Data Sciences (ICACDS 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1613))

Included in the following conference series:

  • 494 Accesses

Abstract

Highly imbalanced data typically make accurate predictions difficult. Unfortunately, software defect datasets tend to have fewer defective modules than non-defective modules. Synthetic oversampling approaches, namely SMOTE, address this concern by creating new minority defective modules to balance the class distribution before a model is trained. Despite its success, these approaches come with the following shortcomings such as 1) over-generalization problem and generate near-duplicated data instances (less diverse data) due to oversampling of noisy samples, and 2) increasing the overlaps between different classes around the class boundaries. This paper introduces INF-SMOTE (Informative- Synthetic Minority Oversampling Technique), a novel and efficient synthetic oversampling approach for software defect datasets, simultaneously targeting all the shortcomings. INF-SMOTE identifies the informative minority samples that are appropriate for over-sampling. The process is in two way 1.) it identify and remove the noisy and overlapping samples from borderline minority instances based on the sampling seeds, and 2) synthetic samples are generated from the informative minority samples. Experiments were conducted on 12 releases of SDP (Software Defect Prediction) Datasets from the NASA repository. By comparing with the state-of-the-art techniques, we observe that the INF-SMOTE improves the defect prediction performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bennin, K.E., Keung, J., Monden, A., Kamei, Y., Ubayashi, N.: Investigating the effects of balanced training and testing datasets on effort-aware fault prediction models. In: 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), vol. 1, pp. 154–163. IEEE (2016)

    Google Scholar 

  2. Bennin, K.E., Keung, J., Monden, A., Phannachitta, P., Mensah, S.: The significant effects of data sampling approaches on software defect prioritization and classification. In: 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 364–373. IEEE (2017)

    Google Scholar 

  3. Bhargava, N., Sharma, G., Bhargava, R., Mathuria, M.: Decision tree analysis on j48 algorithm for data mining. Proc. Int. J. Adv. Res. Comput. Sci. Software Eng. 3(6), 1114–1119 (2013)

    Google Scholar 

  4. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    MATH  Google Scholar 

  5. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  6. Fenton, N.E., Ohlsson, N.: Quantitative analysis of faults and failures in a complex software system. IEEE Trans. Software Eng. 26(8), 797–814 (2000)

    Article  Google Scholar 

  7. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2011)

    Article  Google Scholar 

  8. Gosain, A., Sardana, S.: Handling class imbalance problem using oversampling techniques: a review. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 79–85. IEEE (2017)

    Google Scholar 

  9. Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: The misuse of the NASA metrics data program data sets for automated software defect prediction. In: 15th Annual Conference on Evaluation & Assessment in Software Engineering (EASE 2011), pp. 96–103. IET (2011)

    Google Scholar 

  10. Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Software Eng. 38(6), 1276–1304 (2011)

    Article  Google Scholar 

  11. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91

    Chapter  Google Scholar 

  12. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008)

    Google Scholar 

  13. Kamei, Y., Monden, A., Matsumoto, S., Kakimoto, T., Matsumoto, K.I.: The effects of over and under sampling on fault-prone module detection. In: First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), pp. 196–204. IEEE (2007)

    Google Scholar 

  14. Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Software Eng. 33(1), 2–13 (2006)

    Article  Google Scholar 

  15. Meyer, D., Wien, F.T.: Support vector machines. Interface LIBSVM Package e1071 28, 20 (2015)

    Google Scholar 

  16. Pelayo, L., Dick, S.: Applying novel resampling strategies to software defect prediction. In: NAFIPS 2007–2007 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 69–72. IEEE (2007)

    Google Scholar 

  17. Potharaju, S.P., Sreedevi, M.: A novel LtR and RtL framework for subset feature selection (reduction) for improving the classification accuracy. In: Pati, B., Panigrahi, C.R., Misra, S., Pujari, A.K., Bakshi, S. (eds.) Progress in Advanced Computing and Intelligent Engineering. AISC, vol. 713, pp. 215–224. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-1708-8_20

    Chapter  Google Scholar 

  18. Provost, F.: Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI 2000 Workshop on Imbalanced Data Sets, vol. 68, pp. 1–3. AAAI Press (2000)

    Google Scholar 

  19. Rekha, G., Tyagi, A.K., Krishna Reddy, V.: A wide scale classification of class imbalance problem and its solutions: a systematic literature review. J. Comput. Sci. 15, 886–929 (2019)

    Article  Google Scholar 

  20. Rish, I., et al.: An empirical study of the Naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)

    Google Scholar 

  21. Schapire, R.E.: Explaining AdaBoost. In: Schölkopf, B., Luo, Z., Vovk, V. (eds.) Empirical Inference, pp. 37–52. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41136-6_5

    Chapter  Google Scholar 

  22. Thirugnanasambandam, K., Prakash, S., Subramanian, V., Pothula, S., Thirumal, V.: Reinforced cuckoo search algorithm-based multimodal optimization. Appl. Intell. 49(6), 2059–2083 (2019). https://doi.org/10.1007/s10489-018-1355-3

    Article  Google Scholar 

  23. Wang, S., Yao, X.: Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 62(2), 434–443 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to G. Rekha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rekha, G., Shailaja, K., Jatoth, C. (2022). Informative Software Defect Data Generation and Prediction: INF-SMOTE. In: Singh, M., Tyagi, V., Gupta, P.K., Flusser, J., Ören, T. (eds) Advances in Computing and Data Sciences. ICACDS 2022. Communications in Computer and Information Science, vol 1613. Springer, Cham. https://doi.org/10.1007/978-3-031-12638-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-12638-3_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-12637-6

  • Online ISBN: 978-3-031-12638-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics