Skip to main content

Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction

  • Conference paper
  • First Online:
Computational Science and Its Applications – ICCSA 2022 Workshops (ICCSA 2022)

Abstract

This research work investigates the deployment of data sampling and ensemble techniques in alleviating the class imbalance problem in software defect prediction (SDP). Specifically, the effect of data sampling techniques on the performance of ensemble methods is investigated. The experiments were conducted using software defect datasets from the NASA software archives. Five data sampling methods (over-sampling techniques (SMOTE, ADASYN, and ROS), and undersampling techniques (RUS and NearMiss) were combined with bagging and boosting ensemble methods based on Naïve Bayes (NB) and Decision Tree (DT) classifier. Predictive performances of developed models were assessed based on the area under the curve (AUC), and Matthew’s correlation coefficient (MCC) values. From the experimental findings, it was observed that the implementation of data sampling methods further enhanced the predictive performances of the experimented ensemble methods. Specifically, BoostedDT on the ROS-balanced datasets recorded the highest average AUC (0.995), and MCC (0.918) values respectively. Aside NearMiss method, which worked best with the Bagging ensemble method, other studied data sampling methods worked well with the Boosting ensemble technique. Also, some of the developed models particularly BoostedDT showed better prediction performance over existing SDP models. As a result, combining data sampling techniques with ensemble methods may not only improve SDP model prediction performance but also provide a plausible solution to the latent class imbalance issue in SDP processes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Song, Q., Guo, Y., Shepperd, M.: A comprehensive investigation of the role of imbalanced learning for software defect prediction. IIEEE Trans. Software Eng. 45, 1253–1269 (2019)

    Article  Google Scholar 

  2. Laradji, I.H., Alshayeb, M., Ghouti, L.: Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 58, 388–402 (2015)

    Article  Google Scholar 

  3. El-Sharkawy, S., Yamagishi-Eichler, N., Schmid, K.: Metrics for analyzing variability and its implementation in software product lines: a systematic literature review. Inf. Softw. Technol. 106, 1–30 (2019)

    Article  Google Scholar 

  4. Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the NASA software defect datasets. IIEEE Trans. Softw. Eng. 39, 1208–1215 (2013)

    Article  Google Scholar 

  5. Tiwari, S., Rathore, S.S.: Coupling and cohesion metrics for object-oriented software: a systematic mapping study. In: Proceedings of the 11th Innovations in Software Engineering Conference, pp. 1–11 (2018)

    Google Scholar 

  6. Balogun, A., Oladele, R., Mojeed, H., Amin-Balogun, B., Adeyemo, V.E., Aro, T.O.: Performance analysis of selected clustering techniques for software defects prediction. Afr. J. Comp. ICT 12, 30–42 (2019)

    Google Scholar 

  7. Alsaeedi, A., Khan, M.Z.: Software defect prediction using supervised machine learning and ensemble techniques: a comparative study. JSEA 12, 85–100 (2019)

    Article  Google Scholar 

  8. Kumar, L., Dastidar, T.G., Goyal, A., Murthy, L.B., Misra, S., Kocher, V., Padmanabhuni, S.: Predicting software defect severity level using deep-learning approach with various hidden layers. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. CCIS, vol. 1517, pp. 744–751. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92310-5_86

    Chapter  Google Scholar 

  9. Kumar, L., et al.: Deep-learning approach with Deepxplore for software defect severity level prediction. In: Gervasi, O., et al. (eds.) ICCSA 2021. LNCS, vol. 12955, pp. 398–410. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87007-2_28

    Chapter  Google Scholar 

  10. Balogun, A., Bajeh, A., Mojeed, H., Akintola, A.: Software defect prediction: a multi-criteria decision-making approach. Niger. J. Technol. Res. 15, 35–42 (2020)

    Article  Google Scholar 

  11. Alsawalqah, H., Faris, H., Aljarah, I., Alnemer, L., Alhindawi, N.: Hybrid SMOTE-ensemble approach for software defect prediction. In: Silhavy, R., Silhavy, P., Prokopova, Z., Senkerik, R., Kominkova Oplatkova, Z. (eds.) CSOC 2017. AISC, vol. 575, pp. 355–366. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57141-6_39

    Chapter  Google Scholar 

  12. Malhotra, R., Jain, J.: handling imbalanced data using ensemble learning in software defect prediction. In: 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence), pp. 300–304. IEEE (2020)

    Google Scholar 

  13. Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Expl. Newsl. 6, 20–29 (2004)

    Article  Google Scholar 

  14. Balogun, A.O., et al.: Data sampling-based feature selection framework for software defect prediction. In: The International Conference on Emerging Applications and Technologies for Industry 4.0, pp. 39–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-80216-5

  15. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  16. El-Shorbagy, S.A., El-Gammal, W.M., Abdelmoez, W.M.: Using SMOTE and heterogeneous stacking in ensemble learning for software defect prediction. In: The 7th International Conference, pp. 44–47. ACM Press (2018)

    Google Scholar 

  17. Tantithamthavorn, C., Hassan, A.E., Matsumoto, K.: The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IIEEE Trans. Softw. Eng. 46, 1200–1219 (2020)

    Article  Google Scholar 

  18. Xie, Z., Jiang, L., Ye, T., Li, X.: A synthetic minority oversampling method based on local densities in low-dimensional space for imbalanced learning. In: International Conference on Database Systems for Advanced Applications, pp. 3–18. Springer, Cham (2015). https://doi.org/10.1007/978-3-030-73200-4

  19. Kamalov, F., Elnagar, A., Leung, H.H.: Ensemble learning with resampling for imbalanced data. In: Huang, D.-S., Jo, K.-H., Li, J., Gribova, V., Hussain, A. (eds.) ICIC 2021. LNCS, vol. 12837, pp. 564–578. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-84529-2_48

    Chapter  Google Scholar 

  20. Cai, X., et al.: An under-sampled software defect prediction method based on hybrid multi-objective cuckoo search. Concurr. Comput. Pract. Exp. 32, e5478 (2020)

    Google Scholar 

  21. Balogun, A.O., Basri, S., Abdulkadir, S.J., Adeyemo, V.E., Imam, A.A., Bajeh, A.O.: Software defect prediction: analysis of class imbalance and performance Stability. J. Eng. Sci. Technol. 14, 15 (2019)

    Google Scholar 

  22. Goyal, S.: Handling class-imbalance with KNN (Neighbourhood) under-sampling for software defect prediction. Artif. Intell. Rev. 55, 1–42 (2021)

    Google Scholar 

  23. Cao, Y., Ding, Z., Xue, F., Rong, X.: An improved twin support vector machine based on multi-objective cuckoo search for software defect prediction. Int. J. Bio-Insp. Comput. 11, 282–291 (2018)

    Article  Google Scholar 

  24. Mabayoje, M.A., Balogun, A.O., Jibril, H.A., Atoyebi, J.O., Mojeed, H.A., Adeyemo, V.E.: Parameter tuning in KNN for software defect prediction: an empirical analysis. Jurnal Teknologi dan Sistem Komputer 7, 121–126 (2019)

    Article  Google Scholar 

  25. Yu, Q., Jiang, S., Zhang, Y.: The performance stability of defect prediction models with class imbalance: an empirical study. IEICE Trans Info Sys. 100, 265–272 (2017)

    Article  Google Scholar 

  26. Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IIEEE Trans. Softw. Eng. 33, 2–13 (2007)

    Article  Google Scholar 

  27. Balogun, A.O., et al.: SMOTE-based homogeneous ensemble methods for software defect prediction. In: Gervasi, O., et al. (eds.) ICCSA 2020. LNCS, vol. 12254, pp. 615–631. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58817-5_45

    Chapter  Google Scholar 

  28. Mockus, A., Weiss, D.M.: Predicting risk of software changes. Bell Labs Tech. J. 5, 169–180 (2000)

    Article  Google Scholar 

  29. Bowes, D., Hall, T., Petrić, J.: Software defect prediction: do different classifiers find the same defects? Softw. Qual. J. 26(2), 525–552 (2017). https://doi.org/10.1007/s11219-016-9353-3

    Article  Google Scholar 

  30. Japkowicz, N.: The class imbalance problem: Significance and strategies. In: Proceedings of the International Conference on Artificial Intelligence, vol. 56, pp. 111–117. Citeseer (2000)

    Google Scholar 

  31. Peng, M., et al.: Trainable undersampling for class-imbalance learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4707–4714 (2019)

    Google Scholar 

  32. Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91

    Chapter  Google Scholar 

  33. Elhassan, T., Aljurf, M.: Classification of imbalance data using tomek link (t-link) combined with random under-sampling (rus) as a data reduction method. Global J. Technol. Optim. S 1 (2016)

    Google Scholar 

  34. Malhotra, R., Kamal, S.: An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343, 120–140 (2019)

    Article  Google Scholar 

  35. Rodriguez, D., Herraiz, I., Harrison, R., Dolado, J., Riquelme, J.C.: Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In: The 18th International Conference, pp. 1–10. ACM Press (2014)

    Google Scholar 

  36. Suresh Kumar, P., Behera, H.S., Nayak, J., Naik, B.: Bootstrap aggregation ensemble learning-based reliable approach for software defect prediction by using characterized code feature. Innov. Syst. Softw. Eng. 17(4), 355–379 (2021). https://doi.org/10.1007/s11334-021-00399-2

    Article  Google Scholar 

  37. Berrar, D.: Bayes’ theorem and naive Bayes classifier. Encyclop. Bioinform. Comput. Biol. ABC Bioinform. 403 (2018)

    Google Scholar 

  38. Balogun, A.O., et al.: Empirical analysis of rank aggregation-based multi-filter feature selection methods in software defect prediction. Electronics 10, 179 (2021)

    Article  Google Scholar 

  39. Ghotra, B., McIntosh, S., Hassan, A.E.: A large-scale study of the impact of feature selection techniques on defect classification models. In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp. 146–157. IEEE (2017)

    Google Scholar 

  40. Xu, Z., Liu, J., Yang, Z., An, G., Jia, X.: The impact of feature selection on defect prediction performance: an empirical comparison. In: 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), pp. 309–320. IEEE (2016)

    Google Scholar 

  41. Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: An empirical comparison of model validation techniques for defect prediction models. IIEEE Trans. Softw. Eng. 43, 1–18 (2016)

    Article  Google Scholar 

  42. Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: Comments on “researcher bias: the use of machine learning in software defect prediction.” IIEEE Trans. Softw. Eng. 42, 1092–1094 (2016)

    Article  Google Scholar 

  43. Yu, Q., Jiang, S., Zhang, Y.: The performance stability of defect prediction models with class imbalance: an empirical study. IEICE Trans E100.D. Inf. Syst., 265–272 (2017)

    Google Scholar 

  44. Balogun, A.O., Akande, N.O., Usman-Hamza, F.E., Adeyemo, V.E., Mabayoje, M.A., Ameen, A.O.: Rotation forest-based logistic model tree for website phishing detection. In: Gervasi, O., et al. (eds.) ICCSA 2021. LNCS, vol. 12957, pp. 154–169. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87013-3_12

    Chapter  Google Scholar 

  45. Li, R., Zhou, L., Zhang, S., Liu, H., Huang, X., Sun, Z.: Software defect prediction based on ensemble learning. In: DSIT 2019: 2019 2nd International Conference on Data Science and Information Technology, pp. 1–6. ACM (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdullateef O. Balogun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Balogun, A.O. et al. (2022). Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction. In: Gervasi, O., Murgante, B., Misra, S., Rocha, A.M.A.C., Garau, C. (eds) Computational Science and Its Applications – ICCSA 2022 Workshops. ICCSA 2022. Lecture Notes in Computer Science, vol 13381. Springer, Cham. https://doi.org/10.1007/978-3-031-10548-7_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-10548-7_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-10547-0

  • Online ISBN: 978-3-031-10548-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics