Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction

Balogun, Abdullateef O.; Odejide, Babajide J.; Bajeh, Amos O.; Alanamu, Zubair O.; Usman-Hamza, Fatima E.; Adeleke, Hammid O.; Mabayoje, Modinat A.; Yusuff, Shakirat R.

doi:10.1007/978-3-031-10548-7_27

Abdullateef O. Balogun^12,13,
Babajide J. Odejide¹²,
Amos O. Bajeh¹²,
Zubair O. Alanamu¹⁴,
Fatima E. Usman-Hamza¹²,
Hammid O. Adeleke¹²,
Modinat A. Mabayoje¹² &
…
Shakirat R. Yusuff¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13381))

Included in the following conference series:

International Conference on Computational Science and Its Applications

1260 Accesses

Abstract

This research work investigates the deployment of data sampling and ensemble techniques in alleviating the class imbalance problem in software defect prediction (SDP). Specifically, the effect of data sampling techniques on the performance of ensemble methods is investigated. The experiments were conducted using software defect datasets from the NASA software archives. Five data sampling methods (over-sampling techniques (SMOTE, ADASYN, and ROS), and undersampling techniques (RUS and NearMiss) were combined with bagging and boosting ensemble methods based on Naïve Bayes (NB) and Decision Tree (DT) classifier. Predictive performances of developed models were assessed based on the area under the curve (AUC), and Matthew’s correlation coefficient (MCC) values. From the experimental findings, it was observed that the implementation of data sampling methods further enhanced the predictive performances of the experimented ensemble methods. Specifically, BoostedDT on the ROS-balanced datasets recorded the highest average AUC (0.995), and MCC (0.918) values respectively. Aside NearMiss method, which worked best with the Bagging ensemble method, other studied data sampling methods worked well with the Boosting ensemble technique. Also, some of the developed models particularly BoostedDT showed better prediction performance over existing SDP models. As a result, combining data sampling techniques with ensemble methods may not only improve SDP model prediction performance but also provide a plausible solution to the latent class imbalance issue in SDP processes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SMOTE-Based Homogeneous Ensemble Methods for Software Defect Prediction

An ensemble model for addressing class imbalance and class overlap in software defect prediction

Article 09 November 2024

Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

Article 27 July 2023

References

Song, Q., Guo, Y., Shepperd, M.: A comprehensive investigation of the role of imbalanced learning for software defect prediction. IIEEE Trans. Software Eng. 45, 1253–1269 (2019)
Article Google Scholar
Laradji, I.H., Alshayeb, M., Ghouti, L.: Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 58, 388–402 (2015)
Article Google Scholar
El-Sharkawy, S., Yamagishi-Eichler, N., Schmid, K.: Metrics for analyzing variability and its implementation in software product lines: a systematic literature review. Inf. Softw. Technol. 106, 1–30 (2019)
Article Google Scholar
Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the NASA software defect datasets. IIEEE Trans. Softw. Eng. 39, 1208–1215 (2013)
Article Google Scholar
Tiwari, S., Rathore, S.S.: Coupling and cohesion metrics for object-oriented software: a systematic mapping study. In: Proceedings of the 11th Innovations in Software Engineering Conference, pp. 1–11 (2018)
Google Scholar
Balogun, A., Oladele, R., Mojeed, H., Amin-Balogun, B., Adeyemo, V.E., Aro, T.O.: Performance analysis of selected clustering techniques for software defects prediction. Afr. J. Comp. ICT 12, 30–42 (2019)
Google Scholar
Alsaeedi, A., Khan, M.Z.: Software defect prediction using supervised machine learning and ensemble techniques: a comparative study. JSEA 12, 85–100 (2019)
Article Google Scholar
Kumar, L., Dastidar, T.G., Goyal, A., Murthy, L.B., Misra, S., Kocher, V., Padmanabhuni, S.: Predicting software defect severity level using deep-learning approach with various hidden layers. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. CCIS, vol. 1517, pp. 744–751. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92310-5_86
Chapter Google Scholar
Kumar, L., et al.: Deep-learning approach with Deepxplore for software defect severity level prediction. In: Gervasi, O., et al. (eds.) ICCSA 2021. LNCS, vol. 12955, pp. 398–410. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87007-2_28
Chapter Google Scholar
Balogun, A., Bajeh, A., Mojeed, H., Akintola, A.: Software defect prediction: a multi-criteria decision-making approach. Niger. J. Technol. Res. 15, 35–42 (2020)
Article Google Scholar
Alsawalqah, H., Faris, H., Aljarah, I., Alnemer, L., Alhindawi, N.: Hybrid SMOTE-ensemble approach for software defect prediction. In: Silhavy, R., Silhavy, P., Prokopova, Z., Senkerik, R., Kominkova Oplatkova, Z. (eds.) CSOC 2017. AISC, vol. 575, pp. 355–366. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57141-6_39
Chapter Google Scholar
Malhotra, R., Jain, J.: handling imbalanced data using ensemble learning in software defect prediction. In: 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence), pp. 300–304. IEEE (2020)
Google Scholar
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Expl. Newsl. 6, 20–29 (2004)
Article Google Scholar
Balogun, A.O., et al.: Data sampling-based feature selection framework for software defect prediction. In: The International Conference on Emerging Applications and Technologies for Industry 4.0, pp. 39–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-80216-5
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
El-Shorbagy, S.A., El-Gammal, W.M., Abdelmoez, W.M.: Using SMOTE and heterogeneous stacking in ensemble learning for software defect prediction. In: The 7th International Conference, pp. 44–47. ACM Press (2018)
Google Scholar
Tantithamthavorn, C., Hassan, A.E., Matsumoto, K.: The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IIEEE Trans. Softw. Eng. 46, 1200–1219 (2020)
Article Google Scholar
Xie, Z., Jiang, L., Ye, T., Li, X.: A synthetic minority oversampling method based on local densities in low-dimensional space for imbalanced learning. In: International Conference on Database Systems for Advanced Applications, pp. 3–18. Springer, Cham (2015). https://doi.org/10.1007/978-3-030-73200-4
Kamalov, F., Elnagar, A., Leung, H.H.: Ensemble learning with resampling for imbalanced data. In: Huang, D.-S., Jo, K.-H., Li, J., Gribova, V., Hussain, A. (eds.) ICIC 2021. LNCS, vol. 12837, pp. 564–578. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-84529-2_48
Chapter Google Scholar
Cai, X., et al.: An under-sampled software defect prediction method based on hybrid multi-objective cuckoo search. Concurr. Comput. Pract. Exp. 32, e5478 (2020)
Google Scholar
Balogun, A.O., Basri, S., Abdulkadir, S.J., Adeyemo, V.E., Imam, A.A., Bajeh, A.O.: Software defect prediction: analysis of class imbalance and performance Stability. J. Eng. Sci. Technol. 14, 15 (2019)
Google Scholar
Goyal, S.: Handling class-imbalance with KNN (Neighbourhood) under-sampling for software defect prediction. Artif. Intell. Rev. 55, 1–42 (2021)
Google Scholar
Cao, Y., Ding, Z., Xue, F., Rong, X.: An improved twin support vector machine based on multi-objective cuckoo search for software defect prediction. Int. J. Bio-Insp. Comput. 11, 282–291 (2018)
Article Google Scholar
Mabayoje, M.A., Balogun, A.O., Jibril, H.A., Atoyebi, J.O., Mojeed, H.A., Adeyemo, V.E.: Parameter tuning in KNN for software defect prediction: an empirical analysis. Jurnal Teknologi dan Sistem Komputer 7, 121–126 (2019)
Article Google Scholar
Yu, Q., Jiang, S., Zhang, Y.: The performance stability of defect prediction models with class imbalance: an empirical study. IEICE Trans Info Sys. 100, 265–272 (2017)
Article Google Scholar
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IIEEE Trans. Softw. Eng. 33, 2–13 (2007)
Article Google Scholar
Balogun, A.O., et al.: SMOTE-based homogeneous ensemble methods for software defect prediction. In: Gervasi, O., et al. (eds.) ICCSA 2020. LNCS, vol. 12254, pp. 615–631. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58817-5_45
Chapter Google Scholar
Mockus, A., Weiss, D.M.: Predicting risk of software changes. Bell Labs Tech. J. 5, 169–180 (2000)
Article Google Scholar
Bowes, D., Hall, T., Petrić, J.: Software defect prediction: do different classifiers find the same defects? Softw. Qual. J. 26(2), 525–552 (2017). https://doi.org/10.1007/s11219-016-9353-3
Article Google Scholar
Japkowicz, N.: The class imbalance problem: Significance and strategies. In: Proceedings of the International Conference on Artificial Intelligence, vol. 56, pp. 111–117. Citeseer (2000)
Google Scholar
Peng, M., et al.: Trainable undersampling for class-imbalance learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4707–4714 (2019)
Google Scholar
Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Chapter Google Scholar
Elhassan, T., Aljurf, M.: Classification of imbalance data using tomek link (t-link) combined with random under-sampling (rus) as a data reduction method. Global J. Technol. Optim. S 1 (2016)
Google Scholar
Malhotra, R., Kamal, S.: An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343, 120–140 (2019)
Article Google Scholar
Rodriguez, D., Herraiz, I., Harrison, R., Dolado, J., Riquelme, J.C.: Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In: The 18th International Conference, pp. 1–10. ACM Press (2014)
Google Scholar
Suresh Kumar, P., Behera, H.S., Nayak, J., Naik, B.: Bootstrap aggregation ensemble learning-based reliable approach for software defect prediction by using characterized code feature. Innov. Syst. Softw. Eng. 17(4), 355–379 (2021). https://doi.org/10.1007/s11334-021-00399-2
Article Google Scholar
Berrar, D.: Bayes’ theorem and naive Bayes classifier. Encyclop. Bioinform. Comput. Biol. ABC Bioinform. 403 (2018)
Google Scholar
Balogun, A.O., et al.: Empirical analysis of rank aggregation-based multi-filter feature selection methods in software defect prediction. Electronics 10, 179 (2021)
Article Google Scholar
Ghotra, B., McIntosh, S., Hassan, A.E.: A large-scale study of the impact of feature selection techniques on defect classification models. In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp. 146–157. IEEE (2017)
Google Scholar
Xu, Z., Liu, J., Yang, Z., An, G., Jia, X.: The impact of feature selection on defect prediction performance: an empirical comparison. In: 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), pp. 309–320. IEEE (2016)
Google Scholar
Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: An empirical comparison of model validation techniques for defect prediction models. IIEEE Trans. Softw. Eng. 43, 1–18 (2016)
Article Google Scholar
Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: Comments on “researcher bias: the use of machine learning in software defect prediction.” IIEEE Trans. Softw. Eng. 42, 1092–1094 (2016)
Article Google Scholar
Yu, Q., Jiang, S., Zhang, Y.: The performance stability of defect prediction models with class imbalance: an empirical study. IEICE Trans E100.D. Inf. Syst., 265–272 (2017)
Google Scholar
Balogun, A.O., Akande, N.O., Usman-Hamza, F.E., Adeyemo, V.E., Mabayoje, M.A., Ameen, A.O.: Rotation forest-based logistic model tree for website phishing detection. In: Gervasi, O., et al. (eds.) ICCSA 2021. LNCS, vol. 12957, pp. 154–169. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87013-3_12
Chapter Google Scholar
Li, R., Zhou, L., Zhang, S., Liu, H., Huang, X., Sun, Z.: Software defect prediction based on ensemble learning. In: DSIT 2019: 2019 2nd International Conference on Data Science and Information Technology, pp. 1–6. ACM (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Ilorin, Ilorin, PMB 1515, Nigeria
Abdullateef O. Balogun, Babajide J. Odejide, Amos O. Bajeh, Fatima E. Usman-Hamza, Hammid O. Adeleke & Modinat A. Mabayoje
Department of Computer and Information Sciences, Universiti Teknologi PETRONAS, Bandar Seri Iskandar, 32610, Perak, Malaysia
Abdullateef O. Balogun
Computer Services and Information Technology (COMSIT), University of Ilorin, Ilorin, PMB 1515, Nigeria
Zubair O. Alanamu
Department of Computer Science, Kwara State University, Malesse, Kwara State, Nigeria
Shakirat R. Yusuff

Authors

Abdullateef O. Balogun
View author publications
You can also search for this author in PubMed Google Scholar
Babajide J. Odejide
View author publications
You can also search for this author in PubMed Google Scholar
Amos O. Bajeh
View author publications
You can also search for this author in PubMed Google Scholar
Zubair O. Alanamu
View author publications
You can also search for this author in PubMed Google Scholar
Fatima E. Usman-Hamza
View author publications
You can also search for this author in PubMed Google Scholar
Hammid O. Adeleke
View author publications
You can also search for this author in PubMed Google Scholar
Modinat A. Mabayoje
View author publications
You can also search for this author in PubMed Google Scholar
Shakirat R. Yusuff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdullateef O. Balogun .

Editor information

Editors and Affiliations

University of Perugia, Perugia, Italy
Osvaldo Gervasi
University of Basilicata, Potenza, Potenza, Italy
Beniamino Murgante
Østfold University College, Halden, Norway
Sanjay Misra
University of Minho, Braga, Portugal
Ana Maria A. C. Rocha
University of Cagliari, Cagliari, Italy
Chiara Garau

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Balogun, A.O. et al. (2022). Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction. In: Gervasi, O., Murgante, B., Misra, S., Rocha, A.M.A.C., Garau, C. (eds) Computational Science and Its Applications – ICCSA 2022 Workshops. ICCSA 2022. Lecture Notes in Computer Science, vol 13381. Springer, Cham. https://doi.org/10.1007/978-3-031-10548-7_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-10548-7_27
Published: 26 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10547-0
Online ISBN: 978-3-031-10548-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SMOTE-Based Homogeneous Ensemble Methods for Software Defect Prediction

An ensemble model for addressing class imbalance and class overlap in software defect prediction

Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SMOTE-Based Homogeneous Ensemble Methods for Software Defect Prediction

An ensemble model for addressing class imbalance and class overlap in software defect prediction

Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation