Abstract
This research work investigates the deployment of data sampling and ensemble techniques in alleviating the class imbalance problem in software defect prediction (SDP). Specifically, the effect of data sampling techniques on the performance of ensemble methods is investigated. The experiments were conducted using software defect datasets from the NASA software archives. Five data sampling methods (over-sampling techniques (SMOTE, ADASYN, and ROS), and undersampling techniques (RUS and NearMiss) were combined with bagging and boosting ensemble methods based on Naïve Bayes (NB) and Decision Tree (DT) classifier. Predictive performances of developed models were assessed based on the area under the curve (AUC), and Matthew’s correlation coefficient (MCC) values. From the experimental findings, it was observed that the implementation of data sampling methods further enhanced the predictive performances of the experimented ensemble methods. Specifically, BoostedDT on the ROS-balanced datasets recorded the highest average AUC (0.995), and MCC (0.918) values respectively. Aside NearMiss method, which worked best with the Bagging ensemble method, other studied data sampling methods worked well with the Boosting ensemble technique. Also, some of the developed models particularly BoostedDT showed better prediction performance over existing SDP models. As a result, combining data sampling techniques with ensemble methods may not only improve SDP model prediction performance but also provide a plausible solution to the latent class imbalance issue in SDP processes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Song, Q., Guo, Y., Shepperd, M.: A comprehensive investigation of the role of imbalanced learning for software defect prediction. IIEEE Trans. Software Eng. 45, 1253–1269 (2019)
Laradji, I.H., Alshayeb, M., Ghouti, L.: Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 58, 388–402 (2015)
El-Sharkawy, S., Yamagishi-Eichler, N., Schmid, K.: Metrics for analyzing variability and its implementation in software product lines: a systematic literature review. Inf. Softw. Technol. 106, 1–30 (2019)
Shepperd, M., Song, Q., Sun, Z., Mair, C.: Data quality: some comments on the NASA software defect datasets. IIEEE Trans. Softw. Eng. 39, 1208–1215 (2013)
Tiwari, S., Rathore, S.S.: Coupling and cohesion metrics for object-oriented software: a systematic mapping study. In: Proceedings of the 11th Innovations in Software Engineering Conference, pp. 1–11 (2018)
Balogun, A., Oladele, R., Mojeed, H., Amin-Balogun, B., Adeyemo, V.E., Aro, T.O.: Performance analysis of selected clustering techniques for software defects prediction. Afr. J. Comp. ICT 12, 30–42 (2019)
Alsaeedi, A., Khan, M.Z.: Software defect prediction using supervised machine learning and ensemble techniques: a comparative study. JSEA 12, 85–100 (2019)
Kumar, L., Dastidar, T.G., Goyal, A., Murthy, L.B., Misra, S., Kocher, V., Padmanabhuni, S.: Predicting software defect severity level using deep-learning approach with various hidden layers. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. CCIS, vol. 1517, pp. 744–751. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92310-5_86
Kumar, L., et al.: Deep-learning approach with Deepxplore for software defect severity level prediction. In: Gervasi, O., et al. (eds.) ICCSA 2021. LNCS, vol. 12955, pp. 398–410. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87007-2_28
Balogun, A., Bajeh, A., Mojeed, H., Akintola, A.: Software defect prediction: a multi-criteria decision-making approach. Niger. J. Technol. Res. 15, 35–42 (2020)
Alsawalqah, H., Faris, H., Aljarah, I., Alnemer, L., Alhindawi, N.: Hybrid SMOTE-ensemble approach for software defect prediction. In: Silhavy, R., Silhavy, P., Prokopova, Z., Senkerik, R., Kominkova Oplatkova, Z. (eds.) CSOC 2017. AISC, vol. 575, pp. 355–366. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57141-6_39
Malhotra, R., Jain, J.: handling imbalanced data using ensemble learning in software defect prediction. In: 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence), pp. 300–304. IEEE (2020)
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Expl. Newsl. 6, 20–29 (2004)
Balogun, A.O., et al.: Data sampling-based feature selection framework for software defect prediction. In: The International Conference on Emerging Applications and Technologies for Industry 4.0, pp. 39–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-80216-5
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
El-Shorbagy, S.A., El-Gammal, W.M., Abdelmoez, W.M.: Using SMOTE and heterogeneous stacking in ensemble learning for software defect prediction. In: The 7th International Conference, pp. 44–47. ACM Press (2018)
Tantithamthavorn, C., Hassan, A.E., Matsumoto, K.: The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IIEEE Trans. Softw. Eng. 46, 1200–1219 (2020)
Xie, Z., Jiang, L., Ye, T., Li, X.: A synthetic minority oversampling method based on local densities in low-dimensional space for imbalanced learning. In: International Conference on Database Systems for Advanced Applications, pp. 3–18. Springer, Cham (2015). https://doi.org/10.1007/978-3-030-73200-4
Kamalov, F., Elnagar, A., Leung, H.H.: Ensemble learning with resampling for imbalanced data. In: Huang, D.-S., Jo, K.-H., Li, J., Gribova, V., Hussain, A. (eds.) ICIC 2021. LNCS, vol. 12837, pp. 564–578. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-84529-2_48
Cai, X., et al.: An under-sampled software defect prediction method based on hybrid multi-objective cuckoo search. Concurr. Comput. Pract. Exp. 32, e5478 (2020)
Balogun, A.O., Basri, S., Abdulkadir, S.J., Adeyemo, V.E., Imam, A.A., Bajeh, A.O.: Software defect prediction: analysis of class imbalance and performance Stability. J. Eng. Sci. Technol. 14, 15 (2019)
Goyal, S.: Handling class-imbalance with KNN (Neighbourhood) under-sampling for software defect prediction. Artif. Intell. Rev. 55, 1–42 (2021)
Cao, Y., Ding, Z., Xue, F., Rong, X.: An improved twin support vector machine based on multi-objective cuckoo search for software defect prediction. Int. J. Bio-Insp. Comput. 11, 282–291 (2018)
Mabayoje, M.A., Balogun, A.O., Jibril, H.A., Atoyebi, J.O., Mojeed, H.A., Adeyemo, V.E.: Parameter tuning in KNN for software defect prediction: an empirical analysis. Jurnal Teknologi dan Sistem Komputer 7, 121–126 (2019)
Yu, Q., Jiang, S., Zhang, Y.: The performance stability of defect prediction models with class imbalance: an empirical study. IEICE Trans Info Sys. 100, 265–272 (2017)
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IIEEE Trans. Softw. Eng. 33, 2–13 (2007)
Balogun, A.O., et al.: SMOTE-based homogeneous ensemble methods for software defect prediction. In: Gervasi, O., et al. (eds.) ICCSA 2020. LNCS, vol. 12254, pp. 615–631. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58817-5_45
Mockus, A., Weiss, D.M.: Predicting risk of software changes. Bell Labs Tech. J. 5, 169–180 (2000)
Bowes, D., Hall, T., Petrić, J.: Software defect prediction: do different classifiers find the same defects? Softw. Qual. J. 26(2), 525–552 (2017). https://doi.org/10.1007/s11219-016-9353-3
Japkowicz, N.: The class imbalance problem: Significance and strategies. In: Proceedings of the International Conference on Artificial Intelligence, vol. 56, pp. 111–117. Citeseer (2000)
Peng, M., et al.: Trainable undersampling for class-imbalance learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4707–4714 (2019)
Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Elhassan, T., Aljurf, M.: Classification of imbalance data using tomek link (t-link) combined with random under-sampling (rus) as a data reduction method. Global J. Technol. Optim. S 1 (2016)
Malhotra, R., Kamal, S.: An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343, 120–140 (2019)
Rodriguez, D., Herraiz, I., Harrison, R., Dolado, J., Riquelme, J.C.: Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In: The 18th International Conference, pp. 1–10. ACM Press (2014)
Suresh Kumar, P., Behera, H.S., Nayak, J., Naik, B.: Bootstrap aggregation ensemble learning-based reliable approach for software defect prediction by using characterized code feature. Innov. Syst. Softw. Eng. 17(4), 355–379 (2021). https://doi.org/10.1007/s11334-021-00399-2
Berrar, D.: Bayes’ theorem and naive Bayes classifier. Encyclop. Bioinform. Comput. Biol. ABC Bioinform. 403 (2018)
Balogun, A.O., et al.: Empirical analysis of rank aggregation-based multi-filter feature selection methods in software defect prediction. Electronics 10, 179 (2021)
Ghotra, B., McIntosh, S., Hassan, A.E.: A large-scale study of the impact of feature selection techniques on defect classification models. In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp. 146–157. IEEE (2017)
Xu, Z., Liu, J., Yang, Z., An, G., Jia, X.: The impact of feature selection on defect prediction performance: an empirical comparison. In: 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), pp. 309–320. IEEE (2016)
Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: An empirical comparison of model validation techniques for defect prediction models. IIEEE Trans. Softw. Eng. 43, 1–18 (2016)
Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: Comments on “researcher bias: the use of machine learning in software defect prediction.” IIEEE Trans. Softw. Eng. 42, 1092–1094 (2016)
Yu, Q., Jiang, S., Zhang, Y.: The performance stability of defect prediction models with class imbalance: an empirical study. IEICE Trans E100.D. Inf. Syst., 265–272 (2017)
Balogun, A.O., Akande, N.O., Usman-Hamza, F.E., Adeyemo, V.E., Mabayoje, M.A., Ameen, A.O.: Rotation forest-based logistic model tree for website phishing detection. In: Gervasi, O., et al. (eds.) ICCSA 2021. LNCS, vol. 12957, pp. 154–169. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87013-3_12
Li, R., Zhou, L., Zhang, S., Liu, H., Huang, X., Sun, Z.: Software defect prediction based on ensemble learning. In: DSIT 2019: 2019 2nd International Conference on Data Science and Information Technology, pp. 1–6. ACM (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Balogun, A.O. et al. (2022). Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction. In: Gervasi, O., Murgante, B., Misra, S., Rocha, A.M.A.C., Garau, C. (eds) Computational Science and Its Applications – ICCSA 2022 Workshops. ICCSA 2022. Lecture Notes in Computer Science, vol 13381. Springer, Cham. https://doi.org/10.1007/978-3-031-10548-7_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-10548-7_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10547-0
Online ISBN: 978-3-031-10548-7
eBook Packages: Computer ScienceComputer Science (R0)