Class Balancing Approaches to Improve for Software Defect Prediction Estimations: A Comparative Study

Sánchez-García, Ángel J.; Limón, Xavier; Domínguez-Isidro, Saúl; Olvera-Villeda, Dan Javier; Pérez-Arriaga, Juan Carlos

doi:10.1134/S036176882470066X

Class Balancing Approaches to Improve for Software Defect Prediction Estimations: A Comparative Study

Published: 12 January 2025

Volume 50, pages 621–647, (2024)
Cite this article

Programming and Computer Software Aims and scope Submit manuscript

39 Accesses
Explore all metrics

Abstract

Addressing software defects is an ongoing challenge in software development, and effectively managing and resolving defects is vital for ensuring software reliability, which is in turn a crucial quality attribute of any software system. Software defect prediction supported by machine learning (ML) methods offers a promising approach to address the problem of software defects. However, one common challenge in ML-based software defect prediction is the issue of data imbalance. In this paper, we present an empirical study aimed at assessing the impact of various class balancing methods on the issue of class imbalance in software defect prediction. We conducted a set of experiments that involved nine distinct class balancing methods across seven different classifiers. We used datasets from the PROMISE repository, provided by the NASA software project. We also employed various metrics including AUC, Accuracy, Precision, Recall, and the F1 measure to gauge the effectiveness of the different class balancing methods. Furthermore, we applied hypothesis testing to determine any significant differences in metric results between datasets with balanced and unbalanced classes. Based on our findings, we conclude that balancing the classes in software defect prediction yields significant improvements in overall performance. Therefore, we strongly advocate for the inclusion of class balancing as a preprocessing step in this domain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

REFERENCES

Olvera-Villeda, D.J., Sanchez-Garcia, A.J., Limon, X., and Dominguez Isidro, S., Class balancing approaches in dataset for software defect prediction: A systematic literature review, Proc. 11th IEEE Int. Conf. in Software Engineering Research and Innovation (CONISOFT), Leon, 2023, pp. 1–6.
Glinz, M., A glossary of requirements engineering terminology, in Standard Glossary of the Certified Professionalfor Requirements Engineering (CPRE) Studies and Exam, Version, 2011, vol. 1, p. 56.
Musa, J.D., Software reliability measurement, J. Syst. Software, 1979, vol. 1, pp. 223–241.
Article MATH Google Scholar
Iso, I. and IEC, N., ISO/IEC, in IEEE International Standard-Systems and Software Engineering-Vocabulary, 2017, pp. 1–541.
Singh, P.D. and Chug, A., Software defect prediction analysis using machine learning algorithms, Proc. 7th IEEE Int. Conf. on Cloud Computing, Data Science & Engineering Confluence, Noida, 2017, pp. 775–781.
Sayyad Shirabad, J. and Menzies, T., The PROMISE repository of software engineering databases, in Proc. School of Information Technology and Engineering, Univ. of Ottawa, 2005. http://PROMISEsite.uottawa.ca/SERepository.
McCabe, T., A complexity measure, IEEE Trans. Software Eng., 1976, vol. 2, no. 4, pp. 308–320.
Article MathSciNet MATH Google Scholar
Halstead, M., Elements of Software Science, Elsevier, 1977.
MATH Google Scholar
Wolpert, D.H. and Macready, W.G., No free lunch theorems for optimization, IEEE Trans. Evol. Comput., 1997, vol. 1, no. 1.
Zhang, Y., Yan, X., and Khan, A.A., A kernel density estimation-based variation sampling for class imbalance in defect prediction, Proc. IEEE Int. Conf. on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Exeter, 2020, pp. 1058–1065.
Elahi, E., Kanwal, S., and Asif, A.N., A new ensemble approach for software fault prediction, Proc. 17th Int. Bhurban Conf. on Applied Sciences and Technology (IBCAST), Bhurban, 2020, pp. 407–412.
Zheng, J., Wang, X., Wei, D., Chen, B., and Shao, Y., A novel imbalanced ensemble learning in software defect predication, IEEE Access, 2021, vol. 9, pp. 86855–86868.
Article MATH Google Scholar
Zha, Q., Yan, X., andZhou, Y., Adaptive centre-weighted oversampling for class imbalance in software defect prediction, Proc. IEEE Int. Conf. on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), Melbourne, 2018, pp. 223–230.
Huda, S., Liu, K., Abdelrazek, M., Ibrahim, A., Alyahya, S., Al-Dossari, H., and Ahmad, S., An ensemble oversampling model for class imbalance problem in software defect prediction, IEEE Access, 2018, vol. 6, pp. 24184–24195.
Article Google Scholar
Malhotra, R., Nishant, N., Gurha, S., and Rathi,V., Application of particle swarm optimization for software defect prediction using object oriented metrics, Proc. 11th Int. Conf. on Cloud Computing, Data Science & Engineering (Confluence), Noida, 2021, pp. 88–93.
Li, Z., Zhang, X., Guo, J., and Shang, Y., Class imbalance data generation for software defect prediction, Proc. 26th IEEE Asia-Pacific Software Engineering Conf. (APSEC), Putrajaya, 2019, pp. 276–283.
Ghosh, S., Rana, A., and Kansal, V., Combining integrated sampling with nonlinear manifold detection techniques for software defect prediction, Proc. 3rd IEEE Int. Conf. on Contemporary Computing and Informatics (IC3I), Gurgaon, 2018, pp. 147–154.
Putri, S.A., et al., Combining integreted sampling technique with feature selection for software defect prediction, Proc. 5th IEEE Int. Conf. on Cyber and IT Service Management (CITSM), Denpasar, 2017, pp. 1–6.
Thaher, T. and Arman, N., Efficient multi-swarm binary harrishawks optimization as a feature selection approach for softwarefault prediction, Proc. 11th IEEE Int. Conf. on Information and Communication Systems (ICICS), Irbid, 2020, pp. 249–254.
Bashir, K., Li, T., Yohannese, C.W., and Mahama, Y., Enhancing software defect prediction using supervised-learning based framework, Proc. 12th IEEE Int. Conf. on Intelligent Systems and Knowledge Engineering (ISKE), Nanjing, 2017, pp. 1–6.
Rathore, S.S., Chouhan, S.S., Jain, D.K., and Vachhani, A.G., Generative oversampling methods for handling imbalanced data in software fault prediction, IEEE Trans. Reliab., 2022, vol. 71, no. 2, pp. 747–762.
Article Google Scholar
Eivazpour, Z. and Keyvanpour, M.R., Improving performance in software defect prediction using variational autoencoder, Proc. 5th IEEE Conf. on Knowledge Based Engineering and Innovation (KBEI), Teheran, 2019, pp. 644–649.
Bispo, A., Prudèncio, R., and Våleras, D., Instance selection and class balancing techniques for cross project defect prediction, Proc. 7th IEEE Brazilian Conf. on Intelligent Systems (BRACIS), Sao Paulo, 2018, pp. 552–557.
Bennin, K.E., Keung, J., Phannachitta, P., Monden, A., and Mensah, S., Mahakil: Diversity based oversampling approachto alleviate the class imbalance issue in software defect prediction, IEEE Trans. Software Eng., 2017, vol. 44, no. 6, pp. 534–550.
Article Google Scholar
Malhotra, R., Kapoor, R., Saxena, P., and Sharma, P., Saga: a hybrid technique to handle imbalance data in software defect prediction, Proc. 11th IEEE Symp. on Computer Applications & Industrial Electronics (ISCAIE), Penang, 2021, pp. 331–336.
Wang, D. and Xiong, X., Software defect prediction basedon combined sampling and feature selection, Proc. 2nd Int. Conf. on Machine Learning and Computer Application ICMLCA 2021, Shenyang, 2021, pp. 1–5.
Liu, Y., Sun, F., Yang, J., and Zhou, D., Software defect prediction model based on improved BP neural network, Proc. 6th IEEE Int. Conf. on Dependable Systems and Their Applications (DSA), Harbin, 2020, pp. 521–522.
Bahaweres, R.B., Agustian, F., Hermadi, I., Suroso, A.I., andArkeman, Y., Software defect prediction using neural network based smote, Proc. 7th IEEE Int. Conf. on Electrical Engineering, Computer Sciences and Informatics (EECSI), Yogyakarta, 2020, pp. 71–76.
Choirunnisa, S., Meidyani, B., andRochimah, S., Software defect prediction using oversampling algorithm: A-suwo, Proc. IEEE Conf. on Electrical Power, Electronics, Communications, Controls and Informatics Seminar (EECCIS), Batu, 2018, pp. 337–341.
Dipa, W.A. and Sunindyo, W.D., Software defect prediction using smote and artificial neural network, Proc. IEEE Int. Conf. on Data and Software Engineering (ICoDSE), Bandung, 2021, pp. 1–4.
Malhotra, R., Agrawal, V., Pal, V., and Agarwal, T., Support vector based oversampling technique for handling class imbalance in software defect prediction, Proc. 11th IEEE Int. Conf. on Cloud Computing, Data Science & Engineering (Confluence), Noida, 2021, pp. 1078–1083.
Gong, L., Jiang, S., and Jiang, L., Tackling class imbalance problem in software defect prediction through clusterbased over-sampling with filtering, IEEE Access, 2019, vol. 7, pp. 145725–145737.
Article MATH Google Scholar
Malhotra, R. and Kamal, S., Tool to handle imbalancing problem in software defect prediction using oversampling methods, Proc. IEEE Int. Conf. on Advances in Computing, Communications and Informatics (ICACCI), Udupi, 2017, pp. 906–912.
Pandey, S.K. and Tripathi, A.K., Class imbalance issuein software defect prediction models by various machine learning techniques: an empirical study, Proc. 8th IEEE Int. Conf. on Smart Computing and Communications (ICSCC), Chongqing, 2021, pp. 58–63.
Zhang, W., Li, Y., Wen, M., and He, R., Comparative studyof ensemble learning methods in just-in-time software defect prediction, Proc. 23rd IEEE Int. Conf. on Software Quality, Reliability, and Security Companion (QRSC), Chiang Mai, 2023, pp. 83–92.
Yang, X., Wang, S., Li, Y., and Wang, S., Does data sampling improve deep learning-based vulnerability detection? yeas! and nays!, Proc. IEEE/ACM 45th Int. Conf. on Software Engineering (ICSE), Melbourne, 2023, pp. 2287–2298.
Kumar, R. and Chaturvedi, A., Software bug prediction usingreward-based weighted majority voting ensemble technique, IEEE Trans. Reliab., 2024, vol. 73, no. 1, pp. 726–740.
Article MATH Google Scholar
Devi, M., Rajkumar, T., and Balakrishnan, D., Predictionof software defects by employing optimized deep learning and oversampling approaches, Proc. 2nd Int. Conf. on Computer, Communication and Control (IC4), Indore, 2024, pp. 1–5.
Wei, W., Jiang, F., Yu, X., and Du, J., An under-sampling algorithm based on weighted complexity and its application in software defect prediction, Proc. 5th Int. Conf. on Software Engineering and Information Management, Yokohama, 2022, pp. 38–44.
Abaei, G., Tah, W.Z., Toh, J.Z.W., and Hor, E.S.J., Improving software fault prediction in imbalanced datasets usingthe under-sampling approach, Proc. 11th Int. Conf. on Software and Computer Applications, Melaka, 2022, pp. 41–47.
Zhang, Z.-W., Jing, X.-Y., and Wang, T.-J., Label propagation based semi-supervised learning for software defect prediction, Automat. Software Eng., 2017, vol. 24, pp. 47–69.
Article MATH Google Scholar
Du, X., Yue, H., and Dong, H., Software defect prediction method based on hybrid sampling, in Proc. Int. Conf. on Frontiers of Electronics, Information andComputation Technologies, Ser. ICFEICT 2021, New York: Association for Computing Machinery, 2022. https://doi.org/10.1145/3474198.3478215.
Ryu, D., Jang, J.-I., and Baik, J., A transfer cost-sensitiveboosting approach for cross-project defect prediction, Software Quality J., 2017, vol. 25, pp. 235–272.
Article MATH Google Scholar
Zhou, L., Li, R., Zhang, S., and Wang, H., Imbalanced data processing model for software defect prediction, Wireless Personal Commun., 2018, vol. 102, pp. 937–950.
Article MATH Google Scholar
He, H., Zhang, X., Wang, Q., Ren, J., Liu, J., Zhao, X., and Cheng, Y., Ensemble multiboost based on ripper classifier for prediction of imbalanced software defect data, IEEE Access, 2019, vol. 7, pp. 110333–110343.
Article Google Scholar
Zeng, C., Zhou, C.Y., Lv, S.K., He, P., and Huang, J., Gcn2defect: graph convolutional networks for smotetomek based software defect prediction, Proc. 32nd IEEE Int. Symp. on Software Reliability Engineering (ISSRE), Wuhan, 2021, pp. 69–79.
Joon, A., Tyagi, R.K., and Kumar, K., Noise filtering and imbalance class distribution removal for optimizing software fault prediction using best software metrics suite, Proc. 5th IEEE Int. Conf. on Communication and Electronics Systems (ICCES), Coimbatore, 2020, pp. 1381–1389.
Chen, L., Fang, B., Shang, Z., and Tang, Y., Tackling class overlap and imbalance problems in software defect prediction, Software Quality J., 2018, vol. 26, pp. 97–125.
Article MATH Google Scholar
Riaz, S., Arshad, A., and Jiao, L., Rough noise-filtered easy ensemble for software fault prediction, IEEE Access, 2018, vol. 6, pp. 46886–46889.
Article MATH Google Scholar
Wan, X., Zheng, Z., and Liu, Y., Spe²: Self-paced ensemble of ensembles for software defect prediction, IEEE Trans. Reliab., 2022, vol. 71, no. 2, pp. 865–879.
Article MATH Google Scholar
Menardi, G. and Torelli, N., Training and assessing classification rules with imbalanced data, Data Mining Knowledge Discovery, 2012, vol. 28, no. 1, pp. 92–122. https://doi.org/10.1007/s10618-012-0295-5
Article MathSciNet MATH Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., and Kegelmeyer, W.P., Smote: Synthetic minority over-sampling technique, J. Artif. Intellig. Res., 2002, vol. 16, no. nil, pp. 321–357. https://doi.org/10.1613/jair.953
He, H., Bai, Y., Garcia, E.A., and Li, S., Adasyn: adaptive synthetic sampling approach for imbalanced learning, Proc. IEEE Int. Joint Conf. on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 2008, p. nil. https://doi.org/10.1109/IJCNN.2008.4633969.
Batista, G.E.A.P.A., Pratim R.C., and Monard, M.C., A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor. Newsl., 2004, vol. 6, no. 1, pp. 20–29. https://doi.org/10.1145/1007730.1007735
Article MATH Google Scholar
Mani, I. and Zhang, I., knn approach to unbalanced data distributions: A case study involving information extraction, in Proc. Workshop on Learning from Imbalanced Datasets, Washington, 2003, vol. 126, no. 1, pp. 2–7.
Wilson, D.L., Asymptotic properties of nearest neighborrules using edited data, IEEE Trans. Syst., Man, Cybernet., 1972, vol. SMC-2, no. 3, pp. 408–421.
Article MATH Google Scholar
Tomek, I., An experiment with the edited nearest-neighbor rule, IEEE Trans. Syst., Man, Cybernet., 1976, vol. SMC-6, no. 6, pp. 448–452. https://doi.org/10.1109/TSMC.1976.4309523
Article MathSciNet MATH Google Scholar
Manju, B.R. and Nair, A.R., Classification of cardiac arrhythmia of 12 lead ECG using combination of smoteenn, xgboost and machine learning algorithms, Proc. 9th Int. Symp. on Embedded Computing and System Design (ISED), Kollam, 2019, pp. 1–7.
Batista, G.E.A.P.A., Bazzan, A.L.C., and Monard, M.C., Balancing training data for automated annotation of keywords: A case study, Proc. 2nd Brazilian Workshop on Bioinformatics, Macaé, Dec. 3–5, 2003. https://api.semanticscholar.org/CorpusID:1579194.
Tomek, I., Two modifications of CNN, IEEE Trans. Syst., Man, Cybernet., 1976, vol. SMC-6, no. 11, pp. 769–772.
Article MathSciNet MATH Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J., Classification and Regression Trees, New York: Chapman and Hall, 2017. https://doi.org/10.1201/9781315139470
Book MATH Google Scholar
Cieslak, D.A. and Chawla, N.V., Learning decision trees for unbalanced data, in Machine Learning and Knowledge Discovery in Databases, Daelemans, W., Goethals, B., and Morik, K., Eds., Berlin, Heidelberg: Springer, 2008, pp. 241–256.
MATH Google Scholar
Fix, E. and Hodges, J.L., Discriminatory analysis. Nonparametric discrimination: consistency properties, Int. Stat. Rev., 1989, vol. 57, no. 3, P. 238. https://doi.org/10.2307/1403797
Article MATH Google Scholar
Breiman, L., Random forests, Mach. Learn., 2001, vol. 45, no. 1, pp. 5–32. https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Freund, Y. and Schapire, R.E., Experiments with a newboosting algorithm, in Proc. 13th Int. Conf. on Machine Learning, Ser. ICML’96, San Francisco: MorganKaufmann, 1996.
Friedman, J.H., Stochastic gradient boosting, Comput. Stat. Data Anal. Nonlin. Methods Data Mining, 2002, vol. 38, no. 4, pp. 367–378. https://www.sciencedirect.com/science/article/pii/S0167947301000652.
Hand, D.J. and Yu, K., Idiot’s Bayes-not so stupid after all?, Int. Stat. Rev., 2001, vol. 69, no. 3, pp. 385–398. https://doi.org/10.1111/j.1751-823.2001.tb00465.x
Article MATH Google Scholar
Hinton, G.E., Connectionist Learning Procedures, Elsevier, 1990, pp. 555–610. https://doi.org/10.1016/B978-0-08-051055-2.50029-8
Book MATH Google Scholar
Dyba, T., Kampenes, V.B., and Sj.berg, D.I., A systematic review of statistical power in software engineering experiments, Inf. Software Technol., 2006, vol. 48, no. 8, pp. 745–755.
Article Google Scholar
Sánchez-García, J., Statistical tests among groups, 2024. https://doi.org/10.5281/zenodo.13239734
Moore, D.S. and McCabe, G.P., Introduction to the Practice of Statistics, WH Freeman/Times Books/Henry Holt & Co, 1989.
MATH Google Scholar
Sánchez-García, J., Statistical tests results, 2024. https://doi.org/10.5281/zenodo.13240040
Malhotra, R. and Khanna, M., Threats to validity in search based predictive modelling for software engineering, IET Software, 2018, vol. 12, no. 4, pp. 293–305.
Article MATH Google Scholar
Bronshteyn, I., Study of defects in a program code in Python, Program. Comput. Software, 2013, vol. 39, pp. 279–284.
Article MathSciNet MATH Google Scholar
Belevantsev, A., Multilevel static analysis for improving program quality, Program. Comput. Software, 2017, vol. 43, pp. 321–336.
Article MathSciNet MATH Google Scholar

Download references

Funding

This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.

Author information

Authors and Affiliations

Facultad de Estadística e Informática, Universidad Veracruzana, Veracruz, Mexico
Ángel J. Sánchez-García, Xavier Limón, Saúl Domínguez-Isidro, Dan Javier Olvera-Villeda & Juan Carlos Pérez-Arriaga

Authors

Ángel J. Sánchez-García
View author publications
You can also search for this author inPubMed Google Scholar
Xavier Limón
View author publications
You can also search for this author inPubMed Google Scholar
Saúl Domínguez-Isidro
View author publications
You can also search for this author inPubMed Google Scholar
Dan Javier Olvera-Villeda
View author publications
You can also search for this author inPubMed Google Scholar
Juan Carlos Pérez-Arriaga
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Ángel J. Sánchez-García, Xavier Limón, Saúl Domínguez-Isidro, Dan Javier Olvera-Villeda or Juan Carlos Pérez-Arriaga.

Ethics declarations

The authors of this work declare that they have no conflicts of interest.

Additional information

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

AI tools may have been used in the translation or editing of this article.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sánchez-García, Á.J., Limón, X., Domínguez-Isidro, S. et al. Class Balancing Approaches to Improve for Software Defect Prediction Estimations: A Comparative Study. Program Comput Soft 50, 621–647 (2024). https://doi.org/10.1134/S036176882470066X

Download citation

Received: 08 May 2024
Revised: 17 August 2024
Accepted: 12 September 2024
Published: 12 January 2025
Issue Date: December 2024
DOI: https://doi.org/10.1134/S036176882470066X

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Class Balancing Approaches to Improve for Software Defect Prediction Estimations: A Comparative Study

Abstract

Access this article

Subscribe and save

Buy Now

Explore related subjects

REFERENCES

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Additional information

Publisher’s Note.

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now