Abstract
Improving software quality by predicting faults during the early stages of software development is a primary goal of software fault prediction (SFP). Various machine learning models help to predict software faults. However, the imbalanced class distribution in the datasets may challenge some traditional learning approaches as they are more biased toward the majority class. The existence of class overlapping makes the prediction difficult owing to learning the minority class inaccurately. In addition to that, high data dimensionality makes the classification process complex and time-consuming. To enhance the performance of the classifier, handling these data quality issues is a big concern. This paper proposes a hybrid density-based method DBOS_US to address the class imbalance, noise, and class overlap in SFP. Initially, the density-based overlap removal (DBO) clustering algorithm is proposed to filter noisy and overlapped instances. Then, a graph-based algorithm, ShapeGraph, is adapted to handle imbalanced classes. The objective of the proposed method DBOS_US is to improve the performance of the traditional SFP classifiers. The experiments are conducted on 11 benchmark datasets from the PROMISE repository using six machine learning models (SVM, DT, KNN, NB, RF, and boosting). The experimental findings and statistical analysis revealed that the proposed method outperforms seven state-of-the-art techniques in terms of Area Under the Curve (AUC), G-mean, Recall (PD), and Probability of False alarms (PF). The proposed method improves the average values of G-mean, Recall, and AUC by at least 2.5%, 8.8%, and 1.2%, respectively.












Similar content being viewed by others
Data availability
Datasets can be downloaded from the online public PROMISE repository (http://promise.site.uottawa.ca/SERepository/datasets-page.html), and code in python will be made available on request.
References
Rathore SS, Kumar S (2021) An empirical study of ensemble techniques for software fault prediction. Appl Intell 51:3615–3644. https://doi.org/10.1007/s10489-020-01935-6
Krasner H (2021) The cost of poor software quality in the US: a 2020 report. In: Proceedings Consortium Information Software QualityTM (CISQTM)
Ahmed MR, Zamal MF Bin, Ali MA, et al (2020) The impact of software fault prediction in real-world application: an automated approach for software engineering. In: ACM International Conference Proceeding Series, pp 247–251. https://doi.org/10.1145/3379247.3379278
Goyal S (2022) Handling class-imbalance with KNN (neighbourhood) under-sampling for software defect prediction. Artif Intell Rev 55:2023–2064. https://doi.org/10.1007/s10462-021-10044-w
Bhandari K, Kumar K, Sangal AL (2022) Data quality issues in software fault prediction: a systematic literature review. Artif Intell Rev. https://doi.org/10.1007/s10462-022-10371-6
Rathore SS, Kumar S (2019) A study on software fault prediction techniques. Artif Intell Rev 51:255–327. https://doi.org/10.1007/s10462-017-9563-5
Bhandari K, Kumar K, Sangal AL (2024) Alleviating class imbalance issue in software fault prediction using DBSCAN-based induced graph under-sampling method. Arab J Sci Eng. https://doi.org/10.1007/s13369-024-08740-0
Walkinshaw N, Minku L (2018) Are 20% of files responsible for 80% of defects?. In: International Symposium on Empirical Software Engineering and Measurement. https://doi.org/10.1145/3239235.3239244
Khleel NAA, Nehéz K (2023) A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method. J Intell Info Syst 60:673–707. https://doi.org/10.1007/s10844-023-00793-1
Chen L, Fang B, Shang Z, Tang Y (2018) Tackling class overlap and imbalance problems in software defect prediction. Softw Qual J 26:97–125. https://doi.org/10.1007/s11219-016-9342-6
Bahaweres RB, Jana EDH, Hermadi I, et al (2021) Handling high-dimensionality on software defect prediction with FLDA. In: Proceedings of 2nd 2021 International Conference on Smart Cities, Automation and Intelligent Computing Systems, ICON-SONICS, pp 76–81. https://doi.org/10.1109/ICON-SONICS53103.2021.9616999
Afzal W, Torkar R (2016) Towards benchmarking feature subset selection methods for software fault prediction. Stud Comput Intell 617:33–58. https://doi.org/10.1007/978-3-319-25964-2_3
Kalsoom A, Maqsood M, Ghazanfar MA et al (2018) A dimensionality reduction-based efficient software fault prediction using fisher linear discriminant analysis (FLDA). J Supercomput 74:4568–4602. https://doi.org/10.1007/s11227-018-2326-5
Cai X, Niu Y, Geng S et al (2020) An under-sampled software defect prediction method based on hybrid multi-objective cuckoo search. Concurr Comput 32:1–14. https://doi.org/10.1002/cpe.5478
Feng S, Keung J, Xiao Y et al (2024) Improving the undersampling technique by optimizing the termination condition for software defect prediction. Expert Syst Appl 235:121084. https://doi.org/10.1016/j.eswa.2023.121084
Shi H, Ai J, Liu J, Xu J (2023) Improving software defect prediction in noisy imbalanced datasets. Appl Sci 13:10466. https://doi.org/10.3390/app131810466
Gong L, Zhang H, Zhang J et al (2022) A comprehensive investigation of the impact of class overlap on software defect prediction. IEEE Trans Softw Eng 49:1–19. https://doi.org/10.1109/TSE.2022.3220740
Feng S, Keung J, Liu J, et al (2021) ROCT: Radius-based class overlap cleaning technique to alleviate the class overlap problem in software defect prediction. In: Proceedings-2021 IEEE 45th Annual Computers, Software, and Applications Conference, COMPSAC, pp 228–237. https://doi.org/10.1109/COMPSAC51774.2021.00041
Khoshgoftaar TM, Gao K, Seliya N (2010) Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction. In: 2010 22nd IEEE International Conference on Tools with Artificial Intelligence, pp 137–144, IEEE
Tomek I (1976) An experiment with the nearest-neighbor rule. IEEE Trans Syst Man Cybernetics SMC. 6:448–452
Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Rel 62:434–443. https://doi.org/10.1109/TR.2013.2259203
Hayaty M, Muthmainah S, Ghufran SM (2021) Random and synthetic over-sampling approach to resolve data imbalance in classification. Int J Artif Intell Res 4:86–94. https://doi.org/10.29099/ijair.v4i2.152
Kubat M, Matwin S (1997) Addressing the curse of imbalanced data sets: one-sided sampling. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp 179–186
Goyal S (2021) Predicting the defects using stacked ensemble learner with filtered dataset. Autom Softw Eng 28:1–81. https://doi.org/10.1007/s10515-021-00285-y
Guzmán-Ponce A, Sánchez JS, Valdovinos RM, Marcial-Romero JR (2021) DBIG-US: a two-stage under-sampling algorithm to face the class imbalance problem. Expert Syst Appl 168:114301. https://doi.org/10.1016/j.eswa.2020.114301
Qian M, Li YF (2022) A Weakly supervised learning-based oversampling framework for class-imbalanced fault diagnosis. IEEE Trans Rel. https://doi.org/10.1109/TR.2021.3138448
Siers MJ, Islam MZ (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf Syst 51:62–71. https://doi.org/10.1016/j.is.2015.02.006
Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402. https://doi.org/10.1016/j.infsof.2014.07.005
Tong H, Liu B, Wang S (2018) Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol 96:94–111. https://doi.org/10.1016/j.infsof.2017.11.008
Bennin KE, Keung J, Phannachitta P et al (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44:534–550. https://doi.org/10.1109/TSE.2017.2731766
Gong L, Jiang S, Jiang L (2019) tackling class imbalance problem in software defect prediction through cluster-based over-sampling with filtering. IEEE Access 7:145725–145737. https://doi.org/10.1109/ACCESS.2019.2945858
Khuat TT, Le MH (2019) Binary teaching–learning-based optimization algorithm with a new update mechanism for sample subset optimization in software defect prediction. Soft Comput 23:9919–9935. https://doi.org/10.1007/s00500-018-3546-6
Chen J, Nair V, Krishna R, Menzies T (2019) Sampling as a baseline optimizer for search-based software engineering. IEEE Trans Softw Eng 45:597–614. https://doi.org/10.1109/TSE.2018.2790925
Tsai CF, Lin WC, Hu YH, Yao GT (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54. https://doi.org/10.1016/j.ins.2018.10.029
Rao KN, Reddy CS (2020) A novel under sampling strategy for efficient software defect analysis of skewed distributed data. Evol Syst 11:119–131. https://doi.org/10.1007/s12530-018-9261-9
Sun Z, Zhang J, Sun H, Zhu X (2020) Collaborative filtering based recommendation of sampling methods for software defect prediction. Appl Soft Comput J 90:106163. https://doi.org/10.1016/j.asoc.2020.106163
Khuat TT, Le MH (2019) Ensemble learning for software fault prediction problem with imbalanced data. Int J Elect Comput Eng 9:3241–3246. https://doi.org/10.11591/ijece.v9i4.pp3241-3246
Huda S, Liu K, Abdelrazek M et al (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE Access 6:24184–24195. https://doi.org/10.1109/ACCESS.2018.2817572
Feng S, Keung J, Yu X et al (2021) COSTE: complexity-based oversampling technique to alleviate the class imbalance problem in software defect prediction. Inf Softw Technol 129:106432. https://doi.org/10.1016/j.infsof.2020.106432
Chakraborty T, Chakraborty AK (2021) Hellinger Net: a hybrid imbalance learning model to improve software defect prediction. IEEE Trans Reliab 70:481–494. https://doi.org/10.1109/TR.2020.3020238
Gupta S, Gupta A (2017) A set of measures designed to identify overlapped instances in software defect prediction. Computing 99:889–914. https://doi.org/10.1007/s00607-016-0538-1
Gong L, Jiang S, Wang R, Jiang L (2019) Empirical evaluation of the impact of class overlap on software defect prediction. In: Proceedings-2019 34th IEEE/ACM International Conference on Automated Software Engineering, ASE, pp 698–709. https://doi.org/10.1109/ASE.2019.00071
NezhadShokouhi MM, Majidi MA, Rasoolzadegan A (2020) Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance. J Supercomput 76:602–635. https://doi.org/10.1007/s11227-019-03051-w
Özakıncı R, Tarhan A (2018) Early software defect prediction: a systematic map and review. J Syst Softw 144:216–239. https://doi.org/10.1016/j.jss.2018.06.025
Zhang H, Zhang X (2007) Comments on data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33:635–637. https://doi.org/10.1109/TSE.2007.70706
Goyal S (2020) Comparison of machine learning techniques for software quality prediction. Int J Knowl Syst Sci 11:20–40. https://doi.org/10.4018/IJKSS.2020040102
Turhan B, Bener A (2009) Analysis of Naive Bayes’ assumptions on software fault data: an empirical study. Data Knowl Eng 68:278–290. https://doi.org/10.1016/j.datak.2008.10.005
Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput J. https://doi.org/10.1016/j.asoc.2014.11.023
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27. https://doi.org/10.1109/TIT.1967.1053964
Borandag E (2023) Software fault prediction using an RNN-based deep learning approach and ensemble machine learning techniques. Appl Sci 13:1639. https://doi.org/10.3390/app13031639
Kaur A, Malhotra R (2008) Application of random forest in predicting fault-prone classes. In: Proceedings-2008 International Conference on Advanced Computer Theory and Engineering, ICACTE, pp 37–43. https://doi.org/10.1109/ICACTE.2008.204
Vluymans S (2019) Learning from imbalanced data. Stud Comput Intell 807:81–110. https://doi.org/10.1007/978-3-030-04663-7_4
Song Q, Jia Z, Shepperd M et al (2011) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37:356–370. https://doi.org/10.1109/TSE.2010.90
Yao J, Shepperd M (2021) The impact of using biased performance metrics on software defect prediction research. Inf Softw Technol 139:106664. https://doi.org/10.1016/j.infsof.2021.106664
Funding
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests.
Author information
Authors and Affiliations
Contributions
K.B. proposed and implemented the technique and K.K have downloaded the dataset and helps in solving the coding errors. K.B. wrote the main text and A.L.S. drawn the figures and proofread the manuscript. At last, other two authors also reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
This section divides into three parts- A1 describes the description of FLDA. A2 describes the results of Baseline, DBOS_US and its constituents' classification results in terms of AUC, G-mean, PF and Recall in the form of Tables 4, 5, 6 and 7, respectively. A3 describes the comparison of the average proposed methods' performance corresponding to each dataset with other state-of-the-art methods in terms of AUC, G-mean, PF and Recall in the form of Tables 8, 9, 10, 11
Appendix A1-FLDA
Consider that there are n training sample vectors represented by \(\left\{ {t_{i} } \right\} _{i = 1}^{n}\), for m classes: C1,C2…Cm, and nj samples belong to the jth class, i.e., \(= \sum\nolimits_{j = 1}^{m} {n_{j} }\). Assume µ be the mean of all training samples, i.e., \(\mu = (1/n)\sum\nolimits_{i = 1}^{n} {t_{i} }\), and µj be the mean of the jth class, i.e., \(\mu_{j} = \left( {1/n_{j} } \right)\sum\nolimits_{{t_{i} \in C_{j} }}^{n} {t_{i} }\), Then, the within-class scatter matrix Sw, and the between-class scatter matrix SB are stated, respectively, as
The objective is to determine a transform vector v that maximizes the Raleigh quotient, which is stated as \(q = \frac{{v^{T } S_{{\text{B }}} v}}{{v^{T } S_{{\text{w }}} v}}\) , where v can evaluate by \(S_{{\text{B }}}\) v = λ \(S_{{\text{w }}}\) v, and λ is a generalized eigenvalue. There are m-1 eigenvectors incorporated with m-1 nonzero eigenvalues owing to SB rank is m-1. The m classes are predicted to be clearly distinguished in this low-dimensional space.
Appendix A2
Appendix A3
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bhandari, K., Kumar, K. & Sangal, A.L. DBOS_US: a density-based graph under-sampling method to handle class imbalance and class overlap issues in software fault prediction. J Supercomput 80, 22682–22725 (2024). https://doi.org/10.1007/s11227-024-06312-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-024-06312-5