Skip to main content
Log in

DBOS_US: a density-based graph under-sampling method to handle class imbalance and class overlap issues in software fault prediction

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Improving software quality by predicting faults during the early stages of software development is a primary goal of software fault prediction (SFP). Various machine learning models help to predict software faults. However, the imbalanced class distribution in the datasets may challenge some traditional learning approaches as they are more biased toward the majority class. The existence of class overlapping makes the prediction difficult owing to learning the minority class inaccurately. In addition to that, high data dimensionality makes the classification process complex and time-consuming. To enhance the performance of the classifier, handling these data quality issues is a big concern. This paper proposes a hybrid density-based method DBOS_US to address the class imbalance, noise, and class overlap in SFP. Initially, the density-based overlap removal (DBO) clustering algorithm is proposed to filter noisy and overlapped instances. Then, a graph-based algorithm, ShapeGraph, is adapted to handle imbalanced classes. The objective of the proposed method DBOS_US is to improve the performance of the traditional SFP classifiers. The experiments are conducted on 11 benchmark datasets from the PROMISE repository using six machine learning models (SVM, DT, KNN, NB, RF, and boosting). The experimental findings and statistical analysis revealed that the proposed method outperforms seven state-of-the-art techniques in terms of Area Under the Curve (AUC), G-mean, Recall (PD), and Probability of False alarms (PF). The proposed method improves the average values of G-mean, Recall, and AUC by at least 2.5%, 8.8%, and 1.2%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Algorithm 2
Algorithm 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

Datasets can be downloaded from the online public PROMISE repository (http://promise.site.uottawa.ca/SERepository/datasets-page.html), and code in python will be made available on request.

References

  1. Rathore SS, Kumar S (2021) An empirical study of ensemble techniques for software fault prediction. Appl Intell 51:3615–3644. https://doi.org/10.1007/s10489-020-01935-6

    Article  Google Scholar 

  2. Krasner H (2021) The cost of poor software quality in the US: a 2020 report. In: Proceedings Consortium Information Software QualityTM (CISQTM)

  3. Ahmed MR, Zamal MF Bin, Ali MA, et al (2020) The impact of software fault prediction in real-world application: an automated approach for software engineering. In: ACM International Conference Proceeding Series, pp 247–251. https://doi.org/10.1145/3379247.3379278

  4. Goyal S (2022) Handling class-imbalance with KNN (neighbourhood) under-sampling for software defect prediction. Artif Intell Rev 55:2023–2064. https://doi.org/10.1007/s10462-021-10044-w

    Article  Google Scholar 

  5. Bhandari K, Kumar K, Sangal AL (2022) Data quality issues in software fault prediction: a systematic literature review. Artif Intell Rev. https://doi.org/10.1007/s10462-022-10371-6

    Article  Google Scholar 

  6. Rathore SS, Kumar S (2019) A study on software fault prediction techniques. Artif Intell Rev 51:255–327. https://doi.org/10.1007/s10462-017-9563-5

    Article  Google Scholar 

  7. Bhandari K, Kumar K, Sangal AL (2024) Alleviating class imbalance issue in software fault prediction using DBSCAN-based induced graph under-sampling method. Arab J Sci Eng. https://doi.org/10.1007/s13369-024-08740-0

    Article  Google Scholar 

  8. Walkinshaw N, Minku L (2018) Are 20% of files responsible for 80% of defects?. In: International Symposium on Empirical Software Engineering and Measurement. https://doi.org/10.1145/3239235.3239244

  9. Khleel NAA, Nehéz K (2023) A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method. J Intell Info Syst 60:673–707. https://doi.org/10.1007/s10844-023-00793-1

    Article  Google Scholar 

  10. Chen L, Fang B, Shang Z, Tang Y (2018) Tackling class overlap and imbalance problems in software defect prediction. Softw Qual J 26:97–125. https://doi.org/10.1007/s11219-016-9342-6

    Article  Google Scholar 

  11. Bahaweres RB, Jana EDH, Hermadi I, et al (2021) Handling high-dimensionality on software defect prediction with FLDA. In: Proceedings of 2nd 2021 International Conference on Smart Cities, Automation and Intelligent Computing Systems, ICON-SONICS, pp 76–81. https://doi.org/10.1109/ICON-SONICS53103.2021.9616999

  12. Afzal W, Torkar R (2016) Towards benchmarking feature subset selection methods for software fault prediction. Stud Comput Intell 617:33–58. https://doi.org/10.1007/978-3-319-25964-2_3

    Article  Google Scholar 

  13. Kalsoom A, Maqsood M, Ghazanfar MA et al (2018) A dimensionality reduction-based efficient software fault prediction using fisher linear discriminant analysis (FLDA). J Supercomput 74:4568–4602. https://doi.org/10.1007/s11227-018-2326-5

    Article  Google Scholar 

  14. Cai X, Niu Y, Geng S et al (2020) An under-sampled software defect prediction method based on hybrid multi-objective cuckoo search. Concurr Comput 32:1–14. https://doi.org/10.1002/cpe.5478

    Article  Google Scholar 

  15. Feng S, Keung J, Xiao Y et al (2024) Improving the undersampling technique by optimizing the termination condition for software defect prediction. Expert Syst Appl 235:121084. https://doi.org/10.1016/j.eswa.2023.121084

    Article  Google Scholar 

  16. Shi H, Ai J, Liu J, Xu J (2023) Improving software defect prediction in noisy imbalanced datasets. Appl Sci 13:10466. https://doi.org/10.3390/app131810466

    Article  Google Scholar 

  17. Gong L, Zhang H, Zhang J et al (2022) A comprehensive investigation of the impact of class overlap on software defect prediction. IEEE Trans Softw Eng 49:1–19. https://doi.org/10.1109/TSE.2022.3220740

    Article  Google Scholar 

  18. Feng S, Keung J, Liu J, et al (2021) ROCT: Radius-based class overlap cleaning technique to alleviate the class overlap problem in software defect prediction. In: Proceedings-2021 IEEE 45th Annual Computers, Software, and Applications Conference, COMPSAC, pp 228–237. https://doi.org/10.1109/COMPSAC51774.2021.00041

  19. Khoshgoftaar TM, Gao K, Seliya N (2010) Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction. In: 2010 22nd IEEE International Conference on Tools with Artificial Intelligence, pp 137–144, IEEE

  20. Tomek I (1976) An experiment with the nearest-neighbor rule. IEEE Trans Syst Man Cybernetics SMC. 6:448–452

  21. Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Rel 62:434–443. https://doi.org/10.1109/TR.2013.2259203

    Article  Google Scholar 

  22. Hayaty M, Muthmainah S, Ghufran SM (2021) Random and synthetic over-sampling approach to resolve data imbalance in classification. Int J Artif Intell Res 4:86–94. https://doi.org/10.29099/ijair.v4i2.152

    Article  Google Scholar 

  23. Kubat M, Matwin S (1997) Addressing the curse of imbalanced data sets: one-sided sampling. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp 179–186

  24. Goyal S (2021) Predicting the defects using stacked ensemble learner with filtered dataset. Autom Softw Eng 28:1–81. https://doi.org/10.1007/s10515-021-00285-y

    Article  Google Scholar 

  25. Guzmán-Ponce A, Sánchez JS, Valdovinos RM, Marcial-Romero JR (2021) DBIG-US: a two-stage under-sampling algorithm to face the class imbalance problem. Expert Syst Appl 168:114301. https://doi.org/10.1016/j.eswa.2020.114301

    Article  Google Scholar 

  26. Qian M, Li YF (2022) A Weakly supervised learning-based oversampling framework for class-imbalanced fault diagnosis. IEEE Trans Rel. https://doi.org/10.1109/TR.2021.3138448

    Article  Google Scholar 

  27. Siers MJ, Islam MZ (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf Syst 51:62–71. https://doi.org/10.1016/j.is.2015.02.006

    Article  Google Scholar 

  28. Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402. https://doi.org/10.1016/j.infsof.2014.07.005

    Article  Google Scholar 

  29. Tong H, Liu B, Wang S (2018) Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol 96:94–111. https://doi.org/10.1016/j.infsof.2017.11.008

    Article  Google Scholar 

  30. Bennin KE, Keung J, Phannachitta P et al (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44:534–550. https://doi.org/10.1109/TSE.2017.2731766

    Article  Google Scholar 

  31. Gong L, Jiang S, Jiang L (2019) tackling class imbalance problem in software defect prediction through cluster-based over-sampling with filtering. IEEE Access 7:145725–145737. https://doi.org/10.1109/ACCESS.2019.2945858

    Article  Google Scholar 

  32. Khuat TT, Le MH (2019) Binary teaching–learning-based optimization algorithm with a new update mechanism for sample subset optimization in software defect prediction. Soft Comput 23:9919–9935. https://doi.org/10.1007/s00500-018-3546-6

    Article  Google Scholar 

  33. Chen J, Nair V, Krishna R, Menzies T (2019) Sampling as a baseline optimizer for search-based software engineering. IEEE Trans Softw Eng 45:597–614. https://doi.org/10.1109/TSE.2018.2790925

    Article  Google Scholar 

  34. Tsai CF, Lin WC, Hu YH, Yao GT (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54. https://doi.org/10.1016/j.ins.2018.10.029

    Article  Google Scholar 

  35. Rao KN, Reddy CS (2020) A novel under sampling strategy for efficient software defect analysis of skewed distributed data. Evol Syst 11:119–131. https://doi.org/10.1007/s12530-018-9261-9

    Article  Google Scholar 

  36. Sun Z, Zhang J, Sun H, Zhu X (2020) Collaborative filtering based recommendation of sampling methods for software defect prediction. Appl Soft Comput J 90:106163. https://doi.org/10.1016/j.asoc.2020.106163

    Article  Google Scholar 

  37. Khuat TT, Le MH (2019) Ensemble learning for software fault prediction problem with imbalanced data. Int J Elect Comput Eng 9:3241–3246. https://doi.org/10.11591/ijece.v9i4.pp3241-3246

    Article  Google Scholar 

  38. Huda S, Liu K, Abdelrazek M et al (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE Access 6:24184–24195. https://doi.org/10.1109/ACCESS.2018.2817572

    Article  Google Scholar 

  39. Feng S, Keung J, Yu X et al (2021) COSTE: complexity-based oversampling technique to alleviate the class imbalance problem in software defect prediction. Inf Softw Technol 129:106432. https://doi.org/10.1016/j.infsof.2020.106432

    Article  Google Scholar 

  40. Chakraborty T, Chakraborty AK (2021) Hellinger Net: a hybrid imbalance learning model to improve software defect prediction. IEEE Trans Reliab 70:481–494. https://doi.org/10.1109/TR.2020.3020238

    Article  Google Scholar 

  41. Gupta S, Gupta A (2017) A set of measures designed to identify overlapped instances in software defect prediction. Computing 99:889–914. https://doi.org/10.1007/s00607-016-0538-1

    Article  MathSciNet  Google Scholar 

  42. Gong L, Jiang S, Wang R, Jiang L (2019) Empirical evaluation of the impact of class overlap on software defect prediction. In: Proceedings-2019 34th IEEE/ACM International Conference on Automated Software Engineering, ASE, pp 698–709. https://doi.org/10.1109/ASE.2019.00071

  43. NezhadShokouhi MM, Majidi MA, Rasoolzadegan A (2020) Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance. J Supercomput 76:602–635. https://doi.org/10.1007/s11227-019-03051-w

    Article  Google Scholar 

  44. Özakıncı R, Tarhan A (2018) Early software defect prediction: a systematic map and review. J Syst Softw 144:216–239. https://doi.org/10.1016/j.jss.2018.06.025

    Article  Google Scholar 

  45. Zhang H, Zhang X (2007) Comments on data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33:635–637. https://doi.org/10.1109/TSE.2007.70706

    Article  Google Scholar 

  46. Goyal S (2020) Comparison of machine learning techniques for software quality prediction. Int J Knowl Syst Sci 11:20–40. https://doi.org/10.4018/IJKSS.2020040102

    Article  Google Scholar 

  47. Turhan B, Bener A (2009) Analysis of Naive Bayes’ assumptions on software fault data: an empirical study. Data Knowl Eng 68:278–290. https://doi.org/10.1016/j.datak.2008.10.005

    Article  Google Scholar 

  48. Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput J. https://doi.org/10.1016/j.asoc.2014.11.023

    Article  Google Scholar 

  49. Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27. https://doi.org/10.1109/TIT.1967.1053964

    Article  Google Scholar 

  50. Borandag E (2023) Software fault prediction using an RNN-based deep learning approach and ensemble machine learning techniques. Appl Sci 13:1639. https://doi.org/10.3390/app13031639

    Article  Google Scholar 

  51. Kaur A, Malhotra R (2008) Application of random forest in predicting fault-prone classes. In: Proceedings-2008 International Conference on Advanced Computer Theory and Engineering, ICACTE, pp 37–43. https://doi.org/10.1109/ICACTE.2008.204

  52. Vluymans S (2019) Learning from imbalanced data. Stud Comput Intell 807:81–110. https://doi.org/10.1007/978-3-030-04663-7_4

    Article  Google Scholar 

  53. Song Q, Jia Z, Shepperd M et al (2011) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37:356–370. https://doi.org/10.1109/TSE.2010.90

    Article  Google Scholar 

  54. Yao J, Shepperd M (2021) The impact of using biased performance metrics on software defect prediction research. Inf Softw Technol 139:106664. https://doi.org/10.1016/j.infsof.2021.106664

    Article  Google Scholar 

Download references

Funding

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests.

Author information

Authors and Affiliations

Authors

Contributions

K.B. proposed and implemented the technique and K.K have downloaded the dataset and helps in solving the coding errors. K.B. wrote the main text and A.L.S. drawn the figures and proofread the manuscript. At last, other two authors also reviewed the manuscript.

Corresponding author

Correspondence to Kirti Bhandari.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

This section divides into three parts- A1 describes the description of FLDA. A2 describes the results of Baseline, DBOS_US and its constituents' classification results in terms of AUC, G-mean, PF and Recall in the form of Tables 4, 5, 6 and 7, respectively. A3 describes the comparison of the average proposed methods' performance corresponding to each dataset with other state-of-the-art methods in terms of AUC, G-mean, PF and Recall in the form of Tables 8, 9, 10, 11

Appendix A1-FLDA

Consider that there are n training sample vectors represented by \(\left\{ {t_{i} } \right\} _{i = 1}^{n}\), for m classes: C1,C2Cm, and nj samples belong to the jth class, i.e., \(= \sum\nolimits_{j = 1}^{m} {n_{j} }\). Assume µ be the mean of all training samples, i.e., \(\mu = (1/n)\sum\nolimits_{i = 1}^{n} {t_{i} }\), and µj be the mean of the jth class, i.e., \(\mu_{j} = \left( {1/n_{j} } \right)\sum\nolimits_{{t_{i} \in C_{j} }}^{n} {t_{i} }\), Then, the within-class scatter matrix Sw, and the between-class scatter matrix SB are stated, respectively, as

$$S_{{\text{w}}} = \mathop \sum \limits_{{t_{i} \in C_{j} }} \left( {t_{i} - \mu_{j} } \right) \left( {t_{i} - \mu_{j} } \right)^{T}$$
(10)
$$S_{{\text{B}}} = \mathop \sum \limits_{j = 1}^{m} n_{j} \left( {\mu_{j} - \mu } \right) \left( {\mu_{j} - \mu } \right)^{T}$$
(11)

The objective is to determine a transform vector v that maximizes the Raleigh quotient, which is stated as \(q = \frac{{v^{T } S_{{\text{B }}} v}}{{v^{T } S_{{\text{w }}} v}}\) , where v can evaluate by \(S_{{\text{B }}}\) v = λ \(S_{{\text{w }}}\) v, and λ is a generalized eigenvalue. There are m-1 eigenvectors incorporated with m-1 nonzero eigenvalues owing to SB rank is m-1. The m classes are predicted to be clearly distinguished in this low-dimensional space.

Algorithm 4
figure d

FLDA (Dimensionality reduction stage)

Appendix A2

See Tables 4, 5, 6 and 7.

Table 4 DBOS_US and its constituents' classification results in terms of AUC
Table 5 DBOS_US and its constituents' classification results in terms of G-mean
Table 6 DBOS_US and its constituents' classification results in terms of PF
Table 7 DBOS_US and its constituents' classification results in terms of Recall

Appendix A3

See Tables 8, 9, 10 and 11.

Table 8 Comparing the average proposed methods' performance corresponding to each dataset with other state-of-the-art methods in terms of AUC
Table 9 Comparing the average proposed methods' performance corresponding to each dataset with other state-of-the-art methods in terms of G-mean
Table 10 Comparing the average proposed methods' performance corresponding to each dataset with other state-of-the-art methods in terms of PF
Table 11 Comparing the average proposed methods' performance corresponding to each dataset with other state-of-the-art methods in terms of Recall

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhandari, K., Kumar, K. & Sangal, A.L. DBOS_US: a density-based graph under-sampling method to handle class imbalance and class overlap issues in software fault prediction. J Supercomput 80, 22682–22725 (2024). https://doi.org/10.1007/s11227-024-06312-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-024-06312-5

Keywords