DBOS_US: a density-based graph under-sampling method to handle class imbalance and class overlap issues in software fault prediction

Bhandari, Kirti; Kumar, Kuldeep; Sangal, Amrit Lal

doi:10.1007/s11227-024-06312-5

DBOS_US: a density-based graph under-sampling method to handle class imbalance and class overlap issues in software fault prediction

Published: 27 June 2024

Volume 80, pages 22682–22725, (2024)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Kirti Bhandari¹,
Kuldeep Kumar² &
Amrit Lal Sangal¹

213 Accesses
2 Citations
Explore all metrics

Abstract

Improving software quality by predicting faults during the early stages of software development is a primary goal of software fault prediction (SFP). Various machine learning models help to predict software faults. However, the imbalanced class distribution in the datasets may challenge some traditional learning approaches as they are more biased toward the majority class. The existence of class overlapping makes the prediction difficult owing to learning the minority class inaccurately. In addition to that, high data dimensionality makes the classification process complex and time-consuming. To enhance the performance of the classifier, handling these data quality issues is a big concern. This paper proposes a hybrid density-based method DBOS_US to address the class imbalance, noise, and class overlap in SFP. Initially, the density-based overlap removal (DBO) clustering algorithm is proposed to filter noisy and overlapped instances. Then, a graph-based algorithm, ShapeGraph, is adapted to handle imbalanced classes. The objective of the proposed method DBOS_US is to improve the performance of the traditional SFP classifiers. The experiments are conducted on 11 benchmark datasets from the PROMISE repository using six machine learning models (SVM, DT, KNN, NB, RF, and boosting). The experimental findings and statistical analysis revealed that the proposed method outperforms seven state-of-the-art techniques in terms of Area Under the Curve (AUC), G-mean, Recall (PD), and Probability of False alarms (PF). The proposed method improves the average values of G-mean, Recall, and AUC by at least 2.5%, 8.8%, and 1.2%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Alleviating Class Imbalance Issue in Software Fault Prediction Using DBSCAN-Based Induced Graph Under-Sampling Method

Article 17 February 2024

A set of measures designed to identify overlapped instances in software defect prediction

Article 10 January 2017

An ensemble model for addressing class imbalance and class overlap in software defect prediction

Article 09 November 2024

Data availability

Datasets can be downloaded from the online public PROMISE repository (http://promise.site.uottawa.ca/SERepository/datasets-page.html), and code in python will be made available on request.

References

Rathore SS, Kumar S (2021) An empirical study of ensemble techniques for software fault prediction. Appl Intell 51:3615–3644. https://doi.org/10.1007/s10489-020-01935-6
Article Google Scholar
Krasner H (2021) The cost of poor software quality in the US: a 2020 report. In: Proceedings Consortium Information Software QualityTM (CISQTM)
Ahmed MR, Zamal MF Bin, Ali MA, et al (2020) The impact of software fault prediction in real-world application: an automated approach for software engineering. In: ACM International Conference Proceeding Series, pp 247–251. https://doi.org/10.1145/3379247.3379278
Goyal S (2022) Handling class-imbalance with KNN (neighbourhood) under-sampling for software defect prediction. Artif Intell Rev 55:2023–2064. https://doi.org/10.1007/s10462-021-10044-w
Article Google Scholar
Bhandari K, Kumar K, Sangal AL (2022) Data quality issues in software fault prediction: a systematic literature review. Artif Intell Rev. https://doi.org/10.1007/s10462-022-10371-6
Article Google Scholar
Rathore SS, Kumar S (2019) A study on software fault prediction techniques. Artif Intell Rev 51:255–327. https://doi.org/10.1007/s10462-017-9563-5
Article Google Scholar
Bhandari K, Kumar K, Sangal AL (2024) Alleviating class imbalance issue in software fault prediction using DBSCAN-based induced graph under-sampling method. Arab J Sci Eng. https://doi.org/10.1007/s13369-024-08740-0
Article Google Scholar
Walkinshaw N, Minku L (2018) Are 20% of files responsible for 80% of defects?. In: International Symposium on Empirical Software Engineering and Measurement. https://doi.org/10.1145/3239235.3239244
Khleel NAA, Nehéz K (2023) A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method. J Intell Info Syst 60:673–707. https://doi.org/10.1007/s10844-023-00793-1
Article Google Scholar
Chen L, Fang B, Shang Z, Tang Y (2018) Tackling class overlap and imbalance problems in software defect prediction. Softw Qual J 26:97–125. https://doi.org/10.1007/s11219-016-9342-6
Article Google Scholar
Bahaweres RB, Jana EDH, Hermadi I, et al (2021) Handling high-dimensionality on software defect prediction with FLDA. In: Proceedings of 2nd 2021 International Conference on Smart Cities, Automation and Intelligent Computing Systems, ICON-SONICS, pp 76–81. https://doi.org/10.1109/ICON-SONICS53103.2021.9616999
Afzal W, Torkar R (2016) Towards benchmarking feature subset selection methods for software fault prediction. Stud Comput Intell 617:33–58. https://doi.org/10.1007/978-3-319-25964-2_3
Article Google Scholar
Kalsoom A, Maqsood M, Ghazanfar MA et al (2018) A dimensionality reduction-based efficient software fault prediction using fisher linear discriminant analysis (FLDA). J Supercomput 74:4568–4602. https://doi.org/10.1007/s11227-018-2326-5
Article Google Scholar
Cai X, Niu Y, Geng S et al (2020) An under-sampled software defect prediction method based on hybrid multi-objective cuckoo search. Concurr Comput 32:1–14. https://doi.org/10.1002/cpe.5478
Article Google Scholar
Feng S, Keung J, Xiao Y et al (2024) Improving the undersampling technique by optimizing the termination condition for software defect prediction. Expert Syst Appl 235:121084. https://doi.org/10.1016/j.eswa.2023.121084
Article Google Scholar
Shi H, Ai J, Liu J, Xu J (2023) Improving software defect prediction in noisy imbalanced datasets. Appl Sci 13:10466. https://doi.org/10.3390/app131810466
Article Google Scholar
Gong L, Zhang H, Zhang J et al (2022) A comprehensive investigation of the impact of class overlap on software defect prediction. IEEE Trans Softw Eng 49:1–19. https://doi.org/10.1109/TSE.2022.3220740
Article Google Scholar
Feng S, Keung J, Liu J, et al (2021) ROCT: Radius-based class overlap cleaning technique to alleviate the class overlap problem in software defect prediction. In: Proceedings-2021 IEEE 45th Annual Computers, Software, and Applications Conference, COMPSAC, pp 228–237. https://doi.org/10.1109/COMPSAC51774.2021.00041
Khoshgoftaar TM, Gao K, Seliya N (2010) Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction. In: 2010 22nd IEEE International Conference on Tools with Artificial Intelligence, pp 137–144, IEEE
Tomek I (1976) An experiment with the nearest-neighbor rule. IEEE Trans Syst Man Cybernetics SMC. 6:448–452
Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Rel 62:434–443. https://doi.org/10.1109/TR.2013.2259203
Article Google Scholar
Hayaty M, Muthmainah S, Ghufran SM (2021) Random and synthetic over-sampling approach to resolve data imbalance in classification. Int J Artif Intell Res 4:86–94. https://doi.org/10.29099/ijair.v4i2.152
Article Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced data sets: one-sided sampling. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp 179–186
Goyal S (2021) Predicting the defects using stacked ensemble learner with filtered dataset. Autom Softw Eng 28:1–81. https://doi.org/10.1007/s10515-021-00285-y
Article Google Scholar
Guzmán-Ponce A, Sánchez JS, Valdovinos RM, Marcial-Romero JR (2021) DBIG-US: a two-stage under-sampling algorithm to face the class imbalance problem. Expert Syst Appl 168:114301. https://doi.org/10.1016/j.eswa.2020.114301
Article Google Scholar
Qian M, Li YF (2022) A Weakly supervised learning-based oversampling framework for class-imbalanced fault diagnosis. IEEE Trans Rel. https://doi.org/10.1109/TR.2021.3138448
Article Google Scholar
Siers MJ, Islam MZ (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf Syst 51:62–71. https://doi.org/10.1016/j.is.2015.02.006
Article Google Scholar
Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402. https://doi.org/10.1016/j.infsof.2014.07.005
Article Google Scholar
Tong H, Liu B, Wang S (2018) Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol 96:94–111. https://doi.org/10.1016/j.infsof.2017.11.008
Article Google Scholar
Bennin KE, Keung J, Phannachitta P et al (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44:534–550. https://doi.org/10.1109/TSE.2017.2731766
Article Google Scholar
Gong L, Jiang S, Jiang L (2019) tackling class imbalance problem in software defect prediction through cluster-based over-sampling with filtering. IEEE Access 7:145725–145737. https://doi.org/10.1109/ACCESS.2019.2945858
Article Google Scholar
Khuat TT, Le MH (2019) Binary teaching–learning-based optimization algorithm with a new update mechanism for sample subset optimization in software defect prediction. Soft Comput 23:9919–9935. https://doi.org/10.1007/s00500-018-3546-6
Article Google Scholar
Chen J, Nair V, Krishna R, Menzies T (2019) Sampling as a baseline optimizer for search-based software engineering. IEEE Trans Softw Eng 45:597–614. https://doi.org/10.1109/TSE.2018.2790925
Article Google Scholar
Tsai CF, Lin WC, Hu YH, Yao GT (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54. https://doi.org/10.1016/j.ins.2018.10.029
Article Google Scholar
Rao KN, Reddy CS (2020) A novel under sampling strategy for efficient software defect analysis of skewed distributed data. Evol Syst 11:119–131. https://doi.org/10.1007/s12530-018-9261-9
Article Google Scholar
Sun Z, Zhang J, Sun H, Zhu X (2020) Collaborative filtering based recommendation of sampling methods for software defect prediction. Appl Soft Comput J 90:106163. https://doi.org/10.1016/j.asoc.2020.106163
Article Google Scholar
Khuat TT, Le MH (2019) Ensemble learning for software fault prediction problem with imbalanced data. Int J Elect Comput Eng 9:3241–3246. https://doi.org/10.11591/ijece.v9i4.pp3241-3246
Article Google Scholar
Huda S, Liu K, Abdelrazek M et al (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE Access 6:24184–24195. https://doi.org/10.1109/ACCESS.2018.2817572
Article Google Scholar
Feng S, Keung J, Yu X et al (2021) COSTE: complexity-based oversampling technique to alleviate the class imbalance problem in software defect prediction. Inf Softw Technol 129:106432. https://doi.org/10.1016/j.infsof.2020.106432
Article Google Scholar
Chakraborty T, Chakraborty AK (2021) Hellinger Net: a hybrid imbalance learning model to improve software defect prediction. IEEE Trans Reliab 70:481–494. https://doi.org/10.1109/TR.2020.3020238
Article Google Scholar
Gupta S, Gupta A (2017) A set of measures designed to identify overlapped instances in software defect prediction. Computing 99:889–914. https://doi.org/10.1007/s00607-016-0538-1
Article MathSciNet Google Scholar
Gong L, Jiang S, Wang R, Jiang L (2019) Empirical evaluation of the impact of class overlap on software defect prediction. In: Proceedings-2019 34th IEEE/ACM International Conference on Automated Software Engineering, ASE, pp 698–709. https://doi.org/10.1109/ASE.2019.00071
NezhadShokouhi MM, Majidi MA, Rasoolzadegan A (2020) Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance. J Supercomput 76:602–635. https://doi.org/10.1007/s11227-019-03051-w
Article Google Scholar
Özakıncı R, Tarhan A (2018) Early software defect prediction: a systematic map and review. J Syst Softw 144:216–239. https://doi.org/10.1016/j.jss.2018.06.025
Article Google Scholar
Zhang H, Zhang X (2007) Comments on data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33:635–637. https://doi.org/10.1109/TSE.2007.70706
Article Google Scholar
Goyal S (2020) Comparison of machine learning techniques for software quality prediction. Int J Knowl Syst Sci 11:20–40. https://doi.org/10.4018/IJKSS.2020040102
Article Google Scholar
Turhan B, Bener A (2009) Analysis of Naive Bayes’ assumptions on software fault data: an empirical study. Data Knowl Eng 68:278–290. https://doi.org/10.1016/j.datak.2008.10.005
Article Google Scholar
Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput J. https://doi.org/10.1016/j.asoc.2014.11.023
Article Google Scholar
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27. https://doi.org/10.1109/TIT.1967.1053964
Article Google Scholar
Borandag E (2023) Software fault prediction using an RNN-based deep learning approach and ensemble machine learning techniques. Appl Sci 13:1639. https://doi.org/10.3390/app13031639
Article Google Scholar
Kaur A, Malhotra R (2008) Application of random forest in predicting fault-prone classes. In: Proceedings-2008 International Conference on Advanced Computer Theory and Engineering, ICACTE, pp 37–43. https://doi.org/10.1109/ICACTE.2008.204
Vluymans S (2019) Learning from imbalanced data. Stud Comput Intell 807:81–110. https://doi.org/10.1007/978-3-030-04663-7_4
Article Google Scholar
Song Q, Jia Z, Shepperd M et al (2011) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37:356–370. https://doi.org/10.1109/TSE.2010.90
Article Google Scholar
Yao J, Shepperd M (2021) The impact of using biased performance metrics on software defect prediction research. Inf Softw Technol 139:106664. https://doi.org/10.1016/j.infsof.2021.106664
Article Google Scholar

Download references

Funding

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Dr B R Ambedkar National Institute of Technology, Jalandhar, Punjab, 144011, India
Kirti Bhandari & Amrit Lal Sangal
Department of Computer Engineering, National Institute of Technology, Kurukshetra, Haryana, 136119, India
Kuldeep Kumar

Authors

Kirti Bhandari
View author publications
You can also search for this author inPubMed Google Scholar
Kuldeep Kumar
View author publications
You can also search for this author inPubMed Google Scholar
Amrit Lal Sangal
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

K.B. proposed and implemented the technique and K.K have downloaded the dataset and helps in solving the coding errors. K.B. wrote the main text and A.L.S. drawn the figures and proofread the manuscript. At last, other two authors also reviewed the manuscript.

Corresponding author

Correspondence to Kirti Bhandari.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

This section divides into three parts- A1 describes the description of FLDA. A2 describes the results of Baseline, DBOS_US and its constituents' classification results in terms of AUC, G-mean, PF and Recall in the form of Tables 4, 5, 6 and 7, respectively. A3 describes the comparison of the average proposed methods' performance corresponding to each dataset with other state-of-the-art methods in terms of AUC, G-mean, PF and Recall in the form of Tables 8, 9, 10, 11

Appendix A1-FLDA

Consider that there are n training sample vectors represented by $\left\{ {t_{i} } \right\} _{i = 1}^{n}$, for m classes: C₁,C₂…C_m, and n_j samples belong to the jth class, i.e., $= \sum\nolimits_{j = 1}^{m} {n_{j} }$. Assume µ be the mean of all training samples, i.e., $\mu = (1/n)\sum\nolimits_{i = 1}^{n} {t_{i} }$, and µ_j be the mean of the jth class, i.e., $\mu_{j} = \left( {1/n_{j} } \right)\sum\nolimits_{{t_{i} \in C_{j} }}^{n} {t_{i} }$, Then, the within-class scatter matrix S_w, and the between-class scatter matrix S_B are stated, respectively, as

$$S_{{\text{w}}} = \mathop \sum \limits_{{t_{i} \in C_{j} }} \left( {t_{i} - \mu_{j} } \right) \left( {t_{i} - \mu_{j} } \right)^{T}$$

(10)

$$S_{{\text{B}}} = \mathop \sum \limits_{j = 1}^{m} n_{j} \left( {\mu_{j} - \mu } \right) \left( {\mu_{j} - \mu } \right)^{T}$$

(11)

The objective is to determine a transform vector v that maximizes the Raleigh quotient, which is stated as $q = \frac{{v^{T } S_{{\text{B }}} v}}{{v^{T } S_{{\text{w }}} v}}$ , where v can evaluate by $S_{{\text{B }}}$ v = λ $S_{{\text{w }}}$ v, and λ is a generalized eigenvalue. There are m-1 eigenvectors incorporated with m-1 nonzero eigenvalues owing to S_B rank is m-1. The m classes are predicted to be clearly distinguished in this low-dimensional space.

Appendix A2

See Tables 4, 5, 6 and 7.

Table 4 DBOS_US and its constituents' classification results in terms of AUC

Full size table

Table 5 DBOS_US and its constituents' classification results in terms of G-mean

Full size table

Table 6 DBOS_US and its constituents' classification results in terms of PF

Full size table

Table 7 DBOS_US and its constituents' classification results in terms of Recall

Full size table

Appendix A3

See Tables 8, 9, 10 and 11.

Table 8 Comparing the average proposed methods' performance corresponding to each dataset with other state-of-the-art methods in terms of AUC

Full size table

Table 9 Comparing the average proposed methods' performance corresponding to each dataset with other state-of-the-art methods in terms of G-mean

Full size table

Table 10 Comparing the average proposed methods' performance corresponding to each dataset with other state-of-the-art methods in terms of PF

Full size table

Table 11 Comparing the average proposed methods' performance corresponding to each dataset with other state-of-the-art methods in terms of Recall

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bhandari, K., Kumar, K. & Sangal, A.L. DBOS_US: a density-based graph under-sampling method to handle class imbalance and class overlap issues in software fault prediction. J Supercomput 80, 22682–22725 (2024). https://doi.org/10.1007/s11227-024-06312-5

Download citation

Accepted: 17 June 2024
Published: 27 June 2024
Issue Date: October 2024
DOI: https://doi.org/10.1007/s11227-024-06312-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DBOS_US: a density-based graph under-sampling method to handle class imbalance and class overlap issues in software fault prediction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Alleviating Class Imbalance Issue in Software Fault Prediction Using DBSCAN-Based Induced Graph Under-Sampling Method

A set of measures designed to identify overlapped instances in software defect prediction

An ensemble model for addressing class imbalance and class overlap in software defect prediction

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix

Appendix A1-FLDA

Appendix A2

Appendix A3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now