Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems

Khuat, Thanh Tung; Le, My Hanh

doi:10.1007/s42979-020-0119-4

Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems

Original Research
Published: 30 March 2020

Volume 1, article number 108, (2020)
Cite this article

SN Computer Science Aims and scope Submit manuscript

1110 Accesses
19 Citations
Explore all metrics

Abstract

Defect prediction in software projects plays a crucial role to reduce quality-based risk and increase the capability of detecting faulty program modules. Hence, classification approaches to anticipate software defect proneness based on static code characteristics have become a hot topic with a great deal of attention in recent years. While several novel studies show that the use of a single classifier causes the performance bottleneck, ensembles of classifiers might effectively enhance classification performance compared to a single classifier. However, the class imbalance property of software defect data severely hinders the classification efficiency of ensemble learning. To cope with this problem, resampling methods are usually combined into ensemble models. This paper empirically assesses the importance of sampling with regard to ensembles of various classifiers on imbalanced data in software defect prediction problems. Extensive experiments with the combination of seven different kinds of classification algorithms, three sampling methods, and two balanced data learning schemata were conducted over ten datasets. Empirical results indicated the positive effects of combining sampling techniques and the ensemble learning model on the performance of defect prediction regarding datasets with imbalanced class distributions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced Data Processing Model for Software Defect Prediction

Article 14 December 2017

Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction

On the relative value of data resampling approaches for software defect prediction

Article 21 June 2018

References

Khuat TT, Le MH. A novel hybrid ABC-PSO algorithm for effort estimation of software projects using agile methodologies. J Intell Syst. 2018;27:489–506.
Google Scholar
Khuat TT, Le MH. An effort estimation approach for agile software development using fireworks algorithm optimized neural network. Int J Comput Sci Inf Secur. 2016;14:122–30.
Google Scholar
Khuat TT, Le MH. Optimizing parameters of software effort estimation models using directed artificial bee colony algorithm. Informatica. 2017;40:427–36.
Google Scholar
Laradji IH, Alshayeb M, Ghouti L. Software defect prediction using ensemble learning on selected features. Inf Softw Technol. 2015;58:388–402.
Google Scholar
Yuan X, Khoshgoftaar TM, Allen EB, Ganesan K. An application of fuzzy clustering to software quality prediction. In: Proceedings of 3rd IEEE symposium on application-specific systems and software engineering technology; 2000; p. 85–90.
D’Ambros M, Lanza M, Robbes R. Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng. 2012;17:531–77.
Google Scholar
Sun Z, Song Q, Zhu X. Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C (Appl Rev). 2012;42:1806–17.
Google Scholar
He H, Garcia EA. Learning from Imbalanced data. IEEE Trans Knowl Data Eng. 2009;21:1263–84.
Google Scholar
Akbani R, Kwek S, Japkowicz N. Applying support vector machines to imbalanced datasets. In: Proceedings of 15th European conference on machine learning; 2004; p. 39–50.
Japkowicz N, Stephen S. The class imbalance problem: a systematic study. J Intell Data Anal. 2002;6:429–49.
MATH Google Scholar
Bouguila N, Han WJ, Hamza AB. A Bayesian approach for software quality prediction. In: Proceedings of 4th international IEEE conference on intelligent systems; 2008; p. 49–54.
Batista GE, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl. 2004;6:20–9.
Google Scholar
Sun Y, Kamel MS, Wong AKC, Wang Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn. 2007;40:3358–78.
MATH Google Scholar
Rokach L. Taxonomy for characterizing ensemble methods in classification tasks: a review and annotated bibliography. Comput Stat Data Anal. 2009;53:4046–72.
MathSciNet MATH Google Scholar
Wang T, Li W, Shi H, Liu Z. Software defect prediction based on classifiers ensemble. J Inf Comput Sci. 2012;8:4241–54.
Google Scholar
Menzies T, Krishna R, Pryor D. The promise repository of empirical software engineering data. 2019. http://promise.site.uottawa.ca/SERepository/. Accessed 30 Sep 2019.
Menzies T, Greenwald J, Frank A. Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng. 2007;33:2–13.
Google Scholar
Ostrand TJ, Weyuker EJ, Bell RM. Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng. 2005;31:340–55.
Google Scholar
Fenton NE, Neil M. Software metrics: successes, failures and new directions. J Syst Softw. 1999;47:149–57.
Google Scholar
McCabe TJ. A complexity measure. IEEE Trans Softw Eng. 1976;2:308–20.
MathSciNet MATH Google Scholar
Card DN, Agresti WW. Measuring software design complexity. J Syst Softw. 1988;8:185–97.
Google Scholar
Chidamber SR, Kemerer CF. A metrics suite for object oriented design. IEEE Trans Softw Eng. 1994;20:476–93.
Google Scholar
Marcus A, Poshyvanyk D, Ferenc R. Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans Softw Eng. 2008;34:287–300.
Google Scholar
Turhan B, Menzies T, Bener AB, Stefano JD. On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng. 2009;14:540–78.
Google Scholar
Dejaeger K, Verbraken T, Baesens B. Toward comprehensible software fault prediction models using Bayesian network classifiers. IEEE Trans Softw Eng. 2013;39:237–57.
Google Scholar
Thwin MMT, Quah TS. Application of neural networks for software quality prediction using object-oriented metrics. J Syst Softw. 2005;76:147–56.
Google Scholar
Rong X, Li F, Cui Z. A model for software defect prediction using vector machine based on CBA. Int J Intell Syst Technol Appl. 2016;15:19–34.
Google Scholar
Huang J, Sun H, Li YF, Xie M. An empirical study of dynamic incomplete-case nearest neighbor imputation in software quality data. In: Proceedings of IEEE international conference on software quality, reliability and security; 2015; p. 37–42.
Lessmann S, Baesens B, Mues C, Pietsch S. Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng. 2008;34:485–96.
Google Scholar
Menzies T, Dekhtyar A, Distefano J, Greenwald J. Problems with precision: a response to comments on ’data mining static code attributes to learn defect predictors’. IEEE Trans Softw Eng. 2007;33:637–40.
Google Scholar
Shanab AA, Khoshgoftaar TM, Wald R, Hulse JV. Evaluation of the importance of data pre-processing order when combining feature selection and data sampling. Int J Bus Intell Data Min. 2012;7:116–34.
Google Scholar
Bowes D, Hall T, Gray D. DConfusion: a technique to allow cross study performance evaluation of fault prediction studies. Autom Softw Eng. 2014;21:287–313.
Google Scholar
Weiss GM. Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl. 2004;6:7–19.
Google Scholar
Gonzalez-Abril L, Nunez H, Angulo C, Velasco F. GSVM: an SVM for handling imbalanced accuracy between classes inbi-classification problems. Appl Soft Comput. 2014;17:23–31.
Google Scholar
Tahir MA, Kittler J, Yan F. Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recogn. 2012;45:3738–50.
Google Scholar
Seiffert C, Khoshgoftaar TM, Hulse JV. Improving software-quality predictions with data sampling and boosting. IEEE Trans Syst Man Cybern. 2009;39:1283–94.
Google Scholar
Zheng J. Cost-sensitive boosting neural networks for software defect prediction. Expert Syst Appl. 2010;37:4537–43.
Google Scholar
Japkowicz N. The class imbalance problem: significance and strategies. In: Proceedings of the 2000 international conference on artificial intelligence; 2000; p. 111–7.
Mani I, Zhang J. KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of international conference on machine learning; 2003.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
MATH Google Scholar
Han H, Wang WY, Mao BH. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Proceedings of international conference on intelligent computing; 2005; p. 878–87.
Xie J, Qiu Z. The effect of imbalanced data sets on LDA: a theoretical and empirical analysis. Pattern Recogn. 2007;40:557–62.
MATH Google Scholar
Tang EK, Suganthan PN, Yao X. An analysis of diversity measures. Mach Learn. 2006;65:247–71.
Google Scholar
Aggarwal CC. Data mining. Berlin: Springer; 2015.
MATH Google Scholar
Diez-Pastor JF, Rodriguez JJ, Garcia-Osorio CI, Kuncheva LI. Diversity techniques improve the performance of the best imbalance learning ensembles. Inf Sci. 2015;325:98–117.
MathSciNet Google Scholar
Blaszczynski J, Stefanowski J. Neighbourhood sampling in bagging for imbalanced data. Neurocomputing. 2015;150:529–42.
Google Scholar
Kittler J, Hatef M, Duin RPW, Matas J. On combining classifiers. IEEE Trans Pattern Anal Mach Intell. 1998;20:226–39.
Google Scholar
Kuncheva LI, Rodriguez JJ. A weighted voting framework for classifiers ensembles. Knowl Inf Syst. 2014;38(2):259–75.
Google Scholar
Onan A, Korukoglu S, Bulut H. A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Syst Appl. 2016;62:1–16.
Google Scholar
Barandela R, Valdovinos RM, Sanchez JS. New applications of ensembles of classifiers. Pattern Anal Appl. 2003;6:245–56.
MathSciNet Google Scholar
Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. In: IEEE symposium on computational intelligence and data mining; 2009; p. 324–31.
Roy A, Cruz R, Sabourin R, Cavalcanti G. A study on combining dynamic selection and data preprocessing for imbalance learning. Neurocomputing. 2018;286:179–92.
Google Scholar
Branco P, Torgo L, Ribeiro RP. A survey of predictive modeling on imbalanced domains. ACM Comput Surv. 2016;49(2):31:1–50.
Google Scholar
Ruta D, Gabrys B. Classifier selection for majority voting. Inf Fusion. 2005;6(1):63–81.
MATH Google Scholar
Khoshgoftaar TM, Allen EB. Logistic regression modeling of software quality. Int J Reliab Qual Saf Eng. 1999;6:303–17.
Google Scholar
Han J, Kamber M, Pei J. Data mining: concepts and techniques. Los Altos: Morgan Kaufmann; 2012.
MATH Google Scholar
Quinlan JR. C4.5: programs for machine learning. Los Altos: Morgan Kaufmann Publishers Inc.; 1993.
Google Scholar
Frank E, Hall MA, Witten IH. The WEKA workbench. Online appendix for data mining: practical machine learning tools and techniques. Los Altos: Morgan Kaufmann; 2016.
Google Scholar
Fisher DH, Xu L, Zard N. Ordering effects in clustering. In: Proceedings of the 9th international workshop of machine learning; 1992; p. 162–8.
Platt JC. Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods 1999; p. 185–208
Aha DW, Kibler D, Albert MK. Instance-based learning algorithms. Mach Learn. 1991;6:37–66.
Google Scholar
Demsar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.
MathSciNet MATH Google Scholar
Iman RL, Davenport JM. Approximations of the critical region of the Friedman statistic. Commun Stat Theory Methods. 1980;9(6):571–95.
MATH Google Scholar
Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat. 1979;6:65–70.
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This research was supported by the Ministry of Education and Training of Vietnam under Grant Number B2019-DNA-03.

Author information

Authors and Affiliations

Advanced Analytics Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, Australia
Thanh Tung Khuat
Information Technology Faculty, The University of Danang - University of Science and Technology, Da Nang, Vietnam
My Hanh Le

Authors

Thanh Tung Khuat
View author publications
You can also search for this author in PubMed Google Scholar
My Hanh Le
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thanh Tung Khuat.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khuat, T.T., Le, M.H. Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems. SN COMPUT. SCI. 1, 108 (2020). https://doi.org/10.1007/s42979-020-0119-4

Download citation

Received: 21 September 2019
Accepted: 11 March 2020
Published: 30 March 2020
DOI: https://doi.org/10.1007/s42979-020-0119-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems

Abstract

Access this article

Similar content being viewed by others

Imbalanced Data Processing Model for Software Defect Prediction

Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction

On the relative value of data resampling approaches for software defect prediction

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems

Abstract

Access this article

Similar content being viewed by others

Imbalanced Data Processing Model for Software Defect Prediction

Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction

On the relative value of data resampling approaches for software defect prediction

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation