Abstract
The use of regression methods, for instance, linear regression, decision tree regression, etc., has been used earlier to build software fault prediction (SFP) models. However, these methods showed limited SFP performance with higher misclassification errors. In previous works, issues such as multicollinearity, feature scaling, and imbalance distribution of faulty and non-faulty modules in the dataset have not been considered reasonably, which might be a potential cause behind the poor prediction performance of these regression methods. Motivated from it, in this paper, we investigate the impact of 15 different regression methods for the faults count prediction in the software system and report their interpretation for fault models. We consider different fault data quality issues, and a comprehensive assessment of the regression methods is presented to handle these issues. We believe that many used regression methods have not been explored before for the SFP by considering different data quality issues. In the presented study, 44 fault datasets and their versions are used that are collected from the PROMISE software data repository are used to validate the performance of the regression methods, and absolute relative error (ARE), root mean square error (RSME), and fault-percentile-average (FPA) are used as the performance measures. For the model building, five different scenarios are considered, (1) original dataset without preprocessing; (2) standardized processed dataset; (3) balanced dataset; (4) non-multicollinearity processed dataset; (5) balanced+non-multicollinearity processed dataset. Experimental results showed that overall kernel-based regression methods, KernelRidge and SVR (Support vector regression, both linear and nonlinear kernels), yielded the best performance for predicting the fault counts compared to other methods. Other regression methods, in particular NNR (Nearest neighbor regression), RFR (Random forest regression), and GBR (Gradient boosting regression), are performed significantly accurately. Further, results showed that applying standardization and handling multicollinearity in the fault dataset helped improve regression methods’ performance. It is concluded that regression methods are promising for building software fault prediction models.
Similar content being viewed by others
Notes
A software module can be a file in case of traditional software or a class in case of object-oriented software.
Fault count and the number of faults in a software module both are same terms.
SK-ESD R-package https://github.com/klainfo/ScottKnottESD.
References
Abdi H (2003) Partial least square regression (pls regression). Encyclop Res Methods Soc Sci 6(4):792–795
Al-Jararha J (2016) New approaches for choosing the ridge parameters. Hacettepe J Math Stat 47(6):1625–1633
Altland HW (1999) Regression analysis: statistical modeling of a response variable
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185
Awad M, Khanna R (2015) Support vector regression. In: Efficient learning machines, pp 67–80. Springer
Batyrshin I (2013) Constructing time series shape association measures: Minkowski distance and data standardization. In: 2013 BRICS congress on computational intelligence and 11th Brazilian congress on computational intelligence, pp 204–212. IEEE
Bennin KE, Keung J, Monden A, Kamei Y, Ubayashi N (2016) Investigating the effects of balanced training and testing datasets on effort-aware fault prediction models. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC), vol 1, IEEE, pp 154–163
Bjørn-H Mevik HR, Cederkvist, (2004) Mean squared error of prediction (msep) estimates for principal component regression (pcr) and partial least squares regression (plsr). J Chemomet 18(9):422–429
Chai T, Draxler RR (2014) Root mean square error (rmse) or mean absolute error (mae)?—arguments against avoiding rmse in the literature. Geosci Model Dev 7(3):1247–1250
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Cheikhi L, Abran A (2013) Promise and isbsg software engineering data repositories: a survey. In: 2013 joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement, IEEE, pp 17–24
Chen T (2014) Introduction to boosted trees. Univ Washington Comput Sci 22:115
Chen X, Zhang D, Zhao Y, Cui Z, Ni C (2019) Software defect number prediction: Unsupervised vs supervised methods. Inf Softw Technol 106:161–181
Chen M, Ma Y (2015) An empirical study on predicting defect numbers. In SEKE, pp 397–402
Corrales DC, Corrales JC, Ledezma A (2018) How to address the data quality issues in regression models: a guided process for data cleaning. Symmetry 10(4):99
Cukic B (2005) Guest editor‘s introduction: the promise of public software engineering data repositories. IEEE softw 22(6):20–22
Dhanajayan RCG, Pillai SA (2017) Slmbc: spiral life cycle model-based bayesian classification technique for efficient software fault prediction and classification. Soft Comput 21(2):403–415
Efron B, Hastie T, Johnstone I, Tibshirani R et al (2004) Least angle regression. Ann Stat 32(2):407–499
El-Dereny M, Rashwan NI (2011) Solving multicollinearity problem using ridge regression models. Int J Contemp Math Sci 6(12):585–600
Fagundes RAA, Souza RMCR, Cysneiros FJA (2016) Zero-inflated prediction model in software-fault data. IET Softw 10(1):1–9
Fahrmeir L, Kneib T, Lang S, Marx B (2013) Regression models. In: Regression, pp 21–72. Springer
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378
Gao K, Khoshgoftaar TM (2007) A comprehensive empirical study of count models for software fault prediction. IEEE Trans Reliab 56(2):223–236
Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, IEEE, vol 1, pp 789–800
Graves TL, Karr AF, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661
Haouari AT, Souici-Meslati L, Atil F, Meslati D (2020) Empirical comparison and evaluation of artificial immune systems in inter-release software fault prediction. Appl Soft Comput 96:106686
Hassouneh Y, Turabieh H, Thaher T, Tumar I, Chantar H, Too J (2021) Boosted whale optimization algorithm with natural selection operators for software fault prediction. IEEE Access 9:14239–14258
Hesterberg T, Choi NH, Meier L, Fraley C et al (2008) Least angle and l1 penalized regression: a review. Stat Surv 2:61–93
Jaakkola TS, Haussler D (1999) Probabilistic kernel regression models. In: AISTATS
James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning, vol 112. Springer
Jelihovschi EG, Faria J, Allaman IB (2014) Scottknott: a package for performing the scott-knott clustering algorithm in r. TEMA (São Carlos) 15(1):3–17
Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Emp Softw Eng 13(5):561–595
Jin C (2020) Software defect prediction model based on distance metric learning. Soft Comput, p 1–15
Jin C (2021) Software defect prediction model based on distance metric learning. Soft Comput 25(1):447–461
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th international conference on predictive models in software engineering, pp 1–10
Khoshgoftaar TM, Gao K (2007) Count models for software quality estimation. IEEE Trans Reliab 56(2):212–222
Kim S, Whitehead EJ Jr, Zhang Y (2008) Classifying software changes: Clean or buggy? IEEE Trans Softw Eng 34(2):181–196
Kutner MH, Nachtsheim CJ, Neter J, Li W, et al (2005) Applied linear statistical models, vol 5. McGraw-Hill Irwin, New York
Li XR, Zhao Z (2005) Relative error measures for evaluation of estimation algorithms. In: 2005 7th international conference on information fusion, vol 1, IEEE, pp 8–pp
Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R news 2(3):18–22
Li N, Shepperd M, Guo Y (2020) A systematic review of unsupervised learning techniques for software defect prediction. Inf Softw Technol, p 106287
López-Martín C, Azzeh M, Bou-Nassif A, Banitaan S (2018) Upsilon-svr polynomial kernel for predicting the defect density in new software projects. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA), IEEE, pp 1377–1382
Lopez-Martin C, Azzeh M, Nassif AB, Banitaan S (2018) v-svr polynomial kernel for predicting the defect density in new software projects. arXiv preprint arXiv:1901.03362
Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518
Marquardt DW, Snee RD (1975) Ridge regression in practice. Am Stat 29(1):3–20
Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Auto Softw Eng 17(4):375–407
Moore AW (2001) Cross-validation for detecting and preventing overfitting. School of Computer Science, Carneigie Mellon University
Muller HG, Stadtmuller U et al (1987) Estimation of heteroscedasticity in regression analysis. Ann Stat 15(2):610–625
Ogutu JO, Schulz-Streeck T, Piepho HP (2012) Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions. In: BMC proceedings, vol 6, p S10. Springer
Ostrand TJ, Weyuker EJ, Bell RM, Ostrand RCW (2005) A different view of fault prediction. In: 29th annual international computer software and applications conference (COMPSAC’05), IEEE, vol 2, pp 3–4
Ostrand TJ, Weyuker EJ, Bell RM (2004) Where the bugs are. ACM SIGSOFT Softw Eng Notes 29(4):86–96
Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355
Prykhodko SB (2016) Developing the software defect prediction models using regression analysis based on normalizing transformations. Res Prac Sem Mod Prob Test Appl Softw (PTTAS-2016), pp 6–7,
Quinlan JR et al (1992) Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence, vol 92, pp 343–348. World Scientific
Rajbahadur GK, Wang S, Kamei Y, Hassan AE (2017) The impact of using regression models to build defect classifiers. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR), IEEE, pp 135–145
Rathore SS, Kumar S (2017) An empirical study of some software fault prediction techniques for the number of faults prediction. Soft Comput 21(24):7417–7434
Ratkowsky DA, Giles DEA (1990) Handbook of nonlinear regression models. Number 04; QA278. 2, R3. M. Dekker, New York
Rawlings JO, Pantula SG, Dickey DA (2001) Applied regression analysis: a research tool. Springer Science & Business Media, Berlin
Rodriguez D, Dolado J, Tuya J, Pfahl D (2019) Software defect prediction with zero-inflated poisson models. arXiv preprint arXiv:1910.13717
Rousseeuw PJ, Leroy AM (2005) Robust regression and outlier detection, vol 589. John wiley & sons, London
Ryan TP (2008) Modern regression methods, vol 655. John Wiley & Sons, London
Schulmeyer GG, McManus JI (1992) Handbook of software quality assurance. Van Nostrand Reinhold Co., New York
Segal MR (2004) Machine learning benchmarks and random forest regression
Sharma D, Chandra P (2020) Linear regression with factor analysis in fault prediction of software. J Interdiscip Math 23(1):11–19
Sharma P, Sangal AL (2020) Soft computing approaches to investigate software fault proneness. Appl Mach Learn , p 217
Shukla S, Radhakrishnan T, Muthukumaran K, Neti LBM (2018) Multi-objective cross-version defect prediction. Soft Comput 22(6):1959–1980
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Sunil JM, Kumar L, Neti LBM (2018) Bayesian logistic regression for software defect prediction(s). In SEKE, pp 421–420
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2018) The impact of automated parameter optimization on defect prediction models. IEEE Trans Softw Eng 45(7):683–711
Tibshirani R (2011) Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B Stat Methodol 73(3):273–282
Torgo L (1997) Kernel regression trees. In: Poster papers of the 9th European conference on machine learning (ECML 97), pp 118–127. Citeseer
Utkin LV, Wiencierz A (2015) Improving over-fitting in ensemble regression by imprecise probabilities. Inf Sci 317:315–328
Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443
Wang F, Huang J, Ma Y (2018) A top-k learning to rank approach to cross-project software defect prediction. In: 2018 25th Asia-Pacific software engineering conference (APSEC), IEEE, pp 335–344
Wang J, Zhang H (2012) Predicting defect numbers based on defect state transition models. In: Proceedings of the 2012 ACM-IEEE international symposium on empirical software engineering and measurement, IEEE, pp 191–200
Weyuker EJ, Ostrand TJ, Bell RM (2010) Comparing the effectiveness of several modeling methods for fault prediction. Emp Softw Eng 15(3):277–295
Woolson RF (2007) Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials, pp 1–3
Xu L, Krzyżak A, Yuille A (1994) On radial basis function nets and kernel regression: statistical consistency, convergence rates, and receptive field size. Neural Netw 7(4):609–628
Xu Z, Liu J, Luo X, Yang Z, Zhang Y, Yuan P, Tang Y, Zhang T (2019) Software defect prediction based on kernel pca and weighted extreme learning machine. Inf Softw Technol 106:182–200
Yang X, Wen W (2018) Ridge and lasso regression models for cross-version defect prediction. IEEE Trans Reliab 67(3):885–896
Yang X, Tang K, Yao X (2014) A learning-to-rank approach to software defect prediction. IEEE Trans Reliab 64(1):234–246
You G, Wang F, Ma Y (2016) An empirical study of ranking-oriented cross-project software defect prediction. Int J Softw Eng Knowl Eng 26(09n10):1511–1538:
Yu X, Liu J, Yang Z, Jia X, Ling Q, Ye S (2017) Learning from imbalanced data for predicting the number of software defects. In: 2017 IEEE 28th international symposium on software reliability engineering (ISSRE), IEEE, pp 78–89
Acknowledgements
We are thankful to the editor and the anonymous reviewers for their valuable comments that helped in improvement of the paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
This article does not contain any studies with human participants.
Funding
Authors declare that all the funding sources have been mentioned in the article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
The detailed results of all used regression methods in terms of ARE, RMSE, and FPA measures are contained in the online Appendix (https://sites.google.com/site/santoshiiitmdj/projects/empirical-analysis-of-regression-methods-for-the-software-fault-prediction?authuser=0). It consists results for all five model building scenarios considered in the presented work.
Rights and permissions
About this article
Cite this article
Rathore, S.S. An exploratory analysis of regression methods for predicting faults in software systems. Soft Comput 25, 14841–14872 (2021). https://doi.org/10.1007/s00500-021-06048-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-06048-x