Skip to main content
Log in

An exploratory analysis of regression methods for predicting faults in software systems

  • Application of soft computing
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

The use of regression methods, for instance, linear regression, decision tree regression, etc., has been used earlier to build software fault prediction (SFP) models. However, these methods showed limited SFP performance with higher misclassification errors. In previous works, issues such as multicollinearity, feature scaling, and imbalance distribution of faulty and non-faulty modules in the dataset have not been considered reasonably, which might be a potential cause behind the poor prediction performance of these regression methods. Motivated from it, in this paper, we investigate the impact of 15 different regression methods for the faults count prediction in the software system and report their interpretation for fault models. We consider different fault data quality issues, and a comprehensive assessment of the regression methods is presented to handle these issues. We believe that many used regression methods have not been explored before for the SFP by considering different data quality issues. In the presented study, 44 fault datasets and their versions are used that are collected from the PROMISE software data repository are used to validate the performance of the regression methods, and absolute relative error (ARE), root mean square error (RSME), and fault-percentile-average (FPA) are used as the performance measures. For the model building, five different scenarios are considered, (1) original dataset without preprocessing; (2) standardized processed dataset; (3) balanced dataset; (4) non-multicollinearity processed dataset; (5) balanced+non-multicollinearity processed dataset. Experimental results showed that overall kernel-based regression methods, KernelRidge and SVR (Support vector regression, both linear and nonlinear kernels), yielded the best performance for predicting the fault counts compared to other methods. Other regression methods, in particular NNR (Nearest neighbor regression), RFR (Random forest regression), and GBR (Gradient boosting regression), are performed significantly accurately. Further, results showed that applying standardization and handling multicollinearity in the fault dataset helped improve regression methods’ performance. It is concluded that regression methods are promising for building software fault prediction models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. A software module can be a file in case of traditional software or a class in case of object-oriented software.

  2. Fault count and the number of faults in a software module both are same terms.

  3. SK-ESD R-package https://github.com/klainfo/ScottKnottESD.

  4. https://sites.google.com/site/santoshiiitmdj/projects/empirical-analysis-of-regression-methods-for-the-software-fault-prediction?authuser=0

References

  • Abdi H (2003) Partial least square regression (pls regression). Encyclop Res Methods Soc Sci 6(4):792–795

    Google Scholar 

  • Al-Jararha J (2016) New approaches for choosing the ridge parameters. Hacettepe J Math Stat 47(6):1625–1633

    MathSciNet  MATH  Google Scholar 

  • Altland HW (1999) Regression analysis: statistical modeling of a response variable

  • Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185

    MathSciNet  Google Scholar 

  • Awad M, Khanna R (2015) Support vector regression. In: Efficient learning machines, pp 67–80. Springer

  • Batyrshin I (2013) Constructing time series shape association measures: Minkowski distance and data standardization. In: 2013 BRICS congress on computational intelligence and 11th Brazilian congress on computational intelligence, pp 204–212. IEEE

  • Bennin KE, Keung J, Monden A, Kamei Y, Ubayashi N (2016) Investigating the effects of balanced training and testing datasets on effort-aware fault prediction models. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC), vol 1, IEEE, pp 154–163

  • Bjørn-H Mevik HR, Cederkvist, (2004) Mean squared error of prediction (msep) estimates for principal component regression (pcr) and partial least squares regression (plsr). J Chemomet 18(9):422–429

  • Chai T, Draxler RR (2014) Root mean square error (rmse) or mean absolute error (mae)?—arguments against avoiding rmse in the literature. Geosci Model Dev 7(3):1247–1250

    Article  Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  • Cheikhi L, Abran A (2013) Promise and isbsg software engineering data repositories: a survey. In: 2013 joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement, IEEE, pp 17–24

  • Chen T (2014) Introduction to boosted trees. Univ Washington Comput Sci 22:115

    Google Scholar 

  • Chen X, Zhang D, Zhao Y, Cui Z, Ni C (2019) Software defect number prediction: Unsupervised vs supervised methods. Inf Softw Technol 106:161–181

    Article  Google Scholar 

  • Chen M, Ma Y (2015) An empirical study on predicting defect numbers. In SEKE, pp 397–402

  • Corrales DC, Corrales JC, Ledezma A (2018) How to address the data quality issues in regression models: a guided process for data cleaning. Symmetry 10(4):99

    Article  Google Scholar 

  • Cukic B (2005) Guest editor‘s introduction: the promise of public software engineering data repositories. IEEE softw 22(6):20–22

    Article  Google Scholar 

  • Dhanajayan RCG, Pillai SA (2017) Slmbc: spiral life cycle model-based bayesian classification technique for efficient software fault prediction and classification. Soft Comput 21(2):403–415

    Article  Google Scholar 

  • Efron B, Hastie T, Johnstone I, Tibshirani R et al (2004) Least angle regression. Ann Stat 32(2):407–499

    Article  MathSciNet  MATH  Google Scholar 

  • El-Dereny M, Rashwan NI (2011) Solving multicollinearity problem using ridge regression models. Int J Contemp Math Sci 6(12):585–600

    MathSciNet  MATH  Google Scholar 

  • Fagundes RAA, Souza RMCR, Cysneiros FJA (2016) Zero-inflated prediction model in software-fault data. IET Softw 10(1):1–9

    Article  Google Scholar 

  • Fahrmeir L, Kneib T, Lang S, Marx B (2013) Regression models. In: Regression, pp 21–72. Springer

  • Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378

    Article  MathSciNet  MATH  Google Scholar 

  • Gao K, Khoshgoftaar TM (2007) A comprehensive empirical study of count models for software fault prediction. IEEE Trans Reliab 56(2):223–236

    Article  Google Scholar 

  • Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, IEEE, vol 1, pp 789–800

  • Graves TL, Karr AF, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661

    Article  Google Scholar 

  • Haouari AT, Souici-Meslati L, Atil F, Meslati D (2020) Empirical comparison and evaluation of artificial immune systems in inter-release software fault prediction. Appl Soft Comput 96:106686

    Article  Google Scholar 

  • Hassouneh Y, Turabieh H, Thaher T, Tumar I, Chantar H, Too J (2021) Boosted whale optimization algorithm with natural selection operators for software fault prediction. IEEE Access 9:14239–14258

    Article  Google Scholar 

  • Hesterberg T, Choi NH, Meier L, Fraley C et al (2008) Least angle and l1 penalized regression: a review. Stat Surv 2:61–93

    Article  MathSciNet  MATH  Google Scholar 

  • Jaakkola TS, Haussler D (1999) Probabilistic kernel regression models. In: AISTATS

  • James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning, vol 112. Springer

  • Jelihovschi EG, Faria J, Allaman IB (2014) Scottknott: a package for performing the scott-knott clustering algorithm in r. TEMA (São Carlos) 15(1):3–17

    Article  MathSciNet  Google Scholar 

  • Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Emp Softw Eng 13(5):561–595

    Article  Google Scholar 

  • Jin C (2020) Software defect prediction model based on distance metric learning. Soft Comput, p 1–15

  • Jin C (2021) Software defect prediction model based on distance metric learning. Soft Comput 25(1):447–461

    Article  Google Scholar 

  • Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th international conference on predictive models in software engineering, pp 1–10

  • Khoshgoftaar TM, Gao K (2007) Count models for software quality estimation. IEEE Trans Reliab 56(2):212–222

    Article  Google Scholar 

  • Kim S, Whitehead EJ Jr, Zhang Y (2008) Classifying software changes: Clean or buggy? IEEE Trans Softw Eng 34(2):181–196

    Article  Google Scholar 

  • Kutner MH, Nachtsheim CJ, Neter J, Li W, et al (2005) Applied linear statistical models, vol 5. McGraw-Hill Irwin, New York

  • Li XR, Zhao Z (2005) Relative error measures for evaluation of estimation algorithms. In: 2005 7th international conference on information fusion, vol 1, IEEE, pp 8–pp

  • Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R news 2(3):18–22

    Google Scholar 

  • Li N, Shepperd M, Guo Y (2020) A systematic review of unsupervised learning techniques for software defect prediction. Inf Softw Technol, p 106287

  • López-Martín C, Azzeh M, Bou-Nassif A, Banitaan S (2018) Upsilon-svr polynomial kernel for predicting the defect density in new software projects. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA), IEEE, pp 1377–1382

  • Lopez-Martin C, Azzeh M, Nassif AB, Banitaan S (2018) v-svr polynomial kernel for predicting the defect density in new software projects. arXiv preprint arXiv:1901.03362

  • Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518

    Article  Google Scholar 

  • Marquardt DW, Snee RD (1975) Ridge regression in practice. Am Stat 29(1):3–20

    MATH  Google Scholar 

  • Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Auto Softw Eng 17(4):375–407

    Article  Google Scholar 

  • Moore AW (2001) Cross-validation for detecting and preventing overfitting. School of Computer Science, Carneigie Mellon University

  • Muller HG, Stadtmuller U et al (1987) Estimation of heteroscedasticity in regression analysis. Ann Stat 15(2):610–625

    Article  MathSciNet  MATH  Google Scholar 

  • Ogutu JO, Schulz-Streeck T, Piepho HP (2012) Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions. In: BMC proceedings, vol 6, p S10. Springer

  • Ostrand TJ, Weyuker EJ, Bell RM, Ostrand RCW (2005) A different view of fault prediction. In: 29th annual international computer software and applications conference (COMPSAC’05), IEEE, vol 2, pp 3–4

  • Ostrand TJ, Weyuker EJ, Bell RM (2004) Where the bugs are. ACM SIGSOFT Softw Eng Notes 29(4):86–96

    Article  Google Scholar 

  • Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355

    Article  Google Scholar 

  • Prykhodko SB (2016) Developing the software defect prediction models using regression analysis based on normalizing transformations. Res Prac Sem Mod Prob Test Appl Softw (PTTAS-2016), pp 6–7,

  • Quinlan JR et al (1992) Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence, vol 92, pp 343–348. World Scientific

  • Rajbahadur GK, Wang S, Kamei Y, Hassan AE (2017) The impact of using regression models to build defect classifiers. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR), IEEE, pp 135–145

  • Rathore SS, Kumar S (2017) An empirical study of some software fault prediction techniques for the number of faults prediction. Soft Comput 21(24):7417–7434

    Article  Google Scholar 

  • Ratkowsky DA, Giles DEA (1990) Handbook of nonlinear regression models. Number 04; QA278. 2, R3. M. Dekker, New York

  • Rawlings JO, Pantula SG, Dickey DA (2001) Applied regression analysis: a research tool. Springer Science & Business Media, Berlin

    MATH  Google Scholar 

  • Rodriguez D, Dolado J, Tuya J, Pfahl D (2019) Software defect prediction with zero-inflated poisson models. arXiv preprint arXiv:1910.13717

  • Rousseeuw PJ, Leroy AM (2005) Robust regression and outlier detection, vol 589. John wiley & sons, London

    MATH  Google Scholar 

  • Ryan TP (2008) Modern regression methods, vol 655. John Wiley & Sons, London

    Book  Google Scholar 

  • Schulmeyer GG, McManus JI (1992) Handbook of software quality assurance. Van Nostrand Reinhold Co., New York

    Google Scholar 

  • Segal MR (2004) Machine learning benchmarks and random forest regression

  • Sharma D, Chandra P (2020) Linear regression with factor analysis in fault prediction of software. J Interdiscip Math 23(1):11–19

    Article  Google Scholar 

  • Sharma P, Sangal AL (2020) Soft computing approaches to investigate software fault proneness. Appl Mach Learn , p 217

  • Shukla S, Radhakrishnan T, Muthukumaran K, Neti LBM (2018) Multi-objective cross-version defect prediction. Soft Comput 22(6):1959–1980

    Article  Google Scholar 

  • Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222

    Article  MathSciNet  Google Scholar 

  • Sunil JM, Kumar L, Neti LBM (2018) Bayesian logistic regression for software defect prediction(s). In SEKE, pp 421–420

  • Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2018) The impact of automated parameter optimization on defect prediction models. IEEE Trans Softw Eng 45(7):683–711

    Article  Google Scholar 

  • Tibshirani R (2011) Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B Stat Methodol 73(3):273–282

  • Torgo L (1997) Kernel regression trees. In: Poster papers of the 9th European conference on machine learning (ECML 97), pp 118–127. Citeseer

  • Utkin LV, Wiencierz A (2015) Improving over-fitting in ensemble regression by imprecise probabilities. Inf Sci 317:315–328

    Article  Google Scholar 

  • Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443

  • Wang F, Huang J, Ma Y (2018) A top-k learning to rank approach to cross-project software defect prediction. In: 2018 25th Asia-Pacific software engineering conference (APSEC), IEEE, pp 335–344

  • Wang J, Zhang H (2012) Predicting defect numbers based on defect state transition models. In: Proceedings of the 2012 ACM-IEEE international symposium on empirical software engineering and measurement, IEEE, pp 191–200

  • Weyuker EJ, Ostrand TJ, Bell RM (2010) Comparing the effectiveness of several modeling methods for fault prediction. Emp Softw Eng 15(3):277–295

    Article  Google Scholar 

  • Woolson RF (2007) Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials, pp 1–3

  • Xu L, Krzyżak A, Yuille A (1994) On radial basis function nets and kernel regression: statistical consistency, convergence rates, and receptive field size. Neural Netw 7(4):609–628

    Article  MATH  Google Scholar 

  • Xu Z, Liu J, Luo X, Yang Z, Zhang Y, Yuan P, Tang Y, Zhang T (2019) Software defect prediction based on kernel pca and weighted extreme learning machine. Inf Softw Technol 106:182–200

    Article  Google Scholar 

  • Yang X, Wen W (2018) Ridge and lasso regression models for cross-version defect prediction. IEEE Trans Reliab 67(3):885–896

    Article  Google Scholar 

  • Yang X, Tang K, Yao X (2014) A learning-to-rank approach to software defect prediction. IEEE Trans Reliab 64(1):234–246

    Article  Google Scholar 

  • You G, Wang F, Ma Y (2016) An empirical study of ranking-oriented cross-project software defect prediction. Int J Softw Eng Knowl Eng 26(09n10):1511–1538:

  • Yu X, Liu J, Yang Z, Jia X, Ling Q, Ye S (2017) Learning from imbalanced data for predicting the number of software defects. In: 2017 IEEE 28th international symposium on software reliability engineering (ISSRE), IEEE, pp 78–89

Download references

Acknowledgements

We are thankful to the editor and the anonymous reviewers for their valuable comments that helped in improvement of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Santosh S. Rathore.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

This article does not contain any studies with human participants.

Funding

Authors declare that all the funding sources have been mentioned in the article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

The detailed results of all used regression methods in terms of ARE, RMSE, and FPA measures are contained in the online Appendix (https://sites.google.com/site/santoshiiitmdj/projects/empirical-analysis-of-regression-methods-for-the-software-fault-prediction?authuser=0). It consists results for all five model building scenarios considered in the presented work.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rathore, S.S. An exploratory analysis of regression methods for predicting faults in software systems. Soft Comput 25, 14841–14872 (2021). https://doi.org/10.1007/s00500-021-06048-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-021-06048-x

Keywords

Navigation