An exploratory analysis of regression methods for predicting faults in software systems

Rathore, Santosh S.

doi:10.1007/s00500-021-06048-x

An exploratory analysis of regression methods for predicting faults in software systems

Application of soft computing
Published: 03 September 2021

Volume 25, pages 14841–14872, (2021)
Cite this article

Soft Computing Aims and scope Submit manuscript

Santosh S. Rathore¹

331 Accesses
1 Citation
Explore all metrics

Abstract

The use of regression methods, for instance, linear regression, decision tree regression, etc., has been used earlier to build software fault prediction (SFP) models. However, these methods showed limited SFP performance with higher misclassification errors. In previous works, issues such as multicollinearity, feature scaling, and imbalance distribution of faulty and non-faulty modules in the dataset have not been considered reasonably, which might be a potential cause behind the poor prediction performance of these regression methods. Motivated from it, in this paper, we investigate the impact of 15 different regression methods for the faults count prediction in the software system and report their interpretation for fault models. We consider different fault data quality issues, and a comprehensive assessment of the regression methods is presented to handle these issues. We believe that many used regression methods have not been explored before for the SFP by considering different data quality issues. In the presented study, 44 fault datasets and their versions are used that are collected from the PROMISE software data repository are used to validate the performance of the regression methods, and absolute relative error (ARE), root mean square error (RSME), and fault-percentile-average (FPA) are used as the performance measures. For the model building, five different scenarios are considered, (1) original dataset without preprocessing; (2) standardized processed dataset; (3) balanced dataset; (4) non-multicollinearity processed dataset; (5) balanced+non-multicollinearity processed dataset. Experimental results showed that overall kernel-based regression methods, KernelRidge and SVR (Support vector regression, both linear and nonlinear kernels), yielded the best performance for predicting the fault counts compared to other methods. Other regression methods, in particular NNR (Nearest neighbor regression), RFR (Random forest regression), and GBR (Gradient boosting regression), are performed significantly accurately. Further, results showed that applying standardization and handling multicollinearity in the fault dataset helped improve regression methods’ performance. It is concluded that regression methods are promising for building software fault prediction models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data quality issues in software fault prediction: a systematic literature review

Article 21 December 2022

Kirti Bhandari, Kuldeep Kumar & Amrit Lal Sangal

Classification framework for faulty-software using enhanced exploratory whale optimizer-based feature selection scheme and random forest ensemble learning

Article 09 February 2023

Majdi Mafarja, Thaer Thaher, … Hamza Turabieh

Effect of Feature Selection on Software Fault Prediction

Notes

A software module can be a file in case of traditional software or a class in case of object-oriented software.
Fault count and the number of faults in a software module both are same terms.
SK-ESD R-package https://github.com/klainfo/ScottKnottESD.
https://sites.google.com/site/santoshiiitmdj/projects/empirical-analysis-of-regression-methods-for-the-software-fault-prediction?authuser=0

References

Abdi H (2003) Partial least square regression (pls regression). Encyclop Res Methods Soc Sci 6(4):792–795
Google Scholar
Al-Jararha J (2016) New approaches for choosing the ridge parameters. Hacettepe J Math Stat 47(6):1625–1633
MathSciNet MATH Google Scholar
Altland HW (1999) Regression analysis: statistical modeling of a response variable
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185
MathSciNet Google Scholar
Awad M, Khanna R (2015) Support vector regression. In: Efficient learning machines, pp 67–80. Springer
Batyrshin I (2013) Constructing time series shape association measures: Minkowski distance and data standardization. In: 2013 BRICS congress on computational intelligence and 11th Brazilian congress on computational intelligence, pp 204–212. IEEE
Bennin KE, Keung J, Monden A, Kamei Y, Ubayashi N (2016) Investigating the effects of balanced training and testing datasets on effort-aware fault prediction models. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC), vol 1, IEEE, pp 154–163
Bjørn-H Mevik HR, Cederkvist, (2004) Mean squared error of prediction (msep) estimates for principal component regression (pcr) and partial least squares regression (plsr). J Chemomet 18(9):422–429
Chai T, Draxler RR (2014) Root mean square error (rmse) or mean absolute error (mae)?—arguments against avoiding rmse in the literature. Geosci Model Dev 7(3):1247–1250
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Cheikhi L, Abran A (2013) Promise and isbsg software engineering data repositories: a survey. In: 2013 joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement, IEEE, pp 17–24
Chen T (2014) Introduction to boosted trees. Univ Washington Comput Sci 22:115
Google Scholar
Chen X, Zhang D, Zhao Y, Cui Z, Ni C (2019) Software defect number prediction: Unsupervised vs supervised methods. Inf Softw Technol 106:161–181
Article Google Scholar
Chen M, Ma Y (2015) An empirical study on predicting defect numbers. In SEKE, pp 397–402
Corrales DC, Corrales JC, Ledezma A (2018) How to address the data quality issues in regression models: a guided process for data cleaning. Symmetry 10(4):99
Article Google Scholar
Cukic B (2005) Guest editor‘s introduction: the promise of public software engineering data repositories. IEEE softw 22(6):20–22
Article Google Scholar
Dhanajayan RCG, Pillai SA (2017) Slmbc: spiral life cycle model-based bayesian classification technique for efficient software fault prediction and classification. Soft Comput 21(2):403–415
Article Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R et al (2004) Least angle regression. Ann Stat 32(2):407–499
Article MathSciNet MATH Google Scholar
El-Dereny M, Rashwan NI (2011) Solving multicollinearity problem using ridge regression models. Int J Contemp Math Sci 6(12):585–600
MathSciNet MATH Google Scholar
Fagundes RAA, Souza RMCR, Cysneiros FJA (2016) Zero-inflated prediction model in software-fault data. IET Softw 10(1):1–9
Article Google Scholar
Fahrmeir L, Kneib T, Lang S, Marx B (2013) Regression models. In: Regression, pp 21–72. Springer
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378
Article MathSciNet MATH Google Scholar
Gao K, Khoshgoftaar TM (2007) A comprehensive empirical study of count models for software fault prediction. IEEE Trans Reliab 56(2):223–236
Article Google Scholar
Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, IEEE, vol 1, pp 789–800
Graves TL, Karr AF, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661
Article Google Scholar
Haouari AT, Souici-Meslati L, Atil F, Meslati D (2020) Empirical comparison and evaluation of artificial immune systems in inter-release software fault prediction. Appl Soft Comput 96:106686
Article Google Scholar
Hassouneh Y, Turabieh H, Thaher T, Tumar I, Chantar H, Too J (2021) Boosted whale optimization algorithm with natural selection operators for software fault prediction. IEEE Access 9:14239–14258
Article Google Scholar
Hesterberg T, Choi NH, Meier L, Fraley C et al (2008) Least angle and l1 penalized regression: a review. Stat Surv 2:61–93
Article MathSciNet MATH Google Scholar
Jaakkola TS, Haussler D (1999) Probabilistic kernel regression models. In: AISTATS
James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning, vol 112. Springer
Jelihovschi EG, Faria J, Allaman IB (2014) Scottknott: a package for performing the scott-knott clustering algorithm in r. TEMA (São Carlos) 15(1):3–17
Article MathSciNet Google Scholar
Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Emp Softw Eng 13(5):561–595
Article Google Scholar
Jin C (2020) Software defect prediction model based on distance metric learning. Soft Comput, p 1–15
Jin C (2021) Software defect prediction model based on distance metric learning. Soft Comput 25(1):447–461
Article Google Scholar
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th international conference on predictive models in software engineering, pp 1–10
Khoshgoftaar TM, Gao K (2007) Count models for software quality estimation. IEEE Trans Reliab 56(2):212–222
Article Google Scholar
Kim S, Whitehead EJ Jr, Zhang Y (2008) Classifying software changes: Clean or buggy? IEEE Trans Softw Eng 34(2):181–196
Article Google Scholar
Kutner MH, Nachtsheim CJ, Neter J, Li W, et al (2005) Applied linear statistical models, vol 5. McGraw-Hill Irwin, New York
Li XR, Zhao Z (2005) Relative error measures for evaluation of estimation algorithms. In: 2005 7th international conference on information fusion, vol 1, IEEE, pp 8–pp
Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R news 2(3):18–22
Google Scholar
Li N, Shepperd M, Guo Y (2020) A systematic review of unsupervised learning techniques for software defect prediction. Inf Softw Technol, p 106287
López-Martín C, Azzeh M, Bou-Nassif A, Banitaan S (2018) Upsilon-svr polynomial kernel for predicting the defect density in new software projects. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA), IEEE, pp 1377–1382
Lopez-Martin C, Azzeh M, Nassif AB, Banitaan S (2018) v-svr polynomial kernel for predicting the defect density in new software projects. arXiv preprint arXiv:1901.03362
Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518
Article Google Scholar
Marquardt DW, Snee RD (1975) Ridge regression in practice. Am Stat 29(1):3–20
MATH Google Scholar
Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Auto Softw Eng 17(4):375–407
Article Google Scholar
Moore AW (2001) Cross-validation for detecting and preventing overfitting. School of Computer Science, Carneigie Mellon University
Muller HG, Stadtmuller U et al (1987) Estimation of heteroscedasticity in regression analysis. Ann Stat 15(2):610–625
Article MathSciNet MATH Google Scholar
Ogutu JO, Schulz-Streeck T, Piepho HP (2012) Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions. In: BMC proceedings, vol 6, p S10. Springer
Ostrand TJ, Weyuker EJ, Bell RM, Ostrand RCW (2005) A different view of fault prediction. In: 29th annual international computer software and applications conference (COMPSAC’05), IEEE, vol 2, pp 3–4
Ostrand TJ, Weyuker EJ, Bell RM (2004) Where the bugs are. ACM SIGSOFT Softw Eng Notes 29(4):86–96
Article Google Scholar
Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355
Article Google Scholar
Prykhodko SB (2016) Developing the software defect prediction models using regression analysis based on normalizing transformations. Res Prac Sem Mod Prob Test Appl Softw (PTTAS-2016), pp 6–7,
Quinlan JR et al (1992) Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence, vol 92, pp 343–348. World Scientific
Rajbahadur GK, Wang S, Kamei Y, Hassan AE (2017) The impact of using regression models to build defect classifiers. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR), IEEE, pp 135–145
Rathore SS, Kumar S (2017) An empirical study of some software fault prediction techniques for the number of faults prediction. Soft Comput 21(24):7417–7434
Article Google Scholar
Ratkowsky DA, Giles DEA (1990) Handbook of nonlinear regression models. Number 04; QA278. 2, R3. M. Dekker, New York
Rawlings JO, Pantula SG, Dickey DA (2001) Applied regression analysis: a research tool. Springer Science & Business Media, Berlin
MATH Google Scholar
Rodriguez D, Dolado J, Tuya J, Pfahl D (2019) Software defect prediction with zero-inflated poisson models. arXiv preprint arXiv:1910.13717
Rousseeuw PJ, Leroy AM (2005) Robust regression and outlier detection, vol 589. John wiley & sons, London
MATH Google Scholar
Ryan TP (2008) Modern regression methods, vol 655. John Wiley & Sons, London
Book Google Scholar
Schulmeyer GG, McManus JI (1992) Handbook of software quality assurance. Van Nostrand Reinhold Co., New York
Google Scholar
Segal MR (2004) Machine learning benchmarks and random forest regression
Sharma D, Chandra P (2020) Linear regression with factor analysis in fault prediction of software. J Interdiscip Math 23(1):11–19
Article Google Scholar
Sharma P, Sangal AL (2020) Soft computing approaches to investigate software fault proneness. Appl Mach Learn , p 217
Shukla S, Radhakrishnan T, Muthukumaran K, Neti LBM (2018) Multi-objective cross-version defect prediction. Soft Comput 22(6):1959–1980
Article Google Scholar
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Article MathSciNet Google Scholar
Sunil JM, Kumar L, Neti LBM (2018) Bayesian logistic regression for software defect prediction(s). In SEKE, pp 421–420
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2018) The impact of automated parameter optimization on defect prediction models. IEEE Trans Softw Eng 45(7):683–711
Article Google Scholar
Tibshirani R (2011) Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B Stat Methodol 73(3):273–282
Torgo L (1997) Kernel regression trees. In: Poster papers of the 9th European conference on machine learning (ECML 97), pp 118–127. Citeseer
Utkin LV, Wiencierz A (2015) Improving over-fitting in ensemble regression by imprecise probabilities. Inf Sci 317:315–328
Article Google Scholar
Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443
Wang F, Huang J, Ma Y (2018) A top-k learning to rank approach to cross-project software defect prediction. In: 2018 25th Asia-Pacific software engineering conference (APSEC), IEEE, pp 335–344
Wang J, Zhang H (2012) Predicting defect numbers based on defect state transition models. In: Proceedings of the 2012 ACM-IEEE international symposium on empirical software engineering and measurement, IEEE, pp 191–200
Weyuker EJ, Ostrand TJ, Bell RM (2010) Comparing the effectiveness of several modeling methods for fault prediction. Emp Softw Eng 15(3):277–295
Article Google Scholar
Woolson RF (2007) Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials, pp 1–3
Xu L, Krzyżak A, Yuille A (1994) On radial basis function nets and kernel regression: statistical consistency, convergence rates, and receptive field size. Neural Netw 7(4):609–628
Article MATH Google Scholar
Xu Z, Liu J, Luo X, Yang Z, Zhang Y, Yuan P, Tang Y, Zhang T (2019) Software defect prediction based on kernel pca and weighted extreme learning machine. Inf Softw Technol 106:182–200
Article Google Scholar
Yang X, Wen W (2018) Ridge and lasso regression models for cross-version defect prediction. IEEE Trans Reliab 67(3):885–896
Article Google Scholar
Yang X, Tang K, Yao X (2014) A learning-to-rank approach to software defect prediction. IEEE Trans Reliab 64(1):234–246
Article Google Scholar
You G, Wang F, Ma Y (2016) An empirical study of ranking-oriented cross-project software defect prediction. Int J Softw Eng Knowl Eng 26(09n10):1511–1538:
Yu X, Liu J, Yang Z, Jia X, Ling Q, Ye S (2017) Learning from imbalanced data for predicting the number of software defects. In: 2017 IEEE 28th international symposium on software reliability engineering (ISSRE), IEEE, pp 78–89

Download references

Acknowledgements

We are thankful to the editor and the anonymous reviewers for their valuable comments that helped in improvement of the paper.

Author information

Authors and Affiliations

Department of Information Technology, ABV-Indian Institute of Information Technology and Management, Gwalior, India
Santosh S. Rathore

Authors

Santosh S. Rathore
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Santosh S. Rathore.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

This article does not contain any studies with human participants.

Funding

Authors declare that all the funding sources have been mentioned in the article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

The detailed results of all used regression methods in terms of ARE, RMSE, and FPA measures are contained in the online Appendix (https://sites.google.com/site/santoshiiitmdj/projects/empirical-analysis-of-regression-methods-for-the-software-fault-prediction?authuser=0). It consists results for all five model building scenarios considered in the presented work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rathore, S.S. An exploratory analysis of regression methods for predicting faults in software systems. Soft Comput 25, 14841–14872 (2021). https://doi.org/10.1007/s00500-021-06048-x

Download citation

Accepted: 15 July 2021
Published: 03 September 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s00500-021-06048-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An exploratory analysis of regression methods for predicting faults in software systems

Abstract

Access this article

Similar content being viewed by others

Data quality issues in software fault prediction: a systematic literature review

Classification framework for faulty-software using enhanced exploratory whale optimizer-based feature selection scheme and random forest ensemble learning

Effect of Feature Selection on Software Fault Prediction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Funding

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An exploratory analysis of regression methods for predicting faults in software systems

Abstract

Access this article

Similar content being viewed by others

Data quality issues in software fault prediction: a systematic literature review

Classification framework for faulty-software using enhanced exploratory whale optimizer-based feature selection scheme and random forest ensemble learning

Effect of Feature Selection on Software Fault Prediction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Funding

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation