A systematic review of software fault prediction studies

doi:10.1016/j.eswa.2008.10.027

Expert Systems with Applications

Volume 36, Issue 4, May 2009, Pages 7346-7354

https://doi.org/10.1016/j.eswa.2008.10.027 Get rights and content

Abstract

This paper provides a systematic review of previous software fault prediction studies with a specific focus on metrics, methods, and datasets. The review uses 74 software fault prediction papers in 11 journals and several conference proceedings. According to the review results, the usage percentage of public datasets increased significantly and the usage percentage of machine learning algorithms increased slightly since 2005. In addition, method-level metrics are still the most dominant metrics in fault prediction research area and machine learning algorithms are still the most popular methods for fault prediction. Researchers working on software fault prediction area should continue to use public datasets and machine learning algorithms to build better fault predictors. The usage percentage of class-level is beyond acceptable levels and they should be used much more than they are now in order to predict the faults earlier in design phase of software life cycle.

Introduction

This paper reviews several journal articles and conference papers on software fault prediction to evaluate the progress and direct future research on this software engineering problem. Many researchers used different approaches such as genetic programming (Evett, Khoshgoftaar, Chien, & Allen, 1998), neural networks (Thwin & Quah, 2003), case-based reasoning (El Emam, Benlarbi, Goel, & Rai, 2001), fuzzy logic (Yuan, Khoshgoftaar, Allen, & Ganesan, 2000), Dempster–Shafer networks (Guo, Cukic, & Singh, 2003), decision trees (Khoshgoftaar & Seliya, 2002), Naı¨ve Bayes (Menzies, Greenwald, & Frank, 2007), and logistic regression (Denaro et al., 2003, Schneidewind, 2001) to predict software faults before testing process. We applied Artificial Immune Systems paradigm for fault prediction during our Fault Prediction Research Program (Catal and Diri, 2007a, Catal and Diri, 2007b, Catal and Diri, 2008).

This review does not describe all these prediction models for practitioners in detail. Our aim is to classify studies with respect to metrics, methods, and datasets that have been used in these prediction papers. We evaluated papers published before and after 2005 with respect to metrics, methods, and datasets because PROMISE repository has been created in 2005. PROMISE repository includes a collection of public datasets to build repeatable, refutable and verifiable models of software engineering and it was inspired by UCI Machine Learning Repository which is widely used by researchers in Machine Learning area (Sayyad & Menzies, 2005).

Jorgensen and Shepperd (2007) provided a systematic review of software development cost estimation studies and our review methodology is similar to their methodology. According to our knowledge, this is the first study which provides a systematic review of software fault prediction studies from different perspectives. We posed the eight research questions shown in Table 1 and these questions helped us to collect the necessary information from papers in our review process.

This paper is organized as follows: Section 2 describes the review process. Section 3 reports the results. Section 4 suggests issues for future research on software fault prediction.

Section snippets

Inclusion criteria

We included papers in our review if the paper describes research on software fault prediction and software quality prediction. We excluded position proceedings and papers which do not include experimental results. Papers with respect to their years, datasets, metrics, techniques, evaluation criteria and results have been examined. The inclusion of papers was based on the similarity degree of the study with fault prediction research topic. The exclusion did not take into account the publication

Results

Twenty-seven journal papers and 47 conference proceedings have been evaluated in this review systematically. Publication years of papers are between year 1990 and 2007. Fig. 1 is a curve which plots publication year on the x-axis and the number of papers published in that year on the y-axis for papers in review.

Sixty-one percentage of papers are conference proceedings, 36% of papers are journal papers, and 3% of papers are book chapters.

Each subsection of this section will address each research

Conclusion

This paper reviewed software fault prediction papers published in conference proceedings and journals to evaluate the progress and direct future research on software fault prediction. We evaluated papers with a specific focus on types of metrics, methods, and datasets and did not describe all the prediction models in detail. The aim was to classify studies with respect to metrics, methods, and datasets that have been used in fault prediction papers. We evaluated papers published before and

Acknowledgements

This study is supported by The Scientific and Technological Research Council of Turkey (TUBITAK) under Grant 107E213. The findings and opinions in this study belong solely to the authors, and are not necessarily those of the sponsor.

References (36)

K. El Emam et al.
Comparing case-based reasoning classifiers for predicting high risk software components
Journal of Systems and Software
(2001)
S. Mahmood et al.
A survey of component based system quality assurance and assessment
Information and Software Technology
(2005)
Abreu, F. B. e., & Carapuca, R. (1994). Object-oriented software engineering: Measuring and controlling the development...
Abreu, F. B. e., & Melo, W. (1996). Evaluating the impact of object-oriented design on software quality. In Proceedings...
J. Bansiya et al.
A hierarchical model for object-oriented design quality assessment
IEEE Transactions on Software Engineering
(2002)
Bibi, S., Tsoumakas, G., Stamelos, I., & Vlahvas, I. (2006). Software defect prediction using regression via...
Catal, C., & Diri, B. (2007a). Software defect prediction using artificial immune recognition system. In Proceedings of...
C. Catal et al.
Software fault prediction with object-oriented metrics based artificial immune recognition system
C. Catal et al.
A fault prediction model with limited fault data to improve test process. Product focused software process improvement
S.R. Chidamber et al.
A metrics suite for object-oriented design
IEEE Transactions on Software Engineering
(1994)

De Almeida, M. A., & Matwin, S. (1999). Machine learning method for software quality model building. In Eleventh...

G. Denaro et al.

Towards industrially relevant fault-proneness models

International Journal of Software Engineering and Knowledge Engineering

(2003)

Evett, M., Khoshgoftaar, T., Chien, P., & Allen, E. (1998). GP-based software quality prediction. In Proceedings of the...

N.S. Gill et al.

Component based measurement: Few useful guidelines

SIGSOFT Software Engineering Notes

(2003)

N.S. Gill et al.

Few important considerations for deriving interface complexity metric for component-based systems

SIGSOFT Software Engineering Notes

(2004)

L. Guo et al.

Predicting fault prone modules by the Dempster–Shafer belief networks

M. Halstead

Elements of software science

(1977)

Y. Jiang et al.

Fault prediction using early lifecycle data

Cited by (430)

A tertiary study on links between source code metrics and external quality attributes
2024, Information and Software Technology
Several secondary studies have investigated the relationship between internal quality attributes, source code metrics and external quality attributes. Sometimes they have contradictory results.
We synthesize evidence of the link between internal quality attributes, source code metrics and external quality attributes along with the efficacy of the prediction models used.
We conducted a tertiary review to identify, evaluate and synthesize secondary studies. We used several characteristics of secondary studies as indicators for the strength of evidence and considered them when synthesizing the results.
From 711 secondary studies, we identified 15 secondary studies that have investigated the link between source code and external quality. Our results show : (1) primarily, the focus has been on object-oriented systems, (2) maintainability and reliability are most often linked to internal quality attributes and source code metrics, with only one secondary study reporting evidence for security, (3) only a small set of complexity, coupling, and size-related source code metrics report a consistent positive link with maintainability and reliability, and (4) group method of data handling (GMDH) based prediction models have performed better than other prediction models for maintainability prediction.
Based on our results, lines of code, coupling, complexity and the cohesion metrics from Chidamber & Kemerer (CK) metrics are good indicators of maintainability with consistent evidence from high and moderate-quality secondary studies. Similarly, four CK metrics related to coupling, complexity and cohesion are good indicators of reliability, while inheritance and certain cohesion metrics show no consistent evidence of links to maintainability and reliability. Further empirical studies are needed to explore the link between internal quality attributes, source code metrics and other external quality attributes, including functionality, portability, and usability. The results will help researchers and practitioners understand the body of knowledge on the subject and identify future research directions.
Bug severity classification in software using ant colony optimization based feature weighting technique
2023, Expert Systems with Applications
At the present, the delivery of the software should be time-bound without affecting the quality of the software. However, bug severity can affect the timely delivery of software. It is a crucial component of the software engineering, including maintenance and testing. Both phases are essential for bug severity classification but require much time. Generally, bug triage is responsible for classifying the bugs based on criticality/severeness. The manual execution of this process is error-prone. Consequently, a model for automatic bug classification is required to help the bug triage. In this work, the ant colony optimization (ACO) based feature extraction technique is proposed to extract more relevant features for bug severity classification. Furthermore, the ACO technique is integrated with NB, SVM, DeepFM and F-SVM techniques for predicting bug severity and classifying bugs into multi-severity classes. Several benchmark projects such as Eclipse, Mozilla, OpenFOAM, JBoss, and Firefox, are considered to evaluate the efficacy of the techniques above. The simulation outcomes are expressed in terms of Accuracy, Precision, Recall, and F1-measure. It is noted that the outcomes of the SVM, NB, DeepFM and F-SVM approaches are improved by the ACO-based feature weighting technique. The accuracy rate of ACO-F-SVM, ACO-NB, ACO-SVM, ACO-DeepFM, NB, SVM, F-SVM, DeepFM techniques are ranging in between 85.73 and 89.38%, 78% to 80%, 73% to 76%, 92.67% to 97.27 %, 71% to 77%,65% to 74%, 78.21% to 81.28% and 90.02% to 95.24% respectively for five benchmark projects. Further, proposed techniques are also produced better simulation results as compared with state-of –the-art techniques. Friedman and post hoc statistical tests are also conducted on proposed techniques.
Industrial applications of software defect prediction using machine learning: A business-driven systematic literature review
2023, Information and Software Technology
Machine learning software defect prediction is a promising field of software engineering, attracting a great deal of attention from the research community; however, its industry application tents to lag behind academic achievements.
This study is part of a larger project focused on improving the quality and minimising the cost of software testing of the 5G system at Nokia, and aims to evaluate the business applicability of machine learning software defect prediction and gather lessons learnt.
The systematic literature review was conducted on journal and conference papers published between 2015 and 2022 in popular online databases (ACM, IEEE, Springer, Scopus, Science Direct, and Google Scholar). A quasi-gold standard procedure was used to validate the search, and SEGRESS guidelines were used for transparency, reporting, and replicability.
We have selected and analysed 32 publications out of 397 found by our automatic search (and seven by snowballing). We have identified highly relevant evidence of methods, features, frameworks, and datasets used. However, we found a minimal emphasis on practical lessons learnt and cost consciousness — both vital from a business perspective.
Even though the number of machine learning software defect prediction studies validated in the industry is increasing (and we were able to identify several excellent papers on studies performed in vivo), there is still not enough practical focus on the business aspects of the effort that would help bridge the gap between the needs of the industry and academic research.
Application of Deep Learning in Software Defect Prediction: Systematic Literature Review and Meta-analysis
2023, Information and Software Technology
Despite recent attention given to Software Defect Prediction (SDP), the lack of any systematic effort to assess existing empirical evidence on the application of Deep Learning (DL) in SDP indicates that it is still relatively under-researched.
To synthesize literature on SDP using DL, pertaining to measurements, models, techniques, datasets, and achievements; to obtain a full understanding of current SDP-related methodologies using DL; and to compare the DL models’ performances with those of Machine Learning (ML) models in classifying software defects.
We completed a thorough review of the literature in this domain. To answer the research issues, results from primary investigations were synthesized. The preliminary findings for DL vs. ML in SDP were verified by using meta-analysis (MA).
We discovered 63 primary studies that passed the systematic literature review quality evaluation. However, only 19 primary studies passed the MA quality evaluation. The five most popular performance measurements employed in SDP were f-measure, recall, accuracy, precision, and Area Under the Curve (AUC). The top five DL techniques used in building SDP models were Convolutional Neural Network (CNN), Deep Neural Network (DNN), Long Short-Term Memory (LSTM), Deep Belief Network (DBN), and Stacked Denoising Autoencoder (SDAE). PROMISE and NASA datasets were found to be used more frequently to train and test DL models in SDP. The MA results show that DL was favored over ML in terms of study and dataset across accuracy, f-measure, and AUC.
The application of DL in SDP remains a challenge, but it has the potential to achieve better predictive performance when the performance-influencing parameters are optimized. We provide a reference point for future research which could be used to improve research quality in this domain.
On the relationship between source-code metrics and cognitive load: A systematic tertiary review
2023, Journal of Systems and Software
The difficulty of software development tasks depends on several factors including the characteristics of the underlying source-code. These characteristics can be captured and measured using source-code metrics, which, in turn, can provide indications about the difficulty of the source-code. From a cognitive perspective, this difficulty is due to an increase in developers’ cognitive load, which can be estimated using psycho-physiological measures. Based on these measures, a handful of studies investigated the relationship between source-code metrics and cognitive load. For most of the metrics, such a relationship could not be established. While these studies used a small subset of metrics, the literature comprises hundreds of other metrics. Despite the existing reviews surveying these metrics, a consolidated overview is still needed to understand their properties and leverage their potential to align with cognitive load. This need is addressed in this paper through a Systematic Tertiary Review (STR) covering the full spectrum of source-code metrics, studying their properties and investigating their potential relationship to cognitive load. The outcome of this STR is intended to guide practitioners in choosing appropriate metrics, set the grounds for conceptualizing the relationship between source-code metrics and cognitive load and raise new research challenges for the future.
Examining the performance of kernel methods for software defect prediction based on support vector machine
2023, Science of Computer Programming
Support Vector Machine (SVM) has been widely used to build software defect prediction models. Prior studies compared the accuracy of SVM to other machine learning algorithms but arrives at contradictory conclusions due to the use of different choices of kernel functions and metrics. Such a contradictory conclusion raises an important question about the performance of kernel functions, across different experimental conditions. To this end, the present study examines the impact and stability of four kernel functions with feature selection on the performance of SVM for software defect prediction. Strictly speaking, we examine the performance of nonlinear kernel functions against linear kernel function based on different experimental parameters such as data granularity, imbalance ratio of the dataset, and feature subsets. A large-scale study has been conducted using four kernel functions, ten feature subset selection thresholds based on the Information gain algorithm, 38 public datasets and one evaluation measure. This has resulted in 1520 experiments. The findings demonstrate that: 1) Not all nonlinear kernel functions significantly outperform linear, only RBF surpasses linear and other nonlinear kernel functions. 2) We don't have significant difference between kernel functions w.r.t. data granularity, we only found significant difference between RBF and other kernel function based on ‘function’ data granularity. 3) we also found that RBF can work significantly better than linear and other nonlinear function over datasets with very high and high imbalance ratios. 4) The performances of all kernel functions fluctuate over different feature subsets; However, using top 40% of the features would work best with all kernel functions. To conclude, we can recommend using SVM with RBF kernel for defects datasets because the performance of other kernel functions is limited.

View all citing articles on Scopus

View full text

ReviewA systematic review of software fault prediction studies

Abstract

Introduction

Section snippets

Inclusion criteria

Results

Conclusion

Acknowledgements

Journal of Systems and Software

Information and Software Technology

A hierarchical model for object-oriented design quality assessment

IEEE Transactions on Software Engineering

Software fault prediction with object-oriented metrics based artificial immune recognition system

A fault prediction model with limited fault data to improve test process. Product focused software process improvement

A metrics suite for object-oriented design

IEEE Transactions on Software Engineering

Towards industrially relevant fault-proneness models

International Journal of Software Engineering and Knowledge Engineering

Component based measurement: Few useful guidelines

SIGSOFT Software Engineering Notes

Few important considerations for deriving interface complexity metric for component-based systems

SIGSOFT Software Engineering Notes

Predicting fault prone modules by the Dempster–Shafer belief networks

Elements of software science

Fault prediction using early lifecycle data

Review
A systematic review of software fault prediction studies