Skip to main content
Log in

Techniques for evaluating fault prediction models

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Many statistical techniques have been proposed to predict fault-proneness of program modules in software engineering. Choosing the “best” candidate among many available models involves performance assessment and detailed comparison, but these comparisons are not simple due to the applicability of varying performance measures. Classifying a software module as fault-prone implies the application of some verification activities, thus adding to the development cost. Misclassifying a module as fault free carries the risk of system failure, also associated with cost implications. Methodologies for precise evaluation of fault prediction models should be at the core of empirical software engineering research, but have attracted sporadic attention. In this paper, we overview model evaluation techniques. In addition to many techniques that have been used in software engineering studies before, we introduce and discuss the merits of cost curves. Using the data from a public repository, our study demonstrates the strengths and weaknesses of performance evaluation techniques and points to a conclusion that the selection of the “best” model cannot be made without considering project cost characteristics, which are specific in each development environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

References

  • Adams NM, Hand DJ (1999) Comparing classifiers when the misallocation costs are uncertain. Pattern Recognit 32:1139–1147. doi:10.1016/S0031-3203(98)00154-X

    Article  Google Scholar 

  • Arisholm E, Briand LC (2006) Predicting fault-prone components in a java legacy system. Proceedings of the 2006 ACM/IEEE International Symposium on Empirical Software Engineering (ISESE’06)

  • Azar D, Precup D, Bouktif S, Kegl B, Sahraoui H (2002) Combining and adapting software quality predictive models by genetic algorithms. 17th IEEE International Conference on Automated Software Engineering. IEEE Computer Society

  • Basili VR, Briand LC, Melo WL (1996) A validation of object-oriented design metrics as quality indicators. IEEE Trans Softw Eng 22(10):751–761. doi:10.1109/32.544352

    Article  Google Scholar 

  • Boetticher GD (2005) Nearest neighbor sampling for better defect prediction. ACM SIGSOFT Software Engineering Notes, 30(4). ACM, New York, NY, pp 1–6

    Google Scholar 

  • Braga AC, Costa L, Oliveira P (2006) A nonparametric method for the comparison of areas under two ROC curves. International Conference on Robust Statistics (ICORS06). Technical University of Lisbon, 16–21 July 2006, Lisbon, Portugal

  • Breiman L (2001) Random forests. Mach Learn 45:5–32. doi:10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  • Challagulla VUB, Bastani FB, Yen I-L, Paul RA (2005) Empirical assessment of machine learning based software defect prediction techniques. Proceedings of the 10th IEEE International Workshop on Object-Oriented Real-Time Dependable Systems (WORDS’05), pp 263–270

  • Conover WJ (1999) Practical nonparametric statistics. Wiley, New York

    Google Scholar 

  • Davis J, Goadrich M (2006) The relationship between precision-recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, PA, pp 233–240

  • Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  Google Scholar 

  • Drummond C, Holte RC (2006) Cost curves: an improved method for visualizing classifier performance. Mach Learn 65(1):95–130. doi:10.1007/s10994-006-8199-5

    Article  Google Scholar 

  • El-Emam K, Benlarbi S, Goel N, Rai SN (2001) Comparing case-based reasoning classifiers for predicting high-risk software components. J Syst Softw 55(3):301–320. doi:10.1016/S0164-1212(00)00079-0

    Article  Google Scholar 

  • Fenton N, Neil M (1999) Software metrics and risk. The 2nd European Software Measurement Conference (FESMA 99), TI-KVIV, Amsterdam, pp 39–55

  • Gokhale SS, Lyu MR (1997) Regression tree modeling for the prediction of software quality. In: Pham H (ed) The third ISSAT International Conference on Reliability and Quality in Design. Anaheim, CA, pp 31–36

  • Guo L, Ma Y, Cukic B, Singh H (2004) Robust prediction of fault-proneness by random forests. Proceedings of the 15th IEEE International Symposium on Software Reliability Engineering (ISSRE 2004), IEEE Press

  • Khoshgoftaar TM, Lanning DL (1995) A neural network approach for early detection of program modules having high risk in the maintenance phase. J Syst Softw 29(1):85–91. doi:10.1016/0164-1212(94)00130-F

    Article  Google Scholar 

  • Khoshgoftaar TM, Allen EB, Ross FD, Munikoti R, Goel N, Nandi A (1997) Predicting fault-prone modules with case-based reasoning. The Eighth International Symposium on Software Engineering (ISSRE '07). IEEE Computer Society, pp 27–35

  • Khoshgoftaar TM, Seliya N (2002) Tree-based software quality estimation models for fault prediction. The 8th IEEE Symposium on Software Metrics (METRICS’02), IEEE Computer Society, pp 203–214

  • Khoshgoftaar TM, Cukic B, Seliya N (2007) An empirical assessment on program module-order models. Qual Technol Quant Manag 4(2):171–190

    MathSciNet  Google Scholar 

  • Koru AG, Liu H (2005) Building effective defect-prediction models in practice. IEEE Softw 22(6):23–29. doi:10.1109/MS.2005.149

    Article  Google Scholar 

  • Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215. doi:10.1023/A:1007452223027

    Article  Google Scholar 

  • Lewis D, Gale W (1994) A sequential algorithm for training text classifiers. Annual ACM Conference on Research and Development in Information Retrieval, the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer-Verlag, New York, NY, pp 3–12

  • Ling CX, Li C (1998) Data mining for direct marketing: problems and solutions. Proc. of the 4th Intern. Conf. on Knowledge Discovery and Data Mining, New York, pp 73–79

  • Ma Y (2007) An empirical investigation of tree ensembles in biometrics and bioinformatics. West Virginia University, PhD thesis, January 2007

  • Macskassy S, Provost F, Rosset S (2005a) Pointwise ROC confidence bounds: an empirical evaluation. Proceedings of the Workshop on ROC Analysis in Machine Learning (ROCML-2005)

  • Macskassy S, Provost F, Rosset S (2005b) ROC confidence bands: an empirical evaluation. Proceedings of the 22nd International Conference on Machine Learning (ICML). Bonn, Germany

  • Menzies T, Stefano JD, Ammar K, Chapman RM, McGill K, Callis P et al (2003) When can we test less? Proceedings of the Ninth International Software Metrics Symposium (METRICS’03), IEEE Computer Society

  • Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13. doi:10.1109/TSE.2007.256941

    Article  Google Scholar 

  • Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355. doi:10.1109/TSE.2005.49

    Article  Google Scholar 

  • Ohlsson N, Alberg H (1996) Predicting fault-prone software modules in telephone switches. IEEE Trans Softw Eng 22(12):886–894. doi:10.1109/32.553637

    Article  Google Scholar 

  • Ohlsson N, Eriksson AC, Helander ME (1997) Early risk-management by identification of fault-prone modules. Empir Softw Eng 2(2):166–173. doi:10.1023/A:1009757419320

    Article  Google Scholar 

  • Selby RW, Porter AA (1988) Learning from examples: generation and evaluation of decision trees for software resource analysis. IEEE Trans Softw Eng 14(12):1743–1757. doi:10.1109/32.9061

    Article  Google Scholar 

  • Siegel S (1956) Nonparametric statistics. McGraw-Hill, New York

    MATH  Google Scholar 

  • Vuk M, Curk T (2006) ROC curve, lift chart and calibration plot. Metodoloski zvezki 3:89–108

    Google Scholar 

  • Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann

  • Youden W (1950) Index for rating diagnostic tests. Cancer 3:32–35. doi:10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3

    Article  Google Scholar 

  • Yousef WA, Wagner RF, Loew MH (2004) Comparison of non-parametric methods for assessing classifier performance in terms of ROC parameters. In Proceedings of Applied Imagery Pattern Recognition Workshop, vol. 33, issue 13–15, pp 190–195

  • Zhang H, Zhang X (2007) Comments on ‘data mining static code attributes to learn defect predictors’. IEEE Trans Softw Eng 33(9):635–637

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bojan Cukic.

Additional information

Editor: Tim Menzies

Appendices

Appendix A: Characteristics of NASA MDP dataset

Accurate prediction of fault-proneness in the software development process enables effective identification of modules which are likely to hide faults. Corrective and remedial steps can be adopted early in the development lifecycle before the project encounters costly redevelopment efforts in the later phases. Software projects vary in size and complexity, programming languages, development processes, etc. When reporting a fault prediction modeling experiment, it is important to share the characteristics of the datasets. Here, we outline the characteristics of NASA Metrics Data Prediction (MDP) datasets, which inevitably have a significant impact on fault prediction. Some of our observations seem to be general, i.e., valid across many reported modeling results, while others are specific to this dataset.

First, only a small proportion of software modules are faulty (Menzies et al. 2007)

This is one of the most consistent characteristics of software defect databases. Faulty modules constitute only a small portion of the software product base. Table 1 (Section 1.1) provides a sample of projects, released as a part of NASA Software Metrics Data Program (Metrics Data Program NASA IV&V facility, http://mdp.ivv.nasa.gov/). Module faults were detected either during the development or in the deployment. For project PC1, only 7% of the modules are faulty and 93% are fault-free; KC4 has the largest percentage of faulty modules among all five projects considered here, 48%, but it is also the smallest one (only 125 modules) and the only one that uses a scripting language (Perl). In all other datasets, there are at least three times as many fault free modules than faulty ones. Such a significantly skewed distribution of faulty and non-faulty modules in sample datasets presents a problem for supervised learning algorithms typically utilized in software engineering studies. The reason is that most supervised learning techniques aim to maximize the overall classification accuracy and ignore the class distribution. For example, even if we declare all modules in PC1 as fault-free and misclassify all faulty modules the overall accuracy would achieve 93%. By all means, in the field of predictive models of software fault-proneness, this would be a fantastic result. However, such a model would be useless, as the sole purpose of predictive software quality studies is predicting where faults hide (Menzies et al. 2007).

Second, software metrics used to build predictive models exhibit high correlation (Menzies et al. 2007)

Typically, models use software metrics as prediction variables. These metrics tend to correlate with each other. Take the projects listed in Table 1, for example. Each data set contains twenty-one software metrics, which describe the product size, complexity and some structural properties (Metrics Data Program NASA IV&V facility, http://mdp.ivv.nasa.gov/). Tables 8 and 9 display the Pearson’s correlation coefficients among the predictive variables for projects PC1 and KC2, respectively. Due to space constraint, we only tabulate the correlation between the lines of code (LOC) and five other randomly chosen variables for each project. In the Table 8, the five randomly selected variables and their metric types are: total operands (TOpnd)—Basic Halstead, volume (V)—Derived Halstead, effort estimate (B)—Derived Halstead, lines of code and comment (LCC)—Line Count, and length (N)—Derived Halstead. The variables selected for KC2 (Table 9) are: unique operators (UOp)—Basic Halstead, volume (V)—Derived Halstead, design complexity (IV.G)—McCabe, total operators (TOp)—Basic Halstead, and lines of blank (LOB)—Line Count. For both projects, the listed variables are either moderately or strongly correlated; the majority of the correlation coefficients are above 0.90.

Table 8 Correlation coefficients among six predictive variables in project PC1
Table 9 Correlation coefficients among six predictive variables in project KC2

The phenomenon of high correlation among predictive variables is termed multicolinearity in regression analysis. As Fenton and Neil (1999) state in (Fenton and Neil 1999), multicolinearity produces unacceptable uncertainty in regression coefficient estimates. Specifically, the coefficients can change drastically depending on which terms (that is, metrics) are present in the model and also depending on the order in which they are placed in the model.

Third, many fault-free modules tend to be small in size

The lines of code (LOC) metric is commonly used to measure the size of a module. We calculated the 90th percentile of LOC for faulty and fault-free modules for five MDP projects, and summarized the findings in Table 10. A 90th percentile is a score such that 90% of the scores are below it. For example, for project PC1, about 90% of the fault-prone models have at most 114 lines of codes, while 90% of the non-fault modules have at most 47 lines of codes. Except in KC4, fault-free modules tend to be shorter than faulty modules. But many software metrics depend on size and measurements of relatively small modules tend to be “close” to each other. This may confuse machine learning algorithms and is likely to cause poor prediction results. Koru and Liu (2005) made the same observation and stated that “small modules show little variation, which would make it difficult for a machine-learning algorithm to distinguish between small—defective and small—nondefective modules”. Dealing with such small components appears to be specific for the MDP dataset studies, limiting the generality of conclusions reached by studying projects included in it.

Table 10 The 90th percentile of lines of code (LOC) for the collection of faulty modules and fault-free modules in each project

Fourth, a significant portion of the minority class instances (i.e., fault-prone modules) are “close neighbors” with the majority class instances

Table 11 provides the evidence. For each module in a training set, we find the nearest (in Euclidean distance) training set vector (the module with the most similar metrics) and count the proportion of faulty modules whose nearest neighbor belongs to the majority class (i.e., fault-free modules). Besides, we find three nearest neighbors of each faulty module, and compute the percentage of faulty modules that has at least two neighbors (among three) from the majority class. Table 11 shows that a significant number of the minority class cases are close to the majority class instances in the feature space. This phenomenon is the consequence of the previously mentioned characteristics. With the majority of the learning set representing fault-free modules, faulty modules are likely to have their measurements similar to many fault-free ones. With the second and third properties, small modules, faulty or not, are likely similar to each other. This poses a problem for all machine learning techniques (Boetticher 2005), especially for the instance-based learning methods. Project KC4 is an exception and this may be part of the reason why in some studies classification algorithms tend to achieve better overall classification results on this dataset (Menzies et al. 2007) than on others analyzed here.

Table 11 Class membership of the neighborhood instance for the faulty modules

These four observations are not necessarily the only interesting and/or unique aspects of the MDP datasets. But good understanding of the dataset is necessary for the selection of adequate classification techniques and play important role in model selection and comparison.

Appendix B: A Brief Description of the Six Classification Algorithms

Six machine learning algorithms are used in illustrative examples throughout the paper: Random Forest, Naïve Bayes, Bagging, J48, Logistic Regression and IBk.

Random forest (rf) is a decision tree-based classifier demonstrated to have good performance in software engineering studies by Guo et al. (2004). As implied from its name, it builds a “forest” of decision trees. The trees are constructed using the following strategy: The root node of each tree contains a bootstrap sample data of the same size as the original data. Each tree has a different bootstrap sample. At each node, a subset of variables is randomly selected from all the input variables to split the node and the best split is adopted. Each tree is grown to the largest extent possible without pruning. When all trees in the forest are built, new instance(s) is fitted to all the trees and a voting process is taken place. The forest selects the classification with the most votes as the prediction of new instance(s).

Naïve Bayes (nb) “naively” assumes data independence. This assumption may be considered overly simplistic in many application scenarios. However, in software engineering data sets its performance is surprisingly good. Naive Bayes classifier has been used extensively in fault-proneness prediction by (Menzies et al. 2007).

Bagging (bag) stands for bootstrap aggregating. It relies on an ensemble of different models. The training data is resampled from the original data set. According to Witten and Frank (2005), bagging typically performs better than single method models and almost never significantly worse.

J48 is a Weka (Witten and Frank 2005) implementation of Quinlan’s C4.5 (Guo et al. 2004) decision tree algorithm. A decision tree is a tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes. It is a popular and classic machine learning algorithm (Witten and Frank 2005).

Logistic regression (log) is a classification scheme which uses mathematical logistic regression function. The most popular models are generalized linear models.

IBk is the Weka (Witten and Frank 2005) tool implementation of k-nearest-neighbor classifier. With k = 1 the default value, IBk is in fact IB1. This is the basic nearest-neighbor instance based learner that searches for the training instance closest in Euclidean distance to the given test instance and uses the result of the search for classification (Witten and Frank 2005).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, Y., Cukic, B. & Ma, Y. Techniques for evaluating fault prediction models. Empir Software Eng 13, 561–595 (2008). https://doi.org/10.1007/s10664-008-9079-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-008-9079-3

Keywords

Navigation