Improving Recall of software defect prediction models using association mining
Introduction
Early identification of defect-prone modules helps in improving software process control, achieving reduced defect correction effort and hence, reduced cost and high software reliability [1], [2], [3], [4]. Managing resources during testing is considered a non-trivial task [5] and the identification of defect-prone modules helps in planning resources during testing [5], [6], [7]. This identification is done using defect prediction techniques that also help in controlling software projects and developing resource and test plans [6], [8].
Software defect prediction techniques either classify a software module as Defect-prone (D) or Not Defect-prone (ND) or predict the number of defects in a software. Both types of models take software metrics as input. Significant number of software defect prediction models are based on product metrics [9], [10]. Use of product metrics for defect prediction has been criticized for being unable to find causal relationship between the metrics and software defects [2], [11]. Despite the critique metrics collected from static analysis have been made publicly available [12] and have encouraged the development of numerous prediction models. These metrics are available in the form of datasets which are considered to be the benchmarks in the domain of software quality. Availability of alternate benchmarks is not well known.
Although the available datasets have been useful to develop models with good Recall (proportion of correctly predicted D modules from all D modules) and Area Under the Curve (AUC can be considered as probability that a model will give higher score to a randomly chosen D module than to a randomly chosen ND module) [13], [14], [15], [16], [17], these datasets have the following limitations: (1) they are imbalanced, (2) the static code attributes available in the datasets have limited information content [13]. Most of the datasets have significantly larger number of ND modules as compared to D modules. Smaller number of D modules (as training examples) affects the ability of classification models to learn the D modules more accurately. Class imbalance is considered a problem in the domain of software defect prediction [18], [19], [20] as well as in other domains [21], [22], [23], [24]. The factor of limited information content suggests that simple learners (like Naive Bayes) can perform as good as any complex learners (such as J48) [13].
Classifying a D module correctly is very important in software defect prediction. Software development organizations cannot afford to ship defective modules to customers, therefore, they strive to detect as many defective modules as possible before the release of software. Significance of correctly identifying defect prone modules can be seen through Pareto principle as applied in software engineering. The principle says that 80% of the defects are located in 20% of code [25]. It also gives an insight regarding high cost of misclassifying D modules. In the domain of software defect prediction, mostly standard machine learning algorithms are used which do not directly address the issue of class imbalance [20]. The standard learning algorithms are biased towards the dominant class and may not perform at their best in imbalanced datasets [21]. The standard algorithms also have the tendency to discard the scarce class by identifying it as noise [21]. Therefore, researchers perform over sampling in scarce class or under sampling in dominant class before using the learning algorithm [21]. This is done to get a balanced distribution of classes such that standard algorithms designed for learning through balanced training set can work in the same manner for both the classes.
The challenge of class imbalance is also faced in the domain of direct marketing where profit needs to be maximized even when significantly large number of customers are not likely to respond as compared to number of customers who are likely to respond [22]. Similarly charity organizations need to contact potential donors from a list of people who have pledged to donate and there is a small proportion of pledges that is fulfilled [22]. In this domain the problem of imbalance has been addressed through association mining (AM). Association Rule Mining (ARM) is an important data mining technique and is employed for discovering interesting relationships between variables in large databases [26]. ARM is used to find interesting correlations, frequent patterns, associations or causal structures among items of large datasets [27]. Fuzzy logic has also been applied in other domains to address the issue of class imbalance [23].
Due to the factor of limited information content the defect prediction models have reached a performance ceiling which can be crossed if information content is improved. This information content can be improved by collecting more insightful data or accessing and combining relevant features available at the time of model development [13]. In the case of most of the public datasets where the additional data required to improve the information is not available, use of data in an insightful and different way can be useful to improve quality of defect prediction.
This paper uses the available information in the public datasets in an effective manner, applies association mining (AM) to find association between software metrics and software defects, and improves performance of classification model in imbalanced datasets. The datasets are preprocessed using the proposed approach, a defect prediction model is developed using the preprocessed data and performance analysis of the model is performed in terms of Recall. The preprocessing step partitions data and finds important itemsets. The important itemsets are relabeled in one partition of the data and the prediction of D modules improves as a result. Afterwards, the preprocessed datasets are used for model development and performance of the model is also analyzed. Naive Bayes (NB) classifier (one of the best techniques along with Random Forests in the field of defect prediction [14], [17]) has been used as a test case to evaluate the proposed preprocessing. Significance of Recall for performance analysis is highlighted through a questionnaire distributed in the software industry. The results show that the proposed approach has improved Recall of the NB classifier up to 40%. Stability of the approach has been tested by experimenting with different number of bins.
When lack of detailed information content is reported in literature, it becomes a non-trivial task to use the available information in an effective manner. Preprocessing suggested in this paper is one such attempt that gives a method to use the available data in a new and insightful way. Average Recall reported for the datasets used has been below 75% [15], [21], [28] whereas Recall values with the proposed approach vary from 78.6% to 85% with different number of bins. It is pertinent to mention that unlike other studies [29], no additional information has been used with the publicly available static code attributes.
Rest of the paper is organized as follows: Section 2 discusses the related work, Section 3 presents our research methodology and our approach to find focused itemsets. Section 4 presents the results of the experiments whereas Section 5 discusses the results. Section 6 concludes the paper and presents future directions.
Section snippets
Related work
Numerous techniques for defect prediction have been reported in literature and comparative studies to evaluate their performance have also been conducted [9], [10], [13], [14], [15], [16], [17], [29]. These techniques include models like neural networks, decision trees, Naive Bayes, case based reasoning, fuzzy inference systems, regression trees, association rule mining based models as well as ensemble models like RIPPER, WHICH, DTNB, and FURIA. Most of these techniques are based on data mining
Proposed approach to improve Recall
This study uses public datasets to improve Recall of Naive Bayes (NB) classifier for defect prediction. As shown in Fig. 1 the selected datasets are preprocessed at the first step, model is developed and evaluated at second step and then the stability of the model is evaluated using different number of bins. At the first step association mining is applied to get the focused itemsets [57]. Focused itemsets are the intervals of attributes that co-occur with defect prone modules more than they
Results
The results of applying the proposed approach have been presented in three steps: first the experimental setup is described, then application of Algorithm 1 is presented, and afterwards data collected from software industry is provided.
Analysis and discussion
In this section first analysis of the results of applying the algorithm is presented followed by an analysis of the feedback from software industry and usefulness of results for the industry.
Conclusion and future work
Software metrics have been investigated over the years for software defect prediction. We have studied association relationship between software product metrics and software defects to improve performance of Naive Bayes classifier. We have proposed a preprocessing approach that discretizes data to study associations of software metrics and defects. We partitioned data class-wise (into Pt and Pf) and generated frequent itemsets in each partition. We identified the 1-Itemsets strongly associated
Acknowledgment
The authors would like to thank Lahore University of Management Sciences (LUMS) and Higher Education Commission (HEC) of Pakistan for supporting the research. The authors would also like to thank the companies and their personnel who participated in the survey. In addition, Dr. Adnan Abid (UMT, Lahore Pakistan) is thanked for providing valuable insight to improve the paper.
References (63)
- et al.
An empirical study based on semi-supervised hybrid self-organizing map for software fault prediction
Knowledge-Based Syst.
(2015) - et al.
Software defect prediction in imbalanced data sets using unbiased support vector machine
- et al.
Mining customer value: from association rules to direct marketing
Data Min. Knowl. Discov.
(2005) - et al.
A bias correction function for classification performance assessment in two-class imbalanced problems
Knowledge-Based Syst.
(2014) Software Engineering: A Practitioners Approach
(2010)- et al.
The limited impact of individual developer data on software defect prediction
Empir. Softw. Eng.
(2011) - et al.
Software defect prediction using Bayesian networks
Empir. Softw. Eng.
(2014) - et al.
A Bayesian Belief Network for assessing the likelihood of fault content
Proceedings of the 14th International Symposium on Software Reliability Engineering, ISSRE
(2003) - I.H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, S.J. Cunningham, The Waikato Environment for Knowledge Analysis...
- et al.
Early software fault prediction using real time defect data
Proceedings of the 2009 Second International Conference on Machine Vision, ICMV’09
(2009)