Elsevier

Knowledge-Based Systems

Volume 90, December 2015, Pages 1-13
Knowledge-Based Systems

Improving Recall of software defect prediction models using association mining

https://doi.org/10.1016/j.knosys.2015.10.009Get rights and content

Abstract

Use of software product metrics in defect prediction studies highlights the utility of these metrics. Public availability of software defect data based on the product metrics has resulted in the development of defect prediction models. These models experience a limitation in learning Defect-prone (D) modules because the available datasets are imbalanced. Most of the datasets are dominated by Not Defect-prone (ND) modules as compared to D modules. This affects the ability of classification models to learn the D modules more accurately. This paper presents an association mining based approach that allows the defect prediction models to learn D modules in imbalanced datasets. The proposed algorithm preprocesses data by setting specific metric values as missing and improves the prediction of D modules. The proposed algorithm has been evaluated using 5 public datasets. A Naive Bayes (NB) classifier has been developed before and after the proposed preprocessing. It has been shown that Recall of the classifier after the proposed preprocessing has improved. Stability of the approach has been tested by experimenting the algorithm with different number of bins. The results show that the algorithm has resulted in up to 40% performance gain.

Introduction

Early identification of defect-prone modules helps in improving software process control, achieving reduced defect correction effort and hence, reduced cost and high software reliability [1], [2], [3], [4]. Managing resources during testing is considered a non-trivial task [5] and the identification of defect-prone modules helps in planning resources during testing [5], [6], [7]. This identification is done using defect prediction techniques that also help in controlling software projects and developing resource and test plans [6], [8].

Software defect prediction techniques either classify a software module as Defect-prone (D) or Not Defect-prone (ND) or predict the number of defects in a software. Both types of models take software metrics as input. Significant number of software defect prediction models are based on product metrics [9], [10]. Use of product metrics for defect prediction has been criticized for being unable to find causal relationship between the metrics and software defects [2], [11]. Despite the critique metrics collected from static analysis have been made publicly available [12] and have encouraged the development of numerous prediction models. These metrics are available in the form of datasets which are considered to be the benchmarks in the domain of software quality. Availability of alternate benchmarks is not well known.

Although the available datasets have been useful to develop models with good Recall (proportion of correctly predicted D modules from all D modules) and Area Under the Curve (AUC can be considered as probability that a model will give higher score to a randomly chosen D module than to a randomly chosen ND module) [13], [14], [15], [16], [17], these datasets have the following limitations: (1) they are imbalanced, (2) the static code attributes available in the datasets have limited information content [13]. Most of the datasets have significantly larger number of ND modules as compared to D modules. Smaller number of D modules (as training examples) affects the ability of classification models to learn the D modules more accurately. Class imbalance is considered a problem in the domain of software defect prediction [18], [19], [20] as well as in other domains [21], [22], [23], [24]. The factor of limited information content suggests that simple learners (like Naive Bayes) can perform as good as any complex learners (such as J48) [13].

Classifying a D module correctly is very important in software defect prediction. Software development organizations cannot afford to ship defective modules to customers, therefore, they strive to detect as many defective modules as possible before the release of software. Significance of correctly identifying defect prone modules can be seen through Pareto principle as applied in software engineering. The principle says that 80% of the defects are located in 20% of code [25]. It also gives an insight regarding high cost of misclassifying D modules. In the domain of software defect prediction, mostly standard machine learning algorithms are used which do not directly address the issue of class imbalance [20]. The standard learning algorithms are biased towards the dominant class and may not perform at their best in imbalanced datasets [21]. The standard algorithms also have the tendency to discard the scarce class by identifying it as noise [21]. Therefore, researchers perform over sampling in scarce class or under sampling in dominant class before using the learning algorithm [21]. This is done to get a balanced distribution of classes such that standard algorithms designed for learning through balanced training set can work in the same manner for both the classes.

The challenge of class imbalance is also faced in the domain of direct marketing where profit needs to be maximized even when significantly large number of customers are not likely to respond as compared to number of customers who are likely to respond [22]. Similarly charity organizations need to contact potential donors from a list of people who have pledged to donate and there is a small proportion of pledges that is fulfilled [22]. In this domain the problem of imbalance has been addressed through association mining (AM). Association Rule Mining (ARM) is an important data mining technique and is employed for discovering interesting relationships between variables in large databases [26]. ARM is used to find interesting correlations, frequent patterns, associations or causal structures among items of large datasets [27]. Fuzzy logic has also been applied in other domains to address the issue of class imbalance [23].

Due to the factor of limited information content the defect prediction models have reached a performance ceiling which can be crossed if information content is improved. This information content can be improved by collecting more insightful data or accessing and combining relevant features available at the time of model development [13]. In the case of most of the public datasets where the additional data required to improve the information is not available, use of data in an insightful and different way can be useful to improve quality of defect prediction.

This paper uses the available information in the public datasets in an effective manner, applies association mining (AM) to find association between software metrics and software defects, and improves performance of classification model in imbalanced datasets. The datasets are preprocessed using the proposed approach, a defect prediction model is developed using the preprocessed data and performance analysis of the model is performed in terms of Recall. The preprocessing step partitions data and finds important itemsets. The important itemsets are relabeled in one partition of the data and the prediction of D modules improves as a result. Afterwards, the preprocessed datasets are used for model development and performance of the model is also analyzed. Naive Bayes (NB) classifier (one of the best techniques along with Random Forests in the field of defect prediction [14], [17]) has been used as a test case to evaluate the proposed preprocessing. Significance of Recall for performance analysis is highlighted through a questionnaire distributed in the software industry. The results show that the proposed approach has improved Recall of the NB classifier up to 40%. Stability of the approach has been tested by experimenting with different number of bins.

When lack of detailed information content is reported in literature, it becomes a non-trivial task to use the available information in an effective manner. Preprocessing suggested in this paper is one such attempt that gives a method to use the available data in a new and insightful way. Average Recall reported for the datasets used has been below 75% [15], [21], [28] whereas Recall values with the proposed approach vary from 78.6% to 85% with different number of bins. It is pertinent to mention that unlike other studies [29], no additional information has been used with the publicly available static code attributes.

Rest of the paper is organized as follows: Section 2 discusses the related work, Section 3 presents our research methodology and our approach to find focused itemsets. Section 4 presents the results of the experiments whereas Section 5 discusses the results. Section 6 concludes the paper and presents future directions.

Section snippets

Related work

Numerous techniques for defect prediction have been reported in literature and comparative studies to evaluate their performance have also been conducted [9], [10], [13], [14], [15], [16], [17], [29]. These techniques include models like neural networks, decision trees, Naive Bayes, case based reasoning, fuzzy inference systems, regression trees, association rule mining based models as well as ensemble models like RIPPER, WHICH, DTNB, and FURIA. Most of these techniques are based on data mining

Proposed approach to improve Recall

This study uses public datasets to improve Recall of Naive Bayes (NB) classifier for defect prediction. As shown in Fig. 1 the selected datasets are preprocessed at the first step, model is developed and evaluated at second step and then the stability of the model is evaluated using different number of bins. At the first step association mining is applied to get the focused itemsets [57]. Focused itemsets are the intervals of attributes that co-occur with defect prone modules more than they

Results

The results of applying the proposed approach have been presented in three steps: first the experimental setup is described, then application of Algorithm 1 is presented, and afterwards data collected from software industry is provided.

Analysis and discussion

In this section first analysis of the results of applying the algorithm is presented followed by an analysis of the feedback from software industry and usefulness of results for the industry.

Conclusion and future work

Software metrics have been investigated over the years for software defect prediction. We have studied association relationship between software product metrics and software defects to improve performance of Naive Bayes classifier. We have proposed a preprocessing approach that discretizes data to study associations of software metrics and defects. We partitioned data class-wise (into Pt and Pf) and generated frequent itemsets in each partition. We identified the 1-Itemsets strongly associated

Acknowledgment

The authors would like to thank Lahore University of Management Sciences (LUMS) and Higher Education Commission (HEC) of Pakistan for supporting the research. The authors would also like to thank the companies and their personnel who participated in the survey. In addition, Dr. Adnan Abid (UMT, Lahore Pakistan) is thanked for providing valuable insight to improve the paper.

References (63)

  • N. Fenton et al.

    On the effectiveness of early life cycle defect prediction with Bayesian Nets

    Empir. Softw. Eng.

    (2008)
  • Z.A. Rana et al.

    An FIS for early detection of defect prone modules

  • P.S. Sandhu et al.

    A model for early prediction of faults in software systems

    Proceedings of the 2nd International Conference on Computer and Automation Engineering (ICCAE)

    (2010)
  • R. Sagarna et al.

    Dynamic search space transformations for software test data generation

    Comput. Intell.

    (2008)
  • G. Abaei et al.

    Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering

  • Z.A. Rana et al.

    Towards a generic model for software quality prediction

    Proceedings of the 6th International Workshop on Software Quality, WoSQ’08

    (2008)
  • T.M. Khoshgoftaar et al.

    Fault prediction modeling for software quality estimation: comparing commonly used techniques

    Empir. Softw. Eng.

    (2003)
  • V.U.B. Challagulla et al.

    Empirical assessment of machine learning based sofwtare defect prediction techniques

    Proceedings of 10th Workshop on Object-Oriented Real-Time Dependable Systems (WORDS’05)

    (2005)
  • N.E. Fenton et al.

    A critique of software defect prediction models

    IEEE Trans. Softw. Eng.

    (1999)
  • T. Menzies et al.

    The PROMISE Repository of Empirical SoftwareEngineering Data

    (2015)
  • T. Menzies et al.

    Implications of ceiling effects in defect predictors

    Proceedings of International Workshop on Predictor Models in Software Engineering, PROMISE’08

    (2008)
  • S. Lessmann et al.

    Benchmarking classification models for software defect prediction: a proposed framework and novel findings

    IEEE Trans. Softw. Eng.

    (2008)
  • T. Menzies et al.

    Data mining static code attributes to learn defect predictors

    IEEE Trans. Softw. Eng.

    (2007)
  • Y. Jiang et al.

    Techniques for evaluating fault prediction models

    Empir. Softw. Engg.

    (2008)
  • T. Menzies et al.

    Defect prediction from static code features: current results, limitations, new approaches

    Autom. Softw. Eng.

    (2010)
  • C. Seiffert et al.

    Building useful models from imbalanced data with sampling and boosting

    Proceedings of the Twenty-First International Florida Artificial Intelligence Research Society Conference, May 15–17, 2008, Coconut Grove, Florida, USA

    (2008)
  • C. Seiffert et al.

    An empirical study of the classification performance of learners on imbalanced and noisy software quality data

    Inf. Sci.

    (2014)
  • V. López et al.

    An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics

    Inf. Sci.

    (2013)
  • S. Alshomrani et al.

    A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets

    Knowledge-Based Syst.

    (2015)
  • Z. Sha et al.

    Mining association rules from dataset containing predetermined decision itemset and rare transactions

    Proceedings of the Seventh International Conference on Natural Computation (ICNC)

    (2011)
  • K. Sotiris et al.

    Association rules mining: a recent overview

    GESTS Int. Trans. Comput. Sci. Eng.

    (2006)
  • Cited by (0)

    View full text