Classification with reject option for software defect prediction

doi:10.1016/j.asoc.2016.06.023

Applied Soft Computing

Volume 49, December 2016, Pages 1085-1093

https://doi.org/10.1016/j.asoc.2016.06.023 Get rights and content

Highlights

•
We propose the use of classification with reject option for software defect prediction (SDP) as a way to incorporate additional knowledge in the SDP process.
•
We propose two variants of the extreme learning machine with reject option.
•
It is proposed an ELM with reject option for imbalanced datasets.
•
The proposed method is tested on five real world software datasets.
•
An example is shown to illustrate how the rejected software modules can be further analyzed to improve the final SDP accuracy.

Abstract

Context

Software defect prediction (SDP) is an important task in software engineering. Along with estimating the number of defects remaining in software systems and discovering defect associations, classifying the defect-proneness of software modules plays an important role in software defect prediction. Several machine-learning methods have been applied to handle the defect-proneness of software modules as a classification problem. This type of “yes” or “no” decision is an important drawback in the decision-making process and if not precise may lead to misclassifications. To the best of our knowledge, existing approaches rely on fully automated module classification and do not provide a way to incorporate extra knowledge during the classification process. This knowledge can be helpful in avoiding misclassifications in cases where system modules cannot be classified in a reliable way.

Objective

We seek to develop a SDP method that (i) incorporates a reject option in the classifier to improve the reliability in the decision-making process; and (ii) makes it possible postpone the final decision related to rejected modules for an expert analysis or even for another classifier using extra domain knowledge.

Method

We develop a SDP method called rejoELM and its variant, IrejoELM. Both methods were built upon the weighted extreme learning machine (ELM) with reject option that makes it possible postpone the final decision of non-classified modules, the rejected ones, to another moment. While rejoELM aims to maximize the accuracy for a rejection rate, IrejoELM maximizes the F-measure. Hence, IrejoELM becomes an alternative for classification with reject option for imbalanced datasets.

Results

rejoEM and IrejoELM are tested on five datasets of source code metrics extracted from real world open-source software projects. Results indicate that rejoELM has an accuracy for several rejection rates that is comparable to some state-of-the-art classifiers with reject option. Although IrejoELM shows lower accuracies for several rejection rates, it clearly outperforms all other methods when the F-measure is used as a performance metric.

Conclusion

It is concluded that rejoELM is a valid alternative for classification with reject option problems when classes are nearly equally represented. On the other hand, IrejoELM is shown to be the best alternative for classification with reject option on imbalanced datasets. Since SDP problems are usually characterized as imbalanced learning problems, the use of IrejoELM is recommended.

Graphical abstract

Introduction

Software defect prediction (SDP) remains an important research topic in the software engineering field after more than 30 years of research [1]. SDP approaches focuses on [1]: (i) estimating the number of defects remaining in software systems; (ii) discovering defect associations; and (iii) classifying the defect-proneness of software modules into defect-prone and not defect-prone. Building a successful SDP system may provide means to allocate test resources more efficiently thus reducing software development costs [2]. The interest in SDP has grown in recent years as reported in [3]. The work consists on a recent review paper that presented 208 related works. Among the commonly used techniques, machine learning methods have achieved the most significant results.

Traditionally, machine learning for SDP are modeled as a classification problem. Several software modules are categorized into defect-prone and non defect-prone and represented by a set of metrics [4]. An algorithm is trained with this dataset so that it can distinguish between the two categories given a vector of metrics from a given software module.

Although previous works can handle important issues, all approaches designed so far are based on fully automated methods. Given a vector of metric from a module, the system automatically assigns it to the class of defect-prone or non defect-prone. The automated procedure of these algorithms do not provide a way to incorporate any human expert knowledge. This knowledge is specially useful when facing situations that are significantly different from the ones available on training set. Human expert (e.g., software developer, maintainer, and tester) knowledge can also be useful in critical applications where a classification error (usually related to misdetecting a defective module) may have serious consequences.

In such situations, a possible solution is to incorporate a reject option on the classifier. In doing so, the classifier may either choose between the two classes or not to classify (reject) the sample. The rejected sample may be further analyzed by a specialist that will give the final decision. The decision to reject a sample is based on the degree of certainty that the classifier have. When both classes are almost equally probable, the classifier chooses to reject the sample.

Classification with reject option is a paradigm that has been successfully applied in many areas but more extensively in medical applications. Examples can be seen for vertebral column diseases [5], tumor detection [6], and breast cancer diagnostics [7]. In these medical applications, the proposed classifiers with reject option aim to reduce the workload of the medical doctor instead of being and automatic diagnostic system. The workload is reduced since most of the cases can be classified correctly and the most difficult ones are analyzed by the specialist.

Similarly to what is done in medical applications, classification with reject option can provide significant improvements in SDP problems. In complex software systems comprising a large number of modules the expenditures in testing may represent a significant amount of the total cost. By using classification with reject option, the classifier may detect most of the defect-prone modules. The rejected ones (the ones that the classifier is not certain) may be sent to reduced team of specialists (e.g., software maintainers and testers).

This paper proposes the application of classifiers with reject option in software defect prediction problems. Additionally, the authors propose two classifiers with reject option based on the extreme learning machine (ELM). ELM is a supervised learning method that presented good results in many problems as can be seen in [8], [9], [10]. Its main advantages are the fast training procedure and its simple formulation. The proposed methods (rejoELM and IrejoELM) are built upon the weighted ELM [11]. While rejoELM aims to maximize the accuracy for a rejection rate, IrejoELM maximizes the F-measure. Hence, IrejoELM presents an alternative for classification with reject option on imbalanced datasets. The methods are tested on 5 datasets extracted from real world software projects and results are compared to several classification with reject option algorithms available on the literature.

The remaining part of this paper is organized as follows. Section 2 presents a brief literature review of recent works on machine learning for SDP. Section 3 show some important concepts related to software defect prediction, classification with reject option and extreme learning machines. The proposed methods are presented in Section 4. Experiments and a discussion about the results are shown in Sections 5 Experiments and results, 6 Discussion, respectively. Threats to validity are shown in Section 7 and conclusions are shown in Section 8.

Section snippets

Related work

Different machine learning methods have been used to solve the software defect prediction problem. Neural networks [2], random forests [12], logistic regression [13] and support vector machines [14] are some of the methods available in the literature.

Apart from the applications of standard classification methods, several works addressed important issues related to SDP. In [2] the authors point that non defect-prone modules happen more frequently than defect-prone ones. This fact may lead to an

Software defect prediction

In software engineering, a software defect (also known as software bug) is an error, or fault, in a software system that manifests during its execution, leading it to behave erroneously or improperly (i.e., different from what is expected). In the software life-cycle, the most software defects arise from mistakes and errors made by software engineers in either a software source code or its design. Software debugging is the process of locating and fixing defects [22]. However, according to the

Defect prediction metrics and dataset

As previously mentioned in Section 3.1, the most of SDP approaches are based on diverse information, such as source code metrics (e.g., lines of code and complexity) and process metrics (e.g., number of changes and recent activity). The source code metrics are related to the software source code itself and can be extracted fully automatically from it using a proper extraction tool (e.g., inFusion¹, STAN4J², and Metrics³

Experiments and results

The performance of rejoELM and IrejoELM was assessed for five SDP datasets comprising CK+OO metrics from real world open-source software projects. Two thirds of the data were used for training and the remaining third was used for testing. All hyper-parameters were chosen using grid search and 5-fold cross validation. Table 2 displays information regarding the size of each data set, the number of positive examples (faulty modules) and the imbalance ratio (IR). IR quantifies the imbalance degree

Discussion

On the basis of our experiments we can state that rejoELM performance is quite similar to the rejoRBF. Thus, the computational complexity takes an important role in the decision making process of selecting the most appropriate method. In this regard, ELM is known to be less complex than RBF. Hence rejoELM can be considered valid alternative for classification with reject option.

Even though rejoELM results are comparable to other state-of-the art classification with reject option methods, it

Construct validity

In this work, the proposed methods were tested considering five different datasets. Such datasets were extract following a well-defined process. Any mistake made during the datasets construction can be seen as a possible threat to the validity of our work. However, in the specialized literature on software defect predication you can find several authors that have been used the same datasets to validate their works. This fact increases our confidence in the datasets consistence and reliability.

Internal validity

Conclusions

In this paper we propose an ELM classifier with reject option with application to software defect prediction. The proposed method, named rejoELM, was tested in five real world datasets and outperformed other commonly used machine learning methods with reject option. For all datasets, source code metrics were used in the experiments.

The use of the reject option paradigm aims to change the standard fully automated classification procedure used in previous works to a semi-automated defect

References (34)

O.F. Arar et al.
Software defect prediction using cost-sensitive neural network
Appl. Soft Comput.
(2015)
R. Ahila et al.
An integrated PSO for parameter determination and feature selection of ELM and its application in classification of power system disturbances
Appl. Soft Comput.
(2015)
Y. Kaya et al.
A hybrid decision support system based on rough set and extreme learning machine for diagnosis of hepatitis disease
Appl. Soft Comput.
(2013)
M. Han et al.
Endpoint prediction model for basic oxygen furnace steel-making based on membrane algorithm evolving extreme learning machine
Appl. Soft Comput.
(2014)
W. Zong et al.
Weighted extreme learning machine for imbalance learning
Neurocomputing
(2013)
K.O. Elish et al.
Predicting defect-prone software modules using support vector machines
J. Syst. Softw.
(2008)
H.-N. Qu et al.
An asymmetric classifier based on partial least squares
Pattern Recognit.
(2010)
Y. Ma et al.
Transfer learning for cross-company software defect prediction
Inf. Softw. Technol.
(2012)
D. Radjenović et al.
Software fault prediction metrics: a systematic literature review
Inf. Softw. Technol.
(2013)
H. Ishibuchi et al.
Neural networks for soft decision making
Fuzzy Sets Syst.
(2000)

A. Bounsiar et al.

General solution and learning method for binary classification with performance constraints

Pattern Recognit. Lett.

(2008)

G.B. Huang et al.

Extreme learning machine: theory and applications

Neurocomputing

(2006)

Q. Song et al.

A general software defect-proneness prediction framework

IEEE Trans. Softw. Eng.

(2011)

T. Hall et al.

A systematic literature review on fault prediction performance in software engineering

IEEE Trans. Softw. Eng.

(2012)

S. Lessmann et al.

Benchmarking classification models for software defect prediction: a proposed framework and novel findings

IEEE Trans. Softw. Eng.

(2008)

A.R. da Rocha Neto et al.

Diagnostic of pathology on the vertebral column with embedded reject option

F. Condessa et al.

Classification with reject option using contextual information

Cited by (38)

An ensemble meta-estimator to predict source code testability[Formula presented]
2022, Applied Soft Computing
Citation Excerpt :
Indeed, as an inherent feature, human factors should not affect testability. So far, machine learning approaches have been applied to different aspects of software testing and debugging [40], including test data generation [41], fault prediction [42–44], and fault localization [45,46]. Mesquita et al. [44] have used the extreme learning machine (ELM) algorithm to classify source code modules as faulty and nonfaulty with a reject option using 17 source code metrics.
Unlike most other software quality attributes, testability cannot be evaluated solely based on the characteristics of the source code. The effectiveness of the test suite and the budget assigned to the test highly impact the testability of the code under test. The size of a test suite determines the test effort and cost, while the coverage measure indicates the test effectiveness. Therefore, testability can be measured based on the coverage and number of test cases provided by a test suite, considering the test budget. This paper offers a new equation to estimate testability regarding the size and coverage of a given test suite. The equation has been used to label 23,000 classes belonging to 110 Java projects with their testability measure. The labeled classes were vectorized using 262 metrics. The labeled vectors were fed into a family of supervised machine learning algorithms, regression, to predict testability in terms of the source code metrics. Regression models predicted testability with an R² of 0.68 and a mean squared error of 0.03, suitable in practice. Fifteen software metrics highly affecting testability prediction were identified using a feature importance analysis technique on the learned model. The proposed models have improved mean absolute error by 38% due to utilizing new criteria, metrics, and data compared with the relevant study on predicting branch coverage as a test criterion. As an application of testability prediction, it is demonstrated that automated refactoring of 42 smelly Java classes targeted at improving the 15 influential software metrics could elevate their testability by an average of 86.87%.
Evaluating pointwise reliability of machine learning prediction
2022, Journal of Biomedical Informatics
Citation Excerpt :
For instance, Bayes classifiers were exploited to detect reliable regions in gene expression data [34], while posterior probability and contextual information are used to classify images from teratoma tissues and reject non-reliable portions [15]. By identifying samples for which the classification may be wrong, classification reliability and classification with reject option can be seen as synonyms [3,27,61]. Learning with rejection often implies the definition of a reject threshold.
Interest in Machine Learning applications to tackle clinical and biological problems is increasing. This is driven by promising results reported in many research papers, the increasing number of AI-based software products, and by the general interest in Artificial Intelligence to solve complex problems. It is therefore of importance to improve the quality of machine learning output and add safeguards to support their adoption. In addition to regulatory and logistical strategies, a crucial aspect is to detect when a Machine Learning model is not able to generalize to new unseen instances, which may originate from a population distant to that of the training population or from an under-represented subpopulation. As a result, the prediction of the machine learning model for these instances may be often wrong, given that the model is applied outside its “reliable” space of work, leading to a decreasing trust of the final users, such as clinicians. For this reason, when a model is deployed in practice, it would be important to advise users when the model’s predictions may be unreliable, especially in high-stakes applications, including those in healthcare. Yet, reliability assessment of each machine learning prediction is still poorly addressed.
Here, we review approaches that can support the identification of unreliable predictions, we harmonize the notation and terminology of relevant concepts, and we highlight and extend possible interrelationships and overlap among concepts. We then demonstrate, on simulated and real data for ICU in-hospital death prediction, a possible integrative framework for the identification of reliable and unreliable predictions. To do so, our proposed approach implements two complementary principles, namely the density principle and the local fit principle. The density principle verifies that the instance we want to evaluate is similar to the training set. The local fit principle verifies that the trained model performs well on training subsets that are more similar to the instance under evaluation. Our work can contribute to consolidating work in machine learning especially in medicine.
Machine learning based methods for software fault prediction: A survey
2021, Expert Systems with Applications
Several prediction approaches are contained in the arena of software engineering such as prediction of effort, security, quality, fault, cost, and re-usability. All these prediction approaches are still in the rudimentary phase. Experiments and research are conducting to build a robust model. Software Fault Prediction (SFP) is the process to develop the model which can be utilized by software practitioners to detect faulty classes/module before the testing phase. Prediction of defective modules before the testing phase will help the software development team leader to allocate resources more optimally and it reduces the testing effort. In this article, we present a Systematic Literature Review (SLR) of various studies from 1990 to June 2019 towards applying machine learning and statistical method over software fault prediction. We have cited 208 research articles, in which we studied 154 relevant articles. We investigated the competence of machine learning in existing datasets and research projects. To the best of our knowledge, the existing SLR considered only a few parameters over SFP’s performance, and they partially examined the various threats and challenges of SFP techniques. In this article, we aggregated those parameters and analyzed them accordingly, and we also illustrate the different challenges in the SFP domain. We also compared the performance between machine learning and statistical techniques based on SFP models. Our empirical study and analysis demonstrate that the prediction ability of machine learning techniques for classifying class/module as fault/non-fault prone is better than classical statistical models. The performance of machine learning-based SFP methods over fault susceptibility is better than conventional statistical purposes. The empirical evidence of our survey reports that the machine learning techniques have the capability, which can be used to identify fault proneness, and able to form well-generalized result. We have also investigated a few challenges in fault prediction discipline, i.e., quality of data, over-fitting of models, and class imbalance problem. We have also summarized 154 articles in a tabular form for quick identification.
CFPS: Collaborative filtering based source projects selection for cross-project defect prediction
2021, Applied Soft Computing
Citation Excerpt :
One way to improve software quality is software defect prediction, which has been an important research topic in the field of software engineering. Software defect prediction aims to find fault-prone modules in software [1–8], which helps organizations to allocate limited resources reasonably and provides an effective means to reduce the workload of software code inspection or testing. Currently, within-project defect prediction (WPDP) [9–15] and cross-project defect prediction (CPDP) [16–20] are two popular but different directions for software defect prediction research.
Software defect prediction aims at helping developers allocate existing resources by predicting defect-prone modules prior to the testing phase. In the past decade, cross-project defect prediction (CPDP) have gained more attention than within-project defect prediction (WPDP) as WPDP is usually inefficient with the scarcity of training data due to the absence of historical defect data. Currently most CPDP studies focus on selecting appropriate training instances for improving the performance of defect prediction while few studies pay attention to the selection of appropriate source projects. However, in practice, source projects selection is the basis and prerequisite of training instances selection as an increasing number of open source software defect data are now available. In present study, we propose a Collaborative Filtering based source Projects Selection (CFPS) method for cross-project defect prediction. For a given new project, the similarity between it and each historical project is firstly calculated and thus the corresponding similarity repository could be obtained. Then CFPS mines the applicability among historical projects for constructing an applicability repository. Finally, with the aforementioned applicability and similarity repository, the popular user-based collaborative filtering algorithm is employed to recommend the appropriate source projects for the given new project. In the experiment, we have empirically validated the importance and necessity of selecting appropriate source projects. Furthermore, the experimental results also demonstrate that the proposed CFPS method is feasible and effective.
Artificial neural network based software fault detection and correction prediction models considering testing effort
2020, Applied Soft Computing Journal
Citation Excerpt :
A hybrid method using an artificial neural network and quantum particle swarm optimization was presented for software fault-prone prediction [36]. Based on extreme learning machine, Mesquita et al. [37] proposed two classifiers with reject option for software defect prediction problems. Juneja [38] proposed a fuzzy-filtered neuro-fuzzy framework for software fault prediction.
Software reliability is an important attribute of software quality. To achieve higher reliability, software development must include a testing phase in which faults can be detected and corrected. The software reliability growth model (SRGM) has evolved from modeling merely the fault detection process (FDP) into incorporating the fault correction process (FCP) as well. However, restricted by mathematical tractability, it is difficult to incorporate into analytical models with more complicated factors, such as the dependency between faults and the influence of staffing levels. This limits the application of analytical models. Therefore, it is promising to adopt data-driven methods such as the artificial neural network (ANN) to model the FDP and the FCP as no specific assumptions are needed. In this study, a stepwise prediction model is proposed to model the FDP and the FCP based on the ANN. Testing effort is considered in our model since it has a great influence on fault detection and correction process. Using real data, the performance of different types of neural networks are compared with the analytical model. The empirical study has confirmed the effectiveness of the proposed models. Further, the optimal policy of the software release time is also presented to illustrate the applications.
A fuzzy-filtered neuro-fuzzy framework for software fault prediction for inter-version and inter-project evaluation
2019, Applied Soft Computing Journal
Fault Prediction is the most required measure to estimate the software quality and reliability. Several methods, measures, aspects and testing methodologies are available to evaluate the software fault. In this paper, a fuzzy-filtered neuro-fuzzy framework is introduced to predict the software faults for internal and external software projects. The suggested framework is split into three primary phases. At the earlier phase, the effective metrics or measures are identified, which can derive the accurate decision on prediction of software fault. In this phase, the composite analytical observation of each software attribute is calculated using Information Gain and Gain Ratio measures. In the second phase, these fuzzy rules are applied on these measures for selection of effective and high-impact features. In the last phase, the Neuro-fuzzy classifier is applied on fuzzy-filtered training and testing sets. The proposed framework is applied to identify the software faults based on inter-version and inter-project evaluation. In this framework, the earlier projects or project-versions are considered as training sets and the new projects or versions are taken as testing sets. The experimentation is conducted on nine open source projects taken from PROMISE repository as well as on PDE and JDT projects. The approximation is applied on internal version-specific fault prediction and external software projects evaluation. The comparative analysis is performed against Decision Tree, Random Tree, Random Forest, Naive Bayes and Multilevel Perceptron classifiers. This prediction result signifies that the proposed framework has gained the higher accuracy, lesser error rate and significant AUC and GM for inter-project and inter-version evaluations.

View all citing articles on Scopus

View full text

Classification with reject option for software defect prediction

Highlights

Abstract

Context

Objective

Method

Results

Conclusion

Graphical abstract

Introduction

Section snippets

Related work

Software defect prediction

Defect prediction metrics and dataset

Experiments and results

Discussion

Construct validity

Internal validity

Conclusions

Appl. Soft Comput.

Appl. Soft Comput.

Appl. Soft Comput.

Appl. Soft Comput.

Neurocomputing

J. Syst. Softw.

Pattern Recognit.

Inf. Softw. Technol.

Inf. Softw. Technol.

Fuzzy Sets Syst.

Pattern Recognit. Lett.

Neurocomputing

A general software defect-proneness prediction framework

IEEE Trans. Softw. Eng.

A systematic literature review on fault prediction performance in software engineering

IEEE Trans. Softw. Eng.

Benchmarking classification models for software defect prediction: a proposed framework and novel findings

IEEE Trans. Softw. Eng.

Diagnostic of pathology on the vertebral column with embedded reject option

Classification with reject option using contextual information