Interpretability application of the Just-in-Time software defect prediction model

doi:10.1016/j.jss.2022.111245

Journal of Systems and Software

Volume 188, June 2022, 111245

https://doi.org/10.1016/j.jss.2022.111245 Get rights and content

Abstract

Software defect prediction is one of the most active fields in software engineering. Recently, some experts have proposed the Just-in-time Defect Prediction Technology. Just-in-time Defect prediction technology has become a hot topic in defect prediction due to its directness and fine granularity. This technique can predict whether a software defect exists in every code change submitted by a developer. In addition, the method has the advantages of high speed and easy tracking. However, the biggest challenge is that the prediction accuracy of Just-in-Time software is affected by the data set category imbalance. In most cases, 20% of defects in software engineering may be in 80% of modules, and code changes that do not cause defects account for a large proportion. Therefore, there is an imbalance in the data set, that is, the imbalance between a few classes and a majority of classes, which will affect the classification prediction effect of the model. Furthermore, because most features do not result in code changes that cause defects, it is not easy to achieve the desired results in practice even though the model is highly predictive. In addition, the features of the data set contain many irrelevant features and redundant features, which are invalid data, which will increase the complexity of the prediction model and reduce the prediction efficiency. To improve the prediction efficiency of Just-in-Time defect prediction technology. We trained a just-in-time defect prediction model using six open source projects from different fields based on random forest classification. LIME Interpretability technique is used to explain the model to a certain extent. By using explicable methods to extract meaningful, relevant features, the experiment can only need 45% of the original work to explain the prediction results of the prediction model and identify critical features through explicable techniques, and only need 96% of the original work to achieve this goal, under the premise of ensuring specific prediction effects. Therefore, the application of interpretable techniques can significantly reduce the workload of developers and improve work efficiency.

Introduction

Software defect prediction technology has been one of the most dynamic contents in software engineering since 1970s. The software has become an essential factor affecting the national economy, military, political, and even social life. Highly reliable and complex software systems depend heavily on the reliability of the software they employ. Software defects are the potential source of related system errors, failures, crashes, and even crashes (Wong et al., 2017). Therefore, defect repair becomes a critical activity in software maintenance, but it also consumes time and resources (Marks et al., 2011).

Defects have an essential impact on software quality and even software economics. For example, the National Institute of Standards and Technology (NIST) estimates that software defects cost as much as the United States $60 billion a year. Thus, identifying and fixing these software flaws could save the United States $22 billion (Newman, 2002).

Statistics show that fixing defects account for 50% to 75% the total cost of software development (LaToza et al., 2006). At the same time, the complexity and difference of defect distribution problems and the deficiency of existing defect prediction technology in solving practical problems are also explained. At present, the existing software defect prediction methods can be divided into static defect prediction methods and dynamic defect prediction methods (Zubrow, 2009). The static defect prediction method is based on the measurement data related to defects to predict the defect tendency, defect density, or defect number of program modules. The dynamic defect prediction method predicts the distribution of system defects over time based on the time of defects or failures to discover the distribution law of software defects over time in its life cycle or some stages. This is because, as people develop software, increases the workload, increase or decrease of humans, such as demand, coding work will introduce more defects. Reviews, test work, can be reducing the number of defects, but in general, assumes that process, factors such as the ability of technology are stable, the defects and the scale of software is a proportional relationship. The software defect prediction technology of the company has been unable to meet the timely discovery of software defects, and there is an obvious problem of inefficiency (Eyolfson et al., 2011).

In order to solve the above challenges, researchers in the field of software engineering put forward the Just-in-Time defect prediction technology. Just-in-Time defect prediction technology refers to the technology to predict defects in every code change submitted by developers. In instant defect prediction, the predicted software entity is a code change. The immediacy of Just-in-Time defect prediction technology is reflected in the fact that it can perform defect analysis on a code change after a developer submits it and predict the likelihood that the code change will be defective. Thus, this technology can effectively cope with the challenges faced by traditional defect prediction technology, mainly reflected in the following three aspects:

(1) Fine-grained. Code change level prediction focuses more on fine-grained software entities than module or file level defect prediction. As a result, developers can spend less time and effort reviewing code changes predicted to be defective.

(2) Just-in-Time. Just-in-Time defect prediction technology can be used to predict defects when a code change is submitted. At this point, developers still have a deep memory of the changed code and do not need to spend time re-understanding their submitted code changes, which helps to fix defects in a more timely manner.

(3) Easy to trace. The developer’s information is saved in the code changes the developer submits. As a result, the project manager can more easily find the developer who introduced the defect, which facilitates timely analysis of the cause of the defect and helps complete defect allocation (Kamei et al., 2012).

Although the machine learning model has outstanding performance in many fields, such as face recognition, image classification, natural language processing etc, this performance is more dependent on a highly nonlinear model and parameter tuning technology. There is no way to fathom what machine learning models learn from the data and how they make their final decisions. This “end to end” decision-making model results in a machine learning model that is exceptionally unexplanatory. From a human perspective, the decision-making process of the model is incomprehensible. That is, the model is unexplainable. The inexplicability of the machine learning model has many potential dangers, and it is not easy to establish trust between humans and machines. Since an unexplainable model cannot provide more reliable information, its actual deployment in many fields will be minimal. For example, the lack of an interpretable automatic medical diagnosis model may bring wrong treatment plans to patients and even seriously threaten the life safety of patients. Therefore, the lack of interpretability has become one of the main obstacles to developing and applying machine learning in authentic tasks.

Machine learning model interpretability has a wide range of potential applications, including model validation, model diagnosis, auxiliary analysis and knowledge discovery. Interpretability means that we have enough understandable information to solve a problem. Specifically, in artificial intelligence, explainable depth models can provide the decision basis for each prediction result. For example, the example shown in Fig. 1 describes the process of a model used for assisted seeing a doctor prove its credibility to doctors: “the model not only has to give its prediction result (flu), but also provides the result of the basis of the conclusion-sneeze, headache and no fatigue (counter-example). Only by doing so can doctors have reason to believe that its diagnosis is justified and well-founded to avoid the tragedy of “frustrated life”.

We illustrate our research in the form of three research questions:

•
RQ1: How efficient is our prediction model? In previous studies, Mockus and Weiss only used a large-scale telecommunications system project to evaluate their prediction model (Lessmann et al., 2008), which may result in unreliable results. To better evaluate our prediction model and increase the experimental persuasiveness, we used data sets from six open source projects published by Lessmann et al. (2008). Furthermore, to better identify defects caused by code changes, a new defect prediction model was established based on Kamei’s previous work. In the new prediction model, the prediction accuracy of software defects caused by code changes is 68%, and the recall rate is 64%.
•
RQ2: What features can be used to judge by interpretability techniques to play a significant role in the prediction? Up to now, Just-in-Time defect prediction technology only predicts the possibility of defects in the change. What are the types of defects to be predicted? Where are the software defects located? At present, there is no relevant research on these issues. The defect type describes the cause and characteristics of the defect, and the defect location refers to the module, file, function and even the line of code where the defect is located. Having information about the type of defect and the location of the defect has excellent potential to help developers fix the defect quickly. Although many researchers have proposed some defect classification and location techniques for immediate defects, there is no related research on predicting defect classification and location. This experiment used interpretable models to calculate the number of files (NF), relative loss measures (LA/LF and LT/NF), and the time interval between the last change and the current change. The experiment found that whether changes can repair defects (PD) was the most crucial feature.
•
RQ3: After removing unimportant features, what is the performance of the defect model?

At present, Just-in-Time software defect prediction technology still has low efficiency caused by heavy workload when facing substantial software projects. We hope that the most influential features can be screened out through the explanatory model through preliminary screening so that the model can have high predictive power in as little time as possible. Our results show that we can achieve 96% of the predictive model’s original capacity at 45% of the original effort.

Section snippets

Just-in-Time Defect prediction technology

During software development and maintenance, code modification is required to remove inherent software defects, improve existing functions, reconstruct existing code, or improve operating performance. However, some code changes may accidentally introduce new defects after completing the modification task (Wong et al., 2010). Therefore, developers want a defect prediction model that can quickly and accurately determine whether committed code change is a Buggy code change (that is, a code change

Case study design

In this part, we mainly answer the preliminary preparation of the three questions we are studying. We introduce the information about the data set used in the experiment and the pre-processing of the data set.

RQ1: How efficient is our prediction model?

Overview: To answer RQ1, we use the feature criteria selected in the table to build a software change risk prediction model based on RandomForest. In order to accurately evaluate the performance of the prediction model, we used an open-source project data set for validation.

Validation technique and data used: Before the experiment, we used the 10-Fold cross-validation method for the preliminary processing of the data set (Efron, 1983). Firstly, the data set is randomly selected, and then the

Limitations and threats to validity

Construct validity. A large number of previous work shows that the parameters of Random Forest classification technology impact the performance of defect models (Mitchell, 2011, Mende, 2010, Mende and Koschke, 2009, Tantithamthavorn et al., 2016, Tantithamthavorn et al., 2018). Although the value of n trees we used for the random forest prediction model is the default value of 100, recent studies show that the parameters of the random forest model do not affect our research results (

Conclusion

In this paper, the interpretability model is used to explain and optimize the defect model. We validated our experiment with an in-depth study of six open source projects. Our experimental results show that the random forest model in RQ1 can predict software defects well, with an accuracy of 71.52% and a recall rate of 68.88%. In RQ2, we innovatively used LIME model to explain the software defect prediction model and results. The existing Just-in-Time Defect Prediction technique only predicts

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research was supported by the 2021 Key R&D Program in Shaanxi Province, China (2021GY-041) and the National Natural Science Foundation of China special project capability-based construction method and execution mechanisms for ubiquitous operating systems (62141208).

References (56)

CatalC. et al.
Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem
Inform. Sci.
(2009)
KhuatT.T. et al.
A comparative study of general fuzzy min–max neural networks for pattern classification problems
Neurocomputing
(2020)
LaradjiI.H. et al.
Software defect prediction using ensemble learning on selected features
Inf. Softw. Technol.
(2015)
PascarellaL. et al.
Fine-grained just-in-time defect prediction
J. Syst. Softw.
(2019)
WongW.E. et al.
Be more familiar with our enemies and pave the way forward: A review of the roles bugs played in software failures
J. Syst. Softw.
(2017)
BejjankiK.K. et al.
Class imbalance reduction (cir): a novel approach to software defect prediction in the presence of class imbalance
Symmetry
(2020)
Bird, C., Nagappan, N., Murphy, B., Gall, H., Devanbu, P., 2011. Don’t touch my code! examining the effects of...
Da CostaD.A. et al.
A framework for evaluating the results of the szz approach for identifying bug-introducing changes
IEEE Trans. Softw. Eng.
(2016)
D’AmbrosM. et al.
An extensive comparison of bug prediction approaches
EfronB.
Estimating the error rate of a prediction rule: improvement on cross-validation
J. Amer. Statist. Assoc.
(1983)

Eyolfson, J., Tan, L., Lam, P., 2011. Do time of day and developer experience affect commit bugginess? In: Proceedings...

Fu, W., Menzies, T., 2017. Revisiting unsupervised learning for defect prediction. In: Proceedings of the 2017 11th...

Fukushima, T., Kamei, Y., McIntosh, S., Yamashita, K., Ubayashi, N., 2014. An empirical study of just-in-time defect...

HassanA.E.

Predicting faults using the complexity of code changes

HerzigK. et al.

It’s not a bug, it’s a feature: how misclassification impacts bug prediction

KameiY. et al.

Studying just-in-time defect prediction using cross-project models

Empir. Softw. Eng.

(2016)

KameiY. et al.

The effects of over and under sampling on fault-prone module detection

KameiY. et al.

Defect prediction: Accomplishments and future challenges

KameiY. et al.

A large-scale empirical study of just-in-time quality assurance

IEEE Trans. Softw. Eng.

(2012)

KeungJ. et al.

Finding conclusion stability for selecting the best effort predictor in software effort estimation

Autom. Softw. Eng.

(2013)

KhuatT.T. et al.

Evaluation of sampling-based ensembles of classifiers on imbalanced data for software defect prediction problems

SN Comput. Sci.

(2020)

KimS. et al.

Classifying software changes: Clean or buggy?

IEEE Trans. Softw. Eng.

(2008)

KimS. et al.

Dealing with noise in defect prediction

KimS. et al.

Automatic identification of bug-introducing changes

LaToza, T.D., Venolia, G., DeLine, R., 2006. Maintaining mental models: a study of developer work habits. In:...

LessmannS. et al.

Benchmarking classification models for software defect prediction: A proposed framework and novel findings

IEEE Trans. Softw. Eng.

(2008)

LiY. et al.

Using tri-relation networks for effective software fault-proneness prediction

IEEE Access

(2019)

LiuW. et al.

Predicting the severity of bug reports based on feature selection

Int. J. Softw. Eng. Knowl. Eng.

(2018)

Cited by (48)

On the relative value of clustering techniques for Unsupervised Effort-Aware Defect Prediction
2024, Expert Systems with Applications
Unsupervised Effort-Aware Defect Prediction (EADP) uses unlabeled data to construct a model and ranks software modules according to the software feature values. Xu et al. (JSS 2021) conducted an exploration of clustering techniques for unsupervised defect prediction and found that several clustering methods exhibit better performance on the F1@20% effort-aware metric. However, their conclusion may not be convincing, as they did not take into account the impact of the Initial False Alarms (IFA) metric on unsupervised EADP. Furthermore, their study did not compare with the state-of-the-art supervised EADP models. To further investigate clustering techniques for unsupervised EADP more comprehensively, we explore the performance of 22 clustering techniques for unsupervised EADP using three classification metrics and six effort-aware metrics. The experimental results demonstrate that (1) the best clustering technique for unsupervised EADP, K-medoids, can significantly reduce the IFA of the ManualUp method to an acceptable range. In contrast, the clustering techniques recommended by Xu et al. exhibit a high IFA value that cannot be deemed acceptable by testing teams; (2) K-medoids performs better than some supervised EADP methods, especially on metrics such as IFA and PMI@20% (Proportion of Modules Inspected when inspecting the top 20% lines of code); (3) better classification performance of clustering techniques could lead to better effort-aware performance. In summary, we recommend using the K-medoids clustering technique for unsupervised EADP and suggest that future research devote more effort to exploring better-unsupervised clustering techniques. In support of reproducibility and future research, we provide the source code used in our study (https://github.com/Andre-Yang816/Clustering4UEADP).
A software defect prediction method based on learnable three-line hybrid feature fusion
2024, Expert Systems with Applications
Software defect prediction (SDP) plays a crucial role in ensuring the security and quality of software systems. However, it faces challenges posed by high-dimensional features present in software defect datasets and the limited effectiveness of traditional nonlinear dimensionality reduction methods in extracting essential feature information. To address these issues, we propose a novel approach called learnable three-line hybrid feature fusion (LTHFFA), which incorporates the principle of three-line hybrid breeding into feature fusion for the first time. In this method, three distinct dimensionality reduction techniques are initially employed to obtain three separate sets of features. Subsequently, a learnable weight factor feature fusion method is proposed to facilitate automatically learn and dynamically update of feature weights. By integrating the three feature sets based on the principle of three-line hybrid breeding, we derive learnable three-line hybrid fusion features. These features are then utilized in the context of software defect prediction. Experimental results demonstrate the superior performance of LTHFFA compared to nine other dimensionality reduction methods across seventeen publicly available software defect datasets. LTHFFA exhibits the ability to effectively integrate multiple feature sets, reduce feature redundancy, and enhance predictive accuracy. Moreover, statistical analysis using Friedman ranking and Holm's post-hoc test confirms the significant advantage of LTHFFA over alternative dimensionality reduction methods.
SMOTE-kTLNN: A hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier
2024, Expert Systems with Applications
In recent years, class-imbalanced learning has become an important branch of machine learning. Synthetic Minority Oversampling Technique (SMOTE) is known as a benchmark method to address imbalanced learning. Although SMOTE performs well on many data, it also has the drawback of generating noisy samples. There are many SMOTE variants to solve this problem. Specifically, these methods are hybrid sampling methods, that is, carrying out an undersampling stage after SMOTE to remove noisy samples. It requires a method that can accurately identify noise to provide reliable performance. In this paper, a hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier (SMOTE-kTLNN) is proposed. SMOTE-kTLNN recognition noise is realized by an Iterative-Partitioning Filter (IPF). Specifically, SMOTE is performed on the original data to balance the data, then the data is divided into $n$ equal parts, establishing kTLNN on each part to predict the whole data. And noisy samples are removed according to the majority voting rule. In the last, the balanced data sets are used to train kNN, AdaBoost, and SVM to verify whether SMOTE-kTLNN is irrelevant to the classifier. The experiment results demonstrate that SMOTE-kTLNN performs better than the comparisons in 25 binary data sets, including Recall, AUC, F1-measure, and G-mean.
Aligning XAI explanations with software developers’ expectations: A case study with code smell prioritization
2024, Expert Systems with Applications
EXplainable Artificial Intelligence (XAI) aims at improving users’ trust in black-boxed models by explaining their predictions. However, XAI techniques produced unreasonable explanations for software defect prediction since expected outputs (e.g., causes of bugs) were not captured by features used to build models. To set aside feature engineering limitations and evaluate whether XAI could adapt to developers, we exploit XAI for code smell prioritization (i.e., predicting criticalities of sub-optimal coding practices and design choices), whose features could capture developers’ major expectations. We assess the gap between XAI explanations and developers’ expectations in terms of (1) the accuracy of prediction, (2) the coverage of explanations on expectations, and (3) the complexity of explanations. We also narrow the gap by preserving the features related to developers’ expectations as much as possible in feature selection. We find that XAI can explain smells with simpler causes in top 3 to 5 features. Complex smells can be explained in around 10 features, which need more expertise to interpret. Selecting features adapting to the developers’ expectations improves coverage by 5% to 29%, with almost no negative impact on accuracy and complexity. Results also highlight the need of dividing coarse-grained prediction targets and developing fine-grained feature engineering.
A multi-objective effort-aware defect prediction approach based on NSGA-II
2023, Applied Soft Computing
Effort-Aware Defect Prediction (EADP) technique sorts software modules by the defect density and aims to find more bugs when testing a certain number of Lines of Code (LOC). The existing EADP methods ignore the number of required inspected modules and thus resulting in more testing cost. Therefore, we propose a multi-objective effort-aware defect prediction approach based on NSGA-II named MOOAC for EADP, which aims to maximize the Proportion of the found Bugs (PofB@20%) and minimize the Proportion of Module Inspected (PMI@20%) when inspecting the top 20% LOC. MOOAC firstly trains a random forest classification model. Then, it builds a logistic regression model, and utilizes the NSGA-II algorithm to generate the coefficient vector of the model by maximizing the PofB@20% value and minimizing the PMI@20% value simultaneously. In the model prediction phase, MOOAC firstly employs the built random forest classifier to decide whether modules are defective. Next, the predicted defective modules are first inspected based on the ratio between the predicted defect probability by the logistic regression model and LOC, which can make testers to find more bugs and test as fewer LOC as possible. The clean modules are then inspected to reduce the Initial False Alarms (IFA), if there is still the testing budget left. The results show that MOOAC exhibits the best overall performance on the PofB@20% and PMI@20%. In other words, MOOAC enables testers to identify more bugs per 1% module.
Boosting multi-objective just-in-time software defect prediction by fusing expert metrics and semantic metrics
2023, Journal of Systems and Software
Just-in-time software defect prediction (JIT-SDP) aims to predict whether a code commit is defect-inducing or defect-clean immediately after developers submit their code commits. In our previous study, we modeled JIT-SDP as a multi-objective optimization problem by designing two potential conflict optimization objectives. By only considering expert metrics for code commits, our proposed multi-objective just-in-time software defect prediction (MOJ-SDP) approach can significantly outperform state-of-the-art supervised and unsupervised baselines. Recent studies have shown that deep learning techniques can be used to automatically extract semantic metrics from code commits and achieved promising performance for JIT-SDP. However, it is unclear how well MOJ-SDP performs when semantic metrics are used, and whether these two types of metrics are complementary and can be boosted by fusing them for MOJ-SDP. We conducted an extensive experiment using 27,319 code commits from 21 real-world open-source projects. Our results show that when using semantic features, the performance of MOJ-SDP can be slightly decreased for $P_{o p t}$ , but greatly improved for Recall@20%Effort. However, when these two types of metrics are fused based on the model-level fusion with the maximum rule, the performance can be boosted by a large margin and outperform state-of-the-art JIT-SDP baselines.

View all citing articles on Scopus

View full text

Interpretability application of the Just-in-Time software defect prediction model

Abstract

Introduction

Section snippets

Just-in-Time Defect prediction technology

Case study design

RQ1: How efficient is our prediction model?

Limitations and threats to validity

Conclusion

Declaration of Competing Interest

Acknowledgements

Inform. Sci.

Neurocomputing

Inf. Softw. Technol.

J. Syst. Softw.

J. Syst. Softw.

Class imbalance reduction (cir): a novel approach to software defect prediction in the presence of class imbalance

Symmetry

A framework for evaluating the results of the szz approach for identifying bug-introducing changes

IEEE Trans. Softw. Eng.

An extensive comparison of bug prediction approaches

Estimating the error rate of a prediction rule: improvement on cross-validation

J. Amer. Statist. Assoc.

Predicting faults using the complexity of code changes

It’s not a bug, it’s a feature: how misclassification impacts bug prediction

Studying just-in-time defect prediction using cross-project models

Empir. Softw. Eng.

The effects of over and under sampling on fault-prone module detection

Defect prediction: Accomplishments and future challenges

A large-scale empirical study of just-in-time quality assurance

IEEE Trans. Softw. Eng.

Finding conclusion stability for selecting the best effort predictor in software effort estimation

Autom. Softw. Eng.

Evaluation of sampling-based ensembles of classifiers on imbalanced data for software defect prediction problems

SN Comput. Sci.

Classifying software changes: Clean or buggy?

IEEE Trans. Softw. Eng.

Dealing with noise in defect prediction

Automatic identification of bug-introducing changes

Benchmarking classification models for software defect prediction: A proposed framework and novel findings

IEEE Trans. Softw. Eng.

Using tri-relation networks for effective software fault-proneness prediction

IEEE Access

Predicting the severity of bug reports based on feature selection

Int. J. Softw. Eng. Knowl. Eng.