Interpretability application of the Just-in-Time software defect prediction model

https://doi.org/10.1016/j.jss.2022.111245Get rights and content

Abstract

Software defect prediction is one of the most active fields in software engineering. Recently, some experts have proposed the Just-in-time Defect Prediction Technology. Just-in-time Defect prediction technology has become a hot topic in defect prediction due to its directness and fine granularity. This technique can predict whether a software defect exists in every code change submitted by a developer. In addition, the method has the advantages of high speed and easy tracking. However, the biggest challenge is that the prediction accuracy of Just-in-Time software is affected by the data set category imbalance. In most cases, 20% of defects in software engineering may be in 80% of modules, and code changes that do not cause defects account for a large proportion. Therefore, there is an imbalance in the data set, that is, the imbalance between a few classes and a majority of classes, which will affect the classification prediction effect of the model. Furthermore, because most features do not result in code changes that cause defects, it is not easy to achieve the desired results in practice even though the model is highly predictive. In addition, the features of the data set contain many irrelevant features and redundant features, which are invalid data, which will increase the complexity of the prediction model and reduce the prediction efficiency. To improve the prediction efficiency of Just-in-Time defect prediction technology. We trained a just-in-time defect prediction model using six open source projects from different fields based on random forest classification. LIME Interpretability technique is used to explain the model to a certain extent. By using explicable methods to extract meaningful, relevant features, the experiment can only need 45% of the original work to explain the prediction results of the prediction model and identify critical features through explicable techniques, and only need 96% of the original work to achieve this goal, under the premise of ensuring specific prediction effects. Therefore, the application of interpretable techniques can significantly reduce the workload of developers and improve work efficiency.

Introduction

Software defect prediction technology has been one of the most dynamic contents in software engineering since 1970s. The software has become an essential factor affecting the national economy, military, political, and even social life. Highly reliable and complex software systems depend heavily on the reliability of the software they employ. Software defects are the potential source of related system errors, failures, crashes, and even crashes (Wong et al., 2017). Therefore, defect repair becomes a critical activity in software maintenance, but it also consumes time and resources (Marks et al., 2011).

Defects have an essential impact on software quality and even software economics. For example, the National Institute of Standards and Technology (NIST) estimates that software defects cost as much as the United States $60 billion a year. Thus, identifying and fixing these software flaws could save the United States $22 billion (Newman, 2002).

Statistics show that fixing defects account for 50% to 75% the total cost of software development (LaToza et al., 2006). At the same time, the complexity and difference of defect distribution problems and the deficiency of existing defect prediction technology in solving practical problems are also explained. At present, the existing software defect prediction methods can be divided into static defect prediction methods and dynamic defect prediction methods (Zubrow, 2009). The static defect prediction method is based on the measurement data related to defects to predict the defect tendency, defect density, or defect number of program modules. The dynamic defect prediction method predicts the distribution of system defects over time based on the time of defects or failures to discover the distribution law of software defects over time in its life cycle or some stages. This is because, as people develop software, increases the workload, increase or decrease of humans, such as demand, coding work will introduce more defects. Reviews, test work, can be reducing the number of defects, but in general, assumes that process, factors such as the ability of technology are stable, the defects and the scale of software is a proportional relationship. The software defect prediction technology of the company has been unable to meet the timely discovery of software defects, and there is an obvious problem of inefficiency (Eyolfson et al., 2011).

In order to solve the above challenges, researchers in the field of software engineering put forward the Just-in-Time defect prediction technology. Just-in-Time defect prediction technology refers to the technology to predict defects in every code change submitted by developers. In instant defect prediction, the predicted software entity is a code change. The immediacy of Just-in-Time defect prediction technology is reflected in the fact that it can perform defect analysis on a code change after a developer submits it and predict the likelihood that the code change will be defective. Thus, this technology can effectively cope with the challenges faced by traditional defect prediction technology, mainly reflected in the following three aspects:

(1) Fine-grained. Code change level prediction focuses more on fine-grained software entities than module or file level defect prediction. As a result, developers can spend less time and effort reviewing code changes predicted to be defective.

(2) Just-in-Time. Just-in-Time defect prediction technology can be used to predict defects when a code change is submitted. At this point, developers still have a deep memory of the changed code and do not need to spend time re-understanding their submitted code changes, which helps to fix defects in a more timely manner.

(3) Easy to trace. The developer’s information is saved in the code changes the developer submits. As a result, the project manager can more easily find the developer who introduced the defect, which facilitates timely analysis of the cause of the defect and helps complete defect allocation (Kamei et al., 2012).

Although the machine learning model has outstanding performance in many fields, such as face recognition, image classification, natural language processing etc, this performance is more dependent on a highly nonlinear model and parameter tuning technology. There is no way to fathom what machine learning models learn from the data and how they make their final decisions. This “end to end” decision-making model results in a machine learning model that is exceptionally unexplanatory. From a human perspective, the decision-making process of the model is incomprehensible. That is, the model is unexplainable. The inexplicability of the machine learning model has many potential dangers, and it is not easy to establish trust between humans and machines. Since an unexplainable model cannot provide more reliable information, its actual deployment in many fields will be minimal. For example, the lack of an interpretable automatic medical diagnosis model may bring wrong treatment plans to patients and even seriously threaten the life safety of patients. Therefore, the lack of interpretability has become one of the main obstacles to developing and applying machine learning in authentic tasks.

Machine learning model interpretability has a wide range of potential applications, including model validation, model diagnosis, auxiliary analysis and knowledge discovery. Interpretability means that we have enough understandable information to solve a problem. Specifically, in artificial intelligence, explainable depth models can provide the decision basis for each prediction result. For example, the example shown in Fig. 1 describes the process of a model used for assisted seeing a doctor prove its credibility to doctors: “the model not only has to give its prediction result (flu), but also provides the result of the basis of the conclusion-sneeze, headache and no fatigue (counter-example). Only by doing so can doctors have reason to believe that its diagnosis is justified and well-founded to avoid the tragedy of “frustrated life”.

We illustrate our research in the form of three research questions:

  • RQ1: How efficient is our prediction model? In previous studies, Mockus and Weiss only used a large-scale telecommunications system project to evaluate their prediction model (Lessmann et al., 2008), which may result in unreliable results. To better evaluate our prediction model and increase the experimental persuasiveness, we used data sets from six open source projects published by Lessmann et al. (2008). Furthermore, to better identify defects caused by code changes, a new defect prediction model was established based on Kamei’s previous work. In the new prediction model, the prediction accuracy of software defects caused by code changes is 68%, and the recall rate is 64%.

  • RQ2: What features can be used to judge by interpretability techniques to play a significant role in the prediction? Up to now, Just-in-Time defect prediction technology only predicts the possibility of defects in the change. What are the types of defects to be predicted? Where are the software defects located? At present, there is no relevant research on these issues. The defect type describes the cause and characteristics of the defect, and the defect location refers to the module, file, function and even the line of code where the defect is located. Having information about the type of defect and the location of the defect has excellent potential to help developers fix the defect quickly. Although many researchers have proposed some defect classification and location techniques for immediate defects, there is no related research on predicting defect classification and location. This experiment used interpretable models to calculate the number of files (NF), relative loss measures (LA/LF and LT/NF), and the time interval between the last change and the current change. The experiment found that whether changes can repair defects (PD) was the most crucial feature.

  • RQ3: After removing unimportant features, what is the performance of the defect model?

At present, Just-in-Time software defect prediction technology still has low efficiency caused by heavy workload when facing substantial software projects. We hope that the most influential features can be screened out through the explanatory model through preliminary screening so that the model can have high predictive power in as little time as possible. Our results show that we can achieve 96% of the predictive model’s original capacity at 45% of the original effort.

Section snippets

Just-in-Time Defect prediction technology

During software development and maintenance, code modification is required to remove inherent software defects, improve existing functions, reconstruct existing code, or improve operating performance. However, some code changes may accidentally introduce new defects after completing the modification task (Wong et al., 2010). Therefore, developers want a defect prediction model that can quickly and accurately determine whether committed code change is a Buggy code change (that is, a code change

Case study design

In this part, we mainly answer the preliminary preparation of the three questions we are studying. We introduce the information about the data set used in the experiment and the pre-processing of the data set.

RQ1: How efficient is our prediction model?

Overview: To answer RQ1, we use the feature criteria selected in the table to build a software change risk prediction model based on RandomForest. In order to accurately evaluate the performance of the prediction model, we used an open-source project data set for validation.

Validation technique and data used: Before the experiment, we used the 10-Fold cross-validation method for the preliminary processing of the data set (Efron, 1983). Firstly, the data set is randomly selected, and then the

Limitations and threats to validity

Construct validity. A large number of previous work shows that the parameters of Random Forest classification technology impact the performance of defect models (Mitchell, 2011, Mende, 2010, Mende and Koschke, 2009, Tantithamthavorn et al., 2016, Tantithamthavorn et al., 2018). Although the value of n trees we used for the random forest prediction model is the default value of 100, recent studies show that the parameters of the random forest model do not affect our research results (

Conclusion

In this paper, the interpretability model is used to explain and optimize the defect model. We validated our experiment with an in-depth study of six open source projects. Our experimental results show that the random forest model in RQ1 can predict software defects well, with an accuracy of 71.52% and a recall rate of 68.88%. In RQ2, we innovatively used LIME model to explain the software defect prediction model and results. The existing Just-in-Time Defect Prediction technique only predicts

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research was supported by the 2021 Key R&D Program in Shaanxi Province, China (2021GY-041) and the National Natural Science Foundation of China special project capability-based construction method and execution mechanisms for ubiquitous operating systems (62141208).

References (56)

  • Eyolfson, J., Tan, L., Lam, P., 2011. Do time of day and developer experience affect commit bugginess? In: Proceedings...
  • Fu, W., Menzies, T., 2017. Revisiting unsupervised learning for defect prediction. In: Proceedings of the 2017 11th...
  • Fukushima, T., Kamei, Y., McIntosh, S., Yamashita, K., Ubayashi, N., 2014. An empirical study of just-in-time defect...
  • HassanA.E.

    Predicting faults using the complexity of code changes

  • HerzigK. et al.

    It’s not a bug, it’s a feature: how misclassification impacts bug prediction

  • KameiY. et al.

    Studying just-in-time defect prediction using cross-project models

    Empir. Softw. Eng.

    (2016)
  • KameiY. et al.

    The effects of over and under sampling on fault-prone module detection

  • KameiY. et al.

    Defect prediction: Accomplishments and future challenges

  • KameiY. et al.

    A large-scale empirical study of just-in-time quality assurance

    IEEE Trans. Softw. Eng.

    (2012)
  • KeungJ. et al.

    Finding conclusion stability for selecting the best effort predictor in software effort estimation

    Autom. Softw. Eng.

    (2013)
  • KhuatT.T. et al.

    Evaluation of sampling-based ensembles of classifiers on imbalanced data for software defect prediction problems

    SN Comput. Sci.

    (2020)
  • KimS. et al.

    Classifying software changes: Clean or buggy?

    IEEE Trans. Softw. Eng.

    (2008)
  • KimS. et al.

    Dealing with noise in defect prediction

  • KimS. et al.

    Automatic identification of bug-introducing changes

  • LaToza, T.D., Venolia, G., DeLine, R., 2006. Maintaining mental models: a study of developer work habits. In:...
  • LessmannS. et al.

    Benchmarking classification models for software defect prediction: A proposed framework and novel findings

    IEEE Trans. Softw. Eng.

    (2008)
  • LiY. et al.

    Using tri-relation networks for effective software fault-proneness prediction

    IEEE Access

    (2019)
  • LiuW. et al.

    Predicting the severity of bug reports based on feature selection

    Int. J. Softw. Eng. Knowl. Eng.

    (2018)
  • Cited by (48)

    View all citing articles on Scopus
    View full text