Practical development of an Eclipse-based software fault prediction tool using Naive Bayes algorithm

https://doi.org/10.1016/j.eswa.2010.08.022Get rights and content

Abstract

Despite the amount of effort software engineers have been putting into developing fault prediction models, software fault prediction still poses great challenges. This research using machine learning and statistical techniques has been ongoing for 15 years, and yet we still have not had a breakthrough. Unfortunately, none of these prediction models have achieved widespread applicability in the software industry due to a lack of software tools to automate this prediction process. Historical project data, including software faults and a robust software fault prediction tool, can enable quality managers to focus on fault-prone modules. Thus, they can improve the testing process. We developed an Eclipse-based software fault prediction tool for Java programs to simplify the fault prediction process. We also integrated a machine learning algorithm called Naive Bayes into the plug-in because of its proven high-performance for this problem. This article presents a practical view to software fault prediction problem, and it shows how we managed to combine software metrics with software fault data to apply Naive Bayes technique inside an open source platform.

Research highlights

Machine learning algorithms can be easily integrated into software fault prediction tools. ► Eclipse framework simplify developing software fault prediction tools. ► The end user should not feel the computational complexity of machine learning algorithms for software fault prediction.

Introduction

Over the last 15 years software fault prediction models have received a lot of attention from software engineering researchers and machine learning experts. However, a robust fault prediction model alone is not enough in today’s competitive software industry. There is a great need for software tools that make it easier for software quality professionals or project managers to predict faults before they occur. Building and application of a fault prediction model within a software company is time-consuming, detailed, meticulous work and mostly commercial projects do not have enough resources to realize this activity (Ostrand & Weyuker, 2006).

On the other hand, the benefits of this Quality Assurance (QA) activity are impressive. By using such models, one can identify the refactoring candidate modules, improve the software testing process, select the best design from design alternatives with class level metrics, and reach a dependable software system (Catal & Diri, 2008). Hence, a tool for simplifying the prediction process is extremely useful to projects that might not be able to allocate necessary resources for this QA activity (Ostrand & Weyuker, 2006).

Software fault prediction approaches use previous software metrics and fault data to predict the fault-prone modules for the next release of software. If an error is reported during system tests or in field, that module’s fault data is marked as 1, otherwise it is marked as 0. For the prediction modeling, software metrics are used as independent variables and fault data (1 or 0) is used as the dependent variable. Therefore, we need a version control system (VCS) such as Subversion to store source code, a change management system (CMS) such as ClearQuest to record fault data, and a tool to collect product metrics (method-level or class-level) from source code. Parameters of the prediction model are calculated using previous software metrics and fault data.

Different version control systems and change management systems may have different kinds of Application Programming Interfaces (APIs) and therefore, we did not aim to use a specific type of VCS or CMS during the development of our software fault prediction tool. In addition, designing a one-size-fits-all tool would not be easy as it seems. We decided to let the prediction tool to calculate the software metrics from a source code directory instead of a VCS. In addition, our prediction tool does not directly read fault data from CMS because sometimes every code change does not necessarily mean a fault. By keeping this strategy in mind, we let the user to add the fault data for each module from an Eclipse editor. Next sections will depict this easy-to-use operation with a figure. For this reason, our prediction tool is not dependent on any Commercial-Off-The-Shell (COTS) tool. Different project groups will be able to use this tool for their projects in our institute. The only limitation now is the programming language that this tool works on. Our prediction tool can calculate software metrics (Halstead, McCabe and Chidamber–Kemerer) on only Java programs.

We modified the source code of an open source project called Lachesis (Vatkov, 2005) to develop our software fault prediction tool called RUBY. “Lachesis is a software complexity measurement program for Object-Oriented source code” (Vatkov, 2005) and it can analyze both Java source code and byte code. However, Lachesis itself is not a fault prediction tool. Because Lachesis plug-in did not provide information to the user about primitive Halstead metrics such as total operand and unique operator, we changed the source code to let the user see these metrics. Furthermore, dependency metrics (Martin, 1994) such as afferent couplings, abstractness, instability, have been removed from graphical user interface because software fault prediction studies did not take into account these metrics up to now and validation of these metrics for fault prediction have not been done yet. In addition, decision density metric calculated by dividing cyclomatic complexity metric to lines of code metric has been removed because this metric has not been validated for software fault prediction yet. Metrics calculated and shown to the user by our tool are shown in Table 1.

After software metrics are calculated, MethodLevelTrain.txt or ClassLevelTrain.txt files are created in project tree with respect to RUBY preference page settings. User should add fault-info or fault-free info for each module (class or method) by right clicking to the editor, automatically shown after metrics calculation. Therefore, we contributed two actions, Add Fault Info and Add Fault-Free Info, on metrics analysis editor. Module class labels initialized with ? character are updated after these actions are activated with 1 value for Add Fault Info action and 0 value for the other action. Users can copy this training file into a new software version and then choose Predict with Naive Bayes option by right clicking to the project. After this selection, current software metrics are calculated and stored in MethodLevelTest.txt or ClassLevelTest.txt files. Naive Bayes machine learning algorithm automatically uses train and test files with respect to preference page settings and predicts the fault-prone and faulty-free modules. After the predictions have been made, the results are shown in an Eclipse view called Result view, developed by our project group.

The architecture of this tool and its reusable parts let us add new machine learning algorithms easily. Training and test files will be used by different algorithms instead of Naive Bayes in that case and other parts of the tool will not be affected. As far as we know, this is the first study to develop an industrial level Eclipse-based software fault prediction tool using machine learning algorithms. The idea and the machine learning method behind this tool is quite influential and any software company can develop their own tool provided that they know the steps we took to develop this prediction tool.

This paper is organized as follows: the following section presents the related work. Section 3 explains Naive Bayes algorithm. Section 4 introduces our Eclipse-based software fault prediction tool. Section 5 presents the conclusions and future works.

Section snippets

Related work

Ostrand and Weyuker (2006) discussed the issues in building an automated software fault prediction tool. They had experience working with four industrial software projects and all of these projects used the same VCS and CMS which are fully integrated. Therefore, each Modification Request (MR) to the system is available and the information about each MR is a written description of the change reason. However, it is a crucial problem to determine whether a modification was made because of a fault,

Naive Bayes algorithm

Classification is a procedure that assigns a class label to a sample from a given set of samples that have labels (Zhang & Sheng, 2004). Naive Bayesian Classification (aka Simple Bayesian Classifier) is the most known and used classification method. It is not only easy to implement on various kinds of datasets, but also it is quite efficient. Let X be a data sample which has no class label. Let H be a hypothesis such that X belongs to a specified class, C. We aim to ascertain P(H|X), the

RUBY: Eclipse-based software fault prediction tool

This section introduces our Eclipse-based software fault prediction tool using Naive Bayes machine learning algorithm. Section 4.1 explains Eclipse platform, Section 4.2 presents RUBY plug-ins and preferences. Section 4.3 presents RUBY analysis and prediction operations and Section 4 explains RUBY flow graph.

Conclusions and future works

We developed an Eclipse-based software fault prediction tool for Java programs to simplify the fault prediction process and we also integrated a machine learning algorithm called Naive Bayes into this plug-in because of its proven high-performance for this problem. As this tool proves, machine learning algorithms can be easily used for real world applications and problems. We examined the probabilistic results of our Naive Bayes implementation and noticed that we have same results with WEKA

Acknowledgement

This project is supported by the Scientific and Technological Research Council of TURKEY (TUBITAK) under Grant 107E213. The findings and opinions in this study belong solely to the authors, and are not necessarily those of the sponsor.

References (29)

  • Catal, C., & Diri, B. (2008). A fault prediction model with limited fault data to improve test process. In 9th...
  • E. Clayberg et al.

    Eclipse: Building commercial quality plug-ins

    (2004)
  • Daum, B. (2005). Professional Eclipse 3 for Java developers....
  • J.C. Dueñas et al.

    Apache and Eclipse: Comparing open source project incubators

    IEEE Software

    (2007)
  • Cited by (86)

    • Multi-view representation learning with Kolmogorov-Smirnov to predict default based on imbalanced and complex dataset

      2022, Information Sciences
      Citation Excerpt :

      2) Machine learning approach: Based on the borrowers’ application information, this method designs the predictive scheme without the requirements of assumptions and prior knowledge. Substantial numbers of machine learning algorithms, including decision trees (DT) [11], Naïve Bayesian [12], and support vector machines (SVM) [13], have been developed for default prediction. However, for a value-missing, high-dimension, sparse, and class-imbalanced lending dataset [14], a single classification model cannot meet the requirements of accurate predicting default.

    View all citing articles on Scopus
    View full text