Practical development of an Eclipse-based software fault prediction tool using Naive Bayes algorithm

doi:10.1016/j.eswa.2010.08.022

Expert Systems with Applications

Volume 38, Issue 3, March 2011, Pages 2347-2353

https://doi.org/10.1016/j.eswa.2010.08.022 Get rights and content

Abstract

Despite the amount of effort software engineers have been putting into developing fault prediction models, software fault prediction still poses great challenges. This research using machine learning and statistical techniques has been ongoing for 15 years, and yet we still have not had a breakthrough. Unfortunately, none of these prediction models have achieved widespread applicability in the software industry due to a lack of software tools to automate this prediction process. Historical project data, including software faults and a robust software fault prediction tool, can enable quality managers to focus on fault-prone modules. Thus, they can improve the testing process. We developed an Eclipse-based software fault prediction tool for Java programs to simplify the fault prediction process. We also integrated a machine learning algorithm called Naive Bayes into the plug-in because of its proven high-performance for this problem. This article presents a practical view to software fault prediction problem, and it shows how we managed to combine software metrics with software fault data to apply Naive Bayes technique inside an open source platform.

Research highlights

► Machine learning algorithms can be easily integrated into software fault prediction tools. ► Eclipse framework simplify developing software fault prediction tools. ► The end user should not feel the computational complexity of machine learning algorithms for software fault prediction.

Introduction

Over the last 15 years software fault prediction models have received a lot of attention from software engineering researchers and machine learning experts. However, a robust fault prediction model alone is not enough in today’s competitive software industry. There is a great need for software tools that make it easier for software quality professionals or project managers to predict faults before they occur. Building and application of a fault prediction model within a software company is time-consuming, detailed, meticulous work and mostly commercial projects do not have enough resources to realize this activity (Ostrand & Weyuker, 2006).

On the other hand, the benefits of this Quality Assurance (QA) activity are impressive. By using such models, one can identify the refactoring candidate modules, improve the software testing process, select the best design from design alternatives with class level metrics, and reach a dependable software system (Catal & Diri, 2008). Hence, a tool for simplifying the prediction process is extremely useful to projects that might not be able to allocate necessary resources for this QA activity (Ostrand & Weyuker, 2006).

Software fault prediction approaches use previous software metrics and fault data to predict the fault-prone modules for the next release of software. If an error is reported during system tests or in field, that module’s fault data is marked as 1, otherwise it is marked as 0. For the prediction modeling, software metrics are used as independent variables and fault data (1 or 0) is used as the dependent variable. Therefore, we need a version control system (VCS) such as Subversion to store source code, a change management system (CMS) such as ClearQuest to record fault data, and a tool to collect product metrics (method-level or class-level) from source code. Parameters of the prediction model are calculated using previous software metrics and fault data.

Different version control systems and change management systems may have different kinds of Application Programming Interfaces (APIs) and therefore, we did not aim to use a specific type of VCS or CMS during the development of our software fault prediction tool. In addition, designing a one-size-fits-all tool would not be easy as it seems. We decided to let the prediction tool to calculate the software metrics from a source code directory instead of a VCS. In addition, our prediction tool does not directly read fault data from CMS because sometimes every code change does not necessarily mean a fault. By keeping this strategy in mind, we let the user to add the fault data for each module from an Eclipse editor. Next sections will depict this easy-to-use operation with a figure. For this reason, our prediction tool is not dependent on any Commercial-Off-The-Shell (COTS) tool. Different project groups will be able to use this tool for their projects in our institute. The only limitation now is the programming language that this tool works on. Our prediction tool can calculate software metrics (Halstead, McCabe and Chidamber–Kemerer) on only Java programs.

We modified the source code of an open source project called Lachesis (Vatkov, 2005) to develop our software fault prediction tool called RUBY. “Lachesis is a software complexity measurement program for Object-Oriented source code” (Vatkov, 2005) and it can analyze both Java source code and byte code. However, Lachesis itself is not a fault prediction tool. Because Lachesis plug-in did not provide information to the user about primitive Halstead metrics such as total operand and unique operator, we changed the source code to let the user see these metrics. Furthermore, dependency metrics (Martin, 1994) such as afferent couplings, abstractness, instability, have been removed from graphical user interface because software fault prediction studies did not take into account these metrics up to now and validation of these metrics for fault prediction have not been done yet. In addition, decision density metric calculated by dividing cyclomatic complexity metric to lines of code metric has been removed because this metric has not been validated for software fault prediction yet. Metrics calculated and shown to the user by our tool are shown in Table 1.

After software metrics are calculated, MethodLevelTrain.txt or ClassLevelTrain.txt files are created in project tree with respect to RUBY preference page settings. User should add fault-info or fault-free info for each module (class or method) by right clicking to the editor, automatically shown after metrics calculation. Therefore, we contributed two actions, Add Fault Info and Add Fault-Free Info, on metrics analysis editor. Module class labels initialized with ? character are updated after these actions are activated with 1 value for Add Fault Info action and 0 value for the other action. Users can copy this training file into a new software version and then choose Predict with Naive Bayes option by right clicking to the project. After this selection, current software metrics are calculated and stored in MethodLevelTest.txt or ClassLevelTest.txt files. Naive Bayes machine learning algorithm automatically uses train and test files with respect to preference page settings and predicts the fault-prone and faulty-free modules. After the predictions have been made, the results are shown in an Eclipse view called Result view, developed by our project group.

The architecture of this tool and its reusable parts let us add new machine learning algorithms easily. Training and test files will be used by different algorithms instead of Naive Bayes in that case and other parts of the tool will not be affected. As far as we know, this is the first study to develop an industrial level Eclipse-based software fault prediction tool using machine learning algorithms. The idea and the machine learning method behind this tool is quite influential and any software company can develop their own tool provided that they know the steps we took to develop this prediction tool.

This paper is organized as follows: the following section presents the related work. Section 3 explains Naive Bayes algorithm. Section 4 introduces our Eclipse-based software fault prediction tool. Section 5 presents the conclusions and future works.

Section snippets

Related work

Ostrand and Weyuker (2006) discussed the issues in building an automated software fault prediction tool. They had experience working with four industrial software projects and all of these projects used the same VCS and CMS which are fully integrated. Therefore, each Modification Request (MR) to the system is available and the information about each MR is a written description of the change reason. However, it is a crucial problem to determine whether a modification was made because of a fault,

Naive Bayes algorithm

Classification is a procedure that assigns a class label to a sample from a given set of samples that have labels (Zhang & Sheng, 2004). Naive Bayesian Classification (aka Simple Bayesian Classifier) is the most known and used classification method. It is not only easy to implement on various kinds of datasets, but also it is quite efficient. Let X be a data sample which has no class label. Let H be a hypothesis such that X belongs to a specified class, C. We aim to ascertain P(H|X), the

RUBY: Eclipse-based software fault prediction tool

This section introduces our Eclipse-based software fault prediction tool using Naive Bayes machine learning algorithm. Section 4.1 explains Eclipse platform, Section 4.2 presents RUBY plug-ins and preferences. Section 4.3 presents RUBY analysis and prediction operations and Section 4 explains RUBY flow graph.

Conclusions and future works

We developed an Eclipse-based software fault prediction tool for Java programs to simplify the fault prediction process and we also integrated a machine learning algorithm called Naive Bayes into this plug-in because of its proven high-performance for this problem. As this tool proves, machine learning algorithms can be easily used for real world applications and problems. We examined the probabilistic results of our Naive Bayes implementation and noticed that we have same results with WEKA

Acknowledgement

This project is supported by the Scientific and Technological Research Council of TURKEY (TUBITAK) under Grant 107E213. The findings and opinions in this study belong solely to the authors, and are not necessarily those of the sponsor.

References (29)

K. El Emam et al.
Comparing case-based reasoning classifiers for predicting high risk software components
Journal of Systems and Software
(2001)
K.O. Elish et al.
Predicting defect-prone software modules using support vector machines
Journal of Systems and Software
(2008)
N. Fenton et al.
Predicting software defects in varying development lifecycles using Bayesian nets
Information and Software Technology
(2007)
I. Gondra
Applying machine learning to software fault-proneness prediction
Journal of Systems and Software
(2008)
A. Janes et al.
Identification of defect-prone classes in telecommunication software systems using design metrics
Information Sciences
(2006)
S. Kanmani et al.
Object-oriented software fault prediction using neural networks
Information and Software Technology
(2007)
T. Quah et al.
Prediction of software development faults in PL/SQL files using neural network models
Information and Software Technology
(2004)
P. Tomaszewski et al.
Statistical models vs. expert estimation for fault prediction in modified code – An industrial case study
Journal of Systems and Software
(2007)
O. Vandecruys et al.
Mining software repositories for comprehensible software fault prediction models
Journal of Systems and Software
(2008)
Catal, C., & Diri, B. (2007). Software fault prediction with object-oriented metrics based artificial immune...

Catal, C., & Diri, B. (2008). A fault prediction model with limited fault data to improve test process. In 9th...

E. Clayberg et al.

Eclipse: Building commercial quality plug-ins

(2004)

Daum, B. (2005). Professional Eclipse 3 for Java developers....

J.C. Dueñas et al.

Apache and Eclipse: Comparing open source project incubators

IEEE Software

(2007)

Cited by (86)

Software defects prediction by metaheuristics tuned extreme gradient boosting and analysis based on Shapley Additive Explanations
2023, Applied Soft Computing
Software testing represents a crucial component of software development, and it is usually making the difference between successful and failed projects. Although it is extremely important, due to the fast pace and short deadlines of contemporary projects it is often neglected or not detailed enough due to the lack of available time, leading to the potential loss of reputation, private users’ data, money, and even lives in some circumstances. In such situations, it would be vital to have the option to predict what modules are error-prone according to the collection of software metrics, and to focus testing on them, and that task is a typical classification task. Machine learning models have been frequently employed within a wide range of classification problems with significant success, and this paper proposes eXtreme gradient boosting (XGBoost) model to execute the defect prediction task. A modified variant of the well-known reptile search optimization algorithm has been suggested to carry out the calibrating of the XGBoost hyperparameters. The enhanced algorithm was named HARSA and evaluated on the collection of challenging CEC2019 benchmark functions, where it exhibited excellent performance. Later, the introduced XGBoost model that uses the proposed algorithm has been evaluated on two benchmark software testing datasets, and the simulation outcomes have been compared to other powerful swarm intelligence metaheuristics that were used in the identical experimental environment, where the proposed approach attained superior classification accuracy on both datasets. Finally, Shapley Additive Explanations analysis was conducted to discover the impact of various software metrics on the classification results.
Application of machine learning to stress corrosion cracking risk assessment
2022, Egyptian Journal of Petroleum
One of the greatest challenges faced by industries today is corrosion and of which, one of the most vital forms is stress corrosion cracking (SCC). It brings highest forms of risks to the industry. Performing risk assessment of stress corrosion cracking is critical to ensure that industrial equipment failure is avoided by employing proper maintenance techniques. With the advancement of digital technology and the fourth industrial revolution called Industrial Internet of Things (IIOT), coupled with the availability of computing power and data, advanced analytical tools like artificial intelligence and machine learning bring powerful algorithms for performing advanced corrosion risk assessment. A perusal of the literature reveals that a review focused on the use of machine learning in corrosion risk assessment of stress corrosion cracking is scarce. So, a comprehensive and up-to-date review on this subject is timely. In this work review we present an overview on the machine learning application in the risk assessment of stress corrosion cracking. First, the current state of the art is briefly summarized. The fundamentals of machine learning algorithms and stress corrosion cracking were presented. Existing knowledge gaps were identified and discussed while the challenges and the future perspectives on the employ of machine learning in corrosion risks assessment of stress corrosion cracking were outlined.
Multi-view representation learning with Kolmogorov-Smirnov to predict default based on imbalanced and complex dataset
2022, Information Sciences
Citation Excerpt :
2) Machine learning approach: Based on the borrowers’ application information, this method designs the predictive scheme without the requirements of assumptions and prior knowledge. Substantial numbers of machine learning algorithms, including decision trees (DT) [11], Naïve Bayesian [12], and support vector machines (SVM) [13], have been developed for default prediction. However, for a value-missing, high-dimension, sparse, and class-imbalanced lending dataset [14], a single classification model cannot meet the requirements of accurate predicting default.
Existing solutions focus on improving overall accuracy for imbalanced and complex loan datasets, resulting in a lower precise recall for default samples. To embrace these challenges, based on peer-to-peer loan application information, we proposed a multi-view representation learning with Kolmogorov-Smirnov (KS) to effectively organize these complex data and predict default. Firstly, the features were automatically represented as multi-views based on their discreteness and correlation difference. Then, a corresponding multi-view deep neural network (MV-DNN) was developed to obtain knowledge in a multi-view way. Here, we firstly designed different view learning layers to obtain knowledge in corresponding views. Subsequently, to interact with the knowledge in different views, an information fusion layer was developed to fuse the acquired information. To face the challenge from imbalanced data distribution, the KS was set as evaluation metric to assist in training MV-DNN to improve the distinguishing ability for two classes of samples. The experimental results show compared with the MV-DNNs based on random and k-means multi-view strategies, and other advanced models, our method could provide optimal comprehensive performance and the most stable multi-view organizing results. Furthermore, we also verified the KS is the key component to assist the model in dealing with the imbalanced dataset.
Software fault prediction using data mining, machine learning and deep learning techniques: A systematic literature review
2022, Computers and Electrical Engineering
Software fault/defect prediction assists software developers to identify faulty constructs, such as modules or classes, early in the software development life cycle. There are data mining, machine learning, and deep learning techniques used for software fault prediction. We perform analysis of previously published reviews, surveys, and related studies to distill a list of questions. These questions were either answered in the past but needed a fresh look or they were not considered at all. We justify why answers to newly added questions are important and divide previous work based on data mining, machine learning, and deep learning and compare their performance. We study which datasets were commonly used and what comparison criteria were mostly adopted for software fault prediction. We select 68 primary studies from a wide list of initially selected set following our quality assessment criteria and present answers to our research questions.
Machine learning based methods for software fault prediction: A survey
2021, Expert Systems with Applications
Several prediction approaches are contained in the arena of software engineering such as prediction of effort, security, quality, fault, cost, and re-usability. All these prediction approaches are still in the rudimentary phase. Experiments and research are conducting to build a robust model. Software Fault Prediction (SFP) is the process to develop the model which can be utilized by software practitioners to detect faulty classes/module before the testing phase. Prediction of defective modules before the testing phase will help the software development team leader to allocate resources more optimally and it reduces the testing effort. In this article, we present a Systematic Literature Review (SLR) of various studies from 1990 to June 2019 towards applying machine learning and statistical method over software fault prediction. We have cited 208 research articles, in which we studied 154 relevant articles. We investigated the competence of machine learning in existing datasets and research projects. To the best of our knowledge, the existing SLR considered only a few parameters over SFP’s performance, and they partially examined the various threats and challenges of SFP techniques. In this article, we aggregated those parameters and analyzed them accordingly, and we also illustrate the different challenges in the SFP domain. We also compared the performance between machine learning and statistical techniques based on SFP models. Our empirical study and analysis demonstrate that the prediction ability of machine learning techniques for classifying class/module as fault/non-fault prone is better than classical statistical models. The performance of machine learning-based SFP methods over fault susceptibility is better than conventional statistical purposes. The empirical evidence of our survey reports that the machine learning techniques have the capability, which can be used to identify fault proneness, and able to form well-generalized result. We have also investigated a few challenges in fault prediction discipline, i.e., quality of data, over-fitting of models, and class imbalance problem. We have also summarized 154 articles in a tabular form for quick identification.
SEDAC: A CVAE-Based Data Augmentation Method for Security Bug Report Identification
2024, arXiv

View all citing articles on Scopus

View full text

Practical development of an Eclipse-based software fault prediction tool using Naive Bayes algorithm

Abstract

Research highlights

Introduction

Section snippets

Related work

Naive Bayes algorithm

RUBY: Eclipse-based software fault prediction tool

Conclusions and future works

Acknowledgement

Journal of Systems and Software

Journal of Systems and Software

Information and Software Technology

Journal of Systems and Software

Information Sciences

Information and Software Technology

Information and Software Technology

Journal of Systems and Software

Journal of Systems and Software

Eclipse: Building commercial quality plug-ins

Apache and Eclipse: Comparing open source project incubators

IEEE Software