Practical development of an Eclipse-based software fault prediction tool using Naive Bayes algorithm
Research highlights
► Machine learning algorithms can be easily integrated into software fault prediction tools. ► Eclipse framework simplify developing software fault prediction tools. ► The end user should not feel the computational complexity of machine learning algorithms for software fault prediction.
Introduction
Over the last 15 years software fault prediction models have received a lot of attention from software engineering researchers and machine learning experts. However, a robust fault prediction model alone is not enough in today’s competitive software industry. There is a great need for software tools that make it easier for software quality professionals or project managers to predict faults before they occur. Building and application of a fault prediction model within a software company is time-consuming, detailed, meticulous work and mostly commercial projects do not have enough resources to realize this activity (Ostrand & Weyuker, 2006).
On the other hand, the benefits of this Quality Assurance (QA) activity are impressive. By using such models, one can identify the refactoring candidate modules, improve the software testing process, select the best design from design alternatives with class level metrics, and reach a dependable software system (Catal & Diri, 2008). Hence, a tool for simplifying the prediction process is extremely useful to projects that might not be able to allocate necessary resources for this QA activity (Ostrand & Weyuker, 2006).
Software fault prediction approaches use previous software metrics and fault data to predict the fault-prone modules for the next release of software. If an error is reported during system tests or in field, that module’s fault data is marked as 1, otherwise it is marked as 0. For the prediction modeling, software metrics are used as independent variables and fault data (1 or 0) is used as the dependent variable. Therefore, we need a version control system (VCS) such as Subversion to store source code, a change management system (CMS) such as ClearQuest to record fault data, and a tool to collect product metrics (method-level or class-level) from source code. Parameters of the prediction model are calculated using previous software metrics and fault data.
Different version control systems and change management systems may have different kinds of Application Programming Interfaces (APIs) and therefore, we did not aim to use a specific type of VCS or CMS during the development of our software fault prediction tool. In addition, designing a one-size-fits-all tool would not be easy as it seems. We decided to let the prediction tool to calculate the software metrics from a source code directory instead of a VCS. In addition, our prediction tool does not directly read fault data from CMS because sometimes every code change does not necessarily mean a fault. By keeping this strategy in mind, we let the user to add the fault data for each module from an Eclipse editor. Next sections will depict this easy-to-use operation with a figure. For this reason, our prediction tool is not dependent on any Commercial-Off-The-Shell (COTS) tool. Different project groups will be able to use this tool for their projects in our institute. The only limitation now is the programming language that this tool works on. Our prediction tool can calculate software metrics (Halstead, McCabe and Chidamber–Kemerer) on only Java programs.
We modified the source code of an open source project called Lachesis (Vatkov, 2005) to develop our software fault prediction tool called RUBY. “Lachesis is a software complexity measurement program for Object-Oriented source code” (Vatkov, 2005) and it can analyze both Java source code and byte code. However, Lachesis itself is not a fault prediction tool. Because Lachesis plug-in did not provide information to the user about primitive Halstead metrics such as total operand and unique operator, we changed the source code to let the user see these metrics. Furthermore, dependency metrics (Martin, 1994) such as afferent couplings, abstractness, instability, have been removed from graphical user interface because software fault prediction studies did not take into account these metrics up to now and validation of these metrics for fault prediction have not been done yet. In addition, decision density metric calculated by dividing cyclomatic complexity metric to lines of code metric has been removed because this metric has not been validated for software fault prediction yet. Metrics calculated and shown to the user by our tool are shown in Table 1.
After software metrics are calculated, MethodLevelTrain.txt or ClassLevelTrain.txt files are created in project tree with respect to RUBY preference page settings. User should add fault-info or fault-free info for each module (class or method) by right clicking to the editor, automatically shown after metrics calculation. Therefore, we contributed two actions, Add Fault Info and Add Fault-Free Info, on metrics analysis editor. Module class labels initialized with ? character are updated after these actions are activated with 1 value for Add Fault Info action and 0 value for the other action. Users can copy this training file into a new software version and then choose Predict with Naive Bayes option by right clicking to the project. After this selection, current software metrics are calculated and stored in MethodLevelTest.txt or ClassLevelTest.txt files. Naive Bayes machine learning algorithm automatically uses train and test files with respect to preference page settings and predicts the fault-prone and faulty-free modules. After the predictions have been made, the results are shown in an Eclipse view called Result view, developed by our project group.
The architecture of this tool and its reusable parts let us add new machine learning algorithms easily. Training and test files will be used by different algorithms instead of Naive Bayes in that case and other parts of the tool will not be affected. As far as we know, this is the first study to develop an industrial level Eclipse-based software fault prediction tool using machine learning algorithms. The idea and the machine learning method behind this tool is quite influential and any software company can develop their own tool provided that they know the steps we took to develop this prediction tool.
This paper is organized as follows: the following section presents the related work. Section 3 explains Naive Bayes algorithm. Section 4 introduces our Eclipse-based software fault prediction tool. Section 5 presents the conclusions and future works.
Section snippets
Related work
Ostrand and Weyuker (2006) discussed the issues in building an automated software fault prediction tool. They had experience working with four industrial software projects and all of these projects used the same VCS and CMS which are fully integrated. Therefore, each Modification Request (MR) to the system is available and the information about each MR is a written description of the change reason. However, it is a crucial problem to determine whether a modification was made because of a fault,
Naive Bayes algorithm
Classification is a procedure that assigns a class label to a sample from a given set of samples that have labels (Zhang & Sheng, 2004). Naive Bayesian Classification (aka Simple Bayesian Classifier) is the most known and used classification method. It is not only easy to implement on various kinds of datasets, but also it is quite efficient. Let X be a data sample which has no class label. Let H be a hypothesis such that X belongs to a specified class, C. We aim to ascertain P(H|X), the
RUBY: Eclipse-based software fault prediction tool
This section introduces our Eclipse-based software fault prediction tool using Naive Bayes machine learning algorithm. Section 4.1 explains Eclipse platform, Section 4.2 presents RUBY plug-ins and preferences. Section 4.3 presents RUBY analysis and prediction operations and Section 4 explains RUBY flow graph.
Conclusions and future works
We developed an Eclipse-based software fault prediction tool for Java programs to simplify the fault prediction process and we also integrated a machine learning algorithm called Naive Bayes into this plug-in because of its proven high-performance for this problem. As this tool proves, machine learning algorithms can be easily used for real world applications and problems. We examined the probabilistic results of our Naive Bayes implementation and noticed that we have same results with WEKA
Acknowledgement
This project is supported by the Scientific and Technological Research Council of TURKEY (TUBITAK) under Grant 107E213. The findings and opinions in this study belong solely to the authors, and are not necessarily those of the sponsor.
References (29)
- et al.
Comparing case-based reasoning classifiers for predicting high risk software components
Journal of Systems and Software
(2001) - et al.
Predicting defect-prone software modules using support vector machines
Journal of Systems and Software
(2008) - et al.
Predicting software defects in varying development lifecycles using Bayesian nets
Information and Software Technology
(2007) Applying machine learning to software fault-proneness prediction
Journal of Systems and Software
(2008)- et al.
Identification of defect-prone classes in telecommunication software systems using design metrics
Information Sciences
(2006) - et al.
Object-oriented software fault prediction using neural networks
Information and Software Technology
(2007) - et al.
Prediction of software development faults in PL/SQL files using neural network models
Information and Software Technology
(2004) - et al.
Statistical models vs. expert estimation for fault prediction in modified code – An industrial case study
Journal of Systems and Software
(2007) - et al.
Mining software repositories for comprehensible software fault prediction models
Journal of Systems and Software
(2008) - Catal, C., & Diri, B. (2007). Software fault prediction with object-oriented metrics based artificial immune...
Eclipse: Building commercial quality plug-ins
Apache and Eclipse: Comparing open source project incubators
IEEE Software
Cited by (86)
Application of machine learning to stress corrosion cracking risk assessment
2022, Egyptian Journal of PetroleumMulti-view representation learning with Kolmogorov-Smirnov to predict default based on imbalanced and complex dataset
2022, Information SciencesCitation Excerpt :2) Machine learning approach: Based on the borrowers’ application information, this method designs the predictive scheme without the requirements of assumptions and prior knowledge. Substantial numbers of machine learning algorithms, including decision trees (DT) [11], Naïve Bayesian [12], and support vector machines (SVM) [13], have been developed for default prediction. However, for a value-missing, high-dimension, sparse, and class-imbalanced lending dataset [14], a single classification model cannot meet the requirements of accurate predicting default.
Software fault prediction using data mining, machine learning and deep learning techniques: A systematic literature review
2022, Computers and Electrical EngineeringMachine learning based methods for software fault prediction: A survey
2021, Expert Systems with Applications