Dynamic selection of fitness function for software change prediction using Particle Swarm Optimization

https://doi.org/10.1016/j.infsof.2019.04.007Get rights and content

Abstract

Context

Over the past few years, researchers have been actively searching for an effective classifier which correctly predicts change prone classes. Though, few researchers have ascertained the predictive capability of search-based algorithms in this domain, their effectiveness is highly dependent on the selection of an optimum fitness function. The criteria for selecting one fitness function over the other is the improved predictive capability of the developed model on the entire dataset. However, it may be the case that various subsets of instances of a dataset may give best results with a different fitness function.

Objective

The aim of this study is to choose the best fitness function for each instance rather than the entire dataset so as to create models which correctly ascertain the change prone nature of majority of instances. Therefore, we propose a novel framework for the adaptive selection of a dynamic optimum fitness function for each instance of the dataset, which would correctly determine its change prone nature.

Method

The predictive models in this study are developed using seven different fitness variants of Particle Swarm Optimization (PSO) algorithm. The proposed framework predicts the best suited fitness variant amongst the seven investigated fitness variants on the basis of structural characteristics of a corresponding instance.

Results

The results of the study are empirically validated on fifteen datasets collected from popular open-source software. The proposed adaptive framework was found efficient in determination of change prone classes as it yielded improved results when compared with models developed using individual fitness variants and fitness-based voting ensemble classifiers.

Conclusion

The performance of the models developed using the proposed adaptive framework were statistically better than the models developed using individual fitness variants of PSO algorithm and competent to models developed using machine learning ensemble classifiers.

Introduction

Change is important in a software product to correct existing defects, incorporate enhancements according to varying requirements, or adapt the software according to change in the environment. Change prone classes will require modifications in the forthcoming releases of a software as i) there may be errors present in such classes, which need correction, ii) a software enhancement may result in changes in a change prone class or iii) design of such classes may be altered to enhance maintainability of a software. On the other hand, a not change prone class persist in future versions of a software without any design or structural changes. Therefore, a not change prone class is less likely to require testing and maintenance resources. Thus, development of models to determine change prone classes of a software is crucial so that constraint software project resources may be used effectively by allocating proper resources to change prone classes [1], [2]. Moreover, rigorous verification and testing activities can be performed in the initial stages of the software development life cycle if we are aware of change prone classes as such classes could be likely sources of errors [1], [2], [3], [4]. Also, software practitioners can focus refactoring activities on prospective change prone classes to localize the impact of changes [5].

Previously, various researchers have been successful in developing efficient Software Change Prediction (SCP) models, which identify change prone classes by using software design metrics as predictor variables [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. Software design metrics are representatives of various characteristics of Object-Oriented (OO) software. These characteristics include size, coupling, inheritance, cohesion, etc.

There has been a recent interest in the use of Search Based Algorithms (SBA) to the software change prediction domain [7], [8], [11], [12]. SBA are metaheuristic in nature, which search for an optimal or near optimal solution among a multitude of possible solutions. This search for an effective solution in a SBA is guided by a fitness function. According to Harman and Clark [13], a researcher can employ performance metrics as fitness evaluators for developing predictive models. The choice of fitness function is critical as it ascertains the suitability of a candidate solution, determining whether it is better or worse than the current solution [14]. Researchers in the past have validated that the selection of a fitness function influences the results of a prediction model [15], [16]. Therefore, the decision of selecting a specific fitness function is crucial.

In order to effectively allocate maintenance resources, the software industry needs to develop efficient SCP models which correctly predict the change prone/not change prone nature of maximum number of classes/instances. Though researchers in the past have used SBA for developing SCP models [7], [8], [11], they have used only a single fitness function for the entire dataset. However, this scenario does not take into consideration the possibility that few instances of a dataset could be best predicted by a certain fitness function while few others could be correctly predicted by a different fitness function. This phenomenon could be true due to varied structural characteristics (design metrics) of the instances, which may be effectively learned and predicted by different fitness variants of the same algorithm (same algorithm using different fitness criteria). Using a varied fitness function for subsets of a dataset could lead to a substantial performance improvement in the predictive performance of the developed models as it would combine the correct classifications of different fitness variants resulting in better models. Thus, this study proposes an adaptive framework, namely Adaptive Selection of Optimum Fitness (ASOF). The framework predicts a dynamic fitness function for each specific instance of a sample on the basis of its structural characteristics.

Although there are a number of SBA which include algorithms like Genetic Programming (GP), Genetic Algorithm (GA), simulated annealing, Tabu Search etc., which can be used as algorithms for developing the ASOF framework, this study chose PSO as a base algorithm due to its numerous advantages. The PSO algorithm is characterized by its faster converging capability, low number of internal parameters and little effect on the algorithm due to changes in problem dimensionality [17], [18], [19]. It is a population-based algorithm, which simulates social behavior of birds and is commonly used for optimization of specific objectives [20]. Constricted Particle Swarm Optimization (CPSO), is a PSO variant which has been successfully used for classification tasks in previous studies [21], [22], [23]. The CPSO algorithm appropriately uses constriction coefficients to provide faster convergence and forbids search-space explosion [23]. Thus, this study uses the CPSO algorithm for developing prediction models. Also, seven different fitness variants of CPSO were coded using seven performance metrics namely Accuracy, Balance, F-measure, G-Mean1, G-Mean2, G-measure and Precision as fitness functions for CPSO. The coded fitness variants were implemented in KEEL software tool (http://www.keel.es/). This study evaluates the impact of variation in the fitness function on the performance of the developed SCP models using various CPSO fitness variants.

A previous study by the authors [24] proposed four ensembles of multiple fitness variants of CPSO, which were aggregated using weighted votes. However, it may be noted that this study is distinct from the earlier one. In this study, a specific individual fitness variant is output as the “best one” for each instance of the dataset. Furthermore, instances which are correctly output by the same fitness variant i.e. instances whose selected fitness function is same are then combined to obtain their actual predictions by the corresponding fitness variant. There is no weighted voting involved as was done in the previous study.

The current study also performs an extensive comparison of the models developed using the ASOF framework with those developed by individual CPSO fitness variants. The ASOF technique can be recognized as an ensemble of fitness variants. The study compares the results of ASOF with nine other baseline techniques. Eight of the baseline techniques are ensemble classifiers as the ASOF technique is based on ensemble methodology. The baseline techniques chosen for comparison are as follows:

  • Four of the chosen baseline techniques are classifiers Fitness-based Voting Ensemble Classifiers (FVEC) namely Majority Voting Ensemble Classifier (MVEC), Weighted Voting Ensemble Classifier (WVEC), Hard Instance Ensemble Classifier (HIEC) and Weighted Voting Hard Instance Ensemble Classifier (WVHIEC), which were proposed by the authors in a previous study [24].

  • As, we are proposing an ensemble classifier of multiple fitness variants of a search-based algorithm, we also compare our proposed approach with four traditional Machine Learning (ML) ensembles, which have been successfully validated in previous studies for determination of change prone classes [1], [2], [7]. These ML classifiers are Random Forests (RF), Bagging (BG), Adaptive Boosting (AB) and Logitboost (LB).

  • Certain literature studies [1], [5] have ascertained Logistic Regression (LR), a statistical technique to be an effective classifier for determining change prone classes. Thus, we also compare our proposed approach with models developed using the LR technique.

Thus, the primary contributions of the study are a) assessing the influence of variation in fitness functions on the predictive capability of the developed SCP models b) developing a unique adaptive framework for selecting an optimum fitness function for each instance of the training sample to predict change prone classes and c) evaluating the capability of models developed using the ASOF framework by comparing it with individual fitness variants, four FVEC, four ML ensemble classifiers and the LR technique.

The results of the study confirm the superiority of the models developed using the ASOF framework as they outperformed the individual fitness variant models. Furthermore, the proposed models were found comparable with baseline techniques. Software practitioners can use this framework for correctly identifying change prone classes. Thereby, focusing attention on these classes during initial phases of the software development life cycle, would guarantee a maintainable and better quality software product.

The study is organized in the following manner. A summary of related literature studies is mentioned in Section 2. The study's empirical research background is explained in Section 3. The proposed ASOF framework is described in Section 4. Section 5 states the Research Questions (RQ) and the design of the experiment. Section 6 briefly describes the baseline techniques used for comparison. Section 7 examines the answers to the RQ's of the study. Section 8 mentions the threats to study's results, while the conclusions and the proposed future work are given in Section 9.

Section snippets

Related work

This section discusses the related literature studies. It has been divided into four parts which are as follows:

Background of empirical research

This section briefly describes the CPSO algorithm and the various fitness functions used in the study.

ASOF framework

The ASOF framework involves developing a predictor for outputting the optimum fitness function based on the structural characteristics of a class. The structural characteristics are quantified by OO metrics of the corresponding dataset. We implemented a tool for ASOF framework. However, firstly we need to prepare appropriate training data for the ASOF framework. In order to so, we need to first evaluate the performance of individual CPSO fitness variants models on a specific dataset. Thus, we

Research questions & experimental framework

This section discusses in detail the research questions and the experimental framework. The experimental framework includes description of independent and dependent variables, data collection, feature selection, performance measures, validation framework and statistical tests used in this study.

Analysis and results

This section presents in detail the results of the study and also answers the RQ's of the study (Section 1).

Threats to validity

This section discusses the various possible threats to the results of the study. It is important to evaluate these threats to strengthen the validity of the obtained results.

Conclusion validity is commonly addressed as the statistical validity of the obtained conclusions. We address conclusion validity threats by a) using two non-parametric statistical tests, i.e. Friedman and Wilcoxon to ensure the statistical validity of the study's results, b) using 30 runs and ten-fold cross validation to

Conclusions

This study proposes a novel framework, namely Adaptive Selection of Optimum Fitness function (ASOF). It is used for predicting the optimum fitness function for each instance of the dataset based on its structural characteristics (OO metrics). The output of the ASOF model is one of the seven possible fitness functions (Accuracy, Balance, G-Mean1, G-Mean2, G-measure, F-measure and Precision) explored in the study. Thereafter, SCP models, which are developed using Constricted Particle Swarm

Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ruchika Malhotra is Associate Head and Associate Professor in the Discipline of Software Engineering, Department of Computer Science & Engineering, Delhi Technological University (formerly Delhi College of Engineering), Delhi, India. She is Associate Dean in Industrial Research and Development, Delhi Technological University. She was awarded with prestigious Raman Fellowship for pursuing Post doctoral research in Indiana University Purdue University Indianapolis USA. She received her master's

References (70)

  • S. Hosseini et al.

    A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction

    Inf. Softw. Technol.

    (2018)
  • D. Ryu et al.

    Effective multi-objective naïve bayes learning for cross-project defect prediction

    Appl. Soft Comput.

    (2016)
  • L.C. Briand et al.

    Exploring the relationships between design measures and software quality in object-oriented systems

    J. Syst. Softw.

    (2000)
  • R. Malhotra et al.

    Investigation of relationship between object-oriented metrics and change proneness

    Int. J. Mach. Learn. Cyber.

    (2013)
  • R. Malhotra et al.

    An empirical study for software change prediction using imbalanced data

    Empir. Softw. Eng.

    (2017)
  • A.G. Koru et al.

    Identifying and characterizing change-prone classes in two large-scale open-source products

    J. Syst. Softw.

    (2007)
  • A.G. Koru et al.

    Comparing high-change modules and modules with the highest measurement values in two large-scale open-source products

    IEEE Trans. Softw. Eng.

    (2005)
  • M.O. Elish et al.

    A suite of metrics for quantifying historical changes to predict future change-prone classes in object-oriented software

    J. Softw.

    (2013)
  • E. Giger et al.

    Can we predict type of code changes? An empirical analysis

  • R. Malhotra et al.

    An exploratory study for software change prediction in object-oriented systems using hybridized techniques

    Autom. Softw. Eng.

    (2017)
  • Y. Zhou et al.

    Examining the potentially confounding effect of class size on the associations between object metrics and change proneness

    IEEE Trans. Softw. Eng.

    (2009)
  • H. Lu et al.

    The Ability of object-oriented metrics to predict change-proneness: a Meta-Analysis

    Empir. Softw. Eng. J.

    (2012)
  • M. Harman et al.

    Metrics are fitness functions too

  • M. Harman et al.

    Search Based Software Engineering: Techniques, Taxonomy, Tutorial, in Empirical Software Engineering and Verification

    (2012)
  • M.W. Aslam

    Selection of fitness function in genetic programming for binary classification

  • F. Ferrucci et al.

    Genetic programming for effort estimation: an analysis of the impact of different fitness functions

  • M.B. Abdelhalim et al.

    Particle swarm optimization for HW/SW partitioning

    Particle Swarm Optimization

    (2009)
  • Y. Abdi et al.

    A hybrid one-class rule learning approach based on swarm intelligence for software fault prediction

    Innov. Syst. Softw. Eng.

    (2015)
  • J. Kennedy et al.

    Matching algorithms to problems: an experimental test of the particle swarm and some genetic algorithms on the multimodal problem generator

  • R. Eberhart et al.

    A new optimizer using particle swarm theory

  • M. Clerc et al.

    The particle swarm-explosion, stability, and convergence in a multidimensional complex space

    IEEE Trans. Evol. Comput.

    (2002)
  • T. Sousa et al.

    A particle swarm data miner

  • D. Romano et al.

    Using source code metrics to predict change-prone java interfaces

  • V.K. Bardsiri et al.

    A flexible method to estimate the software development effort based on the classification of projects and localization of comparisons

    Empir. Softw. Eng.

    (2014)
  • A.F. Sheta et al.

    Evaluating software cost estimation models using particle swarm optimisation and fuzzy logic for NASA projects: a comparative study

    Int. J. Bio-Inspir. Comput.

    (2010)
  • Cited by (0)

    Ruchika Malhotra is Associate Head and Associate Professor in the Discipline of Software Engineering, Department of Computer Science & Engineering, Delhi Technological University (formerly Delhi College of Engineering), Delhi, India. She is Associate Dean in Industrial Research and Development, Delhi Technological University. She was awarded with prestigious Raman Fellowship for pursuing Post doctoral research in Indiana University Purdue University Indianapolis USA. She received her master's and doctorate degree in software engineering from the University School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India. She was an Assistant Professor at the University School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India. She has received IBM Faculty Award 2013. She is recipient of Commendable Research Award by Delhi Technological University. Her h-index is 26 as reported by Google Scholar. She is author of book titled “Empirical Research in Software Engineering” published by CRC press and co-author of a book on Object Oriented Software Engineering published by PHI Learning. Her research interests are in software testing, improving software quality, statistical and adaptive prediction models, software metrics and the definition and validation of software metrics. She has published more than 160 research papers in international journals and conferences.

    Megha Khanna is currently pursuing her doctoral degree from Delhi Technological University. She is currently working in Sri Guru Gobind Singh College of Commerce, University of Delhi. She completed her master's degree in software engineering in 2010 from the University School of Information Technology, Guru Gobind Singh Indraprastha University, India. She received her graduation degree in computer science (Hons.) in 2007 from Acharya Narendra Dev College, University of Delhi. She is recipient of Commendable Research Award by Delhi Technological University. Her research interests are in software quality improvement, applications of machine learning techniques in change prediction, and the definition and validation of software metrics. She has various publications in international conferences and journals.

    View full text