GA-based method for feature selection and parameters optimization for machine learning regression applied to software effort estimation

https://doi.org/10.1016/j.infsof.2010.05.009Get rights and content

Abstract

Context

In software industry, project managers usually rely on their previous experience to estimate the number men/hours required for each software project. The accuracy of such estimates is a key factor for the efficient application of human resources. Machine learning techniques such as radial basis function (RBF) neural networks, multi-layer perceptron (MLP) neural networks, support vector regression (SVR), bagging predictors and regression-based trees have recently been applied for estimating software development effort. Some works have demonstrated that the level of accuracy in software effort estimates strongly depends on the values of the parameters of these methods. In addition, it has been shown that the selection of the input features may also have an important influence on estimation accuracy.

Objective

This paper proposes and investigates the use of a genetic algorithm method for simultaneously (1) select an optimal input feature subset and (2) optimize the parameters of machine learning methods, aiming at a higher accuracy level for the software effort estimates.

Method

Simulations are carried out using six benchmark data sets of software projects, namely, Desharnais, NASA, COCOMO, Albrecht, Kemerer and Koten and Gray. The results are compared to those obtained by methods proposed in the literature using neural networks, support vector machines, multiple additive regression trees, bagging, and Bayesian statistical models.

Results

In all data sets, the simulations have shown that the proposed GA-based method was able to improve the performance of the machine learning methods. The simulations have also demonstrated that the proposed method outperforms some recent methods reported in the recent literature for software effort estimation. Furthermore, the use of GA for feature selection considerably reduced the number of input features for five of the data sets used in our analysis.

Conclusions

The combination of input features selection and parameters optimization of machine learning methods improves the accuracy of software development effort. In addition, this reduces model complexity, which may help understanding the relevance of each input feature. Therefore, some input parameters can be ignored without loss of accuracy in the estimations.

Introduction

Experienced software project managers develop the ability to find the trade-off between software quality and time-to-market. The efficiency in resource allocation is one of the main aspects to find out such equilibrium point. In this context, estimating software development effort is essential.

An study published by the Standish Group’s Chaos states that 66% of the software projects analyzed were delivered with delay or above the foreseen budget, or worse, they were not finished [8]. Failures rate of software projects is still very high [9], [10]; it is estimated that over the last 5 years the impact of such software project failures on the US economy have cost between 25 billion and 75 billion [9], [10]. In this context, both overestimates and underestimates of the software effort are harmful to software companies [7]. Indeed, one of the major causes of such failures is inaccurate estimates of effort in software projects [10]. Hence, investigate novel methods for improving the accuracy of such estimates is essential to strengthen software companies’ competitive strategy.

Several methods have been investigated for software effort estimation, including traditional methods such as the constructive cost model (COCOMO) [11], and, more recently, machine learning techniques such as radial basis function (RBF) neural networks [12], MLP neural networks [26], multiple additive regression trees [30], wavelet neural networks [27], bagging predictors [13] and support vector regression (SVR) [10]. Machine learning techniques use data from past projects to build a regression model that is subsequently employed to predict the effort of new software projects.

Genetic algorithms (GAs) were shown to be very efficient for optimum or approximately optimum solution search in a great variety of problems. They avoid problems found in traditional optimization algorithms, such as returning the local minimum [14]. Recently, Huang and Wang proposed a genetic algorithm to simultaneously optimize the parameters and input feature subset of support vector machine (SVM) without loss of accuracy in classification problems [15]. Two factors substantially influence the accuracy and computation time of machine learning techniques: (1) the choice of the input feature subset and (2) the choice of the parameters values of machine learning techniques. Hence, according to Huang and Wang, the simultaneous optimization of these two factors improves the accuracy of machine learning techniques for classification problems.

Oliveira employed grid selection for optimizing SVR parameters for software effort estimation [10]. His work did not investigate feature selection methods; all input features were used for building the regression models. Huang and Wang demonstrated that the simultaneous optimization of the parameters and feature selection improves the accuracy of SVM results for classification problems [15]. Their results showed that GA-based method outperforms grid selection for SVM parameter optimization for classification problems [15]. These results motivated us to adapt the ideas of Huang and Wang for machine learning regression methods. In particular, we aim at reducing the number of input features, keeping the accuracy level for software effort estimates.

In this context, this paper adapts the method proposed by Huang and Wang [15] for feature selection and parameters optimization of machine learning methods applied to software effort estimation (a regression problem). The idea behind our method is to adapt the fitness function of the genetic algorithm and the parameters to be optimized. Notice that support vector machines for regression (SVR) has three important parameters, whereas SVM for classification has only two. Furthermore, our method generalizes the method of Huang and Wang [15], since we apply it to three different machine learning techniques (SVR, MLP neural networks and M5P model trees), whereas their method was developed and investigated solely for SVMs [15].

The main contributions of this paper are threefold: (1) to develop a novel method for software effort estimation based on genetic algorithms applied to input feature selection and parameters optimizations of machine learning methods; (2) to investigate the proposed method by applying it to three machine learning techniques, namely, (i) support vector regression (SVR), (ii) multi-layer perceptron (MLP) neural networks, and (iii) model trees; and (3) to show that our method outperforms recent methods proposed and investigated in the literature for software effort estimation [5], [30], [26], [33], [13], [25], [10].

This paper is organized as follows. Section 2 reviews the regression methods used in this paper and Section 3 reviews some basic genetic algorithm (GA) concepts. In Section 4, we present our GA-based method for feature selection and optimization of machine learning parameters for software effort estimation. The experiments and results are discussed in Section 5. Finally, Section 6 concludes the paper.

Section snippets

Regression methods

The goal of regression methods is to build a function f(x) that adequately maps a set of independent variables (X1,X2,  , Xn) into a dependent variable Y. In our case, we aim to build regression models using a training data set to use it subsequently to predict the total effort for the development of software projects in man-months.

Genetic algorithm

Genetic algorithm (GA) is a global optimization algorithm based on the theory of natural selection. In GA, individuals of a population having good phenotypic characteristics have greater survival and reproduction possibilities. As a result, the individuals less adapted to the environment will tend to disappear. Thus, GAs favor the combination of the individuals most apt, i.e., the candidates most promising for the solution of the problem [14]. It uses a random strategy of parallel search,

Chromosome design

Machine learning techniques have parameters that almost always significantly influence the performance of these techniques. For instance, the complexity parameter, C, of support vector machines (SVMs) has an important influence on its performance and thereby needs to be carefully selected for a given data set. In this article we investigate three machine learning techniques, namely, SVR, MLP and M5P (briefly reviewed in Section 2). Table 1 shows the parameters of each of these techniques.

The

Experiments

In this article we use six benchmark software effort data sets to evaluate the proposed method: (1) Desharnais data set [24], [25], (2) a data set from NASA [12], [10], (3) the COCOMO data set [26], (4) Albrecht data set [4], [6], (5) Kemerer data set [11], and (6) a data set used by Koten and Gray [5], which contains data from database-oriented software systems that were developed using a specific 4GL toolsuite. These data sets are used in many articles to evaluate the performance of novel

Conclusion

This article proposed and investigated a novel method for software effort estimation. The proposed method applies a genetic algorithm to simultaneously select the optimal input feature subset and the parameters of a machine learning technique used for regression.

Although very popular in the literature, as discussed in the papers [1], [31], [32] the metrics MMRE and PRED(25) do not present enough support for adequate statistical analysis. However, most of the recently published articles on

Acknowledgements

This work was partially supported by the National Institute of Science and Technology for Software Engineering (INES), funded by CNPq and FACEPE, grants 573964/2008-4 and APQ-1037-1.03/08 and by Petrobrás.

References (33)

  • R. Agarwal et al.

    Estimating software projects

    SIGSOFT Software Engineering Notes

    (2001)
  • Standish, Project Success Rates Improved Over 10 Years, Tech. Rep., Standish Group, 2004....
  • R.N. Charette

    Why software fails

    IEEE Spectrum

    (2005)
  • C.F. Kemerer

    An empirical validation of software cost estimation models

    Communications of the ACM

    (1987)
  • M. Shin et al.

    Empirical data modeling in software engineering using radial basis functions

    IEEE Transactions on Software Engineering

    (2000)
  • P.L. Braga et al.

    Bagging predictors for estimation of software project effort

    IEEE International Joint Conference on Neural Networks (IJCNN2007)

    (2007)
  • Cited by (184)

    View all citing articles on Scopus
    View full text