GA-based method for feature selection and parameters optimization for machine learning regression applied to software effort estimation

doi:10.1016/j.infsof.2010.05.009

Information and Software Technology

Volume 52, Issue 11, November 2010, Pages 1155-1166

https://doi.org/10.1016/j.infsof.2010.05.009 Get rights and content

Abstract

Context

In software industry, project managers usually rely on their previous experience to estimate the number men/hours required for each software project. The accuracy of such estimates is a key factor for the efficient application of human resources. Machine learning techniques such as radial basis function (RBF) neural networks, multi-layer perceptron (MLP) neural networks, support vector regression (SVR), bagging predictors and regression-based trees have recently been applied for estimating software development effort. Some works have demonstrated that the level of accuracy in software effort estimates strongly depends on the values of the parameters of these methods. In addition, it has been shown that the selection of the input features may also have an important influence on estimation accuracy.

Objective

This paper proposes and investigates the use of a genetic algorithm method for simultaneously (1) select an optimal input feature subset and (2) optimize the parameters of machine learning methods, aiming at a higher accuracy level for the software effort estimates.

Method

Simulations are carried out using six benchmark data sets of software projects, namely, Desharnais, NASA, COCOMO, Albrecht, Kemerer and Koten and Gray. The results are compared to those obtained by methods proposed in the literature using neural networks, support vector machines, multiple additive regression trees, bagging, and Bayesian statistical models.

Results

In all data sets, the simulations have shown that the proposed GA-based method was able to improve the performance of the machine learning methods. The simulations have also demonstrated that the proposed method outperforms some recent methods reported in the recent literature for software effort estimation. Furthermore, the use of GA for feature selection considerably reduced the number of input features for five of the data sets used in our analysis.

Conclusions

The combination of input features selection and parameters optimization of machine learning methods improves the accuracy of software development effort. In addition, this reduces model complexity, which may help understanding the relevance of each input feature. Therefore, some input parameters can be ignored without loss of accuracy in the estimations.

Introduction

Experienced software project managers develop the ability to find the trade-off between software quality and time-to-market. The efficiency in resource allocation is one of the main aspects to find out such equilibrium point. In this context, estimating software development effort is essential.

An study published by the Standish Group’s Chaos states that 66% of the software projects analyzed were delivered with delay or above the foreseen budget, or worse, they were not finished [8]. Failures rate of software projects is still very high [9], [10]; it is estimated that over the last 5 years the impact of such software project failures on the US economy have cost between 25 billion and 75 billion [9], [10]. In this context, both overestimates and underestimates of the software effort are harmful to software companies [7]. Indeed, one of the major causes of such failures is inaccurate estimates of effort in software projects [10]. Hence, investigate novel methods for improving the accuracy of such estimates is essential to strengthen software companies’ competitive strategy.

Several methods have been investigated for software effort estimation, including traditional methods such as the constructive cost model (COCOMO) [11], and, more recently, machine learning techniques such as radial basis function (RBF) neural networks [12], MLP neural networks [26], multiple additive regression trees [30], wavelet neural networks [27], bagging predictors [13] and support vector regression (SVR) [10]. Machine learning techniques use data from past projects to build a regression model that is subsequently employed to predict the effort of new software projects.

Genetic algorithms (GAs) were shown to be very efficient for optimum or approximately optimum solution search in a great variety of problems. They avoid problems found in traditional optimization algorithms, such as returning the local minimum [14]. Recently, Huang and Wang proposed a genetic algorithm to simultaneously optimize the parameters and input feature subset of support vector machine (SVM) without loss of accuracy in classification problems [15]. Two factors substantially influence the accuracy and computation time of machine learning techniques: (1) the choice of the input feature subset and (2) the choice of the parameters values of machine learning techniques. Hence, according to Huang and Wang, the simultaneous optimization of these two factors improves the accuracy of machine learning techniques for classification problems.

Oliveira employed grid selection for optimizing SVR parameters for software effort estimation [10]. His work did not investigate feature selection methods; all input features were used for building the regression models. Huang and Wang demonstrated that the simultaneous optimization of the parameters and feature selection improves the accuracy of SVM results for classification problems [15]. Their results showed that GA-based method outperforms grid selection for SVM parameter optimization for classification problems [15]. These results motivated us to adapt the ideas of Huang and Wang for machine learning regression methods. In particular, we aim at reducing the number of input features, keeping the accuracy level for software effort estimates.

In this context, this paper adapts the method proposed by Huang and Wang [15] for feature selection and parameters optimization of machine learning methods applied to software effort estimation (a regression problem). The idea behind our method is to adapt the fitness function of the genetic algorithm and the parameters to be optimized. Notice that support vector machines for regression (SVR) has three important parameters, whereas SVM for classification has only two. Furthermore, our method generalizes the method of Huang and Wang [15], since we apply it to three different machine learning techniques (SVR, MLP neural networks and M5P model trees), whereas their method was developed and investigated solely for SVMs [15].

The main contributions of this paper are threefold: (1) to develop a novel method for software effort estimation based on genetic algorithms applied to input feature selection and parameters optimizations of machine learning methods; (2) to investigate the proposed method by applying it to three machine learning techniques, namely, (i) support vector regression (SVR), (ii) multi-layer perceptron (MLP) neural networks, and (iii) model trees; and (3) to show that our method outperforms recent methods proposed and investigated in the literature for software effort estimation [5], [30], [26], [33], [13], [25], [10].

This paper is organized as follows. Section 2 reviews the regression methods used in this paper and Section 3 reviews some basic genetic algorithm (GA) concepts. In Section 4, we present our GA-based method for feature selection and optimization of machine learning parameters for software effort estimation. The experiments and results are discussed in Section 5. Finally, Section 6 concludes the paper.

Section snippets

Regression methods

The goal of regression methods is to build a function f(x) that adequately maps a set of independent variables (X₁,X₂, … , X_n) into a dependent variable Y. In our case, we aim to build regression models using a training data set to use it subsequently to predict the total effort for the development of software projects in man-months.

Genetic algorithm

Genetic algorithm (GA) is a global optimization algorithm based on the theory of natural selection. In GA, individuals of a population having good phenotypic characteristics have greater survival and reproduction possibilities. As a result, the individuals less adapted to the environment will tend to disappear. Thus, GAs favor the combination of the individuals most apt, i.e., the candidates most promising for the solution of the problem [14]. It uses a random strategy of parallel search,

Chromosome design

Machine learning techniques have parameters that almost always significantly influence the performance of these techniques. For instance, the complexity parameter, C, of support vector machines (SVMs) has an important influence on its performance and thereby needs to be carefully selected for a given data set. In this article we investigate three machine learning techniques, namely, SVR, MLP and M5P (briefly reviewed in Section 2). Table 1 shows the parameters of each of these techniques.

The

Experiments

In this article we use six benchmark software effort data sets to evaluate the proposed method: (1) Desharnais data set [24], [25], (2) a data set from NASA [12], [10], (3) the COCOMO data set [26], (4) Albrecht data set [4], [6], (5) Kemerer data set [11], and (6) a data set used by Koten and Gray [5], which contains data from database-oriented software systems that were developed using a specific 4GL toolsuite. These data sets are used in many articles to evaluate the performance of novel

Conclusion

This article proposed and investigated a novel method for software effort estimation. The proposed method applies a genetic algorithm to simultaneously select the optimal input feature subset and the parameters of a machine learning technique used for regression.

Although very popular in the literature, as discussed in the papers [1], [31], [32] the metrics MMRE and PRED(25) do not present enough support for adequate statistical analysis. However, most of the recently published articles on

Acknowledgements

This work was partially supported by the National Institute of Science and Technology for Software Engineering (INES), funded by CNPq and FACEPE, grants 573964/2008-4 and APQ-1037-1.03/08 and by Petrobrás.

References (33)

C. van Koten et al.
Bayesian statistical effort prediction models for data-centred 4GL software development
Information & Software Technology
(2006)
A.L.I. Oliveira
Estimation of software project effort with support vector regression
Neurocomputing
(2006)
C.-L. Huang et al.
A GA-based feature selection and parameters optimization for support vector machines
Expert Systems with Applications
(2006)
C.J. Burgess et al.
Can genetic programming improve software effort estimation? A comparative evaluation
Information & Software Technology
(2001)
M.O. Elish
Improved estimation of software project effort using multiple additive regression trees
Expert Systems with Applications
(2009)
B. Kitchenham, E. Mendes, Why comparative effort prediction studies may be invalid, in: PROMISE ’09: Proceedings of the...
W. Conover, Practical Nonparametric Statistics, Wiley Series in Probability and Statistics,...
B.A. Kitchenham et al.
Cross versus within-company cost estimation studies: a systematic review
IEEE Transactions on Software Engineering
(2007)
A. Albrecht et al.
Software function, source lines of code, and development effort prediction: a software science validation
IEEE Transactions on Software Engineering
(1983)
Y.F. Li et al.
A study of the non-linear adjustment for analogy based software cost estimation
Empirical Software Engineering
(2009)

R. Agarwal et al.

Estimating software projects

SIGSOFT Software Engineering Notes

(2001)

Standish, Project Success Rates Improved Over 10 Years, Tech. Rep., Standish Group, 2004....

R.N. Charette

Why software fails

IEEE Spectrum

(2005)

C.F. Kemerer

An empirical validation of software cost estimation models

Communications of the ACM

(1987)

M. Shin et al.

Empirical data modeling in software engineering using radial basis functions

IEEE Transactions on Software Engineering

(2000)

P.L. Braga et al.

Bagging predictors for estimation of software project effort

IEEE International Joint Conference on Neural Networks (IJCNN2007)

(2007)

Cited by (184)

What influence farmers’ relative poverty in China: A global analysis based on statistical and interpretable machine learning methods
2023, Heliyon
Poverty eradication has always been a major challenge to global development and governance, which received widespread attention from each country. With the completion poverty alleviation task in 2020, relative poverty governance becomes an important issue to be solved in China urgently. Because of a large population, poor infrastructures, insufficient resources, and long-term uneven development raising the living standard of farmers in rural areas is critical to China's success in realizing moderate prosperity. Therefore, identifying the poor farmers, exploring the influence factors to relative poverty, and clarifying its effect mechanism in rural areas are significant for the subsequent poverty governance. Most of the previous studies adopted the method of apriori assuming the factor system and verifying the hypothesis. We innovatively constructed a relative poverty index system consistent with China's actual conditions, selecting all the possible variables that could affect relative poverty based on the existing literature, including individual characteristics, psychological endowment, and geographical environment, and rebuilt an experimental database. Then, through data processing and data analysis, the main factors influencing the relative poverty of farmers were systematically sorted out based on the machine learning method. Finally, 25 chosen influencing factors were discussed in detail. Research findings show that: 1) Machine learning algorithm is proved it could be well applied in relative poverty fields, especially XGBoost, which achieves 81.9% accuracy and the score of ROC_AUC reaches 0.819. 2) This study sheds light on many new research directions in applying machine learning for relative poverty research, besides, the paper offers an integral framework and beneficial reference for target identification using machine learning algorithms. 3) In addition, by utilizing the interpretable tools, the “black-box” of ML become transparent through PDP and SHAP explanation, it also reveals that machine learning models can readily handle the non-linear association relationship.
Thermal optimization of Li-ion battery pack using genetic algorithm integrated with machine learning
2023, Thermal Science and Engineering Progress
This research work is focused on the “Air-Cooled” Battery Thermal Management System (BTMS) through the optimization of cell spacing of the battery pack. The Computational Fluid Dynamics (CFD) model has been developed to analyze the temperature field of the battery pack. The cell structure optimization is carried out using variable spacing obtained by numerical optimization and genetic algorithm (GA) integrated with the machine learning (ML) approach “Support Vector Machine”, to enhance the temperature uniformity across the cells. The temperature distribution of the geometries obtained by numerical optimization and GA-SVM is analyzed using CFD. The maximum cell temperature of GA battery pack is reduced by approximately 3.5 K and the temperature uniformity across the cell is increased by more than 70 %. Moreover, the effects of the inlet air flow rate on the thermal behavior of the battery pack are also analyzed. The results obtained in this study suggest that the proposed optimization method is an effective tool to design the cell spacing for improvement of the cooling efficiency of the battery pack.
Still our most important asset: A systematic review on human resource management in the midst of the fourth industrial revolution
2023, Journal of Innovation and Knowledge
The digital transition is transforming society in general and industry in particular, so much so that the concept of Industry 4.0 is by now widespread and consolidated. However, there is the risk that too pronounced a focus on technology may lead to losing sight of the centrality of human resources. Among others, the European Commission has urged greater attention from research and industry on the human factor. There is a need for further theorizing concerning the role of humans in a digitalized industry. Understanding what is known in the literature on human resource management in an Industry 4.0 context is an important first step toward defining a roadmap for enhancing the human factor in a digitized industry. To achieve this goal, we conducted a systematic analysis of the literature. We used a text-mining approach implemented via the MySLR software platform, which uses latent Dirichlet allocation, to analyze 566 papers published in academic journals. In this way, we were able to identify the main trends in the literature, the most frequently analyzed topics, and the gaps and possible future developments. Our study provides a synthesis of the research in this domain, as well as a research framework based on three pillars related to the technological, human-centric, and organizational perspectives. The results provide researchers and managers with a clear and updated overview of research in the field of human resource management in the context of Industry 4.0, as well as future research directions.
Machine learning for hydrothermal treatment of biomass: A review
2023, Bioresource Technology
Hydrothermal treatment (HTT) (i.e., hydrothermal carbonization, liquefaction, and gasification) is a promising technology for biomass valorization. However, diverse variables, including biomass compositions and hydrothermal processes parameters, have impeded in-depth mechanistic understanding on the reaction and engineering in HTT. Recently, machine learning (ML) has been widely employed to predict and optimize the production of biofuels, chemicals, and materials from HTT by feeding experimental data. This review comprehensively analyzed the application of ML for HTT of biomass and systematically illustrated basic ML procedure and descriptors for inputs and outputs of ML models (e.g., biomass compositions, operation conditions, yield and physicochemical properties of derived products) that could be applied in HTT. Moreover, this review summarized ML-aided HTT prediction of yield, compositions, and physicochemical properties of HTT hydrochar or biochar, bio-oil, syngas, and aqueous phase. Ultimately, future prospects were proposed to enhance predictive performance, mechanistic interpretation, process optimization, data sharing, and model application during ML-aided HTT.
Hydrogen production optimization from sewage sludge supercritical gasification process using machine learning methods integrated with genetic algorithm
2022, Chemical Engineering Research and Design
Hydrogen production from the supercritical water gasification (SCWG) of sewage sludge (SS) is a sustainable and efficient process. However, the challenging and intricate task for the experimental technique is to find out the correlation between proximate, ultimate analysis and gasification conditions with H₂ production. This process is complicated, expensive and requires many experimental techniques. To accurately predict and analyze the effect of input parameters on SCWG of SS process economically, an efficient model must be developed. The novelty of this study includes the consideration of four different machine learning (ML) methods integrated with Genetic Algorithm for the prediction, analysis, and evaluation of Hydrogen yield from the supercritical water gasification of sewage sludge. The ML methods included Support Vector Machine, Ensembled Tree, Gaussian Process Regression, and Artificial Neural Network. The results suggests that GPR is favored for predicting Hydrogen yield (Coefficient of determination (R²) = 0.997, Root Mean Square Error (RMSE) = 0.093, and is highly recommended for dealing with complex variable-target correlation. On the other hand, the performance of Support Vector Machine (SVM) was poor with R² = 0.761 and RMSE = 2.479. The R² and RMSE for Ensembled Tree (ET) and Artificial Neural Network (ANN) was 0.994, 0.560 and 0.943, 1.521 respectively. The partial dependence plot shows that temperature, moisture content and pressure are among the effective parameters of SCWG. Furthermore, optimization techniques such as genetic algorithms are incorporated to optimize H₂ production by tuning the ML hyperparameters. Additionally, a Graphical User Interface was developed by utilizing the optimized GPR model for ease in computing H₂ yield.The optimum ML method integrated with GA will be beneficial for researcher to predict the H₂ yield for the experimental work.
Gene selection for high dimensional biological datasets using hybrid island binary artificial bee colony with chaos game optimization
2024, Artificial Intelligence Review

View all citing articles on Scopus

View full text

GA-based method for feature selection and parameters optimization for machine learning regression applied to software effort estimation

Abstract

Context

Objective

Method

Results

Conclusions

Introduction

Section snippets

Regression methods

Genetic algorithm

Chromosome design

Experiments

Conclusion

Acknowledgements

Information & Software Technology

Neurocomputing

Expert Systems with Applications

Information & Software Technology

Expert Systems with Applications

Cross versus within-company cost estimation studies: a systematic review

IEEE Transactions on Software Engineering

Software function, source lines of code, and development effort prediction: a software science validation

IEEE Transactions on Software Engineering

A study of the non-linear adjustment for analogy based software cost estimation

Empirical Software Engineering

Estimating software projects

SIGSOFT Software Engineering Notes

Why software fails

IEEE Spectrum

An empirical validation of software cost estimation models

Communications of the ACM

Empirical data modeling in software engineering using radial basis functions

IEEE Transactions on Software Engineering

Bagging predictors for estimation of software project effort

IEEE International Joint Conference on Neural Networks (IJCNN2007)