Adaptive ridge regression system for software cost estimating on multi-collinear datasets

https://doi.org/10.1016/j.jss.2010.07.032Get rights and content

Abstract

Cost estimation is one of the most critical activities in software life cycle. In past decades, a number of techniques have been proposed for cost estimation. Linear regression is yet the most frequently applied method in the literature. However, a number of studies point out that linear regression is prone to low prediction accuracy. The low prediction accuracy is due to a number of reasons such as non-linearity and non-normality. One less addressed reason is the multi-collinearities which may lead to unstable regression coefficients. On the other hand, it has been reported that multi-collinearity spreads widely across the software engineering datasets. To tackle this problem and improve regression's accuracy, we propose a holistic problem-solving approach (named adaptive ridge regression system) integrating data transformation, multi-collinearity diagnosis, ridge regression technique and multi-objective optimization. The proposed system is tested on two real world datasets with the comparisons with OLS regression, stepwise regression and other machine learning methods. The results indicate that adaptive ridge regression system can significantly improve the performance of regressions on multi-collinear datasets and produce more explainable results than machine learning methods.

Introduction

The production of low cost and high quality software in a short time has become the ultimate goal of software industry. To achieve this goal, the software development processes need to be well managed and controlled. One of the most important activities is that of estimating the cost devoted to build the software. This task is known as Software Cost Estimation (Boehm, 1981). Aiming at accurate estimation, several techniques have been published in the past decades. Such as expert judgment (Jorgensen, 2004, Gruschke and Jorgensen, 2008), parametric models (Boehm, 1981, Albrecht and Gaffney, 1983, Puntnam and Myers, 1991), machine learning methods (Shepperd and Schofield, 1997, Heiat, 2002, Pendharkar et al., 2005, Kumar et al., 2008, Keung et al., 2008, Mendes and Mosley, 2008, Li et al., 2009a, Li et al., 2009b) and linear regression methods (Miyazaki et al., 1994, Costagliola et al., 2005, Berlin et al., 2009, Huang et al., 2008).

According to a recent overview conducted by Jorgensen and Shepperd (2007), the linear regression is yet the most frequently applied method in the cost estimation literature. A large number of studies used linear regression methods (especially the OLS regression) as a benchmark against their proposed methods and concluded that regressions have lower accuracies than the new methods. However, as Kitchenham and Mendes (2009) recently pointed out that, these conclusions might be biased due to the inappropriate use of regression methods. Some restrictions of linear regressions such as homoscedasticity, moderate outliers, and normal errors might be ignored by the users. Among all restrictions, independence of the explanatory variables (or project features) is a less addressed one in the cost estimation literature. This restriction is violated when the multi-collinearity phenomenon appears. Multi-collinearity means two or more explanatory variables in a multiple regression model are highly correlated in linear form and multi-collinearity often causes linear regression losing stability and effectiveness (Kutner et al., 2005).

On the other hand, the inter-correlated explanatory variables (or project features) widely spreads across the software engineering datasets (Shepperd and Kadoda, 2001, Mendes et al., 2003). For example, the ‘number of lines of source code’ is often regarded to be positively related to the ‘number of inputs’, ‘number of outputs’, etc. In literature, there are some techniques to alleviate the effects of multi-collinearity: such as getting additional or dropping some explanatory variables (Neter et al., 1983). However, it is either difficult to get appropriate variables out from large amount of project features such as the ISBSG provided (ISBSG, 2007), or time-consuming to gather the information needed to create a new variable. On the other hand, dropping some variables might lead to only a few explanatory variables in the regression equation. Too few variables will reduce the explainability of the regression equation and make it vulnerable to the changing collinearities among explanatory variables.

Ridge regression technique (Hoerl and Kennard, 1970) provides a good alternative to deal with multi-collinearity which makes full use of the existing data and avoids adding or dropping explanatory variables. It also holds the potential (via ridge parameter) to improve the fitting and prediction accuracy of the regression methods. As ridge regression implements the central idea of the regularization theory (Tikhonov and Arsenin, 1977), some studies (Agarwal et al., 2007, Yu and Liong, 2007) have pointed out that it can achieve equally good or even better prediction performances than some machine learning methods such as support vector machine and artificial neural networks. In addition, the ridge regression equation is more transparent than the black-box liked machine learning methods especially neural networks.

Ridge regression has been successfully applied in a number of research areas such as biology (Goeman, 2008), environmental science (Hessami et al., 2008), hydrology (Chokmani et al., 2008), and nuclear science (Zhou et al., 2001). More recently, ridge regression has been introduced for the estimation issues in software engineering. Nguyen et al. (2008) applied ridge regression technique to estimate the constrained coefficients for the COCOMO models. Papadopoulos et al. (2009) utilized ridge regression to generate confidence intervals for effort estimation. Parsa et al. (2008) applied ridge regression to produce classification scores to filter out the redundant features. However, none of the previous studies have been focused on using ridge regression to resolve the multi-collinearity problem in the software cost dataset.

Based on ridge regression, we propose a novel problem-solving approach, namely adaptive ridge regression (ARR) system, combining different techniques for cost estimation on multi-collinear datasets. The ARR system consists of data transformation, multi-collinearity diagnosis, ridge regression technique and a multi-objective optimization to train the ridge parameter. The rest of this paper is organized as follows: Section 2 introduces multi-collinearity diagnosis and ridge regression; Section 3 describes the proposed adaptive ridge regression (ARR) system; Section 4 presents the real world datasets and procedures for the empirical validation; Section 5 describes the experimental results and comparisons; Section 6 presents the threads to validity; the last section summarizes this work and points out possible future directions.

Section snippets

Multi-collinearity diagnosis and ridge regression

Prior to the detailed description of multi-collinearity, the multiple linear regression equation is presented. In general, a multiple linear regression equation has the following form:y=β0+β1x1+β2x2++βpxp+ewhere y denotes the dependent variable, xi, i = 1, …, p stands for the ith explantory variable, βi, i = 1, …, p is the ith regression coefficient, β0 is the intercept, and the error term e is a random noise following a standard normal distribution.

Adaptive ridge regression for cost estimation

Based upon the concepts of ridge regression and the indicator of multi-collinearity diagnosis, this section proposes the adaptive ridge regression (ARR) system for cost estimation. Different from the previous studies on ridge regression, this system includes a multi-objective optimization problem to search for the ridge regression parameter which can maximize the accuracy and minimize the multi-collinearity. Prior to the details about ARR system, the accuracy metrics are introduced.

Experiment set-up

In this section, we describe the structured experiment procedures on real world cost estimation datasets.

Results on Albrecht dataset

Table 3 summarizes the testing results across of all 10 experiments. The table shows that ARR achieves the best average MMRE, average PRED(0.25) and average MdMRE. In terms of standard deviation of error metrics, ARR has the smallest std. of MMRE, the fifth smallest std. of PRED(0.25) and the third smallest std. of MdMRE.

To further analyze the error metrics, we draw out the box plots of MMRE, PRED(0.25), and MdMRE in Fig. 3.

The boxplots show that ARR has best medians, shortest inter-quartiles

Discussions

This section discusses the threats to validity, and the theoretical and practical implications of this study.

Summary and future works

Linear regression is the most frequently applied method for cost estimation. A number of studies pointed out that linear regression is prone to low prediction accuracy. One less addressed reason for the low prediction accuracy is the multi-collinearity. To tackle this problem and improve regression accuracy, we propose a holistic approach: adaptive ridge regression system (ARR), which consists of data transformation, multi-collinearity diagnosis, ridge regression technique and a multi-objective

Acknowledgement

This research was partially supported by a grant from A*Star (SERC grant number 072 1340050) in Singapore.

Yanfu Li received his PhD in 2010 from Department of Industrial & Systems Engineering at National University of Singapore. He is currently a Research Associate at University of Tennessee, USA. His research interests include software cost estimation, data mining, software reliability and quality engineering, and cloud computing. He has publications on Journal of Systems and Software, Empirical Software Engineering, IEEE Transactions on Reliability, Expert Systems with Applications, Applied

References (73)

  • C. Mair et al.

    An investigation of machine learning based prediction systems

    Journal of Systems and Software

    (2000)
  • E. Mendes et al.

    Investigating Web size metrics for early Web cost estimation

    Journal of Systems and Software

    (2005)
  • Y. Miyazaki et al.

    Robust regression for developing software estimation models

    Journal of Systems and Software

    (1994)
  • P. Sentas et al.

    Software productivity and effort prediction with ordinal regression

    Information and Software Technology

    (2005)
  • Q.B. Song et al.

    A new imputation method for small software project data sets

    Journal of Systems and Software

    (2007)
  • X.Y. Yu et al.

    Forecasting of hydrologic time series with ridge regression in feature space

    Journal of Hydrology

    (2007)
  • A.J. Albrecht et al.

    Software function, source lines of code, and development effort prediction

    IEEE Transactions on Software Engineering

    (1983)
  • L. Angelis et al.

    A simulation tool for efficient analogy based cost estimation

    Empirical Software Engineering

    (2000)
  • L. Angelis et al.

    Building a software cost estimation model based on categorical data

  • M. Auer et al.

    Optimal project feature weights in analogy-based cost estimation: improvement and limitations

    IEEE Transactions on Software Engineering

    (2006)
  • D.A. Belsley

    Condition Diagnostics Collinearity and Weak Data in Regression

    (1991)
  • B. Boehm

    Software Engineering Economics

    (1981)
  • G.E.P. Box et al.

    An analysis of transformations (with discussion)

    Journal of Royal Statistics Society Series B

    (1964)
  • N. Brauner et al.

    Role of range and precision of the independent variable in regression of data

    AIChE Journal

    (1998)
  • L.C. Briand et al.

    An assessment and comparison of common cost estimation modeling techniques

  • S. Chatterjee et al.

    Regression Analysis by Example

    (1977)
  • S. Conte et al.

    Software Engineering Metrics and Models

    (1986)
  • G. Costagliola et al.

    Class point: an approach for the size estimation of object-oriented systems

    IEEE Transactions on Software Engineering

    (2005)
  • I. Das et al.

    Normal-boundary intersection: a new method for generating the pareto surface in nonlinear multicriteria optimization problems

    SIAM Journal on Optimization

    (1998)
  • K. Deb

    Multi-objective Optimization using Evolutionary Algorithms

    (2001)
  • J.M. Desharnais

    Analyse statistique de la productivitie des projets informatique a partie de la technique des point des foncti on

    (1989)
  • N.R. Draper et al.

    Applied Regression Analysis

    (1981)
  • T. Evgeniou et al.

    Regularization networks and support vector machines

    Advances in Computational Mathematics

    (2000)
  • T. Foss et al.

    A simulation study of the model evaluation criterion MMRE

    IEEE Transactions on Software Engineering

    (2003)
  • S. Gass et al.

    The computational algorithm for the parametric objective function

    Naval Research Logistics Quarterly

    (1955)
  • Goeman

    Autocorrelated logistic ridge regression for prediction based on proteomics spectra

    Statistical Applications in Genetics and Molecular Biology

    (2008)
  • Cited by (0)

    Yanfu Li received his PhD in 2010 from Department of Industrial & Systems Engineering at National University of Singapore. He is currently a Research Associate at University of Tennessee, USA. His research interests include software cost estimation, data mining, software reliability and quality engineering, and cloud computing. He has publications on Journal of Systems and Software, Empirical Software Engineering, IEEE Transactions on Reliability, Expert Systems with Applications, Applied Software Computing and several international conferences. He is an invited reviewer of six international journals. Dr. Li is a member of IEEE.

    Min Xie received his PhD in Quality Technology in 1987 from Linkoping University in Sweden. He was awarded the prestigious LKY research fellowship in 1991 and currently he is a Professor at National University of Singapore. Prof Xie has authored or co-authored numerous papers and six books on quality and reliability engineering, including Software Reliability Modelling in 1991 by World Scientific Publisher, Weibull Models by John Wiley in 2003, and Computing Systems Reliability by Kluwer Academic in 2004. He is an Editor of Int Journal of Reliability, Quality and Safety Engineering, Department Editor of IIE Transactions, Associate Editor of IEEE Transactions on Reliability, and on the editorial board of a number other international journals. Prof Xie is an elected fellow of IEEE.

    T.N. Goh holds a BE from the University of Saskatchewan, Canada and a PhD from the University of Wisconsin-Madison. Positions that he has held include Dean, Faculty of Engineering; Head, Department of Industrial and Systems Engineering; Director, NUS Office of Quality Management, and CEO, Design Technology Institute International. Dr. Goh is an elected Academician of the International Academy for Quality, Fellow of the American Society for Quality (ASQ), and Honorary Member of the Singapore Quality Institute. He is author or co-author of more than a hundred research papers and three books and currently serves on the editorial boards of eight international research journals.

    View full text