Investigating the use of Support Vector Regression for web effort estimation

Corazza, Anna; Di Martino, Sergio; Ferrucci, Filomena; Gravino, Carmine; Mendes, Emilia

doi:10.1007/s10664-010-9138-4

Investigating the use of Support Vector Regression for web effort estimation

Published: 29 July 2010

Volume 16, pages 211–243, (2011)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Anna Corazza¹,
Sergio Di Martino¹,
Filomena Ferrucci²,
Carmine Gravino² &
…
Emilia Mendes³

588 Accesses
33 Citations
Explore all metrics

Abstract

Support Vector Regression (SVR) is a new generation of Machine Learning algorithms, suitable for predictive data modeling problems. The objective of this paper is twofold: first, to investigate the effectiveness of SVR for Web effort estimation using a cross-company dataset; second, to compare different SVR configurations looking at the one that presents the best performance. In particular, we took into account three variables’ preprocessing strategies (no-preprocessing, normalization, and logarithmic), in combination with two different dependent variables (effort and inverse effort). As a result, SVR was applied using six different data configurations. Moreover, to understand the suitability of kernel functions to handle non-linear problems, SVR was applied without a kernel, and in combination with the Radial Basis Function (RBF) and the Polynomial kernels, thus obtaining 18 different SVR configurations. To identify, for each configuration, which were the best values for each of the parameters we defined a procedure based on a leave-one-out cross-validation approach. The dataset employed was the Tukutuku database, which has been adopted in many previous Web effort estimation studies. Three different training and test set splits were used, including respectively 130 and 65 projects. The SVR-based predictions were also benchmarked against predictions obtained using Manual StepWise Regression and Case-Based Reasoning. Our results showed that the configuration corresponding to the logarithmic features’ preprocessing, in combination with the RBF kernel provided the best results for all three data splits. In addition, SVR provided significantly superior prediction accuracy than all the considered benchmarking techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hyper-Parameter Tuning of Classification and Regression Trees for Software Effort Estimation

A Novel Software Architecture to Calculate Effort Estimation for Industrial Big Data

Comparative Analysis of Ensemble Models for Software Effort Estimation

Notes

The norm was computed basing on all the input information available in the test set, thus all the attributes except the Effort.
SVM-light is freely available in http://svmlight.joachims.org/ for scientific use.
In a leave-one-out cross-validation, a single observation from the original sample is used to evaluate the model that is trained using the remaining observations. This is repeated until each observation in the sample is used once as validation data. The application of a leave-one-out cross validation on the training set has allowed us to prevent problems of model overfitting that could hinder the model having good prediction accuracy on an out dataset.
Observe that we removed an outlier for the boxplot of C6(L) and an outlier for the boxplot of C6(R) to improve the readability of the plots.
Observe that we removed an outlier for the boxplot of C6(L) to improve the readability of the plots.

References

Bailey JW, Basili VR (1981) A meta model for software development resource expenditure. Procs. Fifth International Conference on Software Engineering, San Diego, California, USA, pp. 107–116
Braga PL, Oliveira ALI, Meira SRL (2007) Software Effort Estimation using Machine Learning Techniques with Robust Confidence Intervals. HIS :352-357
Braga PL, Oliveira ALI, Meira SRL (2008) A GA-based Feature Selection and Parameters Optimization for Support Vector Regression Applied to Software Effort Estimation. Proceedings of the ACM symposium on Applied computing :1788-1792
Briand L, Langley T, Wiekzorek I (2000) A Replicated Assessment and Comparison of Common Software Cost Modeling Techniques. In Proceedings of International Conference on Software Engineering, IEEE press, pp 377–386
Briand L, Labiche Y, PentaM D, Yan-Bondoc H (2005) An experimental investigation of formality in UML-based development. IEEE TSE 31(10):833–849
Google Scholar
Christodoulou SP, Zafiris PA, Papatheodorou TS (2000) WWW2000: The Developer's view and a practitioner's approach to Web Engineering. Procs. ICSE Workshop on Web Engineering, Limerick, Ireland, pp 75–92
Google Scholar
Chulani S, Boehm B, Steece B (1999) Bayesian Analysis of Empirical Software Engineering Cost Models. IEEE TSE 25:573–583
Google Scholar
Conover WJ (1998) Practical nonparametric statistics, 3rd edn. Wiley, New York
Google Scholar
Conte SD, Dunsmore HE, Shen VY (1986) Software Engineering Metrics and Models. Benjamin-Cummins
Cook RD (1977) “Detection of influential observations in linear regression. Technometrics 19:15–18
Article MathSciNet MATH Google Scholar
Corazza A, Di Martino S, Ferrucci F, Gravino C, Mendes E (2009) Applying Support Vector Regression for Web Effort Estimation using a Cross-Company Dataset. In Proceedings of Empirical Software Engineering and Measurement (ESEM’09), Lake Buena Vista Florida, pp 17-19, Ottobre
Cortes C, Vapnik V (1995) Support-Vector Networks. Mach Learn 20
Costagliola G, Di Martino S, Ferrucci F, Gravino C, Tortora G, Vitiello G (2006) Effort estimation modeling techniques: a case study for web applications. Procs. Intl. Conference on Web Engineering (ICWE’06), 9-16
Desharnais JM (1989) Analyse statistique de la productivitie des projets in834 formatique a partie de la technique des point des fonction, Ph.D. thesis, 835 Unpublished Masters Thesis, University of Montreal
Di Martino S, Ferrucci F, Gravino C, Mendes E (2007) Comparing Size Measures for Predicting Web Application Development Effort: A Case Study. Procs. Empirical Software Engineering and Measurement, IEEE press, pp. 324–333
Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220
Article MathSciNet MATH Google Scholar
Huang C-L, Wang C-J (2006) A GA-based feature selection and parameters optimization for support vector machines. Expert Syst Appl 31(2):231–240
Article Google Scholar
Joachims T (1999) Making large-Scale SVM Learning Practical. In: Schölkopf B, Burges CJC, Smola AJ (eds) Advances in Kernel Methods— Support Vector Learning. MIT Press, Cambridge, MA
Google Scholar
Kitchenham BA (1998) A Procedure for Analyzing Unbalanced Datasets. IEEE TSE 24(4):278–301
Google Scholar
Kitchenham BA, Mendes E (2004) A Comparison of Cross-company and Single-company Effort Estimation Models for Web Applications. Proc EASE 2004:47–55
Google Scholar
Kitchenham B, Pickard L, Peeger S (1995) Case studies for method and tool evaluation. IEEE Softw 12(4):52–62
Article Google Scholar
Kitchenham B, Pickard LM, MacDonell SG, Shepperd MJ (2001) What accuracy statistics really measure. IEE Proc Softw 148(3):81–85
Article Google Scholar
Kitchenham BA, Mendes E, Travassos G (2006) A Systematic Review of Cross- and Within-company Cost Estimation Studies”, Procs. Empirical Assessment in Software Engineering, pp 89-98
Kitchenham B, Mendes E, Travassos G (2007) Cross versus Within-Company Cost Estimation Studies: A systematic Review. IEEE TSE 33(5):316–329
Google Scholar
Maxwell K (2002) Applied Statistics for Software Managers. Software Quality Institute Series, Prentice Hall
Google Scholar
Mendes E (2008) The Use of Bayesian Networks for Web Effort Estimation: Further Investigation. Procs. International Conference on Web Engineering
Mendes E, Kitchenham BA (2004) Further Comparison of Cross-company and Within-company Effort Estimation Models for Web Applications. Procs. IEEE Metrics, pp 348-357
Mendes E, Mosley N (2008) Bayesian Network Models for Web Effort Prediction: A Comparative Study. IEEE TSE 34(6):723–737
Google Scholar
Mendes E, Mosley N, Counsell S (2002) Comparison of Length, complexity and functionality as size measures for predicting Web design and authoring effort. IEE Proc Softw 149(3):86–92
Article Google Scholar
Mendes E, Counsell S, Mosley N, Triggs C, Watson I (2003c) A Comparative Study of Cost Estimation Models for Web Hypermedia Applications. Empir Software Eng 8(23):163–196
Article Google Scholar
Mendes E, Mosley N, Counsell S (2005a) Investigating Web Size Metrics for Early Web Cost Estimation. J Syst Softw 77(2):157–172
Article Google Scholar
Mendes E, Mosley N, Counsell S (2005) Web Effort Estimation. In: Mendes E, Mosley N (eds)Web Engineering, Springer-Verlag, ISBN: 3-540-28196-7
Mendes E, Martino SD, Ferrucci F, Gravino C (2008) Cross-company vs. single-company web effort models using the Tukutuku database: An extended study. J Syst Softw 81(5):673–690
Article Google Scholar
Oliveira ALI (2006) Estimation of software project effort with support vector regression. Neurocomputing 69(13–15):1749–1753
Article Google Scholar
Schölkopf B (1997) Support Vector Learning. R. Oldenbourg Verlag, Munchen. Doktorarbeit, TU Berlin. Download: http://www.kernel-machines.org
Schölkopf B, Smola AJ (2002) Learning with Kernels. MIT Press
Shepperd MJ, Kadoda G (2001) Using Simulation to Evaluate Prediction Techniques. Procs IEEE Metrics’01, London, UK, pp 349-358
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Article MathSciNet Google Scholar
Standish Group International. The Chaos Report; www.standishgroup.com/sample_research/PDFpages/Chaos1994.pdf
Vapnik V (1998) Statistical Learning Theory. Wiley
Vapnik V, Chervonenkis A (1964) A note on one class of perceptrons. Automatics and Remote Control 25
Vapnik V, Lerner A (1963) Pattern recognition using generalized portrait method. Autom Remote Control 24:774–780
Google Scholar

Download references

Acknowledgments

Authors wish to thank all companies that volunteered data to the Tukutuku database.

Author information

Authors and Affiliations

University of Napoli “Federico II”, Via Cinthia, 80126, Naples, Italy
Anna Corazza & Sergio Di Martino
University of Salerno, Via Ponte Don Melillo, 84084, Fisciano, SA, Italy
Filomena Ferrucci & Carmine Gravino
The University of Auckland, Private Bag, 92019, Auckland, New Zealand
Emilia Mendes

Authors

Anna Corazza
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Di Martino
View author publications
You can also search for this author in PubMed Google Scholar
Filomena Ferrucci
View author publications
You can also search for this author in PubMed Google Scholar
Carmine Gravino
View author publications
You can also search for this author in PubMed Google Scholar
Emilia Mendes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Filomena Ferrucci.

Additional information

Editor: James Miller

Appendix

1.1 Manual Stepwise Regression

Stepwise Regression (Maxwell 2002) is a statistical technique whereby a prediction model (Equation) is built and represents the relationship between independent (e.g. number of Web pages) and dependent variables (e.g. total Effort). This technique builds the model by adding, at each stage, the independent variable with the highest association to the dependent variable, taking into account all variables currently in the model. It aims to find the set of independent variables (predictors) that best explains the variation in the dependent variable (response). In particular, we applied a Manual Stepwise Regression using the technique proposed by Kitchenham (Kitchenham 1998). Basically the idea is to use this technique to select the important independent variables, and then to use linear regression to obtain the final model.

In our study we employed the variables shown in Table 1 with Manual Stepwise Regression in order to select the most important size measures. Once selected they were the ones used for cross-validation on each split, i.e. we did not perform a separate Manual Stepwise Regression for each split; we simply performed a regression using the variables previously selected using the manual stepwise procedure.

Whenever variables were highly skewed they were transformed before being used in the forward stepwise procedure. This was done in order to comply with the assumptions underlying stepwise regression (Maxwell 2002) (i.e. residuals should be independent and normally distributed; relationship between dependent and independent variables should be linear). The transformation employed was to take the natural log (Ln), which makes larger values smaller and brings the data values closer to each other (Maxwell 2002). A new variable containing the transformed values was created for each original variable that needed to be transformed.

In addition, whenever a variable needed to be transformed but had zero values, the natural logarithmic transformation was applied to the variable’s value after adding 1.

To verify the stability of each effort model built using forward stepwise regression, the following steps were employed (Kitchenham and Mendes 2004):

Use of a residual plot showing residuals vs. fitted values to investigate if the residuals are randomly and normally distributed.
Calculate Cook’s distance values (Cook 1977) for all projects to identify influential data points. Any projects with distances higher than 3 × (4/n), where n represents the total number of projects, are immediately removed from the data analysis (Maxwell 2002). Those with distances higher than 4/n but smaller than (3 × (4/n)) are removed in order to test the model stability, by observing the effect of their removal on the model. If the model coefficients remain stable and the adjusted R² (goodness of fit) improves, the highly influential projects are retained in the data analysis.

1.2 Case-Based Reasoning

Case-Based Reasoning (CBR) is a branch of Artificial Intelligence where knowledge of similar past cases is used to solve new cases (Shepperd & Kadoda 2001). Within the context of our investigation, the idea behind the use of the CBR technique is to predict the effort of a new project by considering similar projects previously developed. In particular, the completed projects are characterized in terms of a set of p features and form the case base. The new project is also characterized in terms of the same p features and it is referred as the target case. Then, the similarity between the target case and the other cases in the p-dimensional feature space is measured, and the most similar cases are used, possibly with adaptations to obtain a prediction for the target case. To apply the method, we have to select the relevant project features, the appropriate similarity function, the number of analogies to choose the similar projects to consider for estimation, and the analogy adaptation strategy for generating the estimation. The selection of the similarity function and the number of analogies are crucial decisions. The similarity measure used in this study is the Euclidean distance as this has been the measure used in the literature with the best results (Mendes et al. 2003c). In addition, all the project attributes considered by the similarity function had equal influence upon the selection of the most similar project(s).

Estimates were based on the average effort of the most similar projects in the case base, with no different weights for attributes or adaptation of the estimated effort.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Corazza, A., Di Martino, S., Ferrucci, F. et al. Investigating the use of Support Vector Regression for web effort estimation. Empir Software Eng 16, 211–243 (2011). https://doi.org/10.1007/s10664-010-9138-4

Download citation

Published: 29 July 2010
Issue Date: April 2011
DOI: https://doi.org/10.1007/s10664-010-9138-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Investigating the use of Support Vector Regression for web effort estimation

Abstract

Access this article

Similar content being viewed by others

Hyper-Parameter Tuning of Classification and Regression Trees for Software Effort Estimation

A Novel Software Architecture to Calculate Effort Estimation for Industrial Big Data

Comparative Analysis of Ensemble Models for Software Effort Estimation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

1.1 Manual Stepwise Regression

1.2 Case-Based Reasoning

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Investigating the use of Support Vector Regression for web effort estimation

Abstract

Access this article

Similar content being viewed by others

Hyper-Parameter Tuning of Classification and Regression Trees for Software Effort Estimation

A Novel Software Architecture to Calculate Effort Estimation for Industrial Big Data

Comparative Analysis of Ensemble Models for Software Effort Estimation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

1.1 Manual Stepwise Regression

1.2 Case-Based Reasoning

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation