Skip to main content
Log in

Investigating the use of Support Vector Regression for web effort estimation

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Support Vector Regression (SVR) is a new generation of Machine Learning algorithms, suitable for predictive data modeling problems. The objective of this paper is twofold: first, to investigate the effectiveness of SVR for Web effort estimation using a cross-company dataset; second, to compare different SVR configurations looking at the one that presents the best performance. In particular, we took into account three variables’ preprocessing strategies (no-preprocessing, normalization, and logarithmic), in combination with two different dependent variables (effort and inverse effort). As a result, SVR was applied using six different data configurations. Moreover, to understand the suitability of kernel functions to handle non-linear problems, SVR was applied without a kernel, and in combination with the Radial Basis Function (RBF) and the Polynomial kernels, thus obtaining 18 different SVR configurations. To identify, for each configuration, which were the best values for each of the parameters we defined a procedure based on a leave-one-out cross-validation approach. The dataset employed was the Tukutuku database, which has been adopted in many previous Web effort estimation studies. Three different training and test set splits were used, including respectively 130 and 65 projects. The SVR-based predictions were also benchmarked against predictions obtained using Manual StepWise Regression and Case-Based Reasoning. Our results showed that the configuration corresponding to the logarithmic features’ preprocessing, in combination with the RBF kernel provided the best results for all three data splits. In addition, SVR provided significantly superior prediction accuracy than all the considered benchmarking techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The norm was computed basing on all the input information available in the test set, thus all the attributes except the Effort.

  2. SVM-light is freely available in http://svmlight.joachims.org/ for scientific use.

  3. In a leave-one-out cross-validation, a single observation from the original sample is used to evaluate the model that is trained using the remaining observations. This is repeated until each observation in the sample is used once as validation data. The application of a leave-one-out cross validation on the training set has allowed us to prevent problems of model overfitting that could hinder the model having good prediction accuracy on an out dataset.

  4. Observe that we removed an outlier for the boxplot of C6(L) and an outlier for the boxplot of C6(R) to improve the readability of the plots.

  5. Observe that we removed an outlier for the boxplot of C6(L) to improve the readability of the plots.

References

  • Bailey JW, Basili VR (1981) A meta model for software development resource expenditure. Procs. Fifth International Conference on Software Engineering, San Diego, California, USA, pp. 107–116

  • Braga PL, Oliveira ALI, Meira SRL (2007) Software Effort Estimation using Machine Learning Techniques with Robust Confidence Intervals. HIS :352-357

  • Braga PL, Oliveira ALI, Meira SRL (2008) A GA-based Feature Selection and Parameters Optimization for Support Vector Regression Applied to Software Effort Estimation. Proceedings of the ACM symposium on Applied computing :1788-1792

  • Briand L, Langley T, Wiekzorek I (2000) A Replicated Assessment and Comparison of Common Software Cost Modeling Techniques. In Proceedings of International Conference on Software Engineering, IEEE press, pp 377–386

  • Briand L, Labiche Y, PentaM D, Yan-Bondoc H (2005) An experimental investigation of formality in UML-based development. IEEE TSE 31(10):833–849

    Google Scholar 

  • Christodoulou SP, Zafiris PA, Papatheodorou TS (2000) WWW2000: The Developer's view and a practitioner's approach to Web Engineering. Procs. ICSE Workshop on Web Engineering, Limerick, Ireland, pp 75–92

    Google Scholar 

  • Chulani S, Boehm B, Steece B (1999) Bayesian Analysis of Empirical Software Engineering Cost Models. IEEE TSE 25:573–583

    Google Scholar 

  • Conover WJ (1998) Practical nonparametric statistics, 3rd edn. Wiley, New York

    Google Scholar 

  • Conte SD, Dunsmore HE, Shen VY (1986) Software Engineering Metrics and Models. Benjamin-Cummins

  • Cook RD (1977) “Detection of influential observations in linear regression. Technometrics 19:15–18

    Article  MathSciNet  MATH  Google Scholar 

  • Corazza A, Di Martino S, Ferrucci F, Gravino C, Mendes E (2009) Applying Support Vector Regression for Web Effort Estimation using a Cross-Company Dataset. In Proceedings of Empirical Software Engineering and Measurement (ESEM’09), Lake Buena Vista Florida, pp 17-19, Ottobre

  • Cortes C, Vapnik V (1995) Support-Vector Networks. Mach Learn 20

  • Costagliola G, Di Martino S, Ferrucci F, Gravino C, Tortora G, Vitiello G (2006) Effort estimation modeling techniques: a case study for web applications. Procs. Intl. Conference on Web Engineering (ICWE’06), 9-16

  • Desharnais JM (1989) Analyse statistique de la productivitie des projets in834 formatique a partie de la technique des point des fonction, Ph.D. thesis, 835 Unpublished Masters Thesis, University of Montreal

  • Di Martino S, Ferrucci F, Gravino C, Mendes E (2007) Comparing Size Measures for Predicting Web Application Development Effort: A Case Study. Procs. Empirical Software Engineering and Measurement, IEEE press, pp. 324–333

  • Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36(3):1171–1220

    Article  MathSciNet  MATH  Google Scholar 

  • Huang C-L, Wang C-J (2006) A GA-based feature selection and parameters optimization for support vector machines. Expert Syst Appl 31(2):231–240

    Article  Google Scholar 

  • Joachims T (1999) Making large-Scale SVM Learning Practical. In: Schölkopf B, Burges CJC, Smola AJ (eds) Advances in Kernel Methods— Support Vector Learning. MIT Press, Cambridge, MA

    Google Scholar 

  • Kitchenham BA (1998) A Procedure for Analyzing Unbalanced Datasets. IEEE TSE 24(4):278–301

    Google Scholar 

  • Kitchenham BA, Mendes E (2004) A Comparison of Cross-company and Single-company Effort Estimation Models for Web Applications. Proc EASE 2004:47–55

    Google Scholar 

  • Kitchenham B, Pickard L, Peeger S (1995) Case studies for method and tool evaluation. IEEE Softw 12(4):52–62

    Article  Google Scholar 

  • Kitchenham B, Pickard LM, MacDonell SG, Shepperd MJ (2001) What accuracy statistics really measure. IEE Proc Softw 148(3):81–85

    Article  Google Scholar 

  • Kitchenham BA, Mendes E, Travassos G (2006) A Systematic Review of Cross- and Within-company Cost Estimation Studies”, Procs. Empirical Assessment in Software Engineering, pp 89-98

  • Kitchenham B, Mendes E, Travassos G (2007) Cross versus Within-Company Cost Estimation Studies: A systematic Review. IEEE TSE 33(5):316–329

    Google Scholar 

  • Maxwell K (2002) Applied Statistics for Software Managers. Software Quality Institute Series, Prentice Hall

    Google Scholar 

  • Mendes E (2008) The Use of Bayesian Networks for Web Effort Estimation: Further Investigation. Procs. International Conference on Web Engineering

  • Mendes E, Kitchenham BA (2004) Further Comparison of Cross-company and Within-company Effort Estimation Models for Web Applications. Procs. IEEE Metrics, pp 348-357

  • Mendes E, Mosley N (2008) Bayesian Network Models for Web Effort Prediction: A Comparative Study. IEEE TSE 34(6):723–737

    Google Scholar 

  • Mendes E, Mosley N, Counsell S (2002) Comparison of Length, complexity and functionality as size measures for predicting Web design and authoring effort. IEE Proc Softw 149(3):86–92

    Article  Google Scholar 

  • Mendes E, Counsell S, Mosley N, Triggs C, Watson I (2003c) A Comparative Study of Cost Estimation Models for Web Hypermedia Applications. Empir Software Eng 8(23):163–196

    Article  Google Scholar 

  • Mendes E, Mosley N, Counsell S (2005a) Investigating Web Size Metrics for Early Web Cost Estimation. J Syst Softw 77(2):157–172

    Article  Google Scholar 

  • Mendes E, Mosley N, Counsell S (2005) Web Effort Estimation. In: Mendes E, Mosley N (eds)Web Engineering, Springer-Verlag, ISBN: 3-540-28196-7

  • Mendes E, Martino SD, Ferrucci F, Gravino C (2008) Cross-company vs. single-company web effort models using the Tukutuku database: An extended study. J Syst Softw 81(5):673–690

    Article  Google Scholar 

  • Oliveira ALI (2006) Estimation of software project effort with support vector regression. Neurocomputing 69(13–15):1749–1753

    Article  Google Scholar 

  • Schölkopf B (1997) Support Vector Learning. R. Oldenbourg Verlag, Munchen. Doktorarbeit, TU Berlin. Download: http://www.kernel-machines.org

  • Schölkopf B, Smola AJ (2002) Learning with Kernels. MIT Press

  • Shepperd MJ, Kadoda G (2001) Using Simulation to Evaluate Prediction Techniques. Procs IEEE Metrics’01, London, UK, pp 349-358

  • Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222

    Article  MathSciNet  Google Scholar 

  • Standish Group International. The Chaos Report; www.standishgroup.com/sample_research/PDFpages/Chaos1994.pdf

  • Vapnik V (1998) Statistical Learning Theory. Wiley

  • Vapnik V, Chervonenkis A (1964) A note on one class of perceptrons. Automatics and Remote Control 25

  • Vapnik V, Lerner A (1963) Pattern recognition using generalized portrait method. Autom Remote Control 24:774–780

    Google Scholar 

Download references

Acknowledgments

Authors wish to thank all companies that volunteered data to the Tukutuku database.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Filomena Ferrucci.

Additional information

Editor: James Miller

Appendix

Appendix

1.1 Manual Stepwise Regression

Stepwise Regression (Maxwell 2002) is a statistical technique whereby a prediction model (Equation) is built and represents the relationship between independent (e.g. number of Web pages) and dependent variables (e.g. total Effort). This technique builds the model by adding, at each stage, the independent variable with the highest association to the dependent variable, taking into account all variables currently in the model. It aims to find the set of independent variables (predictors) that best explains the variation in the dependent variable (response). In particular, we applied a Manual Stepwise Regression using the technique proposed by Kitchenham (Kitchenham 1998). Basically the idea is to use this technique to select the important independent variables, and then to use linear regression to obtain the final model.

In our study we employed the variables shown in Table 1 with Manual Stepwise Regression in order to select the most important size measures. Once selected they were the ones used for cross-validation on each split, i.e. we did not perform a separate Manual Stepwise Regression for each split; we simply performed a regression using the variables previously selected using the manual stepwise procedure.

Whenever variables were highly skewed they were transformed before being used in the forward stepwise procedure. This was done in order to comply with the assumptions underlying stepwise regression (Maxwell 2002) (i.e. residuals should be independent and normally distributed; relationship between dependent and independent variables should be linear). The transformation employed was to take the natural log (Ln), which makes larger values smaller and brings the data values closer to each other (Maxwell 2002). A new variable containing the transformed values was created for each original variable that needed to be transformed.

In addition, whenever a variable needed to be transformed but had zero values, the natural logarithmic transformation was applied to the variable’s value after adding 1.

To verify the stability of each effort model built using forward stepwise regression, the following steps were employed (Kitchenham and Mendes 2004):

  • Use of a residual plot showing residuals vs. fitted values to investigate if the residuals are randomly and normally distributed.

  • Calculate Cook’s distance values (Cook 1977) for all projects to identify influential data points. Any projects with distances higher than 3 × (4/n), where n represents the total number of projects, are immediately removed from the data analysis (Maxwell 2002). Those with distances higher than 4/n but smaller than (3 × (4/n)) are removed in order to test the model stability, by observing the effect of their removal on the model. If the model coefficients remain stable and the adjusted R2 (goodness of fit) improves, the highly influential projects are retained in the data analysis.

1.2 Case-Based Reasoning

Case-Based Reasoning (CBR) is a branch of Artificial Intelligence where knowledge of similar past cases is used to solve new cases (Shepperd & Kadoda 2001). Within the context of our investigation, the idea behind the use of the CBR technique is to predict the effort of a new project by considering similar projects previously developed. In particular, the completed projects are characterized in terms of a set of p features and form the case base. The new project is also characterized in terms of the same p features and it is referred as the target case. Then, the similarity between the target case and the other cases in the p-dimensional feature space is measured, and the most similar cases are used, possibly with adaptations to obtain a prediction for the target case. To apply the method, we have to select the relevant project features, the appropriate similarity function, the number of analogies to choose the similar projects to consider for estimation, and the analogy adaptation strategy for generating the estimation. The selection of the similarity function and the number of analogies are crucial decisions. The similarity measure used in this study is the Euclidean distance as this has been the measure used in the literature with the best results (Mendes et al. 2003c). In addition, all the project attributes considered by the similarity function had equal influence upon the selection of the most similar project(s).

Estimates were based on the average effort of the most similar projects in the case base, with no different weights for attributes or adaptation of the estimated effort.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Corazza, A., Di Martino, S., Ferrucci, F. et al. Investigating the use of Support Vector Regression for web effort estimation. Empir Software Eng 16, 211–243 (2011). https://doi.org/10.1007/s10664-010-9138-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-010-9138-4

Keywords

Navigation