Abstract
The importance of Software Cost Estimation at the early stages of the development life cycle is clearly portrayed by the utilization of several models and methods, appeared so far in the literature. The researchers’ interest has been focused on two well known techniques, namely the parametric Regression Analysis and the non-parametric Estimation by Analogy. Despite the several comparison studies, there seems to be a discrepancy in choosing the best prediction technique between them. In this paper, we introduce a semi-parametric technique, called LSEbA that achieves to combine the aforementioned methods retaining the advantages of both approaches. Furthermore, the proposed method is consistent with the mixed nature of Software Cost Estimation data and takes advantage of the whole pure information of the dataset even if there is a large amount of missing values. The paper analytically illustrates the process of building such a model and presents the experimentation on three representative datasets verifying the benefits of the proposed model in terms of accuracy, bias and spread. Comparisons of LSEbA with linear regression, estimation by analogy and a combination of them, based on the average of their outcomes are made through accuracy metrics, statistical tests and a graphical tool, the Regression Error Characteristic curves.




Similar content being viewed by others
References
Angelis L, Stamelos I, Morisio M (2001) Building a software cost estimation model based on categorical data. Proceedings of the IEEE 8th International Symposium on Software Metrics, pp. 4–15
Anglin P, Gencay R (1996) Semiparametric estimation of a hedonic price function. J Appl Econ 11(6):633–648
Bi J, Bennet K-P (2003) Regression error characteristics curves. Proceedings of the AIII 20th International Conference on Machine Learning, pp. 43–50
Briand L, Langley T, Wieczorek I (2000) A replicated assessment and comparison of common software cost modeling techniques. Proceedings of the IEEE International Conference Software Engineering, pp. 377–386
Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data Proceedings of the METRICS, pp. 154–165
Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29(11):985–995
Hardle W (1990) Applied non-parametric regression. Economics Society Monographs, Cambridge University Press
Hardle W, Liang H, Gao J (2000) Partially linear models. Physica-Verlag, Heidelberg
ISBSG Dataset 10 (2007), http://www.isbsg.org
Jorgensen M, Shepperd M (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33(1):33–53
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. John Wiley, New York
Kitchenham B (1998) A procedure for analyzing unbalanced datasets. IEEE Trans Softw Eng 24(4):278–301
Kitchenham B, Mendes E (2004) A comparison of cross-company and within-company effort estimation models for web applications. Proceedings of the Empirical Assessment in Software Engineering, pp. 47–55
Kitchenham B, Pickard L, MacDonell S, Shepperd M (2001) What accuracy statistics really measure. IEE Proc Software 148(3):81–85
Kitchenham B, Pfleeger L, McColl B, Eagan S (2002) A case study of maintenance estimation accuracy. J Syst Softw 64(1):57–77
Korte M, Port D (2008) Confidence in software cost estimation results based on mmre and pred. Proceedings of the 4th ACM International Workshop on Predictor Models in Software Engineering, pp. 63–70
Liebchen G, Shepperd M (2008) Data sets and data quality in software engineering. Proceedings of the 4th ACM International Workshop on Predictor Models in Software Engineering, pp. 39–44
Lokan C, Mendes E (2006) Cross-company and single-company effort models using the ISBSG database: a further replicated study. Proceedings of the ACM-IEEE International Symposium on Empirical Software Engineering, pp. 75–84
MacDonell S, Shepperd M (2003) Combining techniques to optimize effort predictions in software project management. J Syst Softw 66(2):91–98
Mair C, Shepperd M (2005) The consistency of empirical comparisons of regression and analogy-based software project cost prediction. Proceedings of the International Symposium on Empirical Software Engineering, pp. 509–518
Mendes E, Kitchenham BA (2004) Further comparison of cross-company and within company effort estimation models for web applications. Proceedings of the 10th IEEE International Symposium on Software Metrics, pp. 348–357
Mendes E, Lokan C (2008) Replicating studies on cross—vs single-company effort models using the ISBSG database. Emp Softw Eng 13(1):3–37
Mendes E, Lokan C, Harrison R, Triggs C (2005) A replicated comparison of cross-company and within-company effort estimation models using the ISBSG database. Proceedings of the IEEE 11th International Software Metrics Symposium, pp. 36–46
Mittas N, Athanasiades M, Angelis L (2008) Improving analogy-based software cost estimation by a resampling method. Inform Softw Technol 50(3):221–230
Mittas N, Angelis L (2008a) Combining regression and estimation by analogy in a semi-parametric model for software cost estimation. Proceedings of the ACM-IEEE 2nd International Symposium on Empirical Software Engineering and Management, pp. 70–79
Mittas N, Angelis L (2008b) Comparing cost prediction models by resampling techniques. J Syst Softw 81(5):616–632
Mittas N, Angelis L (2008c) Comparing software cost prediction models by a visualization tool. Proceedings of the IEEE 34th Euromicro Conference on Software Engineering and Advanced Applications, pp. 433–440
Myrtveit I, Stensrud E, Olsson U (2001) Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27(11):999–1013
Myrtveit I, Stensrud E, Shepperd M (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31(5):380–391
NASA93 (2007) Dataset, http://promisedata.org/repository/#nasa93. (NASA93 2007)
Port D, Korte M (2008) Comparative studies of the model evaluation criterions mmre and pred in software cost estimation research. Proceedings of the ACM-IEEE 2nd International Symposium on Empirical Software Engineering and Management, pp. 51–60
Robinson P (1988) Root-n-consistent semiparametric regression. Econometrica 56(4):931–954
Sentas P, Angelis L, Stamelos I, Bleris G (2005) Software productivity and effort prediction with ordinal regression. Inform Softw Technol 47:17–29
Shepperd M, Schofield C (1997) Estimating software project effort using analogies. IEEE Trans Softw Eng 23(11):736–743
Sheskin DJ (2004) Handbook of parametric and nonparametric statistical procedures (Third Edition) Chapman & Hall/CRC
Strike K, Emam KE, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908
Wissmann M, Toutenburg H, Shalabh (2007) Role of categorical variables in multicollinearity in the linear regression model. Technical Report, Number 008, Department of Statistics, University of Munich
Acknowledgement
We would like to thank the reviewers and the editor for their valuable comments which helped us to improve the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Emilia Mendes
Rights and permissions
About this article
Cite this article
Mittas, N., Angelis, L. LSEbA: least squares regression and estimation by analogy in a semi-parametric model for software cost estimation. Empir Software Eng 15, 523–555 (2010). https://doi.org/10.1007/s10664-010-9128-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-010-9128-6