Abstract
Producing accurate and reliable software effort estimation has always been a challenge for both academic research and software industries. Regarding this issue, data quality is an important factor that impacts the estimation accuracy of effort estimation methods. To assess the impact of data quality, we investigated the effect of eliminating outliers on the estimation accuracy of commonly used software effort estimation methods. Based on three research questions, we associatively analyzed the influence of outlier elimination on the accuracy of software effort estimation by applying five methods of outlier elimination (Least trimmed squares, Cook’s distance, K-means clustering, Box plot, and Mantel leverage metric) and two methods of effort estimation (Least squares regression and Estimation by analogy with the variation of the parameters). Empirical experiments were performed using industrial data sets (ISBSG Release 9, Bank and Stock data sets that are collected from financial companies, and a Desharnais data set in the PROMISE repository). In addition, the effect of the outlier elimination methods is evaluated by the statistical tests (the Friedman test and the Wilcoxon signed rank test). The experimental results derived from the evaluation criteria showed that there was no substantial difference between the software effort estimation results with and without outlier elimination. However, statistical analysis indicated that outlier elimination leads to a significant improvement in the estimation accuracy on the Stock data set (in case of some combinations of outlier elimination and effort estimation methods). In addition, although outlier elimination did not lead to a significant improvement in the estimation accuracy on the other data sets, our graphical analysis of errors showed that outlier elimination can improve the likelihood to produce more accurate effort estimates for new software project data to be estimated. Therefore, from a practical point of view, it is necessary to consider the outlier elimination and to conduct a detailed analysis of the effort estimation results to improve the accuracy of software effort estimation in software organizations.
Similar content being viewed by others
Notes
The situation of Fig. 2 may be caused by the data points with very different types of solutions. This may not necessarily to do with outliers. However, the situation can be recognized as part of the outlier problem according to the definition of outliers with respect to applications in software organizations. That is, as the definition of outliers is subjective and usually different in each software organization, the data points with very different types of solutions can be identified as outliers.
CMMI is awarded by Carnegie Mellon University’s Software Engineering Institute (SEI) and is a software development process improvement approach for which the goal is to help organizations improve their performance. At maturity level 3, the organization’s set of standard processes is well established and improved over time. Projects establish their defined processes by tailoring the organization’s set of standard processes according to tailoring guidelines (Chrissis et al. 2003).
Note that, when K is equal to 1 and any similarity function is selected, all of the calculations for the final effort estimate (mean, median, and weighted mean) give the same results. Moreover, when K is equal to 2 and any similarity function is selected, the mean and the median give the same results.
References
Agulló J, Croux C, Van Aelst S (2008) The multivariate least-trimmed squares estimator. J Multivar Anal 99(3):311–338
Alpaydin E (2004) Introduction to machine learning. MIT Press, Cambridge
Barret V, Lewis T (1994) Outliers in statistical data, 3rd edn. Wiley, New York
Boetticher GD, Menzies T, Ostrand TJ (2007) PROMISE Repository of empirical software engineering data. http://promisedata.org/repository, West Virginia University, Department of Computer Science
Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proc 9th international software metrics symposium (METRICS ’03), pp 154–165
Chan V, Wong W (2007) Outlier elimination in construction of software metric models. In: Proc the 22nd ACM symposium on applied computing (SAC ’07), pp 1484–1488
Chiu NH, Huang SJ (2007) The adjusted analogy-based software effort estimation based on similarity distances. J Syst Softw 80(4):628–640
Chrissis MB, Konrad M, Shrum S (2003) CMMI: guidelines for process integration and product improvement. Addison-Wesley Professional
Conte S, Dunsmore H, Shen V (1986) Software engineering metrics and models. Benjamin/Cummings Publishing Company
Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18
de Barcelos Tronto I, da Silva J, Sant’Anna N (2007) Comparison of artificial neural network and regression models in software effort estimation. In: Proc 2007 international joint conference on neural networks (IJCNN ’07), pp 771–776
Desharnais J (1989) Analyse statistique de la productivitie des projets informatique a partie de la technique des point des fonction. Masters thesis, University of Montreal
Field A (2009) Discovering statistics using SPSS, 3rd edn. Sage Publications Ltd
Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29(11):985–995
Hamilton L (1992) Regression with graphics: a second course in applied statistics. Duxbury Press
Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann
Huang SJ, Chiu NH (2006) Optimization of analogy weights by genetic algorithm for software effort estimation. Inf Softw Technol 48(11):1034–1045
IFPUG (1994) Function point counting practices manual. International Function Point Users Group. www.ifpug.org
ISBSG (2005) International Software Benchmarking Standards Group. www.isbsg.org
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Jeffery R, Ruhe M, Wieczorek I (2000) A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data. Inf Softw Technol 42(14):1009–1016
Jeffery R, Ruhe M, Wieczorek I (2001) Using public domain metrics to estimate software development effort. In: Proc 7th IEEE international software metrics symposium (METRICS ’01), pp 16–27
Jorgensen M, Shepperd MJ (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33(1):33–53
Keung JW, Kitchenham BA, Jeffery DR (2008) Analogy-X: providing statistical inference to analogy-based software cost estimation. IEEE Trans Softw Eng 34(4):471–484
Kirsopp C, Shepperd MJ (2002) Making inferences with small numbers of training sets. IEE Proc Softw 149(5):123–130
Kitchenham B, MacDonell S, Pickard L, Shepperd MJ (1999) Assessing prediction systems. The Information Science Discussion Paper Series, University of Otago
Kocaguneli E, Menzies T, Bener A, Keung J (2012) Exploiting the essential assumptions of analogybased effort estimation. IEEE Trans Softw Eng 38(2):425–438
Kultur Y, Turhan B, Bener AB (2008) ENNA: software effort estimation using ensemble of neural networks with associative memory. In: Proc 16th ACM SIGSOFT international symposium on foundations of software engineering (FSE ’08), pp 330–338
Li YF, Xie M, Goh TN (2009) A study of project selection and feature weighting for analogy based software cost estimation. J Syst Softw 82(2):241–252
Liu Q, Qin W, Mintram R, Ross M (2008) Evaluation of preliminary data analysis framework in software cost estimation based on ISBSG R9 data. Softw Q J 16(3):411–458
Lokan C, Mendes E (2006) Cross-company and single-company effort models using the ISBSG database: a further replicated study. In: Proc 2006 ACM/IEEE international symposium on empirical software engineering (ISESE ’06), pp 75–84
MacDonell SG, Shepperd MJ (2003) Combining techniques to optimize effort predictions in software project management. J Syst Softw 66(2):91–98
Mair C, Shepperd MJ (2005) The consistency of empirical comparisons of regression and analogy-based software project cost prediction. In: Proc 2005 ACM/IEEE international symposium on empirical software engineering (ISESE ’05), pp 509–518
Maxwell KD (2002) Applied statistics for software managers. Prentice Hall
Mendes E, Lokan C (2008) Replicating studies on cross- vs single-company effort models using the ISBSG database. Empir Software Eng 13(1):3–37
Mendes M, Pala A (2003) Type I error rate and power of three normality tests. Pakistan J Inf Technol 2(2):135–139
Menzies T, Jalali O, Hihn J, Baker D, Lum K (2010) Stable rankings for different effort models. Autom Softw Eng 17(4):409–437
Menzies T, Butcher A, Marcus A, Zimmermann T, Cok DR (2011) Local vs. global models for effort estimation and defect prediction. In: Proc 26th IEEE/ACM international conference on automated software engineering (ASE ’11), pp 343–351
Mittas N, Angelis L (2008) Combining regression and estimation by analogy in a semi-parametric model for software cost estimation. In: Proc second ACM-IEEE international symposium on empirical software engineering and measurement (ESEM ’08), pp 70–79
Miyazaki Y, Takanou A, Nozaki H, Nakagawa N, Okada K (1991) Method to estimate parameter values in software prediction models. Inf Softw Technol 33(3):239–243
Miyazaki Y, Terakado M, Ozaki K, Nozaki H (1994) Robust regression for developing software estimation models. J Syst Softw 27(1):3–16
Myrtveit I, Stensrud E, Shepperd MJ (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31(5):380–391
Ott RL, Longnecker MT (2008) An introduction to statistical methods and data analysis, 6th edn. Duxbury Press
Pendharkar P, Subramanian G, Rodger J (2005) A probabilistic model for predicting software development effort. IEEE Trans Softw Eng 31(7):615–624
Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65
Rousseeuw P, Leroy A (1987) Robust regression and outlier detection. Wiley, New York
Rousseeuw P, van Driessen K (2006) Computing LTS regression for large data sets. Data Min Knowl Discovery 12(1):29–45
Seo YS, Yoon KA, Bae DH (2008) An empirical analysis of software effort estimation with outlier elimination. In: Proc 4th international workshop on predictor models in software engineering (PROMISE ’08), pp 25–32
Seo YS, Yoon KA, Bae DH (2009) Improving the accuracy of software effort estimation based on multiple least square regression models by estimation error-based data partitioning. In: Proc 2009 16th Asia–Pacific software engineering conference (APSEC ’09), pp 3–10
Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52(3–4):591–611
Shepperd MJ, Kadoda G (2001) Comparing software prediction techniques using simulation. IEEE Trans Softw Eng 27(11):1014–1022
Shepperd MJ, Schofield C (1997) Estimating software project effort using analogies. IEEE Trans Softw Eng 23(11):736–743
Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908
Van Hulse J, Khoshgoftaar TM (2008) A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. J Syst Softw 81(5):691–708
Wen J, Li S, Tang L (2009) Improve analogy-based software effort estimation using principal components analysis and correlation weighting. In: Proc 2009 16th Asia–Pacific software engineering conference (APSEC ’09), pp 179–186
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper. This work was partially supported by Defense Acquisition Program Administration and Agency for Defense Development under the contract.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Martin Shepperd
Rights and permissions
About this article
Cite this article
Seo, YS., Bae, DH. On the value of outlier elimination on software effort estimation research. Empir Software Eng 18, 659–698 (2013). https://doi.org/10.1007/s10664-012-9207-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-012-9207-y