Skip to main content
Log in

On the value of outlier elimination on software effort estimation research

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Producing accurate and reliable software effort estimation has always been a challenge for both academic research and software industries. Regarding this issue, data quality is an important factor that impacts the estimation accuracy of effort estimation methods. To assess the impact of data quality, we investigated the effect of eliminating outliers on the estimation accuracy of commonly used software effort estimation methods. Based on three research questions, we associatively analyzed the influence of outlier elimination on the accuracy of software effort estimation by applying five methods of outlier elimination (Least trimmed squares, Cook’s distance, K-means clustering, Box plot, and Mantel leverage metric) and two methods of effort estimation (Least squares regression and Estimation by analogy with the variation of the parameters). Empirical experiments were performed using industrial data sets (ISBSG Release 9, Bank and Stock data sets that are collected from financial companies, and a Desharnais data set in the PROMISE repository). In addition, the effect of the outlier elimination methods is evaluated by the statistical tests (the Friedman test and the Wilcoxon signed rank test). The experimental results derived from the evaluation criteria showed that there was no substantial difference between the software effort estimation results with and without outlier elimination. However, statistical analysis indicated that outlier elimination leads to a significant improvement in the estimation accuracy on the Stock data set (in case of some combinations of outlier elimination and effort estimation methods). In addition, although outlier elimination did not lead to a significant improvement in the estimation accuracy on the other data sets, our graphical analysis of errors showed that outlier elimination can improve the likelihood to produce more accurate effort estimates for new software project data to be estimated. Therefore, from a practical point of view, it is necessary to consider the outlier elimination and to conduct a detailed analysis of the effort estimation results to improve the accuracy of software effort estimation in software organizations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. The situation of Fig. 2 may be caused by the data points with very different types of solutions. This may not necessarily to do with outliers. However, the situation can be recognized as part of the outlier problem according to the definition of outliers with respect to applications in software organizations. That is, as the definition of outliers is subjective and usually different in each software organization, the data points with very different types of solutions can be identified as outliers.

  2. CMMI is awarded by Carnegie Mellon University’s Software Engineering Institute (SEI) and is a software development process improvement approach for which the goal is to help organizations improve their performance. At maturity level 3, the organization’s set of standard processes is well established and improved over time. Projects establish their defined processes by tailoring the organization’s set of standard processes according to tailoring guidelines (Chrissis et al. 2003).

  3. Note that, when K is equal to 1 and any similarity function is selected, all of the calculations for the final effort estimate (mean, median, and weighted mean) give the same results. Moreover, when K is equal to 2 and any similarity function is selected, the mean and the median give the same results.

References

  • Agulló J, Croux C, Van Aelst S (2008) The multivariate least-trimmed squares estimator. J Multivar Anal 99(3):311–338

    Article  MATH  Google Scholar 

  • Alpaydin E (2004) Introduction to machine learning. MIT Press, Cambridge

    Google Scholar 

  • Barret V, Lewis T (1994) Outliers in statistical data, 3rd edn. Wiley, New York

    Google Scholar 

  • Boetticher GD, Menzies T, Ostrand TJ (2007) PROMISE Repository of empirical software engineering data. http://promisedata.org/repository, West Virginia University, Department of Computer Science

  • Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proc 9th international software metrics symposium (METRICS ’03), pp 154–165

  • Chan V, Wong W (2007) Outlier elimination in construction of software metric models. In: Proc the 22nd ACM symposium on applied computing (SAC ’07), pp 1484–1488

  • Chiu NH, Huang SJ (2007) The adjusted analogy-based software effort estimation based on similarity distances. J Syst Softw 80(4):628–640

    Article  Google Scholar 

  • Chrissis MB, Konrad M, Shrum S (2003) CMMI: guidelines for process integration and product improvement. Addison-Wesley Professional

  • Conte S, Dunsmore H, Shen V (1986) Software engineering metrics and models. Benjamin/Cummings Publishing Company

  • Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18

    Article  MathSciNet  MATH  Google Scholar 

  • de Barcelos Tronto I, da Silva J, Sant’Anna N (2007) Comparison of artificial neural network and regression models in software effort estimation. In: Proc 2007 international joint conference on neural networks (IJCNN ’07), pp 771–776

  • Desharnais J (1989) Analyse statistique de la productivitie des projets informatique a partie de la technique des point des fonction. Masters thesis, University of Montreal

  • Field A (2009) Discovering statistics using SPSS, 3rd edn. Sage Publications Ltd

  • Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29(11):985–995

    Article  Google Scholar 

  • Hamilton L (1992) Regression with graphics: a second course in applied statistics. Duxbury Press

  • Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann

  • Huang SJ, Chiu NH (2006) Optimization of analogy weights by genetic algorithm for software effort estimation. Inf Softw Technol 48(11):1034–1045

    Article  Google Scholar 

  • IFPUG (1994) Function point counting practices manual. International Function Point Users Group. www.ifpug.org

  • ISBSG (2005) International Software Benchmarking Standards Group. www.isbsg.org

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  • Jeffery R, Ruhe M, Wieczorek I (2000) A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data. Inf Softw Technol 42(14):1009–1016

    Article  Google Scholar 

  • Jeffery R, Ruhe M, Wieczorek I (2001) Using public domain metrics to estimate software development effort. In: Proc 7th IEEE international software metrics symposium (METRICS ’01), pp 16–27

  • Jorgensen M, Shepperd MJ (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33(1):33–53

    Article  Google Scholar 

  • Keung JW, Kitchenham BA, Jeffery DR (2008) Analogy-X: providing statistical inference to analogy-based software cost estimation. IEEE Trans Softw Eng 34(4):471–484

    Article  Google Scholar 

  • Kirsopp C, Shepperd MJ (2002) Making inferences with small numbers of training sets. IEE Proc Softw 149(5):123–130

    Article  Google Scholar 

  • Kitchenham B, MacDonell S, Pickard L, Shepperd MJ (1999) Assessing prediction systems. The Information Science Discussion Paper Series, University of Otago

  • Kocaguneli E, Menzies T, Bener A, Keung J (2012) Exploiting the essential assumptions of analogybased effort estimation. IEEE Trans Softw Eng 38(2):425–438

    Article  Google Scholar 

  • Kultur Y, Turhan B, Bener AB (2008) ENNA: software effort estimation using ensemble of neural networks with associative memory. In: Proc 16th ACM SIGSOFT international symposium on foundations of software engineering (FSE ’08), pp 330–338

  • Li YF, Xie M, Goh TN (2009) A study of project selection and feature weighting for analogy based software cost estimation. J Syst Softw 82(2):241–252

    Article  Google Scholar 

  • Liu Q, Qin W, Mintram R, Ross M (2008) Evaluation of preliminary data analysis framework in software cost estimation based on ISBSG R9 data. Softw Q J 16(3):411–458

    Article  Google Scholar 

  • Lokan C, Mendes E (2006) Cross-company and single-company effort models using the ISBSG database: a further replicated study. In: Proc 2006 ACM/IEEE international symposium on empirical software engineering (ISESE ’06), pp 75–84

  • MacDonell SG, Shepperd MJ (2003) Combining techniques to optimize effort predictions in software project management. J Syst Softw 66(2):91–98

    Article  Google Scholar 

  • Mair C, Shepperd MJ (2005) The consistency of empirical comparisons of regression and analogy-based software project cost prediction. In: Proc 2005 ACM/IEEE international symposium on empirical software engineering (ISESE ’05), pp 509–518

  • Maxwell KD (2002) Applied statistics for software managers. Prentice Hall

  • Mendes E, Lokan C (2008) Replicating studies on cross- vs single-company effort models using the ISBSG database. Empir Software Eng 13(1):3–37

    Article  Google Scholar 

  • Mendes M, Pala A (2003) Type I error rate and power of three normality tests. Pakistan J Inf Technol 2(2):135–139

    Google Scholar 

  • Menzies T, Jalali O, Hihn J, Baker D, Lum K (2010) Stable rankings for different effort models. Autom Softw Eng 17(4):409–437

    Article  Google Scholar 

  • Menzies T, Butcher A, Marcus A, Zimmermann T, Cok DR (2011) Local vs. global models for effort estimation and defect prediction. In: Proc 26th IEEE/ACM international conference on automated software engineering (ASE ’11), pp 343–351

  • Mittas N, Angelis L (2008) Combining regression and estimation by analogy in a semi-parametric model for software cost estimation. In: Proc second ACM-IEEE international symposium on empirical software engineering and measurement (ESEM ’08), pp 70–79

  • Miyazaki Y, Takanou A, Nozaki H, Nakagawa N, Okada K (1991) Method to estimate parameter values in software prediction models. Inf Softw Technol 33(3):239–243

    Article  Google Scholar 

  • Miyazaki Y, Terakado M, Ozaki K, Nozaki H (1994) Robust regression for developing software estimation models. J Syst Softw 27(1):3–16

    Article  Google Scholar 

  • Myrtveit I, Stensrud E, Shepperd MJ (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31(5):380–391

    Article  Google Scholar 

  • Ott RL, Longnecker MT (2008) An introduction to statistical methods and data analysis, 6th edn. Duxbury Press

  • Pendharkar P, Subramanian G, Rodger J (2005) A probabilistic model for predicting software development effort. IEEE Trans Softw Eng 31(7):615–624

    Article  Google Scholar 

  • Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880

    Article  MathSciNet  MATH  Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65

    Article  MATH  Google Scholar 

  • Rousseeuw P, Leroy A (1987) Robust regression and outlier detection. Wiley, New York

    Book  MATH  Google Scholar 

  • Rousseeuw P, van Driessen K (2006) Computing LTS regression for large data sets. Data Min Knowl Discovery 12(1):29–45

    Article  MathSciNet  Google Scholar 

  • Seo YS, Yoon KA, Bae DH (2008) An empirical analysis of software effort estimation with outlier elimination. In: Proc 4th international workshop on predictor models in software engineering (PROMISE ’08), pp 25–32

  • Seo YS, Yoon KA, Bae DH (2009) Improving the accuracy of software effort estimation based on multiple least square regression models by estimation error-based data partitioning. In: Proc 2009 16th Asia–Pacific software engineering conference (APSEC ’09), pp 3–10

  • Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52(3–4):591–611

    MathSciNet  MATH  Google Scholar 

  • Shepperd MJ, Kadoda G (2001) Comparing software prediction techniques using simulation. IEEE Trans Softw Eng 27(11):1014–1022

    Article  Google Scholar 

  • Shepperd MJ, Schofield C (1997) Estimating software project effort using analogies. IEEE Trans Softw Eng 23(11):736–743

    Article  Google Scholar 

  • Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908

    Article  Google Scholar 

  • Van Hulse J, Khoshgoftaar TM (2008) A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. J Syst Softw 81(5):691–708

    Article  Google Scholar 

  • Wen J, Li S, Tang L (2009) Improve analogy-based software effort estimation using principal components analysis and correlation weighting. In: Proc 2009 16th Asia–Pacific software engineering conference (APSEC ’09), pp 179–186

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper. This work was partially supported by Defense Acquisition Program Administration and Agency for Defense Development under the contract.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yeong-Seok Seo.

Additional information

Editor: Martin Shepperd

Rights and permissions

Reprints and permissions

About this article

Cite this article

Seo, YS., Bae, DH. On the value of outlier elimination on software effort estimation research. Empir Software Eng 18, 659–698 (2013). https://doi.org/10.1007/s10664-012-9207-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-012-9207-y

Keywords

Navigation