On the value of outlier elimination on software effort estimation research

Seo, Yeong-Seok; Bae, Doo-Hwan

doi:10.1007/s10664-012-9207-y

On the value of outlier elimination on software effort estimation research

Published: 06 May 2012

Volume 18, pages 659–698, (2013)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Yeong-Seok Seo¹ &
Doo-Hwan Bae¹

1167 Accesses
Explore all metrics

Abstract

Producing accurate and reliable software effort estimation has always been a challenge for both academic research and software industries. Regarding this issue, data quality is an important factor that impacts the estimation accuracy of effort estimation methods. To assess the impact of data quality, we investigated the effect of eliminating outliers on the estimation accuracy of commonly used software effort estimation methods. Based on three research questions, we associatively analyzed the influence of outlier elimination on the accuracy of software effort estimation by applying five methods of outlier elimination (Least trimmed squares, Cook’s distance, K-means clustering, Box plot, and Mantel leverage metric) and two methods of effort estimation (Least squares regression and Estimation by analogy with the variation of the parameters). Empirical experiments were performed using industrial data sets (ISBSG Release 9, Bank and Stock data sets that are collected from financial companies, and a Desharnais data set in the PROMISE repository). In addition, the effect of the outlier elimination methods is evaluated by the statistical tests (the Friedman test and the Wilcoxon signed rank test). The experimental results derived from the evaluation criteria showed that there was no substantial difference between the software effort estimation results with and without outlier elimination. However, statistical analysis indicated that outlier elimination leads to a significant improvement in the estimation accuracy on the Stock data set (in case of some combinations of outlier elimination and effort estimation methods). In addition, although outlier elimination did not lead to a significant improvement in the estimation accuracy on the other data sets, our graphical analysis of errors showed that outlier elimination can improve the likelihood to produce more accurate effort estimates for new software project data to be estimated. Therefore, from a practical point of view, it is necessary to consider the outlier elimination and to conduct a detailed analysis of the effort estimation results to improve the accuracy of software effort estimation in software organizations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Negative results for software effort estimation

Article 21 November 2016

Outlier Mining Techniques for Software Defect Prediction

Re-estimating software effort using prior phase efforts and data mining techniques

Article 02 May 2018

Notes

The situation of Fig. 2 may be caused by the data points with very different types of solutions. This may not necessarily to do with outliers. However, the situation can be recognized as part of the outlier problem according to the definition of outliers with respect to applications in software organizations. That is, as the definition of outliers is subjective and usually different in each software organization, the data points with very different types of solutions can be identified as outliers.
CMMI is awarded by Carnegie Mellon University’s Software Engineering Institute (SEI) and is a software development process improvement approach for which the goal is to help organizations improve their performance. At maturity level 3, the organization’s set of standard processes is well established and improved over time. Projects establish their defined processes by tailoring the organization’s set of standard processes according to tailoring guidelines (Chrissis et al. 2003).
Note that, when K is equal to 1 and any similarity function is selected, all of the calculations for the final effort estimate (mean, median, and weighted mean) give the same results. Moreover, when K is equal to 2 and any similarity function is selected, the mean and the median give the same results.

References

Agulló J, Croux C, Van Aelst S (2008) The multivariate least-trimmed squares estimator. J Multivar Anal 99(3):311–338
Article MATH Google Scholar
Alpaydin E (2004) Introduction to machine learning. MIT Press, Cambridge
Google Scholar
Barret V, Lewis T (1994) Outliers in statistical data, 3rd edn. Wiley, New York
Google Scholar
Boetticher GD, Menzies T, Ostrand TJ (2007) PROMISE Repository of empirical software engineering data. http://promisedata.org/repository, West Virginia University, Department of Computer Science
Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proc 9th international software metrics symposium (METRICS ’03), pp 154–165
Chan V, Wong W (2007) Outlier elimination in construction of software metric models. In: Proc the 22nd ACM symposium on applied computing (SAC ’07), pp 1484–1488
Chiu NH, Huang SJ (2007) The adjusted analogy-based software effort estimation based on similarity distances. J Syst Softw 80(4):628–640
Article Google Scholar
Chrissis MB, Konrad M, Shrum S (2003) CMMI: guidelines for process integration and product improvement. Addison-Wesley Professional
Conte S, Dunsmore H, Shen V (1986) Software engineering metrics and models. Benjamin/Cummings Publishing Company
Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18
Article MathSciNet MATH Google Scholar
de Barcelos Tronto I, da Silva J, Sant’Anna N (2007) Comparison of artificial neural network and regression models in software effort estimation. In: Proc 2007 international joint conference on neural networks (IJCNN ’07), pp 771–776
Desharnais J (1989) Analyse statistique de la productivitie des projets informatique a partie de la technique des point des fonction. Masters thesis, University of Montreal
Field A (2009) Discovering statistics using SPSS, 3rd edn. Sage Publications Ltd
Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29(11):985–995
Article Google Scholar
Hamilton L (1992) Regression with graphics: a second course in applied statistics. Duxbury Press
Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann
Huang SJ, Chiu NH (2006) Optimization of analogy weights by genetic algorithm for software effort estimation. Inf Softw Technol 48(11):1034–1045
Article Google Scholar
IFPUG (1994) Function point counting practices manual. International Function Point Users Group. www.ifpug.org
ISBSG (2005) International Software Benchmarking Standards Group. www.isbsg.org
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Jeffery R, Ruhe M, Wieczorek I (2000) A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data. Inf Softw Technol 42(14):1009–1016
Article Google Scholar
Jeffery R, Ruhe M, Wieczorek I (2001) Using public domain metrics to estimate software development effort. In: Proc 7th IEEE international software metrics symposium (METRICS ’01), pp 16–27
Jorgensen M, Shepperd MJ (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33(1):33–53
Article Google Scholar
Keung JW, Kitchenham BA, Jeffery DR (2008) Analogy-X: providing statistical inference to analogy-based software cost estimation. IEEE Trans Softw Eng 34(4):471–484
Article Google Scholar
Kirsopp C, Shepperd MJ (2002) Making inferences with small numbers of training sets. IEE Proc Softw 149(5):123–130
Article Google Scholar
Kitchenham B, MacDonell S, Pickard L, Shepperd MJ (1999) Assessing prediction systems. The Information Science Discussion Paper Series, University of Otago
Kocaguneli E, Menzies T, Bener A, Keung J (2012) Exploiting the essential assumptions of analogybased effort estimation. IEEE Trans Softw Eng 38(2):425–438
Article Google Scholar
Kultur Y, Turhan B, Bener AB (2008) ENNA: software effort estimation using ensemble of neural networks with associative memory. In: Proc 16th ACM SIGSOFT international symposium on foundations of software engineering (FSE ’08), pp 330–338
Li YF, Xie M, Goh TN (2009) A study of project selection and feature weighting for analogy based software cost estimation. J Syst Softw 82(2):241–252
Article Google Scholar
Liu Q, Qin W, Mintram R, Ross M (2008) Evaluation of preliminary data analysis framework in software cost estimation based on ISBSG R9 data. Softw Q J 16(3):411–458
Article Google Scholar
Lokan C, Mendes E (2006) Cross-company and single-company effort models using the ISBSG database: a further replicated study. In: Proc 2006 ACM/IEEE international symposium on empirical software engineering (ISESE ’06), pp 75–84
MacDonell SG, Shepperd MJ (2003) Combining techniques to optimize effort predictions in software project management. J Syst Softw 66(2):91–98
Article Google Scholar
Mair C, Shepperd MJ (2005) The consistency of empirical comparisons of regression and analogy-based software project cost prediction. In: Proc 2005 ACM/IEEE international symposium on empirical software engineering (ISESE ’05), pp 509–518
Maxwell KD (2002) Applied statistics for software managers. Prentice Hall
Mendes E, Lokan C (2008) Replicating studies on cross- vs single-company effort models using the ISBSG database. Empir Software Eng 13(1):3–37
Article Google Scholar
Mendes M, Pala A (2003) Type I error rate and power of three normality tests. Pakistan J Inf Technol 2(2):135–139
Google Scholar
Menzies T, Jalali O, Hihn J, Baker D, Lum K (2010) Stable rankings for different effort models. Autom Softw Eng 17(4):409–437
Article Google Scholar
Menzies T, Butcher A, Marcus A, Zimmermann T, Cok DR (2011) Local vs. global models for effort estimation and defect prediction. In: Proc 26th IEEE/ACM international conference on automated software engineering (ASE ’11), pp 343–351
Mittas N, Angelis L (2008) Combining regression and estimation by analogy in a semi-parametric model for software cost estimation. In: Proc second ACM-IEEE international symposium on empirical software engineering and measurement (ESEM ’08), pp 70–79
Miyazaki Y, Takanou A, Nozaki H, Nakagawa N, Okada K (1991) Method to estimate parameter values in software prediction models. Inf Softw Technol 33(3):239–243
Article Google Scholar
Miyazaki Y, Terakado M, Ozaki K, Nozaki H (1994) Robust regression for developing software estimation models. J Syst Softw 27(1):3–16
Article Google Scholar
Myrtveit I, Stensrud E, Shepperd MJ (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31(5):380–391
Article Google Scholar
Ott RL, Longnecker MT (2008) An introduction to statistical methods and data analysis, 6th edn. Duxbury Press
Pendharkar P, Subramanian G, Rodger J (2005) A probabilistic model for predicting software development effort. IEEE Trans Softw Eng 31(7):615–624
Article Google Scholar
Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880
Article MathSciNet MATH Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65
Article MATH Google Scholar
Rousseeuw P, Leroy A (1987) Robust regression and outlier detection. Wiley, New York
Book MATH Google Scholar
Rousseeuw P, van Driessen K (2006) Computing LTS regression for large data sets. Data Min Knowl Discovery 12(1):29–45
Article MathSciNet Google Scholar
Seo YS, Yoon KA, Bae DH (2008) An empirical analysis of software effort estimation with outlier elimination. In: Proc 4th international workshop on predictor models in software engineering (PROMISE ’08), pp 25–32
Seo YS, Yoon KA, Bae DH (2009) Improving the accuracy of software effort estimation based on multiple least square regression models by estimation error-based data partitioning. In: Proc 2009 16th Asia–Pacific software engineering conference (APSEC ’09), pp 3–10
Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52(3–4):591–611
MathSciNet MATH Google Scholar
Shepperd MJ, Kadoda G (2001) Comparing software prediction techniques using simulation. IEEE Trans Softw Eng 27(11):1014–1022
Article Google Scholar
Shepperd MJ, Schofield C (1997) Estimating software project effort using analogies. IEEE Trans Softw Eng 23(11):736–743
Article Google Scholar
Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908
Article Google Scholar
Van Hulse J, Khoshgoftaar TM (2008) A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. J Syst Softw 81(5):691–708
Article Google Scholar
Wen J, Li S, Tang L (2009) Improve analogy-based software effort estimation using principal components analysis and correlation weighting. In: Proc 2009 16th Asia–Pacific software engineering conference (APSEC ’09), pp 179–186

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper. This work was partially supported by Defense Acquisition Program Administration and Agency for Defense Development under the contract.

Author information

Authors and Affiliations

Department of Computer Science, College of Information Science & Technology, KAIST, Daejeon, South Korea
Yeong-Seok Seo & Doo-Hwan Bae

Authors

Yeong-Seok Seo
View author publications
You can also search for this author in PubMed Google Scholar
Doo-Hwan Bae
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yeong-Seok Seo.

Additional information

Editor: Martin Shepperd

Rights and permissions

Reprints and permissions

About this article

Cite this article

Seo, YS., Bae, DH. On the value of outlier elimination on software effort estimation research. Empir Software Eng 18, 659–698 (2013). https://doi.org/10.1007/s10664-012-9207-y

Download citation

Published: 06 May 2012
Issue Date: August 2013
DOI: https://doi.org/10.1007/s10664-012-9207-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the value of outlier elimination on software effort estimation research

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Negative results for software effort estimation

Outlier Mining Techniques for Software Defect Prediction

Re-estimating software effort using prior phase efforts and data mining techniques

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

On the value of outlier elimination on software effort estimation research

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Negative results for software effort estimation

Outlier Mining Techniques for Software Defect Prediction

Re-estimating software effort using prior phase efforts and data mining techniques

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation