Towards improving statistical modeling of software engineering data: think locally, act globally!

Bettenburg, Nicolas; Nagappan, Meiyappan; Hassan, Ahmed E.

doi:10.1007/s10664-013-9292-6

Towards improving statistical modeling of software engineering data: think locally, act globally!

Published: 15 January 2014

Volume 20, pages 294–335, (2015)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Nicolas Bettenburg¹,
Meiyappan Nagappan¹ &
Ahmed E. Hassan¹

779 Accesses
23 Citations
1 Altmetric
Explore all metrics

Abstract

Much research in software engineering (SE) is focused on modeling data collected from software repositories. Insights gained over the last decade suggests that such datasets contain a high amount of variability in the data. Such variability has a detrimental effect on model quality, as suggested by recent research. In this paper, we propose to split the data into smaller homogeneous subsets and learn sets of individual statistical models, one for each subset, as a way around the high variability in such data. Our case study on a variety of SE datasets demonstrates that such local models can significantly outperform traditional models with respect to model fit and predictive performance. However, we find that analysts need to be aware of potential pitfalls when building local models: firstly, the choice of clustering algorithm and its parameters can have a substantial impact on model quality. Secondly, the data being modeled needs to have enough variability to take full advantage of local modeling. For example, our case study on social data shows no advantage of local over global modeling, as clustering fails to derive appropriate subsets. Lastly, the interpretation of local models can become very complex when there is a large number of variables or data subsets. Overall, we find that a hybrid approach between local and traditional global modeling, such as Multivariate Adaptive Regression Splines (MARS) combines the best of both worlds. MARS models are non-parametric and thus do not require prior calibration of parameters, are easily interpretable by analysts and outperform local, as well as traditional models out of the box in four out of five datasets in our case study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian Data Analysis in Empirical Software Engineering: The Case of Missing Data

Robust Statistical Methods for Empirical Software Engineering

Article Open access 16 June 2016

Barbara Kitchenham, Lech Madeyski, … Amnart Pohthong

Improving the Software Estimation Models Based on Functional Size through Validation of the Assumptions behind the Linear Regression and the Use of the Confidence Intervals When the Reference Database Presents a Wedge-Shape Form

Article 28 December 2021

F. Valdés-Souto & Lizbeth Naranjo-Albarrán

Notes

References

Ackerman M, Ben-david S (2009) Clusterability: a theoretical study. In: Proceedings of the twelfth international conference on artificial intelligence and statistics (AISTATS’09), JMLR workshop and conference proceedings, vol 5. pp. 1–8
Akaike H (1974) A new look at the statistical model identification. Autom Control IEEE Trans 19(6):716–723
Article MATH MathSciNet Google Scholar
Andreou A, Papatheocharous E (2008) Software cost estimation using fuzzy decision trees. In: Automated software engineering, 2008. ASE 2008. 23rd IEEE/ACM International Conference, pp 371 –374
Attoh-okine N, Mensah S, Nawaiseh M, Hall D (2001) Using multivariate adaptive regression splines (mars) in pavement roughness prediction. Strategy
Barkmann H, Lincke R, Lowe W (2009) Quantitative evaluation of software quality metrics in open-source projects. In: Proceedings of the 2009 international conference on advanced information networking and applications workshops, WAINA ’09. IEEE Computer Society, Washington, DC, pp 1067–1072
Bettenburg N, Hassan AE (2010) Studying the impact of social structures on software quality. In: Proceedings of the 2010 IEEE 18th international conference on program comprehension, ICPC ’10. IEEE Computer Society, Washington, DC, pp 124–133
Bettenburg N, Nagappan M, Hassan A (2012) Think locally, act globally: improving defect and effort prediction models. In: Mining software repositories (MSR), 2012 9th IEEE working conference. pp 60–69
Di Penta M (2011) Nothing else matters: what predictive model should i use?. In: Proceedings of the 7th international conference on predictive models in software engineering, promise ’11. ACM, New York, NY,pp 10:1–10:3
Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81:649–660
Article Google Scholar
Fox J (2008) Applied regression analysis and generalized linear models, 2nd edn. Sage, Los Angeles, London
Fraley C (2007) Bayesian regularization for normal mixture estimation and model-based clustering. J Classif 181(2):155–181
Article MathSciNet Google Scholar
Fraley C, Raftery AE (2009) Mclust version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, University of Washington, Department of Statistics, Seattle, 2006 (subsequent revisions)
Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19(1):1–67
Article MATH Google Scholar
Harrell FE (2001) With Applications to Linear Models, Logistic Regression, and Survival Analysis, Series: Springer Series in Statistics, 1st ed. 2002. Corr. 2nd printing 2001, XXIII, New York, 571 pp
Hartigan JA, Wong MA (1979) A k-means clustering algorithm. JSTOR: Appl Stat 28(1):100–108
MATH Google Scholar
Kamei Y, Matsumoto S, Monden A, Matsumoto Ki, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort-aware models. In: Proceedings of the 2010 IEEE international conference on software maintenance, ICSM ’10. IEEE Computer Society, pp 1–10
Li M, Zhang H, Wu R, Zhou ZH (2012) Sample-based software defect prediction with active and semi-supervised learning. Autom Softw Eng 19(2):201–230. doi:10.1007/s10515-011-0092-1
Article Google Scholar
McQuitty L (1966) Similarity analysis by reciprocal pairs for discrete and continuous data. Educ Psychol Meas 26(4):825–831
Article Google Scholar
Menzies T, Butcher A, Cok D, Layman L, Marcus A, Shull F, Turhan B, Zimmermann T (2013) Local vs. global lessons from defect prediction and effort estimation. IEEE Trans Softw Eng (to appear)
Menzies T, Butcher A, Marcus A, Zimmermann T, Cok D (2011) Local vs global models for effort estimation and defect prediction. In: Proceedings of the 26th IEEE/ACM international conference on automated software engineering
Menzies T, Caglayan B, Kocaguneli E, Krall J, Peters F, Turhan B (2012) The promise repository of empirical software engineering data. http://promisedata.googlecode.com
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13
Article Google Scholar
Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5:169–180
Article Google Scholar
Mockus A, Weiss DM, Zhang P (2003) Understanding and predicting effort in software projects. In: Proceedings of the 25th international conference on software engineering, ICSE ’03. IEEE Computer Society, Washington, DC, pp 274–284
Mockus A, Zhang P, Li PL (2005) Predictors of customer perceived software quality. In: Proceedings of the 27th international conference on Software engineering, ICSE ’05. ACM, New York, NY, pp 225–233
Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: Proceedings of the 27th international conference on software engineering, ICSE ’05. ACM, pp 284–292
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering, ICSE ’06. ACM, New York, NY, pp 452–461
Nguyen THD, Adams B, Hassan AE (2010) Studying the impact of dependency network measures on software quality. In: Proceedings of the 2010 IEEE international conference on software maintenance. IEEE Computer Society, pp 1–10
Osei-Bryson KM, Ko M (2004) Exploring the relationship between information technology investments and firm performance using regression splines analysis. Inf Manag 42(1):1–13
Article Google Scholar
Posnett D, Filkov V, Devanbu P (2011) Ecological inference in empirical software engineering. Int Conf Autom Softw Eng 362–371
Raftery AE, Raftery AE (2007) Bayesian model selection in social research adrian e. raftery sociological methodology. Soc Res 25:111–163
Google Scholar
Rahman F, Devanbu P (2013) How, and why, process metrics are better. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. IEEE Computer Society, Washington, DC, pp 432–441
Rice JA (2001) Mathematical statistics and data analysis, Series: Wadsworth statistics/probability series, 2nd ed. illustrated, 1995. Duxbury Press, Pacific Grove, 602 pp
Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Shepperd M (1988) A critique of cyclomatic complexity as a software metric. Softw Eng J 3(2):30–36
Article Google Scholar
Shihab E, Bird C, Zimmermann T, Zimmermann T (2012) The effect of branching strategies on software quality. In: ESEM. pp 301–310
Witten IH, Frank E (2005) Data Mining: practical machine learning tools and techniques, 2nd edn (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., San Francisco
Google Scholar
York TP, Eaves LJ (2001) Common disease analysis using multivariate adaptive regression splines (mars): Genetic analysis workshop 12 simulated sequence data. Genet Epidemiol 21 Suppl 1:S649–S654
Google Scholar
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE ’09. ACM, New York, NY, pp 91–100
Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the third international workshop on predictor models in software engineering, PROMISE ’07. IEEE Computer Society, Washington, DC, p 9

Download references

Author information

Authors and Affiliations

Software Analysis and Intelligence Lab (SAIL), School of Computing, Queen’s University, Kingston, Ontario, K1N 3L6, Canada
Nicolas Bettenburg, Meiyappan Nagappan & Ahmed E. Hassan

Authors

Nicolas Bettenburg
View author publications
You can also search for this author in PubMed Google Scholar
Meiyappan Nagappan
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed E. Hassan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Bettenburg.

Additional information

Communicated by: Maximillano di Penta and Tao Xie

Appendices

Appendix

Further descriptions of the individual metrics used in these datasets can be found in the work by Menzies et al. (2011) andour previous work on social metrics Bettenburg and Hassan (2010).

1.1 Metrics in the Lucene 2.4 Dataset

Dependent Variables: bug
Independent Variables: amc,avg_cc, ca, cam, cbm, ce, dam, dit, ic, lcom, lcom3, max_cc, mfa, moa, noc, npm

1.2 Metrics in the Xalan 2.6 Dataset

Dependent Variables: bug
Independent Variables: avg_cc, ca, cam, cbm, ce, dam, dit, ic,lcom, lcom3, loc, max_cc, mfa, moa, noc, npm

1.3 Metrics in the CHINA Dataset

Dependent Variables: Effort
Independent Variables: Input, Output, Enquiry, File, Interface,Changed, PDR_UFP, Resource, Duration

1.4 Metrics in the NASACOC Dataset

Dependent Variables: months
Independent Variables: pmat, rely, data, cplx, time, stor, pvol, pcap, apex, plex, ltex, site, sced, kloc, effort

1.5 Metrics in the Eclipse 3.0 (Code) Dataset

Dependent Variables: post
Independent Variables: pre, ImportDeclaration, VG_sum, PrefixExpression, NOM_avg, NOF_avg, TLOC, NullLiteral, NOM_max, BooleanLiteral, SwitchCase, SimpleName, SuperMethodInvocation, LabeledStatement, Block, NORM_Assignment, Initializer, NSM_avg, InfixExpression, Assignment, NumberLiteral, VariableDeclarationFragment, Javadoc, NORM_FieldDeclaration, ForStatement, Modifier, NORM_ArrayCreation, MethodInvocation, VariableDeclarationExpression, ArrayCreation, ExpressionStatement, NORM_PostfixExpression, InstanceofExpression, SwitchStatement, ArrayType, SynchronizedStatement

1.6 Metrics in the Eclipse 3.0 (Social) Dataset

Dependent Variables: poat
Independent Variables: NSCOM, PATCHS, NSOURCE, NPATCH, NTRACE, TRACES, NLINK, NPART, NDEVS, NUSERS, SNACENT, NMSG, REPLY, REPLYE, DLEN, DLENE, INT, INTE, WA, WAE

Appendix B

Table 9 Results: Using different clustering techniques for building local models in comparison to global models and global models with local considerations (denoted with MARS)

Full size table

Table 10 Performance of local models built through hierarchical clustering, with different values for parameter k (number of clusters)

Full size table

Table 11 Performance of local models built through k-means clustering, with different values for parameter k (number of clusters)

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bettenburg, N., Nagappan, M. & Hassan, A.E. Towards improving statistical modeling of software engineering data: think locally, act globally!. Empir Software Eng 20, 294–335 (2015). https://doi.org/10.1007/s10664-013-9292-6

Download citation

Published: 15 January 2014
Issue Date: April 2015
DOI: https://doi.org/10.1007/s10664-013-9292-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards improving statistical modeling of software engineering data: think locally, act globally!

Abstract

Access this article

Similar content being viewed by others

Bayesian Data Analysis in Empirical Software Engineering: The Case of Missing Data

Robust Statistical Methods for Empirical Software Engineering

Improving the Software Estimation Models Based on Functional Size through Validation of the Assumptions behind the Linear Regression and the Use of the Confidence Intervals When the Reference Database Presents a Wedge-Shape Form

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix

1.1 Metrics in the Lucene 2.4 Dataset

1.2 Metrics in the Xalan 2.6 Dataset

1.3 Metrics in the CHINA Dataset

1.4 Metrics in the NASACOC Dataset

1.5 Metrics in the Eclipse 3.0 (Code) Dataset

1.6 Metrics in the Eclipse 3.0 (Social) Dataset

Appendix B

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Bayesian Data Analysis in Empirical Software Engineering: The Case of Missing Data

Robust Statistical Methods for Empirical Software Engineering

Improving the Software Estimation Models Based on Functional Size through Validation of the Assumptions behind the Linear Regression and the Use of the Confidence Intervals When the Reference Database Presents a Wedge-Shape Form

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix

1.1 Metrics in the Lucene 2.4 Dataset

1.2 Metrics in the Xalan 2.6 Dataset

1.3 Metrics in the CHINA Dataset

1.4 Metrics in the NASACOC Dataset

1.5 Metrics in the Eclipse 3.0 (Code) Dataset

1.6 Metrics in the Eclipse 3.0 (Social) Dataset

Appendix B

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation