Skip to main content
Log in

Towards improving statistical modeling of software engineering data: think locally, act globally!

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Much research in software engineering (SE) is focused on modeling data collected from software repositories. Insights gained over the last decade suggests that such datasets contain a high amount of variability in the data. Such variability has a detrimental effect on model quality, as suggested by recent research. In this paper, we propose to split the data into smaller homogeneous subsets and learn sets of individual statistical models, one for each subset, as a way around the high variability in such data. Our case study on a variety of SE datasets demonstrates that such local models can significantly outperform traditional models with respect to model fit and predictive performance. However, we find that analysts need to be aware of potential pitfalls when building local models: firstly, the choice of clustering algorithm and its parameters can have a substantial impact on model quality. Secondly, the data being modeled needs to have enough variability to take full advantage of local modeling. For example, our case study on social data shows no advantage of local over global modeling, as clustering fails to derive appropriate subsets. Lastly, the interpretation of local models can become very complex when there is a large number of variables or data subsets. Overall, we find that a hybrid approach between local and traditional global modeling, such as Multivariate Adaptive Regression Splines (MARS) combines the best of both worlds. MARS models are non-parametric and thus do not require prior calibration of parameters, are easily interpretable by analysts and outperform local, as well as traditional models out of the box in four out of five datasets in our case study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://www.msrconf.org

  2. http://promisedata.org

References

  • Ackerman M, Ben-david S (2009) Clusterability: a theoretical study. In: Proceedings of the twelfth international conference on artificial intelligence and statistics (AISTATS’09), JMLR workshop and conference proceedings, vol 5. pp. 1–8

  • Akaike H (1974) A new look at the statistical model identification. Autom Control IEEE Trans 19(6):716–723

    Article  MATH  MathSciNet  Google Scholar 

  • Andreou A, Papatheocharous E (2008) Software cost estimation using fuzzy decision trees. In: Automated software engineering, 2008. ASE 2008. 23rd IEEE/ACM International Conference, pp 371 –374

  • Attoh-okine N, Mensah S, Nawaiseh M, Hall D (2001) Using multivariate adaptive regression splines (mars) in pavement roughness prediction. Strategy

  • Barkmann H, Lincke R, Lowe W (2009) Quantitative evaluation of software quality metrics in open-source projects. In: Proceedings of the 2009 international conference on advanced information networking and applications workshops, WAINA ’09. IEEE Computer Society, Washington, DC, pp 1067–1072

  • Bettenburg N, Hassan AE (2010) Studying the impact of social structures on software quality. In: Proceedings of the 2010 IEEE 18th international conference on program comprehension, ICPC ’10. IEEE Computer Society, Washington, DC, pp 124–133

  • Bettenburg N, Nagappan M, Hassan A (2012) Think locally, act globally: improving defect and effort prediction models. In: Mining software repositories (MSR), 2012 9th IEEE working conference. pp 60–69

  • Di Penta M (2011) Nothing else matters: what predictive model should i use?. In: Proceedings of the 7th international conference on predictive models in software engineering, promise ’11. ACM, New York, NY,pp 10:1–10:3

  • Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81:649–660

    Article  Google Scholar 

  • Fox J (2008) Applied regression analysis and generalized linear models, 2nd edn. Sage, Los Angeles, London

  • Fraley C (2007) Bayesian regularization for normal mixture estimation and model-based clustering. J Classif 181(2):155–181

    Article  MathSciNet  Google Scholar 

  • Fraley C, Raftery AE (2009) Mclust version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, University of Washington, Department of Statistics, Seattle, 2006 (subsequent revisions)

  • Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19(1):1–67

    Article  MATH  Google Scholar 

  • Harrell FE (2001) With Applications to Linear Models, Logistic Regression, and Survival Analysis, Series: Springer Series in Statistics, 1st ed. 2002. Corr. 2nd printing 2001, XXIII, New York, 571 pp

  • Hartigan JA, Wong MA (1979) A k-means clustering algorithm. JSTOR: Appl Stat 28(1):100–108

    MATH  Google Scholar 

  • Kamei Y, Matsumoto S, Monden A, Matsumoto Ki, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort-aware models. In: Proceedings of the 2010 IEEE international conference on software maintenance, ICSM ’10. IEEE Computer Society, pp 1–10

  • Li M, Zhang H, Wu R, Zhou ZH (2012) Sample-based software defect prediction with active and semi-supervised learning. Autom Softw Eng 19(2):201–230. doi:10.1007/s10515-011-0092-1

    Article  Google Scholar 

  • McQuitty L (1966) Similarity analysis by reciprocal pairs for discrete and continuous data. Educ Psychol Meas 26(4):825–831

    Article  Google Scholar 

  • Menzies T, Butcher A, Cok D, Layman L, Marcus A, Shull F, Turhan B, Zimmermann T (2013) Local vs. global lessons from defect prediction and effort estimation. IEEE Trans Softw Eng (to appear)

  • Menzies T, Butcher A, Marcus A, Zimmermann T, Cok D (2011) Local vs global models for effort estimation and defect prediction. In: Proceedings of the 26th IEEE/ACM international conference on automated software engineering

  • Menzies T, Caglayan B, Kocaguneli E, Krall J, Peters F, Turhan B (2012) The promise repository of empirical software engineering data. http://promisedata.googlecode.com

  • Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13

    Article  Google Scholar 

  • Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5:169–180

    Article  Google Scholar 

  • Mockus A, Weiss DM, Zhang P (2003) Understanding and predicting effort in software projects. In: Proceedings of the 25th international conference on software engineering, ICSE ’03. IEEE Computer Society, Washington, DC, pp 274–284

  • Mockus A, Zhang P, Li PL (2005) Predictors of customer perceived software quality. In: Proceedings of the 27th international conference on Software engineering, ICSE ’05. ACM, New York, NY, pp 225–233

  • Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: Proceedings of the 27th international conference on software engineering, ICSE ’05. ACM, pp 284–292

  • Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering, ICSE ’06. ACM, New York, NY, pp 452–461

  • Nguyen THD, Adams B, Hassan AE (2010) Studying the impact of dependency network measures on software quality. In: Proceedings of the 2010 IEEE international conference on software maintenance. IEEE Computer Society, pp 1–10

  • Osei-Bryson KM, Ko M (2004) Exploring the relationship between information technology investments and firm performance using regression splines analysis. Inf Manag 42(1):1–13

    Article  Google Scholar 

  • Posnett D, Filkov V, Devanbu P (2011) Ecological inference in empirical software engineering. Int Conf Autom Softw Eng 362–371

  • Raftery AE, Raftery AE (2007) Bayesian model selection in social research adrian e. raftery sociological methodology. Soc Res 25:111–163

    Google Scholar 

  • Rahman F, Devanbu P (2013) How, and why, process metrics are better. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. IEEE Computer Society, Washington, DC, pp 432–441

  • Rice JA (2001) Mathematical statistics and data analysis, Series: Wadsworth statistics/probability series, 2nd ed. illustrated, 1995. Duxbury Press, Pacific Grove, 602 pp

    Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MATH  Google Scholar 

  • Shepperd M (1988) A critique of cyclomatic complexity as a software metric. Softw Eng J 3(2):30–36

    Article  Google Scholar 

  • Shihab E, Bird C, Zimmermann T, Zimmermann T (2012) The effect of branching strategies on software quality. In: ESEM. pp 301–310

  • Witten IH, Frank E (2005) Data Mining: practical machine learning tools and techniques, 2nd edn (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  • York TP, Eaves LJ (2001) Common disease analysis using multivariate adaptive regression splines (mars): Genetic analysis workshop 12 simulated sequence data. Genet Epidemiol 21 Suppl 1:S649–S654

    Google Scholar 

  • Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE ’09. ACM, New York, NY, pp 91–100

  • Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the third international workshop on predictor models in software engineering, PROMISE ’07. IEEE Computer Society, Washington, DC, p 9

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolas Bettenburg.

Additional information

Communicated by: Maximillano di Penta and Tao Xie

Appendices

Appendix

Further descriptions of the individual metrics used in these datasets can be found in the work by Menzies et al. (2011) andour previous work on social metrics Bettenburg and Hassan (2010).

1.1 Metrics in the Lucene 2.4 Dataset

  • Dependent Variables: bug

  • Independent Variables: amc,avg_cc, ca, cam, cbm, ce, dam, dit, ic, lcom, lcom3, max_cc, mfa, moa, noc, npm

1.2 Metrics in the Xalan 2.6 Dataset

  • Dependent Variables: bug

  • Independent Variables: avg_cc, ca, cam, cbm, ce, dam, dit, ic,lcom, lcom3, loc, max_cc, mfa, moa, noc, npm

1.3 Metrics in the CHINA Dataset

  • Dependent Variables: Effort

  • Independent Variables: Input, Output, Enquiry, File, Interface,Changed, PDR_UFP, Resource, Duration

1.4 Metrics in the NASACOC Dataset

  • Dependent Variables: months

  • Independent Variables: pmat, rely, data, cplx, time, stor, pvol, pcap, apex, plex, ltex, site, sced, kloc, effort

1.5 Metrics in the Eclipse 3.0 (Code) Dataset

  • Dependent Variables: post

  • Independent Variables: pre, ImportDeclaration, VG_sum, PrefixExpression, NOM_avg, NOF_avg, TLOC, NullLiteral, NOM_max, BooleanLiteral, SwitchCase, SimpleName, SuperMethodInvocation, LabeledStatement, Block, NORM_Assignment, Initializer, NSM_avg, InfixExpression, Assignment, NumberLiteral, VariableDeclarationFragment, Javadoc, NORM_FieldDeclaration, ForStatement, Modifier, NORM_ArrayCreation, MethodInvocation, VariableDeclarationExpression, ArrayCreation, ExpressionStatement, NORM_PostfixExpression, InstanceofExpression, SwitchStatement, ArrayType, SynchronizedStatement

1.6 Metrics in the Eclipse 3.0 (Social) Dataset

  • Dependent Variables: poat

  • Independent Variables: NSCOM, PATCHS, NSOURCE, NPATCH, NTRACE, TRACES, NLINK, NPART, NDEVS, NUSERS, SNACENT, NMSG, REPLY, REPLYE, DLEN, DLENE, INT, INTE, WA, WAE

Appendix B

Table 9 Results: Using different clustering techniques for building local models in comparison to global models and global models with local considerations (denoted with MARS)
Table 10 Performance of local models built through hierarchical clustering, with different values for parameter k (number of clusters)
Table 11 Performance of local models built through k-means clustering, with different values for parameter k (number of clusters)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bettenburg, N., Nagappan, M. & Hassan, A.E. Towards improving statistical modeling of software engineering data: think locally, act globally!. Empir Software Eng 20, 294–335 (2015). https://doi.org/10.1007/s10664-013-9292-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-013-9292-6

Keywords

Navigation