Abstract
Empirical Software Engineering studies apply methods, like linear regression, statistic tests, or correlation analysis, to better understand software engineering scenarios. Assuring the validity of such methods and corresponding results is challenging but critical. This is also reflected by quality criteria on the validity that are part of the reviewing process for the corresponding research results. However, such criteria are often hard to define operationally and thus hard to judge by the reviewers. In this paper, we describe a new strategy to define and communicate the validity of methods and results. We conceptually decompose a study into an empirical scenario, a used method, and the produced results. Validity can only be described as the relationship between the three parts. To make the empirical scenario fully operational, we convert informal assumptions on it into executable simulation code that leverages artificial data to replace (or complement) our real data. We can then run the method on the artificial data and examine the impact of our assumptions on the quality of results. This may operationally i) support the validity of a method for a valid result, ii) threaten the validity of a method for an invalid result if assumptions are controversial, or iii) invalidate a method for an invalid result if assumptions are plausible. We encourage researchers to submit simulations as additional artifacts to the reviewing process to make such statements explicit. Rating if a simulated scenario is plausible or controversial is subjective and may benefit from involving a reviewer. We show that existing empirical software engineering studies can benefit from such additional validation artifacts.





















Similar content being viewed by others
Data Availability
All artifacts and data sets are provided online on GitHub (https://github.com/topleet/MSR2022).
Notes
In R, random number generators are vertorized and start with a letter r followed by an abbreviation for the distribution family (we will see rbinom, rnorm and rpoisson).
All our reproductions of other papers are fully available online to guarantee the reproduction of this paper.
References
Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Parzen E, Tanabe K, Kitagawa G (eds) Selected papers of hirotugu akaike. Springer, pp 199–213
Alali A, Kagdi HH, Maletic JI (2008) What’s a typical commit? A characterization of open source software repositories. In: ICPC, pp 182–191. IEEE Computer society
Albayrak Ö, Carver JC (2014) Investigation of individual factors impacting the effectiveness of requirements inspections: a replicated experiment. Empir Softw Eng 19(1):241–266
Anda B, Sjøberg DIK (2005) Investigating the role of use cases in the construction of class diagrams. Empir Softw Eng 10(3):285–309
Apa C, Dieste O, Espinosa GEG, Fonseca CER (2014) Effectiveness for detecting faults within and outside the scope of testing techniques: an independent replication. Empir Softw Eng 19(2):378–417
Baayen RH, Davidson DJ, Bates DM (2008) Mixed-effects modeling with crossed random effects for subjects and items. J Memory Lang 59(4):390–412
Bangash AA, Sahar H, Hindle A, Ali K (2020) On the time-based conclusion stability of cross-project defect prediction models. Empirical Software Engineering pp 1–38
Barón MM, Wyrich M, Graziotin D, Wagner S (2023) Evidence profiles for validity threats in program comprehension experiments. In: ICSE, pp 1907–1919. IEEE
Barr DJ, Levy R, Scheepers C, Tily HJ (2013) Random effects structure for confirmatory hypothesis testing: Keep it maximal. J Memory Lang 368(3):255–278
Beheim B, Atkinson QD, Bulbulia J, Gervais W, Gray RD, Henrich J, Lang M, Monroe MW, Muthukrishna M, Norenzayan A, Purzycki BG, Shariff A, Slingerland E, Spicer R, Willard AK (2021) Treatment of missing data determined conclusions regarding moralizing gods. Nature 595(7866):1476–4687
Bidoki NH, Schiappa M, Sukthankar G, Garibay I (2020) Modeling social coding dynamics with sampled historical data. Online Soc Netw Med 16:100070
Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu PT (2009) Fair and balanced?: bias in bug-fix datasets. In: ESEC/SIGSOFT FSE, pp 121–130. ACM
Blythe J, Bollenbacher J, Huang D, Hui P, Krohn R, Pacheco D, Muric G, Sapienza A, Tregubov A, Ahn Y, Flammini A, Lerman K, Menczer F, Weninger T, Ferrara E (2019) Massive multi-agent data-driven simulations of the GitHub ecosystem. In: PAAMS, Lecture notes in computer science, vol 11523, pp 3–15. Springer
Boh WF, Slaughter S, Espinosa JA (2007) Learning from experience in software development: A multilevel analysis. Manag Sci 53(8):1315–1331
Borges H, Hora AC, Valente MT (2016) Predicting the popularity of GitHub repositories. In: PROMISE, pp 9:1–9:10. ACM
Borle NC, Feghhi M, Stroulia E, Greiner R, Hindle A (2018) Analyzing the effects of test driven development in GitHub. Empir Softw Eng 23(4):1931–1958
Burton A, Altman DG, Royston P, Holder RL (2006) The design of simulation studies in medical statistics. Stat Med 25(24):4279–4292
Canfora G, Lucia AD, Penta MD, Oliveto R, Panichella A, Panichella S (2015) Defect prediction as a multiobjective optimization problem. Softw Test Verification Reliab 25(4):426–459
Casalnuovo C, Devanbu PT, Oliveira A, Filkov V, Ray B (2015) Assert use in GitHub projects. In: ICSE (1), pp 755–766. IEEE Computer Society
Clyburne-Sherin A, Fei X, Green SA (2019) Computational reproducibility via containers in psychology. Meta-psychology 3
Cohen J, Cohen P, West SG, Aiken LS (2013) Applied multiple regression/correlation analysis for the behavioral sciences. Routledge
Cosentino V, Izquierdo JLC, Cabot J (2016) Findings from GitHub: methods, datasets and limitations. In: Proceedings MSR, pp 137–141
Dias M, Bacchelli A, Gousios G, Cassou D, Ducasse S (2015) Untangling fine-grained code changes. In: SANER, pp 341–350. IEEE Computer society
Falcão F, Barbosa C, Fonseca B, Garcia A, Ribeiro M, Gheyi R (2020) On relating technical, social factors, and the introduction of bugs. In: SANER, pp 378–388. IEEE
Fang H, Lamba H, Herbsleb JD, Vasilescu B (2022) This is damn slick! estimating the impact of tweets on open source project popularity and new contributors. In: ICSE, pp 2116–2129. ACM
Gabel M, Su, Z (2010) A study of the uniqueness of source code. In: SIGSOFT FSE, pp 147–156. ACM
Gasparini A, Abrams KR, Barrett JK, Major RW, Sweeting MJ, Brunskill NJ, Crowther MJ (2020) Mixed-effects models for health care longitudinal data with an informative visiting process: A Monte Carlo simulation study. Statistica Neerlandica 74(1):5–23
Gelman A, Hill J (2006) Data analysis using regression and multilevel/hierarchical models. Cambridge University Press
Gelman A, Hill J, Vehtari A (2020) Regression and other stories. Cambridge University Press
Ghaleb TA, da Costa DA, Zou Y (2019) An empirical study of the long duration of continuous integration builds. Empir Softw Eng 24(4):2102–2139
Harrell FE (2015) Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis, vol 2. Springer
Härtel J, Lämmel R (2020) Incremental map-reduce on repository history. In: SANER, pp 320–331. IEEE
Härtel J, Lämmel R (2022) Operationalizing threats to MSR studies by simulation-based testing. In: MSR, pp 86–97. IEEE
He Z, Peters F, Menzies T, Yang Y (2013) Learning from open-source projects: An empirical study on defect prediction. In: ESEM, pp 45–54. IEEE Computer society
Herzig K, Zeller A (2013) The impact of tangled code changes. In: MSR, pp 121–130. IEEE Computer society
Honsel, V (2015) Statistical learning and software mining for agent based simulation of software evolution. In: ICSE (2), pp 863–866. IEEE Computer society
Honsel V, Honsel D, Grabowski J (2014) Software process simulation based on mining software repositories. In: ICDM Workshops, pp 828–831. IEEE Computer society
Honsel V, Honsel D, Herbold S, Grabowski J, Waack S (2015) Mining software dependency networks for agent-based simulation of software evolution. In: ASE Workshops, pp 102–108. IEEE Computer society
Imbens GW, Rubin DB (2015) Causal inference in statistics, social, and biomedical sciences. Cambridge University Press
Iyer RN, Yun SA, Nagappan M, Hoey J (2019) Effects of personality traits on pull request acceptance. IEEE Transactions on Software Engineering
Jamie DM (2002) Using computer simulation methods to teach statistics: A review of the literature. Journal of Statistics Education 10(1)
Jbara A, Matan A, Feitelson DG (2014) High-MCC functions in the Linux kernel. Empir Softw Eng 19(5):1261–1298
Jiarpakdee J, Tantithamthavorn C, Hassan AE (2021) The impact of correlated metrics on the interpretation of defect models. IEEE Trans Softw Eng 47(2):320–331
Johnson J, Lubo S, Yedla N, Aponte J, Sharif B (2019) An empirical study assessing source code readability in comprehension. In: ICSME, pp 513–523. IEEE
Jolak R, Savary-Leblanc M, Dalibor M, Wortmann A, Hebig R, Vincur J, Polásek I, Pallec XL, Gérard S, Chaudron MRV (2020) Software engineering whispers: The effect of textual vs. graphical software design descriptions on software design communication. Empir Softw Eng 25(6):4427–4471
Kamei Y, Shihab E, Adams B, Hassan AE, Mockus A, Sinha A, Ubayashi N (2013) A large-scale empirical study of just-in-time quality assurance. IEEE Trans Softw Eng 39(6):757–773
Kochhar PS, Lo D (2017) Revisiting assert use in GitHub projects. In: EASE, pp 298–307. ACM
Martens A, Koziolek H, Prechelt L, Reussner RH (2011) From monolithic to component-based performance evaluation of software architectures - A series of experiments analysing accuracy and effort. Empir Softw Eng 16(5):587–622
McChesney IR, Bond RR (2020) Observations on the linear order of program code reading patterns in programmers with dyslexia. In: EASE, pp 81–89. ACM
McElreath, R (2020) Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press
Miller G (2006) A Scientist’s nightmare: Software problem leads to five retractions. Science 314(5807):1856–1857
Mockus, A (2010) Organizational volatility and its effects on software defects. In: SIGSOFT FSE, pp 117–126. ACM
Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5(2):169–180
Morris TP, White IR, Crowther MJ (2019) Using simulation studies to evaluate statistical methods. Stat Med 38(11):2074–2102
Nagappan, N, Zeller, A, Zimmermann, T, Herzig, K, Murphy, B (2010) Change bursts as defect predictors. In: ISSRE, pp 309–318. IEEE Computer society
Nam J, Fu W, Kim S, Menzies T, Tan L (2018) Heterogeneous defect prediction. IEEE Trans Softw Eng 44(9):874–896
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: ICSE, pp 382–391. IEEE Computer society
Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR (1996) A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49(12):1373–1379
Penta MD, Cerulo L, Guéhéneuc Y, Antoniol G (2008) An empirical study of the relationships between design pattern roles and class change proneness. In: ICSM, pp 217–226. IEEE Computer society
Posnett D, Filkov V, Devanbu, PT (2011) Ecological inference in empirical software engineering. In: ASE, pp 362–371. IEEE Computer society
Rahman F, Devanbu PT (2011) Ownership, experience and defects: a fine-grained study of authorship. In: ICSE, pp 491–500. ACM
Rahman F, Posnett D, Devanbu PT (2012) Recalling the "imprecision" of cross-project defect prediction. In: SIGSOFT FSE, p 61. ACM
Rahman MM, Roy CK, Collins JA (2016) CoRReCT: code reviewer recommendation in GitHub based on cross-project and technology experience. In: ICSE (Companion Volume), pp 222–231. ACM
Reyes RP, Dieste O, Fonseca ER, Juristo N (2018) Statistical errors in software engineering experiments: a preliminary literature review. In: ICSE, pp 1195–1206. ACM
Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, Guillera-Arroita G, Hauenstein S, Lahoz-Monfort JJ, Schröder B, Thuiller W, Warton DI, Wintle BA, Hartig F, Dormann CF (2017) Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40(8):913–929
Sayagh M, Kerzazi N, Petrillo F, Bennani K, Adams B (2020) What should your run-time configuration framework do to help developers? Empir Softw Eng 25(2):1259–1293
Scholtes I, Mavrodiev P, Schweitzer F (2016) From Aristotle to Ringelmann: a large-scale analysis of team productivity and coordination in Open Source Software projects. Empir Softw Eng 21(2):642–683
Seifer P, Härtel J, Leinberger M, Lämmel R, Staab S (2019) Empirical study on the usage of graph query languages in open source Java projects. In: SLE, pp 152–166. ACM
Seo T, Lee H (2009) Agent-based simulation model for the evolution process of open source software. In: SEKE, pp 170–177. Knowledge systems institute graduate school
Shadish WR, Cook TD, Campbell DT (2002) Experimental and quasi-experimental designs for generalized causal inference. Houghton mifflin company
Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88(422):486–494
Sjøberg DIK, Hannay JE, Hansen O, Kampenes VB, Karahasanovic A, Liborg N, Rekdal AC (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753
Sliwerski J, Zimmermann T, Zeller A (2005) When do changes induce fixes? In: MSR. ACM
Stodden V, Seiler J, Ma Z (2018) An empirical analysis of journal policy effectiveness for computational reproducibility. Proc Natl Acad Sci USA 115(11):2584–2589
Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction for imbalanced data. In: ICSE (2), pp 99–108. IEEE Computer society
Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: pitfalls and challenges. In: ICSE (SEIP), pp 286–295. ACM
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng 43(1):1–18
Thongtanunam P, McIntosh S, Hassan AE, Iida H (2016) Revisiting code ownership and its relationship with software quality in the scope of modern code review. In: ICSE, pp 1039–1050. ACM
Tichy WF, Lukowicz P, Prechelt L, Heinz EA (1995) Experimental evaluation in computer science: A quantitative study. J Syst Softw 28(1):9–18
Tsay J, Dabbish L, Herbsleb JD (2014) Influence of social and technical factors for evaluating contribution in GitHub. In: ICSE, pp 356–366. ACM
Tufano M, Bavota G, Poshyvanyk D, Penta MD, Oliveto R, Lucia AD (2017) An empirical study on developer-related factors characterizing fix-inducing commits. J Softw Evol Process 29(1)
Vasilescu B, Posnett D, Ray B, van den Brand MGJ, Serebrenik A, Devanbu PT, Filkov V (2015) Gender and tenure diversity in GitHub teams. In: CHI, pp 3789–3798. ACM
Vokác M (2004) Defect frequency and design patterns: An empirical study of industrial code. IEEE Trans Softw Eng 30(12):904–917
Wood M (2005) The role of simulation approaches in statistics. Journal of Statistics Education 13(3)
Yan M, Xia X, Fan Y, Lo D, Hassan AE, Zhang X (2020) Effort-aware just-in-time defect identification in practice: a case study at Alibaba. In: ESEC/SIGSOFT FSE, pp 1308–1319. ACM
Zhang F, Hassan AE, McIntosh S, Zou Y (2017) The use of summation to aggregate software metrics hinders the performance of defect prediction models. IEEE Trans Softw Eng 43(5):476–491
Zimmermann T, Nagappan N (2008) Predicting defects using network analysis on dependency graphs. In: ICSE, pp 531–540. ACM
Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: PROMISE 2007, p 76. IEEE
Acknowledgements
We want to acknowledge the original work of the authors in the studies, subject to the following illustrations. All studies have been selected because of their originality. However, we believe that this meta-validation of simulation-based testing is not credible on unpublished examples.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors have no conflict of interest.
Additional information
Communicated by: Nicole Novielli, Shane McIntosh, David Lo.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Mining Software Repositories (MSR)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Härtel, J., Lämmel, R. Operationalizing validity of empirical software engineering studies. Empir Software Eng 28, 153 (2023). https://doi.org/10.1007/s10664-023-10370-3
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-023-10370-3