ABSTRACT
Background: Statistical concepts and techniques are often applied incorrectly, even in mature disciplines such as medicine or psychology. Surprisingly, there are very few works that study statistical problems in software engineering (SE). Aim: Assess the existence of statistical errors in SE experiments. Method: Compile the most common statistical errors in experimental disciplines. Survey experiments published in ICSE to assess whether errors occur in high quality SE publications. Results: The same errors as identified in others disciplines were found in ICSE experiments, where 30% of the reviewed papers included several error types such as: a) missing statistical hypotheses, b) missing sample size calculation, c) failure to assess statistical test assumptions, and d) uncorrected multiple testing. This rather large error rate is greater for research papers where experiments are confined to the validation section. The origin of the errors can be traced back to: a) researchers not having sufficient statistical training, and, b) a profusion of exploratory research. Conclusions: This paper provides preliminary evidence that SE research suffers from the same statistical problems as other experimental disciplines. However, the SE community appears to be unaware of any shortcomings in its experiments, whereas other disciplines work hard to avoid these threats. Further research is necessary to find the underlying causes and set up corrective measures, but there are some potentially effective actions and are a priori easy to implement: a) improve the statistical training of SE researchers, and b) enforce quality assessment and reporting guidelines in SE publications.
- Saba Alimadadi, Sheldon Sequeira, Ali Mesbah, and Karthik Pattabiraman. 2014. Understanding JavaScript event-based interactions. In Proceedings of the 36th International Conference on Software Engineering. ACM, 367--377. Google ScholarDigital Library
- Douglas G Altman. 1998. Statistical reviewing for medical journals. Statistics in medicine 17, 23 (1998), 2661--2674.Google Scholar
- Paul V Anderson, Sarah Heckman, Mladen Vouk, David Wright, Michael Carter, Janet E Burge, and Gerald C Gannod. 2015. CS/SE instructors can improve student writing without reducing class time devoted to technical content: experimental results. In Proceedings of the 37th International Conference on Software Engineering-Volume 2. IEEE Press, 455--464. Google ScholarDigital Library
- Andrea Arcuri and Lionel Briand. 2014. A Hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24, 3 (2014), 219--250. Google ScholarDigital Library
- Marjan Bakker and Jelte M Wicherts. 2011. The (mis) reporting of statistical results in psychology journals. Behavior Research Methods 43, 3 (2011), 666--678.Google ScholarCross Ref
- Kirk R Baumgardner. 1997. A review of key research design and statistical analysis issues. Oral Surgery, Oral Medicine, Oral Pathology, Oral Radiology, and Endodontology 84, 5 (1997), 550--556.Google Scholar
- Gabriele Bavota, Bogdan Dit, Rocco Oliveto, Massimiliano Di Penta, Denys Poshy-vanyk, and Andrea De Lucia. 2013. An empirical study on the developers' perception of software coupling. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 692--701. Google ScholarDigital Library
- A Bhatt. 2010. Evolution of Clinical Research: A History Before and Beyond James Lind. Perspectives in Clinical Research 1, 1 (March 2010), 6--10.Google ScholarCross Ref
- Christian Bird, Nachiappan Nagappan, Premkumar Devanbu, Harald Gall, and Brendan Murphy. 2009. Does distributed development affect software quality?: an empirical case study of windows vista. Commun. ACM 52, 8 (2009), 85--93. Google ScholarDigital Library
- Marc Branch. 2014. Malignant side effects of null-hypothesis significance testing. Theory & Psychology 24, 2 (2014), 256--277.Google ScholarCross Ref
- Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology 3, 2 (2006), 77--101.Google Scholar
- James K Brewer. 1985. Behavioral statistics textbooks: Source of myths and misconceptions? Journal of Educational and Behavioral Statistics 10, 3 (1985), 252--268.Google ScholarCross Ref
- Yan Cai and WK Chan. 2012. MagicFuzzer: scalable deadlock detection for large-scale applications. In Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 606--616. Google ScholarDigital Library
- Mariano Ceccato, Alessandro Marchetto, Leonardo Mariani, Cu D Nguyen, and Paolo Tonella. 2012. An empirical study about the effectiveness of debugging when random test cases are used. In Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 452--462. Google ScholarDigital Library
- Chris Chambers, Marcus Munafo, and more than 80 signatories. 2013. Trust in science would be improved by study pre-registration. The Guardian, 5 June 2013. Available: https://www.theguardian.com/science/blog/2013/jun/05/trust-in-science-study-pre-registration {Last accessed: 16 August 2017}. (2013).Google Scholar
- Hyun-Chul Cho and Shuzo Abe. 2013. Is two-tailed testing for directional research hypotheses tests legitimate? Journal of Business Research 66, 9 (2013), 1261--1266.Google ScholarCross Ref
- Ilinca Ciupa, Andreas Leitner, Manuel Oriol, and Bertrand Meyer. 2008. ARTOO: adaptive random testing for object-oriented software. In Proceedings of the 30th international conference on Software engineering. ACM, 71--80. Google ScholarDigital Library
- James Clause and Alessandro Orso. 2010. LEAKPOINT: pinpointing the causes of memory leaks. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, 515--524. Google ScholarDigital Library
- Steve Cohen, George Smith, Richard A Chechile, Glen Burns, and Frank Tsai. 1996. Identifying impediments to learning probability and statistics from an assessment of instructional software. Journal of Educational and Behavioral Statistics 21, 1 (1996), 35--54.Google ScholarCross Ref
- Lucas Cordeiro and Bernd Fischer. 2011. Verifying multi-threaded software using smt-based context-bounded model checking. In Proceedings of the 33rd International Conference on Software Engineering. ACM, 331--340. Google ScholarDigital Library
- John W Creswell. 2002. Educational research: Planning, conducting, and evaluating quantitative. Prentice Hall.Google Scholar
- DanielaS Cruzes and Tore Dyba. 2011. Recommended steps for thematic synthesis in software engineering. In Empirical Software Engineering and Measurement (ESEM), 2011 International Symposium on. IEEE, 275--284. Google ScholarDigital Library
- Tore Dybå, Vigdis By Kampenes, and Dag IK Sjøberg. 2006. A systematic review of statistical power in software engineering experiments. Information and Software Technology 48, 8 (2006), 745--755.Google ScholarCross Ref
- Stefan Endrikat, Stefan Hanenberg, Romain Robbes, and Andreas Stefik. 2014. How do api documentation and static typing affect api usability?. In Proceedings of the 36th International Conference on Software Engineering. ACM, 632--642. Google ScholarDigital Library
- Ilker Ercan, Yaning Yang, Guven Özkaya, Sengul Cangur, Bulent Ediz, Ismet Kan, et al. 2008. Misusage of statistics in medical research. (2008).Google Scholar
- Filomena Ferrucci, Mark Harman, Jian Ren, and Federica Sarro. 2013. Not going to take this anymore: multi-objective overtime planning for software engineering projects. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 462--471. Google ScholarDigital Library
- Joseph L Fleiss, Bruce Levin, and Myunghee Cho Paik. 2013. Statistical methods for rates and proportions. John Wiley & Sons.Google Scholar
- Christine A Franklin. 2007. Guidelines for assessment and instruction in statistics education (GAISE) report: A pre-K-12 curriculum framework. American Statistical Association.Google Scholar
- Phillip I Good and James W Hardin. 2012. Common errors in statistics (and how to avoid them). John Wiley & Sons. Google ScholarDigital Library
- Sheila M Gore, Ian G Jones, and Eilif C Rytter. 1977. Misuse of statistical methods: critical assessment of articles in BMJ from January to March 1976. BMJ 1, 6053 (1977), 85--87.Google ScholarCross Ref
- K.L. Gwet. 2014. Handbook of Inter-Rater Reliability. The Definitive Guide to Measuring the Extent of Agreement Among Raters (4 ed.). Advanced Analytics, LLC.Google Scholar
- M Sayeed Haque and Sanju George. 2007. Use of statistics in the Psychiatric Bulletin: author guidelines. The Psychiatrist 31, 7 (2007), 265--267.Google Scholar
- Hwa-You Hsu and Alessandro Orso. 2009. MINTS: A general framework and tool for supporting test-suite minimization. In Software Engineering, 2009. ICSE 2009. IEEE 31st International Conference on. IEEE, 419--429. Google ScholarDigital Library
- Schuyler W Huck. 2009. Statistical misconceptions. Routledge.Google Scholar
- John P.A. Ioannidis. 2005. Why most published research findings are false. PLoS Medicine 2, 8 (2005), 696--701.Google ScholarCross Ref
- David S Janzen, John Clements, and Michael Hilton. 2013. An evaluation of interactive test-driven labs with WebIDE in CS0. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 1090--1098. Google ScholarDigital Library
- Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, 96--105. Google ScholarDigital Library
- Magne Jørgensen, Tore Dybå, Knut Liestøl, and Dag IK Sjøberg. 2016. Incorrect results in software engineering experiments: How to improve research practices. Journal of Systems and Software 116 (2016), 133--145. Google ScholarDigital Library
- Carol Kilkenny, Nick Parsons, Ed Kadyszewski, Michael FW Festing, Innes C Cuthill, Derek Fry, Jane Hutton, and Douglas G Altman. 2009. Survey of the quality of experimental design, statistical analysis and reporting of research using animals. PloS one 4, 11 (2009), e7824.Google ScholarCross Ref
- Andrew King, Sam Procter, Dan Andresen, John Hatcliff, Steve Warren, William Spees, Raoul Jetley, Paul Jones, and Sandy Weininger. 2009. An open test bed for medical device integration and coordination. In Software Engineering-Companion Volume, 2009. ICSE-Companion 2009. 31st International Conference on. IEEE, 141--151.Google ScholarCross Ref
- B. Kitchenham, J. Fry, and S. Linkman. 2003. The case against cross-over designs in software engineering. In Software Technology and Engineering Practice, 2003. Eleventh Annual International Workshop on. 65--67. Google ScholarDigital Library
- Barbara Kitchenham, Lech Madeyski, David Budgen, Jacky Keung, Pearl Brereton, Stuart Charters, Shirley Gibbs, and Amnart Pohthong. 2016. Robust Statistical Methods for Empirical Software Engineering. Empirical Software Engineering (2016), 1--52. Google ScholarDigital Library
- Fredrik Kjolstad, Danny Dig, Gabriel Acevedo, and Marc Snir. 2011. Transformation for class immutability. In Proceedings of the 33rd International Conference on Software Engineering. ACM, 61--70. Google ScholarDigital Library
- Christian FJ Lange and Michel RV Chaudron. 2006. Effects of defects in UML models: an experimental investigation. In Proceedings of the 28th international conference on Software engineering. ACM, 401--411. Google ScholarDigital Library
- Otavio Augusto Lazzarini Lemos, Fabiano Cutigi Ferrari, Fábio Fagundes Silveira, and Alessandro Garcia. 2012. Development of auxiliary functions: should you be agile? an empirical assessment of pair programming and test-first programming. In Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 529--539. Google ScholarDigital Library
- Rupak Majumdar and Koushik Sen. 2007. Hybrid concolic testing. In Software Engineering, 2007. ICSE 2007. 29th International Conference on. IEEE, 416--426. Google ScholarDigital Library
- David Mandelin, Doug Kimelman, and Daniel Yellin. 2006. A Bayesian approach to diagram matching with application to architectural models. In Proceedings of the 28th international conference on Software engineering. ACM, 222--231. Google ScholarDigital Library
- Mika V Mäntylä, Kai Petersen, Timo OA Lehtinen, and Casper Lassenius. 2014. Time pressure: a controlled experiment of test case development and requirements review. In Proceedings of the 36th International Conference on Software Engineering. ACM, 83--94. Google ScholarDigital Library
- Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: finding relevant functions and their usage. In Proceedings of the 33rd International Conference on Software Engineering. ACM, 111--120. Google ScholarDigital Library
- Lijun Mei, WK Chan, and TH Tse. 2008. Data flow testing of service-oriented workflow applications. In Proceedings of the 30th international conference on Software engineering. ACM, 371--380. Google ScholarDigital Library
- Habshah Midi, AHM Rahmatullah Imon, and Azmi Jaafar. 2012. The Misconceptions of Some Statistical Techniques In Research. Jurnal Teknologi 47, 1 (2012), 21--36.Google Scholar
- James Miller. 1999. Can results from software engineering experiments be safely combined?. In Software Metrics Symposium, 1999. Proceedings. Sixth International. IEEE, 152--158. Google ScholarDigital Library
- Rahul Mohanani, Paul Ralph, and Ben Shreeve. 2014. Requirements fixation. In Proceedings of the 36th International Conference on Software Engineering. ACM, 895--906. Google ScholarDigital Library
- Sebastian C Müller and Thomas Fritz. 2015. Stuck and frustrated or in flow and happy: Sensing developers' emotions and progress. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on, Vol. 1. IEEE, 688--699. Google ScholarDigital Library
- Marcus R Munafò, Brian A Nosek, Dorothy VM Bishop, Katherine S Button, Christopher D Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wa-genmakers, Jennifer J Ware, and John PA Ioannidis. 2017. A manifesto for reproducible science. Nature Human Behaviour 1 (2017), 0021.Google ScholarCross Ref
- Noboru Nakamichi, Kazuyuki Shima, Makoto Sakai, and Ken-ichi Matsumoto. 2006. Detecting low usability web pages using quantitative data of users' behavior. In Proceedings of the 28th international conference on Software engineering. ACM, 569--576. Google ScholarDigital Library
- TH Ng, Shing Chi Cheung, WK Chan, and Yuen-Tak Yu. 2007. Do maintainers utilize deployed design patterns effectively?. In Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, 168--177. Google ScholarDigital Library
- Raymond S Nickerson. 2000. Null hypothesis significance testing: a review of an old and continuing controversy. Psychological methods 5, 2 (2000), 241.Google Scholar
- Adrian Nistor, Qingzhou Luo, Michael Pradel, Thomas R Gross, and Darko Mari-nov. 2012. Ballerina: Automatic generation and clustering of efficient random unit tests for multithreaded code. In Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 727--737. Google ScholarDigital Library
- Aditya V Nori and Sriram K Rajamani. 2010. An empirical study of optimizations in YOGI. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, 355--364. Google ScholarDigital Library
- Renato Novais, Camila Nunes, Caio Lima, Elder Cirilo, Francisco Dantas, Alessan-dro Garcia, and Manoel Mendonça. 2012. On the proactive and interactive visualization for feature evolution comprehension: An industrial investigation. In Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 1044--1053. Google ScholarDigital Library
- Regina Nuzzo et al. 2014. Statistical errors. Nature 506, 7487 (2014), 150--152.Google Scholar
- Cara H Olsen. 2003. Review of the use of statistics in infection and immunity. Infection and immunity 71, 12 (2003), 6689--6692.Google Scholar
- Sangmin Park, Richard W Vuduc, and Mary Jean Harrold. 2010. Falcon: fault localization in concurrent programs. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, 245--254. Google ScholarDigital Library
- Fayola Peters, Tim Menzies, and Lucas Layman. 2015. LACE2: Better privacy-preserving data sharing for cross project defect prediction. In Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 801--811. Google ScholarDigital Library
- Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai, and Chengsong Wang. 2014. The strength of random search on automated program repair. In Proceedings of the 36th International Conference on Software Engineering. ACM, 254--265. Google ScholarDigital Library
- Steven P Reiss. 2008. Tracking source locations. In Proceedings of the 30th international conference on Software engineering. ACM, 11--20. Google ScholarDigital Library
- Filippo Ricca, Massimiliano Di Penta, Marco Torchiano, Paolo Tonella, and Mariano Ceccato. 2007. The role of experience and ability in comprehension tasks supported by UML stereotypes. In ICSE, Vol. 7. 375--384. Google ScholarDigital Library
- Paige Rodeghero, Collin McMillan, Paul W McBurney, Nigel Bosch, and Sidney D'Mello. 2014. Improving automated source code summarization via an eye-tracking study of programmers. In Proceedings of the 36th International Conference on Software Engineering. ACM, 390--401. Google ScholarDigital Library
- Norsaremah Salleh, Emilia Mendes, John Grundy, and Giles St J Burch. 2010. An empirical study of the effects of conscientiousness in pair programming using the five-factor personality model. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, 577--586. Google ScholarDigital Library
- Jesper W Schneider. 2015. Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations. Scientometrics 102, 1 (2015), 411--432. Google ScholarDigital Library
- Kenneth F Schulz, Douglas G Altman, and David Moher. 2010. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMC medicine 8, 1 (2010), 18.Google Scholar
- Janet Siegmund, Christian Kästner, Sven Apel, Chris Parnin, Anja Bethmann, Thomas Leich, Gunter Saake, and André Brechmann. 2014. Understanding understanding source code with functional magnetic resonance imaging. In Proceedings of the 36th International Conference on Software Engineering. ACM, 378--389. Google ScholarDigital Library
- Janice Singer. 1999. Using the American Psychological Association (APA) style guidelines to report experimental results. In Proceedings of workshop on empirical studies in software maintenance. 71--75.Google Scholar
- Ana Elisa Castro Sotos, Stijn Vanhoof, Wim Van den Noortgate, and Patrick Onghena. 2007. Students misconceptions of statistical inference: A review of the empirical evidence from research on statistics education. Educational Research Review 2, 2(2007), 98--113.Google ScholarCross Ref
- Matt Staats, Gregory Gay, and Mats PE Heimdahl. 2012. Automated oracle creation support, or: how I learned to stop worrying about fault propagation and love mutation testing. In Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 870--880. Google ScholarDigital Library
- Denes Szucs and John Ioannidis. 2017. When null hypothesis significance testing is unsuitable for research: a reassessment. Frontiers in Human Neuroscience 11 (2017), 390.Google ScholarCross Ref
- Jianbin Tan, George S Avrunin, and Lori A Clarke. 2006. Managing space for finite-state verification. In Proceedings of the 28th international conference on Software engineering. ACM, 152--161. Google ScholarDigital Library
- Shin Hwei Tan and Abhik Roychoudhury. 2015. relifix: Automated repair of software regressions. In Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 471--482. Google ScholarDigital Library
- Matthew Thompson, Arpita Tiwari, Rongwei Fu, Esther Moe, and David I Buckley. 2012. A Framework To Facilitate the Use of Systematic Reviews and Meta-Analyses in the Design of Primary Research Studies. (2012).Google Scholar
- S. Vegas, C. Apa, and N. Juristo. 2016. Crossover Designs in Software Engineering Experiments: Benefits and Perils. IEEE Transactions on Software Engineering 42, 2 (February 2016), 120--135. Google ScholarDigital Library
- Andrew Vickers. 2010. What is a P-value anyway?: 34 stories to help you actually understand statistics. Addison-Wesley Longman.Google Scholar
- Gerald E Welch and Steven G Gabbe. 1996. Review of statistics usage in the American Journal of Obstetrics and Gynecology. American journal of obstetrics and gynecology 175, 5 (1996), 1138--1141.Google Scholar
- Richard Wettel, Michele Lanza, and Romain Robbes. 2011. Software systems as cities: A controlled experiment. In Proceedings of the 33rd International Conference on Software Engineering. ACM, 551--560. Google ScholarDigital Library
- Michael W Whalen, Suzette Person, Neha Rungta, Matt Staats, and Daniela Grijincu. 2015. A flexible and non-intrusive approach for computing complex structural coverage metrics. In Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 506--516. Google ScholarDigital Library
- Stefan Winter, Oliver Schwahn, Roberto Natella, Neeraj Suri, and Domenico Cotroneo. 2015. No PAIN, no gain?: the utility of PArallel fault INjections. In Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 494--505. Google ScholarDigital Library
- Chang Xu, Shing-Chi Cheung, and Wing-Kwong Chan. 2006. Incremental consistency checking for pervasive context. In Proceedings of the 28th international conference on Software engineering. ACM, 292--301. Google ScholarDigital Library
- Koen Yskout, Riccardo Scandariato, and Wouter Joosen. 2012. Does organizing security patterns focus architectural choices?. In Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 617--627. Google ScholarDigital Library
- Koen Yskout, Riccardo Scandariato, and Wouter Joosen. 2015. Do security patterns really help designers?. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on, Vol. 1. IEEE, 292--302. Google ScholarDigital Library
- Yanbing Yu, James A Jones, and Mary Jean Harrold. 2008. An empirical study of the effects of test-suite reduction on fault localization. In Proceedings of the 30th international conference on Software engineering. ACM, 201--210. Google ScholarDigital Library
- Carmen Zannier, Grigori Melnik, and Frank Maurer. 2006. On the success of empirical studies in the international conference on software engineering. In Proceedings of the 28th international conference on Software engineering. ACM, 341--350. Google ScholarDigital Library
- Fadi Zaraket, Adnan Aziz, and Sarfraz Khurshid. 2007. Sequential circuits for relational analysis. In Software Engineering, 2007. ICSE 2007. 29th International Conference on. IEEE, 13--22. Google ScholarDigital Library
- Dina Zayan, Michal Antkiewicz, and Krzysztof Czarnecki. 2014. Effects of using examples on structural model comprehension: a controlled experiment. In Proceedings of the 36th International Conference on Software Engineering. ACM, 955--966. Google ScholarDigital Library
- Lingming Zhang, Dan Hao, Lu Zhang, Gregg Rothermel, and Hong Mei. 2013. Bridging the gap between the total and additional test-case prioritization strategies. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 192--201. Google ScholarDigital Library
Index Terms
Statistical errors in software engineering experiments: a preliminary literature review
Recommendations
Publication Bias: A Detailed Analysis of Experiments Published in ESEM
EASE '20: Proceedings of the 24th International Conference on Evaluation and Assessment in Software EngineeringBackground: Publication bias is the failure to publish the results of a study based on the direction or strength of the study findings. The existence of publication bias is firmly established in areas like medical research. Recent research suggests the ...
Incorrect results in software engineering experiments
Publication and researcher bias is common in software engineering experiments.Our model shows how these biases lead to a high proportion of incorrect results.Increased statistical power is a key factor to improve the trustworthiness. ContextThe ...
A Survey of Controlled Experiments in Software Engineering
The classical method for identifying cause-effect relationships is to conduct controlled experiments. This paper reports upon the present state of how controlled experiments in software engineering are conducted and the extent to which relevant ...
Comments