Abstract
Regression testing is the activity performed by developers to check whether new modifications have not introduced bugs. A crucial requirement to make regression testing effective is that test cases are deterministic. Unfortunately, this is not always the case as some tests might suffer from so-called flakiness, i.e., tests that exhibit both a passing and a failing outcome with the same code. Flaky tests are widely recognized as a serious issue, since they hide real bugs and increase software inspection costs. While previous research has focused on understanding the root causes of test flakiness and devising techniques that automatically fix them, in this paper we explore an orthogonal perspective: the relation between flaky tests and test smells, i.e., suboptimal development choices applied when developing tests. Relying on (1) an analysis of the state-of-the-art and (2) interviews with industrial developers, we first identify five flakiness-inducing test smell types, namely Resource Optimism, Indirect Testing, Test Run War, Fire and Forget, and Conditional Test Logic, and automate their detection. Then, we perform a large-scale empirical study on 19,532 JUnit test methods of 18 software systems, discovering that the five considered test smells causally co-occur with flaky tests in 75% of the cases. Furthermore, we evaluate the effect of refactoring, showing that it is not only able to remove design flaws, but also fixes all 75% flaky tests causally co-occurring with test smells.
Similar content being viewed by others
Change history
26 March 2020
The authors have retracted this article Palomba and Zaidman (2019). Upon re-review of the experiment presented in the article, the authors identified errors in the flaky test detection strategy. After careful analysis of the replication study, the results presented in this article are rendered unreliable. All authors agree to this retraction.
References
Abbes M, Khomh F, Gueheneuc Y-G, Antoniol G (2011) An empirical study of the impact of two antipatterns, blob and spaghetti code, on program comprehension. In: Proceedings of the European conference on software maintenance and reengineering (CSMR). IEEE, pp 181–190
Al Dallal J (2015) Identifying refactoring opportunities in object-oriented code: a systematic literature review. Inf Softw Technol 58:231–249
Arcoverde R, Garcia A, Figueiredo E (2011) Understanding the longevity of code smells: preliminary results of an explanatory survey. In: Proceedings of the international workshop on refactoring tools. ACM, pp 33–36
Athanasiou D, Nugroho A, Visser J, Zaidman A (2014) Test code quality and its relation to issue handling performance. IEEE Trans Softw Eng 40(11):1100–1125
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, Reading
Bavota G, De Carluccio B, De Lucia A, Di Penta M, Oliveto R, Strollo O (2012) When does a refactoring induce bugs? An empirical study. In: Proceedings of the international working conference on source code analysis and manipulation (SCAM). IEEE, pp 104–113
Bavota G, De Lucia A, Di Penta M, Oliveto R, Palomba F (2015a) An experimental investigation on the innate relationship between quality and refactoring. J Syst Softw 107:1–14
Bavota G, Qusef A, Oliveto R, De Lucia A, Binkley D (2015b) Are test smells really harmful? An empirical study. Empir Softw Eng 20(4):1052–1094
Beck (2002) Test driven development: by example. Addison-Wesley Longman Publishing Co. Inc., Boston
Bell J, Kaiser G (2014) Unit test virtualization with VMVM. In: Proceedings of the international conference on software engineering (ICSE). ACM, pp 550–561
Beller M, Gousios G, Panichella A, Zaidman A (2015a) When, how, and why developers (do not) test in their IDEs. In: Proceedings of the joint meeting on foundations of software engineering (ESEC/FSE). ACM, pp 179–190
Beller M, Gousios G, Zaidman A (2015b) How (much) do developers test? In: Proceedings of the international conference on software engineering (ICSE). IEEE, pp 559–562
Beller M, Gousios G, Zaidman A (2017a) Oops, my tests broke the build: an explorative analysis of Travis CI with GitHub. In: Proceedings of the international conference on mining software repositories (MSR). ACM, pp 356–367
Beller M, Gousios G, Panichella A, Proksch S, Amann S, Zaidman A (2017b) Developer testing in the ide: patterns, beliefs, and behavior. In: IEEE transactions on software engineering (TSE), to Appear
Beller M, Gousios G, Zaidman A (2017c) TravisTorrent Synthesizing Travis CI And GitHub for full-stack research on continuous integration. In: Proceedings of the international conference on mining software repositories (MSR). IEEE, pp 447–450
Bell J, Legunsen O, Hilton M, Eloussi L, Yung T, Marinov D (2018) Deflaker: automatically detecting flaky tests. In: Proceedings of the international conference on software engineering (ICSE). ACM, pp 433–444
Budd TA (1980) Mutation analysis of program test data. Ph.D. dissertation, New Haven, aAI8025191
Catolino G, Palomba F, De Lucia A, Ferrucci F, Zaidman A (2017) Developer-related factors in change prediction: an empirical assessment. In: Proceedings of the international conference on program comprehension (ICPC). IEEE, pp 186–195
Catolino G, Palomba F, De Lucia A, Ferrucci F, Zaidman A (2018) Enhancing change prediction models using developer-related factors. J Syst Softw 143 (9):14–28
Croux C, Dehon C (2010) Influence functions of the spearman and kendall correlation measures. Stat Methods Appl 19(4):497–515. [Online]. Available: https://doi.org/10.1007/s10260-010-0142-z
Daniel B, Jagannath V, Dig D, Marinov D (2009) Reassert: suggesting repairs for broken unit tests. In: Proceedings of the international conference on automated software engineering (ASE). IEEE, pp 433–444
Developers G (2012) No more flaky tests on the go team. [Online]. Available: https://www.thoughtworks.com/insights/blog/no-more-flaky-tests-go-team
Developers C (2018) Flakiness dashboard howto. [Online]. Available: http://www.chromium.org/developers/testing/flakiness-dashboard
Di Nucci D, Palomba F, Tamburri DA, Serebrenik A, De Lucia A (2018) Detecting code smells using machine learning techniques: are we there yet? In: 25th IEEE international conference on software analysis, evolution and reengineering. IEEE, pp 612–621
Engström E., Runeson P (2010) A qualitative survey of regression testing practices. In: Proceedings of the international conference on product-focused software process improvement (PROFES). Springer, Berlin Heidelberg, pp 3–16
Farchi E, Nir Y, Ur S (2003) Concurrent bug patterns and how to test them. In: Proceedings international parallel and distributed processing symposium, p 7
Fowler M (1999) Refactoring: improving the design of existing code. Addison-Wesley, Reading
Fowler M (2011) Eradicating non-determinism in tests. [Online]. Available: https://martinfowler.com/articles/nonDeterminism.html
Garousi V, Felderer M, Mäntylä MV (2016) The need for multivocal literature reviews in software engineering: complementing systematic literature reviews with grey literature. In: Proceedings of the 20th international conference on evaluation and assessment in software engineering. ACM, p 26
Garousi V, Küċük B (2018) Smells in software test code: a survey of knowledge in industry and academia. J Syst Softw 138:52–81
Gousios G, Zaidman A, Storey M-A, van Deursen A (2015) Work practices and challenges in pull-based development: the integrator’s perspective. In: Proceedings of the international conference on software engineering (ICSE). IEEE, pp 358–368
Greiler M, van Deursen A, Storey MA (2013a) Automated detection of test fixture strategies and smells. In: Proceedings of the international conference on software testing, verification and validation (ICST). IEEE, pp 322–331
Greiler M, Zaidman A, van Deursen A, Storey M-A (2013b) Strategies for avoiding text fixture smells during software evolution. In: Proceedings of the 10th working conference on mining software repositories (MSR). IEEE, pp 387–396
Hao D, Zhang L, Zhong H, Mei H, Sun J (2005) Eliminating harmful redundancy for testing-based fault localization using test suite reduction: an experimental study. In: Proceedings of the international conference on software maintenance (ICSM). IEEE, pp 683–686
Jaccard P (1901) Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Societé Vaudoise des Sciences Naturelles 37:547–579
Jin G, Song L, Zhang W, Lu S, Liblit B (2011) Automated atomicity-violation fixing. In: Proceedings of the 32nd ACM SIGPLAN conference on programming language design and implementation (PLDI). ACM, pp 389–400
Kamei Y, Shihab E, Adams B, Hassan AE, Mockus A, Sinha A, Ubayashi N (2013) A large-scale empirical study of just-in-time quality assurance. IEEE Trans Softw Eng 39(6):757–773
Kendall M (1948) Rank correlation methods. Charles Griffin & Company Limited, London
Khomh F, Vaucher S, Guéhéneuc Y-G, Sahraoui H (2009) A Bayesian approach for the detection of code and design smells. In: Proceedings of the 9th international conference on quality software (QSIC). IEEE, pp 305–314
Khomh F, Di Penta M, Guéhéneuc Y-G, Antoniol G (2012) An exploratory study of the impact of antipatterns on class change- and fault-proneness. Empir Softw Eng 17(3):243–275
Kleiman S, Shah D, Smaalders B (1996) Programming with threads. Sun Soft Press Mountain View
Lacoste FJ (2009) Killing the gatekeeper: introducing a continuous integration system. In: 2009 agile conference, pp 387–392
Lanza M, Marinescu R (2006) Object-oriented metrics in practice: using software metrics to characterize, evaluate, and improve the design of object-oriented systems. Springer, Berlin
Lozano A, Wermelinger M, Nuseibeh B (2007) Assessing the impact of bad smells using historical information. In: Proceedings of the international workshop on principles of software evolution (IWPSE). ACM, pp 31–34
Lu S, Park S, Seo E, Zhou Y (2008) Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. In: Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS). ACM, pp 329–339
Luo Q, Hariri F, Eloussi L, Marinov D (2014) An empirical analysis of flaky tests. In: Proceedings of the SIGSOFT international symposium on foundations of software engineering (FSE). ACM, pp 643–653
Mackenzie N, Knipe S (2006) Research dilemmas: paradigms, methods and methodology. Issues Educ Res 16(2):193–205
Malaiya YK, Li MN, Bieman JM, Karcich R (2002) Software reliability growth with test coverage. IEEE Trans Reliab 51(4):420–426
Marinescu R (2004) Detection strategies: metrics-based rules for detecting design flaws. In: Proceedings of the international conference on software maintenance (ICSM). IEEE, pp 350–359
Marinescu P, Hosek P, Cadar C (2014) Covrig: a framework for the analysis of code, test, and coverage evolution in real software. In: Proceedings of the international symposium on software testing and analysis (ISSTA). ACM, pp 93–104
Melski E (2018) 6 tips for writing robust, maintainable unit tests. [Online]. Available: https://blog.melski.net/tag/unit-tests/
Memon AM, Cohen MB (2013) Automated testing of gui applications: models, tools, and controlling flakiness. In: Proceedings of the international conference on software engineering (ICSE). IEEE, pp 1479–1480
Mens T, Tourwé T (2004) A survey of software refactoring. IEEE Trans Softw Eng 30(2):126–139
Meszaros G (2007) xUnit test patterns: refactoring test code. Addison Wesley, Reading
Micco J (2016) Flaky tests at Google and how we mitigate them, last visited, March 24th, 2017. [Online] Available: https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html
Moha N, Guéhéneuc Y-G, Duchien L, Meur A-FL (2010) Decor: a method for the specification and detection of code and design smells. IEEE Trans Softw Eng 36(1):20–36
Moonen L, van Deursen A, Zaidman A, Bruntink M (2008) On the interplay between software testing and evolution and its effect on program comprehension. In: Mens T, Demeyer S (eds) Software evolution. Springer, pp 173–202
Muşlu K, Soran B, Wuttke J (2011) Finding bugs by isolating unit tests. In: Proceedings of the SIGSOFT symposium on foundations of software engineering and the european conference on software engineering (ESEC/FSE). ACM, pp 496–499
Munro MJ (2005) Product metrics for automatic identification of “bad smell” design problems in java source-code. In: Proceedings of the international software metrics symposium (METRICS). IEEE
Oliveto R, Khomh F, Antoniol G, Guéhéneuc Y-G (2010) Numerical signatures of antipatterns: an approach based on B-Splines. In: Proceedings of the 14th conference on software maintenance and reengineering. IEEE Computer Society Press, pp 248–251
Palomba F, Zaidman A (2017) Does refactoring of test smells induce fixing flaky tests?. In: 2017 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 1–12
Palomba F, Zaidman A (2018) The smell of fear: on the relation between test smells and flaky tests - online appendix, [Online] Available: https://tinyurl.com/ycnmnd6w
Palomba F, Bavota G, Di Penta M, Oliveto R, De Lucia A (2014) Do they really smell bad? A study on developers’ perception of bad code smells. In: Proceedings of the 30th IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 101–110
Palomba F, Bavota G, Di Penta M, Oliveto R, Poshyvanyk D, De Lucia A (2015) Mining version histories for detecting code smells. IEEE Trans Softw Eng 41(5):462–489
Palomba F, Di Nucci D, Panichella A, Oliveto R, De Lucia A (2016a) On the diffusion of test smells in automatically generated test code: an empirical study. In: Proceedings of the international workshop on search-based software testing (SBST). ACM, pp 5–14
Palomba F, Panichella A, De Lucia A, Oliveto R, Zaidman A (2016b) A textual-based technique for smell detection. In: IEEE 24th international conference on program comprehension (ICPC), pp 1–10
Palomba F, Panichella A, Zaidman A, Oliveto R, De Lucia A (2016c) Automatic test case generation: what if test code quality matters?. In: Proceedings of the 25th international symposium on software testing and analysis. ACM, pp 130–141
Palomba F, Panichella A, Zaidman A, Oliveto R, De Lucia A (2017a) The scent of a smell: an extensive comparison between textual and structural smells. IEEE Transactions on Software Engineering
Palomba F, Zaidman A, Oliveto R, De Lucia A (2017b) An exploratory study on the relationship between changes and refactoring. In: 2017 IEEE/ACM 25th international conference on program comprehension (ICPC). IEEE, pp 176–185
Palomba F, Zanoni M, Fontana FA, De Lucia A, Oliveto R (2017c) Toward a smell-aware bug prediction model. IEEE Transactions on Software Engineering
Palomba F, Bavota G, Di Penta M, Fasano F, Oliveto R, De Lucia A (2018a) A large-scale empirical study on the lifecycle of code smell co-occurrences. Inf Softw Technol 99:1–10
Palomba F, Zaidman A, Lucia A (2018b) Automatic test smell detection using information retrieval techniques. In: Proceedings of the international conference on software maintenance and evolution (ICSME). IEEE
Palomba F, Bavota G, Penta M et al (2018c) Empir Software Eng 23:1188–1221. https://doi.org/10.1007/s10664-017-9535-z
Palomba F, Tamburri DA, Arcelli Fontana F, Oliveto R, Zaidman A, Serebrenik A (2019) Beyond technical aspects: how do community smells influence the intensity of code smells? IEEE transactions on software engineering
Perez A, Abreu R, van Deursen A (2017) A test-suite diagnosability metric for spectrum-based fault localization approaches. In: Proceedings of the international conference on software engineering (ICSE). ACM, pp 654–664
Peters R, Zaidman A (2012) Evaluating the lifespan of code smells using software repository mining. In: Proceedings of the European conference on software maintenance and reengineering (CSMR). IEEE, pp 411–416
Pinto LS, Sinha S, Orso A (2012) Understanding myths and realities of test-suite evolution. In: Proceedings of the international symposium on the foundations of software engineering (FSE). ACM, pp 33:1–33:11
Ratiu D, Ducasse S, Gîrba T, Marinescu R (2004) Using history information to improve design flaws detection. In: Proceedings of the European conference on software maintenance and reengineering (CSMR). IEEE, pp 223–232
Sackett DL (1979) Bias in analytic research. In: The case-control study consensus and controversy. Elsevier, pp 51–63
Sjoberg D, Yamashita A, Anda B, Mockus A, Dyba T (2013) Quantifying the effect of code smells on maintenance effort. IEEE Trans Softw Eng 39(8):1144–1156
Spadini D, Palomba F, Zaidman A, Bruntink M, Bacchelli A (2018) On the relation of test smells to software code quality. In: Proceedings of the international conference on software maintenance and evolution (ICSME). IEEE
Tsantalis N, Chatzigeorgiou A (2009) Identification of move method refactoring opportunities. IEEE Trans Softw Eng 35(3):347–367
Tsantalis N, Chatzigeorgiou A (2011) Identification of extract method refactoring opportunities for the decomposition of methods. J Syst Softw 84(10):1757–1782
Tufano M, Palomba F, Bavota G, Di Penta M, Oliveto R, De Lucia A, Poshyvanyk D (2016) An empirical investigation into the nature of test smells. In: Proceedings of the international conference on automated software engineering (ASE). ACM, pp 4–15
Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lucia A, Poshyvanyk D (2017) When and why your code starts to smell bad (and whether the smells go away). Trans Softw Eng (TSE) 43(11):1063–1088
Vahabzadeh A, Fard AM, Mesbah A (2015) An empirical study of bugs in test code. In: 2015 IEEE international conference on software maintenance and evolution (ICSME), pp 101–110
van Deursen A, Moonen L, Bergh A, Kok G (2001) Refactoring test code. In: Proceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (XP), pp 92–95
Van Rompaey B, Bois B, Demeyer S, Rieger M (2007) On the detection of test smells: a metrics-based approach for general fixture and eager test. IEEE Trans Softw Eng 33(12):800–817
Weiss RS (1995) Learning from strangers: the art and method of qualitative interview studies. Simon and Schuster, New York
Yamashita A (2012) Do code smells reflect important maintainability aspects? In: International conference on software maintenance (ICSM). IEEE, pp 306–315
Yamashita A, Moonen L (2013) Exploring the impact of inter-smell relations on software maintainability: an empirical study. In: Proceedings of the international conference on software engineering (ICSE). IEEE, pp 682–691
Yamashita A, Zanoni M, Fontana FA, Walter B (2015) Inter-smell relations in industrial and open source systems: a replication and comparative analysis. In: Proceedings of the international conference on software maintenance and evolution (ICSME). IEEE, pp 121–130
Yang G, Khurshid S, Kim M (2012) Specification-based test repair using a lightweight formal method. In: Proceedings of the international symposium on formal methods (FM), pp 455–470
Zaidman A, Van Rompaey B, van Deursen A, Demeyer S (2011) Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining. Empir Softw Eng 16(3):325–364. [Online]. Available: https://doi.org/10.1007/s10664-010-9143-7
Zhang S, Jalali D, Wuttke J, Muslu K, Lam W, Ernst MD, Notkin D (2014) Empirically revisiting the test independence assumption. In: Proceedings of the international symposium on software testing and analysis (ISSTA). ACM, pp 385–396
Zhang Y, Mesbah A (2015) Assertions are strongly correlated with test suite effectiveness. In: Proceedings of the joint meeting on foundations of software engineering (ESEC/FSE). ACM, pp 214–224
Acknowledgments
We would like to thank the 10 developers that participated in the interviews, and the 2 external inspectors that helped us categorize the test smells. We thank the anonymous reviewers, whose comments and feedback significantly improved this paper. This work was partially sponsored by the EU Horizon 2020 ICT-10-2016-RIA “STAMP” project (No. 731529), the NWO “TestRoots” project (No. 639.022.314), and the SNF “Data-Driven Code Review” project (No. PP00P2_170529).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Lu Zhang, Thomas Zimmermann, Xin Peng, and Hong Mei
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The authors have retracted this article. Upon re-review of the experiment presented in the article, the authors identified errors in the flaky test detection strategy. After careful analysis of the replication study, the results presented in this article are rendered unreliable. All authors agree to this retraction.
About this article
Cite this article
Palomba, F., Zaidman, A. RETRACTED ARTICLE: The smell of fear: on the relation between test smells and flaky tests. Empir Software Eng 24, 2907–2946 (2019). https://doi.org/10.1007/s10664-019-09683-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-019-09683-z