Skip to main content
Log in

RETRACTED ARTICLE: The smell of fear: on the relation between test smells and flaky tests

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

This article was retracted on 26 March 2020

This article has been updated

Abstract

Regression testing is the activity performed by developers to check whether new modifications have not introduced bugs. A crucial requirement to make regression testing effective is that test cases are deterministic. Unfortunately, this is not always the case as some tests might suffer from so-called flakiness, i.e., tests that exhibit both a passing and a failing outcome with the same code. Flaky tests are widely recognized as a serious issue, since they hide real bugs and increase software inspection costs. While previous research has focused on understanding the root causes of test flakiness and devising techniques that automatically fix them, in this paper we explore an orthogonal perspective: the relation between flaky tests and test smells, i.e., suboptimal development choices applied when developing tests. Relying on (1) an analysis of the state-of-the-art and (2) interviews with industrial developers, we first identify five flakiness-inducing test smell types, namely Resource Optimism, Indirect Testing, Test Run War, Fire and Forget, and Conditional Test Logic, and automate their detection. Then, we perform a large-scale empirical study on 19,532 JUnit test methods of 18 software systems, discovering that the five considered test smells causally co-occur with flaky tests in 75% of the cases. Furthermore, we evaluate the effect of refactoring, showing that it is not only able to remove design flaws, but also fixes all 75% flaky tests causally co-occurring with test smells.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Listing 1
Listing 2
Listing 3
Listing 4
Listing 5
Listing 6
Listing 7

Similar content being viewed by others

Change history

  • 26 March 2020

    The authors have retracted this article Palomba and Zaidman (2019). Upon re-review of the experiment presented in the article, the authors identified errors in the flaky test detection strategy. After careful analysis of the replication study, the results presented in this article are rendered unreliable. All authors agree to this retraction.

Notes

  1. Available here: https://github.com/apache

  2. https://fbinfer.com

  3. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/Condition.html

  4. http://www.eclemma.org/jacoco/

References

  • Abbes M, Khomh F, Gueheneuc Y-G, Antoniol G (2011) An empirical study of the impact of two antipatterns, blob and spaghetti code, on program comprehension. In: Proceedings of the European conference on software maintenance and reengineering (CSMR). IEEE, pp 181–190

  • Al Dallal J (2015) Identifying refactoring opportunities in object-oriented code: a systematic literature review. Inf Softw Technol 58:231–249

    Article  Google Scholar 

  • Arcoverde R, Garcia A, Figueiredo E (2011) Understanding the longevity of code smells: preliminary results of an explanatory survey. In: Proceedings of the international workshop on refactoring tools. ACM, pp 33–36

  • Athanasiou D, Nugroho A, Visser J, Zaidman A (2014) Test code quality and its relation to issue handling performance. IEEE Trans Softw Eng 40(11):1100–1125

    Article  Google Scholar 

  • Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, Reading

    Google Scholar 

  • Bavota G, De Carluccio B, De Lucia A, Di Penta M, Oliveto R, Strollo O (2012) When does a refactoring induce bugs? An empirical study. In: Proceedings of the international working conference on source code analysis and manipulation (SCAM). IEEE, pp 104–113

  • Bavota G, De Lucia A, Di Penta M, Oliveto R, Palomba F (2015a) An experimental investigation on the innate relationship between quality and refactoring. J Syst Softw 107:1–14

    Article  Google Scholar 

  • Bavota G, Qusef A, Oliveto R, De Lucia A, Binkley D (2015b) Are test smells really harmful? An empirical study. Empir Softw Eng 20(4):1052–1094

    Article  Google Scholar 

  • Beck (2002) Test driven development: by example. Addison-Wesley Longman Publishing Co. Inc., Boston

    Google Scholar 

  • Bell J, Kaiser G (2014) Unit test virtualization with VMVM. In: Proceedings of the international conference on software engineering (ICSE). ACM, pp 550–561

  • Beller M, Gousios G, Panichella A, Zaidman A (2015a) When, how, and why developers (do not) test in their IDEs. In: Proceedings of the joint meeting on foundations of software engineering (ESEC/FSE). ACM, pp 179–190

  • Beller M, Gousios G, Zaidman A (2015b) How (much) do developers test? In: Proceedings of the international conference on software engineering (ICSE). IEEE, pp 559–562

  • Beller M, Gousios G, Zaidman A (2017a) Oops, my tests broke the build: an explorative analysis of Travis CI with GitHub. In: Proceedings of the international conference on mining software repositories (MSR). ACM, pp 356–367

  • Beller M, Gousios G, Panichella A, Proksch S, Amann S, Zaidman A (2017b) Developer testing in the ide: patterns, beliefs, and behavior. In: IEEE transactions on software engineering (TSE), to Appear

  • Beller M, Gousios G, Zaidman A (2017c) TravisTorrent Synthesizing Travis CI And GitHub for full-stack research on continuous integration. In: Proceedings of the international conference on mining software repositories (MSR). IEEE, pp 447–450

  • Bell J, Legunsen O, Hilton M, Eloussi L, Yung T, Marinov D (2018) Deflaker: automatically detecting flaky tests. In: Proceedings of the international conference on software engineering (ICSE). ACM, pp 433–444

  • Budd TA (1980) Mutation analysis of program test data. Ph.D. dissertation, New Haven, aAI8025191

    Google Scholar 

  • Catolino G, Palomba F, De Lucia A, Ferrucci F, Zaidman A (2017) Developer-related factors in change prediction: an empirical assessment. In: Proceedings of the international conference on program comprehension (ICPC). IEEE, pp 186–195

  • Catolino G, Palomba F, De Lucia A, Ferrucci F, Zaidman A (2018) Enhancing change prediction models using developer-related factors. J Syst Softw 143 (9):14–28

    Article  Google Scholar 

  • Croux C, Dehon C (2010) Influence functions of the spearman and kendall correlation measures. Stat Methods Appl 19(4):497–515. [Online]. Available: https://doi.org/10.1007/s10260-010-0142-z

    Article  MathSciNet  Google Scholar 

  • Daniel B, Jagannath V, Dig D, Marinov D (2009) Reassert: suggesting repairs for broken unit tests. In: Proceedings of the international conference on automated software engineering (ASE). IEEE, pp 433–444

  • Developers G (2012) No more flaky tests on the go team. [Online]. Available: https://www.thoughtworks.com/insights/blog/no-more-flaky-tests-go-team

  • Developers C (2018) Flakiness dashboard howto. [Online]. Available: http://www.chromium.org/developers/testing/flakiness-dashboard

  • Di Nucci D, Palomba F, Tamburri DA, Serebrenik A, De Lucia A (2018) Detecting code smells using machine learning techniques: are we there yet? In: 25th IEEE international conference on software analysis, evolution and reengineering. IEEE, pp 612–621

  • Engström E., Runeson P (2010) A qualitative survey of regression testing practices. In: Proceedings of the international conference on product-focused software process improvement (PROFES). Springer, Berlin Heidelberg, pp 3–16

  • Farchi E, Nir Y, Ur S (2003) Concurrent bug patterns and how to test them. In: Proceedings international parallel and distributed processing symposium, p 7

  • Fowler M (1999) Refactoring: improving the design of existing code. Addison-Wesley, Reading

    MATH  Google Scholar 

  • Fowler M (2011) Eradicating non-determinism in tests. [Online]. Available: https://martinfowler.com/articles/nonDeterminism.html

  • Garousi V, Felderer M, Mäntylä MV (2016) The need for multivocal literature reviews in software engineering: complementing systematic literature reviews with grey literature. In: Proceedings of the 20th international conference on evaluation and assessment in software engineering. ACM, p 26

  • Garousi V, Küċük B (2018) Smells in software test code: a survey of knowledge in industry and academia. J Syst Softw 138:52–81

    Article  Google Scholar 

  • Gousios G, Zaidman A, Storey M-A, van Deursen A (2015) Work practices and challenges in pull-based development: the integrator’s perspective. In: Proceedings of the international conference on software engineering (ICSE). IEEE, pp 358–368

  • Greiler M, van Deursen A, Storey MA (2013a) Automated detection of test fixture strategies and smells. In: Proceedings of the international conference on software testing, verification and validation (ICST). IEEE, pp 322–331

  • Greiler M, Zaidman A, van Deursen A, Storey M-A (2013b) Strategies for avoiding text fixture smells during software evolution. In: Proceedings of the 10th working conference on mining software repositories (MSR). IEEE, pp 387–396

  • Hao D, Zhang L, Zhong H, Mei H, Sun J (2005) Eliminating harmful redundancy for testing-based fault localization using test suite reduction: an experimental study. In: Proceedings of the international conference on software maintenance (ICSM). IEEE, pp 683–686

  • Jaccard P (1901) Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Societé Vaudoise des Sciences Naturelles 37:547–579

  • Jin G, Song L, Zhang W, Lu S, Liblit B (2011) Automated atomicity-violation fixing. In: Proceedings of the 32nd ACM SIGPLAN conference on programming language design and implementation (PLDI). ACM, pp 389–400

  • Kamei Y, Shihab E, Adams B, Hassan AE, Mockus A, Sinha A, Ubayashi N (2013) A large-scale empirical study of just-in-time quality assurance. IEEE Trans Softw Eng 39(6):757–773

    Article  Google Scholar 

  • Kendall M (1948) Rank correlation methods. Charles Griffin & Company Limited, London

    MATH  Google Scholar 

  • Khomh F, Vaucher S, Guéhéneuc Y-G, Sahraoui H (2009) A Bayesian approach for the detection of code and design smells. In: Proceedings of the 9th international conference on quality software (QSIC). IEEE, pp 305–314

  • Khomh F, Di Penta M, Guéhéneuc Y-G, Antoniol G (2012) An exploratory study of the impact of antipatterns on class change- and fault-proneness. Empir Softw Eng 17(3):243–275

    Article  Google Scholar 

  • Kleiman S, Shah D, Smaalders B (1996) Programming with threads. Sun Soft Press Mountain View

  • Lacoste FJ (2009) Killing the gatekeeper: introducing a continuous integration system. In: 2009 agile conference, pp 387–392

  • Lanza M, Marinescu R (2006) Object-oriented metrics in practice: using software metrics to characterize, evaluate, and improve the design of object-oriented systems. Springer, Berlin

    MATH  Google Scholar 

  • Lozano A, Wermelinger M, Nuseibeh B (2007) Assessing the impact of bad smells using historical information. In: Proceedings of the international workshop on principles of software evolution (IWPSE). ACM, pp 31–34

  • Lu S, Park S, Seo E, Zhou Y (2008) Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. In: Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS). ACM, pp 329–339

  • Luo Q, Hariri F, Eloussi L, Marinov D (2014) An empirical analysis of flaky tests. In: Proceedings of the SIGSOFT international symposium on foundations of software engineering (FSE). ACM, pp 643–653

  • Mackenzie N, Knipe S (2006) Research dilemmas: paradigms, methods and methodology. Issues Educ Res 16(2):193–205

    Google Scholar 

  • Malaiya YK, Li MN, Bieman JM, Karcich R (2002) Software reliability growth with test coverage. IEEE Trans Reliab 51(4):420–426

    Article  Google Scholar 

  • Marinescu R (2004) Detection strategies: metrics-based rules for detecting design flaws. In: Proceedings of the international conference on software maintenance (ICSM). IEEE, pp 350–359

  • Marinescu P, Hosek P, Cadar C (2014) Covrig: a framework for the analysis of code, test, and coverage evolution in real software. In: Proceedings of the international symposium on software testing and analysis (ISSTA). ACM, pp 93–104

  • Melski E (2018) 6 tips for writing robust, maintainable unit tests. [Online]. Available: https://blog.melski.net/tag/unit-tests/

  • Memon AM, Cohen MB (2013) Automated testing of gui applications: models, tools, and controlling flakiness. In: Proceedings of the international conference on software engineering (ICSE). IEEE, pp 1479–1480

  • Mens T, Tourwé T (2004) A survey of software refactoring. IEEE Trans Softw Eng 30(2):126–139

    Article  Google Scholar 

  • Meszaros G (2007) xUnit test patterns: refactoring test code. Addison Wesley, Reading

    Google Scholar 

  • Micco J (2016) Flaky tests at Google and how we mitigate them, last visited, March 24th, 2017. [Online] Available: https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html

  • Moha N, Guéhéneuc Y-G, Duchien L, Meur A-FL (2010) Decor: a method for the specification and detection of code and design smells. IEEE Trans Softw Eng 36(1):20–36

    Article  Google Scholar 

  • Moonen L, van Deursen A, Zaidman A, Bruntink M (2008) On the interplay between software testing and evolution and its effect on program comprehension. In: Mens T, Demeyer S (eds) Software evolution. Springer, pp 173–202

  • Muşlu K, Soran B, Wuttke J (2011) Finding bugs by isolating unit tests. In: Proceedings of the SIGSOFT symposium on foundations of software engineering and the european conference on software engineering (ESEC/FSE). ACM, pp 496–499

  • Munro MJ (2005) Product metrics for automatic identification of “bad smell” design problems in java source-code. In: Proceedings of the international software metrics symposium (METRICS). IEEE

  • Oliveto R, Khomh F, Antoniol G, Guéhéneuc Y-G (2010) Numerical signatures of antipatterns: an approach based on B-Splines. In: Proceedings of the 14th conference on software maintenance and reengineering. IEEE Computer Society Press, pp 248–251

  • Palomba F, Zaidman A (2017) Does refactoring of test smells induce fixing flaky tests?. In: 2017 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 1–12

  • Palomba F, Zaidman A (2018) The smell of fear: on the relation between test smells and flaky tests - online appendix, [Online] Available: https://tinyurl.com/ycnmnd6w

  • Palomba F, Bavota G, Di Penta M, Oliveto R, De Lucia A (2014) Do they really smell bad? A study on developers’ perception of bad code smells. In: Proceedings of the 30th IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 101–110

  • Palomba F, Bavota G, Di Penta M, Oliveto R, Poshyvanyk D, De Lucia A (2015) Mining version histories for detecting code smells. IEEE Trans Softw Eng 41(5):462–489

    Article  Google Scholar 

  • Palomba F, Di Nucci D, Panichella A, Oliveto R, De Lucia A (2016a) On the diffusion of test smells in automatically generated test code: an empirical study. In: Proceedings of the international workshop on search-based software testing (SBST). ACM, pp 5–14

  • Palomba F, Panichella A, De Lucia A, Oliveto R, Zaidman A (2016b) A textual-based technique for smell detection. In: IEEE 24th international conference on program comprehension (ICPC), pp 1–10

  • Palomba F, Panichella A, Zaidman A, Oliveto R, De Lucia A (2016c) Automatic test case generation: what if test code quality matters?. In: Proceedings of the 25th international symposium on software testing and analysis. ACM, pp 130–141

  • Palomba F, Panichella A, Zaidman A, Oliveto R, De Lucia A (2017a) The scent of a smell: an extensive comparison between textual and structural smells. IEEE Transactions on Software Engineering

  • Palomba F, Zaidman A, Oliveto R, De Lucia A (2017b) An exploratory study on the relationship between changes and refactoring. In: 2017 IEEE/ACM 25th international conference on program comprehension (ICPC). IEEE, pp 176–185

  • Palomba F, Zanoni M, Fontana FA, De Lucia A, Oliveto R (2017c) Toward a smell-aware bug prediction model. IEEE Transactions on Software Engineering

  • Palomba F, Bavota G, Di Penta M, Fasano F, Oliveto R, De Lucia A (2018a) A large-scale empirical study on the lifecycle of code smell co-occurrences. Inf Softw Technol 99:1–10

    Article  Google Scholar 

  • Palomba F, Zaidman A, Lucia A (2018b) Automatic test smell detection using information retrieval techniques. In: Proceedings of the international conference on software maintenance and evolution (ICSME). IEEE

  • Palomba F, Bavota G, Penta M et al (2018c) Empir Software Eng 23:1188–1221. https://doi.org/10.1007/s10664-017-9535-z

    Article  Google Scholar 

  • Palomba F, Tamburri DA, Arcelli Fontana F, Oliveto R, Zaidman A, Serebrenik A (2019) Beyond technical aspects: how do community smells influence the intensity of code smells? IEEE transactions on software engineering

  • Perez A, Abreu R, van Deursen A (2017) A test-suite diagnosability metric for spectrum-based fault localization approaches. In: Proceedings of the international conference on software engineering (ICSE). ACM, pp 654–664

  • Peters R, Zaidman A (2012) Evaluating the lifespan of code smells using software repository mining. In: Proceedings of the European conference on software maintenance and reengineering (CSMR). IEEE, pp 411–416

  • Pinto LS, Sinha S, Orso A (2012) Understanding myths and realities of test-suite evolution. In: Proceedings of the international symposium on the foundations of software engineering (FSE). ACM, pp 33:1–33:11

  • Ratiu D, Ducasse S, Gîrba T, Marinescu R (2004) Using history information to improve design flaws detection. In: Proceedings of the European conference on software maintenance and reengineering (CSMR). IEEE, pp 223–232

  • Sackett DL (1979) Bias in analytic research. In: The case-control study consensus and controversy. Elsevier, pp 51–63

  • Sjoberg D, Yamashita A, Anda B, Mockus A, Dyba T (2013) Quantifying the effect of code smells on maintenance effort. IEEE Trans Softw Eng 39(8):1144–1156

    Article  Google Scholar 

  • Spadini D, Palomba F, Zaidman A, Bruntink M, Bacchelli A (2018) On the relation of test smells to software code quality. In: Proceedings of the international conference on software maintenance and evolution (ICSME). IEEE

  • Tsantalis N, Chatzigeorgiou A (2009) Identification of move method refactoring opportunities. IEEE Trans Softw Eng 35(3):347–367

    Article  Google Scholar 

  • Tsantalis N, Chatzigeorgiou A (2011) Identification of extract method refactoring opportunities for the decomposition of methods. J Syst Softw 84(10):1757–1782

    Article  Google Scholar 

  • Tufano M, Palomba F, Bavota G, Di Penta M, Oliveto R, De Lucia A, Poshyvanyk D (2016) An empirical investigation into the nature of test smells. In: Proceedings of the international conference on automated software engineering (ASE). ACM, pp 4–15

  • Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lucia A, Poshyvanyk D (2017) When and why your code starts to smell bad (and whether the smells go away). Trans Softw Eng (TSE) 43(11):1063–1088

    Article  Google Scholar 

  • Vahabzadeh A, Fard AM, Mesbah A (2015) An empirical study of bugs in test code. In: 2015 IEEE international conference on software maintenance and evolution (ICSME), pp 101–110

  • van Deursen A, Moonen L, Bergh A, Kok G (2001) Refactoring test code. In: Proceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (XP), pp 92–95

  • Van Rompaey B, Bois B, Demeyer S, Rieger M (2007) On the detection of test smells: a metrics-based approach for general fixture and eager test. IEEE Trans Softw Eng 33(12):800–817

    Article  Google Scholar 

  • Weiss RS (1995) Learning from strangers: the art and method of qualitative interview studies. Simon and Schuster, New York

    Google Scholar 

  • Yamashita A (2012) Do code smells reflect important maintainability aspects? In: International conference on software maintenance (ICSM). IEEE, pp 306–315

  • Yamashita A, Moonen L (2013) Exploring the impact of inter-smell relations on software maintainability: an empirical study. In: Proceedings of the international conference on software engineering (ICSE). IEEE, pp 682–691

  • Yamashita A, Zanoni M, Fontana FA, Walter B (2015) Inter-smell relations in industrial and open source systems: a replication and comparative analysis. In: Proceedings of the international conference on software maintenance and evolution (ICSME). IEEE, pp 121–130

  • Yang G, Khurshid S, Kim M (2012) Specification-based test repair using a lightweight formal method. In: Proceedings of the international symposium on formal methods (FM), pp 455–470

  • Zaidman A, Van Rompaey B, van Deursen A, Demeyer S (2011) Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining. Empir Softw Eng 16(3):325–364. [Online]. Available: https://doi.org/10.1007/s10664-010-9143-7

    Article  Google Scholar 

  • Zhang S, Jalali D, Wuttke J, Muslu K, Lam W, Ernst MD, Notkin D (2014) Empirically revisiting the test independence assumption. In: Proceedings of the international symposium on software testing and analysis (ISSTA). ACM, pp 385–396

  • Zhang Y, Mesbah A (2015) Assertions are strongly correlated with test suite effectiveness. In: Proceedings of the joint meeting on foundations of software engineering (ESEC/FSE). ACM, pp 214–224

Download references

Acknowledgments

We would like to thank the 10 developers that participated in the interviews, and the 2 external inspectors that helped us categorize the test smells. We thank the anonymous reviewers, whose comments and feedback significantly improved this paper. This work was partially sponsored by the EU Horizon 2020 ICT-10-2016-RIA “STAMP” project (No. 731529), the NWO “TestRoots” project (No. 639.022.314), and the SNF “Data-Driven Code Review” project (No. PP00P2_170529).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabio Palomba.

Additional information

Communicated by: Lu Zhang, Thomas Zimmermann, Xin Peng, and Hong Mei

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The authors have retracted this article. Upon re-review of the experiment presented in the article, the authors identified errors in the flaky test detection strategy. After careful analysis of the replication study, the results presented in this article are rendered unreliable. All authors agree to this retraction.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Palomba, F., Zaidman, A. RETRACTED ARTICLE: The smell of fear: on the relation between test smells and flaky tests. Empir Software Eng 24, 2907–2946 (2019). https://doi.org/10.1007/s10664-019-09683-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-019-09683-z

Keywords

Navigation