Test coverage of impacted code elements for detecting refactoring faults: An exploratory study

doi:10.1016/j.jss.2016.02.001

Journal of Systems and Software

Volume 123, January 2017, Pages 223-238

https://doi.org/10.1016/j.jss.2016.02.001 Get rights and content

Highlights

•
An analysis on test suites’ capacity of revealing refactoring faults.
•
Identification of which code elements are most likely to reveal refactoring faults.
•
Guidelines to help developers to decide when it is safe to perform a refactoring.
•
A model for predicting a suite’s likelihood of detecting refactoring faults.

Abstract

Refactoring validation by testing is critical for quality in agile development. However, this activity may be misleading when a test suite is insufficiently robust for revealing faults. Particularly, refactoring faults can be tricky and difficult to detect. Coverage analysis is a standard practice to evaluate fault detection capability of test suites. However, there is usually a low correlation between coverage and fault detection. In this paper, we present an exploratory study on the use of coverage data of mostly impacted code elements to identify shortcomings in a test suite. We consider three real open source projects and their original test suites. The results show that a test suite not directly calling the refactored method and/or its callers increases the chance of missing the fault. Additional analysis of branch coverage on test cases shows that there are higher chances of detecting a refactoring fault when branch coverage is high. These results give evidence that a combination of impact analysis with branch coverage could be highly effective in detecting faults introduced by refactoring edits. Furthermore, we propose a statistic model that evidences the correlation of coverage over certain code elements and the suite’s capability of revealing refactoring faults.

Introduction

Refactoring improves quality factors of a program while preserving its external behavior (Fowler, Beck, Brant, Opdyke, Roberts, 1999, Mens, Tourwé, 2004). Refactoring edits are one of the foundations of agile software development. In the agile community, the refactoring activity is known to confine the complexity of a source code, improving non-functional aspects of a software such as decreased coupling and increased cohesion (Moser et al., 2008). Fowler et al. (1999) lists four advantages that refactoring brings in the context of Agile Methods: (i) it helps developers to program faster; (ii) it improves the design of the software; (iii) it makes software easier to understand; and (iv) it helps developers to find bugs.

Recent studies have evidenced that nearly 30% of the changes performed during software development are likely to be refactorings (Soares et al., 2011). For example, code clones spread throughout several methods of a class can be unified into a single method, then replacing the clones by a call to this new method; this is the Extract Method refactoring (Fowler et al., 1999), which is one of the most widely applied (Murphy et al., 2006).

Although there are several automatic refactoring tools in popular IDEs, developers still perform most refactorings manually. Murphy et al. (2006) find that about 90% of refactoring edits are manually applied. Negara et al. (2013) agree by showing that expert developers prefer manual refactorings over automated. Usability issues seem to have a negative impact on developers’ confidence on those tools (Lee et al., 2013). Moreover, recent works show that incorrect refactorings – unexpectedly changing behavior – are present even in the most used tools (Daniel, Dig, Garcia, Marinov, 2007, Soares, Gheyi, Massoni, 2013).

In such scenario, developers widely use regression test suites for validating manually-applied refactoring edits. As such, refactoring edits are error prone and require validation, as subtle faults may pass unnoticed. Dig and Johnson (2005) state that nearly 80% of the changes that break client applications are API-level refactoring edits. In addition, 77% of the participants from Kim et al.’s survey with Microsoft developers (Kim et al., 2012) confirm that refactoring may induce the introduction of subtle bugs and functionality regression. A regression test suite, however, may be ineffective in finding refactoring faults. Also, it may be impractical to rerun and analyze the execution results of the whole test suite after each refactoring edit. Techniques that minimize the test suite, while maintaining its effectiveness, are desirable.

Nevertheless, this intuition has little scientific evidence; it is important to distinguish which impacted methods, if called by the test suite, are most effective in detecting faults that might be led to by refactoring. In this paper, we present an exploratory study, performed on three real open-source Java projects, with seeded faults related to two of the most common refactoring edits, Extract Method and Move Method. Using the actual test suites from the selected projects, we measure the direct calls (first-level coverage) to several groups of methods possibly impacted by a refactoring edit, relating these data to the status of the test case – whether it detects or not the seeded fault.

Overall, only 67% of the seeded 270 faults were detected by the project’s test suite. The lack of test cases calling the method whose body is changed seem to be very relevant – 70% of the unrevealed faults present this property. Similarly, 51% of the undetected faults are missed by test cases that directly call the callers of the changed method. On the other hand, for 78% of the detected faults, the test suite included at least one test case that calls the refactored method directly. Considering callers, this rate was also high (70%). In 62% of these suites there were test cases that cover, at first level, both the refactored method and its callers. The detection results did not present statistical dependence with the type of refactoring (same with the type of seeded fault). Based on our results, we propose a statistical model that uses first-level coverage data to foresee chances a test suite has to detect refactoring faults.

First-level coverage reports on a direct need for the agile developer – identify the calls that must be made in a test case for improving its chance of detecting refactoring faults. On the other hand, indirect coverage of impacted elements may not be applicable in a agile context. The reason is that it can be tricky and demand high costs for a developer to assess fault detection capability in indirect calls. For instance, after applying an Extract Method, it seems intuitive that tests directly calling the changed method, its callers, and callees present good chances of detecting any newly-introduced fault. When considering first-level coverage, we are also focusing on test case expressiveness regarding refactoring edits. When a test case that calls directly a method fails, it may be more helpful to locate the fault.

Considering that several faults were missed by test cases even with first-level coverage of impacted methods, we additionally analyzed test cases that exercise the modified method and their callers. If at least one test case in the suite called the changed method, and the branch coverage of this method was greater than 75%, 91% of the faults were detected. If callers of the changed method were directly accessed, in 88% of the cases, the faults were detected with high branch coverage. For suites with low branch coverage (less than 25%), detection dropped to 66% and 62%, respectively. These results provide a good case for tests with direct calls combined with high branch coverage.

As another additional study, we explored the relationship between first-level coverage of impacted elements and binding issues with refactored variables within class hierarchies. Previous research (Soares et al., 2011) reported on several subtle faults in manual and automated refactoring being due to homonymous variables or methods being confused by refactored statements, so we extended our investigation for relating test cases with this kind of refactoring fault. Similarly to the other studies, this investigation showed that when a test suite covers the refactored method and its callers better are the chances of detecting binding-related faults introduced when refactoring.

We published a preliminary version of this study (Alves et al., 2014a) in which a single refactoring type and refactoring fault are analyzed. The current paper extends our previous study by investigating new refactoring types, new refactoring faults, by adding statistical validation to the conclusions, and by proposing new artifacts to help the evaluation of a test suite regarding its detection of refactoring faults.

Section 2 brings a motivating example for the problem of test cases that miss refactoring bugs. Next, we present the setup and research questions investigated by the experimental studies (Section 3), then Section 4 includes the results and discussion for the main experimental study. In Section 5, we extend the study to relate its results with branch coverage within the exercised methods, while Section 6 presents an exploratory study for binding-related refactoring faults. Section 7 discusses threats to validity. The last two sections cover the related work and concluding remarks, respectively.

Section snippets

Motivating example

In agile methodologies, even simple solutions may need improvement when requirement changes must be incorporated into the code base, so manual refactoring is frequent. Automation of refactoring is common, but here we focus on manual refactoring.

Opportunities for code improvement often involve code duplication, and its minimization or elimination is often desirable. For this task, the Extract Method refactoring (Fowler et al., 1999) encompasses small changes that group together multiple code

Research questions

We conducted three exploratory studies with real open source programs to address the following research questions. We discuss RQ1–RQ4 in the first study (Section 4), while we discuss RQ5 and RQ6 in Sections 5 and 6, respectively.

RQ1: Is the type of refactoring a factor that influence a suite’s capacity of detecting refactoring faults?
RQ2: Is the type of refactoring fault a factor that influence a suite’s capacity of detecting refactoring faults?
RQ3: Are direct calls to the most commonly

Study on test coverage for impacted program elements

The discussion in this section considers research questions RQ1–4, analyzing the influence of refactoring or fault type on the suite effectiveness, and the statistical model that approximates the chance of detecting a fault by means of the first-level coverage of test cases.

Variation: branch coverage analysis

As mentioned before, not all test suites that include test cases that cover M and/or C detect the fault. For instance, for one XML-Security faulty version, although its suite has 36 test cases that call M and C (37% of the suite), they do not detect the fault. In order to provide a more thorough analysis, and discuss RQ5, we extended the first study by selecting the subset of faults for which there was at least one test case directly calling the changed method and/or its callers (M > 0 and/or C

Study on binding-related faults

Reference binding problems are one of the most common faults developers face when refactoring. Since refactoring edits usually deal with the moving, renaming and replacement of source code elements, a developer may have problems with method and variable references and their overriding/overloading constraints. Recent studies emphasize how easy it is to include a binding problem when refactoring, even when using well-known automatic refactoring tools (Soares, Catao, Varjao, Aguiar, Gheyi,

Threats to validity

In terms of internal validity, the accuracy of EclEmma and the coded R functions in calculating coverage and statistics might directly affect our study results. However, both tools have been widely used in practice, and applied by researchers, which attests that they rely on their results. Moreover, we manually validated Eclemma’s and our regression functions’ results for a limited sample. Likewise, we measured only first-level coverage for the first study, and branch coverage as an extension,

Testing coverage and suite effectiveness

Long-established texts on software testing (e.g. Perry, 2006) recommend the use of coverage, in general, to gain confidence that a test suite is effective on detecting faults. Extensive research on this topic, however, presents divergent conclusions. Frankl and Iakounenko (1998) report the chance of detecting a fault increases sharply with very high coverage rates. Wong et al. (1994) find a suite effectiveness is highly correlated with block coverage, Cai and Lyu (2005) observe a moderate

Concluding remarks

This article reports studies on the relationship between program methods impacted by the Extract Method or Move Method refactoring and first-level test coverage of impacted elements. Three real open-source Java systems were subject to seeded refactoring faults. Using the actual test suites employed by the selected projects, we measured the coverage of several groups of methods impacted by a refactoring edit, relating these data to the status of the test case (whether it detects or not the

Acknowledgments

This work was partially supported by the National Institute of Science and Technology for Software Engineering, funded by CNPq/Brasil, Grant 573964/2008-4.

Everton L.G. Alves is a doctoral candidate in the Computing and Systems Department at Federal University of Campina Grande (UFCG), Brazil. He received a Master degree in Computer Science from the Federal University of Campina Grande, Brazil, in 2011 and a Bachelor degree in Computer Science in 2009, from the same university. His main interests include regression testing, software maintenance, model-driven development and testing, real-time systems and integration.

References (45)

DigD. et al.
The role of refactorings in api evolution
Proceedings of the 21st IEEE International Conference on Software Maintenance, 2005, ICSM’05
(2005)
GeX. et al.
Manual refactoring changes with automated refactoring validation
Proceedings of the 36th International Conference on Software Engineering
(2014)
AlvesE.
Investigating Test Case Prioritization Techniques for Refactoring Activities Validation: Evaluating Suite Characterists
Technical Report SPLab-2012-002
(2012)
AlvesE.
Investigating Test Case Prioritization Techniques for Refactoring Activities Validation: Evaluating the Behavior
Technical Report SPLab-2012-001
(2012)
AlvesE.L. et al.
Test coverage and impact analysis for detecting refactoring faults: A study on the extract method refactoring
Proceedings of the 30th ACM/SIGAPP Symposium on Applied Computing - SAC 2014 (To appear)
(2014)
AlvesE.L. et al.
A refactoring-based approach for test case selection and prioritization
Proceedings of the 2013 8th International Workshop on Automation of Software Test (AST)
(2013)
AlvesE.L. et al.
Refdistiller: A refactoring aware code review tool for inspecting manual refactoring edits
Proceedings of the the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, Research Demonstration Track (To appear)
(2014)
AndrewsJ.H. et al.
Is mutation an appropriate tool for testing experiments? [software testing]
Proceedings of the 27th International Conference on Software Engineering, 2005. ICSE 2005
(2005)
AndrewsJ.H. et al.
General test result checking with log file analysis
Softw. Eng., IEEE Trans.
(2003)
BavotaG. et al.
When does a refactoring induce bugs? an empirical study
Proceedings of the IEEE 12th International Working Conference on Source Code Analysis and Manipulation (SCAM), 2012
(2012)

BriandL. et al.

Using simulation for assessing the real impact of test coverage on defect coverage

Proceedings of the 10th International Symposium on Software Reliability Engineering, 1999

(1999)

BriandL.C. et al.

Using simulation to empirically investigate test coverage criteria based on statechart

Proceedings of the 26th International Conference on Software Engineering

(2004)

CaiX. et al.

The effect of code coverage on fault detection under different testing profiles

ACM SIGSOFT Software Engineering Notes

(2005)

CornélioM. et al.

Sound refactorings

Science of Computer Programming

(2010)

DanielB. et al.

Automated testing of refactoring engines

Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering

(2007)

Field, A., Miles, J., Field, Z., 2012, Discovering Statistics Using...

Fowler, M., Beck, K., Brant, J., Opdyke, W., Roberts, D., 1999. Refactoring: Improving the Design of Existing...

FranklP.G. et al.

Further empirical studies of test effectiveness

ACM SIGSOFT Software Engineering Notes

(1998)

FranklP.G. et al.

An experimental comparison of the effectiveness of the all-uses and all-edges adequacy criteria

Proceedings of the Symposium on Testing, Analysis, and Verification

(1991)

FranklP.G. et al.

All-uses vs mutation testing: an experimental comparison of effectiveness

J. Syst. Softw.

(1997)

GargantiniA. et al.

Extending coverage criteria by evaluating their robustness to code structure changes

Testing Software and Systems

(2012)

GligoricM. et al.

Comparing non-adequate test suites using coverage criteria

Proceedings of the 2013 International Symposium on Software Testing and Analysis

(2013)

Cited by (12)

BOOMPizer: Minimization and prioritization of CONCOLIC based boosted MC/DC test cases
2022, Journal of King Saud University - Computer and Information Sciences
Citation Excerpt :
These test cases are now prioritized by considering CI and Inf value of each test case. Existing literature says that there is no statistical relationship between code coverage and fault exposing potential (Alves et al., 2017; Elbaum et al., 2002; Farooq and Nadeem, 2017), so it depends on the problem context and requirements for test case prioritization. Therefore, it is up to the tester to assign weights to these two metrics for test case prioritization.
Recent research evidence indicates that the powerful testing tools, even though generate test inputs automatically for coverage measures, but not up to satisfaction. These tools sometimes achieve high structural coverage, which do not guarantee to have high fault detection ability. These findings lead us to a decisive point that code coverage is merely one factor towards effective test data generation. Thus, we discuss our findings and proposed work on Modified Condition/ Decision Coverage (MC/DC) test case generation and prioritization techniques. This work aims to generate, minimize, and prioritize MC/DC test cases obtained through concolic testing process. This work presents three technical contributions. The first contribution is to propose a greedy algorithm to increase the number of effective test cases for improving MC/DC scores. The second contribution is to minimize the updated test suite size to have only the optimal number of contributing test cases towards forming MC/DC pairs. The third contribution is to prioritize these test cases by considering both their Contribution Index (CI) values and Fault Exposing Potential (FEP) values. The proposed approach is validated by experimenting on eighteen Java programs and achieved on an average $\times$ 1.67 times increase in the number of effective test cases that lead to an average increase in MC/DC score by 41.08%. We also achieved on an average 49.00% reduction rate to minimize the test suite size and finally prioritized the test cases, based on their prioritization index values.
A Configurable Test Case Prioritization Technique for Early Fault Detection and Low Test Case Spreading
2022, ACM International Conference Proceeding Series
Evaluating the Effectiveness of Regression Test Suites for Extract Method Validation
2022, ACM International Conference Proceeding Series
Refactoring Techniques for Improving Software Quality: Practitioners’ Perspectives
2021, Journal of Information and Communication Technology
Can operational profile coverage explain post-release bug detection?
2020, Software Testing Verification and Reliability
Refactoring and Active Object Languages
2020, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View all citing articles on Scopus

Tiago Massoni is a professor in the Computing and Systems Department at the Federal University of Campina Grande. His research interests include software design and evolution, and formal methods. In addition to his academic posts he also worked as a programmer at IBM in California. He holds a Doctoral degree in Computer Science from the Federal University of Pernambuco, and is a member of the ACM.

Patrcía Duarte de Lima Machado is a Professor in the Computing and Systems Department at Federal University of Campina Grande (UFCG), Brazil, since 1995. She received her PhD Degree in Computer Science from the University of Edinburgh, UK, in 2001, Master Degree in Computer Science from the Federal University of Pernambuco, Brazil, in 1994 and Bachelor Degree in Computer Science from the Federal University of Paraiba, Brazil, in 1992. Her research interests include software testing, formal methods, mobile computing, component based software development and model-driven development. Since 1998, she has produced a number of contributions in the area of software testing, including research projects, publications, tools, supervising, national/international cooperation and teaching.

View full text

Test coverage of impacted code elements for detecting refactoring faults: An exploratory study

Highlights

Abstract

Introduction

Section snippets

Motivating example

Research questions

Study on test coverage for impacted program elements

Variation: branch coverage analysis

Study on binding-related faults

Threats to validity

Testing coverage and suite effectiveness

Concluding remarks

Acknowledgments

Investigating Test Case Prioritization Techniques for Refactoring Activities Validation: Evaluating Suite Characterists

Technical Report SPLab-2012-002

Investigating Test Case Prioritization Techniques for Refactoring Activities Validation: Evaluating the Behavior

Technical Report SPLab-2012-001

Test coverage and impact analysis for detecting refactoring faults: A study on the extract method refactoring

Proceedings of the 30th ACM/SIGAPP Symposium on Applied Computing - SAC 2014 (To appear)

A refactoring-based approach for test case selection and prioritization

Proceedings of the 2013 8th International Workshop on Automation of Software Test (AST)

Refdistiller: A refactoring aware code review tool for inspecting manual refactoring edits

Proceedings of the the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, Research Demonstration Track (To appear)

Is mutation an appropriate tool for testing experiments? [software testing]

Proceedings of the 27th International Conference on Software Engineering, 2005. ICSE 2005

General test result checking with log file analysis

Softw. Eng., IEEE Trans.

When does a refactoring induce bugs? an empirical study

Proceedings of the IEEE 12th International Working Conference on Source Code Analysis and Manipulation (SCAM), 2012

Using simulation for assessing the real impact of test coverage on defect coverage

Proceedings of the 10th International Symposium on Software Reliability Engineering, 1999

Using simulation to empirically investigate test coverage criteria based on statechart

Proceedings of the 26th International Conference on Software Engineering

The effect of code coverage on fault detection under different testing profiles

ACM SIGSOFT Software Engineering Notes

Sound refactorings

Science of Computer Programming

Automated testing of refactoring engines

Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering

Further empirical studies of test effectiveness

ACM SIGSOFT Software Engineering Notes

An experimental comparison of the effectiveness of the all-uses and all-edges adequacy criteria

Proceedings of the Symposium on Testing, Analysis, and Verification

All-uses vs mutation testing: an experimental comparison of effectiveness

J. Syst. Softw.

Extending coverage criteria by evaluating their robustness to code structure changes

Testing Software and Systems

Comparing non-adequate test suites using coverage criteria

Proceedings of the 2013 International Symposium on Software Testing and Analysis