An investigation of the fault-proneness of clone evolutionary patterns

Barbour, Liliane; An, Le; Khomh, Foutse; Zou, Ying; Wang, Shaohua

doi:10.1007/s11219-017-9375-5

An investigation of the fault-proneness of clone evolutionary patterns

Published: 13 June 2017

Volume 26, pages 1187–1222, (2018)
Cite this article

Download PDF

Software Quality Journal Aims and scope Submit manuscript

An investigation of the fault-proneness of clone evolutionary patterns

Download PDF

Liliane Barbour¹,
Le An²,
Foutse Khomh ORCID: orcid.org/0000-0002-5704-4173²,
Ying Zou¹ &
…
Shaohua Wang³

526 Accesses
7 Citations
Explore all metrics

Abstract

Two identical or similar code fragments form a clone pair. Previous studies have identified cloning as a risky practice. Therefore, a developer needs to be aware of any clone pairs in order to properly propagate any changes between clones. A clone pair may experience many changes during the creation and maintenance of a software system. A change can either maintain or remove the similarity between clones in a clone pair. If a change maintains the similarity between clones, the clone pair is left in a consistent state. When a change makes the clones no longer similar, the clone pair is left in an inconsistent state. The set of states and changes experienced by clone pairs over time form an evolution history known as a clone genealogy. In this paper, we examine clone genealogies to identify fault-prone “patterns” of states and changes. We explore the use of clone genealogy information in fault prediction. We conduct a quasi-experiment with four long-lived software systems (i.e., Apache Ant, ArgoUML, JEdit, Maven) and identify clones using the NiCad and iClones clone detection tools. Overall, we find that the size of the clone can impact the fault-proneness of a clone pair. However, there is no clear impact of the time interval between changes to a clone pair on the fault-proneness of the clone pair. We also discover that adding clone genealogy information can increase the explanatory power of fault prediction models.

Is Late Propagation a Harmful Code Clone Evolutionary Pattern? An Empirical Study

A Summary on the Stability of Code Clones and Current Research Trends

Code clones and developer behavior: results of two surveys of the clone research community

Article 07 August 2015

Debarshi Chatterji, Jeffrey C. Carver & Nicholas A. Kraft

1 Introduction

Cloning occurs when two code segments are highly similar or identical to each other. Each code segment is known as a clone, and the two clones form a clone pair. A clone group is a set of code segments, where any two of them form a clone pair. Cloning is a common practice in software development. Some clones are introduced intentionally through the copy and paste actions of developers, while others are introduced accidentally (Roy and Cordy 2007).

Once created, clones evolve as they are modified during both the development and maintenance phases of software systems. A clone pair is in a consistent state if the clones are identical or similar. A clone pair is in an inconsistent state if they are no longer similar. Over time, a clone pair is either in a consistent state or an inconsistent state. Clones that become inconsistent can be later re-synchronized, and consistent clones can diverge. The set of states and changes between the states experienced by a clone pair across versions of a system is known as a “clone pair genealogy.” Furthermore, a clone genealogy can exhibit a specific “clone evolutionary pattern”, which defines a specific ordering of states and changes that occur frequently in clone genealogies over the lifetime of a software system. For example, a consistent clone pair that transitions to an inconsistent state, and then re-synchronizes to a consistent state is known as late propagation (Thummalapenta et al. 2010). Clone pairs that transition to inconsistent states during their evolution are difficult to monitor using clone detection tools. Hence, they are more at risk of faults due to a lack of propagation of changes.

Previous studies (Thummalapenta et al. 2010; Aversano et al. 2007; Barbour et al. 2011; 2013) on clone genealogies have defined specific clone evolutionary patterns and studied their relationship with faults. Specific clone evolutionary patterns have been identified as fault-prone without detailed information. More specifically, a genealogy only provides details about the past, but cannot inform a developer about whether the current state or the next change will be risky. Moreover, the history of the clone groups has also not been considered when predicting faults in clones. In our work, we examine clone evolutionary patterns and changes within clone genealogies and their relationship with faults and strive to provide insights on the characteristics of fault-prone clones. For each clone group, we analyze all the clone pairs within the clone group. We chose to study clone pairs instead of clone groups since clone pairs within the same clone group are not equally risky. Additionally, we investigate if metrics collected from clone pair genealogies can improve the performance of prediction models when identifying clone pairs that are at a higher risk of faults.

We investigate the clone genealogies of four open-source software systems (i.e., Apache Ant, ArgoUML, JEdit, Maven). Using the cloning information from each system, we address the following research questions:

RQ1: Which clone evolutionary patterns and clone changes are most at risk of faults? We examine if a specific evolutionary pattern or change is found to be more prone to faults. Clone pairs exhibiting a fault-prone pattern or experiencing a fault-prone change should be flagged for future monitoring.
RQ2: Does the size of a clone or the time interval between changes affect the fault-proneness of a clone pair? We expand on the previous question to determine if the size of the clone (in LOC) or the time interval between consecutive changes to a clone pair can be used to highlight fault-prone clone pairs. We suggest that these characteristics may influence the fault-proneness of clone evolutionary patterns and changes. Our results can be used to refine the identification of clone pairs at risk of faults. This helps determine where testing and review efforts should be focused.
RQ3: Can we predict faults in software clones using clone genealogy information? One snapshot of a software system provides limited information that can be used to predict faults in a clone pair. However, genealogy information about a clone pair takes more effort to collect and track. We propose metrics to capture information about the genealogy of a clone pair and we use statistical models to establish and inspect dependencies between the metrics and faults in clone pairs.

We provide three contributions in this paper:

We give a formal definition of clone pair genealogies and clone evolutionary patterns.
We identify characteristics of fault-prone clone pair states and changes in clone pair genealogies that can be used to locate the most fault-prone clones.
We show that clone genealogy information can increase the explanatory power of fault prediction models. In particular, the number of previous faults in a clone pair can help predict future faults in the clone fragments.

Organization

Section 2 summarizes related studies on clone genealogies and prediction of faults. Section 3 discusses the building blocks of genealogies and clone evolutionary patterns. Section 4 outlines our study approach. Section 5 summarizes the study results. Section 6 reports on a qualitative evaluation of the genealogies and discusses the results of the study.

Section 7 discusses the threats to the validity of our study. Section 8 concludes the paper and outlines avenues for future work.

2 Related work

2.1 Clone genealogies

Kim et al. (2005) performed the first study of clone evolution. They analyzed groups of clone snippets, known as clone classes, and described the types of changes that can be experienced by a clone class. In our work, we examine clones at the clone pair level to identify which clone pairs are most at risk of faults. A clone class with dozens of members may only contain a few risky clone pairs. Kim et al. also performed a case study using CCFinder and found that clones are very volatile. Half the clones became inconsistent within eight check-ins. In our work, we continue to examine clones after they become inconsistent to examine their fault-proneness.

2.2 Bug-proneness of code clone

Rahman et al. (2012) explored the relationship between defect-proneness and code clones by analyzing four open-source C projects. They did not observe a strong correlation between bugs and code clones, nor a correlation between bug-proneness and cloned code size. Their findings challenge the Fowler et al’s (2009) claim that code clones are “bad code smells.” However, this study did not take the evolution of code clones nor a more popular programming language, Java, into account. Juergens et al. (2009) conducted a case study on open-source and commercial systems to investigate whether code clone’s inconsistent changes can lead to defects. They observed that nearly half of the unintentionally inconsistent changes caused defects. Although this work only studies one type of clone evolution, it leads us to further discover the relationship between bug-proneness and other clone evolution types.

2.3 Analysis of clone genealogies

Krinke (2007) examined inconsistent and consistent changes to clones in 200 weekly snapshots of five open-source systems. The study examined identical clones. About half of the changes to identical clones were consistent. The results may have been affected by the time interval of 1 week between snapshots. Changes to the clones, including inconsistent changes, may have occurred between snapshots. In our work, we examine the relationship between delay since the last change and faults.

Krinke performed a second study on the stability of cloned code, which was repeated and extended by Göde and Harder (2011). In Göde et al.’s work, they performed clone detection using a token-based clone detection tool, but used a time interval between snapshots of one commit. Overall, their findings agreed with Krinke that cloned code is much more stable than non-cloned code. They also experimented with the parameters of their clone detection tool and showed that the results are impacted by the choice of parameters. Because of this result, we use two different clone detection tools in our study, with each one implementing a different clone detection technique. These two clone detection tools (i.e., NiCad and iClones) were found to achieve higher precision and recall by Svajlenko and Roy (2014). For each clone detection tool, we use the same parameters as in the study of Svajlenko and Roy (2014), that compared 11 different clone detection tools.

Göde and Koschke (2011) performed a different study on code clones and found that over half of the clones in three systems were stagnant. In other words, once they were formed, they were never modified. About 12% of the clones experienced a change. In a similar study (Göde and Koschke 2011), Göde et al. found that 87.8% of clones are never changed or only changed once. They suggest that these clones are irrelevant to developers. In our study, we consider all the changes that occurred during the evolutionary history of the clones and identify the most fault-prone ones. We also consider metrics that contain historical information taken from clone genealogies to identify fault-prone clone pairs.

Göde and Harder (2011) performed a study examining consecutive change pairs within clone genealogies. They defined four different types of consecutive changes. They examined the relationship between the change pair type, the delay between changes, the change author, and the location of the clones in the project structure and whether an inconsistent change was intentional. Overall, they found that two consecutive changes are the most common change pair, and that few inconsistent changes were accidental.

Thummalapenta et al. (2010) performed a study that looked at four different types of clone evolutionary patterns within clone classes. They classified their clone classes into consistent evolution, independent, delayed propagation, and late propagation evolutionary patterns. They found that the first two patterns were the most common types. They concluded that each pattern experienced a different proportion of faults within a software system. In our work, we examine clones in more detail and define further clone evolutionary patterns.

Barbour et al. (2011, 2013) investigated faults in eight different types of late propagation and found that late propagation clones are more fault-prone when (i) clones in the pair undergo a diverging modification followed by a reconciling change that modifies both clones in the clone pair or (ii) clones in pair experience diverging changes, followed by a reconciling change that modifies only the diverging clone in the clone pair. They also reported that the size of the clones experiencing late propagation has an effect on the fault-proneness of specific types of late propagation genealogies. Recently, Mondal et al. (2016) investigated the frequency of late propagation for different types of clones (i.e., type 1, type 2, and type 3) using the NiCad clone detection tool. They found that late propagation occurs more frequently in type 3 clones. They also observed that late propagation of type 3 clones are more fault-prone than late propagations of either type 1 or type 2 clones. In this paper, we build on these previous works to analyze the fault-proneness of all types of clone evolutionary patterns (i.e., not only late propagation).

Xie et al. (2013) investigated two evolutionary phenomena on clones: the mutation of the type of a clone during the evolution of a system, and the migration of clone segments across the repositories of a software system. They observed that clone migration and clone mutation occur frequently in clone genealogies, and that increasing the distance between code segments in a clone group during the evolution of the system increases the risk for faults. They also found that mutating clones to type 2 or type 3 increases the risk for faults. In a follow-up study (Xie et al. 2014), they examined the fault-proneness of clone migration in clone genealogies and found that migrated clone segments, clone groups, and clone genealogies are not equally fault-prone. They also found that when a clone mutation occurs during a clone migration, the risk for faults in the migrated clone is increased. The migration of a clone that was not changed for a long period of time is also reported to be risky.

2.4 Statistical explanatory models

Several studies have investigated the use of process and product metrics to build fault prediction and explanatory models.

Khoshgoftaar et al. (1996) analyzed two consecutive releases of a large software system used in telecommunications and showed that the number of past added/removed lines of code is a good predictor of future faults at the module level. Bernstein et al. (2007) used the number of revisions and corrections on a file, recorded in a given amount of time, to predict the location of faults. Graves et al. (2000) investigated different predictors of faults using statistical models and found that the sum of contributions from all changes to a module is the best predictor of faults in a module. Nagappan and Ball (2005) analyzed the relation between code churn (i.e., the amount of lines added, modified or deleted to a file) and fault density in Windows Server 2003 and concluded that relative code churns are better predictors of fault density than absolute code churns. Hassan (2009) introduced the notion of entropy of changes to capture the complexity of a source code change process. He performed a case study using six open-source software systems and found that the entropy of changes is a better predictor of faults than traditional predictors like the amount of changes or the number of previous faults.

El Emam et al. (2001) combined Chidamber and Kemerer metrics (Chidamber and Kemerer 1994) with Briand et al.’s coupling metrics (Briand et al. 1999) to predict faults in a large commercial Java system. Nagappan et al. (2006) investigated the use of source code metrics to predict post-release faults at the module level using five Microsoft software systems. They found that complexity metrics can successfully predict post-release faults, but that the set of best predictors was system-dependant. Zimmermann et al. (2007) also used source code metrics to predict faults in Eclipse. Arisholm and Briand (2006) proposed the use of code quality, class structure, changes in class structure, and the history of class-level changes and faults to predict faulty classes.

Moser et al. (2008) performed a comparative analysis of the predictive power of process and source code metrics for fault prediction and found that process metrics are better predictors of faults than product metrics. In this work, we examine whether clone genealogy metrics can be used to increase the performance of fault prediction models built using product and process metrics.

Kononenko et al. (2015) investigated the relationships between the quality of code review and technical, personal, and participation factors in the code review process. They found that both personal and participation factors can influence the quality of code review. McIntosh et al. (2015) built explanatory models to explore the impact of the code review process on software quality. They found a significant correlation between code review quality and the factors on code review coverage, participation, and reviewers’ expertise. In this paper, we built explanatory models to investigate the relationship between clone genealogies metrics and fault-proneness.

3 Clone evolutionary patterns

3.1 States and transitions of clone pairs

A clone pair can either be in a consistent state (C _s) or an inconsistent state (I _s). We define the set of states of a clone pair as S = {C _s,I _s}. The two states are shown as circles in Fig. 1. A clone pair is in a consistent state if the code segments in the pair are identical or similar (i.e., have a cloned-relationship). A clone pair is in an inconsistent state if the code segments in the pair are no longer similar (i.e., the cloned-relationship has been removed). An inconsistent clone pair can transition back to a consistent state (C _s) at a later time, so we continue to study inconsistent clone pairs.

A change is an input action that modifies the content of one or both of the code segments in a clone pair. A change can transition the clone pair between states, or maintain the clone pair’s current state. For example, if a clone pair is in a consistent state and experiences a change that removes the cloned-relationship between the code segments in the pair, the clone pair transitions into an inconsistent state. If the change preserves the cloned-relationship between the code segments, the clone pair remains in a consistent state.

There are four possible changes:

Consistent change (CON _c): a change modifies one or both code segments of a clone pair in a consistent state. Such change keeps the code segments in the clone pair in a cloned-relationship (i.e., consistent change CON _c is the transition from consistent state C _s to consistent state C _s).
Inconsistent change (INC _c): a change modifies one or both code segments of a clone pair in an inconsistent state. The code segments continue to be dissimilar, so the clone pair remains in an inconsistent state (i.e., inconsistent change INC _c is the transition from inconsistent state I _s to inconsistent state I _s).
Re-synchronizing change (RESYNC _c): a change modifies one or both code segments of a clone pair in an inconsistent state. The change causes the code segments to have a cloned-relationship. The clone pair transitions to a consistent state (i.e., re-synchronizing change RESYNC _c is the transition from inconsistent state I _s to consistent state C _s).
Diverging change (DIV _c): a change modifies one or both code segments in a clone pair in a consistent state. The change removes the cloned-relationship between the code segments (i.e., diverging change DIV _c is the transition from consistent state C _s to inconsistent state I _s).

A clone genealogy describes the evolutionary history of a clone pair. We define a clone genealogy as a finite transition system, G = {S,Act,Trans,I ₀,A}, where:

The set of states is S = {C _s,I _s};
The set of actions (i.e., changes) is Act = {CON _c,INC _c,RESYNC _c,DIV _c};
The transition relations are Trans = {(C _s,CON _c,C _s),(C _s,DIV _c,I _s),(I _s,INC _c,I _s),(I _s,RESYNC _c,C _s)};
The set of initial states is I ₀ = {C _s}; and
The accepting states are A = {C _s,I _s}

Figure 1 is a pictorial representation of the clone genealogy transition system. A genealogy is a finite model, and grows as changes are applied to a clone pair, terminating in either a consistent or an inconsistent state. A clone pair starts from a consistent state when the clone pair can be detected. Therefore, a clone genealogy is always initiated in a consistent state.

3.2 Six clone evolutionary patterns

A “clone pair evolutionary pattern” is a path in a graph G. The graph G represents the history of states and changes for a clone pair. It is a finite sequence of states P = s ₀ s ₁ s ₂…s _n, where s ₀,s ₁,s ₂,…,s _n ∈ S = {C _s,I _s}. The following six evolutionary patterns define all possible paths in graph G, where n is an integer ≥ 1:

Unchanged pattern (UNC _p): the clone pair is formed, but never experiences any changes (i.e., UNC _p is defined as the path C _s in graph G).
Synchronous (SYNC _p): the clone pair has experienced one or more changes, but remains in a consistent state (i.e., SYNC _p is defined as the path $C_{s}{C_{s}^{n}}$ in graph G).
Inconsistent pattern (INC _p): after the creation of the clone pair, it transitions to an inconsistent state without ever experiencing any consistent changes (i.e., INC _p is defined as the path $C_{s}{I_{s}^{n}}$ in graph G).
Divergent pattern (DIV _p): the clone pair experiences one or more consistent changes before transitioning to an inconsistent state (i.e., DIV _p is defined as the path $C_{s}{C_{s}^{n}}{I_{s}^{n}}$ in graph G).
Late propagation pattern (LP _p): the clone pair transitions from a consistent state to an inconsistent state. Later, it experiences a re-synchronizing change that transitions it back to a consistent state (i.e., LP _p is defined as the path $({C_{s}^{n}}{I_{s}^{n}})^{n}{C_{s}^{n}}$ in graph G).
Late propagation with diversion pattern (LPDIV _p): the clone pair undergoes late propagation, but later it experiences a diverging change that brings it back to an inconsistent state (i.e., LPDIV _p is defined as the path $({C_{s}^{n}}{I_{s}^{n}})^{n}{C_{s}^{n}}{I_{s}^{n}}$ in graph G).

A clone pair with an unchanged pattern (UNC _p) never changes and therefore has no evolutionary history. These clone pairs are excluded from our study.

Figure 2 shows an example of inconsistent clone genealogy. The example is a code segment from a clone containing 18 lines of code and is taken from ArgoUML using the clone detection tool NiCad. When the clone pair is created in revision 7646, it is in a consistent state (C _s). Its genealogy is described by the graph G and it exhibits an unchanged pattern (UNC _p). Clone A then experiences a diverging change (DIV _c) that modifies several lines of code. The clone pair is now in an inconsistent state (I _s). This gives it the path C _s I _s in graph G, which belongs to the inconsistent evolutionary pattern (INC _p).

The inconsistent and divergent evolutionary patterns are similar. However, in a divergent evolutionary pattern (DIV _p), a clone pair must experience at least one consistent change before a diverging change occurs. A clone pair demonstrating an inconsistent evolutionary pattern (INC _p) diverges immediately after the clone pair is formed. Clone pairs exhibiting an inconsistent pattern (INC _p) may be “false positive” clones, since the clone pair never experiences any consistent or re-synchronizing changes. They may also be intentionally transitioned to an inconsistent state. For example, a developer may copy a code and then extensively modify it for a new environment (Kapser and Godfrey 2006). Because clones exhibiting inconsistent and divergent patterns are not able to be identified by a clone detection tool, they are more difficult to monitor, and could be more at risk of faults due to a lack of propagation of changes.

Late propagation (LP _p) occurs much less frequently than other evolutionary patterns (Thummalapenta et al. 2010). However, previous studies (Thummalapenta et al. 2010) have shown that the late propagation is risky and fault-prone. For example, the diverging change in a late propagation may be accidental, given that the clone pair is later re-synchronized. However, accidental changes to clones are considered risky. Therefore, late propagation is considered risky (Thummalapenta et al. 2010). Late propagation with diversion (LPDIV _p) is a special case of the late propagation evolutionary pattern. A clone pair first experiences a late propagation evolutionary (LP _p) pattern (a diverging change later followed by a re-synchronizing change). The clone pair then diverges a second time, creating the late propagation with diversion (LPDIV _p) evolutionary pattern. The frequent change of a state in the late propagation with diversion pattern might indicate that developers have difficulty in monitoring and propagating changes between clone pairs.

4 Study design

This section describes the setup of our quasi-experiment that aims to identify fault-prone states and changes in clone genealogies. Figure 3 shows an overview of the steps we use to extract clone information from a source code repository and build clone genealogies. We describe our steps in more detail in the following subsections. We share our analytic scripts and data at: https://github.com/swatlab/clone_genealogies.

4.1 Subject systems

We select four open-source Java systems as the subjects systems. All of the subject systems possess a long development history, which is suitable for our clone genealogy study.

Apache Ant is an open-source build-tool with an extensive Java library. We study its revision history from January 2000 to July 2016.
ArgoUML is a UML-modeling software system. We study its commit history from January 1998 to January 2015 (i.e., until the most recent version of the project).
JEdit is an open-source text editor built for programmers. It is written in Java, and provides support for editing more than 200 programming languages. Many plug-ins have been written for JEdit. In this study, we only examine the editor. The project started in 1998 and is still under development. We examine its revision history from September 2001 to July 2015.
Maven is a build automation tool used primarily for Java projects. We study its commit history from September 2003 to July 2016.

Table 1 summarizes the characteristics of each system. We use the SLOCCount tool (Wheeler 2016) to count the total number of lines of code (LOC) and the percentage of Java code in each project. For each project, we provide LOC for the last studied revision. Table 2 shows the numbers of faulty changes and the numbers of clean changes for each subject system.

Table 1 Characteristics of the systems

An investigation of the fault-proneness of clone evolutionary patterns

Abstract

Similar content being viewed by others

Is Late Propagation a Harmful Code Clone Evolutionary Pattern? An Empirical Study

A Summary on the Stability of Code Clones and Current Research Trends

Code clones and developer behavior: results of two surveys of the clone research community

1 Introduction

Organization

2 Related work

2.1 Clone genealogies

2.2 Bug-proneness of code clone

2.3 Analysis of clone genealogies

2.4 Statistical explanatory models

3 Clone evolutionary patterns

3.1 States and transitions of clone pairs

3.2 Six clone evolutionary patterns

4 Study design

4.1 Subject systems

4.2 Data preprocessing

4.3 Detecting faulty changes

4.4 Extracting clone genealogies

Removing test files

Detecting clones

Building clone genealogies

5 Study results

5.1 RQ1: Which clone evolutionary patterns and clone changes are most at risk of faults?

Motivation

Approach

Faults vs. clone evolutionary patterns

Faults vs. Changes

Faults vs. evolutionary patterns and changes

Results

Faults vs. clone evolutionary patterns

Faults vs. changes

Faults vs. evolutionary patterns and changes

5.2 RQ2: Does the size of a clone or the time interval between changes affect the fault-proneness of a clone pair?

Motivation

Approach

Results

Faults and time interval between changes

Faults and size of clone

5.3 RQ3: Can we predict faults in software clones using clone genealogy information?

Motivation

Approach

Results

6 Discussion

7 Threats to validity

8 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation