1 Introduction

Cloning occurs when two code segments are highly similar or identical to each other. Each code segment is known as a clone, and the two clones form a clone pair. A clone group is a set of code segments, where any two of them form a clone pair. Cloning is a common practice in software development. Some clones are introduced intentionally through the copy and paste actions of developers, while others are introduced accidentally (Roy and Cordy 2007).

Once created, clones evolve as they are modified during both the development and maintenance phases of software systems. A clone pair is in a consistent state if the clones are identical or similar. A clone pair is in an inconsistent state if they are no longer similar. Over time, a clone pair is either in a consistent state or an inconsistent state. Clones that become inconsistent can be later re-synchronized, and consistent clones can diverge. The set of states and changes between the states experienced by a clone pair across versions of a system is known as a “clone pair genealogy.” Furthermore, a clone genealogy can exhibit a specific “clone evolutionary pattern”, which defines a specific ordering of states and changes that occur frequently in clone genealogies over the lifetime of a software system. For example, a consistent clone pair that transitions to an inconsistent state, and then re-synchronizes to a consistent state is known as late propagation (Thummalapenta et al. 2010). Clone pairs that transition to inconsistent states during their evolution are difficult to monitor using clone detection tools. Hence, they are more at risk of faults due to a lack of propagation of changes.

Previous studies (Thummalapenta et al. 2010; Aversano et al. 2007; Barbour et al. 2011; 2013) on clone genealogies have defined specific clone evolutionary patterns and studied their relationship with faults. Specific clone evolutionary patterns have been identified as fault-prone without detailed information. More specifically, a genealogy only provides details about the past, but cannot inform a developer about whether the current state or the next change will be risky. Moreover, the history of the clone groups has also not been considered when predicting faults in clones. In our work, we examine clone evolutionary patterns and changes within clone genealogies and their relationship with faults and strive to provide insights on the characteristics of fault-prone clones. For each clone group, we analyze all the clone pairs within the clone group. We chose to study clone pairs instead of clone groups since clone pairs within the same clone group are not equally risky. Additionally, we investigate if metrics collected from clone pair genealogies can improve the performance of prediction models when identifying clone pairs that are at a higher risk of faults.

We investigate the clone genealogies of four open-source software systems (i.e., Apache Ant, ArgoUML, JEdit, Maven). Using the cloning information from each system, we address the following research questions:

  • RQ1: Which clone evolutionary patterns and clone changes are most at risk of faults? We examine if a specific evolutionary pattern or change is found to be more prone to faults. Clone pairs exhibiting a fault-prone pattern or experiencing a fault-prone change should be flagged for future monitoring.

  • RQ2: Does the size of a clone or the time interval between changes affect the fault-proneness of a clone pair? We expand on the previous question to determine if the size of the clone (in LOC) or the time interval between consecutive changes to a clone pair can be used to highlight fault-prone clone pairs. We suggest that these characteristics may influence the fault-proneness of clone evolutionary patterns and changes. Our results can be used to refine the identification of clone pairs at risk of faults. This helps determine where testing and review efforts should be focused.

  • RQ3: Can we predict faults in software clones using clone genealogy information? One snapshot of a software system provides limited information that can be used to predict faults in a clone pair. However, genealogy information about a clone pair takes more effort to collect and track. We propose metrics to capture information about the genealogy of a clone pair and we use statistical models to establish and inspect dependencies between the metrics and faults in clone pairs.

We provide three contributions in this paper:

  • We give a formal definition of clone pair genealogies and clone evolutionary patterns.

  • We identify characteristics of fault-prone clone pair states and changes in clone pair genealogies that can be used to locate the most fault-prone clones.

  • We show that clone genealogy information can increase the explanatory power of fault prediction models. In particular, the number of previous faults in a clone pair can help predict future faults in the clone fragments.

Organization

Section 2 summarizes related studies on clone genealogies and prediction of faults. Section 3 discusses the building blocks of genealogies and clone evolutionary patterns. Section 4 outlines our study approach. Section 5 summarizes the study results. Section 6 reports on a qualitative evaluation of the genealogies and discusses the results of the study.

Section 7 discusses the threats to the validity of our study. Section 8 concludes the paper and outlines avenues for future work.

2 Related work

2.1 Clone genealogies

Kim et al. (2005) performed the first study of clone evolution. They analyzed groups of clone snippets, known as clone classes, and described the types of changes that can be experienced by a clone class. In our work, we examine clones at the clone pair level to identify which clone pairs are most at risk of faults. A clone class with dozens of members may only contain a few risky clone pairs. Kim et al. also performed a case study using CCFinder and found that clones are very volatile. Half the clones became inconsistent within eight check-ins. In our work, we continue to examine clones after they become inconsistent to examine their fault-proneness.

2.2 Bug-proneness of code clone

Rahman et al. (2012) explored the relationship between defect-proneness and code clones by analyzing four open-source C projects. They did not observe a strong correlation between bugs and code clones, nor a correlation between bug-proneness and cloned code size. Their findings challenge the Fowler et al’s (2009) claim that code clones are “bad code smells.” However, this study did not take the evolution of code clones nor a more popular programming language, Java, into account. Juergens et al. (2009) conducted a case study on open-source and commercial systems to investigate whether code clone’s inconsistent changes can lead to defects. They observed that nearly half of the unintentionally inconsistent changes caused defects. Although this work only studies one type of clone evolution, it leads us to further discover the relationship between bug-proneness and other clone evolution types.

2.3 Analysis of clone genealogies

Krinke (2007) examined inconsistent and consistent changes to clones in 200 weekly snapshots of five open-source systems. The study examined identical clones. About half of the changes to identical clones were consistent. The results may have been affected by the time interval of 1 week between snapshots. Changes to the clones, including inconsistent changes, may have occurred between snapshots. In our work, we examine the relationship between delay since the last change and faults.

Krinke performed a second study on the stability of cloned code, which was repeated and extended by Göde and Harder (2011). In Göde et al.’s work, they performed clone detection using a token-based clone detection tool, but used a time interval between snapshots of one commit. Overall, their findings agreed with Krinke that cloned code is much more stable than non-cloned code. They also experimented with the parameters of their clone detection tool and showed that the results are impacted by the choice of parameters. Because of this result, we use two different clone detection tools in our study, with each one implementing a different clone detection technique. These two clone detection tools (i.e., NiCad and iClones) were found to achieve higher precision and recall by Svajlenko and Roy (2014). For each clone detection tool, we use the same parameters as in the study of Svajlenko and Roy (2014), that compared 11 different clone detection tools.

Göde and Koschke (2011) performed a different study on code clones and found that over half of the clones in three systems were stagnant. In other words, once they were formed, they were never modified. About 12% of the clones experienced a change. In a similar study (Göde and Koschke 2011), Göde et al. found that 87.8% of clones are never changed or only changed once. They suggest that these clones are irrelevant to developers. In our study, we consider all the changes that occurred during the evolutionary history of the clones and identify the most fault-prone ones. We also consider metrics that contain historical information taken from clone genealogies to identify fault-prone clone pairs.

Göde and Harder (2011) performed a study examining consecutive change pairs within clone genealogies. They defined four different types of consecutive changes. They examined the relationship between the change pair type, the delay between changes, the change author, and the location of the clones in the project structure and whether an inconsistent change was intentional. Overall, they found that two consecutive changes are the most common change pair, and that few inconsistent changes were accidental.

Thummalapenta et al. (2010) performed a study that looked at four different types of clone evolutionary patterns within clone classes. They classified their clone classes into consistent evolution, independent, delayed propagation, and late propagation evolutionary patterns. They found that the first two patterns were the most common types. They concluded that each pattern experienced a different proportion of faults within a software system. In our work, we examine clones in more detail and define further clone evolutionary patterns.

Barbour et al. (2011, 2013) investigated faults in eight different types of late propagation and found that late propagation clones are more fault-prone when (i) clones in the pair undergo a diverging modification followed by a reconciling change that modifies both clones in the clone pair or (ii) clones in pair experience diverging changes, followed by a reconciling change that modifies only the diverging clone in the clone pair. They also reported that the size of the clones experiencing late propagation has an effect on the fault-proneness of specific types of late propagation genealogies. Recently, Mondal et al. (2016) investigated the frequency of late propagation for different types of clones (i.e., type 1, type 2, and type 3) using the NiCad clone detection tool. They found that late propagation occurs more frequently in type 3 clones. They also observed that late propagation of type 3 clones are more fault-prone than late propagations of either type 1 or type 2 clones. In this paper, we build on these previous works to analyze the fault-proneness of all types of clone evolutionary patterns (i.e., not only late propagation).

Xie et al. (2013) investigated two evolutionary phenomena on clones: the mutation of the type of a clone during the evolution of a system, and the migration of clone segments across the repositories of a software system. They observed that clone migration and clone mutation occur frequently in clone genealogies, and that increasing the distance between code segments in a clone group during the evolution of the system increases the risk for faults. They also found that mutating clones to type 2 or type 3 increases the risk for faults. In a follow-up study (Xie et al. 2014), they examined the fault-proneness of clone migration in clone genealogies and found that migrated clone segments, clone groups, and clone genealogies are not equally fault-prone. They also found that when a clone mutation occurs during a clone migration, the risk for faults in the migrated clone is increased. The migration of a clone that was not changed for a long period of time is also reported to be risky.

2.4 Statistical explanatory models

Several studies have investigated the use of process and product metrics to build fault prediction and explanatory models.

Khoshgoftaar et al. (1996) analyzed two consecutive releases of a large software system used in telecommunications and showed that the number of past added/removed lines of code is a good predictor of future faults at the module level. Bernstein et al. (2007) used the number of revisions and corrections on a file, recorded in a given amount of time, to predict the location of faults. Graves et al. (2000) investigated different predictors of faults using statistical models and found that the sum of contributions from all changes to a module is the best predictor of faults in a module. Nagappan and Ball (2005) analyzed the relation between code churn (i.e., the amount of lines added, modified or deleted to a file) and fault density in Windows Server 2003 and concluded that relative code churns are better predictors of fault density than absolute code churns. Hassan (2009) introduced the notion of entropy of changes to capture the complexity of a source code change process. He performed a case study using six open-source software systems and found that the entropy of changes is a better predictor of faults than traditional predictors like the amount of changes or the number of previous faults.

El Emam et al. (2001) combined Chidamber and Kemerer metrics (Chidamber and Kemerer 1994) with Briand et al.’s coupling metrics (Briand et al. 1999) to predict faults in a large commercial Java system. Nagappan et al. (2006) investigated the use of source code metrics to predict post-release faults at the module level using five Microsoft software systems. They found that complexity metrics can successfully predict post-release faults, but that the set of best predictors was system-dependant. Zimmermann et al. (2007) also used source code metrics to predict faults in Eclipse. Arisholm and Briand (2006) proposed the use of code quality, class structure, changes in class structure, and the history of class-level changes and faults to predict faulty classes.

Moser et al. (2008) performed a comparative analysis of the predictive power of process and source code metrics for fault prediction and found that process metrics are better predictors of faults than product metrics. In this work, we examine whether clone genealogy metrics can be used to increase the performance of fault prediction models built using product and process metrics.

Kononenko et al. (2015) investigated the relationships between the quality of code review and technical, personal, and participation factors in the code review process. They found that both personal and participation factors can influence the quality of code review. McIntosh et al. (2015) built explanatory models to explore the impact of the code review process on software quality. They found a significant correlation between code review quality and the factors on code review coverage, participation, and reviewers’ expertise. In this paper, we built explanatory models to investigate the relationship between clone genealogies metrics and fault-proneness.

3 Clone evolutionary patterns

3.1 States and transitions of clone pairs

A clone pair can either be in a consistent state (C s ) or an inconsistent state (I s ). We define the set of states of a clone pair as S = {C s ,I s }. The two states are shown as circles in Fig. 1. A clone pair is in a consistent state if the code segments in the pair are identical or similar (i.e., have a cloned-relationship). A clone pair is in an inconsistent state if the code segments in the pair are no longer similar (i.e., the cloned-relationship has been removed). An inconsistent clone pair can transition back to a consistent state (C s ) at a later time, so we continue to study inconsistent clone pairs.

Fig. 1
figure 1

Clone pair states and changes

A change is an input action that modifies the content of one or both of the code segments in a clone pair. A change can transition the clone pair between states, or maintain the clone pair’s current state. For example, if a clone pair is in a consistent state and experiences a change that removes the cloned-relationship between the code segments in the pair, the clone pair transitions into an inconsistent state. If the change preserves the cloned-relationship between the code segments, the clone pair remains in a consistent state.

There are four possible changes:

  • Consistent change (CON c ): a change modifies one or both code segments of a clone pair in a consistent state. Such change keeps the code segments in the clone pair in a cloned-relationship (i.e., consistent change CON c is the transition from consistent state C s to consistent state C s ).

  • Inconsistent change (INC c ): a change modifies one or both code segments of a clone pair in an inconsistent state. The code segments continue to be dissimilar, so the clone pair remains in an inconsistent state (i.e., inconsistent change INC c is the transition from inconsistent state I s to inconsistent state I s ).

  • Re-synchronizing change (RESYNC c ): a change modifies one or both code segments of a clone pair in an inconsistent state. The change causes the code segments to have a cloned-relationship. The clone pair transitions to a consistent state (i.e., re-synchronizing change RESYNC c is the transition from inconsistent state I s to consistent state C s ).

  • Diverging change (DIV c ): a change modifies one or both code segments in a clone pair in a consistent state. The change removes the cloned-relationship between the code segments (i.e., diverging change DIV c is the transition from consistent state C s to inconsistent state I s ).

A clone genealogy describes the evolutionary history of a clone pair. We define a clone genealogy as a finite transition system, G = {S,Act,Trans,I 0,A}, where:

  • The set of states is S = {C s ,I s };

  • The set of actions (i.e., changes) is Act = {CON c ,INC c ,RESYNC c ,DIV c };

  • The transition relations are Trans = {(C s ,CON c ,C s ),(C s ,DIV c ,I s ),(I s ,INC c ,I s ),(I s ,RESYNC c ,C s )};

  • The set of initial states is I 0 = {C s }; and

  • The accepting states are A = {C s ,I s }

Figure 1 is a pictorial representation of the clone genealogy transition system. A genealogy is a finite model, and grows as changes are applied to a clone pair, terminating in either a consistent or an inconsistent state. A clone pair starts from a consistent state when the clone pair can be detected. Therefore, a clone genealogy is always initiated in a consistent state.

3.2 Six clone evolutionary patterns

A “clone pair evolutionary pattern” is a path in a graph G. The graph G represents the history of states and changes for a clone pair. It is a finite sequence of states P = s 0 s 1 s 2s n , where s 0,s 1,s 2,…,s n S = {C s ,I s }. The following six evolutionary patterns define all possible paths in graph G, where n is an integer ≥ 1:

  • Unchanged pattern (UNC p ): the clone pair is formed, but never experiences any changes (i.e., UNC p is defined as the path C s in graph G).

  • Synchronous (SYNC p ): the clone pair has experienced one or more changes, but remains in a consistent state (i.e., SYNC p is defined as the path \(C_{s}{C_{s}^{n}}\) in graph G).

  • Inconsistent pattern (INC p ): after the creation of the clone pair, it transitions to an inconsistent state without ever experiencing any consistent changes (i.e., INC p is defined as the path \(C_{s}{I_{s}^{n}}\) in graph G).

  • Divergent pattern (DIV p ): the clone pair experiences one or more consistent changes before transitioning to an inconsistent state (i.e., DIV p is defined as the path \(C_{s}{C_{s}^{n}}{I_{s}^{n}}\) in graph G).

  • Late propagation pattern (LP p ): the clone pair transitions from a consistent state to an inconsistent state. Later, it experiences a re-synchronizing change that transitions it back to a consistent state (i.e., LP p is defined as the path \(({C_{s}^{n}}{I_{s}^{n}})^{n}{C_{s}^{n}}\) in graph G).

  • Late propagation with diversion pattern (LPDIV p ): the clone pair undergoes late propagation, but later it experiences a diverging change that brings it back to an inconsistent state (i.e., LPDIV p is defined as the path \(({C_{s}^{n}}{I_{s}^{n}})^{n}{C_{s}^{n}}{I_{s}^{n}}\) in graph G).

A clone pair with an unchanged pattern (UNC p ) never changes and therefore has no evolutionary history. These clone pairs are excluded from our study.

Figure 2 shows an example of inconsistent clone genealogy. The example is a code segment from a clone containing 18 lines of code and is taken from ArgoUML using the clone detection tool NiCad. When the clone pair is created in revision 7646, it is in a consistent state (C s ). Its genealogy is described by the graph G and it exhibits an unchanged pattern (UNC p ). Clone A then experiences a diverging change (DIV c ) that modifies several lines of code. The clone pair is now in an inconsistent state (I s ). This gives it the path C s I s in graph G, which belongs to the inconsistent evolutionary pattern (INC p ).

Fig. 2
figure 2

An example of an inconsistent genealogy from ArgoUML using NiCad (inconsistent lines are highlighted)

The inconsistent and divergent evolutionary patterns are similar. However, in a divergent evolutionary pattern (DIV p ), a clone pair must experience at least one consistent change before a diverging change occurs. A clone pair demonstrating an inconsistent evolutionary pattern (INC p ) diverges immediately after the clone pair is formed. Clone pairs exhibiting an inconsistent pattern (INC p ) may be “false positive” clones, since the clone pair never experiences any consistent or re-synchronizing changes. They may also be intentionally transitioned to an inconsistent state. For example, a developer may copy a code and then extensively modify it for a new environment (Kapser and Godfrey 2006). Because clones exhibiting inconsistent and divergent patterns are not able to be identified by a clone detection tool, they are more difficult to monitor, and could be more at risk of faults due to a lack of propagation of changes.

Late propagation (LP p ) occurs much less frequently than other evolutionary patterns (Thummalapenta et al. 2010). However, previous studies (Thummalapenta et al. 2010) have shown that the late propagation is risky and fault-prone. For example, the diverging change in a late propagation may be accidental, given that the clone pair is later re-synchronized. However, accidental changes to clones are considered risky. Therefore, late propagation is considered risky (Thummalapenta et al. 2010). Late propagation with diversion (LPDIV p ) is a special case of the late propagation evolutionary pattern. A clone pair first experiences a late propagation evolutionary (LP p ) pattern (a diverging change later followed by a re-synchronizing change). The clone pair then diverges a second time, creating the late propagation with diversion (LPDIV p ) evolutionary pattern. The frequent change of a state in the late propagation with diversion pattern might indicate that developers have difficulty in monitoring and propagating changes between clone pairs.

4 Study design

This section describes the setup of our quasi-experiment that aims to identify fault-prone states and changes in clone genealogies. Figure 3 shows an overview of the steps we use to extract clone information from a source code repository and build clone genealogies. We describe our steps in more detail in the following subsections. We share our analytic scripts and data at: https://github.com/swatlab/clone_genealogies.

Fig. 3
figure 3

Overview of the analysis process

4.1 Subject systems

We select four open-source Java systems as the subjects systems. All of the subject systems possess a long development history, which is suitable for our clone genealogy study.

  • Apache Ant is an open-source build-tool with an extensive Java library. We study its revision history from January 2000 to July 2016.

  • ArgoUML is a UML-modeling software system. We study its commit history from January 1998 to January 2015 (i.e., until the most recent version of the project).

  • JEdit is an open-source text editor built for programmers. It is written in Java, and provides support for editing more than 200 programming languages. Many plug-ins have been written for JEdit. In this study, we only examine the editor. The project started in 1998 and is still under development. We examine its revision history from September 2001 to July 2015.

  • Maven is a build automation tool used primarily for Java projects. We study its commit history from September 2003 to July 2016.

Table 1 summarizes the characteristics of each system. We use the SLOCCount tool (Wheeler 2016) to count the total number of lines of code (LOC) and the percentage of Java code in each project. For each project, we provide LOC for the last studied revision. Table 2 shows the numbers of faulty changes and the numbers of clean changes for each subject system.

Table 1 Characteristics of the systems
Table 2 Number of faulty and clean clone changes in each system

We examined the length of the clone genealogies contained in the selected software systems and observed that more than 50% of the genealogies only experienced 1–2 changes.

Figures 4 and 5 show the frequency of the number changes in each of the clone genealogies. Overall, although the studied systems contain high numbers of genealogies, the genealogies tend to be short. In this paper, we do not consider the unchanged clone pattern (UNC p ), hence, all the studied clone genealogies experienced at least one change. Figure 6 depicts the number of clone genealogies deriving from a specific commit. In this figure, we eliminated outliers. The median value for each project is less than 3, implying that there are only few clone genealogies starting from each commit.

Fig. 4
figure 4

Percentage of the frequency of the number of changes in a studied clone genealogy detected by NiCad

Fig. 5
figure 5

Percentage of the frequency of the number of changes in a studied clone genealogy detected by iClones

Fig. 6
figure 6

Number of clone genealogies starting from a specific commit

4.2 Data preprocessing

To analyze a repository’s history, Git provides high-performance functions to extract changed files, renamed files, and blame faulty files. Since the source code of ArgoUML and JEdit is managed by SVN, we use Git’s git-svn command to convert the two systems’ repositories to Git. Then, we use the following command to extract each commit’s commit ID, committer email, commit date, and commit message:

  • git log --pretty=format:”%H,%ae,%ai,%s”

4.3 Detecting faulty changes

We leverage the SZZ algorithm (Śliwerski et al. 2005) to detect changes that introduced faults. We first apply Fischer et al.’s heuristic (Fischer et al. 2003) to identity fault-fixing commits by using regular expressions to detect bug IDs from the studied commit messages. We then mine the subject systems’ bug tracking systems (issuezilla for ArgoUML, Jira for Ant and Maven, and SourceForge for JEdit) to extract their bugs’ creation date. Next, we extract the modified files of each fault-fixing commit through the following Git command:

  • git log [commit-id] -n 1 --name-status

In this paper, we only take modified Java files into account. Given each file F in a commit C, we extract C’s parent commit \(C^{\prime }\). For Ant and Maven, we use the [commit-id]̂ command to obtain \(C^{\prime }\); while for ArgoUML and JEdit, since their repositories were converted from SVN, we find the C’s precedent commit \(C^{\prime }\) by time, i.e., \(C^{\prime }\) is the nearest commit prior to C. Then, we use Git’s diff command to extract F’s deleted lines. We apply Git’s blame command to identify commits that introduced these deleted lines, noted as the “candidate faulty changes.” We eliminate the commits that only changed blank and comment lines. Finally, we filter the commits that were submitted after their corresponding bugs’ creation date.

4.4 Extracting clone genealogies

Extracting clone genealogies from each subject system requires three steps: removing test files, detecting clones, and building clone genealogies.

Removing test files

Test files are frequently copied and then modified to create multiple test cases, so they often contain clones. These files are used for development purposes and not used during the normal execution of the system. They may also contain syntactically incorrect code. For all these reasons, we believe that clones in test code should be studied separately from clones in production code. Therefore, we exclude test files from our study. In future work, we plan to examine the evolution of clones in test code which are nevertheless clones and need to be maintained. To remove the test files, we perform a search on each system for files and folders with a filename containing the word “test.” We then manually verify each file before removing it from the study to prevent the automatic removal of a non-test file, such as a file with the name “updateState.java.” At the end of this semi-automatic process, we also manually verify all the remaining files in our data set, to ensure that no semantically test-related files remain in the data set of our study.

Detecting clones

We use two existing clone detection tools to detect clones in the four systems: NiCad(Roy and Cordy 2008) and iClones (Gode and Koschke 2009). We use the most recent versions of both tools: NiCad-4 and iClones-0.2. We select these two clone detection tools because they are recommended by Svajlenko and Roy (2014) who compared the performance of 11 clone detection tools from the literature. Today, NiCad and iClones are considered as state-of-the-art tools by the clone community (Svajlenko and Roy 2014).

Both NiCad and iClones use a hybrid approach to detect clones. We use the default settings of Nicad to detect clones greater than 10 lines of code, while using the default setting of iClons to detect clones with minimum 100 tokens. We detect identical clones and clones where the variable names are different (i.e., “blindrename”). The same settings were used by Svajlenko and Roy (2014) in their comparison of 11 clone detection tools. With these settings, NiCad and iClones were found to achieve higher precision and recall in comparison to the other nine clone detection tools that were studied.

We use the Git checkout command to retrieve a system’s snapshot for a specific commit. Then we perform clone detection on each of the snapshots of the studied systems. Table 1 summarizes the number of clone pairs and clone genealogies found in each subject system using both clone detection tools. For ArgoUML and Ant, Nicad detected more clone pairs than iClones; while for JEdit and Maven, iClones detected more clone pairs. This difference in the number of clones found by the two detection tools is likely due to the lack of agreement on the definition of code clones (Lakhotia et al. 2003) and to the implementation of the tools.

Building clone genealogies

Each clone detection tool outputs a list of clones within each source code repository. To create a set of clone pair genealogies, we link the clone pairs between each commit. A change to a clone can affect its size. A change to the file containing the clone, even if it does not affect the clone itself, can shift the clone’s line numbers. To account for these changes when mapping clones, we use the Git diff command to query for a list of changes to each Java file. We limit our genealogies to describe only changes that modify the clone contents, not the clone line numbers. This is because a shift in the line numbers cannot cause the clone pair to transition to a different state.

We build a clone genealogy for each clone pair detected by the clone detection tool. We first extract a system’s commit sequence list. For ArgoUML and JEdit, which were originally managed by SVN, we sort their commits by time in ascending order. For Ant and Maven, which are managed by Git, we make a list and put a system’s last commit as the first element. Then we recursively look for the list’s last element’s parent commit until the system’s first commit is met. We reverse the lists to obtain Ant and Maven’s commit sequence lists.

For each clone pair, we track its modification in every commit along the commit sequence list. If a commit, C new , changed a file that contains code in the clone pair, we use the diff command to compare the commit with its previous commit, C old , in order to check whether the clone snippets are modified and to map the start line and end line numbers from C old to C new . We use Python’s third-party patch parsing library whatthepatch (Corley 2016) to extract the line mapping on a clone file between C old and C new . In case that the first lines, L 1L n , of a clone snippets are deleted in C old and no corresponding line added in C new to replace these deleted lines, we map L n+1 from C old to C new as the start line. Similarly, in case that the last lines, L x L x + n are deleted in C old and no corresponding line added in C new to replace them, we map L x−1 from C old to C new as the end line.

We decide whether a clone is changed when there is any deleted or added lines performed in the clone’s boundaries. If a clone is modified, we determine whether the new state of the clone pair is inconsistent (I s ) or consistent (C s ). We verify this by searching the clone pair list generated by a clone detection tool. We query the list for a matching clone pair in the new commit, C new , that contains the start and end line numbers of the clone pair. If no clone pair is found, then the state of the clone is inconsistent and an inconsistent state (I s ) is added to the genealogy. If a clone is found, then a consistent state (C s ) is added to the genealogy. This process is repeated for each commit in the commit sequence list or until one or both of the clones is deleted. We use the following command to extract renamed files in a new commit:

  • git diff [old-commit] [new-commit] --name-status -M

This command can extract file pair, where a file is deleted and another file is added in the new commit and the two files have a code similarity greater than 50%. In this paper, we only consider the file pairs with more than 99% of code similarity as renamed files. When searching the clone sequence list, we allow a matching clone to be bigger than the clone pair, and contain the smaller clone. For example, if one of the clones in a clone pair is from lines 1 to 10, a matching clone in the clone pair list could be from lines 1 to 20. Although we add the bigger clone from the clone pair list to our genealogy, we continue to monitor only the smaller clone to generate the genealogy. The bigger clone (i.e., lines 1 to 20 in our example) might disappear in a future revision, but the smaller clone (i.e., lines 1 to 10 in our example) persists after the bigger clone is removed.

5 Study results

This section reports and discusses the results of our study.

5.1 RQ1: Which clone evolutionary patterns and clone changes are most at risk of faults?

Motivation

Developers are interested in identifying areas of a software applications that have a higher likelihood of faults. Previous studies (Kamiya et al. 2002) have identified clones as more fault-prone than non-cloned code. Clones occur frequently, with as much as one fifth of a software system containing duplicate code (Roy and Cordy 2007). However, not all the clones lead to faults. It can be resource consuming to monitor all clone pairs for faults. It is beneficial if we can identify characteristics of fault-prone clone pairs, risky clone pairs can be highlighted for monitoring. In this research question, we examine whether the evolutionary pattern exhibited by the clone pair can be used to locate fault-prone clone pairs. Additionally, we study the different changes described in Section 3 to determine whether some types of changes are more likely to induce faults than others. This will make developers more aware of the potential risk of performing a specific type of change to a system.

Approach

We examine this research question using the odds ratio (OR) and validate the statistical significance of the results using the Fisher’s exact test. The Fisher’s exact test (Sheskin 2007) determines whether there are non random associations between two categorical variables (e.g., a clone evolutionary pattern and the occurrence of faults). In this paper, we use a 95% confidence level (i.e., α = 0.05) as the cutoff to decide whether there exists statistically significant difference between a clone evolutionary pattern and the occurrence of faults. Since we will perform more than one comparison, we use Bonferroni correction (Dmitrienko et al. 2005) to control the familywise error rate. Concretely, we use the adjusted p value, which is multiplied by the number of comparisons. The odds ratio compares the odds of an event occurring in two different groups, the “control” group and the “experimental” group. An OR = 1 implies that the event is equally likely in both the control and experimental group, an OR > 1 implies that the event is more likely in the experimental group, and an OR < 1 implies that it is more likely in the control group. An OR value close to zero or infinity means that the difference between the ratios of the odds of experiencing a fault by clone evolutionary patterns from the two groups is very large.

After building the set of clone genealogies for a subject system, we identify all clone evolutionary patterns within the genealogies. For each genealogy graph G, we visit each state in G and identify the clone evolutionary pattern (i.e., the path P). Using the SZZ algorithm described in Section 4.3, we identify faulty states. We also check each change within the genealogy graph G to determine the type of the change, and verify whether the change is fault-inducing.

For the result of each clone detection tool, we perform the following three tests:

Faults vs. clone evolutionary patterns

Using the synchronous (SYNC p ) evolutionary pattern as the control group, we calculate the odds ratios between the control group and each of the different evolutionary patterns (the “experimental” groups). We test the following null hypothesis H 01: Each type of clone evolutionary pattern has the same proportion of clone pairs that experienced a fault-inducing change.

We chose the SYNC p evolutionary pattern as our control group because we expect that clones that are maintained consistently (all the changes are propagated on time consistently) throughout their evolution history would be less prone to faults than others.

Faults vs. Changes

Using consistent changes (CON c ) as our control group, we calculate the odds ratios between the consistent changes and each of the different types of changes. We test the following null hypothesis H 02: Each change type has the same proportion of clone pairs that experienced a fault fix as a consequence of the change.

We chose CON c changes as our control group because we expect a change that keeps two clone fragments in a consistent state to be less risky (i.e., to have a low probability of introducing a fault in the system).

Faults vs. evolutionary patterns and changes

We examine evolutionary patterns and changes together to determine the most fault-prone changes when a clone pair exhibits a specific clone evolutionary pattern (e.g., late propagation followed by a consistent change). Using the inconsistent (INC p ) evolutionary pattern followed by a diverging change (INC c ) as the control group, we calculate the odds ratio between the control group and each of the different combinations of evolutionary patterns and changes. Each evolutionary pattern can be followed by only two of the four types of changes. The final state of a clone evolutionary pattern is always consistent for the pattern. For example, a synchronous pattern (SYNC p ) will always end in a consistent state (C s ). Therefore, a clone pair can only be in one of two states at any time (i.e., consistent or inconsistent). Each state only has two possible transitions, with each transition representing a change to a clone pair. For example, since a late propagation (LP p ) ends in a consistent state (C s ), it can only be followed by a consistent change (CON c ) or a diverging change (DIV c ). We test the following null hypothesis: H 03: Each combination of evolutionary pattern and change type has the same proportion of clone pairs that experienced a fault fix as a consequence of the change.

We chose the inconsistent (INC p ) evolutionary pattern followed by a diverging change (INC c ) as our control group because we expect this combination of pattern and operation to be the riskiest. Clones that experience these operations cannot be tracked with a clone detection tool, hence, developers can easily fail to propagate changes to clone fragments. The combination of INC p and INC c is therefore a good reference upon which we can compare the odds of faults occurring in the other combinations of genealogies and change operations.

Results

We now discuss the results of the aforementioned three tests. Each of the following subsections summarizes the results for one of the three tests. For each evolutionary pattern, change, and combination of genealogy and change, we provide the number of faulty and clean occurrences in Tables 35, and 7.

Table 3 Contingency tables for clone evolutionary patterns

Faults vs. clone evolutionary patterns

Table 4 summarizes the results of the odds ratio and Fisher’s exact test. For each clone evolutionary pattern we show the obtained odds ratios and p-values. If an adjusted p-value of the Fisher’s exact test is less than 0.05, it is marked in italics (Table 5).

Table 4 Statistical analyses for clone evolutionary patterns
Table 5 Contingency tables for clone pair changes

For all studied system, when the p value is less than 0.05 (i.e., the difference is statistically significant), the OR values of INC p , DIV p , LP p , and LPDIV p are greater than 1; meaning that the risk for faults is higher when clones follow other patterns in comparison to the SYNC p pattern. In JEdit and Maven, we could not find enough occurrences of LP p or LPDIV p , which may lead to some insignificant p values. Since all the obtained OR values are ≠ 1, we reject H 01.

figure i

Faults vs. changes

The results of the odds ratio and Fisher’s exact test are summarized in Table 6. For each type of change, we show the obtained odds ratios and p-values.

Table 6 Statistical analyses for clone pair changes

For all studied system (with the exception of RESYNC c detected by NiCad for ArgoUML), when the p-value is less than 0.05 (i.e., the difference is statistically significant), the OR values of DIV c , INC c , and RESYNC c are greater than 1; meaning that all of the changes are more fault-prone than consistent changes (CON c ). These results are expected, because clone pairs experiencing inconsistent changes are difficult to monitor using clone detection tools and are more likely to cause bugs. For Maven, none of the results is statistically significant (all adjusted p-values are > 0.05). Hence, we cannot reject H 02. We explain this outcome by the low number of DIV c , INC c , RESYNC c changes performed in Maven, compared to other systems (Table 7).

figure l
Table 7 Contingency tables for evolutionary patterns and changes

Faults vs. evolutionary patterns and changes

The results of the odds ratio and Fisher’s exact test are summarized in Table 8. For each combination of clone evolutionary pattern and type of change, we show the obtained odds ratios and p values.

Table 8 Statistical analyses for evolutionary patterns and changes

Using the NiCad clone detection tool, we obtained the following results: In ArgoUML, Ant, and JEdit, a consistent change on a clone pair that follows the SYNC p pattern is less likely to introduce a fault than an inconsistent change on a clone pair that follows the INC p pattern. This result is statistically significant (adjusted p value < 0.01).

In ArgoUML and Ant, a re-synchronizing change on a clone pair that follows the INC p pattern or follows the DIV p pattern is less likely to introduce a fault than an inconsistent change on a clone pair that follows the INC p pattern. This result is statistically significant (adjusted p value < 0.01).

In Ant, an inconsistent change on a clone pair that follows the DIV p pattern is more likely to introduce a fault than an inconsistent change on a clone pair that follows the INC p pattern. This result is statistically significant (adjusted p value < 0.01).

In ArgoUML, an inconsistent change on a clone pair that follows the DIV p pattern, a re-synchronizing change that follows the late propagation patten, as well as a consistent change that follows the late propagation pattern are more likely to introduce a fault than an inconsistent change on a clone pair that follows the INC p pattern. This result is statistically significant (adjusted p value < 0.01).

Using the iClones clone detection tool, we obtained the following results: In Ant and JEdit, a consistent change on a clone pair that follows the SYNC p pattern is less likely to introduce a fault than an inconsistent change on a clone pair that follows the INC p pattern. This result is statistically significant (adjusted p value < 0.01).

In ArgoUML and Ant, a re-synchronizing change on a clone pair that follows the INC p pattern is less likely to introduce a fault than an inconsistent change on a clone pair that follows the INC p pattern. This result is statistically significant (adjusted p value < 0.01).

In ArgoUML and JEdit, a diverging change on a clone pair that follows the SYNC p pattern is more likely to introduce a fault than an inconsistent change on a clone pair that follows the INC p pattern. This result is statistically significant (adjusted p value < 0.01).

In ArgoUML, a consistent change following the SYNC p pattern, an inconsistent change following the DIV p , as well as a consistent change following the LP p pattern are more likely to introduce a fault than an inconsistent change on a clone pair that follows the INC p pattern. This result is statistically significant (adjusted p-value < 0.05).

In Ant, an inconsistent change on a clone pair that follows the LPDIV p pattern is more likely to introduce a fault than an inconsistent change on a clone pair that follows the INC p pattern. In addition, a diverging change following the SYNC p , a re-synchronizing change following the DIV p or the LPDIV p , as well as a consistent change following the LP p pattern are less likely to introduce a fault than an inconsistent change following the INC p pattern. This result is statistically significant (adjusted p-value < 0.05).

In the case of Maven, there is no statistically significant result. Hence, we cannot reject H 03.

figure n

5.2 RQ2: Does the size of a clone or the time interval between changes affect the fault-proneness of a clone pair?

Motivation

In this question we examine the effect of two metrics on fault-proneness: the time interval since the last change and the size of the clone. We examine the time interval because it is believed that a long time interval between changes will lead a developer to become unfamiliar with the code, causing an increase in the number of faults. It is also expected that a smaller clone will be less prone to faults, as it is less complex and may require less effort to modify. Using our set of clone pair genealogies, we examine whether the time interval between changes or the size of the clone relates to faults. An evolutionary history of a clone pair tracks the types and frequency of changes to clone pairs. By examining the evolutionary history of clone pairs, we can determine whether fault-proneness is affected by either of these two metrics.

Approach

In this question, we classify each change by the time interval since the last change. We divide the changes into five time periods: one day, one week, one month, one year, and more than one year. We performed this discretization because the Fisher test requires categorical variables. A change is flagged if it is fault-inducing. Using “One Day” as the control group, we calculate the odds ratios between the control group and each of the other time period and perform the Fisher’s exact test. We test the following null hypothesis H 04: The time interval between modifications to a clone pair has no relationship with faults.

When examining the effect of clone size on faults, we examine each state from each genealogy graph G. For each state, we identify the evolutionary pattern of the clone pair and measure the number of lines of cloned code in a clone pair. The size of the clone is then labeled as either “big” if it is greater than or equal to the median lines of clone of a studied system detected by the tool, or “small” if it is smaller than the median lines of clone of the system detected by the tool. For each state, we use the SZZ algorithm to determine whether it is faulty or not. We calculated the odds ratios and the p value of the Fisher’s exact test, and test the following null hypothesis H 05: The size of the clone has no relationship with faults. When calculating the odds ratio, we select the synchronous evolutionary pattern with a small clone size as our control group. Since a large size is known to be correlated with a high risk of fault, we expect the synchronous evolutionary pattern with a small clone size to be less fault-prone than the other patterns, hence our choice of this pattern as our control group.

To better understand the correlational relationship between bug-proneness and time interval (respectively clone size), we build a linear regression model for each studied system. The linear regression models have the following form:

$$\begin{array}{@{}rcl@{}} Faulty = \alpha{Interval} + \beta{Size}+ \gamma \end{array} $$
(1)

We leverage R to create the GLM models, in which, time interval and clone size are independent variables, and whether a clone is faulty is the dependent variable. We will compare the explanatory power of time interval and clone size with other metrics in RQ3.

Results

In this subsection we summarize our results when investigating the relationship between the time interval between changes or the size of the clone and faults. For each time interval and evolutionary pattern considering cloned code size, we provide the number of faulty and clean occurrences in Tables 9 and 11.

Table 9 Contingency tables for evolutionary patterns considering the time interval between changes

Faults and time interval between changes

Table 10 summarizes the results of the Odds ratio and Fisher tests. We obtained the following result with NiCad: For ArgoUML, changes occurring after one week are always less fault-prone than changes performed within a day. For Ant, on the contrary, any changes occurring after one week are always more fault-prone than changes performed within a day. For JEdit, changes occurring after one week and less than one year are less fault-prone than changes performed within a day; while changes occurring after a year become more fault-prone than changes performed within a day. Regarding Maven, none of the results is statistically significant (Table 11).

Table 10 Statistical analyses for evolutionary patterns considering the time interval between changes
Table 11 Contingency tables for evolutionary patterns considering the cloned code sizes

But when we look at the results obtained with iClones, we see that for ArgoUML, changes occurring after one week but within one month, as well as changes occurring after more than one year are less fault-prone than changes performed within a day. For Ant, we obtained the same results as those of NiCad. For JEdit, changes occurring after one week are always less fault-prone than changes performed within a day. For Maven, changes occurring after one year are more fault-prone than changes performed within a day.

Table 13 shows the coefficients and p values in the linear regression model that investigates how time interval of changes and size impact the fault-proneness. Figures 7 and 8 depict the trends of the fault-proneness probability changes, with respect to the time interval. Based on the results of both clone detection tools, the probability of fault-proneness increases with the increase of the time interval for Ant; while the probability decreases with the increase of the time interval for ArgoUML. For JEdit and Maven, the trends diverge depending on different clone detection tools. Some of the results seem inconsistent with the those from Table 10 because time interval and size can interfere with each other in the regression model. In the future, we plan to build non-linear regression models (Harrell 2013) to explore whether the models contain any knot that changes the direction of the trends. This kind of model can better reflect trends obtained for JEdit results based on NiCad’s detection in Table 10. In summary, the results are system dependent, so we cannot reject H 04 in general.

Fig. 7
figure 7

Estimated fault-proneness probability for various time interval sizes (in days) based on NiCad’s detection

Fig. 8
figure 8

Estimated fault-proneness probability for various time interval sizes (in days) based on iClones’ detection

Faults and size of clone

The odds ratios of the evolutionary patterns classified by the size of the clone are summarized in Table 12.

Table 12 Statistical analyses for evolutionary patterns considering the cloned code sizes

Consistent with our findings from RQ1, most patterns involved with inconsistent change are more fault-prone than SYNC p small, except SYNC p big for ArgoUML and Ant based on iClones’ detection. This exception may be due to the relatively small time interval observed between the big changes, which can also explain the trend of ArgoUML in Fig. 8. Once again, the time interval and size factors may interfere with each other. From the results of Table 12, we can reject H 05.

Table 13 shows the coefficients and p-values in the linear regression model that investigates how cloned code sizes affect fault-proneness. Figures 9 and 10 depict the trends of the fault-proneness probability changes with respect to the size of the clones. Based on the results of both clone detection tools, the probability of fault-proneness increases with the increase of the cloned size for Ant; while the probability decreases with the increase of the cloned size for JEdit. For ArgoUML and Maven, the trends diverge depending on different clone detection tools.

figure s
Table 13 Coefficients and p values of the linear regression model on the relationship between fault-proneness and time interval of changes as well as cloned code size
Fig. 9
figure 9

Estimated fault-proneness probability for various clone sizes based on NiCad’s detection

Fig. 10
figure 10

Estimated fault-proneness probability for various clone sizes based on iClones’ detection

5.3 RQ3: Can we predict faults in software clones using clone genealogy information?

Motivation

Tracking the genealogy of all clone pairs in an entire system is resource intensive. When building prediction models to identify faulty code clones, developers face a tradeoff between relying on only traditional fault prediction metrics or collecting additional genealogy metrics which provide richer information on the history of a clone pair. Knowing the gain achieved by adding genealogy metrics to fault explanatory models is important to help developers decide whether the added effort justifies the results.

Approach

In this question, we propose metrics to capture the genealogy information of a clone pair. We combine these metrics with traditional product and process metrics and investigate their statistical relationships with future faults in cloned code. Table 14 presents the description of all the metrics used in this study. The metrics are divided into three categories: product metrics, process metrics and genealogy metrics. Product metrics can be collected using the snapshot of the system that contains the clone pair. For example, “CPathDepth” describes the number of folders that the clones in a clone pair have in common within the system directory structure. Process metrics are collected using the history of changes on clone pairs. For example, “TPC” measures the total number of changes in the history of a clone. Genealogy metrics capture state changes in the history of clone pairs. For example, “EConStChg” measures the number of consistent changes of states within a clone pair genealogy.

Table 14 Clone pair metrics

For each state in a clone genealogy instance, we collect all the metrics from Table 14. Since each clone in the clone pair will have its own set of metrics (e.g., MLOC), we compute the maximum value of each metric across the two clones. To reduce the skewness observed on metric values, we apply a standard log transformation to each metric. From the measurements obtained, we create linear regression models that set the number of reported faults in relation to our three groups of metrics. The linear regression models have the following form:

$$\begin{array}{@{}rcl@{}} Faults &= &\sum\limits_{i}\alpha_{i}ProductM_{i} + \sum\limits_{j}\beta_{j}ProcessM_{j} \\ &&+ \sum\limits_{k}\gamma_{k}GenealogyM_{k} + \delta \end{array} $$
(2)

With this model, we investigate the statistical relationships between product, process and genealogy metrics, which are represented by the regression variables (ProductM i , ProcessM j , and GenealogyM k ), and the number of reported faults, represented by the dependent variable of the model (Faults). We follow the same methodology as in the work of Cataldo et al. (2009). First, we compute the variance inflation factors (VIF) (Kutner et al. 2004) of each metric to examine multi-collinearity between the variables of our regression model. Next, we construct Generalized Linear Models to investigate the relative impact of each of our three groups of metrics on future faults. We remove from the models all variables with VIF > 5, as recommended by Rogerson (2010).

We create the models following a hierarchical modeling approach: we start out with a baseline model that contains only product metrics as regression variables. We then build subsequent models by adding step by step, respectively, process metrics and clone genealogy metrics. We chose to follow a hierarchical modeling approach because contrary to a step-wise modeling approach, the hierarchical approach has the advantage of minimizing the artificial inflation of errors and therefore the overfitting (Cataldo et al. 2009).

We report for each statistical model the explanatory power, deviance, of the model and the percentage of deviance explained. The deviance of a model M is defined as D(M) = −2.LL(M), where LL(M) is the log-likelihood of the model M. The deviance explained is the ratio between \(D(Faults\thicksim Intercept)\) and D(M). For each subsequent model M S + E derived from a model M S , we also test the statistical significance of the difference between M S + E and M S . For each explantory metric, we report its corresponding p-values. We use the varImp package in R to calculate the importance of the metrics, and report the top 3 metrics, which have the strongest explanatory power.

Results

In this subsection we describe the results for RQ3. Tables 15 and 16 presents the results of our hierarchical analysis. In these tables, M S represents a model built using product metrics only (i.e., the basic model). M S + E is a model built using product and process metrics, while M S + E + G is a model containing product, process and genealogy metrics. The results of Tables 15 and 16 show that genealogy metrics only slightly contribute to the explanatory power of the fault-proneness models. The biggest improvement is obtained on Maven (i.e., 2.1%) thanks to the EchgTimeInt metric.

Table 15 Hierarchical analysis of linear regression models for NiCad
Table 16 Hierarchical Analysis of Linear Regression Models for iClones

On average, the explanatory power of a fault prediction model built using both product and process metrics (i.e., M S + E ) is increased by 4.3% when genealogy information is added to the model. This increase is statistically significant. The increase is the highest for Ant when the iClones detection tool is used (i.e., 8.5%).

figure t

6 Discussion

Our identification of clone genealogies are based on a line mapping algorithm, which may not be 100% accurate. To examine the accuracy of our results, we manually examined 50 commits that generated more than 10 new clone genealogies. We found that all of these commits involved with a large amount of new classes or reconstructions. Examining these genealogies helped us to better understand why there are a large number of clone genealogies detected in some systems, such as ArgoUML, and helped us validate our clone detection scripts. For example, in ArgoUML, based on iClones’ detection, 230 different clone genealogies started from the commit 559aca3 (SVN revision 122992), because there are 1818 new files created in this commit. Another example is, in Ant, based on NiCad’s detection, 3257 different clone genealogies start from the commit d1064de, because there are 1903 new Java classes created in this commit. In our manual validation, we also examined whether a clone genealogy was introduced from the first commit in the genealogy, and whether it disappeared after the last commit in the genealogy. For example, the zipFile method was introduced in respectively two classes (proposal/myrmidon/src/main/org/apache/tools/ant/taskdefs/Ear.java and proposal/sandbox/antlib/src/main/org/apache/tools/ant/taskdefs/Antjar.java) in Ant commit d1064de. The two methods were very similar at the beginning. They experienced three consistent changes (b8c5034, 7c0bc50C, and 669a7ea). However, at the commit 0a07be8, the second file changed the algorithm of the method zipFile, i.e., the clone pair became inconsistent. Finally, the second file was removed from the system at the commit 99cdb67. Another example is, in ArgoUML, the buildConnection method which was introduced at commit a6a72d7 (SVN revision 11634) in respectively two new files src/ model-euml/src/org/argouml/model/euml/UmlFactoryEUMLImpl.java and src/model-mdr/src/ org/argouml/model/mdr/UmlFactoryMDRImpl.java. The two methods were identical at the beginning. They experienced consistent changes at commits 1eb1d05(revision 11993) and 964f 121 (revision 11994). But since the commit 4e6285c (revision 12105), the first file changed its exception handling statements. The clone pair became dissimilar until the last studied commit.

We expected that INC pattern would be the most fault-prone However, according to the results, DIV pattern is highly fault-prone, because a fault could be fixed by propagating changes performed on one clone segment to the other segments. Here are two examples that we manually examined in Ant. In Bug 41353, running tasks in parallel generated a problem. The solution to the fault was to clone the properties in data which was accessed in parallel, which resulted in an inconsistent change on the clone contained in files src/main/org/apache/tools/ant/PropertyHelper.java @ 472:480 and proposal/embed/src/java/org/apache/tools/ant/PropertyHelper.java @ 501:513. As a result of this, the clone pair evolved into a DIV p genealogy pattern. In the case of Bug 42736, in order to encapsulate the reference to a method inside the delegate object, the clone has created an interface with add method add(...) and getDelegates and getDelegateInterfaces invoked methods to retrieve a collection of delegates of the specified type. This modification resulted into the two clone segments diverging, resulting into an INC p genealogy pattern.

An interesting phenomenon is the migration of clones across repositories. Among the genealogies that were analyzed, we observe that faults occur more frequently among clones from files located in different directories. And to fix these faults, developers often propagate changes from one clone segment to the other. This was the case for example for Ant’s bugs 19897, 22326, and 7552. A closer look at the files involved in these clones reveal that developers duplicated code to experiment on new changes. However instead of doing this in separate branches, they performed it in the main code base and committed their experimentations in the trunk, whenever they were satisfied with their experimentations, the modifications are propagated to the main files of Ant that would be released to the public. We found that this phenomenon explains a large proportion of the fault fix observed on the DIV genealogies.

7 Threats to validity

In this section we discuss the threats to validity of our study.

Construct validity threats involve the relationship between theory and observation. The source of threats in this study are due to measurement errors experienced by the clone detection tools. To reduce the number of false positive clone detection results, we repeat the study using two clone detection tools that use different clone detection techniques and that have both been used in previous studies and reported to achieve good precision and recall (see Svajlenko Roy 2014).

In this study, we have chosen to analyze clone pairs instead of clone groups since clone pairs within the same clone group are not equally risky. However, all analysis presented in this paper can be replicated on clone groups easily.

The SZZ heuristic used to identify fault-inducing changes is not 100% accurate. However, it has been successfully used in multiple previous studies from the literature, with satisfying results. In our implementation, we remove all fault-inducing commit candidates that only changed blank lines or comment lines.

Threats to internal validity do not affect this study, as it is an exploratory study (Yin 2002). We cannot claim causation, we simply report observations and correlations, although our discussion tries to explain these observations.

Threats to conclusion validity address the relationship between the treatment and the outcome. We are careful to acknowledge the assumptions of each statistical test. We used non-parametric tests that do not require making assumptions about the data set distribution. To exclude test files from our study, we manually examined all files in our data set.

Threats to External validity address the generalizability of our results. We examine four different sized systems from four different domains. Nevertheless, more studies on more systems should be done to further validate our results. All of our subject systems are written in Java. Our results may not be able to generalized to systems with other programming languages. However, Java, C, and C++ all belong to the “C-family programming languages” (Wikipedia 2017), i.e., they share some common features in syntax. We believe that our approach can yield similar results on C/C++ systems. In the future, we plan to extend this study on more programming language, such as C and C++. We also welcome software practitioners and researchers to replicate and validate our work on other programming languages.

Threats to reliability validity take into account the possibly of replicating our study. In this paper, we provide all the details needed to replicate our study. All our four subject systems are publicly available for study. The data and scripts used in this study is also publicly available and can be downloaded hereFootnote 1.

8 Conclusion

In this paper, we examine the states within clone genealogies and changes to clone pairs to determine their relationship with faults in software systems. We formally define six different clone evolutionary patterns and four types of changes experienced by a clone pair. Using these definitions, we show that clone pairs exhibiting inconsistent and divergent patterns are more likely to experience a fault than clone pairs that are maintained consistently. We also show that the size of the cloned region of a clone pair can impact the fault-proneness of the clone pair. But, there is no clear relationship between the cloned code changed time and the fault-proneness of a clone pair. Next, we investigate the statistical relationships between product, process, genealogy metrics, and the number of future faults in cloned code. Our results show that adding genealogy information to a fault prediction model built using product and process metrics can increase the explanatory power of the model. We found that clone pairs causing faults in the past can help indicate future faults in the clone fragments. In the future, we intend to explore more factors that can be correlated with fault-proneness of code clones, such as the number of different maintainers, and the domain of the system. We also plan to replicate our study on more systems using different clone detection tools. Moreover, we will use the results of our study to build recommendation systems to assist maintenance teams in the management of software clones. The data used in this study is publicly available and can be found at: https://github.com/swatlab/clone_genealogies.