Elsevier

Journal of Systems and Software

Volume 134, December 2017, Pages 105-119
Journal of Systems and Software

Predicting change consistency in a clone group

https://doi.org/10.1016/j.jss.2017.08.045Get rights and content

Highlights

  • Developed a Bayesian network for predicting clone consistency-requirement.

  • Introduced code, context and clone evolution attributes to quantify clone change.

  • Performed experiments on 4 open source repositories to demonstrate its effectiveness.

  • Provided a plug-in prototype in Eclipse to avoid clone consistency-defect.

Abstract

Code cloning has been accepted as one of the general code reuse methods in software development, thanks to the increasing demand in rapid software production. The introduction of clone groups and clone genealogies enable software developers to be aware of the presence of and changes to clones as a collective group; they also allow developers to understand how clone groups evolve throughout software life cycle. Due to similarity in codes within a clone group, a change in one piece of the code may require developers to make consistent change to other clones in the group. Failure in making such consistent change to a clone group when necessary is commonly known as “clone consistency-defect”, which can adversely impact software reusability.

In this work, we propose an approach to predict the need for making consistent change in clones within a clone group at the time when changes have been made to one of its clones. We build a variant of clone genealogies to collect all consistent/inconsistent changes to clone groups, and extract three attribute sets from clone groups as input for predicting the need for consistent clone change. These three attribute sets are code attributes, context attributes and evolution attributes respectively. Together, they provide a holistic view about clone changes. We conduct experiments on four open source projects. Our experiments show that our approach has reasonable precision and recall in predicting whether a clone group requires (or is free of) consistent change. This holistic approach can aid developers in maintaining clone changes, and avoid potential consistency-defect, which can improve software quality and reusability.

Introduction

A clone fragment (or simply called a clone) is generally referred to as a piece of code fragment that is “similar” to another piece of code fragment; the notion of “similarity” between two code fragments is typically defined at textual or syntactical level (Koschke, 2007). “Copy-and-paste” operation is the most noticeable way for physically reusing existing code from software, and it can introduce abundant clone fragments. The presence of clones in software has given rise to the question of whether clones can adversely affect software quality, starting with Fowler et al. identifying some of the clones as “bad smell” (Fowler and Beck, 1999). Clone research community has since debated on whether changing a set of clones inconsistently may cause software defects, and whether the requirement for changing clones consistently may lead to extra maintenance cost. If clone fragments in a group of clones need to be changed consistently and developers forget to do so, it may introduce defects at latter stage of software evolution (Bettenburg, Shang, Ibrahim, Adams, Zou, Hassan, 2009, Juergens, Deissenboeck, Hummel, Wagner, 2009). On the other hand, when consistent change within a clone group is not required, developers might unnecessarily spend time on verifying and attempting to maintain clone consistency, resulting in additional software maintenance overhead (Aversano, Cerulo, Di Penta, 2007, Barbour, Khomh, Zou, 2011).

This work is an extension of our conference paper (Zhang et al., 2016), which outlines a predictive model that warns software developers about the need to perform consistent change in clones, so as to reduce clone maintainability cost in specificity, and improve software maintainability in general. The extension here includes experimental details as well as the inclusion of another software repository as experiment subject. Moreover, we extend the technique to include prediction of clone changes which do NOT require consistent change to the corresponding clone group. Thus, in this work, we develop a more holistic approach which predicts whether consistent change is needed for a clone group when one of clone fragments in the group has been modified. Specifically, when a developer modifies a piece of code which is a clone of other code, our developed model will make its prediction, and offer two possible warnings to developers:

  • 1.

    When similar changes are indeed required for at least one other clones in a clone group, we say that the clone group satisfies the clone consistency-requirement. If this requirement is predicted, our model will alarm the developer, and appropriate management action can be taken to avoid consistency-defect. Although this leads to an increase in software maintenance cost, it reduces the risk for clone-consistency-defect.

  • 2.

    When none of the clones in the clone group requires consistent change, we say that the clone group are consistency-free. If this requirement is predicted, our model will inform the developer, who can then change the clones freely with more confidence. This in turns saves unnecessarily time on verifying consistency.

A related work done in this direction of clone consistency-requirement prediction, which has inspired the current work, was conducted by Wang, Dang, Zhang, Zhang, Lan, Mei, 2012, Wang, Dang, Zhang, Zhang, Lan, Mei, 2014. In that work, they define a code cloning operation as “consistency-maintenance-requirement” if its generated code clones experience consistent changes in software evolution history. They aim to automatically predict whether a code cloning operation requires consistency-maintenance at the time when a copy-and-paste operation is performed. They employ Bayesian network machine learning technique (Friedman et al., 1997) to develop and train a prediction model. Their predictor is built with the following two categories of inputs: (1) the syntactic characteristics of the code and its copy-and-paste counterpart and (2) the physical context of the code and its copy-and-paste counterpart. Tested on two open source projects and two large-scale internal projects, they show that their predictor is able to recommend developers to perform more than 50% of cloning operations with a precision of at least 94% in these four subjects; in addition, it is also able to avoid 37% to 72% of consistency-maintenance-required code clones by warning developers on only 13% to 40% code clones.

While Wang et al. aim to perform prediction at copy-and-paste time, we perform prediction at almost any time in software life cycle when a clone has been modified. Our technique can thus be applied to existing clones in an established project, rather than new clones formed (via copy-and-paste). To achieve that, we need to be aware of the presence of clone group to which the modified code belongs. A clone group is a group of clones within a piece of software which are known to be similar by some similarity measures. In order to train a predictor, it is natural to investigate the evolution of clone groups during software evolution. To this end, we adapt the notion of clone genealogy as nicely explained by Kim et al. (2005). A clone genealogy describes the evolution of clones, and defines various clone patterns to describe how clones in a group have been changed from the earlier version of the project. We hypothesize that how a clone had been modified genealogically wrt its clone group has an impact on the prediction if the clone group requires consistent change in future. We thus build our predictor based on three categories of inputs, two of them have been adapted from the work by Wang, Dang, Zhang, Zhang, Lan, Mei, 2012, Wang, Dang, Zhang, Zhang, Lan, Mei, 2014, and the last one captures the characteristics of clone genealogy, called evolution attributes. This combination of three attributes provides a holistic view on clone groups; the presence of evolution attributes enables the predictor to be customized to individual software repository. We develop and construct, via WEKA (Hall et al., 2009), a Bayesian network as the predictor, and experiment on its predictive power on three software projects. Our experiments show that: the predictor performs reasonably well with stable precision and recall for both its prediction for clone consistency-requirement and consistency-free, with precision ranges between 70% to 80%, and its recall between 63% and 83%. In addition, each of the attribute sets contributes positively in its own way to the predictive power, and an absence of any of these attribute sets can adversely affect the recall ability of the predictor.

The contributions of this paper are as follows:

  • 1.

    We propose an approach to predict the need for consistent change in a clone group arising from the occurrence of a clone change.

  • 2.

    We identify a new set of attribute for prediction based on information related to clone genealogies. The results show that this set of attributes has positive impact on the recall ability of the predictor.

  • 3.

    We demonstrate the feasibility of this prediction via an evaluation on four open source projects. The results show that our approach can predict consistent change in clone group effectively with good precision and reasonable recall, and can help improve software reusability through predictive clone maintenance.

This paper is organized as follows: Section 2 discusses related works.We give a brief introduction of code clone research and defect prediction. Preliminaries is provided in Section 3. There, we explain code clone types, clone genealogy and clone changes. And also give a example of consistent change from real world. Section 4 details our approach to consistent clone change prediction. We provide the detail of implementation in Section 5. Section 6 describes our evaluation through experimentation on four projects. Section 7 discusses threats to validity. We conclude and point to future work in Section 8.

Section snippets

Related works

Ever since code clones have been identified as “bad smell” (Fowler and Beck, 1999), there have been discussions over its harmfulness to software. Proponents for clones being harmful opine that their existence can lead to software defect and incur additional maintenance effort. To this end, Lozano and Wermelinger, studying the changeability of code clones, show that the maintenance effort of changing a method may increase significantly when the method has a clone (Lozano and Wermelinger, 2008).

Preliminaries

We group together similar clone fragments within a piece of software to form a clone group. There are typically four ways to categorize clones into groups (Roy and Cordy, 2007), and in this work, we consider Type-1, Type-2 and Type-3 cloning strategy:

  • Type-1 clone is an exact clone; ie., all clones in a clone group are identical without any modifications, except for differences in code layout and comments.

  • Type-2 clone is a syntactically identical clone; ie., all clones in a group are

Definitions for clone consistency-requirement

We propose a formal definition of “consistent change” between two pairs of clone fragments, as follows:

Definition 1 (τ-Consistent change)

Given that two clone fragments c1, c2 are modified to c1 and c2 respectively. We say this modification between c1 and c2 is a τ-consistent change if for some very small threshold τ, textSim(ci,ci)<1i{1,2}|textSim(c1,c1)textSim(c2,c2)|<τ

Note that textSim(c,c)= 1 - UPI(c, c′). The first constraint above states that both c1 and c2 have been modified. The second constraint dictates that

Implementation

We implemented our approach as a prototype tool in Java, and integrated it into IDE (eclipse) as a plug-in4. Our tool combines three key functions to achieve this prediction task.

Methodology

We conducted experiments on repositories of four open source projects. Table 1 gives the details of these projects. As observed from the table, there are hundreds of change instances for each project, ranging from 159 to 1040 counts, with project jEdit being the smallest repository. Among them, the number of change instances which meet consistency-requirement (ie., leading to consistent change of clone groups in the future) is shown in column 3, whereas column 2 reflects the number of change

Threats to validity

As an empirical study, our experimental results may subject to some threats of validity, that including construct validity, internal validity and external validity.

Conclusions and future work

The presence of clones in software adds to the burden of software maintenance due to the potential need to maintain consistent change in clones when software evolves. In this paper, we propose an approach to predict clone consistency-requirement for any clone group which has some of its constituent clone fragments undergoing code modification. We build clone genealogies from the software repositories to extract clone change instances. For each change instance, we extract three sets of

Acknowledgments

We would like to thank the anonymous reviewers for their valuable and thorough comments. This work is supported by the 13th Five-Year National Science and Technology Major Project of China (Grant no. 2017YFC0702204) and the National Natural Science Foundation of China (Grant no. 61672191 and 61173021).

Fanlong Zhang was born in 1987. He is a Ph.D. candidate of Computer Science at Harbin Institute of Technology. His research interests include software engineering, program analysis, and code clone analysis and maintenance. (Email: [email protected])

References (38)

  • L. Aversano et al.

    How clones are maintained: an empirical study

    Software Maintenance and Reengineering, 2007. CSMR’07. 11th European Conference on

    (2007)
  • T. Bakota et al.

    Clone smells in software evolution

    Software Maintenance, 2007. ICSM 2007. IEEE International Conference on

    (2007)
  • L. Barbour et al.

    Late propagation in software clones

    Software Maintenance (ICSM), 2011 27th IEEE International Conference on

    (2011)
  • N. Bettenburg et al.

    An empirical study on inconsistent changes to code clones at release level

    2009 16th Working Conference on Reverse Engineering

    (2009)
  • M. Ci et al.

    A new clone group mapping algorithm for extracting clone genealogy on multi-version software

    Instrumentation, Measurement, Computer, Communication and Control (IMCCC), 2013 Third International Conference on

    (2013)
  • E. Duala-Ekoko et al.

    Clonetracker: tool support for code clone management

    Proceedings of the 30th international conference on Software engineering

    (2008)
  • E. Duala-Ekoko et al.

    Clone region descriptors: representing and tracking duplication in source code

    ACM Trans. Softw. Eng. Methodol.

    (2010)
  • M. Fowler et al.

    Refactoring: Improving the Design of Existing Code

    (1999)
  • N. Friedman et al.

    Bayesian network classifiers

    Mach. Learn.

    (1997)
  • N. Göde et al.

    Incremental clone detection

    Software Maintenance and Reengineering, 2009. CSMR’09. 13th European Conference on

    (2009)
  • N. Göde et al.

    Frequency and risks of changes to clones

    Proceedings of the 33rd International Conference on Software Engineering

    (2011)
  • M. Hall et al.

    The weka data mining software: an update

    ACM SIGKDD Explor. Newslett

    (2009)
  • J. Harder et al.

    Cloned code: stable code

    J. Softw.

    (2013)
  • A.E. Hassan

    Predicting faults using the complexity of code changes

    Proceedings of the 31st International Conference on Software Engineering

    (2009)
  • E. Juergens et al.

    Do code clones matter?

    Software Engineering, 2009. ICSE 2009. IEEE 31st International Conference on

    (2009)
  • T. Kamiya et al.

    Ccfinder: a multilinguistic token-based code clone detection system for large scale source code

    IEEE Trans. Softw. Eng.

    (2002)
  • C. Kapser et al.

    Cloning considered harmful considered harmful

    2006 13th Working Conference on Reverse Engineering

    (2006)
  • M. Kim et al.

    An empirical study of code clone genealogies

    ACM SIGSOFT Software Engineering Notes

    (2005)
  • R. Koschke

    Survey of research on software clones

    Dagstuhl Seminar Proceedings

    (2007)
  • Cited by (13)

    • An empirical study on clone consistency prediction based on machine learning

      2021, Information and Software Technology
      Citation Excerpt :

      For the case of clone-changing instances, three sets of attributes are used to describe the characteristics of these instances from the perspective of the clone group. The various attribute sets and their respective details can be found in the work of [9], and they are listed here: In this step, we train different prediction models with five different machine-learning methods both for clone-creating and clone-changing instances.

    • Ranking code clones to support maintenance activities

      2023, Empirical Software Engineering
    View all citing articles on Scopus

    Fanlong Zhang was born in 1987. He is a Ph.D. candidate of Computer Science at Harbin Institute of Technology. His research interests include software engineering, program analysis, and code clone analysis and maintenance. (Email: [email protected])

    Siau-cheng Khoo received Ph.D. degree in Computer Science from Yale University. He is an Associate Professor with Department of Computer Science at National University of Singapore . His research interests include program analysis, optimizations and software engineering. (Email: [email protected])

    Xiaohong Su (corresponding author) was born in 1966. She is a Professor at Harbin Institute of Technology. Her research interests include software fault localization, clone detection and analysis, and program analysis. (Email: [email protected])

    1

    Fanlong Zhang is the main author, and most of the work was done by him.

    2

    The major part of this work was done when the first author was on a PhD internship at National University of Singapore.

    View full text