Predicting change consistency in a clone group

doi:10.1016/j.jss.2017.08.045

Journal of Systems and Software

Volume 134, December 2017, Pages 105-119

https://doi.org/10.1016/j.jss.2017.08.045 Get rights and content

Highlights

•
Developed a Bayesian network for predicting clone consistency-requirement.
•
Introduced code, context and clone evolution attributes to quantify clone change.
•
Performed experiments on 4 open source repositories to demonstrate its effectiveness.
•
Provided a plug-in prototype in Eclipse to avoid clone consistency-defect.

Abstract

Code cloning has been accepted as one of the general code reuse methods in software development, thanks to the increasing demand in rapid software production. The introduction of clone groups and clone genealogies enable software developers to be aware of the presence of and changes to clones as a collective group; they also allow developers to understand how clone groups evolve throughout software life cycle. Due to similarity in codes within a clone group, a change in one piece of the code may require developers to make consistent change to other clones in the group. Failure in making such consistent change to a clone group when necessary is commonly known as “clone consistency-defect”, which can adversely impact software reusability.

In this work, we propose an approach to predict the need for making consistent change in clones within a clone group at the time when changes have been made to one of its clones. We build a variant of clone genealogies to collect all consistent/inconsistent changes to clone groups, and extract three attribute sets from clone groups as input for predicting the need for consistent clone change. These three attribute sets are code attributes, context attributes and evolution attributes respectively. Together, they provide a holistic view about clone changes. We conduct experiments on four open source projects. Our experiments show that our approach has reasonable precision and recall in predicting whether a clone group requires (or is free of) consistent change. This holistic approach can aid developers in maintaining clone changes, and avoid potential consistency-defect, which can improve software quality and reusability.

Introduction

A clone fragment (or simply called a clone) is generally referred to as a piece of code fragment that is “similar” to another piece of code fragment; the notion of “similarity” between two code fragments is typically defined at textual or syntactical level (Koschke, 2007). “Copy-and-paste” operation is the most noticeable way for physically reusing existing code from software, and it can introduce abundant clone fragments. The presence of clones in software has given rise to the question of whether clones can adversely affect software quality, starting with Fowler et al. identifying some of the clones as “bad smell” (Fowler and Beck, 1999). Clone research community has since debated on whether changing a set of clones inconsistently may cause software defects, and whether the requirement for changing clones consistently may lead to extra maintenance cost. If clone fragments in a group of clones need to be changed consistently and developers forget to do so, it may introduce defects at latter stage of software evolution (Bettenburg, Shang, Ibrahim, Adams, Zou, Hassan, 2009, Juergens, Deissenboeck, Hummel, Wagner, 2009). On the other hand, when consistent change within a clone group is not required, developers might unnecessarily spend time on verifying and attempting to maintain clone consistency, resulting in additional software maintenance overhead (Aversano, Cerulo, Di Penta, 2007, Barbour, Khomh, Zou, 2011).

This work is an extension of our conference paper (Zhang et al., 2016), which outlines a predictive model that warns software developers about the need to perform consistent change in clones, so as to reduce clone maintainability cost in specificity, and improve software maintainability in general. The extension here includes experimental details as well as the inclusion of another software repository as experiment subject. Moreover, we extend the technique to include prediction of clone changes which do NOT require consistent change to the corresponding clone group. Thus, in this work, we develop a more holistic approach which predicts whether consistent change is needed for a clone group when one of clone fragments in the group has been modified. Specifically, when a developer modifies a piece of code which is a clone of other code, our developed model will make its prediction, and offer two possible warnings to developers:

1.
When similar changes are indeed required for at least one other clones in a clone group, we say that the clone group satisfies the clone consistency-requirement. If this requirement is predicted, our model will alarm the developer, and appropriate management action can be taken to avoid consistency-defect. Although this leads to an increase in software maintenance cost, it reduces the risk for clone-consistency-defect.
2.
When none of the clones in the clone group requires consistent change, we say that the clone group are consistency-free. If this requirement is predicted, our model will inform the developer, who can then change the clones freely with more confidence. This in turns saves unnecessarily time on verifying consistency.

A related work done in this direction of clone consistency-requirement prediction, which has inspired the current work, was conducted by Wang, Dang, Zhang, Zhang, Lan, Mei, 2012, Wang, Dang, Zhang, Zhang, Lan, Mei, 2014. In that work, they define a code cloning operation as “consistency-maintenance-requirement” if its generated code clones experience consistent changes in software evolution history. They aim to automatically predict whether a code cloning operation requires consistency-maintenance at the time when a copy-and-paste operation is performed. They employ Bayesian network machine learning technique (Friedman et al., 1997) to develop and train a prediction model. Their predictor is built with the following two categories of inputs: (1) the syntactic characteristics of the code and its copy-and-paste counterpart and (2) the physical context of the code and its copy-and-paste counterpart. Tested on two open source projects and two large-scale internal projects, they show that their predictor is able to recommend developers to perform more than 50% of cloning operations with a precision of at least 94% in these four subjects; in addition, it is also able to avoid 37% to 72% of consistency-maintenance-required code clones by warning developers on only 13% to 40% code clones.

While Wang et al. aim to perform prediction at copy-and-paste time, we perform prediction at almost any time in software life cycle when a clone has been modified. Our technique can thus be applied to existing clones in an established project, rather than new clones formed (via copy-and-paste). To achieve that, we need to be aware of the presence of clone group to which the modified code belongs. A clone group is a group of clones within a piece of software which are known to be similar by some similarity measures. In order to train a predictor, it is natural to investigate the evolution of clone groups during software evolution. To this end, we adapt the notion of clone genealogy as nicely explained by Kim et al. (2005). A clone genealogy describes the evolution of clones, and defines various clone patterns to describe how clones in a group have been changed from the earlier version of the project. We hypothesize that how a clone had been modified genealogically wrt its clone group has an impact on the prediction if the clone group requires consistent change in future. We thus build our predictor based on three categories of inputs, two of them have been adapted from the work by Wang, Dang, Zhang, Zhang, Lan, Mei, 2012, Wang, Dang, Zhang, Zhang, Lan, Mei, 2014, and the last one captures the characteristics of clone genealogy, called evolution attributes. This combination of three attributes provides a holistic view on clone groups; the presence of evolution attributes enables the predictor to be customized to individual software repository. We develop and construct, via WEKA (Hall et al., 2009), a Bayesian network as the predictor, and experiment on its predictive power on three software projects. Our experiments show that: the predictor performs reasonably well with stable precision and recall for both its prediction for clone consistency-requirement and consistency-free, with precision ranges between 70% to 80%, and its recall between 63% and 83%. In addition, each of the attribute sets contributes positively in its own way to the predictive power, and an absence of any of these attribute sets can adversely affect the recall ability of the predictor.

The contributions of this paper are as follows:

1.
We propose an approach to predict the need for consistent change in a clone group arising from the occurrence of a clone change.
2.
We identify a new set of attribute for prediction based on information related to clone genealogies. The results show that this set of attributes has positive impact on the recall ability of the predictor.
3.
We demonstrate the feasibility of this prediction via an evaluation on four open source projects. The results show that our approach can predict consistent change in clone group effectively with good precision and reasonable recall, and can help improve software reusability through predictive clone maintenance.

This paper is organized as follows: Section 2 discusses related works.We give a brief introduction of code clone research and defect prediction. Preliminaries is provided in Section 3. There, we explain code clone types, clone genealogy and clone changes. And also give a example of consistent change from real world. Section 4 details our approach to consistent clone change prediction. We provide the detail of implementation in Section 5. Section 6 describes our evaluation through experimentation on four projects. Section 7 discusses threats to validity. We conclude and point to future work in Section 8.

Section snippets

Related works

Ever since code clones have been identified as “bad smell” (Fowler and Beck, 1999), there have been discussions over its harmfulness to software. Proponents for clones being harmful opine that their existence can lead to software defect and incur additional maintenance effort. To this end, Lozano and Wermelinger, studying the changeability of code clones, show that the maintenance effort of changing a method may increase significantly when the method has a clone (Lozano and Wermelinger, 2008).

Preliminaries

We group together similar clone fragments within a piece of software to form a clone group. There are typically four ways to categorize clones into groups (Roy and Cordy, 2007), and in this work, we consider Type-1, Type-2 and Type-3 cloning strategy:

•
Type-1 clone is an exact clone; ie., all clones in a clone group are identical without any modifications, except for differences in code layout and comments.
•
Type-2 clone is a syntactically identical clone; ie., all clones in a group are

Definitions for clone consistency-requirement

We propose a formal definition of “consistent change” between two pairs of clone fragments, as follows:

Definition 1 (τ-Consistent change)

Given that two clone fragments c₁, c₂ are modified to $c_{1}^{'}$ and $c_{2}^{'}$ respectively. We say this modification between c₁ and c₂ is a τ-consistent change if for some very small threshold τ, $\begin{matrix} textSim (c_{i}, c_{i}^{'}) < 1 & \forall i \in {1, 2} \\ | textSim (c_{1}, c_{1}^{'}) - textSim (c_{2}, c_{2}^{'}) | < τ \end{matrix}$

Note that $textSim (c, c^{'}) =$ 1 - UPI(c, c′). The first constraint above states that both c₁ and c₂ have been modified. The second constraint dictates that

Implementation

We implemented our approach as a prototype tool in Java, and integrated it into IDE (eclipse) as a plug-in⁴. Our tool combines three key functions to achieve this prediction task.

Methodology

We conducted experiments on repositories of four open source projects. Table 1 gives the details of these projects. As observed from the table, there are hundreds of change instances for each project, ranging from 159 to 1040 counts, with project jEdit being the smallest repository. Among them, the number of change instances which meet consistency-requirement (ie., leading to consistent change of clone groups in the future) is shown in column 3, whereas column 2 reflects the number of change

Threats to validity

As an empirical study, our experimental results may subject to some threats of validity, that including construct validity, internal validity and external validity.

Conclusions and future work

The presence of clones in software adds to the burden of software maintenance due to the potential need to maintain consistent change in clones when software evolves. In this paper, we propose an approach to predict clone consistency-requirement for any clone group which has some of its constituent clone fragments undergoing code modification. We build clone genealogies from the software repositories to extract clone change instances. For each change instance, we extract three sets of

Acknowledgments

We would like to thank the anonymous reviewers for their valuable and thorough comments. This work is supported by the 13th Five-Year National Science and Technology Major Project of China (Grant no. 2017YFC0702204) and the National Natural Science Foundation of China (Grant no. 61672191 and 61173021).

Fanlong Zhang was born in 1987. He is a Ph.D. candidate of Computer Science at Harbin Institute of Technology. His research interests include software engineering, program analysis, and code clone analysis and maintenance. (Email: [email protected])

References (38)

L. Aversano et al.
How clones are maintained: an empirical study
Software Maintenance and Reengineering, 2007. CSMR’07. 11th European Conference on
(2007)
T. Bakota et al.
Clone smells in software evolution
Software Maintenance, 2007. ICSM 2007. IEEE International Conference on
(2007)
L. Barbour et al.
Late propagation in software clones
Software Maintenance (ICSM), 2011 27th IEEE International Conference on
(2011)
N. Bettenburg et al.
An empirical study on inconsistent changes to code clones at release level
2009 16th Working Conference on Reverse Engineering
(2009)
M. Ci et al.
A new clone group mapping algorithm for extracting clone genealogy on multi-version software
Instrumentation, Measurement, Computer, Communication and Control (IMCCC), 2013 Third International Conference on
(2013)
E. Duala-Ekoko et al.
Clonetracker: tool support for code clone management
Proceedings of the 30th international conference on Software engineering
(2008)
E. Duala-Ekoko et al.
Clone region descriptors: representing and tracking duplication in source code
ACM Trans. Softw. Eng. Methodol.
(2010)
M. Fowler et al.
Refactoring: Improving the Design of Existing Code
(1999)
N. Friedman et al.
Bayesian network classifiers
Mach. Learn.
(1997)
N. Göde et al.
Incremental clone detection
Software Maintenance and Reengineering, 2009. CSMR’09. 13th European Conference on
(2009)

N. Göde et al.

Frequency and risks of changes to clones

Proceedings of the 33rd International Conference on Software Engineering

(2011)

M. Hall et al.

The weka data mining software: an update

ACM SIGKDD Explor. Newslett

(2009)

J. Harder et al.

Cloned code: stable code

J. Softw.

(2013)

A.E. Hassan

Predicting faults using the complexity of code changes

Proceedings of the 31st International Conference on Software Engineering

(2009)

E. Juergens et al.

Do code clones matter?

Software Engineering, 2009. ICSE 2009. IEEE 31st International Conference on

(2009)

T. Kamiya et al.

Ccfinder: a multilinguistic token-based code clone detection system for large scale source code

IEEE Trans. Softw. Eng.

(2002)

C. Kapser et al.

Cloning considered harmful considered harmful

2006 13th Working Conference on Reverse Engineering

(2006)

M. Kim et al.

An empirical study of code clone genealogies

ACM SIGSOFT Software Engineering Notes

(2005)

R. Koschke

Survey of research on software clones

Dagstuhl Seminar Proceedings

(2007)

Cited by (13)

Cross-project clone consistent-defect prediction via transfer-learning method
2023, Information Sciences
Code clones are comparable code snippets that are introduced into software by developers in order to increase software development productivity. A change to code clone may result in a consistent-defect if the developers forget to verify the consistency of the code after the change. To reduce such change-related maintenance costs, researchers have proposed a number of methods for predicting clone consistency in advance. Unfortunately, the effectiveness of these cross-project models is unsatisfactory, and performing such predictions with insufficient data remains a challenge. Meanwhile, cross-project defect prediction via transfer learning method is prevalent in the software engineering community. Consequently, we first construct an empirical study to explore whether transfer-learning techniques could well be utilized for clone cross-project consistent-defect prediction in the initial stages of software development. In this paper, we employ transfer-learning techniques to predict clone consistency at both the time of clone creating and clone changing in order to avoid clone consistent-defects and maintenance. We conduct an experiment on open-source projects to evaluate the effectiveness of various transfer-learning methods. Our investigation demonstrates that transfer-learning techniques have a beneficial impact on predicting cross-project clone consistent-defect, and that the size of the dataset also has a positive effect on prediction. In order to promote software safety and security, we recommend that developers leverage transfer-learning to enhance the capability for clone cross-project consistent-defect prediction early in the software development phase.
Clone consistent-defect prediction based on deep learning method
2023, Information Sciences
Many consistent changes take place across code clones in software, and this has emerged as a severe issue for software security. If the developers forget to modify the relevant clones consistently, such modifications will introduce consistent-defect. Researchers leverage machine learning techniques to predict clone consistent-defect by representing these changes with the designed attributes. Meanwhile, deep learning technology has demonstrated tremendous potential in characterizing source code. As such, we explore whether deep learning technology can enhance the effectiveness of clone consistent-defect prediction. In this study, we investigate various neural networks for modeling clone consistent change. Specifically, our approach models code clones and their evolution to capture the semantic properties automatically from the perspectives of clone fragment, clone group, and clone evolution, as opposed to manually generated attributes. To evaluate the effectiveness of our approach, we conduct an experiment on the dataset collected from 8 open-source projects. The results demonstrate that our neural network models are efficient both in cross-project and within-project scenarios, with F-measures of around 80% and recalls of around 90%. We conclude that deep learning technology may successfully assist developers in predicting clone consistent-defect, so helping to improve the security of code clones by alerting developers to confirm their consistency.
A systematic literature review on the use of machine learning in code clone research
2023, Computer Science Review
Research related to code clones includes detection of clones in software systems, analysis, visualization and management of clones. Detection of semantic clones and management of clones have attracted use of machine learning techniques in code clone related research.
The aim of this study is to report the extent of machine learning usage in code clone related research areas.
The paper uses a systematic review method to report the use of machine learning in research related to code clones. The study considers a comprehensive set of 57 articles published in leading conferences, workshops and journals.
Code clone related research using machine learning techniques is classified into different categories. Machine learning and deep learning algorithms used in the code clone research are reported. The datasets, features used to train machine learning models and metrics used to evaluate machine learning algorithms are reported. The comparative results of various machine learning algorithms presented in primary studies are reported.
The research will help to identify the status of using machine learning in different code clone related research areas. We identify the need of more empirical studies to assess the benefits of machine learning in code clone research and give recommendations for future research.
An empirical study on clone consistency prediction based on machine learning
2021, Information and Software Technology
Citation Excerpt :
For the case of clone-changing instances, three sets of attributes are used to describe the characteristics of these instances from the perspective of the clone group. The various attribute sets and their respective details can be found in the work of [9], and they are listed here: In this step, we train different prediction models with five different machine-learning methods both for clone-creating and clone-changing instances.
Code Clones have been accepted as a common phenomenon in software, thanks to the increasing demand for rapid production of software. The existence of code clones is recognized by developers in the form of clone group, which includes several pieces of clone fragments that are similar to one another. A change in one of these clone fragments may indicate necessary “consistent changes” are required for the rest of the clones within the same group, which can increase extra maintenance costs. A failure in making such consistent change when it is necessary is commonly known as a “clone consistency-defect”, which can adversely impact software maintainability.
Predicting the need for “clone consistent changes” after successful clone-creating or clone-changing operations can help developers maintain clone changes effectively, avoid consistency-defects and reduce maintenance cost.
In this work, we use several sets of attributes in two scenarios of clone operations (clone-creating and clone-changing), and conduct an empirical study on five different machine-learning methods to assess each of their clone consistency predictability — whether any one of the clone operations will require or be free of clone consistency maintenance in future.
We perform our experiments on eight open-source projects. Our study shows that such predictions can be reasonably effective both for clone-creating and changing operating instances. We also investigate the use of five different machine-learning methods for predictions and show that our selected features are effective in predicting the needs of consistency-maintenance across all selected machine-learning methods.
The empirical study conducted here demonstrates that the models developed by different machine-learning methods with the specified sets of attributes have the ability to perform clone-consistency prediction.
TPCaps: a framework for code clone detection and localization based on improved CapsNet
2023, Applied Intelligence
Ranking code clones to support maintenance activities
2023, Empirical Software Engineering

View all citing articles on Scopus

Siau-cheng Khoo received Ph.D. degree in Computer Science from Yale University. He is an Associate Professor with Department of Computer Science at National University of Singapore . His research interests include program analysis, optimizations and software engineering. (Email: [email protected])

Xiaohong Su (corresponding author) was born in 1966. She is a Professor at Harbin Institute of Technology. Her research interests include software fault localization, clone detection and analysis, and program analysis. (Email: [email protected])

¹: Fanlong Zhang is the main author, and most of the work was done by him.

²: The major part of this work was done when the first author was on a PhD internship at National University of Singapore.

View full text

Predicting change consistency in a clone group

Highlights

Abstract

Introduction

Section snippets

Related works

Preliminaries

Definitions for clone consistency-requirement

Implementation

Methodology

Threats to validity

Conclusions and future work

Acknowledgments

How clones are maintained: an empirical study

Software Maintenance and Reengineering, 2007. CSMR’07. 11th European Conference on

Clone smells in software evolution

Software Maintenance, 2007. ICSM 2007. IEEE International Conference on

Late propagation in software clones

Software Maintenance (ICSM), 2011 27th IEEE International Conference on

An empirical study on inconsistent changes to code clones at release level

2009 16th Working Conference on Reverse Engineering

A new clone group mapping algorithm for extracting clone genealogy on multi-version software

Instrumentation, Measurement, Computer, Communication and Control (IMCCC), 2013 Third International Conference on

Clonetracker: tool support for code clone management

Proceedings of the 30th international conference on Software engineering

Clone region descriptors: representing and tracking duplication in source code

ACM Trans. Softw. Eng. Methodol.

Refactoring: Improving the Design of Existing Code

Bayesian network classifiers

Mach. Learn.

Incremental clone detection

Software Maintenance and Reengineering, 2009. CSMR’09. 13th European Conference on

Frequency and risks of changes to clones

Proceedings of the 33rd International Conference on Software Engineering

The weka data mining software: an update

ACM SIGKDD Explor. Newslett

Cloned code: stable code

J. Softw.

Predicting faults using the complexity of code changes

Proceedings of the 31st International Conference on Software Engineering

Do code clones matter?

Software Engineering, 2009. ICSE 2009. IEEE 31st International Conference on

Ccfinder: a multilinguistic token-based code clone detection system for large scale source code

IEEE Trans. Softw. Eng.

Cloning considered harmful considered harmful

2006 13th Working Conference on Reverse Engineering

An empirical study of code clone genealogies

ACM SIGSOFT Software Engineering Notes

Survey of research on software clones

Dagstuhl Seminar Proceedings