Predicting change consistency in a clone group
Introduction
A clone fragment (or simply called a clone) is generally referred to as a piece of code fragment that is “similar” to another piece of code fragment; the notion of “similarity” between two code fragments is typically defined at textual or syntactical level (Koschke, 2007). “Copy-and-paste” operation is the most noticeable way for physically reusing existing code from software, and it can introduce abundant clone fragments. The presence of clones in software has given rise to the question of whether clones can adversely affect software quality, starting with Fowler et al. identifying some of the clones as “bad smell” (Fowler and Beck, 1999). Clone research community has since debated on whether changing a set of clones inconsistently may cause software defects, and whether the requirement for changing clones consistently may lead to extra maintenance cost. If clone fragments in a group of clones need to be changed consistently and developers forget to do so, it may introduce defects at latter stage of software evolution (Bettenburg, Shang, Ibrahim, Adams, Zou, Hassan, 2009, Juergens, Deissenboeck, Hummel, Wagner, 2009). On the other hand, when consistent change within a clone group is not required, developers might unnecessarily spend time on verifying and attempting to maintain clone consistency, resulting in additional software maintenance overhead (Aversano, Cerulo, Di Penta, 2007, Barbour, Khomh, Zou, 2011).
This work is an extension of our conference paper (Zhang et al., 2016), which outlines a predictive model that warns software developers about the need to perform consistent change in clones, so as to reduce clone maintainability cost in specificity, and improve software maintainability in general. The extension here includes experimental details as well as the inclusion of another software repository as experiment subject. Moreover, we extend the technique to include prediction of clone changes which do NOT require consistent change to the corresponding clone group. Thus, in this work, we develop a more holistic approach which predicts whether consistent change is needed for a clone group when one of clone fragments in the group has been modified. Specifically, when a developer modifies a piece of code which is a clone of other code, our developed model will make its prediction, and offer two possible warnings to developers:
- 1.
When similar changes are indeed required for at least one other clones in a clone group, we say that the clone group satisfies the clone consistency-requirement. If this requirement is predicted, our model will alarm the developer, and appropriate management action can be taken to avoid consistency-defect. Although this leads to an increase in software maintenance cost, it reduces the risk for clone-consistency-defect.
- 2.
When none of the clones in the clone group requires consistent change, we say that the clone group are consistency-free. If this requirement is predicted, our model will inform the developer, who can then change the clones freely with more confidence. This in turns saves unnecessarily time on verifying consistency.
A related work done in this direction of clone consistency-requirement prediction, which has inspired the current work, was conducted by Wang, Dang, Zhang, Zhang, Lan, Mei, 2012, Wang, Dang, Zhang, Zhang, Lan, Mei, 2014. In that work, they define a code cloning operation as “consistency-maintenance-requirement” if its generated code clones experience consistent changes in software evolution history. They aim to automatically predict whether a code cloning operation requires consistency-maintenance at the time when a copy-and-paste operation is performed. They employ Bayesian network machine learning technique (Friedman et al., 1997) to develop and train a prediction model. Their predictor is built with the following two categories of inputs: (1) the syntactic characteristics of the code and its copy-and-paste counterpart and (2) the physical context of the code and its copy-and-paste counterpart. Tested on two open source projects and two large-scale internal projects, they show that their predictor is able to recommend developers to perform more than 50% of cloning operations with a precision of at least 94% in these four subjects; in addition, it is also able to avoid 37% to 72% of consistency-maintenance-required code clones by warning developers on only 13% to 40% code clones.
While Wang et al. aim to perform prediction at copy-and-paste time, we perform prediction at almost any time in software life cycle when a clone has been modified. Our technique can thus be applied to existing clones in an established project, rather than new clones formed (via copy-and-paste). To achieve that, we need to be aware of the presence of clone group to which the modified code belongs. A clone group is a group of clones within a piece of software which are known to be similar by some similarity measures. In order to train a predictor, it is natural to investigate the evolution of clone groups during software evolution. To this end, we adapt the notion of clone genealogy as nicely explained by Kim et al. (2005). A clone genealogy describes the evolution of clones, and defines various clone patterns to describe how clones in a group have been changed from the earlier version of the project. We hypothesize that how a clone had been modified genealogically wrt its clone group has an impact on the prediction if the clone group requires consistent change in future. We thus build our predictor based on three categories of inputs, two of them have been adapted from the work by Wang, Dang, Zhang, Zhang, Lan, Mei, 2012, Wang, Dang, Zhang, Zhang, Lan, Mei, 2014, and the last one captures the characteristics of clone genealogy, called evolution attributes. This combination of three attributes provides a holistic view on clone groups; the presence of evolution attributes enables the predictor to be customized to individual software repository. We develop and construct, via WEKA (Hall et al., 2009), a Bayesian network as the predictor, and experiment on its predictive power on three software projects. Our experiments show that: the predictor performs reasonably well with stable precision and recall for both its prediction for clone consistency-requirement and consistency-free, with precision ranges between 70% to 80%, and its recall between 63% and 83%. In addition, each of the attribute sets contributes positively in its own way to the predictive power, and an absence of any of these attribute sets can adversely affect the recall ability of the predictor.
The contributions of this paper are as follows:
- 1.
We propose an approach to predict the need for consistent change in a clone group arising from the occurrence of a clone change.
- 2.
We identify a new set of attribute for prediction based on information related to clone genealogies. The results show that this set of attributes has positive impact on the recall ability of the predictor.
- 3.
We demonstrate the feasibility of this prediction via an evaluation on four open source projects. The results show that our approach can predict consistent change in clone group effectively with good precision and reasonable recall, and can help improve software reusability through predictive clone maintenance.
This paper is organized as follows: Section 2 discusses related works.We give a brief introduction of code clone research and defect prediction. Preliminaries is provided in Section 3. There, we explain code clone types, clone genealogy and clone changes. And also give a example of consistent change from real world. Section 4 details our approach to consistent clone change prediction. We provide the detail of implementation in Section 5. Section 6 describes our evaluation through experimentation on four projects. Section 7 discusses threats to validity. We conclude and point to future work in Section 8.
Section snippets
Related works
Ever since code clones have been identified as “bad smell” (Fowler and Beck, 1999), there have been discussions over its harmfulness to software. Proponents for clones being harmful opine that their existence can lead to software defect and incur additional maintenance effort. To this end, Lozano and Wermelinger, studying the changeability of code clones, show that the maintenance effort of changing a method may increase significantly when the method has a clone (Lozano and Wermelinger, 2008).
Preliminaries
We group together similar clone fragments within a piece of software to form a clone group. There are typically four ways to categorize clones into groups (Roy and Cordy, 2007), and in this work, we consider Type-1, Type-2 and Type-3 cloning strategy:
- •
Type-1 clone is an exact clone; ie., all clones in a clone group are identical without any modifications, except for differences in code layout and comments.
- •
Type-2 clone is a syntactically identical clone; ie., all clones in a group are
Definitions for clone consistency-requirement
We propose a formal definition of “consistent change” between two pairs of clone fragments, as follows:
Definition 1 (τ-Consistent change) Given that two clone fragments c1, c2 are modified to and respectively. We say this modification between c1 and c2 is a τ-consistent change if for some very small threshold τ,
Note that 1 - UPI(c, c′). The first constraint above states that both c1 and c2 have been modified. The second constraint dictates that
Implementation
We implemented our approach as a prototype tool in Java, and integrated it into IDE (eclipse) as a plug-in4. Our tool combines three key functions to achieve this prediction task.
Methodology
We conducted experiments on repositories of four open source projects. Table 1 gives the details of these projects. As observed from the table, there are hundreds of change instances for each project, ranging from 159 to 1040 counts, with project jEdit being the smallest repository. Among them, the number of change instances which meet consistency-requirement (ie., leading to consistent change of clone groups in the future) is shown in column 3, whereas column 2 reflects the number of change
Threats to validity
As an empirical study, our experimental results may subject to some threats of validity, that including construct validity, internal validity and external validity.
Conclusions and future work
The presence of clones in software adds to the burden of software maintenance due to the potential need to maintain consistent change in clones when software evolves. In this paper, we propose an approach to predict clone consistency-requirement for any clone group which has some of its constituent clone fragments undergoing code modification. We build clone genealogies from the software repositories to extract clone change instances. For each change instance, we extract three sets of
Acknowledgments
We would like to thank the anonymous reviewers for their valuable and thorough comments. This work is supported by the 13th Five-Year National Science and Technology Major Project of China (Grant no. 2017YFC0702204) and the National Natural Science Foundation of China (Grant no. 61672191 and 61173021).
Fanlong Zhang was born in 1987. He is a Ph.D. candidate of Computer Science at Harbin Institute of Technology. His research interests include software engineering, program analysis, and code clone analysis and maintenance. (Email: [email protected])
References (38)
- et al.
How clones are maintained: an empirical study
Software Maintenance and Reengineering, 2007. CSMR’07. 11th European Conference on
(2007) - et al.
Clone smells in software evolution
Software Maintenance, 2007. ICSM 2007. IEEE International Conference on
(2007) - et al.
Late propagation in software clones
Software Maintenance (ICSM), 2011 27th IEEE International Conference on
(2011) - et al.
An empirical study on inconsistent changes to code clones at release level
2009 16th Working Conference on Reverse Engineering
(2009) - et al.
A new clone group mapping algorithm for extracting clone genealogy on multi-version software
Instrumentation, Measurement, Computer, Communication and Control (IMCCC), 2013 Third International Conference on
(2013) - et al.
Clonetracker: tool support for code clone management
Proceedings of the 30th international conference on Software engineering
(2008) - et al.
Clone region descriptors: representing and tracking duplication in source code
ACM Trans. Softw. Eng. Methodol.
(2010) - et al.
Refactoring: Improving the Design of Existing Code
(1999) - et al.
Bayesian network classifiers
Mach. Learn.
(1997) - et al.
Incremental clone detection
Software Maintenance and Reengineering, 2009. CSMR’09. 13th European Conference on
(2009)
Frequency and risks of changes to clones
Proceedings of the 33rd International Conference on Software Engineering
The weka data mining software: an update
ACM SIGKDD Explor. Newslett
Cloned code: stable code
J. Softw.
Predicting faults using the complexity of code changes
Proceedings of the 31st International Conference on Software Engineering
Do code clones matter?
Software Engineering, 2009. ICSE 2009. IEEE 31st International Conference on
Ccfinder: a multilinguistic token-based code clone detection system for large scale source code
IEEE Trans. Softw. Eng.
Cloning considered harmful considered harmful
2006 13th Working Conference on Reverse Engineering
An empirical study of code clone genealogies
ACM SIGSOFT Software Engineering Notes
Survey of research on software clones
Dagstuhl Seminar Proceedings
Cited by (13)
Cross-project clone consistent-defect prediction via transfer-learning method
2023, Information SciencesClone consistent-defect prediction based on deep learning method
2023, Information SciencesA systematic literature review on the use of machine learning in code clone research
2023, Computer Science ReviewAn empirical study on clone consistency prediction based on machine learning
2021, Information and Software TechnologyCitation Excerpt :For the case of clone-changing instances, three sets of attributes are used to describe the characteristics of these instances from the perspective of the clone group. The various attribute sets and their respective details can be found in the work of [9], and they are listed here: In this step, we train different prediction models with five different machine-learning methods both for clone-creating and clone-changing instances.
TPCaps: a framework for code clone detection and localization based on improved CapsNet
2023, Applied IntelligenceRanking code clones to support maintenance activities
2023, Empirical Software Engineering
Fanlong Zhang was born in 1987. He is a Ph.D. candidate of Computer Science at Harbin Institute of Technology. His research interests include software engineering, program analysis, and code clone analysis and maintenance. (Email: [email protected])
Siau-cheng Khoo received Ph.D. degree in Computer Science from Yale University. He is an Associate Professor with Department of Computer Science at National University of Singapore . His research interests include program analysis, optimizations and software engineering. (Email: [email protected])
Xiaohong Su (corresponding author) was born in 1966. She is a Professor at Harbin Institute of Technology. Her research interests include software fault localization, clone detection and analysis, and program analysis. (Email: [email protected])
- 1
Fanlong Zhang is the main author, and most of the work was done by him.
- 2
The major part of this work was done when the first author was on a PhD internship at National University of Singapore.