Abstract
Code reuse through copying and pasting leads to so-called software clones. These clones can be roughly categorized into identical fragments (type-1 clones), fragments with parameter substitution (type-2 clones), and similar fragments that differ through modified, deleted, or added statements (type-3 clones). Although there has been extensive research on detecting clones, detection of type-3 clones is still an open research issue due to the inherent vagueness in their definition. In this paper, we analyze type-3 clones detected by state-of-the-art tools and investigate type-3 clones in terms of their syntactic differences. Then, we derive their underlying semantic abstractions from their syntactic differences. Finally, we investigate whether there are code characteristics that indicate that a tool-suggested clone candidate is a real type-3 clone from a human’s perspective. Our findings can help developers of clone detectors and clone refactoring tools to improve their tools.









Similar content being viewed by others
Notes
The semantic information is not needed here.
References
Baker, B. S. (1995). On finding duplication and near-duplication in large software systems. In L. Wills, P. Newcomb, & E. Chikofsky (Eds.), Proceedings of WCRE (pp. 86–95).
Balazinska, M., Merlo, E. M., Dagenais, M., Lague, B., & Kontogiannis, K. (1999). Measuring clone based reengineering opportunities. In IEEE symposium on software metrics (pp. 292–303). IEEE Computer Society Press.
Balazinska, M., Merlo, E., Dagenais, M., Lague, B., & Kontogiannis, K. (2000). Advanced clone-analysis to support object-oriented system refactoring. In WCRE (pp. 98–107). IEEE Computer Society Press.
Baxter, I. D., Yahin, A., Moura, L., Sant’Anna, M., & Bier, L. (1998). Clone detection using abstract syntax trees. In T. M. Koshgoftaar & K. Bennett (Eds.), ICSM, (pp. 368–378).
Bellon, S., Koschke, R., Antoniol, G., Krinke, J., & Merlo, E. (2007). Comparison and evaluation of clone detection tools. IEEE Computer Society Transactions on Software Engineering, 33, 577–591.
Chen, X., Kwong, S., & Li, M. (2000) A compression algorithm for dna sequences and its applications in genome comparison. In RECOMB ’00: Proceedings of the fourth annual international conference on computational molecular biology (p. 107). New York, NY, USA: ACM. doi:10.1145/332306.332352.
Chen, X., Francia, B., Li, M., Mckinnon, B., & Seker, A. (2004). Shared information and program plagiarism detection. Transactions on Information Theory, 50(7), 1545–1551. doi:10.1109/TIT.2004.830793.
Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. Numerische Mathematik, 1, 269–271.
Ducasse, S., Rieger, M., & Demeyer, S. (1999). A language independent approach for detecting duplicated code. In ICSM ’99: Proceedings of the IEEE international conference on software maintenance (p. 109). Washington, DC, USA: IEEE Computer Society.
Evans, W. S., Fraser, C. W., & Ma, F. (2007). Clone detection via structural abstraction. In WCRE (pp. 150–159).
Falke, R., Koschke, R., & Frenzel, P. (2008). Empirical evaluation of clone detection using syntax suffix trees. Empirical Software Engineering, 13(6), 601–643. doi:10.1007/s10664-008-9073-9.
Frenzel, P., Koschke, R., Breu, A. P. J., & Angstmann, K. (2007). Extending the reflection method for consolidating software variants into product lines. In WCRE (pp. 160–169). IEEE Computer Society Press.
Higo, Y., Kamiya, T., Kusumoto, S., & Inoue, K. (2004). Aries: Refactoring support environment based on code clone analysis. In IASTED Conference on software engineering and applications (pp. 222–229).
Higo, Y., Kamiya, T., Kusumoto, S., & Inoue, K. (2007). Method and implementation for investigating code clones in a software system. Information and Software Technology, 49(9–10), 985–998.
Jia, Y., Binkley, D., Harman, M., Krinke, J., & Matsushita, M. (2009) Kclone: A proposed approach to fast precise code clone detection. In Proceedings of CSMR’09 (pp. 12–16).
Kamiya, T., Kusumoto, S., & Inoue, K. (2002). CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Computer Society Transactions on Software Engineering, 28(7), 654–670.
Kapser, C., Anderson, P., Godfrey, M., Koschke, R., Rieger, M., van Rysselberghe, F., & Weißgerber, P. (2007). Subjectivity in clone judgment: Can we ever agree? In Duplication, redundancy, and similarity in software, dagstuhl seminar proceedings, No. 06301.
Kapser, C. J., & Godfrey, M. W. (2003a) A taxonomy of clones in source code: The re-engineers most wanted list. In Proceedings of IWDSC’03.
Kapser, C. J., & Godfrey, M. W. (2003b) Toward a taxonomy of clones in source code: A case study. In Evolution of large scale industrial software architectures (pp. 67–78).
Kapser, C. J., & Godfrey, M. W. (2006). Supporting the analysis of clones in software systems: Research articles. Journal of Software Maintenance and Evolution, 18(2), 61–82.
Koschke, R. (2007). Survey of research on software clones. In R. Koschke, E. Merlo, & A. Walenstein (Eds.), Duplication, redundancy, and similarity in software, Dagstuhl seminar proceedings.
Koschke, R. (2008a). Frontiers in software clone management. In Proceedings of the international conference on software maintenance.
Koschke, R. (2008b). Identifying and removing software clones, chap. 2 (pp. 15–39). Berlin: Springer.
Koschke, R., Girard, J. F., Würthner, M. (1998). Intermediate representations for reverse engineering. In WCRE (pp. 241–250). IEEE Computer Society Press.
Koschke, R., Frenzel, P., Breu, A. P., & Angstmann, K. (2009). Extending the reflexion method for consolidating software variants into product lines. Software Quality Journal, 17(4), 331–366.
Krinke, J. (2001). Identifying similar code with program dependence graphs. In WCRE (pp. 301–309).
Li, M., Chen, X., Li, X., Ma, B., & Vitányi, P. M. B. (2004). The similarity metric. Transactions on Information Theory, 50(12), 3250–3264.
Mayrand, J., Leblanc, C., & Merlo, E. (1996). Experiment on the automatic detection of function clones in a software system using metrics. In ICSM (p. 244). IEEE Computer Society.
Mende, T., Beckwermert, F., Koschke, R., & Meier, G. (2008). Supporting the grow-and-prune model in software product lines evolution using clone detection. In European Conference on Software Maintenance and Reengineering (pp. 163–172). IEEE Computer Society Press.
Mende, T., Koschke, R., & Beckwermert, F. (2009). An evaluation of code similarity identification for the grow-and-prune model. Journal of Software Maintenance and Evolution: Research and Practice, 21(2), 143–169.
Nevill-Manning, C. G., & Witten, I. H. (1997). Linear-time, incremental hierarchy inference for compression. In DCC (pp. 3–11). Washington, DC, USA: IEEE Computer Society.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Roy, C. K., & Cordy, J. R. (2007). A survey on software clone detection research. Technical report no. 2007-541. Ontario, Canada: School of Computing, Queen’s University at Kingston.
Roy, C. K., Cordy, J. R., & Koschke, R. (2009) Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Journal of Science of Computer Programming doi:10.1016/j.scico.2009.02.007, accepted for publication.
Selkow, S. M. (1977). The tree-to-tree editing problem. Information Processing Letters, 6(6), 184–186. doi:10.1016/0020-0190(77)90064-3.
Shasha, D., & Zhang, K. (1989). Fast parallel algorithms for the unit cost editing distance between trees. In SPAA ’89: Proceedings of the first annual ACM symposium on parallel algorithms and architectures (pp. 117–126). New York, NY, USA: ACM. doi:10.1145/72935.72949.
Smith, R., & Horwitz, S. (2009). Detecting and measuring similarity in code clones.
Tai, K. C. (1979). The tree-to-tree correction problem. J ACM, 26(3), 422–433. doi:10.1145/322139.322143.
Tiarks, R., Koschke, R., & Falke, R. (2009). An assessment of type-3 clones as detected by state-of-the-art tools. In Workshop source code analysis and manipulation (pp. 67–76). IEEE Computer Society Press.
Valiente, G. (2002). Algorithms on trees and graphs, 1st Ed.. New York: Springer.
Walenstein, A. (2007). Code clones: Reconsidering terminology. In Duplication, Redundancy, and Similarity in Software, Dagstuhl Seminar Proceedings, No. 06301.
Walenstein, A., Jyoti, N., Li, J., Yang, Y., & Lakhotia, A. (2003). Problems creating task-relevant clone detection reference data. In WCRE. IEEE Computer Society Press.
Walenstein, A., El-Ramly, M., Cordy, J. R., S W, Mahdavi, K., Pizka, M., Ramalingam, G., & von Gudenberg, J. W. (2007a). Similarity in programs. In Duplication, redundancy, and similarity in software.
Walenstein, A., Venable, M., Hayes, M., Thompson, C., & Lakhotia, A. (2007b) Exploiting similarity between variants to defeat malware. In Proceedings of BlackHat 2007 DC Briefings.
Zhang, K. (1995). Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recognition, 28(3), 463–474. doi:10.1016/0031-3203(94)00109-Y.
Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Scientific Computing, 18(6), 1245–1262. doi:10.1137/0218082.
Ziv, J., & Lempel, A. (1977). A universal algorithm for sequential data compression. Transactions on Information Theory, 23(3), 337–343. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1055714.
Acknowledgments
We want to thank Pierre Frenzel for sharing his validated clones with us. Furthermore, we want to thank our industrial partner for giving us the opportunity to analyze industrial code of a software product line and the anonymous reviewers for their valuable comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tiarks, R., Koschke, R. & Falke, R. An extended assessment of type-3 clones as detected by state-of-the-art tools. Software Qual J 19, 295–331 (2011). https://doi.org/10.1007/s11219-010-9115-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11219-010-9115-6