skip to main content
10.1145/1295014.1295029acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
Article

Efficient token based clone detection with flexible tokenization

Published: 03 September 2007 Publication History

Abstract

Code clones are similar code fragments that occur at multiple locations in a software system. Detection of code clones provides useful information for maintenance, reengineering, program understanding and reuse. Several techniques have been proposed to detect code clones. These techniques differ in the code representation used for analysis of clones, ranging from plain text to parse trees and program dependence graphs. Clone detection based on lexical tokens involves minimal code transformation and gives good results, but is computationally expensive because of the large number of tokens that need to be compared. We explored string algorithms to find suitable data structures and algorithms for efficient token based clone detection and implemented them in our tool Repeated Tokens Finder (RTF). Instead of using suffix tree for string matching, we use more memory efficient suffix array. RTF incorporates a suffix array based linear time algorithm to detect string matches. It also provides a simple and customizable tokenization mechanism. Initial analysis and experiments show that our clone detection is simple, scalable, and performs better than the previous well-known tools.

References

[1]
Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E., "Replacing suffix trees with suffix arrays", Journal of Discrete Algorithms, vol. 2(1), 2004, pp. 53--86.
[2]
Baker, B. S., "A Program for Identifying Duplicated Code", Computing Science and Statistics, vol. 24, 1992, pp. 49--57.
[3]
Basit, H. A., Rajapakse, D. C., and Jarzabek, S. "Beyond Templates: a Study of Clones in the STL and Some General Implications," In Proc. Int. Conf. on Software Engineering, (ICSE'05), St. Louis, USA, May 2005, pp. 451--459.
[4]
Basit, H. A., and Jarzabek, S. "Detecting Higher-level Similarity Patterns in Programs", In Proc. European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE'05), ACM Press, Lisbon, Portugal, September 2005, pp. 156--165.
[5]
Baxter, I. D., Yahin, A., Moura, L., Anna, M. S. and Bier, L., "Clone detection using abstract syntax trees," In Proc. Intl. Conference on Software Maintenance (ICSM '98), 1998, pp. 368--377.
[6]
Bellon, S., "Vergleich von Techniken zur Erkennung duplizierten Quellcodes", Master's Thesis, Institut fur Softwaretechnologie, Universitat Stuttgart, Stuttgart, Germany, 2002.
[7]
Burd, E., and Bailey, J., "Evaluating Clone Detection Tools for Use during Preventative Maintenance", In Proc. 2nd IEEE Intl. Workshop on Source Code Analysis and Manipulation (SCAM'02), 2002, pp. 36--43.
[8]
Ducasse, S, Rieger, M., and Demeyer, S., "A language independent approach for detecting duplicated code," In Proc. Intl. Conference on Software Maintenance (ICSM '99), 1999. pp. 109--118.
[9]
Jarzabek, S. and Shubiao, L., "Eliminating Redundancies with a 'Composition with Adaptation' Meta-programming Technique", In Proc. European Soft. Eng. Conf. and ACM SIGSOFT Symp. on the Foundations of Soft. Eng. (ESEC-FSE'03), Helsinki, Sept. 2003, pp. 237--246.
[10]
Johnson, J. H., "Identifying redundancy in source code using fingerprints", In Proc. Conf. of the Centre for Advanced Studies on Collaborative research: software engineering (CASCON'93), 1993, pp 171--183.
[11]
Kamiya, T., Kusumoto, S, and Inoue, K., "CCFinder: A multi-linguistic token based code clone detection system for large scale source code," IEEE Trans. Software Engineering, vol. 28(7), July 2002, pp. 654 -- 670.
[12]
Kolpakov, R., Bana, G., and Kucherov, G., "mreps: efficient and flexible detection of tandem repeats in DNA", Nucleic Acids Research, vol. 31(13), Oxford University Press, 2003, 3672--3678.
[13]
Komondoor, R., and Horwitz, S., "Using slicing to identify duplication in source code," In Proc. 8th International Symposium on Static Analysis, 2001, pp. 40--56.
[14]
Krinke, J., "Identifying Similar Code with Program Dependence Graphs", In Proc. 8th Working Conference on Reverse Engineering, Stuttgart, Germany, October 2001, pp. 301--309.
[15]
Koschke, R., Falke, R., and Frenzel, P. Clone Detection Using Abstract Syntax Suffix Trees. In Proceedings of the 13th Working Conference on Reverse Engineering (WCRE), pages 253--262, 2006.
[16]
Lague, B., Proulx, D., Merlo, E., Mayrand, J., and Hudepohl, J., "Assessing the benefits of incorporating function clone detection in a development process," Experience Report, Intl. Conference on Software Maintenance (ICSM '97), 1997, pp. 314--321.
[17]
Li, Z., Lu, S., Myagmar, S., Zhou, Y., CP-Miner: Finding copy-paste and related bugs in large-scale software code, IEEE Transactions on Software Engineering, vol. 32(3), 2006, pp. 176--192.
[18]
Linux Online, http://www.linux.org/, 2006.
[19]
Manber, U., and Myers, G. W., "Suffix arrays: a new model for on-line string searches", SIAM Journal of Computing, vol. 22(5), 1993, pp. 935--948.
[20]
Mayrand J., Leblanc C., and Merlo E., "Experiment on the automatic detection of function clones in a software system using metrics", In Proc. Intl. Conference on Software Maintenance (ICSM '96), 1996, pp. 244--254.
[21]
McCreight E. M., "A space-economical suffix tree construction algorithm", Journal of the ACM, vol. 23(2), 1976, pp. 262--272.
[22]
XML-Based Variant Configuration Language- Technology for Reuse, http://xvcl.comp.nus.edu.sg, 2006.

Cited By

View all
  • (2024)A Generative AI-Driven Method-Level Semantic Clone Detection Based on the Structural and Semantical Comparison of MethodsIEEE Access10.1109/ACCESS.2024.340177012(70773-70791)Online publication date: 2024
  • (2023)Distributed Representation for Assembly CodeComputers10.3390/computers1211022212:11(222)Online publication date: 1-Nov-2023
  • (2023)C³: Code Clone-Based Identification of Duplicated ComponentsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613883(1832-1843)Online publication date: 30-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC-FSE companion '07: The 6th Joint Meeting on European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering: companion papers
September 2007
189 pages
ISBN:9781595938121
DOI:10.1145/1295014
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 September 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clone detection
  2. reverse engineering
  3. software maintenance
  4. token-based clone detection

Qualifiers

  • Article

Conference

ESEC/FSE07
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Generative AI-Driven Method-Level Semantic Clone Detection Based on the Structural and Semantical Comparison of MethodsIEEE Access10.1109/ACCESS.2024.340177012(70773-70791)Online publication date: 2024
  • (2023)Distributed Representation for Assembly CodeComputers10.3390/computers1211022212:11(222)Online publication date: 1-Nov-2023
  • (2023)C³: Code Clone-Based Identification of Duplicated ComponentsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613883(1832-1843)Online publication date: 30-Nov-2023
  • (2021)From Simple to Structural Clones: Tapping the Benefits of Non-redundancyEnterprise Information Systems10.1007/978-3-030-75418-1_26(563-590)Online publication date: 1-May-2021
  • (2019)Interactive Near Duplicate Search in Software DocumentationProgramming and Computer Software10.1134/S036176881906004545:6(346-355)Online publication date: 3-Dec-2019
  • (2019)CASFinder: Detecting Common Attack SurfaceData and Applications Security and Privacy XXXIII10.1007/978-3-030-22479-0_18(338-358)Online publication date: 11-Jun-2019
  • (2018)Duplicate finder toolkitProceedings of the 40th International Conference on Software Engineering: Companion Proceeedings10.1145/3183440.3195081(171-172)Online publication date: 27-May-2018
  • (2018)Detecting Near Duplicates in Software DocumentationProgramming and Computing Software10.1134/S036176881805007944:5(335-343)Online publication date: 1-Sep-2018
  • (2016)Clone Detection in Reuse of Software Technical DocumentationPerspectives of System Informatics10.1007/978-3-319-41579-6_14(170-185)Online publication date: 28-Jun-2016
  • (2013)Software clone detection: A systematic reviewInformation and Software Technology10.1016/j.infsof.2013.01.00855:7(1165-1199)Online publication date: Jul-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media