Clone detection via structural abstraction

Evans, William S.; Fraser, Christopher W.; Ma, Fei

doi:10.1007/s11219-009-9074-y

Clone detection via structural abstraction

Published: 14 March 2009

Volume 17, pages 309–330, (2009)
Cite this article

Software Quality Journal Aims and scope Submit manuscript

William S. Evans¹,
Christopher W. Fraser² &
Fei Ma³

519 Accesses
37 Citations
3 Altmetric
Explore all metrics

Abstract

This paper describes the design, implementation, and application of a new algorithm to detect cloned code. It operates on the abstract syntax trees formed by many compilers as an intermediate representation. It extends prior work by identifying clones even when arbitrary subtrees have been changed. These subtrees may represent structural rather than simply lexical code differences. In several hundred thousand lines of Java and C# code, 20–50% of the clones that we find involve these structural changes, which are not accounted for by previous methods. Our method also identifies cloning in declarations, so it is somewhat more general than conventional procedural abstraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SourcererCC: Scalable and Accurate Clone Detection

Various Code Clone Detection Techniques and Tools: A Comprehensive Survey

Detecting Refactored Clones

Notes

The notation emphasizes the fact that each hole may be filled with a different pattern.
This does not consider the cost of the r − 1 call instructions that replace r − 1 of the occurrences.
http://java.sun.com/j2se/1.4.2/download.html.
Let n, c, t, and ℓ be the number of nodes, characters, tokens, and lines in a file. For Java, n ≈ 0.55 c ≈ 4.0 t ≈ 13.5 ℓ. For C#, n ≈ 0.39 c ≈ 1.45 t ≈ 14.9 ℓ.
The entire clone comprises 231 nodes (21 source lines), contains one hole, and the clone occurs twice.
http://javadoc.netbeans.org.
http://www.eclipse.org.
http://java.sun.com.
A trademark of Semantic Designs Inc.
They actually reported the fraction of sampled clone pairs that were unacceptable.

References

Badros, G. J. (2000). Java ML: A markup language for Java source code. Computer Networks (Amsterdam, Netherlands), 33(1–6), 159–177.
Google Scholar
Baker, B. S. (1995). On finding duplication and near-duplication in large software systems. In Proceedings of the IEEE working conference on reverse engineering (pp. 86–95).
Baker, B. S. (1997). Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM Journal on Computing, 26(5), 1343–1362.
Article MATH MathSciNet Google Scholar
Baker, B. S. (2007). Finding clones with Dup: Analysis of an experiment. IEEE Transactions on Software Engineering, 33(9), 608–621.
Article Google Scholar
Baker, B. S., & Manber, U. (1998). Deducing similarities in Java sources from bytecodes. In Proceedings of USENIX annual technical conference (pp. 179–190).
Baxter, I. D., Yahin, A., Moura, L., Sant’Anna, M., & Bier, L. (1998). Clone detection using abstract syntax trees. In Proceedings of the international conference on software maintenance (pp. 368–377).
Bellon, S. (2002). Vergleich von techniken zur erkennung duplizierten quellcodes. Master’s thesis, University of Stuttgart. Thesis number 1998.
Bellon, S., Koschke, R., Antoniol, G., Krinke, J., & Merlo, E. (2007). Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering, 33(9), 577–591.
Article Google Scholar
Cheung, W., Evans, W., & Moses, J. (2003). Predicated instructions for code compaction. In Proceedings of the 7th international workshop on software and compilers for embedded systems (pp. 17–32).
Chi, Y., Nijssen, S., Muntz, R. R., & Kok, J. N. (2005). Frequent subtree mining—an overview. Fundamenta Informaticae, 66(1–2), 161–198.
MATH MathSciNet Google Scholar
Church, K., & Helfman, J. (1993). Dotplot: A program for exploring self-similarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 2(2), 153–174.
Article Google Scholar
Cooper, K. D., & McIntosh, N. (1999). Enhanced code compression for embedded RISC processors. In ACM conference on programming language design and implementation (pp. 139–149).
Debray, S. K., Evans, W., Muth, R., & de Sutter, B. (2000). Compiler techniques for code compaction. ACM Transactions on Programming Languages and Systems, 22(2), 378–415.
Article Google Scholar
Ducasse, S., Rieger, M., & Demeyer, S. (1999). A language independent approach for detecting duplicated code. In Proceedings of the IEEE international conference on software maintenance (ICSM) (pp. 109–118).
Evans, W., Fraser, C. W., & Ma, F. (2007). Clone detection via structural abstraction. In Proceedings of the IEEE working conference on reverse engineering (pp. 150–159).
Fraser, C., Myers, E., & Wendt, A. (1984). Analyzing and compressing assembly code. In Proceedings of the ACM SIGPLAN symposium on compiler construction (Vol. 19, pp. 117–121).
Griswold, R. E., & Griswold, M. T. (1996). The Icon programming language. Peer-to-Peer Communications.
Griswold, W. G., & Notkin, D. (1993). Automated assistance for program restructuring. ACM Transactions on Software Engineering and Methodology, 2(3), 228–279.
Article Google Scholar
Hanson, D. R., & Proebsting, T. A. (2004). A research C# compiler. Software-Practice and Experience, 34(13), 1211–1224.
Article Google Scholar
Jiang, L., Misherghi, G., Su, Z., & Glondu, S. (2007). DECKARD: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on software engineering (pp. 96–105).
Kamiya, T., Kusumoto, S., & Inoue, K. (2002). CCFinder: A multi-linguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 28(7), 654–670.
Google Scholar
Karp, R. M., Miller, R. E., & Rosenberg, A. L. (1972). Rapid identification of repeated patterns in strings, trees, and arrays. In Proceedings of the ACM symposium on theory of computing (pp. 125–136).
Komondoor, R., & Horwitz, S. (2001). Using slicing to identify duplication in source code. In Proceedings of the eighth international symposium on static analysis (pp. 40–56).
Kontogiannis, K. A., DeMori, R., Merlo, E., Galler, M., & Bernstein, M. (1996). Pattern matching for clone and concept detection. Automated Software Engineering, 3, 77–108.
Article MathSciNet Google Scholar
Koschke, R., Falke, R., & Frenzel, P. (2006). Clone detection using abstract syntax suffix trees. In Proceedings of the IEEE working conference on reverse engineering (pp. 253–262).
Li, Z., Lu, S., Myagmar, S., & Zhou, Y. (2006). CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering, 32(3), 176–192.
Article Google Scholar
Ma, F. (2006). On the study of tree pattern matching algorithms and applications. Master’s thesis, Department of Computer Science, University of British Columbia.
Mayrand, J., Leblanc, C., & Merlo, E. (1996). Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the IEEE international conference on software maintenance (pp. 244–253).
Seal, D. (Ed.) (2001). ARM architecture reference manual (2nd ed.). Addison-Wesley.
Sutter, B. D., Bus, B. D., & Bosschere, K. D. (2002). Sifting out the mud: Low level C++ code reuse. In Proceedings of the 17th ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications (pp. 275–291).
Yang, W. (1991). Identifying syntactic differences between two programs. Software-Practice and Experience, 21(7), 739–755.
Article Google Scholar
Zaki, M. J. (2002). Efficiently mining frequent trees in a forest. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80).

Download references

Acknowlegements

W. S. Evans was supported by Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant.

Author information

Authors and Affiliations

Department of Computer Science, University of British Columbia, Vancouver, BC, V6T1Z4, Canada
William S. Evans
Seattle, WA, USA
Christopher W. Fraser
Microsoft, One Microsoft Way, Redmond, WA, 98052, USA
Fei Ma

Authors

William S. Evans
View author publications
You can also search for this author in PubMed Google Scholar
Christopher W. Fraser
View author publications
You can also search for this author in PubMed Google Scholar
Fei Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to William S. Evans.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Evans, W.S., Fraser, C.W. & Ma, F. Clone detection via structural abstraction. Software Qual J 17, 309–330 (2009). https://doi.org/10.1007/s11219-009-9074-y

Download citation

Published: 14 March 2009
Issue Date: December 2009
DOI: https://doi.org/10.1007/s11219-009-9074-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clone detection via structural abstraction

Abstract

Access this article

Similar content being viewed by others

SourcererCC: Scalable and Accurate Clone Detection

Various Code Clone Detection Techniques and Tools: A Comprehensive Survey

Detecting Refactored Clones

Notes

References

Acknowlegements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

SourcererCC: Scalable and Accurate Clone Detection

Various Code Clone Detection Techniques and Tools: A Comprehensive Survey

Detecting Refactored Clones

Notes

References

Acknowlegements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation