Detection of semantically similar code

Wang, Tiantian; Wang, Kechao; Su, Xiaohong; Ma, Peijun

doi:10.1007/s11704-014-3430-1

Detection of semantically similar code

Research Article
Published: 22 October 2014

Volume 8, pages 996–1011, (2014)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Tiantian Wang¹,
Kechao Wang^1,2,
Xiaohong Su¹ &
…
Peijun Ma¹

262 Accesses
13 Citations
Explore all metrics

Abstract

The traditional similar code detection approaches are limited in detecting semantically similar codes, impeding their applications in practice. In this paper, we have improved the traditional metrics-based approach as well as the graph-based approach and presented a metrics-based and graph-based combined approach. First, source codes are represented as augmented system dependence graphs. Then, metrics-based candidate similar code extraction is performed to filter out most of the dissimilar code pairs so as to lower the computational complexity. After that, code normalization is performed on the candidate similar codes to remove code variations so as to detect similar code at the semantic level. Finally, program matching is performed on the normalized control dependence trees to output semantically similar codes. Experiment results show that our approach can detect similar codes with code variations, and it can be applied to large software.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Machine Learning Approach for Source Code Similarity via Graph-Focused Features

Unsupervised Graph Neural Networks for Source Code Similarity Detection

Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Bettenburg N, Shang W Y, Ibrahim W, Adams B, Zou Y, Hassan A E. An empirical study on inconsistent changes to code clones at the release level. Science of Computer Programming, 2012, 77(6): 760–776
Article Google Scholar
Duala-Ekoko E, Robillard M P. Clone region descriptors: representing and tracking duplication in source code. ACM Transactions on Software Engineering and Methodology, 2010, 20(1): Article No. 3
Google Scholar
Krinke J. A study of consistent and inconsistent changes to code clones. In: Proceedings of the 14th Working Conference on Reverse Engineering. 2007, 170–178
Chapter Google Scholar
Nguyen H A, Nguyen T T, Pham N H, Al-Kofahi J, Nguyen T N. Clone management for evolving software. IEEE Transactions on Software Engineering, 2012, 38(5): 1008–1026
Article Google Scholar
Thummalapenta S, Cerulo L, Aversano L, Penta M D. An empirical study on the maintenance of source code clones. Empirical Software Engineering, 2010, 15(1): 1–34
Article Google Scholar
Bruntink M, Van Deursen A, Van Engelen R, Tourwe T. On the use of clone detection for identifying crosscutting concern code. IEEE Transactions on Software Engineering, 2005, 31(10): 804–818
Article Google Scholar
Li J, Ernst M D. CBCD: cloned buggy code detector. In: Proceedings of the 34th International Conference on Software Engineering. 2012, 310–320
Google Scholar
Li Z, Lu S, Myagmar S, Zhou Y. CP-Miner: finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering, 2006, 32(3): 176–192
Article Google Scholar
Rahman F, Bird C, Devanbu P. Clones: what is that smell?. Empirical Software Engineering, 2012, 17(4–5): 503–530
Article Google Scholar
Roy C K, Cordy J R, Koschke R. Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Science of Computer Programming, 2009, 74(7): 470–495
Article MATH MathSciNet Google Scholar
Church K W, Helfman J I. Dotplot: a program for exploring self-similarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 1993, 2(2): 153–174
Google Scholar
Ducasse S, Rieger M, Demeyer S. A language independent approach for detecting duplicated code. In: Proceedings of the IEEE International Conference on Software Maintenance. 1999, 109–118
Google Scholar
Manber U. Finding similar files in a large file system. In: Proceedings of the 1994 Usenix Winter Technical Conference. 1994, 1–10
Google Scholar
Roy C K, Cordy J R. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: Proceedings of the 16th IEEE International Conference on Program Comprehension. 2008, 172–181
Google Scholar
Baker B S. On finding duplication and near-duplication in large software systems. In: Proceedings of the 2nd Working Conference on Reverse Engineering. 1995, 86–95
Chapter Google Scholar
Baker B S. Finding clones with dup: analysis of an experiment. IEEE Transactions on Software Engineering, 2007, 33(9): 608–621
Article Google Scholar
Kamiya T, Kusumoto S, Inoue K. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 2002, 28(7): 654–670
Article Google Scholar
Livieri S, Higo Y, Matushita M, Inoue K. Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In: Proceedings of the 29th International Conference on Software Engineering. 2007, 106–115
Google Scholar
Ueda Y, Kamiya T, Kusumoto S, Inoue K. On detection of gapped code clones using gap locations. In: Proceedings of the 9th Asia-Pacific Software Engineering Conference. 2002, 327–336
Google Scholar
Higo Y, Kamiya T, Kusumoto S, Inoue K. Method and implementation for investigating code clones in a software system. Information and Software Technology, 2007, 49(9): 985–998
Article Google Scholar
Baxter I D, Yahin A, Moura L, Sant’Anna M, Bier L. Clone detection using abstract syntax trees. In: Proceedings of the International Conference on Software Maintenance. 1998, 368–377
Google Scholar
Koschke R, Falke R, Frenzel P. Clone detection using abstract syntax suffix trees. In: Proceedings of the 13th Working Conference on Reverse Engineering. 2006, 253–262
Google Scholar
Prechelt L, Malpohl G, Philippsen M. JPlag: finding plagiarisms among a set of programs. Technical Report, Department of Informatics, University of Karlsruhe. 2000
Google Scholar
Wahler V, Seipel D, Wolff J, Fischer G. Clone detection in source code by frequent itemset techniques. In: Proceedings of the 4th IEEE International Workshop on Source Code Analysis and Manipulation. 2004, 128–135
Google Scholar
Balazinska M, Merlo E, Dagenais M, Lague B, Kontogiannis K. Measuring clone based reengineering opportunities. In: Proceedings of the 6th International Software Metrics Symposium. 1999, 292–303
Google Scholar
Davey N, Barson P, Field S, Frank R, Tansley D. The development of a software clone detector. International Journal of Applied Software Technology, 1995, 1(3–4), 219–236
Google Scholar
Kontogiannis K A, DeMori R, Merlo E, Galler M, Bernstein M. Pattern matching for clone and concept detection. Automated Software Engineering, 1996, 3(1–2): 77–108
Article MathSciNet Google Scholar
Mayrand J, Leblanc C, Merlo E M. Experiment on the automatic detection of function clones in a software system using metrics. In: Proceedings of the International Conference on Software Maintenance. 1996, 244–253
Chapter Google Scholar
Patenaude J F, Merlo E, Dagenais M, Lague B. Extending software quality assessment techniques to java systems. In: Proceedings of the 7th International Workshop on Program Comprehension. 1999, 49–56
Chapter Google Scholar
Schleimer S, Wilkerson D S, Aiken A. Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. 2003, 76–85
Chapter Google Scholar
Komondoor R, Horwitz S. Using slicing to identify duplication in source code. Lecture Notes in Computer Science, 2001, 2126: 40–56
Article MathSciNet Google Scholar
Krinke J. Identifying similar code with program dependence graphs. In: Proceedings of the 8th Working Conference on Reverse Engineering. 2001, 301–309
Chapter Google Scholar
Liu C, Chen C, Han J, Yu P S. GPlag: detection of software plagiarism by program dependence graph analysis. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006, 872–881
Chapter Google Scholar
Qu W, Jiang M, Jia Y. Software reuse detection using an integrated space-logic domain model. In: Proceeding of the IEEE International Conference on Information Reuse and Integration. 2007, 638–643
Google Scholar
Gabel M, Jiang L, Su Z. Scalable detection of semantic clones. In: Proceedings of the 30th International Conference on Software Engineering. 2008, 321–330
Google Scholar
Ferrante J, Ottenstein K J, Warren J D. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 1987, 9(3): 319–349
Article MATH Google Scholar
Binkley, D, Horwitz, S, Reps, T. The Multi-Procedure Equivalence Theorem. CS Technical Reports, Computer Sciences Department, University of Wisconsin-Madison. 1989
Google Scholar
Church K W, Helfman J I. Dotplot: a program for exploring self-similarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 1993, 2(2): 153–174
Google Scholar
Horwitz S, Prins J, Reps T. On the adequacy of program dependence graphs for representing programs. In: Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 1988, 146–157
Chapter Google Scholar
Xu S, San Chee Y. Transformation-based diagnosis of student programs for programming tutoring systems. IEEE Transactions on Software Engineering, 2003, 29(4): 360–384
Article Google Scholar
Ammarguellat Z. A control-flow normalization algorithm and its complexity. IEEE Transactions on Software Engineering, 1992, 18(3): 237–251
Article Google Scholar
Williams M H, Ossher H L. Conversion of unstructured flow diagrams to structured form. The Computer Journal, 1978, 21(2): 161–167
Article MATH Google Scholar
Yang W. Identifying syntactic differences between two programs. Software: Practice and Experience, 1991, 21(7): 739–755
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Tiantian Wang, Kechao Wang, Xiaohong Su & Peijun Ma
School of Software, Harbin University, Harbin, 150086, China
Kechao Wang

Authors

Tiantian Wang
View author publications
Search author on:PubMed Google Scholar
Kechao Wang
View author publications
Search author on:PubMed Google Scholar
Xiaohong Su
View author publications
Search author on:PubMed Google Scholar
Peijun Ma
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Tiantian Wang.

Additional information

Tiantian Wang is an associate professor at Harbin Institute of Technology, China. She received the PhD degree from Harbin Institute of Technology in 2009. Her current research interests are program analysis, automatic software debugging, and computer aided education.

Kechao Wang received the MS degree from Huazhong Univeristy of Science and Technology, China in 2006. Since 2012, he has been a PhD candidate in Computer Science Department of Harbin Institute of Technology. His current research interests are software fault localization and program analysis.

Xiaohong Su is a professor of Harbin Institute of Technology. She is a senior membership of China Computer Federation. Her main research interests are software bug detection, graphics and image processing, information fusion, and intelligent computation.

Peijun Ma is a professor of Harbin Institute of Technology. His main research interests are software engineering, information fusion, color matching, image processing, and intelligent control.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, T., Wang, K., Su, X. et al. Detection of semantically similar code. Front. Comput. Sci. 8, 996–1011 (2014). https://doi.org/10.1007/s11704-014-3430-1

Download citation

Received: 28 October 2013
Accepted: 24 June 2014
Published: 22 October 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s11704-014-3430-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detection of semantically similar code

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Machine Learning Approach for Source Code Similarity via Graph-Focused Features

Unsupervised Graph Neural Networks for Source Code Similarity Detection

Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now