research-article

Functional code clone detection with syntax and semantics fusion learning

Authors:
Chunrong Fang

Nanjing University, China

Nanjing University, China
View Profile

,
Zixi Liu

Nanjing University, China

Nanjing University, China
View Profile

,
Yangyang Shi

Nanjing University, China

Nanjing University, China
View Profile

,
Jeff Huang

Texas A&M University, USA

Texas A&M University, USA
View Profile

,
Qingkai Shi

Hong Kong University of Science and Technology, China

Hong Kong University of Science and Technology, China
View Profile

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and AnalysisJuly 2020Pages 516–527https://doi.org/10.1145/3395363.3397362

Published:18 July 2020Publication History

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 516–527

ABSTRACT

Clone detection of source code is among the most fundamental software engineering techniques. Despite intensive research in the past decade, existing techniques are still unsatisfactory in detecting "functional" code clones. In particular, existing techniques cannot efficiently extract syntax and semantics information from source code. In this paper, we propose a novel joint code representation that applies fusion embedding techniques to learn hidden syntactic and semantic features of source codes. Besides, we introduce a new granularity for functional code clone detection. Our approach regards the connected methods with caller-callee relationships as a functionality and the method without any caller-callee relationship with other methods represents a single functionality. Then we train a supervised deep learning model to detect functional code clones. We conduct evaluations on a large dataset of C++ programs and the experimental results show that fusion learning can significantly outperform the state-of-the-art techniques in detecting functional code clones.

References

Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of the 2nd Working Conference on Reverse Engineering. IEEE, 86-95.Google ScholarDigital Library
Magdalena Balazinska, Ettore Merlo, Michel Dagenais, Bruno Lague, and Kostas Kontogiannis. 2000. Advanced clone-analysis to support object-oriented system refactoring. In Proceedings of the 7th Working Conference on Reverse Engineering. IEEE, 98-107.Google ScholarCross Ref
Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on software engineering 33, 9 ( 2007 ), 577-591.Google ScholarDigital Library
Wen-Ke Chen, Bengu Li, and Rajiv Gupta. 2003. Code compaction of matching single-entry multiple-exit regions. In Proceedings of the 10th International Static Analysis Symposium. Springer, 401-417.Google ScholarCross Ref
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 855-864.Google ScholarDigital Library
Reid Holmes and Gail C Murphy. 2005. Using structural context to recommend source code examples. In Proceedings of the 27th International Conference on Software Engineering. IEEE, 117-125.Google Scholar
Chenping Hou, Feiping Nie, Xuelong Li, Dongyun Yi, and Yi Wu. 2014. Joint embedding learning and sparse regression: A framework for unsupervised feature selection. IEEE Transactions on Cybernetics 44, 6 ( 2014 ), 793-804.Google Scholar
Sohei Ito. 2018. Semantical equivalence of the control flow graph and the program dependence graph. arXiv preprint arXiv: 1803. 02976 ( 2018 ).Google Scholar
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering. IEEE, 96-105.Google ScholarDigital Library
Lingxiao Jiang, Zhendong Su, and Edwin Chiu. 2007. Context-based detection of clone-related bugs. In Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT symposium on the Foundations of Software Engineering. ACM, 55-64.Google ScholarDigital Library
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 ( 2014 ).Google Scholar
Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering. ACM, 664-675.Google ScholarDigital Library
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ( 2014 ).Google Scholar
Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of the 8th International Symposium on Static Analysis. Springer, 40-56.Google ScholarCross Ref
Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of 8th Working Conference on Reverse Engineering. IEEE, 301-309.Google ScholarCross Ref
Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3 ( 2015 ), 211-225.Google Scholar
Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2004. CP-Miner: A tool for finding copy-paste and related bugs in operating system code. In Proceedings of the 6th Symposium on Operating Systems Design & Implementation. USENIX, 289-302.Google Scholar
Chao Liu, Chen Chen, Jiawei Han, and Philip S Yu. 2006. GPLAG: Detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 872-881.Google ScholarDigital Library
Xing Liu and P Gontey. 1987. Program translation by manipulating abstract syntax trees. In Proceedings of the C++ Workshop. 345-360.Google Scholar
Na Meng, Lisa Hua, Miryung Kim, and Kathryn S McKinley. 2015. Does automated refactoring obviate systematic editing?. In Proceedings of the 37th International Conference on Software Engineering. IEEE, 392-402.Google Scholar
Tomas Mikolov, Kai Chen, Greg Corrado, and Jefrey Dean. 2013. Eficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 ( 2013 ).Google Scholar
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.Google Scholar
Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec: Learning distributed representations of graphs. arXiv preprint arXiv:1707.05005 ( 2017 ).Google Scholar
Manziba Akanda Nishi and Kostadin Damevski. 2018. Scalable code clone detection and search based on adaptive prefix filtering. Journal of Systems and Software 137 ( 2018 ), 130-142.Google Scholar
Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1105-1114.Google ScholarDigital Library
J-F Patenaude, Ettore Merlo, Michel Dagenais, and Bruno Laguë. 1999. Extending software quality assessment techniques to java systems. In Proceedings of the 7th International Workshop on Program Comprehension. IEEE, 49-56.Google ScholarDigital Library
Daniel Perez and Shigeru Chiba. 2019. Cross-language clone detection by learning over abstract syntax trees. In Proceedings of the 16th IEEE/ACM International Conference on Mining Software Repositories (MSR). IEEE, 518-528.Google ScholarDigital Library
Dhavleesh Rattan, Rajesh Bhatia, and Maninder Singh. 2013. Software clone detection: A systematic review. Information and Software Technology 55, 7 ( 2013 ), 1165-1199.Google Scholar
Chanchal K Roy and James R Cordy. 2008. NICAD: Accurate detection of nearmiss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 16th IEEE International Conference on Program Comprehension. IEEE, 172-181.Google ScholarDigital Library
Hitesh Sajnani, Vaibhav Saini, Jefrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering. IEEE, 1157-1168.Google ScholarDigital Library
Abdullah Sheneamer and Jugal Kalita. 2016. Semantic clone detection using machine learning. In Proceedings of the 15th IEEE International Conference on Machine Learning and Applications. IEEE, 1024-1028.Google ScholarCross Ref
Daniel Svozil, Vladimir Kvasnicka, and Jiri Pospichal. 1997. Introduction to multilayer feed-forward neural networks. Chemometrics and Intelligent Laboratory Systems 39, 1 ( 1997 ), 43-62.Google Scholar
Nikolaos Tsantalis, Davood Mazinanian, and Giri Panamoottil Krishnan. 2015. Assessing the refactorability of software clones. IEEE Transactions on Software Engineering 41, 11 ( 2015 ), 1055-1090.Google ScholarDigital Library
Nikolaos Tsantalis, Davood Mazinanian, and Shahriar Rostami. 2017. Clone refactoring with lambda expressions. In Proceedings of the 39th International Conference on Software Engineering. IEEE Press, 60-70.Google ScholarDigital Library
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. Deep learning similarities from diferent representations of source code. In Proceedings of the 15th IEEE/ACM International Conference on Mining Software Repositories. IEEE, 542-553.Google ScholarDigital Library
Tim A Wagner, Vance Maverick, Susan L Graham, and Michael A Harrison. 1994. Accurate static estimators for program optimization. ACM Sigplan Notices 29, 6 ( 1994 ), 85-96.Google Scholar
Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1225-1234.Google ScholarDigital Library
Huihui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code.. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 3034-3040.Google ScholarCross Ref
Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically finding patches using genetic programming. In Proceedings of 31st IEEE International Conference on Software Engineering. IEEE, 364-374.Google ScholarDigital Library
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87-98.Google ScholarDigital Library
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering. IEEE, 783-794.Google ScholarDigital Library
Gang Zhao and Jef Huang. 2018. Deepsim: deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 141-151.Google ScholarDigital Library

Index Terms

Functional code clone detection with syntax and semantics fusion learning
1. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems
      1. Software management
        Software maintenance
2. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Maintaining software
  2. Software organization and properties
    1. Software functional properties
      1. Correctness
        Functionality

Recommendations

Deep learning code fragments for code clone detection
ASE '16: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering

Code clone detection is an important problem for software maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These techniques also depend ...
Read More
DSFM: Enhancing Functional Code Clone Detection with Deep Subtree Interactions
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Functional code clone detection is important for software maintenance. In recent years, deep learning techniques are introduced to improve the performance of functional code clone detectors. By representing each code snippet as a vector containing its ...
Read More
Comparison and Evaluation of Clone Detection Techniques with Different Code Representations
ICSE '23: Proceedings of the 45th International Conference on Software Engineering

As one of bad smells in code, code clones may increase the cost of software maintenance and the risk of vulnerability propagation. In the past two decades, numerous clone detection technologies have been proposed. They can be divided into text-based, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis
July 2020
591 pages
ISBN:9781450380089
DOI:10.1145/3395363
General Chair:
Sarfraz Khurshid
University of Texas at Austin, USA
,
Program Chair:
Corina S. Păsăreanu
Carnegie Mellon University Silicon Valley / NASA Ames Research Center, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 July 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
Author Tags
Code clone detection
code representation
functional clone detection
syntax and semantics fusion learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate58of213submissions,27%
Upcoming Conference
ISSTA '24

Sponsor:

sigsoft

33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

September 16 - 20, 2024

Vienna , Austria
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 73
  Total Citations
  View Citations
- 1,158
  Total Downloads
- Downloads (Last 12 months)252
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Functional code clone detection with syntax and semantics fusion learning

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Deep learning code fragments for code clone detection

DSFM: Enhancing Functional Code Clone Detection with Deep Subtree Interactions

Comparison and Evaluation of Clone Detection Techniques with Different Code Representations