ABSTRACT
Clone detection of source code is among the most fundamental software engineering techniques. Despite intensive research in the past decade, existing techniques are still unsatisfactory in detecting "functional" code clones. In particular, existing techniques cannot efficiently extract syntax and semantics information from source code. In this paper, we propose a novel joint code representation that applies fusion embedding techniques to learn hidden syntactic and semantic features of source codes. Besides, we introduce a new granularity for functional code clone detection. Our approach regards the connected methods with caller-callee relationships as a functionality and the method without any caller-callee relationship with other methods represents a single functionality. Then we train a supervised deep learning model to detect functional code clones. We conduct evaluations on a large dataset of C++ programs and the experimental results show that fusion learning can significantly outperform the state-of-the-art techniques in detecting functional code clones.
- Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of the 2nd Working Conference on Reverse Engineering. IEEE, 86-95.Google ScholarDigital Library
- Magdalena Balazinska, Ettore Merlo, Michel Dagenais, Bruno Lague, and Kostas Kontogiannis. 2000. Advanced clone-analysis to support object-oriented system refactoring. In Proceedings of the 7th Working Conference on Reverse Engineering. IEEE, 98-107.Google ScholarCross Ref
- Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on software engineering 33, 9 ( 2007 ), 577-591.Google ScholarDigital Library
- Wen-Ke Chen, Bengu Li, and Rajiv Gupta. 2003. Code compaction of matching single-entry multiple-exit regions. In Proceedings of the 10th International Static Analysis Symposium. Springer, 401-417.Google ScholarCross Ref
- Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 855-864.Google ScholarDigital Library
- Reid Holmes and Gail C Murphy. 2005. Using structural context to recommend source code examples. In Proceedings of the 27th International Conference on Software Engineering. IEEE, 117-125.Google Scholar
- Chenping Hou, Feiping Nie, Xuelong Li, Dongyun Yi, and Yi Wu. 2014. Joint embedding learning and sparse regression: A framework for unsupervised feature selection. IEEE Transactions on Cybernetics 44, 6 ( 2014 ), 793-804.Google Scholar
- Sohei Ito. 2018. Semantical equivalence of the control flow graph and the program dependence graph. arXiv preprint arXiv: 1803. 02976 ( 2018 ).Google Scholar
- Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering. IEEE, 96-105.Google ScholarDigital Library
- Lingxiao Jiang, Zhendong Su, and Edwin Chiu. 2007. Context-based detection of clone-related bugs. In Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT symposium on the Foundations of Software Engineering. ACM, 55-64.Google ScholarDigital Library
- Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 ( 2014 ).Google Scholar
- Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering. ACM, 664-675.Google ScholarDigital Library
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ( 2014 ).Google Scholar
- Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of the 8th International Symposium on Static Analysis. Springer, 40-56.Google ScholarCross Ref
- Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of 8th Working Conference on Reverse Engineering. IEEE, 301-309.Google ScholarCross Ref
- Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3 ( 2015 ), 211-225.Google Scholar
- Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2004. CP-Miner: A tool for finding copy-paste and related bugs in operating system code. In Proceedings of the 6th Symposium on Operating Systems Design & Implementation. USENIX, 289-302.Google Scholar
- Chao Liu, Chen Chen, Jiawei Han, and Philip S Yu. 2006. GPLAG: Detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 872-881.Google ScholarDigital Library
- Xing Liu and P Gontey. 1987. Program translation by manipulating abstract syntax trees. In Proceedings of the C++ Workshop. 345-360.Google Scholar
- Na Meng, Lisa Hua, Miryung Kim, and Kathryn S McKinley. 2015. Does automated refactoring obviate systematic editing?. In Proceedings of the 37th International Conference on Software Engineering. IEEE, 392-402.Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jefrey Dean. 2013. Eficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 ( 2013 ).Google Scholar
- Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.Google Scholar
- Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec: Learning distributed representations of graphs. arXiv preprint arXiv:1707.05005 ( 2017 ).Google Scholar
- Manziba Akanda Nishi and Kostadin Damevski. 2018. Scalable code clone detection and search based on adaptive prefix filtering. Journal of Systems and Software 137 ( 2018 ), 130-142.Google Scholar
- Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1105-1114.Google ScholarDigital Library
- J-F Patenaude, Ettore Merlo, Michel Dagenais, and Bruno Laguë. 1999. Extending software quality assessment techniques to java systems. In Proceedings of the 7th International Workshop on Program Comprehension. IEEE, 49-56.Google ScholarDigital Library
- Daniel Perez and Shigeru Chiba. 2019. Cross-language clone detection by learning over abstract syntax trees. In Proceedings of the 16th IEEE/ACM International Conference on Mining Software Repositories (MSR). IEEE, 518-528.Google ScholarDigital Library
- Dhavleesh Rattan, Rajesh Bhatia, and Maninder Singh. 2013. Software clone detection: A systematic review. Information and Software Technology 55, 7 ( 2013 ), 1165-1199.Google Scholar
- Chanchal K Roy and James R Cordy. 2008. NICAD: Accurate detection of nearmiss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 16th IEEE International Conference on Program Comprehension. IEEE, 172-181.Google ScholarDigital Library
- Hitesh Sajnani, Vaibhav Saini, Jefrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering. IEEE, 1157-1168.Google ScholarDigital Library
- Abdullah Sheneamer and Jugal Kalita. 2016. Semantic clone detection using machine learning. In Proceedings of the 15th IEEE International Conference on Machine Learning and Applications. IEEE, 1024-1028.Google ScholarCross Ref
- Daniel Svozil, Vladimir Kvasnicka, and Jiri Pospichal. 1997. Introduction to multilayer feed-forward neural networks. Chemometrics and Intelligent Laboratory Systems 39, 1 ( 1997 ), 43-62.Google Scholar
- Nikolaos Tsantalis, Davood Mazinanian, and Giri Panamoottil Krishnan. 2015. Assessing the refactorability of software clones. IEEE Transactions on Software Engineering 41, 11 ( 2015 ), 1055-1090.Google ScholarDigital Library
- Nikolaos Tsantalis, Davood Mazinanian, and Shahriar Rostami. 2017. Clone refactoring with lambda expressions. In Proceedings of the 39th International Conference on Software Engineering. IEEE Press, 60-70.Google ScholarDigital Library
- Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. Deep learning similarities from diferent representations of source code. In Proceedings of the 15th IEEE/ACM International Conference on Mining Software Repositories. IEEE, 542-553.Google ScholarDigital Library
- Tim A Wagner, Vance Maverick, Susan L Graham, and Michael A Harrison. 1994. Accurate static estimators for program optimization. ACM Sigplan Notices 29, 6 ( 1994 ), 85-96.Google Scholar
- Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1225-1234.Google ScholarDigital Library
- Huihui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code.. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 3034-3040.Google ScholarCross Ref
- Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically finding patches using genetic programming. In Proceedings of 31st IEEE International Conference on Software Engineering. IEEE, 364-374.Google ScholarDigital Library
- Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87-98.Google ScholarDigital Library
- Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering. IEEE, 783-794.Google ScholarDigital Library
- Gang Zhao and Jef Huang. 2018. Deepsim: deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 141-151.Google ScholarDigital Library
Index Terms
- Functional code clone detection with syntax and semantics fusion learning
Recommendations
Deep learning code fragments for code clone detection
ASE '16: Proceedings of the 31st IEEE/ACM International Conference on Automated Software EngineeringCode clone detection is an important problem for software maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These techniques also depend ...
DSFM: Enhancing Functional Code Clone Detection with Deep Subtree Interactions
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software EngineeringFunctional code clone detection is important for software maintenance. In recent years, deep learning techniques are introduced to improve the performance of functional code clone detectors. By representing each code snippet as a vector containing its ...
Comparison and Evaluation of Clone Detection Techniques with Different Code Representations
ICSE '23: Proceedings of the 45th International Conference on Software EngineeringAs one of bad smells in code, code clones may increase the cost of software maintenance and the risk of vulnerability propagation. In the past two decades, numerous clone detection technologies have been proposed. They can be divided into text-based, ...
Comments