skip to main content
10.1145/3395363.3397362acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections

Functional code clone detection with syntax and semantics fusion learning

Published:18 July 2020Publication History

ABSTRACT

Clone detection of source code is among the most fundamental software engineering techniques. Despite intensive research in the past decade, existing techniques are still unsatisfactory in detecting "functional" code clones. In particular, existing techniques cannot efficiently extract syntax and semantics information from source code. In this paper, we propose a novel joint code representation that applies fusion embedding techniques to learn hidden syntactic and semantic features of source codes. Besides, we introduce a new granularity for functional code clone detection. Our approach regards the connected methods with caller-callee relationships as a functionality and the method without any caller-callee relationship with other methods represents a single functionality. Then we train a supervised deep learning model to detect functional code clones. We conduct evaluations on a large dataset of C++ programs and the experimental results show that fusion learning can significantly outperform the state-of-the-art techniques in detecting functional code clones.

References

  1. Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of the 2nd Working Conference on Reverse Engineering. IEEE, 86-95.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Magdalena Balazinska, Ettore Merlo, Michel Dagenais, Bruno Lague, and Kostas Kontogiannis. 2000. Advanced clone-analysis to support object-oriented system refactoring. In Proceedings of the 7th Working Conference on Reverse Engineering. IEEE, 98-107.Google ScholarGoogle ScholarCross RefCross Ref
  3. Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on software engineering 33, 9 ( 2007 ), 577-591.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Wen-Ke Chen, Bengu Li, and Rajiv Gupta. 2003. Code compaction of matching single-entry multiple-exit regions. In Proceedings of the 10th International Static Analysis Symposium. Springer, 401-417.Google ScholarGoogle ScholarCross RefCross Ref
  5. Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 855-864.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Reid Holmes and Gail C Murphy. 2005. Using structural context to recommend source code examples. In Proceedings of the 27th International Conference on Software Engineering. IEEE, 117-125.Google ScholarGoogle Scholar
  7. Chenping Hou, Feiping Nie, Xuelong Li, Dongyun Yi, and Yi Wu. 2014. Joint embedding learning and sparse regression: A framework for unsupervised feature selection. IEEE Transactions on Cybernetics 44, 6 ( 2014 ), 793-804.Google ScholarGoogle Scholar
  8. Sohei Ito. 2018. Semantical equivalence of the control flow graph and the program dependence graph. arXiv preprint arXiv: 1803. 02976 ( 2018 ).Google ScholarGoogle Scholar
  9. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering. IEEE, 96-105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lingxiao Jiang, Zhendong Su, and Edwin Chiu. 2007. Context-based detection of clone-related bugs. In Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT symposium on the Foundations of Software Engineering. ACM, 55-64.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 ( 2014 ).Google ScholarGoogle Scholar
  12. Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering. ACM, 664-675.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ( 2014 ).Google ScholarGoogle Scholar
  14. Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of the 8th International Symposium on Static Analysis. Springer, 40-56.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of 8th Working Conference on Reverse Engineering. IEEE, 301-309.Google ScholarGoogle ScholarCross RefCross Ref
  16. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3 ( 2015 ), 211-225.Google ScholarGoogle Scholar
  17. Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2004. CP-Miner: A tool for finding copy-paste and related bugs in operating system code. In Proceedings of the 6th Symposium on Operating Systems Design & Implementation. USENIX, 289-302.Google ScholarGoogle Scholar
  18. Chao Liu, Chen Chen, Jiawei Han, and Philip S Yu. 2006. GPLAG: Detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 872-881.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Xing Liu and P Gontey. 1987. Program translation by manipulating abstract syntax trees. In Proceedings of the C++ Workshop. 345-360.Google ScholarGoogle Scholar
  20. Na Meng, Lisa Hua, Miryung Kim, and Kathryn S McKinley. 2015. Does automated refactoring obviate systematic editing?. In Proceedings of the 37th International Conference on Software Engineering. IEEE, 392-402.Google ScholarGoogle Scholar
  21. Tomas Mikolov, Kai Chen, Greg Corrado, and Jefrey Dean. 2013. Eficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 ( 2013 ).Google ScholarGoogle Scholar
  22. Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  23. Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec: Learning distributed representations of graphs. arXiv preprint arXiv:1707.05005 ( 2017 ).Google ScholarGoogle Scholar
  24. Manziba Akanda Nishi and Kostadin Damevski. 2018. Scalable code clone detection and search based on adaptive prefix filtering. Journal of Systems and Software 137 ( 2018 ), 130-142.Google ScholarGoogle Scholar
  25. Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1105-1114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J-F Patenaude, Ettore Merlo, Michel Dagenais, and Bruno Laguë. 1999. Extending software quality assessment techniques to java systems. In Proceedings of the 7th International Workshop on Program Comprehension. IEEE, 49-56.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Daniel Perez and Shigeru Chiba. 2019. Cross-language clone detection by learning over abstract syntax trees. In Proceedings of the 16th IEEE/ACM International Conference on Mining Software Repositories (MSR). IEEE, 518-528.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Dhavleesh Rattan, Rajesh Bhatia, and Maninder Singh. 2013. Software clone detection: A systematic review. Information and Software Technology 55, 7 ( 2013 ), 1165-1199.Google ScholarGoogle Scholar
  29. Chanchal K Roy and James R Cordy. 2008. NICAD: Accurate detection of nearmiss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 16th IEEE International Conference on Program Comprehension. IEEE, 172-181.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Hitesh Sajnani, Vaibhav Saini, Jefrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering. IEEE, 1157-1168.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Abdullah Sheneamer and Jugal Kalita. 2016. Semantic clone detection using machine learning. In Proceedings of the 15th IEEE International Conference on Machine Learning and Applications. IEEE, 1024-1028.Google ScholarGoogle ScholarCross RefCross Ref
  32. Daniel Svozil, Vladimir Kvasnicka, and Jiri Pospichal. 1997. Introduction to multilayer feed-forward neural networks. Chemometrics and Intelligent Laboratory Systems 39, 1 ( 1997 ), 43-62.Google ScholarGoogle Scholar
  33. Nikolaos Tsantalis, Davood Mazinanian, and Giri Panamoottil Krishnan. 2015. Assessing the refactorability of software clones. IEEE Transactions on Software Engineering 41, 11 ( 2015 ), 1055-1090.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Nikolaos Tsantalis, Davood Mazinanian, and Shahriar Rostami. 2017. Clone refactoring with lambda expressions. In Proceedings of the 39th International Conference on Software Engineering. IEEE Press, 60-70.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. Deep learning similarities from diferent representations of source code. In Proceedings of the 15th IEEE/ACM International Conference on Mining Software Repositories. IEEE, 542-553.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tim A Wagner, Vance Maverick, Susan L Graham, and Michael A Harrison. 1994. Accurate static estimators for program optimization. ACM Sigplan Notices 29, 6 ( 1994 ), 85-96.Google ScholarGoogle Scholar
  37. Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1225-1234.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Huihui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code.. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 3034-3040.Google ScholarGoogle ScholarCross RefCross Ref
  39. Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically finding patches using genetic programming. In Proceedings of 31st IEEE International Conference on Software Engineering. IEEE, 364-374.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87-98.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering. IEEE, 783-794.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Gang Zhao and Jef Huang. 2018. Deepsim: deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 141-151.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Functional code clone detection with syntax and semantics fusion learning

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader