skip to main content
10.1145/3324884.3416562acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

SCDetector: software functional clone detection based on semantic tokens analysis

Authors Info & Claims
Published:27 January 2021Publication History

ABSTRACT

Code clone detection is to find out code fragments with similar functionalities, which has been more and more important in software engineering. Many approaches have been proposed to detect code clones, in which token-based methods are the most scalable but cannot handle semantic clones because of the lack of consideration of program semantics. To address the issue, researchers conduct program analysis to distill the program semantics into a graph representation and detect clones by matching the graphs. However, such approaches suffer from low scalability since graph matching is typically time-consuming.

In this paper, we propose SCDetector to combine the scalability of token-based methods with the accuracy of graph-based methods for software functional clone detection. Given a function source code, we first extract the control flow graph by static analysis. Instead of using traditional heavyweight graph matching, we treat the graph as a social network and apply social-network-centrality analysis to dig out the centrality of each basic block. Then we assign the centrality to each token in a basic block and sum the centrality of the same token in different basic blocks. By this, a graph is turned into certain tokens with graph details (i.e., centrality), called semantic tokens. Finally, these semantic tokens are fed into a Siamese architecture neural network to train a code clone detector. We evaluate SCDetector on two large datasets of functionally similar code. Experimental results indicate that our system is superior to four state-of-the-art methods (i.e., SourcererCC, Deckard, RtvNN, and ASTNN) and the time cost of SCDetector is 14 times less than a traditional graph-based method (i.e., CCSharp) on detecting semantic clones.

References

  1. 2017. Google Code Jam. https://code.google.com/codejam/past-contests.Google ScholarGoogle Scholar
  2. 2020. BigCloneBench. https://github.com/clonebench/BigCloneBench.Google ScholarGoogle Scholar
  3. 2020. A Java optimization framework (Soot). https://github.com/Sable/soot.Google ScholarGoogle Scholar
  4. 2020. Platform for C/C++ Code Analysis (Joern). https://joern.io.Google ScholarGoogle Scholar
  5. 2020. Software for complex networks (Networkx). http://networkx.github.io.Google ScholarGoogle Scholar
  6. 2020. Tensors and Dynamic neural networks in Python with strong GPU acceleration (PyTorch). https://pytorch.org.Google ScholarGoogle Scholar
  7. 2020. T.J. Watson Libraries for Analysis (WALA). http://wala.sourceforge.net/wiki/index.php/Main_Page.Google ScholarGoogle Scholar
  8. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  9. Magdalena Balazinska, Ettore Merlo, Michel Dagenais, Bruno Lague, and Kostas Kontogiannis. 1999. Measuring clone based reengineering opportunities. In Proceedings of the 6th International Software Metrics Symposium (ISMS'99). 292--303.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Pierre Baldi and Yves Chauvin. 1993. Neural networks for fingerprint recognition. Neural Computation 5, 3 (1993), 402--418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering 33, 9 (2007), 577--591.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kai Chen, Peng Liu, and Yingjun Zhang. 2014. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering (ICSE'14). 175--186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nigel Coles. 2001. It's not what you know---It's who you know that counts. Analysing serious crime groups as social networks. British Journal of Criminology 41, 4 (2001), 580--594.Google ScholarGoogle ScholarCross RefCross Ref
  14. Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Proceedings of the 1999 International Conference on Software Maintenance (ICSM'99). 109--118.Google ScholarGoogle ScholarCross RefCross Ref
  15. Rochelle Elva and GT. Leavens. 2012. Jsctracker: A semantic clone detection tool for java code. Technical Report. University of Central Florida.Google ScholarGoogle Scholar
  16. Katherine Faust. 1997. Centrality in affiliation networks. Social Networks 19, 2 (1997), 157--191.Google ScholarGoogle ScholarCross RefCross Ref
  17. LC. Freeman. 1977. A set of measures of centrality based on betweenness. Sociometry 40, 1 (1977), 35--41.Google ScholarGoogle ScholarCross RefCross Ref
  18. LC. Freeman. 1978. Centrality in social networks conceptual clarification. Social Networks 1, 3 (1978), 215--239.Google ScholarGoogle ScholarCross RefCross Ref
  19. DM. German, Massimiliano Di Penta, Yann-Gael Gueheneuc, and Giuliano Antoniol. 2009. Code siblings: Technical and legal implications of copying code between applications. In Proceedings of the 6th International Working Conference on Mining Software Repositories (MSR'09). 81--90.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Nils Göde and Rainer Koschke. 2009. Incremental clone detection. In Proceedings of the 2009 European Conference on Software Maintenance and Reengineering (ECSMR'09). 219--228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Roger Guimera, Stefano Mossa, Adrian Turtschi, and LA Nunes Amaral. 2005. The worldwide air transportation network: Anomalous centrality, community structure, and cities' global roles. the National Academy of Sciences 102, 22 (2005), 7794--7799.Google ScholarGoogle Scholar
  22. Tomoya Ishihara, Keisuke Hotta, Yoshiki Higo, Hiroshi Igaki, and Shinji Kusumoto. 2012. Inter-project functional clone detection toward building libraries: an empirical study on 13,000 projects. In Proceedings of the 19th Working Conference on Reverse Engineering (WCRE'12). 387--391.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hawoong Jeong, SP. Mason, AL. Barabási, and ZN. Oltvai. 2001. Lethality and centrality in protein networks. Nature 411, 6833 (2001), 41--42.Google ScholarGoogle Scholar
  24. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE'07). 96--105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the 18th International Symposium on Software Testing and Analysis (ISSTA'09). 81--92.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J Howard Johnson. 1994. Substring matching for clone detection and change tracking. In Proceedings of the 1994 International Conference on Software Maintenance (ICSM'94). 120--126.Google ScholarGoogle ScholarCross RefCross Ref
  27. Toshihiro Kamiya. 2013. Agec: An execution-semantic clone detection tool. In Proceedings of the 21st International Conference on Program Comprehension (ICPC'13). 227--229.Google ScholarGoogle ScholarCross RefCross Ref
  28. Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Leo Katz. 1953. A new status index derived from sociometric analysis. Psychometrika 18, 1 (1953), 39--43.Google ScholarGoogle ScholarCross RefCross Ref
  30. Iman Keivanloo, Juergen Rilling, and Philippe Charland. 2011. Internet-scale real-time code clone search via multi-level indexing. In Proceedings of the 18th Working Conference on Reverse Engineering (WCRE'11). 23--27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Iman Keivanloo, CK. Roy, and Juergen Rilling. 2012. Sebyte: A semantic clone detection tool for intermediate languages. In Proceedings of the 20th International Conference on Program Comprehension (ICPC'12). 247--249.Google ScholarGoogle ScholarCross RefCross Ref
  32. Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of the 2001 International Static Analysis Symposium (ISAS'01). 40--56.Google ScholarGoogle ScholarCross RefCross Ref
  33. Rainer Koschke. 2012. Large-scale inter-system clone detection using suffix trees. In Proceedings of the 16th European Conference on Software Maintenance and Reengineering (ECSME'12). 309--318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working Conference on Reverse Engineering (WCRE'01). 301--309.Google ScholarGoogle ScholarCross RefCross Ref
  35. Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CClearner: A deep learning-based clone detection approach. In Proceedings of the 2017 International Conference on Software Maintenance and Evolution (ICSME'17). 249--260.Google ScholarGoogle ScholarCross RefCross Ref
  36. Xiaoming Liu, Johan Bollen, ML. Nelson, and Herbert Van de Sompel. 2005. Co-authorship networks in the digital library research community. Information Processing & Management 41, 6 (2005), 1462--1480.Google ScholarGoogle ScholarCross RefCross Ref
  37. Jean Mayrand, Claude Leblanc, and Ettore Merlo. 1996. Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the 1996 International Conference on Software Maintenance (ICSM'96). 244--253.Google ScholarGoogle ScholarCross RefCross Ref
  38. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google ScholarGoogle Scholar
  39. JF. Patenaude, Ettore Merlo, Michel Dagenais, and Bruno Laguë. 1999. Extending software quality assessment techniques to java systems. In Proceedings of the 7th International Workshop on Program Comprehension (IWPC'99). 49--56.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. CK. Roy and JR. Cordy. 2008. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 2008 International Conference on Program Comprehension (ICPC'08). 172--181.Google ScholarGoogle Scholar
  41. Chanchal Kumar Roy and JR. Cordy. 2007. A survey on software clone detection research. Queen's School of Computing TR 541, 115 (2007), 64--68.Google ScholarGoogle Scholar
  42. Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V Lopes. 2018. Oreo: Detection of clones in the twilight zone. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE'18). 354--365.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, CK. Roy, and CV. Lopes. 2016. SourcererCC: Scaling code clone detection to big code. In Proceedings of the 38th International Conference on Software Engineering (ICSE'16). 1157--1168.Google ScholarGoogle Scholar
  44. Abdullah Sheneamer and Jugal Kalita. 2016. Semantic clone detection using machine learning. In Proceedings of the 15th International Conference on Machine Learning and Applications (ICMLA'16). 1024--1028.Google ScholarGoogle ScholarCross RefCross Ref
  45. Jeffrey Svajlenko, JF. Islam, Iman Keivanloo, CK. Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 2014 International Conference on Software Maintenance and Evolution (ICSME'14). 476--480.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Kai Sheng Tai, Richard Socher, and CD. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015).Google ScholarGoogle Scholar
  47. Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (CMNLP'15). 1422--1432.Google ScholarGoogle ScholarCross RefCross Ref
  48. Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. Deep learning similarities from different representations of source code. In Proceedings of the 15th International Conference on Mining Software Repositories (MSR'18). 542--553.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AN. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceddings of the 2017 Conference on Neural Information Processing Systems (NIPS'17). 5998--6008.Google ScholarGoogle Scholar
  50. Min Wang, Pengcheng Wang, and Yun Xu. 2017. CCSharp: An efficient three-phase code clone detector using modified pdgs. In Proceedings of the 24th Asia-Pacific Software Engineering Conference (APSEC'17). 100--109.Google ScholarGoogle ScholarCross RefCross Ref
  51. Pengcheng Wang, Jeffrey Svajlenko, Yanzhao Wu, Yun Xu, and CK. Roy. 2018. CCAligner: A token based large-gap clone detector. In Proceedings of the 40th International Conference on Software Engineering (ICSE'18). 1066--1077.Google ScholarGoogle Scholar
  52. Huihui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the 2017 International Joint Conferences on Artificial Intelligence (IJCAI'17). 3034--3040.Google ScholarGoogle ScholarCross RefCross Ref
  53. Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st International Conference on Automated Software Engineering (ASE'16). 87--98.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Yueming Wu, Xiaodi Li, Deqing Zou, Wei Yang, Xin Zhang, and Hai Jin. 2019. MalScan: Fast market-wide mobile malware scanning by social-network centrality analysis. In Proceedings of the 34th International Conference on Automated Software Engineering (ASE'19). 139--150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Wei Yang, Xusheng Xiao, Benjamin Andow, Sihan Li, Tao Xie, and William Enck. 2015. Appcontext: Differentiating malicious and benign mobile app behaviors using context. In Proceedings of the 37th International Conference on Software Engineering (ICSE'15). 303--313.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Wojciech Zaremba and Ilya Sutskever. 2014. Learning to execute. arXiv preprint arXiv:1410.4615 (2014).Google ScholarGoogle Scholar
  57. Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering (ICSE'19). 783--794.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Gang Zhao and Jeff Huang. 2018. Deepsim: Deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE'18). 141--151.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SCDetector: software functional clone detection based on semantic tokens analysis

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering
      December 2020
      1449 pages
      ISBN:9781450367684
      DOI:10.1145/3324884

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 January 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate82of337submissions,24%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader