research-article

SCDetector: software functional clone detection based on semantic tokens analysis

Authors:
Yueming Wu

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

,
Deqing Zou

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

,
Shihan Dou

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

,
Siru Yang

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

,
Wei Yang

University of Texas at Dallas

University of Texas at Dallas
View Profile

,
Feng Cheng

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

,
Hong Liang

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

,
Hai Jin

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software EngineeringDecember 2020Pages 821–833https://doi.org/10.1145/3324884.3416562

Published:27 January 2021Publication History

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

Pages 821–833

ABSTRACT

Code clone detection is to find out code fragments with similar functionalities, which has been more and more important in software engineering. Many approaches have been proposed to detect code clones, in which token-based methods are the most scalable but cannot handle semantic clones because of the lack of consideration of program semantics. To address the issue, researchers conduct program analysis to distill the program semantics into a graph representation and detect clones by matching the graphs. However, such approaches suffer from low scalability since graph matching is typically time-consuming.

In this paper, we propose SCDetector to combine the scalability of token-based methods with the accuracy of graph-based methods for software functional clone detection. Given a function source code, we first extract the control flow graph by static analysis. Instead of using traditional heavyweight graph matching, we treat the graph as a social network and apply social-network-centrality analysis to dig out the centrality of each basic block. Then we assign the centrality to each token in a basic block and sum the centrality of the same token in different basic blocks. By this, a graph is turned into certain tokens with graph details (i.e., centrality), called semantic tokens. Finally, these semantic tokens are fed into a Siamese architecture neural network to train a code clone detector. We evaluate SCDetector on two large datasets of functionally similar code. Experimental results indicate that our system is superior to four state-of-the-art methods (i.e., SourcererCC, Deckard, RtvNN, and ASTNN) and the time cost of SCDetector is 14 times less than a traditional graph-based method (i.e., CCSharp) on detecting semantic clones.

References

2017. Google Code Jam. https://code.google.com/codejam/past-contests.Google Scholar
2020. BigCloneBench. https://github.com/clonebench/BigCloneBench.Google Scholar
2020. A Java optimization framework (Soot). https://github.com/Sable/soot.Google Scholar
2020. Platform for C/C++ Code Analysis (Joern). https://joern.io.Google Scholar
2020. Software for complex networks (Networkx). http://networkx.github.io.Google Scholar
2020. Tensors and Dynamic neural networks in Python with strong GPU acceleration (PyTorch). https://pytorch.org.Google Scholar
2020. T.J. Watson Libraries for Analysis (WALA). http://wala.sourceforge.net/wiki/index.php/Main_Page.Google Scholar
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
Magdalena Balazinska, Ettore Merlo, Michel Dagenais, Bruno Lague, and Kostas Kontogiannis. 1999. Measuring clone based reengineering opportunities. In Proceedings of the 6th International Software Metrics Symposium (ISMS'99). 292--303.Google ScholarDigital Library
Pierre Baldi and Yves Chauvin. 1993. Neural networks for fingerprint recognition. Neural Computation 5, 3 (1993), 402--418.Google ScholarDigital Library
Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering 33, 9 (2007), 577--591.Google ScholarDigital Library
Kai Chen, Peng Liu, and Yingjun Zhang. 2014. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering (ICSE'14). 175--186.Google ScholarDigital Library
Nigel Coles. 2001. It's not what you know---It's who you know that counts. Analysing serious crime groups as social networks. British Journal of Criminology 41, 4 (2001), 580--594.Google ScholarCross Ref
Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Proceedings of the 1999 International Conference on Software Maintenance (ICSM'99). 109--118.Google ScholarCross Ref
Rochelle Elva and GT. Leavens. 2012. Jsctracker: A semantic clone detection tool for java code. Technical Report. University of Central Florida.Google Scholar
Katherine Faust. 1997. Centrality in affiliation networks. Social Networks 19, 2 (1997), 157--191.Google ScholarCross Ref
LC. Freeman. 1977. A set of measures of centrality based on betweenness. Sociometry 40, 1 (1977), 35--41.Google ScholarCross Ref
LC. Freeman. 1978. Centrality in social networks conceptual clarification. Social Networks 1, 3 (1978), 215--239.Google ScholarCross Ref
DM. German, Massimiliano Di Penta, Yann-Gael Gueheneuc, and Giuliano Antoniol. 2009. Code siblings: Technical and legal implications of copying code between applications. In Proceedings of the 6th International Working Conference on Mining Software Repositories (MSR'09). 81--90.Google ScholarDigital Library
Nils Göde and Rainer Koschke. 2009. Incremental clone detection. In Proceedings of the 2009 European Conference on Software Maintenance and Reengineering (ECSMR'09). 219--228.Google ScholarDigital Library
Roger Guimera, Stefano Mossa, Adrian Turtschi, and LA Nunes Amaral. 2005. The worldwide air transportation network: Anomalous centrality, community structure, and cities' global roles. the National Academy of Sciences 102, 22 (2005), 7794--7799.Google Scholar
Tomoya Ishihara, Keisuke Hotta, Yoshiki Higo, Hiroshi Igaki, and Shinji Kusumoto. 2012. Inter-project functional clone detection toward building libraries: an empirical study on 13,000 projects. In Proceedings of the 19th Working Conference on Reverse Engineering (WCRE'12). 387--391.Google ScholarDigital Library
Hawoong Jeong, SP. Mason, AL. Barabási, and ZN. Oltvai. 2001. Lethality and centrality in protein networks. Nature 411, 6833 (2001), 41--42.Google Scholar
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE'07). 96--105.Google ScholarDigital Library
Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the 18th International Symposium on Software Testing and Analysis (ISSTA'09). 81--92.Google ScholarDigital Library
J Howard Johnson. 1994. Substring matching for clone detection and change tracking. In Proceedings of the 1994 International Conference on Software Maintenance (ICSM'94). 120--126.Google ScholarCross Ref
Toshihiro Kamiya. 2013. Agec: An execution-semantic clone detection tool. In Proceedings of the 21st International Conference on Program Comprehension (ICPC'13). 227--229.Google ScholarCross Ref
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670.Google ScholarDigital Library
Leo Katz. 1953. A new status index derived from sociometric analysis. Psychometrika 18, 1 (1953), 39--43.Google ScholarCross Ref
Iman Keivanloo, Juergen Rilling, and Philippe Charland. 2011. Internet-scale real-time code clone search via multi-level indexing. In Proceedings of the 18th Working Conference on Reverse Engineering (WCRE'11). 23--27.Google ScholarDigital Library
Iman Keivanloo, CK. Roy, and Juergen Rilling. 2012. Sebyte: A semantic clone detection tool for intermediate languages. In Proceedings of the 20th International Conference on Program Comprehension (ICPC'12). 247--249.Google ScholarCross Ref
Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of the 2001 International Static Analysis Symposium (ISAS'01). 40--56.Google ScholarCross Ref
Rainer Koschke. 2012. Large-scale inter-system clone detection using suffix trees. In Proceedings of the 16th European Conference on Software Maintenance and Reengineering (ECSME'12). 309--318.Google ScholarDigital Library
Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working Conference on Reverse Engineering (WCRE'01). 301--309.Google ScholarCross Ref
Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CClearner: A deep learning-based clone detection approach. In Proceedings of the 2017 International Conference on Software Maintenance and Evolution (ICSME'17). 249--260.Google ScholarCross Ref
Xiaoming Liu, Johan Bollen, ML. Nelson, and Herbert Van de Sompel. 2005. Co-authorship networks in the digital library research community. Information Processing & Management 41, 6 (2005), 1462--1480.Google ScholarCross Ref
Jean Mayrand, Claude Leblanc, and Ettore Merlo. 1996. Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the 1996 International Conference on Software Maintenance (ICSM'96). 244--253.Google ScholarCross Ref
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
JF. Patenaude, Ettore Merlo, Michel Dagenais, and Bruno Laguë. 1999. Extending software quality assessment techniques to java systems. In Proceedings of the 7th International Workshop on Program Comprehension (IWPC'99). 49--56.Google ScholarDigital Library
CK. Roy and JR. Cordy. 2008. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 2008 International Conference on Program Comprehension (ICPC'08). 172--181.Google Scholar
Chanchal Kumar Roy and JR. Cordy. 2007. A survey on software clone detection research. Queen's School of Computing TR 541, 115 (2007), 64--68.Google Scholar
Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V Lopes. 2018. Oreo: Detection of clones in the twilight zone. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE'18). 354--365.Google ScholarDigital Library
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, CK. Roy, and CV. Lopes. 2016. SourcererCC: Scaling code clone detection to big code. In Proceedings of the 38th International Conference on Software Engineering (ICSE'16). 1157--1168.Google Scholar
Abdullah Sheneamer and Jugal Kalita. 2016. Semantic clone detection using machine learning. In Proceedings of the 15th International Conference on Machine Learning and Applications (ICMLA'16). 1024--1028.Google ScholarCross Ref
Jeffrey Svajlenko, JF. Islam, Iman Keivanloo, CK. Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 2014 International Conference on Software Maintenance and Evolution (ICSME'14). 476--480.Google ScholarDigital Library
Kai Sheng Tai, Richard Socher, and CD. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015).Google Scholar
Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (CMNLP'15). 1422--1432.Google ScholarCross Ref
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. Deep learning similarities from different representations of source code. In Proceedings of the 15th International Conference on Mining Software Repositories (MSR'18). 542--553.Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AN. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceddings of the 2017 Conference on Neural Information Processing Systems (NIPS'17). 5998--6008.Google Scholar
Min Wang, Pengcheng Wang, and Yun Xu. 2017. CCSharp: An efficient three-phase code clone detector using modified pdgs. In Proceedings of the 24th Asia-Pacific Software Engineering Conference (APSEC'17). 100--109.Google ScholarCross Ref
Pengcheng Wang, Jeffrey Svajlenko, Yanzhao Wu, Yun Xu, and CK. Roy. 2018. CCAligner: A token based large-gap clone detector. In Proceedings of the 40th International Conference on Software Engineering (ICSE'18). 1066--1077.Google Scholar
Huihui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the 2017 International Joint Conferences on Artificial Intelligence (IJCAI'17). 3034--3040.Google ScholarCross Ref
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st International Conference on Automated Software Engineering (ASE'16). 87--98.Google ScholarDigital Library
Yueming Wu, Xiaodi Li, Deqing Zou, Wei Yang, Xin Zhang, and Hai Jin. 2019. MalScan: Fast market-wide mobile malware scanning by social-network centrality analysis. In Proceedings of the 34th International Conference on Automated Software Engineering (ASE'19). 139--150.Google ScholarDigital Library
Wei Yang, Xusheng Xiao, Benjamin Andow, Sihan Li, Tao Xie, and William Enck. 2015. Appcontext: Differentiating malicious and benign mobile app behaviors using context. In Proceedings of the 37th International Conference on Software Engineering (ICSE'15). 303--313.Google ScholarDigital Library
Wojciech Zaremba and Ilya Sutskever. 2014. Learning to execute. arXiv preprint arXiv:1410.4615 (2014).Google Scholar
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering (ICSE'19). 783--794.Google ScholarDigital Library
Gang Zhao and Jeff Huang. 2018. Deepsim: Deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE'18). 141--151.Google ScholarDigital Library

Index Terms

SCDetector: software functional clone detection based on semantic tokens analysis
1. Software and its engineering
  1. Software notations and tools
    1. Software maintenance tools

Recommendations

Brain-inspired GCN: Modularity-based Siamese simple graph convolutional networks
Abstract
In graph representation learning, Graph Convolutional Networks (GCNs) and their variants have received much attention. However, GCNs encounter oversmoothing as the models get deeper, limiting their ability to aggregate node representations within ...
Highlights
- Low-pass filtered features can alleviate oversmoothing.
- Nodes in the graphs have similar characteristics as brain modules.
- The nonlinearity is not necessary in graph convolutional networks.
- Preservation of modular structure ...
Read More
Executive Network Centrality and Corporate Reporting
This paper investigates the association of corporate reporting and executive network centrality, which measures an executive’s relative position in a massive network consisting of outside corporate leaders. I find that high-centrality chief executive ...
Read More
ACSiam: Asymmetric convolution structures for visual tracking with Siamese network
Abstract
Object trackers based on Siamese network usually transform the tracking task into a matching problem between the candidate samples and the target template. However, with the increasing depth and width of backbone networks, researches ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering
December 2020
1449 pages
ISBN:9781450367684
DOI:10.1145/3324884
General Chair:
John Grundy,
Program Chairs:
Claire Le Goues,
David Lo
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 January 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
semantic tokens
siamese network
social network centrality
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate82of337submissions,24%
Upcoming Conference
ASE '24

Sponsor:

sigsoft online

sigsoft online

ASE '24: 39th IEEE/ACM International Conference on Automated Software Engineering

October 27 - November 1, 2024

Sacramento , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 398
  Total Downloads
- Downloads (Last 12 months)82
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SCDetector: software functional clone detection based on semantic tokens analysis

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Brain-inspired GCN: Modularity-based Siamese simple graph convolutional networks

Executive Network Centrality and Corporate Reporting

ACSiam: Asymmetric convolution structures for visual tracking with Siamese network