Unsupervised Graph Neural Networks for Source Code Similarity Detection

Cassagne, Julien; Merlo, Ettore; Branco, Paula; Jourdan, Guy-Vincent; Onut, Iosif-Viorel

doi:10.1007/978-3-031-45275-8_36

Julien Cassagne ORCID: orcid.org/0000-0002-6617-9010¹²,
Ettore Merlo¹²,
Paula Branco¹³,
Guy-Vincent Jourdan¹³ &
…
Iosif-Viorel Onut¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14276))

Included in the following conference series:

International Conference on Discovery Science

1074 Accesses

Abstract

In this paper, we propose a novel unsupervised approach for code similarity and clone detection that is based on Graph Neural Networks. We propose a hybrid approach to detect similarities within source code, using centroid distances and a Graph Auto-Encoder that uses a raw abstract syntax trees as input. When compared to $\mathrm {R_{TV}NN}$ [33], the state-of-the-art unsupervised approach for code similarity and clone detection, our method improves significantly training and inference time efficiency, while preserving or improving precision. In our experiments, our algorithm is on average 77 times faster during training and 21 times faster during inference. This shows that using Graph Auto-Encoders in the domain of source code similarity analysis is the better option in an industrial context or in a production environment. We illustrate this by using our approach to compute source code similarity within a large dataset of phishing kits written in PHP provided by our industry partner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Combining Holistic Source Code Representation with Siamese Neural Networks for Detecting Code Clones

Source Code Clone Detection Using Unsupervised Similarity Measures

A Machine Learning Approach for Source Code Similarity via Graph-Focused Features

References

Repository. https://gitlab.com/polymtl-static-analysis/vgae-code-analysis
Baxter, I., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: Proceedings of the International Conference on Software Maintenance (Cat. No. 98CB36272), pp. 368–377 (1998)
Google Scholar
Ducasse, S., Nierstrasz, O., Rieger, M.: On the effectiveness of clone detection by string matching: research articles. J. Softw. Maint. Evol. 18(1) (2006)
Google Scholar
Feng, S., Duarte, M.F.: Graph autoencoder-based unsupervised feature selection with broad and local data structure preservation. Neurocomputing (2018)
Google Scholar
Fey, M., Lenssen, J.E.: Fast Graph Representation Learning with PyTorch Geometric (2019)
Google Scholar
Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. CoRR abs/1704.01212 (2017)
Google Scholar
Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: Proceedings of the 2005 IEEE International Joint Conference on Neural Networks (2005)
Google Scholar
Jiang, S., Hong, Y., Fu, C., Qian, Y., Han, L.: Function-level obfuscation detection method based on graph convolutional networks. J. Inf. Secur. Appl. 61, 102953 (2021)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes (2014)
Google Scholar
Kipf, T.N., Welling, M.: Variational graph auto-encoders. arXiv:1611.07308 [cs, stat] (2016)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2017)
Google Scholar
Li, Y., Gu, C., Dullien, T., Vinyals, O., Kohli, P.: Graph matching networks for learning the similarity of graph structured objects (2019)
Google Scholar
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks (2015)
Google Scholar
Liu, C., Lin, Z., Lou, J.G., Wen, L., Zhang, D.: Can neural clone detection generalize to unseen functionalities$f$. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 617–629 (2021)
Google Scholar
Liu, S.: A unified framework to learn program semantics with graph neural networks. In: 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2020)
Google Scholar
Ma, G., Ahmed, N.K., Willke, T.L., Yu, P.S.: Deep graph similarity learning: a survey. Data Min. Knowl. Disc. 35(3), 688–725 (2021)
Article MathSciNet MATH Google Scholar
McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 (2020)
Mehrotra, N., Agarwal, N., Gupta, P., Anand, S., Lo, D., Purandare, R.: Modeling functional similarity in source code with graph-based siamese networks. arXiv:2011.11228 [cs] (2020)
Merlo, E., Antoniol, G., Di Penta, M., Rollo, V.: Linear complexity object-oriented similarity for clone detection and software evolution analyses. In: Proceedings of the 20th IEEE International Conference on Software Maintenance, pp. 412–416 (2004)
Google Scholar
Nair, A., Roy, A., Meinke, K.: funcGNN: a graph neural network approach to program similarity. In: Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 1–11 (2020). arXiv: 2007.13239
Nguyen, V.A., Nguyen, D.Q., Nguyen, V., Le, T., Tran, Q.H., Phung, D.: ReGVD: revisiting graph neural networks for vulnerability detection. In: 2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (2022)
Google Scholar
Pan, S., Hu, R., Long, G., Jiang, J., Yao, L., Zhang, C.: Adversarially regularized graph autoencoder for graph embedding. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018. AAAI Press (2018)
Google Scholar
Park, J., Lee, M., Chang, H., Lee, K., Choi, J.: Symmetric graph convolutional autoencoder for unsupervised graph representation learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019)
Google Scholar
Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci. Comput. Program. 74(7), 470–495 (2009)
Article MathSciNet MATH Google Scholar
Rozi, M.F., Ban, T., Ozawa, S., Kim, S., Takahashi, T., Inoue, D.: JStrack: enriching malicious JavaScript detection based on AST graph analysis and attention mechanism. In: Neural Information Processing: ICONIP (2021)
Google Scholar
Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE Trans. Neural Netw. 20(1), 61–80 (2009)
Article Google Scholar
Siow, J.K., Liu, S., Xie, X., Meng, G., Liu, Y.: Learning program semantics with code representations: an empirical study. In: 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 554–565 (2022)
Google Scholar
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China, pp. 1556–1566. Association for Computational Linguistics (2015)
Google Scholar
Wang, L., et al.: Inductive and unsupervised representation learning on graph structured objects. In: International Conference on Learning Representations (2020)
Google Scholar
Wang, W., Li, G., Ma, B., Xia, X., Jin, Z.: Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 261–271 (2020)
Google Scholar
Wei, H., Li, M.: Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017 (2017)
Google Scholar
White, M., Tufano, M., Vendome, C., Poshyvanyk, D.: Deep learning code fragments for code clone detection. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98 (2016)
Google Scholar
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Yu, P.S.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 4–24 (2020)
Article MathSciNet Google Scholar
Yahya, M.A., Kim, D.K.: CLCD-I: cross-language clone detection by using deep learning with infercode. Computers 12(1) (2023)
Google Scholar
Yu, H., Lam, W., Chen, L., Li, G., Xie, T., Wang, Q.: Neural detection of semantic code clones via tree-based convolution. In: 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pp. 70–80 (2019)
Google Scholar
Zeng, J., Ben, K., Li, X., Zhang, X.: Fast code clone detection based on weighted recursive autoencoders. IEEE Access 7, 125062–125078 (2019)
Article Google Scholar
Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., Liu, X.: A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794 (2019)
Google Scholar
Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Sun, M.: Graph neural networks: a review of methods and applications. AI Open 1, 57–81 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Polytechnique Montreal, Montreal, Canada
Julien Cassagne & Ettore Merlo
University of Ottawa, Ottawa, Canada
Paula Branco & Guy-Vincent Jourdan
IBM Centre for Advanced Studies, Toronto, Canada
Iosif-Viorel Onut

Authors

Julien Cassagne
View author publications
You can also search for this author in PubMed Google Scholar
Ettore Merlo
View author publications
You can also search for this author in PubMed Google Scholar
Paula Branco
View author publications
You can also search for this author in PubMed Google Scholar
Guy-Vincent Jourdan
View author publications
You can also search for this author in PubMed Google Scholar
Iosif-Viorel Onut
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julien Cassagne .

Editor information

Editors and Affiliations

Waikato University, Hamilton, New Zealand
Albert Bifet
Aeronautics Institute of Technology, São José dos Campos, Brazil
Ana Carolina Lorena
University of Porto, Porto, Portugal
Rita P. Ribeiro
University of Porto, Porto, Portugal
João Gama
University of Coimbra, Coimbra, Portugal
Pedro H. Abreu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cassagne, J., Merlo, E., Branco, P., Jourdan, GV., Onut, IV. (2023). Unsupervised Graph Neural Networks for Source Code Similarity Detection. In: Bifet, A., Lorena, A.C., Ribeiro, R.P., Gama, J., Abreu, P.H. (eds) Discovery Science. DS 2023. Lecture Notes in Computer Science(), vol 14276. Springer, Cham. https://doi.org/10.1007/978-3-031-45275-8_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-45275-8_36
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45274-1
Online ISBN: 978-3-031-45275-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Graph Neural Networks for Source Code Similarity Detection