Combining Holistic Source Code Representation with Siamese Neural Networks for Detecting Code Clones

Patel, Smit; Sinha, Roopak

doi:10.1007/978-3-031-04673-5_12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13045))

Included in the following conference series:

IFIP International Conference on Testing Software and Systems

542 Accesses

Abstract

Code clones can be defined as two identical pieces of code having the same or similar functionality. Code clone detection is critical to improve and sustain code quality. Current methods are unable to extract semantic and syntactic features and classify code bases satisfactorily. We propose a novel two-stage machine-learning approach towards code clone detection. Firstly, multiple intermediate representations of source code are extracted and combined to generate a holistic embedding based on a recently proposed technique. Next, we use these embeddings to train an Intermediate Merge Siamese Neural Network to detect functional code clones. Siamese Neural Networks are a state-of-the-art machine learning architecture particularly suited to code clone detection. This novel combination allows for learning subtle syntactic and semantic features and identifying previously undetectable similarities. Our solution shows a significant improvement in code clone detection, as shown by experimental evaluation over the OJClone C++ dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Analysis of Code Similarity with Triplet Loss-Based Deep Learning System

CCLearner: Clone Detection via Deep Learning

Transformer-based networks over tree structures for code classification

Article 09 November 2021

Notes

1.
All implementation artefacts are available from https://github.com/smit25/Code-Clone-Detection-Using-Intermediate-Merge-Siamese-Network.

References

Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), pp. 1–6. IEEE (2017)
Google Scholar
Chicco, D.: Siamese neural networks: an overview. In: Cartwright, H. (ed.) Artificial Neural Networks. MMB, vol. 2190, pp. 73–94. Springer, New York (2021). https://doi.org/10.1007/978-1-0716-0826-5_3
Chapter Google Scholar
Church, K.W.: Word2Vec. Nat. Lang. Eng. 23(1), 155–162 (2017)
Article Google Scholar
Fang, C., Liu, Z., Shi, Y., Huang, J., Shi, Q.: Functional code clone detection with syntax and semantics fusion learning. In: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 516–527 (2020)
Google Scholar
Göde, N., Koschke, R.: Frequency and risks of changes to clones. In: Proceedings of the 33rd International Conference on Software Engineering, pp. 311–320 (2011)
Google Scholar
Hawkins, D.M.: The problem of overfitting. J. Chem. Inf. Comput. Sci. 44(1), 1–12 (2004)
Article MathSciNet Google Scholar
Higo, Y., Kusumoto, S.: Enhancing quality of code clone detection with program dependency graph. In: 2009 16th Working Conference on Reverse Engineering, pp. 315–316. IEEE (2009)
Google Scholar
Jiang, L., Misherghi, G., Su, Z., Glondu, S.: DECKARD: scalable and accurate tree-based detection of code clones. In: 29th International Conference on Software Engineering (ICSE 2007), pp. 96–105. IEEE (2007)
Google Scholar
Kapser, C.J., Godfrey, M.W.: “Cloning considered harmful” considered harmful: patterns of cloning in software. Empir. Softw. Eng. 13(6) (2008). https://doi.org/10.1007/s10664-008-9076-6
Kim, M., Bergman, L., Lau, T., Notkin, D.: An ethnographic study of copy and paste programming practices in OOPL. In: Proceedings 2004 International Symposium on Empirical Software Engineering, ISESE 2004, pp. 83–92. IEEE (2004)
Google Scholar
Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, Lille, vol. 2 (2015)
Google Scholar
Koschke, R., Falke, R., Frenzel, P.: Clone detection using abstract syntax suffix trees. In: 2006 13th Working Conference on Reverse Engineering, pp. 253–262. IEEE (2006)
Google Scholar
Krasner, H.: The cost of poor software quality in the US: a 2020 report. In: Proceedings of the Consortium For Information & Software Quality^TM (CISQ^TM) (2021)
Google Scholar
Krinke, J.: Identifying similar code with program dependence graphs. In: Proceedings Eighth Working Conference on Reverse Engineering, pp. 301–309. IEEE (2001)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105 (2012)
Google Scholar
Mahajan, S., Abolhassani, N., Prasad, M.R.: Recommending stack overflow posts for fixing runtime exceptions using failure scenario matching. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, pp. 1052–1064. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3368089.3409764
Melekhov, I., Kannala, J., Rahtu, E.: Siamese network features for image matching. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 378–383. IEEE (2016)
Google Scholar
Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI 2016, pp. 1287–1293. AAAI Press (2016)
Google Scholar
Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., Jaiswal, S.: graph2vec: learning distributed representations of graphs. arXiv preprint arXiv:1707.05005 (2017)
Nguyen, H.V., Bai, L.: Cosine similarity metric learning for face verification. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6493, pp. 709–720. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19309-5_55
Chapter Google Scholar
Pearl, J.: Bayesian Networks, pp. 149–153. MIT Press, Cambridge (1998)
Google Scholar
Roy, C.K., Cordy, J.R.: A mutation/injection-based automatic framework for evaluating code clone detection tools. In: 2009 International Conference on Software Testing, Verification, and Validation Workshops, pp. 157–166. IEEE (2009)
Google Scholar
Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s School of Computing TR 541(115), 64–68 (2007)
Google Scholar
Saini, V., Sajnani, H., Kim, J., Lopes, C.: SourcererCC and sourcererCC-I: tools to detect clones in batch mode and during software development. In: Proceedings of the 38th International Conference on Software Engineering Companion (2016)
Google Scholar
Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., Lopes, C.V.: SourcererCC: scaling code clone detection to big-code. In: Proceedings of the 38th International Conference on Software Engineering, pp. 1157–1168 (2016)
Google Scholar
Severyn, A., Moschitti, A.: Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 959–962 (2015)
Google Scholar
Wei, H., Li, M.: Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: IJCAI, pp. 3034–3040 (2017)
Google Scholar
Wei, H., Li, M.: Positive and unlabeled learning for detecting software functional clones with adversarial training. In: IJCAI, pp. 2840–2846 (2018)
Google Scholar
White, M., Tufano, M., Vendome, C., Poshyvanyk, D.: Deep learning code fragments for code clone detection. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98. IEEE (2016)
Google Scholar
Xie, C., Wang, X., Qian, C., Wang, M.: A source code similarity based on Siamese neural network. Appl. Sci. 10(21), 7519 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Technology Indore, Indore, India
Smit Patel
IT & Software Engineering, Auckland University of Technology, Auckland, New Zealand
Roopak Sinha

Authors

Smit Patel
View author publications
You can also search for this author in PubMed Google Scholar
Roopak Sinha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Smit Patel .

Editor information

Editors and Affiliations

University College London, London, UK
David Clark
Middlesex University, London, UK
Hector Menendez
Telecom SudParis, Evry Cedex, France
Ana Rosa Cavalli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patel, S., Sinha, R. (2022). Combining Holistic Source Code Representation with Siamese Neural Networks for Detecting Code Clones. In: Clark, D., Menendez, H., Cavalli, A.R. (eds) Testing Software and Systems. ICTSS 2021. Lecture Notes in Computer Science, vol 13045. Springer, Cham. https://doi.org/10.1007/978-3-031-04673-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-04673-5_12
Published: 10 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04672-8
Online ISBN: 978-3-031-04673-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Combining Holistic Source Code Representation with Siamese Neural Networks for Detecting Code Clones