Abstract
Code clones can be defined as two identical pieces of code having the same or similar functionality. Code clone detection is critical to improve and sustain code quality. Current methods are unable to extract semantic and syntactic features and classify code bases satisfactorily. We propose a novel two-stage machine-learning approach towards code clone detection. Firstly, multiple intermediate representations of source code are extracted and combined to generate a holistic embedding based on a recently proposed technique. Next, we use these embeddings to train an Intermediate Merge Siamese Neural Network to detect functional code clones. Siamese Neural Networks are a state-of-the-art machine learning architecture particularly suited to code clone detection. This novel combination allows for learning subtle syntactic and semantic features and identifying previously undetectable similarities. Our solution shows a significant improvement in code clone detection, as shown by experimental evaluation over the OJClone C++ dataset.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
All implementation artefacts are available from https://github.com/smit25/Code-Clone-Detection-Using-Intermediate-Merge-Siamese-Network.
References
Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), pp. 1–6. IEEE (2017)
Chicco, D.: Siamese neural networks: an overview. In: Cartwright, H. (ed.) Artificial Neural Networks. MMB, vol. 2190, pp. 73–94. Springer, New York (2021). https://doi.org/10.1007/978-1-0716-0826-5_3
Church, K.W.: Word2Vec. Nat. Lang. Eng. 23(1), 155–162 (2017)
Fang, C., Liu, Z., Shi, Y., Huang, J., Shi, Q.: Functional code clone detection with syntax and semantics fusion learning. In: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 516–527 (2020)
Göde, N., Koschke, R.: Frequency and risks of changes to clones. In: Proceedings of the 33rd International Conference on Software Engineering, pp. 311–320 (2011)
Hawkins, D.M.: The problem of overfitting. J. Chem. Inf. Comput. Sci. 44(1), 1–12 (2004)
Higo, Y., Kusumoto, S.: Enhancing quality of code clone detection with program dependency graph. In: 2009 16th Working Conference on Reverse Engineering, pp. 315–316. IEEE (2009)
Jiang, L., Misherghi, G., Su, Z., Glondu, S.: DECKARD: scalable and accurate tree-based detection of code clones. In: 29th International Conference on Software Engineering (ICSE 2007), pp. 96–105. IEEE (2007)
Kapser, C.J., Godfrey, M.W.: “Cloning considered harmful” considered harmful: patterns of cloning in software. Empir. Softw. Eng. 13(6) (2008). https://doi.org/10.1007/s10664-008-9076-6
Kim, M., Bergman, L., Lau, T., Notkin, D.: An ethnographic study of copy and paste programming practices in OOPL. In: Proceedings 2004 International Symposium on Empirical Software Engineering, ISESE 2004, pp. 83–92. IEEE (2004)
Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, Lille, vol. 2 (2015)
Koschke, R., Falke, R., Frenzel, P.: Clone detection using abstract syntax suffix trees. In: 2006 13th Working Conference on Reverse Engineering, pp. 253–262. IEEE (2006)
Krasner, H.: The cost of poor software quality in the US: a 2020 report. In: Proceedings of the Consortium For Information & Software QualityTM (CISQTM) (2021)
Krinke, J.: Identifying similar code with program dependence graphs. In: Proceedings Eighth Working Conference on Reverse Engineering, pp. 301–309. IEEE (2001)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105 (2012)
Mahajan, S., Abolhassani, N., Prasad, M.R.: Recommending stack overflow posts for fixing runtime exceptions using failure scenario matching. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, pp. 1052–1064. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3368089.3409764
Melekhov, I., Kannala, J., Rahtu, E.: Siamese network features for image matching. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 378–383. IEEE (2016)
Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI 2016, pp. 1287–1293. AAAI Press (2016)
Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., Jaiswal, S.: graph2vec: learning distributed representations of graphs. arXiv preprint arXiv:1707.05005 (2017)
Nguyen, H.V., Bai, L.: Cosine similarity metric learning for face verification. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6493, pp. 709–720. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19309-5_55
Pearl, J.: Bayesian Networks, pp. 149–153. MIT Press, Cambridge (1998)
Roy, C.K., Cordy, J.R.: A mutation/injection-based automatic framework for evaluating code clone detection tools. In: 2009 International Conference on Software Testing, Verification, and Validation Workshops, pp. 157–166. IEEE (2009)
Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s School of Computing TR 541(115), 64–68 (2007)
Saini, V., Sajnani, H., Kim, J., Lopes, C.: SourcererCC and sourcererCC-I: tools to detect clones in batch mode and during software development. In: Proceedings of the 38th International Conference on Software Engineering Companion (2016)
Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., Lopes, C.V.: SourcererCC: scaling code clone detection to big-code. In: Proceedings of the 38th International Conference on Software Engineering, pp. 1157–1168 (2016)
Severyn, A., Moschitti, A.: Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 959–962 (2015)
Wei, H., Li, M.: Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: IJCAI, pp. 3034–3040 (2017)
Wei, H., Li, M.: Positive and unlabeled learning for detecting software functional clones with adversarial training. In: IJCAI, pp. 2840–2846 (2018)
White, M., Tufano, M., Vendome, C., Poshyvanyk, D.: Deep learning code fragments for code clone detection. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98. IEEE (2016)
Xie, C., Wang, X., Qian, C., Wang, M.: A source code similarity based on Siamese neural network. Appl. Sci. 10(21), 7519 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Patel, S., Sinha, R. (2022). Combining Holistic Source Code Representation with Siamese Neural Networks for Detecting Code Clones. In: Clark, D., Menendez, H., Cavalli, A.R. (eds) Testing Software and Systems. ICTSS 2021. Lecture Notes in Computer Science, vol 13045. Springer, Cham. https://doi.org/10.1007/978-3-031-04673-5_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-04673-5_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04672-8
Online ISBN: 978-3-031-04673-5
eBook Packages: Computer ScienceComputer Science (R0)