Skip to main content

Combining Holistic Source Code Representation with Siamese Neural Networks for Detecting Code Clones

  • Conference paper
  • First Online:
  • 439 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13045))

Abstract

Code clones can be defined as two identical pieces of code having the same or similar functionality. Code clone detection is critical to improve and sustain code quality. Current methods are unable to extract semantic and syntactic features and classify code bases satisfactorily. We propose a novel two-stage machine-learning approach towards code clone detection. Firstly, multiple intermediate representations of source code are extracted and combined to generate a holistic embedding based on a recently proposed technique. Next, we use these embeddings to train an Intermediate Merge Siamese Neural Network to detect functional code clones. Siamese Neural Networks are a state-of-the-art machine learning architecture particularly suited to code clone detection. This novel combination allows for learning subtle syntactic and semantic features and identifying previously undetectable similarities. Our solution shows a significant improvement in code clone detection, as shown by experimental evaluation over the OJClone C++ dataset.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    All implementation artefacts are available from https://github.com/smit25/Code-Clone-Detection-Using-Intermediate-Merge-Siamese-Network.

References

  1. Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), pp. 1–6. IEEE (2017)

    Google Scholar 

  2. Chicco, D.: Siamese neural networks: an overview. In: Cartwright, H. (ed.) Artificial Neural Networks. MMB, vol. 2190, pp. 73–94. Springer, New York (2021). https://doi.org/10.1007/978-1-0716-0826-5_3

    Chapter  Google Scholar 

  3. Church, K.W.: Word2Vec. Nat. Lang. Eng. 23(1), 155–162 (2017)

    Article  Google Scholar 

  4. Fang, C., Liu, Z., Shi, Y., Huang, J., Shi, Q.: Functional code clone detection with syntax and semantics fusion learning. In: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 516–527 (2020)

    Google Scholar 

  5. Göde, N., Koschke, R.: Frequency and risks of changes to clones. In: Proceedings of the 33rd International Conference on Software Engineering, pp. 311–320 (2011)

    Google Scholar 

  6. Hawkins, D.M.: The problem of overfitting. J. Chem. Inf. Comput. Sci. 44(1), 1–12 (2004)

    Article  MathSciNet  Google Scholar 

  7. Higo, Y., Kusumoto, S.: Enhancing quality of code clone detection with program dependency graph. In: 2009 16th Working Conference on Reverse Engineering, pp. 315–316. IEEE (2009)

    Google Scholar 

  8. Jiang, L., Misherghi, G., Su, Z., Glondu, S.: DECKARD: scalable and accurate tree-based detection of code clones. In: 29th International Conference on Software Engineering (ICSE 2007), pp. 96–105. IEEE (2007)

    Google Scholar 

  9. Kapser, C.J., Godfrey, M.W.: “Cloning considered harmful” considered harmful: patterns of cloning in software. Empir. Softw. Eng. 13(6) (2008). https://doi.org/10.1007/s10664-008-9076-6

  10. Kim, M., Bergman, L., Lau, T., Notkin, D.: An ethnographic study of copy and paste programming practices in OOPL. In: Proceedings 2004 International Symposium on Empirical Software Engineering, ISESE 2004, pp. 83–92. IEEE (2004)

    Google Scholar 

  11. Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, Lille, vol. 2 (2015)

    Google Scholar 

  12. Koschke, R., Falke, R., Frenzel, P.: Clone detection using abstract syntax suffix trees. In: 2006 13th Working Conference on Reverse Engineering, pp. 253–262. IEEE (2006)

    Google Scholar 

  13. Krasner, H.: The cost of poor software quality in the US: a 2020 report. In: Proceedings of the Consortium For Information & Software QualityTM (CISQTM) (2021)

    Google Scholar 

  14. Krinke, J.: Identifying similar code with program dependence graphs. In: Proceedings Eighth Working Conference on Reverse Engineering, pp. 301–309. IEEE (2001)

    Google Scholar 

  15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105 (2012)

    Google Scholar 

  16. Mahajan, S., Abolhassani, N., Prasad, M.R.: Recommending stack overflow posts for fixing runtime exceptions using failure scenario matching. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, pp. 1052–1064. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3368089.3409764

  17. Melekhov, I., Kannala, J., Rahtu, E.: Siamese network features for image matching. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 378–383. IEEE (2016)

    Google Scholar 

  18. Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI 2016, pp. 1287–1293. AAAI Press (2016)

    Google Scholar 

  19. Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., Jaiswal, S.: graph2vec: learning distributed representations of graphs. arXiv preprint arXiv:1707.05005 (2017)

  20. Nguyen, H.V., Bai, L.: Cosine similarity metric learning for face verification. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6493, pp. 709–720. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19309-5_55

    Chapter  Google Scholar 

  21. Pearl, J.: Bayesian Networks, pp. 149–153. MIT Press, Cambridge (1998)

    Google Scholar 

  22. Roy, C.K., Cordy, J.R.: A mutation/injection-based automatic framework for evaluating code clone detection tools. In: 2009 International Conference on Software Testing, Verification, and Validation Workshops, pp. 157–166. IEEE (2009)

    Google Scholar 

  23. Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s School of Computing TR 541(115), 64–68 (2007)

    Google Scholar 

  24. Saini, V., Sajnani, H., Kim, J., Lopes, C.: SourcererCC and sourcererCC-I: tools to detect clones in batch mode and during software development. In: Proceedings of the 38th International Conference on Software Engineering Companion (2016)

    Google Scholar 

  25. Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., Lopes, C.V.: SourcererCC: scaling code clone detection to big-code. In: Proceedings of the 38th International Conference on Software Engineering, pp. 1157–1168 (2016)

    Google Scholar 

  26. Severyn, A., Moschitti, A.: Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 959–962 (2015)

    Google Scholar 

  27. Wei, H., Li, M.: Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: IJCAI, pp. 3034–3040 (2017)

    Google Scholar 

  28. Wei, H., Li, M.: Positive and unlabeled learning for detecting software functional clones with adversarial training. In: IJCAI, pp. 2840–2846 (2018)

    Google Scholar 

  29. White, M., Tufano, M., Vendome, C., Poshyvanyk, D.: Deep learning code fragments for code clone detection. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98. IEEE (2016)

    Google Scholar 

  30. Xie, C., Wang, X., Qian, C., Wang, M.: A source code similarity based on Siamese neural network. Appl. Sci. 10(21), 7519 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Smit Patel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Patel, S., Sinha, R. (2022). Combining Holistic Source Code Representation with Siamese Neural Networks for Detecting Code Clones. In: Clark, D., Menendez, H., Cavalli, A.R. (eds) Testing Software and Systems. ICTSS 2021. Lecture Notes in Computer Science, vol 13045. Springer, Cham. https://doi.org/10.1007/978-3-031-04673-5_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04673-5_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04672-8

  • Online ISBN: 978-3-031-04673-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics