A Machine Learning Approach for Source Code Similarity via Graph-Focused Features

Boldini, Giacomo; Diana, Alessio; Arceri, Vincenzo; Bonnici, Vincenzo; Bagnara, Roberto

doi:10.1007/978-3-031-53969-5_5

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14505))

Included in the following conference series:

International Conference on Machine Learning, Optimization, and Data Science

45 Accesses
1 Altmetric

Abstract

Source code similarity aims at recognizing common characteristics between two different codes by means of their components. It plays a significant role in many activities regarding software development and analysis which have the potential of assisting software teams working on large codebases. Existing approaches aim at computing similarity between two codes by suitable representation of them which captures syntactic and semantic properties. However, they lack explainability and generalization for multiple languages comparison. Here, we present a preliminary result that attempts at providing a graph-focused representation of code by means of which clustering and classification of programs is possible while exposing explainability and generalizability characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In C/C++, executable programs are obtained by linking together the code coming from a complete set of translation units. A translation unit is the portion of a program a compiler operates upon, and is constituted by a main file (typically with a .c/.C/.cxx/.cpp extension) along with all header files (typically with a .h/.H/.hpp extension) that the main file includes, directly or indirectly. A prerequisite of our approach is that the translation units to be analyzed are complete, so that the Clang compiler can process them without errors.
2.
This phase is managed by using a custom LLVM-IR parser. The parser is generated using ANTLR [21] starting from the LLVM-IR 7.0.0 grammar.
3.
In LLVM-IR, op-code refers to the instruction code (or name) that specifies the operation to be performed by the instruction.

References

Allamanis, M.: The adverse effects of code duplication in machine learning models of code. In: Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153 (2019). https://doi.org/10.1145/3359591.3359735
Alon, U., et al.: Code2vec: learning distributed representations of code. Proc. ACM Program. Lang. 3(POPL) (2019). https://doi.org/10.1145/3290353
Arceri, V., Mastroeni, I.: Analyzing dynamic code: a sound abstract interpreter for Evil eval. ACM Trans. Priv. Secur. 24(2), 10:1–10:38 (2021). https://doi.org/10.1145/3426470
Arceri, V., Olliaro, M., Cortesi, A., Mastroeni, I.: Completeness of abstract domains for string analysis of javascript programs. In: Hierons, R.M., Mosbah, M. (eds.) ICTAC 2019. LNCS, vol. 11884, pp. 255–272. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32505-3_15
Chapter Google Scholar
Bonnici, V., et al.: Enhancing graph database indexing by suffix tree structure. In: Dijkstra, T.M.H., Tsivtsivadze, E., Marchiori, E., Heskes, T. (eds.) PRIB 2010. LNCS, pp. 195–203. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16001-1_17
Chapter Google Scholar
Dalla Preda, M., et al.: Abstract symbolic automata: Mixed syntactic/semantic similarity analysis of executables. In: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 329–341 (2015). https://doi.org/10.1145/2676726.2676986
Dhavleesh, R., et al.: Software clone detection: a systematic review. Inf. Softw. Technol. 55(7), 1165–1199 (2013). https://doi.org/10.1016/j.infsof.2013.01.008
Article Google Scholar
Flemming, N., et al.: Principles of Program Analysis. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-03811-6
Book Google Scholar
Geurts, P., et al.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006). https://doi.org/10.1007/s10994-006-6226-1
Article Google Scholar
Giugno, R., et al.: Grapes: a software for parallel searching on biological graphs targeting multi-core architectures. PLoS ONE 8(10), e76911 (2013)
Article Google Scholar
Hubert, L.J., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Article Google Scholar
Jannik, P., et al.: Leveraging semantic signatures for bug search in binary programs. In: Proceedings of the 30th Annual Computer Security Applications Conference, pp. 406–415 (2014). https://doi.org/10.1145/2664243.2664269
Jie, Z., et al.: Fast code clone detection based on weighted recursive autoencoders. IEEE Access 7, 125062–125078 (2019). https://doi.org/10.1109/ACCESS.2019.2938825
Article Google Scholar
Krinke, J., Ragkhitwetsagul, C.: Code similarity in clone detection. In: Inoue, K., Roy, C.K. (eds.) Code Clone Analysis, pp. 135–150. Springer, Singapore (2021). https://doi.org/10.1007/978-981-16-1927-4_10
Chapter Google Scholar
Lei, M., et al.: Deep learning application on code clone detection: a review of current knowledge. J. Syst. Softw. 184, 111141 (2022). https://doi.org/10.1016/j.jss.2021.111141
Article Google Scholar
Licheri, N., et al.: GRAPES-DD: exploiting decision diagrams for index-driven search in biological graph databases. BMC Bioinform. 22, 1–24 (2021)
Article Google Scholar
Liu, F.T., et al.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422 (2008). https://doi.org/10.1109/ICDM.2008.17
Mikolov, T., et al.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, 2–4 May 2013, Workshop Track Proceedings (2013)
Google Scholar
Müllner, D.: Modern hierarchical, agglomerative clustering algorithms (2011)
Google Scholar
Narayanan, A., et al.: graph2vec: learning distributed representations of graphs. CoRR abs/1707.05005 (2017)
Google Scholar
Parr, T.J., Quong, R.W.: ANTLR: a predicated-LL(k) parser generator. Softw. Pract. Exp. 25(7), 789–810 (1995). https://doi.org/10.1002/spe.4380250705
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet Google Scholar
Puri, R., et al.: Project codenet: a large-scale AI for code dataset for learning a diversity of coding tasks. arXiv preprint arXiv:2105.12655 1035 (2021)
Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s Sch. Comput. TR 541(115), 64–68 (2007)
Google Scholar
Saini, N., et al.: Code clones: detection and management. Procedia Comput. Sci. 132, 718–727 (2018). https://doi.org/10.1016/j.procs.2018.05.080. International Conference on Computational Intelligence and Data Science
The LLVM Development Team: LLVM Language Reference Manual (Version 7.0.0) (2018)
Google Scholar
Đurić, Z., Gašević, D.: A source code similarity system for plagiarism detection. Comput. J. 56(1), 70–86 (2013). https://doi.org/10.1093/comjnl/bxs018
Article Google Scholar
Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in python. Nat. Methods 17, 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2

Download references

Acknowledgement

This project has been partially founded by the University of Parma (Italy), project number MUR_DM737_B_MAFI_BONNICI. V. Bonnici is partially supported by INdAM-GNCS, project number CUP_E55F22000270001, and by the CINI InfoLife laboratory.

Author information

Authors and Affiliations

Department of Mathematical, Physical and Computer Sciences, University of Parma, Parco Area delle Scienze, 53/A, 43124, Parma, Italy
Giacomo Boldini, Alessio Diana, Vincenzo Arceri, Vincenzo Bonnici & Roberto Bagnara

Authors

Giacomo Boldini
View author publications
You can also search for this author in PubMed Google Scholar
Alessio Diana
View author publications
You can also search for this author in PubMed Google Scholar
Vincenzo Arceri
View author publications
You can also search for this author in PubMed Google Scholar
Vincenzo Bonnici
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Bagnara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vincenzo Bonnici .

Editor information

Editors and Affiliations

University of Catania, Catania, Catania, Italy
Giuseppe Nicosia
Newcastle University, Newcastle upon Tyne, UK
Varun Ojha
University of Oxford, Oxford, UK
Emanuele La Malfa
University of Cambridge, Cambridge, UK
Gabriele La Malfa
University of Florida, Gainesville, FL, USA
Panos M. Pardalos
Dana-Farber Cancer Institute, Boston, MA, USA
Renato Umeton

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boldini, G., Diana, A., Arceri, V., Bonnici, V., Bagnara, R. (2024). A Machine Learning Approach for Source Code Similarity via Graph-Focused Features. In: Nicosia, G., Ojha, V., La Malfa, E., La Malfa, G., Pardalos, P.M., Umeton, R. (eds) Machine Learning, Optimization, and Data Science. LOD 2023. Lecture Notes in Computer Science, vol 14505. Springer, Cham. https://doi.org/10.1007/978-3-031-53969-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-53969-5_5
Published: 16 February 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53968-8
Online ISBN: 978-3-031-53969-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Machine Learning Approach for Source Code Similarity via Graph-Focused Features