Abstract
The growing popularity of enterprise technologies for decentralized systems leads to commonalities in using components. This direction, however, opens new challenges to code clone detection. Approaches can no longer look at the low-level code but must deal with the higher-level component semantics. Yet, not many works addressed this trend. One of the quality issues that can be identified in large systems is duplicated behavior with different syntactic structures. It is crucial to detect these issues for enterprises where software’s codebase(s) grows and evolves, and maintenance costs rise significantly. This issue is referred to as a semantic clone. The detection of semantic clones requires semantic information about the given program. Unfortunately, while many code clone detection techniques are proposed, there is a lack of solutions targeted explicitly toward enterprise systems and even fewer solutions dedicated to semantic clones. To reason about semantic clones, we consider different pairs of component call-graphs in the system. Since different component types are common in enterprise systems, we can ensure that only relevant fragments are matched, using targeted enterprise metadata. When applied to an established system benchmark, our method indicates high accuracy in detecting semantic clones. We also assessed different system versions to elaborate on the method’s applicability to decentralized system evolution.
Similar content being viewed by others
Data availability
The dataset generated in this work is available at https://doi.org/10.5281/zenodo.7632839 and https://doi.org/10.5281/zenodo.7632842. Our prototype tools are available at GitHub as open source: Semantic Clone Detector: https://github.com/cloudhubs/Distributed-Systems-Semantic-Clone-Detector Gradle Plugin: https://github.com/cloudhubs/prophet-gradle-plugin, Interactive Tool: https://github.com/svacina/prophet.
Notes
Train-Ticket benchmark: https://github.com/FudanSELab/train-ticket, accessed on 2/5/2023.
Wanxin benchmark: https://github.com/mikuhuyo/wanxin-p2p, accessed on 2/5/2023.
Swarm benchmark: https://github.com/macrozheng/mall-swarm, accessed on 2/5/2023.
Syntactic Clone results from [22]: https://microservicedata.github.io, accessed on 2/5/2023.
Note that while the examples and implementation demonstrations of our method are specific to the Java platform, it is not limited to just this platform
The domain of this binary output is usually modeled in one of two ways: \(\{1,0\}\) or \(\{1,-1\}\) for positive and negative classes respectively. We assume the \(\{1,0\}\) model for the purposes of the explanation.
Our Prototype:https://github.com/cloudhubs/Distributed-Systems-Semantic-Clone-Detector, accessed on 2/5/2023.
WS4J: https://github.com/Sciss/ws4j, accessed on 2/5/2023.
Our Semantic Clone Dataset (V1): https://zenodo.org/record/7632839, accessed on 2/11/2023.
Our Semantic Clone Dataset (V2): https://zenodo.org/record/7632842, accessed on 2/11/2023.
SonarQube: https://www.sonarqube.org, accessed on 2/5/2023.
Our Interactive Tool: https://github.com/svacina/prophet, accessed on 2/5/2023.
VSCode IDE: https://code.visualstudio.com, accessed on 2/5/2023.
Our Plugin: https://github.com/cloudhubs/prophet-gradle-plugin, accessed on 2/5/2023.
References
Besker T, Martini A, Bosch J. Technical debt cripples software developer productivity: A longitudinal study on developers’ daily software development work. In: Proceedings of the 2018 International Conference on Technical Debt. TechDebt ’2018:18,105–114; Association for Computing Machinery, New York, NY, USA . https://doi.org/10.1145/3194164.3194178
Ain QU, Butt WH, Anwar MW, Azam F, Maqbool B. A systematic review on code clone detection. IEEE Access. 2019;7:86121–44.
Baker BS. On finding duplication and near-duplication in large software systems. In: Proceedings of 2nd working conference on reverse engineering, 1995;86–95 https://doi.org/10.1109/WCRE.1995.514697. IEEE
Ducasse S, Rieger M, Demeyer S. A language independent approach for detecting duplicated code. In: Proceedings IEEE international conference on software maintenance-1999 (ICSM’99).’Software maintenance for business change’(Cat. No. 99CB36360), 1999;109–118 . https://doi.org/10.1109/ICSM.1999.792593. IEEE
Higo Y, Kusumoto S, Inoue K. A metric-based approach to identifying refactoring opportunities for merging code clones in a java software system. J Softw Maint Evol: Res Pract. 2008;20(6):435–61. https://doi.org/10.1002/smr.394.
Kumar A, Yadav R, Kumar K. A systematic review of semantic clone detection techniques in software systems. In: IOP conference series: materials science and engineering, 2021;1022:012074 https://doi.org/10.1088/1757-899X/1022/1/012074. IOP Publishing
Vislavski, T., Rakić, G., Cardozo, N., Budimac, Z.: Licca: A tool for cross-language clone detection. In: 2018 IEEE 25th international conference on software analysis, evolution and reengineering (SANER), pp. 512–516 (2018). https://doi.org/10.1109/SANER.2018.8330250. IEEE
Saini V, Farmahinifarahani F, Lu Y, Baldi P, Lopes CV.: Oreo: Detection of clones in the twilight zone. In: Proceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp. 2018;354–365 https://doi.org/10.5281/zenodo.1317760
Svacina J, Bushong V, Das D, Cernỳ, T. Semantic code clone detection method for distributed enterprise systems. In: CLOSER, pp. 27–37 (2022). https://doi.org/10.5220/0011032200003200
Roy CK, Cordy JR. A survey on software clone detection research. Queen’Sch Comput TR. 2007;541(115):64–8.
Svajlenko J, Roy CK Evaluating clone detection tools with bigclonebench. In: 2015 IEEE international conference on software maintenance and evolution (ICSME), pp. 131–140 (2015). https://doi.org/10.1109/ICSM.2015.7332459. IEEE
Nasirloo H, Azimzadeh F Semantic code clone detection using abstract memory states and program dependency graphs. In: 2018 4th international conference on web research (ICWR) 2018:19–27 https://doi.org/10.1109/ICWR.2018.8387232. IEEE
Wu, Y., Zou, D., Dou, S., Yang, S., Yang, W., Cheng, F., Liang, H., Jin, H.: Scdetector: software functional clone detection based on semantic tokens analysis. In: Proceedings of the 35th IEEE/ACM international conference on automated software engineering, pp. 821–833 (2020). https://doi.org/10.1145/3324884.3416562
Vislavski T, Rakić G, Cardozo N, Budimac Z. Licca: A tool for cross-language clone detection. In: 2018 IEEE 25th international conference on software analysis, evolution and reengineering (SANER), 2018;512–516 https://doi.org/10.1109/SANER.2018.8330250. IEEE
Alomari HW, Stephan M. Clone detection through srcclone: a program slicing based approach. J Syst Softw. 2022;184: 111115. https://doi.org/10.1016/j.jss.2021.111115.
Juergens E, Deissenboeck F, Hummel B Code similarities beyond copy & paste. In: 2010 14th European conference on software maintenance and reengineering,2010;78–87 : https://doi.org/10.1109/CSMR.2010.33. IEEE
Sheneamer A, Kalita J. A survey of software clone detection techniques. Int J Comput Appl. 2016;137(10):1–21.
Marcus, A., Maletic, J.I.: Identification of high-level concept clones in source code. In: Proceedings 16th annual international conference on automated software engineering (ASE 2001), pp. 107–114 (2001). https://doi.org/10.1109/ASE.2001.989796. IEEE
Sheneamer A, Roy S, Kalita J. A detection framework for semantic code clones and obfuscated code. Expert Syst Appl. 2018;97:405–20. https://doi.org/10.1016/j.eswa.2017.12.040.
Fang C, Liu Z, Shi Y, Huang J, Shi Q. Functional code clone detection with syntax and semantics fusion learning. In: Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, pp. 2020;516–527 https://doi.org/10.1145/3395363.3397362
Alrabaee S, Wang L, Debbabi M. Bingold: towards robust binary analysis by extracting the semantics of binary code as semantic flow graphs (sfgs). Digit Investig. 2016;18:11–22. https://doi.org/10.1016/j.diin.2016.04.002.
Zhao Y, Mo R, Zhang Y, Zhang S, Xiong P. Exploring and understanding cross-service code clones in microservice projects. In: 2022 IEEE/ACM 30th international conference on program comprehension (ICPC), 2022:449–459 ; https://doi.org/10.1145/3524610.3527925. IEEE
Kamiya T, Kusumoto S, Inoue K. A token-based code clone detection tool-ccfinder and its empirical evaluation. Techinal report, Osaka University, Department of Information and Computer Scineces, Inoue Laboratory (2000)
Papadimitriou CH, Raghavan P, Tamaki H, Vempala S. Latent semantic indexing: a probabilistic analysis. J Comput Syst Sci. 2000;61(2):217–35. https://doi.org/10.1006/jcss.2000.1711.
Hou C, Nie F, Li X, Yi D, Wu Y. Joint embedding learning and sparse regression: a framework for unsupervised feature selection. IEEE Trans Cybern. 2013;44(6):793–804. https://doi.org/10.1109/TCYB.2013.2272642.
Baldi P, Chauvin Y. Neural networks for fingerprint recognition. Neural Comput. 1993;5(3):402–18. https://doi.org/10.1162/neco.1993.5.3.402.
Weiser M. Program slicing. IEEE Trans Softw Eng SE. 1984;10(4):352–7. https://doi.org/10.1109/TSE.1984.5010248.
Alomari HW, Collard ML, Maletic JI, Alhindawi N, Meqdadi O. srcslice: very efficient and scalable forward static slicing. J Softw: Evol Proc. 2014;26(11):931–61. https://doi.org/10.1002/smr.1651.
Rakić G. Extendable and adaptable framework for input language independent static analysis. PhD thesis, University of Novi Sad (Serbia) 2015
Koschke R, Falke R, Frenzel P. Clone detection using abstract syntax suffix trees. In: 2006 13th Working conference on reverse engineering, pp. 2006;253–262 https://doi.org/10.1109/WCRE.2006.18. IEEE
Cordy JR, Roy CK. The nicad clone detector. In: 2011 IEEE 19th international conference on program comprehension, pp. 219–220 (2011). https://doi.org/10.1109/ICPC.2011.26. IEEE
Koschke R, Baxter ID, Conradt M, Cordy JR. Software clone management towards industrial application (dagstuhl seminar 12071). In: Dagstuhl Reports, vol. 2 (2012). https://doi.org/10.4230/DagRep.2.2.21. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik
Schiewe M, Curtis J, Bushong V, Cerny T. Advancing static code analysis with language-agnostic component identification. IEEE Access. 2022;10:30743–61. https://doi.org/10.1109/ACCESS.2022.3160485.
JBoss: Javassist : Java bytecode engineering toolkit (2020). https://www.javassist.org Accessed 2021-06-18
Christiane F, Brown K. Wordnet and wordnets. In: Encyclopedia of Language and Linguistics. UK, Oxford: Elsevier; 2005. p. 665–70.
Bishop CM, Nasrabadi NM. Pattern recognition and machine learning 4(4) (2006)
James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning 2013;112
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
Manning CD, Introduction to information retrieval 2008.
Lemnaru C, Potolea R Imbalanced classification problems: systematic study, issues and best practices. In: Enterprise information systems: 13th international conference, ICEIS 2011, Beijing, China, June 8-11, 2011, Revised Selected Papers 13, pp. 35–50 (2012). https://doi.org/10.1007/978-3-642-29958-2_3. Springer
Abu-Mostafa YS, Magdon-Ismail M, Lin H-T. Learn Data, vol. 4. NY, USA: AMLBook New York; 2012.
Wohlin C, Runeson P, Höst M, Ohlsson M, Regnell B, Wesslén A. Experimentation in Software Engineering: An Introduction. Germany: The Kluwer International Series In Software Engineering. Springer; 2000.
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant No. 1854049 and a grant from Red Hat Research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Advances on Cloud Computing and Services Science” guest edited by Donald F. Ferguson, Claus Pahl and Maarten van Steen.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Abdelfattah, A.S., Rodriguez, A., Walker, A. et al. Detecting Semantic Clones in Microservices Using Components. SN COMPUT. SCI. 4, 470 (2023). https://doi.org/10.1007/s42979-023-01910-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-023-01910-1