Abstract
Despite a number of techniques have been proposed over the years to detect clones for improving software maintenance, reusability or security, there is still a lack of language agnostic approaches with code granularity flexibility for near-miss clone detection in big code in scale. However, it is challenging to detect near-miss clones in big code since it requires more computing and memory resources as the scale of the source code increases. In this paper, we present FastDCF, a fast and scalable distributed clone finder, which is partial index based and optimized with multithreading strategy. Furthermore, it overcomes single node CPU and memory resource limitation with MapReduce and HDFS by scalable distributed parallelization, which further improves the efficiency. It cannot only detect Type-1 and Type-2 clones but also can discover the most computationally expensive Type-3 clones for large repositories. Meanwhile, it works for both function and file granularities. And it supports many different programming languages. Experimental results show that FastDCF detects clones in 250 million lines of code within 24 min, which is more efficient compared to existing clone detection techniques, with recall and precision comparable to state-of-the-art approaches. With BigCloneBench, a recent and widely used benchmark, FastDCF achieves both high recall and precision, which is competitive with other existing tools.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lopes, C.V., et al.: DéjàVu: a map of code duplicates on GitHub. Proc. ACM Program. Lang. 1, 1–28 (2017)
Akram, J., Mumtaz, M., Luo, P.: IBFET: index‐based features extraction technique for scalable code clone detection at file level granularity. Softw. Pract. Exp. 50(1), 22–46 (2020)
Baker, B.S.: On finding duplication and near-duplication in large software systems. In: Proceedings of 2nd Working Conference on Reverse Engineering, pp. 86–95 (1995)
Kamiya, T., Kusumoto, S., Inoue, K.: CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 28, 654–670 (2002)
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53, 72–77 (2010)
Kim, S., Woo, S., Lee, H., Oh, H.: VUDDY: a scalable approach for vulnerable code clone discovery. In: IEEE Symposium on Security and Privacy (SP) (2017)
Cordy, J.R., Roy, C.K.: The NiCad clone detector. In: IEEE 19th International Conference on Program Comprehension, pp. 219–220 (2011)
Chen, K., Liu, P., Zhang, Y.: Achieving accuracy and scalability simultaneously in detecting application clones on Android markets. In: Proceedings of the 36th International Conference on Software Engineering, pp. 175–186 (2014)
The TXL Programming Language. https://www.txl.ca/. Accessed 21 Apr 2020
Roy, C.K., Cordy, J.R.: A mutation/injection-based automatic framework for evaluating code clone detection tools. In: Software Testing, Verification and Validation Workshops, ICSTW 2009, pp. 157–166 (2009)
Hindle, A., Barr, E.T., Su, Z., Gabel, M., Devanbu, P.: On the naturalness of software. In: 34th International Conference on Software Engineering (ICSE), pp. 837–847 (2012)
Ambient Software Evoluton Group, IJaDataset 2.0 (January 2013). http://secold.org/projects/seclone. Accessed 21 Oct 2019
Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci. Comput. Program. 74, 470–495 (2009)
Zibran, M.F., Saha, R.K., Asaduzzaman, M., Roy, C.K.: Analyzing and forecasting near-miss clones in evolving software: an empirical study. In: IEEE International Conference on Engineering of Complex Computer Systems (2011)
Mayrand, J., Leblanc, C., Merlo, E.M.: Experiment on the automatic detection of function clones in a software system using metrics. In: International Conference on Software Maintenance (1996)
Lavoie, T., Eilers-Smith, M., Merlo, E.: Challenging cloning related problems with GPU-based algorithms. In: International Workshop on Software Clones (2010)
Pham, N.H., Nguyen, T.T., Nguyen, H.A., Nguyen, T.N.: Detection of recurring software vulnerabilities. In: IEEE/ACM International Conference on Automated Software Engineering (2010)
Li, H., Kwon, H., Kwon, J., Lee, H.: CLORIFI: software vulnerability discovery using code clone verification. Concurr. Comput. Pract. Exp. 28, 1900–1917 (2016)
Saha, R.K., Roy, C.K., Schneider, K.A., Perry, D.E.: Understanding the evolution of Type-3 clones: an exploratory study. In: 2013 10th IEEE Working Conference on Mining Software Repositories (MSR) (2013)
Wang, P., Svajlenko, J., Wu, Y., Xu, Y., Roy, C.K.: CCAligner: a token based large-gap clone detector. In: IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 1066–1077 (2018)
Honnutagi, P.S.: The Hadoop distributed file system. Int. J. Comput. Sci. Inf. Technol. 5, 6238–6243 (2014)
Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., Lopes, C.V.: SourcererCC: scaling code clone detection to big code. In: Proceedings of the 38th International Conference on Software Engineering, pp. 1157–1168 (2015)
Svajlenko, J., Roy, C.K.: CloneWorks: a fast and flexible large-scale near-miss clone detection tool. In: IEEE/ACM International Conference on Software Engineering Companion (2017)
Livieri, S., Higo, Y., Matushita, M., Inoue, K.: Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In: 29th International Conference on Software Engineering, ICSE 2007, pp. 106–115 (2007)
Hummel, B., Juergens, E., Heinemann, L., Conradt, M.: Index-based code clone detection: incremental, distributed, scalable. In: 2010 IEEE International Conference on Software Maintenance, pp. 1–9 (2010)
Roy, C.K., Cordy, J.R.: Near-miss function clones in open source software: an empirical study. J. Softw. Maint. Evol. Res. Pract. 22, 165–189 (2012)
Svajlenko, J., Islam, J.F., Keivanloo, I., Roy, C.K., Mia, M.M.: Towards a big data curated benchmark of inter-project code clones. In: IEEE International Conference on Software Maintenance and Evolution, pp. 476–480 (2014)
Jang, J., Agrawal, A., Brumley, D.: ReDeBug: finding unpatched code clones in entire OS distributions. In: IEEE Symposium on Security and Privacy, pp. 48–62 (2012)
Acknowledgement
The work in this paper is supported by the Natural Science Foundation of China (Under Grant NO.: 61872444 and U19A2060) and the National Key Research and Development Program of China (2018YFB1003602).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, L. et al. (2022). FastDCF: A Partial Index Based Distributed and Scalable Near-Miss Code Clone Detection Approach for Very Large Code Repositories. In: Shen, H., et al. Parallel and Distributed Computing, Applications and Technologies. PDCAT 2021. Lecture Notes in Computer Science(), vol 13148. Springer, Cham. https://doi.org/10.1007/978-3-030-96772-7_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-96772-7_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96771-0
Online ISBN: 978-3-030-96772-7
eBook Packages: Computer ScienceComputer Science (R0)