Skip to main content

FastDCF: A Partial Index Based Distributed and Scalable Near-Miss Code Clone Detection Approach for Very Large Code Repositories

  • Conference paper
  • First Online:
Parallel and Distributed Computing, Applications and Technologies (PDCAT 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13148))

  • 1501 Accesses

Abstract

Despite a number of techniques have been proposed over the years to detect clones for improving software maintenance, reusability or security, there is still a lack of language agnostic approaches with code granularity flexibility for near-miss clone detection in big code in scale. However, it is challenging to detect near-miss clones in big code since it requires more computing and memory resources as the scale of the source code increases. In this paper, we present FastDCF, a fast and scalable distributed clone finder, which is partial index based and optimized with multithreading strategy. Furthermore, it overcomes single node CPU and memory resource limitation with MapReduce and HDFS by scalable distributed parallelization, which further improves the efficiency. It cannot only detect Type-1 and Type-2 clones but also can discover the most computationally expensive Type-3 clones for large repositories. Meanwhile, it works for both function and file granularities. And it supports many different programming languages. Experimental results show that FastDCF detects clones in 250 million lines of code within 24 min, which is more efficient compared to existing clone detection techniques, with recall and precision comparable to state-of-the-art approaches. With BigCloneBench, a recent and widely used benchmark, FastDCF achieves both high recall and precision, which is competitive with other existing tools.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Lopes, C.V., et al.: DéjàVu: a map of code duplicates on GitHub. Proc. ACM Program. Lang. 1, 1–28 (2017)

    Article  Google Scholar 

  2. Akram, J., Mumtaz, M., Luo, P.: IBFET: index‐based features extraction technique for scalable code clone detection at file level granularity. Softw. Pract. Exp. 50(1), 22–46 (2020)

    Article  Google Scholar 

  3. Baker, B.S.: On finding duplication and near-duplication in large software systems. In: Proceedings of 2nd Working Conference on Reverse Engineering, pp. 86–95 (1995)

    Google Scholar 

  4. Kamiya, T., Kusumoto, S., Inoue, K.: CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 28, 654–670 (2002)

    Article  Google Scholar 

  5. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53, 72–77 (2010)

    Article  Google Scholar 

  6. Kim, S., Woo, S., Lee, H., Oh, H.: VUDDY: a scalable approach for vulnerable code clone discovery. In: IEEE Symposium on Security and Privacy (SP) (2017)

    Google Scholar 

  7. Cordy, J.R., Roy, C.K.: The NiCad clone detector. In: IEEE 19th International Conference on Program Comprehension, pp. 219–220 (2011)

    Google Scholar 

  8. Chen, K., Liu, P., Zhang, Y.: Achieving accuracy and scalability simultaneously in detecting application clones on Android markets. In: Proceedings of the 36th International Conference on Software Engineering, pp. 175–186 (2014)

    Google Scholar 

  9. The TXL Programming Language. https://www.txl.ca/. Accessed 21 Apr 2020

  10. Roy, C.K., Cordy, J.R.: A mutation/injection-based automatic framework for evaluating code clone detection tools. In: Software Testing, Verification and Validation Workshops, ICSTW 2009, pp. 157–166 (2009)

    Google Scholar 

  11. Hindle, A., Barr, E.T., Su, Z., Gabel, M., Devanbu, P.: On the naturalness of software. In: 34th International Conference on Software Engineering (ICSE), pp. 837–847 (2012)

    Google Scholar 

  12. Ambient Software Evoluton Group, IJaDataset 2.0 (January 2013). http://secold.org/projects/seclone. Accessed 21 Oct 2019

  13. Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci. Comput. Program. 74, 470–495 (2009)

    Article  MathSciNet  Google Scholar 

  14. Zibran, M.F., Saha, R.K., Asaduzzaman, M., Roy, C.K.: Analyzing and forecasting near-miss clones in evolving software: an empirical study. In: IEEE International Conference on Engineering of Complex Computer Systems (2011)

    Google Scholar 

  15. Mayrand, J., Leblanc, C., Merlo, E.M.: Experiment on the automatic detection of function clones in a software system using metrics. In: International Conference on Software Maintenance (1996)

    Google Scholar 

  16. Lavoie, T., Eilers-Smith, M., Merlo, E.: Challenging cloning related problems with GPU-based algorithms. In: International Workshop on Software Clones (2010)

    Google Scholar 

  17. Pham, N.H., Nguyen, T.T., Nguyen, H.A., Nguyen, T.N.: Detection of recurring software vulnerabilities. In: IEEE/ACM International Conference on Automated Software Engineering (2010)

    Google Scholar 

  18. Li, H., Kwon, H., Kwon, J., Lee, H.: CLORIFI: software vulnerability discovery using code clone verification. Concurr. Comput. Pract. Exp. 28, 1900–1917 (2016)

    Article  Google Scholar 

  19. Saha, R.K., Roy, C.K., Schneider, K.A., Perry, D.E.: Understanding the evolution of Type-3 clones: an exploratory study. In: 2013 10th IEEE Working Conference on Mining Software Repositories (MSR) (2013)

    Google Scholar 

  20. Wang, P., Svajlenko, J., Wu, Y., Xu, Y., Roy, C.K.: CCAligner: a token based large-gap clone detector. In: IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 1066–1077 (2018)

    Google Scholar 

  21. Honnutagi, P.S.: The Hadoop distributed file system. Int. J. Comput. Sci. Inf. Technol. 5, 6238–6243 (2014)

    Google Scholar 

  22. Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., Lopes, C.V.: SourcererCC: scaling code clone detection to big code. In: Proceedings of the 38th International Conference on Software Engineering, pp. 1157–1168 (2015)

    Google Scholar 

  23. Svajlenko, J., Roy, C.K.: CloneWorks: a fast and flexible large-scale near-miss clone detection tool. In: IEEE/ACM International Conference on Software Engineering Companion (2017)

    Google Scholar 

  24. Livieri, S., Higo, Y., Matushita, M., Inoue, K.: Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In: 29th International Conference on Software Engineering, ICSE 2007, pp. 106–115 (2007)

    Google Scholar 

  25. Hummel, B., Juergens, E., Heinemann, L., Conradt, M.: Index-based code clone detection: incremental, distributed, scalable. In: 2010 IEEE International Conference on Software Maintenance, pp. 1–9 (2010)

    Google Scholar 

  26. Roy, C.K., Cordy, J.R.: Near-miss function clones in open source software: an empirical study. J. Softw. Maint. Evol. Res. Pract. 22, 165–189 (2012)

    Google Scholar 

  27. Svajlenko, J., Islam, J.F., Keivanloo, I., Roy, C.K., Mia, M.M.: Towards a big data curated benchmark of inter-project code clones. In: IEEE International Conference on Software Maintenance and Evolution, pp. 476–480 (2014)

    Google Scholar 

  28. Jang, J., Agrawal, A., Brumley, D.: ReDeBug: finding unpatched code clones in entire OS distributions. In: IEEE Symposium on Security and Privacy, pp. 48–62 (2012)

    Google Scholar 

Download references

Acknowledgement

The work in this paper is supported by the Natural Science Foundation of China (Under Grant NO.: 61872444 and U19A2060) and the National Key Research and Development Program of China (2018YFB1003602).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yi Ren or Jianbo Guan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, L. et al. (2022). FastDCF: A Partial Index Based Distributed and Scalable Near-Miss Code Clone Detection Approach for Very Large Code Repositories. In: Shen, H., et al. Parallel and Distributed Computing, Applications and Technologies. PDCAT 2021. Lecture Notes in Computer Science(), vol 13148. Springer, Cham. https://doi.org/10.1007/978-3-030-96772-7_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-96772-7_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-96771-0

  • Online ISBN: 978-3-030-96772-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics