Skip to main content
Log in

Surfacing code in the dark: an instant clone search approach

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In this paper, we study how to “surface” code for instant reference. A traditional mode of surfacing code has been treating code as text and applying keyword search techniques. However, many prior work observes the limitation of such approach: (1) semantic description of code is limited to comments and (2) syntactic keyword is often not selective enough. In contrast, we discuss enabling techniques and scenarios of instant semantic-based surfacing. For example, developers, during a development session, may reference the existing code sharing similar semantics, using his code so far as a query. In addition to such semantic-based surfacing, we also enhance keyword-based surfacing with semantics, by instantly adding semantic tags for code submitted to the repository. To achieve this goal, we first propose scalable indexing structures on vector abstractions of code. Our experimental results show our techniques outperform a state-of-the-art tool in efficiency without compromising accuracy. We then deploy our technique for instant search and tagging scenarios: For instant code search scenario, we demonstrate an instant clone search tool using our techniques, supporting sub-second search over 54 million LOC. For instant code tagging scenario, we propose an automatic instant code tagging algorithm to mine the meaningful tags from clones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. It took about 580 ms to extract characteristic vectors for a code having 190 lines (the average number of lines of a code in the Java open source project code set is about 189). It took about 72 s to reduce dimensions \((\mathcal{D}'=20)\) by using variance-based approach.

  2. http://www.ohloh.net//.

References

  1. Lee M-W, Roh J-W, Hwang SW, Kim S (2010) Instant code clone search. In: ACM SIGSOFT/FSE

  2. Kim J, Lee S, Hwang SW, Kim S (2009) Adding examples into java documents. In: ASE

  3. Kim J, Lee S, Hwang SW, Sunghun K (2010) Towards an intelligent code search engine. In: AAAI

  4. Kim J, Lee S, Hwang S-W, Kim S (2013) Enriching documents with examples: a corpus mining approach. ACM Trans Inf Syst 31(1):1:1–1:27

    Article  MathSciNet  Google Scholar 

  5. Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in oopl. In: ISESE

  6. Brandt J, Dontcheva M, Weskamp M, Klemmer S (2010) Example-centric programming: integrating web search into the development environment. In: SIGCHI

  7. Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: ICSM

  8. Jiang L, Misherghi G, Su Z, Glondu S (2007) Deckard: scalable and accurate tree-based detection of code clones. In: ICSE

  9. Kamiya T, Kusumoto S, Inoue K (2002) CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670

    Article  Google Scholar 

  10. Livieri S, Higo Y, Matushita M, Inoue K (2007) Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In: ICSE

  11. Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74(7):470–495

    Article  MATH  MathSciNet  Google Scholar 

  12. Wahler V, Seipel D, Wolff J, Fischer G (2004) Clone detection in source code by frequent itemset techniques. In: SCAM

  13. Jürgens E, Hummel B, Deissenboeck F, Feilkas M (2008) Static bug detection through analysis of inconsistent clones. In: Software Engineering (Workshops), pp 443–446

  14. Kim M, Sazawal V, Notkin D, Murphy G (2005) An empirical study of code clone genealogies. SIGSOFT Softw Eng Notes 30(5):187–196

    Article  Google Scholar 

  15. Beckmann N, Begel HP, Schneider R, Seeger B (1990) The r*-tree: an efficient and robust access method for points and rectangles. In: SIGMOD

  16. Wang X-J, Zhang L, Liu M, Li Y, Ma WY (2010) Arista-image search to annotation on billions of web photos. In: CVPR

  17. Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262

    Article  MATH  MathSciNet  Google Scholar 

  18. Hjaltason GR, Samet H (1999) Distance browsing in spatial databases. ACM TODS 24:265–318

    Article  Google Scholar 

  19. Böhm C, Krebs F (2004) The k-nearest neighbour join: turbo charging the kdd process. Knowl Inf Syst 6(6):728–749

    Article  Google Scholar 

  20. Korn F, Pagel B-U, Faloutsos C (2001) On the ‘dimensionality curse’ and the ‘self-similarity blessing’. TKDE 13(1):96–111

    Google Scholar 

  21. Korn F, Sidiropoulos N, Faloutsos C, Siegel E, Protopapas Z (1996) Fast nearest neighbor search in medical image databases. In: VLDB

  22. Seidl T, Kriegel H-P (1998) Optimal multi-step k-nearest neighbor search. In: SIGMOD

  23. Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: SIGMOD

  24. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: VLDB

  25. Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers, Dordrecht

    Book  MATH  Google Scholar 

  26. Bercken J, Seeger B (2001) An evaluation of generic bulk loading techniques. In: VLDB

  27. Kamel I, Faloutsos C (1993) On packing r-trees. In: CIKM

  28. Leutenegger ST, Edgington JM, Lopez MA (1997) STR: a simple and efficient algorithm for r-tree packing. In: ICDE

  29. Berchtold S, Böhm C, Kriegel H-P(1998) Improving the query performance of high-dimensional index structures by bulk-load operations. In: EDBT

  30. Silberschatz A, Galvin PB, Gagne G (2008) Operating system concepts, 8th edn. Wiley, New York

    Google Scholar 

  31. Jolliffe IT (2002) Principal component analysis. Springer Series in Statistics, 2nd edn. Springer, Berlin

    Google Scholar 

  32. Yi B-K, Faloutsos C (2000) Fast time sequence indexing for arbitrary lp norms. In: VLDB

  33. Van Rijsbergen CJ (1979) Information retrieval. Butterworth-Heinemann, London

    Google Scholar 

  34. Li Z, Shan L, Myagmar S, Zhou Y (2004) Cp-miner: a tool for finding copy-paste and related bugs in operating system code. In: OSDI

  35. Bianchini M, Gori M, Scarselli F (2005) Inside pagerank. ACM Trans Int Technol 5(1):92–128

    Article  Google Scholar 

  36. Keivanloo I, Rilling J, Charland P (2011) Internet-scale real-time code clone search via multi-level indexing. In: WCRE

  37. Xie T, Acharya M, Thummalapenta S, Taneja K (2008) Improving software reliability and productivity via mining program source code. In: NSFNGS

  38. Li Y, Zhang L, Li G, Xie B, Sun J (2008) Recommending typical usage examples for component retrieval in reuse repositories. In: ICSR

  39. Holmes R, Murphy GC (2005) Using structural context to recommend source code examples. In: ICSE

  40. Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12):952–970

    Article  Google Scholar 

  41. Bajracharya SK, Ngo TC, Linstead E, Dou Y, Rigor P, Baldi P, Lopes CV (2006) Sourcerer: a search engine for open source code supporting structure-based search. In: OOPSLA Companion

  42. Wang X, Lo D, Jiefeng CL, Zhang HM, Jeffrey XY (2010) Matching dependence-related queries in the system dependence graph. In: ASE

  43. McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Chen F (2011) Portfolio: finding relevant functions and their usage. In: ICSE

  44. Wang S, Lo D, Jiang L (2011) Code search via topic-enriched dependence graph matching. In: WCRE

  45. McMillan C, Grechanik M, Poshyvanyk D, Fu C, Qing X (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Softw Eng 38(5):1069–1087

    Article  Google Scholar 

  46. Chan W-K, Cheng H, Lo D (2012) Searching connected api subgraph via text phrases. In: FSE

  47. McMillan C, Grechanik M, Poshyvanyk D (2012) Detecting similar software applications. In: ICSE

  48. Ferdian T, David L, Lingxiao J (2012) Detecting similar applications with collaborative tagging

  49. Al-Kofahi JM, Tamrawi A, Nguyen TT, Nguyen HA, Nguyen TN (2010) Fuzzy set approach for automatic tagging in evolving software. In: ICDM

  50. Preisach C, Marinho LB, Schmidt-Thieme L (2010) Semi-supervised tag recommendation-using untagged resources to mitigate cold-start problems. In: PAKDD

  51. Rendle S, Marinho LB, Nanopoulos A, Schmidt-Thieme L (2009) Learning optimal ranking with tensor factorization for tag recommendation. In: KDD

Download references

Acknowledgments

This work was supported by the Engineering Research Center of Excellence Program of Korea Ministry of Science, ICT & Future Planning (MSIP)/National Research Foundation of Korea (NRF) (Grant NRF-2008-0062609).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seung-won Hwang.

Additional information

This work builds on and significantly extends our preliminary work [1].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, Jw., Lee, MW., Roh, JW. et al. Surfacing code in the dark: an instant clone search approach. Knowl Inf Syst 41, 727–759 (2014). https://doi.org/10.1007/s10115-013-0677-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0677-z

Keywords

Navigation