Abstract
In this paper, we study how to “surface” code for instant reference. A traditional mode of surfacing code has been treating code as text and applying keyword search techniques. However, many prior work observes the limitation of such approach: (1) semantic description of code is limited to comments and (2) syntactic keyword is often not selective enough. In contrast, we discuss enabling techniques and scenarios of instant semantic-based surfacing. For example, developers, during a development session, may reference the existing code sharing similar semantics, using his code so far as a query. In addition to such semantic-based surfacing, we also enhance keyword-based surfacing with semantics, by instantly adding semantic tags for code submitted to the repository. To achieve this goal, we first propose scalable indexing structures on vector abstractions of code. Our experimental results show our techniques outperform a state-of-the-art tool in efficiency without compromising accuracy. We then deploy our technique for instant search and tagging scenarios: For instant code search scenario, we demonstrate an instant clone search tool using our techniques, supporting sub-second search over 54 million LOC. For instant code tagging scenario, we propose an automatic instant code tagging algorithm to mine the meaningful tags from clones.
















Similar content being viewed by others
Notes
It took about 580 ms to extract characteristic vectors for a code having 190 lines (the average number of lines of a code in the Java open source project code set is about 189). It took about 72 s to reduce dimensions \((\mathcal{D}'=20)\) by using variance-based approach.
References
Lee M-W, Roh J-W, Hwang SW, Kim S (2010) Instant code clone search. In: ACM SIGSOFT/FSE
Kim J, Lee S, Hwang SW, Kim S (2009) Adding examples into java documents. In: ASE
Kim J, Lee S, Hwang SW, Sunghun K (2010) Towards an intelligent code search engine. In: AAAI
Kim J, Lee S, Hwang S-W, Kim S (2013) Enriching documents with examples: a corpus mining approach. ACM Trans Inf Syst 31(1):1:1–1:27
Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in oopl. In: ISESE
Brandt J, Dontcheva M, Weskamp M, Klemmer S (2010) Example-centric programming: integrating web search into the development environment. In: SIGCHI
Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: ICSM
Jiang L, Misherghi G, Su Z, Glondu S (2007) Deckard: scalable and accurate tree-based detection of code clones. In: ICSE
Kamiya T, Kusumoto S, Inoue K (2002) CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670
Livieri S, Higo Y, Matushita M, Inoue K (2007) Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In: ICSE
Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74(7):470–495
Wahler V, Seipel D, Wolff J, Fischer G (2004) Clone detection in source code by frequent itemset techniques. In: SCAM
Jürgens E, Hummel B, Deissenboeck F, Feilkas M (2008) Static bug detection through analysis of inconsistent clones. In: Software Engineering (Workshops), pp 443–446
Kim M, Sazawal V, Notkin D, Murphy G (2005) An empirical study of code clone genealogies. SIGSOFT Softw Eng Notes 30(5):187–196
Beckmann N, Begel HP, Schneider R, Seeger B (1990) The r*-tree: an efficient and robust access method for points and rectangles. In: SIGMOD
Wang X-J, Zhang L, Liu M, Li Y, Ma WY (2010) Arista-image search to annotation on billions of web photos. In: CVPR
Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262
Hjaltason GR, Samet H (1999) Distance browsing in spatial databases. ACM TODS 24:265–318
Böhm C, Krebs F (2004) The k-nearest neighbour join: turbo charging the kdd process. Knowl Inf Syst 6(6):728–749
Korn F, Pagel B-U, Faloutsos C (2001) On the ‘dimensionality curse’ and the ‘self-similarity blessing’. TKDE 13(1):96–111
Korn F, Sidiropoulos N, Faloutsos C, Siegel E, Protopapas Z (1996) Fast nearest neighbor search in medical image databases. In: VLDB
Seidl T, Kriegel H-P (1998) Optimal multi-step k-nearest neighbor search. In: SIGMOD
Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: SIGMOD
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: VLDB
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers, Dordrecht
Bercken J, Seeger B (2001) An evaluation of generic bulk loading techniques. In: VLDB
Kamel I, Faloutsos C (1993) On packing r-trees. In: CIKM
Leutenegger ST, Edgington JM, Lopez MA (1997) STR: a simple and efficient algorithm for r-tree packing. In: ICDE
Berchtold S, Böhm C, Kriegel H-P(1998) Improving the query performance of high-dimensional index structures by bulk-load operations. In: EDBT
Silberschatz A, Galvin PB, Gagne G (2008) Operating system concepts, 8th edn. Wiley, New York
Jolliffe IT (2002) Principal component analysis. Springer Series in Statistics, 2nd edn. Springer, Berlin
Yi B-K, Faloutsos C (2000) Fast time sequence indexing for arbitrary lp norms. In: VLDB
Van Rijsbergen CJ (1979) Information retrieval. Butterworth-Heinemann, London
Li Z, Shan L, Myagmar S, Zhou Y (2004) Cp-miner: a tool for finding copy-paste and related bugs in operating system code. In: OSDI
Bianchini M, Gori M, Scarselli F (2005) Inside pagerank. ACM Trans Int Technol 5(1):92–128
Keivanloo I, Rilling J, Charland P (2011) Internet-scale real-time code clone search via multi-level indexing. In: WCRE
Xie T, Acharya M, Thummalapenta S, Taneja K (2008) Improving software reliability and productivity via mining program source code. In: NSFNGS
Li Y, Zhang L, Li G, Xie B, Sun J (2008) Recommending typical usage examples for component retrieval in reuse repositories. In: ICSR
Holmes R, Murphy GC (2005) Using structural context to recommend source code examples. In: ICSE
Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12):952–970
Bajracharya SK, Ngo TC, Linstead E, Dou Y, Rigor P, Baldi P, Lopes CV (2006) Sourcerer: a search engine for open source code supporting structure-based search. In: OOPSLA Companion
Wang X, Lo D, Jiefeng CL, Zhang HM, Jeffrey XY (2010) Matching dependence-related queries in the system dependence graph. In: ASE
McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Chen F (2011) Portfolio: finding relevant functions and their usage. In: ICSE
Wang S, Lo D, Jiang L (2011) Code search via topic-enriched dependence graph matching. In: WCRE
McMillan C, Grechanik M, Poshyvanyk D, Fu C, Qing X (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Softw Eng 38(5):1069–1087
Chan W-K, Cheng H, Lo D (2012) Searching connected api subgraph via text phrases. In: FSE
McMillan C, Grechanik M, Poshyvanyk D (2012) Detecting similar software applications. In: ICSE
Ferdian T, David L, Lingxiao J (2012) Detecting similar applications with collaborative tagging
Al-Kofahi JM, Tamrawi A, Nguyen TT, Nguyen HA, Nguyen TN (2010) Fuzzy set approach for automatic tagging in evolving software. In: ICDM
Preisach C, Marinho LB, Schmidt-Thieme L (2010) Semi-supervised tag recommendation-using untagged resources to mitigate cold-start problems. In: PAKDD
Rendle S, Marinho LB, Nanopoulos A, Schmidt-Thieme L (2009) Learning optimal ranking with tensor factorization for tag recommendation. In: KDD
Acknowledgments
This work was supported by the Engineering Research Center of Excellence Program of Korea Ministry of Science, ICT & Future Planning (MSIP)/National Research Foundation of Korea (NRF) (Grant NRF-2008-0062609).
Author information
Authors and Affiliations
Corresponding author
Additional information
This work builds on and significantly extends our preliminary work [1].
Rights and permissions
About this article
Cite this article
Park, Jw., Lee, MW., Roh, JW. et al. Surfacing code in the dark: an instant clone search approach. Knowl Inf Syst 41, 727–759 (2014). https://doi.org/10.1007/s10115-013-0677-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0677-z