Surfacing code in the dark: an instant clone search approach

Park, Jin-woo; Lee, Mu-Woong; Roh, Jong-Won; Hwang, Seung-won; Kim, Sunghun

doi:10.1007/s10115-013-0677-z

Surfacing code in the dark: an instant clone search approach

Regular Paper
Published: 03 August 2013

Volume 41, pages 727–759, (2014)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Jin-woo Park¹,
Mu-Woong Lee¹,
Jong-Won Roh¹,
Seung-won Hwang¹ &
…
Sunghun Kim²

433 Accesses
Explore all metrics

Abstract

In this paper, we study how to “surface” code for instant reference. A traditional mode of surfacing code has been treating code as text and applying keyword search techniques. However, many prior work observes the limitation of such approach: (1) semantic description of code is limited to comments and (2) syntactic keyword is often not selective enough. In contrast, we discuss enabling techniques and scenarios of instant semantic-based surfacing. For example, developers, during a development session, may reference the existing code sharing similar semantics, using his code so far as a query. In addition to such semantic-based surfacing, we also enhance keyword-based surfacing with semantics, by instantly adding semantic tags for code submitted to the repository. To achieve this goal, we first propose scalable indexing structures on vector abstractions of code. Our experimental results show our techniques outperform a state-of-the-art tool in efficiency without compromising accuracy. We then deploy our technique for instant search and tagging scenarios: For instant code search scenario, we demonstrate an instant clone search tool using our techniques, supporting sub-second search over 54 million LOC. For instant code tagging scenario, we propose an automatic instant code tagging algorithm to mine the meaningful tags from clones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Siamese: scalable and incremental code clone search via multiple code representations

Article 05 March 2019

Finding Code-Clone Snippets in Large Source-Code Collection by ccgrep

SourcererCC: Scalable and Accurate Clone Detection

Notes

It took about 580 ms to extract characteristic vectors for a code having 190 lines (the average number of lines of a code in the Java open source project code set is about 189). It took about 72 s to reduce dimensions $(\mathcal{D}'=20)$ by using variance-based approach.
http://www.ohloh.net//.

References

Lee M-W, Roh J-W, Hwang SW, Kim S (2010) Instant code clone search. In: ACM SIGSOFT/FSE
Kim J, Lee S, Hwang SW, Kim S (2009) Adding examples into java documents. In: ASE
Kim J, Lee S, Hwang SW, Sunghun K (2010) Towards an intelligent code search engine. In: AAAI
Kim J, Lee S, Hwang S-W, Kim S (2013) Enriching documents with examples: a corpus mining approach. ACM Trans Inf Syst 31(1):1:1–1:27
Article MathSciNet Google Scholar
Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in oopl. In: ISESE
Brandt J, Dontcheva M, Weskamp M, Klemmer S (2010) Example-centric programming: integrating web search into the development environment. In: SIGCHI
Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: ICSM
Jiang L, Misherghi G, Su Z, Glondu S (2007) Deckard: scalable and accurate tree-based detection of code clones. In: ICSE
Kamiya T, Kusumoto S, Inoue K (2002) CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670
Article Google Scholar
Livieri S, Higo Y, Matushita M, Inoue K (2007) Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In: ICSE
Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74(7):470–495
Article MATH MathSciNet Google Scholar
Wahler V, Seipel D, Wolff J, Fischer G (2004) Clone detection in source code by frequent itemset techniques. In: SCAM
Jürgens E, Hummel B, Deissenboeck F, Feilkas M (2008) Static bug detection through analysis of inconsistent clones. In: Software Engineering (Workshops), pp 443–446
Kim M, Sazawal V, Notkin D, Murphy G (2005) An empirical study of code clone genealogies. SIGSOFT Softw Eng Notes 30(5):187–196
Article Google Scholar
Beckmann N, Begel HP, Schneider R, Seeger B (1990) The r*-tree: an efficient and robust access method for points and rectangles. In: SIGMOD
Wang X-J, Zhang L, Liu M, Li Y, Ma WY (2010) Arista-image search to annotation on billions of web photos. In: CVPR
Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262
Article MATH MathSciNet Google Scholar
Hjaltason GR, Samet H (1999) Distance browsing in spatial databases. ACM TODS 24:265–318
Article Google Scholar
Böhm C, Krebs F (2004) The k-nearest neighbour join: turbo charging the kdd process. Knowl Inf Syst 6(6):728–749
Article Google Scholar
Korn F, Pagel B-U, Faloutsos C (2001) On the ‘dimensionality curse’ and the ‘self-similarity blessing’. TKDE 13(1):96–111
Google Scholar
Korn F, Sidiropoulos N, Faloutsos C, Siegel E, Protopapas Z (1996) Fast nearest neighbor search in medical image databases. In: VLDB
Seidl T, Kriegel H-P (1998) Optimal multi-step k-nearest neighbor search. In: SIGMOD
Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: SIGMOD
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: VLDB
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers, Dordrecht
Book MATH Google Scholar
Bercken J, Seeger B (2001) An evaluation of generic bulk loading techniques. In: VLDB
Kamel I, Faloutsos C (1993) On packing r-trees. In: CIKM
Leutenegger ST, Edgington JM, Lopez MA (1997) STR: a simple and efficient algorithm for r-tree packing. In: ICDE
Berchtold S, Böhm C, Kriegel H-P(1998) Improving the query performance of high-dimensional index structures by bulk-load operations. In: EDBT
Silberschatz A, Galvin PB, Gagne G (2008) Operating system concepts, 8th edn. Wiley, New York
Google Scholar
Jolliffe IT (2002) Principal component analysis. Springer Series in Statistics, 2nd edn. Springer, Berlin
Google Scholar
Yi B-K, Faloutsos C (2000) Fast time sequence indexing for arbitrary lp norms. In: VLDB
Van Rijsbergen CJ (1979) Information retrieval. Butterworth-Heinemann, London
Google Scholar
Li Z, Shan L, Myagmar S, Zhou Y (2004) Cp-miner: a tool for finding copy-paste and related bugs in operating system code. In: OSDI
Bianchini M, Gori M, Scarselli F (2005) Inside pagerank. ACM Trans Int Technol 5(1):92–128
Article Google Scholar
Keivanloo I, Rilling J, Charland P (2011) Internet-scale real-time code clone search via multi-level indexing. In: WCRE
Xie T, Acharya M, Thummalapenta S, Taneja K (2008) Improving software reliability and productivity via mining program source code. In: NSFNGS
Li Y, Zhang L, Li G, Xie B, Sun J (2008) Recommending typical usage examples for component retrieval in reuse repositories. In: ICSR
Holmes R, Murphy GC (2005) Using structural context to recommend source code examples. In: ICSE
Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12):952–970
Article Google Scholar
Bajracharya SK, Ngo TC, Linstead E, Dou Y, Rigor P, Baldi P, Lopes CV (2006) Sourcerer: a search engine for open source code supporting structure-based search. In: OOPSLA Companion
Wang X, Lo D, Jiefeng CL, Zhang HM, Jeffrey XY (2010) Matching dependence-related queries in the system dependence graph. In: ASE
McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Chen F (2011) Portfolio: finding relevant functions and their usage. In: ICSE
Wang S, Lo D, Jiang L (2011) Code search via topic-enriched dependence graph matching. In: WCRE
McMillan C, Grechanik M, Poshyvanyk D, Fu C, Qing X (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Softw Eng 38(5):1069–1087
Article Google Scholar
Chan W-K, Cheng H, Lo D (2012) Searching connected api subgraph via text phrases. In: FSE
McMillan C, Grechanik M, Poshyvanyk D (2012) Detecting similar software applications. In: ICSE
Ferdian T, David L, Lingxiao J (2012) Detecting similar applications with collaborative tagging
Al-Kofahi JM, Tamrawi A, Nguyen TT, Nguyen HA, Nguyen TN (2010) Fuzzy set approach for automatic tagging in evolving software. In: ICDM
Preisach C, Marinho LB, Schmidt-Thieme L (2010) Semi-supervised tag recommendation-using untagged resources to mitigate cold-start problems. In: PAKDD
Rendle S, Marinho LB, Nanopoulos A, Schmidt-Thieme L (2009) Learning optimal ranking with tensor factorization for tag recommendation. In: KDD

Download references

Acknowledgments

This work was supported by the Engineering Research Center of Excellence Program of Korea Ministry of Science, ICT & Future Planning (MSIP)/National Research Foundation of Korea (NRF) (Grant NRF-2008-0062609).

Author information

Authors and Affiliations

Pohang University of Science and Technology (POSTECH), Pohang, Republic of Korea
Jin-woo Park, Mu-Woong Lee, Jong-Won Roh & Seung-won Hwang
Hong Kong University of Science and Technology (HKUST), Hong Kong, China
Sunghun Kim

Authors

Jin-woo Park
View author publications
You can also search for this author inPubMed Google Scholar
Mu-Woong Lee
View author publications
You can also search for this author inPubMed Google Scholar
Jong-Won Roh
View author publications
You can also search for this author inPubMed Google Scholar
Seung-won Hwang
View author publications
You can also search for this author inPubMed Google Scholar
Sunghun Kim
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Seung-won Hwang.

Additional information

This work builds on and significantly extends our preliminary work [1].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, Jw., Lee, MW., Roh, JW. et al. Surfacing code in the dark: an instant clone search approach. Knowl Inf Syst 41, 727–759 (2014). https://doi.org/10.1007/s10115-013-0677-z

Download citation

Received: 26 November 2012
Revised: 22 May 2013
Accepted: 19 July 2013
Published: 03 August 2013
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10115-013-0677-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Surfacing code in the dark: an instant clone search approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Siamese: scalable and incremental code clone search via multiple code representations

Finding Code-Clone Snippets in Large Source-Code Collection by ccgrep

SourcererCC: Scalable and Accurate Clone Detection

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now