research-article

Deep code search

Authors:
Xiaodong Gu

The Hong Kong University of Science and Technology, Hong Kong

The Hong Kong University of Science and Technology, Hong Kong
View Profile

,
Hongyu Zhang

The University of Newcastle, Callaghan, Australia

The University of Newcastle, Callaghan, Australia
View Profile

,
Sunghun Kim

The Hong Kong University of Science and Technology, Hong Kong and Clova AI Research, NAVER

The Hong Kong University of Science and Technology, Hong Kong and Clova AI Research, NAVER
View Profile

ICSE '18: Proceedings of the 40th International Conference on Software EngineeringMay 2018Pages 933–944https://doi.org/10.1145/3180155.3180167

Published:27 May 2018Publication History

ICSE '18: Proceedings of the 40th International Conference on Software Engineering

Pages 933–944

ABSTRACT

To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code.

In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled.

As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques.

References

Camel case, https://en.wikipedia.org/wiki/camelcase.Google Scholar
Eclipse JDT. http://www.eclipse.org/jdt/.Google Scholar
Github. https://github.com.Google Scholar
Keras. https://keras.io/.Google Scholar
Lucene. https://lucene.apache.org/.Google Scholar
Theano, http://deeplearning.net/software/theano/.Google Scholar
M. Allamanis, H. Peng, and C. Sutton. A convolutional attention network for extreme summarization of source code. In International Conference on Machine Learning (ICML), 2016.Google Scholar
J. Anvik and G. C. Murphy. Reducing the effort of bug report triage: Recommenders for development-oriented decisions. ACM Transactions on Software Engineering and Methodology (TOSEM), 20(3):10, 2011. Google ScholarDigital Library
A. Bacchelli, M. Lanza, and R. Robbes. Linking e-mails and source code artifacts. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, pages 375--384. ACM, 2010. Google ScholarDigital Library
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.Google Scholar
O. Barzilay, C. Treude, and A. Zagalsky. Facilitating crowd sourced software engineering via stack overflow. In Finding Source Code on the Web for Remix and Reuse, pages 289--308. Springer, 2013.Google ScholarCross Ref
T. J. Biggerstaff, B. G. Mitbander, and D. E. Webster. Program understanding and the concept assignment problem. Communications of the ACM, 37(5):72--82, 1994. Google ScholarDigital Library
J. Brandt, M. Dontcheva, M. Weskamp, and S. R. Klemmer. Example-centric programming: integrating web search into the development environment. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 513--522. ACM, 2010. Google ScholarDigital Library
B. A. Campbell and C. Treude. NLP2Code: Code snippet content assist via natural language tasks. arXiv preprint arXiv:1701.05648, 2017.Google Scholar
W.-K. Chan, H. Cheng, and D. Lo. Searching connected API subgraph via text phrases. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, page 10. ACM, 2012. Google ScholarDigital Library
O. Chaparro and A. Marcus. On the reduction of verbose queries in text retrieval based software maintenance. In Proceedings of the 38th International Conference on Software Engineering Companion, pages 716--718. ACM, 2016. Google ScholarDigital Library
K. Cho, B. Van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724--1734, Doha, Qatar, Oct. 2014. Association for Computational Linguistics.Google ScholarCross Ref
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493--2537, 2011. Google ScholarDigital Library
C. S. Corley, K. Damevski, and N. A. Kraft. Exploring the use of deep learning for feature location. In Software Maintenance and Evolution (ICSME), 2015 IEEE International Conference on, pages 556--560. IEEE, 2015. Google ScholarDigital Library
B. Dagenais and M. P. Robillard. Recovering traceability links between an api and its learning resources. In 2012 34th International Conference on Software Engineering (ICSE), pages 47--57. IEEE, 2012. Google ScholarDigital Library
M. Feng, B. Xiang, M. R. Glass, L. Wang, and B. Zhou. Applying deep learning to answer selection: A study and an open task. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 813--820. IEEE, 2015.Google ScholarCross Ref
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. DeViSE: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121--2129, 2013. Google ScholarDigital Library
X. Ge, D. C. Shepherd, K. Damevski, and E. Murphy-Hill. Design and evaluation of a multi-recommendation system for local code search. Journal of Visual Languages & Computing, 2016.Google Scholar
G. Gousios, M. Pinzger, and A. v. Deursen. An exploratory study of the pull-based software development model. In Proceedings of the 36th International Conference on Software Engineering, pages 345--355. ACM, 2014. Google ScholarDigital Library
A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence, 31(5):855--868, 2009. Google ScholarDigital Library
M. Grechanik, C. Fu, Q. Xie, C. McMillan, D. Poshyvanyk, and C. Cumby. A search engine for finding highly relevant applications. In 2010 ACM/IEEE 32nd International Conference on Software Engineering, volume 1, pages 475--484. IEEE, 2010. Google ScholarDigital Library
X. Gu, H. Zhang, D. Zhang, and S. Kim. Deep API learning. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE'16), 2016. Google ScholarDigital Library
X. Gu, H. Zhang, D. Zhang, and S. Kim. DeepAM: Migrate APIs with multi-modal sequence to sequence learning. In Proceedings of the Twenty-Sixth International Joint Conferences on Artifical Intelligence (IJCAI'17), 2017. Google ScholarDigital Library
S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies. Automatic query reformulations for text retrieval in software engineering. In Proceedings of the 2013 International Conference on Software Engineering, pages 842--851. IEEE Press, 2013. Google ScholarDigital Library
E. Hill, L. Pollock, and K. Vijay-Shanker. Improving source code search with natural language phrasal representations of method signatures. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, pages 524--527. IEEE Computer Society, 2011. Google ScholarDigital Library
E. Hill, M. Roldan-Vega, J. A. Fails, and G. Mallet. NL-based query refinement and contextualized code search results: A user study. In Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week-IEEE Conference on, pages 34--43. IEEE, 2014.Google ScholarCross Ref
R. Holmes, R. Cottrell, R. J. Walker, and J. Denzinger. The end-to-end use of source code examples: An exploratory study. In Software Maintenance, 2009. ICSM 2009. IEEE International Conference on, pages 555--558. IEEE, 2009.Google ScholarCross Ref
A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128--3137, 2015.Google ScholarCross Ref
Y. Ke, K. T. Stolee, C. Le Goues, and Y. Brun. Repairing programs with semantic code search (T). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, pages 295--306. IEEE, 2015.Google ScholarDigital Library
I. Keivanloo, J. Rilling, and Y. Zou. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering, pages 664--675. ACM, 2014. Google ScholarDigital Library
Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.Google Scholar
D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.Google Scholar
A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. Combining deep learning with information retrieval to localize buggy files for bug reports (n). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, pages 476--481. IEEE, 2015.Google ScholarDigital Library
Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188--1196, 2014. Google ScholarDigital Library
M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661--670. ACM, 2014. Google ScholarDigital Library
X. Li, Z. Wang, Q. Wang, S. Yan, T. Xie, and H. Mei. Relationship-aware code search for JavaScript frameworks. In Proceedings of the ACM SIGSOFT 24th International Symposium on the Foundations of Software Engineering. ACM, 2016. Google ScholarDigital Library
W. Ling, E. Grefenstette, K. M. Hermann, T. Kocisky, A. Senior, F. Wang, and P. Blunsom. Latent predictor networks for code generation. arXiv preprint arXiv:1603.06744, 2016.Google Scholar
E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi. Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery, 18:300--336, 2009. Google ScholarDigital Library
M. Lu, X. Sun, S. Wang, D. Lo, and Y. Duan. Query expansion via wordnet for effective code search. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pages 545--549. IEEE, 2015.Google Scholar
F. Lv, H. Zhang, J. Lou, S. Wang, D. Zhang, and J. Zhao. CodeHow: Effective code search based on API understanding and extended boolean model. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE 2015). IEEE, 2015.Google ScholarDigital Library
C. McMillan, M. Grechanik, D. Poshyvanyk, C. Fu, and Q. Xie. Exemplar: A source code search engine for finding highly relevant applications. IEEE Transactions on Software Engineering, 38(5):1069--1087, 2012. Google ScholarDigital Library
C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu. Portfolio: finding relevant functions and their usage. In Proceedings of the 33rd International Conference on Software Engineering (ICSE'11), pages 111--120. IEEE, 2011. Google ScholarDigital Library
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.Google Scholar
T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26--30, 2010, pages 1045--1048, 2010.Google ScholarCross Ref
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013. Google ScholarDigital Library
I. J. Mojica, B. Adams, M. Nagappan, S. Dienst, T. Berger, and A. E. Hassan. A large scale empirical study on software reuse in mobile apps. IEEE Software, 31(2):78--86, 2014.Google ScholarCross Ref
D. J. Montana and L. Davis. Training feedforward neural networks using genetic algorithms. In IJCAI, volume 89, pages 762--767, 1989. Google ScholarDigital Library
L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI'16, pages 1287--1293. AAAI Press, 2016. Google ScholarDigital Library
L. Mou, R. Men, G. Li, L. Zhang, and Z. Jin. On end-to-end program generation from user intention by deep neural networks. arXiv, 2015.Google Scholar
A. Nederlof, A. Mesbah, and A. v. Deursen. Software engineering for the web: the state of the practice. In Companion Proceedings of the 36th International Conference on Software Engineering, pages 4--13. ACM, 2014. Google ScholarDigital Library
T. D. Nguyen, A. T. Nguyen, H. D. Phan, and T. N. Nguyen. Exploring api embedding for api usages and applications. In Proceedings of the 39th International Conference on Software Engineering, pages 438--449. IEEE Press, 2017. Google ScholarDigital Library
L. Nie, H. Jiang, Z. Ren, Z. Sun, and X. Li. Query expansion based on crowd knowledge for code search. IEEE Transactions on Services Computing, 9(5):771--783, 2016.Google ScholarCross Ref
H. Niu, I. Keivanloo, and Y. Zou. Learning to rank code examples for code search engines. Empirical Software Engineering, pages 1--33, 2016. Google ScholarDigital Library
H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. K. Ward. Deep sentence embedding using the long short term memory network: Analysis and application to information retrieval. CoRR, abs/1502.06922, 2015. Google ScholarDigital Library
H. Peng, L. Mou, G. Li, Y. Liu, L. Zhang, and Z. Jin. Building program vector representations for deep learning. In Proceedings of the 8th International Conference on Knowledge Science, Engineering and Management - Volume 9403, KSEM 2015, pages 547--553, New York, NY, USA, 2015. Springer-Verlag New York, Inc. Google ScholarDigital Library
L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza. Mining stackoverflow to turn the ide into a self-confident programming prompter. In Proceedings of the 11th Working Conference on Mining Software Repositories, pages 102--111. ACM, 2014. Google ScholarDigital Library
M. Raghothaman, Y. Wei, and Y. Hamadi. SWIM: synthesizing what I mean: code search and idiomatic snippet synthesis. In Proceedings of the 38th International Conference on Software Engineering, pages 357--367. ACM, 2016. Google ScholarDigital Library
M. Rahimi and J. Cleland-Huang. Patterns of co-evolution between requirements and source code. In 2015 IEEE Fifth International Workshop on Requirements Patterns (RePa), pages 25--31. IEEE, 2015. Google ScholarDigital Library
V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 2014. Google ScholarDigital Library
S. P. Reiss. Semantics-based code search. In Proceedings of the 31st International Conference on Software Engineering, pages 243--253. IEEE Computer Society, 2009. Google ScholarDigital Library
M. Renieres and S. P. Reiss. Fault localization with nearest neighbor queries. In Automated Software Engineering, 2003. Proceedings. 18th IEEE International Conference on, pages 30--39, Oct 2003. Google ScholarDigital Library
P. C. Rigby and M. P. Robillard. Discovering essential code elements in informal documentation. In Proceedings of the 2013 International Conference on Software Engineering, pages 832--841. IEEE Press, 2013. Google ScholarDigital Library
J. Singer, T. Lethbridge, N. Vinson, and N. Anquetil. An examination of software engineering work practices. In CASCON First Decade High Impact Papers, pages 174--188. IBM Corp., 2010. Google ScholarDigital Library
K. T. Stolee, S. Elbaum, and D. Dobos. Solving the search for source code. ACM Transactions on Software Engineering and Methodology (TOSEM), 23(3):26, 2014. Google ScholarDigital Library
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112, 2014. Google ScholarDigital Library
M. Tan, B. Xiang, and B. Zhou. Lstm-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108, 2015.Google Scholar
J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384--394. Association for Computational Linguistics, 2010. Google ScholarDigital Library
Y. Uneno, O. Mizuno, and E.-H. Choi. Using a distributed representation of words in localizing relevant files for bug reports. In Software Quality, Reliability and Security (QRS), 2016 IEEE International Conference on, pages 183--190. IEEE, 2016.Google ScholarCross Ref
J. Weston, S. Bengio, and N. Usunier. Wsabie: scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Three, pages 2764--2770. AAAI Press, 2011. Google ScholarDigital Library
M. White, M. Tufano, M. Martinez, M. Monperrus, and D. Poshyvanyk. Sorting and transforming program repair ingredients via deep learning code similarities. arXiv preprint arXiv:1707.04742, 2017.Google Scholar
M. White, M. Tufano, C. Vendome, and D. Poshyvanyk. Deep learning code fragments for code clone detection. In Proceedings of the 31th IEEE/ACM International Conference on Automated Software Engineering (ASE 2016), 2016. Google ScholarDigital Library
M. White, C. Vendome, M. Linares-Vásquez, and D. Poshyvanyk. Toward deep learning software repositories. In Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on, pages 334--345. IEEE, 2015. Google ScholarDigital Library
R. Xu, C. Xiong, W. Chen, and J. J. Corso. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI, pages 2346--2352. Citeseer, 2015. Google ScholarDigital Library
X. Ye, R. Bunescu, and C. Liu. Learning to rank relevant files for bug reports using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 689--699. ACM, 2014. Google ScholarDigital Library
X. Ye, H. Shen, X. Ma, R. Bunescu, and C. Liu. From word embeddings to document similarities for improved information retrieval in software engineering. In Proceedings of the 38th International Conference on Software Engineering, pages 404--415. ACM, 2016. Google ScholarDigital Library
H. Zhang, A. Jain, G. Khandelwal, C. Kaushik, S. Ge, and W. Hu. Bing developer assistant: Improving developer productivity by recommending sample code. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pages 956--961. ACM, 2016. Google ScholarDigital Library
J. Zhou and R. J. Walker. API Deprecation: A retrospective analysis and detection method for code examples on the web. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE'16). ACM, 2016. Google ScholarDigital Library

Index Terms

Deep code search
1. Software and its engineering
  1. Software creation and management
    1. Software development techniques
      1. Reusability

Recommendations

Learning to rank code examples for code search engines

Source code examples are used by developers to implement unfamiliar tasks by learning from existing solutions. To better support developers in finding existing solutions, code search engines are designed to locate and rank code examples relevant to user'...
Read More
Code semantic enrichment for deep code search
Abstract
Code search aims to retrieve code snippets from a large-scale codebase, where the semantics of the searched code match developers’ query intent. Code is a low-level implementation of programming intents, but query is always expressed as clear and ...
Graphical abstract

Display Omitted
Highlights
- Finding that the code semantics can be enriched by incorporating with the description of its most similar code.
- Proposing a code semantic enrichment approach named SemEnr for deep code search.
- Evaluating the performance of SemEnr ...
Read More
Active code search: incorporating user feedback to improve code search relevance
ASE '14: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering

Code search techniques return relevant code fragments given a user query. They typically work in a passive mode: given a user query, a static list of code fragments sorted by the relevance scores decided by a code search technique is returned to the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '18: Proceedings of the 40th International Conference on Software Engineering
May 2018
1307 pages
ISBN:9781450356381
DOI:10.1145/3180155
Conference Chair:
Michel Chaudron
Chalmers University of Technology, University of Gothenburg, Sweden
,
General Chair:
Ivica Crnkovic
Chalmers University of Technology, University of Gothenburg, Sweden
,
Program Chairs:
Marsha Chechik
University of Toronto, Canada
,
Mark Harman
Facebook and University College London, United Kingdom
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
code search
deep learning
joint embedding
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 329
  Total Citations
  View Citations
- 1,784
  Total Downloads
- Downloads (Last 12 months)295
- Downloads (Last 6 weeks)29
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Deep code search

ICSE '18: Proceedings of the 40th International Conference on Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning to rank code examples for code search engines

Code semantic enrichment for deep code search

Active code search: incorporating user feedback to improve code search relevance