ABSTRACT
To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code.
In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled.
As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques.
- Camel case, https://en.wikipedia.org/wiki/camelcase.Google Scholar
- Eclipse JDT. http://www.eclipse.org/jdt/.Google Scholar
- Github. https://github.com.Google Scholar
- Keras. https://keras.io/.Google Scholar
- Lucene. https://lucene.apache.org/.Google Scholar
- Theano, http://deeplearning.net/software/theano/.Google Scholar
- M. Allamanis, H. Peng, and C. Sutton. A convolutional attention network for extreme summarization of source code. In International Conference on Machine Learning (ICML), 2016.Google Scholar
- J. Anvik and G. C. Murphy. Reducing the effort of bug report triage: Recommenders for development-oriented decisions. ACM Transactions on Software Engineering and Methodology (TOSEM), 20(3):10, 2011. Google ScholarDigital Library
- A. Bacchelli, M. Lanza, and R. Robbes. Linking e-mails and source code artifacts. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, pages 375--384. ACM, 2010. Google ScholarDigital Library
- D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.Google Scholar
- O. Barzilay, C. Treude, and A. Zagalsky. Facilitating crowd sourced software engineering via stack overflow. In Finding Source Code on the Web for Remix and Reuse, pages 289--308. Springer, 2013.Google ScholarCross Ref
- T. J. Biggerstaff, B. G. Mitbander, and D. E. Webster. Program understanding and the concept assignment problem. Communications of the ACM, 37(5):72--82, 1994. Google ScholarDigital Library
- J. Brandt, M. Dontcheva, M. Weskamp, and S. R. Klemmer. Example-centric programming: integrating web search into the development environment. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 513--522. ACM, 2010. Google ScholarDigital Library
- B. A. Campbell and C. Treude. NLP2Code: Code snippet content assist via natural language tasks. arXiv preprint arXiv:1701.05648, 2017.Google Scholar
- W.-K. Chan, H. Cheng, and D. Lo. Searching connected API subgraph via text phrases. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, page 10. ACM, 2012. Google ScholarDigital Library
- O. Chaparro and A. Marcus. On the reduction of verbose queries in text retrieval based software maintenance. In Proceedings of the 38th International Conference on Software Engineering Companion, pages 716--718. ACM, 2016. Google ScholarDigital Library
- K. Cho, B. Van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724--1734, Doha, Qatar, Oct. 2014. Association for Computational Linguistics.Google ScholarCross Ref
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493--2537, 2011. Google ScholarDigital Library
- C. S. Corley, K. Damevski, and N. A. Kraft. Exploring the use of deep learning for feature location. In Software Maintenance and Evolution (ICSME), 2015 IEEE International Conference on, pages 556--560. IEEE, 2015. Google ScholarDigital Library
- B. Dagenais and M. P. Robillard. Recovering traceability links between an api and its learning resources. In 2012 34th International Conference on Software Engineering (ICSE), pages 47--57. IEEE, 2012. Google ScholarDigital Library
- M. Feng, B. Xiang, M. R. Glass, L. Wang, and B. Zhou. Applying deep learning to answer selection: A study and an open task. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 813--820. IEEE, 2015.Google ScholarCross Ref
- A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. DeViSE: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121--2129, 2013. Google ScholarDigital Library
- X. Ge, D. C. Shepherd, K. Damevski, and E. Murphy-Hill. Design and evaluation of a multi-recommendation system for local code search. Journal of Visual Languages & Computing, 2016.Google Scholar
- G. Gousios, M. Pinzger, and A. v. Deursen. An exploratory study of the pull-based software development model. In Proceedings of the 36th International Conference on Software Engineering, pages 345--355. ACM, 2014. Google ScholarDigital Library
- A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence, 31(5):855--868, 2009. Google ScholarDigital Library
- M. Grechanik, C. Fu, Q. Xie, C. McMillan, D. Poshyvanyk, and C. Cumby. A search engine for finding highly relevant applications. In 2010 ACM/IEEE 32nd International Conference on Software Engineering, volume 1, pages 475--484. IEEE, 2010. Google ScholarDigital Library
- X. Gu, H. Zhang, D. Zhang, and S. Kim. Deep API learning. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE'16), 2016. Google ScholarDigital Library
- X. Gu, H. Zhang, D. Zhang, and S. Kim. DeepAM: Migrate APIs with multi-modal sequence to sequence learning. In Proceedings of the Twenty-Sixth International Joint Conferences on Artifical Intelligence (IJCAI'17), 2017. Google ScholarDigital Library
- S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies. Automatic query reformulations for text retrieval in software engineering. In Proceedings of the 2013 International Conference on Software Engineering, pages 842--851. IEEE Press, 2013. Google ScholarDigital Library
- E. Hill, L. Pollock, and K. Vijay-Shanker. Improving source code search with natural language phrasal representations of method signatures. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, pages 524--527. IEEE Computer Society, 2011. Google ScholarDigital Library
- E. Hill, M. Roldan-Vega, J. A. Fails, and G. Mallet. NL-based query refinement and contextualized code search results: A user study. In Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week-IEEE Conference on, pages 34--43. IEEE, 2014.Google ScholarCross Ref
- R. Holmes, R. Cottrell, R. J. Walker, and J. Denzinger. The end-to-end use of source code examples: An exploratory study. In Software Maintenance, 2009. ICSM 2009. IEEE International Conference on, pages 555--558. IEEE, 2009.Google ScholarCross Ref
- A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128--3137, 2015.Google ScholarCross Ref
- Y. Ke, K. T. Stolee, C. Le Goues, and Y. Brun. Repairing programs with semantic code search (T). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, pages 295--306. IEEE, 2015.Google ScholarDigital Library
- I. Keivanloo, J. Rilling, and Y. Zou. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering, pages 664--675. ACM, 2014. Google ScholarDigital Library
- Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.Google Scholar
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.Google Scholar
- A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. Combining deep learning with information retrieval to localize buggy files for bug reports (n). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, pages 476--481. IEEE, 2015.Google ScholarDigital Library
- Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188--1196, 2014. Google ScholarDigital Library
- M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661--670. ACM, 2014. Google ScholarDigital Library
- X. Li, Z. Wang, Q. Wang, S. Yan, T. Xie, and H. Mei. Relationship-aware code search for JavaScript frameworks. In Proceedings of the ACM SIGSOFT 24th International Symposium on the Foundations of Software Engineering. ACM, 2016. Google ScholarDigital Library
- W. Ling, E. Grefenstette, K. M. Hermann, T. Kocisky, A. Senior, F. Wang, and P. Blunsom. Latent predictor networks for code generation. arXiv preprint arXiv:1603.06744, 2016.Google Scholar
- E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi. Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery, 18:300--336, 2009. Google ScholarDigital Library
- M. Lu, X. Sun, S. Wang, D. Lo, and Y. Duan. Query expansion via wordnet for effective code search. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pages 545--549. IEEE, 2015.Google Scholar
- F. Lv, H. Zhang, J. Lou, S. Wang, D. Zhang, and J. Zhao. CodeHow: Effective code search based on API understanding and extended boolean model. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE 2015). IEEE, 2015.Google ScholarDigital Library
- C. McMillan, M. Grechanik, D. Poshyvanyk, C. Fu, and Q. Xie. Exemplar: A source code search engine for finding highly relevant applications. IEEE Transactions on Software Engineering, 38(5):1069--1087, 2012. Google ScholarDigital Library
- C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu. Portfolio: finding relevant functions and their usage. In Proceedings of the 33rd International Conference on Software Engineering (ICSE'11), pages 111--120. IEEE, 2011. Google ScholarDigital Library
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.Google Scholar
- T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26--30, 2010, pages 1045--1048, 2010.Google ScholarCross Ref
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013. Google ScholarDigital Library
- I. J. Mojica, B. Adams, M. Nagappan, S. Dienst, T. Berger, and A. E. Hassan. A large scale empirical study on software reuse in mobile apps. IEEE Software, 31(2):78--86, 2014.Google ScholarCross Ref
- D. J. Montana and L. Davis. Training feedforward neural networks using genetic algorithms. In IJCAI, volume 89, pages 762--767, 1989. Google ScholarDigital Library
- L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI'16, pages 1287--1293. AAAI Press, 2016. Google ScholarDigital Library
- L. Mou, R. Men, G. Li, L. Zhang, and Z. Jin. On end-to-end program generation from user intention by deep neural networks. arXiv, 2015.Google Scholar
- A. Nederlof, A. Mesbah, and A. v. Deursen. Software engineering for the web: the state of the practice. In Companion Proceedings of the 36th International Conference on Software Engineering, pages 4--13. ACM, 2014. Google ScholarDigital Library
- T. D. Nguyen, A. T. Nguyen, H. D. Phan, and T. N. Nguyen. Exploring api embedding for api usages and applications. In Proceedings of the 39th International Conference on Software Engineering, pages 438--449. IEEE Press, 2017. Google ScholarDigital Library
- L. Nie, H. Jiang, Z. Ren, Z. Sun, and X. Li. Query expansion based on crowd knowledge for code search. IEEE Transactions on Services Computing, 9(5):771--783, 2016.Google ScholarCross Ref
- H. Niu, I. Keivanloo, and Y. Zou. Learning to rank code examples for code search engines. Empirical Software Engineering, pages 1--33, 2016. Google ScholarDigital Library
- H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. K. Ward. Deep sentence embedding using the long short term memory network: Analysis and application to information retrieval. CoRR, abs/1502.06922, 2015. Google ScholarDigital Library
- H. Peng, L. Mou, G. Li, Y. Liu, L. Zhang, and Z. Jin. Building program vector representations for deep learning. In Proceedings of the 8th International Conference on Knowledge Science, Engineering and Management - Volume 9403, KSEM 2015, pages 547--553, New York, NY, USA, 2015. Springer-Verlag New York, Inc. Google ScholarDigital Library
- L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza. Mining stackoverflow to turn the ide into a self-confident programming prompter. In Proceedings of the 11th Working Conference on Mining Software Repositories, pages 102--111. ACM, 2014. Google ScholarDigital Library
- M. Raghothaman, Y. Wei, and Y. Hamadi. SWIM: synthesizing what I mean: code search and idiomatic snippet synthesis. In Proceedings of the 38th International Conference on Software Engineering, pages 357--367. ACM, 2016. Google ScholarDigital Library
- M. Rahimi and J. Cleland-Huang. Patterns of co-evolution between requirements and source code. In 2015 IEEE Fifth International Workshop on Requirements Patterns (RePa), pages 25--31. IEEE, 2015. Google ScholarDigital Library
- V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 2014. Google ScholarDigital Library
- S. P. Reiss. Semantics-based code search. In Proceedings of the 31st International Conference on Software Engineering, pages 243--253. IEEE Computer Society, 2009. Google ScholarDigital Library
- M. Renieres and S. P. Reiss. Fault localization with nearest neighbor queries. In Automated Software Engineering, 2003. Proceedings. 18th IEEE International Conference on, pages 30--39, Oct 2003. Google ScholarDigital Library
- P. C. Rigby and M. P. Robillard. Discovering essential code elements in informal documentation. In Proceedings of the 2013 International Conference on Software Engineering, pages 832--841. IEEE Press, 2013. Google ScholarDigital Library
- J. Singer, T. Lethbridge, N. Vinson, and N. Anquetil. An examination of software engineering work practices. In CASCON First Decade High Impact Papers, pages 174--188. IBM Corp., 2010. Google ScholarDigital Library
- K. T. Stolee, S. Elbaum, and D. Dobos. Solving the search for source code. ACM Transactions on Software Engineering and Methodology (TOSEM), 23(3):26, 2014. Google ScholarDigital Library
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112, 2014. Google ScholarDigital Library
- M. Tan, B. Xiang, and B. Zhou. Lstm-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108, 2015.Google Scholar
- J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384--394. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- Y. Uneno, O. Mizuno, and E.-H. Choi. Using a distributed representation of words in localizing relevant files for bug reports. In Software Quality, Reliability and Security (QRS), 2016 IEEE International Conference on, pages 183--190. IEEE, 2016.Google ScholarCross Ref
- J. Weston, S. Bengio, and N. Usunier. Wsabie: scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Three, pages 2764--2770. AAAI Press, 2011. Google ScholarDigital Library
- M. White, M. Tufano, M. Martinez, M. Monperrus, and D. Poshyvanyk. Sorting and transforming program repair ingredients via deep learning code similarities. arXiv preprint arXiv:1707.04742, 2017.Google Scholar
- M. White, M. Tufano, C. Vendome, and D. Poshyvanyk. Deep learning code fragments for code clone detection. In Proceedings of the 31th IEEE/ACM International Conference on Automated Software Engineering (ASE 2016), 2016. Google ScholarDigital Library
- M. White, C. Vendome, M. Linares-Vásquez, and D. Poshyvanyk. Toward deep learning software repositories. In Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on, pages 334--345. IEEE, 2015. Google ScholarDigital Library
- R. Xu, C. Xiong, W. Chen, and J. J. Corso. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI, pages 2346--2352. Citeseer, 2015. Google ScholarDigital Library
- X. Ye, R. Bunescu, and C. Liu. Learning to rank relevant files for bug reports using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 689--699. ACM, 2014. Google ScholarDigital Library
- X. Ye, H. Shen, X. Ma, R. Bunescu, and C. Liu. From word embeddings to document similarities for improved information retrieval in software engineering. In Proceedings of the 38th International Conference on Software Engineering, pages 404--415. ACM, 2016. Google ScholarDigital Library
- H. Zhang, A. Jain, G. Khandelwal, C. Kaushik, S. Ge, and W. Hu. Bing developer assistant: Improving developer productivity by recommending sample code. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pages 956--961. ACM, 2016. Google ScholarDigital Library
- J. Zhou and R. J. Walker. API Deprecation: A retrospective analysis and detection method for code examples on the web. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE'16). ACM, 2016. Google ScholarDigital Library
Index Terms
- Deep code search
Recommendations
Learning to rank code examples for code search engines
Source code examples are used by developers to implement unfamiliar tasks by learning from existing solutions. To better support developers in finding existing solutions, code search engines are designed to locate and rank code examples relevant to user'...
Code semantic enrichment for deep code search
AbstractCode search aims to retrieve code snippets from a large-scale codebase, where the semantics of the searched code match developers’ query intent. Code is a low-level implementation of programming intents, but query is always expressed as clear and ...
Graphical abstractDisplay Omitted
Highlights- Finding that the code semantics can be enriched by incorporating with the description of its most similar code.
- Proposing a code semantic enrichment approach named SemEnr for deep code search.
- Evaluating the performance of SemEnr ...
Active code search: incorporating user feedback to improve code search relevance
ASE '14: Proceedings of the 29th ACM/IEEE International Conference on Automated Software EngineeringCode search techniques return relevant code fragments given a user query. They typically work in a passive mode: given a user query, a static list of code fragments sorted by the relevance scores decided by a code search technique is returned to the ...
Comments