research-article

OCoR: an overlapping-aware code retriever

Authors:
Qihao Zhu

Peking University

Peking University
View Profile

,
Zeyu Sun

Peking University

Peking University
View Profile

,
Xiran Liang

Peking University

Peking University
View Profile

,
Yingfei Xiong

Peking University

Peking University
View Profile

,
Lu Zhang

Peking University

Peking University
View Profile

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software EngineeringDecember 2020Pages 883–894https://doi.org/10.1145/3324884.3416530

Published:27 January 2021Publication History

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

Pages 883–894

ABSTRACT

Code retrieval helps developers reuse code snippets in the open-source projects. Given a natural language description, code retrieval aims to search for the most relevant code relevant among a set of code snippets. Existing state-of-the-art approaches apply neural networks to code retrieval. However, these approaches still fail to capture an important feature: overlaps. The overlaps between different names used by different people indicate that two different names may be potentially related (e.g., "message" and "msg"), and the overlaps between identifiers in code and words in natural language descriptions indicate that the code snippet and the description may potentially be related.

To address this problem, we propose a novel neural architecture named OCoR¹, where we introduce two specifically-designed components to capture overlaps: the first embeds names by characters to capture the overlaps between names, and the second introduces a novel overlap matrix to represent the degrees of overlaps between each natural language word and each identifier.

The evaluation was conducted on two established datasets. The experimental results show that OCoR significantly outperforms the existing state-of-the-art approaches and achieves 13.1% to 22.3% improvements. Moreover, we also conducted several in-depth experiments to help understand the performance of the different components in OCoR.

References

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).Google Scholar
Miltos Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6--11 July 2015 (JMLR Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. Journal of Machine Learning Research: Workshop and Conference Proceedings, 2123--2132.Google Scholar
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).Google Scholar
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:cs.CL/1409.0473Google Scholar
Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5, 2 (March 1994), 157--166. Google ScholarDigital Library
Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When Deep Learning Met Code Search. arXiv:cs.SE/1905.03813Google Scholar
Nick Craswell. 2009. Mean reciprocal rank. Encyclopedia of Database Systems (2009), 1703--1703.Google Scholar
Jian Fu, Xipeng Qiu, and Xuanjing Huang. 2016. Convolutional deep neural networks for document-based question answering. In Natural Language Understanding and Intelligent Applications. Springer, 790--797.Google Scholar
github. 2020. https://github.com/. github.Google Scholar
Alessandro Giusti, Dan C Cireşan, Jonathan Masci, Luca M Gambardella, and Jürgen Schmidhuber. 2013. Fast image scanning with deep max-pooling convolutional neural networks. In 2013 IEEE International Conference on Image Processing. IEEE, 4034--4038.Google ScholarCross Ref
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 933--944.Google ScholarDigital Library
Sonia Haiduc, Gabriele Bavota, Andrian Marcus, Rocco Oliveto, Andrea De Lucia, and Tim Menzies. 2013. Automatic query reformulations for text retrieval in software engineering. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 842--851.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:cs.CV/1512.03385Google Scholar
Dan Hendrycks and Kevin Gimpel. 2016. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. CoRR abs/1606.08415 (2016). arXiv:1606.08415 http://arxiv.org/abs/1606.08415Google Scholar
Emily Hill, Manuel Roldan-Vega, Jerry Alan Fails, and Greg Mallet. 2014. NL-based query refinement and contextualized code search results: A user study. In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, 34--43.Google ScholarCross Ref
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:cs.NE/1207.0580Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.Google ScholarDigital Library
Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2015. Convolutional Neural Network Architectures for Matching Natural Language Sentences. arXiv:cs.CL/1503.03244Google Scholar
He Hua and Jimmy Lin. 2016. Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. 2019. Music Transformer: Generating Music with Long-Term Structure. In ICLR.Google Scholar
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:cs.LG/1909.09436Google Scholar
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2073--2083.Google ScholarCross Ref
Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting Working Code Examples. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 664--675. Google ScholarDigital Library
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. arXiv:cs.CL/1408.5882Google Scholar
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv:cs.LG/1412.6980Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.Google Scholar
Edward Loper and Steven Bird. 2002. NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002).Google ScholarDigital Library
Hong Mei and Lu Zhang. 2018. Can big data bring a breakthrough for software automation? Science China Information Sciences 61 (05 2018), 056101. Google ScholarCross Ref
Meili Lu, X. Sun, S. Wang, D. Lo, and Yucong Duan. 2015. Query expansion via WordNet for effective code search. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 545--549. Google ScholarCross Ref
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv:cs.CL/1301.3781Google Scholar
Stack Overflow. 2020. https://stackoverflow.com/. Stack Overflow.Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532--1543. http://www.aclweb.org/anthology/D14-1162Google Scholar
Xipeng Qiu and Xuanjing Huang. 2015. Convolutional Neural Tensor Network Architecture for Community-Based Question Answering. In IJCAI, Qiang Yang and Michael Wooldridge (Eds.). AAAI Press, 1305--1311. http://dblp.uni-trier.de/db/conf/ijcai/ijcai2015.html#QiuH15Google Scholar
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural Machine Translation of Rare Words with Subword Units. arXiv:cs.CL/1508.07909Google Scholar
Zeyu Sun, Qihao Zhu, Lili Mou, Yingfei Xiong, Ge Li, and Lu Zhang. 2019. A grammar-based structural cnn decoder for code generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7055--7062.Google ScholarDigital Library
Zeyu Sun, Qihao Zhu, Yingfei Xiong, Yican Sun, Lili Mou, and Lu Zhang. 2020. TreeGen: A Tree-Based Transformer Architecture for Code Generation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 8984--8991. https://aaai.org/ojs/index.php/AAAI/article/view/6430Google ScholarCross Ref
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google Scholar
Venkatesh Vinayakarao, Anita Sarma, Rahul Purandare, Shuktika Jain, and Saumya Jain. 2017. Anne: Improving source code search using entity retrieval approach. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 211--220.Google ScholarDigital Library
Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S Yu. 2018. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 397--407.Google ScholarDigital Library
Mingzhou Xu, Derek F Wong, Baosong Yang, Yue Zhang, and Lidia S Chao. 2019. Leveraging local and global patterns for self-attention networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3069--3075.Google ScholarCross Ref
Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. 2019. CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning. The World Wide Web Conference on - WWW '19 (2019). Google ScholarDigital Library
Ziyu Yao, Daniel S Weld, Wei-Peng Chen, and Huan Sun. 2018. Staqc: A systematically mined question-code dataset from stack overflow. In Proceedings of the 2018 World Wide Web Conference. 1693--1703.Google ScholarDigital Library
Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu. 2016. Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. arXiv preprint arXiv:1611.06639 (2016).Google Scholar

Index Terms

OCoR: an overlapping-aware code retriever

Recommendations

CRaDLe: Deep code retrieval based on semantic Dependency Learning
Abstract
Code retrieval is a common practice for programmers to reuse existing code snippets in the open-source repositories. Given a user query (i.e., a natural language description), code retrieval aims at searching the most relevant ones from a set of ...
Read More
CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning
WWW '19: The World Wide Web Conference

To accelerate software development, much research has been performed to help people understand and reuse the huge amount of available code resources. Two important tasks have been widely studied: code retrieval, which aims to retrieve code snippets ...
Read More
Bug localization with combination of deep learning and information retrieval
ICPC '17: Proceedings of the 25th International Conference on Program Comprehension

The automated task of locating the potential buggy files in a software project given a bug report is called bug localization. Bug localization helps developers focus on crucial files. However, the existing automated bug localization approaches face a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering
December 2020
1449 pages
ISBN:9781450367684
DOI:10.1145/3324884
General Chair:
John Grundy,
Program Chairs:
Claire Le Goues,
David Lo
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 January 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
code retrieval
neural network
overlap
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate82of337submissions,24%
Upcoming Conference
ASE '24

Sponsor:

sigsoft online

sigsoft online

ASE '24: 39th IEEE/ACM International Conference on Automated Software Engineering

October 27 - November 1, 2024

Sacramento , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 124
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

OCoR: an overlapping-aware code retriever

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

CRaDLe: Deep code retrieval based on semantic Dependency Learning

CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning

Bug localization with combination of deep learning and information retrieval