skip to main content
survey

A Survey of Source Code Search: A 3-Dimensional Perspective

Published: 28 June 2024 Publication History

Abstract

(Source) code search is widely concerned by software engineering researchers because it can improve the productivity and quality of software development. Given a functionality requirement usually described in a natural language sentence, a code search system can retrieve code snippets that satisfy the requirement from a large-scale code corpus, e.g., GitHub. To realize effective and efficient code search, many techniques have been proposed successively. These techniques improve code search performance mainly by optimizing three core components, including query understanding component, code understanding component, and query-code matching component. In this article, we provide a 3-dimensional perspective survey for code search. Specifically, we categorize existing code search studies into query-end optimization techniques, code-end optimization techniques, and match-end optimization techniques according to the specific components they optimize. These optimization techniques are proposed to enhance the performance of specific components, and thus the overall performance of code search. Considering that each end can be optimized independently and contributes to the code search performance, we treat each end as a dimension. Therefore, this survey is 3-dimensional in nature, and it provides a comprehensive summary of each dimension in detail. To understand the research trends of the three dimensions in existing code search studies, we systematically review 68 relevant literatures. Different from existing code search surveys that only focus on the query end or code end or introduce various aspects shallowly (including codebase, evaluation metrics, modeling technique, etc.), our survey provides a more nuanced analysis and review of the evolution and development of the underlying techniques used in the three ends. Based on a systematic review and summary of existing work, we outline several open challenges and opportunities at the three ends that remain to be addressed in future work.

References

[1]
Shamsa Abid, Shafay Shamail, Hamid Abdul Basit, and Sarah Nadi. 2021. FACER: An API usage-based code-example recommender for opportunistic reuse. Empir. Softw. Eng. 26, 5 (2021), 110.
[2]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. In Proceedings of the 6th International Conference on Learning Representations. OpenReview.net, 1–17.
[3]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. A general path-based representation for predicting program properties. In Proceedings of the 39th Conference on Programming Language Design and Implementation. ACM, 404–419.
[4]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, Jan. (2003), 993–1022.
[5]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomás Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Computat. Linguist. 5 (2017), 135–146.
[6]
Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, and Scott R. Klemmer. 2009. Two studies of opportunistic programming: Interleaving web foraging, learning, and writing code. In Proceedings of the 27th International Conference on Human Factors in Computing Systems. ACM, 1589–1598.
[7]
Bo Cai, Yaoxiang Yu, and Yi Hu. 2023. CSSAM: Code search via attention matching of code semantics and structures. In Proceedings of the 30th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 402–413.
[8]
Fuqi Cai, Changjing Wang, Qing Huang, Zhengkang Zuo, and Yunyan Liao. 2021. Search for compatible source code. Int. J. Softw. Eng. Knowl. Eng. 31, 3 (2021), 477–502.
[9]
José Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In Proceedings of the 13th Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 964–974.
[10]
Guihong Cao, Jianfeng Gao, Jian-Yun Nie, and Jing Bai. 2007. Extending query translation to cross-language query expansion with Markov chain models. In Proceedings of the 16th Conference on Information and Knowledge Management. ACM, 351–360.
[11]
Jialun Cao, Meiziniu Li, Ming Wen, and Shing chi Cheung. 2023. A study on prompt design, advantages and limitations of ChatGPT for deep learning program repair. CoRR abs/2304.08191, 1 (2023), 1–12.
[12]
Hailin Chen, Amrita Saha, Steven Chu-Hong Hoi, and Shafiq Joty. 2023. Personalized distillation: Empowering open-sourced LLMs with adaptive learning for code generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 6737–6749.
[13]
Qingying Chen and Minghui Zhou. 2018. A neural framework for retrieval and summarization of source code. In Proceedings of the 33rd International Conference on Automated Software Engineering. ACM, 826–831.
[14]
Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks for program translation. In Proceedings of the 6th International Conference on Learning Representations. OpenReview.net, 1–11.
[15]
Zhengzhao Chen, Renhe Jiang, Zejun Zhang, Yu Pei, Minxue Pan, Tian Zhang, and Xuandong Li. 2020. Enhancing example-based code search with functional semantics. J. Syst. Softw. 165, 1 (2020), 110568.
[16]
Yi Cheng and Li Kuang. 2022. CSRS: Code search with relevance matching and semantic matching. In Proceedings of the 30th International Conference on Program Comprehension. ACM, 533–542.
[17]
Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555, 1 (2014), 1–9.
[18]
Marcelo de Rezende Martins and Marco Aurélio Gerosa. 2020. CoNCRA: A convolutional neural networks code retrieval approach. In Proceedings of the 34th Brazilian Symposium on Software Engineering. ACM, 526–531.
[19]
Zhongyang Deng, Ling Xu, Chao Liu, Meng Yan, Zhou Xu, and Yan Lei. 2022. Fine-grained co-attentive representation learning for semantic code search. In Proceedings of the 29th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 396–407.
[20]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 23rd Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4171–4186.
[21]
Lun Du, Xiaozhou Shi, Yanlin Wang, Ensheng Shi, Shi Han, and Dongmei Zhang. 2021. Is a single model enough? MuCoS: A multi-model ensemble learning approach for semantic code search. In Proceedings of the 30th International Conference on Information & Knowledge Management. ACM, 2994–2998.
[22]
Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and Xia Hu. 2022. Shortcut learning of large language models in natural language understanding: A survey. CoRR abs/2208.11857, 1 (2022), 1–10.
[23]
Jacob Eisenstein. 2019. Introduction to Natural Language Processing. MIT Press.
[24]
Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large language models for software engineering: Survey and open problems. CoRR abs/2310.03533, 1 (2023), 1–23.
[25]
Sen Fang, Youshuai Tan, Tao Zhang, and Yepang Liu. 2021. Self-attention networks for code search. Inf. Softw. Technol. 134, 1 (2021), 106542.
[26]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP. Association for Computational Linguistics, 1536–1547.
[27]
Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3 (1987), 319–349.
[28]
Edwin T. Floyd and J. Tom Kinser. 1990. Service code search high performance medical records indexing on a budget. In Proceedings of the Annual Symposium on Computer Application in Medical Care. American Medical Informatics Association, 408.
[29]
William B. Frakes and Brian A. Nejmeh. 1987. Software reuse through information retrieval. SIGIR Forum 21, 1-2 (1987), 30–36.
[30]
Shanqing Fu, Bing Li, Yi Cai, Zhuang Liu, and Junxia Guo. 2020. Recommendation based on Java code analysis and search. In Proceedings of the 6th Fuzzy Systems and Data Mining, Vol. 331. IOS Press, 514–521.
[31]
Inc. GitHub. 2008. GitHub. Retrieved from https://github.com
[32]
Akhilesh Deepak Gotmare, Junnan Li, Shafiq Joty, and Steven C. H. Hoi. 2021. Cascaded fast and slow models for efficient semantic code search. Comput. Res. Repos. abs/2110.07811, 1 (2021), 1–12.
[33]
Luca Di Grazia and Michael Pradel. 2022. Code search: A survey of techniques for finding code. Comput. Res. Repos. abs/2204.02765, 1 (2022), 1–30.
[34]
Jian Gu, Zimin Chen, and Martin Monperrus. 2021. Multimodal representation for neural code search. In Proceedings of the 37th International Conference on Software Maintenance and Evolution. IEEE, 483–494.
[35]
Wenchao Gu, Zongjie Li, Cuiyun Gao, Chaozheng Wang, Hongyu Zhang, Zenglin Xu, and Michael R. Lyu. 2020. CRaDLe: Deep code retrieval based on semantic dependency learning. Neural Netw. 141, 1 (2020), 385–394.
[36]
Wenchao Gu, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Michael R. Lyu. 2022. Accelerating code search with deep hashing and code classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2534–2544.
[37]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 40th International Conference on Software Engineering. ACM, 933–944.
[38]
Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of the 24th International Symposium on Foundations of Software Engineering. ACM, 631–642.
[39]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. GraphCodeBERT pre-training code representations with data flow. In Proceedings of the 9th International Conference on Learning Representations. OpenReview.net, 1–18.
[40]
Sonia Haiduc, Gabriele Bavota, Andrian Marcus, Rocco Oliveto, Andrea De Lucia, and Tim Menzies. 2013. Automatic query reformulations for text retrieval in software engineering. In Proceedings of the 35th International Conference on Software Engineering. IEEE Computer Society, 842–851.
[41]
Rajarshi Haldar, Lingfei Wu, Jinjun Xiong, and Julia Hockenmaier. 2020. A multi-perspective architecture for semantic code search. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). Association for Computational Linguistics, 8563–8568.
[42]
Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7 (2006), 1527–1554.
[43]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John C. Grundy, and Haoyu Wang. 2023. Large language models for software engineering: A systematic literature review. CoRR abs/2308.10620, 1 (2023), 1–62.
[44]
Gang Hu, Min Peng, Yihan Zhang, Qianqian Xie, Wang Gao, and Mengting Yuan. 2020b. Unsupervised software repositories mining and its application to code search. Softw. Pract. Exper. 50, 3 (2020), 299–322.
[45]
Gang Hu, Min Peng, Yihan Zhang, Qianqian Xie, and Mengting Yuan. 2020a. Neural joint attention code search over structure embeddings for software Q&A sites. J. Syst. Softw. 170, 1 (2020), 110773.
[46]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In Proceedings of the 26th Conference on Program Comprehension. ACM, 200–210.
[47]
Qing Huang, Yang Yang, and Ming Cheng. 2019. Deep learning the semantics of change sequences for query expansion. Softw. Pract. Exper. 49, 11 (2019), 1600–1617.
[48]
Qing Huang, Yangrui Yang, Xue Zhan, Hongyan Wan, and Guoqing Wu. 2018. Query expansion based on statistical learning from code changes. Softw.: Pract. Exper. 48, 7 (2018), 1333–1351.
[49]
Ishrar Hussain, Leila Kosseim, and Olga Ormandjieva. 2008. Using linguistic knowledge to classify non-functional requirements in SRS documents. In Proceedings of the 13th International Conference on Applications of Natural Language to Information Systems. Springer, 287–298.
[50]
Stack Exchange Inc;. 2008. Stack Overflow. Retrieved from https://stackoverflow.com/
[51]
He Jiang, Liming Nie, Zeyi Sun, Zhilei Ren, Weiqiang Kong, Tao Zhang, and Xiapu Luo. 2019. ROSF: Leveraging information retrieval and supervised learning for recommending code snippets. IEEE Trans. Serv. Comput. 12, 1 (2019), 34–46.
[52]
Oscar Karnalim. 2018. Language-agnostic source code retrieval using keyword & identifier lexical pattern. Int. J. Softw. Eng. Comput. Syst. 4, 1 (2018), 29–47.
[53]
Kisub Kim, Sankalp Ghatpande, Dongsun Kim, Xin Zhou, Kui Liu, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. 2024. Big code search: A bibliography. ACM Comput. Surv. 56, 1 (2024), 25:1–25:49.
[54]
Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code prediction by feeding trees to transformers. In Proceedings of the 43rd International Conference on Software Engineering. IEEE, 150–162.
[55]
Barbara A. Kitchenham and Pearl Brereton. 2013a. A systematic review of systematic review process research in software engineering. Inf. Softw. Technol. 55, 12 (2013), 2049–2075.
[56]
Yutaka Kobayashi and Yasuhisa Niimi. 1989. An efficient VQ code search algorithm using signal continuity. In Processings of the 1st European Conference on Speech Communication and Technology. ISCA, 1446–1449.
[57]
Xianglong Kong, Hongyu Chen, Ming Yu, and Lixiang Zhang. 2022a. Boosting code search with structural code annotation. Electronics 11, 19 (2022), 3053.
[58]
Xianglong Kong, Supeng Kong, Ming Yu, and Chengjie Du. 2022b. Joint embedding of semantic and statistical features for effective code search. Appl. Sci. 12, 19 (2022), 10002.
[59]
Vladimir Kovalenko, Egor Bogomolov, Timofey Bryksin, and Alberto Bacchelli. 2019. PathMiner: A library for mining of path-based representations of code. In Proceedings of the 16th International Conference on Mining Software Repositories. IEEE/ACM, 13–17.
[60]
George A. Miller. 1995. WordNet: A lexical database for english. Communications of the ACM 38, 11 (1995), 39–41.
[61]
Haochen Li, Xin Zhou, and Zhiqi Shen. 2024. Rewriting the code: A simple method for large language model augmented code search. CoRR abs/2401.04514, 1 (2024), 1–11.
[62]
Chunyang Ling, Zeqi Lin, Yanzhen Zou, and Bing Xie. 2020. Adaptive deep code search. In Proceedings of the 28th International Conference on Program Comprehension. ACM, 48–59.
[63]
Xiang Ling, Lingfei Wu, Saizhuo Wang, Gaoning Pan, Tengfei Ma, Fangli Xu, Alex X. Liu, Chunming Wu, and Shouling Ji. 2021. Deep graph matching and searching for semantic code retrieval. ACM Trans. Knowl. Discov. Data 15, 5 (2021), 88:1–88:21.
[64]
Erik Linstead, Sushil Krishna Bajracharya, Trung Chi Ngo, Paul Rigor, Cristina Videira Lopes, and Pierre Baldi. 2009. Sourcerer: Mining and searching internet-scale software repositories. Data Min. Knowl. Discov. 18, 2 (2009), 300–336.
[65]
Chao Liu, Xin Xia, David Lo, Cuiyun Gao, Xiaohu Yang, and John C. Grundy. 2022. Opportunities and challenges in code search tools. Comput. Surv. 54, 9 (2022), 196:1–196:40.
[66]
Chao Liu, Xin Xia, David Lo, Zhiwe Liu, Ahmed E. Hassan, and Shanping Li. 2021a. CodeMatcher: Searching code based on sequential semantics of important query words. ACM Trans. Softw. Eng. Methodol. 31, 1 (2021), 1–37.
[67]
Jason Liu, Seohyun Kim, Vijayaraghavan Murali, Swarat Chaudhuri, and Satish Chandra. 2019. Neural query expansion for code search. In Proceedings of the 3rd International Workshop on Machine Learning and Programming Languages. ACM, 29–37.
[68]
Shangqing Liu, Xiaofei Xie, Lei Ma, Jing Kai Siow, and Yang Liu. 2021b. GraphSearchNet: Enhancing GNNs via capturing global dependency for semantic code search. Comput. Res. Repos. abs/2111.02671, 1 (2021), 1–14.
[69]
Shangqing Liu, Xiaofei Xie, Jing Kai Siow, Lei Ma, Guozhu Meng, and Yang Liu. 2023. GraphSearchNet: Enhancing GNNs via capturing global dependencies for semantic code search. IEEE Trans. Softw. Eng. 49, 4 (2023), 2839–2855.
[70]
Xiaoyu Liu, LiGuo Huang, and Vincent Ng. 2018. Effective API recommendation without historical software repositories. In Proceedings of the 33rd International Conference on Automated Software Engineering. ACM/IEEE, 282–292.
[71]
Jinting Lu, Ying Wei, Xiaobing Sun, Bin Li, Wanzhi Wen, and Cheng Zhou. 2018. Interactive query reformulation for source-code search with word relations. IEEE Access 6, 1 (2018), 75660–75668.
[72]
Meili Lu, Xiaobing Sun, Shaowei Wang, David Lo, and Yucong Duan. 2015. Query expansion via WordNet for effective code search. In Proceedings of the 22nd International Conference on Software Analysis, Evolution, and Reengineering. IEEE, 545–549.
[73]
Fei Lv, Hongyu Zhang, Jian-Guang Lou, Shaowei Wang, Dongmei Zhang, and Jianjun Zhao. 2015. CodeHow: Effective code search based on API understanding and extended Boolean model (E). In Proceedings of the 30th International Conference on Automated Software Engineering. IEEE/ACM, 260–270.
[74]
Yuetian Mao, Chengcheng Wan, Yuze Jiang, and Xiaodong Gu. 2023. Self-supervised query reformulation for code search. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, San Francisco, CA, USA, 363–374.
[75]
Lee Wei Mar, Ye-Chi Wu, and Hewijin Christine Jiau. 2011. Recommending proper API code examples for documentation purpose. In Proceedings of the 18th Asia Pacific Software Engineering Conference. IEEE Computer Society, 331–338.
[76]
Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: Finding relevant functions and their usage. In Proceedings of the 33rd International Conference on Software Engineering. ACM, 111–120.
[77]
Yao Meng. 2021. An intelligent code search approach using hybrid encoders. Wirel. Commun. Mob. Comput. 2021, 1 (2021), 9990988:1–9990988:16.
[78]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations. OpenReview.net, 1–12.
[79]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems. MIT Press, 3111–3119.
[80]
Bhaskar Mitra and Nick Craswell. 2017. Neural models for information retrieval. Comput. Res. Repos. abs/1705.01509 (2017), 1–52.
[81]
Liming Nie, He Jiang, Zhilei Ren, Zeyi Sun, and Xiaochen Li. 2016. Query expansion based on crowd knowledge for code search. IEEE Trans. Serv. Comput. 9, 5 (2016), 771–783.
[82]
Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo. 2022. SPT-Code: Sequence-to-sequence pre-training for learning source code representations. In Proceedings of the 44th International Conference on Software Engineering. ACM, 1–13.
[83]
Haoran Niu, Iman Keivanloo, and Ying Zou. 2016. Learning to rank code examples for code search engines. Empir. Softw. Eng. 22, 1 (2016), 259–291.
[84]
Rahul Pandita, Kunal Taneja, Laurie A. Williams, and Teresa Tung. 2016. ICON: Inferring temporal constraints from natural language API descriptions. In Proceedings of the 32nd International Conference on Software Maintenance and Evolution. IEEE Computer Society, 378–388.
[85]
Kai Petersen, Sairam Vakkalanka, and Ludwik Kuzniarz. 2015. Guidelines for conducting systematic mapping studies in software engineering: An update. Inf. Softw. Technol. 64, 1 (2015), 1–18.
[86]
Andreas Polydoros and Charles Weber. 1984. A unified approach to serial search spread-spectrum code acquisition-Part I: General theory. IEEE Trans. Commun. 32, 5 (1984), 542–549.
[87]
Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is ChatGPT a general-purpose natural language processing task solver? In Proceedings of the 28th Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1339–1384.
[88]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1 (2020), 5485–5551.
[89]
Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: Synthesizing what I mean: Code search and idiomatic snippet synthesis. In Proceedings of the 38th International Conference on Software Engineering. Association for Computing Machinery, 357–367.
[90]
Mohammad Masudur Rahman and Chanchal K. Roy. 2016. QUICKAR: Automatic query reformulation for concept location using crowdsourced knowledge. In Proceedings of the 31st International Conference on Automated Software Engineering. ACM, 220–225.
[91]
Mohammad Masudur Rahman and Chanchal K. Roy. 2021. A systematic literature review of automated query reformulations in source code search. Comput. Res. Repos. abs/2108.09646, 1 (2021), 1–68.
[92]
Mohammad Masudur Rahman, Chanchal Kumar Roy, and David Lo. 2016. RACK: Automatic API recommendation using crowdsourced knowledge. In Proceedings of the 23rd International Conference on Software Analysis, Evolution, and Reengineering. IEEE Computer Society, 349–359.
[93]
Mohammad Masudur Rahman, Chanchal K. Roy, and David Lo. 2019. Automatic query reformulation for code search using crowdsourced knowledge. Empir. Softw. Eng. 24, 4 (2019), 1869–1924.
[94]
Leiming Ren, Shinmin Shan, Kai Wang, and Kun Xue. 2020. CSDA: A novel attention-based LSTM approach for code search. J. Phys.: Confer. Ser. 1544, 1 (2020), 012056.
[95]
Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on source code: A neural code search. In Proceedings of the 2nd International Workshop on Machine Learning and Programming Languages. ACM, 31–41.
[96]
Gerard Salton, Edward A. Fox, and Harry Wu. 1983. Extended Boolean information retrieval. Commun. ACM 26, 11 (1983), 1022–1036.
[97]
Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613–620.
[98]
Abdus Satter and Kazi Sakib. 2016. A search log mining based query expansion technique to improve effectiveness in code search. In Proceedings of the 19th International Conference on Computer and Information Technology. IEEE, 586–591.
[99]
Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, and Yan Lei. 2020. Improving code search with co-attentive representation learning. In Proceedings of the 28th International Conference on Program Comprehension. Association for Computing Machinery, 196–207.
[100]
Amit Singhal. 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 24, 4 (2001), 35–43.
[101]
Raphael Sirres, Tegawendé F. Bissyandé, Dongsun Kim, David Lo, Jacques Klein, Kisub Kim, and Yves Le Traon. 2018. Augmenting and structuring user queries to support efficient free-form code search. Empir. Softw. Eng. 23, 5 (2018), 2622–2654.
[102]
Barbara Kitchenham and Stuart Charters. 2007. Guidelines for performing systematic literature reviews in software engineering. Technical report, ver. 2.3 ebse technical report. ebse.
[103]
Thomas A. Standish. 1984. An essay on software reuse. IEEE Trans. Softw. Eng. 10, 5 (1984), 494–497.
[104]
Kathryn T. Stolee, Sebastian G. Elbaum, and Matthew B. Dwyer. 2016. Code search with input/output queries: Generalizing, ranking, and assessment. J. Syst. Softw. 116, 1 (2016), 35–48.
[105]
Yulei Sui and Jingling Xue. 2016. SVF: Interprocedural static value-flow analysis in LLVM. In Proceedings of the 25th International Conference on Compiler Construction. ACM, 265–266.
[106]
Weisong Sun, Chunrong Fang, Yuchen Chen, Guanhong Tao, Tingxu Han, and Quanjun Zhang. 2022. Code search based on context-aware code translation. In Proceedings of the 44th International Conference on Software Engineering. ACM, 388–400.
[107]
Weisong Sun, Chunrong Fang, Yudu You, Yuchen Chen, Yi Liu, Chong Wang, Jian Zhang, Quanjun Zhang, Hanwei Qian, Wei Zhao, Yang Liu, and Zhenyu Chen. 2023a. A prompt learning framework for source code summarization. CoRR abs/2312.16066, 1 (2023), 1–23.
[108]
Weisong Sun, Chunrong Fang, Yudu You, Yun Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Huang, Yuchen Chen, Quanjun Zhang, Hanwei Qian, Yang Liu, and Zhenyu Chen. 2023b. Automatic code summarization via ChatGPT: How far are we? CoRR abs/2305.12865 (2023), 1–13.
[109]
Zhensu Sun, Yan Liu, Chen Yang, and Yu Qian. 2020. PSCS: A path-based neural model for semantic code search. Comput. Res. Repos. abs/2008.03042, 1 (2020), 1–7.
[110]
Xunzhu Tang, Zhenghan Chen, Saad Ezzini, Haoye Tian, Yewei Song, Jacques Klein, and Tegawendé F. Bissyandé. 2023. Hyperbolic code retrieval: A novel approach for efficient code search using hyperbolic space embeddings. Comput. Res. Repos. abs/2308.15234, 1 (2023), 1–10.
[111]
Yuan Tian, David Lo, and Julia L. Lawall. 2014. Automated construction of a software-specific word similarity database. In Proceedings of the IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering: Software Evolution Week. IEEE Computer Society, 44–53.
[112]
Christoph Treude, Martin P. Robillard, and Barthélémy Dagenais. 2015. Extracting development tasks to navigate software documentation. IEEE Trans. Softw. Eng. 41, 6 (2015), 565–581.
[113]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems. MIT Press, 5998–6008.
[114]
Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip S. Yu. 2019. Multi-modal attention network learning for semantic source code retrieval. In Proceedings of the 34th International Conference on Automated Software Engineering. IEEE, 13–25.
[115]
Chaozheng Wang, Zhenhao Nong, Cuiyun Gao, Zongjie Li, Jichuan Zeng, Zhenchang Xing, and Yang Liu. 2022. Enriching query semantics for code search with reinforcement learning. Neural Netw. 145, 1 (2022), 22–32.
[116]
Chong Wang, Xin Peng, Zhenchang Xing, Yue Zhang, Mingwei Liu, Rong Luo, and Xiujie Meng. 2023. XCoS: Explainable code search based on query scoping and knowledge graph. ACM Trans. Softw. Eng. Methodol. 32, 6 (2023), 140:1–140:28.
[117]
Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. 2024. Teaching code LLMs to use autocompletion tools in repository-level code generation. CoRR abs/2401.06391, 1 (2024), 1–13.
[118]
Hao Wang, Jia Zhang, Yingce Xia, Jiang Bian, Chao Zhang, and Tie-Yan Liu. 2020a. COSEA: Convolutional code search with layer-wise attention. Comput. Res. Repos. abs/2010.09520 (2020), 1–10.
[119]
Shuohang Wang and Jing Jiang. 2017. A compare-aggregate model for matching text sequences. In Proceedings of the 5th International Conference on Learning Representations. OpenReview.net, 1–11.
[120]
Wenhua Wang, Yuqun Zhang, Zhengran Zeng, and Guandong Xu. 2020b. TranS^3: A transformer-based framework for unifying code summarization and code search. Comput. Res. Repos. abs/2003.03238, 1 (2020), 1–12.
[121]
Chen Wu and Ming Yan. 2022. Learning deep semantic model for code search using CodeSearchNet corpus. Comput. Res. Repos. abs/2201.11313, 1 (2022), 1–6.
[122]
Ling Xu, Huanhuan Yang, Chao Liu, Jianhang Shuai, Meng Yan, Yan Lei, and Zhou Xu. 2021. Two-stage attention-based model for code search with textual and structural features. In Proceedings of the 28th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 342–353.
[123]
Yangrui Yang and Qing Huang. 2017. IECS: Intent-enforced code search via extended Boolean model. J. Intell. Fuzz. Syst. 33, 4 (2017), 2565–2576.
[124]
Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. 2019. CoaCor: Code annotation for code retrieval with reinforcement learning. In Proceedings of the 28th World Wide Web Conference. ACM, 2203–2214.
[125]
Haibo Yu, Wenhao Song, and Tsunenori Mine. 2016. APIBook: An effective approach for finding APIs. In Proceedings of the 8th Asia-Pacific Symposium on Internetware. ACM, 45–53.
[126]
Hao Yu, Yin Zhang, Yuli Zhao, and Bin Zhang. 2022. Incorporating code structure and quality in deep code search. Appl. Sci. 12, 4 (2022), 2051.
[127]
Chen Zeng, Yue Yu, Shanshan Li, Xin Xia, Zhiming Wang, Mingyang Geng, Linxiao Bai, Wei Dong, and Xiangke Liao. 2023. deGraphCS: Embedding variable-based flow graph for neural code search. ACM Trans. Softw. Eng. Methodol. 32, 2 (2023), 34:1–34:27.
[128]
Feng Zhang, Haoran Niu, Iman Keivanloo, and Ying Zou. 2018. Expanding queries for code search using semantically related API class-names. IEEE Trans. Softw. Eng. 44, 11 (2018), 1070–1082.
[129]
Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. 2023a. A survey of learning-based automated program repair. ACM Trans. Softw. Eng. Methodol. 33, 2 (2023), 55:1—55:69.
[130]
Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. 2023b. A survey on large language models for software engineering. CoRR abs/2312.15223, 1 (2023), 1–57.
[131]
Quanjun Zhang, Tongke Zhang, Juan Zhai, Chunrong Fang, Bowen Yu, Weisong Sun, and Zhenyu Chen. 2023c. A critical review of large language model on software engineering: An example from ChatGPT and automated program repair. CoRR abs/2310.08879, 1 (2023), 1–12.
[132]
Wei Zhao and Yan Liu. 2022. Utilizing edge attention in graph-based code search. In Proceedings of the 34th International Conference on Software Engineering and Knowledge Engineering. KSI Research Inc., 60–66.
[133]
Qihao Zhu, Zeyu Sun, Xiran Liang, Yingfei Xiong, and Lu Zhang. 2020. OCoR: An overlapping-aware code retriever. In Proceedings of the 35th International Conference on Automated Software Engineering. Institute of Electrical and Electronics Engineers, 883–894.

Cited By

View all
  • (2025)Vulseye: Detect Smart Contract Vulnerabilities via Stateful Directed Graybox FuzzingIEEE Transactions on Information Forensics and Security10.1109/TIFS.2025.353782720(2157-2170)Online publication date: 2025
  • (2024)CVECenter: Industry Practice of Automated Vulnerability Management for Linux Distribution CommunityCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663852(329-339)Online publication date: 10-Jul-2024

Index Terms

  1. A Survey of Source Code Search: A 3-Dimensional Perspective

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Software Engineering and Methodology
    ACM Transactions on Software Engineering and Methodology  Volume 33, Issue 6
    July 2024
    951 pages
    EISSN:1557-7392
    DOI:10.1145/3613693
    • Editor:
    • Mauro Pezzé
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 June 2024
    Online AM: 06 April 2024
    Accepted: 21 March 2024
    Revised: 16 February 2024
    Received: 12 November 2023
    Published in TOSEM Volume 33, Issue 6

    Check for updates

    Author Tags

    1. Source code search
    2. deep learning
    3. query-end optimization
    4. code-end optimization
    5. match-end optimization

    Qualifiers

    • Survey

    Funding Sources

    • National Natural Science Foundation of China
    • Science, Technology, and Innovation Commission of Shenzhen Municipality
    • Program B for Outstanding PhD Candidate of Nanjing University
    • National Research Foundation, Singapore
    • Cyber Security Agency under its National Cybersecurity R&D Programme

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)524
    • Downloads (Last 6 weeks)47
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Vulseye: Detect Smart Contract Vulnerabilities via Stateful Directed Graybox FuzzingIEEE Transactions on Information Forensics and Security10.1109/TIFS.2025.353782720(2157-2170)Online publication date: 2025
    • (2024)CVECenter: Industry Practice of Automated Vulnerability Management for Linux Distribution CommunityCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663852(329-339)Online publication date: 10-Jul-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media