CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework

Authors

  • Rui Li Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
  • Liyang He Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
  • Qi Liu Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
  • Yuze Zhao Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
  • Zheng Zhang Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
  • Zhenya Huang Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
  • Yu Su School of Computer Science and Artificial Intelligence, Hefei Normal University
  • Shijin Wang State Key Laboratory of Cognitive Intelligence iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd.

DOI:

https://doi.org/10.1609/aaai.v38i8.28713

Keywords:

DMKM: Applications, NLP: Other

Abstract

Multilingual code retrieval aims to find code snippets relevant to a user's query from a multilingual codebase, which plays a crucial role in software development and expands their application scenarios compared to classical monolingual code retrieval. Despite the performance improvements achieved by previous studies, two crucial problems are overlooked in the multilingual scenario. First, certain programming languages face data scarcity in specific domains, resulting in limited representation capabilities within those domains. Second, different programming languages can be used interchangeably within the same domain, making it challenging for multilingual models to accurately identify the intended programming language of a user's query. To address these issues, we propose the CommONalities and SpecIalties Driven Multilingual CodE Retrieval Framework (CONSIDER), which includes two modules. The first module enhances the representation of various programming languages by modeling pairwise and global commonalities among them. The second module introduces a novel contrastive learning negative sampling algorithm that leverages language confusion to automatically extract specific language features. Through our experiments, we confirm the significant benefits of our model in real-world multilingual code retrieval scenarios in various aspects. Furthermore, an evaluation demonstrates the effectiveness of our proposed CONSIDER framework in monolingual scenarios as well. Our source code is available at https://github.com/smsquirrel/consider.

Published

2024-03-24

How to Cite

Li, R., He, L., Liu, Q., Zhao, Y., Zhang, Z., Huang, Z., Su, Y., & Wang, S. (2024). CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework. Proceedings of the AAAI Conference on Artificial Intelligence, 38(8), 8679-8687. https://doi.org/10.1609/aaai.v38i8.28713

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management