CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework

Rui Li; Liyang He; Qi Liu; Yuze Zhao; Zheng Zhang; Zhenya Huang; Yu Su; Shijin Wang

doi:10.1609/aaai.v38i8.28713

Authors

Rui Li Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
Liyang He Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
Qi Liu Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
Yuze Zhao Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
Zheng Zhang Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
Zhenya Huang Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
Yu Su School of Computer Science and Artificial Intelligence, Hefei Normal University
Shijin Wang State Key Laboratory of Cognitive Intelligence iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd.

DOI:

https://doi.org/10.1609/aaai.v38i8.28713

Keywords:

DMKM: Applications, NLP: Other

Abstract

Multilingual code retrieval aims to find code snippets relevant to a user's query from a multilingual codebase, which plays a crucial role in software development and expands their application scenarios compared to classical monolingual code retrieval. Despite the performance improvements achieved by previous studies, two crucial problems are overlooked in the multilingual scenario. First, certain programming languages face data scarcity in specific domains, resulting in limited representation capabilities within those domains. Second, different programming languages can be used interchangeably within the same domain, making it challenging for multilingual models to accurately identify the intended programming language of a user's query. To address these issues, we propose the CommONalities and SpecIalties Driven Multilingual CodE Retrieval Framework (CONSIDER), which includes two modules. The first module enhances the representation of various programming languages by modeling pairwise and global commonalities among them. The second module introduces a novel contrastive learning negative sampling algorithm that leverages language confusion to automatically extract specific language features. Through our experiments, we confirm the significant benefits of our model in real-world multilingual code retrieval scenarios in various aspects. Furthermore, an evaluation demonstrates the effectiveness of our proposed CONSIDER framework in monolingual scenarios as well. Our source code is available at https://github.com/smsquirrel/consider.

CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription