skip to main content
10.1145/3609437.3609456acmotherconferencesArticle/Chapter ViewAbstractPublication PagesinternetwareConference Proceedingsconference-collections
research-article

MCodeSearcher: Multi-View Contrastive Learning for Code Search

Published:05 October 2023Publication History

ABSTRACT

Code search has been a critical software development activity in facilitating developers to retrieve a proper code snippet from open-source repositories given a user intent. In recent years, large-scale pre-trained models have shown impressive performance on code representation learning and have achieved state-of-the-art performance on code search task. However, it is challenging for these models to distinguish the functionally equivalent code snippets with dissimilar implementations or the non-equivalent code snippets that look similar. Due to the diversity of the code implementations, it is necessary for the code search engines to identify the functional similarities or dissimilarities of source code so as to return the functionally matched source code for a given query. Besides, existing pre-trained models mainly focus on learning the semantic representations of code snippets. The semantic correlation between the code snippet and natural language query is not sufficiently exploited. An effective code search tool not only needs to understand the relationship between queries and code snippets but also needs to identify the relationship between diversified code snippets. To address these limitations, we propose a novel multi-view contrastive learning model MCodeSearcher for code retrieval, aiming at sufficiently exploiting (1) the semantic correlation between queries and code snippets, and (2) the relationship between functionally equivalent code snippets. To achieve this, we design contrastive training objectives from three views and pre-train our model with these objectives. The experimental results on five representative code search datasets show that our approach significantly outperforms the state-of-the-art methods.

References

  1. Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021).Google ScholarGoogle Scholar
  2. Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2021. Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 511–521.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 964–974.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.Google ScholarGoogle Scholar
  5. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  6. Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessandro Morari, Baishakhi Ray, and Saikat Chakraborty. 2021. Contrastive Learning for Source Code with Structural and Functional Properties. arXiv preprint arXiv:2110.03868 (2021).Google ScholarGoogle Scholar
  7. Lun Du, Xiaozhou Shi, Yanlin Wang, Ensheng Shi, Shi Han, and Dongmei Zhang. 2021. Is a single model enough? mucos: A multi-model ensemble learning approach for semantic code search. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2994–2998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Mengnan Du, Varun Manjunatha, Rajiv Jain, Ruchi Deshpande, Franck Dernoncourt, Jiuxiang Gu, Tong Sun, and Xia Hu. 2021. Towards interpreting and mitigating shortcut learning behavior of NLU models. arXiv preprint arXiv:2103.06922 (2021).Google ScholarGoogle Scholar
  9. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).Google ScholarGoogle Scholar
  10. Jian Gu, Zimin Chen, and Martin Monperrus. 2021. Multimodal Representation for Neural Code Search. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 483–494.Google ScholarGoogle Scholar
  11. Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 933–944.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. arXiv preprint arXiv:2203.03850 (2022).Google ScholarGoogle Scholar
  13. Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).Google ScholarGoogle Scholar
  14. Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. Cosqa: 20,000+ web queries for code search and question answering. arXiv preprint arXiv:2105.13239 (2021).Google ScholarGoogle Scholar
  15. Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).Google ScholarGoogle Scholar
  16. Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2073–2083.Google ScholarGoogle ScholarCross RefCross Ref
  17. Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph E Gonzalez, and Ion Stoica. 2020. Contrastive code representation learning. arXiv preprint arXiv:2007.04973 (2020).Google ScholarGoogle Scholar
  18. Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-based statistical translation of programming languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software. 173–184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. 2020. Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511 (2020).Google ScholarGoogle Scholar
  20. Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Qiu, Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Jiang, Weizhu Chen, and Nan Duan. 2022. CodeRetriever: Unimodal and Bimodal Contrastive Learning. arXiv preprint arXiv:2201.10866 (2022).Google ScholarGoogle Scholar
  21. Chunyang Ling, Zeqi Lin, Yanzhen Zou, and Bing Xie. 2020. Adaptive deep code search. In Proceedings of the 28th International Conference on Program Comprehension. 48–59.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Xiang Ling, Lingfei Wu, Saizhuo Wang, Gaoning Pan, Tengfei Ma, Fangli Xu, Alex X Liu, Chunming Wu, and Shouling Ji. 2021. Deep graph matching and searching for semantic code retrieval. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 5 (2021), 1–21.Google ScholarGoogle Scholar
  23. Erik Linstead, Sushil Bajracharya, Trung Ngo, Paul Rigor, Cristina Lopes, and Pierre Baldi. 2009. Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery 18, 2 (2009), 300–336.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Fang Liu, Jia Li, and Li Zhang. 2023. Syntax and Domain Aware Model for Unsupervised Program Translation. arXiv preprint arXiv:2302.03908 (2023).Google ScholarGoogle Scholar
  25. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692http://arxiv.org/abs/1907.11692Google ScholarGoogle Scholar
  26. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).Google ScholarGoogle Scholar
  27. Meili Lu, Xiaobing Sun, Shaowei Wang, David Lo, and Yucong Duan. 2015. Query expansion via wordnet for effective code search. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 545–549.Google ScholarGoogle Scholar
  28. Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).Google ScholarGoogle Scholar
  29. Fei Lv, Hongyu Zhang, Jian-guang Lou, Shaowei Wang, Dongmei Zhang, and Jianjun Zhao. 2015. Codehow: Effective code search based on api understanding and extended boolean model (e). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 260–270.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo. 2022. SPT-Code: Sequence-to-Sequence Pre-Training for Learning the Representation of Source Code. arXiv preprint arXiv:2201.01549 (2022).Google ScholarGoogle Scholar
  31. Sheena Panthaplackel, Junyi Jessy Li, Milos Gligoric, and Raymond J Mooney. 2020. Deep Just-In-Time Inconsistency Detection Between Comments and Source Code. arXiv preprint arXiv:2010.01625 (2020).Google ScholarGoogle Scholar
  32. Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.Google ScholarGoogle Scholar
  33. Baptiste Roziere, Marie-Anne Lachaux, Marc Szafraniec, and Guillaume Lample. 2021. DOBF: A deobfuscation pre-training objective for programming languages. arXiv preprint arXiv:2102.07492 (2021).Google ScholarGoogle Scholar
  34. Baptiste Roziere, Jie M Zhang, Francois Charton, Mark Harman, Gabriel Synnaeve, and Guillaume Lample. 2021. Leveraging Automated Unit Tests for Unsupervised Code Translation. arXiv preprint arXiv:2110.06773 (2021).Google ScholarGoogle Scholar
  35. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).Google ScholarGoogle Scholar
  36. Ensheng Shi, Wenchao Gub, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun. 2022. Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation. arXiv preprint arXiv:2204.03293 (2022).Google ScholarGoogle Scholar
  37. Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, and Yan Lei. 2020. Improving code search with co-attentive representation learning. In Proceedings of the 28th International Conference on Program Comprehension. 196–207.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, 11 (2008).Google ScholarGoogle Scholar
  39. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  40. Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip Yu. 2019. Multi-modal attention network learning for semantic source code retrieval. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 13–25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, and Xin Jiang. 2021. SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation. arXiv preprint arXiv:2108.04556 (2021).Google ScholarGoogle Scholar
  42. Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).Google ScholarGoogle Scholar
  43. Ziyu Yao, Daniel S Weld, Wei-Peng Chen, and Huan Sun. 2018. Staqc: A systematically mined question-code dataset from stack overflow. In Proceedings of the 2018 World Wide Web Conference. 1693–1703.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In 2018 IEEE/ACM 15th international conference on mining software repositories (MSR). IEEE, 476–486.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Kechi Zhang, Zhuo Li, Zhi Jin, and Ge Li. 2023. Implant Global and Local Hierarchy Information to Sequence based Code Representation Models. arXiv preprint arXiv:2303.07826 (2023).Google ScholarGoogle Scholar
  46. Kechi Zhang, Wenhan Wang, Huangzhao Zhang, Ge Li, and Zhi Jin. 2022. Learning to represent programs with heterogeneous graphs. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension. 378–389.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MCodeSearcher: Multi-View Contrastive Learning for Code Search

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            Internetware '23: Proceedings of the 14th Asia-Pacific Symposium on Internetware
            August 2023
            332 pages
            ISBN:9798400708947
            DOI:10.1145/3609437

            Copyright © 2023 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 5 October 2023

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            Overall Acceptance Rate55of111submissions,50%
          • Article Metrics

            • Downloads (Last 12 months)89
            • Downloads (Last 6 weeks)18

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format