skip to main content
10.1145/3661167.3661233acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
research-article
Open access

Leveraging Statistical Machine Translation for Code Search

Published: 18 June 2024 Publication History

Abstract

Machine Translation (MT) has numerous applications in Software Engineering (SE). Recently, it has been employed not only for programming language translation but also as an oracle for deriving information for various research problems in SE. In this application branch, MT’s impact has been assessed through metrics measuring the accuracy of these problems rather than traditional translation evaluation metrics. For code search, a recent work, ASTTrans, introduced an MT-based model for extracting relevant non-terminal nodes from the Abstract Syntax Tree (AST) of an implementation based on natural language descriptions. While ASTTrans demonstrated the effectiveness of MT in enhancing code search on small datasets with low embedding dimensions, it struggled to improve the accuracy of code search on the standard benchmark CodeSearchNet.
In this work, we present Oracle4CS, a novel approach that integrates the classical MT model called Statistical Machine Translation to support modernized models for code search. To accomplish this, we introduce a new code representation technique called ASTSum, which summarizes each code snippet using a limited number of AST nodes. Additionally, we devise a fresh approach to code search, replacing natural language queries with a new representation that incorporates the results of our query-to-ASTSum translation process. Through experiments, we demonstrate that Oracle4CS can enhance code search performance on both the original BERT-based model UniXcoder and the optimized BERT-based model CoCoSoDa by up to 1.18% and 2% in Mean Reciprocal Rank (MRR) across eight selected well-known datasets. We also explore ASTSum as a promising code representation for supporting code search, potentially improving MRR by over 17% on average when paired with an optimal SMT model for query-to-ASTSum translation.

Supplemental Material

MP4 File
This is the video presentation of my paper

References

[1]
[n. d.]. Article on treesitter. https://tinyurl.com/y2a86znt. Accessed: 2022-6-20.
[2]
[n. d.]. Articles about Machine Translation. https://tinyurl.com/5n7d6wrd. Accessed: 2023-5-1.
[3]
Yitian Chai, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. 2022. Cross-Domain Deep Code Search with Meta Learning. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 487–498. https://doi.org/10.1145/3510003.3510125
[4]
Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar Devanbu, and Baishakhi Ray. 2022. NatGen: Generative pre-training by "Naturalizing" source code. arxiv:2206.07585 [cs.PL]
[5]
Le Chen, Quazi Ishtiaque Mahmud, Hung Phan, Nesreen K. Ahmed, and Ali Jannesari. 2023. Learning to Parallelize with OpenMP by Augmented Heterogeneous AST Representation. arxiv:2305.05779 [cs.LG]
[6]
Yihong Dong, Jiazheng Ding, Xue Jiang, Ge Li, Zhuo Li, and Zhi Jin. 2023. CodeScore: Evaluating Code Generation by Learning Code Execution. arxiv:2301.09043 [cs.SE]
[7]
Aryaz Eghbali and Michael Pradel. 2023. CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (Rochester, MI, USA) (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 28, 12 pages. https://doi.org/10.1145/3551349.3556903
[8]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
[9]
Spence Green, Daniel Cer, and Christopher D Manning. 2014. Phrasal: A toolkit for new directions in statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation. 114–121.
[10]
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. https://doi.org/10.48550/ARXIV.2203.03850
[11]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2020. GraphCodeBERT: Pre-training Code Representations with Data Flow. CoRR abs/2009.08366 (2020). arXiv:2009.08366https://arxiv.org/abs/2009.08366
[12]
Vincent J. Hellendoorn, Christian Bird, Earl T. Barr, and Miltiadis Allamanis. 2018. Deep Learning Type Inference(ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 152–162. https://doi.org/10.1145/3236024.3236051
[13]
Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing Source Code with Transferred API Knowledge. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (Stockholm, Sweden) (IJCAI’18). AAAI Press, 2269–2275.
[14]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. https://doi.org/10.48550/ARXIV.1909.09436
[15]
Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code Prediction by Feeding Trees to Transformers. arxiv:2003.13848 [cs.SE]
[16]
Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy Liang. 2019. SPoC: Search-based Pseudocode to Code. https://doi.org/10.48550/ARXIV.1906.04908
[17]
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. 2022. CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. arxiv:2207.01780 [cs.LG]
[18]
Haau-Sing Li, Mohsen Mesgar, André F. T. Martins, and Iryna Gurevych. 2023. Python Code Generation by Asking Clarification Questions. arxiv:2212.09885 [cs.CL]
[19]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
[20]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv:1907.11692 [cs.CL]
[21]
Adam Lopez. 2008. Statistical machine translation. ACM Computing Surveys (CSUR) 40, 3 (2008), 1–49.
[22]
Yingwei Ma, Yue Yu, Shanshan Li, Zhouyang Jia, Jun Ma, Rulin Xu, Wei Dong, and Xiangke Liao. 2023. MulCS: Towards a Unified Deep Representation for Multilingual Code Search. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 120–131. https://doi.org/10.1109/SANER56733.2023.00021
[23]
Yuetian Mao, Chengcheng Wan, Yuze Jiang, and Xiaodong Gu. 2023. Self-Supervised Query Reformulation for Code Search. arxiv:2307.00267 [cs.SE]
[24]
Daye Nam, Baishakhi Ray, Seohyun Kim, Xianshan Qu, and Satish Chandra. 2022. Predictive Synthesis of API-Centric Code. arxiv:2201.03758 [cs.SE]
[25]
Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. 2014. Migrating Code with Statistical Machine Translation. In Companion Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE Companion 2014). Association for Computing Machinery, New York, NY, USA, 544–547. https://doi.org/10.1145/2591062.2591072
[26]
Hoan Anh Nguyen, Hung Dang Phan, Samantha Syeda Khairunnesa, Son Nguyen, Aashish Yadavally, Shaohua Wang, Hridesh Rajan, and Tien Nguyen. 2023. A Hybrid Approach for Inference between Behavioral Exception API Documentation and Implementations, and Its Applications. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (Rochester, MI, USA) (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 2, 13 pages. https://doi.org/10.1145/3551349.3560434
[27]
Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Learning to generate pseudo-code from source code using statistical machine translation. In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 574–584.
[28]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
[29]
Hung Phan and Ali Jannesari. 2023. Evaluating and Optimizing the Effectiveness of Neural Machine Translation in Supporting Code Retrieval Models: A Study on the CAT Benchmark. arXiv preprint arXiv:2308.04693 (2023).
[30]
Hung Phan, Hoan Anh Nguyen, Ngoc M. Tran, Linh H. Truong, Anh Tuan Nguyen, and Tien N. Nguyen. 2018. Statistical Learning of API Fully Qualified Names in Code Snippets of Online Forums. In Proceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden) (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 632–642. https://doi.org/10.1145/3180155.3180230
[31]
Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arxiv:2009.10297 [cs.SE]
[32]
Ensheng Shi, Yanlin Wang, Wenchao Gu, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun. 2023. CoCoSoDa: Effective Contrastive Learning for Code Search. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 2198–2210. https://doi.org/10.1109/ICSE48619.2023.00185
[33]
Lin Shi, Fangwen Mu, Xiao Chen, Song Wang, Junjie Wang, Ye Yang, Ge Li, Xin Xia, and Qing Wang. 2022. Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 107–119. https://doi.org/10.1145/3540250.3549145
[34]
Zejian Shi, Yun Xiong, Yao Zhang, Zhijie Jiang, Jinjing Zhao, Lei Wang, and Shanshan Li. 2023. Improving Code Search with Multi-Modal Momentum Contrastive Learning. In 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC). 280–291. https://doi.org/10.1109/ICPC58990.2023.00043
[35]
Mohammed Latif Siddiq, Beatrice Casey, and Joanna C. S. Santos. 2023. A Lightweight Framework for High-Quality Code Generation. arxiv:2307.08220 [cs.SE]
[36]
Nikita Sorokin, Dmitry Abulkhanov, Sergey Nikolenko, and Valentin Malykh. 2023. CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search. arxiv:2305.11626 [cs.CL]
[37]
Zeyu Sun, Qihao Zhu, Yingfei Xiong, Yican Sun, Lili Mou, and Lu Zhang. 2019. TreeGen: A Tree-Based Transformer Architecture for Code Generation. arxiv:1911.09983 [cs.LG]
[38]
Xunzhu Tang, zhenghan Chen, Saad Ezzini, Haoye Tian, Yewei Song, Jacques Klein, and Tegawende F. Bissyande. 2023. Hyperbolic Code Retrieval: A Novel Approach for Efficient Code Search Using Hyperbolic Space Embeddings. arxiv:2308.15234 [cs.SE]
[39]
Sindhu Tipirneni, Ming Zhu, and Chandan K. Reddy. 2023. StructCoder: Structure-Aware Transformer for Code Generation. arxiv:2206.05239 [cs.LG]
[40]
Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S. Yu. 2018. Improving Automatic Source Code Summarization via Deep Reinforcement Learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier, France) (ASE ’18). Association for Computing Machinery, New York, NY, USA, 397–407. https://doi.org/10.1145/3238147.3238206
[41]
Deze Wang, Boxing Chen, Shanshan Li, Wei Luo, Shaoliang Peng, Wei Dong, and Xiangke Liao. 2023. One Adapter for All Programming Languages? Adapter Tuning for Code Search and Summarization. arxiv:2303.15822 [cs.SE]
[42]
Xin Wang, Yasheng Wang, Yao Wan, Fei Mi, Yitong Li, Pingyi Zhou, Jin Liu, Hao Wu, Xin Jiang, and Qun Liu. 2022. Compilable Neural Code Generation with Compiler Feedback. arxiv:2203.05132 [cs.CL]
[43]
Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, and Steven C. H. Hoi. 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. arXiv preprint (2023).
[44]
Yuanbo Wen, Qi Guo, Qiang Fu, Xiaqing Li, Jianxing Xu, Yanlin Tang, Yongwei Zhao, Xing Hu, Zidong Du, Ling Li, Chao Wang, Xuehai Zhou, and Yunji Chen. 2022. BabelTower: Learning to Auto-parallelized Program Translation. In Proceedings of the 39th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 23685–23700. https://proceedings.mlr.press/v162/wen22b.html
[45]
Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, and Houari Sahraoui. 2023. Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models. arxiv:2308.10462 [cs.SE]
[46]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs/1609.08144 (2016). arXiv:1609.08144http://arxiv.org/abs/1609.08144
[47]
Wang Yue, Wang Weishi, Joty Shafiq, and Hoi Steven, C.H.2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021.
[48]
Shuyan Zhou, Uri Alon, Sumit Agarwal, and Graham Neubig. 2023. CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code. (2023). https://arxiv.org/abs/2302.05527

Cited By

View all
  • (2025)Promises and perils of using Transformer-based models for SE researchNeural Networks10.1016/j.neunet.2024.107067184(107067)Online publication date: Apr-2025
  • (2024)C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTMApplied Sciences10.3390/app1413579514:13(5795)Online publication date: 2-Jul-2024

Index Terms

  1. Leveraging Statistical Machine Translation for Code Search

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering
    June 2024
    728 pages
    ISBN:9798400717017
    DOI:10.1145/3661167
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2024

    Check for updates

    Author Tags

    1. Abstract Syntax Tree
    2. Information Retrieval
    3. Statistical Machine Translation

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Data Availability

    Funding Sources

    • Nation Science Foundation

    Conference

    EASE 2024

    Acceptance Rates

    Overall Acceptance Rate 71 of 232 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)370
    • Downloads (Last 6 weeks)87
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Promises and perils of using Transformer-based models for SE researchNeural Networks10.1016/j.neunet.2024.107067184(107067)Online publication date: Apr-2025
    • (2024)C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTMApplied Sciences10.3390/app1413579514:13(5795)Online publication date: 2-Jul-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media