research-article

Open access

Leveraging Statistical Machine Translation for Code Search

Authors:

Hung Phan,

Ali JannesariAuthors Info & Claims

EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering

Pages 191 - 200

https://doi.org/10.1145/3661167.3661233

Published: 18 June 2024 Publication History

All formats PDF

Abstract

Machine Translation (MT) has numerous applications in Software Engineering (SE). Recently, it has been employed not only for programming language translation but also as an oracle for deriving information for various research problems in SE. In this application branch, MT’s impact has been assessed through metrics measuring the accuracy of these problems rather than traditional translation evaluation metrics. For code search, a recent work, ASTTrans, introduced an MT-based model for extracting relevant non-terminal nodes from the Abstract Syntax Tree (AST) of an implementation based on natural language descriptions. While ASTTrans demonstrated the effectiveness of MT in enhancing code search on small datasets with low embedding dimensions, it struggled to improve the accuracy of code search on the standard benchmark CodeSearchNet.

In this work, we present Oracle4CS, a novel approach that integrates the classical MT model called Statistical Machine Translation to support modernized models for code search. To accomplish this, we introduce a new code representation technique called ASTSum, which summarizes each code snippet using a limited number of AST nodes. Additionally, we devise a fresh approach to code search, replacing natural language queries with a new representation that incorporates the results of our query-to-ASTSum translation process. Through experiments, we demonstrate that Oracle4CS can enhance code search performance on both the original BERT-based model UniXcoder and the optimized BERT-based model CoCoSoDa by up to 1.18% and 2% in Mean Reciprocal Rank (MRR) across eight selected well-known datasets. We also explore ASTSum as a promising code representation for supporting code search, potentially improving MRR by over 17% on average when paired with an optimal SMT model for query-to-ASTSum translation.

Supplemental Material

MP4 File

This is the video presentation of my paper

Download
65.00 MB

References

[1]

[n. d.]. Article on treesitter. https://tinyurl.com/y2a86znt. Accessed: 2022-6-20.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Syntactic discriminative language model rerankers for statistical machine translation

Preventing translation quality deterioration caused by beam search decoding in neural machine translation using statistical machine translation

Integrating source-language context into phrase-based statistical machine translation

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Data Availability

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations