research-article

Is neural machine translation approach accurate enough for coding assistance?

Authors:

Teruno Kajiura,

Kimio KuramitsuAuthors Info & Claims

BCNC 2021: Proceedings of the 1st ACM SIGPLAN International Workshop on Beyond Code: No Code

Pages 23 - 28

https://doi.org/10.1145/3486949.3486966

Published: 17 October 2021 Publication History

Abstract

Coding assistance with deep learning is an emerging concern that has recently attracted much attention in the software development community. To integrate coding assistance with deep learning compactly, we focus on neural machine translation (NMT), which allows users to translate natural language descriptions into expressions in a programming language such as Python. A rising problem here is the limited availability of parallel corpora, which is essential to train better NMT models.

To overcome the problem, we propose a transcompiler-based back-translation, a data augmentation method that generates parallel corpora from numerous source code repositories. In this paper, we present our initial experimental results by comparing several NMT models that are built upon the existing corpora and our corpora. The resulting BLEU indicates that our proposed model is accurate enough to allow coding assistance in the future.

Supplementary Material

Auxiliary Presentation Video (splashws21bcncmain-p97-p-video.mp4)

This is a presentation video of my talk at BCNC 2021. In this presentation, we focus on the neural machine translation approach for coding assistance. To overcome a shortage of parallel corpus, we made a generative approach to enlarge corpus size in both quality and quantity. We conducted experiments and showed that NMT could be accurate enough for code generation.

Download
186.42 MB

References

[1]

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653.

[2]

Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of Python functions and documentation strings for automated code documentation and code generation. arxiv:1707.02275.

[3]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. arxiv:2107.03374.

[4]

H. Fudaba, Y. Oda, K. Akabe, G. Neubig, H. Hata, S. Sakti, T. Toda, and S. Nakamura. 2015. Pseudogen: A Tool to Automatically Generate Pseudo-Code from Source Code. In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 824–829.

[5]

Jian Gu, Zimin Chen, and Martin Monperrus. 2021. Multimodal Representation for Neural Code Search. arXiv preprint arXiv:2107.00992.

[6]

Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear

[7]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany. 2073–2083. https://doi.org/10.18653/v1/P16-1195

[8]

Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium. 66–71. https://doi.org/10.18653/v1/D18-2012

[9]

Celine Lee, Justin Gottschlich, and Dan Roth. 2021. Toward Code Generation: A Survey and Lessons from Semantic Parsing. arXiv preprint arXiv:2105.03317.

[10]

Chao Liu, Xin Xia, David Lo, Cuiyun Gao, Xiaohu Yang, and John Grundy. 2020. Opportunities and Challenges in Code Search Tools. arXiv preprint arXiv:2011.02297.

[11]

Y. Oda, H. Fudaba, G. Neubig, H. Hata, S. Sakti, T. Toda, and S. Nakamura. 2015. Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation. In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 574–584. https://doi.org/10.1109/ASE.2015.36

Digital Library

[12]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA. 311–318. https://doi.org/10.3115/1073083.1073135

Digital Library

[13]

K. M. Tahsin Hassan Rahit, Rashidul Hasan Nabil, and Md Hasibul Huq. 2019. Machine Translation from Natural Language to Code Using Long-Short Term Memory. Proceedings of the Future Technologies Conference (FTC) 2019, Oct, 56–63. isbn:9783030325206 issn:2194-5365 https://doi.org/10.1007/978-3-030-32520-6_6

[14]

Md. Mostafizer Rahman, Yutaka Watanobe, and Keita Nakamura. 2020. Source Code Assessment and Classification Based on Estimated Error Probability Using Attentive LSTM Language Model and Its Application in Programming Education. Applied Sciences, 10 (2020), 2973.

[15]

Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K. Vijay-Shanker. 2010. Towards Automatically Generating Summary Comments for Java Methods. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE ’10). Association for Computing Machinery, New York, NY, USA. 43–52. isbn:9781450301169 https://doi.org/10.1145/1858996.1859006

Digital Library

[16]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR, abs/1706.03762 (2017), arxiv:1706.03762. arxiv:1706.03762

[17]

Haojun Wang, Haixia Long, Ailan Wang, Tianyue Liu, and Haiyan Fu. 2021. Deep Learning and Regularization Algorithms for Malicious Code Classification. IEEE Access.

[18]

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. arxiv:1805.08949.

[19]

Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neural Model for General-Purpose Code Generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada. 440–450. https://doi.org/10.18653/v1/P17-1041

[20]

Pengcheng Yin and Graham Neubig. 2018. TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium. 7–12. https://doi.org/10.18653/v1/D18-2002

Cited By

Kuramitsu KObara MSato MAkinobu Y(2024)Training AI Model that Suggests Python Code from Student Requests in Natural LanguageJournal of Information Processing10.2197/ipsjjip.32.6932(69-76)Online publication date: 2024
https://doi.org/10.2197/ipsjjip.32.69
Kuramitsu KObara YSato MObara MFeldman MHilton M(2023)KOGI: A Seamless Integration of ChatGPT into Jupyter Environments for Programming EducationProceedings of the 2023 ACM SIGPLAN International Symposium on SPLASH-E10.1145/3622780.3623648(50-59)Online publication date: 18-Oct-2023
https://dl.acm.org/doi/10.1145/3622780.3623648
Liguori PImprota CNatella RCukic BCotroneo D(2023)Who evaluates the evaluators? On automatic metrics for assessing AI-based offensive code generatorsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120073225:COnline publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.120073

Index Terms

Is neural machine translation approach accurate enough for coding assistance?
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Machine translation
2. Software and its engineering
  1. Software creation and management
    1. Software development techniques
      1. Automatic programming

Recommendations

Using Translation Memory to Improve Neural Machine Translations
ICDLT '22: Proceedings of the 2022 6th International Conference on Deep Learning Technologies

In this paper, we describe a way of using translation memory (TM) to improve the translation quality and stability of neural machine translation (NMT) systems, especially when the sentences to be translated have high similarity with sentences stored in ...
Post-editing neural machine translation versus phrase-based machine translation for English---Chinese

This paper aims to shed light on the post-editing process of the recently-introduced neural machine translation (NMT) paradigm. Using simple and more complex texts, we first evaluate the output quality from English to Chinese phrase-based statistical (...
Making the Most of Synthetic Parallel Texts: Portuguese-Chinese Neural Machine Translation Enhanced with Back-Translation
Computational Processing of the Portuguese Language
Abstract
The generation of synthetic parallel corpora through the automatic translation of a monolingual text, a process known as back-translation, is a technique used to augment the amount of parallel data available for training Machine Translation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BCNC 2021: Proceedings of the 1st ACM SIGPLAN International Workshop on Beyond Code: No Code

October 2021

35 pages

ISBN:9781450391252

DOI:10.1145/3486949

General Chairs:
Yegor Bugayenko
Huawei, Russia
,
Letizia Jaccheri
Norwegian University of Science and Technology, Norway
,
Andrey Kuleshov
Huawei, Russia
,
Giancarlo Succi
Innopolis University, Russia
,
Anthony I. (Tony) Wasserman
Carnegie Mellon Silicon Valley, USA
,
Ahmed ElBatanony
Innopolis University, Russia

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SPLASH '21

Sponsor:

SIGPLAN

SPLASH '21: Software for Humanity

October 17, 2021

IL, Chicago, USA

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
164
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kuramitsu KObara MSato MAkinobu Y(2024)Training AI Model that Suggests Python Code from Student Requests in Natural LanguageJournal of Information Processing10.2197/ipsjjip.32.6932(69-76)Online publication date: 2024
https://doi.org/10.2197/ipsjjip.32.69
Kuramitsu KObara YSato MObara MFeldman MHilton M(2023)KOGI: A Seamless Integration of ChatGPT into Jupyter Environments for Programming EducationProceedings of the 2023 ACM SIGPLAN International Symposium on SPLASH-E10.1145/3622780.3623648(50-59)Online publication date: 18-Oct-2023
https://dl.acm.org/doi/10.1145/3622780.3623648
Liguori PImprota CNatella RCukic BCotroneo D(2023)Who evaluates the evaluators? On automatic metrics for assessing AI-based offensive code generatorsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120073225:COnline publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.120073

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten