skip to main content
10.1145/3486949.3486966acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

Is neural machine translation approach accurate enough for coding assistance?

Published: 17 October 2021 Publication History

Abstract

Coding assistance with deep learning is an emerging concern that has recently attracted much attention in the software development community. To integrate coding assistance with deep learning compactly, we focus on neural machine translation (NMT), which allows users to translate natural language descriptions into expressions in a programming language such as Python. A rising problem here is the limited availability of parallel corpora, which is essential to train better NMT models.
To overcome the problem, we propose a transcompiler-based back-translation, a data augmentation method that generates parallel corpora from numerous source code repositories. In this paper, we present our initial experimental results by comparing several NMT models that are built upon the existing corpora and our corpora. The resulting BLEU indicates that our proposed model is accurate enough to allow coding assistance in the future.

Supplementary Material

Auxiliary Presentation Video (splashws21bcncmain-p97-p-video.mp4)
This is a presentation video of my talk at BCNC 2021. In this presentation, we focus on the neural machine translation approach for coding assistance. To overcome a shortage of parallel corpus, we made a generative approach to enlarge corpus size in both quality and quantity. We conducted experiments and showed that NMT could be accurate enough for code generation.

References

[1]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653.
[2]
Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of Python functions and documentation strings for automated code documentation and code generation. arxiv:1707.02275.
[3]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. arxiv:2107.03374.
[4]
H. Fudaba, Y. Oda, K. Akabe, G. Neubig, H. Hata, S. Sakti, T. Toda, and S. Nakamura. 2015. Pseudogen: A Tool to Automatically Generate Pseudo-Code from Source Code. In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 824–829.
[5]
Jian Gu, Zimin Chen, and Martin Monperrus. 2021. Multimodal Representation for Neural Code Search. arXiv preprint arXiv:2107.00992.
[6]
Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear
[7]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany. 2073–2083. https://doi.org/10.18653/v1/P16-1195
[8]
Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium. 66–71. https://doi.org/10.18653/v1/D18-2012
[9]
Celine Lee, Justin Gottschlich, and Dan Roth. 2021. Toward Code Generation: A Survey and Lessons from Semantic Parsing. arXiv preprint arXiv:2105.03317.
[10]
Chao Liu, Xin Xia, David Lo, Cuiyun Gao, Xiaohu Yang, and John Grundy. 2020. Opportunities and Challenges in Code Search Tools. arXiv preprint arXiv:2011.02297.
[11]
Y. Oda, H. Fudaba, G. Neubig, H. Hata, S. Sakti, T. Toda, and S. Nakamura. 2015. Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation. In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 574–584. https://doi.org/10.1109/ASE.2015.36
[12]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA. 311–318. https://doi.org/10.3115/1073083.1073135
[13]
K. M. Tahsin Hassan Rahit, Rashidul Hasan Nabil, and Md Hasibul Huq. 2019. Machine Translation from Natural Language to Code Using Long-Short Term Memory. Proceedings of the Future Technologies Conference (FTC) 2019, Oct, 56–63. isbn:9783030325206 issn:2194-5365 https://doi.org/10.1007/978-3-030-32520-6_6
[14]
Md. Mostafizer Rahman, Yutaka Watanobe, and Keita Nakamura. 2020. Source Code Assessment and Classification Based on Estimated Error Probability Using Attentive LSTM Language Model and Its Application in Programming Education. Applied Sciences, 10 (2020), 2973.
[15]
Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K. Vijay-Shanker. 2010. Towards Automatically Generating Summary Comments for Java Methods. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE ’10). Association for Computing Machinery, New York, NY, USA. 43–52. isbn:9781450301169 https://doi.org/10.1145/1858996.1859006
[16]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR, abs/1706.03762 (2017), arxiv:1706.03762. arxiv:1706.03762
[17]
Haojun Wang, Haixia Long, Ailan Wang, Tianyue Liu, and Haiyan Fu. 2021. Deep Learning and Regularization Algorithms for Malicious Code Classification. IEEE Access.
[18]
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. arxiv:1805.08949.
[19]
Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neural Model for General-Purpose Code Generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada. 440–450. https://doi.org/10.18653/v1/P17-1041
[20]
Pengcheng Yin and Graham Neubig. 2018. TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium. 7–12. https://doi.org/10.18653/v1/D18-2002

Cited By

View all
  • (2024)Training AI Model that Suggests Python Code from Student Requests in Natural LanguageJournal of Information Processing10.2197/ipsjjip.32.6932(69-76)Online publication date: 2024
  • (2023)KOGI: A Seamless Integration of ChatGPT into Jupyter Environments for Programming EducationProceedings of the 2023 ACM SIGPLAN International Symposium on SPLASH-E10.1145/3622780.3623648(50-59)Online publication date: 18-Oct-2023
  • (2023)Who evaluates the evaluators? On automatic metrics for assessing AI-based offensive code generatorsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120073225:COnline publication date: 1-Sep-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BCNC 2021: Proceedings of the 1st ACM SIGPLAN International Workshop on Beyond Code: No Code
October 2021
35 pages
ISBN:9781450391252
DOI:10.1145/3486949
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. back-translation
  2. code generation
  3. neural machine translation

Qualifiers

  • Research-article

Conference

SPLASH '21
Sponsor:
SPLASH '21: Software for Humanity
October 17, 2021
IL, Chicago, USA

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Training AI Model that Suggests Python Code from Student Requests in Natural LanguageJournal of Information Processing10.2197/ipsjjip.32.6932(69-76)Online publication date: 2024
  • (2023)KOGI: A Seamless Integration of ChatGPT into Jupyter Environments for Programming EducationProceedings of the 2023 ACM SIGPLAN International Symposium on SPLASH-E10.1145/3622780.3623648(50-59)Online publication date: 18-Oct-2023
  • (2023)Who evaluates the evaluators? On automatic metrics for assessing AI-based offensive code generatorsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120073225:COnline publication date: 1-Sep-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media