Are NLP Metrics Suitable for Evaluating Generated Code?

Takaichi, Riku; Higo, Yoshiki; Matsumoto, Shinsuke; Kusumoto, Shinji; Kurabayashi,  Toshiyuki; Kirinuki, Hiroyuki; Tanno, Haruto

doi:10.1007/978-3-031-21388-5_38

Riku Takaichi¹²,
Yoshiki Higo¹²,
Shinsuke Matsumoto¹²,
Shinji Kusumoto¹²,
Toshiyuki Kurabayashi¹³,
Hiroyuki Kirinuki¹³ &
…
Haruto Tanno¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13709))

Included in the following conference series:

International Conference on Product-Focused Software Process Improvement

1539 Accesses

Abstract

Code generation is a technique that generates program source code without human intervention. There has been much research on automated methods for writing code, such as code generation. However, many techniques are still in their infancy and often generate syntactically incorrect code. Therefore, automated metrics used in natural language processing (NLP) are occasionally used to evaluate existing techniques in code generation. At present, it is unclear which metrics in NLP are more suitable than others for evaluating generated codes. In this study, we clarify which NLP metrics are applicable to syntactically incorrect code and suitable for the evaluation of techniques that automatically generate codes. Our results show that METEOR is the best of the automated metrics compared in this study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/nazim1021/neural-machine-translation-using-gan.

References

Ahmad, W., Chakraborty, S., Ray, B., Chang, K.W.: Unified pre-training for program understanding and generation. In: Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of ACL Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of Workshop on Statistical Machine Translation (2014)
Google Scholar
Dong, L., Lapata, M.: Coarse-to-Fine decoding for neural semantic parsing. In: Proceedings of Annual Meeting of the Association for Computational Linguistics (2018)
Google Scholar
Karaivanov, S., Raychev, V., Vechev, M.: Phrase-based statistical translation of programming languages. In: Proceedings of ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (2014)
Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceedings of ACL Text Summarization Branches Out (2004)
Google Scholar
Liu, H., Shen, M., Zhu, J., Niu, N., Li, G., Zhang, L.: Deep learning based program generation from requirements text: are we there yet? IEEE Trans. Softw. Eng. 48(4), 1268–1289 (2022)
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of Annual Meeting of the Association for Computational Linguistics (2002)
Google Scholar
Parisotto, E., Mohamed, A., Singh, R., Li, L., Zhou, D., Kohli, P.: Neuro-symbolic program synthesis. In: Proceedings of International Conference on Learning Representations (2017)
Google Scholar
Rabinovich, M., Stern, M., Klein, D.: Abstract syntax networks for code generation and semantic parsing (2017). https://arxiv.org/abs/1704.07535
Spector, L.: Autoconstructive evolution: Push, PushGP, and Pushpop. In: Proceedings of Genetic and Evolutionary Computation Conference (2001)
Google Scholar
Svyatkovskiy, A., Deng, S.K., Fu, S., Sundaresan, N.: Intellicode compose: code generation using transformer. In: Proceedings of ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2020)
Google Scholar
Tran, N., Tran, H., Nguyen, S., Nguyen, H., Nguyen, T.: Does BLEU score work for code migration? In: Proceedings of IEEE/ACM International Conference on Program Comprehension (2019)
Google Scholar
Yin, P., Neubig, G.: A syntactic neural model for general-purpose code generation. In: Proceedings of Annual Meeting of the Association for Computational Linguistics (2017)
Google Scholar
Zhao, G., Huang, J.: Deepsim: deep learning code functional similarity. In: Proceedings of ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2018)
Google Scholar

Download references

Acknowledgements

This research was supported by JSPS KAKENHI, Japan (grant numbers JP20H04166, JP21K18302, JP21K11820, JP21H04877, JP22H03567, and JP22K11985).

Author information

Authors and Affiliations

Graduate School of Information Science and Technology, Osaka University, Suita, Osaka, Japan
Riku Takaichi, Yoshiki Higo, Shinsuke Matsumoto & Shinji Kusumoto
Nippon Telegraph and Telephone Corporation, Minato, Tokyo, Japan
Toshiyuki Kurabayashi, Hiroyuki Kirinuki & Haruto Tanno

Authors

Riku Takaichi
View author publications
You can also search for this author in PubMed Google Scholar
Yoshiki Higo
View author publications
You can also search for this author in PubMed Google Scholar
Shinsuke Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar
Shinji Kusumoto
View author publications
You can also search for this author in PubMed Google Scholar
Toshiyuki Kurabayashi
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Kirinuki
View author publications
You can also search for this author in PubMed Google Scholar
Haruto Tanno
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Riku Takaichi .

Editor information

Editors and Affiliations

Tampere University, Tampere, Finland
Davide Taibi
Reutlingen University, Reutlingen, Germany
Marco Kuhrmann
University of Jyväskylä, Jyväskylä, Finland
Tommi Mikkonen
Leibniz University Hannover, Hannover, Germany
Jil Klünder
University of Jyväskylä, Jyväskylä, Finland
Pekka Abrahamsson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Takaichi, R. et al. (2022). Are NLP Metrics Suitable for Evaluating Generated Code?. In: Taibi, D., Kuhrmann, M., Mikkonen, T., Klünder, J., Abrahamsson, P. (eds) Product-Focused Software Process Improvement. PROFES 2022. Lecture Notes in Computer Science, vol 13709. Springer, Cham. https://doi.org/10.1007/978-3-031-21388-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-21388-5_38
Published: 14 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21387-8
Online ISBN: 978-3-031-21388-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Are NLP Metrics Suitable for Evaluating Generated Code?