Fine-Tuning Large Language Models for Answering Programming Questions with Code Snippets

Lomshakov, Vadim; Kovalchuk, Sergey; Omelchenko, Maxim; Nikolenko, Sergey; Aliev, Artem

doi:10.1007/978-3-031-36021-3_15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14074))

Included in the following conference series:

International Conference on Computational Science

1172 Accesses
1 Citations

Abstract

We study the ability of pretrained large language models (LLM) to answer questions from online question answering fora such as Stack Overflow. We consider question-answer pairs where the main part of the answer consists of source code. On two benchmark datasets—CoNaLa and a newly collected dataset based on Stack Overflow—we investigate how a closed-book question answering system can be improved by fine-tuning the LLM for the downstream task, prompt engineering, and data preprocessing. We use publicly available autoregressive language models such as GPT-Neo, CodeGen, and PanGu-Coder, and after the proposed fine-tuning achieve a BLEU score of 0.4432 on the CoNaLa test set, significantly exceeding previous state of the art for this task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Beau, N., Crabbé, B.: The impact of lexical and grammatical processing on generating code from natural language. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 2204–2214. Association for Computational Linguistics, Dublin, Ireland (2022). https://doi.org/10.18653/v1/2022.findings-acl.173
Beyer, S., Macho, C., Di Penta, M., Pinzger, M.: What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories. Empir. Softw. Eng. 25(3), 2258–2301 (2019). https://doi.org/10.1007/s10664-019-09758-x
Article Google Scholar
Black, S., Gao, L., Wang, P., Leahy, C., Biderman, S.: GPT-Neo: large Scale autoregressive language modeling with mesh-tensorflow (2021). https://doi.org/10.5281/zenodo.5297715
Brown, T.B. et al.: Language models are few-shot learners (2020). https://doi.org/10.48550/ARXIV.2005.14165
Chen, M. et al.: Evaluating large language models trained on code. CoRR abs/2107.03374 (2021), https://arxiv.org/abs/2107.03374
Christopoulou, F. et al.: PanGu-Coder: program synthesis with function-level language modeling (2022). https://doi.org/10.48550/ARXIV.2207.11280
Ding, N. et al.: Openprompt: an open-source framework for prompt-learning. arXiv preprint arXiv:2111.01998 (2021)
Evtikhiev, M., Bogomolov, E., Sokolov, Y., Bryksin, T.: Out of the bleu: how should we assess quality of the code generation models? (2022). https://doi.org/10.48550/ARXIV.2208.03133
Gao, L., et al.: The pile: An 800 GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020)
Hall, P., Hart, J.D.: Bootstrap test for difference between means in nonparametric regression. J. Am. Statist. Assoc. 85(412), 1039–1049 (1990). https://doi.org/10.1080/01621459.1990.10474974
Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., Steinhardt, J.: Measuring coding challenge competence with apps (2021). https://doi.org/10.48550/ARXIV.2105.09938
Kovalchuk, S.V., Lomshakov, V., Aliev, A.: Human perceiving behavior modeling in evaluation of code generation models. In: Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pp. 287–294. ACL, Abu Dhabi, UAE (2022). https://aclanthology.org/2022.gem-1.24
Lee, N., Li, B.Z., Wang, S., Yih, W.T., Ma, H., Khabsa, M.: Language models as fact checkers? (2020). https://doi.org/10.48550/ARXIV.2006.04102
Li, Y. et al.: Competition-level code generation with alphacode (2022). https://doi.org/10.48550/ARXIV.2203.07814
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004). https://aclanthology.org/W04-1013
Nijkamp, E. et al.: CodeGen: an open large language model for code with multi-turn program synthesis (2022). https://doi.org/10.48550/ARXIV.2203.13474
Norouzi, S., Cao, Y.: Semantic parsing with less prior and more monolingual data. CoRR abs/2101.00259 (2021). https://arxiv.org/abs/2101.00259
Petroni, F. et al.: Language models as knowledge bases? (2019). https://doi.org/10.48550/ARXIV.1909.01066
Ren, S. et al.: CodeBLEU: a method for automatic evaluation of code synthesis (2020). https://doi.org/10.48550/ARXIV.2009.10297
Roberts, A., Raffel, C., Shazeer, N.: How much knowledge can you pack into the parameters of a language model? (2020). https://doi.org/10.48550/ARXIV.2002.08910
Soliman, A.S., Hadhoud, M.M., Shaheen, S.I.: MarianCG: a code generation transformer model inspired by machine translation. J. Eng. Appl. Sci. 69(1), 1–23 (2022)
Article Google Scholar
Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: MPNet: masked and permuted pre-training for language understanding. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS2020, Curran Associates Inc., Red Hook, NY, USA (2020)
Google Scholar
Tran, N., Tran, H., Nguyen, S., Nguyen, H., Nguyen, T.: Does BLEU score work for code migration? In: 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pp. 165–176 (2019)
Google Scholar
Xu, F.F., Alon, U., Neubig, G., Hellendoorn, V.J.: A systematic evaluation of large language models of code (2022). https://doi.org/10.48550/ARXIV.2202.13169
Ye, Q. et al.: Studying strategically: learning to mask for closed-book QA (2020). https://doi.org/10.48550/ARXIV.2012.15856
Yin, P., Deng, B., Chen, E., Vasilescu, B., Neubig, G.: Learning to mine aligned code and natural language pairs from stack overflow. In: 2018 IEEE/ACM 15th Intl. Conf. on Mining Software Repositories (MSR), pp. 476–486 (2018)
Google Scholar
Yin, P., Neubig, G.: TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 7–12. ACL, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-2002
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with BERT (2019). https://doi.org/10.48550/ARXIV.1904.09675

Download references

Acknowledgements

The work of Sergey Nikolenko was prepared in the framework of the strategic project “Digital Business” within the Strategic Academic Leadership Program “Priority 2030” at NUST MISiS.

Author information

Authors and Affiliations

Huawei, St. Petersburg, Russia
Vadim Lomshakov, Sergey Kovalchuk, Maxim Omelchenko & Artem Aliev
AI Center, National University of Science and Technology MISIS, Moscow, Russia
Sergey Nikolenko
St. Petersburg Department of the Steklov Institute of Mathematics, St. Petersburg, Russia
Sergey Nikolenko

Authors

Vadim Lomshakov
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Kovalchuk
View author publications
You can also search for this author in PubMed Google Scholar
Maxim Omelchenko
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Nikolenko
View author publications
You can also search for this author in PubMed Google Scholar
Artem Aliev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vadim Lomshakov .

Editor information

Editors and Affiliations

Czech Technical University in Prague, Prague, Czech Republic
Jiří Mikyška
University of Amsterdam, Amsterdam, The Netherlands
Clélia de Mulatier
AGH University of Science and Technology, Krakow, Poland
Maciej Paszynski
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Tennessee at Knoxville, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M.A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lomshakov, V., Kovalchuk, S., Omelchenko, M., Nikolenko, S., Aliev, A. (2023). Fine-Tuning Large Language Models for Answering Programming Questions with Code Snippets. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-3_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-36021-3_15
Published: 26 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36020-6
Online ISBN: 978-3-031-36021-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fine-Tuning Large Language Models for Answering Programming Questions with Code Snippets