Abstract
We study the ability of pretrained large language models (LLM) to answer questions from online question answering fora such as Stack Overflow. We consider question-answer pairs where the main part of the answer consists of source code. On two benchmark datasets—CoNaLa and a newly collected dataset based on Stack Overflow—we investigate how a closed-book question answering system can be improved by fine-tuning the LLM for the downstream task, prompt engineering, and data preprocessing. We use publicly available autoregressive language models such as GPT-Neo, CodeGen, and PanGu-Coder, and after the proposed fine-tuning achieve a BLEU score of 0.4432 on the CoNaLa test set, significantly exceeding previous state of the art for this task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Beau, N., Crabbé, B.: The impact of lexical and grammatical processing on generating code from natural language. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 2204–2214. Association for Computational Linguistics, Dublin, Ireland (2022). https://doi.org/10.18653/v1/2022.findings-acl.173
Beyer, S., Macho, C., Di Penta, M., Pinzger, M.: What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories. Empir. Softw. Eng. 25(3), 2258–2301 (2019). https://doi.org/10.1007/s10664-019-09758-x
Black, S., Gao, L., Wang, P., Leahy, C., Biderman, S.: GPT-Neo: large Scale autoregressive language modeling with mesh-tensorflow (2021). https://doi.org/10.5281/zenodo.5297715
Brown, T.B. et al.: Language models are few-shot learners (2020). https://doi.org/10.48550/ARXIV.2005.14165
Chen, M. et al.: Evaluating large language models trained on code. CoRR abs/2107.03374 (2021), https://arxiv.org/abs/2107.03374
Christopoulou, F. et al.: PanGu-Coder: program synthesis with function-level language modeling (2022). https://doi.org/10.48550/ARXIV.2207.11280
Ding, N. et al.: Openprompt: an open-source framework for prompt-learning. arXiv preprint arXiv:2111.01998 (2021)
Evtikhiev, M., Bogomolov, E., Sokolov, Y., Bryksin, T.: Out of the bleu: how should we assess quality of the code generation models? (2022). https://doi.org/10.48550/ARXIV.2208.03133
Gao, L., et al.: The pile: An 800 GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020)
Hall, P., Hart, J.D.: Bootstrap test for difference between means in nonparametric regression. J. Am. Statist. Assoc. 85(412), 1039–1049 (1990). https://doi.org/10.1080/01621459.1990.10474974
Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., Steinhardt, J.: Measuring coding challenge competence with apps (2021). https://doi.org/10.48550/ARXIV.2105.09938
Kovalchuk, S.V., Lomshakov, V., Aliev, A.: Human perceiving behavior modeling in evaluation of code generation models. In: Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pp. 287–294. ACL, Abu Dhabi, UAE (2022). https://aclanthology.org/2022.gem-1.24
Lee, N., Li, B.Z., Wang, S., Yih, W.T., Ma, H., Khabsa, M.: Language models as fact checkers? (2020). https://doi.org/10.48550/ARXIV.2006.04102
Li, Y. et al.: Competition-level code generation with alphacode (2022). https://doi.org/10.48550/ARXIV.2203.07814
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004). https://aclanthology.org/W04-1013
Nijkamp, E. et al.: CodeGen: an open large language model for code with multi-turn program synthesis (2022). https://doi.org/10.48550/ARXIV.2203.13474
Norouzi, S., Cao, Y.: Semantic parsing with less prior and more monolingual data. CoRR abs/2101.00259 (2021). https://arxiv.org/abs/2101.00259
Petroni, F. et al.: Language models as knowledge bases? (2019). https://doi.org/10.48550/ARXIV.1909.01066
Ren, S. et al.: CodeBLEU: a method for automatic evaluation of code synthesis (2020). https://doi.org/10.48550/ARXIV.2009.10297
Roberts, A., Raffel, C., Shazeer, N.: How much knowledge can you pack into the parameters of a language model? (2020). https://doi.org/10.48550/ARXIV.2002.08910
Soliman, A.S., Hadhoud, M.M., Shaheen, S.I.: MarianCG: a code generation transformer model inspired by machine translation. J. Eng. Appl. Sci. 69(1), 1–23 (2022)
Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: MPNet: masked and permuted pre-training for language understanding. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS2020, Curran Associates Inc., Red Hook, NY, USA (2020)
Tran, N., Tran, H., Nguyen, S., Nguyen, H., Nguyen, T.: Does BLEU score work for code migration? In: 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pp. 165–176 (2019)
Xu, F.F., Alon, U., Neubig, G., Hellendoorn, V.J.: A systematic evaluation of large language models of code (2022). https://doi.org/10.48550/ARXIV.2202.13169
Ye, Q. et al.: Studying strategically: learning to mask for closed-book QA (2020). https://doi.org/10.48550/ARXIV.2012.15856
Yin, P., Deng, B., Chen, E., Vasilescu, B., Neubig, G.: Learning to mine aligned code and natural language pairs from stack overflow. In: 2018 IEEE/ACM 15th Intl. Conf. on Mining Software Repositories (MSR), pp. 476–486 (2018)
Yin, P., Neubig, G.: TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 7–12. ACL, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-2002
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with BERT (2019). https://doi.org/10.48550/ARXIV.1904.09675
Acknowledgements
The work of Sergey Nikolenko was prepared in the framework of the strategic project “Digital Business” within the Strategic Academic Leadership Program “Priority 2030” at NUST MISiS.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lomshakov, V., Kovalchuk, S., Omelchenko, M., Nikolenko, S., Aliev, A. (2023). Fine-Tuning Large Language Models for Answering Programming Questions with Code Snippets. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-3_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-36021-3_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36020-6
Online ISBN: 978-3-031-36021-3
eBook Packages: Computer ScienceComputer Science (R0)