SEGym: Optimizing Large Language Model Assisted Software Engineering Agents with Reinforcement Learning

Stenzel, Gerhard; Schmid, Kyrill; Kölle, Michael; Altmann, Philipp; Lingsch-Rosenfeld, Marian; Zorn, Maximilian; Bücher, Tim; Gabor, Thomas; Wirsing, Martin; Belzner, Lenz

doi:10.1007/978-3-031-75434-0_8

Gerhard Stenzel⁹,
Kyrill Schmid¹⁰,
Michael Kölle⁹,
Philipp Altmann⁹,
Marian Lingsch-Rosenfeld⁹,
Maximilian Zorn⁹,
Tim Bücher⁹,
Thomas Gabor⁹,
Martin Wirsing⁹ &
…
Lenz Belzner^8,10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15217))

Included in the following conference series:

International Conference on Bridging the Gap between AI and Reality

79 Accesses
1 Citations

Abstract

Current software development agents based on large language models (LLMs) are often defined using heuristic methods, which can limit their flexibility and effectiveness. Moreover, the entry barriers for new researchers in this field are high, largely due to the complex infrastructure required to develop and optimize these agents. This paper proposes a new approach: modeling software development agents over LLMs as a partially observable Markov decision process (POMDP) to enable data-driven optimization. To support this approach, we introduce SEGym, a framework based on the Gym interface for reinforcement learning agents. SEGym simplifies the setup of optimization experiments for software development agents within the POMDP framework, making it more accessible for researchers to engage in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Knowledge-enhanced software refinement: leveraging reinforcement learning for search-based quality engineering

Article 25 June 2024

An Overview on Large Language Models

Harnessing pre-trained generalist agents for software engineering tasks

Article 11 December 2024

Notes

1.
https://github.com/gpt-engineer-org/gpt-engineer.
2.
https://github.com/smol-ai/developer.
3.
https://github.com/Pythagora-io/gpt-pilot.
4.
https://github.com/wasp-lang/wasp.
5.
https://github.com/melih-unsal/DemoGPT.
6.
https://github.com/stitionai/devika.
7.
https://github.com/OpenDevin/OpenDevin.
8.
https://github.com/rjmacarthy/twinny.
9.
https://github.com/paul-gauthier/aider.
10.
It is not clear if the sets of possible inputs and outputs of an LLM are in fact equal, but for the purpose of this paper any edge cases, where certain messages might be impossible to get an LLM to say, do not appear relevant.
11.
Note that the results presented here are meant for illustration. They are not statistically significant as we only run a single instance of our setup. We plan to perform an in-depth evaluation of SEGym on SWE-bench (lite) in future work.
12.
Code and dataset are available online: https://github.com/kyrillschmid/SEGym.

References

Belzner, L., Gabor, T., Wirsing, M.: Large language model assisted software engineering: prospects, challenges, and a case study. In: International Conference on Bridging the Gap between AI and Reality, pp. 355–374. Springer (2023)
Google Scholar
Cassano, F., et al.: Multipl-e: a scalable and extensible approach to benchmarking neural code generation (2022)
Google Scholar
Chang, Y., et al.: A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 15(3), March 2024. https://doi.org/10.1145/3641289
Chen, M., et al.: Evaluating large language models trained on code (2021)
Google Scholar
Cheng, Y., et al.: Exploring large language model based intelligent agents: Definitions, methods, and prospects (2024)
Google Scholar
Du, M., Luu, A.T., Ji, B., Ng, S.K.: Mercury: an efficiency benchmark for llm code synthesis (2024)
Google Scholar
Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797 (2023)
Gioacchini, L., et al.: Agentquest: A modular benchmark framework to measure progress and improve llm agents (2024)
Google Scholar
Guo, T., et al.: Large language model based multi-agents: a survey of progress and challenges (2024)
Google Scholar
Guo, Z., et al.: Evaluating large language models: a comprehensive survey (2023)
Google Scholar
Hendrycks, D.,et al.: Measuring coding challenge competence with apps (2021)
Google Scholar
Hong, S., et al.: MetaGPT: meta programming for a multi-agent collaborative framework (2023)
Google Scholar
Hou, X., et al.: Large language models for software engineering: a systematic literature review (2024)
Google Scholar
Huang, D., Zhang, J.M., Luck, M., Bu, Q., Qing, Y., Cui, H.: Agentcoder: multi-agent-based code generation with iterative testing and optimisation (2024)
Google Scholar
Huang, D., Zhang, J.M., Qing, Y., Cui, H.: Effibench: Benchmarking the efficiency of automatically generated code (2024)
Google Scholar
Jain, N., et al.: Livecodebench: holistic and contamination free evaluation of large language models for code (2024)
Google Scholar
Jimenez, C.E., et al.: SWE-bench: can language models resolve real-world github issues? In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=VTF8yNQM66
Lavie, A., Agarwal, A.: Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments, pp. 228–231. Association for Computational Linguistics, USA (2007)
Google Scholar
Li, B., et al.: Devbench: a comprehensive benchmark for software development (2024)
Google Scholar
Li, J., Li, G., Zhang, X., Dong, Y., Jin, Z.: Evocodebench: an evolving code generation benchmark aligned with real-world code repositories (2024)
Google Scholar
Liu, X., et al.: Agentbench: evaluating llms as agents (2023)
Google Scholar
Liu, Z., et al.: Agentlite: a lightweight library for building and advancing task-oriented llm agent system (2024)
Google Scholar
Lozhkov, A., et al.: Starcoder 2 and the stack v2: the next generation (2024)
Google Scholar
Packer, C., et al.: Memgpt: towards llms as operating systems (2024)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation, pp. 311–318. Association for Computational Linguistics, USA (2002). https://doi.org/10.3115/1073083.1073135
Qian, C., et al.: Communicative agents for software development (2023)
Google Scholar
Ren, S., et al.: Codebleu: a method for automatic evaluation of code synthesis (2020)
Google Scholar
Ridnik, T., Kredo, D., Friedman, I.: Code generation with alphacodium: from prompt engineering to flow engineering (2024)
Google Scholar
Romera-Paredes, B., et al.: Mathematical discoveries from program search with large language models. Nature 625(7995), 468–475 (2024)
Article MATH Google Scholar
Sai, A.B., Mohankumar, A.K., Khapra, M.M.: A survey of evaluation metrics used for nlg systems (2020)
Google Scholar
Si, C., Zhang, Y., Yang, Z., Liu, R., Yang, D.: Design2code: how far are we from automating front-end engineering? (2024)
Google Scholar
Tao, W., Zhou, Y., Zhang, W., Cheng, Y.: Magis: Llm-based multi-agent framework for github issue resolution (2024)
Google Scholar
Towers, M., et al.: Gymnasium, March 2023. https://doi.org/10.5281/zenodo.8127026, https://zenodo.org/record/8127025
Wang, L., et al.: A survey on large language model based autonomous agents. Frontiers Comput. Sci. 18(6), March 2024. https://doi.org/10.1007/s11704-024-40231-1
Wu, Q., et al.: Autogen: enabling next-gen llm applications via multi-agent conversation (2023)
Google Scholar
Xie, Y., Xie, A., Sheth, D., Liu, P., Fried, D., Rose, C.: Codebenchgen: creating scalable execution-based code generation benchmarks (2024)
Google Scholar
Yadav, A., Singh, M.: Pythonsaga: Redefining the benchmark to evaluate code generating llm (2024)
Google Scholar
Yang, H., Yue, S., He, Y.: Auto-gpt for online decision making: Benchmarks and additional opinions (2023)
Google Scholar
Yang, J., et al.: Swe-agent: agent computer interfaces enable software engineering language models (2024)
Google Scholar
Zhang, F., et al.: Repocoder: repository-level code completion through iterative retrieval and generation (2023)
Google Scholar
Zhang, K., Li, J., Li, G., Shi, X., Jin, Z.: Codeagent: enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges (2024)
Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert (2020)
Google Scholar
Zheng, Z., et al.: A survey of large language models for code: Evolution, benchmarking, and future trends (2024)
Google Scholar
Zhuo, T.Y.: Ice-score: instructing large language models to evaluate code (2024)
Google Scholar

Download references

Author information

Authors and Affiliations

Technische Hochschule Ingolstadt, Ingolstadt, Germany
Lenz Belzner
LMU Munich, Munich, Germany
Gerhard Stenzel, Michael Kölle, Philipp Altmann, Marian Lingsch-Rosenfeld, Maximilian Zorn, Tim Bücher, Thomas Gabor & Martin Wirsing
Zen AI, Munich, Germany
Kyrill Schmid & Lenz Belzner

Authors

Gerhard Stenzel
View author publications
You can also search for this author in PubMed Google Scholar
Kyrill Schmid
View author publications
You can also search for this author in PubMed Google Scholar
Michael Kölle
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Altmann
View author publications
You can also search for this author in PubMed Google Scholar
Marian Lingsch-Rosenfeld
View author publications
You can also search for this author in PubMed Google Scholar
Maximilian Zorn
View author publications
You can also search for this author in PubMed Google Scholar
Tim Bücher
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Gabor
View author publications
You can also search for this author in PubMed Google Scholar
Martin Wirsing
View author publications
You can also search for this author in PubMed Google Scholar
Lenz Belzner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Wirsing .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Bernhard Steffen

Appendices

A Test Observability

A critical decision in the context of LLM-generated bug fixes pertains to determining which tests are accessible to the coding agent. All previously created tests, which are executed during regression testing, can be made available to the LLM without further deliberation, as they may aid the agent in comprehending the function’s context. However, the tests employed to verify the resolution of the current issue should be excluded from the training data. This precautionary measure is taken to prevent the agent from simply memorizing the test cases and their expected outcomes, thereby circumventing the system. Furthermore, this approach enhances the agent’s real-world applicability, as issues reported on platforms like GitHub typically provide only vague problem descriptions and, if any, stack traces.

Conversely, passing potentially handcrafted tests can assist the agent in accurately reproducing the issue and understanding the bug’s context. This is particularly valuable for complex issues where the agent may struggle to grasp the problem based solely on the provided description.

To reconcile these conflicting objectives, we propose a hybrid approach whereby the coding agent is equipped with a LeetCode-style test suite. This suite encompasses both training tests and evaluation tests. By adopting this approach, the agent can acquire knowledge from the training tests while simultaneously being compelled to generalize its understanding to unseen tests in order to successfully pass the evaluation tests. This test suite can be generated by employing a separate agent tasked with generating additional tests for the same issue, thereby enhancing the comprehensiveness of the training data.

B Speed Evaluation

The runtimes encompass the cumulative duration of generating a valid patch, encompassing both unsuccessful attempts and the non-LLM fuzzy matching of the generated code segments (Table 1).

Table 1. Model speed comparison in seconds (wall time)

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stenzel, G. et al. (2025). SEGym: Optimizing Large Language Model Assisted Software Engineering Agents with Reinforcement Learning. In: Steffen, B. (eds) Bridging the Gap Between AI and Reality. AISoLA 2024. Lecture Notes in Computer Science, vol 15217. Springer, Cham. https://doi.org/10.1007/978-3-031-75434-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-75434-0_8
Published: 30 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-75433-3
Online ISBN: 978-3-031-75434-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SEGym: Optimizing Large Language Model Assisted Software Engineering Agents with Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Knowledge-enhanced software refinement: leveraging reinforcement learning for search-based quality engineering

An Overview on Large Language Models

Harnessing pre-trained generalist agents for software engineering tasks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Test Observability

B Speed Evaluation

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SEGym: Optimizing Large Language Model Assisted Software Engineering Agents with Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Knowledge-enhanced software refinement: leveraging reinforcement learning for search-based quality engineering

An Overview on Large Language Models

Harnessing pre-trained generalist agents for software engineering tasks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Test Observability

B Speed Evaluation

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation