Abstract
Current software development agents based on large language models (LLMs) are often defined using heuristic methods, which can limit their flexibility and effectiveness. Moreover, the entry barriers for new researchers in this field are high, largely due to the complex infrastructure required to develop and optimize these agents. This paper proposes a new approach: modeling software development agents over LLMs as a partially observable Markov decision process (POMDP) to enable data-driven optimization. To support this approach, we introduce SEGym, a framework based on the Gym interface for reinforcement learning agents. SEGym simplifies the setup of optimization experiments for software development agents within the POMDP framework, making it more accessible for researchers to engage in this field.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
It is not clear if the sets of possible inputs and outputs of an LLM are in fact equal, but for the purpose of this paper any edge cases, where certain messages might be impossible to get an LLM to say, do not appear relevant.
- 11.
Note that the results presented here are meant for illustration. They are not statistically significant as we only run a single instance of our setup. We plan to perform an in-depth evaluation of SEGym on SWE-bench (lite) in future work.
- 12.
Code and dataset are available online: https://github.com/kyrillschmid/SEGym.
References
Belzner, L., Gabor, T., Wirsing, M.: Large language model assisted software engineering: prospects, challenges, and a case study. In: International Conference on Bridging the Gap between AI and Reality, pp. 355–374. Springer (2023)
Cassano, F., et al.: Multipl-e: a scalable and extensible approach to benchmarking neural code generation (2022)
Chang, Y., et al.: A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 15(3), March 2024. https://doi.org/10.1145/3641289
Chen, M., et al.: Evaluating large language models trained on code (2021)
Cheng, Y., et al.: Exploring large language model based intelligent agents: Definitions, methods, and prospects (2024)
Du, M., Luu, A.T., Ji, B., Ng, S.K.: Mercury: an efficiency benchmark for llm code synthesis (2024)
Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797 (2023)
Gioacchini, L., et al.: Agentquest: A modular benchmark framework to measure progress and improve llm agents (2024)
Guo, T., et al.: Large language model based multi-agents: a survey of progress and challenges (2024)
Guo, Z., et al.: Evaluating large language models: a comprehensive survey (2023)
Hendrycks, D.,et al.: Measuring coding challenge competence with apps (2021)
Hong, S., et al.: MetaGPT: meta programming for a multi-agent collaborative framework (2023)
Hou, X., et al.: Large language models for software engineering: a systematic literature review (2024)
Huang, D., Zhang, J.M., Luck, M., Bu, Q., Qing, Y., Cui, H.: Agentcoder: multi-agent-based code generation with iterative testing and optimisation (2024)
Huang, D., Zhang, J.M., Qing, Y., Cui, H.: Effibench: Benchmarking the efficiency of automatically generated code (2024)
Jain, N., et al.: Livecodebench: holistic and contamination free evaluation of large language models for code (2024)
Jimenez, C.E., et al.: SWE-bench: can language models resolve real-world github issues? In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=VTF8yNQM66
Lavie, A., Agarwal, A.: Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments, pp. 228–231. Association for Computational Linguistics, USA (2007)
Li, B., et al.: Devbench: a comprehensive benchmark for software development (2024)
Li, J., Li, G., Zhang, X., Dong, Y., Jin, Z.: Evocodebench: an evolving code generation benchmark aligned with real-world code repositories (2024)
Liu, X., et al.: Agentbench: evaluating llms as agents (2023)
Liu, Z., et al.: Agentlite: a lightweight library for building and advancing task-oriented llm agent system (2024)
Lozhkov, A., et al.: Starcoder 2 and the stack v2: the next generation (2024)
Packer, C., et al.: Memgpt: towards llms as operating systems (2024)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation, pp. 311–318. Association for Computational Linguistics, USA (2002). https://doi.org/10.3115/1073083.1073135
Qian, C., et al.: Communicative agents for software development (2023)
Ren, S., et al.: Codebleu: a method for automatic evaluation of code synthesis (2020)
Ridnik, T., Kredo, D., Friedman, I.: Code generation with alphacodium: from prompt engineering to flow engineering (2024)
Romera-Paredes, B., et al.: Mathematical discoveries from program search with large language models. Nature 625(7995), 468–475 (2024)
Sai, A.B., Mohankumar, A.K., Khapra, M.M.: A survey of evaluation metrics used for nlg systems (2020)
Si, C., Zhang, Y., Yang, Z., Liu, R., Yang, D.: Design2code: how far are we from automating front-end engineering? (2024)
Tao, W., Zhou, Y., Zhang, W., Cheng, Y.: Magis: Llm-based multi-agent framework for github issue resolution (2024)
Towers, M., et al.: Gymnasium, March 2023. https://doi.org/10.5281/zenodo.8127026, https://zenodo.org/record/8127025
Wang, L., et al.: A survey on large language model based autonomous agents. Frontiers Comput. Sci. 18(6), March 2024. https://doi.org/10.1007/s11704-024-40231-1
Wu, Q., et al.: Autogen: enabling next-gen llm applications via multi-agent conversation (2023)
Xie, Y., Xie, A., Sheth, D., Liu, P., Fried, D., Rose, C.: Codebenchgen: creating scalable execution-based code generation benchmarks (2024)
Yadav, A., Singh, M.: Pythonsaga: Redefining the benchmark to evaluate code generating llm (2024)
Yang, H., Yue, S., He, Y.: Auto-gpt for online decision making: Benchmarks and additional opinions (2023)
Yang, J., et al.: Swe-agent: agent computer interfaces enable software engineering language models (2024)
Zhang, F., et al.: Repocoder: repository-level code completion through iterative retrieval and generation (2023)
Zhang, K., Li, J., Li, G., Shi, X., Jin, Z.: Codeagent: enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges (2024)
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert (2020)
Zheng, Z., et al.: A survey of large language models for code: Evolution, benchmarking, and future trends (2024)
Zhuo, T.Y.: Ice-score: instructing large language models to evaluate code (2024)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Test Observability
A critical decision in the context of LLM-generated bug fixes pertains to determining which tests are accessible to the coding agent. All previously created tests, which are executed during regression testing, can be made available to the LLM without further deliberation, as they may aid the agent in comprehending the function’s context. However, the tests employed to verify the resolution of the current issue should be excluded from the training data. This precautionary measure is taken to prevent the agent from simply memorizing the test cases and their expected outcomes, thereby circumventing the system. Furthermore, this approach enhances the agent’s real-world applicability, as issues reported on platforms like GitHub typically provide only vague problem descriptions and, if any, stack traces.
Conversely, passing potentially handcrafted tests can assist the agent in accurately reproducing the issue and understanding the bug’s context. This is particularly valuable for complex issues where the agent may struggle to grasp the problem based solely on the provided description.
To reconcile these conflicting objectives, we propose a hybrid approach whereby the coding agent is equipped with a LeetCode-style test suite. This suite encompasses both training tests and evaluation tests. By adopting this approach, the agent can acquire knowledge from the training tests while simultaneously being compelled to generalize its understanding to unseen tests in order to successfully pass the evaluation tests. This test suite can be generated by employing a separate agent tasked with generating additional tests for the same issue, thereby enhancing the comprehensiveness of the training data.
B Speed Evaluation
The runtimes encompass the cumulative duration of generating a valid patch, encompassing both unsuccessful attempts and the non-LLM fuzzy matching of the generated code segments (Table 1).
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Stenzel, G. et al. (2025). SEGym: Optimizing Large Language Model Assisted Software Engineering Agents with Reinforcement Learning. In: Steffen, B. (eds) Bridging the Gap Between AI and Reality. AISoLA 2024. Lecture Notes in Computer Science, vol 15217. Springer, Cham. https://doi.org/10.1007/978-3-031-75434-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-75434-0_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-75433-3
Online ISBN: 978-3-031-75434-0
eBook Packages: Computer ScienceComputer Science (R0)