Skip to main content

SEGym: Optimizing Large Language Model Assisted Software Engineering Agents with Reinforcement Learning

  • Conference paper
  • First Online:
Bridging the Gap Between AI and Reality (AISoLA 2024)

Abstract

Current software development agents based on large language models (LLMs) are often defined using heuristic methods, which can limit their flexibility and effectiveness. Moreover, the entry barriers for new researchers in this field are high, largely due to the complex infrastructure required to develop and optimize these agents. This paper proposes a new approach: modeling software development agents over LLMs as a partially observable Markov decision process (POMDP) to enable data-driven optimization. To support this approach, we introduce SEGym, a framework based on the Gym interface for reinforcement learning agents. SEGym simplifies the setup of optimization experiments for software development agents within the POMDP framework, making it more accessible for researchers to engage in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/gpt-engineer-org/gpt-engineer.

  2. 2.

    https://github.com/smol-ai/developer.

  3. 3.

    https://github.com/Pythagora-io/gpt-pilot.

  4. 4.

    https://github.com/wasp-lang/wasp.

  5. 5.

    https://github.com/melih-unsal/DemoGPT.

  6. 6.

    https://github.com/stitionai/devika.

  7. 7.

    https://github.com/OpenDevin/OpenDevin.

  8. 8.

    https://github.com/rjmacarthy/twinny.

  9. 9.

    https://github.com/paul-gauthier/aider.

  10. 10.

    It is not clear if the sets of possible inputs and outputs of an LLM are in fact equal, but for the purpose of this paper any edge cases, where certain messages might be impossible to get an LLM to say, do not appear relevant.

  11. 11.

    Note that the results presented here are meant for illustration. They are not statistically significant as we only run a single instance of our setup. We plan to perform an in-depth evaluation of SEGym on SWE-bench (lite) in future work.

  12. 12.

    Code and dataset are available online: https://github.com/kyrillschmid/SEGym.

References

  1. Belzner, L., Gabor, T., Wirsing, M.: Large language model assisted software engineering: prospects, challenges, and a case study. In: International Conference on Bridging the Gap between AI and Reality, pp. 355–374. Springer (2023)

    Google Scholar 

  2. Cassano, F., et al.: Multipl-e: a scalable and extensible approach to benchmarking neural code generation (2022)

    Google Scholar 

  3. Chang, Y., et al.: A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 15(3), March 2024. https://doi.org/10.1145/3641289

  4. Chen, M., et al.: Evaluating large language models trained on code (2021)

    Google Scholar 

  5. Cheng, Y., et al.: Exploring large language model based intelligent agents: Definitions, methods, and prospects (2024)

    Google Scholar 

  6. Du, M., Luu, A.T., Ji, B., Ng, S.K.: Mercury: an efficiency benchmark for llm code synthesis (2024)

    Google Scholar 

  7. Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.: Promptbreeder: self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797 (2023)

  8. Gioacchini, L., et al.: Agentquest: A modular benchmark framework to measure progress and improve llm agents (2024)

    Google Scholar 

  9. Guo, T., et al.: Large language model based multi-agents: a survey of progress and challenges (2024)

    Google Scholar 

  10. Guo, Z., et al.: Evaluating large language models: a comprehensive survey (2023)

    Google Scholar 

  11. Hendrycks, D.,et al.: Measuring coding challenge competence with apps (2021)

    Google Scholar 

  12. Hong, S., et al.: MetaGPT: meta programming for a multi-agent collaborative framework (2023)

    Google Scholar 

  13. Hou, X., et al.: Large language models for software engineering: a systematic literature review (2024)

    Google Scholar 

  14. Huang, D., Zhang, J.M., Luck, M., Bu, Q., Qing, Y., Cui, H.: Agentcoder: multi-agent-based code generation with iterative testing and optimisation (2024)

    Google Scholar 

  15. Huang, D., Zhang, J.M., Qing, Y., Cui, H.: Effibench: Benchmarking the efficiency of automatically generated code (2024)

    Google Scholar 

  16. Jain, N., et al.: Livecodebench: holistic and contamination free evaluation of large language models for code (2024)

    Google Scholar 

  17. Jimenez, C.E., et al.: SWE-bench: can language models resolve real-world github issues? In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=VTF8yNQM66

  18. Lavie, A., Agarwal, A.: Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments, pp. 228–231. Association for Computational Linguistics, USA (2007)

    Google Scholar 

  19. Li, B., et al.: Devbench: a comprehensive benchmark for software development (2024)

    Google Scholar 

  20. Li, J., Li, G., Zhang, X., Dong, Y., Jin, Z.: Evocodebench: an evolving code generation benchmark aligned with real-world code repositories (2024)

    Google Scholar 

  21. Liu, X., et al.: Agentbench: evaluating llms as agents (2023)

    Google Scholar 

  22. Liu, Z., et al.: Agentlite: a lightweight library for building and advancing task-oriented llm agent system (2024)

    Google Scholar 

  23. Lozhkov, A., et al.: Starcoder 2 and the stack v2: the next generation (2024)

    Google Scholar 

  24. Packer, C., et al.: Memgpt: towards llms as operating systems (2024)

    Google Scholar 

  25. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation, pp. 311–318. Association for Computational Linguistics, USA (2002). https://doi.org/10.3115/1073083.1073135

  26. Qian, C., et al.: Communicative agents for software development (2023)

    Google Scholar 

  27. Ren, S., et al.: Codebleu: a method for automatic evaluation of code synthesis (2020)

    Google Scholar 

  28. Ridnik, T., Kredo, D., Friedman, I.: Code generation with alphacodium: from prompt engineering to flow engineering (2024)

    Google Scholar 

  29. Romera-Paredes, B., et al.: Mathematical discoveries from program search with large language models. Nature 625(7995), 468–475 (2024)

    Article  MATH  Google Scholar 

  30. Sai, A.B., Mohankumar, A.K., Khapra, M.M.: A survey of evaluation metrics used for nlg systems (2020)

    Google Scholar 

  31. Si, C., Zhang, Y., Yang, Z., Liu, R., Yang, D.: Design2code: how far are we from automating front-end engineering? (2024)

    Google Scholar 

  32. Tao, W., Zhou, Y., Zhang, W., Cheng, Y.: Magis: Llm-based multi-agent framework for github issue resolution (2024)

    Google Scholar 

  33. Towers, M., et al.: Gymnasium, March 2023. https://doi.org/10.5281/zenodo.8127026, https://zenodo.org/record/8127025

  34. Wang, L., et al.: A survey on large language model based autonomous agents. Frontiers Comput. Sci. 18(6), March 2024. https://doi.org/10.1007/s11704-024-40231-1

  35. Wu, Q., et al.: Autogen: enabling next-gen llm applications via multi-agent conversation (2023)

    Google Scholar 

  36. Xie, Y., Xie, A., Sheth, D., Liu, P., Fried, D., Rose, C.: Codebenchgen: creating scalable execution-based code generation benchmarks (2024)

    Google Scholar 

  37. Yadav, A., Singh, M.: Pythonsaga: Redefining the benchmark to evaluate code generating llm (2024)

    Google Scholar 

  38. Yang, H., Yue, S., He, Y.: Auto-gpt for online decision making: Benchmarks and additional opinions (2023)

    Google Scholar 

  39. Yang, J., et al.: Swe-agent: agent computer interfaces enable software engineering language models (2024)

    Google Scholar 

  40. Zhang, F., et al.: Repocoder: repository-level code completion through iterative retrieval and generation (2023)

    Google Scholar 

  41. Zhang, K., Li, J., Li, G., Shi, X., Jin, Z.: Codeagent: enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges (2024)

    Google Scholar 

  42. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert (2020)

    Google Scholar 

  43. Zheng, Z., et al.: A survey of large language models for code: Evolution, benchmarking, and future trends (2024)

    Google Scholar 

  44. Zhuo, T.Y.: Ice-score: instructing large language models to evaluate code (2024)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Wirsing .

Editor information

Editors and Affiliations

Appendices

A Test Observability

A critical decision in the context of LLM-generated bug fixes pertains to determining which tests are accessible to the coding agent. All previously created tests, which are executed during regression testing, can be made available to the LLM without further deliberation, as they may aid the agent in comprehending the function’s context. However, the tests employed to verify the resolution of the current issue should be excluded from the training data. This precautionary measure is taken to prevent the agent from simply memorizing the test cases and their expected outcomes, thereby circumventing the system. Furthermore, this approach enhances the agent’s real-world applicability, as issues reported on platforms like GitHub typically provide only vague problem descriptions and, if any, stack traces.

Conversely, passing potentially handcrafted tests can assist the agent in accurately reproducing the issue and understanding the bug’s context. This is particularly valuable for complex issues where the agent may struggle to grasp the problem based solely on the provided description.

To reconcile these conflicting objectives, we propose a hybrid approach whereby the coding agent is equipped with a LeetCode-style test suite. This suite encompasses both training tests and evaluation tests. By adopting this approach, the agent can acquire knowledge from the training tests while simultaneously being compelled to generalize its understanding to unseen tests in order to successfully pass the evaluation tests. This test suite can be generated by employing a separate agent tasked with generating additional tests for the same issue, thereby enhancing the comprehensiveness of the training data.

B Speed Evaluation

The runtimes encompass the cumulative duration of generating a valid patch, encompassing both unsuccessful attempts and the non-LLM fuzzy matching of the generated code segments (Table 1).

Table 1. Model speed comparison in seconds (wall time)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Stenzel, G. et al. (2025). SEGym: Optimizing Large Language Model Assisted Software Engineering Agents with Reinforcement Learning. In: Steffen, B. (eds) Bridging the Gap Between AI and Reality. AISoLA 2024. Lecture Notes in Computer Science, vol 15217. Springer, Cham. https://doi.org/10.1007/978-3-031-75434-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-75434-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-75433-3

  • Online ISBN: 978-3-031-75434-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics