Abstract
This paper presents the benchmarking of three multi-agent systems powered by large language models. The paper presents a comparative analysis of AutoGen, CrewAI, and TaskWeaver. Nowadays, large language models have emerged as powerful tools able to assist users in various areas. The integration of large language models into multi-agent systems increases their potential for collaborative problem-solving. This study focuses on a case study involving a machine learning code generation task which is used to evaluate the framework’s performance. To assess the performance of the solutions, it is requested to create energy forecasting models using the same dataset as the base. After producing the code, a new dataset is used to test the model performance using the root mean square error. The three solutions were able to provide results using multiple large language models. The best result was achieved by TaskWeaver using GPT-3.5, with an error of 25.04.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Hadi, M.U., et al.: Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects (2023). https://doi.org/10.36227/techrxiv.23589741.v4
Vaswani, A., et al.: Attention is all you need (2023). https://doi.org/10.48550/arXiv.1706.03762
Cardoso, R.C., Ferrando, A.: A review of agent-based programming for multi-agent systems. Computers 10, 16 (2021). https://doi.org/10.3390/computers10020016
Julian, V., Botti, V.: Multi-agent systems. Appl. Sci. 9, 1402 (2019). https://doi.org/10.3390/app9071402
Ribeiro, B., Gomes, L., Barbarroxa, R., Vale, Z.: A novel framework for multiagent knowledge-based federated learning systems. In: Mathieu, P., Dignum, F., Novais, P., De la Prieta, F. (eds.) PAAMS 2023. LNCS, vol. 13955, pp. 296–306. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-37616-0_25
Burch, D.: Survey: Large Language Model Adoption Reaches Tipping Point. https://arize.com/blog/llm-survey/. Accessed 20 Mar 2024
Kumar, A.: LLM Training & GPU Memory Requirements: Examples. https://vitalflux.com/llm-gpu-memory-requirements-examples/. Accessed 20 Mar 2024
Long, T., et al.: Tweetorial Hooks: Generative AI Tools to Motivate Science on Social Media (2023). http://arxiv.org/abs/2305.12265
Romera-Paredes, B., et al.: Mathematical discoveries from program search with large language models. Nature 625, 468–475 (2024). https://doi.org/10.1038/s41586-023-06924-6
Xu, S., Zhang, X.: Leveraging generative artificial intelligence to simulate student learning behavior (2023). http://arxiv.org/abs/2310.19206
Miessler, D.: danielmiessler/fabric (2024). https://github.com/danielmiessler/fabric
Gomes, L., Ribeiro, B., Lezama, F., Vale, Z.: A multi-agent system empowered by federated learning and genetic programming. In: 2023 31st Signal Processing and Communications Applications Conference (SIU), Istanbul, Turkiye, pp. 1–4. IEEE (2023). https://doi.org/10.1109/SIU59756.2023.10223778
Faia, R., Ribeiro, B., Goncalves, C., Gomes, L., Vale, Z.: Multi-agent based energy community cost optimization considering high electric vehicles penetration. Sustain. Energy Technol. Assess. 59, 103402 (2023). https://doi.org/10.1016/j.seta.2023.103402
Talebirad, Y., Nadiri, A.: Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents (2023). http://arxiv.org/abs/2306.03314
Qian, C., et al.: Communicative Agents for Software Development (2023). http://arxiv.org/abs/2307.07924
Pythagora-io/gpt-pilot (2024). https://github.com/Pythagora-io/gpt-pilot
AutoGen | AutoGen. https://microsoft.github.io/autogen/. Accessed 19 Mar 2024
AutoGen Studio: Interactively Explore Multi-Agent Workflows | AutoGen. https://microsoft.github.io/autogen/blog/2023/12/01/AutoGenStudio/. Accessed 02 Apr 2024
Hello from TaskWeaver | TaskWeaver. https://docusaurus.io/TaskWeaver/. Accessed 19 Mar 2024
Acknowledgments
This work has been supported by the European Union under the Next Generation EU, through a grant of the Portuguese Republic’s Recovery and Resilience Plan (PRR) Partnership Agreement, within the scope of the project PRODUTECH R3 – “Agenda Mobilizadora da Fileira das Tecnologias de Produção para a Reindustrialização”, Total project investment: 166.988.013,71 Euros; Total Grant: 97.111.730,27 Euros. The authors acknowledge the work facilities and equipment provided by GECAD research center (UIDB/00760/2020), DOI: https://doi.org/10.54499/UIDB/00760/2020 to the project team.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Barbarroxa, R., Gomes, L., Vale, Z. (2025). Benchmarking Large Language Models for Multi-agent Systems: A Comparative Analysis of AutoGen, CrewAI, and TaskWeaver. In: Mathieu, P., De la Prieta, F. (eds) Advances in Practical Applications of Agents, Multi-Agent Systems, and Digital Twins: The PAAMS Collection. PAAMS 2024. Lecture Notes in Computer Science(), vol 15157. Springer, Cham. https://doi.org/10.1007/978-3-031-70415-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-70415-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70414-7
Online ISBN: 978-3-031-70415-4
eBook Packages: Computer ScienceComputer Science (R0)