Tool learning with large language models: a survey

Qu, Changle; Dai, Sunhao; Wei, Xiaochi; Cai, Hengyi; Wang, Shuaiqiang; Yin, Dawei; Xu, Jun; Wen, Ji-rong

doi:10.1007/s11704-024-40678-2

Tool learning with large language models: a survey

Review Article
Published: 13 January 2025

Volume 19, article number 198343, (2025)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Changle Qu¹,
Sunhao Dai¹,
Xiaochi Wei²,
Hengyi Cai³,
Shuaiqiang Wang²,
Dawei Yin²,
Jun Xu¹ &
…
Ji-rong Wen¹

1233 Accesses
11 Citations
14 Altmetric
1 Mention
Explore all metrics

Abstract

Recently, tool learning with large language models (LLMs) has emerged as a promising paradigm for augmenting the capabilities of LLMs to tackle highly complex problems. Despite growing attention and rapid advancements in this field, the existing literature remains fragmented and lacks systematic organization, posing barriers to entry for newcomers. This gap motivates us to conduct a comprehensive survey of existing works on tool learning with LLMs. In this survey, we focus on reviewing existing literature from the two primary aspects (1) why tool learning is beneficial and (2) how tool learning is implemented, enabling a comprehensive understanding of tool learning with LLMs. We first explore the “why” by reviewing both the benefits of tool integration and the inherent benefits of the tool learning paradigm from six specific aspects. In terms of “how”, we systematically review the literature according to a taxonomy of four key stages in the tool learning workflow: task planning, tool selection, tool calling, and response generation. Additionally, we provide a detailed summary of existing benchmarks and evaluation methods, categorizing them according to their relevance to different stages. Finally, we discuss current challenges and outline potential future directions, aiming to inspire both researchers and industrial developers to further explore this emerging and promising area.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LLM-Based Agents for Tool Learning: A Survey

Article Open access 26 June 2025

TUMS: Enhancing Tool-Use Abilities of LLMs with Multi-structure Handlers

Selecting from Multiple Strategies Improves the Foreseeable Reasoning of Tool-Augmented Large Language Models

References

Washburn S L. Tools and human evolution. Scientific American, 1960, 203(3): 62–75
Article MATH Google Scholar
Gibson K R, Ingold T. Tools, Language, and Cognition in Human Evolution. Cambridge: Cambridge University Press, 1994
MATH Google Scholar
Von Eckardt B. What Is Cognitive Science? The MIT Press, 1995, ISBN: 9780262720236
Book MATH Google Scholar
Shumaker R W, Walkup K R, Beck B B. Animal Tool Behavior: the Use and Manufacture of Tools by Animals. Baltimore: Johns Hopkins University Press, 2011
Book MATH Google Scholar
Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. GPT-4 technical report. 2024, arXiv preprint arXiv: 2303.08774
Google Scholar
El-Kassas W S, Salama C R, Rafea A A, Mohamed H K. Automatic text summarization: a comprehensive survey. Expert Systems with Applications, 2021, 165: 113679
Article Google Scholar
Zhang T, Ladhak F, Durmus E, Liang P, McKeown K, Hashimoto T B. Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 2024, 12: 39–57
Article Google Scholar
Zhang B, Haddow B, Birch A. Prompting large language model for machine translation: a case study. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 41092–41110
Google Scholar
Feng Z, Zhang Y, Li H, Liu W, Lang J, Feng Y, Wu J, Liu Z. Improving LLM-based machine translation with systematic self-correction. 2024, arXiv preprint arXiv: 2402.16379v2
MATH Google Scholar
Yang Z, Qi P, Zhang S, Bengio Y, Cohen W W, Salakhutdinov R, Manning C D. HotpotQA: a dataset for diverse, explainable multi-hop question answering. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 2369–2380
Chapter Google Scholar
Kwiatkowski T, Palomaki J, Redfield O, Collins M, Parikh A, Alberti C, Epstein D, Polosukhin I, Devlin J, Lee K, Toutanova K, Jones L, Kelcey M, Chang M W, Dai A M, Uszkoreit J, Le Q, Petrov S. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 2019, 7: 452–466
Article Google Scholar
Mallen A, Asai A, Zhong V, Das R, Khashabi D, Hajishirzi H. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 9802–9822
Google Scholar
Vu T, Iyyer M, Wang X, Constant N, Wei J, Wei J, Tar C, Sung Y H, Zhou D, Le Q, Luong T. FreshLLMs: refreshing large language models with search engine augmentation. In: Proceedings of Findings of the Association for Computational Linguistics ACL 2024. 2024, 13697–13720
Chapter Google Scholar
Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, Ishii E, Bang Y J, Madotto A, Fung P. Survey of hallucination in natural language generation. ACM Computing Surveys, 2023, 55(12): 248
Article Google Scholar
Zhang Y, Li Y, Cui L, Cai D, Liu L, Fu T, Huang X, Zhao E, Zhang Y, Chen Y, Wang L, Luu A T, Bi W, Shi F, Shi S. Siren’s song in the AI ocean: a survey on hallucination in large language models. 2023, arXiv preprint arXiv: 2309.01219
MATH Google Scholar
Qin Y, Hu S, Lin Y, Chen W, Ding N, et al. Tool learning with foundation models. 2024, arXiv preprint arXiv: 2304.08354
MATH Google Scholar
Schick T, Dwivedi-Yu J, Dessì R, Raileanu R, Lomeli M, Hambro E, Zettlemoyer L, Cancedda N, Scialom T. Toolformer: language models can teach themselves to use tools. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 36
Google Scholar
Qin Y, Liang S, Ye Y, Zhu K, Yan L, Lu Y, Lin Y, Cong X, Tang X, Qian B, Zhao S, Hong L, Tian R, Xie R, Zhou J, Gerstein M, Li D, Liu Z, Sun M. ToolLLM: facilitating large language models to master 16000+ real-world APIs. In: Proceedings of the 12th International Conference on Learning Representations. 2024
MATH Google Scholar
Tang Q, Deng Z, Lin H, Han X, Liang Q, Cao B, Sun L. ToolAlpaca: generalized tool learning for language models with 3000 simulated cases. 2023, arXiv preprint arXiv: 2306.05301
Google Scholar
Wang H, Qin Y, Lin Y, Pan J Z, Wong K F. Empowering large language models: tool learning for real-world interaction. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024, 2983–2986
Chapter MATH Google Scholar
Yao S, Chen H, Yang J, Narasimhan K. WebShop: towards scalable real-world web interaction with grounded language agents. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 20744–20757
Google Scholar
Lazaridou A, Gribovskaya E, Stokowiec W, Grigorev N. Internet-augmented language models through few-shot prompting for open-domain question answering. 2022, arXiv preprint arXiv: 2203.05115
Google Scholar
Lu Y, Yu H, Khashabi D. GEAR: augmenting language models with generalizable and efficient tool resolution. In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. 2024, 112–138
MATH Google Scholar
Pan L, Wu X, Lu X, Luu A T, Wang W Y, Kan M Y, Nakov P. Fact-checking complex claims with program-guided reasoning. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 6981–7004
Google Scholar
Wang X, Wang Z, Liu J, Chen Y, Yuan L, Peng H, Ji H. MINT: evaluating LLMs in multi-turn interaction with tools and language feedback. In: Proceedings of the 12th International Conference on Learning Representations. 2024
MATH Google Scholar
Parisi A, Zhao Y, Fiedel N. TALM: tool augmented language models. 2022, arXiv preprint arXiv: 2205.12255
Google Scholar
Karpas E, Abend O, Belinkov Y, Lenz B, Lieber O, Ratner N, Shoham Y, Bata H, Levine Y, Leyton-Brown K, Muhlgay D, Rozen N, Schwartz E, Shachaf G, Shalev-Shwartz S, Shashua A, Tenenholtz M. MRKL systems: a modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. 2022, arXiv preprint arXiv: 2205.00445
Google Scholar
Nakano R, Hilton J, Balaji S, Wu J, Ouyang L, Kim C, Hesse C, Jain S, Kosaraju V, Saunders W, Jiang X, Cobbe K, Eloundou T, Krueger G, Button K, Knight M, Chess B, Schulman J. WebGPT: Browserassisted question-answering with human feedback. 2022, arXiv preprint arXiv: 2112.09332
Google Scholar
Surís D, Menon S, Vondrick C. ViperGPT: visual inference via python execution for reasoning. In: Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. 2023, 11854–11864
MATH Google Scholar
Li M, Zhao Y, Yu B, Song F, Li H, Yu H, Li Z, Huang F, Li Y. API-Bank: a comprehensive benchmark for tool-augmented LLMs. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 3102–3116
Chapter MATH Google Scholar
Huang Y, Shi J, Li Y, Fan C, Wu S, Zhang Q, Liu Y, Zhou P, Wan Y, Gong N Z, Sun L C. MetaTool benchmark for large language models: deciding whether to use tools and which to use. In: Proceedings of the 12th International Conference on Learning Representations. 2024
MATH Google Scholar
Chen Z, Du W, Zhang W, Liu K, Liu J, Zheng M, Zhuo J, Zhang S, Lin D, Chen K, Zhao F. T-Eval: evaluating the tool utilization capability of large language models step by step. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 9510–9529
MATH Google Scholar
Xu Q, Hong F, Li B, Hu C, Chen Z, Zhang J. On the tool manipulation capability of open-source large language models. 2023, arXiv preprint arXiv: 2305.16504
MATH Google Scholar
Gao S, Shi Z, Zhu M, Fang B, Xin X, Ren P, Chen Z, Ma J, Ren Z. Confucius: iterative tool learning from introspection feedback by easy-to-difficult curriculum. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. 2024, 18030–18038
MATH Google Scholar
Zhao Y, Wu J, Wang X, Tang W, Wang D, De Rijke M. Let me do it for you: towards LLM empowered recommendation via tool learning. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024, 1796–1806
Chapter MATH Google Scholar
Zhao W X, Zhou K, Li J, Tang T, Wang X, et al. A survey of large language models. 2024, arXiv preprint arXiv: 2303.18223
MATH Google Scholar
Huang X, Liu W, Chen X, Wang X, Wang H, Lian D, Wang Y, Tang R, Chen E. Understanding the planning of LLM agents: a survey. 2024, arXiv preprint arXiv: 2402.02716
MATH Google Scholar
Qiao S, Ou Y, Zhang N, Chen X, Yao Y, Deng S, Tan C, Huang F, Chen H. Reasoning with language model prompting: a survey. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 5368–5393
MATH Google Scholar
Sun J, Zheng C, Xie E, Liu Z, Chu R, et al. A survey of reasoning with foundation models. 2024, arXiv preprint arXiv: 2312.11562
MATH Google Scholar
Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, Chen Z, Tang J, Chen X, Lin Y, Zhao W X, Wei Z, Wen J. A survey on large language model based autonomous agents. Frontiers of Computer Science, 2024, 18(6): 186345
Article MATH Google Scholar
Sumers T R, Yao S, Narasimhan K, Griffiths T L. Cognitive architectures for language agents. Transactions on Machine Learning Research, 2024, ISSN: 2835-8856
Google Scholar
Xi Z, Chen W, Guo X, He W, Ding Y, et al. The rise and potential of large language model based agents: a survey. 2023, arXiv preprint arXiv: 2309.07864
MATH Google Scholar
Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, Dai Y, Sun J, Wang M, Wang H. Retrieval-augmented generation for large language models: a survey. 2024, arXiv preprint arXiv: 2312.10997
MATH Google Scholar
Zhao P, Zhang H, Yu Q, Wang Z, Geng Y, Fu F, Yang L, Zhang W, Jiang J, Cui B. Retrieval-augmented generation for AI-generated content: a survey. 2024, arXiv preprint arXiv: 2402.19473
Book MATH Google Scholar
Mialon G, Dessì R, Lomeli M, Nalmpantis C, Pasunuru R, Raileanu R, Rozière B, Schick T, Dwivedi-Yu J, Celikyilmaz A, Grave E, LeCun Y, Scialom T. Augmented language models: a survey. Transactions on Machine Learning Research, 2023, ISSN:2835-8856
Google Scholar
Wang Z, Cheng Z, Zhu H, Fried D, Neubig G. What are tools anyway? A survey from the language model perspective. 2024, arXiv preprint arXiv: 2403.15452
MATH Google Scholar
Komeili M, Shuster K, Weston J. Internet-augmented dialogue generation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 8460–8478
MATH Google Scholar
Zhang K, Zhang H, Li G, Li J, Li Z, Jin Z. Toolcoder: Teach code generation models to use api search tools. arXiv preprint arXiv:2305.04032, 2023
MATH Google Scholar
Shi W, Min S, Yasunaga M, Seo M, James R, Lewis M, Zettlemoyer L, Yih W T. REPLUG: retrieval-augmented black-box language models. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024, 8371–8384
Google Scholar
Paranjape B, Lundberg S, Singh S, Hajishirzi H, Zettlemoyer L, Ribeiro M T. ART: automatic multi-step reasoning and tool-use for large language models. 2023, arXiv preprint arXiv: 2303.09014
Google Scholar
Gou Z, Shao Z, Gong Y, Shen Y, Yang Y, Duan N, Chen W. CRITIC: large language models can self-correct with tool-interactive critiquing. In: Proceedings of the 12th International Conference on Learning Representations. 2024
MATH Google Scholar
Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A, et al. LaMDA: language models for dialog applications. 2022, arXiv preprint arXiv: 2201.08239
Google Scholar
Patil S G, Zhang T, Wang X, Gonzalez J E. Gorilla: large language model connected with massive APIs. 2023, arXiv preprint arXiv: 2305.15334
Google Scholar
Hao S, Liu T, Wang Z, Hu Z. ToolkenGPT: augmenting frozen language models with massive tools via tool embeddings. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 36
MATH Google Scholar
Zhuang Y, Yu Y, Wang K, Sun H, Zhang C. ToolQA: a dataset for LLM question answering with external tools. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2024, 36
MATH Google Scholar
Zhang K, Chen H, Li L, Wang W. Syntax error-free and generalizable tool use for LLMS via finite-state decoding. 2024, arXiv preprint arXiv: 2310.07075v1
Google Scholar
Gu Y, Shu Y, Yu H, Liu X, Dong Y, Tang J, Srinivasa J, Latapie H, Su Y. Middleware for LLMs: tools are instrumental for language agents in complex environments. 2024, arXiv preprint arXiv: 2402.14672
Google Scholar
Cobbe K, Kosaraju V, Bavarian M, Chen M, Jun H, Kaiser L, Plappert M, Tworek J, Hilton J, Nakano R, Hesse C, Schulman J. Training verifiers to solve math word problems. 2021, arXiv preprint arXiv: 2110.14168
Google Scholar
Shao Z, Huang F, Huang M. Chaining simultaneous thoughts for numerical reasoning. In: Proceedings of Findings of the Association for Computational Linguistics. 2022, 2533–2547
MATH Google Scholar
Kadlčík M, Štefánik M, Sotolar O, Martinek V. Calc-X and calcformers: empowering arithmetical chain-of-thought through interaction with symbolic systems. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 12101–12108
Chapter MATH Google Scholar
He-Yueya J, Poesia G, Wang R E, Goodman N D. Solving math word problems by combining language models with symbolic solvers. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023
Google Scholar
Zhang B, Zhou K, Wei X, Zhao X, Sha J, Wang S, Wen J R. Evaluating and improving tool-augmented computation-intensive math reasoning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2024, 1023
MATH Google Scholar
Gou Z, Shao Z, Gong Y, Shen Y, Yang Y, Huang M, Duan N, Chen W. ToRA: a tool-integrated reasoning agent for mathematical problem solving. In: Proceedings of the 12th International Conference on Learning Representations. 2024
MATH Google Scholar
Das D, Banerjee D, Aditya S, Kulkarni A. MATHSENSEI: a tool-augmented large language model for mathematical reasoning. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024, 942–966
MATH Google Scholar
Veerendranath V, Shah V, Ghate K. Calc-CMU at semeval-2024 task 7: pre-calc–learning to use the calculator improves numeracy in language models. In: Proceedings of the 18th International Workshop on Semantic Evaluation. 2024, 1468–1475
MATH Google Scholar
Bulusu A, Man B, Jagmohan A, Vempaty A, Mari-Wyka J, Akkil D. MathViz-E: A case-study in domain-specialized tool-using agents. 2024, arXiv preprint arXiv: 2407.17544
Google Scholar
Gao L, Madaan A, Zhou S, Alon U, Liu P, Yang Y, Callan J, Neubig G. PAL: program-aided language models. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 10764–10799
Google Scholar
Chen W, Ma X, Wang X, Cohen W W. Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research, 2023, ISSN: 2835-8856
MATH Google Scholar
Lu P, Peng B, Cheng H, Galley M, Chang K W, Wu Y N, Zhu S C, Gao J. Chameleon: plug-and-play compositional reasoning with large language models. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 36
MATH Google Scholar
Wang X, Peng H, Jabbarvand R, Ji H. LETI: learning to generate from textual interactions. In: Proceedings of Findings of the Association for Computational Linguistics. 2024, 223–239
MATH Google Scholar
Wu J, Zhu R, Chen N, Sun Q, Li X, Gao M. Structure-aware fine-tuning for code pre-trained models. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024, 15362–15372
Google Scholar
Zhang K, Li J, Li G, Shi X, Jin Z. CodeAgent: enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 13643–13658
MATH Google Scholar
Inaba T, Kiyomaru H, Cheng F, Kurohashi S. MultiTool-CoT: GPT-3 can use multiple external tools with chain of thought prompting. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 1522–1532
MATH Google Scholar
Bran A M, Cox S, Schilter O, Baldassari C, White A D, Schwaller P. Augmenting large language models with chemistry tools. Nature Machine Intelligence, 2024, 6(5): 525–535
Article Google Scholar
Ramos M C, Collison C J, White A D. A review of large language models and autonomous agents in chemistry. 2024, arXiv preprint arXiv: 2407.01603
MATH Google Scholar
Jin Q, Yang Y, Chen Q, Lu Z. GeneGPT: augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics, 2024, 40(2): btae075
Article Google Scholar
Theuma A, Shareghi E. Equipping language models with tool use capability for tabular data analysis in finance. In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. 2024, 90–103
MATH Google Scholar
Gao S, Wen Y, Zhu M, Wei J, Cheng Y, Zhang Q, Shang S. Simulating financial market via large language model based agents. 2024, arXiv preprint arXiv: 2406.19966
MATH Google Scholar
Zhang W, Zhao L, Xia H, Sun S, Sun J, Qin M, Li X, Zhao Y, Zhao Y, Cai X, Zheng L T, Wang X R, An B. A multimodal foundation agent for financial trading: tool-augmented, diversified, and generalist. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024, 4314–4325
Chapter MATH Google Scholar
Jin Q, Wang Z, Yang Y, Zhu Q, Wright D, Huang T, Wilbur W J, He Z, Taylor A, Chen Q, Lu Z. AgentMD: empowering language agents for risk prediction with large-scale clinical tool learning. 2024, arXiv preprint arXiv: 2402.13225
Google Scholar
Li B, Yan T, Pan Y, Luo J, Ji R, Ding J, Xu Z, Liu S, Dong H, Lin Z, Wang Y. MmedAgent: learning to use medical tools with multi-modal agent. 2024, arXiv preprint arXiv: 2407.02483
MATH Google Scholar
Yang Z, Li L, Wang J, Lin K, Azarnasab E, Ahmed F, Liu Z, Liu C, Zeng M, Wang L. MM-REACT: prompting chatGPT for multimodal reasoning and action. 2023, arXiv preprint arXiv: 2303.11381
MATH Google Scholar
Liu Z, He Y, Wang W, Wang W, Wang Y, Chen S, Zhang Q, Yang Y, Li Q, Yu J, Li K, Chen Z, Yang X, Zhu X, Wang Y, Wang L, Luo P, Dai J, Qiao Y. InternChat: solving vision-centric tasks by interacting with chatbots beyond language. 2023, arXiv preprint arXiv: 2305.05662v2
MATH Google Scholar
Gao D, Ji L, Zhou L, Lin K Q, Chen J, Fan Z, Shou M Z. AssistGPT: a general multi-modal assistant that can plan, execute, inspect, and learn. 2023, arXiv preprint arXiv: 2306.08640
Google Scholar
Gao Z, Du Y, Zhang X, Ma X, Han W, Zhu S C, Li Q. CLOVA: a closed-loop visual assistant with tool usage and update. In: Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 13258–13268
Google Scholar
Zhao L, Yang Y, Zhang K, Shao W, Zhang Y, Qiao Y, Luo P, Ji R. DiffAgent: fast and accurate text-to-image API selection with large language model. In: Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 6390–6399
MATH Google Scholar
Ma Z, Huang W, Zhang J, Gupta T, Krishna R. m&m’s: a benchmark to evaluate tool-use for multi-step multi-modal tasks. 2024, arXiv preprint arXiv: 2403.11085
Google Scholar
Wang C, Luo W, Chen Q, Mai H, Guo J, Dong S, Xuan X, Li Z, Ma L, Gao S. Tool-LMM: a large multi-modal model for tool agent learning. 2023, arXiv preprint arXiv: 2401.10727v1
MATH Google Scholar
Shen Y, Song K, Tan X, Li D, Lu W, Zhuang Y. HuggingGPT: solving AI tasks with chatGPT and its friends in hugging face. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2024, 36
MATH Google Scholar
Lyu B, Cong X, Yu H, Yang P, Qin Y, Ye Y, Lu Y, Zhang Z, Yan Y, Lin Y, Liu Z, Sun M. GitAgent: facilitating autonomous agent with GitHub by tool extension. 2023, arXiv preprint arXiv: 2312.17294
Google Scholar
Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E H, Le Q V, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 24824–24837
Google Scholar
Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K R, Cao Y. ReAct: synergizing reasoning and acting in language models. In: Proceedings of the 11th International Conference on Learning Representations. 2023
Google Scholar
Song Y, Xiong W, Zhu D, Li C, Wang K, Tian Y, Li S. RestGPT: connecting large language models with real-world applications via RESTful APIs. 2023, arXiv preprint arXiv: 2306.06624v1
MATH Google Scholar
Ruan J, Chen Y, Zhang B, Xu Z, Bao T, Du G, Shi S, Mao H, Zeng X, Zhao R. TPTU: task planning and tool usage of large language modelbased AI agents. 2023, arXiv preprint arXiv: 2308.03427v1
Google Scholar
Zhuang Y, Chen X, Yu T, Mitra S, Bursztyn V S, Rossi R A, Sarkhel S, Zhang C. ToolChain*: efficient action space navigation in large language models with a* search. In: Proceedings of the 12th International Conference on Learning Representations. 2024
Google Scholar
Liu Z, Lai Z, Gao Z, Cui E, Li Z, Zhu X, Lu L, Chen Q, Qiao Y, Dai J, Wang W. ControlLLM: augment language models with tools by searching on graphs. 2023, arXiv preprint arXiv: 2310.17796
Google Scholar
Chen Y, Lv A, Lin T E, Chen C, Wu Y, Huang F, Li Y, Yan R. Fortify the shortest stave in attention: enhancing context awareness of large language models for effective tool use. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 11160–11174
MATH Google Scholar
Huang T, Jung D, Kumar V, Kachuee M, Li X, Xu P, Chen M. Planning and editing what you retrieve for enhanced tool learning. In: Proceedings of Findings of the Association for Computational Linguistics. 2024, 975–988
MATH Google Scholar
Shi Z, Gao S, Chen X, Feng Y, Yan L, Shi H, Yin D, Chen Z, Verberne S, Ren Z. Chain of tools: large language model is an automatic multi-tool learner. 2024, arXiv preprint arXiv: 2405.16533v1
MATH Google Scholar
Wu X, Shen Y, Shan C, Song K, Wang S, Zhang B, Feng J, Cheng H, Chen W, Xiong Y, Li D. Can graph learning improve task planning? 2024, arXiv preprint arXiv: 2405.19119v1
Google Scholar
Liu Y, Yuan Y, Wang C, Han J, Ma Y, Zhang L, Zheng N, Xu H. From summary to action: enhancing large language models for complex tasks with open world APIs. 2024, arXiv preprint arXiv: 2402.18157
Google Scholar
Zheng Y, Li P, Yan M, Zhang J, Huang F, Liu Y. Budget-constrained tool learning with planning. In: Proceedings of Findings of the Association for Computational Linguistics ACL 2024. 2024, 9039–9052
Chapter MATH Google Scholar
Qu C, Dai S, Wei X, Cai H, Wang S, Yin D, Xu J, Wen J R. From exploration to mastery: enabling LLMs to master tools via self-driven interactions. 2024, arXiv preprint arXiv: 2410.08197
Google Scholar
Liang Y, Wu C, Song T, Wu W, Xia Y, Liu Y, Ou Y, Lu S, Ji L, Mao S, Wang Y, Shou L, Gong M, Duan N. Taskmatrix. AI: Completing tasks by connecting foundation models with millions of APIs. Intelligent Computing, 2024, 3: 0063
Article Google Scholar
Qian C, Xiong C, Liu Z, Liu Z. Toolink: linking toolkit creation and using through chain-of-solving on open-source model. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024, 831–854
MATH Google Scholar
Kong Y, Ruan J, Chen Y, Zhang B, Bao T, Shi S, Du G, Hu X, Mao H, Li Z, Zeng X, Zhao R. TPTU-v2: boosting task planning and tool usage of large language model-based agents in real-world systems. 2023, arXiv preprint arXiv: 2311.11315
Google Scholar
Shen W, Li C, Chen H, Yan M, Quan X, Chen H, Zhang J, Huang F. Small LLMs are weak tool learners: a multi-LLM agent. 2024, arXiv preprint arXiv: 2401.07324
MATH Google Scholar
Gao S, Dwivedi-Yu J, Yu P, Tan X E, Pasunuru R, Golovneva O, Sinha K, Celikyilmaz A, Bosselut A, Wang T. Efficient tool use with chain-of-abstraction reasoning. 2024, arXiv preprint arXiv: 2401.17464
Google Scholar
Gui A, Li J, Dai Y, Du N, Xiao H. Look before you leap: towards decision-aware and generalizable tool-usage for large language models. 2024, arXiv preprint arXiv: 2402.16696
Google Scholar
Ge Y, Hua W, Mei K, Tan J, Xu S, Li Z, Zhang Y. OpenAGI: when LLM meets domain experts. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 36
MATH Google Scholar
Wang Y, Yu J, Yao Z, Zhang J, Xie Y, Tu S, Fu Y, Feng Y, Zhang J, Zhang J, Huang B, Li Y, Yuan H, Hou L, Li J, Tang J. A solution-based LLM API-using methodology for academic information seeking. 2024, arXiv preprint arXiv: 2405.15165
MATH Google Scholar
Chen S, Wang Y, Wu Y F, Chen Q G, Xu Z, Luo W, Zhang K, Zhang L. Advancing tool-augmented large language models: integrating insights from errors in inference trees. 2024, arXiv preprint arXiv: 2406.07115
MATH Google Scholar
Liu Z, Hoang T, Zhang J, Zhu M, Lan T, Kokane S, Tan J, Yao W, Liu Z, Feng Y, Murthy R, Yang L, Savarese S, Niebles J C, Wang H, Heinecke S, Xiong C. APIGen: automated pipeline for generating verifiable and diverse function-calling datasets. 2024, arXiv preprint arXiv: 2406.18518
Google Scholar
Sparck Jones K. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 1972, 28(1): 11–21
Article MathSciNet MATH Google Scholar
Robertson S, Zaragoza H. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 2009, 3(4): 333–389
Article MATH Google Scholar
Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 3982–3992
MATH Google Scholar
Xiong L, Xiong C, Li Y, Tang K F, Liu J, Bennett P N, Ahmed J, Overwijk A. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: Proceedings of the 9th International Conference on Learning Representations. 2021
MATH Google Scholar
Hofstätter S, Lin S C, Yang J H, Lin J, Hanbury A. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021, 113–122
Chapter MATH Google Scholar
Izacard G, Caron M, Hosseini L, Riedel S, Bojanowski P, Joulin A, Grave E. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022, ISSN: 2385-8856
Google Scholar
Gao L, Callan J. Unsupervised corpus aware language model pretraining for dense passage retrieval. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 2843–2853
MATH Google Scholar
Yuan L, Chen Y, Wang X, Fung Y R, Peng H, Ji H. CRAFT: customizing LLMs by creating and retrieving from specialized toolsets. In: Proceedings of the 12th International Conference on Learning Representations. 2024
MATH Google Scholar
Anantha R, Bandyopadhyay B, Kashi A, Mahinder S, Hill A W, Chappidi S. ProTIP: progressive tool retrieval improves planning. 2023, arXiv preprint arXiv: 2312.10332
Google Scholar
Zheng Y, Li P, Liu W, Liu Y, Luan J, Wang B. ToolRerank: adaptive and hierarchy-aware reranking for tool retrieval. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024, 16263–16273
MATH Google Scholar
Qu C, Dai S, Wei X, Cai H, Wang S, Yin D, Xu J, Wen J R. Towards completeness-oriented tool retrieval for large language models. In: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management. 2024, 1930–1940
Chapter MATH Google Scholar
Chen Z, Zhou K, Zhang B, Gong Z, Zhao X, Wen J R. ChatCoT: tool-augmented chain-of-thought reasoning on chat-based large language models. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, 14777–14790
Chapter Google Scholar
Liu X, Peng Z, Yi X, Xie X, Xiang L, Liu Y, Xu D. ToolNet: connecting large language models with massive tools via tool graph. 2024, arXiv preprint arXiv: 2403.00839
MATH Google Scholar
Mekala D, Weston J, Lanchantin J, Raileanu R, Lomeli M, Shang J, Dwivedi-Yu J. TOOLVERIFIER: generalization to new tools via self-verification. 2024, arXiv preprint arXiv: 2402.14158
Google Scholar
Qiao S, Gui H, Lv C, Jia Q, Chen H, Zhang N. Making language models better tool learners with execution feedback. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics. 2024, 3550–3568
Google Scholar
Du Y, Wei F, Zhang H. AnyTool: self-reflective, hierarchical agents for large-scale API calls. In: Proceedings of the 41st International Conference on Machine Learning. 2024
MATH Google Scholar
Fore M, Singh S, Stamoulis D. GeckOpt: LLM system efficiency via intent-based tool selection. In: Proceedings of Great Lakes Symposium on VLSI 2024. 2024, 353–354
Chapter Google Scholar
Zhang Y, Cai H, Song X, Chen Y, Sun R, Zheng J. Reverse chain: a generic-rule for LLMs to master multi-API planning. In: Proceedings of Findings of the Association for Computational Linguistics. 2024, 302–325
MATH Google Scholar
Yuan S, Song K, Chen J, Tan X, Shen Y, Kan R, Li D, Yang D. EASYTOOL: enhancing LLM-based agents with concise tool instruction. 2024, arXiv preprint arXiv: 2401.06201
MATH Google Scholar
Shi Z, Gao S, Chen X, Feng Y, Yan L, Shi H, Yin D, Ren P, Verberne S, Ren Z. Learning to use tools via cooperative and interactive agents. 2024, arXiv preprint arXiv: 2403.03031v4
Book MATH Google Scholar
Yang R, Song L, Li Y, Zhao S, Ge Y, Li X, Shan Y. GPT4Tools: teaching large language model to use tools via self-instruction. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023, 36
MATH Google Scholar
Li L, Chai Y, Wang S, Sun Y, Tian H, Zhang N, Wu H. Tool-augmented reward modeling. In: Proceedings of the 12th International Conference on Learning Representations. 2024
MATH Google Scholar
Wang B, Fang H, Eisner J, Van Durme B, Su Y. LLMs in the imaginarium: tool learning through simulated trial and error. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 10583–10604
Google Scholar
Xu F, Shi W, Choi E. RECOMP: improving retrieval-augmented LMS with compression and selective augmentation. In: Proceedings of the 12th International Conference on Learning Representations. 2024
MATH Google Scholar
Ye J, Li G, Gao S, Huang C, Wu Y, Li S, Fan X, Dou S, Zhang Q, Gui T, Huang X. ToolEyes: fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios. 2024, arXiv preprint arXiv: 2401.00741
MATH Google Scholar
Huang S, Zhong W, Lu J, Zhu Q, Gao J, Liu W, Hou Y, Zeng X, Wang Y, Shang L, Jiang X, Xu R, Liu Q. Planning, creation, usage: benchmarking LLMs for comprehensive tool utilization in real-world complex scenarios. In: Proceedings of Findings of the Association for Computational Linguistics ACL 2024. 2024, 4363–4400
Chapter MATH Google Scholar
Wu M, Zhu T, Han H, Tan C, Zhang X, Chen W. Seal-tools: self-instruct tool learning dataset for agent tuning and detailed benchmark. 2024, arXiv preprint arXiv: 2405.08355v1
Google Scholar
Basu K, Abdelaziz I, Chaudhury S, Dan S, Crouse M, Munawar A, Austel V, Kumaravel S, Muthusamy V, Kapanipathi P, Lastras L A. API-BLEND: a comprehensive corpora for training and benchmarking API LLMs. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 12859–12870
Google Scholar
Shen H, Li Y, Meng D, Cai D, Qi S, Zhang L, Xu M, Ma Y. ShortcutsBench: a large-scale real-world benchmark for API-based agents. 2024, arXiv preprint arXiv: 2407.00132v2
Google Scholar
Wang J, Ma Z, Li Y, Zhang S, Chen C, Chen K, Le X. GTA: a benchmark for general tool agents. 2024, arXiv preprint arXiv: 2407.08713
MATH Google Scholar
Ning K, Su Y, Lv X, Zhang Y, Liu J, Liu K, Xu J. WTU-EVAL: a whether-or-not tool usage evaluation benchmark for large language models. 2024, arXiv preprint arXiv: 2407.12823
MATH Google Scholar
Trivedi H, Khot T, Hartmann M, Manku R, Dong V, Li E, Gupta S, Sabharwal A, Balasubramanian N. AppWorld: a controllable world of apps and people for benchmarking interactive coding agents. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 16022–16076
Google Scholar
Ruan Y, Dong H, Wang A, Pitis S, Zhou Y, Ba J, Dubois Y, Maddison C J, Hashimoto T. Identifying the risks of lm agents with an lmemulated sandbox. In: Proceedings of the 12th International Conference on Learning Representations. 2024
Google Scholar
Farn N, Shin R. ToolTalk: evaluating tool-usage in a conversational setting. 2023, arXiv preprint arXiv: 2311.10775
MATH Google Scholar
Zhong Y, Qi M, Wang R, Qiu Y, Zhang Y, Ma H. VioTGPT: learning to schedule vision tools towards intelligent video internet of things. 2023, arXiv preprint arXiv: 2312.00401
MATH Google Scholar
Ye J, Wu Y, Gao S, Huang C, Li S, Li G, Fan X, Zhang Q, Gui T, Huang X. RoTBench: a multi-level benchmark for evaluating the robustness of large language models in tool learning. 2024, arXiv preprint arXiv: 2401.08326
MATH Google Scholar
Ye J, Li S, Li G, Huang C, Gao S, Wu Y, Zhang Q, Gui T, Huang X. ToolSword: unveiling safety issues of large language models in tool learning across three stages. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 2181–2211
MATH Google Scholar
Ma Y, Gou Z, Hao J, Xu R, Wang S, Pan L, Yang Y, Cao Y, Sun A, Awadalla H, Chen W. SciAgent: Tool-augmented language models for scientific reasoning. 2024, arXiv preprint arXiv: 2402.11451
Google Scholar
Guo Z, Cheng S, Wang H, Liang S, Qin Y, Li P, Liu Z, Sun M, Liu Y. StableToolBench: towards stable large-scale benchmarking on tool learning of large language models. In: Proceedings of Findings of the Association for Computational Linguistics ACL 2024. 2024, 11143–11156
Chapter MATH Google Scholar
Zhan Q, Liang Z, Ying Z, Kang D. InjecAgent: benchmarking indirect prompt injections in tool-integrated large language model agents. In: Proceedings of Findings of the Association for Computational Linguistics ACL 2024. 2024, 10471–10506
Chapter MATH Google Scholar
Guo Z, Huang Y, Xiong D. CToolEval: a Chinese benchmark for LLM-powered agent evaluation in real-world API interactions. In: Proceedings of Findings of the Association for Computational Linguistics ACL 2024. 2024, 15711–15724
Chapter Google Scholar
Lu J, Holleis T, Zhang Y, Aumayer B, Nan F, Bai F, Ma S, Ma S, Li M, Yin G, Wang Z, Pang R. ToolSandbox: a stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. 2024, arXiv preprint arXiv: 2408.04682
Google Scholar
Zhu M. Recall, precision and average precision. University of Waterloo, Dissertation, 2004, 6
MATH Google Scholar
Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 2002, 20(4): 422–446
Article MATH Google Scholar
Papineni K, Roukos S, Ward T, Zhu W J. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002, 311–318
Google Scholar
Lin C Y. ROUGE: a package for automatic evaluation of summaries. In: Proceedings of Text Summarization Branches Out. 2004, 74–81
MATH Google Scholar
Blackwell M, Iacus S, King G, Porro G. CEM: coarsened exact matching in Stata. The Stata Journal: Promoting communications on statistics and Stata, 2009, 9(4): 524–546
Article MATH Google Scholar
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R. Training language models to follow instructions with human feedback. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2011
Google Scholar
Dao X Q, Le N B. Investigating the effectiveness of chatGPT in mathematical reasoning and problem solving: evidence from the Vietnamese national high school graduation examination. 2023, arXiv preprint arXiv: 2306.06331
MATH Google Scholar
Wei T, Luan J, Liu W, Dong S, Wang B. CMATH: can your language model pass Chinese elementary school math test? 2023, arXiv preprint arXiv: 2306.16636
Google Scholar
Chen M, Tworek J, Jun H, Yuan Q, de Oliveira Pinto H P, et al. Evaluating large language models trained on code. 2021, arXiv preprint arXiv: 2107.03374
Google Scholar
Austin J, Odena A, Nye M, Bosma M, Michalewski H, Dohan D, Jiang E, Cai C, Terry M, Le Q, Sutton C. Program synthesis with large language models. 2021, arXiv preprint arXiv: 2108.07732
Google Scholar
Singh S, Fore M, Stamoulis D. Evaluating tool-augmented agents in remote sensing platforms. 2024, arXiv preprint arXiv: 2405.00709v1
MATH Google Scholar
Linardatos P, Papastefanopoulos V, Kotsiantis S. Explainable AI: a review of machine learning interpretability methods. Entropy, 2021, 23(1): 18
Article MATH Google Scholar
Zhao H, Chen H, Yang F, Liu N, Deng H, Cai H, Wang S, Yin D, Du M. Explainability for large language models: a survey. ACM Transactions on Intelligent Systems and Technology, 2024, 15(2): 20
Article MATH Google Scholar
Weidinger L, Mellor J, Rauh M, Griffin C, Uesato J, et al. Ethical and social risks of harm from language models. 2021, arXiv preprint arXiv: 2112.04359
MATH Google Scholar
Gao T, Yen H, Yu J, Chen D. Enabling large language models to generate text with citations. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 6465–6488
Chapter MATH Google Scholar
Sun H, Cai H, Wang B, Hou Y, Wei X, Wang S, Zhang Y, Yin D. Towards verifiable text generation with evolving memory and self-reflection. 2024, arXiv preprint arXiv: 2312.09075
Book MATH Google Scholar
Wallace E, Feng S, Kandpal N, Gardner M, Singh S. Universal adversarial triggers for attacking and analyzing NLP. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 2153–2162
Google Scholar
Jin D, Jin Z, Zhou J T, Szolovits P. Is Bert really robust? A strong baseline for natural language attack on text classification and entailment. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 8018–8025
MATH Google Scholar
Wu F, Zhang N, Jha S, McDaniel P, Xiao C. A new era in LLM security: exploring security concerns in real-world LLM-based systems. 2024, arXiv preprint arXiv: 2402.18649
Google Scholar
Zhang J. Graph-toolFormer: to empower LLMs with graph reasoning ability via prompt augmented by chatGPT. 2023, arXiv preprint arXiv: 2304.11116
Google Scholar
Li C, Yang R, Li T, Bafarassat M, Sharifi K, Bergemann D, Yang Z. STRIDE: a tool-assisted LLM agent framework for strategic and interactive decision-making. 2024, arXiv preprint arXiv: 2405.16376
MATH Google Scholar
Huang W, Abbeel P, Pathak D, Mordatch I. Language models as zero-shot planners: extracting actionable knowledge for embodied agents. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 9118–9147
MATH Google Scholar
Chern I C, Chern S, Chen S, Yuan W, Feng K, Zhou C, He J, Neubig G, Liu P. FacTool: factuality detection in generative AI–a tool augmented framework for multi-task and multi-domain scenarios. 2023, arXiv preprint arXiv: 2307.13528
MATH Google Scholar
Xu S, Pang L, Shen H, Cheng X, Chua T S. Search-in-the-chain: interactively enhancing large language models with search for knowledge-intensive tasks. In: Proceedings of ACM Web Conference 2024. 2024, 1362–1373
Chapter MATH Google Scholar
Kim G, Baldi P, McAleer S. Language models can solve computer tasks. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2024, 1723
MATH Google Scholar
Liu Y, Peng X, Zhang Y, Cao J, Zhang X, Cheng S, Wang X, Yin J, Du T. Tool-planner: dynamic solution tree planning for large language model with tool clustering. 2024, arXiv preprint arXiv: 2406.03807v1
MATH Google Scholar
Erbacher P, Falissard L, Guigue V, Soulier L. Navigating uncertainty: optimizing API dependency for hallucination reduction in closed-book QA. In: Proceedings of the 46th European Conference on Information Retrieval. 2024, 393–402
MATH Google Scholar
Xu Q, Li Y, Xia H, Li W. Enhancing tool retrieval with iterative feedback from large language models. 2024, arXiv preprint arXiv: 2406.17465
Book MATH Google Scholar
Xiao S, Liu Z, Zhang P, Muennighoff N. C-pack: Pack-aged resources to advance general chinese embedding, 2023
Google Scholar
Gemini Team Google. Gemini: a family of highly capable multimodal models. 2024, arXiv preprint arXiv: 2312.11805
Google Scholar
Hsieh C Y, Chen S A, Li C L, Fujii Y, Ratner A, Lee C Y, Krishna R, Pfister T. Tool documentation enables zero-shot tool-usage with large language models. 2023, arXiv preprint arXiv: 2308.00675
Google Scholar
Xu Y, Feng Y, Mu H, Hou Y, Li Y, Wang X, Zhong W, Li Z, Tu D, Zhu Q, Zhang M, Che W. Concise and precise context compression for tool-using language models. In: Proceedings of Findings of the Association for Computational Linguistics ACL 2024. 2024, 16430–16441
Chapter MATH Google Scholar
Shen Y, Song K, Tan X, Zhang W, Ren K, Yuan S, Lu W, Li D, Zhuang Y. TaskBench: benchmarking large language models for task automation. 2023, arXiv preprint arXiv: 2311.18760
Google Scholar
Wang H, Wang H, Wang L, Hu M, Wang R, Xue B, Lu H, Mi F, Wong K F. TPE: towards better compositional reasoning over conceptual tools with multi-persona collaboration. 2023, arXiv preprint arXiv: 2309.16090
MATH Google Scholar
Qian C, Han C, Fung Y, Qin Y, Liu Z, Ji H. CREATOR: tool creation for disentangling abstract and concrete reasoning of large language models. In: Proceedings of Findings of Association for Computational Linguistics: EMNLP 2023. 2023, 6922–6939
Chapter MATH Google Scholar
Jacovi A, Caciularu A, Herzig J, Aharoni R, Bohnet B, Geva M. A comprehensive evaluation of tool-assisted generation strategies. In: Proceedings of Findings of Findings of the Association for Computational Linguistics: EMNLP 2023. 2023, 13856–13878
Chapter Google Scholar
Nathani D, Wang D, Pan L, Wang W Y. MAF: multi-aspect feedback for improving reasoning in large language models. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 6591–6616
Chapter MATH Google Scholar
Yao Y, Duan J, Xu K, Cai Y, Sun Z, Zhang Y. A survey on large language model (LLM) security and privacy: the good, the bad, and the ugly. High-Confidence Computing, 2024, 4: 100211
Article Google Scholar
Cui T, Wang Y, Fu C, Xiao Y, Li S, Deng X, Liu Y, Zhang Q, Qiu Z, Li P, Tan Z, Xiong J, Kong X, Wen Z, Xu K, Li Q. Risk taxonomy, mitigation, and assessment benchmarks of large language model systems. 2024, arXiv preprint arXiv: 2401.05778
MATH Google Scholar
Das B C, Amini M H, Wu Y. Security and privacy challenges of large language models: a survey. 2024, arXiv preprint arXiv: 2402.00888
MATH Google Scholar
Qin Y, Cai Z, Jin D, Yan L, Liang S, Zhu K, Lin Y, Han X, Ding N, Wang H, Xie R, Qi F, Liu Z, Sun M, Zhou J. WebCPM: interactive web search for Chinese long-form question answering. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 8968–8988
Google Scholar
Miao X, Oliaro G, Zhang Z, Cheng X, Jin H, Chen T, Jia Z. Towards efficient generative large language model serving: a survey from algorithms to systems. 2023, arXiv preprint arXiv: 2312.15234
MATH Google Scholar
Cai T, Wang X, Ma T, Chen X, Zhou D. Large language models as tool makers. In: Proceedings of the 12th International Conference on Learning Representations. 2024
MATH Google Scholar
Wang Z, Neubig G, Fried D. TroVE: inducing verifiable and efficient toolboxes for solving programmatic tasks. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 51177–51191
Google Scholar

Download references

Acknowledgements

This work was funded by the National Key R&D Program of China (2023YFA1008704), the National Natural Science Foundation of China (Grant No. 62377044), Beijing Key Laboratory of Big Data Management and Analysis Methods, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, funds for building world-class universities (disciplines) of Renmin University of China, and PCC@RUC. The authors would like to extend their sincere gratitude to Yankai Lin for his constructive feedback throughout the development of this work.

Author information

Authors and Affiliations

Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, 100872, China
Changle Qu, Sunhao Dai, Jun Xu & Ji-rong Wen
Baidu Inc., Beijing, 100193, China
Xiaochi Wei, Shuaiqiang Wang & Dawei Yin
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100864, China
Hengyi Cai

Authors

Changle Qu
View author publications
Search author on:PubMed Google Scholar
Sunhao Dai
View author publications
Search author on:PubMed Google Scholar
Xiaochi Wei
View author publications
Search author on:PubMed Google Scholar
Hengyi Cai
View author publications
Search author on:PubMed Google Scholar
Shuaiqiang Wang
View author publications
Search author on:PubMed Google Scholar
Dawei Yin
View author publications
Search author on:PubMed Google Scholar
Jun Xu
View author publications
Search author on:PubMed Google Scholar
Ji-rong Wen
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Jun Xu.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Changle Qu is currently pursuing the PhD degree at Gaoling School of Artificial Intelligence, Renmin University of China, China. His current research interests mainly include tool learning with large language models and information retrieval.

Sunhao Dai is a PhD candidate at Gaoling School of Artificial Intelligence, Renmin University of China, China. His current research interests lie in recommender systems and information retrieval. He has published several papers in top-tier conferences such as KDD, SIGIR, ICDE, CIKM, and RecSys.

Xiaochi Wei received PhD degree from Beijing Institute of Technology, China in 2018, under the supervision of Prof. Heyan Huang. He visited National University of Singapore, Singapore from 2015 to 2016, under the supervision of Prof. Tat-Seng Chua. He is a Senior R&D Engineer in Baidu Inc.. His research interests include question answering, multi-media information retrieval, and recommender systems. He has served as PC member in severals conferences, e.g., AAAI, IJCAI, ACL, and EMNLP.

Hengyi Cai received PhD degree from Institute of Computing Technology, Chinese Academy of Sciences (Outstanding Graduate), China in 2021. He joined JD’s doctoral management trainee program in the summer of 2021. Previously, he was a research intern at Baidu’s Search Science Team in 2020, under the supervision of Dr. Dawei Yin. His research interests include dialogue system, question answering, and information retrieval. He served or is serving as PC member for top-tire conference including ACL, EMNLP, KDD, NeurIPS, and SIGIR.

Shuaiqiang Wang received the BSc and PhD degrees in computer science from Shandong University, China in 2004 and 2009, respectively. He is currently a principle algorithm engineer with Baidu Inc.. Previously, he was a research scientist with JD.com. Before that, he was an Assistant Professor with the University of Manchester, UK and the University of Jyvaskyla, Finland. served as Senior PC Member of IJCAI, and PC Member of WWW, SIGIR, and WSDM in recent years. He is broadly interested in several research areas including information retrieval, recommender systems, and data mining.

Dawei Yin received PhD degree from Lehigh University, USA in 2013. He is senior director of engineering with Baidu inc.. He is managing the search science team with Baidu. Previously, he was senior director, managing the recommendation engineering team with JD.com between 2016 and 2019. Prior to JD.com, he was senior research manager with Yahoo Labs, leading relevance science team and in charge of Core Search Relevance of Yahoo Search. His research interests include data mining, applied machine learning, information retrieval and recommender system. He published more than 100 research papers in premium conferences and journals, and was the recipients of WSDM 2016 Best Paper Award, KDD 2016 Best Paper Award, WSDM 2018 Best Student Paper Award.

Jun Xu is a professor with the Gaoling School of Artificial Intelligence, Renmin University of China, China. His research interests focus on learning to rank and semantic matching in web search. He served or is serving as SPC for SIGIR, WWW, and AAAI, editorial board member for JASIST, and associate editor for ACM TOIS. He has won the Test of Time Award Honorable Mention in SIGIR (2019), Best Paper Award in AIRS (2010) and Best Paper Runner-up in CIKM (2017).

Ji-rong Wen is a professor of the Renmin University of China (RUC), China. He is also the dean of the School of Information and executive dean of the Gaoling School of Artificial Intelligence with RUC. His main research interests include information retrieval, data mining, and machine learning. He was a senior researcher and group manager of the Web Search and Mining Group with Microsoft Research Asia (MSRA).

Electronic supplementary material