LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Liu, Shilong; Cheng, Hao; Liu, Haotian; Zhang, Hao; Li, Feng; Ren, Tianhe; Zou, Xueyan; Yang, Jianwei; Su, Hang; Zhu, Jun; Zhang, Lei; Gao, Jianfeng; Li, Chunyuan

doi:10.1007/978-3-031-72970-6_8

Shilong Liu¹³,
Hao Cheng¹⁴,
Haotian Liu¹⁵,
Hao Zhang¹⁶,
Feng Li¹⁶,
Tianhe Ren¹⁷,
Xueyan Zou¹⁵,
Jianwei Yang¹⁴,
Hang Su¹³,
Jun Zhu¹³,
Lei Zhang¹⁷,
Jianfeng Gao¹⁴ &
…
Chunyuan Li¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15105))

Included in the following conference series:

European Conference on Computer Vision

1048 Accesses
3 Citations

Abstract

This paper presents LLaVA-Plus (Large Language and Vision Assistants that Plug and Learn to Use Skills), a general-purpose multimodal assistant trained using an end-to-end approach that systematically expands the capabilities of large multimodal models (LMMs). LLaVA-Plus maintains a skill repository that contains a wide range of vision and vision-language pre-trained models (tools), and is able to activate relevant tools, given users’ multimodal inputs, to compose their execution results on the fly to fulfill many real-world tasks. To acquire the ability of using tools, LLaVA-Plus is trained on multimodal instruction-following data that we have curated. The training data covers many tool use examples of visual understanding, generation, external knowledge retrieval and their compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities, and exhibits many new capabilities. Compared with tool-augmented LLMs, LLaVA-Plus is distinct in that the image query is directly grounded in and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

S. Liu, H. Liu, H. Zhang, F. Li and X. Zou—Work performed during an internship at Microsoft.

C. Li—Project Lead.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions

See and Think: Embodied Agent in Virtual Environment

A Schema-Based Approach to the Linkage of Multimodal Learning Sources with Generative AI

Notes

1.
The term “tools” in this paper is used to describe the APIs or pre-built models that LMM interfaces with.

References

Langchain (2022). https://github.com/hwchase17/langchain
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198 (2022)
Askell, A., et al.: A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 (2021)
Awadalla, A., et al.: OpenFlamingo (2023). https://doi.org/10.5281/zenodo.7733589
Bitton, Y., et al.: Visit-bench: a benchmark for vision-language instruction following inspired by real-world use (2023)
Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402 (2023)
Google Scholar
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, Y., et al.: Can pre-trained vision and language models answer visual information-seeking questions? (2023)
Google Scholar
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)
Gan, Z., Li, L., Li, C., Wang, L., Liu, Z., Gao, J.: Vision-language pre-training: basics, recent advances, and future trends. Found. Trends® Comput. Graph. Vision (2022)
Google Scholar
Gao, P., et al.: LLaMA-adapter v2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. arXiv preprint arXiv:2211.11559 (2022)
JaidedAI: EasyOCR (2023). https://github.com/JaidedAI/EasyOCR
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal language models. arXiv preprint arXiv:2305.17216 (2023)
Lai, X., et al.: LISA: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)
Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)
Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., Gao, J.: Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 (2023)
Li, C., et al.: ELEVATER: a benchmark and toolkit for evaluating language-augmented visual models. In: NeurIPS Track on Datasets and Benchmarks (2022)
Google Scholar
Li, F., et al.: Semantic-SAM: segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, S., Fan, L., Johns, E., Yu, Z., Xiao, C., Anandkumar, A.: Prismer: a vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506 (2023)
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Long, S., Qin, S., Panteleev, D., Bissacco, A., Fujii, Y., Raptis, M.: Towards end-to-end unified scene text detection and layout analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Long, S., Qin, S., Panteleev, D., Bissacco, A., Fujii, Y., Raptis, M.: ICDAR 2023 competition on hierarchical text detection and recognition. arXiv preprint arXiv:2305.09750 (2023)
Minsky, M.: Society of Mind. Simon and Schuster (1988)
Google Scholar
OpenAI: ChatGPT (2023). https://openai.com/blog/chatgpt/
OpenAI: ChatGPT plugins (2023). https://openai.com/blog/chatgpt-plugins
OpenAI: GPT-4 technical report (2023)
Google Scholar
OpenAI: GPT-4v(ision) system card (2023). https://cdn.openai.com/papers/GPTV_System_Card.pdf
Pan, J., et al.: JourneyDB: a benchmark for generative image understanding (2023)
Google Scholar
Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: large language model connected with massive APIs. arXiv preprint arXiv:2305.15334 (2023)
Peng, B., Li, C., He, P., Galley, M., Gao, J.: Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277 (2023)
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Pi, R., et al.: DetGPT: detect what you need via reasoning. arXiv preprint arXiv:2305.14167 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)
Google Scholar
Schick, T., et al.: ToolFormer: language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761 (2023)
Sun, Q., et al.: Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023)
Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128 (2023)
Vicuna: Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org/
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)
Yang, R., et al.: GPT4Tools: teaching large language model to use tools via self-instruction. arXiv preprint arXiv:2305.18752 (2023)
Yang, Z., et al.: MM-REACT: prompting ChatGPT for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)
Yao, S., et al.: REACT: synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022)
Yu, L., et al: Scaling autoregressive multi-modal models: pretraining and instruction tuning (2023)
Google Scholar
Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)
Zhang, H., et al.: A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131 (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)
Google Scholar
Zhang, S., et al.: GPT4ROI: instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601 (2023)
Zhang, Y., et al.: Recognize anything: a strong image tagging model. arXiv preprint arXiv:2306.03514 (2023)
Zhao, Y., Lin, Z., Zhou, D., Huang, Z., Feng, J., Kang, B.: BuboGPT: enabling visual grounding in multi-modal LLMs. arXiv preprint arXiv:2307.08581 (2023)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023)
Google Scholar
Zou, X., et al.: Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718 (2023)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Institute for AI, BNRist, Tsinghua University, Beijing, China
Shilong Liu, Hang Su & Jun Zhu
Microsoft Research, Redmond, USA
Hao Cheng, Jianwei Yang, Jianfeng Gao & Chunyuan Li
University of Wisconsin-Madison, Madison, USA
Haotian Liu & Xueyan Zou
HKUST, Kowloon, Hong Kong
Hao Zhang & Feng Li
IDEA Research, Wuhan, China
Tianhe Ren & Lei Zhang

Authors

Shilong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Haotian Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Feng Li
View author publications
You can also search for this author in PubMed Google Scholar
Tianhe Ren
View author publications
You can also search for this author in PubMed Google Scholar
Xueyan Zou
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hang Su
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Gao
View author publications
You can also search for this author in PubMed Google Scholar
Chunyuan Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shilong Liu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 17683 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, S. et al. (2025). LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15105. Springer, Cham. https://doi.org/10.1007/978-3-031-72970-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-72970-6_8
Published: 23 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72969-0
Online ISBN: 978-3-031-72970-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents