Abstract
This paper presents LLaVA-Plus (Large Language and Vision Assistants that Plug and Learn to Use Skills), a general-purpose multimodal assistant trained using an end-to-end approach that systematically expands the capabilities of large multimodal models (LMMs). LLaVA-Plus maintains a skill repository that contains a wide range of vision and vision-language pre-trained models (tools), and is able to activate relevant tools, given users’ multimodal inputs, to compose their execution results on the fly to fulfill many real-world tasks. To acquire the ability of using tools, LLaVA-Plus is trained on multimodal instruction-following data that we have curated. The training data covers many tool use examples of visual understanding, generation, external knowledge retrieval and their compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities, and exhibits many new capabilities. Compared with tool-augmented LLMs, LLaVA-Plus is distinct in that the image query is directly grounded in and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.
S. Liu, H. Liu, H. Zhang, F. Li and X. Zou—Work performed during an internship at Microsoft.
C. Li—Project Lead.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The term “tools” in this paper is used to describe the APIs or pre-built models that LMM interfaces with.
References
Langchain (2022). https://github.com/hwchase17/langchain
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198 (2022)
Askell, A., et al.: A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 (2021)
Awadalla, A., et al.: OpenFlamingo (2023). https://doi.org/10.5281/zenodo.7733589
Bitton, Y., et al.: Visit-bench: a benchmark for vision-language instruction following inspired by real-world use (2023)
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402 (2023)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, Y., et al.: Can pre-trained vision and language models answer visual information-seeking questions? (2023)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)
Gan, Z., Li, L., Li, C., Wang, L., Liu, Z., Gao, J.: Vision-language pre-training: basics, recent advances, and future trends. Found. Trends® Comput. Graph. Vision (2022)
Gao, P., et al.: LLaMA-adapter v2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. arXiv preprint arXiv:2211.11559 (2022)
JaidedAI: EasyOCR (2023). https://github.com/JaidedAI/EasyOCR
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal language models. arXiv preprint arXiv:2305.17216 (2023)
Lai, X., et al.: LISA: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)
Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)
Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., Gao, J.: Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 (2023)
Li, C., et al.: ELEVATER: a benchmark and toolkit for evaluating language-augmented visual models. In: NeurIPS Track on Datasets and Benchmarks (2022)
Li, F., et al.: Semantic-SAM: segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, S., Fan, L., Johns, E., Yu, Z., Xiao, C., Anandkumar, A.: Prismer: a vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506 (2023)
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Long, S., Qin, S., Panteleev, D., Bissacco, A., Fujii, Y., Raptis, M.: Towards end-to-end unified scene text detection and layout analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Long, S., Qin, S., Panteleev, D., Bissacco, A., Fujii, Y., Raptis, M.: ICDAR 2023 competition on hierarchical text detection and recognition. arXiv preprint arXiv:2305.09750 (2023)
Minsky, M.: Society of Mind. Simon and Schuster (1988)
OpenAI: ChatGPT (2023). https://openai.com/blog/chatgpt/
OpenAI: ChatGPT plugins (2023). https://openai.com/blog/chatgpt-plugins
OpenAI: GPT-4 technical report (2023)
OpenAI: GPT-4v(ision) system card (2023). https://cdn.openai.com/papers/GPTV_System_Card.pdf
Pan, J., et al.: JourneyDB: a benchmark for generative image understanding (2023)
Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: large language model connected with massive APIs. arXiv preprint arXiv:2305.15334 (2023)
Peng, B., Li, C., He, P., Galley, M., Gao, J.: Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277 (2023)
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Pi, R., et al.: DetGPT: detect what you need via reasoning. arXiv preprint arXiv:2305.14167 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)
Schick, T., et al.: ToolFormer: language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761 (2023)
Sun, Q., et al.: Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023)
Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128 (2023)
Vicuna: Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org/
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)
Yang, R., et al.: GPT4Tools: teaching large language model to use tools via self-instruction. arXiv preprint arXiv:2305.18752 (2023)
Yang, Z., et al.: MM-REACT: prompting ChatGPT for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)
Yao, S., et al.: REACT: synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022)
Yu, L., et al: Scaling autoregressive multi-modal models: pretraining and instruction tuning (2023)
Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)
Zhang, H., et al.: A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131 (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)
Zhang, S., et al.: GPT4ROI: instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601 (2023)
Zhang, Y., et al.: Recognize anything: a strong image tagging model. arXiv preprint arXiv:2306.03514 (2023)
Zhao, Y., Lin, Z., Zhou, D., Huang, Z., Feng, J., Kang, B.: BuboGPT: enabling visual grounding in multi-modal LLMs. arXiv preprint arXiv:2307.08581 (2023)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023)
Zou, X., et al.: Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, S. et al. (2025). LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15105. Springer, Cham. https://doi.org/10.1007/978-3-031-72970-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-72970-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72969-0
Online ISBN: 978-3-031-72970-6
eBook Packages: Computer ScienceComputer Science (R0)