Skip to main content

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15105))

Included in the following conference series:

Abstract

This paper presents LLaVA-Plus (Large Language and Vision Assistants that Plug and Learn to Use Skills), a general-purpose multimodal assistant trained using an end-to-end approach that systematically expands the capabilities of large multimodal models (LMMs). LLaVA-Plus maintains a skill repository that contains a wide range of vision and vision-language pre-trained models (tools), and is able to activate relevant tools, given users’ multimodal inputs, to compose their execution results on the fly to fulfill many real-world tasks. To acquire the ability of using tools, LLaVA-Plus is trained on multimodal instruction-following data that we have curated. The training data covers many tool use examples of visual understanding, generation, external knowledge retrieval and their compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities, and exhibits many new capabilities. Compared with tool-augmented LLMs, LLaVA-Plus is distinct in that the image query is directly grounded in and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

S. Liu, H. Liu, H. Zhang, F. Li and X. Zou—Work performed during an internship at Microsoft.

C. Li—Project Lead.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The term “tools” in this paper is used to describe the APIs or pre-built models that LMM interfaces with.

References

  1. Langchain (2022). https://github.com/hwchase17/langchain

  2. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198 (2022)

  3. Askell, A., et al.: A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 (2021)

  4. Awadalla, A., et al.: OpenFlamingo (2023). https://doi.org/10.5281/zenodo.7733589

  5. Bitton, Y., et al.: Visit-bench: a benchmark for vision-language instruction following inspired by real-world use (2023)

    Google Scholar 

  6. Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402 (2023)

    Google Scholar 

  7. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

  8. Chen, Y., et al.: Can pre-trained vision and language models answer visual information-seeking questions? (2023)

    Google Scholar 

  9. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)

  10. Gan, Z., Li, L., Li, C., Wang, L., Liu, Z., Gao, J.: Vision-language pre-training: basics, recent advances, and future trends. Found. Trends® Comput. Graph. Vision (2022)

    Google Scholar 

  11. Gao, P., et al.: LLaMA-adapter v2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

  12. Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. arXiv preprint arXiv:2211.11559 (2022)

  13. JaidedAI: EasyOCR (2023). https://github.com/JaidedAI/EasyOCR

  14. Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

  15. Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal language models. arXiv preprint arXiv:2305.17216 (2023)

  16. Lai, X., et al.: LISA: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)

  17. Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)

  18. Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)

  19. Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., Gao, J.: Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 (2023)

  20. Li, C., et al.: ELEVATER: a benchmark and toolkit for evaluating language-augmented visual models. In: NeurIPS Track on Datasets and Benchmarks (2022)

    Google Scholar 

  21. Li, F., et al.: Semantic-SAM: segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767 (2023)

  22. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  23. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023)

    Google Scholar 

  24. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)

  25. Liu, S., Fan, L., Johns, E., Yu, Z., Xiao, C., Anandkumar, A.: Prismer: a vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506 (2023)

  26. Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)

  27. Long, S., Qin, S., Panteleev, D., Bissacco, A., Fujii, Y., Raptis, M.: Towards end-to-end unified scene text detection and layout analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

    Google Scholar 

  28. Long, S., Qin, S., Panteleev, D., Bissacco, A., Fujii, Y., Raptis, M.: ICDAR 2023 competition on hierarchical text detection and recognition. arXiv preprint arXiv:2305.09750 (2023)

  29. Minsky, M.: Society of Mind. Simon and Schuster (1988)

    Google Scholar 

  30. OpenAI: ChatGPT (2023). https://openai.com/blog/chatgpt/

  31. OpenAI: ChatGPT plugins (2023). https://openai.com/blog/chatgpt-plugins

  32. OpenAI: GPT-4 technical report (2023)

    Google Scholar 

  33. OpenAI: GPT-4v(ision) system card (2023). https://cdn.openai.com/papers/GPTV_System_Card.pdf

  34. Pan, J., et al.: JourneyDB: a benchmark for generative image understanding (2023)

    Google Scholar 

  35. Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: large language model connected with massive APIs. arXiv preprint arXiv:2305.15334 (2023)

  36. Peng, B., Li, C., He, P., Galley, M., Gao, J.: Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277 (2023)

  37. Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

  38. Pi, R., et al.: DetGPT: detect what you need via reasoning. arXiv preprint arXiv:2305.14167 (2023)

  39. Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)

  40. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

    Google Scholar 

  41. Schick, T., et al.: ToolFormer: language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761 (2023)

  42. Sun, Q., et al.: Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023)

  43. Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128 (2023)

  44. Vicuna: Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org/

  45. Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)

  46. Yang, R., et al.: GPT4Tools: teaching large language model to use tools via self-instruction. arXiv preprint arXiv:2305.18752 (2023)

  47. Yang, Z., et al.: MM-REACT: prompting ChatGPT for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)

  48. Yao, S., et al.: REACT: synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022)

  49. Yu, L., et al: Scaling autoregressive multi-modal models: pretraining and instruction tuning (2023)

    Google Scholar 

  50. Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)

  51. Zhang, H., et al.: A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131 (2023)

  52. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)

    Google Scholar 

  53. Zhang, S., et al.: GPT4ROI: instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601 (2023)

  54. Zhang, Y., et al.: Recognize anything: a strong image tagging model. arXiv preprint arXiv:2306.03514 (2023)

  55. Zhao, Y., Lin, Z., Zhou, D., Huang, Z., Feng, J., Kang, B.: BuboGPT: enabling visual grounding in multi-modal LLMs. arXiv preprint arXiv:2307.08581 (2023)

  56. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  57. Zou, X., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023)

    Google Scholar 

  58. Zou, X., et al.: Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718 (2023)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shilong Liu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 17683 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, S. et al. (2025). LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15105. Springer, Cham. https://doi.org/10.1007/978-3-031-72970-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72970-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72969-0

  • Online ISBN: 978-3-031-72970-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics