Skip to main content

MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published multimodal pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models, including both dense variants up to 30B and mixture-of-experts (MoE) variants up to 64B, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

B. McKinzie, Z. Gan—First authors; J. P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, P. Grasch—Core authors; A. Toshev, A. Toshev—Senior authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/apple/axlearn.

References

  1. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: ICCV (2019)

    Google Scholar 

  3. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning (2022)

    Google Scholar 

  4. Awadalla, A., et al.: Openflamingo: an open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)

  5. Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)

  6. Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  7. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  8. Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: COYO-700M: image-text pair dataset (2022). https://github.com/kakaobrain/coyo-dataset

  9. Cha, J., Kang, W., Mun, J., Roh, B.: Honeybee: locality-enhanced projector for multimodal LLM. arXiv preprint arXiv:2312.06742 (2023)

  10. Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)

    Google Scholar 

  11. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

  12. Chen, L., et al.: ShareGPT4V: improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)

  13. Chen, T., et al.: AdaMV-MoE: adaptive multi-task vision mixture-of-experts. In: ICCV (2023)

    Google Scholar 

  14. Chen, X., et al.: PaLI-X: on scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 (2023)

  15. Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

  16. Chowdhery, A., et al.: PaLM: scaling language modeling with pathways. JMLR 24(240), 1–113 (2023)

    Google Scholar 

  17. Chu, X., et al.: MobileVLM: a fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886 (2023)

  18. Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)

  19. Dai, D., et al.: DeepSeekMoE: towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 (2024)

  20. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning (2023)

    Google Scholar 

  21. Daxberger, E., et al.: Mobile V-MoEs: scaling down vision transformers via sparse mixture-of-experts (2023)

    Google Scholar 

  22. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  23. Dosovitskiy, A., et al.: An image is worth 16 x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  24. Driess, D., et al.: PaLM-E: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

  25. Du, N., et al.: GLaM: efficient scaling of language models with mixture-of-experts. In: ICML (2022)

    Google Scholar 

  26. El-Nouby, A., et al.: Scalable pre-training of large autoregressive image models. arXiv preprint arXiv:2401.08541 (2024)

  27. Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V.: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)

  28. Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: scaling to trillion parameter models with simple and efficient sparsity (2022)

    Google Scholar 

  29. Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  30. Fu, T.J., et al.: Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102 (2023)

  31. Gao, P., et al.: SPHINX-X: scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 (2024)

  32. Gong, T., et al.: Multimodal-GPT: a vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790 (2023)

  33. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)

    Google Scholar 

  34. Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. In: CVPR (2018)

    Google Scholar 

  35. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)

    Google Scholar 

  36. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  37. He, M., et al.: Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530 (2024)

  38. Henighan, T., et al.: Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701 (2020)

  39. Hoffmann, J., et al.: Training compute-optimal large language models (2022)

    Google Scholar 

  40. Huang, S., et al.: Language is not all you need: aligning perception with language models (2023)

    Google Scholar 

  41. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)

    Google Scholar 

  42. IDEFICS: Introducing IDEFICS: an open reproduction of state-of-the-art visual language model (2023). https://huggingface.co/blog/idefics

  43. Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassilvitskii, S., Koyejo, S.: Scaling laws for downstream task performance of large language models (2024)

    Google Scholar 

  44. Jiang, A.Q., et al.: Mixtral of experts (2024)

    Google Scholar 

  45. Kafle, K., Price, B., Cohen, S., Kanan, C.: DVQA: understanding data visualizations via question answering. In: CVPR (2018)

    Google Scholar 

  46. Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: ECCV (2016)

    Google Scholar 

  47. Kim, G., et al.: OCR-free document understanding transformer. In: ECCV (2022)

    Google Scholar 

  48. Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal language models. arXiv preprint arXiv:2305.17216 (2023)

  49. Komatsuzaki, A., et al.: Sparse upcycling: training mixture-of-experts from dense checkpoints. In: ICLR (2023)

    Google Scholar 

  50. Lai, X., et al.: LISA: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)

  51. Lai, Z., et al.: From scarcity to efficiency: improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699 (2023)

  52. Laurençon, H., et al.: OBELICS: an open web-scale filtered dataset of interleaved image-text documents (2023)

    Google Scholar 

  53. Lepikhin, D., et al.: GShard: scaling giant models with conditional computation and automatic sharding. In: ICLR (2021)

    Google Scholar 

  54. Li, B., et al.: MIMIC-IT: multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)

  55. Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)

  56. Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: benchmarking multimodal LLMs with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)

  57. Li, C., et al.: Multimodal foundation models: from specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 (2023)

  58. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  59. Li, L., et al.: \(M^3\)it: a large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387 (2023)

  60. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  61. Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)

  62. Li, Z., et al.: Monkey: image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607 (2023)

  63. Lin, B., et al.: MoE-LLaVA: mixture of experts for large vision-language models (2024)

    Google Scholar 

  64. Lin, J., et al.: VILA: on pre-training for visual language models. arXiv preprint arXiv:2312.07533 (2023)

  65. Lin, T., et al.: Microsoft COCO: common objects in context. arXiv preprint arXiv:1405.0312 (2014)

  66. Lin, Z., et al.: SPHINX: the joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)

  67. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)

  68. Liu, H., et al.: LLAVA-NeXT: improved reasoning, OCR, and world knowledge (2024). https://llava-vl.github.io/blog/2024-01-30-llava-next/

  69. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)

    Google Scholar 

  70. Liu, S., et al.: LLAVA-Plus: learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437 (2023)

  71. Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)

  72. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)

    Google Scholar 

  73. Lu, P., et al.: MathVista: evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)

  74. Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022)

    Google Scholar 

  75. Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: CVPR (2019)

    Google Scholar 

  76. Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: a benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022)

  77. Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: InfographicVQA. In: WACV (2022)

    Google Scholar 

  78. Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: a dataset for VQA on document images. In: WACV (2021)

    Google Scholar 

  79. Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)

    Google Scholar 

  80. Mustafa, B., Ruiz, C.R., Puigcerver, J., Jenatton, R., Houlsby, N.: Multimodal contrastive learning with LIMoE: the language-image mixture of experts. In: NeurIPS (2022)

    Google Scholar 

  81. Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  82. Peng, Z., et al.: KOSMOS-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

  83. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  84. Rae, J.W., et al.: Scaling language models: methods, analysis and insights from training gopher. arXiv preprint arXiv:2112.11446 (2021)

  85. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21(140), 1–67 (2020)

    Google Scholar 

  86. Ranasinghe, K., McKinzie, B., Ravi, S., Yang, Y., Toshev, A., Shlens, J.: Perceptual grouping in contrastive vision-language models. In: ICCV (2023)

    Google Scholar 

  87. Rao, Y., et al.: DenseCLIP: language-guided dense prediction with context-aware prompting. In: CVPR (2022)

    Google Scholar 

  88. Ruiz, C.R., et al.: Scaling vision with sparse mixture of experts. In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) NeurIPS (2021)

    Google Scholar 

  89. Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-OKVQA: a benchmark for visual question answering using world knowledge. In: ECCV (2022)

    Google Scholar 

  90. Shao, Z., Ouyang, X., Yu, Z., Yu, J.: Imp: an empirical study of multimodal small language models (2024). https://huggingface.co/MILVLG/imp-v1-3b

  91. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)

    Google Scholar 

  92. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-LM: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)

  93. Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: TextCaps: a dataset for image captioning with reading comprehension. In: ECCV (2020)

    Google Scholar 

  94. Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)

    Google Scholar 

  95. Sun, Q., et al.: Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286 (2023)

  96. Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  97. Thoppilan, R., et al.: LaMDA: language models for dialog applications. arXiv preprint arXiv:2201.08239 (2022)

  98. Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. arXiv preprint arXiv:2401.06209 (2024)

  99. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  100. Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. In: NeurIPS (2021)

    Google Scholar 

  101. Wang, F., Mei, J., Yuille, A.: SCLIP: rethinking self-attention for dense vision-language inference. arXiv preprint arXiv:2312.01597 (2023)

  102. Wang, J., Meng, L., Weng, Z., He, B., Wu, Z., Jiang, Y.G.: To see is to believe: prompting GPT-4V for better visual instruction tuning. arXiv preprint arXiv:2311.07574 (2023)

  103. Wang, W., et al.: CogVLM: visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)

  104. Wang, W., et al.: VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175 (2023)

  105. Wei, J., et al.: Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021)

  106. Yang, G., Hu, E.J.: Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522 (2020)

  107. Yang, G., et al.: Tensor programs V: tuning large neural networks via zero-shot hyperparameter transfer (2022)

    Google Scholar 

  108. Ye, J., et al.: mPLUG-DocOwl: modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499 (2023)

  109. Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)

  110. Ye, Q., et al.: mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023)

  111. You, H., et al.: FERRET: refer and ground anything anywhere at any granularity. In: ICLR (2024)

    Google Scholar 

  112. Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)

  113. Yue, X., et al.: MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. arXiv preprint arXiv:2311.16502 (2023)

  114. Zhang, H., et al.: LLaVa-grounding: grounded visual chat with large multimodal models. arXiv preprint arXiv:2312.02949 (2023)

  115. Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)

  116. Zhao, B., Wu, B., Huang, T.: SViT: scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087 (2023)

  117. Zhou, B., et al.: TinyLLAVA: a framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289 (2024)

  118. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  119. Zhu, Y., Zhu, M., Liu, N., Ou, Z., Mou, X., Tang, J.: LLaVa-Phi: efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330 (2024)

  120. Zoph, B., et al.: ST-MoE: designing stable and transferable sparse expert models (2022)

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank Floris Weers for his assistance with multimodal evaluation infrastructure; Vaishaal Shankar, Alaa El-Nouby, Yang Zhao, Shuangfei Zhai, Russ Webb, Hadi Pouransari, Hong-You Chen, Yanghao Li, and David Mizrahi for valuable guidance, suggestions, and feedback; Chen Chen and Qibin Chen for help on instruction tuning; Maitreyi Kunnavakkam Vinjimur, Megan Maher Welsh, Bhavika Devnani, and David Koski for their assistance with input pipelines and data processing; Guoli Yin, Tom Nickson and Michael Tu for assistance with the AXLearn infrastructure and early LLM work; Ankur Jain and Varsha Mohan Paidi for assistance with dataset creation and filtering; Esteban Gonzalez, Ian Clark, Jack Bailin, David Koski, and in particular Venkata Yerneni for assistance with the internal Weights & Biases instance for tracking experiments and model evaluations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Toshev .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13668 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

McKinzie, B. et al. (2025). MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15087. Springer, Cham. https://doi.org/10.1007/978-3-031-73397-0_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73397-0_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73396-3

  • Online ISBN: 978-3-031-73397-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics