MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training

McKinzie, Brandon; Gan, Zhe; Fauconnier, Jean-Philippe; Dodge, Sam; Zhang, Bowen; Dufter, Philipp; Shah, Dhruti; Du, Xianzhi; Peng, Futang; Belyi, Anton; Zhang, Haotian; Singh, Karanjeet; Kang, Doug; Hè, Hongyu; Schwarzer, Max; Gunter, Tom; Kong, Xiang; Zhang, Aonan; Wang, Jianyu; Wang, Chong; Du, Nan; Lei, Tao; Wiseman, Sam; Lee, Mark; Wang, Zirui; Pang, Ruoming; Grasch, Peter; Toshev, Alexander; Yang, Yinfei

doi:10.1007/978-3-031-73397-0_18

Brandon McKinzie¹³,
Zhe Gan¹³,
Jean-Philippe Fauconnier¹³,
Sam Dodge¹³,
Bowen Zhang¹³,
Philipp Dufter¹³,
Dhruti Shah¹³,
Xianzhi Du¹³,
Futang Peng¹³,
Anton Belyi¹³,
Haotian Zhang¹³,
Karanjeet Singh¹³,
Doug Kang¹³,
Hongyu Hè¹³,
Max Schwarzer¹³,
Tom Gunter¹³,
Xiang Kong¹³,
Aonan Zhang¹³,
Jianyu Wang¹³,
Chong Wang¹³,
Nan Du¹³,
Tao Lei¹³,
Sam Wiseman¹³,
Mark Lee¹³,
Zirui Wang¹³,
Ruoming Pang¹³,
Peter Grasch¹³,
Alexander Toshev¹³ &
…
Yinfei Yang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15087))

Included in the following conference series:

European Conference on Computer Vision

1070 Accesses

Abstract

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published multimodal pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models, including both dense variants up to 30B and mixture-of-experts (MoE) variants up to 64B, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

B. McKinzie, Z. Gan—First authors; J. P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, P. Grasch—Core authors; A. Toshev, A. Toshev—Senior authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Integrating Text-to-Image and Vision Language Models for Synergistic Dataset Generation: The Creation of Synergy-General-Multimodal Pairs

X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-Modal Reasoning

Notes

1.
https://github.com/apple/axlearn.

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: ICCV (2019)
Google Scholar
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning (2022)
Google Scholar
Awadalla, A., et al.: Openflamingo: an open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)
Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: COYO-700M: image-text pair dataset (2022). https://github.com/kakaobrain/coyo-dataset
Cha, J., Kang, W., Mun, J., Roh, B.: Honeybee: locality-enhanced projector for multimodal LLM. arXiv preprint arXiv:2312.06742 (2023)
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)
Google Scholar
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, L., et al.: ShareGPT4V: improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)
Chen, T., et al.: AdaMV-MoE: adaptive multi-task vision mixture-of-experts. In: ICCV (2023)
Google Scholar
Chen, X., et al.: PaLI-X: on scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 (2023)
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chowdhery, A., et al.: PaLM: scaling language modeling with pathways. JMLR 24(240), 1–113 (2023)
Google Scholar
Chu, X., et al.: MobileVLM: a fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886 (2023)
Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Dai, D., et al.: DeepSeekMoE: towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 (2024)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning (2023)
Google Scholar
Daxberger, E., et al.: Mobile V-MoEs: scaling down vision transformers via sparse mixture-of-experts (2023)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16 x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Driess, D., et al.: PaLM-E: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Du, N., et al.: GLaM: efficient scaling of language models with mixture-of-experts. In: ICML (2022)
Google Scholar
El-Nouby, A., et al.: Scalable pre-training of large autoregressive image models. arXiv preprint arXiv:2401.08541 (2024)
Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V.: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)
Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: scaling to trillion parameter models with simple and efficient sparsity (2022)
Google Scholar
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Fu, T.J., et al.: Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102 (2023)
Gao, P., et al.: SPHINX-X: scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 (2024)
Gong, T., et al.: Multimodal-GPT: a vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790 (2023)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
Google Scholar
Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. In: CVPR (2018)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
He, M., et al.: Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530 (2024)
Henighan, T., et al.: Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701 (2020)
Hoffmann, J., et al.: Training compute-optimal large language models (2022)
Google Scholar
Huang, S., et al.: Language is not all you need: aligning perception with language models (2023)
Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)
Google Scholar
IDEFICS: Introducing IDEFICS: an open reproduction of state-of-the-art visual language model (2023). https://huggingface.co/blog/idefics
Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassilvitskii, S., Koyejo, S.: Scaling laws for downstream task performance of large language models (2024)
Google Scholar
Jiang, A.Q., et al.: Mixtral of experts (2024)
Google Scholar
Kafle, K., Price, B., Cohen, S., Kanan, C.: DVQA: understanding data visualizations via question answering. In: CVPR (2018)
Google Scholar
Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: ECCV (2016)
Google Scholar
Kim, G., et al.: OCR-free document understanding transformer. In: ECCV (2022)
Google Scholar
Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal language models. arXiv preprint arXiv:2305.17216 (2023)
Komatsuzaki, A., et al.: Sparse upcycling: training mixture-of-experts from dense checkpoints. In: ICLR (2023)
Google Scholar
Lai, X., et al.: LISA: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
Lai, Z., et al.: From scarcity to efficiency: improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699 (2023)
Laurençon, H., et al.: OBELICS: an open web-scale filtered dataset of interleaved image-text documents (2023)
Google Scholar
Lepikhin, D., et al.: GShard: scaling giant models with conditional computation and automatic sharding. In: ICLR (2021)
Google Scholar
Li, B., et al.: MIMIC-IT: multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)
Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)
Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: benchmarking multimodal LLMs with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)
Li, C., et al.: Multimodal foundation models: from specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, L., et al.: $M^3$it: a large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387 (2023)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
Li, Z., et al.: Monkey: image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607 (2023)
Lin, B., et al.: MoE-LLaVA: mixture of experts for large vision-language models (2024)
Google Scholar
Lin, J., et al.: VILA: on pre-training for visual language models. arXiv preprint arXiv:2312.07533 (2023)
Lin, T., et al.: Microsoft COCO: common objects in context. arXiv preprint arXiv:1405.0312 (2014)
Lin, Z., et al.: SPHINX: the joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, H., et al.: LLAVA-NeXT: improved reasoning, OCR, and world knowledge (2024). https://llava-vl.github.io/blog/2024-01-30-llava-next/
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)
Google Scholar
Liu, S., et al.: LLAVA-Plus: learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437 (2023)
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)
Google Scholar
Lu, P., et al.: MathVista: evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022)
Google Scholar
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: CVPR (2019)
Google Scholar
Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: a benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022)
Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: InfographicVQA. In: WACV (2022)
Google Scholar
Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: a dataset for VQA on document images. In: WACV (2021)
Google Scholar
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)
Google Scholar
Mustafa, B., Ruiz, C.R., Puigcerver, J., Jenatton, R., Houlsby, N.: Multimodal contrastive learning with LIMoE: the language-image mixture of experts. In: NeurIPS (2022)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Peng, Z., et al.: KOSMOS-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Rae, J.W., et al.: Scaling language models: methods, analysis and insights from training gopher. arXiv preprint arXiv:2112.11446 (2021)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21(140), 1–67 (2020)
Google Scholar
Ranasinghe, K., McKinzie, B., Ravi, S., Yang, Y., Toshev, A., Shlens, J.: Perceptual grouping in contrastive vision-language models. In: ICCV (2023)
Google Scholar
Rao, Y., et al.: DenseCLIP: language-guided dense prediction with context-aware prompting. In: CVPR (2022)
Google Scholar
Ruiz, C.R., et al.: Scaling vision with sparse mixture of experts. In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) NeurIPS (2021)
Google Scholar
Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-OKVQA: a benchmark for visual question answering using world knowledge. In: ECCV (2022)
Google Scholar
Shao, Z., Ouyang, X., Yu, Z., Yu, J.: Imp: an empirical study of multimodal small language models (2024). https://huggingface.co/MILVLG/imp-v1-3b
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Google Scholar
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-LM: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)
Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: TextCaps: a dataset for image captioning with reading comprehension. In: ECCV (2020)
Google Scholar
Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)
Google Scholar
Sun, Q., et al.: Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286 (2023)
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Thoppilan, R., et al.: LaMDA: language models for dialog applications. arXiv preprint arXiv:2201.08239 (2022)
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. arXiv preprint arXiv:2401.06209 (2024)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. In: NeurIPS (2021)
Google Scholar
Wang, F., Mei, J., Yuille, A.: SCLIP: rethinking self-attention for dense vision-language inference. arXiv preprint arXiv:2312.01597 (2023)
Wang, J., Meng, L., Weng, Z., He, B., Wu, Z., Jiang, Y.G.: To see is to believe: prompting GPT-4V for better visual instruction tuning. arXiv preprint arXiv:2311.07574 (2023)
Wang, W., et al.: CogVLM: visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)
Wang, W., et al.: VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175 (2023)
Wei, J., et al.: Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021)
Yang, G., Hu, E.J.: Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522 (2020)
Yang, G., et al.: Tensor programs V: tuning large neural networks via zero-shot hyperparameter transfer (2022)
Google Scholar
Ye, J., et al.: mPLUG-DocOwl: modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499 (2023)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Ye, Q., et al.: mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023)
You, H., et al.: FERRET: refer and ground anything anywhere at any granularity. In: ICLR (2024)
Google Scholar
Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)
Yue, X., et al.: MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. arXiv preprint arXiv:2311.16502 (2023)
Zhang, H., et al.: LLaVa-grounding: grounded visual chat with large multimodal models. arXiv preprint arXiv:2312.02949 (2023)
Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
Zhao, B., Wu, B., Huang, T.: SViT: scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087 (2023)
Zhou, B., et al.: TinyLLAVA: a framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289 (2024)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, Y., Zhu, M., Liu, N., Ou, Z., Mou, X., Tang, J.: LLaVa-Phi: efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330 (2024)
Zoph, B., et al.: ST-MoE: designing stable and transferable sparse expert models (2022)
Google Scholar

Download references

Acknowledgements

The authors would like to thank Floris Weers for his assistance with multimodal evaluation infrastructure; Vaishaal Shankar, Alaa El-Nouby, Yang Zhao, Shuangfei Zhai, Russ Webb, Hadi Pouransari, Hong-You Chen, Yanghao Li, and David Mizrahi for valuable guidance, suggestions, and feedback; Chen Chen and Qibin Chen for help on instruction tuning; Maitreyi Kunnavakkam Vinjimur, Megan Maher Welsh, Bhavika Devnani, and David Koski for their assistance with input pipelines and data processing; Guoli Yin, Tom Nickson and Michael Tu for assistance with the AXLearn infrastructure and early LLM work; Ankur Jain and Varsha Mohan Paidi for assistance with dataset creation and filtering; Esteban Gonzalez, Ian Clark, Jack Bailin, David Koski, and in particular Venkata Yerneni for assistance with the internal Weights & Biases instance for tracking experiments and model evaluations.

Author information

Authors and Affiliations

Apple, Cupertino, USA
Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev & Yinfei Yang

Authors

Brandon McKinzie
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Gan
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Philippe Fauconnier
View author publications
You can also search for this author in PubMed Google Scholar
Sam Dodge
View author publications
You can also search for this author in PubMed Google Scholar
Bowen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Dufter
View author publications
You can also search for this author in PubMed Google Scholar
Dhruti Shah
View author publications
You can also search for this author in PubMed Google Scholar
Xianzhi Du
View author publications
You can also search for this author in PubMed Google Scholar
Futang Peng
View author publications
You can also search for this author in PubMed Google Scholar
Anton Belyi
View author publications
You can also search for this author in PubMed Google Scholar
Haotian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Karanjeet Singh
View author publications
You can also search for this author in PubMed Google Scholar
Doug Kang
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Hè
View author publications
You can also search for this author in PubMed Google Scholar
Max Schwarzer
View author publications
You can also search for this author in PubMed Google Scholar
Tom Gunter
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Kong
View author publications
You can also search for this author in PubMed Google Scholar
Aonan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Nan Du
View author publications
You can also search for this author in PubMed Google Scholar
Tao Lei
View author publications
You can also search for this author in PubMed Google Scholar
Sam Wiseman
View author publications
You can also search for this author in PubMed Google Scholar
Mark Lee
View author publications
You can also search for this author in PubMed Google Scholar
Zirui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ruoming Pang
View author publications
You can also search for this author in PubMed Google Scholar
Peter Grasch
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Toshev
View author publications
You can also search for this author in PubMed Google Scholar
Yinfei Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Toshev .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13668 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

McKinzie, B. et al. (2025). MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15087. Springer, Cham. https://doi.org/10.1007/978-3-031-73397-0_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-73397-0_18
Published: 03 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73396-3
Online ISBN: 978-3-031-73397-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training