DreamLIP: Language-Image Pre-training with Long Captions

Zheng, Kecheng; Zhang, Yifei; Wu, Wei; Lu, Fan; Ma, Shuailei; Jin, Xin; Chen, Wei; Shen, Yujun

doi:10.1007/978-3-031-72649-1_5

Kecheng Zheng^13,14,
Yifei Zhang¹⁵,
Wei Wu¹⁶,
Fan Lu¹⁶,
Shuailei Ma¹⁸,
Xin Jin¹⁷,
Wei Chen¹³ &
…
Yujun Shen¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15076))

Included in the following conference series:

European Conference on Computer Vision

675 Accesses
3 Citations

Abstract

Language-image pre-training largely relies on how precisely and thoroughly a text describes its paired image. In practice, however, the contents of an image can be so rich that well describing them requires lengthy captions (e.g., with 10 sentences), which are usually missing in existing datasets. Consequently, there are currently no clear evidences on whether and how language-image pre-training could benefit from long captions. To figure this out, we first re-caption 30M images with detailed descriptions using a pre-trained Multi-modality Large Language Model (MLLM), and then study the usage of the resulting captions under a contrastive learning framework. We observe that, each sentence within a long caption is very likely to describe the image partially (e.g., an object). Motivated by this, we propose to dynamically sample sub-captions from the text label to construct multiple positive pairs, and introduce a grouping loss to match the embeddings of each sub-caption with its corresponding local image patches in a self-supervised manner. Experimental results on a wide range of downstream tasks demonstrate the consistent superiority of our method, termed DreamLIP, over previous alternatives, highlighting its fine-grained representational capacity. It is noteworthy that, on the tasks of image-text retrieval and semantic segmentation, our model trained with 30M image-text pairs achieves on par or even better performance than CLIP trained with 400M pairs. Project page is available at https://zyf0619sjtu.github.io/dream-lip.

Kecheng Zheng, Yifei Zhang : Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

FocusCap: Object-Focused Image Captioning with CLIP-Guided Language Model

Generating Diverse and Meaningful Captions

What Convnets Make for Image Captioning?

References

Betker, J., et al.: Improving image generation with better captions. Comput. Sci. https://cdnopenai.com/papers/dall-e-3. pdf 2(3), 8 (2023)
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative components with random forests. In: Computer Vision–ECCV 2014: 13th European Conference, zurich, Switzerland, September 6–12, 2014, Proceedings, part VI 13, pp. 446–461 (2014)
Google Scholar
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1209–1218 (2018)
Google Scholar
Chen, L., et al.: ShareGPT4V: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
Google Scholar
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. Adv. Neural Inform. Process. Syst. 36 (2024)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009)
Google Scholar
Dong, X., et al.: Maskclip: masked self-distillation advances contrastive language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10995–11005 (2023)
Google Scholar
Dou, Z.Y., et al.: Coarse-to-fine vision-language pre-training with fusion in the backbone. Adv. Neural Inform. Process. Syst. 35, 32942–32956 (2022)
Google Scholar
Everingham, M., Winn, J.: The pascal visual object classes challenge 2012 (voc2012) development kit. Pattern Anal. Stat. Model. Comput. Learn., Tech. Rep 2007(1–45), 5 (2012)
Google Scholar
Fan, L., Krishnan, D., Isola, P., Katabi, D., Tian, Y.: Improving clip training with language rewrites. Adv. Neural Inform. Process. Syst. 36 (2024)
Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop (2004)
Google Scholar
Fürst, A., et al.: Cloob: Modern Hopfield networks with infoloob outperform clip. Adv. Neural Inform. Process. Syst. 35, 20450–20468 (2022)
Google Scholar
Gao, Y., et al.: Softclip: Softer cross-modal alignment makes clip stronger. arXiv preprint arXiv:2303.17561 (2023)
Gao, Y., et al.: Pyramidclip: hierarchical feature alignment for vision-language model pretraining. Adv. Neural Inform. Process. Syst. 35, 35959–35970 (2022)
Google Scholar
Geng, S., Yuan, J., Tian, Y., Chen, Y., Zhang, Y.: HiCLIP: contrastive language-image pretraining with hierarchy-aware attention. In: International Conference Learning Represent (2023)
Google Scholar
Hammoud, H.A.A.K., Itani, H., Pizzati, F., Torr, P., Bibi, A., Ghanem, B.: Synthclip: Are we ready for a fully synthetic clip training? arXiv preprint arXiv:2402.01832 (2024)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning (2021)
Google Scholar
Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S.N., Rosaen, K., Vasudevan, R.: Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? arXiv preprint arXiv:1610.01983 (2016)
Kim, B., Jo, Y., Kim, J., Kim, S.: Misalign, contrast then distill: rethinking misalignments in language-image pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2563–2572 (2023)
Google Scholar
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: ICCVW (2013)
Google Scholar
Krizhevsky, A., et al.: Learning multiple layers of features from tiny images (2009)
Google Scholar
Lai, Z., et al.: From scarcity to efficiency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699 (2023)
Lee, J., et al.: Uniclip: Unified framework for contrastive language-image pre-training. arXiv:2209.13430 (2022)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (2022)
Google Scholar
Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., Yan, J.: Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv:2110.05208 (2021)
Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling language-image pre-training via masking. In: IEEE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23390–23400 (2023)
Google Scholar
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
Lin, T.Y., et al.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, Y., et al.: Mllms-augmented visual-language representation learning. arXiv preprint arXiv:2311.18765 (2023)
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. Adv. Neural Inform. Process. Syst. 35, 2507–2521 (2022)
Google Scholar
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv:1306.5151 (2013)
Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898 (2014)
Google Scholar
Mu, N., Kirillov, A., Wagner, D., Xie, S.: Slip: self-supervision meets language-image pre-training. In: European Conference on Computer Vision (2022)
Google Scholar
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Sixth Indian Conference on Computer Vision, Graphics & Image Processing (2008)
Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv:1807.03748 (2018)
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 102–118. Springer (2016). https://doi.org/10.1007/978-3-319-46475-6_7
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Association for Computational Linguistics (2018)
Google Scholar
Singh, A., et al.: Towards VQA models that can read. In: IEEE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8317–8326 (2019)
Google Scholar
Tian, Y., Fan, L., Isola, P., Chang, H., Krishnan, D.: Stablerep: synthetic images from text-to-image models make strong visual representation learners. Adv. Neural Inform. Process. Syst. 36 (2024)
Google Scholar
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? Exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209 (2024)
Wu, S., Fei, H., Zhang, H., Chua, T.S.: Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion. Adv. Neural Inform. Process. Syst. 36 (2023)
Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2010)
Google Scholar
Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2945–2954 (2023)
Google Scholar
Yang, J., et al.: Unified contrastive learning in image-text-label space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19163–19173 (2022)
Google Scholar
Yang, K., et al.: Alip: adaptive language-image pre-training with synthetic caption. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2922–2931 (2023)
Google Scholar
Yao, L., et al.: Filip: fine-grained interactive language-image pre-training. arXiv:2111.07783 (2021)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguis. pp. 67–78 (2014)
Google Scholar
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)
Yuan, J., Zhang, J., Sun, S., Torr, P., Zhao, B.: Real-fake: Effective training data synthesis through distribution matching. arXiv preprint arXiv:2310.10402 (2023)
Zhao, L., Zheng, K., Zheng, Y., Zhao, D., Zhou, J.: RLEG: Vision-language representation learning with diffusion-based embedding generation. International Conference on Machine Learning (2023)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Google Scholar

Download references

Acknowledgements

This work is supported in part by NSFC 62302246 and ZJNSFC under Grant LQ23F010008, and supported by High Performance Computing Center at Eastern Institute of Technology, Ningbo, and Ningbo Institute of Digital Twin. Wei Chen is supported by the National Science Foundation of China (62132017), Zhejiang Provincial Natural Science Foundation of China (LD24F020011).

Author information

Authors and Affiliations

State Key Lab of CAD&CG, Zhejiang University, Hangzhou, China
Kecheng Zheng & Wei Chen
Ant Group, Hangzhou, China
Kecheng Zheng & Yujun Shen
Shanghai Jiao Tong University, Shanghai, China
Yifei Zhang
University of Science and Technology of China, Hefei, China
Wei Wu & Fan Lu
Eastern Institute of Technology, Ningbo, China
Xin Jin
Northeastern University, Shenyang, China
Shuailei Ma

Authors

Kecheng Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Yifei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Fan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Shuailei Ma
View author publications
You can also search for this author in PubMed Google Scholar
Xin Jin
View author publications
You can also search for this author in PubMed Google Scholar
Wei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yujun Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kecheng Zheng or Yifei Zhang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1998 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, K. et al. (2025). DreamLIP: Language-Image Pre-training with Long Captions. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15076. Springer, Cham. https://doi.org/10.1007/978-3-031-72649-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-72649-1_5
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72648-4
Online ISBN: 978-3-031-72649-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DreamLIP: Language-Image Pre-training with Long Captions