skip to main content
10.1145/3463945.3468170acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
keynote

WenLan: Efficient Large-Scale Multi-Modal Pre-Training on Real World Data

Published: 27 August 2021 Publication History

Abstract

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project 'WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework [1]. We construct a large Chinese multi-source dataset of 650 million image-text pairs for pre-training our model. Extensive experiments demonstrate that WenLan on various downstream tasks and easy to build efficient applications based on searching between images and texts.

Reference

[1]
Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang, Yuqing Song, Xin Hong, Wanqing Cui, Dan Yang Hou, Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, Shizhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training. CoRR abs/2103.06561 (2021)

Cited By

View all
  • (2022)Zero-Shot Video Grounding for Automatic Video Understanding in Sustainable Smart CitiesSustainability10.3390/su1501015315:1(153)Online publication date: 22-Dec-2022

Index Terms

  1. WenLan: Efficient Large-Scale Multi-Modal Pre-Training on Real World Data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding
    August 2021
    60 pages
    ISBN:9781450385305
    DOI:10.1145/3463945
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 August 2021

    Check for updates

    Author Tags

    1. image and text pairs
    2. multi-modal
    3. pre-training models
    4. weak correlation assumption

    Qualifiers

    • Keynote

    Conference

    ICMR '21
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 25 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Zero-Shot Video Grounding for Automatic Video Understanding in Sustainable Smart CitiesSustainability10.3390/su1501015315:1(153)Online publication date: 22-Dec-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media