keynote

WenLan: Efficient Large-Scale Multi-Modal Pre-Training on Real World Data

Author:

Ruihua SongAuthors Info & Claims

MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

Page 3

https://doi.org/10.1145/3463945.3468170

Published: 27 August 2021 Publication History

Get Access

Abstract

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project 'WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework [1]. We construct a large Chinese multi-source dataset of 650 million image-text pairs for pre-training our model. Extensive experiments demonstrate that WenLan on various downstream tasks and easy to build efficient applications based on searching between images and texts.

Reference

[1]

Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang, Yuqing Song, Xin Hong, Wanqing Cui, Dan Yang Hou, Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, Shizhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training. CoRR abs/2103.06561 (2021)

Google Scholar

Cited By

View all

Wang PSun LWang LSun J(2022)Zero-Shot Video Grounding for Automatic Video Understanding in Sustainable Smart CitiesSustainability10.3390/su1501015315:1(153)Online publication date: 22-Dec-2022
https://doi.org/10.3390/su15010153

Index Terms

WenLan: Efficient Large-Scale Multi-Modal Pre-Training on Real World Data
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Real-time Emotion Pre-Recognition in Conversations with Contrastive Multi-modal Dialogue Pre-training
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

This paper presents our pioneering effort in addressing a new and realistic scenario in multi-modal dialogue systems called Multi-modal Real-time Emotion Pre-recognition in Conversations (MREPC). The objective is to predict the emotion of a forthcoming ...
Annotation and analysis of listener's engagement based on multi-modal behaviors
MA3HMI '16: Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction

We address the annotation of engagement in the context of human-machine interaction. Engagement represents the level of how much a user is being interested in and willing to continue the current interaction. The conversational data used in the ...
Focusing computational visual attention in multi-modal human-robot interaction
ICMI-MLMI '10: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction

Identifying verbally and non-verbally referred-to objects is an important aspect of human-robot interaction. Most importantly, it is essential to achieve a joint focus of attention and, thus, a natural interaction behavior. In this contribution, we ...

Comments

Information & Contributors

Information

Published In

MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

August 2021

60 pages

ISBN:9781450385305

DOI:10.1145/3463945

General Chairs:
Bei Liu
Microsoft Research Asia, China
,
Jianlong Fu
Microsoft Research Asia, China
,
Shizhe Chen
INRIA, France
,
Qin Jin
Renmin University of China, China
,
Alexander Hauptmann
Carnegie Mellon University, USA
,
Yong Rui
Lenovo Group, China

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 August 2021

Check for updates

Author Tags

Qualifiers

Keynote

Conference

ICMR '21

Sponsor:

SIGMM

ICMR '21: International Conference on Multimedia Retrieval

November 16 - 19, 2021

Taipei, Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
103
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wang PSun LWang LSun J(2022)Zero-Shot Video Grounding for Automatic Video Understanding in Sustainable Smart CitiesSustainability10.3390/su1501015315:1(153)Online publication date: 22-Dec-2022
https://doi.org/10.3390/su15010153

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

Reference

Cited By

Index Terms

Recommendations

Real-time Emotion Pre-Recognition in Conversations with Contrastive Multi-modal Dialogue Pre-training

Annotation and analysis of listener's engagement based on multi-modal behaviors

Focusing computational visual attention in multi-modal human-robot interaction

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations