skip to main content
10.1145/3460426.3470947acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
abstract

MMPT'21: International Joint Workshop on Multi-Modal Pre-Training for Multimedia Understanding

Published: 01 September 2021 Publication History

Abstract

Pre-training has been an emerging topic that provides a way to learn strong representation in many fields (e.g., natural language processing, computing vision). In the last few years, we have witnessed many research works on multi-modal pre-training which have achieved state-of-the-art performances on many multimedia tasks (e.g., image-text retrieval, video localization, speech recognition). In this workshop, we aim to gather peer researchers on related topics for more insightful discussion. We also intend to attract more researchers to explore and investigate more opportunities of designing and using innovative pre-training models for multimedia tasks.

References

[1]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: Universal image-text representation learning. In ECCV. 104--120.
[2]
Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. 2021. Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning. In CVPR.
[3]
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).
[4]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS. 13--23.
[5]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-training of generic visual-linguistic representations. In ICLR.
[6]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP. 5103--5114.
[7]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and VQA. In AAAI. 13041--13049.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval
August 2021
715 pages
ISBN:9781450384636
DOI:10.1145/3460426
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2021

Check for updates

Author Tags

  1. multi-modal
  2. multimedia
  3. pre-training
  4. understanding

Qualifiers

  • Abstract

Conference

ICMR '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 74
    Total Downloads
  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media