Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

August 2021

2021 Proceeding

General Chairs:
Bei Liu
Microsoft Research Asia, China
,
Jianlong Fu
Microsoft Research Asia, China
,
Shizhe Chen
INRIA, France
,
Qin Jin
Renmin University of China, China
,
Alexander Hauptmann
Carnegie Mellon University, USA
,
Yong Rui
Lenovo Group, China

Publisher:

Association for Computing Machinery
New York
NY
United States

Conference:

ICMR '21: International Conference on Multimedia Retrieval Taipei Taiwan November 16 - 19, 2021

ISBN:

978-1-4503-8530-5

Published:

27 August 2021

Sponsors:

SIGMM

Recommend ACM DL

ALREADY A SUBSCRIBER?SIGN IN

Bibliometrics

Abstract

It is our great pleasure to welcome you to the ICMR 2021 Workshop on Multi-Modal Pre- Training for Multimedia Understanding - MMPT 2021.

The First International Joint Workshop on Multi-Modal Pre-Training for Multimedia Understanding aims to gather peer researchers on related topics for more insightful discussion. Pre-training has been an emerging topic that provides a way to learn strong representation in many fields (e.g., natural language processing, computing vision), in both industry and research communities.

Proceeding Downloads

PDF(Title Page, Chairs, Welcome, Contents, Organization)

PDF(Author Index)

Select All

Export Citations Save to Binder

SESSION: Keynote Talks

keynote

Cross-modal Pretraining and Matching for Video Understanding

Limin Wang

Pages 1–2https://doi.org/10.1145/3463945.3468169

Videos are generally accompanied with multi-modal information such as audio, text, and motion. The multi-modal information is becoming an important cue for understanding video content. How to model the correlation between multi-modalities in videos is ...

keynote

WenLan: Efficient Large-Scale Multi-Modal Pre-Training on Real World Data

Ruihua Song

Page 3https://doi.org/10.1145/3463945.3468170

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic ...

SESSION: MMPT 2021 Workshop Presentation

research-article

Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer

Pages 4–13https://doi.org/10.1145/3463945.3469054

Automatically generating natural language descriptions for images, i.e., image captioning, is one of the primary goals for multimedia understanding. The recent success of deep neural networks in image captioning has been accompanied by region-based ...

research-article

Language-Conditioned Region Proposal and Retrieval Network for Referring Expression Comprehension

Pages 14–22https://doi.org/10.1145/3463945.3469055

Referring expression comprehension (REC) is a multi-modal task that aims to localize target regions in images according to language descriptions. Existing methods can be concluded into two categories, proposal-based methods and proposal-free methods. ...

short-paper

Residual Recurrent CRNN for End-to-End Optical Music Recognition on Monophonic Scores

Pages 23–27https://doi.org/10.1145/3463945.3469056

One of the challenges of the Optical Music Recognition task is to transcript the symbols of the camera-captured images into digital music notations. Previous end-to-end model which was developed as a Convolutional Recurrent Neural Network does not ...

research-article

Style-Guided Image-to-Image Translation for Multiple Domains

Pages 28–36https://doi.org/10.1145/3463945.3469057

The cross-domain image translation has drawn more and more attention. It aims to translate images from a source domain into target domains, such that images can appear in multiple styles. The most popular approaches are using encoders to extract style ...

research-article

A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods

Pages 37–45https://doi.org/10.1145/3463945.3469058

Opinion and sentiment analysis is a vital task to characterize subjective information in social media posts. In this paper, we present a comprehensive experimental evaluation and comparison with six state-of-the-art methods, from which we have re-...

research-article

Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial Networks with Self-Attention

Pages 46–54https://doi.org/10.1145/3463945.3469059

The recognition of handwritten mathematical expressions in images and video frames is a difficult and unsolved problem yet. Deep convectional neural networks are basically a promising approach, but typically require a large amount of labeled training ...

Contributors

Bei Liu
- Publication Years
- Publication counts0
- Citation count0
- Available for Download0
- Downloads (cumulative)0
- Downloads (12 months)0
- Downloads (6 weeks)0
- Average Downloads per Article0
- Average Citation per Article0
View Full Profile
Jianlong Fu
Microsoft Research
- Publication Years2010 - 2025
- Publication counts56
- Citation count649
- Available for Download25
- Downloads (cumulative)10,794
- Downloads (12 months)2,009
- Downloads (6 weeks)265
- Average Downloads per Article432
- Average Citation per Article12
View Full Profile
Shizhe Chen
INRIA Institut National de Recherche en Informatique et en Automatique
- Publication Years2015 - 2024
- Publication counts28
- Citation count572
- Available for Download23
- Downloads (cumulative)12,027
- Downloads (12 months)886
- Downloads (6 weeks)73
- Average Downloads per Article523
- Average Citation per Article20
View Full Profile
Qin Jin
Renmin University of China
- Publication Years2013 - 2025
- Publication counts66
- Citation count999
- Available for Download50
- Downloads (cumulative)20,435
- Downloads (12 months)2,556
- Downloads (6 weeks)276
- Average Downloads per Article409
- Average Citation per Article15
View Full Profile
Alexander Georg Hauptmann
Carnegie Mellon University
- Publication Years1986 - 2024
- Publication counts260
- Citation count6,488
- Available for Download121
- Downloads (cumulative)65,314
- Downloads (12 months)4,102
- Downloads (6 weeks)579
- Average Downloads per Article540
- Average Citation per Article25
View Full Profile
Yong Rui
- Publication Years
- Publication counts0
- Citation count0
- Available for Download0
- Downloads (cumulative)0
- Downloads (12 months)0
- Downloads (6 weeks)0
- Average Downloads per Article0
- Average Citation per Article0
View Full Profile

Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding
1. Information systems
  1. Information systems applications

Comments

MM

Sections

Proceeding Downloads

Cross-modal Pretraining and Matching for Video Understanding

WenLan: Efficient Large-Scale Multi-Modal Pre-Training on Real World Data

Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer

Language-Conditioned Region Proposal and Retrieval Network for Referring Expression Comprehension

Residual Recurrent CRNN for End-to-End Optical Music Recognition on Monophonic Scores

Style-Guided Image-to-Image Translation for Multiple Domains

A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods

Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial Networks with Self-Attention

WOWMOM '02: Proceedings of the 5th ACM international workshop on Wireless mobile multimedia

MM&Sec '06: Proceedings of the 8th workshop on Multimedia and security

IH&MMSec '17: Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security

Save to Binder

Sections

Proceeding Downloads

Save to Binder

Recommendations

WOWMOM '02: Proceedings of the 5th ACM international workshop on Wireless mobile multimedia

MM&Sec '06: Proceedings of the 8th workshop on Multimedia and security

IH&MMSec '17: Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security