skip to main content
10.1145/3463945acmconferencesBook PagePublication PagesmmConference Proceedingsconference-collections
MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding
ACM2021 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
ICMR '21: International Conference on Multimedia Retrieval Taipei Taiwan November 16 - 19, 2021
ISBN:
978-1-4503-8530-5
Published:
27 August 2021
Sponsors:
Recommend ACM DL
ALREADY A SUBSCRIBER?SIGN IN

Reflects downloads up to 17 Jan 2025Bibliometrics
Skip Abstract Section
Abstract

It is our great pleasure to welcome you to the ICMR 2021 Workshop on Multi-Modal Pre- Training for Multimedia Understanding - MMPT 2021.

The First International Joint Workshop on Multi-Modal Pre-Training for Multimedia Understanding aims to gather peer researchers on related topics for more insightful discussion. Pre-training has been an emerging topic that provides a way to learn strong representation in many fields (e.g., natural language processing, computing vision), in both industry and research communities.

Skip Table Of Content Section
SESSION: Keynote Talks
keynote
Cross-modal Pretraining and Matching for Video Understanding

Videos are generally accompanied with multi-modal information such as audio, text, and motion. The multi-modal information is becoming an important cue for understanding video content. How to model the correlation between multi-modalities in videos is ...

keynote
WenLan: Efficient Large-Scale Multi-Modal Pre-Training on Real World Data

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic ...

SESSION: MMPT 2021 Workshop Presentation
research-article
Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer

Automatically generating natural language descriptions for images, i.e., image captioning, is one of the primary goals for multimedia understanding. The recent success of deep neural networks in image captioning has been accompanied by region-based ...

research-article
Language-Conditioned Region Proposal and Retrieval Network for Referring Expression Comprehension

Referring expression comprehension (REC) is a multi-modal task that aims to localize target regions in images according to language descriptions. Existing methods can be concluded into two categories, proposal-based methods and proposal-free methods. ...

short-paper
Residual Recurrent CRNN for End-to-End Optical Music Recognition on Monophonic Scores

One of the challenges of the Optical Music Recognition task is to transcript the symbols of the camera-captured images into digital music notations. Previous end-to-end model which was developed as a Convolutional Recurrent Neural Network does not ...

research-article
Style-Guided Image-to-Image Translation for Multiple Domains

The cross-domain image translation has drawn more and more attention. It aims to translate images from a source domain into target domains, such that images can appear in multiple styles. The most popular approaches are using encoders to extract style ...

research-article
A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods

Opinion and sentiment analysis is a vital task to characterize subjective information in social media posts. In this paper, we present a comprehensive experimental evaluation and comparison with six state-of-the-art methods, from which we have re-...

research-article
Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial Networks with Self-Attention

The recognition of handwritten mathematical expressions in images and video frames is a difficult and unsolved problem yet. Deep convectional neural networks are basically a promising approach, but typically require a large amount of labeled training ...

Contributors
  • Microsoft Research
  • INRIA Institut National de Recherche en Informatique et en Automatique
  • Renmin University of China
  • Carnegie Mellon University
  1. Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

    Recommendations