It is our great pleasure to welcome you to the ICMR 2021 Workshop on Multi-Modal Pre- Training for Multimedia Understanding - MMPT 2021.
The First International Joint Workshop on Multi-Modal Pre-Training for Multimedia Understanding aims to gather peer researchers on related topics for more insightful discussion. Pre-training has been an emerging topic that provides a way to learn strong representation in many fields (e.g., natural language processing, computing vision), in both industry and research communities.
Proceeding Downloads
Cross-modal Pretraining and Matching for Video Understanding
Videos are generally accompanied with multi-modal information such as audio, text, and motion. The multi-modal information is becoming an important cue for understanding video content. How to model the correlation between multi-modalities in videos is ...
WenLan: Efficient Large-Scale Multi-Modal Pre-Training on Real World Data
Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic ...
Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer
Automatically generating natural language descriptions for images, i.e., image captioning, is one of the primary goals for multimedia understanding. The recent success of deep neural networks in image captioning has been accompanied by region-based ...
Language-Conditioned Region Proposal and Retrieval Network for Referring Expression Comprehension
Referring expression comprehension (REC) is a multi-modal task that aims to localize target regions in images according to language descriptions. Existing methods can be concluded into two categories, proposal-based methods and proposal-free methods. ...
Residual Recurrent CRNN for End-to-End Optical Music Recognition on Monophonic Scores
One of the challenges of the Optical Music Recognition task is to transcript the symbols of the camera-captured images into digital music notations. Previous end-to-end model which was developed as a Convolutional Recurrent Neural Network does not ...
Style-Guided Image-to-Image Translation for Multiple Domains
The cross-domain image translation has drawn more and more attention. It aims to translate images from a source domain into target domains, such that images can appear in multiple styles. The most popular approaches are using encoders to extract style ...
A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods
Opinion and sentiment analysis is a vital task to characterize subjective information in social media posts. In this paper, we present a comprehensive experimental evaluation and comparison with six state-of-the-art methods, from which we have re-...
Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial Networks with Self-Attention
The recognition of handwritten mathematical expressions in images and video frames is a difficult and unsolved problem yet. Deep convectional neural networks are basically a promising approach, but typically require a large amount of labeled training ...
- Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding