keynote

Cross-modal Pretraining and Matching for Video Understanding

Author:

Limin WangAuthors Info & Claims

MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

Pages 1 - 2

https://doi.org/10.1145/3463945.3468169

Published: 27 August 2021 Publication History

Get Access

Abstract

Videos are generally accompanied with multi-modal information such as audio, text, and motion. The multi-modal information is becoming an important cue for understanding video content. How to model the correlation between multi-modalities in videos is still an unsolved problem in video understanding tasks such as video action recognition, video temporal grounding, and video description. In this talk, we focus on two specific video understanding tasks (i.e., cross-modal self-supervised pretraining and temporal grounding) by exploiting the video-text cross modal information. In particular, we notice that videos are naturally accompanied by abundant text information such as YouTube titles, Instagram captions, and Movie scripts. This textual information could serve as a general information to guide us train a multi-modal network, which could be used as a general video representation to be fine-tuned on the downstream tasks, or as cross-modal matching similarity to be used for video segment retrieval. Specifically, we first present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning, the learnt models obtain competitive results for action classification on Kinetics under the linear classification protocol. Moreover, our visual model provides an effective initialization to fine-tune on downstream tasks, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51. Our CPD demonstrates that pre-training on a relatively small dataset is able to yield a comparable performance to those methods of using order magnitude more data, which is meaningful and practicable for the scenarios with limited computational facilities. Second, we present a Contrastive and Compatible Matching Network (C2M-Net), to directly model the relations between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative pairs from a dual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via contrastive learning to maximize their mutual information. In addition, to precisely rank relatively positive pairs for accurate temporal grounding, we also learn the compatibility between queries and moments by directly regressing their IoU-based similarity. Our C2M-Net yields state-of-the-art performance on three benchmarks of CharadesSTA, TACoS, and ActivityNet-Captions.

Reference

[1]

Tianhao Li, and Limin Wang, Learning Spatiotemporal Features via Video and Text Pair Discrimination, in arXiv 2020.

Google Scholar

Index Terms

Cross-modal Pretraining and Matching for Video Understanding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Semi-supervised cross-modal hashing via modality-specific and cross-modal graph convolutional networks
Highlights
- MCGCN for the first time builds cross-modal graph and jointly learns modality-specific and modality-shared features for semi-supervised cross-modal hashing.
- MCGCN provides a three-channel network architecture, including two modality-...
Abstract
Cross-modal hashing maps heterogeneous multimedia data into Hamming space for retrieving relevant samples across modalities, which has received great research interests due to its rapid retrieval and low storage cost. In real-world applications, ...
Self-Paced Cross-Modal Subspace Matching
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Cross-modal matching methods match data from different modalities according to their similarities. Most existing methods utilize label information to reduce the semantic gap between different modalities. However, it is usually time-consuming to manually ...
Deep Video Understanding with a Unified Multi-Modal Retrieval Framework
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

In this paper, we propose a unified multi-modal retrieval framework to tackle two typical video understanding tasks, i.e., matching movie scenes and text descriptions, and scene sentiment classification. For the task of matching movie scenes and text ...

Comments

Information & Contributors

Information

Published In

MMPT '21: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

August 2021

60 pages

ISBN:9781450385305

DOI:10.1145/3463945

General Chairs:
Bei Liu
Microsoft Research Asia, China
,
Jianlong Fu
Microsoft Research Asia, China
,
Shizhe Chen
INRIA, France
,
Qin Jin
Renmin University of China, China
,
Alexander Hauptmann
Carnegie Mellon University, USA
,
Yong Rui
Lenovo Group, China

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 August 2021

Check for updates

Author Tags

Qualifiers

Keynote

Funding Sources

National Natural Science Foundation of China

Conference

ICMR '21

Sponsor:

SIGMM

ICMR '21: International Conference on Multimedia Retrieval

November 16 - 19, 2021

Taipei, Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
116
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

Reference

Index Terms

Recommendations

Semi-supervised cross-modal hashing via modality-specific and cross-modal graph convolutional networks

Self-Paced Cross-Modal Subspace Matching

Deep Video Understanding with a Unified Multi-Modal Retrieval Framework

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations