research-article

You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos

Authors:

Xin Sun,

Xuan Wang,

Jialin Gao,

Qiong Liu,

Xi ZhouAuthors Info & Claims

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1022 - 1032

https://doi.org/10.1145/3477495.3532083

Published: 07 July 2022 Publication History

Get Access

Abstract

Moment retrieval in videos is a challenging task that aims to retrieve the most relevant video moment in an untrimmed video given a sentence description. Previous methods tend to perform self-modal learning and cross-modal interaction in a coarse manner, which neglect fine-grained clues contained in video content, query context, and their alignment. To this end, we propose a novel Multi-Granularity Perception Network (MGPN) that perceives intra-modality and inter-modality information at a multi-granularity level. Specifically, we formulate moment retrieval as a multi-choice reading comprehension task and integrate human reading strategies into our framework. A coarse-grained feature encoder and a co-attention mechanism are utilized to obtain a preliminary perception of intra-modality and inter-modality information. Then a fine-grained feature encoder and a conditioned interaction module are introduced to enhance the initial perception inspired by how humans address reading comprehension problems. Moreover, to alleviate the huge computation burden of some existing methods, we further design an efficient choice comparison module and reduce the hidden size with imperceptible quality loss. Extensive experiments on Charades-STA, TACoS, and ActivityNet Captions datasets demonstrate that our solution outperforms existing state-of-the-art methods.

Supplementary Material

MP4 File (SIGIR22-fp160.mp4)

We formulate moment retrieval task from the perspective of multi-choice reading comprehension and propose a novel Multi-Granularity Perception Network (MGPN) to tackle it. We integrate several human reading strategies (i.e. passage question reread, enhanced passage question alignment, choice comparison) into our framework and empower our model to perceive intra-modality and inter-modality information at a multi-granularity level. Extensive experiments demonstrate the effectiveness and efficiency of our proposed MGPN.

Download
16.35 MB

References

[1]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803--5812.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Retrieve-and-Read: Multi-task Learning of Information Retrieval and Reading Comprehension

Why read if you can skim: towards enabling faster screen reading

Teach Machine How to Read: Reading Behavior Inspired Relevance Estimation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations