skip to main content
10.1145/3477495.3532083acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos

Published: 07 July 2022 Publication History

Abstract

Moment retrieval in videos is a challenging task that aims to retrieve the most relevant video moment in an untrimmed video given a sentence description. Previous methods tend to perform self-modal learning and cross-modal interaction in a coarse manner, which neglect fine-grained clues contained in video content, query context, and their alignment. To this end, we propose a novel Multi-Granularity Perception Network (MGPN) that perceives intra-modality and inter-modality information at a multi-granularity level. Specifically, we formulate moment retrieval as a multi-choice reading comprehension task and integrate human reading strategies into our framework. A coarse-grained feature encoder and a co-attention mechanism are utilized to obtain a preliminary perception of intra-modality and inter-modality information. Then a fine-grained feature encoder and a conditioned interaction module are introduced to enhance the initial perception inspired by how humans address reading comprehension problems. Moreover, to alleviate the huge computation burden of some existing methods, we further design an efficient choice comparison module and reduce the hidden size with imperceptible quality loss. Extensive experiments on Charades-STA, TACoS, and ActivityNet Captions datasets demonstrate that our solution outperforms existing state-of-the-art methods.

Supplementary Material

MP4 File (SIGIR22-fp160.mp4)
We formulate moment retrieval task from the perspective of multi-choice reading comprehension and propose a novel Multi-Granularity Perception Network (MGPN) to tackle it. We integrate several human reading strategies (i.e. passage question reread, enhanced passage question alignment, choice comparison) into our framework and empower our model to perceive intra-modality and inter-modality information at a multi-granularity level. Extensive experiments demonstrate the effectiveness and efficiency of our proposed MGPN.

References

[1]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803--5812.
[2]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[3]
Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In Proceedings of the 2018 conference on empirical methods in natural language processing. 162--171.
[4]
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014.
[5]
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal transformer for video retrieval. In European Conference on Computer Vision. Springer, 214--229.
[6]
Jialin Gao, Tong He, Xi Zhou, and Shiming Ge. 2021 a. Skeleton-Based Action Recognition With Focusing-Diffusion Graph Convolutional Networks. IEEE Signal Processing Letters, Vol. 28 (2021), 2058--2062.
[7]
Jialin Gao, Zhixiang Shi, Guanshuo Wang, Jiani Li, Yufeng Yuan, Shiming Ge, and Xi Zhou. 2020. Accurate temporal action proposal generation with relation-aware pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10810--10817.
[8]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision. 5267--5275.
[9]
Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, and Bernard Ghanem. 2021 b. Relation-aware Video Reading Comprehension for Temporal Language Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3978--3988.
[10]
Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander G Hauptmann. 2019. ExCL: Extractive Clip Localization Using Natural Language Descriptions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 1984--1990.
[11]
John T Guthrie and Peter Mosenthal. 1987. Literacy as multidimensional: Locating information and reading comprehension. Educational Psychologist, Vol. 22, 3--4 (1987), 279--297.
[12]
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. Advances in neural information processing systems, Vol. 28 (2015), 1693--1701.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[14]
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision. 706--715.
[15]
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 785--794.
[16]
Kun Li, Dan Guo, and Meng Wang. 2021. Proposal-Free Video Grounding with Contextual Pyramid Network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1902--1910.
[17]
Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV). 3--19.
[18]
Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Jointly cross-and self-modal graph attention network for query-based moment localization. In Proceedings of the 28th ACM International Conference on Multimedia. 4070--4078.
[19]
Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In The 41st international ACM SIGIR conference on research & development in information retrieval. 15--24.
[20]
Chujie Lu, Long Chen, Chilie Tan, Xiaolin Li, and Jun Xiao. 2019. DEBUG: A dense bottom-up grounding approach for natural language video localization. In EMNLP-IJCNLP. 5147--5156.
[21]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[22]
Xiaoye Qu, Pengwei Tang, Zhikang Zou, Yu Cheng, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 4280--4288.
[23]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don't Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 784--789.
[24]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000 Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
[25]
Qiu Ran, Peng Li, Weiwei Hu, and Jie Zhou. 2019. Option comparison network for multiple-choice reading comprehension. arXiv preprint arXiv:1903.03033 (2019).
[26]
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding Action Descriptions in Videos. Transactions of the Association for Computational Linguistics, Vol. 1 (2013), 25--36.
[27]
Cristian Rodriguez, Edison Marrese-Taylor, Fatemeh Sadat Saleh, Hongdong Li, and Stephen Gould. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2464--2473.
[28]
Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1049--1058.
[29]
Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. 2018. Charades-ego: A large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626 (2018).
[30]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[31]
Hang Su, Zhigang Chang, Mingyang Yu, Jialin Gao, Xinzhe Li, Shibao Zheng, et almbox. 2020. Convolutional neural network with adaptive inferential framework for skeleton-based action recognition. Journal of Visual Communication and Image Representation, Vol. 73 (2020), 102925.
[32]
Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, and Claire Cardie. 2019. DREAM: A Challenge Data Set and Models for Dialogue-Based Reading Comprehension. Transactions of the Association for Computational Linguistics, Vol. 7 (2019), 217--231.
[33]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[35]
Guanshuo Wang, Xiong Chen, Jialin Gao, Xi Zhou, and Shiming Ge. 2021 a. Self-Guided Body Part Alignment With Relation Transformers for Occluded Person Re-Identification. IEEE Signal Processing Letters, Vol. 28 (2021), 1155--1159.
[36]
Hai Wang, Dian Yu, Kai Sun, Jianshu Chen, Dong Yu, David McAllester, and Dan Roth. 2019. Evidence sentence extraction for machine reading comprehension. arXiv preprint arXiv:1902.08852 (2019).
[37]
Hao Wang, Zheng-Jun Zha, Xuejin Chen, Zhiwei Xiong, and Jiebo Luo. 2020. Dual path interaction network for video moment localization. In Proceedings of the 28th ACM International Conference on Multimedia. 4116--4124.
[38]
Hao Wang, Zheng-Jun Zha, Liang Li, Dong Liu, and Jiebo Luo. 2021 b. Structured Multi-Level Interaction Network for Video Moment Localization via Language Query. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7026--7035.
[39]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794--7803.
[40]
Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, and Jun Xiao. 2021. Boundary Proposal Network for Two-Stage Natural Language Video Localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2986--2994.
[41]
Huijuan Xu, Kun He, Bryan A Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9062--9069.
[42]
Xinli Yu, Mohsen Malmir, Xin He, Jiangning Chen, Tong Wang, Yue Wu, Yue Liu, and Yang Liu. 2021. Cross Interaction Network for Natural Language Guided Video Moment Retrieval. (2021).
[43]
Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 471--487.
[44]
Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2019 a. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 536--546.
[45]
Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019 b. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9159--9166.
[46]
Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. 2020. Dense Regression Network for Video Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[47]
Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. 2019 a. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1247--1257.
[48]
Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021 a. Video Corpus Moment Retrieval with Contrastive Learning. arXiv preprint arXiv:2105.06247 (2021).
[49]
Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020 b. Span-based Localizing Network for Natural Language Video Localization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 6543--6554.
[50]
Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020 a. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12870--12877.
[51]
Shuailiang Zhang, Hai Zhao, Yuwei Wu, Zhuosheng Zhang, Xi Zhou, and Xiang Zhou. 2020 c. DCMN: Dual co-matching network for multi-choice reading comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9563--9570.
[52]
Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019 b. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 655--664.
[53]
Zhuosheng Zhang, Junjie Yang, and Hai Zhao. 2021 b. Retrospective Reader for Machine Reading Comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14506--14514.
[54]
Yukun Zheng, Jiaxin Mao, Yiqun Liu, Zixin Ye, Min Zhang, and Shaoping Ma. 2019. Human behavior inspired machine reading comprehension. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 425--434.

Cited By

View all
  • (2025)Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence GroundingIEEE Transactions on Multimedia10.1109/TMM.2024.352167627(95-107)Online publication date: 1-Jan-2025
  • (2024)Towards Visual-Prompt Temporal Answer Grounding in Instructional VideoIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341104546:12(8836-8853)Online publication date: Dec-2024
  • (2024)Gist, Content, Target-Oriented: A 3-Level Human-Like Framework for Video Moment RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.344367226(11044-11056)Online publication date: 1-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2022
3569 pages
ISBN:9781450387323
DOI:10.1145/3477495
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. human reading strategies
  2. moment retrieval in videos
  3. multi-granularity perception

Qualifiers

  • Research-article

Conference

SIGIR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)2
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence GroundingIEEE Transactions on Multimedia10.1109/TMM.2024.352167627(95-107)Online publication date: 1-Jan-2025
  • (2024)Towards Visual-Prompt Temporal Answer Grounding in Instructional VideoIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341104546:12(8836-8853)Online publication date: Dec-2024
  • (2024)Gist, Content, Target-Oriented: A 3-Level Human-Like Framework for Video Moment RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.344367226(11044-11056)Online publication date: 1-Jan-2024
  • (2024)Learning Feature Semantic Matching for Spatio-Temporal Video GroundingIEEE Transactions on Multimedia10.1109/TMM.2024.338769626(9268-9279)Online publication date: 1-Jan-2024
  • (2024)Dynamic Pathway for Query-Aware Feature Learning in Language-Driven Action LocalizationIEEE Transactions on Multimedia10.1109/TMM.2024.336891926(7451-7461)Online publication date: 27-Feb-2024
  • (2024)Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video ParsingIEEE Signal Processing Letters10.1109/LSP.2024.338895731(1149-1153)Online publication date: 2024
  • (2024)The Deep Learning-Based Semantic Cross-Modal Moving-Object Moment Retrieval System2024 10th International Conference on Applied System Innovation (ICASI)10.1109/ICASI60819.2024.10547990(134-136)Online publication date: 17-Apr-2024
  • (2024)Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01770(18709-18719)Online publication date: 16-Jun-2024
  • (2024)Multi-granularity retrieval of mineral resource geological reports based on multi-feature associationOre Geology Reviews10.1016/j.oregeorev.2024.105889165(105889)Online publication date: Feb-2024
  • (2024)High-compressed deepfake video detection with contrastive spatiotemporal distillationNeurocomputing10.1016/j.neucom.2023.126872565:COnline publication date: 27-Feb-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media