skip to main content
10.1145/3394171.3413975acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Dual Path Interaction Network for Video Moment Localization

Published: 12 October 2020 Publication History

Abstract

Video moment localization aims to localize a specific moment in a video by a natural language query. Previous works either use alignment information to find out the best-matching candidate (i.e., top-down approach) or use discrimination information to predict the temporal boundaries of the match (i.e., bottom-up approach). Little research has taken both the candidate-level alignment information and frame-level boundary information together and considers the complementarity between them. In this paper, we propose a unified top-down and bottom-up approach called Dual Path Interaction Network (DPIN), where the alignment and discrimination information are closely connected to jointly make the prediction. Our model includes a boundary prediction pathway encoding the frame-level representation and an alignment pathway extracting the candidate-level representation. The two branches of our network predict two complementary but different representations for moment localization. To enforce the consistency and strengthen the connection between the two representations, we propose a semantically conditioned interaction module. The experimental results on three popular benchmarks (i.e., TACoS, Charades-STA, and Activity-Caption) demonstrate that the proposed approach effectively localizes the relevant moment and outperforms the state-of-the-art approaches.

Supplementary Material

MP4 File (3394171.3413975.mp4)
Presentation Video

References

[1]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In ICCV.
[2]
Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. SST: Single-Stream Temporal Action Proposals. In CVPR.
[3]
Yuwei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In CVPR.
[4]
Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In EMNLP.
[5]
Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo Luo. 2019. Localizing natural language in videos. In AAAI.
[6]
Long Chen, Chujie Lu, Siliang Tang, Jun Xiao, Dong Zhang, Chilie Tan, and Xiaolin Li. 2020. Rethinking the Bottom-Up Framework for Query-based Video Localization. In AAAI.
[7]
Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic Proposal for Activity Localization in Videos via Sentence Query. In AAAI.
[8]
Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation. In EMNLP.
[9]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In ICCV.
[10]
Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. 2019. MAC: Mining Activity Concepts for Language-based Temporal Localization. In WACV.
[11]
Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander Hauptmann. 2019. ExCL: Extractive Clip Localization Using Natural Language Descriptions. arXiv preprint arXiv:1904.02755 (2019).
[12]
Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. arXiv preprint arXiv:1901.06829 (2019).
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
[14]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In CVPR.
[15]
S Hochreiter and J Schmidhuber. [n.d.]. Long Short-Term Memory. Neural Computation, Vol. 9, 8 ( [n.,d.]), 1735--1780.
[16]
Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. 2019. Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention. In ICMR.
[17]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[18]
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV.
[19]
Hei Law and Jia Deng. 2018. CornerNet: Detecting Objects as Paired Keypoints. In ECCV.
[20]
Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. BMN: Boundary-Matching Network for Temporal Action Proposal Generation. In ICCV.
[21]
Bingbin Liu, Serena Yeung, Edward Chou, De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. 2018c. Temporal modular networks for retrieving complex compositional activities in videos. In ECCV.
[22]
Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-Aware Visual Policy Network for Sequence-Level Image Captioning. In ACM MM.
[23]
Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, and Feng Wu. 2019 b. Learning to Assemble Neural Module Tree Networks for Visual Grounding. In ICCV.
[24]
Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018a. Attentive moment retrieval in videos. In SIGIR.
[25]
Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018b. Cross-modal moment localization in videos. In ACM MM.
[26]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qingming Huang. 2019 a. Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. In ICCV.
[27]
Chujie Lu, Long Chen, Chilie Tan, Xiaolin Li, and Jun Xiao. 2019. DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization. In EMNLP.
[28]
Alberto Montes, Amaia Salvador Aguilera, Santiago Pascual, and Xavier Giro I Nieto. 2016. Temporal activity detection in untrimmed videos with recurrent neural networks. In NIPS.
[29]
Cristian Rodriguez Opazo, Edison Marresetaylor, Fatemeh Sadat Saleh, Hongdong Li, and Stephen Gould. 2020. Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention. In WACV.
[30]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
[31]
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding Action Descriptions in Videos. Transactions of the Association for Computational Linguistics, Vol. 1 (2013), 25--36.
[32]
Zheng Shou, Dongang Wang, and Shihfu Chang. 2016. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. In CVPR.
[33]
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. In ECCV.
[34]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[35]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV.
[36]
Jingwen Wang, Lin Ma, and Wenhao Jiang. 2020. Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction. In AAAI.
[37]
Weining Wang, Yan Huang, and Liang Wang. 2019 a. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. In CVPR.
[38]
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. 2019 b. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), Vol. 38, 5 (2019), 146.
[39]
Aming Wu and Yahong Han. 2018. Multi-modal Circulant Fusion for Video-to-Language and Backward. In IJCAI.
[40]
Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection. In ICCV.
[41]
Huijuan Xu, Kun He, Bryan A Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In AAAI.
[42]
Tianhao Yang, Zheng-Jun Zha, and Hanwang Zhang. 2019. Making History Matter: History-Advantage Sequence Training for Visual Dialog. In ICCV.
[43]
Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2019 a. Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos. In NIPS.
[44]
Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019 b. To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI.
[45]
Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2020. Context-Aware Visual Policy Network for Fine-Grained Image Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 1--1.
[46]
Zheng-Jun Zha, Jiawei Liu, Tianhao Yang, and Yongdong Zhang. 2019. Spatiotemporal-Textual Co-Attention Network for Video Question Answering. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 15, 2 (2019), 53.
[47]
Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. 2019 a. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR.
[48]
Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language. In AAAI.
[49]
Songyang Zhang, Jinsong Su, and Jiebo Luo. 2019 c. Exploiting Temporal Relationships in Video Moment Localization with Natural Language. In ACM MM.
[50]
Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019 b. Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos. In SIGIR.
[51]
Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal Action Detection with Structured Segment Networks. In ICCV.

Cited By

View all
  • (2025)Uncertain features exploration in temporal moment localization via language by utilizing customized temporal transformerKnowledge-Based Systems10.1016/j.knosys.2024.112667309(112667)Online publication date: Jan-2025
  • (2025)Action-guided prompt tuning for video groundingInformation Fusion10.1016/j.inffus.2024.102577113(102577)Online publication date: Jan-2025
  • (2024)Reinforcement Learning with Multi-Policy Movement Strategy for Weakly Supervised Temporal Sentence GroundingApplied Sciences10.3390/app1421969614:21(9696)Online publication date: 23-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal retrieval
  2. moment localization

Qualifiers

  • Research-article

Funding Sources

  • National Key R&D Program of China
  • National Natural Science Foundation of China

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)2
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Uncertain features exploration in temporal moment localization via language by utilizing customized temporal transformerKnowledge-Based Systems10.1016/j.knosys.2024.112667309(112667)Online publication date: Jan-2025
  • (2025)Action-guided prompt tuning for video groundingInformation Fusion10.1016/j.inffus.2024.102577113(102577)Online publication date: Jan-2025
  • (2024)Reinforcement Learning with Multi-Policy Movement Strategy for Weakly Supervised Temporal Sentence GroundingApplied Sciences10.3390/app1421969614:21(9696)Online publication date: 23-Oct-2024
  • (2024)Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)10.1109/WACVW60836.2024.00071(617-624)Online publication date: 1-Jan-2024
  • (2024)Dynamic Pathway for Query-Aware Feature Learning in Language-Driven Action LocalizationIEEE Transactions on Multimedia10.1109/TMM.2024.336891926(7451-7461)Online publication date: 27-Feb-2024
  • (2024)Leveraging the Video-Level Semantic Consistency of Event for Audio-Visual Event LocalizationIEEE Transactions on Multimedia10.1109/TMM.2023.332449826(4617-4627)Online publication date: 1-Jan-2024
  • (2024)Momentum Cross-Modal Contrastive Learning for Video Moment RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334409734:7(5977-5994)Online publication date: Jul-2024
  • (2024)Gazing After Glancing: Edge Information Guided Perception Network for Video Moment RetrievalIEEE Signal Processing Letters10.1109/LSP.2024.340353331(1535-1539)Online publication date: 2024
  • (2024)Structural and Contrastive Guidance Mining for Weakly-Supervised Language Moment LocalizationIEEE Access10.1109/ACCESS.2024.345087812(129290-129301)Online publication date: 2024
  • (2024)Context-aware relational reasoning for video chunks and frames overlapping in language-based moment localizationNeurocomputing10.1016/j.neucom.2024.128224601(128224)Online publication date: Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media