research-article

Dual Path Interaction Network for Video Moment Localization

Authors:

Jiebo LuoAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 4116 - 4124

https://doi.org/10.1145/3394171.3413975

Published: 12 October 2020 Publication History

Abstract

Video moment localization aims to localize a specific moment in a video by a natural language query. Previous works either use alignment information to find out the best-matching candidate (i.e., top-down approach) or use discrimination information to predict the temporal boundaries of the match (i.e., bottom-up approach). Little research has taken both the candidate-level alignment information and frame-level boundary information together and considers the complementarity between them. In this paper, we propose a unified top-down and bottom-up approach called Dual Path Interaction Network (DPIN), where the alignment and discrimination information are closely connected to jointly make the prediction. Our model includes a boundary prediction pathway encoding the frame-level representation and an alignment pathway extracting the candidate-level representation. The two branches of our network predict two complementary but different representations for moment localization. To enforce the consistency and strengthen the connection between the two representations, we propose a semantically conditioned interaction module. The experimental results on three popular benchmarks (i.e., TACoS, Charades-STA, and Activity-Caption) demonstrate that the proposed approach effectively localizes the relevant moment and outperforms the state-of-the-art approaches.

Supplementary Material

MP4 File (3394171.3413975.mp4)

Presentation Video

Download
95.22 MB

References

[1]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In ICCV.

[2]

Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. SST: Single-Stream Temporal Action Proposals. In CVPR.

[3]

Yuwei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In CVPR.

[4]

Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In EMNLP.

[5]

Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo Luo. 2019. Localizing natural language in videos. In AAAI.

[6]

Long Chen, Chujie Lu, Siliang Tang, Jun Xiao, Dong Zhang, Chilie Tan, and Xiaolin Li. 2020. Rethinking the Bottom-Up Framework for Query-based Video Localization. In AAAI.

[7]

Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic Proposal for Activity Localization in Videos via Sentence Query. In AAAI.

[8]

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation. In EMNLP.

[9]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In ICCV.

[10]

Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. 2019. MAC: Mining Activity Concepts for Language-based Temporal Localization. In WACV.

[11]

Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander Hauptmann. 2019. ExCL: Extractive Clip Localization Using Natural Language Descriptions. arXiv preprint arXiv:1904.02755 (2019).

[12]

Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. arXiv preprint arXiv:1901.06829 (2019).

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.

[14]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In CVPR.

[15]

S Hochreiter and J Schmidhuber. [n.d.]. Long Short-Term Memory. Neural Computation, Vol. 9, 8 ( [n.,d.]), 1735--1780.

[16]

Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. 2019. Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention. In ICMR.

[17]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[18]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV.

[19]

Hei Law and Jia Deng. 2018. CornerNet: Detecting Objects as Paired Keypoints. In ECCV.

[20]

Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. BMN: Boundary-Matching Network for Temporal Action Proposal Generation. In ICCV.

[21]

Bingbin Liu, Serena Yeung, Edward Chou, De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. 2018c. Temporal modular networks for retrieving complex compositional activities in videos. In ECCV.

[22]

Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-Aware Visual Policy Network for Sequence-Level Image Captioning. In ACM MM.

[23]

Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, and Feng Wu. 2019 b. Learning to Assemble Neural Module Tree Networks for Visual Grounding. In ICCV.

[24]

Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018a. Attentive moment retrieval in videos. In SIGIR.

[25]

Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018b. Cross-modal moment localization in videos. In ACM MM.

[26]

Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qingming Huang. 2019 a. Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. In ICCV.

[27]

Chujie Lu, Long Chen, Chilie Tan, Xiaolin Li, and Jun Xiao. 2019. DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization. In EMNLP.

[28]

Alberto Montes, Amaia Salvador Aguilera, Santiago Pascual, and Xavier Giro I Nieto. 2016. Temporal activity detection in untrimmed videos with recurrent neural networks. In NIPS.

[29]

Cristian Rodriguez Opazo, Edison Marresetaylor, Fatemeh Sadat Saleh, Hongdong Li, and Stephen Gould. 2020. Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention. In WACV.

[30]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP.

[31]

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding Action Descriptions in Videos. Transactions of the Association for Computational Linguistics, Vol. 1 (2013), 25--36.

[32]

Zheng Shou, Dongang Wang, and Shihfu Chang. 2016. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. In CVPR.

[33]

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. In ECCV.

[34]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[35]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV.

[36]

Jingwen Wang, Lin Ma, and Wenhao Jiang. 2020. Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction. In AAAI.

[37]

Weining Wang, Yan Huang, and Liang Wang. 2019 a. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. In CVPR.

[38]

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. 2019 b. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), Vol. 38, 5 (2019), 146.

Digital Library

[39]

Aming Wu and Yahong Han. 2018. Multi-modal Circulant Fusion for Video-to-Language and Backward. In IJCAI.

[40]

Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection. In ICCV.

[41]

Huijuan Xu, Kun He, Bryan A Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In AAAI.

[42]

Tianhao Yang, Zheng-Jun Zha, and Hanwang Zhang. 2019. Making History Matter: History-Advantage Sequence Training for Visual Dialog. In ICCV.

[43]

Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2019 a. Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos. In NIPS.

[44]

Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019 b. To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI.

[45]

Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2020. Context-Aware Visual Policy Network for Fine-Grained Image Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 1--1.

[46]

Zheng-Jun Zha, Jiawei Liu, Tianhao Yang, and Yongdong Zhang. 2019. Spatiotemporal-Textual Co-Attention Network for Video Question Answering. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 15, 2 (2019), 53.

Digital Library

[47]

Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. 2019 a. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR.

[48]

Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language. In AAAI.

[49]

Songyang Zhang, Jinsong Su, and Jiebo Luo. 2019 c. Exploiting Temporal Relationships in Video Moment Localization with Natural Language. In ACM MM.

[50]

Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019 b. Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos. In SIGIR.

[51]

Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal Action Detection with Structured Segment Networks. In ICCV.

Cited By

Nawaz HShi DZhang X(2025)Uncertain features exploration in temporal moment localization via language by utilizing customized temporal transformerKnowledge-Based Systems10.1016/j.knosys.2024.112667309(112667)Online publication date: Jan-2025
https://doi.org/10.1016/j.knosys.2024.112667
Wang JTsao RWang XWang XFeng FTian SPoria S(2025)Action-guided prompt tuning for video groundingInformation Fusion10.1016/j.inffus.2024.102577113(102577)Online publication date: Jan-2025
https://doi.org/10.1016/j.inffus.2024.102577
Jiang SKong YZhang LYin B(2024)Reinforcement Learning with Multi-Policy Movement Strategy for Weakly Supervised Temporal Sentence GroundingApplied Sciences10.3390/app1421969614:21(9696)Online publication date: 23-Oct-2024
https://doi.org/10.3390/app14219696
Show More Cited By

Index Terms

Dual Path Interaction Network for Video Moment Localization
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Novelty in information retrieval
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

A Survey on Video Moment Localization
Video moment localization, also known as video moment retrieval, aims to search a target segment within a video described by a given natural language query. Beyond the task of temporal action localization whereby the target actions are pre-defined, video ...
Cross-modal Moment Localization in Videos
MM '18: Proceedings of the 26th ACM international conference on Multimedia

In this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the ...
Progressive Localization Networks for Language-Based Moment Localization
This article targets the task of language-based video moment localization. The language-based setting of this task allows for an open set of target activities, resulting in a large variation of the temporal lengths of video moments. Most existing methods ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

October 2020

4889 pages

ISBN:9781450379885

DOI:10.1145/3394171

General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R&D Program of China
National Natural Science Foundation of China

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 12 - 16, 2020

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
643
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)2

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nawaz HShi DZhang X(2025)Uncertain features exploration in temporal moment localization via language by utilizing customized temporal transformerKnowledge-Based Systems10.1016/j.knosys.2024.112667309(112667)Online publication date: Jan-2025
https://doi.org/10.1016/j.knosys.2024.112667
Wang JTsao RWang XWang XFeng FTian SPoria S(2025)Action-guided prompt tuning for video groundingInformation Fusion10.1016/j.inffus.2024.102577113(102577)Online publication date: Jan-2025
https://doi.org/10.1016/j.inffus.2024.102577
Jiang SKong YZhang LYin B(2024)Reinforcement Learning with Multi-Policy Movement Strategy for Weakly Supervised Temporal Sentence GroundingApplied Sciences10.3390/app1421969614:21(9696)Online publication date: 23-Oct-2024
https://doi.org/10.3390/app14219696
Panta LShrestha PSapkota BBhattarai AManandhar SSah A(2024)Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)10.1109/WACVW60836.2024.00071(617-624)Online publication date: 1-Jan-2024
https://doi.org/10.1109/WACVW60836.2024.00071
Yang SWu XShang ZLuo J(2024)Dynamic Pathway for Query-Aware Feature Learning in Language-Driven Action LocalizationIEEE Transactions on Multimedia10.1109/TMM.2024.336891926(7451-7461)Online publication date: 27-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3368919
Jiang YYin JDang Y(2024)Leveraging the Video-Level Semantic Consistency of Event for Audio-Visual Event LocalizationIEEE Transactions on Multimedia10.1109/TMM.2023.332449826(4617-4627)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3324498
Han DCheng XGuo NYe XRainer BPriller P(2024)Momentum Cross-Modal Contrastive Learning for Video Moment RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334409734:7(5977-5994)Online publication date: Jul-2024
https://doi.org/10.1109/TCSVT.2023.3344097
Huang ZJi YLi YLiu C(2024)Gazing After Glancing: Edge Information Guided Perception Network for Video Moment RetrievalIEEE Signal Processing Letters10.1109/LSP.2024.340353331(1535-1539)Online publication date: 2024
https://doi.org/10.1109/LSP.2024.3403533
Tang DCao X(2024)Structural and Contrastive Guidance Mining for Weakly-Supervised Language Moment LocalizationIEEE Access10.1109/ACCESS.2024.345087812(129290-129301)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3450878
Nawaz HShi DNawaz M(2024)Context-aware relational reasoning for video chunks and frames overlapping in language-based moment localizationNeurocomputing10.1016/j.neucom.2024.128224601(128224)Online publication date: Oct-2024
https://doi.org/10.1016/j.neucom.2024.128224
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten