research-article

Single-shot Semantic Matching Network for Moment Localization in Videos

Authors:

Yilong YinAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 17, Issue 3

Article No.: 84, Pages 1 - 14

https://doi.org/10.1145/3441577

Published: 22 July 2021 Publication History

Abstract

Moment localization in videos using natural language refers to finding the most relevant segment from videos given a natural language query. Most of the existing methods require video segment candidates for further matching with the query, which leads to extra computational costs, and they may also not locate the relevant moments under any length evaluated. To address these issues, we present a lightweight single-shot semantic matching network (SSMN) to avoid the complex computations required to match the query and the segment candidates, and the proposed SSMN can locate moments of any length theoretically. Using the proposed SSMN, video features are first uniformly sampled to a fixed number, while the query sentence features are generated and enhanced by GloVe, long-term short memory (LSTM), and soft-attention modules. Subsequently, the video features and sentence features are fed to an enhanced cross-modal attention model to mine the semantic relationships between vision and language. Finally, a score predictor and a location predictor are designed to locate the start and stop indexes of the query moment. We evaluate the proposed method on two benchmark datasets and the experimental results demonstrate that SSMN outperforms state-of-the-art methods in both precision and efficiency.

References

[1]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803–5812.

[2]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.

[3]

Xiaojun Chang, Zhigang Ma, Yi Yang, Zhiqiang Zeng, and Alexander G. Hauptmann. 2016. Bi-level semantic representation analysis for multimedia event detection. IEEE Trans. Cyber. 47, 5 (2016), 1180–1197.

[4]

X. Chang, Y. L. Yu, Y. Yang, and E. P. Xing. 2017. Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans. Pattern Anal. Mach. Intel. 39, 8 (2017), 1617–1632.

Digital Library

[5]

Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 162–171.

[6]

Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2016. Word2VisualVec: Image and video to sentence matching by visual feature prediction. arXiv preprint arXiv:1604.06838 (2016).

[7]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267–5275.

[8]

Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans. Pattern Anal. Mach. Intel. 42, 5 (2019), 1112–1131.

[9]

Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander G. Hauptmann. 2019. ExCL: Extractive clip localization using natural language descriptions. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 1984–1990.

[10]

Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2712–2719.

Digital Library

[11]

Meera Hahn, Asim Kadav, James M. Rehg, and Hans Peter Graf. 2019. Tripping through time: Efficient Localization of Activities in Videos. arxiv:cs.CV/1904.09936 (2019).

[12]

Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 8393–8400.

Digital Library

[13]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2018. Localizing moments in video with temporal language. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

[14]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.

Digital Library

[15]

Mihir Jain, Jan Van Gemert, Hervé Jégou, Patrick Bouthemy, and Cees G. M. Snoek. 2014. Action localization with tubelets from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 740–747.

Digital Library

[16]

Andrej Karpathy, Armand Joulin, and Li F. Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 1889–1897.

Digital Library

[17]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[18]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.

[19]

Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the 27th AAAI Conference on Artificial Intelligence.

Digital Library

[20]

Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 7083–7093.

[21]

Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Temporal convolution based action proposal: Submission to ActivityNet 2017. arXiv preprint arXiv:1707.06750 (2017).

[22]

Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 15–24.

Digital Library

[23]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer, 21–37.

[24]

Xiang Long, Chuang Gan, and Gerard de Melo. 2018. Video captioning with multi-faceted attention. Trans. Assoc. Comput. Ling. 6 (2018), 173–184.

[25]

Pascal Mettes, Jan C. Van Gemert, Spencer Cappallo, Thomas Mensink, and Cees G. M. Snoek. 2015. Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. 427–434.

Digital Library

[26]

Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 (2018).

[27]

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the ACM International Conference on Multimedia Retrieval. 19–27.

Digital Library

[28]

Niluthpol C. Mithun, Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. 2019. Joint embeddings with multimodal cues for video-text retrieval. Int. J. Multim. Inf. Retr. 8, 1 (2019), 3–18.

[29]

Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, and Amit K. Roy-Chowdhury. 2018. Webly supervised joint embedding for cross-modal image-text retrieval. In Proceedings of the 26th ACM International Conference on Multimedia. 1856–1864.

Digital Library

[30]

Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, and Bohyung Han. 2019. Streamlined dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6588–6597.

[31]

Xiushan Nie, Yane Chai, Ju Liu, Jiande Sun, and Yilong Yin. 2016. Spherical torus-based video hashing for near-duplicate video detection. Sci. China Inf. Sci. 59, 5 (2016), 059101.

[32]

X. Nie, W. Jing, C. Cui, C. J. Zhang, L. Zhu, and Y. Yin. 2020. Joint multi-view hashing for large-scale near-duplicate video retrieval. IEEE Trans. Knowl. Data Eng. 32, 10 (2020), 1951–1965.

[33]

X. Nie, J. Liu, and J. Sun. 2010. Robust video hashing for identification based on MDS. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 1834–1837.

[34]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.

[35]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.

[36]

Rakshith Shetty and Jorma Laaksonen. 2015. Video captioning with recurrent networks based on frame-and video-level features and visual content classification. arXiv preprint arXiv:1512.02949 (2015).

[37]

Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.

[38]

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. Springer, 510–526.

[39]

Jingkuan Song, Tao He, Lianli Gao, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2020. Unified binary generative adversarial network for image retrieval and compression. Int. J. Comput. Vis. 128 (2020), 2243–2264.

Digital Library

[40]

Christian Szegedy, Alexander Toshev, and Dumitru Erhan. 2013. Deep neural networks for object detection. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 2553–2561.

Digital Library

[41]

Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). 1218–1227.

[42]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.

Digital Library

[43]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542.

Digital Library

[44]

Weining Wang, Yan Huang, and Liang Wang. 2019. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 334–343.

[45]

Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-c3d: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision. 5783–5792.

[46]

Huijuan Xu, Kun He, Bryan A Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9062–9069.

Digital Library

[47]

Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S. Davis. 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2019).

[48]

Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2019. CAM-RNN: Co-attention model based RNN for video captioning. IEEE Transactions on Image Processing 28, 11 (2019), 5552–5565.

Digital Library

[49]

Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8739–8748.

Cited By

Yan K(2024)Syntactic analysis of SMOSS model combined with improved LSTM model: Taking English writing teaching as an examplePLOS ONE10.1371/journal.pone.031204919:11(e0312049)Online publication date: 15-Nov-2024
https://doi.org/10.1371/journal.pone.0312049
Hou WLi GTian YHu D(2024)Towards Long Form Audio-visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3672079Online publication date: 7-Jun-2024
https://doi.org/10.1145/3672079
Zhao JYang HHe HPeng JZhang WNi JSangaiah ACastiglione A(2024)Backdoor Two-Stream Video Models on Federated LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3651307Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1145/3651307
Show More Cited By

Index Terms

Single-shot Semantic Matching Network for Moment Localization in Videos
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Cross-modal Moment Localization in Videos
MM '18: Proceedings of the 26th ACM international conference on Multimedia

In this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the ...
Attentive Moment Retrieval in Videos
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

In the past few years, language-based video retrieval has attracted a lot of attention. However, as a natural extension, localizing the specific video moments within a video given a description query is seldom explored. Although these two tasks look ...
Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 17, Issue 3

August 2021

443 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3476118

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2021

Accepted: 01 December 2020

Revised: 01 November 2020

Received: 01 July 2020

Published in TOMM Volume 17, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Natural Science Foundation of China
National Key R&D Program of China
Taishan Scholar Project of Shandong Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
212
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)2

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yan K(2024)Syntactic analysis of SMOSS model combined with improved LSTM model: Taking English writing teaching as an examplePLOS ONE10.1371/journal.pone.031204919:11(e0312049)Online publication date: 15-Nov-2024
https://doi.org/10.1371/journal.pone.0312049
Hou WLi GTian YHu D(2024)Towards Long Form Audio-visual Video UnderstandingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3672079Online publication date: 7-Jun-2024
https://doi.org/10.1145/3672079
Zhao JYang HHe HPeng JZhang WNi JSangaiah ACastiglione A(2024)Backdoor Two-Stream Video Models on Federated LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3651307Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1145/3651307
Liu WCai JLi QLiao CCao JHe SYu Y(2024)Learning Nighttime Semantic Segmentation the Hard WayACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365003220:7(1-23)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3650032
Gao XPang YLiu YHan MYu JWang WChen Y(2024)Multimodal Visual-Semantic Representations Learning for Scene Text RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364655120:7(1-18)Online publication date: 27-Mar-2024
https://dl.acm.org/doi/10.1145/3646551
Chen HYu YDong YLu ZLi YZhang Z(2024)Multi-Content Interaction Network for Few-Shot SegmentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364385020:6(1-20)Online publication date: 8-Mar-2024
https://dl.acm.org/doi/10.1145/3643850
Chen QHuang TLiu Q(2024)SWRM: Similarity Window Reweighting and Margin for Long-Tailed RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364381620:6(1-18)Online publication date: 8-Mar-2024
https://dl.acm.org/doi/10.1145/3643816
Liang RZhang SZhang WZhang GTang J(2024)Nonlocal Hybrid Network for Long-tailed Image ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363025620:4(1-22)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3630256
Singh PKushwaha A(2024)Key frame extraction algorithm for video summarization based on key frame extraction using sliding windowMultimedia Tools and Applications10.1007/s11042-024-20461-yOnline publication date: 20-Nov-2024
https://doi.org/10.1007/s11042-024-20461-y
Yichen GKun LDan G(2023)Proposal-free video grounding based on motion excitationJournal of Image and Graphics10.11834/jig.22010928:10(3077-3091)Online publication date: 2023
https://doi.org/10.11834/jig.220109
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents