skip to main content
10.1145/3581783.3612349acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Joint Searching and Grounding: Multi-Granularity Video Content Retrieval

Published: 27 October 2023 Publication History

Abstract

Text-based video retrieval is a well-studied task aimed at retrieving relevant videos from a large collection in response to a given text query. Most existing TVR works assume that videos are already trimmed and fully relevant to the query thus ignoring that most videos in real-world scenarios are untrimmed and contain massive irrelevant video content. Moreover, as users' queries are only relevant to video events rather than complete videos, it is also more practical to provide specific video events rather than an untrimmed video list. In this paper, we introduce a challenging but more realistic task called Multi-Granularity Video Content Retrieval (MGVCR), which involves retrieving both video files and specific video content with their temporal locations. This task presents significant challenges since it requires identifying and ranking the partial relevance between long videos and text queries under the lack of temporal alignment supervision between the query and relevant moments. To this end, we propose a novel unified framework, termed, Joint Searching and Grounding (JSG). It consists of two branches: (1) a glance branch that coarsely aligns the query and moment proposals using inter-video contrastive learning, and (2) a gaze branch that finely aligns two modalities using both inter- and intra-video contrastive learning. Based on the glance-to-gaze design, our JSG method learns two separate joint embedding spaces for moments and text queries using a hybrid synergistic contrastive learning strategy. Extensive experiments on three public benchmarks, i.e., Charades-STA, DiDeMo, and ActivityNet-Captions demonstrate the superior performance of our JSG method on both video-level retrieval and event-level retrieval subtasks. Our open-source implementation code is available at https://github.com/CFM-MSG/Code_JSG.

References

[1]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803--5812.
[2]
Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng. 2022. Contrastive Learning for Unsupervised Video Highlight Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14042--14052.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6299--6308.
[4]
Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10638--10647.
[5]
Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, Shujie Chen, Xirong Li, and Xun Wang. 2022a. Partially Relevant Video Retrieval. In Proceedings of the ACM International Conference on Multimedia. 246--257.
[6]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9346--9355.
[7]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[8]
Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan He, and Xun Wang. 2022b. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Transactions on Circuits and Systems for Video Technology (2022).
[9]
Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, and Bryan Russell. 2019. Temporal localization of moments in video collections with natural language. (2019).
[10]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202--6211.
[11]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267--5275.
[12]
Junyu Gao and Changsheng Xu. 2021a. Fast video moment retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 1523--1532.
[13]
Junyu Gao and Changsheng Xu. 2021b. Learning video moment retrieval without a single annotated video. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 3 (2021), 1646--1657.
[14]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. 2020a. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9726--9735.
[15]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. 2020b. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9726--9735.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770--778.
[17]
Xun Jiang, Xing Xu, Jingran Zhang, Fumin Shen, Zuo Cao, and Heng Tao Shen. 2022a. SDN: Semantic Decoupling Network for Temporal Language Grounding. IEEE Transactions on Neural Networks and Learning Systems (2022).
[18]
Xun Jiang, Xing Xu, Jingran Zhang, Fumin Shen, Zuo Cao, and Heng Tao Shen. 2022b. Semi-Supervised Video Paragraph Grounding With Contrastive Encoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2466--2475.
[19]
Xun Jiang, Zailei Zhou, Xing Xu, Yang Yang, Guoqing Wang, and Heng Tao Shen. 2023. Faster Video Moment Retrieval with Point-Level Supervision. arXiv preprint arXiv:2305.14017 (2023).
[20]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations.
[21]
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-Captioning Events in Videos. In Proceedings of the IEEE International Conference on Computer Vision. 706--715.
[22]
Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. In European Conference on Computer Vision. Springer, 447--463.
[23]
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for VideoLanguage Omni-representation Pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2046--2065.
[24]
Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2vv fully deep learning for ad-hoc video search. In Proceedings of the ACM International Conference on Multimedia. 1786--1794.
[25]
Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In European Conference on Computer Vision. 3--19.
[26]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[27]
Jonghwan Mun, Minsu Cho, and Bohyung Han. 2020. Local-Global Video-Text Interactions for Temporal Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10807--10816.
[28]
Ryumei Nakada, Halil Ibrahim Gulluk, Zhun Deng, Wenlong Ji, James Zou, and Linjun Zhang. 2023. Understanding multimodal contrastive learning and incorporating unpaired data. In International Conference on Artificial Intelligence and Statistics. 4348--4380.
[29]
Jungin Park, Jiyoung Lee, Ig-Jae Kim, and Kwanghoon Sohn. 2022. Probabilistic representations for video contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14711--14721.
[30]
Sudipta Paul, Niluthpol Chowdhury Mithun, and Amit K Roy-Chowdhury. 2021. Text-based localization of moments in a video corpus. IEEE Transactions on Image Processing, Vol. 30 (2021), 8886--8899.
[31]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 815--823.
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[33]
Gongmian Wang, Xing Xu, Fumin Shen, Huimin Lu, Yanli Ji, and Heng Tao Shen. 2022. Cross-modal dynamic networks for video moment retrieval with text query. IEEE Transactions on Multimedia, Vol. 24 (2022), 1221--1232.
[34]
Yinwei Wei, Xiang Wang, Qi Li, Liqiang Nie, Yan Li, Xuanping Li, and Tat-Seng Chua. 2021. Contrastive Learning for Cold-Start Recommendation. In Proceedings of the ACM International Conference on Multimedia. 5382--5390.
[35]
Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. 2022. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15671--15680.
[36]
Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021. Video corpus moment retrieval with contrastive learning. In Proceedings of the ACM International Conference on Research & Development in Information Retrieval. 685--695.
[37]
Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020a. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language. In AAAI Conference on Artificial Intelligence. 12870--12877.
[38]
Zhu Zhang, Zhou Zhao, Zhijie Lin, Jieming Zhu, and Xiuqiang He. 2020b. Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding. In Advances in Neural Information Processing Systems. 18123--18134.
[39]
Minghang Zheng, Yanjie Huang, Qingchao Chen, and Yang Liu. 2022a. Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining. In AAAI Conference on Artificial Intelligence. 3517--3525.
[40]
Minghang Zheng, Yanjie Huang, Qingchao Chen, Yuxin Peng, and Yang Liu. 2022b. Weakly Supervised Temporal Sentence Grounding With Gaussian-Based Contrastive Proposal Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15555--15564.
[41]
Ding Zou, Wei Wei, Xian-Ling Mao, Ziyang Wang, Minghui Qiu, Feida Zhu, and Xin Cao. 2022Multi-level Cross-view Contrastive Learning for Knowledge-aware Recommender System. In Proceedings of the ACM International Conference on Research & Development in Information Retrieval. 1358--1368.

Cited By

View all
  • (2025)Joint Objective and Subjective Fuzziness Denoising for Multimodal Sentiment AnalysisIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.340554133:1(15-27)Online publication date: 1-Jan-2025
  • (2024)Enhanced Experts with Uncertainty-Aware Routing for Multimodal Sentiment AnalysisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680949(9650-9659)Online publication date: 28-Oct-2024
  • (2024)Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680731(5260-5269)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. contrastive learning
  2. multi-granularity video content retrieval
  3. multimedia applications
  4. multimodal learning
  5. video understanding

Qualifiers

  • Research-article

Funding Sources

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)187
  • Downloads (Last 6 weeks)18
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Joint Objective and Subjective Fuzziness Denoising for Multimodal Sentiment AnalysisIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.340554133:1(15-27)Online publication date: 1-Jan-2025
  • (2024)Enhanced Experts with Uncertainty-Aware Routing for Multimodal Sentiment AnalysisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680949(9650-9659)Online publication date: 28-Oct-2024
  • (2024)Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680731(5260-5269)Online publication date: 28-Oct-2024
  • (2024)PTAN: Principal Token-aware Adjacent Network for Compositional Temporal GroundingProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658113(618-627)Online publication date: 30-May-2024
  • (2024)Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363946920:6(1-22)Online publication date: 8-Mar-2024
  • (2024)Zero-Shot Video Moment Retrieval With Angular Reconstructive Text EmbeddingsIEEE Transactions on Multimedia10.1109/TMM.2024.339627226(9657-9670)Online publication date: 19-Jul-2024
  • (2024)Uncertainty-Debiased Multimodal Fusion: Learning Deterministic Joint Representation for Multimodal Sentiment Analysis2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688376(1-6)Online publication date: 15-Jul-2024
  • (2024)Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02538(26866-26875)Online publication date: 16-Jun-2024
  • (2024)RGNet: A Unified Clip Retrieval and Grounding Network for Long VideosComputer Vision – ECCV 202410.1007/978-3-031-72664-4_20(352-369)Online publication date: 26-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media