research-article

Joint Searching and Grounding: Multi-Granularity Video Content Retrieval

Authors:

Zhiguo Chen,

Xun Jiang,

Xing Xu,

Zuo Cao,

Yijun Mo,

Heng Tao ShenAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 975 - 983

https://doi.org/10.1145/3581783.3612349

Published: 27 October 2023 Publication History

Get Access

Abstract

Text-based video retrieval is a well-studied task aimed at retrieving relevant videos from a large collection in response to a given text query. Most existing TVR works assume that videos are already trimmed and fully relevant to the query thus ignoring that most videos in real-world scenarios are untrimmed and contain massive irrelevant video content. Moreover, as users' queries are only relevant to video events rather than complete videos, it is also more practical to provide specific video events rather than an untrimmed video list. In this paper, we introduce a challenging but more realistic task called Multi-Granularity Video Content Retrieval (MGVCR), which involves retrieving both video files and specific video content with their temporal locations. This task presents significant challenges since it requires identifying and ranking the partial relevance between long videos and text queries under the lack of temporal alignment supervision between the query and relevant moments. To this end, we propose a novel unified framework, termed, Joint Searching and Grounding (JSG). It consists of two branches: (1) a glance branch that coarsely aligns the query and moment proposals using inter-video contrastive learning, and (2) a gaze branch that finely aligns two modalities using both inter- and intra-video contrastive learning. Based on the glance-to-gaze design, our JSG method learns two separate joint embedding spaces for moments and text queries using a hybrid synergistic contrastive learning strategy. Extensive experiments on three public benchmarks, i.e., Charades-STA, DiDeMo, and ActivityNet-Captions demonstrate the superior performance of our JSG method on both video-level retrieval and event-level retrieval subtasks. Our open-source implementation code is available at https://github.com/CFM-MSG/Code_JSG.

References

[1]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803--5812.

Abstract

References

Cited By

Index Terms

Recommendations

Multi-class relevance feedback content-based image retrieval

Video Corpus Moment Retrieval with Contrastive Learning

Content-based video retrieval: does video's semantic visual feature matter?

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations