This paper addresses the challenging task of language-driven moment retrieval. Previous methods are typically trained to localize the target moment corresponding to a single sentence query in a complicated video. However, this specific moment generally delivers richer contents than the query, i.e., the semantics of one query may miss certain object details or actions in the complex foreground-background visual contents. Such information imbalance between two modalities makes it difficult to finely align their representations. To this end, instead of training with a single query, we propose to utilize the diversity and complementarity among different queries corresponding to the same video moment for enriching the textual semantics. Specifically, we develop a Teacher-Student Moment Retrieval (TSMR) framework to fill this cross-modal information gap. A teacher model is trained to not only encode a certain query but also capture extra complementary queries to aggregate contextual semantics for obtaining more comprehensive moment-related query representations. Since the additional queries are inaccessible during inference, we further introduce an adaptive knowledge distillation mechanism to train a student model with a single query input by selectively absorbing the knowledge from the teacher model. In this manner, the student model is more robust to the cross-modal information gap during the moment retrieval guided by a single query. Experimental results on two benchmarks demonstrate the effectiveness of our proposed method.

Supplemental Material

MP4 File

This paper addresses the task of language-driven moment retrieval. Previous methods are typically trained to localize the target moment corresponding to a single sentence query in a complicated video. However, this specific moment generally delivers richer contents than the query, i.e., the semantics of one query may miss certain object details or actions in the complex foreground- background visual contents. Such information imbalance between two modalities makes it difficult to finely align their representations. To this end, instead of training with a single query, we propose to utilize the diversity and complementarity among different queries corresponding to the same video moment for enriching the textual semantics.

Download
67.55 MB

References

[1]

Soroush Abbasi Koohpayegani, Ajinkya Tejankar, and Hamed Pirsiavash. 2020. Compress: Self-supervised learning by compressing representations. Advances in Neural Information Processing Systems, Vol. 33 (2020), 12980--12992.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Selective Query-Guided Debiasing for Video Corpus Moment Retrieval

Question-driven information retrieval systems

Moment-Based Techniques for Image Retrieval

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations