Zero-Shot Video Moment Retrieval Using BLIP-Based Models

Wattasseril, Jobin Idiculla; Shekhar, Sumit; Döllner, Jürgen; Trapp, Matthias

doi:10.1007/978-3-031-47969-4_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14361))

Included in the following conference series:

International Symposium on Visual Computing

625 Accesses
1 Citations

Abstract

Video Moment Retrieval (VMR) is a challenging task at the intersection of vision and language, with the goal to retrieve relevant moments from videos corresponding to natural language queries. State-of-the-art approaches for VMR often rely on large amounts of training data including frame-level saliency annotations, weakly supervised pre-training on speech captions, and signals from additional modalities such as audio, which can be limiting in practical scenarios. Moreover, most of these approaches make use of pre-trained spatio-temporal backbones for aggregating temporal features across multiple frames, which incurs significant training and inference costs. To address these limitations, we propose a zero-shot approach with sparse frame-sampling strategies that does not rely on additional modalities and performs well with feature extraction from just individual frames. Our approach uses Bootstrapped Language-Image Pre-training based models (BLIP/BLIP-2), which have been shown to be effective for various downstream vision-language tasks, even in zero-shot settings. We show that such models can be easily repurposed as effective, off-the-shelf feature extractors for VMR. On the QVHighlights benchmark for VMR, our approach outperforms both zero-shot approaches and supervised approaches (without saliency score annotations) by at least \(25\%\) and \(21\%\) respectively, on all metrics. Further, we also show that our approach is comparable to state-of-the-art supervised approaches trained on saliency score annotations and additional modalities, with a gap of at most \(7\%\) across all metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR 2017, pp. 4724–4733. IEEE Computer Society (2017). https://doi.org/10.1109/CVPR.2017.502
Chen, J., Luo, W., Zhang, W., Ma, L.: Explore inter-contrast between videos via composition for weakly supervised temporal sentence grounding. In: AAAI 2022, pp. 267–275. AAAI Press (2022). https://ojs.aaai.org/index.php/AAAI/article/view/19902
Choe, T.E., Lee, M.W., Guo, F., Taylor, G., Yu, L., Haering, N.: Semantic video event search for surveillance video. In: ICCV Workshops 2011, pp. 1963–1970. IEEE (2011)
Google Scholar
Diwan, A., Peng, P., Mooney, R.J.: Zero-shot video moment retrieval with off-the-shelf models. CoRR abs/2211.02178 (2022). https://doi.org/10.48550/arXiv.2211.02178
Du, Y., Liu, Z., Li, J., Zhao, W.X.: A survey of vision-language pre-trained models. In: IJCAI 2022, pp. 5436–5443. ijcai.org (2022). https://doi.org/10.24963/ijcai.2022/762
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV 2019, pp. 6201–6210. IEEE (2019). https://doi.org/10.1109/ICCV.2019.00630
Hao, J., Sun, H., Ren, P., Wang, J., Qi, Q., Liao, J.: Can shuffling video benefit temporal bias problem: a novel training framework for temporal grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 130–147. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_8
Chapter Google Scholar
Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.C.: Localizing moments in video with natural language. In: ICCV 2017, pp. 5804–5813. IEEE Computer Society (2017). https://doi.org/10.1109/ICCV.2017.618
Huang, J., Worring, M.: Query-controllable video summarization. In: ICMR 2020, pp. 242–250. ACM (2020). https://doi.org/10.1145/3372278.3390695
Leake, M., Davis, A., Truong, A., Agrawala, M.: Computational video editing for dialogue-driven scenes. ACM Trans. Graph. 36(4), 130:1–130:14 (2017). https://doi.org/10.1145/3072959.3073653
Lei, J., Berg, T.L., Bansal, M.: QVHighlights: detecting moments and highlights in videos via natural language queries. CoRR abs/2107.09609 (2021). arxiv.org/abs/2107.09609
Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVR: a large-scale dataset for video-subtitle moment retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 447–463. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_27
Chapter Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR abs/2301.12597 (2023). https://doi.org/10.48550/arXiv.2301.12597
Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML 2022, vol. 162, pp. 12888–12900. PMLR (2022). https://proceedings.mlr.press/v162/li22n.html
Liu, D., et al.: Context-aware biaffine localizing network for temporal sentence grounding. In: CVPR 2021, pp. 11235–11244. Computer Vision Foundation/IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.01108
Liu, Y., Li, S., Wu, Y., Chen, C.W., Shan, Y., Qie, X.: UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: CVPR 2022, pp. 3032–3041. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.00305
Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: CVPR 2023, pp. 23023–23033 (2023)
Google Scholar
Nam, J., Ahn, D., Kang, D., Ha, S.J., Choi, J.: Zero-shot natural language video localization. In: ICCV 2021, pp. 1450–1459. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00150
Song, Y., Wang, J., Ma, L., Yu, Z., Yu, J.: Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. CoRR abs/2003.07048 (2020). arxiv.org/abs/2003.07048
Tang, K., Bao, Y., Zhao, Z., Zhu, L., Lin, Y., Peng, Y.: AutoHighlight: automatic highlights detection and segmentation in soccer matches. In: IEEE BigData 2018, pp. 4619–4624. IEEE (2018). https://doi.org/10.1109/BigData.2018.8621906
Tellex, S., Kollar, T., Shaw, G., Roy, N., Roy, D.: Grounding spatial language for video search. In: ICMI-MLMI 2010, pp. 31:1–31:8. ACM (2010). https://doi.org/10.1145/1891903.1891944
Wang, G., Wu, X., Liu, Z., Yan, J.: Prompt-based zero-shot video moment retrieval. In: MM 2022, pp. 413–421. ACM (2022). https://doi.org/10.1145/3503161.3548004
Wang, M., Yang, G.W., Hu, S.M., Yau, S.T., Shamir, A., et al.: Write-a-video: computational video montage from themed text. ACM Trans. Graph. 38(6), 177–1 (2019)
Article Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19
Chapter Google Scholar
Xu, M., et al.: Boundary-sensitive pre-training for temporal localization in videos. In: ICCV 2021, pp. 7200–7210. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00713
Zeng, Y., Cao, D., Lu, S., Zhang, H., Xu, J., Qin, Z.: Moment is important: language-based video moment retrieval via adversarial learning. ACM Trans. Multim. Comput. Commun. Appl. 18(2), 56:1–56:21 (2022). https://doi.org/10.1145/3478025
Zhang, H., Sun, A., Jing, W., Zhou, J.T.: The elements of temporal sentence grounding in videos: a survey and future directions. CoRR abs/2201.08071 (2022). arxiv.org/abs/2201.08071
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI 2020, pp. 12870–12877. AAAI Press (2020). https://ojs.aaai.org/index.php/AAAI/article/view/6984

Download references

Acknowledgments

This work was partially funded by the Research School on “Service-Oriented Systems Engineering” of the Hasso Plattner Institute.

Author information

Authors and Affiliations

Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
Jobin Idiculla Wattasseril, Sumit Shekhar, Jürgen Döllner & Matthias Trapp

Authors

Jobin Idiculla Wattasseril
View author publications
You can also search for this author in PubMed Google Scholar
Sumit Shekhar
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Döllner
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Trapp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jobin Idiculla Wattasseril .

Editor information

Editors and Affiliations

University of Nevada Reno, Reno, NV, USA
George Bebis
Google Research, Mountain View, CA, USA
Golnaz Ghiasi
New York University, New York, USA
Yi Fang
Ben-Gurion University, Be'er Sheva, Israel
Andrei Sharf
Microsoft Research, Beijing, China
Yue Dong
The University of Oklahoma, Norman, OK, USA
Chris Weaver
University of Maryland, Collage Park, MD, USA
Zhicheng Leo
University of Central Florida, Orlando, FL, USA
Joseph J. LaViola Jr.
InnerOptic Technology, Hillsborough, NC, USA
Luv Kohli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wattasseril, J.I., Shekhar, S., Döllner, J., Trapp, M. (2023). Zero-Shot Video Moment Retrieval Using BLIP-Based Models. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2023. Lecture Notes in Computer Science, vol 14361. Springer, Cham. https://doi.org/10.1007/978-3-031-47969-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-47969-4_13
Published: 01 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47968-7
Online ISBN: 978-3-031-47969-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics