Video Retrieval with Tree-Based Video Segmentation

Kang, Seong-Min; Jung, Dongin; Cho, Yoon-Sik

doi:10.1007/978-3-031-30675-4_29

Seong-Min Kang¹⁵,
Dongin Jung¹⁵ &
Yoon-Sik Cho¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13945))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1486 Accesses

Abstract

Text-to-video retrieval aims to find relevant videos from text queries. The recently introduced Contrastive Language Image Pretraining (CLIP), a pretrained vision-language model trained on large-scale image and caption pairs, has been extensively used in the literature. Existing studies have focused on directly applying CLIP to learn the temporal dependency. While leveraging the dynamics of the video intuitively sounds reasonable, learning temporal dynamics has demonstrated no advantage or only small improvements. When temporal dynamics are not incorporated, most studies focus on constructing representative images from a video. However, we found these images tend to be noisy, degrading the performance of text-to-video task. This observation is the intuition for designing the proposed model, we introduce a novel tree-based frame division method to focus on the most relevant image for learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738 (2021)
Google Scholar
Chen, D., Dolan, W.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200. Association for Computational Linguistics, Portland, Oregon, USA, June 2011
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2video: mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021)
Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 214–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_13
Chapter Google Scholar
Guadarrama, S., et al.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: 2013 IEEE International Conference on Computer Vision, pp. 2712–2719 (2013)
Google Scholar
Guzman-Rivera, A., Batra, D., Kohli, P.: Multiple choice learning: Learning to produce multiple structured outputs. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Google Scholar
Lee, K., Hwang, C., Park, K., Shin, J.: Confident multiple choice learning. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 2014–2023. PMLR, 06–11 August 2017
Google Scholar
Lee, S., et al.: Stochastic multiple choice learning for training diverse deep ensembles. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29. Curran Associates, Inc. (2016)
Google Scholar
Lei, J., et al.: Less is more: clipbert for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7331–7341 (2021)
Google Scholar
Li, Z., Chen, Q., Koltun, V.: Interactive image segmentation with latent diversity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar
Luo, H., et al.: CLIP4Clip: an empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021)
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2630–2640 (2019)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Tian, K., Xu, Y., Zhou, S., Guan, J.: Versatile multiple choice learning and its application to vision computing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6349–6357 (2019)
Google Scholar
Wang, Q., Zhang, Y., Zheng, Y., Pan, P., Hua, X.S.: Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111 (2022)
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5288–5296 (2016)
Google Scholar

Download references

Acknowledgement

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2021-0-01341, Artificial Intelligence Graduate School Program of Chung-Ang Univ.), and (No. 2021-0-02067, Next Generation AI for Multi-purpose Video Search).

Author information

Authors and Affiliations

Chung-Ang University, Seoul, 06974, South Korea
Seong-Min Kang, Dongin Jung & Yoon-Sik Cho

Authors

Seong-Min Kang
View author publications
You can also search for this author in PubMed Google Scholar
Dongin Jung
View author publications
You can also search for this author in PubMed Google Scholar
Yoon-Sik Cho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yoon-Sik Cho .

Editor information

Editors and Affiliations

Tianjin University, Tianjin, China
Xin Wang
University of Torino, Turin, Italy
Maria Luisa Sapino
POSTECH, Pohang, Korea (Republic of)
Wook-Shin Han
University of California Santa Barbara, Santa Barbara, CA, USA
Amr El Abbadi
University of Auckland, Auckland, New Zealand
Gill Dobbie
Tianjin University, Tianjin, China
Zhiyong Feng
Beijing University of Posts and Telecommunications, Beijing, China
Yingxiao Shao
The University of Queensland, Brisbane, QLD, Australia
Hongzhi Yin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, SM., Jung, D., Cho, YS. (2023). Video Retrieval with Tree-Based Video Segmentation. In: Wang, X., et al. Database Systems for Advanced Applications. DASFAA 2023. Lecture Notes in Computer Science, vol 13945. Springer, Cham. https://doi.org/10.1007/978-3-031-30675-4_29

Download citation

DOI: https://doi.org/10.1007/978-3-031-30675-4_29
Published: 15 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30674-7
Online ISBN: 978-3-031-30675-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Video Retrieval with Tree-Based Video Segmentation