EclipSE: Efficient Long-Range Video Retrieval Using Sight and Sound

Lin, Yan-Bo; Lei, Jie; Bansal, Mohit; Bertasius, Gedas

doi:10.1007/978-3-031-19830-4_24

Yan-Bo Lin¹²,
Jie Lei¹²,
Mohit Bansal¹² &
…
Gedas Bertasius¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13694))

Included in the following conference series:

European Conference on Computer Vision

2496 Accesses

Abstract

We introduce an audiovisual method for long-range text-to-video retrieval. Unlike previous approaches designed for short video retrieval (e.g., 5–15 s in duration), our approach aims to retrieve minute-long videos that capture complex human actions. One challenge of standard video-only approaches is the large computational cost associated with processing hundreds of densely extracted frames from such long videos. To address this issue, we propose to replace parts of the video with compact audio cues that succinctly summarize dynamic audio events and are cheap to process. Our method, named EclipSE (Efficient CLIP with Sound Encoding), adapts the popular CLIP model to an audiovisual video setting, by adding a unified audiovisual transformer block that captures complementary cues from the video and audio streams. In addition to being $2.92\times $ faster and $2.34\times $ memory-efficient than long-range video-only approaches, our method also achieves better text-to-video retrieval accuracy on several diverse long-range video datasets such as ActivityNet, QVHighlights, YouCook2, DiDeMo, and Charades. Our code is available at https://github.com/GenjiB/ECLIPSE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Text-to-video: a semantic search engine for internet videos

Article 24 December 2015

A comprehensive review of the video-to-text problem

Article 16 January 2022

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

References

Croitoru, I., et al.: Crossmodal generalized distillation for text-video retrieval. In ICCV, Teachtext (2021)
Google Scholar
Liu, S., Fan, H., Qian, S., Chen, Y., Ding, W., Wang, Z.: Hierarchical transformer with momentum contrast for video-text retrieval. In: ICCV (2021)
Google Scholar
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: ICCV (2021)
Google Scholar
Wang, X., Zhu, L., Yang, Y.: T2VLAD: global-local sequence alignment for text-video retrieval. In: CVPR (2021)
Google Scholar
Wray, M., Doughty, H., Damen, D.: On semantic similarity in video retrieval. In: CVPR (2021)
Google Scholar
Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video retrieval. In: ECCV (2020)
Google Scholar
Gabeur, V., Nagrani, A., Sun, C., Alahari, K., Schmid, C.: Masking modalities for cross-modal video retrieval. In: WACV (2022)
Google Scholar
Hu, X., et al.: Contrastive pre-training for zero-shot video-text understanding. In: EMNLP, VideoCLIP (2021)
Google Scholar
Lei, J., et al.: ClipBERT for video-and-language learning via sparse sampling. In: CVPR (2021)
Google Scholar
Dong, J., Li, X., Snoek, C.G.M.: Word2VisualVec: image and video to sentence matching by visual feature prediction. arXiv Preprint (2016)
Google Scholar
Xu, R., Xiong, C., Chen, W., Corso, J.: Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI (2015)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv Preprint (2014)
Google Scholar
Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2Video: Mastering video-text retrieval via image clip. arXiv Preprint (2021)
Google Scholar
Gao, Z., et al.: CLIP2TV: an empirical study on transformer-based methods for video-text retrieval. arXiv Preprint (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Luo, H., et al.: CLIP4Clip: an empirical study of clip for end to end video clip retrieval. arXiv Preprint (2021)
Google Scholar
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: NeurIPS (2021)
Google Scholar
Kazakos, E., Huh, J., Nagrani, A., Zisserman, A., Damen, D.: Multimodal egocentric action recognition. In: BMVC (2021)
Google Scholar
Arandjelović, R., Zisserman, A.: Objects that sound. In: ECCV (2018)
Google Scholar
Vasudevan, A.B., Dai, D., Van Gool, L.: Sound and visual representation learning with multiple pretraining tasks, In: CVPR (2022)
Google Scholar
Afouras, T., Asano, Y.M., Fagan, F., Vedaldi, A., Metze, F.: Self-supervised object detection from audio-visual correspondence. In: CVPR (2022)
Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: Learning sound representations from unlabeled video. In: NeurIPS, SoundNet (2016)
Google Scholar
lwassel, H., Mahajan, D., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. In: NeurIPS (2020)
Google Scholar
Lin, Y.-B., Tseng, H.-Y., Lee, H.-Y., Lin, Y.-Y., Yang, M.-H.: Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. In: NeurIPS (2021)
Google Scholar
Gao, R., Tae-Hyun, O., Grauman, K., Torresani, L.: Action recognition by previewing audio. In: CVPR (2020)
Google Scholar
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV (2017)
Google Scholar
Lei, J., Berg, T.L., Bansal, M.: QVHighlights: detecting moments and highlights in videos via natural language queries. In: NeurIPS (2021)
Google Scholar
Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV (2017)
Google Scholar
Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: AAAI (2018)
Google Scholar
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, A., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: ECCV (2016)
Google Scholar
Patrick, M., et al.: Support-set bottlenecks for video-text representation learning. In: ICLR (2021)
Google Scholar
Amrani, E., Ben-Ari, R., Rotman, D., Bronstein, A.: Noise estimation using density estimation for self-supervised multimodal learning. In: AAAI (2020)
Google Scholar
Miech, A., Alayrac, J.-P., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: CVPR (2020)
Google Scholar
Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: ICCV (2019)
Google Scholar
Ge, Y., et al.: Bridging video-text retrieval with maultiple choice questions. In: CVPR (2022)
Google Scholar
Haoyu, L., Fei, N., Huo, Y., Gao, Y., Zhiwu, L., Rong Wen, J.: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In: CVPR (2022)
Google Scholar
Wang, J., et al.: Object-aware video-language pre-training for retrieval. In: CVPR (2022)
Google Scholar
Zhu, L., Yang, Y.: ActBERT: learning global-local video-text representations. In: CVPR (2020)
Google Scholar
Li, L., et al.: Hierarchical encoder for Video+Language omni-representation pre-training. In: EMNLP (2020)
Google Scholar
Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: ECCV (2018)
Google Scholar
Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: CVPR (2017)
Google Scholar
Luo, H., et al.: UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv Preprint (2020)
Google Scholar
Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Video retrieval using representations from collaborative experts. In: BMVC (2019)
Google Scholar
Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.M.: Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ICMR (2018)
Google Scholar
Miech, A., Laptev, I., Sivic, J.: Learning a text-video embedding from incomplete and heterogeneous data. arXiv Preprint (2018)
Google Scholar
Cheng, X., Lin, H., Wu, X., Yang, F., Shen, D.: Improving video-text retrieval by multi-stream corpus alignment and dual SoftMax loss. arXiv Preprint (2021)
Google Scholar
Wang, Z., Wu, Y., Narasimhan, K., Russakovsky, O.: Multi-query video retrieval. arXiv Preprint (2022)
Google Scholar
Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko. A.: Multidomain multimodal transformer for video retrieval. In: CVPRW (2021)
Google Scholar
Portillo-Quintero, J.A., Ortiz-Bayliss, J.C., Terashima-Marín, H.: A straightforward framework for video retrieval using clip. In: MCPR (2021)
Google Scholar
Gorti, S.K., et al.: Cross-modal language-video attention for text-video retrieval. In CVPR (2022)
Google Scholar
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: A clip-hitchhiker’s guide to long video retrieval. arXiv Preprint (2022)
Google Scholar
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: ECCV (2018)
Google Scholar
Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: NeurIPS (2018)
Google Scholar
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)
Google Scholar
Owens, A., Jiajun, W., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: ECCV (2016)
Google Scholar
Asano, Y.M., Patrick, M., Rupprecht, C., Vedaldi, A.: Labelling unlabelled videos from scratch with multi-modal self-supervision. In: NeurIPS (2020)
Google Scholar
Ma, S., Zeng, Z., McDuff, D., Song, Y.: Active contrastive learning of audio-visual video representations. In: ICLR (2021)
Google Scholar
Morgado, P., Vasconcelos, N., Misra, I.: Audio-visual instance discrimination with cross-modal agreement. In: CVPR (2021)
Google Scholar
Morgado, P., Misra, I., Vasconcelos, N.: Robust audio-visual instance discrimination. In: CVPR (2021)
Google Scholar
Lin, Y.-B., Li, Y.-J., Wang, Y.-G.F.: Dual-modality seq2seq network for audio-visual event localization. In: ICASSP (2019)
Google Scholar
Ma, S., Zeng, Z., McDuff, D., Song, Y.: Contrastive learning of global and local video representations. In: NeurIPS (2021)
Google Scholar
Zhang, Y., Doughty, H., Shao, L., Snoek, C.G.M.: Audio-adaptive activity recognition across video domains. In: CVPR (2022)
Google Scholar
Dosovitskiy, D., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Lin, Y.-B., Wang, Y.-C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: ACCV (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Shvetsova, N., et al.: Everything at once-multi-modal fusion transformer for video retrieval. In: CVPR (2022)
Google Scholar
Zellers, R., et al.: Neural script knowledge through vision and language and sound. In CVPR (2022)
Google Scholar
Akbari, H., et al.: VATT: transformers for multimodal self-supervised learning from raw video, audio and text. In: NeurIPS (2021)
Google Scholar
Zhao, Y., Hessel, J., Yu, Y., Lu, X., Zellers, R., Choi, Y.: Connecting the dots between audio and text without parallel data through visual knowledge transfer. arXiv Preprint (2021)
Google Scholar
Alayrac, J.-B., et al.: Self-supervised multimodal versatile networks. In: NeurIPS (2020)
Google Scholar
Chen, B., et al.: Multimodal clustering networks for self-supervised learning from unlabeled videos. In: ICCV (2021)
Google Scholar
Lin, Y.-B., Frank Wang, Y-C.: Exploiting audio-visual consistency with partial supervision for spatial audio generation. In: AAAI (2021)
Google Scholar
Wang, S., et al.: Self-attention with linear complexity. arXiv Preprint, Linformer (2020)
Google Scholar
Choromanski, K.M.: Rethinking attention with performers. In: ICLR (2021)
Google Scholar
Patrick, M., et al.: Trajectory attention in video transformers. In: NeurIPS (2021)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)
Google Scholar
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)
Google Scholar
Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: A large-scale audio-visual dataset. In: ICASSP (2020)
Google Scholar
Gong, Y., Chung, Y.-A., Glass, J.: Audio spectrogram transformer. In INTEERSPEECH, AST (2021)
Google Scholar
Gong, Y., Chung, Y.-A., Glass, J.: Improving audio tagging with pretraining, sampling, labeling, and aggregation. In: TASLP (2021)
Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Zhang, B., Hu, H., Sha, F.: Cross-modal and hierarchical modeling of video and text. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 385–401. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_23
Chapter Google Scholar
Chen, X., et al.: Data collection and evaluation server. arXiv Preprint, Microsoft coco captions (2015)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. In: IJCV (2017)
Google Scholar
Miech, A., et al.: Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: ICCV (2019)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, USA
Yan-Bo Lin, Jie Lei, Mohit Bansal & Gedas Bertasius

Authors

Yan-Bo Lin
View author publications
You can also search for this author in PubMed Google Scholar
Jie Lei
View author publications
You can also search for this author in PubMed Google Scholar
Mohit Bansal
View author publications
You can also search for this author in PubMed Google Scholar
Gedas Bertasius
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan-Bo Lin .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2086 KB)

Supplementary material 2 (mp4 13618 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, YB., Lei, J., Bansal, M., Bertasius, G. (2022). EclipSE: Efficient Long-Range Video Retrieval Using Sight and Sound. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13694. Springer, Cham. https://doi.org/10.1007/978-3-031-19830-4_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-19830-4_24
Published: 22 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19829-8
Online ISBN: 978-3-031-19830-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

EclipSE: Efficient Long-Range Video Retrieval Using Sight and Sound

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Text-to-video: a semantic search engine for internet videos

A comprehensive review of the video-to-text problem

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2086 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

EclipSE: Efficient Long-Range Video Retrieval Using Sight and Sound

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Text-to-video: a semantic search engine for internet videos

A comprehensive review of the video-to-text problem

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2086 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation