Abstract
We introduce an audiovisual method for long-range text-to-video retrieval. Unlike previous approaches designed for short video retrieval (e.g., 5–15 s in duration), our approach aims to retrieve minute-long videos that capture complex human actions. One challenge of standard video-only approaches is the large computational cost associated with processing hundreds of densely extracted frames from such long videos. To address this issue, we propose to replace parts of the video with compact audio cues that succinctly summarize dynamic audio events and are cheap to process. Our method, named EclipSE (Efficient CLIP with Sound Encoding), adapts the popular CLIP model to an audiovisual video setting, by adding a unified audiovisual transformer block that captures complementary cues from the video and audio streams. In addition to being \(2.92\times \) faster and \(2.34\times \) memory-efficient than long-range video-only approaches, our method also achieves better text-to-video retrieval accuracy on several diverse long-range video datasets such as ActivityNet, QVHighlights, YouCook2, DiDeMo, and Charades. Our code is available at https://github.com/GenjiB/ECLIPSE.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Croitoru, I., et al.: Crossmodal generalized distillation for text-video retrieval. In ICCV, Teachtext (2021)
Liu, S., Fan, H., Qian, S., Chen, Y., Ding, W., Wang, Z.: Hierarchical transformer with momentum contrast for video-text retrieval. In: ICCV (2021)
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: ICCV (2021)
Wang, X., Zhu, L., Yang, Y.: T2VLAD: global-local sequence alignment for text-video retrieval. In: CVPR (2021)
Wray, M., Doughty, H., Damen, D.: On semantic similarity in video retrieval. In: CVPR (2021)
Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video retrieval. In: ECCV (2020)
Gabeur, V., Nagrani, A., Sun, C., Alahari, K., Schmid, C.: Masking modalities for cross-modal video retrieval. In: WACV (2022)
Hu, X., et al.: Contrastive pre-training for zero-shot video-text understanding. In: EMNLP, VideoCLIP (2021)
Lei, J., et al.: ClipBERT for video-and-language learning via sparse sampling. In: CVPR (2021)
Dong, J., Li, X., Snoek, C.G.M.: Word2VisualVec: image and video to sentence matching by visual feature prediction. arXiv Preprint (2016)
Xu, R., Xiong, C., Chen, W., Corso, J.: Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI (2015)
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv Preprint (2014)
Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2Video: Mastering video-text retrieval via image clip. arXiv Preprint (2021)
Gao, Z., et al.: CLIP2TV: an empirical study on transformer-based methods for video-text retrieval. arXiv Preprint (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Luo, H., et al.: CLIP4Clip: an empirical study of clip for end to end video clip retrieval. arXiv Preprint (2021)
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: NeurIPS (2021)
Kazakos, E., Huh, J., Nagrani, A., Zisserman, A., Damen, D.: Multimodal egocentric action recognition. In: BMVC (2021)
Arandjelović, R., Zisserman, A.: Objects that sound. In: ECCV (2018)
Vasudevan, A.B., Dai, D., Van Gool, L.: Sound and visual representation learning with multiple pretraining tasks, In: CVPR (2022)
Afouras, T., Asano, Y.M., Fagan, F., Vedaldi, A., Metze, F.: Self-supervised object detection from audio-visual correspondence. In: CVPR (2022)
Aytar, Y., Vondrick, C., Torralba, A.: Learning sound representations from unlabeled video. In: NeurIPS, SoundNet (2016)
lwassel, H., Mahajan, D., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. In: NeurIPS (2020)
Lin, Y.-B., Tseng, H.-Y., Lee, H.-Y., Lin, Y.-Y., Yang, M.-H.: Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. In: NeurIPS (2021)
Gao, R., Tae-Hyun, O., Grauman, K., Torresani, L.: Action recognition by previewing audio. In: CVPR (2020)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV (2017)
Lei, J., Berg, T.L., Bansal, M.: QVHighlights: detecting moments and highlights in videos via natural language queries. In: NeurIPS (2021)
Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV (2017)
Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: AAAI (2018)
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, A., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: ECCV (2016)
Patrick, M., et al.: Support-set bottlenecks for video-text representation learning. In: ICLR (2021)
Amrani, E., Ben-Ari, R., Rotman, D., Bronstein, A.: Noise estimation using density estimation for self-supervised multimodal learning. In: AAAI (2020)
Miech, A., Alayrac, J.-P., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: CVPR (2020)
Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: ICCV (2019)
Ge, Y., et al.: Bridging video-text retrieval with maultiple choice questions. In: CVPR (2022)
Haoyu, L., Fei, N., Huo, Y., Gao, Y., Zhiwu, L., Rong Wen, J.: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In: CVPR (2022)
Wang, J., et al.: Object-aware video-language pre-training for retrieval. In: CVPR (2022)
Zhu, L., Yang, Y.: ActBERT: learning global-local video-text representations. In: CVPR (2020)
Li, L., et al.: Hierarchical encoder for Video+Language omni-representation pre-training. In: EMNLP (2020)
Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: ECCV (2018)
Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: CVPR (2017)
Luo, H., et al.: UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv Preprint (2020)
Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Video retrieval using representations from collaborative experts. In: BMVC (2019)
Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.M.: Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ICMR (2018)
Miech, A., Laptev, I., Sivic, J.: Learning a text-video embedding from incomplete and heterogeneous data. arXiv Preprint (2018)
Cheng, X., Lin, H., Wu, X., Yang, F., Shen, D.: Improving video-text retrieval by multi-stream corpus alignment and dual SoftMax loss. arXiv Preprint (2021)
Wang, Z., Wu, Y., Narasimhan, K., Russakovsky, O.: Multi-query video retrieval. arXiv Preprint (2022)
Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko. A.: Multidomain multimodal transformer for video retrieval. In: CVPRW (2021)
Portillo-Quintero, J.A., Ortiz-Bayliss, J.C., Terashima-Marín, H.: A straightforward framework for video retrieval using clip. In: MCPR (2021)
Gorti, S.K., et al.: Cross-modal language-video attention for text-video retrieval. In CVPR (2022)
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: A clip-hitchhiker’s guide to long video retrieval. arXiv Preprint (2022)
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: ECCV (2018)
Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: NeurIPS (2018)
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)
Owens, A., Jiajun, W., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: ECCV (2016)
Asano, Y.M., Patrick, M., Rupprecht, C., Vedaldi, A.: Labelling unlabelled videos from scratch with multi-modal self-supervision. In: NeurIPS (2020)
Ma, S., Zeng, Z., McDuff, D., Song, Y.: Active contrastive learning of audio-visual video representations. In: ICLR (2021)
Morgado, P., Vasconcelos, N., Misra, I.: Audio-visual instance discrimination with cross-modal agreement. In: CVPR (2021)
Morgado, P., Misra, I., Vasconcelos, N.: Robust audio-visual instance discrimination. In: CVPR (2021)
Lin, Y.-B., Li, Y.-J., Wang, Y.-G.F.: Dual-modality seq2seq network for audio-visual event localization. In: ICASSP (2019)
Ma, S., Zeng, Z., McDuff, D., Song, Y.: Contrastive learning of global and local video representations. In: NeurIPS (2021)
Zhang, Y., Doughty, H., Shao, L., Snoek, C.G.M.: Audio-adaptive activity recognition across video domains. In: CVPR (2022)
Dosovitskiy, D., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Lin, Y.-B., Wang, Y.-C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: ACCV (2020)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Shvetsova, N., et al.: Everything at once-multi-modal fusion transformer for video retrieval. In: CVPR (2022)
Zellers, R., et al.: Neural script knowledge through vision and language and sound. In CVPR (2022)
Akbari, H., et al.: VATT: transformers for multimodal self-supervised learning from raw video, audio and text. In: NeurIPS (2021)
Zhao, Y., Hessel, J., Yu, Y., Lu, X., Zellers, R., Choi, Y.: Connecting the dots between audio and text without parallel data through visual knowledge transfer. arXiv Preprint (2021)
Alayrac, J.-B., et al.: Self-supervised multimodal versatile networks. In: NeurIPS (2020)
Chen, B., et al.: Multimodal clustering networks for self-supervised learning from unlabeled videos. In: ICCV (2021)
Lin, Y.-B., Frank Wang, Y-C.: Exploiting audio-visual consistency with partial supervision for spatial audio generation. In: AAAI (2021)
Wang, S., et al.: Self-attention with linear complexity. arXiv Preprint, Linformer (2020)
Choromanski, K.M.: Rethinking attention with performers. In: ICLR (2021)
Patrick, M., et al.: Trajectory attention in video transformers. In: NeurIPS (2021)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)
Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: A large-scale audio-visual dataset. In: ICASSP (2020)
Gong, Y., Chung, Y.-A., Glass, J.: Audio spectrogram transformer. In INTEERSPEECH, AST (2021)
Gong, Y., Chung, Y.-A., Glass, J.: Improving audio tagging with pretraining, sampling, labeling, and aggregation. In: TASLP (2021)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Zhang, B., Hu, H., Sha, F.: Cross-modal and hierarchical modeling of video and text. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 385–401. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_23
Chen, X., et al.: Data collection and evaluation server. arXiv Preprint, Microsoft coco captions (2015)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. In: IJCV (2017)
Miech, A., et al.: Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: ICCV (2019)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 2 (mp4 13618 KB)
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lin, YB., Lei, J., Bansal, M., Bertasius, G. (2022). EclipSE: Efficient Long-Range Video Retrieval Using Sight and Sound. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13694. Springer, Cham. https://doi.org/10.1007/978-3-031-19830-4_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-19830-4_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19829-8
Online ISBN: 978-3-031-19830-4
eBook Packages: Computer ScienceComputer Science (R0)