Skip to main content

EclipSE: Efficient Long-Range Video Retrieval Using Sight and Sound

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13694))

Included in the following conference series:

  • 2496 Accesses

Abstract

We introduce an audiovisual method for long-range text-to-video retrieval. Unlike previous approaches designed for short video retrieval (e.g., 5–15 s in duration), our approach aims to retrieve minute-long videos that capture complex human actions. One challenge of standard video-only approaches is the large computational cost associated with processing hundreds of densely extracted frames from such long videos. To address this issue, we propose to replace parts of the video with compact audio cues that succinctly summarize dynamic audio events and are cheap to process. Our method, named EclipSE (Efficient CLIP with Sound Encoding), adapts the popular CLIP model to an audiovisual video setting, by adding a unified audiovisual transformer block that captures complementary cues from the video and audio streams. In addition to being \(2.92\times \) faster and \(2.34\times \) memory-efficient than long-range video-only approaches, our method also achieves better text-to-video retrieval accuracy on several diverse long-range video datasets such as ActivityNet, QVHighlights, YouCook2, DiDeMo, and Charades. Our code is available at https://github.com/GenjiB/ECLIPSE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Croitoru, I., et al.: Crossmodal generalized distillation for text-video retrieval. In ICCV, Teachtext (2021)

    Google Scholar 

  2. Liu, S., Fan, H., Qian, S., Chen, Y., Ding, W., Wang, Z.: Hierarchical transformer with momentum contrast for video-text retrieval. In: ICCV (2021)

    Google Scholar 

  3. Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: ICCV (2021)

    Google Scholar 

  4. Wang, X., Zhu, L., Yang, Y.: T2VLAD: global-local sequence alignment for text-video retrieval. In: CVPR (2021)

    Google Scholar 

  5. Wray, M., Doughty, H., Damen, D.: On semantic similarity in video retrieval. In: CVPR (2021)

    Google Scholar 

  6. Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video retrieval. In: ECCV (2020)

    Google Scholar 

  7. Gabeur, V., Nagrani, A., Sun, C., Alahari, K., Schmid, C.: Masking modalities for cross-modal video retrieval. In: WACV (2022)

    Google Scholar 

  8. Hu, X., et al.: Contrastive pre-training for zero-shot video-text understanding. In: EMNLP, VideoCLIP (2021)

    Google Scholar 

  9. Lei, J., et al.: ClipBERT for video-and-language learning via sparse sampling. In: CVPR (2021)

    Google Scholar 

  10. Dong, J., Li, X., Snoek, C.G.M.: Word2VisualVec: image and video to sentence matching by visual feature prediction. arXiv Preprint (2016)

    Google Scholar 

  11. Xu, R., Xiong, C., Chen, W., Corso, J.: Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI (2015)

    Google Scholar 

  12. Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv Preprint (2014)

    Google Scholar 

  13. Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2Video: Mastering video-text retrieval via image clip. arXiv Preprint (2021)

    Google Scholar 

  14. Gao, Z., et al.: CLIP2TV: an empirical study on transformer-based methods for video-text retrieval. arXiv Preprint (2021)

    Google Scholar 

  15. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  16. Luo, H., et al.: CLIP4Clip: an empirical study of clip for end to end video clip retrieval. arXiv Preprint (2021)

    Google Scholar 

  17. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: NeurIPS (2021)

    Google Scholar 

  18. Kazakos, E., Huh, J., Nagrani, A., Zisserman, A., Damen, D.: Multimodal egocentric action recognition. In: BMVC (2021)

    Google Scholar 

  19. Arandjelović, R., Zisserman, A.: Objects that sound. In: ECCV (2018)

    Google Scholar 

  20. Vasudevan, A.B., Dai, D., Van Gool, L.: Sound and visual representation learning with multiple pretraining tasks, In: CVPR (2022)

    Google Scholar 

  21. Afouras, T., Asano, Y.M., Fagan, F., Vedaldi, A., Metze, F.: Self-supervised object detection from audio-visual correspondence. In: CVPR (2022)

    Google Scholar 

  22. Aytar, Y., Vondrick, C., Torralba, A.: Learning sound representations from unlabeled video. In: NeurIPS, SoundNet (2016)

    Google Scholar 

  23. lwassel, H., Mahajan, D., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. In: NeurIPS (2020)

    Google Scholar 

  24. Lin, Y.-B., Tseng, H.-Y., Lee, H.-Y., Lin, Y.-Y., Yang, M.-H.: Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. In: NeurIPS (2021)

    Google Scholar 

  25. Gao, R., Tae-Hyun, O., Grauman, K., Torresani, L.: Action recognition by previewing audio. In: CVPR (2020)

    Google Scholar 

  26. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV (2017)

    Google Scholar 

  27. Lei, J., Berg, T.L., Bansal, M.: QVHighlights: detecting moments and highlights in videos via natural language queries. In: NeurIPS (2021)

    Google Scholar 

  28. Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV (2017)

    Google Scholar 

  29. Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: AAAI (2018)

    Google Scholar 

  30. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, A., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: ECCV (2016)

    Google Scholar 

  31. Patrick, M., et al.: Support-set bottlenecks for video-text representation learning. In: ICLR (2021)

    Google Scholar 

  32. Amrani, E., Ben-Ari, R., Rotman, D., Bronstein, A.: Noise estimation using density estimation for self-supervised multimodal learning. In: AAAI (2020)

    Google Scholar 

  33. Miech, A., Alayrac, J.-P., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: CVPR (2020)

    Google Scholar 

  34. Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: ICCV (2019)

    Google Scholar 

  35. Ge, Y., et al.: Bridging video-text retrieval with maultiple choice questions. In: CVPR (2022)

    Google Scholar 

  36. Haoyu, L., Fei, N., Huo, Y., Gao, Y., Zhiwu, L., Rong Wen, J.: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In: CVPR (2022)

    Google Scholar 

  37. Wang, J., et al.: Object-aware video-language pre-training for retrieval. In: CVPR (2022)

    Google Scholar 

  38. Zhu, L., Yang, Y.: ActBERT: learning global-local video-text representations. In: CVPR (2020)

    Google Scholar 

  39. Li, L., et al.: Hierarchical encoder for Video+Language omni-representation pre-training. In: EMNLP (2020)

    Google Scholar 

  40. Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: ECCV (2018)

    Google Scholar 

  41. Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: CVPR (2017)

    Google Scholar 

  42. Luo, H., et al.: UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv Preprint (2020)

    Google Scholar 

  43. Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Video retrieval using representations from collaborative experts. In: BMVC (2019)

    Google Scholar 

  44. Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.M.: Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ICMR (2018)

    Google Scholar 

  45. Miech, A., Laptev, I., Sivic, J.: Learning a text-video embedding from incomplete and heterogeneous data. arXiv Preprint (2018)

    Google Scholar 

  46. Cheng, X., Lin, H., Wu, X., Yang, F., Shen, D.: Improving video-text retrieval by multi-stream corpus alignment and dual SoftMax loss. arXiv Preprint (2021)

    Google Scholar 

  47. Wang, Z., Wu, Y., Narasimhan, K., Russakovsky, O.: Multi-query video retrieval. arXiv Preprint (2022)

    Google Scholar 

  48. Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko. A.: Multidomain multimodal transformer for video retrieval. In: CVPRW (2021)

    Google Scholar 

  49. Portillo-Quintero, J.A., Ortiz-Bayliss, J.C., Terashima-Marín, H.: A straightforward framework for video retrieval using clip. In: MCPR (2021)

    Google Scholar 

  50. Gorti, S.K., et al.: Cross-modal language-video attention for text-video retrieval. In CVPR (2022)

    Google Scholar 

  51. Bain, M., Nagrani, A., Varol, G., Zisserman, A.: A clip-hitchhiker’s guide to long video retrieval. arXiv Preprint (2022)

    Google Scholar 

  52. Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: ECCV (2018)

    Google Scholar 

  53. Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: NeurIPS (2018)

    Google Scholar 

  54. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)

    Google Scholar 

  55. Owens, A., Jiajun, W., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: ECCV (2016)

    Google Scholar 

  56. Asano, Y.M., Patrick, M., Rupprecht, C., Vedaldi, A.: Labelling unlabelled videos from scratch with multi-modal self-supervision. In: NeurIPS (2020)

    Google Scholar 

  57. Ma, S., Zeng, Z., McDuff, D., Song, Y.: Active contrastive learning of audio-visual video representations. In: ICLR (2021)

    Google Scholar 

  58. Morgado, P., Vasconcelos, N., Misra, I.: Audio-visual instance discrimination with cross-modal agreement. In: CVPR (2021)

    Google Scholar 

  59. Morgado, P., Misra, I., Vasconcelos, N.: Robust audio-visual instance discrimination. In: CVPR (2021)

    Google Scholar 

  60. Lin, Y.-B., Li, Y.-J., Wang, Y.-G.F.: Dual-modality seq2seq network for audio-visual event localization. In: ICASSP (2019)

    Google Scholar 

  61. Ma, S., Zeng, Z., McDuff, D., Song, Y.: Contrastive learning of global and local video representations. In: NeurIPS (2021)

    Google Scholar 

  62. Zhang, Y., Doughty, H., Shao, L., Snoek, C.G.M.: Audio-adaptive activity recognition across video domains. In: CVPR (2022)

    Google Scholar 

  63. Dosovitskiy, D., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  64. Lin, Y.-B., Wang, Y.-C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: ACCV (2020)

    Google Scholar 

  65. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  66. Shvetsova, N., et al.: Everything at once-multi-modal fusion transformer for video retrieval. In: CVPR (2022)

    Google Scholar 

  67. Zellers, R., et al.: Neural script knowledge through vision and language and sound. In CVPR (2022)

    Google Scholar 

  68. Akbari, H., et al.: VATT: transformers for multimodal self-supervised learning from raw video, audio and text. In: NeurIPS (2021)

    Google Scholar 

  69. Zhao, Y., Hessel, J., Yu, Y., Lu, X., Zellers, R., Choi, Y.: Connecting the dots between audio and text without parallel data through visual knowledge transfer. arXiv Preprint (2021)

    Google Scholar 

  70. Alayrac, J.-B., et al.: Self-supervised multimodal versatile networks. In: NeurIPS (2020)

    Google Scholar 

  71. Chen, B., et al.: Multimodal clustering networks for self-supervised learning from unlabeled videos. In: ICCV (2021)

    Google Scholar 

  72. Lin, Y.-B., Frank Wang, Y-C.: Exploiting audio-visual consistency with partial supervision for spatial audio generation. In: AAAI (2021)

    Google Scholar 

  73. Wang, S., et al.: Self-attention with linear complexity. arXiv Preprint, Linformer (2020)

    Google Scholar 

  74. Choromanski, K.M.: Rethinking attention with performers. In: ICLR (2021)

    Google Scholar 

  75. Patrick, M., et al.: Trajectory attention in video transformers. In: NeurIPS (2021)

    Google Scholar 

  76. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)

    Google Scholar 

  77. Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)

    Google Scholar 

  78. Hershey, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)

    Google Scholar 

  79. Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: A large-scale audio-visual dataset. In: ICASSP (2020)

    Google Scholar 

  80. Gong, Y., Chung, Y.-A., Glass, J.: Audio spectrogram transformer. In INTEERSPEECH, AST (2021)

    Google Scholar 

  81. Gong, Y., Chung, Y.-A., Glass, J.: Improving audio tagging with pretraining, sampling, labeling, and aggregation. In: TASLP (2021)

    Google Scholar 

  82. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)

    Google Scholar 

  83. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  84. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  85. Zhang, B., Hu, H., Sha, F.: Cross-modal and hierarchical modeling of video and text. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 385–401. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_23

    Chapter  Google Scholar 

  86. Chen, X., et al.: Data collection and evaluation server. arXiv Preprint, Microsoft coco captions (2015)

    Google Scholar 

  87. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. In: IJCV (2017)

    Google Scholar 

  88. Miech, A., et al.: Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: ICCV (2019)

    Google Scholar 

  89. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan-Bo Lin .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2086 KB)

Supplementary material 2 (mp4 13618 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, YB., Lei, J., Bansal, M., Bertasius, G. (2022). EclipSE: Efficient Long-Range Video Retrieval Using Sight and Sound. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13694. Springer, Cham. https://doi.org/10.1007/978-3-031-19830-4_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19830-4_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19829-8

  • Online ISBN: 978-3-031-19830-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics