Skip to main content

E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15141))

Included in the following conference series:

  • 210 Accesses

Abstract

Spatio-temporal video grounding aims to localize the spatio-temporal tube in a video according to the given language query. To eliminate the annotation costs, we make a first exploration to tackle spatio-temporal video grounding in a zero-shot manner. Our method dispenses with the need for any training videos or annotations; instead, it localizes the target object by leveraging pre-trained vision-language models and optimizing within the video and text query during the test time. To enable spatio-temporal comprehension, we introduce a multimodal modulation that integrates the spatio-temporal context into both visual and textual representation. On the visual side, we devise a context-based visual modulation that enhances the visual representation by propagation and aggregation of the contextual semantics. Concurrently, on the textual front, we propose a prototype-based textual modulation to refine the textual representations using visual prototypes, effectively mitigating the cross-modal discrepancy. In addition, to overcome the interleaved spatio-temporal dilemma, we propose an expectation maximization (EM) framework to optimize the process of temporal relevance estimation and spatial region identification in an alternating way. Comprehensive experiments validate that our zero-shot approach achieves superior performance in comparison to several state-of-the-art methods with stronger supervision. The code is available at https://github.com/baopj/E3M.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahn, J., Kwak, S.: Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In: CVPR (2018)

    Google Scholar 

  2. Antoine Yang, Antoine Miech, J.S.I.L., Schmid, C.: Tubedetr: Spatio-temporal video grounding with transformers. In: CVPR (2022)

    Google Scholar 

  3. Bao, P., Shao, Z., Yang, W., Ng, B.P., Er, M.H., Kot, A.C.: Omnipotent distillation with LLMS for weakly-supervised natural language video localization: When divergence meets consistency. In: AAAI (2024)

    Google Scholar 

  4. Bao, P., Xia, Y., Yang, W., Ng, B.P., Er, M.H., Kot, A.C.: Local-global multi-modal distillation for weakly-supervised temporal video grounding. In: AAAI (2024)

    Google Scholar 

  5. Bao, P., Yang, W., Ng, B.P., Er, M.H., Kot, A.C.: Cross-modal label contrastive learning for unsupervised audio-visual event localization. In: AAAI (2023)

    Google Scholar 

  6. Chen, J., Bao, W., Kong, Y.: Activity-driven weakly-supervised spatio-temporal grounding from untrimmed videos. In: ACM MM (2020)

    Google Scholar 

  7. Jiang, K., He, X., Xu, R., Wang, X.E.: Comclip: training-free compositional image and text matching. In: NAACL (2024)

    Google Scholar 

  8. Jin, Y., Li, Y., Yuan, Z., Mu, Y.: Embracing consistency: a one-stage approach for spatio-temporal video grounding. In: NeurIPS (2022)

    Google Scholar 

  9. Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 105–124. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_7

    Chapter  Google Scholar 

  10. Kaiming He, Xiangyu Zhang, S.R., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  11. Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Basic Eng. (2011)

    Google Scholar 

  12. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics (1955)

    Google Scholar 

  13. Li, M., et al.: Winner: Weakly-supervised hierarchical decomposition and alignment for spatio-temporal video grounding. In: CVPR (2023)

    Google Scholar 

  14. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  15. Lin, Z., Tan, C., Hu, J., Jin, Z., Ye, T., Zheng, W.: Collaborative static and dynamic vision-language streams for spatio-temporal video grounding. In: CVPR (2023)

    Google Scholar 

  16. Liu, R., Huang, J., Li, G., Feng, J., Wu, X., Li, T.H.: Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In: CVPR (2023)

    Google Scholar 

  17. Luo, D., Huang, J., Gong, S., Jin, H., Liu, Y.: Towards generalisable video moment retrieval: visual-dynamic injection to image-text pre-training. In: CVPR (2023)

    Google Scholar 

  18. Mirshghallah, F., Taram, M., Vepakomma, P., Singh, A., Raskar, R., Esmaeilzadeh, H.: Privacy in deep learning: a survey. arXiv preprint arXiv:2004.12254 (2020)

  19. Peng, B., Chen, X., Wang, Y., Lu, C., Qiao, Y.: Conditionvideo: training-free condition-guided text-to-video generation. In: AAAI (2024)

    Google Scholar 

  20. Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  21. Rasheed, H.A., Khattak, M.U., Maaz, M., Khan, S., Khan, F.S.: Fine-tuned clip models are efficient video learners. In: CVPR (2023)

    Google Scholar 

  22. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI (2015)

    Google Scholar 

  23. Shi, J., Xu, J., Gong, B., Xu, C.: Not all frames are equal: weakly-supervised video grounding with contextual similarity and visual clustering losses. In: CVPR (2019)

    Google Scholar 

  24. Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does clip know about a red circle? visual prompt engineering for vlms. In: CVPR (2023)

    Google Scholar 

  25. Su, R., Xu, Q.Y.D.: Stvgbert: a visual-linguistic transformer based framework for spatio-temporal video grounding. In: ICCV (2021)

    Google Scholar 

  26. Subramanian, S., Merrill, W., Darrell, T., Gardner, M., Singh, S., Rohrbach, A.: Reclip: a strong zero-shot baseline for referring expression comprehension. In: ACL (2022)

    Google Scholar 

  27. Tiong, A.M.H., Li, J., Li, B.A., Savarese, S., Hoi, S.C.H.: Plug-and-play VQA: Zero-shot VQA by conjoining large pretrained models with zero training. In: EMNLP Findings (2022)

    Google Scholar 

  28. Wang, Y., Zhang, J., Kan, M., Shan, S., Chen, X.: Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In: CVPR (2020)

    Google Scholar 

  29. Wasim, S.T., Naseer, M., Khan, S., Khan, F.S., Shah, M.: Vita-clip: video and text adaptive clip via multimodal prompting. In: CVPR (2023)

    Google Scholar 

  30. Xing, J., Wang, M., Hou, X., Dai, G., Wang, J., Liu, Y.: Multimodal adaptation of clip for few-shot action recognition. In: CVPR (2023)

    Google Scholar 

  31. Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Tubedetr: spatio-temporal video grounding with transformers. In: CVPR (2022)

    Google Scholar 

  32. Yu, H., Ding, S., Li, L., Wu, J.: Self-attentive clip hashing for unsupervised cross-modal retrieval. In: MM Asia (2022)

    Google Scholar 

  33. Zhang, G., Liu, B., Zhu, T., Zhou, A., Zhou, W.: Visual privacy attacks and defenses in deep learning: a survey. Artif. Intell. Rev. (2022)

    Google Scholar 

  34. Zhang, R., Wang, S., Duan, Y., Tang, Y., Zhang, Y., Tan, Y.P.: Hoi-aware adaptive network for weakly-supervised action segmentation. IJCAI (2023)

    Google Scholar 

  35. Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., Tian, Q.: Controlvideo: training-free controllable text-to-video generation. ArXiv (2023)

    Google Scholar 

  36. Zhang, Z., Zhao, Z., Zhao, Y., Wang, Q., Liu, H., Gao, L.: Where does it exist: Spatio-temporal video grounding for multi-form sentences. In: CVPR (2020)

    Google Scholar 

  37. Zhang, Z., Zhao, Z., Lin, Z., Huai, B., Yuan, N. J..: Object-aware multi-branch relation networks for spatio-temporal video grounding. In: IJCAI (2021)

    Google Scholar 

  38. Tang, Z., et al.: Human-centric spatio-temporal video grounding with visual transformers. TCSVT (2021)

    Google Scholar 

Download references

Acknowledgements

This work was carried out at Rapid-Rich Object Search (ROSE) Lab, School of Electrical & Electronic Engineering, Nanyang Technological University. This research is supported by the NTU-PKU Joint Research Institute (a collaboration between the Nanyang Technological University and Peking University that is sponsored by a donation from the Ng Teng Fong Charitable Foundation).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peijun Bao .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3139 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bao, P., Shao, Z., Yang, W., Ng, B.P., Kot, A.C. (2025). E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15141. Springer, Cham. https://doi.org/10.1007/978-3-031-73010-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73010-8_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73009-2

  • Online ISBN: 978-3-031-73010-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics