Abstract
Spatio-temporal video grounding aims to localize the spatio-temporal tube in a video according to the given language query. To eliminate the annotation costs, we make a first exploration to tackle spatio-temporal video grounding in a zero-shot manner. Our method dispenses with the need for any training videos or annotations; instead, it localizes the target object by leveraging pre-trained vision-language models and optimizing within the video and text query during the test time. To enable spatio-temporal comprehension, we introduce a multimodal modulation that integrates the spatio-temporal context into both visual and textual representation. On the visual side, we devise a context-based visual modulation that enhances the visual representation by propagation and aggregation of the contextual semantics. Concurrently, on the textual front, we propose a prototype-based textual modulation to refine the textual representations using visual prototypes, effectively mitigating the cross-modal discrepancy. In addition, to overcome the interleaved spatio-temporal dilemma, we propose an expectation maximization (EM) framework to optimize the process of temporal relevance estimation and spatial region identification in an alternating way. Comprehensive experiments validate that our zero-shot approach achieves superior performance in comparison to several state-of-the-art methods with stronger supervision. The code is available at https://github.com/baopj/E3M.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahn, J., Kwak, S.: Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In: CVPR (2018)
Antoine Yang, Antoine Miech, J.S.I.L., Schmid, C.: Tubedetr: Spatio-temporal video grounding with transformers. In: CVPR (2022)
Bao, P., Shao, Z., Yang, W., Ng, B.P., Er, M.H., Kot, A.C.: Omnipotent distillation with LLMS for weakly-supervised natural language video localization: When divergence meets consistency. In: AAAI (2024)
Bao, P., Xia, Y., Yang, W., Ng, B.P., Er, M.H., Kot, A.C.: Local-global multi-modal distillation for weakly-supervised temporal video grounding. In: AAAI (2024)
Bao, P., Yang, W., Ng, B.P., Er, M.H., Kot, A.C.: Cross-modal label contrastive learning for unsupervised audio-visual event localization. In: AAAI (2023)
Chen, J., Bao, W., Kong, Y.: Activity-driven weakly-supervised spatio-temporal grounding from untrimmed videos. In: ACM MM (2020)
Jiang, K., He, X., Xu, R., Wang, X.E.: Comclip: training-free compositional image and text matching. In: NAACL (2024)
Jin, Y., Li, Y., Yuan, Z., Mu, Y.: Embracing consistency: a one-stage approach for spatio-temporal video grounding. In: NeurIPS (2022)
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 105–124. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_7
Kaiming He, Xiangyu Zhang, S.R., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Basic Eng. (2011)
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics (1955)
Li, M., et al.: Winner: Weakly-supervised hierarchical decomposition and alignment for spatio-temporal video grounding. In: CVPR (2023)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lin, Z., Tan, C., Hu, J., Jin, Z., Ye, T., Zheng, W.: Collaborative static and dynamic vision-language streams for spatio-temporal video grounding. In: CVPR (2023)
Liu, R., Huang, J., Li, G., Feng, J., Wu, X., Li, T.H.: Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In: CVPR (2023)
Luo, D., Huang, J., Gong, S., Jin, H., Liu, Y.: Towards generalisable video moment retrieval: visual-dynamic injection to image-text pre-training. In: CVPR (2023)
Mirshghallah, F., Taram, M., Vepakomma, P., Singh, A., Raskar, R., Esmaeilzadeh, H.: Privacy in deep learning: a survey. arXiv preprint arXiv:2004.12254 (2020)
Peng, B., Chen, X., Wang, Y., Lu, C., Qiao, Y.: Conditionvideo: training-free condition-guided text-to-video generation. In: AAAI (2024)
Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Rasheed, H.A., Khattak, M.U., Maaz, M., Khan, S., Khan, F.S.: Fine-tuned clip models are efficient video learners. In: CVPR (2023)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI (2015)
Shi, J., Xu, J., Gong, B., Xu, C.: Not all frames are equal: weakly-supervised video grounding with contextual similarity and visual clustering losses. In: CVPR (2019)
Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does clip know about a red circle? visual prompt engineering for vlms. In: CVPR (2023)
Su, R., Xu, Q.Y.D.: Stvgbert: a visual-linguistic transformer based framework for spatio-temporal video grounding. In: ICCV (2021)
Subramanian, S., Merrill, W., Darrell, T., Gardner, M., Singh, S., Rohrbach, A.: Reclip: a strong zero-shot baseline for referring expression comprehension. In: ACL (2022)
Tiong, A.M.H., Li, J., Li, B.A., Savarese, S., Hoi, S.C.H.: Plug-and-play VQA: Zero-shot VQA by conjoining large pretrained models with zero training. In: EMNLP Findings (2022)
Wang, Y., Zhang, J., Kan, M., Shan, S., Chen, X.: Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In: CVPR (2020)
Wasim, S.T., Naseer, M., Khan, S., Khan, F.S., Shah, M.: Vita-clip: video and text adaptive clip via multimodal prompting. In: CVPR (2023)
Xing, J., Wang, M., Hou, X., Dai, G., Wang, J., Liu, Y.: Multimodal adaptation of clip for few-shot action recognition. In: CVPR (2023)
Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Tubedetr: spatio-temporal video grounding with transformers. In: CVPR (2022)
Yu, H., Ding, S., Li, L., Wu, J.: Self-attentive clip hashing for unsupervised cross-modal retrieval. In: MM Asia (2022)
Zhang, G., Liu, B., Zhu, T., Zhou, A., Zhou, W.: Visual privacy attacks and defenses in deep learning: a survey. Artif. Intell. Rev. (2022)
Zhang, R., Wang, S., Duan, Y., Tang, Y., Zhang, Y., Tan, Y.P.: Hoi-aware adaptive network for weakly-supervised action segmentation. IJCAI (2023)
Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., Tian, Q.: Controlvideo: training-free controllable text-to-video generation. ArXiv (2023)
Zhang, Z., Zhao, Z., Zhao, Y., Wang, Q., Liu, H., Gao, L.: Where does it exist: Spatio-temporal video grounding for multi-form sentences. In: CVPR (2020)
Zhang, Z., Zhao, Z., Lin, Z., Huai, B., Yuan, N. J..: Object-aware multi-branch relation networks for spatio-temporal video grounding. In: IJCAI (2021)
Tang, Z., et al.: Human-centric spatio-temporal video grounding with visual transformers. TCSVT (2021)
Acknowledgements
This work was carried out at Rapid-Rich Object Search (ROSE) Lab, School of Electrical & Electronic Engineering, Nanyang Technological University. This research is supported by the NTU-PKU Joint Research Institute (a collaboration between the Nanyang Technological University and Peking University that is sponsored by a donation from the Ng Teng Fong Charitable Foundation).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bao, P., Shao, Z., Yang, W., Ng, B.P., Kot, A.C. (2025). E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15141. Springer, Cham. https://doi.org/10.1007/978-3-031-73010-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-73010-8_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73009-2
Online ISBN: 978-3-031-73010-8
eBook Packages: Computer ScienceComputer Science (R0)