E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

Bao, Peijun; Shao, Zihao; Yang, Wenhan; Ng, Boon Poh; Kot, Alex C.

doi:10.1007/978-3-031-73010-8_14

Peijun Bao¹³,
Zihao Shao¹⁴,
Wenhan Yang¹⁵,
Boon Poh Ng¹³ &
…
Alex C. Kot¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15141))

Included in the following conference series:

European Conference on Computer Vision

214 Accesses

Abstract

Spatio-temporal video grounding aims to localize the spatio-temporal tube in a video according to the given language query. To eliminate the annotation costs, we make a first exploration to tackle spatio-temporal video grounding in a zero-shot manner. Our method dispenses with the need for any training videos or annotations; instead, it localizes the target object by leveraging pre-trained vision-language models and optimizing within the video and text query during the test time. To enable spatio-temporal comprehension, we introduce a multimodal modulation that integrates the spatio-temporal context into both visual and textual representation. On the visual side, we devise a context-based visual modulation that enhances the visual representation by propagation and aggregation of the contextual semantics. Concurrently, on the textual front, we propose a prototype-based textual modulation to refine the textual representations using visual prototypes, effectively mitigating the cross-modal discrepancy. In addition, to overcome the interleaved spatio-temporal dilemma, we propose an expectation maximization (EM) framework to optimize the process of temporal relevance estimation and spatial region identification in an alternating way. Comprehensive experiments validate that our zero-shot approach achieves superior performance in comparison to several state-of-the-art methods with stronger supervision. The code is available at https://github.com/baopj/E3M.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Training-Free Video Temporal Grounding Using Large-Scale Pre-trained Models

Matching and Localizing: A Simple yet Effective Framework for Human-Centric Spatio-Temporal Video Grounding

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

References

Ahn, J., Kwak, S.: Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In: CVPR (2018)
Google Scholar
Antoine Yang, Antoine Miech, J.S.I.L., Schmid, C.: Tubedetr: Spatio-temporal video grounding with transformers. In: CVPR (2022)
Google Scholar
Bao, P., Shao, Z., Yang, W., Ng, B.P., Er, M.H., Kot, A.C.: Omnipotent distillation with LLMS for weakly-supervised natural language video localization: When divergence meets consistency. In: AAAI (2024)
Google Scholar
Bao, P., Xia, Y., Yang, W., Ng, B.P., Er, M.H., Kot, A.C.: Local-global multi-modal distillation for weakly-supervised temporal video grounding. In: AAAI (2024)
Google Scholar
Bao, P., Yang, W., Ng, B.P., Er, M.H., Kot, A.C.: Cross-modal label contrastive learning for unsupervised audio-visual event localization. In: AAAI (2023)
Google Scholar
Chen, J., Bao, W., Kong, Y.: Activity-driven weakly-supervised spatio-temporal grounding from untrimmed videos. In: ACM MM (2020)
Google Scholar
Jiang, K., He, X., Xu, R., Wang, X.E.: Comclip: training-free compositional image and text matching. In: NAACL (2024)
Google Scholar
Jin, Y., Li, Y., Yuan, Z., Mu, Y.: Embracing consistency: a one-stage approach for spatio-temporal video grounding. In: NeurIPS (2022)
Google Scholar
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 105–124. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_7
Chapter Google Scholar
Kaiming He, Xiangyu Zhang, S.R., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Basic Eng. (2011)
Google Scholar
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics (1955)
Google Scholar
Li, M., et al.: Winner: Weakly-supervised hierarchical decomposition and alignment for spatio-temporal video grounding. In: CVPR (2023)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, Z., Tan, C., Hu, J., Jin, Z., Ye, T., Zheng, W.: Collaborative static and dynamic vision-language streams for spatio-temporal video grounding. In: CVPR (2023)
Google Scholar
Liu, R., Huang, J., Li, G., Feng, J., Wu, X., Li, T.H.: Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In: CVPR (2023)
Google Scholar
Luo, D., Huang, J., Gong, S., Jin, H., Liu, Y.: Towards generalisable video moment retrieval: visual-dynamic injection to image-text pre-training. In: CVPR (2023)
Google Scholar
Mirshghallah, F., Taram, M., Vepakomma, P., Singh, A., Raskar, R., Esmaeilzadeh, H.: Privacy in deep learning: a survey. arXiv preprint arXiv:2004.12254 (2020)
Peng, B., Chen, X., Wang, Y., Lu, C., Qiao, Y.: Conditionvideo: training-free condition-guided text-to-video generation. In: AAAI (2024)
Google Scholar
Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Rasheed, H.A., Khattak, M.U., Maaz, M., Khan, S., Khan, F.S.: Fine-tuned clip models are efficient video learners. In: CVPR (2023)
Google Scholar
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI (2015)
Google Scholar
Shi, J., Xu, J., Gong, B., Xu, C.: Not all frames are equal: weakly-supervised video grounding with contextual similarity and visual clustering losses. In: CVPR (2019)
Google Scholar
Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does clip know about a red circle? visual prompt engineering for vlms. In: CVPR (2023)
Google Scholar
Su, R., Xu, Q.Y.D.: Stvgbert: a visual-linguistic transformer based framework for spatio-temporal video grounding. In: ICCV (2021)
Google Scholar
Subramanian, S., Merrill, W., Darrell, T., Gardner, M., Singh, S., Rohrbach, A.: Reclip: a strong zero-shot baseline for referring expression comprehension. In: ACL (2022)
Google Scholar
Tiong, A.M.H., Li, J., Li, B.A., Savarese, S., Hoi, S.C.H.: Plug-and-play VQA: Zero-shot VQA by conjoining large pretrained models with zero training. In: EMNLP Findings (2022)
Google Scholar
Wang, Y., Zhang, J., Kan, M., Shan, S., Chen, X.: Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In: CVPR (2020)
Google Scholar
Wasim, S.T., Naseer, M., Khan, S., Khan, F.S., Shah, M.: Vita-clip: video and text adaptive clip via multimodal prompting. In: CVPR (2023)
Google Scholar
Xing, J., Wang, M., Hou, X., Dai, G., Wang, J., Liu, Y.: Multimodal adaptation of clip for few-shot action recognition. In: CVPR (2023)
Google Scholar
Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Tubedetr: spatio-temporal video grounding with transformers. In: CVPR (2022)
Google Scholar
Yu, H., Ding, S., Li, L., Wu, J.: Self-attentive clip hashing for unsupervised cross-modal retrieval. In: MM Asia (2022)
Google Scholar
Zhang, G., Liu, B., Zhu, T., Zhou, A., Zhou, W.: Visual privacy attacks and defenses in deep learning: a survey. Artif. Intell. Rev. (2022)
Google Scholar
Zhang, R., Wang, S., Duan, Y., Tang, Y., Zhang, Y., Tan, Y.P.: Hoi-aware adaptive network for weakly-supervised action segmentation. IJCAI (2023)
Google Scholar
Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., Tian, Q.: Controlvideo: training-free controllable text-to-video generation. ArXiv (2023)
Google Scholar
Zhang, Z., Zhao, Z., Zhao, Y., Wang, Q., Liu, H., Gao, L.: Where does it exist: Spatio-temporal video grounding for multi-form sentences. In: CVPR (2020)
Google Scholar
Zhang, Z., Zhao, Z., Lin, Z., Huai, B., Yuan, N. J..: Object-aware multi-branch relation networks for spatio-temporal video grounding. In: IJCAI (2021)
Google Scholar
Tang, Z., et al.: Human-centric spatio-temporal video grounding with visual transformers. TCSVT (2021)
Google Scholar

Download references

Acknowledgements

This work was carried out at Rapid-Rich Object Search (ROSE) Lab, School of Electrical & Electronic Engineering, Nanyang Technological University. This research is supported by the NTU-PKU Joint Research Institute (a collaboration between the Nanyang Technological University and Peking University that is sponsored by a donation from the Ng Teng Fong Charitable Foundation).

Author information

Authors and Affiliations

Nanyang Technological University, Singapore, Singapore
Peijun Bao, Boon Poh Ng & Alex C. Kot
Peking University, Beijing, China
Zihao Shao
Peng Cheng Laboratory, Shenzhen, China
Wenhan Yang

Authors

Peijun Bao
View author publications
You can also search for this author in PubMed Google Scholar
Zihao Shao
View author publications
You can also search for this author in PubMed Google Scholar
Wenhan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Boon Poh Ng
View author publications
You can also search for this author in PubMed Google Scholar
Alex C. Kot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peijun Bao .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3139 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bao, P., Shao, Z., Yang, W., Ng, B.P., Kot, A.C. (2025). E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15141. Springer, Cham. https://doi.org/10.1007/978-3-031-73010-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-73010-8_14
Published: 10 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73009-2
Online ISBN: 978-3-031-73010-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Training-Free Video Temporal Grounding Using Large-Scale Pre-trained Models

Matching and Localizing: A Simple yet Effective Framework for Human-Centric Spatio-Temporal Video Grounding

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3139 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Training-Free Video Temporal Grounding Using Large-Scale Pre-trained Models

Matching and Localizing: A Simple yet Effective Framework for Human-Centric Spatio-Temporal Video Grounding

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3139 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation