Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Zhu, Zixin; Feng, Xuelu; Chen, Dongdong; Yuan, Junsong; Qiao, Chunming; Hua, Gang

doi:10.1007/978-3-031-73254-6_26

Zixin Zhu¹³,
Xuelu Feng¹³,
Dongdong Chen¹⁴,
Junsong Yuan¹³,
Chunming Qiao¹³ &
…
Gang Hua¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15070))

Included in the following conference series:

European Conference on Computer Vision

348 Accesses

Abstract

In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed “VD-IT”, tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks. Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods. The code is available at https://github.com/buxiangzhiren/VD-IT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark

MRRVOS: Modular Refinement Referring Video Object Segmentation

VISA: Reasoning Video Object Segmentation via Large Language Models

References

Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR, pp. 22563–22575 (2023)
Google Scholar
Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4985–4995 (2022)
Google Scholar
Chen, S., Sun, P., Song, Y., Luo, P.: Diffusiondet: diffusion model for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19830–19843 (2023)
Google Scholar
Chen, W., et al.: Multi-attention network for compressed video referring object segmentation. In: ACM MM, pp. 4416–4425 (2022)
Google Scholar
Chen, Z., Gao, R., Xiang, T.Z., Lin, F.: Diffusion model for camouflaged object detection. arXiv preprint arXiv:2308.00303 (2023)
Ding, H., Liu, C., Wang, S., Jiang, X.: VLT: vision-language transformer and query generation for referring segmentation. IEEE TPAMI (2022)
Google Scholar
Ding, Z., Hui, T., Huang, J., Wei, X., Han, J., Liu, S.: Language-bridged spatial-temporal interaction for referring video object segmentation. In: CVPR, pp. 4964–4973 (2022)
Google Scholar
Ding, Z., et al.: Progressive multimodal interaction network for referring video object segmentation. The 3rd Large-scale Video Object Segmentation Challenge 8 (2021)
Google Scholar
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR, pp. 12873–12883 (2021)
Google Scholar
Fan, W.C., Chen, Y.C., Chen, D., Cheng, Y., Yuan, L., Wang, Y.C.F.: Frido: feature pyramid diffusion for complex scene image synthesis. In: Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI) (2023)
Google Scholar
Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: CVPR, pp. 5958–5966 (2018)
Google Scholar
Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2022) (2022)
Google Scholar
Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 108–124. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_7
Chapter Google Scholar
Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 123–141. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_8
Chapter Google Scholar
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics 52 (1955)
Google Scholar
Li, D., et al.: You only infer once: Cross-modal meta-transfer for referring video object segmentation. In: AAAI, pp. 1297–1305 (2022)
Google Scholar
Li, M., Sigal, L.: Referring transformer: a one-step approach to multi-task visual grounding. NeurIPS 34, 19652–19664 (2021)
Google Scholar
Li, X., Sun, X., Meng, Y., Liang, J., Wu, F., Li, J.: Dice loss for data-imbalanced NLP tasks. arXiv preprint arXiv:1911.02855 (2019)
Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Open-vocabulary object segmentation with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7667–7676 (2023)
Google Scholar
Li, Z., Wang, M., Mei, J., Liu, Y.: MaIL: a unified mask-image-language trimodal network for referring image segmentation. arXiv preprint arXiv:2111.10747 (2021)
Liang, C., Wu, Y., Luo, Y., Yang, Y.: Clawcranenet: leveraging object-level relation for text-based video segmentation. arXiv preprint arXiv:2103.10702 (2021)
Liang, C., et al.: Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061 (2021)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)
Google Scholar
Liu, S., Hui, T., Huang, S., Wei, Y., Li, B., Li, G.: Cross-modal progressive comprehension for referring segmentation. IEEE TPAMI 44(9), 4761–4775 (2021)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized Bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Z., et al.: Video swin transformer. In: CVPR, pp. 3202–3211 (2022)
Google Scholar
Mei, J., Piergiovanni, A., Hwang, J.N., Li, W.: SLVP: self-supervised language-video pre-training for referring video object segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 507–517 (2024)
Google Scholar
Miao, B., Bennamoun, M., Gao, Y., Mian, A.: Spectrum-guided multi-granularity referring video object segmentation. In: ICCV, pp. 920–930 (2023)
Google Scholar
Molad, E., et al.: Dreamix: video diffusion models are general video editors. arXiv preprint arXiv:2302.01329 (2023)
Park, H., Yoo, J., Jeong, S., Venkatesh, G., Kwak, N.: Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In: CVPR, pp. 8405–8414 (2021)
Google Scholar
Pnvr, K., Singh, B., Ghosh, P., Siddiquie, B., Jacobs, D.: LD-ZNet: a latent diffusion approach for text-based image segmentation. In: ICCV, pp. 4157–4168 (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Google Scholar
Saharia, C., et al.: Palette: image-to-image diffusion models. In: ACM SIGGRAPH, pp. 1–10 (2022)
Google Scholar
Seo, S., Lee, J.-Y., Han, B.: URVOS: unified referring video object segmentation network with a large-scale benchmark. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 208–223. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_13
Chapter Google Scholar
Tur, A.O., Dall’Asen, N., Beyan, C., Ricci, E.: Exploring diffusion models for unsupervised video anomaly detection. In: ICIP, pp. 2540–2544. IEEE (2023)
Google Scholar
Vaswani, A., et al.: Attention is all you need. NeurIPS 30 (2017)
Google Scholar
Wang, H., Deng, C., Yan, J., Tao, D.: Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In: ICCV, pp. 3939–3948 (2019)
Google Scholar
Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571 (2023)
Wang, W., et al.: Semantic image synthesis via diffusion models. arXiv preprint arXiv:2207.00050 (2022)
Wang, Z., et al.: CRIS: clip-driven referring image segmentation. In: CVPR, pp. 11686–11695 (2022)
Google Scholar
Wu, D., Dong, X., Shao, L., Shen, J.: Multi-level representation learning with semantic alignment for referring video object segmentation. In: CVPR, pp. 4996–5005 (2022)
Google Scholar
Wu, D., Wang, T., Zhang, Y., Zhang, X., Shen, J.: Onlinerefer: a simple online baseline for referring video object segmentation. In: ICCV, pp. 2761–2770 (2023)
Google Scholar
Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: CVPR, pp. 4974–4984 (2022)
Google Scholar
Wu, W., Zhao, Y., Shou, M.Z., Zhou, H., Shen, C.: Diffumask: synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. arXiv preprint arXiv:2303.11681 (2023)
Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: CVPR, pp. 2955–2966 (2023)
Google Scholar
Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: language-aware vision transformer for referring image segmentation. In: CVPR, pp. 18155–18165 (2022)
Google Scholar
Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR, pp. 10502–10511 (2019)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Chapter Google Scholar
Zhang, J., et al.: A tale of two features: stable diffusion complements dino for zero-shot semantic correspondence. NeurIPS 36 (2024)
Google Scholar
Zhao, H., Lin, K.Q., Yan, R., Li, Z.: Diffusionvmr: diffusion model for video moment retrieval. arXiv preprint arXiv:2308.15109 (2023)
Zhao, S., et al.: Uni-controlnet: all-in-one control to text-to-image diffusion models. In: Thirty-Seventh Conference on Neural Information Processing Systems (NeurIPS 2023) (2023)
Google Scholar
Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., Feng, J.: Magicvideo: efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022)
Zhu, W., Li, J., Lu, J., Zhou, J.: Separable structure modeling for semi-supervised video object segmentation. IEEE TCSVT 32(1), 330–344 (2021)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

Download references

Author information

Authors and Affiliations

University at Buffalo, Buffalo, USA
Zixin Zhu, Xuelu Feng, Junsong Yuan & Chunming Qiao
Microsoft GenAI, Redmond, USA
Dongdong Chen
Dolby Laboratories, San Francisco, USA
Gang Hua

Authors

Zixin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xuelu Feng
View author publications
You can also search for this author in PubMed Google Scholar
Dongdong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Junsong Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Chunming Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Gang Hua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zixin Zhu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 614 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, Z., Feng, X., Chen, D., Yuan, J., Qiao, C., Hua, G. (2025). Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15070. Springer, Cham. https://doi.org/10.1007/978-3-031-73254-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-73254-6_26
Published: 28 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73253-9
Online ISBN: 978-3-031-73254-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation