Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

Fang, Hao; Wu, Peng; Li, Yawei; Zhang, Xinxin; Lu, Xiankai

doi:10.1007/978-3-031-72897-6_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15128))

Included in the following conference series:

European Conference on Computer Vision

175 Accesses

Abstract

Open-Vocabulary Video Instance Segmentation (VIS) is attracting increasing attention due to its ability to segment and track arbitrary objects. However, the recent Open-Vocabulary VIS attempts obtained unsatisfactory results, especially in terms of generalization ability of novel categories. We discover that the domain gap between the VLM features (e.g., CLIP) and the instance queries and the underutilization of temporal consistency are two central causes. To mitigate these issues, we design and train a novel Open-Vocabulary VIS baseline called OVFormer. OVFormer utilizes a lightweight module for unified embedding alignment between query embeddings and CLIP image embeddings to remedy the domain gap. Unlike previous image-based training methods, we conduct video-based model training and deploy a semi-online inference scheme to fully mine the temporal consistency in the video. Without bells and whistles, OVFormer achieves 21.9 mAP with a ResNet-50 backbone on LV-VIS, exceeding the previous state-of-the-art performance by 7.7. Extensive experiments on some Close-Vocabulary VIS datasets also demonstrate the strong zero-shot generalization ability of OVFormer (+ 7.6 mAP on YouTube-VIS 2019, + 3.9 mAP on OVIS). Code is available at https://github.com/fanghaook/OVFormer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

OV-VIS: Open-Vocabulary Video Instance Segmentation

Article 31 May 2024

In Defense of Online Models for Video Instance Segmentation

Less Than Few: Self-shot Video Instance Segmentation

References

Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP, pp. 3464–3468. IEEE (2016)
Google Scholar
Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., Shao, L.: SipMask: spatial information preservation for fast image and video instance segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 1–18. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_1
Chapter Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, X., Li, S., Lim, S.N., Torralba, A., Zhao, H.: Open-vocabulary panoptic segmentation with embedding modulation. In: ICCV, pp. 1141–1150 (2023)
Google Scholar
Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.: Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764 (2021)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1290–1299 (2022)
Google Scholar
Cheng, H.K., Schwing, A.G.: XMem: long-term video object segmentation with an Atkinson-Shiffrin memory model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 640–658. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_37
Chapter Google Scholar
Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmentation. In: CVPR, pp. 11583–11592 (2022)
Google Scholar
Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for open-vocabulary object detection with vision-language model. In: CVPR, pp. 14084–14093 (2022)
Google Scholar
Fang, H., Zhang, T., Zhou, X., Zhang, X.: Learning better video query with SAM for video instance segmentation. TCSVT 1–12 (2024)
Google Scholar
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: ICLR, pp. 1–14 (2021)
Google Scholar
Guo, P., et al.: Openvis: open-vocabulary video instance segmentation. arXiv preprint arXiv:2305.16835 (2023)
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR, pp. 5356–5364 (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Heo, M., Hwang, S., Oh, S.W., Lee, J.Y., Kim, S.J.: Vita: video instance segmentation via object token association. NeurIPS 35, 23109–23120 (2022)
Google Scholar
Huang, D.A., Yu, Z., Anandkumar, A.: Minvis: a minimal video instance segmentation framework without video-based training. NeurIPS 35, 31265–31277 (2022)
Google Scholar
Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. NeurIPS 34, 13352–13363 (2021)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, pp. 4904–4916. PMLR (2021)
Google Scholar
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)
Article MathSciNet Google Scholar
Li, J., Yu, B., Rao, Y., Zhou, J., Lu, J.: Tcovis: temporally consistent online video instance segmentation. In: ICCV, pp. 1097–1107 (2023)
Google Scholar
Li, S., Fischer, T., Ke, L., Ding, H., Danelljan, M., Yu, F.: Ovtrack: open-vocabulary multiple object tracking. In: CVPR, pp. 5567–5577 (2023)
Google Scholar
Lin, H., Wu, R., Liu, S., Lu, J., Jia, J.: Video instance segmentation with a propose-reduce paradigm. In: ICCV, pp. 1739–1748 (2021)
Google Scholar
Liu, Y., et al.: Opening up open world tracking. In: CVPR, pp. 19045–19055 (2022)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Lu, X., Ma, C., Shen, J., Yang, X., Reid, I., Yang, M.H.: Deep object tracking with shrinkage loss. PAMI 44(5), 2386–2401 (2020)
Google Scholar
Lu, X., Wang, W., Shen, J., Crandall, D., Luo, J.: Zero-shot video object segmentation with co-attention siamese networks. PAMI 44(4), 2228–2242 (2020)
Google Scholar
Lu, X., Wang, W., Shen, J., Crandall, D.J., Van Gool, L.: Segmenting objects from relational visual data. PAMI 44(11), 7885–7897 (2021)
Article Google Scholar
Ma, Z., et al.: Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation. In: CVPR, pp. 14074–14083 (2022)
Google Scholar
Meinhardt, T., Feiszli, M., Fan, Y., Leal-Taixe, L., Ranjan, R.: Novis: a case for end-to-end near-online video instance segmentation. arXiv preprint arXiv:2308.15266 (2023)
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV, pp. 565–571. IEEE (2016)
Google Scholar
Qi, J., et al.: Occluded video instance segmentation: a benchmark. IJCV 130(8), 2022–2039 (2022)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. NeurIPS 30, 1–11 (2017)
Google Scholar
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: fast end-to-end embedding learning for video object segmentation. In: CVPR, pp. 9481–9490 (2019)
Google Scholar
Wang, H., et al.: Towards open-vocabulary video instance segmentation. In: ICCV, pp. 4057–4066 (2023)
Google Scholar
Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR, pp. 8741–8750 (2021)
Google Scholar
Wu, J., et al.: Betrayed by captions: joint caption grounding and generation for open vocabulary instance segmentation. In: ICCV, pp. 21938–21948 (2023)
Google Scholar
Wu, J., Jiang, Y., Bai, S., Zhang, W., Bai, X.: Seqformer: sequential transformer for video instance segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 553–569. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_32
Chapter Google Scholar
Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 588–605. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_34
Chapter Google Scholar
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2
Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV, pp. 5188–5197 (2019)
Google Scholar
Yang, S., et al.: Temporally efficient vision transformer for video instance segmentation. In: CVPR, pp. 2885–2895 (2022)
Google Scholar
Yang, Z., Miao, J., Wei, Y., Wang, W., Wang, X., Yang, Y.: Scalable video object segmentation with identification mechanism. PAMI 1–15 (2024)
Google Scholar
Ying, K., et al.: CTVIS: consistent training for online video instance segmentation. In: ICCV, pp. 899–908 (2023)
Google Scholar
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: CVPR, pp. 14393–14402 (2021)
Google Scholar
Zhang, T., et al.: DVIS: decoupled video instance segmentation framework. In: CVPR. pp. 1282–1291 (2023)
Google Scholar
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40
Chapter Google Scholar
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 350–368. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_21
Chapter Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (No. 62106128, 62101309), the Natural Science Foundation of Shandong Province (No. ZR2021QF001, ZR2021QF109), Shandong Province Science and Technology Small and Medium-sized Enterprise Innovation Capacity Enhancement Project (2023TSGC0115), Shandong Province Higher Education Institutions Youth Entrepreneurship and Technology Support Program (2023KJ027).

Author information

Authors and Affiliations

School of Software, Shandong University, Jinan, China
Hao Fang, Peng Wu, Xinxin Zhang & Xiankai Lu
Integrated Systems Laboratory and Computer Vision Lab, ETH Zurich, Zürich, Switzerland
Yawei Li

Authors

Hao Fang
View author publications
You can also search for this author in PubMed Google Scholar
Peng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yawei Li
View author publications
You can also search for this author in PubMed Google Scholar
Xinxin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiankai Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiankai Lu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7411 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fang, H., Wu, P., Li, Y., Zhang, X., Lu, X. (2025). Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-72897-6_13
Published: 02 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72896-9
Online ISBN: 978-3-031-72897-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation