Skip to main content

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Open-Vocabulary Video Instance Segmentation (VIS) is attracting increasing attention due to its ability to segment and track arbitrary objects. However, the recent Open-Vocabulary VIS attempts obtained unsatisfactory results, especially in terms of generalization ability of novel categories. We discover that the domain gap between the VLM features (e.g., CLIP) and the instance queries and the underutilization of temporal consistency are two central causes. To mitigate these issues, we design and train a novel Open-Vocabulary VIS baseline called OVFormer. OVFormer utilizes a lightweight module for unified embedding alignment between query embeddings and CLIP image embeddings to remedy the domain gap. Unlike previous image-based training methods, we conduct video-based model training and deploy a semi-online inference scheme to fully mine the temporal consistency in the video. Without bells and whistles, OVFormer achieves 21.9 mAP with a ResNet-50 backbone on LV-VIS, exceeding the previous state-of-the-art performance by 7.7. Extensive experiments on some Close-Vocabulary VIS datasets also demonstrate the strong zero-shot generalization ability of OVFormer (+ 7.6 mAP on YouTube-VIS 2019, + 3.9 mAP on OVIS). Code is available at https://github.com/fanghaook/OVFormer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP, pp. 3464–3468. IEEE (2016)

    Google Scholar 

  2. Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., Shao, L.: SipMask: spatial information preservation for fast image and video instance segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 1–18. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_1

    Chapter  Google Scholar 

  3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  4. Chen, X., Li, S., Lim, S.N., Torralba, A., Zhao, H.: Open-vocabulary panoptic segmentation with embedding modulation. In: ICCV, pp. 1141–1150 (2023)

    Google Scholar 

  5. Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.: Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764 (2021)

  6. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1290–1299 (2022)

    Google Scholar 

  7. Cheng, H.K., Schwing, A.G.: XMem: long-term video object segmentation with an Atkinson-Shiffrin memory model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 640–658. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_37

    Chapter  Google Scholar 

  8. Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmentation. In: CVPR, pp. 11583–11592 (2022)

    Google Scholar 

  9. Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for open-vocabulary object detection with vision-language model. In: CVPR, pp. 14084–14093 (2022)

    Google Scholar 

  10. Fang, H., Zhang, T., Zhou, X., Zhang, X.: Learning better video query with SAM for video instance segmentation. TCSVT 1–12 (2024)

    Google Scholar 

  11. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: ICLR, pp. 1–14 (2021)

    Google Scholar 

  12. Guo, P., et al.: Openvis: open-vocabulary video instance segmentation. arXiv preprint arXiv:2305.16835 (2023)

  13. Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR, pp. 5356–5364 (2019)

    Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  15. Heo, M., Hwang, S., Oh, S.W., Lee, J.Y., Kim, S.J.: Vita: video instance segmentation via object token association. NeurIPS 35, 23109–23120 (2022)

    Google Scholar 

  16. Huang, D.A., Yu, Z., Anandkumar, A.: Minvis: a minimal video instance segmentation framework without video-based training. NeurIPS 35, 31265–31277 (2022)

    Google Scholar 

  17. Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. NeurIPS 34, 13352–13363 (2021)

    Google Scholar 

  18. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, pp. 4904–4916. PMLR (2021)

    Google Scholar 

  19. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  20. Li, J., Yu, B., Rao, Y., Zhou, J., Lu, J.: Tcovis: temporally consistent online video instance segmentation. In: ICCV, pp. 1097–1107 (2023)

    Google Scholar 

  21. Li, S., Fischer, T., Ke, L., Ding, H., Danelljan, M., Yu, F.: Ovtrack: open-vocabulary multiple object tracking. In: CVPR, pp. 5567–5577 (2023)

    Google Scholar 

  22. Lin, H., Wu, R., Liu, S., Lu, J., Jia, J.: Video instance segmentation with a propose-reduce paradigm. In: ICCV, pp. 1739–1748 (2021)

    Google Scholar 

  23. Liu, Y., et al.: Opening up open world tracking. In: CVPR, pp. 19045–19055 (2022)

    Google Scholar 

  24. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)

    Google Scholar 

  25. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  26. Lu, X., Ma, C., Shen, J., Yang, X., Reid, I., Yang, M.H.: Deep object tracking with shrinkage loss. PAMI 44(5), 2386–2401 (2020)

    Google Scholar 

  27. Lu, X., Wang, W., Shen, J., Crandall, D., Luo, J.: Zero-shot video object segmentation with co-attention siamese networks. PAMI 44(4), 2228–2242 (2020)

    Google Scholar 

  28. Lu, X., Wang, W., Shen, J., Crandall, D.J., Van Gool, L.: Segmenting objects from relational visual data. PAMI 44(11), 7885–7897 (2021)

    Article  Google Scholar 

  29. Ma, Z., et al.: Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation. In: CVPR, pp. 14074–14083 (2022)

    Google Scholar 

  30. Meinhardt, T., Feiszli, M., Fan, Y., Leal-Taixe, L., Ranjan, R.: Novis: a case for end-to-end near-online video instance segmentation. arXiv preprint arXiv:2308.15266 (2023)

  31. Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV, pp. 565–571. IEEE (2016)

    Google Scholar 

  32. Qi, J., et al.: Occluded video instance segmentation: a benchmark. IJCV 130(8), 2022–2039 (2022)

    Article  Google Scholar 

  33. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  34. Vaswani, A., et al.: Attention is all you need. NeurIPS 30, 1–11 (2017)

    Google Scholar 

  35. Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: fast end-to-end embedding learning for video object segmentation. In: CVPR, pp. 9481–9490 (2019)

    Google Scholar 

  36. Wang, H., et al.: Towards open-vocabulary video instance segmentation. In: ICCV, pp. 4057–4066 (2023)

    Google Scholar 

  37. Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)

  38. Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR, pp. 8741–8750 (2021)

    Google Scholar 

  39. Wu, J., et al.: Betrayed by captions: joint caption grounding and generation for open vocabulary instance segmentation. In: ICCV, pp. 21938–21948 (2023)

    Google Scholar 

  40. Wu, J., Jiang, Y., Bai, S., Zhang, W., Bai, X.: Seqformer: sequential transformer for video instance segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 553–569. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_32

    Chapter  Google Scholar 

  41. Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 588–605. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_34

    Chapter  Google Scholar 

  42. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2

  43. Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV, pp. 5188–5197 (2019)

    Google Scholar 

  44. Yang, S., et al.: Temporally efficient vision transformer for video instance segmentation. In: CVPR, pp. 2885–2895 (2022)

    Google Scholar 

  45. Yang, Z., Miao, J., Wei, Y., Wang, W., Wang, X., Yang, Y.: Scalable video object segmentation with identification mechanism. PAMI 1–15 (2024)

    Google Scholar 

  46. Ying, K., et al.: CTVIS: consistent training for online video instance segmentation. In: ICCV, pp. 899–908 (2023)

    Google Scholar 

  47. Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: CVPR, pp. 14393–14402 (2021)

    Google Scholar 

  48. Zhang, T., et al.: DVIS: decoupled video instance segmentation framework. In: CVPR. pp. 1282–1291 (2023)

    Google Scholar 

  49. Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40

    Chapter  Google Scholar 

  50. Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 350–368. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_21

    Chapter  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (No. 62106128, 62101309), the Natural Science Foundation of Shandong Province (No. ZR2021QF001, ZR2021QF109), Shandong Province Science and Technology Small and Medium-sized Enterprise Innovation Capacity Enhancement Project (2023TSGC0115), Shandong Province Higher Education Institutions Youth Entrepreneurship and Technology Support Program (2023KJ027).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiankai Lu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7411 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fang, H., Wu, P., Li, Y., Zhang, X., Lu, X. (2025). Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72897-6_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72896-9

  • Online ISBN: 978-3-031-72897-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics