Skip to main content

Expanding the Horizons: Exploring Further Steps in Open-Vocabulary Segmentation

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14434))

Included in the following conference series:

  • 438 Accesses

Abstract

The open vocabulary segmentation (OVS) task has gained significant attention due to the challenges posed by both segmentation and open vocabulary classification, which involves recognizing arbitrary categories. Recent studies have leveraged pretrained Vision-Language models (VLMs) as a new paradigm for addressing this problem, leading to notable achievements. However, our analysis reveals that these methods are not yet fully satisfactory. In this paper, we empirically analyze the key challenges in four main categories: segmentation, dataset, reasoning and recognition. Surprisingly, we observe that the current research focus in OVS primarily revolves around recognition issues, while others remain relatively unexplored. Motivated by these findings, we propose preliminary approaches to address the top three identified issues by integrating advanced models and making adjustments to existing segmentation models. Experimental results demonstrate the promising performance gains achieved by our proposed methods on the OVS benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bucher, M., et al.: Zero-shot semantic segmentation. In: NeurIPS (2019)

    Google Scholar 

  2. Cen, J., et al.: Segment anything in 3D with NeRFs (2023)

    Google Scholar 

  3. Cha, J., et al.: Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: CVPR (2023)

    Google Scholar 

  4. Cho, S., et al.: CAT-Seg: cost aggregation for open-vocabulary semantic segmentation. CoRR (2023)

    Google Scholar 

  5. Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)

    Google Scholar 

  6. Ding, J., et al.: Decoupling zero-shot semantic segmentation. In: CVPR (2022)

    Google Scholar 

  7. Ding, Z., et al.: Open-vocabulary panoptic segmentation with MaskCLIP. arXiv preprint arXiv:2208.08984 (2022)

  8. Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31

  9. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)

    Google Scholar 

  10. Karazija, L., et al.: Diffusion models for zero-shot open-vocabulary segmentation. CoRR (2023)

    Google Scholar 

  11. Ke, L., et al.: Segment anything in high quality (2023)

    Google Scholar 

  12. Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

  13. Li, B., et al.: Language-driven semantic segmentation. In: ICLR (2022)

    Google Scholar 

  14. Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR (2023)

    Google Scholar 

  15. Liu, H., et al.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)

  16. Liu, Q., Wen, Y., Han, J., Xu, C., Xu, H., Liang, X.: Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13680, pp. 275–292. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20044-1_16

  17. Lüddecke, T., et al.: Image segmentation using text and image prompts. In: CVPR (2022)

    Google Scholar 

  18. Luo, H., et al.: SegCLIP: patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: ICML (2023)

    Google Scholar 

  19. Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: CVPR (2014)

    Google Scholar 

  20. Mukhoti, J., et al.: Open vocabulary semantic segmentation with patch aligned contrastive learning. In: CVPR (2023)

    Google Scholar 

  21. OpenAI: GPT-4 technical report (2023)

    Google Scholar 

  22. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  23. Rao, Y., et al.: DenseCLIP: language-guided dense prediction with context-aware prompting. In: CVPR (2022)

    Google Scholar 

  24. Rombach, R., et al.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  25. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  26. Touvron, H., et al.: LLaMA: open and efficient foundation language models (2023)

    Google Scholar 

  27. Wang, X., Zhang, X., Cao, Y., Wang, W., Shen, C., Huang, T.: SegGPT: segmenting everything in context. arXiv preprint arXiv:2304.03284 (2023)

  28. Xian, Y., et al.: Semantic projection network for zero- and few-label semantic segmentation. In: CVPR (2019)

    Google Scholar 

  29. Xu, J., et al.: GroupViT: semantic segmentation emerges from text supervision. In: CVPR (2022)

    Google Scholar 

  30. Xu, J., et al.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: CVPR (2023)

    Google Scholar 

  31. Xu, J., et al.: Learning open-vocabulary semantic segmentation models from natural language supervision. In: CVPR (2023)

    Google Scholar 

  32. Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13689, pp. 736–753 . Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_42

  33. Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13689, pp. 736–753. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_42

  34. Xu, M., et al.: Side adapter network for open-vocabulary semantic segmentation. In: CVPR (2023)

    Google Scholar 

  35. Zhang, H., et al.: A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131 (2023)

  36. Zhao, H., et al.: Open vocabulary scene parsing. In: ICCV (2017)

    Google Scholar 

  37. Zhao, W.X., et al.: A survey of large language models (2023)

    Google Scholar 

  38. Zhao, X., et al.: Fast segment anything (2023)

    Google Scholar 

  39. Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: CVPR (2022)

    Google Scholar 

  40. Zhou, B., et al.: Scene parsing through ADE20K dataset. In: CVPR (2017)

    Google Scholar 

  41. Zou, X., et al.: Generalized decoding for pixel, image, and language. In: CVPR (2023)

    Google Scholar 

  42. Zou, X., et al.: Segment everything everywhere all at once (2023)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China (21XNLG28), National Natural Science Foundation of China (No. 62276268) and Kuaishou. We acknowledge the anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruihua Song .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, X., Ji, L., Yan, K., Sun, Y., Song, R. (2024). Expanding the Horizons: Exploring Further Steps in Open-Vocabulary Segmentation. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14434. Springer, Singapore. https://doi.org/10.1007/978-981-99-8549-4_34

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8549-4_34

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8548-7

  • Online ISBN: 978-981-99-8549-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics