Expanding the Horizons: Exploring Further Steps in Open-Vocabulary Segmentation

Wang, Xihua; Ji, Lei; Yan, Kun; Sun, Yuchong; Song, Ruihua

doi:10.1007/978-981-99-8549-4_34

Xihua Wang¹⁵,
Lei Ji¹⁶,
Kun Yan^16,17,
Yuchong Sun¹⁵ &
…
Ruihua Song¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14434))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

438 Accesses

Abstract

The open vocabulary segmentation (OVS) task has gained significant attention due to the challenges posed by both segmentation and open vocabulary classification, which involves recognizing arbitrary categories. Recent studies have leveraged pretrained Vision-Language models (VLMs) as a new paradigm for addressing this problem, leading to notable achievements. However, our analysis reveals that these methods are not yet fully satisfactory. In this paper, we empirically analyze the key challenges in four main categories: segmentation, dataset, reasoning and recognition. Surprisingly, we observe that the current research focus in OVS primarily revolves around recognition issues, while others remain relatively unexplored. Motivated by these findings, we propose preliminary approaches to address the top three identified issues by integrating advanced models and making adjustments to existing segmentation models. Experimental results demonstrate the promising performance gains achieved by our proposed methods on the OVS benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bucher, M., et al.: Zero-shot semantic segmentation. In: NeurIPS (2019)
Google Scholar
Cen, J., et al.: Segment anything in 3D with NeRFs (2023)
Google Scholar
Cha, J., et al.: Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: CVPR (2023)
Google Scholar
Cho, S., et al.: CAT-Seg: cost aggregation for open-vocabulary semantic segmentation. CoRR (2023)
Google Scholar
Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Ding, J., et al.: Decoupling zero-shot semantic segmentation. In: CVPR (2022)
Google Scholar
Ding, Z., et al.: Open-vocabulary panoptic segmentation with MaskCLIP. arXiv preprint arXiv:2208.08984 (2022)
Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Google Scholar
Karazija, L., et al.: Diffusion models for zero-shot open-vocabulary segmentation. CoRR (2023)
Google Scholar
Ke, L., et al.: Segment anything in high quality (2023)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Li, B., et al.: Language-driven semantic segmentation. In: ICLR (2022)
Google Scholar
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR (2023)
Google Scholar
Liu, H., et al.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, Q., Wen, Y., Han, J., Xu, C., Xu, H., Liang, X.: Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13680, pp. 275–292. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20044-1_16
Lüddecke, T., et al.: Image segmentation using text and image prompts. In: CVPR (2022)
Google Scholar
Luo, H., et al.: SegCLIP: patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: ICML (2023)
Google Scholar
Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: CVPR (2014)
Google Scholar
Mukhoti, J., et al.: Open vocabulary semantic segmentation with patch aligned contrastive learning. In: CVPR (2023)
Google Scholar
OpenAI: GPT-4 technical report (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Rao, Y., et al.: DenseCLIP: language-guided dense prediction with context-aware prompting. In: CVPR (2022)
Google Scholar
Rombach, R., et al.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Touvron, H., et al.: LLaMA: open and efficient foundation language models (2023)
Google Scholar
Wang, X., Zhang, X., Cao, Y., Wang, W., Shen, C., Huang, T.: SegGPT: segmenting everything in context. arXiv preprint arXiv:2304.03284 (2023)
Xian, Y., et al.: Semantic projection network for zero- and few-label semantic segmentation. In: CVPR (2019)
Google Scholar
Xu, J., et al.: GroupViT: semantic segmentation emerges from text supervision. In: CVPR (2022)
Google Scholar
Xu, J., et al.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: CVPR (2023)
Google Scholar
Xu, J., et al.: Learning open-vocabulary semantic segmentation models from natural language supervision. In: CVPR (2023)
Google Scholar
Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13689, pp. 736–753 . Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_42
Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13689, pp. 736–753. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_42
Xu, M., et al.: Side adapter network for open-vocabulary semantic segmentation. In: CVPR (2023)
Google Scholar
Zhang, H., et al.: A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131 (2023)
Zhao, H., et al.: Open vocabulary scene parsing. In: ICCV (2017)
Google Scholar
Zhao, W.X., et al.: A survey of large language models (2023)
Google Scholar
Zhao, X., et al.: Fast segment anything (2023)
Google Scholar
Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: CVPR (2022)
Google Scholar
Zhou, B., et al.: Scene parsing through ADE20K dataset. In: CVPR (2017)
Google Scholar
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: CVPR (2023)
Google Scholar
Zou, X., et al.: Segment everything everywhere all at once (2023)
Google Scholar

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China (21XNLG28), National Natural Science Foundation of China (No. 62276268) and Kuaishou. We acknowledge the anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Xihua Wang, Yuchong Sun & Ruihua Song
Microsoft Research Asia, Beijing, China
Lei Ji & Kun Yan
SKLSDE Lab, Beihang University, Beijing, China
Kun Yan

Authors

Xihua Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Ji
View author publications
You can also search for this author in PubMed Google Scholar
Kun Yan
View author publications
You can also search for this author in PubMed Google Scholar
Yuchong Sun
View author publications
You can also search for this author in PubMed Google Scholar
Ruihua Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruihua Song .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, X., Ji, L., Yan, K., Sun, Y., Song, R. (2024). Expanding the Horizons: Exploring Further Steps in Open-Vocabulary Segmentation. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14434. Springer, Singapore. https://doi.org/10.1007/978-981-99-8549-4_34

Download citation

DOI: https://doi.org/10.1007/978-981-99-8549-4_34
Published: 25 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8548-7
Online ISBN: 978-981-99-8549-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Expanding the Horizons: Exploring Further Steps in Open-Vocabulary Segmentation