Training-Free Region Prediction with Stable Diffusion

Honbu, Yuma; Yanai, Keiji

doi:10.1007/978-3-031-53302-0_2

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14557))

Included in the following conference series:

International Conference on Multimedia Modeling

1028 Accesses

Abstract

Semantic segmentation models require a large number of images with pixel-level annotations for training, which is a costly problem. In this study, we propose a method called StableSeg that infers region masks of any classes without needs of additional training by using an image synthesis foundation model, Stable Diffusion, pre-trained with five billion image-text pair data. We also propose StableSeg++, which uses the pseudo-masks generated by StableSeg to estimate the optimal weights of the attention maps, and can infer better region masks. We show the effectiveness of the proposed methods by the experiments on five datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Grid-Attention: Enhancing Computational Efficiency of Large Vision Models Without Fine-Tuning

DIAL: Dense Image-Text ALignment for Weakly Supervised Semantic Segmentation

Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation

References

Bucher, M., Vu, T., Cord, M., Pérez, P.: Zero-shot semantic segmentation. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition (2019)
Google Scholar
Burgert, R., Ranasinghe, K., Li, X., Ryoo, M.S.: Peekaboo: text to image diffusion models are zero-shot segmentors. In: Proceedings of arXiv:2211.13224 (2022)
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: smantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)
Article Google Scholar
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Chapter Google Scholar
Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2015)
Article Google Scholar
Ghiasi, G., Gu, X., Cui, Y., Lin, T.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31
Chapter Google Scholar
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Google Scholar
Kingma, P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of International Conference on Machine Learning (2014)
Google Scholar
Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with gaussian edge potentials. In: Advances in Neural Information Processing Systems (2011)
Google Scholar
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: Proceedings of International Conference on Learning Representation (2022)
Google Scholar
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. arXiv preprint arXiv:2210.04150 (2022)
Lüddecke, T., Ecker, A.S.: Image segmentation using text and image prompts. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition, pp. 7086–7096 (2022)
Google Scholar
Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition (2014)
Google Scholar
Okamoto, K., Yanai, K.: UEC-FoodPIX complete: a large-scale food image segmentation dataset. In: Proceedings of ICPR Workshop on Multimedia Assisted Dietary Management (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip Latents. arXiv preprint arXiv:2204.06125 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487 (2022)
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022)
Wu, C., Lin, Z., Cohen, S., Bui, T., Maji, S.: Phrasecut: language-based image segmentation in the wild. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition, pp. 7086–7096 (2020)
Google Scholar
Xiongwei, W., Xin, F., Ying, L., Ee-Peng, L., Steven, H., Qianru, S.: A large-scale benchmark for food image segmentation. arXiv preprint arXiv:2105.05409 (2021)
Xu, J., et al.: GroupViT: semantic segmentation emerges from text supervision. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition, pp. 18134–18144 (2022)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A.: Scene parsing through ade20k dataset. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition (2017)
Google Scholar
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022)
Chapter Google Scholar

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Numbers, 21H05812, 22H00540, 22H00548, and 22K19808.

Author information

Authors and Affiliations

The University of Electro-Communications, Tokyo, Japan
Yuma Honbu & Keiji Yanai

Authors

Yuma Honbu
View author publications
You can also search for this author in PubMed Google Scholar
Keiji Yanai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Keiji Yanai .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Honbu, Y., Yanai, K. (2024). Training-Free Region Prediction with Stable Diffusion. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-53302-0_2
Published: 29 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Training-Free Region Prediction with Stable Diffusion