Skip to main content

Training-Free Region Prediction with Stable Diffusion

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14557))

Included in the following conference series:

  • 363 Accesses

Abstract

Semantic segmentation models require a large number of images with pixel-level annotations for training, which is a costly problem. In this study, we propose a method called StableSeg that infers region masks of any classes without needs of additional training by using an image synthesis foundation model, Stable Diffusion, pre-trained with five billion image-text pair data. We also propose StableSeg++, which uses the pseudo-masks generated by StableSeg to estimate the optimal weights of the attention maps, and can infer better region masks. We show the effectiveness of the proposed methods by the experiments on five datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bucher, M., Vu, T., Cord, M., Pérez, P.: Zero-shot semantic segmentation. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  2. Burgert, R., Ranasinghe, K., Li, X., Ryoo, M.S.: Peekaboo: text to image diffusion models are zero-shot segmentors. In: Proceedings of arXiv:2211.13224 (2022)

  3. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: smantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)

    Article  Google Scholar 

  4. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49

    Chapter  Google Scholar 

  5. Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2015)

    Article  Google Scholar 

  6. Ghiasi, G., Gu, X., Cui, Y., Lin, T.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31

    Chapter  Google Scholar 

  7. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

  8. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33 (2020)

    Google Scholar 

  9. Kingma, P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of International Conference on Machine Learning (2014)

    Google Scholar 

  10. Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with gaussian edge potentials. In: Advances in Neural Information Processing Systems (2011)

    Google Scholar 

  11. Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: Proceedings of International Conference on Learning Representation (2022)

    Google Scholar 

  12. Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. arXiv preprint arXiv:2210.04150 (2022)

  13. Lüddecke, T., Ecker, A.S.: Image segmentation using text and image prompts. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition, pp. 7086–7096 (2022)

    Google Scholar 

  14. Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition (2014)

    Google Scholar 

  15. Okamoto, K., Yanai, K.: UEC-FoodPIX complete: a large-scale food image segmentation dataset. In: Proceedings of ICPR Workshop on Multimedia Assisted Dietary Management (2021)

    Google Scholar 

  16. Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)

  17. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip Latents. arXiv preprint arXiv:2204.06125 (2022)

  18. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  19. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487 (2022)

  20. Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022)

  21. Wu, C., Lin, Z., Cohen, S., Bui, T., Maji, S.: Phrasecut: language-based image segmentation in the wild. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition, pp. 7086–7096 (2020)

    Google Scholar 

  22. Xiongwei, W., Xin, F., Ying, L., Ee-Peng, L., Steven, H., Qianru, S.: A large-scale benchmark for food image segmentation. arXiv preprint arXiv:2105.05409 (2021)

  23. Xu, J., et al.: GroupViT: semantic segmentation emerges from text supervision. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition, pp. 18134–18144 (2022)

    Google Scholar 

  24. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A.: Scene parsing through ade20k dataset. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  25. Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022)

    Chapter  Google Scholar 

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Numbers, 21H05812, 22H00540, 22H00548, and 22K19808.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keiji Yanai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Honbu, Y., Yanai, K. (2024). Training-Free Region Prediction with Stable Diffusion. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53302-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53301-3

  • Online ISBN: 978-3-031-53302-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics