Abstract
Semantic segmentation models require a large number of images with pixel-level annotations for training, which is a costly problem. In this study, we propose a method called StableSeg that infers region masks of any classes without needs of additional training by using an image synthesis foundation model, Stable Diffusion, pre-trained with five billion image-text pair data. We also propose StableSeg++, which uses the pseudo-masks generated by StableSeg to estimate the optimal weights of the attention maps, and can infer better region masks. We show the effectiveness of the proposed methods by the experiments on five datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bucher, M., Vu, T., Cord, M., Pérez, P.: Zero-shot semantic segmentation. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition (2019)
Burgert, R., Ranasinghe, K., Li, X., Ryoo, M.S.: Peekaboo: text to image diffusion models are zero-shot segmentors. In: Proceedings of arXiv:2211.13224 (2022)
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: smantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2015)
Ghiasi, G., Gu, X., Cui, Y., Lin, T.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Kingma, P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of International Conference on Machine Learning (2014)
Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with gaussian edge potentials. In: Advances in Neural Information Processing Systems (2011)
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: Proceedings of International Conference on Learning Representation (2022)
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. arXiv preprint arXiv:2210.04150 (2022)
Lüddecke, T., Ecker, A.S.: Image segmentation using text and image prompts. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition, pp. 7086–7096 (2022)
Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition (2014)
Okamoto, K., Yanai, K.: UEC-FoodPIX complete: a large-scale food image segmentation dataset. In: Proceedings of ICPR Workshop on Multimedia Assisted Dietary Management (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip Latents. arXiv preprint arXiv:2204.06125 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487 (2022)
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022)
Wu, C., Lin, Z., Cohen, S., Bui, T., Maji, S.: Phrasecut: language-based image segmentation in the wild. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition, pp. 7086–7096 (2020)
Xiongwei, W., Xin, F., Ying, L., Ee-Peng, L., Steven, H., Qianru, S.: A large-scale benchmark for food image segmentation. arXiv preprint arXiv:2105.05409 (2021)
Xu, J., et al.: GroupViT: semantic segmentation emerges from text supervision. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition, pp. 18134–18144 (2022)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A.: Scene parsing through ade20k dataset. In: Proceedings of CVF/IEEE Computer Vision and Pattern Recognition (2017)
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022)
Acknowledgements
This work was supported by JSPS KAKENHI Grant Numbers, 21H05812, 22H00540, 22H00548, and 22K19808.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Honbu, Y., Yanai, K. (2024). Training-Free Region Prediction with Stable Diffusion. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-53302-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)