Abstract
Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs). Despite these advancements, current DM adaptations for inpainting, which involve modifications to the sampling strategy or the development of inpainting-specific DMs, frequently suffer from semantic inconsistencies and reduced image quality. Addressing these challenges, our work introduces a novel paradigm: the division of masked image features and noisy latent into separate branches. This division dramatically diminishes the model’s learning load, facilitating a nuanced incorporation of essential masked image information in a hierarchical fashion. Herein, we present BrushNet, a novel plug-and-play dual-branch model engineered to embed pixel-level masked image features into any pre-trained DM, guaranteeing coherent and enhanced image inpainting outcomes. Additionally, we introduce BrushData and BrushBench to facilitate segmentation-based inpainting training and performance assessment. Our extensive experimental analysis demonstrates BrushNet’s superior performance over existing models across seven key metrics, including image quality, mask region preservation, and textual coherence.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The proposed BrushData and BrushBench will be released along with the codes.
References
Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM Trans. Graph. (TOG) 42(4), 1–11 (2023)
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18208–18218 (2022)
Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: International Conference and Exhibition on Computer Graphics and Interactive Techniques (SIGGRAPH), pp. 417–424 (2000)
Binghui, C., Chao, L., Chongyang, Z., Wangmeng, X., Yifeng, G., Xuansong, X.: Replaceanything as you want: Ultra-high quality content replacement (2023). https://aigcdesigngroup.github.io/replace-anything/
Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs. arXiv preprint arXiv:1801.01401 (2018)
Corneanu, C., Gadde, R., Martinez, A.M.: Latentpaint: image inpainting in latent space with diffusion models. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 4334–4343 (2024)
Criminisi, A., Pérez, P., Toyama, K.: Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 13(9), 1200–1212 (2004)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009)
epinikion: epicrealism (2023). https://civitai.com/models/25694?modelVersionId=143906
heni29833: Henmixreal (2024). https://civitai.com/models/20282?modelVersionId=305687
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems (NIPS) 30 (2017)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NIPS) 33, 6840–6851 (2020)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Huang, H., He, R., Sun, Z., Tan, T., et al.: Introvae: introspective variational autoencoders for photographic image synthesis. Advances in Neural Information Processing Systems (NIPS) 31 (2018)
Huang, Y., et al.: Diffusion model-based image editing: A survey. arXiv preprint arXiv:2402.17525 (2024)
Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar, S.: Rethinking fid: Towards a better evaluation metric for image generation. arXiv preprint arXiv:2401.09603 (2023)
Kuznetsova, A., et al.: The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis. (IJCV) 128(7), 1956–1981 (2020)
Li, Z., Wei, P., Yin, X., Ma, Z., Kot, A.C.: Virtual try-on with pose-garment keypoints guided inpainting. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22788–22797 (2023)
Lin, T.Y., et al.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision (ECCV), pp. 740–755. Springer (2014)
Liu, A., Niepert, M., Broeck, G.V.d.: Image inpainting via tractable steering of diffusion models. arXiv preprint arXiv:2401.03349 (2023)
Liu, H., Wan, Z., Huang, W., Song, Y., Han, X., Liao, J.: Pd-GAN: probabilistic diverse GAN for image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9371–9381 (2021)
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: IEEE/CVF International Conference on Computer Vision (ICCV), December 2015
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: RePaint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11461–11471 (2022)
Lykon: Dreamshaper (2022). https://civitai.com/models/4384?modelVersionId=128713
Manukyan, H., Sargsyan, A., Atanyan, B., Wang, Z., Navasardyan, S., Shi, H.: Hd-painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models. arXiv preprint arXiv:2312.14091 (2023)
Meina: Meinamix (2023). https://civitai.com/models/7240?modelVersionId=119057
Peng, J., Liu, D., Xu, S., Li, H.: Generating diverse structure for image inpainting with hierarchical vq-vae. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10775–10784 (2021)
von Platen, P., et al.: Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers (2022)
Quan, W., Chen, J., Liu, Y., Yan, D.M., Wonka, P.: Deep learning-based image and video inpainting: a survey. IntJ. of Computer Vision (IJCV) pp. 1–34 (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), pp. 8748–8763. PMLR (2021)
Razzhigaev, A., et al.: Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion. arXiv preprint arXiv:2310.03502 (2023)
Ren, T., et al.: Grounded sam: assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems (NIPS) 35, 25278–25294 (2022)
SG161222: Realisticvision (2023). https://civitai.com/models/4201?modelVersionId=130072
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Wang, S., et al.: Imagen editor and editbench: advancing and evaluating text-guided image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18359–18369 (2023)
Wikipedia contributors: Mean squared error — Wikipedia, the free encyclopedia (2024). https://en.wikipedia.org/w/index.php?title=Mean_squared_error&oldid=1207422018. Accessed 4 Mar 2024
Wikipedia contributors: Peak signal-to-noise ratio — Wikipedia, the free encyclopedia (2024). https://en.wikipedia.org/w/index.php?title=Peak_signal-to-noise_ratio&oldid=1210897995. Accessed 4 Mar 2024
Wu, C., et al.: GODIVA: generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806 (2021)
Wu, X., et al.: Human preference score v2: a solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)
Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: Smartbrush: text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22428–22437 (2023)
Xie, S., et al.: Dreaminpainter: text-guided subject-driven image inpainting with diffusion models. arXiv preprint arXiv:2312.03771 (2023)
Xu, J., et al.: Imagereward: Learning and evaluating human preferences for text-to-image generation (2023)
Xu, Z., Zhang, X., Chen, W., Yao, M., Liu, J., Xu, T., Wang, Z.: A review of image inpainting methods based on deep learning. Appl. Sci. 13(20), 11189 (2023)
Yang, S., Chen, X., Liao, J.: Uni-paint: a unified framework for multimodal image inpainting with pretrained diffusion model. In: ACM International Conference on Multimedia (MM), pp. 3190–3199 (2023)
Yang, S., Zhang, L., Ma, L., Liu, Y., Fu, J., He, Y.: Magicremover: tuning-free text-guided image inpainting with diffusion models. arXiv preprint arXiv:2310.02848 (2023)
Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J.: Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015)
Yu, T., et al.: Inpaint anything: segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)
Zhang, G., Ji, J., Zhang, Y., Yu, M., Jaakkola, T., Chang, S.: Towards coherent image inpainting using denoising diffusion implicit models (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2023)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 586–595 (2018)
Zhao, S., et al.: Large scale image completion via co-modulated generative adversarial networks. arXiv preprint arXiv:2103.10428 (2021)
Zheng, C., Cham, T.J., Cai, J.: Pluralistic image completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1438–1447 (2019)
Zheng, H., et al.: Image inpainting with cascaded modulation GAN and object-aware training. In: European Conference on Computer Vision (ECCV), pp. 277–296. Springer (2022)
Zhuang, J., Zeng, Y., Liu, W., Yuan, C., Chen, K.: A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. arXiv preprint arXiv:2312.03594 (2023)
Acknowledgment
This work is supported in part by Research Matching Grant (CSE-7-2022)-RMG01.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q. (2025). BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15078. Springer, Cham. https://doi.org/10.1007/978-3-031-72661-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-72661-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72660-6
Online ISBN: 978-3-031-72661-3
eBook Packages: Computer ScienceComputer Science (R0)