Improving Text-Guided Object Inpainting with Semantic Pre-inpainting

Chen, Yifu; Chen, Jingwen; Pan, Yingwei; Li, Yehao; Yao, Ting; Chen, Zhineng; Mei, Tao

doi:10.1007/978-3-031-72952-2_7

Yifu Chen^13,14,
Jingwen Chen¹⁵,
Yingwei Pan¹⁵,
Yehao Li¹⁵,
Ting Yao¹⁵,
Zhineng Chen^13,14 &
…
Tao Mei¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15104))

Included in the following conference series:

European Conference on Computer Vision

286 Accesses

Abstract

Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at https://github.com/Nnn-s/CATdiffusion.

This work was performed when Yifu Chen was visiting HiDream.ai as a research intern.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors

Masked-attention diffusion guidance for spatially controlling text-to-image generation

Article 30 November 2023

VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model

References

Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM Trans. Graph. 42(4), 1–11 (2023)
Article Google Scholar
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR (2022)
Google Scholar
Ballester, C., Bertalmio, M., Caselles, V., Sapiro, G., Verdera, J.: Filling-in by joint interpolation of vector fields and gray levels. IEEE Trans. Image Process. 10(8), 1200–1211 (2001)
Article MathSciNet Google Scholar
Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28(3), 24 (2009)
Article Google Scholar
Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: SIGGRAPH (2000)
Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: CVPR (2023)
Google Scholar
Chen, J., Pan, Y., Yao, T., Mei, T.: ControlStyle: text-driven stylized image generation using diffusion priors. In: ACM MM (2023)
Google Scholar
Chen, Y., Pan, Y., Li, Y., Yao, T., Mei, T.: Control3D: towards controllable text-to-3D generation. In: ACM MM (2023)
Google Scholar
Criminisi, A., Pérez, P., Toyama, K.: Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 13(9), 1200–1212 (2004)
Article Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS, vol. 34, pp. 8780–8794 (2021)
Google Scholar
Feng, Z., et al.: ERNIE-ViLG 2.0: improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In: CVPR (2023)
Google Scholar
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Article MathSciNet Google Scholar
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Google Scholar
Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Kuznetsova, A., et al.: The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vision 128(7), 1956–1981 (2020)
Article Google Scholar
Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y., Jia, J.: MAT: mask-aware transformer for large hole image inpainting. In: CVPR (2022)
Google Scholar
Li, Y., Yao, T., Pan, Y., Mei, T.: Contextual transformer networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1489–1500 (2022)
Article Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, G., Reda, F.A., Shih, K.J., Wang, T., Tao, A., Catanzaro, B.: Image inpainting for irregular holes using partial convolutions. In: ECCV (2018)
Google Scholar
Liu, H., Wan, Z., Huang, W., Song, Y., Han, X., Liao, J.: PD-GAN: probabilistic diverse GAN for image inpainting. In: CVPR (2021)
Google Scholar
Navaneet, K., Koohpayegani, S.A., Tejankar, A., Pirsiavash, H.: SimReg: regression as a simple yet effective tool for self-supervised knowledge distillation. arXiv preprint arXiv:2201.05131 (2022)
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Pan, Y., Qiu, Z., Yao, T., Li, H., Mei, T.: To create what you tell: generating videos from captions. In: ACM MM (2017)
Google Scholar
Peng, J., Liu, D., Xu, S., Li, H.: Generating diverse structure for image inpainting with hierarchical VQ-VAE. In: CVPR (2021)
Google Scholar
von Platen, P., et al.: Diffusers: state-of-the-art diffusion models (2022). https://github.com/huggingface/diffusers
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Quan, W., Zhang, R., Zhang, Y., Li, Z., Wang, J., Yan, D.: Image inpainting with local and global refinement. IEEE Trans. Image Process. 31, 2405–2420 (2022)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS, vol. 35, pp. 36479–36494 (2022)
Google Scholar
Seitzer, M.: pytorch-fid: FID score for PyTorch (2020). https://github.com/mseitzer/pytorch-fid. Version 0.3.0
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Song, Y., et al.: Contextual-based image inpainting: infer, match, and translate. In: ECCV (2018)
Google Scholar
Tang, J., et al.: Make-it-3D: high-fidelity 3D creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Google Scholar
Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. In: CVPR (2022)
Google Scholar
Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: ICCV (2023)
Google Scholar
Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: SmartBrush: text and shape guided object inpainting with diffusion model. In: CVPR (2023)
Google Scholar
Xue, Z., et al.: RAPHAEL: text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295 (2023)
Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O., Li, H.: High-resolution image inpainting using multi-scale neural patch synthesis. In: CVPR (2017)
Google Scholar
Yao, T., Li, Y., Pan, Y., Mei, T.: HIRI-ViT: scaling vision transformer with high resolution inputs. IEEE Trans. Pattern Anal. Mach. Intell. (2024)
Google Scholar
Yao, T., Li, Y., Pan, Y., Wang, Y., Zhang, X.P., Mei, T.: Dual vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(9), 10870–10882 (2023)
Article Google Scholar
Yi, Z., Tang, Q., Azizi, S., Jang, D., Xu, Z.: Contextual residual aggregation for ultra high-resolution image inpainting. In: CVPR (2020)
Google Scholar
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. In: CVPR (2018)
Google Scholar
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: ICCV (2019)
Google Scholar
Zeng, Y., Fu, J., Chao, H., Guo, B.: Learning pyramid-context encoder network for high-quality image inpainting. In: CVPR (2019)
Google Scholar
Zhang, L., Chen, Q., Hu, B., Jiang, S.: Text-guided neural image inpainting. In: ACM MM (2020)
Google Scholar
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV (2023)
Google Scholar
Zhang, Z., et al.: TRIP: temporal residual learning with image noise prior for image-to-video diffusion models. In: CVPR (2024)
Google Scholar
Zhang, Z., Zhao, Z., Zhang, Z., Huai, B., Yuan, J.: Text-guided image inpainting. In: ACM MM (2020)
Google Scholar
Zhao, L., et al.: UCTGAN: diverse image inpainting based on unsupervised cross-space translation. In: CVPR (2020)
Google Scholar
Zhao, S., et al.: Uni-ControlNet: all-in-one control to text-to-image diffusion models. arXiv preprint arXiv:2305.16322 (2023)
Zheng, C., Cham, T.J., Cai, J.: Pluralistic image completion. In: CVPR (2019)
Google Scholar
Zhu, R., et al.: SD-DiT: unleashing the power of self-supervised discrimination in diffusion transformer. In: CVPR (2024)
Google Scholar

Download references

Acknowledgement

This work was supported by National Natural Science Foundation of China (No. 62172103, 32341012).

Author information

Authors and Affiliations

School of Computer Science, Fudan University, Shanghai, China
Yifu Chen & Zhineng Chen
Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Shanghai, China
Yifu Chen & Zhineng Chen
HiDream.ai Inc., Beijing, China
Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao & Tao Mei

Authors

Yifu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jingwen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yingwei Pan
View author publications
You can also search for this author in PubMed Google Scholar
Yehao Li
View author publications
You can also search for this author in PubMed Google Scholar
Ting Yao
View author publications
You can also search for this author in PubMed Google Scholar
Zhineng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Tao Mei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhineng Chen .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 815 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Y. et al. (2025). Improving Text-Guided Object Inpainting with Semantic Pre-inpainting. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15104. Springer, Cham. https://doi.org/10.1007/978-3-031-72952-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-72952-2_7
Published: 01 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72951-5
Online ISBN: 978-3-031-72952-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Text-Guided Object Inpainting with Semantic Pre-inpainting