Skip to main content

Improving Text-Guided Object Inpainting with Semantic Pre-inpainting

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15104))

Included in the following conference series:

  • 286 Accesses

Abstract

Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at https://github.com/Nnn-s/CATdiffusion.

This work was performed when Yifu Chen was visiting HiDream.ai as a research intern.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM Trans. Graph. 42(4), 1–11 (2023)

    Article  Google Scholar 

  2. Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR (2022)

    Google Scholar 

  3. Ballester, C., Bertalmio, M., Caselles, V., Sapiro, G., Verdera, J.: Filling-in by joint interpolation of vector fields and gray levels. IEEE Trans. Image Process. 10(8), 1200–1211 (2001)

    Article  MathSciNet  Google Scholar 

  4. Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28(3), 24 (2009)

    Article  Google Scholar 

  5. Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: SIGGRAPH (2000)

    Google Scholar 

  6. Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: CVPR (2023)

    Google Scholar 

  7. Chen, J., Pan, Y., Yao, T., Mei, T.: ControlStyle: text-driven stylized image generation using diffusion priors. In: ACM MM (2023)

    Google Scholar 

  8. Chen, Y., Pan, Y., Li, Y., Yao, T., Mei, T.: Control3D: towards controllable text-to-3D generation. In: ACM MM (2023)

    Google Scholar 

  9. Criminisi, A., Pérez, P., Toyama, K.: Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 13(9), 1200–1212 (2004)

    Article  Google Scholar 

  10. Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS, vol. 34, pp. 8780–8794 (2021)

    Google Scholar 

  11. Feng, Z., et al.: ERNIE-ViLG 2.0: improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In: CVPR (2023)

    Google Scholar 

  12. Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  13. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

  14. Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

  15. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)

    Google Scholar 

  16. Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  17. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  18. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  19. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  20. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  21. Kuznetsova, A., et al.: The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vision 128(7), 1956–1981 (2020)

    Article  Google Scholar 

  22. Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y., Jia, J.: MAT: mask-aware transformer for large hole image inpainting. In: CVPR (2022)

    Google Scholar 

  23. Li, Y., Yao, T., Pan, Y., Mei, T.: Contextual transformer networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1489–1500 (2022)

    Article  Google Scholar 

  24. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  25. Liu, G., Reda, F.A., Shih, K.J., Wang, T., Tao, A., Catanzaro, B.: Image inpainting for irregular holes using partial convolutions. In: ECCV (2018)

    Google Scholar 

  26. Liu, H., Wan, Z., Huang, W., Song, Y., Han, X., Liao, J.: PD-GAN: probabilistic diverse GAN for image inpainting. In: CVPR (2021)

    Google Scholar 

  27. Navaneet, K., Koohpayegani, S.A., Tejankar, A., Pirsiavash, H.: SimReg: regression as a simple yet effective tool for self-supervised knowledge distillation. arXiv preprint arXiv:2201.05131 (2022)

  28. Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)

  29. Pan, Y., Qiu, Z., Yao, T., Li, H., Mei, T.: To create what you tell: generating videos from captions. In: ACM MM (2017)

    Google Scholar 

  30. Peng, J., Liu, D., Xu, S., Li, H.: Generating diverse structure for image inpainting with hierarchical VQ-VAE. In: CVPR (2021)

    Google Scholar 

  31. von Platen, P., et al.: Diffusers: state-of-the-art diffusion models (2022). https://github.com/huggingface/diffusers

  32. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)

  33. Quan, W., Zhang, R., Zhang, Y., Li, Z., Wang, J., Yan, D.: Image inpainting with local and global refinement. IEEE Trans. Image Process. 31, 2405–2420 (2022)

    Article  Google Scholar 

  34. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  35. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  36. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  37. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS, vol. 35, pp. 36479–36494 (2022)

    Google Scholar 

  38. Seitzer, M.: pytorch-fid: FID score for PyTorch (2020). https://github.com/mseitzer/pytorch-fid. Version 0.3.0

  39. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  40. Song, Y., et al.: Contextual-based image inpainting: infer, match, and translate. In: ECCV (2018)

    Google Scholar 

  41. Tang, J., et al.: Make-it-3D: high-fidelity 3D creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184 (2023)

  42. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)

    Google Scholar 

  43. Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. In: CVPR (2022)

    Google Scholar 

  44. Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: ICCV (2023)

    Google Scholar 

  45. Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: SmartBrush: text and shape guided object inpainting with diffusion model. In: CVPR (2023)

    Google Scholar 

  46. Xue, Z., et al.: RAPHAEL: text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295 (2023)

  47. Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O., Li, H.: High-resolution image inpainting using multi-scale neural patch synthesis. In: CVPR (2017)

    Google Scholar 

  48. Yao, T., Li, Y., Pan, Y., Mei, T.: HIRI-ViT: scaling vision transformer with high resolution inputs. IEEE Trans. Pattern Anal. Mach. Intell. (2024)

    Google Scholar 

  49. Yao, T., Li, Y., Pan, Y., Wang, Y., Zhang, X.P., Mei, T.: Dual vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(9), 10870–10882 (2023)

    Article  Google Scholar 

  50. Yi, Z., Tang, Q., Azizi, S., Jang, D., Xu, Z.: Contextual residual aggregation for ultra high-resolution image inpainting. In: CVPR (2020)

    Google Scholar 

  51. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. In: CVPR (2018)

    Google Scholar 

  52. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: ICCV (2019)

    Google Scholar 

  53. Zeng, Y., Fu, J., Chao, H., Guo, B.: Learning pyramid-context encoder network for high-quality image inpainting. In: CVPR (2019)

    Google Scholar 

  54. Zhang, L., Chen, Q., Hu, B., Jiang, S.: Text-guided neural image inpainting. In: ACM MM (2020)

    Google Scholar 

  55. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV (2023)

    Google Scholar 

  56. Zhang, Z., et al.: TRIP: temporal residual learning with image noise prior for image-to-video diffusion models. In: CVPR (2024)

    Google Scholar 

  57. Zhang, Z., Zhao, Z., Zhang, Z., Huai, B., Yuan, J.: Text-guided image inpainting. In: ACM MM (2020)

    Google Scholar 

  58. Zhao, L., et al.: UCTGAN: diverse image inpainting based on unsupervised cross-space translation. In: CVPR (2020)

    Google Scholar 

  59. Zhao, S., et al.: Uni-ControlNet: all-in-one control to text-to-image diffusion models. arXiv preprint arXiv:2305.16322 (2023)

  60. Zheng, C., Cham, T.J., Cai, J.: Pluralistic image completion. In: CVPR (2019)

    Google Scholar 

  61. Zhu, R., et al.: SD-DiT: unleashing the power of self-supervised discrimination in diffusion transformer. In: CVPR (2024)

    Google Scholar 

Download references

Acknowledgement

This work was supported by National Natural Science Foundation of China (No. 62172103, 32341012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhineng Chen .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 815 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, Y. et al. (2025). Improving Text-Guided Object Inpainting with Semantic Pre-inpainting. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15104. Springer, Cham. https://doi.org/10.1007/978-3-031-72952-2_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72952-2_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72951-5

  • Online ISBN: 978-3-031-72952-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics