Skip to main content

Fine-Grained Image Editing Using ControlNet: Expanding Possibilities in Visual Manipulation

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14867))

Included in the following conference series:

  • 695 Accesses

Abstract

In recent years, diffusion probabilistic models have emerged as a hot topic in computer vision. Image creation programs such as Imagen, Latent Diffusion Models, and Stable Diffusion have shown outstanding generative powers, sparking considerable community discussions. They frequently, however, lack the ability to precisely modify real-world images. In this paper, we propose a novel ControlNet-based image editing framework that enables alteration of real images based on pose maps, scribbling maps, and other features without the need for training or fine-tuning. Given a guiding image as input, we edit the initial noise generated from the guiding image to influence the generation process. Then features extracted from the guiding image are directly injected into the generation process of the translated image. We also construct a classifier guidance based on strong correspondences between intermediate features of the ControlNet branches. The editing signals are converted into gradients to guide the sampling direction. At the end of this paper, we demonstrate high-quality results of our proposed model in image editing tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 35, 36479–36494 (2022)

    Google Scholar 

  2. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.J.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)

  3. Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)

  4. Liu, V., Chilton, L.B.: Design guidelines for prompt engineering text-to-image generative models. In: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–23 (2022)

    Google Scholar 

  5. Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text-driven layered image and video editing. In: European conference on computer vision, pp. 707–723. Springer, (2022). https://doi.org/10.1007/978-3-031-19784-0_41

  6. Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., et al.: Universal guidance for diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 843–852 (2023)

    Google Scholar 

  7. Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., et al.: Palette: Image-to-image diffusion models. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)

    Google Scholar 

  8. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

  9. Epstein, D., Jabri, A., Poole, B., Efros, A., Holynski.: Diffusion self-guidance for controllable image generation. Adv. Neural Inf. Process. Syst. 36, 16222–16239 (2023)

    Google Scholar 

  10. Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan.: Emergent correspondence from image diffusion. Adv. Neural Inf. Process. Syst. 36, 1363–1389 (2023)

    Google Scholar 

  11. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  12. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)

    Google Scholar 

  13. Mousakhan, A., Brox, T., Tayyub.: Anomaly detection with conditioned denoising diffusion models. arXiv preprint arXiv:2305.15956 (2023)

  14. Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., et al.: Cogview: Mastering text-to-image generation via transformers. Adv. Neural Inf. Process. Syst. 34, 19822–19835 (2021)

    Google Scholar 

  15. Hinz, T., Heinrich, S., Wermter.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1552–1565 (2020)

    Google Scholar 

  16. Avrahami, O., Fried, O., Lischinski.: Blended latent diffusion. ACM Trans. Graph. (TOG) 42, 1–11 (2023)

    Google Scholar 

  17. Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., et al.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)

  18. Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., et al.: T2i-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4296–4304 (2024)

    Google Scholar 

  19. Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., et al.: Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22511–22521 (2023)

    Google Scholar 

  20. Song, J., Meng, C., Ermon.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  21. Mao, J., Wang, X., Aizawa, K.: Guided image synthesis via initial image editing in diffusion model. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5321–5329 (2023)

    Google Scholar 

  22. Dhariwal, P., Nichol.: Diffusion models beat GANs on image synthesis. In: Adv. Neural Inf. Process. Syst. 34, 8780–8794 (2021)

    Google Scholar 

  23. Pantic, M., Valstar, M., Rademaker, R., Maat, L.: Web-based database for facial expression analysis. In: 2005 IEEE International Conference on Multimedia and Expo, pp. 5−pp. IEEE, (2005)

    Google Scholar 

  24. Bhunia, A.K., Khan, S., Cholakkal, H., Anwer, R.M., Laaksonen, J., Shah, M., et al.: Person image synthesis via denoising diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5968–5976 (2023)

    Google Scholar 

  25. Liu, K., Li, Q., Qiu.: PoseGAN: a pose-to-image translation framework for camera localization. ISPRS J. Photogrammetry Remote Sens. 166, 308–315 (2020)

    Google Scholar 

  26. Zhang, J., Li, K., Lai, Y.-K., Yang, J.: PISE: Person image synthesis and editing with decoupled GAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7982–7990 (2021)

    Google Scholar 

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China (62376286).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongbo Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, L. et al. (2024). Fine-Grained Image Editing Using ControlNet: Expanding Possibilities in Visual Manipulation. In: Huang, DS., Si, Z., Guo, J. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14867. Springer, Singapore. https://doi.org/10.1007/978-981-97-5597-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-5597-4_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-5596-7

  • Online ISBN: 978-981-97-5597-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics