Fine-Grained Image Editing Using ControlNet: Expanding Possibilities in Visual Manipulation

Xu, Longfei; Huang, Hongbo; Zhao, Yushuang; Pan, Shuwen; Zheng, Yaolin; Yan, Xiaoxu; Huang, Linkai; Wu, Lishan

doi:10.1007/978-981-97-5597-4_3

Longfei Xu¹⁰,
Hongbo Huang¹⁰,
Yushuang Zhao¹⁰,
Shuwen Pan¹⁰,
Yaolin Zheng¹⁰,
Xiaoxu Yan¹⁰,
Linkai Huang¹⁰ &
…
Lishan Wu¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14867))

Included in the following conference series:

International Conference on Intelligent Computing

695 Accesses

Abstract

In recent years, diffusion probabilistic models have emerged as a hot topic in computer vision. Image creation programs such as Imagen, Latent Diffusion Models, and Stable Diffusion have shown outstanding generative powers, sparking considerable community discussions. They frequently, however, lack the ability to precisely modify real-world images. In this paper, we propose a novel ControlNet-based image editing framework that enables alteration of real images based on pose maps, scribbling maps, and other features without the need for training or fine-tuning. Given a guiding image as input, we edit the initial noise generated from the guiding image to influence the generation process. Then features extracted from the guiding image are directly injected into the generation process of the translated image. We also construct a classifier guidance based on strong correspondences between intermediate features of the ControlNet branches. The editing signals are converted into gradients to guide the sampling direction. At the end of this paper, we demonstrate high-quality results of our proposed model in image editing tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DATENeRF: Depth-Aware Text-Based Editing of NeRFs

Editable Image Elements for Controllable Synthesis

GaussCtrl: Multi-view Consistent Text-Driven 3D Gaussian Splatting Editing

References

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 35, 36479–36494 (2022)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.J.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)
Liu, V., Chilton, L.B.: Design guidelines for prompt engineering text-to-image generative models. In: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–23 (2022)
Google Scholar
Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text-driven layered image and video editing. In: European conference on computer vision, pp. 707–723. Springer, (2022). https://doi.org/10.1007/978-3-031-19784-0_41
Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., et al.: Universal guidance for diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 843–852 (2023)
Google Scholar
Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., et al.: Palette: Image-to-image diffusion models. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)
Google Scholar
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
Epstein, D., Jabri, A., Poole, B., Efros, A., Holynski.: Diffusion self-guidance for controllable image generation. Adv. Neural Inf. Process. Syst. 36, 16222–16239 (2023)
Google Scholar
Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan.: Emergent correspondence from image diffusion. Adv. Neural Inf. Process. Syst. 36, 1363–1389 (2023)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Google Scholar
Mousakhan, A., Brox, T., Tayyub.: Anomaly detection with conditioned denoising diffusion models. arXiv preprint arXiv:2305.15956 (2023)
Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., et al.: Cogview: Mastering text-to-image generation via transformers. Adv. Neural Inf. Process. Syst. 34, 19822–19835 (2021)
Google Scholar
Hinz, T., Heinrich, S., Wermter.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1552–1565 (2020)
Google Scholar
Avrahami, O., Fried, O., Lischinski.: Blended latent diffusion. ACM Trans. Graph. (TOG) 42, 1–11 (2023)
Google Scholar
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., et al.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., et al.: T2i-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4296–4304 (2024)
Google Scholar
Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., et al.: Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22511–22521 (2023)
Google Scholar
Song, J., Meng, C., Ermon.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Mao, J., Wang, X., Aizawa, K.: Guided image synthesis via initial image editing in diffusion model. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5321–5329 (2023)
Google Scholar
Dhariwal, P., Nichol.: Diffusion models beat GANs on image synthesis. In: Adv. Neural Inf. Process. Syst. 34, 8780–8794 (2021)
Google Scholar
Pantic, M., Valstar, M., Rademaker, R., Maat, L.: Web-based database for facial expression analysis. In: 2005 IEEE International Conference on Multimedia and Expo, pp. 5−pp. IEEE, (2005)
Google Scholar
Bhunia, A.K., Khan, S., Cholakkal, H., Anwer, R.M., Laaksonen, J., Shah, M., et al.: Person image synthesis via denoising diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5968–5976 (2023)
Google Scholar
Liu, K., Li, Q., Qiu.: PoseGAN: a pose-to-image translation framework for camera localization. ISPRS J. Photogrammetry Remote Sens. 166, 308–315 (2020)
Google Scholar
Zhang, J., Li, K., Lai, Y.-K., Yang, J.: PISE: Person image synthesis and editing with decoupled GAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7982–7990 (2021)
Google Scholar

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China (62376286).

Author information

Authors and Affiliations

Beijing Information Science and Technology University, Beijing, China
Longfei Xu, Hongbo Huang, Yushuang Zhao, Shuwen Pan, Yaolin Zheng, Xiaoxu Yan, Linkai Huang & Lishan Wu

Authors

Longfei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hongbo Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yushuang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Shuwen Pan
View author publications
You can also search for this author in PubMed Google Scholar
Yaolin Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxu Yan
View author publications
You can also search for this author in PubMed Google Scholar
Linkai Huang
View author publications
You can also search for this author in PubMed Google Scholar
Lishan Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongbo Huang .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Ningbo, China
De-Shuang Huang
Tianjin University of Science and Technology, Tianjin, China
Zhanjun Si
Xiamen University, Xiamen, China
Jiayang Guo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, L. et al. (2024). Fine-Grained Image Editing Using ControlNet: Expanding Possibilities in Visual Manipulation. In: Huang, DS., Si, Z., Guo, J. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14867. Springer, Singapore. https://doi.org/10.1007/978-981-97-5597-4_3

Download citation

DOI: https://doi.org/10.1007/978-981-97-5597-4_3
Published: 02 August 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5596-7
Online ISBN: 978-981-97-5597-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics