Abstract
In recent years, diffusion probabilistic models have emerged as a hot topic in computer vision. Image creation programs such as Imagen, Latent Diffusion Models, and Stable Diffusion have shown outstanding generative powers, sparking considerable community discussions. They frequently, however, lack the ability to precisely modify real-world images. In this paper, we propose a novel ControlNet-based image editing framework that enables alteration of real images based on pose maps, scribbling maps, and other features without the need for training or fine-tuning. Given a guiding image as input, we edit the initial noise generated from the guiding image to influence the generation process. Then features extracted from the guiding image are directly injected into the generation process of the translated image. We also construct a classifier guidance based on strong correspondences between intermediate features of the ControlNet branches. The editing signals are converted into gradients to guide the sampling direction. At the end of this paper, we demonstrate high-quality results of our proposed model in image editing tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 35, 36479–36494 (2022)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.J.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)
Liu, V., Chilton, L.B.: Design guidelines for prompt engineering text-to-image generative models. In: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–23 (2022)
Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text-driven layered image and video editing. In: European conference on computer vision, pp. 707–723. Springer, (2022). https://doi.org/10.1007/978-3-031-19784-0_41
Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., et al.: Universal guidance for diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 843–852 (2023)
Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., et al.: Palette: Image-to-image diffusion models. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
Epstein, D., Jabri, A., Poole, B., Efros, A., Holynski.: Diffusion self-guidance for controllable image generation. Adv. Neural Inf. Process. Syst. 36, 16222–16239 (2023)
Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan.: Emergent correspondence from image diffusion. Adv. Neural Inf. Process. Syst. 36, 1363–1389 (2023)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Mousakhan, A., Brox, T., Tayyub.: Anomaly detection with conditioned denoising diffusion models. arXiv preprint arXiv:2305.15956 (2023)
Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., et al.: Cogview: Mastering text-to-image generation via transformers. Adv. Neural Inf. Process. Syst. 34, 19822–19835 (2021)
Hinz, T., Heinrich, S., Wermter.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1552–1565 (2020)
Avrahami, O., Fried, O., Lischinski.: Blended latent diffusion. ACM Trans. Graph. (TOG) 42, 1–11 (2023)
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., et al.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., et al.: T2i-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4296–4304 (2024)
Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., et al.: Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22511–22521 (2023)
Song, J., Meng, C., Ermon.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Mao, J., Wang, X., Aizawa, K.: Guided image synthesis via initial image editing in diffusion model. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5321–5329 (2023)
Dhariwal, P., Nichol.: Diffusion models beat GANs on image synthesis. In: Adv. Neural Inf. Process. Syst. 34, 8780–8794 (2021)
Pantic, M., Valstar, M., Rademaker, R., Maat, L.: Web-based database for facial expression analysis. In: 2005 IEEE International Conference on Multimedia and Expo, pp. 5−pp. IEEE, (2005)
Bhunia, A.K., Khan, S., Cholakkal, H., Anwer, R.M., Laaksonen, J., Shah, M., et al.: Person image synthesis via denoising diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5968–5976 (2023)
Liu, K., Li, Q., Qiu.: PoseGAN: a pose-to-image translation framework for camera localization. ISPRS J. Photogrammetry Remote Sens. 166, 308–315 (2020)
Zhang, J., Li, K., Lai, Y.-K., Yang, J.: PISE: Person image synthesis and editing with decoupled GAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7982–7990 (2021)
Acknowledgments
This work was supported by National Natural Science Foundation of China (62376286).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Xu, L. et al. (2024). Fine-Grained Image Editing Using ControlNet: Expanding Possibilities in Visual Manipulation. In: Huang, DS., Si, Z., Guo, J. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14867. Springer, Singapore. https://doi.org/10.1007/978-981-97-5597-4_3
Download citation
DOI: https://doi.org/10.1007/978-981-97-5597-4_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5596-7
Online ISBN: 978-981-97-5597-4
eBook Packages: Computer ScienceComputer Science (R0)