ByteEdit: Boost, Comply and Accelerate Generative Image Editing

Ren, Yuxi; Wu, Jie; Lu, Yanzuo; Kuang, Huafeng; Xia, Xin; Wang, Xionghui; Wang, Qianqian; Zhu, Yixing; Xie, Pan; Wang, Shiyin; Xiao, Xuefeng; Wang, Yitong; Zheng, Min; Fu, Lean

doi:10.1007/978-3-031-72646-0_11

Yuxi Ren¹³,
Jie Wu¹³,
Yanzuo Lu¹³,
Huafeng Kuang¹³,
Xin Xia¹³,
Xionghui Wang¹³,
Qianqian Wang¹³,
Yixing Zhu¹³,
Pan Xie¹³,
Shiyin Wang¹³,
Xuefeng Xiao¹³,
Yitong Wang¹³,
Min Zheng¹³ &
…
Lean Fu¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15061))

Included in the following conference series:

European Conference on Computer Vision

217 Accesses

Abstract

Recent advancements in diffusion-based generative image editing have sparked a profound revolution, reshaping the landscape of image outpainting and inpainting tasks. Despite these strides, the field grapples with inherent challenges, including: i) inferior quality; ii) poor consistency; iii) insufficient instrcution adherence; iv) suboptimal generation efficiency. To address these obstacles, we present ByteEdit, an innovative feedback learning framework meticulously designed to Boost, Comply, and Accelerate Generative Image Editing tasks. ByteEdit seamlessly integrates image reward models dedicated to enhancing aesthetics and image-text alignment, while also introducing a dense, pixel-level reward model tailored to foster coherence in the output. Furthermore, we propose a pioneering adversarial and progressive feedback learning strategy to expedite the model’s inference speed. Through extensive large-scale user evaluations, we demonstrate that ByteEdit surpasses leading generative image editing products, including Adobe, Canva, and MeiTu, in both generation quality and consistency. ByteEdit-Outpainting exhibits a remarkable enhancement of 388% and 135% in quality and consistency, respectively, when compared to the baseline model. Experiments also verfied that our acceleration models maintains excellent performance results in terms of quality and consistency.

Y. Ren, J. Wu and Y. Lu—Equal contribution.

ByteDance Project Page: https://byte-edit.github.io.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

Deep image synthesis from intuitive user input: A review and perspectives

Article Open access 27 October 2021

Gradient Adjusting Networks for Domain Inversion

References

Adobe firefly - free generative AI for creatives. https://www.adobe.com/products/firefly.html
Canva: Free AI image generator: online text to image app. https://www.canva.com/ai-image-generator/
MiracleVision. https://ai.meitu.com/index/
Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM Trans. Graph. (TOG) 42(4), 1–11 (2023)
Article Google Scholar
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)
Google Scholar
Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: AnyDoor: zero-shot object-level image customization. arXiv preprint arXiv:2307.09481 (2023)
Chen, X., et al.: Microsoft CoCo captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: DiffEdit: diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427 (2022)
Dong, H.,et al.: RAFT: reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767 (2023)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Isajanyan, A., Shatveryan, A., Kocharyan, D., Wang, Z., Shi, H.: Social reward: evaluating and enhancing generative ai through million-user feedback from an online creative community. arXiv preprint arXiv:2402.09872 (2024)
Joseph, K., et al.: Iterative multi-granular image editing using diffusion models. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 8107–8116 (2024)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-Pic: an open dataset of user preferences for text-to-image generation. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Lee, K., et al.: Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Lu, Y., Zhang, M., Ma, A.J., Xie, X., Lai, J.H.: Coarse-to-fine latent diffusion for pose-guided person image synthesis. arXiv preprint arXiv:2402.18078 (2024)
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11461–11471 (2022)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Google Scholar
Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
Qin, J.,et al.: DiffusionGPT: LLM-driven text-to-image generation system. arXiv preprint arXiv:2401.10061 (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ren, Y., et al.: UGC: unified GAN compression for efficient image-to-image translation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17281–17291 (2023)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042 (2023)
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Turc, I., Nemade, G.: Midjourney user prompts and generated images (250k) (2022). https://doi.org/10.34740/KAGGLE/DS/2349267
Wang, S., et al.: Imagen editor and EditBench: advancing and evaluating text-guided image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18359–18369 (2023)
Google Scholar
Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Human preference score: better aligning text-to-image models with human preference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2096–2105 (2023)
Google Scholar
Xiao, Z., Kreis, K., Vahdat, A.: Tackling the generative learning trilemma with denoising diffusion GANs. arXiv preprint arXiv:2112.07804 (2021)
Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: SmartBrush: text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22428–22437 (2023)
Google Scholar
Xie, S., et al.: DreamInpainter: text-guided subject-driven image inpainting with diffusion models. arXiv preprint arXiv:2312.03771 (2023)
Xu, J., et al.: ImageReward: learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977 (2023)
Xu, Y., Gong, M., Xie, S., Wei, W., Grundmann, M., Hou, T., et al.: Semi-implicit denoising diffusion models (SIDDMs). arXiv preprint arXiv:2306.12511 (2023)
Xu, Y., Zhao, Y., Xiao, Z., Hou, T.: UFOGen: you forward once large scale text-to-image generation via diffusion GANs. arXiv preprint arXiv:2311.09257 (2023)
Yang, B., et al.: Paint by example: exemplar-based image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18381–18391 (2023)
Google Scholar
Yang, S., Chen, T., Zhou, M.: A dense reward view on aligning text-to-image diffusion with preference. arXiv preprint arXiv:2402.08265 (2024)
Yildirim, A.B., Baday, V., Erdem, E., Erdem, A., Dundar, A.: Inst-inpaint: instructing to remove objects with diffusion models. arXiv preprint arXiv:2304.03246 (2023)
Yu, T., et la.: Inpaint anything: segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)
Yuan, H., Chen, Z., Ji, K., Gu, Q.: Self-play fine-tuning of diffusion models for text-to-image generation. arXiv preprint arXiv:2402.10210 (2024)
Zhang, M., et al.: DiffusionEngine: diffusion model is scalable data engine for object detection. arXiv preprint arXiv:2309.03893 (2023)
Zhang, Z., Zhang, S., Zhan, Y., Luo, Y., Wen, Y., Tao, D.: Confronting reward overoptimization for diffusion models: a perspective of inductive and primacy biases. arXiv preprint arXiv:2402.08552 (2024)

Download references

Author information

Authors and Affiliations

ByteDance, Beijing, China
Yuxi Ren, Jie Wu, Yanzuo Lu, Huafeng Kuang, Xin Xia, Xionghui Wang, Qianqian Wang, Yixing Zhu, Pan Xie, Shiyin Wang, Xuefeng Xiao, Yitong Wang, Min Zheng & Lean Fu

Authors

Yuxi Ren
View author publications
You can also search for this author in PubMed Google Scholar
Jie Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yanzuo Lu
View author publications
You can also search for this author in PubMed Google Scholar
Huafeng Kuang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Xia
View author publications
You can also search for this author in PubMed Google Scholar
Xionghui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qianqian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yixing Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Pan Xie
View author publications
You can also search for this author in PubMed Google Scholar
Shiyin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xuefeng Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Yitong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Min Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Lean Fu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ren, Y. et al. (2025). ByteEdit: Boost, Comply and Accelerate Generative Image Editing. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15061. Springer, Cham. https://doi.org/10.1007/978-3-031-72646-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-72646-0_11
Published: 28 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72645-3
Online ISBN: 978-3-031-72646-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

Deep image synthesis from intuitive user input: A review and perspectives

Gradient Adjusting Networks for Domain Inversion

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

Deep image synthesis from intuitive user input: A review and perspectives

Gradient Adjusting Networks for Domain Inversion

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation