Leveraging LLMs for On-the-Fly Instruction Guided Image Editing

Santos, Rodrigo; Silva, João; Branco, António

doi:10.1007/978-3-031-73497-7_3

Rodrigo Santos¹²,
João Silva¹² &
António Branco¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14967))

Included in the following conference series:

EPIA Conference on Artificial Intelligence

301 Accesses
1 Citations

Abstract

The combination of language processing and image processing keeps attracting increased interest given recent impressive advances that leverage the combined strengths of both domains of research. Among these advances, the task of editing an image on the basis solely of a natural language instruction stands out as a most challenging endeavour. While recent approaches for this task resort to training or fine-tuning, this paper explores a novel, unsupervised method that permits instruction-guided image editing on the fly. This approach is organized along three steps that resort to image captioning and DDIM inversion, followed by obtaining the edit direction embedding, followed by generating the edited image. While dispensing with any form of training, our approach is shown to be effective and competitive, outperforming recent, state-of-the-art models for this task on the MAGICBRUSH dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

A Benchmark and Baseline for Language-Driven Image Editing

Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions

Notes

1.
Code is available at https://github.com/nlx-group/pix2pix-onthefly.
2.
For example, if the alteration request asked to add sunglasses to a cat, GPT-3 would be asked to generate thousands of before-edit captions about “cats” and thousands of after-edit captions about “cats with sunglasses”.
3.
According to the HuggingFace leaderboard, Phi-2 has an average across tested tasks of 61.33, Falcon has 44.17, LLaMA2-7b 50.97, LLaMA-13b 55.69, and Mistral 60.97.
4.
https://github.com/pix2pixzero/pix2pix-zero as of 04/03/2024.
5.
We use CLIP ViT-B/32 to obtain these embeddings. This is the model used in the MAGICBRUSH paper, as to ensure comparability with the scores reported there.

References

Almazrouei, E., Alobeidli, H., Alshamsi, A., et al.: The Falcon series of language models: towards open frontier models (2023)
Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF, pp. 18392–18402 (2023)
Google Scholar
Brown, T., Mann, B., et al.: Language models are few-shot learners. arXiv:2005.14165 (2020)
Cheng, Y., Gan, Z., Li, Y., et al.: Sequential attention GAN for interactive image editing. In: Proceedings of the 28th ACM, pp. 4383–4391 (2020)
Google Scholar
Gunasekar, S., et al.: Textbooks are all you need. arXiv:2306.11644 (2023)
Hertz, A., Mokady, R., Tenenbaum, J., et al.: Prompt-to-prompt image editing with cross attention control. arXiv:2208.01626 (2022)
Jiang, A.Q., Sablayrolles, A., Mensch, A., et al.: Mistral 7B (2023)
Google Scholar
Jiang, W., Xu, N., et al.: Language-guided global image editing via cross-modal cyclic mechanism. In: Proceedings of the IEEE/CVF, pp. 2115–2124 (2021)
Google Scholar
Li, J., Li, D., et al.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML, pp. 12888–12900 (2022)
Google Scholar
Mokady, R., Hertz, A., et al.: Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF, pp. 6038–6047 (2023)
Google Scholar
Osório, et al.: Portulan extraglue datasets and models: kick-starting a benchmark for the neural processing of Portuguese. In: BUCC Workshop, pp. 24–34 (2024)
Google Scholar
Parmar, G., Kumar Singh, K., Zhang, R., et al.: Zero-shot image-to-image translation. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)
Google Scholar
Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. arXiv:2103.00020 (2021)
Ramesh, A., Dhariwal, P., Nichol, A., et al.: Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125 (2022)
Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., et al.: Advancing neural encoding of Portuguese with transformer Albertina pt. In: EPIA, pp. 441–453 (2023)
Google Scholar
Rombach, R., Blattmann, A., et al.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF, pp. 10684–10695 (2022)
Google Scholar
Santos, R., Branco, A., Silva, J.: Language driven image editing via transformers. In: 2022 IEEE 34th ICTAI, pp. 909–914 (2022)
Google Scholar
Santos, R., Branco, A., Silva, J.R.: Cost-effective language driven image editing with LX-DRIM. In: Proceedings of the 1st MMMPIE Workshop, pp. 31–43 (2022)
Google Scholar
Santos, R., Rodrigues, J., et al.: Fostering the ecosystem of open neural encoders for Portuguese with Albertina PT* family. In: SIGUL workshop, pp. 105–114 (2024)
Google Scholar
Santos, R., Silva, J., et al.: Advancing generative AI for Portuguese with open decoder Gervásio PT*. In: SIGUL Workshop, pp. 16–26 (2024)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv:2010.02502 (2020)
Touvron, H., Martin, L., Stone, K., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv:2307.09288 (2023)
Yang, C., Wang, X., Lu, Y., et al.: Large language models as optimizers. arXiv:2309.03409 (2023)
Zhang, K., Mo, L., Chen, W., et al.: MagicBrush: a manually annotated dataset for instruction-guided image editing. NeurIPS 36 (2024)
Google Scholar
Zhang, S., Yang, X., Feng, Y., et al.: HIVE: harnessing human feedback for instructional visual editing. arXiv:2303.09618 (2023)
Zhuang, P., Koyejo, O., Schwing, A.G.: Enjoy your editing: controllable GANs for image editing via latent space navigation. arXiv:2102.01187 (2021)

Download references

Acknowledgments

PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT (PINFRA/22117/2016); ACCELERAT.AI—Multilingual Intelligent Contact Centers, funded by IAPMEI (C625734525-00462629); Language Driven Image Design with Diffusion, funded by FCT (2022.15880.CPCA.A1); and IMPROMPT—Image Alteration with Language Prompts, funded by FCT (CPCA-IAC/AV/590897/2023).

Author information

Authors and Affiliations

NLX—Natural Language and Speech Group, Department of Informatics, University of Lisbon, Faculdade de Ciências, Campo Grande, 1749-016, Lisboa, Portugal
Rodrigo Santos, João Silva & António Branco

Authors

Rodrigo Santos
View author publications
You can also search for this author in PubMed Google Scholar
João Silva
View author publications
You can also search for this author in PubMed Google Scholar
António Branco
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rodrigo Santos .

Editor information

Editors and Affiliations

University of Minho, Braga, Portugal
Manuel Filipe Santos
University of Minho, Braga, Portugal
José Machado
University of Minho, Braga, Portugal
Paulo Novais
University of Minho, Braga, Portugal
Paulo Cortez
Polytechnic Institute of Viana do Castelo, Viana do Castelo, Portugal
Pedro Miguel Moreira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santos, R., Silva, J., Branco, A. (2025). Leveraging LLMs for On-the-Fly Instruction Guided Image Editing. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14967. Springer, Cham. https://doi.org/10.1007/978-3-031-73497-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-73497-7_3
Published: 16 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73496-0
Online ISBN: 978-3-031-73497-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Leveraging LLMs for On-the-Fly Instruction Guided Image Editing