Abstract
The combination of language processing and image processing keeps attracting increased interest given recent impressive advances that leverage the combined strengths of both domains of research. Among these advances, the task of editing an image on the basis solely of a natural language instruction stands out as a most challenging endeavour. While recent approaches for this task resort to training or fine-tuning, this paper explores a novel, unsupervised method that permits instruction-guided image editing on the fly. This approach is organized along three steps that resort to image captioning and DDIM inversion, followed by obtaining the edit direction embedding, followed by generating the edited image. While dispensing with any form of training, our approach is shown to be effective and competitive, outperforming recent, state-of-the-art models for this task on the MAGICBRUSH dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Code is available at https://github.com/nlx-group/pix2pix-onthefly.
- 2.
For example, if the alteration request asked to add sunglasses to a cat, GPT-3 would be asked to generate thousands of before-edit captions about “cats” and thousands of after-edit captions about “cats with sunglasses”.
- 3.
According to the HuggingFace leaderboard, Phi-2 has an average across tested tasks of 61.33, Falcon has 44.17, LLaMA2-7b 50.97, LLaMA-13b 55.69, and Mistral 60.97.
- 4.
https://github.com/pix2pixzero/pix2pix-zero as of 04/03/2024.
- 5.
We use CLIP ViT-B/32 to obtain these embeddings. This is the model used in the MAGICBRUSH paper, as to ensure comparability with the scores reported there.
References
Almazrouei, E., Alobeidli, H., Alshamsi, A., et al.: The Falcon series of language models: towards open frontier models (2023)
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF, pp. 18392–18402 (2023)
Brown, T., Mann, B., et al.: Language models are few-shot learners. arXiv:2005.14165 (2020)
Cheng, Y., Gan, Z., Li, Y., et al.: Sequential attention GAN for interactive image editing. In: Proceedings of the 28th ACM, pp. 4383–4391 (2020)
Gunasekar, S., et al.: Textbooks are all you need. arXiv:2306.11644 (2023)
Hertz, A., Mokady, R., Tenenbaum, J., et al.: Prompt-to-prompt image editing with cross attention control. arXiv:2208.01626 (2022)
Jiang, A.Q., Sablayrolles, A., Mensch, A., et al.: Mistral 7B (2023)
Jiang, W., Xu, N., et al.: Language-guided global image editing via cross-modal cyclic mechanism. In: Proceedings of the IEEE/CVF, pp. 2115–2124 (2021)
Li, J., Li, D., et al.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML, pp. 12888–12900 (2022)
Mokady, R., Hertz, A., et al.: Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF, pp. 6038–6047 (2023)
Osório, et al.: Portulan extraglue datasets and models: kick-starting a benchmark for the neural processing of Portuguese. In: BUCC Workshop, pp. 24–34 (2024)
Parmar, G., Kumar Singh, K., Zhang, R., et al.: Zero-shot image-to-image translation. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)
Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. arXiv:2103.00020 (2021)
Ramesh, A., Dhariwal, P., Nichol, A., et al.: Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125 (2022)
Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., et al.: Advancing neural encoding of Portuguese with transformer Albertina pt. In: EPIA, pp. 441–453 (2023)
Rombach, R., Blattmann, A., et al.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF, pp. 10684–10695 (2022)
Santos, R., Branco, A., Silva, J.: Language driven image editing via transformers. In: 2022 IEEE 34th ICTAI, pp. 909–914 (2022)
Santos, R., Branco, A., Silva, J.R.: Cost-effective language driven image editing with LX-DRIM. In: Proceedings of the 1st MMMPIE Workshop, pp. 31–43 (2022)
Santos, R., Rodrigues, J., et al.: Fostering the ecosystem of open neural encoders for Portuguese with Albertina PT* family. In: SIGUL workshop, pp. 105–114 (2024)
Santos, R., Silva, J., et al.: Advancing generative AI for Portuguese with open decoder Gervásio PT*. In: SIGUL Workshop, pp. 16–26 (2024)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv:2010.02502 (2020)
Touvron, H., Martin, L., Stone, K., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv:2307.09288 (2023)
Yang, C., Wang, X., Lu, Y., et al.: Large language models as optimizers. arXiv:2309.03409 (2023)
Zhang, K., Mo, L., Chen, W., et al.: MagicBrush: a manually annotated dataset for instruction-guided image editing. NeurIPS 36 (2024)
Zhang, S., Yang, X., Feng, Y., et al.: HIVE: harnessing human feedback for instructional visual editing. arXiv:2303.09618 (2023)
Zhuang, P., Koyejo, O., Schwing, A.G.: Enjoy your editing: controllable GANs for image editing via latent space navigation. arXiv:2102.01187 (2021)
Acknowledgments
PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT (PINFRA/22117/2016); ACCELERAT.AI—Multilingual Intelligent Contact Centers, funded by IAPMEI (C625734525-00462629); Language Driven Image Design with Diffusion, funded by FCT (2022.15880.CPCA.A1); and IMPROMPT—Image Alteration with Language Prompts, funded by FCT (CPCA-IAC/AV/590897/2023).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Santos, R., Silva, J., Branco, A. (2025). Leveraging LLMs for On-the-Fly Instruction Guided Image Editing. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14967. Springer, Cham. https://doi.org/10.1007/978-3-031-73497-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-73497-7_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73496-0
Online ISBN: 978-3-031-73497-7
eBook Packages: Computer ScienceComputer Science (R0)