Skip to main content

Leveraging LLMs for On-the-Fly Instruction Guided Image Editing

  • Conference paper
  • First Online:
Progress in Artificial Intelligence (EPIA 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14967))

Included in the following conference series:

  • 116 Accesses

Abstract

The combination of language processing and image processing keeps attracting increased interest given recent impressive advances that leverage the combined strengths of both domains of research. Among these advances, the task of editing an image on the basis solely of a natural language instruction stands out as a most challenging endeavour. While recent approaches for this task resort to training or fine-tuning, this paper explores a novel, unsupervised method that permits instruction-guided image editing on the fly. This approach is organized along three steps that resort to image captioning and DDIM inversion, followed by obtaining the edit direction embedding, followed by generating the edited image. While dispensing with any form of training, our approach is shown to be effective and competitive, outperforming recent, state-of-the-art models for this task on the MAGICBRUSH dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Code is available at https://github.com/nlx-group/pix2pix-onthefly.

  2. 2.

    For example, if the alteration request asked to add sunglasses to a cat, GPT-3 would be asked to generate thousands of before-edit captions about “cats” and thousands of after-edit captions about “cats with sunglasses”.

  3. 3.

    According to the HuggingFace leaderboard, Phi-2 has an average across tested tasks of 61.33, Falcon has 44.17, LLaMA2-7b 50.97, LLaMA-13b 55.69, and Mistral 60.97.

  4. 4.

    https://github.com/pix2pixzero/pix2pix-zero as of 04/03/2024.

  5. 5.

    We use CLIP ViT-B/32 to obtain these embeddings. This is the model used in the MAGICBRUSH paper, as to ensure comparability with the scores reported there.

References

  1. Almazrouei, E., Alobeidli, H., Alshamsi, A., et al.: The Falcon series of language models: towards open frontier models (2023)

    Google Scholar 

  2. Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF, pp. 18392–18402 (2023)

    Google Scholar 

  3. Brown, T., Mann, B., et al.: Language models are few-shot learners. arXiv:2005.14165 (2020)

  4. Cheng, Y., Gan, Z., Li, Y., et al.: Sequential attention GAN for interactive image editing. In: Proceedings of the 28th ACM, pp. 4383–4391 (2020)

    Google Scholar 

  5. Gunasekar, S., et al.: Textbooks are all you need. arXiv:2306.11644 (2023)

  6. Hertz, A., Mokady, R., Tenenbaum, J., et al.: Prompt-to-prompt image editing with cross attention control. arXiv:2208.01626 (2022)

  7. Jiang, A.Q., Sablayrolles, A., Mensch, A., et al.: Mistral 7B (2023)

    Google Scholar 

  8. Jiang, W., Xu, N., et al.: Language-guided global image editing via cross-modal cyclic mechanism. In: Proceedings of the IEEE/CVF, pp. 2115–2124 (2021)

    Google Scholar 

  9. Li, J., Li, D., et al.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML, pp. 12888–12900 (2022)

    Google Scholar 

  10. Mokady, R., Hertz, A., et al.: Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF, pp. 6038–6047 (2023)

    Google Scholar 

  11. Osório, et al.: Portulan extraglue datasets and models: kick-starting a benchmark for the neural processing of Portuguese. In: BUCC Workshop, pp. 24–34 (2024)

    Google Scholar 

  12. Parmar, G., Kumar Singh, K., Zhang, R., et al.: Zero-shot image-to-image translation. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)

    Google Scholar 

  13. Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. arXiv:2103.00020 (2021)

  14. Ramesh, A., Dhariwal, P., Nichol, A., et al.: Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125 (2022)

  15. Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., et al.: Advancing neural encoding of Portuguese with transformer Albertina pt. In: EPIA, pp. 441–453 (2023)

    Google Scholar 

  16. Rombach, R., Blattmann, A., et al.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF, pp. 10684–10695 (2022)

    Google Scholar 

  17. Santos, R., Branco, A., Silva, J.: Language driven image editing via transformers. In: 2022 IEEE 34th ICTAI, pp. 909–914 (2022)

    Google Scholar 

  18. Santos, R., Branco, A., Silva, J.R.: Cost-effective language driven image editing with LX-DRIM. In: Proceedings of the 1st MMMPIE Workshop, pp. 31–43 (2022)

    Google Scholar 

  19. Santos, R., Rodrigues, J., et al.: Fostering the ecosystem of open neural encoders for Portuguese with Albertina PT* family. In: SIGUL workshop, pp. 105–114 (2024)

    Google Scholar 

  20. Santos, R., Silva, J., et al.: Advancing generative AI for Portuguese with open decoder Gervásio PT*. In: SIGUL Workshop, pp. 16–26 (2024)

    Google Scholar 

  21. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv:2010.02502 (2020)

  22. Touvron, H., Martin, L., Stone, K., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv:2307.09288 (2023)

  23. Yang, C., Wang, X., Lu, Y., et al.: Large language models as optimizers. arXiv:2309.03409 (2023)

  24. Zhang, K., Mo, L., Chen, W., et al.: MagicBrush: a manually annotated dataset for instruction-guided image editing. NeurIPS 36 (2024)

    Google Scholar 

  25. Zhang, S., Yang, X., Feng, Y., et al.: HIVE: harnessing human feedback for instructional visual editing. arXiv:2303.09618 (2023)

  26. Zhuang, P., Koyejo, O., Schwing, A.G.: Enjoy your editing: controllable GANs for image editing via latent space navigation. arXiv:2102.01187 (2021)

Download references

Acknowledgments

PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT (PINFRA/22117/2016); ACCELERAT.AI—Multilingual Intelligent Contact Centers, funded by IAPMEI (C625734525-00462629); Language Driven Image Design with Diffusion, funded by FCT (2022.15880.CPCA.A1); and IMPROMPT—Image Alteration with Language Prompts, funded by FCT (CPCA-IAC/AV/590897/2023).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rodrigo Santos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Santos, R., Silva, J., Branco, A. (2025). Leveraging LLMs for On-the-Fly Instruction Guided Image Editing. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14967. Springer, Cham. https://doi.org/10.1007/978-3-031-73497-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73497-7_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73496-0

  • Online ISBN: 978-3-031-73497-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics