Skip to main content

The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

In spite of recent advancements in text-to-image generation, limitations persist in handling complex and imaginative prompts due to the restricted diversity and complexity of training data. This work explores how diffusion models can generate images from prompts requiring artistic creativity or specialized knowledge. We introduce the Realistic-Fantasy Benchmark (RFBench), a novel evaluation framework blending realistic and fantastical scenarios. To address these challenges, we propose the Realistic-Fantasy Network (RFNet), a training-free approach integrating diffusion models with LLMs. Extensive human evaluations and GPT-based compositional assessments demonstrate our approach’s superiority over state-of-the-art methods. Our code and dataset is available at https://leo81005.github.io/Reality-and-Fantasy/.

Y. Yao, C.-F. Hsu—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    One detailed sample can be found in our supplementary material.

  2. 2.

    We adopt GPT4-CLIP due to BLIP-CLIP’s [3] limitations in accurately capturing image meanings through generated captions.

  3. 3.

    The widely recognized metric, CLIPScore [10, 29] exhibits limitations in evaluating our task. For detailed examples, please see the supplementary materials.

  4. 4.

    Details of survey samples can be found in our supplementary material.

References

  1. Anciukevičius, T., et al.: RenderDiffusion: image diffusion for 3d reconstruction, inpainting and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12608–12618 (2023)

    Google Scholar 

  2. Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: MultiDiffusion: fusing diffusion paths for controlled image generation. In: International Conference on Machine Learning (2023)

    Google Scholar 

  3. Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. (TOG) 42(4), 1–10 (2023)

    Article  Google Scholar 

  4. Feng, W., et al.: Training-free structured diffusion guidance for compositional text-to-image synthesis. In: International Conference on Learning Representations (2023). https://openreview.net/forum?id=PUIqjT4rzq7

  5. Feng, W., et al.: LayoutGPT: compositional visual planning and generation with large language models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  6. Friedrich, F., et al.: Fair diffusion: instructing text-to-image generation models on fairness. arXiv preprint arXiv:2302.10893 (2023)

  7. Gani, H., Bhat, S.F., Naseer, M., Khan, S., Wonka, P.: LLM blueprint: enabling text-to-image generation with complex and detailed prompts. In: International Conference on Learning Representations (2024)

    Google Scholar 

  8. Golnari, P.A.: LORA-enhanced distillation on guided diffusion models. arXiv preprint arXiv:2312.06899 (2023)

  9. Gong, J., Foo, L.G., Fan, Z., Ke, Q., Rahmani, H., Liu, J.: DiffPose: toward more reliable 3D pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13041–13051 (2023)

    Google Scholar 

  10. Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

  11. Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: Advances in Neural Information Processing Systems Workshop (2021). https://openreview.net/forum?id=qw8AKxfYbI

  12. Hu, E.J., et al.: LORA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

    Google Scholar 

  13. Huang, K., Sun, K., Xie, E., Li, Z., Liu, X.: T2I-CompBench: a comprehensive benchmark for open-world compositional text-to-image generation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  14. Kemker, R., McClure, M., Abitino, A., Hayes, T., Kanan, C.: Measuring catastrophic forgetting in neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  15. Qu, L., Wu, S., Fei, H., Nie, L., Chua, T.S.: LayoutLLM-T2I: eliciting layout guidance from LLM for text-to-image generation. In: Proceedings of the ACM International Conference on Multimedia (2023)

    Google Scholar 

  16. Lian, L., Li, B., Yala, A., Darrell, T.: LLM-grounded diffusion: enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655 (2023)

  17. Lian, L., Shi, B., Yala, A., Darrell, T., Li, B.: LLM-grounded video diffusion models. arXiv preprint arXiv:2309.17444 (2023)

  18. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (2023)

    Google Scholar 

  19. Luccioni, S., Akiki, C., Mitchell, M., Jernite, Y.: Stable bias: evaluating societal representations in diffusion models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  20. Mantri, K.S.I., Sasikumar, N.: Interactive fashion content generation using LLMs and latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (2023)

    Google Scholar 

  21. Naik, R., Nushi, B.: Social biases through the text-to-image generation lens. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 786–808. AIES 2023, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3600211.3604711

  22. Nair, N.G., et al.: Steered diffusion: a generalized framework for plug-and-play conditional image synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20850–20860 (2023)

    Google Scholar 

  23. Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: Proceedings of Machine Learning Research, pp. 16784–16804 (2022)

    Google Scholar 

  24. Orgad, H., Kawar, B., Belinkov, Y.: Editing implicit assumptions in text-to-image diffusion models. In: 2023 IEEE/CVF International Conference on Computer Vision, pp. 7030–7038. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/ICCV51070.2023.00649

  25. Perera, M.V., Patel, V.M.: Analyzing bias in diffusion-based face generation models. arXiv preprint arXiv:2305.06402 (2023)

  26. Phung, Q., Ge, S., Huang, J.B.: Grounded text-to-image synthesis with attention refocusing. arXiv preprint arXiv:2306.05427 (2023)

  27. Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. In: International Conference on Learning Representations (2024). https://openreview.net/forum?id=di52zR8xgf

  28. Qin, J., et al.: DiffusionGPT: LLM-driven text-to-image generation system. arXiv preprint arXiv:2401.10061 (2024)

  29. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  30. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

  31. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  32. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494 (2022)

    Google Scholar 

  33. Smith, J.S., et al.: Continual diffusion: continual customization of text-to-image diffusion with C-LORA. arXiv preprint arXiv:2304.06027 (2023)

  34. Su, X., et al.: Unbiased image synthesis via manifold-driven sampling in diffusion models. arXiv preprint arXiv:2307.08199 (2023)

  35. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)

    Google Scholar 

  36. Wu, T.H., Lian, L., Gonzalez, J.E., Li, B., Darrell, T.: Self-correcting LLM-controlled diffusion models. arXiv preprint arXiv:2311.16090 (2023)

  37. Xie, J., et al.: BoxDiff: text-to-image synthesis with training-free box-constrained diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7452–7461 (2023)

    Google Scholar 

  38. Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., Cui, B.: Mastering text-to-image diffusion: recaptioning, planning, and generating with multimodal LLMS. arXiv preprint arXiv:2401.11708 (2024)

  39. Yang, L., et al.: Diffusion models: a comprehensive survey of methods and applications. ACM Comput. Surv. 56(4) (2023). https://doi.org/10.1145/3626235

  40. Yang, Z., et al.: ReCo: region-controlled text-to-image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition, pp. 14246–14255 (2023)

    Google Scholar 

  41. Zhang, C., Zhang, C., Zhang, M., Kweon, I.S.: Text-to-image diffusion model in generative AI: A survey. arXiv preprint arXiv:2303.07909 (2023)

  42. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)

    Google Scholar 

  43. Zhang, T., Wang, Z., Huang, J., Tasnim, M.M., Shi, W.: A survey of diffusion based image generation models: issues and their solutions. arXiv preprint arXiv:2308.13142 (2023)

Download references

Acknowledgment

This work is partially supported by the National Science and Technology Council, Taiwan under Grants NSTC-112-2221-E-A49-059-MY3 and NSTC-112-2221-E-A49-094-MY3 as well as the partial support from Ministry of Science and Technology of Taiwan under the grant numbers: MOST-109-2221-E-009-114-MY3, MOST-110-2221-E-A49-164, MOST-109-2223-E-009-002-MY3, MOST-110-2218-E-A49-018 and MOST-111-2634-F-007-002.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Yao .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 16217 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yao, Y. et al. (2025). The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15080. Springer, Cham. https://doi.org/10.1007/978-3-031-72670-5_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72670-5_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72669-9

  • Online ISBN: 978-3-031-72670-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics