Abstract
In spite of recent advancements in text-to-image generation, limitations persist in handling complex and imaginative prompts due to the restricted diversity and complexity of training data. This work explores how diffusion models can generate images from prompts requiring artistic creativity or specialized knowledge. We introduce the Realistic-Fantasy Benchmark (RFBench), a novel evaluation framework blending realistic and fantastical scenarios. To address these challenges, we propose the Realistic-Fantasy Network (RFNet), a training-free approach integrating diffusion models with LLMs. Extensive human evaluations and GPT-based compositional assessments demonstrate our approach’s superiority over state-of-the-art methods. Our code and dataset is available at https://leo81005.github.io/Reality-and-Fantasy/.
Y. Yao, C.-F. Hsu—These authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
One detailed sample can be found in our supplementary material.
- 2.
We adopt GPT4-CLIP due to BLIP-CLIP’s [3] limitations in accurately capturing image meanings through generated captions.
- 3.
- 4.
Details of survey samples can be found in our supplementary material.
References
Anciukevičius, T., et al.: RenderDiffusion: image diffusion for 3d reconstruction, inpainting and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12608–12618 (2023)
Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: MultiDiffusion: fusing diffusion paths for controlled image generation. In: International Conference on Machine Learning (2023)
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. (TOG) 42(4), 1–10 (2023)
Feng, W., et al.: Training-free structured diffusion guidance for compositional text-to-image synthesis. In: International Conference on Learning Representations (2023). https://openreview.net/forum?id=PUIqjT4rzq7
Feng, W., et al.: LayoutGPT: compositional visual planning and generation with large language models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Friedrich, F., et al.: Fair diffusion: instructing text-to-image generation models on fairness. arXiv preprint arXiv:2302.10893 (2023)
Gani, H., Bhat, S.F., Naseer, M., Khan, S., Wonka, P.: LLM blueprint: enabling text-to-image generation with complex and detailed prompts. In: International Conference on Learning Representations (2024)
Golnari, P.A.: LORA-enhanced distillation on guided diffusion models. arXiv preprint arXiv:2312.06899 (2023)
Gong, J., Foo, L.G., Fan, Z., Ke, Q., Rahmani, H., Liu, J.: DiffPose: toward more reliable 3D pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13041–13051 (2023)
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: Advances in Neural Information Processing Systems Workshop (2021). https://openreview.net/forum?id=qw8AKxfYbI
Hu, E.J., et al.: LORA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)
Huang, K., Sun, K., Xie, E., Li, Z., Liu, X.: T2I-CompBench: a comprehensive benchmark for open-world compositional text-to-image generation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Kemker, R., McClure, M., Abitino, A., Hayes, T., Kanan, C.: Measuring catastrophic forgetting in neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Qu, L., Wu, S., Fei, H., Nie, L., Chua, T.S.: LayoutLLM-T2I: eliciting layout guidance from LLM for text-to-image generation. In: Proceedings of the ACM International Conference on Multimedia (2023)
Lian, L., Li, B., Yala, A., Darrell, T.: LLM-grounded diffusion: enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655 (2023)
Lian, L., Shi, B., Yala, A., Darrell, T., Li, B.: LLM-grounded video diffusion models. arXiv preprint arXiv:2309.17444 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (2023)
Luccioni, S., Akiki, C., Mitchell, M., Jernite, Y.: Stable bias: evaluating societal representations in diffusion models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Mantri, K.S.I., Sasikumar, N.: Interactive fashion content generation using LLMs and latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (2023)
Naik, R., Nushi, B.: Social biases through the text-to-image generation lens. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 786–808. AIES 2023, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3600211.3604711
Nair, N.G., et al.: Steered diffusion: a generalized framework for plug-and-play conditional image synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20850–20860 (2023)
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: Proceedings of Machine Learning Research, pp. 16784–16804 (2022)
Orgad, H., Kawar, B., Belinkov, Y.: Editing implicit assumptions in text-to-image diffusion models. In: 2023 IEEE/CVF International Conference on Computer Vision, pp. 7030–7038. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/ICCV51070.2023.00649
Perera, M.V., Patel, V.M.: Analyzing bias in diffusion-based face generation models. arXiv preprint arXiv:2305.06402 (2023)
Phung, Q., Ge, S., Huang, J.B.: Grounded text-to-image synthesis with attention refocusing. arXiv preprint arXiv:2306.05427 (2023)
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. In: International Conference on Learning Representations (2024). https://openreview.net/forum?id=di52zR8xgf
Qin, J., et al.: DiffusionGPT: LLM-driven text-to-image generation system. arXiv preprint arXiv:2401.10061 (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494 (2022)
Smith, J.S., et al.: Continual diffusion: continual customization of text-to-image diffusion with C-LORA. arXiv preprint arXiv:2304.06027 (2023)
Su, X., et al.: Unbiased image synthesis via manifold-driven sampling in diffusion models. arXiv preprint arXiv:2307.08199 (2023)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)
Wu, T.H., Lian, L., Gonzalez, J.E., Li, B., Darrell, T.: Self-correcting LLM-controlled diffusion models. arXiv preprint arXiv:2311.16090 (2023)
Xie, J., et al.: BoxDiff: text-to-image synthesis with training-free box-constrained diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7452–7461 (2023)
Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., Cui, B.: Mastering text-to-image diffusion: recaptioning, planning, and generating with multimodal LLMS. arXiv preprint arXiv:2401.11708 (2024)
Yang, L., et al.: Diffusion models: a comprehensive survey of methods and applications. ACM Comput. Surv. 56(4) (2023). https://doi.org/10.1145/3626235
Yang, Z., et al.: ReCo: region-controlled text-to-image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition, pp. 14246–14255 (2023)
Zhang, C., Zhang, C., Zhang, M., Kweon, I.S.: Text-to-image diffusion model in generative AI: A survey. arXiv preprint arXiv:2303.07909 (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Zhang, T., Wang, Z., Huang, J., Tasnim, M.M., Shi, W.: A survey of diffusion based image generation models: issues and their solutions. arXiv preprint arXiv:2308.13142 (2023)
Acknowledgment
This work is partially supported by the National Science and Technology Council, Taiwan under Grants NSTC-112-2221-E-A49-059-MY3 and NSTC-112-2221-E-A49-094-MY3 as well as the partial support from Ministry of Science and Technology of Taiwan under the grant numbers: MOST-109-2221-E-009-114-MY3, MOST-110-2221-E-A49-164, MOST-109-2223-E-009-002-MY3, MOST-110-2218-E-A49-018 and MOST-111-2634-F-007-002.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yao, Y. et al. (2025). The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15080. Springer, Cham. https://doi.org/10.1007/978-3-031-72670-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-72670-5_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72669-9
Online ISBN: 978-3-031-72670-5
eBook Packages: Computer ScienceComputer Science (R0)