The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation

Yao, Yi; Hsu, Chan-Feng; Lin, Jhe-Hao; Xie, Hongxia; Lin, Terence; Huang, Yi-Ning; Shuai, Hong-Han; Cheng, Wen-Huang

doi:10.1007/978-3-031-72670-5_24

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15080))

Included in the following conference series:

European Conference on Computer Vision

411 Accesses

Abstract

In spite of recent advancements in text-to-image generation, limitations persist in handling complex and imaginative prompts due to the restricted diversity and complexity of training data. This work explores how diffusion models can generate images from prompts requiring artistic creativity or specialized knowledge. We introduce the Realistic-Fantasy Benchmark (RFBench), a novel evaluation framework blending realistic and fantastical scenarios. To address these challenges, we propose the Realistic-Fantasy Network (RFNet), a training-free approach integrating diffusion models with LLMs. Extensive human evaluations and GPT-based compositional assessments demonstrate our approach’s superiority over state-of-the-art methods. Our code and dataset is available at https://leo81005.github.io/Reality-and-Fantasy/.

Y. Yao, C.-F. Hsu—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ProCreate, Don’t Reproduce! Propulsive Energy Diffusion for Creative Generation

Text-to-Image Synthesis via Visual-Memory Creative Adversarial Network

RECON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories

Notes

1.
One detailed sample can be found in our supplementary material.
2.
We adopt GPT4-CLIP due to BLIP-CLIP’s [3] limitations in accurately capturing image meanings through generated captions.
3.
The widely recognized metric, CLIPScore [10, 29] exhibits limitations in evaluating our task. For detailed examples, please see the supplementary materials.
4.
Details of survey samples can be found in our supplementary material.

References

Anciukevičius, T., et al.: RenderDiffusion: image diffusion for 3d reconstruction, inpainting and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12608–12618 (2023)
Google Scholar
Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: MultiDiffusion: fusing diffusion paths for controlled image generation. In: International Conference on Machine Learning (2023)
Google Scholar
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. (TOG) 42(4), 1–10 (2023)
Article Google Scholar
Feng, W., et al.: Training-free structured diffusion guidance for compositional text-to-image synthesis. In: International Conference on Learning Representations (2023). https://openreview.net/forum?id=PUIqjT4rzq7
Feng, W., et al.: LayoutGPT: compositional visual planning and generation with large language models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Friedrich, F., et al.: Fair diffusion: instructing text-to-image generation models on fairness. arXiv preprint arXiv:2302.10893 (2023)
Gani, H., Bhat, S.F., Naseer, M., Khan, S., Wonka, P.: LLM blueprint: enabling text-to-image generation with complex and detailed prompts. In: International Conference on Learning Representations (2024)
Google Scholar
Golnari, P.A.: LORA-enhanced distillation on guided diffusion models. arXiv preprint arXiv:2312.06899 (2023)
Gong, J., Foo, L.G., Fan, Z., Ke, Q., Rahmani, H., Liu, J.: DiffPose: toward more reliable 3D pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13041–13051 (2023)
Google Scholar
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: Advances in Neural Information Processing Systems Workshop (2021). https://openreview.net/forum?id=qw8AKxfYbI
Hu, E.J., et al.: LORA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)
Google Scholar
Huang, K., Sun, K., Xie, E., Li, Z., Liu, X.: T2I-CompBench: a comprehensive benchmark for open-world compositional text-to-image generation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Kemker, R., McClure, M., Abitino, A., Hayes, T., Kanan, C.: Measuring catastrophic forgetting in neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Qu, L., Wu, S., Fei, H., Nie, L., Chua, T.S.: LayoutLLM-T2I: eliciting layout guidance from LLM for text-to-image generation. In: Proceedings of the ACM International Conference on Multimedia (2023)
Google Scholar
Lian, L., Li, B., Yala, A., Darrell, T.: LLM-grounded diffusion: enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655 (2023)
Lian, L., Shi, B., Yala, A., Darrell, T., Li, B.: LLM-grounded video diffusion models. arXiv preprint arXiv:2309.17444 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (2023)
Google Scholar
Luccioni, S., Akiki, C., Mitchell, M., Jernite, Y.: Stable bias: evaluating societal representations in diffusion models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Mantri, K.S.I., Sasikumar, N.: Interactive fashion content generation using LLMs and latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (2023)
Google Scholar
Naik, R., Nushi, B.: Social biases through the text-to-image generation lens. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 786–808. AIES 2023, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3600211.3604711
Nair, N.G., et al.: Steered diffusion: a generalized framework for plug-and-play conditional image synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20850–20860 (2023)
Google Scholar
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: Proceedings of Machine Learning Research, pp. 16784–16804 (2022)
Google Scholar
Orgad, H., Kawar, B., Belinkov, Y.: Editing implicit assumptions in text-to-image diffusion models. In: 2023 IEEE/CVF International Conference on Computer Vision, pp. 7030–7038. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/ICCV51070.2023.00649
Perera, M.V., Patel, V.M.: Analyzing bias in diffusion-based face generation models. arXiv preprint arXiv:2305.06402 (2023)
Phung, Q., Ge, S., Huang, J.B.: Grounded text-to-image synthesis with attention refocusing. arXiv preprint arXiv:2306.05427 (2023)
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. In: International Conference on Learning Representations (2024). https://openreview.net/forum?id=di52zR8xgf
Qin, J., et al.: DiffusionGPT: LLM-driven text-to-image generation system. arXiv preprint arXiv:2401.10061 (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494 (2022)
Google Scholar
Smith, J.S., et al.: Continual diffusion: continual customization of text-to-image diffusion with C-LORA. arXiv preprint arXiv:2304.06027 (2023)
Su, X., et al.: Unbiased image synthesis via manifold-driven sampling in diffusion models. arXiv preprint arXiv:2307.08199 (2023)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)
Google Scholar
Wu, T.H., Lian, L., Gonzalez, J.E., Li, B., Darrell, T.: Self-correcting LLM-controlled diffusion models. arXiv preprint arXiv:2311.16090 (2023)
Xie, J., et al.: BoxDiff: text-to-image synthesis with training-free box-constrained diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7452–7461 (2023)
Google Scholar
Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., Cui, B.: Mastering text-to-image diffusion: recaptioning, planning, and generating with multimodal LLMS. arXiv preprint arXiv:2401.11708 (2024)
Yang, L., et al.: Diffusion models: a comprehensive survey of methods and applications. ACM Comput. Surv. 56(4) (2023). https://doi.org/10.1145/3626235
Yang, Z., et al.: ReCo: region-controlled text-to-image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition, pp. 14246–14255 (2023)
Google Scholar
Zhang, C., Zhang, C., Zhang, M., Kweon, I.S.: Text-to-image diffusion model in generative AI: A survey. arXiv preprint arXiv:2303.07909 (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Google Scholar
Zhang, T., Wang, Z., Huang, J., Tasnim, M.M., Shi, W.: A survey of diffusion based image generation models: issues and their solutions. arXiv preprint arXiv:2308.13142 (2023)

Download references

Acknowledgment

This work is partially supported by the National Science and Technology Council, Taiwan under Grants NSTC-112-2221-E-A49-059-MY3 and NSTC-112-2221-E-A49-094-MY3 as well as the partial support from Ministry of Science and Technology of Taiwan under the grant numbers: MOST-109-2221-E-009-114-MY3, MOST-110-2221-E-A49-164, MOST-109-2223-E-009-002-MY3, MOST-110-2218-E-A49-018 and MOST-111-2634-F-007-002.

Author information

Authors and Affiliations

National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Yi Yao, Chan-Feng Hsu, Jhe-Hao Lin, Terence Lin, Yi-Ning Huang & Hong-Han Shuai
Jilin University, Changchun, China
Hongxia Xie
National Taiwan University, Taipei, Taiwan
Wen-Huang Cheng

Authors

Yi Yao
View author publications
You can also search for this author in PubMed Google Scholar
Chan-Feng Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Jhe-Hao Lin
View author publications
You can also search for this author in PubMed Google Scholar
Hongxia Xie
View author publications
You can also search for this author in PubMed Google Scholar
Terence Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Ning Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hong-Han Shuai
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Huang Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Yao .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 16217 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yao, Y. et al. (2025). The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15080. Springer, Cham. https://doi.org/10.1007/978-3-031-72670-5_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-72670-5_24
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72669-9
Online ISBN: 978-3-031-72670-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ProCreate, Don’t Reproduce! Propulsive Energy Diffusion for Creative Generation

Text-to-Image Synthesis via Visual-Memory Creative Adversarial Network

RECON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 16217 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ProCreate, Don’t Reproduce! Propulsive Energy Diffusion for Creative Generation

Text-to-Image Synthesis via Visual-Memory Creative Adversarial Network

RECON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 16217 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation