BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

Huang, Ian; Yang, Guandao; Guibas, Leonidas

doi:10.1007/978-3-031-73024-5_18

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15147))

Included in the following conference series:

European Conference on Computer Vision

330 Accesses

Abstract

Graphics design is important for various applications, including movie production and game design. To create a high-quality scene, designers usually need to spend hours in software like Blender, in which they might need to interleave and repeat operations, such as connecting material nodes, hundreds of times. Moreover, slightly different design goals may require completely different sequences, making automation difficult. In this paper, we propose a system that leverages Vision-Language Models (VLMs), like GPT-4V, to intelligently search the design action space to arrive at an answer that can satisfy a user’s intent. Specifically, we design a vision-based edit generator and state evaluator to work together to find the correct sequence of actions to achieve the goal. Inspired by the role of visual imagination in the human design process, we supplement the visual reasoning capabilities of VLMs with “imagined” reference images from image-generation models, providing visual grounding of abstract language descriptions. In this paper, we provide empirical evidence suggesting our system can produce simple but tedious Blender editing sequences for tasks such as editing procedural materials and geometry from text and/or reference images, as well as adjusting lighting configurations for product renderings in complex scenes (For project website and code, please go to: https://ianhuang0630.github.io/BlenderAlchemyWeb/).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LatentEditor: Text Driven Local Editing of 3D Scenes

3DEgo: 3D Editing on the Go!

CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches

Notes

1.
At the time of publishing, works like 3DGPT [42], L3GO [53] have not yet open-sourced their code.

References

Blendergpt. https://github.com/gd3kr/BlenderGPT
How long does it take to create a 3D model? https://3d-ace.com/blog/how-long-does-it-take-to-create-a-3d-model/
How long does it take to make a 3D model? https://pixune.com/blog/how-long-does-it-take-to-create-a-3d-model/
Ahn, M., et al.: Do as i can, not as i say: grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022)
Austin, J., et al.: Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)
Baumli, K., et al.: Vision-language models as a source of rewards. arXiv preprint arXiv:2312.09187 (2023)
Betker, J., et al.: Improving image generation with better captions. Comput. Sci. 2(3), 8 (2023). https://cdn.openai.com/papers/dall-e-3.pdf
Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396 (2023)
Chen, M., et al.: Evaluating large language models trained on code (2021)
Google Scholar
Chen, Y., Chen, R., Lei, J., Zhang, Y., Jia, K.: Tango: text-driven photorealistic and robust 3D stylization via lighting decomposition. Adv. Neural. Inf. Process. Syst. 35, 30923–30936 (2022)
Google Scholar
De La Torre, F., Fang, C.M., Huang, H., Banburski-Fahey, A., Fernandez, J.A., Lanier, J.: LLMR: real-time prompting of interactive worlds using large language models. arXiv preprint arXiv:2309.12276 (2023)
Firoozi, R., et al.: Foundation models in robotics: applications, challenges, and the future. arXiv preprint arXiv:2312.07843 (2023)
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Fu, C., et al.: A challenger to GPT-4V? Early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436 (2023)
Goel, P., Wang, K.C., Liu, C.K., Fatahalian, K.: Iterative motion editing with natural language. arXiv preprint arXiv:2312.11538 (2023)
Guerrero, P., Hašan, M., Sunkavalli, K., Měch, R., Boubekeur, T., Mitra, N.J.: Matformer: a generative model for procedural materials. arXiv preprint arXiv:2207.01044 (2022)
Henzler, P., Deschaintre, V., Mitra, N.J., Ritschel, T.: Generative modelling of BRDF textures from flash images. arXiv preprint arXiv:2102.11861 (2021)
Hu, Y., et al.: Toward general-purpose robots via foundation models: a survey and meta-analysis. arXiv preprint arXiv:2312.08782 (2023)
Hu, Y., Guerrero, P., Hasan, M., Rushmeier, H., Deschaintre, V.: Node graph optimization using differentiable proxies. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–9 (2022)
Google Scholar
Hu, Y., He, C., Deschaintre, V., Dorsey, J., Rushmeier, H.: An inverse procedural modeling pipeline for SVBRDF maps. ACM Trans. Graph. (TOG) 41(2), 1–17 (2022)
Article Google Scholar
Huang, I., Krishna, V., Atekha, O., Guibas, L.: Aladdin: zero-shot hallucination of stylized 3D assets from abstract scene descriptions. arXiv preprint arXiv:2306.06212 (2023)
Jiang, A.Q., et al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)
Li, C., et al.: LLaVA-MED: training a large language-and-vision assistant for biomedicine in one day. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Liang, J., et al.: Code as policies: language model programs for embodied control. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 9493–9500. IEEE (2023)
Google Scholar
Liu, J., et al.: Perception-driven procedural texture generation from examples. Neurocomputing 291, 21–34 (2018)
Article Google Scholar
Olausson, T.X., Inala, J.P., Wang, C., Gao, J., Solar-Lezama, A.: Is self-repair a silver bullet for code generation? In: The Twelfth International Conference on Learning Representations (2023)
Google Scholar
OpenAI: GPT-4 system card. OpenAI (2023). https://cdn.openai.com/papers/gpt-4-system-card.pdf
OpenAI: GPT-4v(ision) system card. OpenAI (2023). https://api.semanticscholar.org/CorpusID:263218031
Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Generative agents: interactive simulacra of human behavior. In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–22 (2023)
Google Scholar
Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: large language model connected with massive APIs. arXiv preprint arXiv:2305.15334 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Raistrick, A., et al.: Infinite photorealistic worlds using procedural generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12630–12641 (2023)
Google Scholar
Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., Cohen-Or, D.: Texture: text-guided texturing of 3D shapes. arXiv preprint arXiv:2302.01721 (2023)
Ritchie, D., et al.: Neurosymbolic models for computer graphics. In: Computer Graphics Forum, vol. 42, pp. 545–568. Wiley Online Library (2023)
Google Scholar
Romera-Paredes, B., et al.: Mathematical discoveries from program search with large language models. Nature 625(7995), 468–475 (2024)
Article Google Scholar
Schick, T., et al.: Toolformer: language models can teach themselves to use tools. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Sharma, P., et al.: Alchemist: parametric control of material properties with diffusion models. arXiv preprint arXiv:2312.02970 (2023)
Shi, L., et al.: Match: differentiable material graphs for procedural material capture. ACM Trans. Graph. (TOG) 39(6), 1–15 (2020)
Google Scholar
Shimizu, E., Fisher, M., Paris, S., McCann, J., Fatahalian, K.: Design adjectives: a framework for interactive model-guided exploration of parameterized design spaces. In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pp. 261–278 (2020)
Google Scholar
Shinn, N., Cassano, F., Labash, B., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: language agents with verbal reinforcement learning (2023). arXiv preprint cs.AI/2303.11366 (2023)
Google Scholar
Singh, I., et al.: Progprompt: generating situated robot task plans using large language models. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11523–11530. IEEE (2023)
Google Scholar
Sun, C., Han, J., Deng, W., Wang, X., Qin, Z., Gould, S.: 3D-GPT: procedural 3D modeling with large language models. arXiv preprint arXiv:2310.12945 (2023)
Tchapmi, L.P., Ray, T., Tchapmi, M., Shen, B., Martin-Martin, R., Savarese, S.: Generating procedural 3D materials from images using neural networks. In: 2022 4th International Conference on Image, Video and Signal Processing, pp. 32–40 (2022)
Google Scholar
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Vecchio, G., et al.: Controlmat: a controlled generative approach to material capture. arXiv preprint arXiv:2309.01700 (2023)
Vecchio, G., Sortino, R., Palazzo, S., Spampinato, C.: Matfuse: controllable material generation with diffusion models. arXiv preprint arXiv:2308.11408 (2023)
Wang, G., et al.: Voyager: an open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Google Scholar
Wen, Z., Liu, Z., Sridhar, S., Fu, R.: Anyhome: open-vocabulary generation of structured and textured 3D homes. arXiv preprint arXiv:2312.06644 (2023)
Wu, T., et al.: GPT-4V (ision) is a human-aligned evaluator for text-to-3D generation. arXiv preprint arXiv:2401.04092 (2024)
Xiao, X., et al.: Robot learning in the era of foundation models: a survey. arXiv preprint arXiv:2311.14379 (2023)
Yamada, Y., Chandu, K., Lin, Y., Hessel, J., Yildirim, I., Choi, Y.: L3go: language agents with chain-of-3D-thoughts for generating unconventional objects. arXiv preprint arXiv:2402.09052 (2024)
Yang, H., Chen, Y., Pan, Y., Yao, T., Chen, Z., Mei, T.: 3dstyle-diffusion: pursuing fine-grained text-driven 3D stylization with 2D diffusion models. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 6860–6868 (2023)
Google Scholar
Yang, Y., et al.: Holodeck: language guided generation of 3D embodied AI environments. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), vol. 30, pp. 20–25. IEEE/CVF (2024)
Google Scholar
Yang, Z., et al.: The dawn of LMMs: preliminary explorations with GPT-4V (ision). arXiv preprint arXiv:2309.17421, vol. 9, no. 1, p. 1 (2023)
Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Yin, S., et al.: A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023)
Yin, S., et al.: Woodpecker: hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045 (2023)
Zeng, X.: Paint3d: paint anything 3D with lighting-less texture diffusion models. arXiv preprint arXiv:2312.13913 (2023)
Zhou, H., et al.: Language-conditioned learning for robotic manipulation: a survey. arXiv preprint arXiv:2312.10807 (2023)
Zsolnai-Fehér, K., Wonka, P., Wimmer, M.: Gaussian material synthesis. arXiv preprint arXiv:1804.08369 (2018)

Download references

Acknowledgements

We acknowledge the support of ARL grant W911NF-21-2-0104 and a Vannevar Bush Faculty Fellowship. We’d additionally like to thank Maneesh Agrawala for general discussions, and Purvi Goel, Mika Uy, Vishnu Sarukkai, Fan-yun Sun and Sharon Lee for feedback on paper revisions.

Author information

Authors and Affiliations

Stanford University, Stanford, USA
Ian Huang, Guandao Yang & Leonidas Guibas

Authors

Ian Huang
View author publications
You can also search for this author in PubMed Google Scholar
Guandao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Leonidas Guibas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ian Huang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 67015 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, I., Yang, G., Guibas, L. (2025). BlenderAlchemy: Editing 3D Graphics with Vision-Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15147. Springer, Cham. https://doi.org/10.1007/978-3-031-73024-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-73024-5_18
Published: 24 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73023-8
Online ISBN: 978-3-031-73024-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models