Skip to main content

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Graphics design is important for various applications, including movie production and game design. To create a high-quality scene, designers usually need to spend hours in software like Blender, in which they might need to interleave and repeat operations, such as connecting material nodes, hundreds of times. Moreover, slightly different design goals may require completely different sequences, making automation difficult. In this paper, we propose a system that leverages Vision-Language Models (VLMs), like GPT-4V, to intelligently search the design action space to arrive at an answer that can satisfy a user’s intent. Specifically, we design a vision-based edit generator and state evaluator to work together to find the correct sequence of actions to achieve the goal. Inspired by the role of visual imagination in the human design process, we supplement the visual reasoning capabilities of VLMs with “imagined” reference images from image-generation models, providing visual grounding of abstract language descriptions. In this paper, we provide empirical evidence suggesting our system can produce simple but tedious Blender editing sequences for tasks such as editing procedural materials and geometry from text and/or reference images, as well as adjusting lighting configurations for product renderings in complex scenes (For project website and code, please go to: https://ianhuang0630.github.io/BlenderAlchemyWeb/).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    At the time of publishing, works like 3DGPT [42], L3GO [53] have not yet open-sourced their code.

References

  1. Blendergpt. https://github.com/gd3kr/BlenderGPT

  2. How long does it take to create a 3D model? https://3d-ace.com/blog/how-long-does-it-take-to-create-a-3d-model/

  3. How long does it take to make a 3D model? https://pixune.com/blog/how-long-does-it-take-to-create-a-3d-model/

  4. Ahn, M., et al.: Do as i can, not as i say: grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022)

  5. Austin, J., et al.: Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)

  6. Baumli, K., et al.: Vision-language models as a source of rewards. arXiv preprint arXiv:2312.09187 (2023)

  7. Betker, J., et al.: Improving image generation with better captions. Comput. Sci. 2(3), 8 (2023). https://cdn.openai.com/papers/dall-e-3.pdf

  8. Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: text-driven texture synthesis via diffusion models. arXiv preprint arXiv:2303.11396 (2023)

  9. Chen, M., et al.: Evaluating large language models trained on code (2021)

    Google Scholar 

  10. Chen, Y., Chen, R., Lei, J., Zhang, Y., Jia, K.: Tango: text-driven photorealistic and robust 3D stylization via lighting decomposition. Adv. Neural. Inf. Process. Syst. 35, 30923–30936 (2022)

    Google Scholar 

  11. De La Torre, F., Fang, C.M., Huang, H., Banburski-Fahey, A., Fernandez, J.A., Lanier, J.: LLMR: real-time prompting of interactive worlds using large language models. arXiv preprint arXiv:2309.12276 (2023)

  12. Firoozi, R., et al.: Foundation models in robotics: applications, challenges, and the future. arXiv preprint arXiv:2312.07843 (2023)

  13. Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  14. Fu, C., et al.: A challenger to GPT-4V? Early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436 (2023)

  15. Goel, P., Wang, K.C., Liu, C.K., Fatahalian, K.: Iterative motion editing with natural language. arXiv preprint arXiv:2312.11538 (2023)

  16. Guerrero, P., Hašan, M., Sunkavalli, K., Měch, R., Boubekeur, T., Mitra, N.J.: Matformer: a generative model for procedural materials. arXiv preprint arXiv:2207.01044 (2022)

  17. Henzler, P., Deschaintre, V., Mitra, N.J., Ritschel, T.: Generative modelling of BRDF textures from flash images. arXiv preprint arXiv:2102.11861 (2021)

  18. Hu, Y., et al.: Toward general-purpose robots via foundation models: a survey and meta-analysis. arXiv preprint arXiv:2312.08782 (2023)

  19. Hu, Y., Guerrero, P., Hasan, M., Rushmeier, H., Deschaintre, V.: Node graph optimization using differentiable proxies. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–9 (2022)

    Google Scholar 

  20. Hu, Y., He, C., Deschaintre, V., Dorsey, J., Rushmeier, H.: An inverse procedural modeling pipeline for SVBRDF maps. ACM Trans. Graph. (TOG) 41(2), 1–17 (2022)

    Article  Google Scholar 

  21. Huang, I., Krishna, V., Atekha, O., Guibas, L.: Aladdin: zero-shot hallucination of stylized 3D assets from abstract scene descriptions. arXiv preprint arXiv:2306.06212 (2023)

  22. Jiang, A.Q., et al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)

  23. Li, C., et al.: LLaVA-MED: training a large language-and-vision assistant for biomedicine in one day. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  24. Liang, J., et al.: Code as policies: language model programs for embodied control. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 9493–9500. IEEE (2023)

    Google Scholar 

  25. Liu, J., et al.: Perception-driven procedural texture generation from examples. Neurocomputing 291, 21–34 (2018)

    Article  Google Scholar 

  26. Olausson, T.X., Inala, J.P., Wang, C., Gao, J., Solar-Lezama, A.: Is self-repair a silver bullet for code generation? In: The Twelfth International Conference on Learning Representations (2023)

    Google Scholar 

  27. OpenAI: GPT-4 system card. OpenAI (2023). https://cdn.openai.com/papers/gpt-4-system-card.pdf

  28. OpenAI: GPT-4v(ision) system card. OpenAI (2023). https://api.semanticscholar.org/CorpusID:263218031

  29. Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Generative agents: interactive simulacra of human behavior. In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–22 (2023)

    Google Scholar 

  30. Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: large language model connected with massive APIs. arXiv preprint arXiv:2305.15334 (2023)

  31. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  32. Raistrick, A., et al.: Infinite photorealistic worlds using procedural generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12630–12641 (2023)

    Google Scholar 

  33. Richardson, E., Metzer, G., Alaluf, Y., Giryes, R., Cohen-Or, D.: Texture: text-guided texturing of 3D shapes. arXiv preprint arXiv:2302.01721 (2023)

  34. Ritchie, D., et al.: Neurosymbolic models for computer graphics. In: Computer Graphics Forum, vol. 42, pp. 545–568. Wiley Online Library (2023)

    Google Scholar 

  35. Romera-Paredes, B., et al.: Mathematical discoveries from program search with large language models. Nature 625(7995), 468–475 (2024)

    Article  Google Scholar 

  36. Schick, T., et al.: Toolformer: language models can teach themselves to use tools. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  37. Sharma, P., et al.: Alchemist: parametric control of material properties with diffusion models. arXiv preprint arXiv:2312.02970 (2023)

  38. Shi, L., et al.: Match: differentiable material graphs for procedural material capture. ACM Trans. Graph. (TOG) 39(6), 1–15 (2020)

    Google Scholar 

  39. Shimizu, E., Fisher, M., Paris, S., McCann, J., Fatahalian, K.: Design adjectives: a framework for interactive model-guided exploration of parameterized design spaces. In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pp. 261–278 (2020)

    Google Scholar 

  40. Shinn, N., Cassano, F., Labash, B., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: language agents with verbal reinforcement learning (2023). arXiv preprint cs.AI/2303.11366 (2023)

    Google Scholar 

  41. Singh, I., et al.: Progprompt: generating situated robot task plans using large language models. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11523–11530. IEEE (2023)

    Google Scholar 

  42. Sun, C., Han, J., Deng, W., Wang, X., Qin, Z., Gould, S.: 3D-GPT: procedural 3D modeling with large language models. arXiv preprint arXiv:2310.12945 (2023)

  43. Tchapmi, L.P., Ray, T., Tchapmi, M., Shen, B., Martin-Martin, R., Savarese, S.: Generating procedural 3D materials from images using neural networks. In: 2022 4th International Conference on Image, Video and Signal Processing, pp. 32–40 (2022)

    Google Scholar 

  44. Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  45. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  46. Vecchio, G., et al.: Controlmat: a controlled generative approach to material capture. arXiv preprint arXiv:2309.01700 (2023)

  47. Vecchio, G., Sortino, R., Palazzo, S., Spampinato, C.: Matfuse: controllable material generation with diffusion models. arXiv preprint arXiv:2308.11408 (2023)

  48. Wang, G., et al.: Voyager: an open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023)

  49. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)

    Google Scholar 

  50. Wen, Z., Liu, Z., Sridhar, S., Fu, R.: Anyhome: open-vocabulary generation of structured and textured 3D homes. arXiv preprint arXiv:2312.06644 (2023)

  51. Wu, T., et al.: GPT-4V (ision) is a human-aligned evaluator for text-to-3D generation. arXiv preprint arXiv:2401.04092 (2024)

  52. Xiao, X., et al.: Robot learning in the era of foundation models: a survey. arXiv preprint arXiv:2311.14379 (2023)

  53. Yamada, Y., Chandu, K., Lin, Y., Hessel, J., Yildirim, I., Choi, Y.: L3go: language agents with chain-of-3D-thoughts for generating unconventional objects. arXiv preprint arXiv:2402.09052 (2024)

  54. Yang, H., Chen, Y., Pan, Y., Yao, T., Chen, Z., Mei, T.: 3dstyle-diffusion: pursuing fine-grained text-driven 3D stylization with 2D diffusion models. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 6860–6868 (2023)

    Google Scholar 

  55. Yang, Y., et al.: Holodeck: language guided generation of 3D embodied AI environments. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), vol. 30, pp. 20–25. IEEE/CVF (2024)

    Google Scholar 

  56. Yang, Z., et al.: The dawn of LMMs: preliminary explorations with GPT-4V (ision). arXiv preprint arXiv:2309.17421, vol. 9, no. 1, p. 1 (2023)

  57. Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  58. Yin, S., et al.: A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023)

  59. Yin, S., et al.: Woodpecker: hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045 (2023)

  60. Zeng, X.: Paint3d: paint anything 3D with lighting-less texture diffusion models. arXiv preprint arXiv:2312.13913 (2023)

  61. Zhou, H., et al.: Language-conditioned learning for robotic manipulation: a survey. arXiv preprint arXiv:2312.10807 (2023)

  62. Zsolnai-Fehér, K., Wonka, P., Wimmer, M.: Gaussian material synthesis. arXiv preprint arXiv:1804.08369 (2018)

Download references

Acknowledgements

We acknowledge the support of ARL grant W911NF-21-2-0104 and a Vannevar Bush Faculty Fellowship. We’d additionally like to thank Maneesh Agrawala for general discussions, and Purvi Goel, Mika Uy, Vishnu Sarukkai, Fan-yun Sun and Sharon Lee for feedback on paper revisions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ian Huang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 67015 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, I., Yang, G., Guibas, L. (2025). BlenderAlchemy: Editing 3D Graphics with Vision-Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15147. Springer, Cham. https://doi.org/10.1007/978-3-031-73024-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73024-5_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73023-8

  • Online ISBN: 978-3-031-73024-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics