skip to main content
10.1145/3581783.3612262acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Scene Graph Masked Variational Autoencoders for 3D Scene Generation

Published:27 October 2023Publication History

ABSTRACT

Generating realistic 3D indoor scenes requires a deep understanding of objects and their spatial relationships. However, existing methods often fail to generate realistic 3D scenes due to the limited understanding of object relationships. To tackle this problem, we propose a Scene Graph Masked Variational Auto-Encoder (SG-MVAE) framework that fully captures the relationships between objects to generate more realistic 3D scenes. Specifically, we first introduce a relationship completion module that adaptively learns the missing relationships between objects in the scene graph. To accurately predict the missing relationships, we employ multi-group attention to capture the correlations between the objects with missing relationships and other objects in the scene. After obtaining the complete scene relationships, we mask the relationships between objects and use a decoder to reconstruct the scene. The reconstruction process enhances the model's understanding of relationships, generating more realistic scenes. Extensive experiments on benchmark datasets show that our model outperforms state-of-the-art methods.

References

  1. Sara Atito, Muhammad Awais, and Josef Kittler. 2021. Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602 (2021).Google ScholarGoogle Scholar
  2. Angel Chang, Manolis Savva, and Christopher D Manning. 2014. Learning spatial knowledge for text to 3D scene generation. In Proceedings of the conference on empirical methods in natural language processing. 2028--2038.Google ScholarGoogle ScholarCross RefCross Ref
  3. Angel X Chang, Mihail Eric, Manolis Savva, and Christopher D Manning. 2017. SceneSeer: 3D scene design with natural language. arXiv preprint arXiv:1703.00050 (2017).Google ScholarGoogle Scholar
  4. Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In International conference on machine learning. 1691--1703.Google ScholarGoogle Scholar
  5. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  6. Helisa Dhamo, Fabian Manhardt, Nassir Navab, and Federico Tombari. 2021. Graph-to-3D: End-to-end generation and manipulation of 3D scenes using scene graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16352--16361.Google ScholarGoogle ScholarCross RefCross Ref
  7. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).Google ScholarGoogle Scholar
  8. Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, and Edouard Grave. 2021. Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740 (2021).Google ScholarGoogle Scholar
  9. Azade Farshad, Sabrina Musatian, Helisa Dhamo, and Nassir Navab. 2021. Migs: Meta image generation from scene graphs. arXiv preprint arXiv:2110.11918 (2021).Google ScholarGoogle Scholar
  10. Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. 2018. A papier-mâché approach to learning 3D surface generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 216--224.Google ScholarGoogle ScholarCross RefCross Ref
  11. Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000--16009.Google ScholarGoogle ScholarCross RefCross Ref
  12. Sen He, Wentong Liao, Michael Ying Yang, Yongxin Yang, Yi-Zhe Song, Bodo Rosenhahn, and Tao Xiang. 2021. Context-aware layout to image generation with enhanced object appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15049--15058.Google ScholarGoogle ScholarCross RefCross Ref
  13. Roei Herzig, Amir Bar, Huijuan Xu, Gal Chechik, Trevor Darrell, and Amir Globerson. 2020. Learning canonical representations for scene graph to image generation. In Proceedings of the European Conference on Computer Vision. Springer, 210--227.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chenfanfu Jiang, Siyuan Qi, Yixin Zhu, Siyuan Huang, Jenny Lin, Lap-Fai Yu, Demetri Terzopoulos, and Song-Chun Zhu. 2018. Configurable 3D scene synthesis and 2d image rendering with per-pixel ground truth using stochastic grammars. International Journal of Computer Vision, Vol. 126 (2018), 920--941.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1219--1228.Google ScholarGoogle ScholarCross RefCross Ref
  16. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  17. Manyi Li, Akshay Gadi Patil, Kai Xu, Siddhartha Chaudhuri, Owais Khan, Ariel Shamir, Changhe Tu, Baoquan Chen, Daniel Cohen-Or, and Hao Zhang. 2019b. Grains: Generative recursive autoencoders for indoor scenes. ACM Transactions on Graphics, Vol. 38, 2 (2019), 1--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yikang Li, Tao Ma, Yeqi Bai, Nan Duan, Sining Wei, and Xiaogang Wang. 2019a. Pastegan: A semi-parametric method to generate image from scene graph. Advances in neural information processing systems, Vol. 32 (2019).Google ScholarGoogle Scholar
  19. Andrew Luo, Zhoutong Zhang, Jiajun Wu, and Joshua B Tenenbaum. 2020. End-to-end optimization of scene layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3754--3763.Google ScholarGoogle ScholarCross RefCross Ref
  20. Ke Ma, Bo Zhao, and Leonid Sigal. 2020. Attribute-guided image generation from layout. arXiv preprint arXiv:2008.11932 (2020).Google ScholarGoogle Scholar
  21. Rui Ma, Akshay Gadi Patil, Matthew Fisher, Manyi Li, Sören Pirk, Binh-Son Hua, Sai-Kit Yeung, Xin Tong, Leonidas Guibas, and Hao Zhang. 2018. Language-driven synthesis of 3D scenes from scene databases. ACM Transactions on Graphics, Vol. 37, 6 (2018), 1--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gaurav Mittal, Shubham Agrawal, Anuva Agarwal, Sushant Mehta, and Tanya Marwah. 2019. Interactive image generation using scene graphs. arXiv preprint arXiv:1905.03743 (2019).Google ScholarGoogle Scholar
  23. Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, and Jian Jun Zhang. 2020a. Total3Dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 55--64.Google ScholarGoogle ScholarCross RefCross Ref
  24. Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, and Jian Jun Zhang. 2020b. Total3Dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 55--64.Google ScholarGoogle ScholarCross RefCross Ref
  25. Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 165--174.Google ScholarGoogle ScholarCross RefCross Ref
  26. Artem Savkin, Rachid Ellouze, Nassir Navab, and Federico Tombari. 2021. Unsupervised traffic scene generation with synthetic 3D scene graphs. In IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 1229--1235.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Renato Sortino, Simone Palazzo, and Concetto Spampinato. 2023. Transformer-based image generation from scene graphs. arXiv preprint arXiv:2303.04634 (2023).Google ScholarGoogle Scholar
  28. Tristan Sylvain, Pengchuan Zhang, Yoshua Bengio, R Devon Hjelm, and Shikhar Sharma. 2021. Object-centric image generation from layouts. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 2647--2655.Google ScholarGoogle ScholarCross RefCross Ref
  29. Hongshuo Tian, Ning Xu, An-An Liu, Chenggang Yan, Zhendong Mao, Quan Zhang, and Yongdong Zhang. 2021. Mask and Predict: Multi-step reasoning for scene graph generation. In Proceedings of the ACM International Conference on Multimedia. 4128--4136.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Subarna Tripathi, Anahita Bhiwandiwalla, Alexei Bastidas, and Hanlin Tang. 2019. Using scene graph context to improve image generation. arXiv preprint arXiv:1901.03762 (2019).Google ScholarGoogle Scholar
  31. Shubham Tulsiani, Saurabh Gupta, David F Fouhey, Alexei A Efros, and Jitendra Malik. 2018. Factoring shape, pose, and layout from the 2d image of a 3D scene. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 302--310.Google ScholarGoogle ScholarCross RefCross Ref
  32. Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, Vol. 11, 12 (2010).Google ScholarGoogle Scholar
  33. Johanna Wald, Helisa Dhamo, Nassir Navab, and Federico Tombari. 2020. Learning 3D semantic scene graphs from 3D indoor reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3961--3970.Google ScholarGoogle ScholarCross RefCross Ref
  34. Kai Wang, Yu-An Lin, Ben Weissmann, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2019. Planit: Planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Transactions on Graphics, Vol. 38, 4 (2019), 1--15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kai Wang, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2018. Deep convolutional priors for indoor scene synthesis. ACM Transactions on Graphics, Vol. 37, 4 (2018), 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. 2022. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9653--9663.Google ScholarGoogle ScholarCross RefCross Ref
  37. Shaotian Yan, Chen Shen, Zhongming Jin, Jianqiang Huang, Rongxin Jiang, Yaowu Chen, and Xian-Sheng Hua. 2020. Pcpl: Predicate-correlation perception learning for unbiased scene graph generation. In Proceedings of the ACM International Conference on Multimedia. 265--273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ming-Jia Yang, Yu-Xiao Guo, Bin Zhou, and Xin Tong. 2021. Indoor scene generation from a collection of semantic-segmented depth images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15203--15212.Google ScholarGoogle ScholarCross RefCross Ref
  39. Bo Zhao, Lili Meng, Weidong Yin, and Leonid Sigal. 2019. Image generation from layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8584--8593.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Scene Graph Masked Variational Autoencoders for 3D Scene Generation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia
    • Article Metrics

      • Downloads (Last 12 months)151
      • Downloads (Last 6 weeks)27

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader