skip to main content
10.1145/3581783.3612262acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Scene Graph Masked Variational Autoencoders for 3D Scene Generation

Published: 27 October 2023 Publication History

Abstract

Generating realistic 3D indoor scenes requires a deep understanding of objects and their spatial relationships. However, existing methods often fail to generate realistic 3D scenes due to the limited understanding of object relationships. To tackle this problem, we propose a Scene Graph Masked Variational Auto-Encoder (SG-MVAE) framework that fully captures the relationships between objects to generate more realistic 3D scenes. Specifically, we first introduce a relationship completion module that adaptively learns the missing relationships between objects in the scene graph. To accurately predict the missing relationships, we employ multi-group attention to capture the correlations between the objects with missing relationships and other objects in the scene. After obtaining the complete scene relationships, we mask the relationships between objects and use a decoder to reconstruct the scene. The reconstruction process enhances the model's understanding of relationships, generating more realistic scenes. Extensive experiments on benchmark datasets show that our model outperforms state-of-the-art methods.

References

[1]
Sara Atito, Muhammad Awais, and Josef Kittler. 2021. Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602 (2021).
[2]
Angel Chang, Manolis Savva, and Christopher D Manning. 2014. Learning spatial knowledge for text to 3D scene generation. In Proceedings of the conference on empirical methods in natural language processing. 2028--2038.
[3]
Angel X Chang, Mihail Eric, Manolis Savva, and Christopher D Manning. 2017. SceneSeer: 3D scene design with natural language. arXiv preprint arXiv:1703.00050 (2017).
[4]
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In International conference on machine learning. 1691--1703.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[6]
Helisa Dhamo, Fabian Manhardt, Nassir Navab, and Federico Tombari. 2021. Graph-to-3D: End-to-end generation and manipulation of 3D scenes using scene graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16352--16361.
[7]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[8]
Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, and Edouard Grave. 2021. Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740 (2021).
[9]
Azade Farshad, Sabrina Musatian, Helisa Dhamo, and Nassir Navab. 2021. Migs: Meta image generation from scene graphs. arXiv preprint arXiv:2110.11918 (2021).
[10]
Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. 2018. A papier-mâché approach to learning 3D surface generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 216--224.
[11]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000--16009.
[12]
Sen He, Wentong Liao, Michael Ying Yang, Yongxin Yang, Yi-Zhe Song, Bodo Rosenhahn, and Tao Xiang. 2021. Context-aware layout to image generation with enhanced object appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15049--15058.
[13]
Roei Herzig, Amir Bar, Huijuan Xu, Gal Chechik, Trevor Darrell, and Amir Globerson. 2020. Learning canonical representations for scene graph to image generation. In Proceedings of the European Conference on Computer Vision. Springer, 210--227.
[14]
Chenfanfu Jiang, Siyuan Qi, Yixin Zhu, Siyuan Huang, Jenny Lin, Lap-Fai Yu, Demetri Terzopoulos, and Song-Chun Zhu. 2018. Configurable 3D scene synthesis and 2d image rendering with per-pixel ground truth using stochastic grammars. International Journal of Computer Vision, Vol. 126 (2018), 920--941.
[15]
Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1219--1228.
[16]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[17]
Manyi Li, Akshay Gadi Patil, Kai Xu, Siddhartha Chaudhuri, Owais Khan, Ariel Shamir, Changhe Tu, Baoquan Chen, Daniel Cohen-Or, and Hao Zhang. 2019b. Grains: Generative recursive autoencoders for indoor scenes. ACM Transactions on Graphics, Vol. 38, 2 (2019), 1--16.
[18]
Yikang Li, Tao Ma, Yeqi Bai, Nan Duan, Sining Wei, and Xiaogang Wang. 2019a. Pastegan: A semi-parametric method to generate image from scene graph. Advances in neural information processing systems, Vol. 32 (2019).
[19]
Andrew Luo, Zhoutong Zhang, Jiajun Wu, and Joshua B Tenenbaum. 2020. End-to-end optimization of scene layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3754--3763.
[20]
Ke Ma, Bo Zhao, and Leonid Sigal. 2020. Attribute-guided image generation from layout. arXiv preprint arXiv:2008.11932 (2020).
[21]
Rui Ma, Akshay Gadi Patil, Matthew Fisher, Manyi Li, Sören Pirk, Binh-Son Hua, Sai-Kit Yeung, Xin Tong, Leonidas Guibas, and Hao Zhang. 2018. Language-driven synthesis of 3D scenes from scene databases. ACM Transactions on Graphics, Vol. 37, 6 (2018), 1--16.
[22]
Gaurav Mittal, Shubham Agrawal, Anuva Agarwal, Sushant Mehta, and Tanya Marwah. 2019. Interactive image generation using scene graphs. arXiv preprint arXiv:1905.03743 (2019).
[23]
Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, and Jian Jun Zhang. 2020a. Total3Dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 55--64.
[24]
Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, and Jian Jun Zhang. 2020b. Total3Dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 55--64.
[25]
Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 165--174.
[26]
Artem Savkin, Rachid Ellouze, Nassir Navab, and Federico Tombari. 2021. Unsupervised traffic scene generation with synthetic 3D scene graphs. In IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 1229--1235.
[27]
Renato Sortino, Simone Palazzo, and Concetto Spampinato. 2023. Transformer-based image generation from scene graphs. arXiv preprint arXiv:2303.04634 (2023).
[28]
Tristan Sylvain, Pengchuan Zhang, Yoshua Bengio, R Devon Hjelm, and Shikhar Sharma. 2021. Object-centric image generation from layouts. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 2647--2655.
[29]
Hongshuo Tian, Ning Xu, An-An Liu, Chenggang Yan, Zhendong Mao, Quan Zhang, and Yongdong Zhang. 2021. Mask and Predict: Multi-step reasoning for scene graph generation. In Proceedings of the ACM International Conference on Multimedia. 4128--4136.
[30]
Subarna Tripathi, Anahita Bhiwandiwalla, Alexei Bastidas, and Hanlin Tang. 2019. Using scene graph context to improve image generation. arXiv preprint arXiv:1901.03762 (2019).
[31]
Shubham Tulsiani, Saurabh Gupta, David F Fouhey, Alexei A Efros, and Jitendra Malik. 2018. Factoring shape, pose, and layout from the 2d image of a 3D scene. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 302--310.
[32]
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, Vol. 11, 12 (2010).
[33]
Johanna Wald, Helisa Dhamo, Nassir Navab, and Federico Tombari. 2020. Learning 3D semantic scene graphs from 3D indoor reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3961--3970.
[34]
Kai Wang, Yu-An Lin, Ben Weissmann, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2019. Planit: Planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Transactions on Graphics, Vol. 38, 4 (2019), 1--15.
[35]
Kai Wang, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2018. Deep convolutional priors for indoor scene synthesis. ACM Transactions on Graphics, Vol. 37, 4 (2018), 1--14.
[36]
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. 2022. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9653--9663.
[37]
Shaotian Yan, Chen Shen, Zhongming Jin, Jianqiang Huang, Rongxin Jiang, Yaowu Chen, and Xian-Sheng Hua. 2020. Pcpl: Predicate-correlation perception learning for unbiased scene graph generation. In Proceedings of the ACM International Conference on Multimedia. 265--273.
[38]
Ming-Jia Yang, Yu-Xiao Guo, Bin Zhou, and Xin Tong. 2021. Indoor scene generation from a collection of semantic-segmented depth images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15203--15212.
[39]
Bo Zhao, Lili Meng, Weidong Yin, and Leonid Sigal. 2019. Image generation from layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8584--8593.

Cited By

View all
  • (2024)Context-Aware Indoor Point Cloud Object Generation through User InstructionsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681699(10182-10190)Online publication date: 28-Oct-2024

Index Terms

  1. Scene Graph Masked Variational Autoencoders for 3D Scene Generation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3d scene
    2. deep learning
    3. generative model
    4. scene graph

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)163
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Context-Aware Indoor Point Cloud Object Generation through User InstructionsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681699(10182-10190)Online publication date: 28-Oct-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media