research-article

Scene Graph Masked Variational Autoencoders for 3D Scene Generation

Authors:

Jin XieAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 5725 - 5733

https://doi.org/10.1145/3581783.3612262

Published: 27 October 2023 Publication History

Abstract

Generating realistic 3D indoor scenes requires a deep understanding of objects and their spatial relationships. However, existing methods often fail to generate realistic 3D scenes due to the limited understanding of object relationships. To tackle this problem, we propose a Scene Graph Masked Variational Auto-Encoder (SG-MVAE) framework that fully captures the relationships between objects to generate more realistic 3D scenes. Specifically, we first introduce a relationship completion module that adaptively learns the missing relationships between objects in the scene graph. To accurately predict the missing relationships, we employ multi-group attention to capture the correlations between the objects with missing relationships and other objects in the scene. After obtaining the complete scene relationships, we mask the relationships between objects and use a decoder to reconstruct the scene. The reconstruction process enhances the model's understanding of relationships, generating more realistic scenes. Extensive experiments on benchmark datasets show that our model outperforms state-of-the-art methods.

References

[1]

Sara Atito, Muhammad Awais, and Josef Kittler. 2021. Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602 (2021).

[2]

Angel Chang, Manolis Savva, and Christopher D Manning. 2014. Learning spatial knowledge for text to 3D scene generation. In Proceedings of the conference on empirical methods in natural language processing. 2028--2038.

[3]

Angel X Chang, Mihail Eric, Manolis Savva, and Christopher D Manning. 2017. SceneSeer: 3D scene design with natural language. arXiv preprint arXiv:1703.00050 (2017).

[4]

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In International conference on machine learning. 1691--1703.

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[6]

Helisa Dhamo, Fabian Manhardt, Nassir Navab, and Federico Tombari. 2021. Graph-to-3D: End-to-end generation and manipulation of 3D scenes using scene graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16352--16361.

[7]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[8]

Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, and Edouard Grave. 2021. Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740 (2021).

[9]

Azade Farshad, Sabrina Musatian, Helisa Dhamo, and Nassir Navab. 2021. Migs: Meta image generation from scene graphs. arXiv preprint arXiv:2110.11918 (2021).

[10]

Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. 2018. A papier-mâché approach to learning 3D surface generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 216--224.

[11]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000--16009.

[12]

Sen He, Wentong Liao, Michael Ying Yang, Yongxin Yang, Yi-Zhe Song, Bodo Rosenhahn, and Tao Xiang. 2021. Context-aware layout to image generation with enhanced object appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15049--15058.

[13]

Roei Herzig, Amir Bar, Huijuan Xu, Gal Chechik, Trevor Darrell, and Amir Globerson. 2020. Learning canonical representations for scene graph to image generation. In Proceedings of the European Conference on Computer Vision. Springer, 210--227.

Digital Library

[14]

Chenfanfu Jiang, Siyuan Qi, Yixin Zhu, Siyuan Huang, Jenny Lin, Lap-Fai Yu, Demetri Terzopoulos, and Song-Chun Zhu. 2018. Configurable 3D scene synthesis and 2d image rendering with per-pixel ground truth using stochastic grammars. International Journal of Computer Vision, Vol. 126 (2018), 920--941.

Digital Library

[15]

Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1219--1228.

[16]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[17]

Manyi Li, Akshay Gadi Patil, Kai Xu, Siddhartha Chaudhuri, Owais Khan, Ariel Shamir, Changhe Tu, Baoquan Chen, Daniel Cohen-Or, and Hao Zhang. 2019b. Grains: Generative recursive autoencoders for indoor scenes. ACM Transactions on Graphics, Vol. 38, 2 (2019), 1--16.

Digital Library

[18]

Yikang Li, Tao Ma, Yeqi Bai, Nan Duan, Sining Wei, and Xiaogang Wang. 2019a. Pastegan: A semi-parametric method to generate image from scene graph. Advances in neural information processing systems, Vol. 32 (2019).

[19]

Andrew Luo, Zhoutong Zhang, Jiajun Wu, and Joshua B Tenenbaum. 2020. End-to-end optimization of scene layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3754--3763.

[20]

Ke Ma, Bo Zhao, and Leonid Sigal. 2020. Attribute-guided image generation from layout. arXiv preprint arXiv:2008.11932 (2020).

[21]

Rui Ma, Akshay Gadi Patil, Matthew Fisher, Manyi Li, Sören Pirk, Binh-Son Hua, Sai-Kit Yeung, Xin Tong, Leonidas Guibas, and Hao Zhang. 2018. Language-driven synthesis of 3D scenes from scene databases. ACM Transactions on Graphics, Vol. 37, 6 (2018), 1--16.

Digital Library

[22]

Gaurav Mittal, Shubham Agrawal, Anuva Agarwal, Sushant Mehta, and Tanya Marwah. 2019. Interactive image generation using scene graphs. arXiv preprint arXiv:1905.03743 (2019).

[23]

Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, and Jian Jun Zhang. 2020a. Total3Dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 55--64.

[24]

Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, and Jian Jun Zhang. 2020b. Total3Dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 55--64.

[25]

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 165--174.

[26]

Artem Savkin, Rachid Ellouze, Nassir Navab, and Federico Tombari. 2021. Unsupervised traffic scene generation with synthetic 3D scene graphs. In IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 1229--1235.

Digital Library

[27]

Renato Sortino, Simone Palazzo, and Concetto Spampinato. 2023. Transformer-based image generation from scene graphs. arXiv preprint arXiv:2303.04634 (2023).

[28]

Tristan Sylvain, Pengchuan Zhang, Yoshua Bengio, R Devon Hjelm, and Shikhar Sharma. 2021. Object-centric image generation from layouts. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 2647--2655.

[29]

Hongshuo Tian, Ning Xu, An-An Liu, Chenggang Yan, Zhendong Mao, Quan Zhang, and Yongdong Zhang. 2021. Mask and Predict: Multi-step reasoning for scene graph generation. In Proceedings of the ACM International Conference on Multimedia. 4128--4136.

Digital Library

[30]

Subarna Tripathi, Anahita Bhiwandiwalla, Alexei Bastidas, and Hanlin Tang. 2019. Using scene graph context to improve image generation. arXiv preprint arXiv:1901.03762 (2019).

[31]

Shubham Tulsiani, Saurabh Gupta, David F Fouhey, Alexei A Efros, and Jitendra Malik. 2018. Factoring shape, pose, and layout from the 2d image of a 3D scene. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 302--310.

[32]

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, Vol. 11, 12 (2010).

[33]

Johanna Wald, Helisa Dhamo, Nassir Navab, and Federico Tombari. 2020. Learning 3D semantic scene graphs from 3D indoor reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3961--3970.

[34]

Kai Wang, Yu-An Lin, Ben Weissmann, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2019. Planit: Planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Transactions on Graphics, Vol. 38, 4 (2019), 1--15.

Digital Library

[35]

Kai Wang, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2018. Deep convolutional priors for indoor scene synthesis. ACM Transactions on Graphics, Vol. 37, 4 (2018), 1--14.

Digital Library

[36]

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. 2022. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9653--9663.

[37]

Shaotian Yan, Chen Shen, Zhongming Jin, Jianqiang Huang, Rongxin Jiang, Yaowu Chen, and Xian-Sheng Hua. 2020. Pcpl: Predicate-correlation perception learning for unbiased scene graph generation. In Proceedings of the ACM International Conference on Multimedia. 265--273.

Digital Library

[38]

Ming-Jia Yang, Yu-Xiao Guo, Bin Zhou, and Xin Tong. 2021. Indoor scene generation from a collection of semantic-segmented depth images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15203--15212.

[39]

Bo Zhao, Lili Meng, Weidong Yin, and Leonid Sigal. 2019. Image generation from layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8584--8593.

Cited By

Yiyang LLin KGu CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Context-Aware Indoor Point Cloud Object Generation through User InstructionsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681699(10182-10190)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681699

Index Terms

Scene Graph Masked Variational Autoencoders for 3D Scene Generation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Automatic Scene Inference for 3D Object Compositing

We present a user-friendly image editing system that supports a drag-and-drop object insertion (where the user merely drags objects into the image, and the system automatically places them in 3D and relights them appropriately), postprocess illumination ...
Non-Photorealistic Outdoor Scene Rendering: Techniques and Application
CGIV '04: Proceedings of the International Conference on Computer Graphics, Imaging and Visualization

Many researches in Non-Photorealistic Rendering (NPR) have been done to investigate a variety of techniques to simulate the styles of hand-drawn artists. Many applications in industrial, architectural, animated film production, technical design, and ...
An extensible scene graph library for teaching computer graphics along the programmable pipeline (abstract only)
SIGCSE '14: Proceedings of the 45th ACM technical symposium on Computer science education

Computer graphics is a subject which is typically enjoyed by students and which has the potential to attract pupils to consider studying computer science. Although the programming methods used by computer graphics have significantly changed in recent ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
254
Total Downloads

Downloads (Last 12 months)163
Downloads (Last 6 weeks)13

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yiyang LLin KGu CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Context-Aware Indoor Point Cloud Object Generation through User InstructionsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681699(10182-10190)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681699

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents