ABSTRACT
Generating realistic 3D indoor scenes requires a deep understanding of objects and their spatial relationships. However, existing methods often fail to generate realistic 3D scenes due to the limited understanding of object relationships. To tackle this problem, we propose a Scene Graph Masked Variational Auto-Encoder (SG-MVAE) framework that fully captures the relationships between objects to generate more realistic 3D scenes. Specifically, we first introduce a relationship completion module that adaptively learns the missing relationships between objects in the scene graph. To accurately predict the missing relationships, we employ multi-group attention to capture the correlations between the objects with missing relationships and other objects in the scene. After obtaining the complete scene relationships, we mask the relationships between objects and use a decoder to reconstruct the scene. The reconstruction process enhances the model's understanding of relationships, generating more realistic scenes. Extensive experiments on benchmark datasets show that our model outperforms state-of-the-art methods.
- Sara Atito, Muhammad Awais, and Josef Kittler. 2021. Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602 (2021).Google Scholar
- Angel Chang, Manolis Savva, and Christopher D Manning. 2014. Learning spatial knowledge for text to 3D scene generation. In Proceedings of the conference on empirical methods in natural language processing. 2028--2038.Google ScholarCross Ref
- Angel X Chang, Mihail Eric, Manolis Savva, and Christopher D Manning. 2017. SceneSeer: 3D scene design with natural language. arXiv preprint arXiv:1703.00050 (2017).Google Scholar
- Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In International conference on machine learning. 1691--1703.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- Helisa Dhamo, Fabian Manhardt, Nassir Navab, and Federico Tombari. 2021. Graph-to-3D: End-to-end generation and manipulation of 3D scenes using scene graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16352--16361.Google ScholarCross Ref
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).Google Scholar
- Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, and Edouard Grave. 2021. Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740 (2021).Google Scholar
- Azade Farshad, Sabrina Musatian, Helisa Dhamo, and Nassir Navab. 2021. Migs: Meta image generation from scene graphs. arXiv preprint arXiv:2110.11918 (2021).Google Scholar
- Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. 2018. A papier-mâché approach to learning 3D surface generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 216--224.Google ScholarCross Ref
- Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000--16009.Google ScholarCross Ref
- Sen He, Wentong Liao, Michael Ying Yang, Yongxin Yang, Yi-Zhe Song, Bodo Rosenhahn, and Tao Xiang. 2021. Context-aware layout to image generation with enhanced object appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15049--15058.Google ScholarCross Ref
- Roei Herzig, Amir Bar, Huijuan Xu, Gal Chechik, Trevor Darrell, and Amir Globerson. 2020. Learning canonical representations for scene graph to image generation. In Proceedings of the European Conference on Computer Vision. Springer, 210--227.Google ScholarDigital Library
- Chenfanfu Jiang, Siyuan Qi, Yixin Zhu, Siyuan Huang, Jenny Lin, Lap-Fai Yu, Demetri Terzopoulos, and Song-Chun Zhu. 2018. Configurable 3D scene synthesis and 2d image rendering with per-pixel ground truth using stochastic grammars. International Journal of Computer Vision, Vol. 126 (2018), 920--941.Google ScholarDigital Library
- Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1219--1228.Google ScholarCross Ref
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Manyi Li, Akshay Gadi Patil, Kai Xu, Siddhartha Chaudhuri, Owais Khan, Ariel Shamir, Changhe Tu, Baoquan Chen, Daniel Cohen-Or, and Hao Zhang. 2019b. Grains: Generative recursive autoencoders for indoor scenes. ACM Transactions on Graphics, Vol. 38, 2 (2019), 1--16.Google ScholarDigital Library
- Yikang Li, Tao Ma, Yeqi Bai, Nan Duan, Sining Wei, and Xiaogang Wang. 2019a. Pastegan: A semi-parametric method to generate image from scene graph. Advances in neural information processing systems, Vol. 32 (2019).Google Scholar
- Andrew Luo, Zhoutong Zhang, Jiajun Wu, and Joshua B Tenenbaum. 2020. End-to-end optimization of scene layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3754--3763.Google ScholarCross Ref
- Ke Ma, Bo Zhao, and Leonid Sigal. 2020. Attribute-guided image generation from layout. arXiv preprint arXiv:2008.11932 (2020).Google Scholar
- Rui Ma, Akshay Gadi Patil, Matthew Fisher, Manyi Li, Sören Pirk, Binh-Son Hua, Sai-Kit Yeung, Xin Tong, Leonidas Guibas, and Hao Zhang. 2018. Language-driven synthesis of 3D scenes from scene databases. ACM Transactions on Graphics, Vol. 37, 6 (2018), 1--16.Google ScholarDigital Library
- Gaurav Mittal, Shubham Agrawal, Anuva Agarwal, Sushant Mehta, and Tanya Marwah. 2019. Interactive image generation using scene graphs. arXiv preprint arXiv:1905.03743 (2019).Google Scholar
- Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, and Jian Jun Zhang. 2020a. Total3Dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 55--64.Google ScholarCross Ref
- Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, and Jian Jun Zhang. 2020b. Total3Dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 55--64.Google ScholarCross Ref
- Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 165--174.Google ScholarCross Ref
- Artem Savkin, Rachid Ellouze, Nassir Navab, and Federico Tombari. 2021. Unsupervised traffic scene generation with synthetic 3D scene graphs. In IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 1229--1235.Google ScholarDigital Library
- Renato Sortino, Simone Palazzo, and Concetto Spampinato. 2023. Transformer-based image generation from scene graphs. arXiv preprint arXiv:2303.04634 (2023).Google Scholar
- Tristan Sylvain, Pengchuan Zhang, Yoshua Bengio, R Devon Hjelm, and Shikhar Sharma. 2021. Object-centric image generation from layouts. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 2647--2655.Google ScholarCross Ref
- Hongshuo Tian, Ning Xu, An-An Liu, Chenggang Yan, Zhendong Mao, Quan Zhang, and Yongdong Zhang. 2021. Mask and Predict: Multi-step reasoning for scene graph generation. In Proceedings of the ACM International Conference on Multimedia. 4128--4136.Google ScholarDigital Library
- Subarna Tripathi, Anahita Bhiwandiwalla, Alexei Bastidas, and Hanlin Tang. 2019. Using scene graph context to improve image generation. arXiv preprint arXiv:1901.03762 (2019).Google Scholar
- Shubham Tulsiani, Saurabh Gupta, David F Fouhey, Alexei A Efros, and Jitendra Malik. 2018. Factoring shape, pose, and layout from the 2d image of a 3D scene. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 302--310.Google ScholarCross Ref
- Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, Vol. 11, 12 (2010).Google Scholar
- Johanna Wald, Helisa Dhamo, Nassir Navab, and Federico Tombari. 2020. Learning 3D semantic scene graphs from 3D indoor reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3961--3970.Google ScholarCross Ref
- Kai Wang, Yu-An Lin, Ben Weissmann, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2019. Planit: Planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Transactions on Graphics, Vol. 38, 4 (2019), 1--15.Google ScholarDigital Library
- Kai Wang, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2018. Deep convolutional priors for indoor scene synthesis. ACM Transactions on Graphics, Vol. 37, 4 (2018), 1--14.Google ScholarDigital Library
- Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. 2022. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9653--9663.Google ScholarCross Ref
- Shaotian Yan, Chen Shen, Zhongming Jin, Jianqiang Huang, Rongxin Jiang, Yaowu Chen, and Xian-Sheng Hua. 2020. Pcpl: Predicate-correlation perception learning for unbiased scene graph generation. In Proceedings of the ACM International Conference on Multimedia. 265--273.Google ScholarDigital Library
- Ming-Jia Yang, Yu-Xiao Guo, Bin Zhou, and Xin Tong. 2021. Indoor scene generation from a collection of semantic-segmented depth images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15203--15212.Google ScholarCross Ref
- Bo Zhao, Lili Meng, Weidong Yin, and Leonid Sigal. 2019. Image generation from layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8584--8593.Google ScholarCross Ref
Index Terms
- Scene Graph Masked Variational Autoencoders for 3D Scene Generation
Recommendations
Calibration of panoramic cameras using 3D scene information
Proceedings of the 11th international conference on Theoretical foundations of computer visionThis chapter proposes a novel approach for the calibration of a panoramic camera using geometric information available in real scenes. Panoramic cameras are of increasing importance for various applications in computer vision, computer graphics or ...
3D scene reconstruction method based on image optical flow feedback
When the traditional method is used for 3D reconstruction of a 3D scene, the noise in the scene image cannot be effectively removed, the reconstruction efficiency is low, and the reconstruction accuracy is low. Therefore, a 3D reconstruction method based ...
Monocular 3D scene modeling and inference: understanding multi-object traffic scenes
ECCV'10: Proceedings of the 11th European conference on Computer vision: Part IVScene understanding has (again) become a focus of computer vision research, leveraging advances in detection, context modeling, and tracking. In this paper, we present a novel probabilistic 3D scene model that encompasses multi-class object detection, ...
Comments