skip to main content
10.1145/3664647.3681022acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

3D Question Answering for City Scene Understanding

Published: 28 October 2024 Publication History

Abstract

3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks and outdoor roadside autonomous driving tasks, there has been limited exploration of city-level scene understanding tasks. Furthermore, existing research faces challenges in understanding city scenes, due to the absence of spatial semantic information and human-environment interaction information at the city level.To address these challenges, we investigate 3D MQA from both dataset and method perspectives. From the dataset perspective, we introduce a novel 3D MQA dataset named <u>City-3DQA</u> for city-level scene understanding, which is the first dataset to incorporate scene semantic and human-environment interactive tasks within the city. From the method perspective, we propose a Scene graph enhanced City-level Understanding method (<u>Sg-CityU</u>), which utilizes the scene graph to introduce the spatial semantic. A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA. Compared to indoor 3D MQA methods and zero-shot using advanced large language models (LLMs), Sg-CityU demonstrates state-of-the-art (SOTA) performance in robustness and generalization.

References

[1]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.
[2]
022)]% azuma2022scanqa, Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. 2022. ScanQA: 3D question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19129--19139.
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023. Qwen Technical Report. arXiv preprint arXiv:2309.16609 (2023).
[4]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966 (2023).
[5]
Andrew Ka-Ching Chan. 2016. Tackling global grand challenges in our cities. Engineering, Vol. 2, 1 (2016), 10--15.
[6]
Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. 2021. Scan2cap: Context-aware dense captioning in rgb-d scans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3193--3203.
[7]
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5828--5839.
[8]
Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. 2018. Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--10.
[9]
Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul Khanna, Dhruv Batra, and Devi Parikh. 2022. Episodic memory question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19119--19128.
[10]
Yasaman Etesam, Leon Kochiev, and Angel X Chang. 2022. 3dvqa: Visual question answering for 3d environments. In 2022 19th Conference on Robots and Vision (CRV). IEEE, 233--240.
[11]
Difei Gao, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2022. Cric: A vqa dataset for compositional reasoning on vision and commonsense. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 5 (2022), 5561--5578.
[12]
Yixuan Geng, Zhipeng Wang, Limin Jia, Yong Qin, Yuanyuan Chai, Keyan Liu, and Lei Tong. 2023. 3DGraphSeg: A unified graph representation-based point cloud segmentation framework for full-range highspeed railway environments. IEEE Transactions on Industrial Informatics (2023).
[13]
Mark A. Goddard, Zoe G. Davies, Solène Guenat, Mark J. Ferguson, Jessica C. Fisher, Adeniran Akanni, Teija Ahjokoski, Pippin M. L. Anderson, Fabio Angeoletto, Constantinos Antoniou, Adam J. Bates, Andrew Barkwith, Adam Berland, Christopher J. Bouch, Christine C. Rega-Brodsky, Loren B. Byrne, David Cameron, Rory Canavan, Tim Chapman, Stuart Connop, Steve Crossland, Marie C. Dade, David A. Dawson, Cynnamon Dobbs, Colleen T. Downs, Erle C. Ellis, Francisco J. Escobedo, Paul Gobster, Natalie Marie Gulsrud, Burak Guneralp, Amy K. Hahs, James D. Hale, Christopher Hassall, Marcus Hedblom, Dieter F. Hochuli, Tommi Inkinen, Ioan-Cristian Ioja, Dave Kendal, Tom Knowland, Ingo Kowarik, Simon J. Langdale, Susannah B. Lerman, Ian MacGregor-Fors, Peter Manning, Peter Massini, Stacey McLean, David D. Mkwambisi, Alessandro Ossola, Gabriel Pérez Luque, Luis Pérez-Urrestarazu, Katia Perini, Gad Perry, Tristan J. Pett, Kate E. Plummer, Raoufou A. Radji, Uri Roll, Simon G. Potts, Heather Rumble, Jon P. Sadler, Stevienna de Saille, Sebastian Sautter, Catherine E. Scott, Assaf Shwartz, Tracy Smith, Robbert P. H. Snep, Carl D. Soulsbury, Margaret C. Stanley, Tim Van de Voorde, Stephen J. Venn, Philip H. Warren, Carla-Leanne Washbourne, Mark Whitling, Nicholas S. G. Williams, Jun Yang, Kumelachew Yeshitela, Ken P. Yocom, and Martin Dallimer. 2021. A global horizon scan of the future impacts of robotics and autonomous systems on urban ecosystems. Nature Ecology & Evolution, Vol. 5, 2 (01 Feb 2021), 219--230. https://doi.org/10.1038/s41559-020-01358-z
[14]
J Vernon Henderson, Anthony J Venables, Tanner Regan, and Ilia Samsonov. 2016. Building functional cities. Science, Vol. 352, 6288 (2016), 946--947.
[15]
Qingyong Hu, Bo Yang, Sheikh Khalid, Wen Xiao, Niki Trigoni, and Andrew Markham. 2022. Sensaturban: Learning semantics from urban-scale photogrammetric point clouds. International Journal of Computer Vision, Vol. 130, 2 (2022), 316--343.
[16]
Zhao Jin, Munawar Hayat, Yuwei Yang, Yulan Guo, and Yinjie Lei. 2023. Context-aware alignment and mutual masking for 3d-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10984--10994.
[17]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171--4186.
[18]
Ue-Hwan Kim, Jin-Man Park, Taek-Jin Song, and Jong-Hwan Kim. 2019. 3-D scene graph: A sparse and semantic representation of physical environments for intelligent agents. IEEE transactions on cybernetics, Vol. 50, 12 (2019), 4921--4933.
[19]
Thomas N Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations.
[20]
Qi Kuang, Jinbo Wu, Jia Pan, and Bin Zhou. 2020. Real-time UAV path planning for autonomous urban scene reconstruction. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1156--1162.
[21]
Lik-Hang Lee, Tristan Braud, Simo Hosio, and Pan Hui. 2021. Towards augmented reality driven human-city interaction: Current research on mobile headsets and future challenges. ACM Computing Surveys (CSUR), Vol. 54, 8 (2021), 1--38.
[22]
Weixin Liang, Youzhi Tian, Chengcai Chen, and Zhou Yu. 2020. Moss: End-to-end dialog system framework with modular supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8327--8335.
[23]
Yiyi Liao, Jun Xie, and Andreas Geiger. 2022. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 3 (2022), 3292--3310.
[24]
Liqiang Lin, Yilin Liu, Yue Hu, Xingguang Yan, Ke Xie, and Hui Huang. 2022. Capturing, reconstructing, and simulating: the urbanscene3d dataset. In European Conference on Computer Vision. Springer, 93--109.
[25]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems, Vol. 36 (2024).
[26]
Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. 2022. SQA3D: Situated Question Answering in 3D Scenes. In The Eleventh International Conference on Learning Representations.
[27]
Taiki Miyanishi, Fumiya Kitamori, Shuhei Kurita, Jungdae Lee, Motoaki Kawanabe, and Nakamasa Inoue. 2023. CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
[28]
Maria Parelli, Alexandros Delitzas, Nikolas Hars, Georgios Vlassis, Sotirios Anagnostidis, Gregor Bachmann, and Thomas Hofmann. 2023. Clip-guided vision-language pre-training for question answering in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5606--5611.
[29]
Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. 2019. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision. 9277--9286.
[30]
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, Vol. 30 (2017).
[31]
Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. 2024. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 4542--4550.
[32]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982--3992.
[33]
Elizabeth R Schotter. 2013. Synonyms provide semantic preview benefit in English. Journal of Memory and Language, Vol. 69, 4 (2013), 619--633.
[34]
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31.
[35]
Jiaxiang Tang, Xiaokang Chen, Jingbo Wang, and Gang Zeng. 2022. Point scene understanding via disentangled instance mesh reconstruction. In European Conference on Computer Vision. Springer, 684--701.
[36]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
[37]
Denny Vrandevcić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM, Vol. 57, 10 (2014), 78--85.
[38]
Jan Oliver Wallgrün, Mahda M Bagher, Pejman Sajjadi, and Alexander Klippel. 2020. A comparison of visual attention guiding approaches for 360 image-based vr tours. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 83--91.
[39]
Chenxi Whitehouse, Monojit Choudhury, and Alham Fikri Aji. 2023. LLM-powered Data Augmentation for Enhanced Cross-lingual Performance. In The 2023 Conference on Empirical Methods in Natural Language Processing.
[40]
Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, and Dhruv Batra. 2019. Embodied question answering in photorealistic environments with point cloud perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6659--6668.
[41]
Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. 2018. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209 (2018).
[42]
Pengfei Xu, Xiaojun Chang, Ling Guo, Po-Yao Huang, Xiaojiang Chen, and Alexander G Hauptmann. 2020. A survey of scene graph: Generation and application. IEEE Trans. Neural Netw. Learn. Syst, Vol. 1 (2020), 1.
[43]
Xu Yan, Zhihao Yuan, Yuhao Du, Yinghong Liao, Yao Guo, Shuguang Cui, and Zhen Li. 2023. Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation. IEEE Transactions on Visualization & Computer Graphics 01 (2023), 1--13.
[44]
Guoqing Yang, Fuyou Xue, Qi Zhang, Ke Xie, Chi-Wing Fu, and Hui Huang. 2023. UrbanBIS: A Large-Scale Benchmark for Fine-Grained Urban Building Instance Segmentation. In ACM SIGGRAPH 2023 Conference Proceedings. 1--11.
[45]
Shuquan Ye, Dongdong Chen, Songfang Han, and Jing Liao. 2022. 3D question answering. IEEE Transactions on Visualization and Computer Graphics (2022).
[46]
Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L Berg, and Dhruv Batra. 2019. Multi-target embodied question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6309--6318.
[47]
Han Zhang, Yucong Yao, Ke Xie, Chi-Wing Fu, Hao Zhang, and Hui Huang. 2021. Continuous aerial path planning for 3D urban scene reconstruction. ACM Trans. Graph., Vol. 40, 6 (2021), 225--1.
[48]
Lichen Zhao, Daigang Cai, Jing Zhang, Lu Sheng, Dong Xu, Rui Zheng, Yinjie Zhao, Lipeng Wang, and Xibo Fan. 2022. Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline. IEEE Transactions on Circuits and Systems for Video Technology (2022).
[49]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, Vol. 36 (2024).
[50]
Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, and Mohamed Elhoseiny. 2023. Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594 (2023).
[51]
Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, and Qing Li. 2023. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2911--2921.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 3d
  2. multimodal question answering
  3. scene understanding

Qualifiers

  • Research-article

Funding Sources

  • Hong Kong RIF
  • the Postdoctoral Fellowship Program of CPSF
  • Hong Kong CRF

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 119
    Total Downloads
  • Downloads (Last 12 months)119
  • Downloads (Last 6 weeks)25
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media