Multi-agent Embodied Question Answering in Interactive Environments

Tan, Sinan; Xiang, Weilai; Liu, Huaping; Guo, Di; Sun, Fuchun

doi:10.1007/978-3-030-58601-0_39

Sinan Tan^12,13,
Weilai Xiang¹⁴,
Huaping Liu^12,13,
Di Guo^12,13 &
…
Fuchun Sun^12,13

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12358))

Included in the following conference series:

European Conference on Computer Vision

3160 Accesses
10 Citations
3 Altmetric

Abstract

We investigate a new AI task—Multi-Agent Interactive Question Answering—where several agents explore the scene jointly in interactive environments to answer a question. To cooperate efficiently and answer accurately, agents must be well-organized to have balanced work division and share knowledge about the objects involved. We address this new problem in two stages: Multi-Agent 3D Reconstruction in Interactive Environments and Question Answering. Our proposed framework features multi-layer structural and semantic memories shared by all agents, as well as a question answering model built upon a 3D-CNN network to encode the scene memories. During the reconstruction, agents simultaneously explore and scan the scene with a clear division of work, organized by next viewpoints planning. We evaluate our framework on the IQuADv1 dataset and outperform the IQA baseline in a single-agent scenario. In multi-agent scenarios, our framework shows favorable speedups while remaining high accuracy.

S. Tan and W. Xiang—Equal contribution.

W. Xiang—This work was completed while Weilai Xiang was visiting Tsinghua University, Beijing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Antol, S., et al.: VQA: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. arXiv preprint arXiv:1709.06158 (2017)
Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 303–312 (1996)
Google Scholar
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2054–2063 (2018)
Google Scholar
Dong, S., et al.: Multi-robot collaborative dense scene reconstruction. ACM Trans. Graph. (TOG) 38(4), 1–16 (2019)
Article Google Scholar
Foerster, J., Assael, I.A., De Freitas, N., Whiteson, S.: Learning to communicate with deep multi-agent reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 2137–2145 (2016)
Google Scholar
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098 (2018)
Google Scholar
Graham, B., van der Maaten, L.: Submanifold sparse convolutional networks. arXiv preprint arXiv:1706.01307 (2017)
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: The IEEE International Conference on Computer Vision (ICCV) October 2017
Google Scholar
Hou, J., Dai, A., Niessner, M.: 3D-sis: 3D semantic instance segmentation of RGB-D scans. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) June 2019
Google Scholar
Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: Deepmvs: learning multi-view stereopsis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2821–2830 (2018)
Google Scholar
Izadi, S., et al.: Kinectfusion: real-time 3D reconstruction and interaction using a moving depth camera. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 559–568 (2011)
Google Scholar
Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: TGIF-QA: toward spatio-temporal reasoning in visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) July 2017
Google Scholar
Kolve, E., et al.: Ai2-thor: an interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474 (2017)
Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: The IEEE International Conference on Computer Vision (ICCV) December 2015
Google Scholar
Mousavi, H.K., Nazari, M., Takáč, M., Motee, N.: Multi-agent image classification via reinforcement learning. arXiv preprint arXiv:1905.04835 (2019)
Savva, M., et al.: Habitat: a platform for embodied AI research. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9339–9347 (2019)
Google Scholar
Stone, P., Veloso, M.: Multiagent systems: a survey from a machine learning perspective. Autonom. Robot. 8(3), 345–383 (2000)
Article Google Scholar
Sukhbaatar, S., Fergus, R., et al.: Learning multiagent communication with backpropagation. In: Advances in Neural Information Processing Systems, pp. 2244–2252 (2016)
Google Scholar
Wu, Y., Wu, Y., Gkioxari, G., Tian, Y.: Building generalizable agents with a realistic and rich 3D environment. arXiv preprint arXiv:1801.02209 (2018)
Xia, F., et al.: Gibson Env V2: Embodied simulation environments for interactive navigation (2019)
Google Scholar
Yang, W., Wang, X., Farhadi, A., Gupta, A., Mottaghi, R.: Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543 (2018)
Zhao, Z., et al.: Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI, pp. 3518–3524 (2017)
Google Scholar
Zheng, L., et al.: Active scene understanding via online semantic reconstruction. In: Computer Graphics Forum. vol. 38, pp. 103–114. Wiley Online Library (2019)
Google Scholar
Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. IEEE (2017)
Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants U1613212 and 61703284.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Sinan Tan, Huaping Liu, Di Guo & Fuchun Sun
Beijing National Research Center for Information Science and Technology, Beijing, 100084, China
Sinan Tan, Huaping Liu, Di Guo & Fuchun Sun
Shenyuan Honors College, Beihang University, Beijing, 100191, China
Weilai Xiang

Authors

Sinan Tan
View author publications
You can also search for this author in PubMed Google Scholar
Weilai Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Huaping Liu
View author publications
You can also search for this author in PubMed Google Scholar
Di Guo
View author publications
You can also search for this author in PubMed Google Scholar
Fuchun Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huaping Liu .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tan, S., Xiang, W., Liu, H., Guo, D., Sun, F. (2020). Multi-agent Embodied Question Answering in Interactive Environments. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12358. Springer, Cham. https://doi.org/10.1007/978-3-030-58601-0_39

Download citation

DOI: https://doi.org/10.1007/978-3-030-58601-0_39
Published: 28 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58600-3
Online ISBN: 978-3-030-58601-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics