Skip to main content

Multi-agent Embodied Question Answering in Interactive Environments

  • Conference paper
  • First Online:
Book cover Computer Vision – ECCV 2020 (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12358))

Included in the following conference series:

Abstract

We investigate a new AI task—Multi-Agent Interactive Question Answering—where several agents explore the scene jointly in interactive environments to answer a question. To cooperate efficiently and answer accurately, agents must be well-organized to have balanced work division and share knowledge about the objects involved. We address this new problem in two stages: Multi-Agent 3D Reconstruction in Interactive Environments and Question Answering. Our proposed framework features multi-layer structural and semantic memories shared by all agents, as well as a question answering model built upon a 3D-CNN network to encode the scene memories. During the reconstruction, agents simultaneously explore and scan the scene with a clear division of work, organized by next viewpoints planning. We evaluate our framework on the IQuADv1 dataset and outperform the IQA baseline in a single-agent scenario. In multi-agent scenarios, our framework shows favorable speedups while remaining high accuracy.

S. Tan and W. Xiang—Equal contribution.

W. Xiang—This work was completed while Weilai Xiang was visiting Tsinghua University, Beijing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Antol, S., et al.: VQA: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  2. Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. arXiv preprint arXiv:1709.06158 (2017)

  3. Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 303–312 (1996)

    Google Scholar 

  4. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2054–2063 (2018)

    Google Scholar 

  5. Dong, S., et al.: Multi-robot collaborative dense scene reconstruction. ACM Trans. Graph. (TOG) 38(4), 1–16 (2019)

    Article  Google Scholar 

  6. Foerster, J., Assael, I.A., De Freitas, N., Whiteson, S.: Learning to communicate with deep multi-agent reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 2137–2145 (2016)

    Google Scholar 

  7. Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098 (2018)

    Google Scholar 

  8. Graham, B., van der Maaten, L.: Submanifold sparse convolutional networks. arXiv preprint arXiv:1706.01307 (2017)

  9. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: The IEEE International Conference on Computer Vision (ICCV) October 2017

    Google Scholar 

  10. Hou, J., Dai, A., Niessner, M.: 3D-sis: 3D semantic instance segmentation of RGB-D scans. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) June 2019

    Google Scholar 

  11. Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: Deepmvs: learning multi-view stereopsis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2821–2830 (2018)

    Google Scholar 

  12. Izadi, S., et al.: Kinectfusion: real-time 3D reconstruction and interaction using a moving depth camera. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 559–568 (2011)

    Google Scholar 

  13. Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: TGIF-QA: toward spatio-temporal reasoning in visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) July 2017

    Google Scholar 

  14. Kolve, E., et al.: Ai2-thor: an interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474 (2017)

  15. Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: The IEEE International Conference on Computer Vision (ICCV) December 2015

    Google Scholar 

  16. Mousavi, H.K., Nazari, M., Takáč, M., Motee, N.: Multi-agent image classification via reinforcement learning. arXiv preprint arXiv:1905.04835 (2019)

  17. Savva, M., et al.: Habitat: a platform for embodied AI research. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9339–9347 (2019)

    Google Scholar 

  18. Stone, P., Veloso, M.: Multiagent systems: a survey from a machine learning perspective. Autonom. Robot. 8(3), 345–383 (2000)

    Article  Google Scholar 

  19. Sukhbaatar, S., Fergus, R., et al.: Learning multiagent communication with backpropagation. In: Advances in Neural Information Processing Systems, pp. 2244–2252 (2016)

    Google Scholar 

  20. Wu, Y., Wu, Y., Gkioxari, G., Tian, Y.: Building generalizable agents with a realistic and rich 3D environment. arXiv preprint arXiv:1801.02209 (2018)

  21. Xia, F., et al.: Gibson Env V2: Embodied simulation environments for interactive navigation (2019)

    Google Scholar 

  22. Yang, W., Wang, X., Farhadi, A., Gupta, A., Mottaghi, R.: Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543 (2018)

  23. Zhao, Z., et al.: Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI, pp. 3518–3524 (2017)

    Google Scholar 

  24. Zheng, L., et al.: Active scene understanding via online semantic reconstruction. In: Computer Graphics Forum. vol. 38, pp. 103–114. Wiley Online Library (2019)

    Google Scholar 

  25. Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. IEEE (2017)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants U1613212 and 61703284.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huaping Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tan, S., Xiang, W., Liu, H., Guo, D., Sun, F. (2020). Multi-agent Embodied Question Answering in Interactive Environments. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12358. Springer, Cham. https://doi.org/10.1007/978-3-030-58601-0_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58601-0_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58600-3

  • Online ISBN: 978-3-030-58601-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics