Abstract
In this paper, we propose an HISNav VQA dataset – a challenging dataset for a Visual Question Answering task that is aimed at the needs of Visual Navigation in human-centered environments. The dataset consists of images of various room scenes that were captured using the Habitat virtual environment and of questions important for navigation tasks using only visual information. We also propose a baseline for a HISNav VQA dataset, a Vector Semiotic Architecture, and demonstrate its performance. The Vector Semiotic Architecture is a combination of a Sign-Based World Model and Vector Symbolic Architectures. The Sign-Based World Model allows representing various aspects of an agent’s knowledge, and Vector Symbolic Architectures serve on a low computational level. The Vector Semiotic Architecture addresses the symbol grounding problem that plays an important role in the Visual Question Answering Task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. arXiv e-prints arXiv:1707.07998, July 2017
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments (2018)
Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15 (2015)
Das, A., et al.: Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Eliasmith, C.: How to Build a Brain: A Neural Architecture for Biological Cognition. Oxford University Press, New York (2013)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2019, pp. 5351–5359 (2019)
Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3608–3617 (2018)
Harnad, S.: The symbol grounding problem. Physica D 42(1), 335–346 (1990)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)
Kanerva, P.: Hyperdimensional computing: an introduction to computing in distributed representation with high-dimensional random vectors. Cogn. Comput. 1(2), 139–159 (2009)
Kiselev, G., Kovalev, A., Panov, A.I.: Spatial reasoning and planning in sign-based world model. In: Kuznetsov, S.O., Osipov, G.S., Stefanuk, V.L. (eds.) RCAI 2018. CCIS, vol. 934, pp. 1–10. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00617-4_1
Kiselev, G., Panov, A.: Hierarchical psychologically inspired planning for human-robot interaction tasks. In: Ronzhin, A., Rigoll, G., Meshcheryakov, R. (eds.) ICR 2019. LNCS (LNAI), vol. 11659, pp. 150–160. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26118-4_15
Komer, B., Stewart, T.C., Voelker, A.R., Eliasmith, C.: A neural representation of continuous space using fractional binding. In: 41st Annual Meeting of the Cognitive Science Society. Cognitive Science Society, QC (2019)
Kovalev, A.K., Panov, A.I.: Mental actions and modelling of reasoning in semiotic approach to AGI. In: Hammer, P., Agrawal, P., Goertzel, B., Iklé, M. (eds.) AGI 2019. LNCS (LNAI), vol. 11654, pp. 121–131. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27005-6_12
Kovalev, A.K., Panov, A.I., Osipov, E.: Hyperdimensional representations in semiotic approach to AGI. In: Goertzel, B., Panov, A.I., Potapov, A., Yampolskiy, R. (eds.) AGI 2020. LNCS (LNAI), vol. 12177, pp. 231–241. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52152-3_24
Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4392–4412, November 2020
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Linda Smith, M.G.: The development of embodied cognition: six lessons from babies. Artif. Life 11, 13–29 (2005)
Savva, M., et al.: Habitat: a platform for embodied AI research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Osipov, G.S., Panov, A.I., Chudova, N.V.: Behavior control as a function of consciousness. I. world model and goal setting. J. Comput. Syst. Sci. Int. 53(4), 517–529 (2014)
Osipov, G.S., Panov, A.I.: Relationships and operations in a sign-based world model of the actor. Sci. Tech. Inf. Process. 45(5), 317–330 (2018). https://doi.org/10.3103/S0147688218050040
Osipov, G.S., Panov, A.I.: Rational behaviour planning of cognitive semiotic agent in dynamic environment. Sci. Tech. Inf. Process. 48(6) (2021)
Panov, A.I.: Goal setting and behavior planning for cognitive agents. Sci. Tech. Inf. Process. 46(6), 404–415 (2019)
Panov, A.I.: Behavior planning of intelligent agent with sign world model. Biol. Inspired Cogn. Archit. 19, 21–31 (2017)
Plate, T.A.: Holographic reduced representations. IEEE Trans. Neural Networks 6(3), 623–641 (1995). https://doi.org/10.1109/72.377968
Staroverov, A., Yudin, D.A., Belkin, I., Adeshkin, V., Solomentsev, Y.K., Panov, A.I.: Real-time object navigation with deep neural networks and hierarchical reinforcement learning. IEEE Access 8, 195608–195621 (2020)
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (2004)
Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.B.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. arXiv e-prints arXiv:1810.02338, October 2018
Yu, J., Zhu, Z., Wang, Y., Zhang, W., Hu, Y., Tan, J.: Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recogn. 108, 107563 (2020)
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. CoRR abs/1811.10830 (2018)
Acknowledgements
The reported study was supported by RFBR, research Project No. 19-37-90164.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Appendix: Data Labeling
B Appendix: Examples
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kirilenko, D.E., Kovalev, A.K., Osipov, E., Panov, A.I. (2021). Question Answering for Visual Navigation in Human-Centered Environments. In: Batyrshin, I., Gelbukh, A., Sidorov, G. (eds) Advances in Soft Computing. MICAI 2021. Lecture Notes in Computer Science(), vol 13068. Springer, Cham. https://doi.org/10.1007/978-3-030-89820-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-89820-5_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89819-9
Online ISBN: 978-3-030-89820-5
eBook Packages: Computer ScienceComputer Science (R0)