Abstract
We introduce a novel vision-and-language navigation (VLN) task of learning to provide real-time guidance to a blind follower situated in complex dynamic navigation scenarios. Towards exploring real-time information needs and fundamental challenges in our novel modeling task, we first collect a multi-modal real-world benchmark with in-situ Orientation and Mobility (O &M) instructional guidance. Subsequently, we leverage the real-world study to inform the design of a larger-scale simulation benchmark, thus enabling comprehensive analysis of limitations in current VLN models. Motivated by how sighted O &M guides seamlessly and safely support the awareness of individuals with visual impairments when collaborating on navigation tasks, we present ASSISTER, an imitation-learned agent that can embody such effective guidance. The proposed assistive VLN agent is conditioned on navigational goals and commands for generating instructional sentences that are coherent with the surrounding visual scene, while also carefully accounting for the immediate assistive navigation task. Altogether, our introduced evaluation and training framework takes a step towards scalable development of the next generation of seamless, human-like assistive agents.
Z. Huang and Z. Shangguan—Equally contributed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ahmetovic, D., Gleason, C., Ruan, C., Kitani, K., Takagi, H., Asakawa, C.: NavCog: a navigational cognitive assistant for the blind. In: MobileHCI (2016)
Ahmetovic, D., Guerreiro, J., Ohn-Bar, E., Kitani, K.M., Asakawa, C.: Impact of expertise on interaction preferences for navigation assistance of visually impaired individuals. In: W4A (2019)
Ahmetovic, D., et al.: Achieving practical and accurate indoor navigation for people with visual impairments. In: W4A (2017)
Aira: aira app. https://aira.io/
Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv (2018)
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering. In: ICCV (2015)
Arditi, A., Tian, Y.: User interface preferences in the design of a camera-based navigation and wayfinding aid. J. Vis. Impairment Blindness 107(2), 118–129 (2013)
Banovic, N., Franz, R.L., Truong, K.N., Mankoff, J., Dey, A.K.: Uncovering information needs for independent spatial learning for users who are visually impaired. In: ASSETS (2013)
Bigham, J.P., et al.: VizWiz: nearly real-time answers to visual questions. In: UIST (2010)
Blukis, V., Paxton, C., Fox, D., Garg, A., Artzi, Y.: A persistent spatial semantic representation for high-level natural language instruction execution. arXiv (2021)
Brady, E.L., Sato, D., Ruan, C., Takagi, H., Asakawa, C.: Exploring interface design for independent navigation by people with visual impairments. In: ASSETS (2015)
Chen, H.,et al.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: CVPR (2019)
Codevilla, F., Müller, M., López, A., Koltun, V., Dosovitskiy, A.: End-to-end driving via conditional imitation learning. In: ICRA (2018)
Codevilla, F., Santana, E., López, A.M., Gaidon, A.: Exploring the limitations of behavior cloning for autonomous driving. In: ICCV (2019)
Daniele, A.F., Bansal, M., Walter, M.R.: Navigational instruction generation as inverse reinforcement learning with neural machine translation. In: HRI (2017)
Das, A., et al.: Visual dialog. In: CVPR (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: ACL (2018)
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: CoRL (2017)
Duvallet, F., Kollar, T., Stentz, A.: Imitation learning for natural language direction following through unknown environments. In: ICRA (2013)
Duvallet, F., et al.: Inferring maps and behaviors from natural language instructions. In: Hsieh, M.A., Khatib, O., Kumar, V. (eds.) Experimental Robotics. STAR, vol. 109, pp. 373–388. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-23778-7_25
Easley, W., et al.: Let’s get lost: exploring social norms in predominately blind environments. In: CHI (2016)
Erickson, Z., Gangaram, V., Kapusta, A., Liu, C.K., Kemp, C.C.: Assistive gym: a physics simulation framework for assistive robotics. ICRA (2020)
Fallah, N., Apostolopoulos, I., Bekris, K., Folmer, E.: Indoor human navigation systems: a survey. Interact. Comput. 25(1), 21–33 (2013)
Fried, D., et al.: Speaker-follower models for vision-and-language navigation. In: NeurIPS (2018)
Fuentes-Pacheco, J., Ruiz-Ascencio, J., Rendón-Mancha, J.M.: Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev. 43(1), 55–81 (2015). https://doi.org/10.1007/s10462-012-9365-8
Geruschat, D.R., Turano, K.A., Stahl, J.W.: Traditional measures of mobility performance and retinitis pigmentosa. Optom. Vis. Sci. 75(7), 525–537 (1998)
Giudice, N.A., Legge, G.E.: Blind navigation and the role of technology. In: The Engineering Handbook of Smart Technology for Aging, Disability, and Independence (2008)
Google: Google speech-to-text. https://cloud.google.com/speech-to-text
Granquist, C., Sun, S.Y., Montezuma, S.R., Tran, T.M., Gage, R., Legge, G.E.: Evaluation and comparison of artificial intelligence vision aids: orcam myeye 1 and seeing AI. J. Vis. Impairment Blindness 115(4), 277–285 (2021)
Guerreiro, J., Ahmetovic, D., Sato, D., Kitani, K., Asakawa, C.: Airport accessibility and navigation assistance for people with visual impairments. In: CHI (2019)
Guerreiro, J., Ohn-Bar, E., Ahmetovic, D., Kitani, K., Asakawa, C.: How context and user behavior affect indoor navigation assistance for blind people. In: W4A (2018)
Guhur, P.L., Tapaswi, M., Chen, S., Laptev, I., Schmid, C.: Airbert: In-domain pretraining for vision-and-language navigation. In: ICCV (2021)
Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social GAN: socially acceptable trajectories with generative adversarial networks. In: CVPR (2018)
Gurari, D., et al.: VizWiz-Priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In: CVPR (2019)
Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. In: CVPR (2018)
Hahn, M., Krantz, J., Batra, D., Parikh, D., Rehg, J.M., Lee, S., Anderson, P.: Where are you? localization from embodied dialog (2020)
Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4(2), 100–107 (1968)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hu, Z., Pan, J., Fan, T., Yang, R., Manocha, D.: Safe navigation with human instructions in complex scenes. IEEE Robot. Autom. Lett. 4(2), 753–760 (2019)
Hudson, D.A., Manning, C.D.: GQA: a new dataset for compositional question answering over real-world images. In: CVPR (2019)
Kacorri, H., Kitani, K.M., Bigham, J.P., Asakawa, C.: People with visual impairment training personal object recognizers: feasibility and challenges. In: CHI (2017)
Kacorri, H., Mascetti, S., Gerino, A., Ahmetovic, D., Takagi, H., Asakawa, C.: Supporting orientation of people with visual impairment: analysis of large scale usage data. In: ASSETS (2016)
Kamikubo, R., Kato, N., Higuchi, K., Yonetani, R., Sato, Y.: Support strategies for remote guides in assisting people with visual impairments for effective indoor navigation. In: CHI (2020)
Kollar, T., Tellex, S., Roy, D., Roy, N.: Toward understanding natural language directions. In: HRI (2010)
Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: CLEV-dialog: a diagnostic dataset for multi-round reasoning in visual dialog. In: NAACL (2019)
Krantz, J., Gokaslan, A., Batra, D., Lee, S., Maksymets, O.: Waypoint models for instruction-guided navigation in continuous environments. In: ICCV (2021)
Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding (2020)
LI, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, G., et al.: Tactile compass: enabling visually impaired people to follow a path with continuous directional feedback. In: CHI (2021)
Long, R.G., Hill, E.: Establishing and maintaining orientation for mobility. Found. Orientation Mobility, 1 (1997)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for Vision-and-Language Tasks. In: NeurIPS (2019)
Savva, M., et al.: Habitat: a platform for embodied AI research. arXiv (2019)
Marston, J.R., Golledge, R.G.: The hidden demand for participation in activities and travel by persons who are visually impaired. J. Vis. Impairment Blindness 97(8), 475–488 (2003)
Matuszek, C., FitzGerald, N., Zettlemoyer, L., Bo, L., Fox, D.: A joint model of language and perception for grounded attribute learning. In: ICML (2012)
Matuszek, C., Herbst, E., Zettlemoyer, L., Fox, D.: Learning to parse natural language commands to a robot control system. In: Desai, J., Dudek, G., Khatib, O., Kumar, V. (eds.) Experimental Robotics. Springer Tracts in Advanced Robotics, vol. 88, pp. 403–415, Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-319-00065-7_28
Maunder, D., Venter, C., Rickert, T., Sentinella, J.: Improving transport access and mobility for people with disabilities. In: CILT (2004)
Microsoft: seeing AI app from microsoft. https://www.microsoft.com/en-us/ai/seeing-ai
Misra, D., Bennett, A., Blukis, V., Niklasson, E., Shatkhin, M., Artzi, Y.: Mapping instructions to actions in 3D environments with visual goal prediction. In: EMNLP (2018)
Misra, D.K., Sung, J., Lee, K., Saxena, A.: Tell me DAVE: context sensitive grounding of natural language to mobile manipulation instructions. In: RSS (2014)
Moudgil, A., Majumdar, A., Agrawal, H., Lee, S., Batra, D.: SOAT: a scene- and object-aware transformer for vision-and-language navigation. In: NeurIPS (2021)
Narasimhan, K., Kulkarni, T.D., Barzilay, R.: Language understanding for textbased games using deep reinforcement learning. In: EMNLP (2015)
Nguyen, K., Dey, D., Brockett, C., Dolan, B.: Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In: CVPR (2019)
Ohn-Bar, E., Kitani, K., Asakawa, C.: Personalized dynamics models for adaptive assistive navigation systems. In: CoRL (2018)
Ohn-Bar, E., Prakash, A., Behl, A., Chitta, K., Geiger, A.: Learning situational driving. In: CVPR (2020)
Osa, T., Pajarinen, J., Neumann, G., Bagnell, J.A., Abbeel, P., Peters, J.: An algorithmic perspective on imitation learning. arXiv (2018)
Peng, H., Song, G., You, J., Zhang, Y., Lian, J.: An indoor navigation service robot system based on vibration tactile feedback. Int. J. Soc. Robot. 9(3), 331–341 (2017)
Puig, X., et al.: Watch-and-help: a challenge for social perception and human-ai collaboration. In: ICLR (2021)
Qi, Y., Wu, Q., Anderson, P., Liu, M., Shen, C., van den Hengel, A.: Reverie: remote embodied referring expressions in real indoor environments. In: CVPR (2020)
Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. In: NeurIPS (2018)
Rasouli, A., Kotseruba, I., Kunic, T., Tsotsos, J.K.: Pie: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction. In: ICCV (2019)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Rieser, J.J., Guth, D., Hill, E.: Mental processes mediating independent travel: implications for orientation and mobility. J. Vis. Impairment Blindness 76(6), 213–218 (1982)
Roberts, P.W., Babinard, J.: Transport strategy to improve accessibility in developing countries (2004)
Roh, J., Paxton, C., Pronobis, A., Farhadi, A., Fox, D.: Conditional driving from natural language instructions. In: CoRL (2020)
Sato, D., Oh, U., Naito, K., Takagi, H., Kitani, K., Asakawa, C.: Navcog3: an evaluation of a smartphone-based blind indoor navigation assistant with semantic features in a large-scale environment. In: ASSETS (2017)
Scheutz, M., Krause, E.A., Oosterveld, B., Frasca, T.M., Platt, R.W.: Spoken instruction-based one-shot object and action learning in a cognitive robotic architecture. In: AAMAS (2017)
Schinazi, V.R., Thrash, T., Chebat, D.R.: Spatial navigation by congenitally blind individuals. In: Cognitive Science, Wiley Interdisciplinary Reviews (2016)
Soong, G.P., Lovie-Kitchin, J.E., Brown, B.: Does mobility performance of visually impaired adults improve immediately after orientation and mobility training? Optom. Vis. Sci. 78(9), 657–666 (2001)
Strelow, E.R.: What is needed for a theory of mobility: direct perceptions and cognitive maps-lessons from the blind. Psychol. Rev. 92(2), 226 (1985)
Tellex, S., Knepper, R.A., Li, A., Rus, D., Roy, N.: Asking for help using inverse semantics. In: RSS (2014)
Tellex, S., et al.: Understanding natural language commands for robotic navigation and mobile manipulation. In: AAAI (2011)
Thomason, J., Gordan, D., Bisk, Y.: Shifting the baseline: single modality performance on visual navigation & QA. In: NAACL (2019)
Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. In: CoRL (2019)
Thomason, J., et al.: Improving grounded natural language understanding through human-robot dialog. In: ICRA (2019)
Thomason, J., Zhang, S., Mooney, R., Stone, P.: Learning to interpret natural language commands through human-robot dialog. In: IJCAI (2015)
Turano, K., Geruschat, D., Stahl, J.W.: Mental effort required for walking: effects of retinitis pigmentosa. Optom. Vis. Sci. 75(12), 879–886 (1998)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Vedantam, R., Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: CVPR (2015)
de Vries, H., Shuster, K., Batra, D., Parikh, D., Weston, J., Kiela, D.: Talk the walk: navigating New York city through grounded dialogue (2018)
Wang, H.C., Katzschmann, R.K., Teng, S., Araki, B., Giarré, L., Rus, D.: Enabling independent navigation for visually impaired people through a wearable vision-based feedback system. In: ICRA (2017)
Wang, S., et al.: Less is more: generating grounded navigation instructions from landmarks. arXiv (2021)
Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: CVPR (2019)
Williams, M.A., Galbraith, C., Kane, S.K., Hurst, A.: "just let the cane hit it" how the blind and sighted see navigation differently. In: ASSETS (2014)
Williams, M.A., Hurst, A., Kane, S.K.: " pray before you step out" describing personal and situational blind navigation behaviors. In: ASSETS (2013)
Wong, S.: Traveling with blindness: A qualitative space-time approach to understanding visual impairment and urban mobility. Health Place 49, 85–92 (2018)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057. PMLR (2015)
Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.B.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: NeurIPS (2018)
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: CVPR (2019)
Zhang, J., Ohn-Bar, E.: Learning by watching. In: CVPR (2021)
Zhang, J., Zheng, M., Boyd, M., Ohn-Bar, E.: X-world: accessibility, vision, and autonomy meet. In: ICCV (2021)
Zhang, J., Zhu, R., Ohn-Bar, E.: SelfD: self-learning large-scale driving policies from the web. In: CVPR (2022)
Zhao, M., et al.: On the evaluation of vision-and-language navigation instructions. ArXiv (2021)
Zhu, F., Zhu, Y., Lee, V., Liang, X., Chang, X.: Deep learning for embodied vision navigation: a survey. arXiv (2021)
Acknowledgments
We thank our study participants and the support of the Department of Transportation Inclusive Design Challenge, NSF (IIS-2152077), and a Boston University CISE grant.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, Z., Shangguan, Z., Zhang, J., Bar, G., Boyd, M., Ohn-Bar, E. (2022). ASSISTER: Assistive Navigation via Conditional Instruction Generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-20059-5_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20058-8
Online ISBN: 978-3-031-20059-5
eBook Packages: Computer ScienceComputer Science (R0)