ASSISTER: Assistive Navigation via Conditional Instruction Generation

Huang, Zanming; Shangguan, Zhongkai; Zhang, Jimuyang; Bar, Gilad; Boyd, Matthew; Ohn-Bar, Eshed

doi:10.1007/978-3-031-20059-5_16

Zanming Huang¹²,
Zhongkai Shangguan¹²,
Jimuyang Zhang¹²,
Gilad Bar¹²,
Matthew Boyd¹² &
…
Eshed Ohn-Bar¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13696))

Included in the following conference series:

European Conference on Computer Vision

1873 Accesses
4 Citations

Abstract

We introduce a novel vision-and-language navigation (VLN) task of learning to provide real-time guidance to a blind follower situated in complex dynamic navigation scenarios. Towards exploring real-time information needs and fundamental challenges in our novel modeling task, we first collect a multi-modal real-world benchmark with in-situ Orientation and Mobility (O &M) instructional guidance. Subsequently, we leverage the real-world study to inform the design of a larger-scale simulation benchmark, thus enabling comprehensive analysis of limitations in current VLN models. Motivated by how sighted O &M guides seamlessly and safely support the awareness of individuals with visual impairments when collaborating on navigation tasks, we present ASSISTER, an imitation-learned agent that can embody such effective guidance. The proposed assistive VLN agent is conditioned on navigational goals and commands for generating instructional sentences that are coherent with the surrounding visual scene, while also carefully accounting for the immediate assistive navigation task. Altogether, our introduced evaluation and training framework takes a step towards scalable development of the next generation of seamless, human-like assistive agents.

Z. Huang and Z. Shangguan—Equally contributed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ahmetovic, D., Gleason, C., Ruan, C., Kitani, K., Takagi, H., Asakawa, C.: NavCog: a navigational cognitive assistant for the blind. In: MobileHCI (2016)
Google Scholar
Ahmetovic, D., Guerreiro, J., Ohn-Bar, E., Kitani, K.M., Asakawa, C.: Impact of expertise on interaction preferences for navigation assistance of visually impaired individuals. In: W4A (2019)
Google Scholar
Ahmetovic, D., et al.: Achieving practical and accurate indoor navigation for people with visual impairments. In: W4A (2017)
Google Scholar
Aira: aira app. https://aira.io/
Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv (2018)
Google Scholar
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)
Google Scholar
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering. In: ICCV (2015)
Google Scholar
Arditi, A., Tian, Y.: User interface preferences in the design of a camera-based navigation and wayfinding aid. J. Vis. Impairment Blindness 107(2), 118–129 (2013)
Article Google Scholar
Banovic, N., Franz, R.L., Truong, K.N., Mankoff, J., Dey, A.K.: Uncovering information needs for independent spatial learning for users who are visually impaired. In: ASSETS (2013)
Google Scholar
Bigham, J.P., et al.: VizWiz: nearly real-time answers to visual questions. In: UIST (2010)
Google Scholar
Blukis, V., Paxton, C., Fox, D., Garg, A., Artzi, Y.: A persistent spatial semantic representation for high-level natural language instruction execution. arXiv (2021)
Google Scholar
Brady, E.L., Sato, D., Ruan, C., Takagi, H., Asakawa, C.: Exploring interface design for independent navigation by people with visual impairments. In: ASSETS (2015)
Google Scholar
Chen, H.,et al.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: CVPR (2019)
Google Scholar
Codevilla, F., Müller, M., López, A., Koltun, V., Dosovitskiy, A.: End-to-end driving via conditional imitation learning. In: ICRA (2018)
Google Scholar
Codevilla, F., Santana, E., López, A.M., Gaidon, A.: Exploring the limitations of behavior cloning for autonomous driving. In: ICCV (2019)
Google Scholar
Daniele, A.F., Bansal, M., Walter, M.R.: Navigational instruction generation as inverse reinforcement learning with neural machine translation. In: HRI (2017)
Google Scholar
Das, A., et al.: Visual dialog. In: CVPR (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: ACL (2018)
Google Scholar
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: CoRL (2017)
Google Scholar
Duvallet, F., Kollar, T., Stentz, A.: Imitation learning for natural language direction following through unknown environments. In: ICRA (2013)
Google Scholar
Duvallet, F., et al.: Inferring maps and behaviors from natural language instructions. In: Hsieh, M.A., Khatib, O., Kumar, V. (eds.) Experimental Robotics. STAR, vol. 109, pp. 373–388. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-23778-7_25
Chapter Google Scholar
Easley, W., et al.: Let’s get lost: exploring social norms in predominately blind environments. In: CHI (2016)
Google Scholar
Erickson, Z., Gangaram, V., Kapusta, A., Liu, C.K., Kemp, C.C.: Assistive gym: a physics simulation framework for assistive robotics. ICRA (2020)
Google Scholar
Fallah, N., Apostolopoulos, I., Bekris, K., Folmer, E.: Indoor human navigation systems: a survey. Interact. Comput. 25(1), 21–33 (2013)
Google Scholar
Fried, D., et al.: Speaker-follower models for vision-and-language navigation. In: NeurIPS (2018)
Google Scholar
Fuentes-Pacheco, J., Ruiz-Ascencio, J., Rendón-Mancha, J.M.: Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev. 43(1), 55–81 (2015). https://doi.org/10.1007/s10462-012-9365-8
Article Google Scholar
Geruschat, D.R., Turano, K.A., Stahl, J.W.: Traditional measures of mobility performance and retinitis pigmentosa. Optom. Vis. Sci. 75(7), 525–537 (1998)
Article Google Scholar
Giudice, N.A., Legge, G.E.: Blind navigation and the role of technology. In: The Engineering Handbook of Smart Technology for Aging, Disability, and Independence (2008)
Google Scholar
Google: Google speech-to-text. https://cloud.google.com/speech-to-text
Granquist, C., Sun, S.Y., Montezuma, S.R., Tran, T.M., Gage, R., Legge, G.E.: Evaluation and comparison of artificial intelligence vision aids: orcam myeye 1 and seeing AI. J. Vis. Impairment Blindness 115(4), 277–285 (2021)
Article Google Scholar
Guerreiro, J., Ahmetovic, D., Sato, D., Kitani, K., Asakawa, C.: Airport accessibility and navigation assistance for people with visual impairments. In: CHI (2019)
Google Scholar
Guerreiro, J., Ohn-Bar, E., Ahmetovic, D., Kitani, K., Asakawa, C.: How context and user behavior affect indoor navigation assistance for blind people. In: W4A (2018)
Google Scholar
Guhur, P.L., Tapaswi, M., Chen, S., Laptev, I., Schmid, C.: Airbert: In-domain pretraining for vision-and-language navigation. In: ICCV (2021)
Google Scholar
Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social GAN: socially acceptable trajectories with generative adversarial networks. In: CVPR (2018)
Google Scholar
Gurari, D., et al.: VizWiz-Priv: A dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In: CVPR (2019)
Google Scholar
Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. In: CVPR (2018)
Google Scholar
Hahn, M., Krantz, J., Batra, D., Parikh, D., Rehg, J.M., Lee, S., Anderson, P.: Where are you? localization from embodied dialog (2020)
Google Scholar
Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4(2), 100–107 (1968)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hu, Z., Pan, J., Fan, T., Yang, R., Manocha, D.: Safe navigation with human instructions in complex scenes. IEEE Robot. Autom. Lett. 4(2), 753–760 (2019)
Article Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for compositional question answering over real-world images. In: CVPR (2019)
Google Scholar
Kacorri, H., Kitani, K.M., Bigham, J.P., Asakawa, C.: People with visual impairment training personal object recognizers: feasibility and challenges. In: CHI (2017)
Google Scholar
Kacorri, H., Mascetti, S., Gerino, A., Ahmetovic, D., Takagi, H., Asakawa, C.: Supporting orientation of people with visual impairment: analysis of large scale usage data. In: ASSETS (2016)
Google Scholar
Kamikubo, R., Kato, N., Higuchi, K., Yonetani, R., Sato, Y.: Support strategies for remote guides in assisting people with visual impairments for effective indoor navigation. In: CHI (2020)
Google Scholar
Kollar, T., Tellex, S., Roy, D., Roy, N.: Toward understanding natural language directions. In: HRI (2010)
Google Scholar
Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: CLEV-dialog: a diagnostic dataset for multi-round reasoning in visual dialog. In: NAACL (2019)
Google Scholar
Krantz, J., Gokaslan, A., Batra, D., Lee, S., Maksymets, O.: Waypoint models for instruction-guided navigation in continuous environments. In: ICCV (2021)
Google Scholar
Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding (2020)
Google Scholar
LI, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, G., et al.: Tactile compass: enabling visually impaired people to follow a path with continuous directional feedback. In: CHI (2021)
Google Scholar
Long, R.G., Hill, E.: Establishing and maintaining orientation for mobility. Found. Orientation Mobility, 1 (1997)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for Vision-and-Language Tasks. In: NeurIPS (2019)
Google Scholar
Savva, M., et al.: Habitat: a platform for embodied AI research. arXiv (2019)
Google Scholar
Marston, J.R., Golledge, R.G.: The hidden demand for participation in activities and travel by persons who are visually impaired. J. Vis. Impairment Blindness 97(8), 475–488 (2003)
Article Google Scholar
Matuszek, C., FitzGerald, N., Zettlemoyer, L., Bo, L., Fox, D.: A joint model of language and perception for grounded attribute learning. In: ICML (2012)
Google Scholar
Matuszek, C., Herbst, E., Zettlemoyer, L., Fox, D.: Learning to parse natural language commands to a robot control system. In: Desai, J., Dudek, G., Khatib, O., Kumar, V. (eds.) Experimental Robotics. Springer Tracts in Advanced Robotics, vol. 88, pp. 403–415, Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-319-00065-7_28
Maunder, D., Venter, C., Rickert, T., Sentinella, J.: Improving transport access and mobility for people with disabilities. In: CILT (2004)
Google Scholar
Microsoft: seeing AI app from microsoft. https://www.microsoft.com/en-us/ai/seeing-ai
Misra, D., Bennett, A., Blukis, V., Niklasson, E., Shatkhin, M., Artzi, Y.: Mapping instructions to actions in 3D environments with visual goal prediction. In: EMNLP (2018)
Google Scholar
Misra, D.K., Sung, J., Lee, K., Saxena, A.: Tell me DAVE: context sensitive grounding of natural language to mobile manipulation instructions. In: RSS (2014)
Google Scholar
Moudgil, A., Majumdar, A., Agrawal, H., Lee, S., Batra, D.: SOAT: a scene- and object-aware transformer for vision-and-language navigation. In: NeurIPS (2021)
Google Scholar
Narasimhan, K., Kulkarni, T.D., Barzilay, R.: Language understanding for textbased games using deep reinforcement learning. In: EMNLP (2015)
Google Scholar
Nguyen, K., Dey, D., Brockett, C., Dolan, B.: Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In: CVPR (2019)
Google Scholar
Ohn-Bar, E., Kitani, K., Asakawa, C.: Personalized dynamics models for adaptive assistive navigation systems. In: CoRL (2018)
Google Scholar
Ohn-Bar, E., Prakash, A., Behl, A., Chitta, K., Geiger, A.: Learning situational driving. In: CVPR (2020)
Google Scholar
Osa, T., Pajarinen, J., Neumann, G., Bagnell, J.A., Abbeel, P., Peters, J.: An algorithmic perspective on imitation learning. arXiv (2018)
Google Scholar
Peng, H., Song, G., You, J., Zhang, Y., Lian, J.: An indoor navigation service robot system based on vibration tactile feedback. Int. J. Soc. Robot. 9(3), 331–341 (2017)
Article Google Scholar
Puig, X., et al.: Watch-and-help: a challenge for social perception and human-ai collaboration. In: ICLR (2021)
Google Scholar
Qi, Y., Wu, Q., Anderson, P., Liu, M., Shen, C., van den Hengel, A.: Reverie: remote embodied referring expressions in real indoor environments. In: CVPR (2020)
Google Scholar
Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. In: NeurIPS (2018)
Google Scholar
Rasouli, A., Kotseruba, I., Kunic, T., Tsotsos, J.K.: Pie: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction. In: ICCV (2019)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Google Scholar
Rieser, J.J., Guth, D., Hill, E.: Mental processes mediating independent travel: implications for orientation and mobility. J. Vis. Impairment Blindness 76(6), 213–218 (1982)
Article Google Scholar
Roberts, P.W., Babinard, J.: Transport strategy to improve accessibility in developing countries (2004)
Google Scholar
Roh, J., Paxton, C., Pronobis, A., Farhadi, A., Fox, D.: Conditional driving from natural language instructions. In: CoRL (2020)
Google Scholar
Sato, D., Oh, U., Naito, K., Takagi, H., Kitani, K., Asakawa, C.: Navcog3: an evaluation of a smartphone-based blind indoor navigation assistant with semantic features in a large-scale environment. In: ASSETS (2017)
Google Scholar
Scheutz, M., Krause, E.A., Oosterveld, B., Frasca, T.M., Platt, R.W.: Spoken instruction-based one-shot object and action learning in a cognitive robotic architecture. In: AAMAS (2017)
Google Scholar
Schinazi, V.R., Thrash, T., Chebat, D.R.: Spatial navigation by congenitally blind individuals. In: Cognitive Science, Wiley Interdisciplinary Reviews (2016)
Google Scholar
Soong, G.P., Lovie-Kitchin, J.E., Brown, B.: Does mobility performance of visually impaired adults improve immediately after orientation and mobility training? Optom. Vis. Sci. 78(9), 657–666 (2001)
Article Google Scholar
Strelow, E.R.: What is needed for a theory of mobility: direct perceptions and cognitive maps-lessons from the blind. Psychol. Rev. 92(2), 226 (1985)
Article Google Scholar
Tellex, S., Knepper, R.A., Li, A., Rus, D., Roy, N.: Asking for help using inverse semantics. In: RSS (2014)
Google Scholar
Tellex, S., et al.: Understanding natural language commands for robotic navigation and mobile manipulation. In: AAAI (2011)
Google Scholar
Thomason, J., Gordan, D., Bisk, Y.: Shifting the baseline: single modality performance on visual navigation & QA. In: NAACL (2019)
Google Scholar
Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. In: CoRL (2019)
Google Scholar
Thomason, J., et al.: Improving grounded natural language understanding through human-robot dialog. In: ICRA (2019)
Google Scholar
Thomason, J., Zhang, S., Mooney, R., Stone, P.: Learning to interpret natural language commands through human-robot dialog. In: IJCAI (2015)
Google Scholar
Turano, K., Geruschat, D., Stahl, J.W.: Mental effort required for walking: effects of retinitis pigmentosa. Optom. Vis. Sci. 75(12), 879–886 (1998)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Vedantam, R., Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: CVPR (2015)
Google Scholar
de Vries, H., Shuster, K., Batra, D., Parikh, D., Weston, J., Kiela, D.: Talk the walk: navigating New York city through grounded dialogue (2018)
Google Scholar
Wang, H.C., Katzschmann, R.K., Teng, S., Araki, B., Giarré, L., Rus, D.: Enabling independent navigation for visually impaired people through a wearable vision-based feedback system. In: ICRA (2017)
Google Scholar
Wang, S., et al.: Less is more: generating grounded navigation instructions from landmarks. arXiv (2021)
Google Scholar
Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: CVPR (2019)
Google Scholar
Williams, M.A., Galbraith, C., Kane, S.K., Hurst, A.: "just let the cane hit it" how the blind and sighted see navigation differently. In: ASSETS (2014)
Google Scholar
Williams, M.A., Hurst, A., Kane, S.K.: " pray before you step out" describing personal and situational blind navigation behaviors. In: ASSETS (2013)
Google Scholar
Wong, S.: Traveling with blindness: A qualitative space-time approach to understanding visual impairment and urban mobility. Health Place 49, 85–92 (2018)
Article Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057. PMLR (2015)
Google Scholar
Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.B.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: NeurIPS (2018)
Google Scholar
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: CVPR (2019)
Google Scholar
Zhang, J., Ohn-Bar, E.: Learning by watching. In: CVPR (2021)
Google Scholar
Zhang, J., Zheng, M., Boyd, M., Ohn-Bar, E.: X-world: accessibility, vision, and autonomy meet. In: ICCV (2021)
Google Scholar
Zhang, J., Zhu, R., Ohn-Bar, E.: SelfD: self-learning large-scale driving policies from the web. In: CVPR (2022)
Google Scholar
Zhao, M., et al.: On the evaluation of vision-and-language navigation instructions. ArXiv (2021)
Google Scholar
Zhu, F., Zhu, Y., Lee, V., Liang, X., Chang, X.: Deep learning for embodied vision navigation: a survey. arXiv (2021)
Google Scholar

Download references

Acknowledgments

We thank our study participants and the support of the Department of Transportation Inclusive Design Challenge, NSF (IIS-2152077), and a Boston University CISE grant.

Author information

Authors and Affiliations

Boston University, Boston, MA, 02215, USA
Zanming Huang, Zhongkai Shangguan, Jimuyang Zhang, Gilad Bar, Matthew Boyd & Eshed Ohn-Bar

Authors

Zanming Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongkai Shangguan
View author publications
You can also search for this author in PubMed Google Scholar
Jimuyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Gilad Bar
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Boyd
View author publications
You can also search for this author in PubMed Google Scholar
Eshed Ohn-Bar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhongkai Shangguan .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, Z., Shangguan, Z., Zhang, J., Bar, G., Boyd, M., Ohn-Bar, E. (2022). ASSISTER: Assistive Navigation via Conditional Instruction Generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-20059-5_16
Published: 29 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20058-8
Online ISBN: 978-3-031-20059-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ASSISTER: Assistive Navigation via Conditional Instruction Generation