Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler

Fu, Tsu-Jui; Wang, Xin Eric; Peterson, Matthew F.; Grafton, Scott T.; Eckstein, Miguel P.; Wang, William Yang

doi:10.1007/978-3-030-58539-6_5

Tsu-Jui Fu¹²,
Xin Eric Wang¹³,
Matthew F. Peterson¹²,
Scott T. Grafton¹²,
Miguel P. Eckstein¹² &
…
William Yang Wang¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12351))

Included in the following conference series:

European Conference on Computer Vision

3896 Accesses
29 Citations

Abstract

Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal by grounding natural language instructions to the visual surroundings. One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments. In this paper, we explore the use of counterfactual thinking as a human-inspired data augmentation method that results in robust models. Counterfactual thinking is a concept that describes the human propensity to create possible alternatives to life events that have already occurred. We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data. In particular, we present a model-agnostic adversarial path sampler (APS) that learns to sample challenging paths that force the navigator to improve based on the navigation performance. APS also serves to do pre-exploration of unseen environments to strengthen the model’s ability to generalize. We evaluate the influence of APS on the performance of different VLN baseline models using the room-to-room dataset (R2R). The results show that the adversarial training process with our proposed APS benefits VLN models under both seen and unseen environments. And the pre-exploration process can further gain additional improvements under unseen environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note that transforming the sampled paths into shortest paths can only be done under seen environments. For pre-exploration under unseen environments, we directly use the sampled paths because the shortest path planner should not be exploited in unseen environments.
2.
Note that the shortest-path information is not used during pre-exploration.
3.
We have tried to update APS simultaneously with NAV during pre-exploration, but it turns out that under a previous unseen environment without any regularization of human-annotated paths, APS tends to sample too difficult paths to accomplish, e.g., back and forth or cycles. However, those paths will not improve NAV and may even hurt the performance. To avoid this kind of dilemma, we keep APS fixed under the pre-exploration.

References

Agmon, N.: Robotic strategic behavior in adversarial environments. In: IJCAI (2017)
Google Scholar
Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv:1807.06757 (2018)
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)
Google Scholar
Angel, C., e al.: Matterport3D: learning from RGB-D data in indoor environments. In: 3DV (2017)
Google Scholar
Antoniou, A., Storkey, A., Edwards, H.: Data augmentation generative adversarial networks. arXiv:1711.04340 (2017)
Ashual, O., Wolf, L.: Specifying object attributes and relations in interactive scene generation. In: ICCV (2019)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Google Scholar
Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: CVPR (2019)
Google Scholar
Chen, L., Zhang, H., Xiao, J., He, X., Pu, S., Chang, S.F.: Counterfactual critic multi-agent training for scene graph generation. In: ICCV (2019)
Google Scholar
Chou, C.J., Chien, J.T., Chen, H.T.: Self adversarial training for human pose estimation. In: APSIPA (2018)
Google Scholar
Fried, D., et al.: Speaker-follower models for vision-and-language navigation. In: NeurIPS (2018)
Google Scholar
Garg, S., Perot, V., Limtiaco, N., Taly, A., Chi, E.H., Beutel, A.: Counterfactual fairness in text classification through robustness. In: AIES (2019)
Google Scholar
Goodfellow, I.J., et al.: Generative adversarial networks. In: NeurIPS (2014)
Google Scholar
Goyal, Y., Wu, Z., Ernst, J., Batra, D., Parikh, D., Lee, S.: Counterfactual visual explanations. In: ICML (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hemachandra, S., Duvallet, F., Howard, T.M., Stentz, A., Roy, N., Walter, M.R.: learning models for following natural language directions in unknown environments. In: ICRA (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural Computation (1997)
Google Scholar
Hong, Z.W., Fu, T.J., Shann, T.Y., Chang, Y.H., Lee, C.Y.: Adversarial active exploration for inverse dynamics model learning. In: CoRL (2019)
Google Scholar
Huang, H., Jain, V., Mehta, H., Baldridge, J., Ie, E.: Multi-modal discriminative model for vision-and-language navigation. In: NAACL Workshop (2019)
Google Scholar
Huang, H., Jain, V., Mehta, H., Ku, A., Magalhaes, G., Baldridge, J., Ie, E.: Transferable representation learning in vision-and-language navigation. In: ICCV (2019)
Google Scholar
Jain, V., Magalhaes, G., Ku, A., Vaswani, A., Ie, E., Baldridge, J.: Stay on the path: instruction fidelity in vision-and-language navigation. arXiv:1905.12255 (2019)
Ke, L., et al.: Tactical Rewind: self-correction via backtracking in vision-and-language navigation. In: CVPR (2019)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kusner, M., Loftus, J., Russell, C., Silva, R.: Counterfactual fairness. In: NeurIPS (2017)
Google Scholar
Ma, C.Y., et al.: Self-monitoring navigation agent via auxiliary progress estimation. In: ICLR (2019)
Google Scholar
Ma, C.Y., Wu, Z., AlRegib, G., Xiong, C., Kira, Z.: The regretful agent: heuristic-aided navigation through progress estimation. In: CVPR (2019)
Google Scholar
Miyato, T., Dai, A.M., Goodfellow, I.: Adversarial training methods for semi-supervised text classification. In: ICLR (2017)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: EMNLP (2014)
Google Scholar
Qi, Y., et al.: REVERIE: remote embodied visual referring expression in real indoor environments. In: CVPR (2020)
Google Scholar
Roese, N.J.: Counterfactual thinking. Psychol. Bull. 121(1), 133–148 (1997)
Article Google Scholar
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: NeurIPS (2000)
Google Scholar
Tan, H., Yu, L., Bansal, M.: Learning to navigate unseen environments: back translation with environmental dropout. In: NAACL (2019)
Google Scholar
Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: CVPR (2019)
Google Scholar
Wang, X., Jain, V., Ie, E., Wang, W., Kozareva, Z., Ravi, S.: Environment-agnostic multitask learning for natural language grounded navigation. In: ECCV (2020)
Google Scholar
Wang, X., Xiong, W., Wang, H., Wang, W.Y.: Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018. Lecture Notes in Computer Science, vol. 11220, pp. 38–55. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_3
Chapter Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)
MATH Google Scholar
Wu, Y., Bamman, D., Russell, S.: Adversarial training for relation extraction. In: EMNLP (2017)
Google Scholar
Zhang, R., Che, T., Ghahramani, Z., Bengio, Y., Song, Y.: MetaGAN: an adversarial approach to few-shot learning. In: NeurIPS (2018)
Google Scholar
Zmigrod, R., Mielke, S.J., Wallach, H., Cotterell, R.: Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. In: ACL (2019)
Google Scholar

Download references

Acknowledgments

Research was sponsored by the U.S. Army Research Office and was accomplished under Contract Number W911NF-19-D-0001 for the Institute for Collaborative Biotechnologies. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Author information

Authors and Affiliations

UC Santa Barbara, Santa Barbara, USA
Tsu-Jui Fu, Matthew F. Peterson, Scott T. Grafton, Miguel P. Eckstein & William Yang Wang
UC Santa Cruz, Santa Cruz, USA
Xin Eric Wang

Authors

Tsu-Jui Fu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Eric Wang
View author publications
You can also search for this author in PubMed Google Scholar
Matthew F. Peterson
View author publications
You can also search for this author in PubMed Google Scholar
Scott T. Grafton
View author publications
You can also search for this author in PubMed Google Scholar
Miguel P. Eckstein
View author publications
You can also search for this author in PubMed Google Scholar
William Yang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tsu-Jui Fu .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 95 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fu, TJ., Wang, X.E., Peterson, M.F., Grafton, S.T., Eckstein, M.P., Wang, W.Y. (2020). Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12351. Springer, Cham. https://doi.org/10.1007/978-3-030-58539-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-58539-6_5
Published: 07 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58538-9
Online ISBN: 978-3-030-58539-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics