Skip to main content

Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler

  • Conference paper
  • First Online:
Book cover Computer Vision – ECCV 2020 (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12351))

Included in the following conference series:

Abstract

Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal by grounding natural language instructions to the visual surroundings. One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments. In this paper, we explore the use of counterfactual thinking as a human-inspired data augmentation method that results in robust models. Counterfactual thinking is a concept that describes the human propensity to create possible alternatives to life events that have already occurred. We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data. In particular, we present a model-agnostic adversarial path sampler (APS) that learns to sample challenging paths that force the navigator to improve based on the navigation performance. APS also serves to do pre-exploration of unseen environments to strengthen the model’s ability to generalize. We evaluate the influence of APS on the performance of different VLN baseline models using the room-to-room dataset (R2R). The results show that the adversarial training process with our proposed APS benefits VLN models under both seen and unseen environments. And the pre-exploration process can further gain additional improvements under unseen environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that transforming the sampled paths into shortest paths can only be done under seen environments. For pre-exploration under unseen environments, we directly use the sampled paths because the shortest path planner should not be exploited in unseen environments.

  2. 2.

    Note that the shortest-path information is not used during pre-exploration.

  3. 3.

    We have tried to update APS simultaneously with NAV during pre-exploration, but it turns out that under a previous unseen environment without any regularization of human-annotated paths, APS tends to sample too difficult paths to accomplish, e.g., back and forth or cycles. However, those paths will not improve NAV and may even hurt the performance. To avoid this kind of dilemma, we keep APS fixed under the pre-exploration.

References

  1. Agmon, N.: Robotic strategic behavior in adversarial environments. In: IJCAI (2017)

    Google Scholar 

  2. Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv:1807.06757 (2018)

  3. Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)

    Google Scholar 

  4. Angel, C., e al.: Matterport3D: learning from RGB-D data in indoor environments. In: 3DV (2017)

    Google Scholar 

  5. Antoniou, A., Storkey, A., Edwards, H.: Data augmentation generative adversarial networks. arXiv:1711.04340 (2017)

  6. Ashual, O., Wolf, L.: Specifying object attributes and relations in interactive scene generation. In: ICCV (2019)

    Google Scholar 

  7. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)

    Google Scholar 

  8. Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: CVPR (2019)

    Google Scholar 

  9. Chen, L., Zhang, H., Xiao, J., He, X., Pu, S., Chang, S.F.: Counterfactual critic multi-agent training for scene graph generation. In: ICCV (2019)

    Google Scholar 

  10. Chou, C.J., Chien, J.T., Chen, H.T.: Self adversarial training for human pose estimation. In: APSIPA (2018)

    Google Scholar 

  11. Fried, D., et al.: Speaker-follower models for vision-and-language navigation. In: NeurIPS (2018)

    Google Scholar 

  12. Garg, S., Perot, V., Limtiaco, N., Taly, A., Chi, E.H., Beutel, A.: Counterfactual fairness in text classification through robustness. In: AIES (2019)

    Google Scholar 

  13. Goodfellow, I.J., et al.: Generative adversarial networks. In: NeurIPS (2014)

    Google Scholar 

  14. Goyal, Y., Wu, Z., Ernst, J., Batra, D., Parikh, D., Lee, S.: Counterfactual visual explanations. In: ICML (2019)

    Google Scholar 

  15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  16. Hemachandra, S., Duvallet, F., Howard, T.M., Stentz, A., Roy, N., Walter, M.R.: learning models for following natural language directions in unknown environments. In: ICRA (2015)

    Google Scholar 

  17. Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural Computation (1997)

    Google Scholar 

  18. Hong, Z.W., Fu, T.J., Shann, T.Y., Chang, Y.H., Lee, C.Y.: Adversarial active exploration for inverse dynamics model learning. In: CoRL (2019)

    Google Scholar 

  19. Huang, H., Jain, V., Mehta, H., Baldridge, J., Ie, E.: Multi-modal discriminative model for vision-and-language navigation. In: NAACL Workshop (2019)

    Google Scholar 

  20. Huang, H., Jain, V., Mehta, H., Ku, A., Magalhaes, G., Baldridge, J., Ie, E.: Transferable representation learning in vision-and-language navigation. In: ICCV (2019)

    Google Scholar 

  21. Jain, V., Magalhaes, G., Ku, A., Vaswani, A., Ie, E., Baldridge, J.: Stay on the path: instruction fidelity in vision-and-language navigation. arXiv:1905.12255 (2019)

  22. Ke, L., et al.: Tactical Rewind: self-correction via backtracking in vision-and-language navigation. In: CVPR (2019)

    Google Scholar 

  23. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  24. Kusner, M., Loftus, J., Russell, C., Silva, R.: Counterfactual fairness. In: NeurIPS (2017)

    Google Scholar 

  25. Ma, C.Y., et al.: Self-monitoring navigation agent via auxiliary progress estimation. In: ICLR (2019)

    Google Scholar 

  26. Ma, C.Y., Wu, Z., AlRegib, G., Xiong, C., Kira, Z.: The regretful agent: heuristic-aided navigation through progress estimation. In: CVPR (2019)

    Google Scholar 

  27. Miyato, T., Dai, A.M., Goodfellow, I.: Adversarial training methods for semi-supervised text classification. In: ICLR (2017)

    Google Scholar 

  28. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: EMNLP (2014)

    Google Scholar 

  29. Qi, Y., et al.: REVERIE: remote embodied visual referring expression in real indoor environments. In: CVPR (2020)

    Google Scholar 

  30. Roese, N.J.: Counterfactual thinking. Psychol. Bull. 121(1), 133–148 (1997)

    Article  Google Scholar 

  31. Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: NeurIPS (2000)

    Google Scholar 

  32. Tan, H., Yu, L., Bansal, M.: Learning to navigate unseen environments: back translation with environmental dropout. In: NAACL (2019)

    Google Scholar 

  33. Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: CVPR (2019)

    Google Scholar 

  34. Wang, X., Jain, V., Ie, E., Wang, W., Kozareva, Z., Ravi, S.: Environment-agnostic multitask learning for natural language grounded navigation. In: ECCV (2020)

    Google Scholar 

  35. Wang, X., Xiong, W., Wang, H., Wang, W.Y.: Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018. Lecture Notes in Computer Science, vol. 11220, pp. 38–55. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_3

    Chapter  Google Scholar 

  36. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)

    MATH  Google Scholar 

  37. Wu, Y., Bamman, D., Russell, S.: Adversarial training for relation extraction. In: EMNLP (2017)

    Google Scholar 

  38. Zhang, R., Che, T., Ghahramani, Z., Bengio, Y., Song, Y.: MetaGAN: an adversarial approach to few-shot learning. In: NeurIPS (2018)

    Google Scholar 

  39. Zmigrod, R., Mielke, S.J., Wallach, H., Cotterell, R.: Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. In: ACL (2019)

    Google Scholar 

Download references

Acknowledgments

Research was sponsored by the U.S. Army Research Office and was accomplished under Contract Number W911NF-19-D-0001 for the Institute for Collaborative Biotechnologies. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tsu-Jui Fu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 95 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fu, TJ., Wang, X.E., Peterson, M.F., Grafton, S.T., Eckstein, M.P., Wang, W.Y. (2020). Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12351. Springer, Cham. https://doi.org/10.1007/978-3-030-58539-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58539-6_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58538-9

  • Online ISBN: 978-3-030-58539-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics