Abstract
Moving around in the world is naturally a multisensory experience, but today’s embodied agents are deaf—restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object. We propose a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to (1) discover elements of the geometry of the physical space indicated by the reverberating audio and (2) detect and follow sound-emitting targets. We further introduce SoundSpaces: a first-of-its-kind dataset of audio renderings based on geometrical acoustic simulations for two sets of publicly available 3D environments (Matterport3D and Replica), and we instrument Habitat to support the new sensor, making it possible to insert arbitrary sound sources in an array of real-world scanned environments. Our results show that audio greatly benefits embodied visual navigation in 3D spaces, and our work lays groundwork for new research in embodied AI with audio-visual perception. Project: http://vision.cs.utexas.edu/projects/audio_visual_navigation.
C. Chen and U. Jain—Contributed equally.
U. Jain—Work done as an intern at Facebook AI Research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
While algorithms could also run with ambisonic inputs, using binaural sound has the advantage of allowing human listeners to interpret our video results (see Supp video).
- 2.
Replica has more multi-room trajectories, where audio gives clear cues of room entrances/exits (vs. open floor plans in Matterport). This may be why AG is better than PG and APG on Replica.
References
Alameda-Pineda, X., Horaud, R.: Vision-guided robot hearing. Int. J. Robot. Res. 34, 437–456 (2015)
Alameda-Pineda, X., et al.: Salsa: a novel dataset for multimodal group behavior analysis. IEEE Trans. Pattern Anal. Mach. intell. 38(8), 1707–1720 (2015)
Ammirato, P., Poirson, P., Park, E., Kosecka, J., Berg, A.: A dataset for developing and benchmarking active vision. In: ICRA (2016)
Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)
Arandjelovic, R., Zisserman, A.: Objects that sound. In: ECCV (2018)
Armeni, I., Sax, A., Zamir, A.R., Savarese, S.: Joint 2D–3D-Semantic Data for Indoor Scene Understanding. ArXiv e-prints, February 2017
Ban, Y., Girin, L., Alameda-Pineda, X., Horaud, R.: Exploiting the complementarity of audio and visual data in multi-speaker tracking. In: ICCV Workshop on Computer Vision for Audio-Visual Media. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017). https://hal.inria.fr/hal-01577965
Ban, Y., Li, X., Alameda-Pineda, X., Girin, L., Horaud, R.: Accounting for room acoustics in audio-visual multi-speaker tracking. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)
Brodeur, S., et al.: Home: a household multimodal environment. https://arxiv.org/abs/1711.11017 (2017)
Cao, C., Ren, Z., Schissler, C., Manocha, D., Zhou, K.: Interactive sound propagation with bidirectional path tracing. ACM Trans. Graph. (TOG) 35(6), 1–11 (2016)
Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: 3DV (2017)
Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: Proceedings of the International Conference on 3D Vision (3DV) (2017)
Chaplot, D.S., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural mapping. In: ICLR (2020)
Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: CVPR (2019)
Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of the on Thematic Workshops of ACM Multimedia 2017. ACM (2017)
Chen, T., Gupta, S., Gupta, A.: Learning exploration policies for navigation. http://arxiv.org/abs/1903.01959
Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A.C., Bengio, Y.: A recurrent latent variable model for sequential data. In: NeurIPS (2015)
Connors, E.C., Yazzolino, L.A., Sánchez, J., Merabet, L.B.: Development of an audio-based virtual gaming environment to assist with navigation skills in the blind. J. Vis. Exp. JoVE 73, e50272 (2013)
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: CVPR (2018)
Das, A., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Neural modular control for embodied question answering. In: ECCV (2018)
Das, A., et al.: Probing emergent semantics in predictive agents via question answering. In: ICML (2020)
Egan, M.D., Quirt, J., Rousseau, M.: Architectural Acoustics. Elsevier, Amsterdam (1989)
Ekstrom, A.D.: Why vision is important to how we navigate. Hippocampus 25, 731–735 (2015)
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. In: SIGGRAPH (2018)
Evers, C., Naylor, P.: Acoustic slam. IEEE/ACM Trans. Audio Speech Lang. Process. 26(9), 1484–1498 (2018)
Fortin, M., et al.: Wayfinding in the blind: larger hippocampal volume and supranormal spatial navigation. Brain 131, 2995–3005 (2008)
Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.: Look, listen, and act: towards audio-visual embodied navigation. In: ICRA (2020)
Gao, R., Chen, C., Al-Halah, Z., Schissler, C., Grauman, K.: VisualEchoes: spatial image representation learning through echolocation. In: ECCV (2020)
Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 36–54. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_3
Gao, R., Grauman, K.: 2.5 D visual sound. In: CVPR (2019)
Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: ICCV (2019)
Gebru, I.D., Ba, S., Evangelidis, G., Horaud, R.: Tracking the active speaker based on a joint audio-visual observation model. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 15–21 (2015)
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: CVPR (2018)
Gordon, D., Kadian, A., Parikh, D., Hoffman, J., Batra, D.: SplitNet: Sim2Sim and Task2Task transfer for embodied visual navigation. In: ICCV (2019)
Gougoux, F., Zatorre, R.J., Lassonde, M., Voss, P., Lepore, F.: A functional neuroimaging study of sound localization: visual cortex activity predicts performance in early-blind individuals. PLoS Biol. 3(2), e27 (2005)
Gunther, R., Kazman, R., MacGregor, C.: Using 3D sound as a navigational aid in virtual environments. Behav. Inf. Technol. 23(6), 435–446 (2010). https://doi.org/10.1080/01449290410001723364
Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625 (2017)
Gupta, S., Fouhey, D., Levine, S., Malik, J.: Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125 (2017)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML (2018)
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2004)
Henriques, J.F., Vedaldi, A.: MapNet: an allocentric spatial memory for mapping environments. In: CVPR (2018)
Hershey, J.R., Movellan, J.R.: Audio vision: using audio-visual synchrony to locate sounds. In: NeurIPS (2000)
Jain, U., et al.: A cordial sync: going beyond marginal policies for multi-agent embodied tasks. In: ECCV (2020)
Jain, U., et al.: Two body problem: collaborative visual task completion. In: CVPR (2019)
Jayaraman, D., Grauman, K.: End-to-end policy learning for active visual categorization. TPAMI 41(7), 1601–1614 (2018)
Johnson, M., Hofmann, K., Hutton, T., Bignell, D.: The malmo platform for artificial intelligence experimentation. In: International Joint Conference on AI (2016)
Kempka, M., Wydmuch, M., Runc, G., Toczek, J., Jakowski, W.: ViZDoom: a doom-based AI research platform for visual reinforcement learning. In: Proceedings of the IEEE Conference on Computational Intelligence and Games (2016)
Kingma, D., Ba, J.: A method for stochastic optimization. In: CVPR (2017)
Kojima, N., Deng, J.: To learn or not to learn: analyzing the role of learning for navigation in virtual environments. arXiv preprint arXiv:1907.11770 (2019)
Kolve, E., et al.: AI2-THOR: an interactive 3D environment for visual AI. arXiv (2017)
Kuttruff, H.: Room Acoustics. CRC Press, Boca Raton (2016)
Lerer, A., Gross, S., Fergus, R.: Learning physical intuition of block towers by example. In: ICML (2016)
Lessard, N., Paré, M., Lepore, F., Lassonde, M.: Early-blind human subjects localize sound sources better than sighted subjects. Nature 395, 278–280 (1998)
Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)
Massiceti, D., Hicks, S.L., van Rheede, J.J.: Stereosonic vision: exploring visual-to-auditory sensory substitution mappings in an immersive virtual reality navigation paradigm. PLoS ONE 13(7), e0199389 (2018)
Merabet, L., Sanchez, J.: Audio-based navigation using virtual environments: combining technology and neuroscience. AER J. Res. Pract. Vis. Impair. Blind. 2, 128–137 (2009)
Merabet, L.B., Pascual-Leone, A.: Neural reorganization following sensory loss: the opportunity of change. Nat. Rev. Neurosci. 11, 44–52 (2010)
Mirowski, P., et al.: Learning to navigate in complex environments. In: ICLR (2017)
Mishkin, D., Dosovitskiy, A., Koltun, V.: Benchmarking classic and learned navigation in complex 3D environments. arXiv preprint arXiv:1901.10915 (2019)
Morgado, P., Nvasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360 video. In: NeurIPS (2018)
Murali, A. et al..: PyRobot: an open-source robotics framework for research and benchmarking. arXiv preprint arXiv:1906.08236 (2019)
Nakadai, K., Lourens, T., Okuno, H.G., Kitano, H.: Active audition for humanoid. In: AAAI (2000)
Nakadai, K., Nakamura, K.: Sound source localization and separation. Wiley Encyclopedia of Electrical and Electronics Engineering (1999)
Nakadai, K., Okuno, H.G., Kitano, H.: Epipolar geometry based sound localization and extraction for humanoid audition. In: IROS Workshops. IEEE (2001)
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: ECCV (2018)
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)
Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48
Picinali, L., Afonso, A., Denis, M., Katz, B.: Exploration of architectural spaces by blind people using auditory virtual reality for the construction of spatial knowledge. Int. J. Hum.-Comput. Stud. 72(4), 393–407 (2014)
Qin, J., Cheng, J., Wu, X., Xu, Y.: A learning based approach to audio surveillance in household environment. Int. J. Inf. Acquis. 3, 213–219 (2006)
Rascon, C., Meza, I.: Localization of sound sources in robotics: a review. Robot. Auton. Syst. 96, 184–210 (2017)
RoÈder, B., Teder-SaÈlejaÈrvi, W., Sterr, A., RoÈsler, F., Hillyard, S.A., Neville, H.J.: Improved auditory spatial tuning in blind humans. Nature 400, 162–166 (1999)
Romano, J.M., Brindza, J.P., Kuchenbecker, K.J.: ROS open-source audio recognizer: ROAR environmental sound detection tools for robot programming. Auton. Robot. 34, 207–215 (2013). https://doi.org/10.1007/s10514-013-9323-6
Savinov, N., Dosovitskiy, A., Koltun, V.: Semi-parametric topological memory for navigation. In: ICLR (2018)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: CVPR (2018)
Straub, J., et al.: The replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)
Sukhbaatar, S., Szlam, A., Synnaeve, G., Chintala, S., Fergus, R.: Mazebase: a sandbox for learning from games. arXiv preprint arXiv:1511.07401 (2015)
Thinus-Blanc, C., Gaunet, F.: Representation of space in blind persons: vision as a spatial sense? Psychol. Bull. 121, 20 (1997)
Thomason, J., Gordon, D., Bisk, Y.: Shifting the baseline: single modality performance on visual navigation & QA. In: NAACL-HLT (2019)
Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press, Cambridge (2005)
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: ECCV (2018)
Tolman, E.C.: Cognitive maps in rats and men. Psychol. Rev. 55, 189 (1948)
van der Maaten, L., Hinton, G.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Veach, E., Guibas, L.: Bidirectional estimators for light transport. In: Sakas, G., Muller, S., Shirley, P. (eds) Photorealistic Rendering Techniques, pp. 145–167. Springer, Heidelberg (1995). https://doi.org/10.1007/978-3-642-87825-1_11
Viciana-Abad, R., Marfil, R., Perez-Lorenzo, J., Bandera, J., Romero-Garces, A., Reche-Lopez, P.: Audio-visual perception system for a humanoid robotic head. Sensors 14, 9522–9545 (2014)
Voss, P., Lassonde, M., Gougoux, F., Fortin, M., Guillemot, J.P., Lepore, F.: Early-and late-onset blind individuals show supra-normal auditory abilities in far-space. Curr. Biol. 14(19), 1734–1738 (2004)
Wang, Y., Kapadia, M., Huang, P., Kavan, L., Badler, N.: Sound localization and multi-modal steering for autonomous virtual agents. In: Symposium on Interactive 3D Graphics and Games (2014)
Wijmans, E., et al.: Embodied question answering in photorealistic environments with point cloud perception. In: CVPR (2019)
Wijmans, E., et al.: Decentralized distributed PPO: solving PointGoal navigation. In: ICLR (2020)
Wood, J., Magennis, M., Arias, E.F.C., Gutierrez, T., Graupp, H., Bergamasco, M.: The design and evaluation of a computer game for the blind in the GRAB haptic audio virtual environment. In: Proceedings of Eurohpatics (2003)
Wortsman, M., Ehsani, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Learning to learn how to learn: self-adaptive visual navigation using meta-learning. In: CVPR (2019)
Woubie, A., Kanervisto, A., Karttunen, J., Hautamaki, V.: Do autonomous agents benefit from hearing? arXiv preprint arXiv:1905.04192 (2019)
Wu, X., Gong, H., Chen, P., Zhong, Z., Xu, Y.: Surveillance robot utilizing video and audio information. J. Intell. Robot. Syst. 55, 403–421 (2009). https://doi.org/10.1007/s10846-008-9297-3
Wu, Y., Wu, Y., Tamar, A., Russell, S., Gkioxari, G., Tian, Y.: Bayesian relational memory for semantic visual navigation. In: ICCV (2019)
Wymann, B., Espié, E., Guionneau, C., Dimitrakakis, C., Coulom, R., Sumner, A.: TORCS, the open racing car simulator (2013). http://www.torcs.org
Xia, F., et al.: Interactive Gibson: a benchmark for interactive navigation in cluttered environments. arXiv preprint arXiv:1910.14442 (2019)
Xia, F., Zamir, A.R., He, Z., Sax, A., Malik, J., Savarese, S.: Gibson Env: real-world perception for embodied agents. In: CVPR (2018)
Yoshida, T., Nakadai, K., Okuno, H.G.: Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In: 2009 9th IEEE-RAS International Conference on Humanoid Robots, pp. 604–609. IEEE (2009)
Aytar, Y., Vondrick, C., Torralba, A.: Learning sound representations from unlabeled video. In: NeurIPS (2016)
Aytar, Y., Vondrick, C., Torralba, A.: See, hear, and read: deep aligned representations. arXiv:1706.00932 (2017)
Zaunschirm, M., Schörkhuber, C., Höldrich, R.: Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint. J. Acoust. Soc. Am. 143, 3616 (2018)
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_35
Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. In: CVPR (2018)
Zhu, Y., et al.: Visual semantic planning using deep successor representations. In: ICCV (2017)
Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: ICRA (2017)
Acknowledgements
UT Austin is supported in part by DARPA Lifelong Learning Machines. We thank Alexander Schwing, Dhruv Batra, Erik Wijmans, Oleksandr Maksymets, Ruohan Gao, and Svetlana Lazebnik for valuable discussions and support with the AI-Habitat platform.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, C. et al. (2020). SoundSpaces: Audio-Visual Navigation in 3D Environments. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12351. Springer, Cham. https://doi.org/10.1007/978-3-030-58539-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-58539-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58538-9
Online ISBN: 978-3-030-58539-6
eBook Packages: Computer ScienceComputer Science (R0)