Active Audio-Visual Separation of Dynamic Sound Sources

Majumder, Sagnik; Grauman, Kristen

doi:10.1007/978-3-031-19842-7_32

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Included in the following conference series:

European Conference on Computer Vision

3532 Accesses

Abstract

We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound accurately at every step using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, using self-attention to make high-quality estimates for current timesteps and also simultaneously improve its past estimates. Using highly realistic acoustic SoundSpaces [13] simulations in real-world scanned Matterport3D [11] environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a dynamic audio target. Project: https://vision.cs.utexas.edu/projects/active-av-dynamic-separation/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

Multiple Sound Sources Localization from Coarse to Fine

High-Quality Visually-Guided Sound Separation from Diverse Categories

Notes

1.
e.g., a human speaker or instrument the agent wishes to listen to.
2.
The monaural source is the ground truth target, since it is devoid of all material and spatial effects due to the environment and the mutual positioning of the agent and the sources. Using a spatialized (e.g., binaural) audio as ground truth would permit undesirable shortcut solutions: the agent could move somewhere in the environment where the target is inaudible, and technically return the right answer of silence [38].

References

Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech enhancement. arXiv preprint arXiv:1804.04121 (2018)
Afouras, T., Chung, J.S., Zisserman, A.: My lips are concealed: audio-visual speech enhancement through obstructions. arXiv preprint arXiv:1907.04975 (2019)
Alameda-Pineda, X., Horaud, R.: Vision-guided robot hearing. Int. J. Robot. Res. 34(4–5), 437–456 (2015)
Article Google Scholar
Yu, Y., Huang, W., Sun, F., Chen, C., Wang, Y., Liu, X.: sound adversarial audio-visual navigation. In: Submitted to The Tenth International Conference on Learning Representations (2022). https://openreview.net/forum?id=NkZq4OEYN-
Asano, F., Goto, M., Itou, K., Asoh, H.: Real-time sound source localization and separation system and its application to automatic speech recognition. In: Eurospeech (2001)
Google Scholar
Ban, Y., Li, X., Alameda-Pineda, X., Girin, L., Horaud, R.: Accounting for room acoustics in audio-visual multi-speaker tracking. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)
Google Scholar
Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007). https://doi.org/10.1109/CVPR.2007.383344
Bellemare, M.G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. arXiv preprint arXiv:1606.01868 (2016)
Bustamante, G., Danès, P., Forgue, T., Podlubne, A., Manhès, J.: An information based feedback control for audio-motor binaural localization. Auton. Robots 42(2), 477–490 (2017). https://doi.org/10.1007/s10514-017-9639-8
Article Google Scholar
Campari, T., Eccher, P., Serafini, L., Ballan, L.: Exploiting scene-specific features for object goal navigation. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12538, pp. 406–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66823-5_24
Chapter Google Scholar
Chang, A., et al.: Matterport3d: learning from RGB-D data in indoor environments. In: International Conference on 3D Vision (3DV) (2017). matterPort3D dataset license. http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf
Chen, C., Al-Halah, Z., Grauman, K.: Semantic audio-visual navigation. In: CVPR (2021)
Google Scholar
Chen, C., et al.: SoundSpaces: audio-visual navigation in 3d environments. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 17–36. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_2
Chapter Google Scholar
Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S.K., Grauman, K.: Learning to set waypoints for audio-visual navigation. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=cR91FAodFMe
Chen, J., Mao, Q., Liu, D.: Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation. arXiv preprint arXiv:2007.13975 (2020)
Chen, K., Chen, J.K., Chuang, J., Vázquez, M., Savarese, S.: Topological planning with transformers for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11276–11286 (2021)
Google Scholar
Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A.C., Bengio, Y.: A recurrent latent variable model for sequential data. In: NeurIPS (2015)
Google Scholar
Chung, S.W., Choe, S., Chung, J.S., Kang, H.G.: Facefilter: audio-visual speech separation using still images. arXiv preprint arXiv:2005.07074 (2020)
Deleforge, A., Horaud, R.: The cocktail party robot: sound source separation and localisation with an active binaural head. In: HRI 2012–7th ACM/IEEE International Conference on Human Robot Interaction, pp. 431–438. ACM, Boston, United States, March 2012. https://doi.org/10.1145/2157689.2157834,https://hal.inria.fr/hal-00768668
Duong, N.Q., Vincent, E., Gribonval, R.: Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)
Article Google Scholar
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 (2018)
Fang, K., Toshev, A., Fei-Fei, L., Savarese, S.: Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 538–547 (2019)
Google Scholar
Fisher III, J.W., Darrell, T., Freeman, W., Viola, P.: Learning joint statistical models for audio-visual fusion and segregation. In: Leen, T., Dietterich, T., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 772–778. MIT Press (2001). https://proceedings.neurips.cc/paper/2000/file/11f524c3fbfeeca4aa916edcb6b6392e-Paper.pdf
Gabbay, A., Shamir, A., Peleg, S.: Visual speech enhancement. arXiv preprint arXiv:1711.08789 (2017)
Gan, C., Huang, D., Zhao, H., Tenenbaum, J.B., Torralba, A.: Music gesture for visual sound separation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10478–10487 (2020)
Google Scholar
Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.B.: Look, listen, and act: towards audio-visual embodied navigation. In: ICRA (2020)
Google Scholar
Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–53 (2018)
Google Scholar
Gao, R., Grauman, K.: 2.5d visual sound. In: CVPR (2019)
Google Scholar
Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3879–3888 (2019)
Google Scholar
Gao, R., Grauman, K.: Visualvoice: audio-visual speech separation with cross-modal consistency. arXiv preprint arXiv:2101.03149 (2021)
Gu, R., et al.: Neural spatial filter: target speaker speech separation assisted with directional information. In: Kubin, G., Kacic, Z. (eds.) Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019, pp. 4290–4294. ISCA (2019). https://doi.org/10.21437/Interspeech.2019-2266
Gu, R., Zou, Y.: Temporal-spatial neural filter: direction informed end-to-end multi-channel target speech separation. arXiv preprint arXiv:2001.00391 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hershey, J.R., Movellan, J.R.: Audio vision: Using audio-visual synchrony to locate sounds. In: NeurIPS (2000)
Google Scholar
Huang, P., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1562–1566 (2014). https://doi.org/10.1109/ICASSP.2014.6853860
Li, B., Dinesh, K., Duan, Z., Sharma, G.: See and listen: score-informed association of sound tracks to players in chamber music performance videos. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2906–2910 (2017). https://doi.org/10.1109/ICASSP.2017.7952688
Lu, W.T., Wang, J.C., Won, M., Choi, K., Song, X.: Spectnt: a time-frequency transformer for music audio. arXiv preprint arXiv:2110.09127 (2021)
Majumder, S., Al-Halah, Z., Grauman, K.: Move2Hear: active audio-visual source separation. In: ICCV (2021)
Google Scholar
Mezghani, L., et al.: Memory-augmented reinforcement learning for image-goal navigation. arXiv preprint arXiv:2101.05181 (2021)
Mezghani, L., Sukhbaatar, S., Szlam, A., Joulin, A., Bojanowski, P.: Learning to visually navigate in photorealistic environments without any supervision. arXiv preprint arXiv:2004.04954 (2020)
Žmolíková, K., et al.: Speakerbeam: speaker aware neural network for target speaker extraction in speech mixtures. IEEE J. Sel. Top. Sign. Proces. 13(4), 800–814 (2019). https://doi.org/10.1109/JSTSP.2019.2922820
Article Google Scholar
Nakadai, K., Hidai, K.i., Okuno, H.G., Kitano, H.: Real-time speaker localization and speech separation by audio-visual integration. In: Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), vol. 1, pp. 1043–1049. IEEE (2002)
Google Scholar
Nakadai, K., Lourens, T., Okuno, H.G., Kitano, H.: Active audition for humanoid. In: AAAI (2000)
Google Scholar
Ochiai, T., et al.: Listen to what you want: neural network-based universal sound selector. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 1441–1445. ISCA (2020). https://doi.org/10.21437/Interspeech.2020-2210, https://doi.org/10.21437/Interspeech.2020-2210
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 631–648 (2018)
Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: An ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015). https://doi.org/10.1109/ICASSP.2015.7178964
Parekh, S., Essid, S., Ozerov, A., Duong, N.Q.K., Pérez, P., Richard, G.: Motion informed audio source separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6–10 (2017). https://doi.org/10.1109/ICASSP.2017.7951787
Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. ACM Press. https://doi.org/10.1145/2733373.2806390, http://dl.acm.org/citation.cfm?doid=2733373.2806390
Pu, J., Panagakis, Y., Petridis, S., Pantic, M.: Audio-visual object localization and separation using low-rank and sparsity. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2901–2905. IEEE (2017)
Google Scholar
Ramakrishnan, S.K., Nagarajan, T., Al-Halah, Z., Grauman, K.: Environment predictive coding for embodied agents. arXiv preprint arXiv:2102.02337 (2021)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR - half-baked or well done? In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630 (2019). https://doi.org/10.1109/ICASSP.2019.8683855
Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)
Google Scholar
Sedighin, F., Babaie-Zadeh, M., Rivet, B., Jutten, C.: Two multimodal approaches for single microphone source separation. In: 2016 24th European Signal Processing Conference (EUSIPCO), pp. 110–114 (2016). https://doi.org/10.1109/EUSIPCO.2016.7760220
Smaragdis, P., Raj, B., Shashanka, M.: Supervised and semi-supervised separation of sounds from single-channel mixtures. In: Davies, M.E., James, C.J., Abdallah, S.A., Plumbley, M.D. (eds.) ICA 2007. LNCS, vol. 4666, pp. 414–421. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74494-8_52
Chapter MATH Google Scholar
Smaragdis, P., Smaragdis, P.: Audio/visual independent components. In: Proceedings of International Symposium on Independant Component Analysis and Blind Source Separation (2003)
Google Scholar
Spiertz, M., Gnann, V.: Source-filter based clustering for monaural blind source separation. In: Proceedings of International Conference on Digital Audio Effects DAFx’09 (2009)
Google Scholar
Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., Zhong, J.: Attention is all you need in speech separation. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25. IEEE (2021)
Google Scholar
Tzinis, E., et al.: Into the wild with audioscope: unsupervised audio-visual separation of on-screen sounds. arXiv preprint arXiv:2011.01143 (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Viciana-Abad, R., Marfil, R., Perez-Lorenzo, J., Bandera, J., Romero-Garces, A., Reche-Lopez, P.: Audio-visual perception system for a humanoid robotic head. Sensors 14(6), 9522–9545 (2014)
Article Google Scholar
Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 5(3), 1066–1074 (2007)
Google Scholar
Weiss, R.J., Mandel, M.I., Ellis, D.P.: Source separation based on binaural cues and source model constraints. In: Ninth Annual Conference of the International Speech Communication Association, vol. 2008 (2009)
Google Scholar
Wijmans, E., et al.: DD-PPO: learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv preprint arXiv:1911.00357 (2019)
Xu, X., Dai, B., Lin, D.: Recursive visual sound separation using minus-plus net. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 882–891 (2019)
Google Scholar
Özgür Yılmaz, Rickard, S.: Blind separation of speech mixtures via time-frequency masking. In: IEEE Transactions on Signal Processing (2002) Submitted (2004)
Google Scholar
Zadeh, A., Ma, T., Poria, S., Morency, L.P.: Wildmix dataset and spectro-temporal transformer model for monoaural audio source separation. arXiv preprint arXiv:1911.09783 (2019)
Zhang, X., Wang, D.: Deep learning based binaural speech separation in reverberant environments. IEEE/ACM Trans. Audio Speech Lang. Process. 25(5), 1075–1084 (2017)
Article Google Scholar
Zhang, Z., He, B., Zhang, Z.: Transmask: a compact and fast speech separation model based on transformer. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5764–5768. IEEE (2021)
Google Scholar
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 570–586 (2018)
Google Scholar

Download references

Acknowledgements

Thank you to Ziad Al-Halah for very valuable discussions. Thanks to Tushar Nagarajan, Kumar Ashutosh, and David Harwarth for feedback on paper drafts. UT Austin is supported in part by DARPA L2M, NSF CCRI, and the IFML NSF AI Institute. K.G. is paid as a research scientist by Meta.

Author information

Authors and Affiliations

UT Austin, Austin, TX, USA
Sagnik Majumder & Kristen Grauman
Facebook AI Research, Austin, TX, USA
Kristen Grauman

Authors

Sagnik Majumder
View author publications
You can also search for this author in PubMed Google Scholar
Kristen Grauman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sagnik Majumder .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 561 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Majumder, S., Grauman, K. (2022). Active Audio-Visual Separation of Dynamic Sound Sources. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-19842-7_32
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19841-0
Online ISBN: 978-3-031-19842-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Active Audio-Visual Separation of Dynamic Sound Sources