Skip to main content

Active Audio-Visual Separation of Dynamic Sound Sources

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Included in the following conference series:

  • 3532 Accesses

Abstract

We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound accurately at every step using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, using self-attention to make high-quality estimates for current timesteps and also simultaneously improve its past estimates. Using highly realistic acoustic SoundSpaces [13] simulations in real-world scanned Matterport3D [11] environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a dynamic audio target. Project: https://vision.cs.utexas.edu/projects/active-av-dynamic-separation/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    e.g., a human speaker or instrument the agent wishes to listen to.

  2. 2.

    The monaural source is the ground truth target, since it is devoid of all material and spatial effects due to the environment and the mutual positioning of the agent and the sources. Using a spatialized (e.g., binaural) audio as ground truth would permit undesirable shortcut solutions: the agent could move somewhere in the environment where the target is inaudible, and technically return the right answer of silence [38].

References

  1. Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech enhancement. arXiv preprint arXiv:1804.04121 (2018)

  2. Afouras, T., Chung, J.S., Zisserman, A.: My lips are concealed: audio-visual speech enhancement through obstructions. arXiv preprint arXiv:1907.04975 (2019)

  3. Alameda-Pineda, X., Horaud, R.: Vision-guided robot hearing. Int. J. Robot. Res. 34(4–5), 437–456 (2015)

    Article  Google Scholar 

  4. Yu, Y., Huang, W., Sun, F., Chen, C., Wang, Y., Liu, X.: sound adversarial audio-visual navigation. In: Submitted to The Tenth International Conference on Learning Representations (2022). https://openreview.net/forum?id=NkZq4OEYN-

  5. Asano, F., Goto, M., Itou, K., Asoh, H.: Real-time sound source localization and separation system and its application to automatic speech recognition. In: Eurospeech (2001)

    Google Scholar 

  6. Ban, Y., Li, X., Alameda-Pineda, X., Girin, L., Horaud, R.: Accounting for room acoustics in audio-visual multi-speaker tracking. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)

    Google Scholar 

  7. Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007). https://doi.org/10.1109/CVPR.2007.383344

  8. Bellemare, M.G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. arXiv preprint arXiv:1606.01868 (2016)

  9. Bustamante, G., Danès, P., Forgue, T., Podlubne, A., Manhès, J.: An information based feedback control for audio-motor binaural localization. Auton. Robots 42(2), 477–490 (2017). https://doi.org/10.1007/s10514-017-9639-8

    Article  Google Scholar 

  10. Campari, T., Eccher, P., Serafini, L., Ballan, L.: Exploiting scene-specific features for object goal navigation. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12538, pp. 406–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66823-5_24

    Chapter  Google Scholar 

  11. Chang, A., et al.: Matterport3d: learning from RGB-D data in indoor environments. In: International Conference on 3D Vision (3DV) (2017). matterPort3D dataset license. http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf

  12. Chen, C., Al-Halah, Z., Grauman, K.: Semantic audio-visual navigation. In: CVPR (2021)

    Google Scholar 

  13. Chen, C., et al.: SoundSpaces: audio-visual navigation in 3d environments. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 17–36. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_2

    Chapter  Google Scholar 

  14. Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S.K., Grauman, K.: Learning to set waypoints for audio-visual navigation. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=cR91FAodFMe

  15. Chen, J., Mao, Q., Liu, D.: Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation. arXiv preprint arXiv:2007.13975 (2020)

  16. Chen, K., Chen, J.K., Chuang, J., Vázquez, M., Savarese, S.: Topological planning with transformers for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11276–11286 (2021)

    Google Scholar 

  17. Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A.C., Bengio, Y.: A recurrent latent variable model for sequential data. In: NeurIPS (2015)

    Google Scholar 

  18. Chung, S.W., Choe, S., Chung, J.S., Kang, H.G.: Facefilter: audio-visual speech separation using still images. arXiv preprint arXiv:2005.07074 (2020)

  19. Deleforge, A., Horaud, R.: The cocktail party robot: sound source separation and localisation with an active binaural head. In: HRI 2012–7th ACM/IEEE International Conference on Human Robot Interaction, pp. 431–438. ACM, Boston, United States, March 2012. https://doi.org/10.1145/2157689.2157834,https://hal.inria.fr/hal-00768668

  20. Duong, N.Q., Vincent, E., Gribonval, R.: Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)

    Article  Google Scholar 

  21. Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 (2018)

  22. Fang, K., Toshev, A., Fei-Fei, L., Savarese, S.: Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 538–547 (2019)

    Google Scholar 

  23. Fisher III, J.W., Darrell, T., Freeman, W., Viola, P.: Learning joint statistical models for audio-visual fusion and segregation. In: Leen, T., Dietterich, T., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 772–778. MIT Press (2001). https://proceedings.neurips.cc/paper/2000/file/11f524c3fbfeeca4aa916edcb6b6392e-Paper.pdf

  24. Gabbay, A., Shamir, A., Peleg, S.: Visual speech enhancement. arXiv preprint arXiv:1711.08789 (2017)

  25. Gan, C., Huang, D., Zhao, H., Tenenbaum, J.B., Torralba, A.: Music gesture for visual sound separation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10478–10487 (2020)

    Google Scholar 

  26. Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.B.: Look, listen, and act: towards audio-visual embodied navigation. In: ICRA (2020)

    Google Scholar 

  27. Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–53 (2018)

    Google Scholar 

  28. Gao, R., Grauman, K.: 2.5d visual sound. In: CVPR (2019)

    Google Scholar 

  29. Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3879–3888 (2019)

    Google Scholar 

  30. Gao, R., Grauman, K.: Visualvoice: audio-visual speech separation with cross-modal consistency. arXiv preprint arXiv:2101.03149 (2021)

  31. Gu, R., et al.: Neural spatial filter: target speaker speech separation assisted with directional information. In: Kubin, G., Kacic, Z. (eds.) Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019, pp. 4290–4294. ISCA (2019). https://doi.org/10.21437/Interspeech.2019-2266

  32. Gu, R., Zou, Y.: Temporal-spatial neural filter: direction informed end-to-end multi-channel target speech separation. arXiv preprint arXiv:2001.00391 (2020)

  33. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  34. Hershey, J.R., Movellan, J.R.: Audio vision: Using audio-visual synchrony to locate sounds. In: NeurIPS (2000)

    Google Scholar 

  35. Huang, P., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1562–1566 (2014). https://doi.org/10.1109/ICASSP.2014.6853860

  36. Li, B., Dinesh, K., Duan, Z., Sharma, G.: See and listen: score-informed association of sound tracks to players in chamber music performance videos. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2906–2910 (2017). https://doi.org/10.1109/ICASSP.2017.7952688

  37. Lu, W.T., Wang, J.C., Won, M., Choi, K., Song, X.: Spectnt: a time-frequency transformer for music audio. arXiv preprint arXiv:2110.09127 (2021)

  38. Majumder, S., Al-Halah, Z., Grauman, K.: Move2Hear: active audio-visual source separation. In: ICCV (2021)

    Google Scholar 

  39. Mezghani, L., et al.: Memory-augmented reinforcement learning for image-goal navigation. arXiv preprint arXiv:2101.05181 (2021)

  40. Mezghani, L., Sukhbaatar, S., Szlam, A., Joulin, A., Bojanowski, P.: Learning to visually navigate in photorealistic environments without any supervision. arXiv preprint arXiv:2004.04954 (2020)

  41. Žmolíková, K., et al.: Speakerbeam: speaker aware neural network for target speaker extraction in speech mixtures. IEEE J. Sel. Top. Sign. Proces. 13(4), 800–814 (2019). https://doi.org/10.1109/JSTSP.2019.2922820

    Article  Google Scholar 

  42. Nakadai, K., Hidai, K.i., Okuno, H.G., Kitano, H.: Real-time speaker localization and speech separation by audio-visual integration. In: Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), vol. 1, pp. 1043–1049. IEEE (2002)

    Google Scholar 

  43. Nakadai, K., Lourens, T., Okuno, H.G., Kitano, H.: Active audition for humanoid. In: AAAI (2000)

    Google Scholar 

  44. Ochiai, T., et al.: Listen to what you want: neural network-based universal sound selector. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 1441–1445. ISCA (2020). https://doi.org/10.21437/Interspeech.2020-2210, https://doi.org/10.21437/Interspeech.2020-2210

  45. Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 631–648 (2018)

    Google Scholar 

  46. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: An ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015). https://doi.org/10.1109/ICASSP.2015.7178964

  47. Parekh, S., Essid, S., Ozerov, A., Duong, N.Q.K., Pérez, P., Richard, G.: Motion informed audio source separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6–10 (2017). https://doi.org/10.1109/ICASSP.2017.7951787

  48. Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. ACM Press. https://doi.org/10.1145/2733373.2806390, http://dl.acm.org/citation.cfm?doid=2733373.2806390

  49. Pu, J., Panagakis, Y., Petridis, S., Pantic, M.: Audio-visual object localization and separation using low-rank and sparsity. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2901–2905. IEEE (2017)

    Google Scholar 

  50. Ramakrishnan, S.K., Nagarajan, T., Al-Halah, Z., Grauman, K.: Environment predictive coding for embodied agents. arXiv preprint arXiv:2102.02337 (2021)

  51. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  52. Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR - half-baked or well done? In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630 (2019). https://doi.org/10.1109/ICASSP.2019.8683855

  53. Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)

    Google Scholar 

  54. Sedighin, F., Babaie-Zadeh, M., Rivet, B., Jutten, C.: Two multimodal approaches for single microphone source separation. In: 2016 24th European Signal Processing Conference (EUSIPCO), pp. 110–114 (2016). https://doi.org/10.1109/EUSIPCO.2016.7760220

  55. Smaragdis, P., Raj, B., Shashanka, M.: Supervised and semi-supervised separation of sounds from single-channel mixtures. In: Davies, M.E., James, C.J., Abdallah, S.A., Plumbley, M.D. (eds.) ICA 2007. LNCS, vol. 4666, pp. 414–421. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74494-8_52

    Chapter  MATH  Google Scholar 

  56. Smaragdis, P., Smaragdis, P.: Audio/visual independent components. In: Proceedings of International Symposium on Independant Component Analysis and Blind Source Separation (2003)

    Google Scholar 

  57. Spiertz, M., Gnann, V.: Source-filter based clustering for monaural blind source separation. In: Proceedings of International Conference on Digital Audio Effects DAFx’09 (2009)

    Google Scholar 

  58. Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., Zhong, J.: Attention is all you need in speech separation. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25. IEEE (2021)

    Google Scholar 

  59. Tzinis, E., et al.: Into the wild with audioscope: unsupervised audio-visual separation of on-screen sounds. arXiv preprint arXiv:2011.01143 (2020)

  60. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  61. Viciana-Abad, R., Marfil, R., Perez-Lorenzo, J., Bandera, J., Romero-Garces, A., Reche-Lopez, P.: Audio-visual perception system for a humanoid robotic head. Sensors 14(6), 9522–9545 (2014)

    Article  Google Scholar 

  62. Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 5(3), 1066–1074 (2007)

    Google Scholar 

  63. Weiss, R.J., Mandel, M.I., Ellis, D.P.: Source separation based on binaural cues and source model constraints. In: Ninth Annual Conference of the International Speech Communication Association, vol. 2008 (2009)

    Google Scholar 

  64. Wijmans, E., et al.: DD-PPO: learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv preprint arXiv:1911.00357 (2019)

  65. Xu, X., Dai, B., Lin, D.: Recursive visual sound separation using minus-plus net. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 882–891 (2019)

    Google Scholar 

  66. Özgür Yılmaz, Rickard, S.: Blind separation of speech mixtures via time-frequency masking. In: IEEE Transactions on Signal Processing (2002) Submitted (2004)

    Google Scholar 

  67. Zadeh, A., Ma, T., Poria, S., Morency, L.P.: Wildmix dataset and spectro-temporal transformer model for monoaural audio source separation. arXiv preprint arXiv:1911.09783 (2019)

  68. Zhang, X., Wang, D.: Deep learning based binaural speech separation in reverberant environments. IEEE/ACM Trans. Audio Speech Lang. Process. 25(5), 1075–1084 (2017)

    Article  Google Scholar 

  69. Zhang, Z., He, B., Zhang, Z.: Transmask: a compact and fast speech separation model based on transformer. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5764–5768. IEEE (2021)

    Google Scholar 

  70. Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 570–586 (2018)

    Google Scholar 

Download references

Acknowledgements

Thank you to Ziad Al-Halah for very valuable discussions. Thanks to Tushar Nagarajan, Kumar Ashutosh, and David Harwarth for feedback on paper drafts. UT Austin is supported in part by DARPA L2M, NSF CCRI, and the IFML NSF AI Institute. K.G. is paid as a research scientist by Meta.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sagnik Majumder .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 561 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Majumder, S., Grauman, K. (2022). Active Audio-Visual Separation of Dynamic Sound Sources. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19842-7_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19841-0

  • Online ISBN: 978-3-031-19842-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics