Video-Guided Sound Source Separation

Zhou, Junfeng; Wang, Feng; Guo, Di; Liu, Huaping; Sun, Fuchun

doi:10.1007/978-3-030-27526-6_36

Junfeng Zhou¹⁴,
Feng Wang¹⁴,
Di Guo¹⁴,
Huaping Liu¹⁴ &
…
Fuchun Sun¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11740))

Included in the following conference series:

International Conference on Intelligent Robotics and Applications

3403 Accesses
1 Citations

Abstract

A major aim of separating sound source is to separate the sound of interest out of mixture, such as the sound of objects on the screen. In this paper we put forward a method incorporating sound-indicated object detection and using the detection result to separate the on screen sounds and the off screen ones. After training, the object detection network could recognize which object is sounding just like human learns what object making what sound. And then using the temporal information of sounds in a video segment, we separate out sound of the object that is not shown in the video. At last, experiments are carried out in data from AudioSet and we demonstrate that the method works well in given scenarios.

This work was supported in part by the National Natural Science Foundation of China under Grant U1613212.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. arXiv preprint arXiv:1804.03160 (2018)
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. arXiv preprint arXiv:1804.03641 (2018)
Segev, D., Schechner, Y.Y., Elad, M.: Example-based cross-modal denoising. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 486–493. IEEE (2012)
Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Aytar, Y., Castrejon, L., Vondrick, C., Pirsiavash, H., Torralba, A.: Cross-modal scene networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2303–2314 (2018)
Article Google Scholar
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: ACM on Multimedia Conference, pp. 154–162 (2017)
Google Scholar
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Article Google Scholar
Spiertz, M., Gnann, V.: Source-filter based clustering for monaural blind source separation. In: Proceedings of the 12th International Conference on Digital Audio Effects (2009)
Google Scholar
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Article Google Scholar
Ozerov, A., Févotte, C., Blouet, R., Durrieu, J.L.: Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 257–260. IEEE (2011)
Google Scholar
Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15(3), 1066–1074 (2007)
Article Google Scholar
Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. arXiv preprint arXiv:1804.01665 (2018)
Parekh, S., Essid, S., Ozerov, A., Duong, N.Q., Pérez, P., Richard, G.: Guiding audio source separation by video object information. In: 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 61–65. IEEE (2017)
Google Scholar
Hennequin, R., David, B., Badeau, R.: Score informed audio source separation using a parametric model of non-negative spectrogram. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2011)
Google Scholar
Le Magoarou, L., Ozerov, A., Duong, N.Q.: Text-informed audio source separation. Example-based approach using non-negative matrix partial co-factorization. J. Signal Process. Syst. 79(2), 117–131 (2015)
Article Google Scholar
Duong, N., Ozerov, A., Chevallier, L., Sirot, J.: An interactive audio source separation framework based on non-negative matrix factorization. In: IEEE International Conference on Acoustics Speech and Signal Processing (2014)
Google Scholar
Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8. IEEE (2007)
Google Scholar
Innami, S., Kasai, H.: NMF-based environmental sound source separation using time-variant gain features. Comput. Math. Appl. 64(5), 1333–1342 (2012)
Article Google Scholar
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(1), 7–19 (2015)
Article Google Scholar
Duong, T.T.H., Nguyen, P.C., Nguyen, C.Q.: Exploiting nonnegative matrix factorization with mixed group sparsity constraint to separate speech signal from single-channel mixture with unknown ambient noise. EAI Endorsed Trans. Context-Aware Syst. Appl. 4(13), 154342 (2018)
Article Google Scholar
Arons, B.: A review of the cocktail party effect. J. Am. Voice I/O Soc. 12(7), 35–50 (1992)
Google Scholar
El Badawy, D., Duong, N.Q., Ozerov, A.: On-the-fly audio source separation. In: 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE (2014)
Google Scholar
Chen, X., Liu, G., Shi, J., Xu, J., Xu, B.: Distilled binary neural network for monaural speech separation. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Intelligent Technology and Systems, TNLIST, Department of Computer Science and Technology, Tsinghua University, Beijing, People’s Republic of China
Junfeng Zhou, Feng Wang, Di Guo, Huaping Liu & Fuchun Sun

Authors

Junfeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Feng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Di Guo
View author publications
You can also search for this author in PubMed Google Scholar
Huaping Liu
View author publications
You can also search for this author in PubMed Google Scholar
Fuchun Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huaping Liu .

Editor information

Editors and Affiliations

Shenyang Institute of Automation, Shenyang, China
Haibin Yu
Shenyang Institute of Automation, Shenyang, China
Jinguo Liu
Shenyang Institute of Automation, Shenyang, China
Lianqing Liu
University of Portsmouth, Portsmouth, UK
Zhaojie Ju
Shenyang Institute of Automation, Shenyang, China
Yuwang Liu
University of Portsmouth, Portsmouth, UK
Dalin Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, J., Wang, F., Guo, D., Liu, H., Sun, F. (2019). Video-Guided Sound Source Separation. In: Yu, H., Liu, J., Liu, L., Ju, Z., Liu, Y., Zhou, D. (eds) Intelligent Robotics and Applications. ICIRA 2019. Lecture Notes in Computer Science(), vol 11740. Springer, Cham. https://doi.org/10.1007/978-3-030-27526-6_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-27526-6_36
Published: 02 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27525-9
Online ISBN: 978-3-030-27526-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics