VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer

Montesinos, Juan F.; Kadandale, Venkatesh S.; Haro, Gloria

doi:10.1007/978-3-031-19836-6_18

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13697))

Included in the following conference series:

European Conference on Computer Vision

2992 Accesses

Abstract

This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Comprehensive Analysis on Features and Performance Evaluation Metrics in Audio-Visual Voice Conversion

A Comprehensive Exploration of Network-Based Approaches for Singing Voice Separation

Automatic Identification of Vietnamese Singer Voices Using Deep Learning and Data Augmentation

References

Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech enhancement. In: Interspeech (2018)
Google Scholar
Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Audio-visual synchronisation in the wild. In: 32nd British Machine Vision Conference, BMVC (2021)
Google Scholar
Cherry, E.C.: Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25(5), 975–979 (1953)
Article Google Scholar
Chuang, S.Y., Tsao, Y., Lo, C.C., Wang, H.M.: Lite audio-visual speech enhancement. In: Proceedings of the Interspeech (2020)
Google Scholar
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: Interspeech (2018)
Google Scholar
Chung, S.W., Choe, S., Chung, J.S., Kang, H.G.: FaceFilter: audio-visual speech separation using still images. arXiv preprint arXiv:2005.07074 (2020)
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. In: SIGGRAPH (2018)
Google Scholar
Gabbay, A., Shamir, A., Peleg, S.: Visual speech enhancement. In: Interspeech, pp. 1170–1174. ISCA (2018)
Google Scholar
Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3879–3888 (2019)
Google Scholar
Gao, R., Grauman, K.: Visualvoice: Audio-visual speech separation with cross-modal consistency. In: CVPR (2021)
Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2017)
Google Scholar
Golumbic, E.Z., Cogan, G.B., Schroeder, C.E., Poeppel, D.: Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party’’. J. Neurosci. 33, 1417–1426 (2013)
Article Google Scholar
Grais, E.M., Roma, G., Simpson, A.J., Plumbley, M.: Combining mask estimates for single channel audio source separation using deep neural networks. In: Interspeech (2016)
Google Scholar
Gu, R., Zhang, S.X., Xu, Y., Chen, L., Zou, Y., Yu, D.: Multi-modal multi-channel target speech separation. IEEE J. Sel. Top. Sign. Process. 14(3), 530–541 (2020)
Article Google Scholar
Guo, J., Zhu, X., Yang, Y., Yang, F., Lei, Z., Li, S.Z.: Towards fast, accurate and stable 3D dense face alignment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 152–168. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_10
Chapter Google Scholar
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415 (2016)
Hou, J.C., Wang, S.S., Lai, Y.H., Tsao, Y., Chang, H.W., Wang, H.M.: Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerg. Top. Comput. Intell. 2(2), 117–128 (2018)
Article Google Scholar
Kabsch, W.: A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A 34(5), 827–828 (1978)
Article Google Scholar
Li, C., Qian, Y.: Deep audio-visual speech separation with attention mechanism. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7314–7318 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054180
Ma, W.J., Zhou, X., Ross, L.A., Foxe, J.J., Parra, L.C.: Lip-reading aids word recognition most in moderate noise: a Bayesian explanation using high-dimensional feature space. PLoS ONE 4(3), e4638 (2009)
Article Google Scholar
Makishima, N., Ihori, M., Takashima, A., Tanaka, T., Orihashi, S., Masumura, R.: Audio-visual speech separation using cross-modal correspondence loss. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6673–6677. IEEE (2021)
Google Scholar
Michelsanti, D., Tan, Z.H., Sigurdsson, S., Jensen, J.: On training targets and objective functions for deep-learning-based audio-visual speech enhancement. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8077–8081. IEEE (2019)
Google Scholar
Michelsanti, D., et al.: An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1368–1396 (2021). https://doi.org/10.1109/TASLP.2021.3066303
Article Google Scholar
Montesinos, J.F., Kadandale, V.S., Haro, G.: A cappella: audio-visual singing voice separation. In: 32nd British Machine Vision Conference, BMVC (2021)
Google Scholar
Morrone, G., Bergamaschi, S., Pasa, L., Fadiga, L., Tikhanoff, V., Badino, L.: Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6900–6904. IEEE (2019)
Google Scholar
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt. J. Stat. Mech: Theory Exp. 2021(12), 124003 (2020)
Article MathSciNet Google Scholar
Rafii, Z., Liutkus, A., Stöter, F.R., Mimilakis, S.I., Bittner, R.: The MUSDB18 corpus for music separation (2017). https://doi.org/10.5281/zenodo.1117372, https://doi.org/10.5281/zenodo.1117372
Sadeghi, M., Alameda-Pineda, X.: Mixture of inference networks for VAE-based audio-visual speech enhancement. IEEE Trans. Signal Process. 69, 1899–1909 (2021)
Article MathSciNet Google Scholar
Sato, H., Ochiai, T., Kinoshita, K., Delcroix, M., Nakatani, T., Araki, S.: Multimodal attention fusion for target speaker extraction. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 778–784. IEEE (2021)
Google Scholar
Slizovskaia, O., Haro, G., Gómez, E.: Conditioned source separation for musical instrument performances. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2083–2095 (2021)
Article Google Scholar
Sun, Z., Wang, Y., Cao, L.: An attention based speaker-independent audio-visual deep learning model for speech enhancement. In: Ro, Y., et al. (eds.) MMM 2020. LNCS, vol. 11962, pp. 722–728. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37734-2_60
Chapter Google Scholar
Truong, T.D., et al.: The right to talk: an audio-visual transformer approach. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1105–1114 (2021)
Google Scholar
Tzinis, E., Wisdom, S., Remez, T., Hershey, J.R.: Improving on-screen sound separation for open-domain videos with audio-visual self-attention. arXiv preprint arXiv:2106.09669 (2021)
Vincent, E., Gribonval, R., Fevotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006). https://doi.org/10.1109/TSA.2005.858005
Article Google Scholar
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Article Google Scholar
Wu, J., et al.: Time domain audio visual speech separation. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 667–673. IEEE (2019)
Google Scholar
Wu, Z., Sivadas, S., Tan, Y.K., Bin, M., Goh, R.S.M.: Multi-modal hybrid deep neural network for speech enhancement. arXiv preprint arXiv:1606.04750 (2016)
Xu, X., et al.: VseGAN: visual speech enhancement generative adversarial network. arXiv preprint arXiv:2102.02599 (2021)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Zadeh, A., Ma, T., Poria, S., Morency, L.P.: Wildmix dataset and spectro-temporal transformer model for monoaural audio source separation. arXiv preprint arXiv:1911.09783 (2019)
Zhao, H., Gan, C., Ma, W.C., Torralba, A.: The sound of motions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1735–1744 (2019)
Google Scholar
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Proceedings of the European conference on computer vision (ECCV), pp. 570–586 (2018)
Google Scholar

Download references

Acknowledgments

We acknowledge support by MICINN/FEDER UE project PID2021-127643NB-I00; H2020-MSCA-RISE-2017 project 777826 NoMADS.

J.F.M. acknowledges support by FPI scholarship PRE2018-083920. We acknowledge NVIDIA Corporation for the donation of GPUs used for the experiments.

Author information

Authors and Affiliations

Universitat Pompeu Fabra, Carrer Roc Boronat, 138, 08018, Barcelona, Spain
Juan F. Montesinos, Venkatesh S. Kadandale & Gloria Haro

Authors

Juan F. Montesinos
View author publications
You can also search for this author in PubMed Google Scholar
Venkatesh S. Kadandale
View author publications
You can also search for this author in PubMed Google Scholar
Gloria Haro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan F. Montesinos .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 6105 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Montesinos, J.F., Kadandale, V.S., Haro, G. (2022). VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-19836-6_18
Published: 22 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19835-9
Online ISBN: 978-3-031-19836-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics