Skip to main content

VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech enhancement. In: Interspeech (2018)

    Google Scholar 

  2. Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Audio-visual synchronisation in the wild. In: 32nd British Machine Vision Conference, BMVC (2021)

    Google Scholar 

  3. Cherry, E.C.: Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25(5), 975–979 (1953)

    Article  Google Scholar 

  4. Chuang, S.Y., Tsao, Y., Lo, C.C., Wang, H.M.: Lite audio-visual speech enhancement. In: Proceedings of the Interspeech (2020)

    Google Scholar 

  5. Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: Interspeech (2018)

    Google Scholar 

  6. Chung, S.W., Choe, S., Chung, J.S., Kang, H.G.: FaceFilter: audio-visual speech separation using still images. arXiv preprint arXiv:2005.07074 (2020)

  7. Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. In: SIGGRAPH (2018)

    Google Scholar 

  8. Gabbay, A., Shamir, A., Peleg, S.: Visual speech enhancement. In: Interspeech, pp. 1170–1174. ISCA (2018)

    Google Scholar 

  9. Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3879–3888 (2019)

    Google Scholar 

  10. Gao, R., Grauman, K.: Visualvoice: Audio-visual speech separation with cross-modal consistency. In: CVPR (2021)

    Google Scholar 

  11. Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2017)

    Google Scholar 

  12. Golumbic, E.Z., Cogan, G.B., Schroeder, C.E., Poeppel, D.: Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party’’. J. Neurosci. 33, 1417–1426 (2013)

    Article  Google Scholar 

  13. Grais, E.M., Roma, G., Simpson, A.J., Plumbley, M.: Combining mask estimates for single channel audio source separation using deep neural networks. In: Interspeech (2016)

    Google Scholar 

  14. Gu, R., Zhang, S.X., Xu, Y., Chen, L., Zou, Y., Yu, D.: Multi-modal multi-channel target speech separation. IEEE J. Sel. Top. Sign. Process. 14(3), 530–541 (2020)

    Article  Google Scholar 

  15. Guo, J., Zhu, X., Yang, Y., Yang, F., Lei, Z., Li, S.Z.: Towards fast, accurate and stable 3D dense face alignment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 152–168. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_10

    Chapter  Google Scholar 

  16. Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415 (2016)

  17. Hou, J.C., Wang, S.S., Lai, Y.H., Tsao, Y., Chang, H.W., Wang, H.M.: Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerg. Top. Comput. Intell. 2(2), 117–128 (2018)

    Article  Google Scholar 

  18. Kabsch, W.: A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A 34(5), 827–828 (1978)

    Article  Google Scholar 

  19. Li, C., Qian, Y.: Deep audio-visual speech separation with attention mechanism. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7314–7318 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054180

  20. Ma, W.J., Zhou, X., Ross, L.A., Foxe, J.J., Parra, L.C.: Lip-reading aids word recognition most in moderate noise: a Bayesian explanation using high-dimensional feature space. PLoS ONE 4(3), e4638 (2009)

    Article  Google Scholar 

  21. Makishima, N., Ihori, M., Takashima, A., Tanaka, T., Orihashi, S., Masumura, R.: Audio-visual speech separation using cross-modal correspondence loss. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6673–6677. IEEE (2021)

    Google Scholar 

  22. Michelsanti, D., Tan, Z.H., Sigurdsson, S., Jensen, J.: On training targets and objective functions for deep-learning-based audio-visual speech enhancement. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8077–8081. IEEE (2019)

    Google Scholar 

  23. Michelsanti, D., et al.: An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1368–1396 (2021). https://doi.org/10.1109/TASLP.2021.3066303

    Article  Google Scholar 

  24. Montesinos, J.F., Kadandale, V.S., Haro, G.: A cappella: audio-visual singing voice separation. In: 32nd British Machine Vision Conference, BMVC (2021)

    Google Scholar 

  25. Morrone, G., Bergamaschi, S., Pasa, L., Fadiga, L., Tikhanoff, V., Badino, L.: Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6900–6904. IEEE (2019)

    Google Scholar 

  26. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt. J. Stat. Mech: Theory Exp. 2021(12), 124003 (2020)

    Article  MathSciNet  Google Scholar 

  27. Rafii, Z., Liutkus, A., Stöter, F.R., Mimilakis, S.I., Bittner, R.: The MUSDB18 corpus for music separation (2017). https://doi.org/10.5281/zenodo.1117372, https://doi.org/10.5281/zenodo.1117372

  28. Sadeghi, M., Alameda-Pineda, X.: Mixture of inference networks for VAE-based audio-visual speech enhancement. IEEE Trans. Signal Process. 69, 1899–1909 (2021)

    Article  MathSciNet  Google Scholar 

  29. Sato, H., Ochiai, T., Kinoshita, K., Delcroix, M., Nakatani, T., Araki, S.: Multimodal attention fusion for target speaker extraction. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 778–784. IEEE (2021)

    Google Scholar 

  30. Slizovskaia, O., Haro, G., Gómez, E.: Conditioned source separation for musical instrument performances. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2083–2095 (2021)

    Article  Google Scholar 

  31. Sun, Z., Wang, Y., Cao, L.: An attention based speaker-independent audio-visual deep learning model for speech enhancement. In: Ro, Y., et al. (eds.) MMM 2020. LNCS, vol. 11962, pp. 722–728. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37734-2_60

    Chapter  Google Scholar 

  32. Truong, T.D., et al.: The right to talk: an audio-visual transformer approach. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1105–1114 (2021)

    Google Scholar 

  33. Tzinis, E., Wisdom, S., Remez, T., Hershey, J.R.: Improving on-screen sound separation for open-domain videos with audio-visual self-attention. arXiv preprint arXiv:2106.09669 (2021)

  34. Vincent, E., Gribonval, R., Fevotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006). https://doi.org/10.1109/TSA.2005.858005

    Article  Google Scholar 

  35. Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)

    Article  Google Scholar 

  36. Wu, J., et al.: Time domain audio visual speech separation. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 667–673. IEEE (2019)

    Google Scholar 

  37. Wu, Z., Sivadas, S., Tan, Y.K., Bin, M., Goh, R.S.M.: Multi-modal hybrid deep neural network for speech enhancement. arXiv preprint arXiv:1606.04750 (2016)

  38. Xu, X., et al.: VseGAN: visual speech enhancement generative adversarial network. arXiv preprint arXiv:2102.02599 (2021)

  39. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  40. Zadeh, A., Ma, T., Poria, S., Morency, L.P.: Wildmix dataset and spectro-temporal transformer model for monoaural audio source separation. arXiv preprint arXiv:1911.09783 (2019)

  41. Zhao, H., Gan, C., Ma, W.C., Torralba, A.: The sound of motions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1735–1744 (2019)

    Google Scholar 

  42. Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Proceedings of the European conference on computer vision (ECCV), pp. 570–586 (2018)

    Google Scholar 

Download references

Acknowledgments

We acknowledge support by MICINN/FEDER UE project PID2021-127643NB-I00; H2020-MSCA-RISE-2017 project 777826 NoMADS.

J.F.M. acknowledges support by FPI scholarship PRE2018-083920. We acknowledge NVIDIA Corporation for the donation of GPUs used for the experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan F. Montesinos .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 6105 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Montesinos, J.F., Kadandale, V.S., Haro, G. (2022). VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19836-6_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19835-9

  • Online ISBN: 978-3-031-19836-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics