Abstract
This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech enhancement. In: Interspeech (2018)
Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Audio-visual synchronisation in the wild. In: 32nd British Machine Vision Conference, BMVC (2021)
Cherry, E.C.: Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25(5), 975–979 (1953)
Chuang, S.Y., Tsao, Y., Lo, C.C., Wang, H.M.: Lite audio-visual speech enhancement. In: Proceedings of the Interspeech (2020)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: Interspeech (2018)
Chung, S.W., Choe, S., Chung, J.S., Kang, H.G.: FaceFilter: audio-visual speech separation using still images. arXiv preprint arXiv:2005.07074 (2020)
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. In: SIGGRAPH (2018)
Gabbay, A., Shamir, A., Peleg, S.: Visual speech enhancement. In: Interspeech, pp. 1170–1174. ISCA (2018)
Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3879–3888 (2019)
Gao, R., Grauman, K.: Visualvoice: Audio-visual speech separation with cross-modal consistency. In: CVPR (2021)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2017)
Golumbic, E.Z., Cogan, G.B., Schroeder, C.E., Poeppel, D.: Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party’’. J. Neurosci. 33, 1417–1426 (2013)
Grais, E.M., Roma, G., Simpson, A.J., Plumbley, M.: Combining mask estimates for single channel audio source separation using deep neural networks. In: Interspeech (2016)
Gu, R., Zhang, S.X., Xu, Y., Chen, L., Zou, Y., Yu, D.: Multi-modal multi-channel target speech separation. IEEE J. Sel. Top. Sign. Process. 14(3), 530–541 (2020)
Guo, J., Zhu, X., Yang, Y., Yang, F., Lei, Z., Li, S.Z.: Towards fast, accurate and stable 3D dense face alignment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 152–168. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_10
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415 (2016)
Hou, J.C., Wang, S.S., Lai, Y.H., Tsao, Y., Chang, H.W., Wang, H.M.: Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerg. Top. Comput. Intell. 2(2), 117–128 (2018)
Kabsch, W.: A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A 34(5), 827–828 (1978)
Li, C., Qian, Y.: Deep audio-visual speech separation with attention mechanism. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7314–7318 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054180
Ma, W.J., Zhou, X., Ross, L.A., Foxe, J.J., Parra, L.C.: Lip-reading aids word recognition most in moderate noise: a Bayesian explanation using high-dimensional feature space. PLoS ONE 4(3), e4638 (2009)
Makishima, N., Ihori, M., Takashima, A., Tanaka, T., Orihashi, S., Masumura, R.: Audio-visual speech separation using cross-modal correspondence loss. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6673–6677. IEEE (2021)
Michelsanti, D., Tan, Z.H., Sigurdsson, S., Jensen, J.: On training targets and objective functions for deep-learning-based audio-visual speech enhancement. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8077–8081. IEEE (2019)
Michelsanti, D., et al.: An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1368–1396 (2021). https://doi.org/10.1109/TASLP.2021.3066303
Montesinos, J.F., Kadandale, V.S., Haro, G.: A cappella: audio-visual singing voice separation. In: 32nd British Machine Vision Conference, BMVC (2021)
Morrone, G., Bergamaschi, S., Pasa, L., Fadiga, L., Tikhanoff, V., Badino, L.: Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6900–6904. IEEE (2019)
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt. J. Stat. Mech: Theory Exp. 2021(12), 124003 (2020)
Rafii, Z., Liutkus, A., Stöter, F.R., Mimilakis, S.I., Bittner, R.: The MUSDB18 corpus for music separation (2017). https://doi.org/10.5281/zenodo.1117372, https://doi.org/10.5281/zenodo.1117372
Sadeghi, M., Alameda-Pineda, X.: Mixture of inference networks for VAE-based audio-visual speech enhancement. IEEE Trans. Signal Process. 69, 1899–1909 (2021)
Sato, H., Ochiai, T., Kinoshita, K., Delcroix, M., Nakatani, T., Araki, S.: Multimodal attention fusion for target speaker extraction. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 778–784. IEEE (2021)
Slizovskaia, O., Haro, G., Gómez, E.: Conditioned source separation for musical instrument performances. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2083–2095 (2021)
Sun, Z., Wang, Y., Cao, L.: An attention based speaker-independent audio-visual deep learning model for speech enhancement. In: Ro, Y., et al. (eds.) MMM 2020. LNCS, vol. 11962, pp. 722–728. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37734-2_60
Truong, T.D., et al.: The right to talk: an audio-visual transformer approach. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1105–1114 (2021)
Tzinis, E., Wisdom, S., Remez, T., Hershey, J.R.: Improving on-screen sound separation for open-domain videos with audio-visual self-attention. arXiv preprint arXiv:2106.09669 (2021)
Vincent, E., Gribonval, R., Fevotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006). https://doi.org/10.1109/TSA.2005.858005
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Wu, J., et al.: Time domain audio visual speech separation. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 667–673. IEEE (2019)
Wu, Z., Sivadas, S., Tan, Y.K., Bin, M., Goh, R.S.M.: Multi-modal hybrid deep neural network for speech enhancement. arXiv preprint arXiv:1606.04750 (2016)
Xu, X., et al.: VseGAN: visual speech enhancement generative adversarial network. arXiv preprint arXiv:2102.02599 (2021)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Zadeh, A., Ma, T., Poria, S., Morency, L.P.: Wildmix dataset and spectro-temporal transformer model for monoaural audio source separation. arXiv preprint arXiv:1911.09783 (2019)
Zhao, H., Gan, C., Ma, W.C., Torralba, A.: The sound of motions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1735–1744 (2019)
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Proceedings of the European conference on computer vision (ECCV), pp. 570–586 (2018)
Acknowledgments
We acknowledge support by MICINN/FEDER UE project PID2021-127643NB-I00; H2020-MSCA-RISE-2017 project 777826 NoMADS.
J.F.M. acknowledges support by FPI scholarship PRE2018-083920. We acknowledge NVIDIA Corporation for the donation of GPUs used for the experiments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Montesinos, J.F., Kadandale, V.S., Haro, G. (2022). VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-19836-6_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19835-9
Online ISBN: 978-3-031-19836-6
eBook Packages: Computer ScienceComputer Science (R0)