End-to-End Active Speaker Detection

Alcázar, Juan León; Cordes, Moritz; Zhao, Chen; Ghanem, Bernard

doi:10.1007/978-3-031-19836-6_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13697))

Included in the following conference series:

European Conference on Computer Vision

2988 Accesses
17 Citations

Abstract

Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached by utilizing audiovisual data but relying exclusively on audio annotations. We achieve this by modelling the direct relationship between the audio signal and the possible sound sources (speakers), as well as introducing a contrastive loss.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

LR-ASD: Lightweight and Robust Network for Active Speaker Detection

Article 19 March 2025

Robust Active Speaker Detection in Challenging Environments Using GNN-Fused Multi-modal Cues and Body Language

Notes

1.
Since tracklets include a single face and were manually curated, it is guaranteed that two tracklets that overlap in time belong to different identities. If there is a single person in the scene, we sample additional visual data from another tracklet in a different movie where no speech event is detected.

References

Alcázar, J.L., et al.: Active speakers in context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12465–12474 (2020)
Google Scholar
Cai, J., Jiang, N., Han, X., Jia, K., Lu, J.: JOLO-GCN: mining joint-centered light-weight information for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2735–2744 (2021)
Google Scholar
Carneiro, H., Weber, C., Wermter, S.: FaVoA: face-voice association favours ambiguous speaker detection. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds.) ICANN 2021. LNCS, vol. 12891, pp. 439–450. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86362-3_36
Chapter Google Scholar
Chakravarty, P., Mirzaei, S., Tuytelaars, T., Van hamme, H.: Who’s speaking? Audio-supervised classification of active speakers in video. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 87–90 (2015)
Google Scholar
Chakravarty, P., Zegers, J., Tuytelaars, T., Van hamme, H.: Active speaker detection with audio-visual co-training. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 312–316 (2016)
Google Scholar
Chang, J.H., Kim, N.S., Mitra, S.K.: Voice activity detection based on multiple statistical models. IEEE Trans. Signal Process. 54(6), 1965–1976 (2006)
Article Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)
Chung, J.S.: Naver at ActivityNet challenge 2019-task B active speaker detection (AVA). arXiv preprint arXiv:1906.10555 (2019)
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19
Chapter Google Scholar
Cutler, R., Davis, L.: Look who’s talking: speaker detection using video and audio correlation. In: International Conference on Multimedia and Expo (2000)
Google Scholar
Ding, S., Wang, Q., Chang, S.y., Wan, L., Moreno, I.L.: Personal VAD: speaker-conditioned voice activity detection. arXiv preprint arXiv:1908.04284 (2019)
Duhme, M., Memmesheimer, R., Paulus, D.: Fusion-GCN: multimodal action recognition using graph convolutional networks. arXiv preprint arXiv:2109.12946 (2021)
Everingham, M., Sivic, J., Zisserman, A.: Taking the bite out of automated naming of characters in TV video. Image Vis. Comput. 27(5), 545–559 (2009)
Article Google Scholar
Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. arXiv preprint arXiv:1906.02739 (2019)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5308–5317 (2016)
Google Scholar
Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1219–1228 (2018)
Google Scholar
Kampffmeyer, M., Chen, Y., Liang, X., Wang, H., Zhang, Y., Xing, E.P.: Rethinking knowledge graph propagation for zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11487–11496 (2019)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Köpüklü, O., Taseska, M., Rigoll, G.: How to design a three-stage architecture for audio-visual active speaker detection in the wild. arXiv preprint arXiv:2106.03932 (2021)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105 (2012)
Google Scholar
LeCun, Y., et al.: Handwritten digit recognition with a back-propagation network. In: Advances in Neural Information Processing Systems 2 (1989)
Google Scholar
León-Alcázar, J., Heilbron, F.C., Thabet, A., Ghanem, B.: MAAS: multi-modal assignation for active speaker detection. arXiv preprint arXiv:2101.03682 (2021)
Li, G., Qian, G., Delgadillo, I.C., Müller, M., Thabet, A., Ghanem, B.: SGAS: sequential greedy architecture search (2019)
Google Scholar
Li, Y., Ouyang, W., Zhou, B., Shi, J., Zhang, C., Wang, X.: Factorizable net: an efficient subgraph-based framework for scene graph generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 346–363. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_21
Chapter Google Scholar
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML (2011)
Google Scholar
Nie, W., Ren, M., Nie, J., Zhao, S.: C-GCN: correlation based graph convolutional network for audio-video emotion recognition. IEEE Trans. Multimedia 23, 3793–3804 (2020)
Google Scholar
Ren, M., Huang, X., Li, W., Song, D., Nie, W.: LR-GCN: latent relation-aware graph convolutional network for conversational emotion recognition. IEEE Trans. Multimedia (2021)
Google Scholar
Roth, J., et al.: AVA active speaker: an audio-visual dataset for active speaker detection. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4492–4496. IEEE (2020)
Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)
Article Google Scholar
Saenko, K., Livescu, K., Siracusa, M., Wilson, K., Glass, J., Darrell, T.: Visual speech recognition with loosely synchronized feature streams. In: ICCV (2005)
Google Scholar
Tanyer, S.G., Ozer, H.: Voice activity detection in nonstationary noise. IEEE Trans. Speech Audio Process. 8(4), 478–482 (2000)
Article Google Scholar
Tao, F., Busso, C.: Bimodal recurrent neural network for audiovisual voice activity detection. In: INTERSPEECH, pp. 1938–1942 (2017)
Google Scholar
Tao, R., Pan, Z., Das, R.K., Qian, X., Shou, M.Z., Li, H.: Is someone speaking? Exploring long-term temporal features for audio-visual active speaker detection. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3927–3935 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989)
Article Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Wang, X., Ye, Y., Gupta, A.: Zero-shot recognition via semantic embeddings and knowledge graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6857–6866 (2018)
Google Scholar
Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., Weinberger, K.: Simplifying graph convolutional networks. In: International Conference on Machine Learning, pp. 6861–6871. PMLR (2019)
Google Scholar
Xie, Z., Chen, J., Peng, B.: Point clouds learning with attention-based graph convolution networks. arXiv preprint arXiv:1905.13445 (2019)
Xu, M., Zhao, C., Rojas, D.S., Thabet, A., Ghanem, B.: G-TAD: sub-graph localization for temporal action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165 (2020)
Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Zhang, Y.H., Xiao, J., Yang, S., Shan, S.: Multi-task learning for audio-visual active speaker detection (2019)
Google Scholar
Zhang, Y., et al.: UniCon: unified context network for robust active speaker detection. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3964–3972 (2021)
Google Scholar

Download references

Acknowledgements

This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding.

Author information

Authors and Affiliations

King Abdullah University of Science and Technology, KAUST, Thuwal, Saudi Arabia
Juan León Alcázar, Moritz Cordes, Chen Zhao & Bernard Ghanem
Leuphana University Lüneburg, Lüneburg, Germany
Moritz Cordes

Authors

Juan León Alcázar
View author publications
You can also search for this author in PubMed Google Scholar
Moritz Cordes
View author publications
You can also search for this author in PubMed Google Scholar
Chen Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Ghanem
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan León Alcázar .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 187 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alcázar, J.L., Cordes, M., Zhao, C., Ghanem, B. (2022). End-to-End Active Speaker Detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-19836-6_8
Published: 22 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19835-9
Online ISBN: 978-3-031-19836-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics