Skip to main content

End-to-End Active Speaker Detection

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached by utilizing audiovisual data but relying exclusively on audio annotations. We achieve this by modelling the direct relationship between the audio signal and the possible sound sources (speakers), as well as introducing a contrastive loss.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Since tracklets include a single face and were manually curated, it is guaranteed that two tracklets that overlap in time belong to different identities. If there is a single person in the scene, we sample additional visual data from another tracklet in a different movie where no speech event is detected.

References

  1. Alcázar, J.L., et al.: Active speakers in context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12465–12474 (2020)

    Google Scholar 

  2. Cai, J., Jiang, N., Han, X., Jia, K., Lu, J.: JOLO-GCN: mining joint-centered light-weight information for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2735–2744 (2021)

    Google Scholar 

  3. Carneiro, H., Weber, C., Wermter, S.: FaVoA: face-voice association favours ambiguous speaker detection. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds.) ICANN 2021. LNCS, vol. 12891, pp. 439–450. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86362-3_36

    Chapter  Google Scholar 

  4. Chakravarty, P., Mirzaei, S., Tuytelaars, T., Van hamme, H.: Who’s speaking? Audio-supervised classification of active speakers in video. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 87–90 (2015)

    Google Scholar 

  5. Chakravarty, P., Zegers, J., Tuytelaars, T., Van hamme, H.: Active speaker detection with audio-visual co-training. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 312–316 (2016)

    Google Scholar 

  6. Chang, J.H., Kim, N.S., Mitra, S.K.: Voice activity detection based on multiple statistical models. IEEE Trans. Signal Process. 54(6), 1965–1976 (2006)

    Article  Google Scholar 

  7. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  8. Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)

  9. Chung, J.S.: Naver at ActivityNet challenge 2019-task B active speaker detection (AVA). arXiv preprint arXiv:1906.10555 (2019)

  10. Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)

  11. Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19

    Chapter  Google Scholar 

  12. Cutler, R., Davis, L.: Look who’s talking: speaker detection using video and audio correlation. In: International Conference on Multimedia and Expo (2000)

    Google Scholar 

  13. Ding, S., Wang, Q., Chang, S.y., Wan, L., Moreno, I.L.: Personal VAD: speaker-conditioned voice activity detection. arXiv preprint arXiv:1908.04284 (2019)

  14. Duhme, M., Memmesheimer, R., Paulus, D.: Fusion-GCN: multimodal action recognition using graph convolutional networks. arXiv preprint arXiv:2109.12946 (2021)

  15. Everingham, M., Sivic, J., Zisserman, A.: Taking the bite out of automated naming of characters in TV video. Image Vis. Comput. 27(5), 545–559 (2009)

    Article  Google Scholar 

  16. Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. arXiv preprint arXiv:1906.02739 (2019)

  17. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)

    Google Scholar 

  18. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)

    Google Scholar 

  19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  20. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  21. Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5308–5317 (2016)

    Google Scholar 

  22. Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1219–1228 (2018)

    Google Scholar 

  23. Kampffmeyer, M., Chen, Y., Liang, X., Wang, H., Zhang, Y., Xing, E.P.: Rethinking knowledge graph propagation for zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11487–11496 (2019)

    Google Scholar 

  24. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

  25. Köpüklü, O., Taseska, M., Rigoll, G.: How to design a three-stage architecture for audio-visual active speaker detection in the wild. arXiv preprint arXiv:2106.03932 (2021)

  26. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105 (2012)

    Google Scholar 

  27. LeCun, Y., et al.: Handwritten digit recognition with a back-propagation network. In: Advances in Neural Information Processing Systems 2 (1989)

    Google Scholar 

  28. León-Alcázar, J., Heilbron, F.C., Thabet, A., Ghanem, B.: MAAS: multi-modal assignation for active speaker detection. arXiv preprint arXiv:2101.03682 (2021)

  29. Li, G., Qian, G., Delgadillo, I.C., Müller, M., Thabet, A., Ghanem, B.: SGAS: sequential greedy architecture search (2019)

    Google Scholar 

  30. Li, Y., Ouyang, W., Zhou, B., Shi, J., Zhang, C., Wang, X.: Factorizable net: an efficient subgraph-based framework for scene graph generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 346–363. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_21

    Chapter  Google Scholar 

  31. Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)

  32. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML (2011)

    Google Scholar 

  33. Nie, W., Ren, M., Nie, J., Zhao, S.: C-GCN: correlation based graph convolutional network for audio-video emotion recognition. IEEE Trans. Multimedia 23, 3793–3804 (2020)

    Google Scholar 

  34. Ren, M., Huang, X., Li, W., Song, D., Nie, W.: LR-GCN: latent relation-aware graph convolutional network for conversational emotion recognition. IEEE Trans. Multimedia (2021)

    Google Scholar 

  35. Roth, J., et al.: AVA active speaker: an audio-visual dataset for active speaker detection. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4492–4496. IEEE (2020)

    Google Scholar 

  36. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)

    Article  Google Scholar 

  37. Saenko, K., Livescu, K., Siracusa, M., Wilson, K., Glass, J., Darrell, T.: Visual speech recognition with loosely synchronized feature streams. In: ICCV (2005)

    Google Scholar 

  38. Tanyer, S.G., Ozer, H.: Voice activity detection in nonstationary noise. IEEE Trans. Speech Audio Process. 8(4), 478–482 (2000)

    Article  Google Scholar 

  39. Tao, F., Busso, C.: Bimodal recurrent neural network for audiovisual voice activity detection. In: INTERSPEECH, pp. 1938–1942 (2017)

    Google Scholar 

  40. Tao, R., Pan, Z., Das, R.K., Qian, X., Shou, M.Z., Li, H.: Is someone speaking? Exploring long-term temporal features for audio-visual active speaker detection. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3927–3935 (2021)

    Google Scholar 

  41. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  42. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989)

    Article  Google Scholar 

  43. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

    Google Scholar 

  44. Wang, X., Ye, Y., Gupta, A.: Zero-shot recognition via semantic embeddings and knowledge graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6857–6866 (2018)

    Google Scholar 

  45. Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., Weinberger, K.: Simplifying graph convolutional networks. In: International Conference on Machine Learning, pp. 6861–6871. PMLR (2019)

    Google Scholar 

  46. Xie, Z., Chen, J., Peng, B.: Point clouds learning with attention-based graph convolution networks. arXiv preprint arXiv:1905.13445 (2019)

  47. Xu, M., Zhao, C., Rojas, D.S., Thabet, A., Ghanem, B.: G-TAD: sub-graph localization for temporal action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165 (2020)

    Google Scholar 

  48. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  49. Zhang, Y.H., Xiao, J., Yang, S., Shan, S.: Multi-task learning for audio-visual active speaker detection (2019)

    Google Scholar 

  50. Zhang, Y., et al.: UniCon: unified context network for robust active speaker detection. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3964–3972 (2021)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan León Alcázar .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 187 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alcázar, J.L., Cordes, M., Zhao, C., Ghanem, B. (2022). End-to-End Active Speaker Detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19836-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19835-9

  • Online ISBN: 978-3-031-19836-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics