Skip to main content

Deep Speaker Embeddings Based Online Diarization

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

  • 1123 Accesses

Abstract

This paper describes our experiments with the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) model for development of an online diarization system. For this task several UIS-RNN models based on different speaker embeddings extractors were trained. These systems were evaluated in terms of Diarization Error Rate (DER) metric on public and private test datasets. Also systems were tested on real dialogue data recorded in a bank office. Proposed online models outperform standard offline Agglomerative Hierarchical Clustering (AHC) approach and are compatible with the state-of-the-art Bayesian HMM (VBx) offline method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ardila, R., et al.: Common voice: a massively-multilingual speech corpus. In: LREC (2020)

    Google Scholar 

  2. Bredin, H., et al.: Pyannote.audio: neural building blocks for speaker diarization. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7124–7128 (2020)

    Google Scholar 

  3. Carletta, J., et al.: The ami meeting corpus: a pre-announcement. In: MLMI (2005)

    Google Scholar 

  4. Dawalatabad, N., Ravanelli, M., Grondin, F., Thienpondt, J., Desplanques, B., Na, H.: Ecapa-tdnn embeddings for speaker diarization. In: Interspeech (2021)

    Google Scholar 

  5. Díez, M., Burget, L., Landini, F., Wang, S., Černocký, J.H.: Optimizing bayesian hmm based x-vector clustering for the second dihard speech diarization challenge. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6519–6523 (2020)

    Google Scholar 

  6. Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: Repvgg: making vgg-style convnets great again. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13728–13737 (2021)

    Google Scholar 

  7. Fini, E., Brutti, A.: Supervised online diarization with sample mean loss for multi-domain data. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7134–7138 (2020)

    Google Scholar 

  8. Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., Watanabe, S.: End-to-end neural speaker diarization with permutation-free objectives. In: INTERSPEECH (2019)

    Google Scholar 

  9. Horiguchi, S., Fujita, Y., Watanabe, S., Xue, Y., Nagamatsu, K.: End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. ArXiv abs/2005.09921 (2020)

    Google Scholar 

  10. Horiguchi, S., Watanabe, S., García, P., Takashima, Y., Kawaguchi, Y.: Online neural diarization of unlimited numbers of speakers. ArXiv abs/2206.02432 (2022)

    Google Scholar 

  11. Isik, Y.Z., Roux, J.L., Chen, Z., Watanabe, S., Hershey, J.R.: Single-channel multi-speaker separation using deep clustering. In: INTERSPEECH (2016)

    Google Scholar 

  12. Kanda, N., et al.: Guided source separation meets a strong ASR backend: hitachi/paderborn university joint investigation for dinner party ASR. In: Proceedings of Interspeech 2019, pp. 1248–1252 (2019). https://doi.org/10.21437/Interspeech.2019-1167

  13. Landini, F., Profant, J., Diez, M., Burget, L.: Bayesian hmm clustering of x-vector sequences (vbx) in speaker diarization: theory, implementation and analysis on standard tasks. Comput. Speech Lang. 71, 101254 (2022)

    Article  Google Scholar 

  14. Landini, F., Profant, J., Díez, M., Burget, L.: Bayesian hmm clustering of x-vector sequences (vbx) in speaker diarization: theory, implementation and analysis on standard tasks. ArXiv abs/2012.14952 (2020)

    Google Scholar 

  15. Lavrentyeva, G., et al.: Blind speech signal quality estimation for speaker verification systems. In: INTERSPEECH (2020)

    Google Scholar 

  16. Martin, A.F., Greenberg, C.S.: Nist 2008 speaker recognition evaluation: performance across telephone and room microphone channels. In: INTERSPEECH (2009)

    Google Scholar 

  17. Medennikov, I., et al.: Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario. ArXiv abs/2005.07272 (2020)

    Google Scholar 

  18. Morrone, G., Cornell, S., Raj, D., Zovato, E., Brutti, A., Squartini, S.: Leveraging speech separation for conversational telephone speaker diarization (2022)

    Google Scholar 

  19. Novoselov, S., Lavrentyeva, G., Avdeeva, A., Volokhov, V., Gusev, A.: Robust speaker recognition with transformers using wav2vec 2.0. ArXiv abs/2203.15095 (2022)

    Google Scholar 

  20. Park, T.J., Han, K.J., Kumar, M., Narayanan, S.S.: Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap. IEEE Signal Process. Lett. 27, 381–385 (2020)

    Article  Google Scholar 

  21. Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.S.: A review of speaker diarization: recent advances with deep learning. Comput. Speech Lang. 72, 101317 (2022)

    Article  Google Scholar 

  22. Qin, X., et al.: The ffsvc 2020 evaluation plan. ArXiv abs/2002.00387 (2020)

    Google Scholar 

  23. Ryant, N., et al.: The second dihard diarization challenge: dataset, task, and baselines. In: INTERSPEECH (2019)

    Google Scholar 

  24. Wang, W., Lin, Q., Li, M.: Online target speaker voice activity detection for speaker diarization. ArXiv abs/2207.05920 (2022)

    Google Scholar 

  25. Xue, Y., et al.: Online end-to-end neural diarization handling overlapping speech and flexible numbers of speakers. ArXiv abs/2101.08473 (2021)

    Google Scholar 

  26. Yu, D., Kolbæk, M., Tan, Z., Jensen, J.H.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245 (2017)

    Google Scholar 

  27. Zhang, A., Wang, Q., Zhu, Z., Paisley, J.W., Wang, C.: Fully supervised speaker diarization. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6301–6305 (2019)

    Google Scholar 

  28. Zhang, Y., et al.: Online speaker diarization with graph-based label generation. In: Odyssey (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anastasia Avdeeva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Avdeeva, A., Novoselov, S. (2022). Deep Speaker Embeddings Based Online Diarization. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20980-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20979-6

  • Online ISBN: 978-3-031-20980-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics