Skip to main content

Beyond Audio Signals: Generative Model-Based Speaker Diarization in Portuguese

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2024)

Abstract

Speaker diarization, the task of automatically identifying different speakers in audio and video, is frequently performed using probabilistic models and deep learning techniques. However, existing methods usually rely on direct analysis of the audio signal, which presents challenges for languages that lack established diarization methodologies, such as Portuguese. In this article, we propose a new approach to speaker diarization that leverages generative models for automatic speaker identification in Portuguese. We employed two generative models: one for refining the transcribed audio and another for performing the diarization task, as well as a model for initially transcribing the audio. Our method simplifies the diarization process by capturing and analyzing speaker style patterns from transcribed audio and achieves high accuracy without depending on direct signal analysis. This approach not only increases the effectiveness of speaker identification but also extends the usefulness of generative models to new domains. It opens a new perspective for diarization research, especially for the development of accurate systems for under-researched languages in audio and video applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In this article, we focus on diarizing videos; however, the process can be equally applied to audio inputs.

References

  1. AI, M.: Llama 3 (2024). https://llama.meta.com/. Accessed 28 Aug 2024

  2. Ali, A., Renals, S.: Word error rate estimation for speech recognition: e-WER. In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 20–24. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/P18-2004, https://aclanthology.org/P18-2004

  3. Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)

    Article  Google Scholar 

  4. Anthropic: Claude AI (2024). https://www.anthropic.com/claude. Accessed 28 Aug 2024

  5. Aronowitz, H.: Speaker diarization using a priori acoustic information. In: INTERSPEECH, pp. 937–940 (2011)

    Google Scholar 

  6. AssemblyAI: Assemblyai (2024). https://www.assemblyai.com/. Accessed 30 Aug 2024

  7. Bain, K., Basson, S., Faisman, A., Kanevsky, D.: Accessibility, transcription, and access everywhere. IBM Syst. J. 44(3), 589–603 (2005)

    Article  Google Scholar 

  8. Bredin, H., et al.: Pyannote. Audio: neural building blocks for speaker diarization. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7124–7128. IEEE (2020)

    Google Scholar 

  9. Chiu, C.C., et al.: Speech recognition for medical conversations. arXiv preprint arXiv:1711.07274 (2017)

  10. Câmara dos Deputados do Brasil: Discursos e notas taquigráficas (2024). https://www.camara.leg.br/internet/sitaqweb/discursodireto.asp. Accessed 10 Jun 2024

  11. Fiscus, J.G., Ajot, J., Michel, M., Garofolo, J.S.: The rich transcription 2006 spring meeting recognition evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 309–322. Springer, Heidelberg (2006). https://doi.org/10.1007/11965152_28

    Chapter  Google Scholar 

  12. Fox, E.B., Sudderth, E.B., Jordan, M.I., Willsky, A.S.: The sticky HDP-HMM: bayesian nonparametric hidden Markov models with persistent states. Arxiv preprint 2 (2007)

    Google Scholar 

  13. Gaikwad, S.K., Gawali, B.W., Yannawar, P.: A review on speech recognition technique. Int. J. Comput. Appl. 10(3), 16–24 (2010)

    Google Scholar 

  14. Galibert, O.: Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech. In: Interspeech (2013). https://doi.org/10.21437/Interspeech.2013-303

  15. Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949 (2017). https://doi.org/10.1109/ICASSP.2017.7953097

  16. Meng, J., Zhang, J., Zhao, H.: Overview of the speech recognition technology. In: 2012 Fourth International Conference on Computational and Information Sciences, pp. 199–202. IEEE (2012)

    Google Scholar 

  17. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  18. Nautsch, A., et al.: Preserving privacy in speaker and speech characterisation. Comput. Speech Lang. 58, 441–480 (2019)

    Article  MATH  Google Scholar 

  19. OpenAI: Gpt-4o (2023). https://openai.com/index/hello-gpt-4o/. Accessed 28 Aug 2024

  20. Oracle Corporation: Oracle AI speech (2024). https://www.oracle.com/artificial-intelligence/speech/. Accessed 30 Aug 2024

  21. Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A review of speaker diarization: recent advances with deep learning. arXiv preprint arXiv:2101.09624 (2021)

  22. Park, T., Koluguri, N.R., Jia, F., Balam, J., Ginsburg, B.: Nemo open source speaker diarization system. In: INTERSPEECH, pp. 853–854 (2022)

    Google Scholar 

  23. Ponraj, A.S., et al.: Speech recognition with gender identification and speaker diarization. In: 2020 IEEE International Conference for Innovation in Technology (INOCON), pp. 1–4. IEEE (2020)

    Google Scholar 

  24. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518. PMLR (2023)

    Google Scholar 

  25. Reynolds, D.A.: Speaker identification and verification using Gaussian mixture speaker models. Speech Commun. 17(1), 91–108 (1995). https://doi.org/10.1016/0167-6393(95)00009-D, https://www.sciencedirect.com/science/article/pii/016763939500009D

  26. Ryant, N., et al.: The second dihard diarization challenge: dataset, task, and baselines. arXiv preprint arXiv:1906.07839 (2019)

  27. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  28. Wang, D., Xiao, X., Kanda, N., Yoshioka, T., Wu, J.: Target speaker voice activity detection with transformers and its integration with end-to-end neural diarization. arXiv preprint arXiv:2208.13085 (2022)

  29. Weninger, F., Wöllmer, M., Schuller, B.: Automatic assessment of singer traits in popular music: Gender, age, height and race. In: Proceedings 12th International Society for Music Information Retrieval Conference, ISMIR 2011 (2011)

    Google Scholar 

  30. Zhang, Y., et al.: Google USM: scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037 (2023)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antônio Oss Boll .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Boll, A.O., Puttlitz, L.M., Boll, H.O., Malossi, R.M. (2025). Beyond Audio Signals: Generative Model-Based Speaker Diarization in Portuguese. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15412. Springer, Cham. https://doi.org/10.1007/978-3-031-79029-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-79029-4_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-79028-7

  • Online ISBN: 978-3-031-79029-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics