Abstract
Speaker diarization, the task of automatically identifying different speakers in audio and video, is frequently performed using probabilistic models and deep learning techniques. However, existing methods usually rely on direct analysis of the audio signal, which presents challenges for languages that lack established diarization methodologies, such as Portuguese. In this article, we propose a new approach to speaker diarization that leverages generative models for automatic speaker identification in Portuguese. We employed two generative models: one for refining the transcribed audio and another for performing the diarization task, as well as a model for initially transcribing the audio. Our method simplifies the diarization process by capturing and analyzing speaker style patterns from transcribed audio and achieves high accuracy without depending on direct signal analysis. This approach not only increases the effectiveness of speaker identification but also extends the usefulness of generative models to new domains. It opens a new perspective for diarization research, especially for the development of accurate systems for under-researched languages in audio and video applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this article, we focus on diarizing videos; however, the process can be equally applied to audio inputs.
References
AI, M.: Llama 3 (2024). https://llama.meta.com/. Accessed 28 Aug 2024
Ali, A., Renals, S.: Word error rate estimation for speech recognition: e-WER. In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 20–24. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/P18-2004, https://aclanthology.org/P18-2004
Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)
Anthropic: Claude AI (2024). https://www.anthropic.com/claude. Accessed 28 Aug 2024
Aronowitz, H.: Speaker diarization using a priori acoustic information. In: INTERSPEECH, pp. 937–940 (2011)
AssemblyAI: Assemblyai (2024). https://www.assemblyai.com/. Accessed 30 Aug 2024
Bain, K., Basson, S., Faisman, A., Kanevsky, D.: Accessibility, transcription, and access everywhere. IBM Syst. J. 44(3), 589–603 (2005)
Bredin, H., et al.: Pyannote. Audio: neural building blocks for speaker diarization. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7124–7128. IEEE (2020)
Chiu, C.C., et al.: Speech recognition for medical conversations. arXiv preprint arXiv:1711.07274 (2017)
Câmara dos Deputados do Brasil: Discursos e notas taquigráficas (2024). https://www.camara.leg.br/internet/sitaqweb/discursodireto.asp. Accessed 10 Jun 2024
Fiscus, J.G., Ajot, J., Michel, M., Garofolo, J.S.: The rich transcription 2006 spring meeting recognition evaluation. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299, pp. 309–322. Springer, Heidelberg (2006). https://doi.org/10.1007/11965152_28
Fox, E.B., Sudderth, E.B., Jordan, M.I., Willsky, A.S.: The sticky HDP-HMM: bayesian nonparametric hidden Markov models with persistent states. Arxiv preprint 2 (2007)
Gaikwad, S.K., Gawali, B.W., Yannawar, P.: A review on speech recognition technique. Int. J. Comput. Appl. 10(3), 16–24 (2010)
Galibert, O.: Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech. In: Interspeech (2013). https://doi.org/10.21437/Interspeech.2013-303
Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949 (2017). https://doi.org/10.1109/ICASSP.2017.7953097
Meng, J., Zhang, J., Zhao, H.: Overview of the speech recognition technology. In: 2012 Fourth International Conference on Computational and Information Sciences, pp. 199–202. IEEE (2012)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Nautsch, A., et al.: Preserving privacy in speaker and speech characterisation. Comput. Speech Lang. 58, 441–480 (2019)
OpenAI: Gpt-4o (2023). https://openai.com/index/hello-gpt-4o/. Accessed 28 Aug 2024
Oracle Corporation: Oracle AI speech (2024). https://www.oracle.com/artificial-intelligence/speech/. Accessed 30 Aug 2024
Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A review of speaker diarization: recent advances with deep learning. arXiv preprint arXiv:2101.09624 (2021)
Park, T., Koluguri, N.R., Jia, F., Balam, J., Ginsburg, B.: Nemo open source speaker diarization system. In: INTERSPEECH, pp. 853–854 (2022)
Ponraj, A.S., et al.: Speech recognition with gender identification and speaker diarization. In: 2020 IEEE International Conference for Innovation in Technology (INOCON), pp. 1–4. IEEE (2020)
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518. PMLR (2023)
Reynolds, D.A.: Speaker identification and verification using Gaussian mixture speaker models. Speech Commun. 17(1), 91–108 (1995). https://doi.org/10.1016/0167-6393(95)00009-D, https://www.sciencedirect.com/science/article/pii/016763939500009D
Ryant, N., et al.: The second dihard diarization challenge: dataset, task, and baselines. arXiv preprint arXiv:1906.07839 (2019)
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Wang, D., Xiao, X., Kanda, N., Yoshioka, T., Wu, J.: Target speaker voice activity detection with transformers and its integration with end-to-end neural diarization. arXiv preprint arXiv:2208.13085 (2022)
Weninger, F., Wöllmer, M., Schuller, B.: Automatic assessment of singer traits in popular music: Gender, age, height and race. In: Proceedings 12th International Society for Music Information Retrieval Conference, ISMIR 2011 (2011)
Zhang, Y., et al.: Google USM: scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Boll, A.O., Puttlitz, L.M., Boll, H.O., Malossi, R.M. (2025). Beyond Audio Signals: Generative Model-Based Speaker Diarization in Portuguese. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15412. Springer, Cham. https://doi.org/10.1007/978-3-031-79029-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-79029-4_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-79028-7
Online ISBN: 978-3-031-79029-4
eBook Packages: Computer ScienceComputer Science (R0)