A Probabilistically-Oriented Analysis of the Performance of ASR Systems for Brazilian Radios and TVs

de Azevedo, Diego Marques; Rodrigues, Guilherme Souza; Ladeira, Marcelo

doi:10.1007/978-3-031-21689-3_13

Diego Marques de Azevedo⁹,
Guilherme Souza Rodrigues⁹ &
Marcelo Ladeira⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13654 ))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

841 Accesses

Abstract

With the use of neural network-based technologies, Automatic Speech Recognition (ASR) systems for Brazilian Portuguese (BP) have shown great progress in the last few years. Several state-of-art results were achieved by open-source end-to-end models, such as the Kaldi toolkit and the Wav2vec 2.0. Alternative commercial tools are also available, including the Google and Microsoft speech to text APIs and the Audimus System of VoiceInteraction. We analyse the relative performance of such tools – in terms of the so-called Word Error Rate (WER) – when transcribing audio recordings from Brazilian radio and TV channels. A generalized linear model (GLM) is designed to stochastically describe the relationship between some of the audio’s properties (e.g. file format and audio duration) and the resulting WER, for each method under consideration. Among other uses, such strategy enables the analysis of local performances, indicating not only which tool performs better, but when exactly it is expected to do so. This, in turn, could be used to design an optimized system composed of several transcribers. The data generated for conducting this experiment and the scripts used to produce the stochastic model are public available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Alencar, V., Alcaim, A.: LSF and LPC-derived features for large vocabulary distributed continuous speech recognition in Brazilian Portuguese. In: 2008 42nd Asilomar Conference on Signals, Systems and Computers, pp. 1237–1241. IEEE (2008)
Google Scholar
Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: International Conference on Machine Learning, pp. 173–182. PMLR (2016)
Google Scholar
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural. Inf. Process. Syst. 33, 12449–12460 (2020)
Google Scholar
Batista, C., Dias, A.L., Neto, N.: Free resources for forced phonetic alignment in Brazilian Portuguese based on Kaldi toolkit. EURASIP J. Adv. Signal Process. 2022(1), 1–32 (2022)
Article Google Scholar
Batista, C.T., Dias, A.L., Neto, N.C.S.: Baseline acoustic models for Brazilian Portuguese using Kaldi tools. In: IberSPEECH, pp. 77–81 (2018)
Google Scholar
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)
Google Scholar
Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. IEEE (2018)
Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.747, https://aclanthology.org/2020.acl-main.747
Georgescu, A.L., Cucu, H., Burileanu, C.: Kaldi-based DNN architectures for speech recognition in Romanian. In: 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1–6. IEEE (2019)
Google Scholar
Junior, A.C., et al.: CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese (2021)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Speech Recognition, Computational Linguistics and Natural Language Processing. Prentice Hall, Upper Saddle River, NJ (2008)
Google Scholar
Karpagavalli, S., Chandra, E.: A review on automatic speech recognition architecture and approaches. Int. J. Signal Process. Image Process. Pattern Recogn. 9(4), 393–404 (2016)
Google Scholar
Leviathan, Y., Matias, Y.: Google duplex: an AI system for accomplishing real-world tasks over the phone (2018)
Google Scholar
de Lima, T.A., Da Costa-Abreu, M.: A survey on automatic speech recognition systems for Portuguese language and its variations. Comput. Speech Lang. 62, 101055 (2020)
Google Scholar
Meinedo, H., Abad, A., Pellegrini, T., Trancoso, I., Neto, J.: The l2f broadcast news speech recognition system. Proc. Fala, 93–96 (2010)
Google Scholar
Meinedo, H., Caseiro, D., Neto, J., Trancoso, I.: AUDIMUS.MEDIA: a broadcast news speech recognition system for the European Portuguese language. In: Mamede, N.J., Trancoso, I., Baptista, J., das Graças Volpe Nunes, M. (eds.) PROPOR 2003. LNCS (LNAI), vol. 2721, pp. 9–17. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45011-4_2
Chapter MATH Google Scholar
Meinedo, H., Souto, N., Neto, J.P.: Speech recognition of broadcast news for the European Portuguese language. In: IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU 2001, pp. 319–322. IEEE (2001)
Google Scholar
Neto, J., Meinedo, H., Viveiros, M.: A media monitoring solution. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1813–1816. IEEE (2011)
Google Scholar
Patry, N.: Making automatic speech recognition work on large files with wav2vec2 in transformers (2022). https://huggingface.co/blog/asr-chunking
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. No. CONF, IEEE Signal Processing Society (2011)
Google Scholar
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. In: Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 2757–2761. ISCA (2020). https://doi.org/10.21437/Interspeech. 2020–2826
Quintanilha, I.M., Netto, S.L., Biscainho, L.W.P.: An open-source end-to-end ASR system for Brazilian Portuguese using DNNs built from newly assembled corpora. J. Commun. Inf. Syst. 35(1), 230–242 (2020)
Google Scholar
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Readings Speech Recogn. 267–296 (1990)
Google Scholar
Salesky, E., et al.: The multilingual TEDX corpus for speech recognition and translation. In: Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pp. 3655–3659. ISCA (2021). https://doi.org/10.21437/Interspeech. 2021–11
Sampaio, M.X., et al.: Evaluation of automatic speech recognition systems. In: Anais do XXXVI Simpósio Brasileiro de Bancos de Dados, pp. 301–306. SBC (2021)
Google Scholar
Schramm, M., Freitas, L., Zanuz, A., Barone, D.: CSLU: spoltech Brazilian Portuguese version 1.0 ldc2006s16 (2006)
Google Scholar
Stefanel Gris, L.R., Casanova, E., de Oliveira, F.S., da Silva Soares, A., Candido Junior, A.: Brazilian Portuguese speech recognition using Wav2vec 2.0. In: Pinheiro, V., et al. (eds.) PROPOR 2022. LNCS (LNAI), vol. 13208, pp. 333–343. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98305-5_31
Chapter Google Scholar
Vase, S.: The maturing of automatic speech recognition in healthcare practices. Proceedings (2021). http://ceur-ws.org. ISSN 1613, 0073
Xiong, W., et al.: Toward human parity in conversational speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(12), 2410–2423 (2017)
Article Google Scholar
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2017 conversational speech recognition system. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5934–5938. IEEE (2018)
Google Scholar
Yu, D., Deng, L.: Automatic Speech Recognition. SCT, Springer, London (2015). https://doi.org/10.1007/978-1-4471-5779-3
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Programa de Pós-Graduação em Computacão Aplicada, University of Brasília, Brasília, Brazil
Diego Marques de Azevedo, Guilherme Souza Rodrigues & Marcelo Ladeira

Authors

Diego Marques de Azevedo
View author publications
You can also search for this author in PubMed Google Scholar
Guilherme Souza Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo Ladeira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego Marques de Azevedo .

Editor information

Editors and Affiliations

Federal University of Rio Grande do Norte, Natal, Brazil
João Carlos Xavier-Junior
Federal University of Bahia, Salvador, Brazil
Ricardo Araújo Rios

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Azevedo, D.M., Rodrigues, G.S., Ladeira, M. (2022). A Probabilistically-Oriented Analysis of the Performance of ASR Systems for Brazilian Radios and TVs. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13654 . Springer, Cham. https://doi.org/10.1007/978-3-031-21689-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-21689-3_13
Published: 19 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21688-6
Online ISBN: 978-3-031-21689-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Probabilistically-Oriented Analysis of the Performance of ASR Systems for Brazilian Radios and TVs