Abstract
With the use of neural network-based technologies, Automatic Speech Recognition (ASR) systems for Brazilian Portuguese (BP) have shown great progress in the last few years. Several state-of-art results were achieved by open-source end-to-end models, such as the Kaldi toolkit and the Wav2vec 2.0. Alternative commercial tools are also available, including the Google and Microsoft speech to text APIs and the Audimus System of VoiceInteraction. We analyse the relative performance of such tools – in terms of the so-called Word Error Rate (WER) – when transcribing audio recordings from Brazilian radio and TV channels. A generalized linear model (GLM) is designed to stochastically describe the relationship between some of the audio’s properties (e.g. file format and audio duration) and the resulting WER, for each method under consideration. Among other uses, such strategy enables the analysis of local performances, indicating not only which tool performs better, but when exactly it is expected to do so. This, in turn, could be used to design an optimized system composed of several transcribers. The data generated for conducting this experiment and the scripts used to produce the stochastic model are public available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
References
Alencar, V., Alcaim, A.: LSF and LPC-derived features for large vocabulary distributed continuous speech recognition in Brazilian Portuguese. In: 2008 42nd Asilomar Conference on Signals, Systems and Computers, pp. 1237–1241. IEEE (2008)
Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: International Conference on Machine Learning, pp. 173–182. PMLR (2016)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural. Inf. Process. Syst. 33, 12449–12460 (2020)
Batista, C., Dias, A.L., Neto, N.: Free resources for forced phonetic alignment in Brazilian Portuguese based on Kaldi toolkit. EURASIP J. Adv. Signal Process. 2022(1), 1–32 (2022)
Batista, C.T., Dias, A.L., Neto, N.C.S.: Baseline acoustic models for Brazilian Portuguese using Kaldi tools. In: IberSPEECH, pp. 77–81 (2018)
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)
Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. IEEE (2018)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.747, https://aclanthology.org/2020.acl-main.747
Georgescu, A.L., Cucu, H., Burileanu, C.: Kaldi-based DNN architectures for speech recognition in Romanian. In: 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1–6. IEEE (2019)
Junior, A.C., et al.: CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese (2021)
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Speech Recognition, Computational Linguistics and Natural Language Processing. Prentice Hall, Upper Saddle River, NJ (2008)
Karpagavalli, S., Chandra, E.: A review on automatic speech recognition architecture and approaches. Int. J. Signal Process. Image Process. Pattern Recogn. 9(4), 393–404 (2016)
Leviathan, Y., Matias, Y.: Google duplex: an AI system for accomplishing real-world tasks over the phone (2018)
de Lima, T.A., Da Costa-Abreu, M.: A survey on automatic speech recognition systems for Portuguese language and its variations. Comput. Speech Lang. 62, 101055 (2020)
Meinedo, H., Abad, A., Pellegrini, T., Trancoso, I., Neto, J.: The l2f broadcast news speech recognition system. Proc. Fala, 93–96 (2010)
Meinedo, H., Caseiro, D., Neto, J., Trancoso, I.: AUDIMUS.MEDIA: a broadcast news speech recognition system for the European Portuguese language. In: Mamede, N.J., Trancoso, I., Baptista, J., das Graças Volpe Nunes, M. (eds.) PROPOR 2003. LNCS (LNAI), vol. 2721, pp. 9–17. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45011-4_2
Meinedo, H., Souto, N., Neto, J.P.: Speech recognition of broadcast news for the European Portuguese language. In: IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU 2001, pp. 319–322. IEEE (2001)
Neto, J., Meinedo, H., Viveiros, M.: A media monitoring solution. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1813–1816. IEEE (2011)
Patry, N.: Making automatic speech recognition work on large files with wav2vec2 in transformers (2022). https://huggingface.co/blog/asr-chunking
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. No. CONF, IEEE Signal Processing Society (2011)
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. In: Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 2757–2761. ISCA (2020). https://doi.org/10.21437/Interspeech. 2020–2826
Quintanilha, I.M., Netto, S.L., Biscainho, L.W.P.: An open-source end-to-end ASR system for Brazilian Portuguese using DNNs built from newly assembled corpora. J. Commun. Inf. Syst. 35(1), 230–242 (2020)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Readings Speech Recogn. 267–296 (1990)
Salesky, E., et al.: The multilingual TEDX corpus for speech recognition and translation. In: Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pp. 3655–3659. ISCA (2021). https://doi.org/10.21437/Interspeech. 2021–11
Sampaio, M.X., et al.: Evaluation of automatic speech recognition systems. In: Anais do XXXVI Simpósio Brasileiro de Bancos de Dados, pp. 301–306. SBC (2021)
Schramm, M., Freitas, L., Zanuz, A., Barone, D.: CSLU: spoltech Brazilian Portuguese version 1.0 ldc2006s16 (2006)
Stefanel Gris, L.R., Casanova, E., de Oliveira, F.S., da Silva Soares, A., Candido Junior, A.: Brazilian Portuguese speech recognition using Wav2vec 2.0. In: Pinheiro, V., et al. (eds.) PROPOR 2022. LNCS (LNAI), vol. 13208, pp. 333–343. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98305-5_31
Vase, S.: The maturing of automatic speech recognition in healthcare practices. Proceedings (2021). http://ceur-ws.org. ISSN 1613, 0073
Xiong, W., et al.: Toward human parity in conversational speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(12), 2410–2423 (2017)
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2017 conversational speech recognition system. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5934–5938. IEEE (2018)
Yu, D., Deng, L.: Automatic Speech Recognition. SCT, Springer, London (2015). https://doi.org/10.1007/978-1-4471-5779-3
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
de Azevedo, D.M., Rodrigues, G.S., Ladeira, M. (2022). A Probabilistically-Oriented Analysis of the Performance of ASR Systems for Brazilian Radios and TVs. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13654 . Springer, Cham. https://doi.org/10.1007/978-3-031-21689-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-21689-3_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21688-6
Online ISBN: 978-3-031-21689-3
eBook Packages: Computer ScienceComputer Science (R0)