Skip to main content

A Probabilistically-Oriented Analysis of the Performance of ASR Systems for Brazilian Radios and TVs

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13654 ))

Included in the following conference series:

  • 841 Accesses

Abstract

With the use of neural network-based technologies, Automatic Speech Recognition (ASR) systems for Brazilian Portuguese (BP) have shown great progress in the last few years. Several state-of-art results were achieved by open-source end-to-end models, such as the Kaldi toolkit and the Wav2vec 2.0. Alternative commercial tools are also available, including the Google and Microsoft speech to text APIs and the Audimus System of VoiceInteraction. We analyse the relative performance of such tools – in terms of the so-called Word Error Rate (WER) – when transcribing audio recordings from Brazilian radio and TV channels. A generalized linear model (GLM) is designed to stochastically describe the relationship between some of the audio’s properties (e.g. file format and audio duration) and the resulting WER, for each method under consideration. Among other uses, such strategy enables the analysis of local performances, indicating not only which tool performs better, but when exactly it is expected to do so. This, in turn, could be used to design an optimized system composed of several transcribers. The data generated for conducting this experiment and the scripts used to produce the stochastic model are public available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://commonvoice.mozilla.org/pt/datasets.

  2. 2.

    https://doi.org/10.17771/PUCRio.acad.8372.

  3. 3.

    http://www.voxforge.org/pt/downloads.

  4. 4.

    https://laps.ufpa.brfalabrasil/.

  5. 5.

    https://commonvoice.mozilla.org/pt/datasets.

  6. 6.

    https://cloud.google.com/speech-to-text?hl=pt-br.

  7. 7.

    https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/.

  8. 8.

    https://www.voice-interaction.com/br/audimus-media-legendagem-automatica-em-tempo-real/.

  9. 9.

    https://github.com/jitsi/jiwer.

  10. 10.

    https://gitlab.com/fb-asr/.

  11. 11.

    https://voiceinteraction.ai/platforms/audimus_media.html.

  12. 12.

    https://github.com/diegomarq/BRTVRAD.

  13. 13.

    https://github.com/savoirfairelinux/num2words.

  14. 14.

    http://ffmpeg.org/ffmpeg-all.html#loudnorm.

  15. 15.

    https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoice.

  16. 16.

    https://github.com/diegomarq/docker-kaldi-coraa-pt.

  17. 17.

    https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese.

  18. 18.

    https://github.com/diegomarq/glm-asr-brtvrad.

References

  1. Alencar, V., Alcaim, A.: LSF and LPC-derived features for large vocabulary distributed continuous speech recognition in Brazilian Portuguese. In: 2008 42nd Asilomar Conference on Signals, Systems and Computers, pp. 1237–1241. IEEE (2008)

    Google Scholar 

  2. Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: International Conference on Machine Learning, pp. 173–182. PMLR (2016)

    Google Scholar 

  3. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural. Inf. Process. Syst. 33, 12449–12460 (2020)

    Google Scholar 

  4. Batista, C., Dias, A.L., Neto, N.: Free resources for forced phonetic alignment in Brazilian Portuguese based on Kaldi toolkit. EURASIP J. Adv. Signal Process. 2022(1), 1–32 (2022)

    Article  Google Scholar 

  5. Batista, C.T., Dias, A.L., Neto, N.C.S.: Baseline acoustic models for Brazilian Portuguese using Kaldi tools. In: IberSPEECH, pp. 77–81 (2018)

    Google Scholar 

  6. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)

    Google Scholar 

  7. Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. IEEE (2018)

    Google Scholar 

  8. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.747, https://aclanthology.org/2020.acl-main.747

  9. Georgescu, A.L., Cucu, H., Burileanu, C.: Kaldi-based DNN architectures for speech recognition in Romanian. In: 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1–6. IEEE (2019)

    Google Scholar 

  10. Junior, A.C., et al.: CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese (2021)

    Google Scholar 

  11. Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Speech Recognition, Computational Linguistics and Natural Language Processing. Prentice Hall, Upper Saddle River, NJ (2008)

    Google Scholar 

  12. Karpagavalli, S., Chandra, E.: A review on automatic speech recognition architecture and approaches. Int. J. Signal Process. Image Process. Pattern Recogn. 9(4), 393–404 (2016)

    Google Scholar 

  13. Leviathan, Y., Matias, Y.: Google duplex: an AI system for accomplishing real-world tasks over the phone (2018)

    Google Scholar 

  14. de Lima, T.A., Da Costa-Abreu, M.: A survey on automatic speech recognition systems for Portuguese language and its variations. Comput. Speech Lang. 62, 101055 (2020)

    Google Scholar 

  15. Meinedo, H., Abad, A., Pellegrini, T., Trancoso, I., Neto, J.: The l2f broadcast news speech recognition system. Proc. Fala, 93–96 (2010)

    Google Scholar 

  16. Meinedo, H., Caseiro, D., Neto, J., Trancoso, I.: AUDIMUS.MEDIA: a broadcast news speech recognition system for the European Portuguese language. In: Mamede, N.J., Trancoso, I., Baptista, J., das Graças Volpe Nunes, M. (eds.) PROPOR 2003. LNCS (LNAI), vol. 2721, pp. 9–17. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45011-4_2

    Chapter  MATH  Google Scholar 

  17. Meinedo, H., Souto, N., Neto, J.P.: Speech recognition of broadcast news for the European Portuguese language. In: IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU 2001, pp. 319–322. IEEE (2001)

    Google Scholar 

  18. Neto, J., Meinedo, H., Viveiros, M.: A media monitoring solution. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1813–1816. IEEE (2011)

    Google Scholar 

  19. Patry, N.: Making automatic speech recognition work on large files with wav2vec2 in transformers (2022). https://huggingface.co/blog/asr-chunking

  20. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. No. CONF, IEEE Signal Processing Society (2011)

    Google Scholar 

  21. Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. In: Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 2757–2761. ISCA (2020). https://doi.org/10.21437/Interspeech. 2020–2826

  22. Quintanilha, I.M., Netto, S.L., Biscainho, L.W.P.: An open-source end-to-end ASR system for Brazilian Portuguese using DNNs built from newly assembled corpora. J. Commun. Inf. Syst. 35(1), 230–242 (2020)

    Google Scholar 

  23. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Readings Speech Recogn. 267–296 (1990)

    Google Scholar 

  24. Salesky, E., et al.: The multilingual TEDX corpus for speech recognition and translation. In: Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pp. 3655–3659. ISCA (2021). https://doi.org/10.21437/Interspeech. 2021–11

  25. Sampaio, M.X., et al.: Evaluation of automatic speech recognition systems. In: Anais do XXXVI Simpósio Brasileiro de Bancos de Dados, pp. 301–306. SBC (2021)

    Google Scholar 

  26. Schramm, M., Freitas, L., Zanuz, A., Barone, D.: CSLU: spoltech Brazilian Portuguese version 1.0 ldc2006s16 (2006)

    Google Scholar 

  27. Stefanel Gris, L.R., Casanova, E., de Oliveira, F.S., da Silva Soares, A., Candido Junior, A.: Brazilian Portuguese speech recognition using Wav2vec 2.0. In: Pinheiro, V., et al. (eds.) PROPOR 2022. LNCS (LNAI), vol. 13208, pp. 333–343. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98305-5_31

    Chapter  Google Scholar 

  28. Vase, S.: The maturing of automatic speech recognition in healthcare practices. Proceedings (2021). http://ceur-ws.org. ISSN 1613, 0073

  29. Xiong, W., et al.: Toward human parity in conversational speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(12), 2410–2423 (2017)

    Article  Google Scholar 

  30. Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2017 conversational speech recognition system. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5934–5938. IEEE (2018)

    Google Scholar 

  31. Yu, D., Deng, L.: Automatic Speech Recognition. SCT, Springer, London (2015). https://doi.org/10.1007/978-1-4471-5779-3

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diego Marques de Azevedo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

de Azevedo, D.M., Rodrigues, G.S., Ladeira, M. (2022). A Probabilistically-Oriented Analysis of the Performance of ASR Systems for Brazilian Radios and TVs. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13654 . Springer, Cham. https://doi.org/10.1007/978-3-031-21689-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21689-3_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21688-6

  • Online ISBN: 978-3-031-21689-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics