Skip to main content

A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2024)

Abstract

We present a freely available spontaneous speech corpus for the Brazilian Portuguese language and report preliminary automatic speech recognition (ASR) results, using both the Wav2Vec2-XLSR-53 and Distil-Whisper models fine-tuned and trained on our corpus. The NURC-SP Audio Corpus comprises 401 different speakers (204 females, 197 males) with a total of 239.30 h of transcribed audio recordings. To the best of our knowledge, this is the first large Paulistano accented spontaneous speech corpus dedicated to the ASR task in Portuguese. We first present the design and development procedures of the NURC-SP Audio Corpus, and then describe four ASR experiments in detail. The experiments demonstrated promising results for the applicability of the corpus for ASR. Specifically, we fine-tuned two versions of Wav2Vec2-XLSR-53 model, trained a Distil-Whisper model using our dataset with labels determined by Whisper Large-V3 model, and fine-tuned this Distil-Whisper model with our corpus. Our best results were the Distil-Whisper fine-tuned over NURC-SP Audio Corpus with a WER of 24.22% followed by a fine-tuned versions of Wav2Vec2-XLSR-53 model with a WER of 33.73%, that is almost 10% point worse than Distil-Whisper’s. To enable experiment reproducibility, we share the NURC-SP Audio Corpus dataset, pre-trained models, and training recipes in Hugging-Face and Github repositories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.ted.com/talks.

  2. 2.

    For example, [5] comments on vowel elision between words that, in the São Paulo dialect, affects the final posttonic vowels /a/, /o/ and /u/. For instance, in the example (a) me’ren[da es]co’lar (school lunch) –> me’ren[des]co’lar, the vowel /a/ is deleted and a new syllable is created ([des]).

  3. 3.

    https://openslr.org/150/.

  4. 4.

    https://groups.inf.ed.ac.uk/ami/corpus/overview.shtml.

  5. 5.

    https://commonvoice.mozilla.org/en/datasets.

  6. 6.

    https://www.openslr.org/51/.

  7. 7.

    https://github.com/nilc-nlp/CORAA.

  8. 8.

    github.com/nilc-nlp/nurc-sp-audio-corpus.

  9. 9.

    https://librivox.org/pages/about-librivox/.

  10. 10.

    https://www.openslr.org/12.

  11. 11.

    https://www.openslr.org/60/.

  12. 12.

    https://arxiv.org/abs/2209.11871.

  13. 13.

    https://sites.google.com/view/tarsila-c4ai/home.

  14. 14.

    https://github.com/Edresson/Wav2Vec-Wrapper.

  15. 15.

    https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese.

  16. 16.

    https://github.com/huggingface/distil-whisper/tree/main/training.

  17. 17.

    https://huggingface.co/blog/fine-tune-whisper.

  18. 18.

    Here, we also focus our analysis on the metric CER, because for smaller audios, with just a few words, this metric tends to be more reliable.

References

  1. Ardila, R., et al.: Common voice: a massively-multilingual speech corpus. In: Calzolari, N., et al. (eds.) Proceedings of the Twelfth Language Resources and Evaluation Conference, May 2020, Marseille, France, pp. 4218–4222. European Language Resources Association (2020). https://aclanthology.org/2020.lrec-1.520

  2. Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020). https://arxiv.org/abs/2006.11477

  3. Bain, M., Huh, J., Han, T., Zisserman, A.: WhisperX: time-accurate speech transcription of long-form audio. In: INTERSPEECH 2023, pp. 4489–4493 (2023). https://doi.org/10.21437/Interspeech.2023-78

  4. Beckman, M.E.: A typology of spontaneous speech. In: Computing Prosody: Computational Models for Processing Spontaneous Speech, pp. 7–26. Springer, New York (1997). https://doi.org/10.1007/978-1-4612-2258-3_2

  5. Bohn, G.P.: Processos e representações lexicais: o caso das vogais posteriores do dialeto paulista. DELTA: Documentação e Estudos em Linguística Teórica e Aplicada 33(2), September 2017. https://revistas.pucsp.br/index.php/delta/article/view/34370

  6. Candido_Junior, A., et al.: CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese. Lang. Resour. Eval. 57, 1139–1171 (2023). https://doi.org/10.1007/s10579-022-09621-4. https://link.springer.com/article/10.1007/s10579-022-09621-4

  7. Clifton, A., et al.: 100,000 podcasts: a spoken English document corpus. In: Scott, D., Bel, N., Zong, C. (eds.) Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), December 2020, pp. 5903–5917. International Committee on Computational Linguistics (2020). https://aclanthology.org/2020.coling-main.519

  8. Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition. In: Proceedings of the INTERSPEECH 2021, pp. 2426–2430 (2021). https://doi.org/10.21437/Interspeech.2021-329

  9. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451 (2020)

    Google Scholar 

  10. Gabler, P., Geiger, B.C., Schuppler, B., Kern, R.: Reconsidering read and spontaneous speech: causal perspectives on the generation of training data for automatic speech recognition. Information 14(2) (2023)

    Google Scholar 

  11. Gandhi, S., von Platen, P., Rush, A.M.: Distil-whisper: robust knowledge distillation via large-scale pseudo labelling (2023)

    Google Scholar 

  12. Garmash, E., et al.: Cem mil podcasts: a spoken Portuguese document corpus for multi-modal, multi-lingual and multi-dialect information access research. In: Arampatzis, A., et al. (eds.) Experimental IR Meets Multilinguality. Multimodality, and Interaction: 14th International Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, 18–21 September 2023, Proceedings, pp. 48–59. Springer, Heidelberg (2023). https://doi.org/10.1007/978-3-031-42448-9_5

  13. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

    Google Scholar 

  14. Gris, L.R.S., Casanova, E., de Oliveira, F.S., da Silva Soares, A., Candido Junior, A.: Brazilian Portuguese speech recognition using Wav2vec 2.0. In: Pinheiro, V., et al. (eds.) Computational Processing of the Portuguese Language, pp. 333–343. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98305-5_31

  15. Gris, L.R.S., Marcacini, R., Junior, A.C., Casanova, E., Soares, A., Aluísio, S.M.: Evaluating OpenAI’s whisper ASR for punctuation prediction and topic modeling of life histories of the museum of the person (2023)

    Google Scholar 

  16. Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., Harper, M.: Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. Audio Speech Lang. Process., 1526–1540 (2006)

    Google Scholar 

  17. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015)

    Google Scholar 

  18. Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. In: Proceedings of the INTERSPEECH 2020, pp. 2757–2761 (2020). https://doi.org/10.21437/Interspeech.2020-2826

  19. Radford, A., Kim, J.W., Xu, T., Brockman, G., Mcleavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, 23–29 July 2023, vol. 202, pp. 28492–28518. PMLR (2023)

    Google Scholar 

  20. Rodrigues, A.C., et al: Portal NURC-SP: design, development, and speech processing corpora resources to support the public dissemination of Portuguese spoken language. In: Gamallo, P., et al (eds.) Proceedings of the 16th International Conference on Computational Processing of Portuguese, pp. 187–195. Association for Computational Lingustics (2024)

    Google Scholar 

  21. Salesky, E., et al.: The Multilingual TEDx corpus for speech recognition and translation. In: Proceedings of the INTERSPEECH 2021, pp. 3655–3659 (2021)

    Google Scholar 

  22. Éva Székely, Henter, G.E., Beskow, J., Gustafson, J.: Spontaneous conversational speech synthesis from found data. In: Proceedings of the INTERSPEECH 2019, pp. 4435–4439 (2019). https://doi.org/10.21437/Interspeech.2019-2836

  23. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information P0rocessing Systems, vol. 30 (2017)

    Google Scholar 

  24. Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech (2019)

    Google Scholar 

Download references

Acknowledgements

This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation. This project was also supported by the Ministry of Science, Technology and Innovation, with resources of Law No. 8.248, of October 23, 1991, within the scope of PPI-SOFTEX, coordinated by Softex and published Residence in TIC 13, DOU 01245.010222/2022-44.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sidney E. Leal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lima, R., Leal, S.E., Junior, A.C., Aluísio, S.M. (2025). A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15412. Springer, Cham. https://doi.org/10.1007/978-3-031-79029-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-79029-4_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-79028-7

  • Online ISBN: 978-3-031-79029-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics