Abstract
We present a freely available spontaneous speech corpus for the Brazilian Portuguese language and report preliminary automatic speech recognition (ASR) results, using both the Wav2Vec2-XLSR-53 and Distil-Whisper models fine-tuned and trained on our corpus. The NURC-SP Audio Corpus comprises 401 different speakers (204 females, 197 males) with a total of 239.30 h of transcribed audio recordings. To the best of our knowledge, this is the first large Paulistano accented spontaneous speech corpus dedicated to the ASR task in Portuguese. We first present the design and development procedures of the NURC-SP Audio Corpus, and then describe four ASR experiments in detail. The experiments demonstrated promising results for the applicability of the corpus for ASR. Specifically, we fine-tuned two versions of Wav2Vec2-XLSR-53 model, trained a Distil-Whisper model using our dataset with labels determined by Whisper Large-V3 model, and fine-tuned this Distil-Whisper model with our corpus. Our best results were the Distil-Whisper fine-tuned over NURC-SP Audio Corpus with a WER of 24.22% followed by a fine-tuned versions of Wav2Vec2-XLSR-53 model with a WER of 33.73%, that is almost 10% point worse than Distil-Whisper’s. To enable experiment reproducibility, we share the NURC-SP Audio Corpus dataset, pre-trained models, and training recipes in Hugging-Face and Github repositories.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
For example, [5] comments on vowel elision between words that, in the São Paulo dialect, affects the final posttonic vowels /a/, /o/ and /u/. For instance, in the example (a) me’ren[da es]co’lar (school lunch) –> me’ren[des]co’lar, the vowel /a/ is deleted and a new syllable is created ([des]).
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
github.com/nilc-nlp/nurc-sp-audio-corpus.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
Here, we also focus our analysis on the metric CER, because for smaller audios, with just a few words, this metric tends to be more reliable.
References
Ardila, R., et al.: Common voice: a massively-multilingual speech corpus. In: Calzolari, N., et al. (eds.) Proceedings of the Twelfth Language Resources and Evaluation Conference, May 2020, Marseille, France, pp. 4218–4222. European Language Resources Association (2020). https://aclanthology.org/2020.lrec-1.520
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020). https://arxiv.org/abs/2006.11477
Bain, M., Huh, J., Han, T., Zisserman, A.: WhisperX: time-accurate speech transcription of long-form audio. In: INTERSPEECH 2023, pp. 4489–4493 (2023). https://doi.org/10.21437/Interspeech.2023-78
Beckman, M.E.: A typology of spontaneous speech. In: Computing Prosody: Computational Models for Processing Spontaneous Speech, pp. 7–26. Springer, New York (1997). https://doi.org/10.1007/978-1-4612-2258-3_2
Bohn, G.P.: Processos e representações lexicais: o caso das vogais posteriores do dialeto paulista. DELTA: Documentação e Estudos em Linguística Teórica e Aplicada 33(2), September 2017. https://revistas.pucsp.br/index.php/delta/article/view/34370
Candido_Junior, A., et al.: CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese. Lang. Resour. Eval. 57, 1139–1171 (2023). https://doi.org/10.1007/s10579-022-09621-4. https://link.springer.com/article/10.1007/s10579-022-09621-4
Clifton, A., et al.: 100,000 podcasts: a spoken English document corpus. In: Scott, D., Bel, N., Zong, C. (eds.) Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), December 2020, pp. 5903–5917. International Committee on Computational Linguistics (2020). https://aclanthology.org/2020.coling-main.519
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition. In: Proceedings of the INTERSPEECH 2021, pp. 2426–2430 (2021). https://doi.org/10.21437/Interspeech.2021-329
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451 (2020)
Gabler, P., Geiger, B.C., Schuppler, B., Kern, R.: Reconsidering read and spontaneous speech: causal perspectives on the generation of training data for automatic speech recognition. Information 14(2) (2023)
Gandhi, S., von Platen, P., Rush, A.M.: Distil-whisper: robust knowledge distillation via large-scale pseudo labelling (2023)
Garmash, E., et al.: Cem mil podcasts: a spoken Portuguese document corpus for multi-modal, multi-lingual and multi-dialect information access research. In: Arampatzis, A., et al. (eds.) Experimental IR Meets Multilinguality. Multimodality, and Interaction: 14th International Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, 18–21 September 2023, Proceedings, pp. 48–59. Springer, Heidelberg (2023). https://doi.org/10.1007/978-3-031-42448-9_5
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Gris, L.R.S., Casanova, E., de Oliveira, F.S., da Silva Soares, A., Candido Junior, A.: Brazilian Portuguese speech recognition using Wav2vec 2.0. In: Pinheiro, V., et al. (eds.) Computational Processing of the Portuguese Language, pp. 333–343. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98305-5_31
Gris, L.R.S., Marcacini, R., Junior, A.C., Casanova, E., Soares, A., Aluísio, S.M.: Evaluating OpenAI’s whisper ASR for punctuation prediction and topic modeling of life histories of the museum of the person (2023)
Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., Harper, M.: Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. Audio Speech Lang. Process., 1526–1540 (2006)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015)
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. In: Proceedings of the INTERSPEECH 2020, pp. 2757–2761 (2020). https://doi.org/10.21437/Interspeech.2020-2826
Radford, A., Kim, J.W., Xu, T., Brockman, G., Mcleavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, 23–29 July 2023, vol. 202, pp. 28492–28518. PMLR (2023)
Rodrigues, A.C., et al: Portal NURC-SP: design, development, and speech processing corpora resources to support the public dissemination of Portuguese spoken language. In: Gamallo, P., et al (eds.) Proceedings of the 16th International Conference on Computational Processing of Portuguese, pp. 187–195. Association for Computational Lingustics (2024)
Salesky, E., et al.: The Multilingual TEDx corpus for speech recognition and translation. In: Proceedings of the INTERSPEECH 2021, pp. 3655–3659 (2021)
Éva Székely, Henter, G.E., Beskow, J., Gustafson, J.: Spontaneous conversational speech synthesis from found data. In: Proceedings of the INTERSPEECH 2019, pp. 4435–4439 (2019). https://doi.org/10.21437/Interspeech.2019-2836
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information P0rocessing Systems, vol. 30 (2017)
Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech (2019)
Acknowledgements
This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation. This project was also supported by the Ministry of Science, Technology and Innovation, with resources of Law No. 8.248, of October 23, 1991, within the scope of PPI-SOFTEX, coordinated by Softex and published Residence in TIC 13, DOU 01245.010222/2022-44.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lima, R., Leal, S.E., Junior, A.C., Aluísio, S.M. (2025). A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15412. Springer, Cham. https://doi.org/10.1007/978-3-031-79029-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-79029-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-79028-7
Online ISBN: 978-3-031-79029-4
eBook Packages: Computer ScienceComputer Science (R0)