A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation

Lima, Rodrigo; Leal, Sidney E.; Junior, Arnaldo Candido; Aluísio, Sandra M.

doi:10.1007/978-3-031-79029-4_3

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15412))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

27 Accesses

Abstract

We present a freely available spontaneous speech corpus for the Brazilian Portuguese language and report preliminary automatic speech recognition (ASR) results, using both the Wav2Vec2-XLSR-53 and Distil-Whisper models fine-tuned and trained on our corpus. The NURC-SP Audio Corpus comprises 401 different speakers (204 females, 197 males) with a total of 239.30 h of transcribed audio recordings. To the best of our knowledge, this is the first large Paulistano accented spontaneous speech corpus dedicated to the ASR task in Portuguese. We first present the design and development procedures of the NURC-SP Audio Corpus, and then describe four ASR experiments in detail. The experiments demonstrated promising results for the applicability of the corpus for ASR. Specifically, we fine-tuned two versions of Wav2Vec2-XLSR-53 model, trained a Distil-Whisper model using our dataset with labels determined by Whisper Large-V3 model, and fine-tuned this Distil-Whisper model with our corpus. Our best results were the Distil-Whisper fine-tuned over NURC-SP Audio Corpus with a WER of 24.22% followed by a fine-tuned versions of Wav2Vec2-XLSR-53 model with a WER of 33.73%, that is almost 10% point worse than Distil-Whisper’s. To enable experiment reproducibility, we share the NURC-SP Audio Corpus dataset, pre-trained models, and training recipes in Hugging-Face and Github repositories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.ted.com/talks.
2.
For example, [5] comments on vowel elision between words that, in the São Paulo dialect, affects the final posttonic vowels /a/, /o/ and /u/. For instance, in the example (a) me’ren[da es]co’lar (school lunch) –> me’ren[des]co’lar, the vowel /a/ is deleted and a new syllable is created ([des]).
3.
https://openslr.org/150/.
4.
https://groups.inf.ed.ac.uk/ami/corpus/overview.shtml.
5.
https://commonvoice.mozilla.org/en/datasets.
6.
https://www.openslr.org/51/.
7.
https://github.com/nilc-nlp/CORAA.
8.
github.com/nilc-nlp/nurc-sp-audio-corpus.
9.
https://librivox.org/pages/about-librivox/.
10.
https://www.openslr.org/12.
11.
https://www.openslr.org/60/.
12.
https://arxiv.org/abs/2209.11871.
13.
https://sites.google.com/view/tarsila-c4ai/home.
14.
https://github.com/Edresson/Wav2Vec-Wrapper.
15.
https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese.
16.
https://github.com/huggingface/distil-whisper/tree/main/training.
17.
https://huggingface.co/blog/fine-tune-whisper.
18.
Here, we also focus our analysis on the metric CER, because for smaller audios, with just a few words, this metric tends to be more reliable.

References

Ardila, R., et al.: Common voice: a massively-multilingual speech corpus. In: Calzolari, N., et al. (eds.) Proceedings of the Twelfth Language Resources and Evaluation Conference, May 2020, Marseille, France, pp. 4218–4222. European Language Resources Association (2020). https://aclanthology.org/2020.lrec-1.520
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020). https://arxiv.org/abs/2006.11477
Bain, M., Huh, J., Han, T., Zisserman, A.: WhisperX: time-accurate speech transcription of long-form audio. In: INTERSPEECH 2023, pp. 4489–4493 (2023). https://doi.org/10.21437/Interspeech.2023-78
Beckman, M.E.: A typology of spontaneous speech. In: Computing Prosody: Computational Models for Processing Spontaneous Speech, pp. 7–26. Springer, New York (1997). https://doi.org/10.1007/978-1-4612-2258-3_2
Bohn, G.P.: Processos e representações lexicais: o caso das vogais posteriores do dialeto paulista. DELTA: Documentação e Estudos em Linguística Teórica e Aplicada 33(2), September 2017. https://revistas.pucsp.br/index.php/delta/article/view/34370
Candido_Junior, A., et al.: CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese. Lang. Resour. Eval. 57, 1139–1171 (2023). https://doi.org/10.1007/s10579-022-09621-4. https://link.springer.com/article/10.1007/s10579-022-09621-4
Clifton, A., et al.: 100,000 podcasts: a spoken English document corpus. In: Scott, D., Bel, N., Zong, C. (eds.) Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), December 2020, pp. 5903–5917. International Committee on Computational Linguistics (2020). https://aclanthology.org/2020.coling-main.519
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition. In: Proceedings of the INTERSPEECH 2021, pp. 2426–2430 (2021). https://doi.org/10.21437/Interspeech.2021-329
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451 (2020)
Google Scholar
Gabler, P., Geiger, B.C., Schuppler, B., Kern, R.: Reconsidering read and spontaneous speech: causal perspectives on the generation of training data for automatic speech recognition. Information 14(2) (2023)
Google Scholar
Gandhi, S., von Platen, P., Rush, A.M.: Distil-whisper: robust knowledge distillation via large-scale pseudo labelling (2023)
Google Scholar
Garmash, E., et al.: Cem mil podcasts: a spoken Portuguese document corpus for multi-modal, multi-lingual and multi-dialect information access research. In: Arampatzis, A., et al. (eds.) Experimental IR Meets Multilinguality. Multimodality, and Interaction: 14th International Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, 18–21 September 2023, Proceedings, pp. 48–59. Springer, Heidelberg (2023). https://doi.org/10.1007/978-3-031-42448-9_5
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Google Scholar
Gris, L.R.S., Casanova, E., de Oliveira, F.S., da Silva Soares, A., Candido Junior, A.: Brazilian Portuguese speech recognition using Wav2vec 2.0. In: Pinheiro, V., et al. (eds.) Computational Processing of the Portuguese Language, pp. 333–343. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98305-5_31
Gris, L.R.S., Marcacini, R., Junior, A.C., Casanova, E., Soares, A., Aluísio, S.M.: Evaluating OpenAI’s whisper ASR for punctuation prediction and topic modeling of life histories of the museum of the person (2023)
Google Scholar
Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., Harper, M.: Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. Audio Speech Lang. Process., 1526–1540 (2006)
Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015)
Google Scholar
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. In: Proceedings of the INTERSPEECH 2020, pp. 2757–2761 (2020). https://doi.org/10.21437/Interspeech.2020-2826
Radford, A., Kim, J.W., Xu, T., Brockman, G., Mcleavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, 23–29 July 2023, vol. 202, pp. 28492–28518. PMLR (2023)
Google Scholar
Rodrigues, A.C., et al: Portal NURC-SP: design, development, and speech processing corpora resources to support the public dissemination of Portuguese spoken language. In: Gamallo, P., et al (eds.) Proceedings of the 16th International Conference on Computational Processing of Portuguese, pp. 187–195. Association for Computational Lingustics (2024)
Google Scholar
Salesky, E., et al.: The Multilingual TEDx corpus for speech recognition and translation. In: Proceedings of the INTERSPEECH 2021, pp. 3655–3659 (2021)
Google Scholar
Éva Székely, Henter, G.E., Beskow, J., Gustafson, J.: Spontaneous conversational speech synthesis from found data. In: Proceedings of the INTERSPEECH 2019, pp. 4435–4439 (2019). https://doi.org/10.21437/Interspeech.2019-2836
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information P0rocessing Systems, vol. 30 (2017)
Google Scholar
Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech (2019)
Google Scholar

Download references

Acknowledgements

This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation. This project was also supported by the Ministry of Science, Technology and Innovation, with resources of Law No. 8.248, of October 23, 1991, within the scope of PPI-SOFTEX, coordinated by Softex and published Residence in TIC 13, DOU 01245.010222/2022-44.

Author information

Authors and Affiliations

University of São Paulo, São Carlos, SP, 13566-590, Brazil
Rodrigo Lima, Sidney E. Leal & Sandra M. Aluísio
Universidade Estadual Paulista, São José do Rio Preto, SP, 15054-000, Brazil
Arnaldo Candido Junior

Authors

Rodrigo Lima
View author publications
You can also search for this author in PubMed Google Scholar
Sidney E. Leal
View author publications
You can also search for this author in PubMed Google Scholar
Arnaldo Candido Junior
View author publications
You can also search for this author in PubMed Google Scholar
Sandra M. Aluísio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sidney E. Leal .

Editor information

Editors and Affiliations

Universidade Federal Fluminense, Niterói, Brazil
Aline Paes
Instituto Tecnológico de Aeronáutica, São José dos Campos, Brazil
Filipe A. N. Verri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lima, R., Leal, S.E., Junior, A.C., Aluísio, S.M. (2025). A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15412. Springer, Cham. https://doi.org/10.1007/978-3-031-79029-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-79029-4_3
Published: 30 January 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-79028-7
Online ISBN: 978-3-031-79029-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation