Two sepedi-english code-switched speech corpora

Modipa, Thipe I.; Davel, Marelie H.

doi:10.1007/s10579-022-09592-6

Two sepedi-english code-switched speech corpora

Original Paper
Published: 06 June 2022

Volume 56, pages 703–727, (2022)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

366 Accesses
1 Citation
Explore all metrics

A Correction to this article was published on 09 September 2022

This article has been updated

Abstract

We report on the development of two reference corpora for the analysis of Sepedi-English code-switched speech in the context of automatic speech recognition. For the first corpus, possible English events were obtained from an existing corpus of transcribed Sepedi-English speech. The second corpus is based on the analysis of radio broadcasts: actual instances of code switching were transcribed and reproduced by a number of native Sepedi speakers. We describe the process to develop and verify both corpora and perform an initial analysis of the newly produced data sets. We find that, in naturally occurring speech, the frequency of code switching is unexpectedly high for this language pair, and that the continuum of code switching (from unmodified embedded words to loanwords absorbed into the matrix language) makes this a particularly challenging task for speech recognition systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mandarin–English code-switching speech corpus in South-East Asia: SEAME

Article 20 May 2015

Fine-Tuned Self-supervised Speech Representations for Language Diarization in Multilingual Code-Switched Speech

SPRAAK: Speech Processing, Recognition and Automatic Annotation Kit

Change history

09 September 2022
A Correction to this paper has been published: https://doi.org/10.1007/s10579-022-09607-2

Notes

https://repo.sadilar.org/handle/20.500.12185/530
https://tla.mpi.nl/tools/tla-tools/elan
https://repo.sadilar.org/handle/20.500.12185/270

References

Amazouz, D. (2019) Linguistic and phonetic investigations of French-Algerian Arabic code-switching: Large corpus studies using automatic speech processing. PhD thesis, University of Texas at Austin, United States, http://www.theses.fr/2019PA030006, thèse de doctorat dirigée par Adda-Decker, Martine Sciences du langage Paris 3 2019
Amazouz, D., Adda-Decker, M., & Lamel, L. (2018) The French-Algerian Code-Switching Triggered audio corpus (FACST). In: Proc. International Conference on Language Resources and Evaluation (LREC), pp 1468–1473
Badenhorst, J., van Heerden, C., Davel, M. H., & Barnard, E. (2011). Collecting and evaluating speech recognition corpora for 11 South African languages. Language Resources and Evaluation, 45, 289–309.
Article Google Scholar
Barnard, E., Davel, M. H., & van Heerden, C. (2009). ASR corpus design for resource-scarce languages. Proc (pp. 2847–2850). United Kingdom: Interspeech.
Google Scholar
Barnard, E., Davel, M.H., & van Huyssteen, G.B. (2010) Speech technology for information access: A South African case study. In: Proc. AAAI Spring Symposium Series: AI for Development
Barnard, E., Davel, M.H., van Heerden, C.J., de Wet, F., & Badenhorst, J. (2014) The NCHLT speech corpus of the South African languages. In: Proc. Spoken Languages Technologies for Under-Resourced Languages (SLTU), pp 194–200
Boersma, P., & Weenink, D. (2009) Praat: doing phonetics by computer (version 5.1.05) [computer program]. http://www.praat.org/
Bullock, B. E., & Toribio, A. J. (2009). The Cambridge handbook of linguistic code-switching. Cambridge: Cambridge University Press.
Book Google Scholar
Chan, J., Ching, P., & Lee, T. (2005). Development of a Cantonese-English code-mixing speech corpus. Proc (pp. 1533–1536). Portugal: Interspeech.
Google Scholar
Chan, J., Ching, P., Lee, T., & Cao, H. (2006). Automatic speech recognition of Cantonese-English code-mixing utterances. Proc (pp. 113–116). USA: Interspeech.
Google Scholar
Chan, J., Cao, H., Ching, P., & Lee, T. (2009). Automatic recognition of Cantonese-English code-mixing speech. Computational Linguistics and Chinese Language Processing, 14, 281–304.
Google Scholar
Davel, M. H., & Barnard, E. (2008). Pronunciation prediction with Default &Refine. Computer Speech and Language, 22, 374–393.
Article Google Scholar
Davel, M. H., & Martirosian, O. (2009). Pronunciation dictionary development in resource-scarce environments. Proc (pp. 2851–2854). United Kingdom: Interspeech.
Google Scholar
Davel, M. H., van Heerden, C., Kleynhans, N., & Barnard, E. (2011). Efficient harvesting of internet audio for resource-scarce ASR. Proc (pp. 3153–3156). Italy: Interspeech.
Google Scholar
Davel, M. H., van Heerden, C., & Barnard, E. (2012). Validating smartphone-collected speech corpora. Proc (pp. 68–75). Cape Town, South Africa: SLTU.
Google Scholar
de Vries, N. J., Davel, M. H., Badenhorst, J., Basson, W. D., de Wet, F., Barnard, E., & de Waal, A. (2014). A smartphone-based ASR data collection tool for under-resourced languages. Speech communication, 56, 119–131.
Article Google Scholar
Diab, M., & Kamboj, A. (2011). Feasibility of leveraging crowd sourcing for the creation of a large scale annotated resource for Hindi English code switched data: A pilot annotation. Tech. rep.: Columbia University New York Center for Computational Learning Systems.
Hamed, I., & Abdennadher, S. (2018) Collection and analysis of code-switch Egyptian Arabic-English speech corpus. In: Proc. International Conference on Language Resources and Evaluation (LREC), pp 3805–3809
van Heerden, C., Barnard, E., & Davel, M. H. (2009). Basic speech recognition for spoken dialogues. Proc (pp. 3003–3006). United Kingdom: Interspeech.
Google Scholar
Joshi, A.K. (1982) Processing of sentences with intra-sentential code-switching. In: Proc. 9th conference on Computational Linguistics, Prague, Czechoslovakia, pp 145–150
Le, V. B., Besacier, L., & Schultz, T. (2006). Acoustic-phonetic unit similarities for context dependent acoustic model portability. Proc (pp. 1101–1104). Speech and Signal Processing (ICASSP): Acoustics.
Li, Y., Yu, Y., & Fung, P. (2012) A Mandarin-English code-switching corpus. In: Proc. International Conference on Language Resources and Evaluation (LREC), pp 2515–2519
Lyu, D., Tan, T., Chng, E., & Li, H. (2010). SEAME: a Mandarin-English code-switching corpus in South-East Asia. Proc (pp. 1986–1989). Japan: Interspeech.
Google Scholar
Modiba TM (2004) Aspects of automatic speech recognition with respect to Northern Sotho. Master’s thesis, University of the North, South Africa
Modipa, T., & Davel, M. H. (2010). Pronunciation modelling of foreign words for Sepedi ASR. Proc (pp. 185–189). South Africa: Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Stellenbosch.
Google Scholar
Modipa, T., Davel, M.H., & de Wet, F. (2010) Acoustic modelling of Sepedi affricates for ASR. In: Proc. Annual Research Conference of the South African Institute of Computer Scientist and Information Technologists (SAICSIT), pp 394–398
Modipa, T.I. (2016) Automatic recognition of code-switched speech in Sepedi. PhD thesis
Modipa, T. I., Davel, M. H., & de Wet, F. (2012). Context-dependent modelling of English vowels in Sepedi code-switched speech. Proc (pp. 173–178). South Africa: Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Pretoria.
Google Scholar
Modipa, T. I., Davel, M. H., & de Wet, F. (2013). Implications of Sepedi/English code switching for ASR systems. Proc (pp. 64–69). South Africa: Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Johannesburg.
Google Scholar
Mokwana, M.L. (2009) The melting pot in Ga-Matlala Maserumule with special reference to the Bapedi culture, language and dialects. Master’s thesis, University of South Africa, South Africa
Myers-Scotton, C. (Ed.). (1993). Social motivations for Codeswitching: Evidence from Africa. Oxford: Clarendon Press.
Google Scholar
Ramanarayanan V, Suendermann-Oeft D (2017) Jee haan, I’d like both, por favor: Elicitation of a Code-Switched Corpus of Hindi-English and Spanish-English Human-Machine Dialog. In: Proc. Interspeech, Sweden, pp 47–51, https://doi.org/10.21437/Interspeech.2017-1198
Statistics South Africa (2012) Census 2011: Census in brief. Tech. Rep. 03-01-41, www.statssa.gov.za
Stemmer, G., Noth, E., & Niemann, H. (2001) Acoustic modeling of foreign words in a German speech recognition system. In: Proc. EUROSPEECH-2001, pp 2745–2748
Vu, N., Lyu, D., Weiner, J., Telaar, D., Schlippe, T., Blaicher, F., Chng, E., Schultz, T., & Li, H. (2012). A first specch recognition system for Mandarin-English code-switch conversational speech. Proc (pp. 4889–4892). Speech and Signal Processing (ICASSP): Acoustics.
Google Scholar
van der Westhuizen E, Niesler T (2018) A first south African corpus of multilingual code-switched soap opera speech. In: Proc. International Conference on Language Resources and Evaluation (LREC), European Language Resources Association (ELRA), Miyazaki, Japan, https://www.aclweb.org/anthology/L18-1451
White, C., Khudanpur, S., & Baker, J. (2008). An investigation of acoustic models for multilingual code switching. Proc (pp. 2691–2694). Australia: Interspeech.
Google Scholar
Yeh, C., Huang, C., & Lee, L. (2011). Bilingual acoustic model adaptation by unit merging on different levels and cross-level integration. Proc (pp. 2317–2320). Italy: Interspeech.
Google Scholar
Yu, D., Deng, L., Liu, P., Wu, J., Gong, Y., & Acero, A. (2009). Cross-lingual speech recognition under runtime resource constraints. Proc (pp. 4193–4196). Speech and Signal Processing (ICASSP): Acoustics.
Google Scholar
Yılmaz E, Biswas A, van der Westhuizen E, de Wet F, Niesler T (2018) Building a Unified Code-Switching ASR System for South African Languages. In: Proc. Interspeech, India, pp 1923–1927, https://doi.org/10.21437/Interspeech.2018-1966, http://dx.doi.org/10.21437/Interspeech.2018-1966

Download references

Acknowledgements

We would like to thank the Department of Arts Culture of South Africa for their financial support of the data collection effort. We would also like to thank Dr Febe de Wet who provided welcome assistance and advice, especially during the initial phase of this project.

Author information

Authors and Affiliations

Department of Computer Science, University of Limpopo, University Road, Sovenga, Polokwane, South Africa
Thipe I. Modipa
Centre for Artificial Intelligence Research (CAIR), National Institute for Theoretical and Computational Sciences (NITheCS), Pretoria, South Africa
Thipe I. Modipa & Marelie H. Davel
Faculty of Engineering, North-West University, Potchefstroom, South Africa
Marelie H. Davel

Authors

Thipe I. Modipa
View author publications
You can also search for this author in PubMed Google Scholar
Marelie H. Davel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thipe I. Modipa.

Ethics declarations

Conflict of interest

No conflict of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Modipa, T.I., Davel, M.H. Two sepedi-english code-switched speech corpora. Lang Resources & Evaluation 56, 703–727 (2022). https://doi.org/10.1007/s10579-022-09592-6

Download citation

Accepted: 13 April 2022
Published: 06 June 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10579-022-09592-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two sepedi-english code-switched speech corpora

Abstract

Access this article

Similar content being viewed by others

Mandarin–English code-switching speech corpus in South-East Asia: SEAME

Fine-Tuned Self-supervised Speech Representations for Language Diarization in Multilingual Code-Switched Speech

SPRAAK: Speech Processing, Recognition and Automatic Annotation Kit

Change history

09 September 2022

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two sepedi-english code-switched speech corpora

Abstract

Access this article

Similar content being viewed by others

Mandarin–English code-switching speech corpus in South-East Asia: SEAME

Fine-Tuned Self-supervised Speech Representations for Language Diarization in Multilingual Code-Switched Speech

SPRAAK: Speech Processing, Recognition and Automatic Annotation Kit

Change history

09 September 2022

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation