Skip to main content
Log in

Two sepedi-english code-switched speech corpora

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

A Correction to this article was published on 09 September 2022

This article has been updated

Abstract

We report on the development of two reference corpora for the analysis of Sepedi-English code-switched speech in the context of automatic speech recognition. For the first corpus, possible English events were obtained from an existing corpus of transcribed Sepedi-English speech. The second corpus is based on the analysis of radio broadcasts: actual instances of code switching were transcribed and reproduced by a number of native Sepedi speakers. We describe the process to develop and verify both corpora and perform an initial analysis of the newly produced data sets. We find that, in naturally occurring speech, the frequency of code switching is unexpectedly high for this language pair, and that the continuum of code switching (from unmodified embedded words to loanwords absorbed into the matrix language) makes this a particularly challenging task for speech recognition systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Change history

Notes

  1. https://repo.sadilar.org/handle/20.500.12185/530

  2. https://tla.mpi.nl/tools/tla-tools/elan

  3. https://repo.sadilar.org/handle/20.500.12185/270

References

  • Amazouz, D. (2019) Linguistic and phonetic investigations of French-Algerian Arabic code-switching: Large corpus studies using automatic speech processing. PhD thesis, University of Texas at Austin, United States, http://www.theses.fr/2019PA030006, thèse de doctorat dirigée par Adda-Decker, Martine Sciences du langage Paris 3 2019

  • Amazouz, D., Adda-Decker, M., & Lamel, L. (2018) The French-Algerian Code-Switching Triggered audio corpus (FACST). In: Proc. International Conference on Language Resources and Evaluation (LREC), pp 1468–1473

  • Badenhorst, J., van Heerden, C., Davel, M. H., & Barnard, E. (2011). Collecting and evaluating speech recognition corpora for 11 South African languages. Language Resources and Evaluation, 45, 289–309.

    Article  Google Scholar 

  • Barnard, E., Davel, M. H., & van Heerden, C. (2009). ASR corpus design for resource-scarce languages. Proc (pp. 2847–2850). United Kingdom: Interspeech.

    Google Scholar 

  • Barnard, E., Davel, M.H., & van Huyssteen, G.B. (2010) Speech technology for information access: A South African case study. In: Proc. AAAI Spring Symposium Series: AI for Development

  • Barnard, E., Davel, M.H., van Heerden, C.J., de Wet, F., & Badenhorst, J. (2014) The NCHLT speech corpus of the South African languages. In: Proc. Spoken Languages Technologies for Under-Resourced Languages (SLTU), pp 194–200

  • Boersma, P., & Weenink, D. (2009) Praat: doing phonetics by computer (version 5.1.05) [computer program]. http://www.praat.org/

  • Bullock, B. E., & Toribio, A. J. (2009). The Cambridge handbook of linguistic code-switching. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Chan, J., Ching, P., & Lee, T. (2005). Development of a Cantonese-English code-mixing speech corpus. Proc (pp. 1533–1536). Portugal: Interspeech.

    Google Scholar 

  • Chan, J., Ching, P., Lee, T., & Cao, H. (2006). Automatic speech recognition of Cantonese-English code-mixing utterances. Proc (pp. 113–116). USA: Interspeech.

    Google Scholar 

  • Chan, J., Cao, H., Ching, P., & Lee, T. (2009). Automatic recognition of Cantonese-English code-mixing speech. Computational Linguistics and Chinese Language Processing, 14, 281–304.

    Google Scholar 

  • Davel, M. H., & Barnard, E. (2008). Pronunciation prediction with Default &Refine. Computer Speech and Language, 22, 374–393.

    Article  Google Scholar 

  • Davel, M. H., & Martirosian, O. (2009). Pronunciation dictionary development in resource-scarce environments. Proc (pp. 2851–2854). United Kingdom: Interspeech.

    Google Scholar 

  • Davel, M. H., van Heerden, C., Kleynhans, N., & Barnard, E. (2011). Efficient harvesting of internet audio for resource-scarce ASR. Proc (pp. 3153–3156). Italy: Interspeech.

    Google Scholar 

  • Davel, M. H., van Heerden, C., & Barnard, E. (2012). Validating smartphone-collected speech corpora. Proc (pp. 68–75). Cape Town, South Africa: SLTU.

    Google Scholar 

  • de Vries, N. J., Davel, M. H., Badenhorst, J., Basson, W. D., de Wet, F., Barnard, E., & de Waal, A. (2014). A smartphone-based ASR data collection tool for under-resourced languages. Speech communication, 56, 119–131.

    Article  Google Scholar 

  • Diab, M., & Kamboj, A. (2011). Feasibility of leveraging crowd sourcing for the creation of a large scale annotated resource for Hindi English code switched data: A pilot annotation. Tech. rep.: Columbia University New York Center for Computational Learning Systems.

  • Hamed, I., & Abdennadher, S. (2018) Collection and analysis of code-switch Egyptian Arabic-English speech corpus. In: Proc. International Conference on Language Resources and Evaluation (LREC), pp 3805–3809

  • van Heerden, C., Barnard, E., & Davel, M. H. (2009). Basic speech recognition for spoken dialogues. Proc (pp. 3003–3006). United Kingdom: Interspeech.

    Google Scholar 

  • Joshi, A.K. (1982) Processing of sentences with intra-sentential code-switching. In: Proc. 9th conference on Computational Linguistics, Prague, Czechoslovakia, pp 145–150

  • Le, V. B., Besacier, L., & Schultz, T. (2006). Acoustic-phonetic unit similarities for context dependent acoustic model portability. Proc (pp. 1101–1104). Speech and Signal Processing (ICASSP): Acoustics.

  • Li, Y., Yu, Y., & Fung, P. (2012) A Mandarin-English code-switching corpus. In: Proc. International Conference on Language Resources and Evaluation (LREC), pp 2515–2519

  • Lyu, D., Tan, T., Chng, E., & Li, H. (2010). SEAME: a Mandarin-English code-switching corpus in South-East Asia. Proc (pp. 1986–1989). Japan: Interspeech.

    Google Scholar 

  • Modiba TM (2004) Aspects of automatic speech recognition with respect to Northern Sotho. Master’s thesis, University of the North, South Africa

  • Modipa, T., & Davel, M. H. (2010). Pronunciation modelling of foreign words for Sepedi ASR. Proc (pp. 185–189). South Africa: Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Stellenbosch.

    Google Scholar 

  • Modipa, T., Davel, M.H., & de Wet, F. (2010) Acoustic modelling of Sepedi affricates for ASR. In: Proc. Annual Research Conference of the South African Institute of Computer Scientist and Information Technologists (SAICSIT), pp 394–398

  • Modipa, T.I. (2016) Automatic recognition of code-switched speech in Sepedi. PhD thesis

  • Modipa, T. I., Davel, M. H., & de Wet, F. (2012). Context-dependent modelling of English vowels in Sepedi code-switched speech. Proc (pp. 173–178). South Africa: Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Pretoria.

    Google Scholar 

  • Modipa, T. I., Davel, M. H., & de Wet, F. (2013). Implications of Sepedi/English code switching for ASR systems. Proc (pp. 64–69). South Africa: Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Johannesburg.

    Google Scholar 

  • Mokwana, M.L. (2009) The melting pot in Ga-Matlala Maserumule with special reference to the Bapedi culture, language and dialects. Master’s thesis, University of South Africa, South Africa

  • Myers-Scotton, C. (Ed.). (1993). Social motivations for Codeswitching: Evidence from Africa. Oxford: Clarendon Press.

    Google Scholar 

  • Ramanarayanan V, Suendermann-Oeft D (2017) Jee haan, I’d like both, por favor: Elicitation of a Code-Switched Corpus of Hindi-English and Spanish-English Human-Machine Dialog. In: Proc. Interspeech, Sweden, pp 47–51, https://doi.org/10.21437/Interspeech.2017-1198

  • Statistics South Africa (2012) Census 2011: Census in brief. Tech. Rep. 03-01-41, www.statssa.gov.za

  • Stemmer, G., Noth, E., & Niemann, H. (2001) Acoustic modeling of foreign words in a German speech recognition system. In: Proc. EUROSPEECH-2001, pp 2745–2748

  • Vu, N., Lyu, D., Weiner, J., Telaar, D., Schlippe, T., Blaicher, F., Chng, E., Schultz, T., & Li, H. (2012). A first specch recognition system for Mandarin-English code-switch conversational speech. Proc (pp. 4889–4892). Speech and Signal Processing (ICASSP): Acoustics.

    Google Scholar 

  • van der Westhuizen E, Niesler T (2018) A first south African corpus of multilingual code-switched soap opera speech. In: Proc. International Conference on Language Resources and Evaluation (LREC), European Language Resources Association (ELRA), Miyazaki, Japan, https://www.aclweb.org/anthology/L18-1451

  • White, C., Khudanpur, S., & Baker, J. (2008). An investigation of acoustic models for multilingual code switching. Proc (pp. 2691–2694). Australia: Interspeech.

    Google Scholar 

  • Yeh, C., Huang, C., & Lee, L. (2011). Bilingual acoustic model adaptation by unit merging on different levels and cross-level integration. Proc (pp. 2317–2320). Italy: Interspeech.

    Google Scholar 

  • Yu, D., Deng, L., Liu, P., Wu, J., Gong, Y., & Acero, A. (2009). Cross-lingual speech recognition under runtime resource constraints. Proc (pp. 4193–4196). Speech and Signal Processing (ICASSP): Acoustics.

    Google Scholar 

  • Yılmaz E, Biswas A, van der Westhuizen E, de Wet F, Niesler T (2018) Building a Unified Code-Switching ASR System for South African Languages. In: Proc. Interspeech, India, pp 1923–1927, https://doi.org/10.21437/Interspeech.2018-1966, http://dx.doi.org/10.21437/Interspeech.2018-1966

Download references

Acknowledgements

We would like to thank the Department of Arts Culture of South Africa for their financial support of the data collection effort. We would also like to thank Dr Febe de Wet who provided welcome assistance and advice, especially during the initial phase of this project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thipe I. Modipa.

Ethics declarations

Conflict of interest

No conflict of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Modipa, T.I., Davel, M.H. Two sepedi-english code-switched speech corpora. Lang Resources & Evaluation 56, 703–727 (2022). https://doi.org/10.1007/s10579-022-09592-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-022-09592-6

Keywords

Navigation