Abstract
We report on the development of two reference corpora for the analysis of Sepedi-English code-switched speech in the context of automatic speech recognition. For the first corpus, possible English events were obtained from an existing corpus of transcribed Sepedi-English speech. The second corpus is based on the analysis of radio broadcasts: actual instances of code switching were transcribed and reproduced by a number of native Sepedi speakers. We describe the process to develop and verify both corpora and perform an initial analysis of the newly produced data sets. We find that, in naturally occurring speech, the frequency of code switching is unexpectedly high for this language pair, and that the continuum of code switching (from unmodified embedded words to loanwords absorbed into the matrix language) makes this a particularly challenging task for speech recognition systems.
Similar content being viewed by others
Change history
09 September 2022
A Correction to this paper has been published: https://doi.org/10.1007/s10579-022-09607-2
Notes
https://repo.sadilar.org/handle/20.500.12185/530
https://tla.mpi.nl/tools/tla-tools/elan
https://repo.sadilar.org/handle/20.500.12185/270
References
Amazouz, D. (2019) Linguistic and phonetic investigations of French-Algerian Arabic code-switching: Large corpus studies using automatic speech processing. PhD thesis, University of Texas at Austin, United States, http://www.theses.fr/2019PA030006, thèse de doctorat dirigée par Adda-Decker, Martine Sciences du langage Paris 3 2019
Amazouz, D., Adda-Decker, M., & Lamel, L. (2018) The French-Algerian Code-Switching Triggered audio corpus (FACST). In: Proc. International Conference on Language Resources and Evaluation (LREC), pp 1468–1473
Badenhorst, J., van Heerden, C., Davel, M. H., & Barnard, E. (2011). Collecting and evaluating speech recognition corpora for 11 South African languages. Language Resources and Evaluation, 45, 289–309.
Barnard, E., Davel, M. H., & van Heerden, C. (2009). ASR corpus design for resource-scarce languages. Proc (pp. 2847–2850). United Kingdom: Interspeech.
Barnard, E., Davel, M.H., & van Huyssteen, G.B. (2010) Speech technology for information access: A South African case study. In: Proc. AAAI Spring Symposium Series: AI for Development
Barnard, E., Davel, M.H., van Heerden, C.J., de Wet, F., & Badenhorst, J. (2014) The NCHLT speech corpus of the South African languages. In: Proc. Spoken Languages Technologies for Under-Resourced Languages (SLTU), pp 194–200
Boersma, P., & Weenink, D. (2009) Praat: doing phonetics by computer (version 5.1.05) [computer program]. http://www.praat.org/
Bullock, B. E., & Toribio, A. J. (2009). The Cambridge handbook of linguistic code-switching. Cambridge: Cambridge University Press.
Chan, J., Ching, P., & Lee, T. (2005). Development of a Cantonese-English code-mixing speech corpus. Proc (pp. 1533–1536). Portugal: Interspeech.
Chan, J., Ching, P., Lee, T., & Cao, H. (2006). Automatic speech recognition of Cantonese-English code-mixing utterances. Proc (pp. 113–116). USA: Interspeech.
Chan, J., Cao, H., Ching, P., & Lee, T. (2009). Automatic recognition of Cantonese-English code-mixing speech. Computational Linguistics and Chinese Language Processing, 14, 281–304.
Davel, M. H., & Barnard, E. (2008). Pronunciation prediction with Default &Refine. Computer Speech and Language, 22, 374–393.
Davel, M. H., & Martirosian, O. (2009). Pronunciation dictionary development in resource-scarce environments. Proc (pp. 2851–2854). United Kingdom: Interspeech.
Davel, M. H., van Heerden, C., Kleynhans, N., & Barnard, E. (2011). Efficient harvesting of internet audio for resource-scarce ASR. Proc (pp. 3153–3156). Italy: Interspeech.
Davel, M. H., van Heerden, C., & Barnard, E. (2012). Validating smartphone-collected speech corpora. Proc (pp. 68–75). Cape Town, South Africa: SLTU.
de Vries, N. J., Davel, M. H., Badenhorst, J., Basson, W. D., de Wet, F., Barnard, E., & de Waal, A. (2014). A smartphone-based ASR data collection tool for under-resourced languages. Speech communication, 56, 119–131.
Diab, M., & Kamboj, A. (2011). Feasibility of leveraging crowd sourcing for the creation of a large scale annotated resource for Hindi English code switched data: A pilot annotation. Tech. rep.: Columbia University New York Center for Computational Learning Systems.
Hamed, I., & Abdennadher, S. (2018) Collection and analysis of code-switch Egyptian Arabic-English speech corpus. In: Proc. International Conference on Language Resources and Evaluation (LREC), pp 3805–3809
van Heerden, C., Barnard, E., & Davel, M. H. (2009). Basic speech recognition for spoken dialogues. Proc (pp. 3003–3006). United Kingdom: Interspeech.
Joshi, A.K. (1982) Processing of sentences with intra-sentential code-switching. In: Proc. 9th conference on Computational Linguistics, Prague, Czechoslovakia, pp 145–150
Le, V. B., Besacier, L., & Schultz, T. (2006). Acoustic-phonetic unit similarities for context dependent acoustic model portability. Proc (pp. 1101–1104). Speech and Signal Processing (ICASSP): Acoustics.
Li, Y., Yu, Y., & Fung, P. (2012) A Mandarin-English code-switching corpus. In: Proc. International Conference on Language Resources and Evaluation (LREC), pp 2515–2519
Lyu, D., Tan, T., Chng, E., & Li, H. (2010). SEAME: a Mandarin-English code-switching corpus in South-East Asia. Proc (pp. 1986–1989). Japan: Interspeech.
Modiba TM (2004) Aspects of automatic speech recognition with respect to Northern Sotho. Master’s thesis, University of the North, South Africa
Modipa, T., & Davel, M. H. (2010). Pronunciation modelling of foreign words for Sepedi ASR. Proc (pp. 185–189). South Africa: Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Stellenbosch.
Modipa, T., Davel, M.H., & de Wet, F. (2010) Acoustic modelling of Sepedi affricates for ASR. In: Proc. Annual Research Conference of the South African Institute of Computer Scientist and Information Technologists (SAICSIT), pp 394–398
Modipa, T.I. (2016) Automatic recognition of code-switched speech in Sepedi. PhD thesis
Modipa, T. I., Davel, M. H., & de Wet, F. (2012). Context-dependent modelling of English vowels in Sepedi code-switched speech. Proc (pp. 173–178). South Africa: Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Pretoria.
Modipa, T. I., Davel, M. H., & de Wet, F. (2013). Implications of Sepedi/English code switching for ASR systems. Proc (pp. 64–69). South Africa: Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Johannesburg.
Mokwana, M.L. (2009) The melting pot in Ga-Matlala Maserumule with special reference to the Bapedi culture, language and dialects. Master’s thesis, University of South Africa, South Africa
Myers-Scotton, C. (Ed.). (1993). Social motivations for Codeswitching: Evidence from Africa. Oxford: Clarendon Press.
Ramanarayanan V, Suendermann-Oeft D (2017) Jee haan, I’d like both, por favor: Elicitation of a Code-Switched Corpus of Hindi-English and Spanish-English Human-Machine Dialog. In: Proc. Interspeech, Sweden, pp 47–51, https://doi.org/10.21437/Interspeech.2017-1198
Statistics South Africa (2012) Census 2011: Census in brief. Tech. Rep. 03-01-41, www.statssa.gov.za
Stemmer, G., Noth, E., & Niemann, H. (2001) Acoustic modeling of foreign words in a German speech recognition system. In: Proc. EUROSPEECH-2001, pp 2745–2748
Vu, N., Lyu, D., Weiner, J., Telaar, D., Schlippe, T., Blaicher, F., Chng, E., Schultz, T., & Li, H. (2012). A first specch recognition system for Mandarin-English code-switch conversational speech. Proc (pp. 4889–4892). Speech and Signal Processing (ICASSP): Acoustics.
van der Westhuizen E, Niesler T (2018) A first south African corpus of multilingual code-switched soap opera speech. In: Proc. International Conference on Language Resources and Evaluation (LREC), European Language Resources Association (ELRA), Miyazaki, Japan, https://www.aclweb.org/anthology/L18-1451
White, C., Khudanpur, S., & Baker, J. (2008). An investigation of acoustic models for multilingual code switching. Proc (pp. 2691–2694). Australia: Interspeech.
Yeh, C., Huang, C., & Lee, L. (2011). Bilingual acoustic model adaptation by unit merging on different levels and cross-level integration. Proc (pp. 2317–2320). Italy: Interspeech.
Yu, D., Deng, L., Liu, P., Wu, J., Gong, Y., & Acero, A. (2009). Cross-lingual speech recognition under runtime resource constraints. Proc (pp. 4193–4196). Speech and Signal Processing (ICASSP): Acoustics.
Yılmaz E, Biswas A, van der Westhuizen E, de Wet F, Niesler T (2018) Building a Unified Code-Switching ASR System for South African Languages. In: Proc. Interspeech, India, pp 1923–1927, https://doi.org/10.21437/Interspeech.2018-1966, http://dx.doi.org/10.21437/Interspeech.2018-1966
Acknowledgements
We would like to thank the Department of Arts Culture of South Africa for their financial support of the data collection effort. We would also like to thank Dr Febe de Wet who provided welcome assistance and advice, especially during the initial phase of this project.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
No conflict of interest to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Modipa, T.I., Davel, M.H. Two sepedi-english code-switched speech corpora. Lang Resources & Evaluation 56, 703–727 (2022). https://doi.org/10.1007/s10579-022-09592-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-022-09592-6