Code-switched automatic speech recognition in five South African languages
Introduction
South Africa is a multilingual country whose citizens are often fluent in more than one of the 11 constitutionally recognised languages. While English is widely used in the media, law and commerce, only a small fraction of the population speak English as a first language.1 As a consequence, code-switching is a common phenomenon in everyday South African conversation (Myers-Scotton, 1989, Auer, 2013, Muysken et al., 2000, van Dulm, 2007).
Code-switching is defined as the alternation between two or more languages during discourse. Because this phenomenon is restricted to spontaneous conversation between multilingual speakers, code-switched speech is typically fast and accented. It is known that the language switches do not occur randomly, but are constrained by linguistic structure (Poplack, 1980, Koban, 2013). However, code-switching is also flexible and dynamic by nature and its comprehensive characterisation has remained elusive (Winkler, 2005).
Scholars distinguish between two types of code-switching. Intersentential code-switching occurs when language changes occur at sentence boundaries. Intrasentential code-switching, on the other hand, occurs when the languages alternates within the same sentence. This second type of code-switching exhibits hybrid structures between the matrix (dominant) and the embedded (inserted) languages that can be further subdivided into the following three categories (Hamers and Blanc, 1989).
- •
Alternation: Two structurally independent language stretches.
- •
Insertion: An embedded language element is incorporated into the structure of the matrix language.
- •
Intraword: In this case matrix language affixes are applied to elements of the embedded language to form words.
Intrasentential code-switching can take various forms, including phonological, morphological, lexical and syntactic changes that result in new linguistic properties. Due to its inherent structural complexity, intrasentential code-switching poses the biggest challenge to the development of language and acoustic models for automatic speech recognition (ASR).
This paper reports on various strategies that we have evaluated in developing code-switching ASR systems for five South African languages. Our investigations were conducted using a corpus that we have compiled from South African soap opera speech and that includes examples of all the code-switching phenomena described in the preceding paragraphs (van der Westhuizen and Niesler, 2018).
Despite having invested several years in the development of this corpus, it remains small and under-resourced. This has presented major challenges throughout and, as a result, a major focus has been to determine how best to take advantage of additional sources of speech and text data. With this in mind, two additional speech and text data sources were included in our investigations: (1) multilingual data from the same domain as the code-switched corpus and (2) monolingual data from a different domain in each of the five considered languages. The main contributions of this paper can therefore be summarised as follows.
- 1.
The comprehensive development and comparative evaluation of both bilingual (Bantu–English language pairs) and pentalingual code-switched ASR systems across four language pairs and five languages overall. This second system is able to process speech containing code-switching between any and all of the five languages in our corpus.
- 2.
An analysis of the relative benefits of using in-domain and out-of-domain speech data in order to enhance acoustic models of code-switched speech in both bilingual and pentalingual scenarios, also across all considered languages.
- 3.
An evaluation and analysis of various code-switched language modelling strategies for the four bilingual as well as the pentalingual scenario.
Our paper is organised as follows. Section 2 summarises related work in ASR of code-switched speech. Sections 3 The South African corpus of code-switched soap opera speech, 4 Other sources of data introduce the speech and text corpora used in our experiments. Section 5 describes our experimental method while Sections 6 Results: Bilingual systems, 7 Results: Pentalingual system present experimental results for bilingual and pentalingual code-switching ASR systems respectively. Section 8 reflects on the experimental findings and Section 9 concludes.
Section snippets
Related work
Over the last decade, ASR of code-switched speech has attracted increasing attention among researchers. Mandarin–English code-switching has been most extensively studied, for example in Vu et al., 2012, Li and Fung, 2012, Adel et al., 2013, Adel et al., 2015 and Lyu et al. (2015). Other authors have considered code-switching between Hindi and English (Sreeram et al., 2018, Pandey et al., 2018, Ganji et al., 2019), English and Malay (Ahmed and Tan, 2012, Singh and Tan, 2018), Russian and
The South African corpus of code-switched soap opera speech
Because code-switching is spontaneous, it does not occur in news or similar broadcast programmes that are often a source of speech data. For the same reason, code-switching is also not found in written or printed language. Furthermore, the mechanisms underlying language switching are still poorly understood, complicating the development of prompts with which to elicit natural code-switched utterances. All these factors contribute to the challenge of collecting authentic code-switched data.
Even
Other sources of data
The previous section shows that our soap opera corpus of code-switched speech remains very small, even if all the data is pooled together. We therefore considered the inclusion of speech and text from other sources. These resources are described in the following subsections.
Automatic speech recognition systems
Our experiments evaluated two approaches to the automatic speech recognition of code-switched soap opera speech. In the first, four independent bilingual systems, for the EZ, EX, ET and ES language pairs respectively, were developed. The second approach involved a single pentalingual system that permits code-switching between all five languages. Both strategies are illustrated in Fig. 1. To allow direct comparison, all systems were evaluated on the test sets presented in Table 2. The following
Results: Bilingual systems
This section reports on the results that were obtained for the four bilingual recognition systems. All results are for the test sets, but similar trends were consistently observed for the development sets.
Results: Pentalingual system
This section reports on the results that were obtained for the pentalingual ASR system. As for the bilingual systems, similar trends were observed for the development and test sets, and hence only test set results are reported.
In addition to word error rate (WER), language recognition accuracy is presented as a measure of system performance.
Discussion
The ultimate aim of the research described in this paper is to improve the accuracy of automatic speech recognition for code-switched South African speech. Towards this aim we considered two system configurations and used a corpus of South African speech that contains examples of code-switching between South African English and four indigenous languages. First, we reduced this to four sub-problems by treating each language pair separately. Second, we developed a pentalingual system that is able
Conclusions
Despite the improvements we have achieved, error rates remain high. In particular, the gap in ASR performance between the well-resourced English and the four low-resourced Bantu languages remains large. Therefore, more effort is required in extending the in-domain data for these languages. In ongoing work, we are attempting to achieve this by means of automatic transcription and subsequent semi-supervised training, as well as by further text and acoustic data augmentation from related languages.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
We would like to thank the Department of Arts and Culture (DAC) of the South African government for funding this research. We are grateful to e.tv and Yula Quinn at Rhythm City, as well as the SABC and Human Stark at Generations: The Legacy, for assistance with data compilation. We also gratefully acknowledge the support of the South African Centre for High Performance Computing (CHPC) for providing computational resources on their Lengau cluster for this research, and the support of Telkom
References (64)
- et al.
IITG-HingCoS corpus: A Hinglish code-switching database for automatic speech recognition
Speech Commun.
(2019) - et al.
Building a first language model for code-switch Arabic-English
Procedia Comput. Sci.
(2017) - et al.
Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition system
Comput. Speech Lang.
(2014) Intra-sentential and inter-sentential code-switching in Turkish-English bilinguals in New York City, USA
Procedia-Soc. Behav. Sci.
(2013)- et al.
Automatic speech recognition of English-isiZulu code-switched speech from South African soap operas
Procedia Comput. Sci.
(2016) - et al.
Synthesised bigrams using word embeddings for code-switched ASR of four South African language pairs
Comput. Speech Lang.
(2019) - et al.
Investigating bilingual deep neural networks for automatic recognition of code-switching Frisian speech
Procedia Comput. Sci.
(2016) - et al.
Semi-supervised acoustic model training for speech with code-switching
Speech Commun.
(2018) - et al.
Syntactic and semantic features for code-switching factored language models
IEEE Trans. Audio Speech Lang. Process.
(2015) - et al.
Recurrent neural network language modeling for code switching conversational speech