Code-switched automatic speech recognition in five South African languages

https://doi.org/10.1016/j.csl.2021.101262Get rights and content

Highlights

  • Addressed different aspects of ASR for South African code-switched speech.

  • Four different code-switched language pairs were studied.

  • Bilingual and five-lingual code-switched ASR was implemented.

  • Explored several ways of addressing severe data sparsity.

  • Analysed the the relative benefits of using in-domain and out-of-domain speech.

Abstract

Most automatic speech recognition (ASR) systems are optimised for one specific language and their performance consequently deteriorates drastically when confronted with multilingual or code-switched speech. We describe our efforts to improve an ASR system that can process code-switched South African speech that contains English and four indigenous languages: isiZulu, isiXhosa, Sesotho and Setswana. We begin using a newly developed language-balanced corpus of code-switched speech compiled from South African soap operas, which are rich in spontaneous code-switching. The small size of the corpus makes this scenario under-resourced, and hence we explore several ways of addressing this sparsity of data. We consider augmenting the acoustic training sets with in-domain data at the expense of making it unbalanced and dominated by English. We further explore the inclusion of monolingual out-of-domain data in the constituent languages. For language modelling, we investigate the inclusion of out-of-domain text data sources and also the inclusion of synthetically-generated code-switch bigrams. In our experiments, we consider two system architectures. The first considers four bilingual speech recognisers, each allowing code-switching between English and one of the indigenous languages. The second considers a single pentalingual speech recogniser able to process switching between all five languages. We find that the additional inclusion of each acoustic and text data source leads to some improvements. While in-domain data is substantially more effective, performance gains were also achieved using out-of-domain data, which is often much easier to obtain. We also find that improvements are achieved in all five languages, even when the training set becomes unbalanced and heavily skewed in favour of English. Finally, we find the use of TDNN-F architectures for the acoustic model to consistently outperform TDNN–BLSTM models in our data-sparse scenario.

Introduction

South Africa is a multilingual country whose citizens are often fluent in more than one of the 11 constitutionally recognised languages. While English is widely used in the media, law and commerce, only a small fraction of the population speak English as a first language.1 As a consequence, code-switching is a common phenomenon in everyday South African conversation (Myers-Scotton, 1989, Auer, 2013, Muysken et al., 2000, van Dulm, 2007).

Code-switching is defined as the alternation between two or more languages during discourse. Because this phenomenon is restricted to spontaneous conversation between multilingual speakers, code-switched speech is typically fast and accented. It is known that the language switches do not occur randomly, but are constrained by linguistic structure (Poplack, 1980, Koban, 2013). However, code-switching is also flexible and dynamic by nature and its comprehensive characterisation has remained elusive (Winkler, 2005).

Scholars distinguish between two types of code-switching. Intersentential code-switching occurs when language changes occur at sentence boundaries. Intrasentential code-switching, on the other hand, occurs when the languages alternates within the same sentence. This second type of code-switching exhibits hybrid structures between the matrix (dominant) and the embedded (inserted) languages that can be further subdivided into the following three categories (Hamers and Blanc, 1989).

  • Alternation: Two structurally independent language stretches.

  • Insertion: An embedded language element is incorporated into the structure of the matrix language.

  • Intraword: In this case matrix language affixes are applied to elements of the embedded language to form words.

Intrasentential code-switching can take various forms, including phonological, morphological, lexical and syntactic changes that result in new linguistic properties. Due to its inherent structural complexity, intrasentential code-switching poses the biggest challenge to the development of language and acoustic models for automatic speech recognition (ASR).

This paper reports on various strategies that we have evaluated in developing code-switching ASR systems for five South African languages. Our investigations were conducted using a corpus that we have compiled from South African soap opera speech and that includes examples of all the code-switching phenomena described in the preceding paragraphs (van der Westhuizen and Niesler, 2018).

Despite having invested several years in the development of this corpus, it remains small and under-resourced. This has presented major challenges throughout and, as a result, a major focus has been to determine how best to take advantage of additional sources of speech and text data. With this in mind, two additional speech and text data sources were included in our investigations: (1) multilingual data from the same domain as the code-switched corpus and (2) monolingual data from a different domain in each of the five considered languages. The main contributions of this paper can therefore be summarised as follows.

  • 1.

    The comprehensive development and comparative evaluation of both bilingual (Bantu–English language pairs) and pentalingual code-switched ASR systems across four language pairs and five languages overall. This second system is able to process speech containing code-switching between any and all of the five languages in our corpus.

  • 2.

    An analysis of the relative benefits of using in-domain and out-of-domain speech data in order to enhance acoustic models of code-switched speech in both bilingual and pentalingual scenarios, also across all considered languages.

  • 3.

    An evaluation and analysis of various code-switched language modelling strategies for the four bilingual as well as the pentalingual scenario.

Our paper is organised as follows. Section 2 summarises related work in ASR of code-switched speech. Sections 3 The South African corpus of code-switched soap opera speech, 4 Other sources of data introduce the speech and text corpora used in our experiments. Section 5 describes our experimental method while Sections 6 Results: Bilingual systems, 7 Results: Pentalingual system present experimental results for bilingual and pentalingual code-switching ASR systems respectively. Section 8 reflects on the experimental findings and Section 9 concludes.

Section snippets

Related work

Over the last decade, ASR of code-switched speech has attracted increasing attention among researchers. Mandarin–English code-switching has been most extensively studied, for example in Vu et al., 2012, Li and Fung, 2012, Adel et al., 2013, Adel et al., 2015 and Lyu et al. (2015). Other authors have considered code-switching between Hindi and English (Sreeram et al., 2018, Pandey et al., 2018, Ganji et al., 2019), English and Malay (Ahmed and Tan, 2012, Singh and Tan, 2018), Russian and

The South African corpus of code-switched soap opera speech

Because code-switching is spontaneous, it does not occur in news or similar broadcast programmes that are often a source of speech data. For the same reason, code-switching is also not found in written or printed language. Furthermore, the mechanisms underlying language switching are still poorly understood, complicating the development of prompts with which to elicit natural code-switched utterances. All these factors contribute to the challenge of collecting authentic code-switched data.

Even

Other sources of data

The previous section shows that our soap opera corpus of code-switched speech remains very small, even if all the data is pooled together. We therefore considered the inclusion of speech and text from other sources. These resources are described in the following subsections.

Automatic speech recognition systems

Our experiments evaluated two approaches to the automatic speech recognition of code-switched soap opera speech. In the first, four independent bilingual systems, for the EZ, EX, ET and ES language pairs respectively, were developed. The second approach involved a single pentalingual system that permits code-switching between all five languages. Both strategies are illustrated in Fig. 1. To allow direct comparison, all systems were evaluated on the test sets presented in Table 2. The following

Results: Bilingual systems

This section reports on the results that were obtained for the four bilingual recognition systems. All results are for the test sets, but similar trends were consistently observed for the development sets.

Results: Pentalingual system

This section reports on the results that were obtained for the pentalingual ASR system. As for the bilingual systems, similar trends were observed for the development and test sets, and hence only test set results are reported.

In addition to word error rate (WER), language recognition accuracy is presented as a measure of system performance.

Discussion

The ultimate aim of the research described in this paper is to improve the accuracy of automatic speech recognition for code-switched South African speech. Towards this aim we considered two system configurations and used a corpus of South African speech that contains examples of code-switching between South African English and four indigenous languages. First, we reduced this to four sub-problems by treating each language pair separately. Second, we developed a pentalingual system that is able

Conclusions

Despite the improvements we have achieved, error rates remain high. In particular, the gap in ASR performance between the well-resourced English and the four low-resourced Bantu languages remains large. Therefore, more effort is required in extending the in-domain data for these languages. In ongoing work, we are attempting to achieve this by means of automatic transcription and subsequent semi-supervised training, as well as by further text and acoustic data augmentation from related languages.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We would like to thank the Department of Arts and Culture (DAC) of the South African government for funding this research. We are grateful to e.tv and Yula Quinn at Rhythm City, as well as the SABC and Human Stark at Generations: The Legacy, for assistance with data compilation. We also gratefully acknowledge the support of the South African Centre for High Performance Computing (CHPC) for providing computational resources on their Lengau cluster for this research, and the support of Telkom

References (64)

  • AhmedB.H. et al.

    Automatic speech recognition of code switching speech using 1-best rescoring

  • Amazouz, D., Adda-Decker, M., Lamel, L., 2017. Addressing code-switching in French/Algerian Arabic speech. In: Proc....
  • Amazouz, D., Adda-Decker, M., Lamel, L., 2018. The French-Algerian code-switching triggered audio corpus (FACST). In:...
  • AuerP.

    Code-Switching in Conversation: Language, Interaction and Identity

    (2013)
  • Barnard, E., Davel, M.H., van Heerden, C., de Wet, F., Badenhorst, J., 2014. The NCHLT speech corpus of the South...
  • BengioY. et al.

    A neural probabilistic language model

    J. Mach. Learn. Res.

    (2003)
  • Biswas, A., van der Westhuizen, E., Niesler, T.R., de Wet, F., 2018a. Improving ASR for code-switched speech in...
  • Biswas, A., de Wet, F., van der Westhuizen, E., Yılmaz, E., Niesler, T.R., 2018b. Multilingual neural network acoustic...
  • Biswas, A., Yılmaz, E., de Wet, F., van der Westhuizen, E., Niesler, T.R., 2019. Semi-supervised acoustic model...
  • BowermanS. et al.

    White South African English: phonology

  • BrownP.F. et al.

    Class-based n-gram models of natural language

    Comput. Linguist.

    (1992)
  • Cotterell, R., Renduchintala, A., Saphra, N., Callison-Burch, C., 2014. An Algerian Arabic-French code-switched corpus....
  • Eiselen, R., Puttkammer, M.J., 2014. Developing text resources for ten South African languages. In: Proc. LREC. pp....
  • GhoshalA. et al.

    Multilingual training of deep neural networks

  • Goldhahn, D., Eckart, T., Quasthoff, U., 2012. Building large monolingual dictionaries at the Leipzig Corpora...
  • HamersE. et al.

    Bilinguality & Bilingualism

    (1989)
  • IARPA Babel project site

    (2020)
  • IARPA Babel Zulu language pack IARPA-babel206b-v0.1e

    (2020)
  • Ko, T., Peddinti, V., Povey, D., Khudanpur, S., 2015. Audio augmentation for speech recognition. In: Proc. Interspeech....
  • Li, Y., Fung, P., 2012. Code-switch language model with inversion constraints for mixed language speech recognition....
  • Li, Y., Fung, P., 2013. Language modeling for mixed language speech recognition using weighted phrase extraction. In:...
  • Li, K., Li, J., Ye, G., Zhao, R., Gong, Y., 2019. Towards code-switching ASR for end-to-end CTC models. In: Proc....
  • Cited by (0)

    View full text