One of the main challenges in building code-mixed ASR systems is the lack of annotated speech data. Often, however, monolingual speech corpora are available in abundance for the languages in the code-mixed speech. In this paper, we explore different techniques that use monolingual speech to create synthetic code-mixed speech and examine their effect on training models for code-mixed ASR. We assume access to a small amount of real code-mixed text, from which we extract probability distributions that govern the transition of phones across languages at code-switch boundaries and the span lengths corresponding to a particular language. We extract segments from monolingual data and concatenate them to form code-mixed utterances such that these probability distributions are preserved. Using this synthetic speech, we show significant improvements in Hindi-English code-mixed ASR performance compared to using synthetic speech naively constructed from complete utterances in different languages. We also present language modelling experiments that use synthetically constructed code-mixed text and discuss their benefits.
Cite as: Taneja, K., Guha, S., Jyothi, P., Abraham, B. (2019) Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition. Proc. Interspeech 2019, 2150-2154, doi: 10.21437/Interspeech.2019-1959
@inproceedings{taneja19_interspeech, author={Karan Taneja and Satarupa Guha and Preethi Jyothi and Basil Abraham}, title={{Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={2150--2154}, doi={10.21437/Interspeech.2019-1959} }