Code-switch language modeling is challenging due to data scarcity as well as expanded vocabulary that involves two languages. We present a novel computational method to generate synthetic code-switch data using the Matrix Language Frame theory to alleviate the issue of data scarcity. The proposed method makes use of augmented parallel data to supplement the real code-switch data. We use the synthetic data to pre-train the language model. We show that the pre-trained language model can match the performance of vanilla models when it is finetuned with 2.5 times less real code-switch data. We also show that the perplexity of a RNN based language model pre-trained on synthetic code-switch data and fine-tuned with real code-switch data is significantly lower than that of the model trained on real code-switch data alone and the reduction in perplexity translates into 1.45% absolute reduction in WER in a speech recognition experiment.
Cite as: Lee, G., Yue, X., Li, H. (2019) Linguistically Motivated Parallel Data Augmentation for Code-Switch Language Modeling. Proc. Interspeech 2019, 3730-3734, doi: 10.21437/Interspeech.2019-1382
@inproceedings{lee19d_interspeech, author={Grandee Lee and Xianghu Yue and Haizhou Li}, title={{Linguistically Motivated Parallel Data Augmentation for Code-Switch Language Modeling}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={3730--3734}, doi={10.21437/Interspeech.2019-1382} }