Abstract
This paper introduces a statistical augmentation approach to generate code-switched sentences for code-switched language modeling. The proposed technique converts monolingual sentences from a particular domain into their corresponding code-switched versions using pretrained monolingual Part-of-Speech tagging models. The work also showed that adding 150 handcrafted formal to informal word replacements can further improve the naturalness of augmented sentences. When tested on an English-Malay code-switching corpus, a relative decrease of 9.7% in perplexity for ngram language model interpolated with the language model trained with augmented texts and other monolingual texts was observed, and 5.9% perplexity reduction for RNNLMs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Srilm - an extensible language modeling toolkit. In: Hansen, J.H.L., Pellom, B.L. (eds.) INTERSPEECH. ISCA (2002). http://dblp.uni-trier.de/db/conf/interspeech/interspeech2002.html#Stolcke02
Adel, H., Vu, T., Kirchhoff, K., Telaar, D., Schultz, T.: Syntactic and semantic features for code-switching factored language models. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (2015). https://doi.org/10.1109/TASLP.2015.2389622
Beneš, K., Burget, L.: Text augmentation for language models in high error recognition scenario. In: Proceedings of Interspeech 2021, pp. 1872–1876 (2021). https://doi.org/10.21437/Interspeech.2021-627
Chang, C.T., Chuang, S.P., VI Lee, H.: Code-switching sentence generation by generative adversarial networks and its application to data augmentation. In: Interspeech (2018)
Ding, B., et al.: DAGA: data augmentation with a generation approach for low-resource tagging tasks. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6045–6057. Association for Computational Linguistics, November 2020. https://doi.org/10.18653/v1/2020.emnlp-main.488, http://aclanthology.org/2020.emnlp-main.488
Gao, Y., Feng, J., Liu, Y., Hou, L., Pan, X., Ma, Y.: Code-switching sentence generation by bert and generative adversarial networks, pp. 3525–3529, September 2019. https://doi.org/10.21437/Interspeech.2019-2501
Garg, S., Parekh, T., Jyothi, P.: Code-switched language models using dual RNNs and same-source pretraining. In: Conference on Empirical Methods in Natural Language Processing (2018)
Gonen, H., Goldberg, Y.: Language modeling for code-switching: evaluation, integration of monolingual data, and discriminative training. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4175–4185. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1427, http://aclanthology.org/D19-1427
Gumperz, J.J.: Discourse Strategies. Studies in Interactional Sociolinguistics, Cambridge University Press, Cambridge (1982). https://doi.org/10.1017/CBO9780511611834
Hu, X., Zhang, Q., Yang, L., Gu, B., Xu, X.: Data augmentation for code-switch language modeling by fusing multiple text generation methods. In: Proceedings of Interspeech 2020, pp. 1062–1066 (2020). https://doi.org/10.21437/Interspeech.2020-2219
Huang, W.R., Peyser, C., Sainath, T.N., Pang, R., Strohman, T.D., Kumar, S.: Sentence-select: large-scale language model data selection for rare-word speech recognition (2022). http://arxiv.org/abs/2203.05008
Husein, Z.: Malaya-speech (2020), speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow https://github.com/huseinzol05/malaya-speech
Kitaev, N., Klein, D.: Constituency parsing with a self-attentive encoder. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2676–2686. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1249, http://aclanthology.org/P18-1249
Koh, J.X., et al.: Building the Singapore English national speech corpus. In: Interspeech (2019)
Lee, G., Yue, X., Li, H.: linguistically motivated parallel data augmentation for code-switch language modeling. In: Proceedings of Interspeech 2019, pp. 3730–3734 (2019). https://doi.org/10.21437/Interspeech.2019-1382
Li, C., Vu, N.T.: Improving code-switching language modeling with artificially generated texts using cycle-consistent adversarial networks. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 1057–1061. ISCA (2020). https://doi.org/10.21437/Interspeech.2020-2177
Li, C.Y., Vu, N.T.: Improving code-switching language modeling with artificially generated texts using cycle-consistent adversarial networks. In: Proceedings of Interspeech 2020, pp. 1057–1061 (2020). https://doi.org/10.21437/Interspeech.2020-2177
Li, S.S., Murray, K.: Language agnostic code-mixing data augmentation by predicting linguistic patterns. ArXiv abs/2211.07628 (2022)
Li, Y., Fung, P.: Language modeling with functional head constraint for code switching speech recognition. In: EMNLP 2014–2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 907–916, January 2014. https://doi.org/10.3115/v1/D14-1098
Li, Y., Fung, P.: Code-switch language model with inversion constraints for mixed language speech recognition. In: Proceedings of COLING 2012, pp. 1671–1680. The COLING 2012 Organizing Committee, Mumbai, India, December 2012).https://aclanthology.org/C12-1102
Li, Y., Fung, P.: Code switch language modeling with functional head constraint. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4913–4917 (2014). https://doi.org/10.1109/ICASSP.2014.6854536
Liu, L., Ding, B., Bing, L., Joty, S., Si, L., Miao, C.: MulDA: a multilingual data augmentation framework for low-resource cross-lingual NER. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5834–5846. Association for Computational Linguistics, August 2021. https://doi.org/10.18653/v1/2021.acl-long.453, https://aclanthology.org/2021.acl-long.453
Loper, E., Bird, S.: Nltk: The natural language toolkit. arXiv preprint cs/0205028 (2002)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: In: ACL (System Demonstrations) (2014)
Myers-Scotton, C.: Duelling languages: Grammatical structure in codeswitching (1993)
Poplack, S.: Sometimes i’ll start a sentence in spanish y termino en espaNol: toward a typology of code-switching 1. Linguistics 18, 581–618 (1980). https://doi.org/10.1515/ling.1980.18.7-8.581
Pratapa, A., Bhat, G., Choudhury, M., Sitaram, S., Dandapat, S., Bali, K.: Language modeling for code-mixing: the role of linguistic theory based synthetic data. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1543–1553. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1143, https://aclanthology.org/P18-1143
Rizvi, M.S.Z., Srinivasan, A., Ganu, T., Choudhury, M., Sitaram, S.: GCM: a toolkit for generating synthetic code-mixed text. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 205–211. Association for Computational Linguistics, April 2021. https://aclanthology.org/2021.eacl-demos.24
Samanta, B., Nangi, S., Jagirdar, H., Ganguly, N., Chakrabarti, S.: A deep generative model for code switched text, pp. 5175–5181, August 2019. https://doi.org/10.24963/ijcai.2019/719
Solorio, T., Liu, Y.: Learning to predict code-switching points. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 973–981. Association for Computational Linguistics, Honolulu, Hawaii, October 2008. https://aclanthology.org/D08-1102
Soto, V., Hirschberg, J.: Improving code-switched language modeling performance using cognate features, pp. 3725–3729, September 2019. https://doi.org/10.21437/Interspeech.2019-2681
Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015). https://proceedings.neurips.cc/paper/2015/file/29921001f2f04bd3baee84a12e98098f-Paper.pdf
Watanabe, S., et al.: ESPnet: end-to-end speech processing toolkit. In: Proceedings of Interspeech, pp. 2207–2211 (2018). https://doi.org/10.21437/Interspeech.2018-1456, https://dx.doi.org/10.21437/Interspeech.2018-1456
Weintraub, M., Taussig, K., Hunicke-Smith, K., Snodgrass, A.: Effect of speaking style on lvcsr performance. In: Proceedings of ICSLP, vol. 96, pp. 16–19. Citeseer (1996)
Winata, G.I., Madotto, A., Wu, C.S., Fung, P.: Code-switching language modeling using syntax-aware multi-task learning. In: Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pp. 62–67. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/W18-3207, https://aclanthology.org/W18-3207
Winata, G.I., Madotto, A., Wu, C.S., Fung, P.: Code-switched language models using neural based synthetic data from parallel sentences. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 271–280. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/K19-1026, https://aclanthology.org/K19-1026
Yılmaz, E., van den Heuvel, H., Van Leeuwen, D.: Acoustic and textual data augmentation for improved ASR of code-switching speech, September 2018. https://doi.org/10.21437/Interspeech.2018-52
Zhou, R., et al.: MELM: data augmentation with masked entity language modeling for low-resource NER. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2251–2262. Association for Computational Linguistics, Dublin, Ireland, May 2022. https://doi.org/10.18653/v1/2022.acl-long.160, https://aclanthology.org/2022.acl-long.160
Acknowledgment
This research is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2(MOE2019-T2-1-084). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the Ministry of Education, Singapore. We would like to acknowledge the High Performance Computing Centre of Nanyang Technological University Singapore, for providing the computing resources, facilities, and services that have contributed significantly to this work. This research is supported by ST Engineering Mission Software & Services Pte. Ltd under a collaboration programme (Research Collaboration No: REQ0149132)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Prachaseree, C. et al. (2023). Adapting Code-Switching Language Models with Statistical-Based Text Augmentation. In: Nguyen, N.T., et al. Intelligent Information and Database Systems. ACIIDS 2023. Lecture Notes in Computer Science(), vol 13996. Springer, Singapore. https://doi.org/10.1007/978-981-99-5837-5_26
Download citation
DOI: https://doi.org/10.1007/978-981-99-5837-5_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-5836-8
Online ISBN: 978-981-99-5837-5
eBook Packages: Computer ScienceComputer Science (R0)