Adapting Code-Switching Language Models with Statistical-Based Text Augmentation

Prachaseree, Chaiyasait; Gupta, Kshitij; Ho, Thi Nga; Peng, Yizhou; Zin Tun, Kyaw; Chng, Eng Siong; Chalapthi, G. S. S.

doi:10.1007/978-981-99-5837-5_26

Chaiyasait Prachaseree¹⁴,
Kshitij Gupta¹⁵,
Thi Nga Ho¹⁴,
Yizhou Peng¹⁶,
Kyaw Zin Tun¹⁴,
Eng Siong Chng¹⁴ &
…
G. S. S. Chalapthi¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13996))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

215 Accesses

Abstract

This paper introduces a statistical augmentation approach to generate code-switched sentences for code-switched language modeling. The proposed technique converts monolingual sentences from a particular domain into their corresponding code-switched versions using pretrained monolingual Part-of-Speech tagging models. The work also showed that adding 150 handcrafted formal to informal word replacements can further improve the naturalness of augmented sentences. When tested on an English-Malay code-switching corpus, a relative decrease of 9.7% in perplexity for ngram language model interpolated with the language model trained with augmented texts and other monolingual texts was observed, and 5.9% perplexity reduction for RNNLMs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/nidhaloff/deep-translator.git.

References

Srilm - an extensible language modeling toolkit. In: Hansen, J.H.L., Pellom, B.L. (eds.) INTERSPEECH. ISCA (2002). http://dblp.uni-trier.de/db/conf/interspeech/interspeech2002.html#Stolcke02
Adel, H., Vu, T., Kirchhoff, K., Telaar, D., Schultz, T.: Syntactic and semantic features for code-switching factored language models. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (2015). https://doi.org/10.1109/TASLP.2015.2389622
Beneš, K., Burget, L.: Text augmentation for language models in high error recognition scenario. In: Proceedings of Interspeech 2021, pp. 1872–1876 (2021). https://doi.org/10.21437/Interspeech.2021-627
Chang, C.T., Chuang, S.P., VI Lee, H.: Code-switching sentence generation by generative adversarial networks and its application to data augmentation. In: Interspeech (2018)
Google Scholar
Ding, B., et al.: DAGA: data augmentation with a generation approach for low-resource tagging tasks. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6045–6057. Association for Computational Linguistics, November 2020. https://doi.org/10.18653/v1/2020.emnlp-main.488, http://aclanthology.org/2020.emnlp-main.488
Gao, Y., Feng, J., Liu, Y., Hou, L., Pan, X., Ma, Y.: Code-switching sentence generation by bert and generative adversarial networks, pp. 3525–3529, September 2019. https://doi.org/10.21437/Interspeech.2019-2501
Garg, S., Parekh, T., Jyothi, P.: Code-switched language models using dual RNNs and same-source pretraining. In: Conference on Empirical Methods in Natural Language Processing (2018)
Google Scholar
Gonen, H., Goldberg, Y.: Language modeling for code-switching: evaluation, integration of monolingual data, and discriminative training. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4175–4185. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1427, http://aclanthology.org/D19-1427
Gumperz, J.J.: Discourse Strategies. Studies in Interactional Sociolinguistics, Cambridge University Press, Cambridge (1982). https://doi.org/10.1017/CBO9780511611834
Hu, X., Zhang, Q., Yang, L., Gu, B., Xu, X.: Data augmentation for code-switch language modeling by fusing multiple text generation methods. In: Proceedings of Interspeech 2020, pp. 1062–1066 (2020). https://doi.org/10.21437/Interspeech.2020-2219
Huang, W.R., Peyser, C., Sainath, T.N., Pang, R., Strohman, T.D., Kumar, S.: Sentence-select: large-scale language model data selection for rare-word speech recognition (2022). http://arxiv.org/abs/2203.05008
Husein, Z.: Malaya-speech (2020), speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow https://github.com/huseinzol05/malaya-speech
Kitaev, N., Klein, D.: Constituency parsing with a self-attentive encoder. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2676–2686. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1249, http://aclanthology.org/P18-1249
Koh, J.X., et al.: Building the Singapore English national speech corpus. In: Interspeech (2019)
Google Scholar
Lee, G., Yue, X., Li, H.: linguistically motivated parallel data augmentation for code-switch language modeling. In: Proceedings of Interspeech 2019, pp. 3730–3734 (2019). https://doi.org/10.21437/Interspeech.2019-1382
Li, C., Vu, N.T.: Improving code-switching language modeling with artificially generated texts using cycle-consistent adversarial networks. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 1057–1061. ISCA (2020). https://doi.org/10.21437/Interspeech.2020-2177
Li, C.Y., Vu, N.T.: Improving code-switching language modeling with artificially generated texts using cycle-consistent adversarial networks. In: Proceedings of Interspeech 2020, pp. 1057–1061 (2020). https://doi.org/10.21437/Interspeech.2020-2177
Li, S.S., Murray, K.: Language agnostic code-mixing data augmentation by predicting linguistic patterns. ArXiv abs/2211.07628 (2022)
Google Scholar
Li, Y., Fung, P.: Language modeling with functional head constraint for code switching speech recognition. In: EMNLP 2014–2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 907–916, January 2014. https://doi.org/10.3115/v1/D14-1098
Li, Y., Fung, P.: Code-switch language model with inversion constraints for mixed language speech recognition. In: Proceedings of COLING 2012, pp. 1671–1680. The COLING 2012 Organizing Committee, Mumbai, India, December 2012).https://aclanthology.org/C12-1102
Li, Y., Fung, P.: Code switch language modeling with functional head constraint. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4913–4917 (2014). https://doi.org/10.1109/ICASSP.2014.6854536
Liu, L., Ding, B., Bing, L., Joty, S., Si, L., Miao, C.: MulDA: a multilingual data augmentation framework for low-resource cross-lingual NER. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5834–5846. Association for Computational Linguistics, August 2021. https://doi.org/10.18653/v1/2021.acl-long.453, https://aclanthology.org/2021.acl-long.453
Loper, E., Bird, S.: Nltk: The natural language toolkit. arXiv preprint cs/0205028 (2002)
Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: In: ACL (System Demonstrations) (2014)
Google Scholar
Myers-Scotton, C.: Duelling languages: Grammatical structure in codeswitching (1993)
Google Scholar
Poplack, S.: Sometimes i’ll start a sentence in spanish y termino en espaNol: toward a typology of code-switching 1. Linguistics 18, 581–618 (1980). https://doi.org/10.1515/ling.1980.18.7-8.581
Pratapa, A., Bhat, G., Choudhury, M., Sitaram, S., Dandapat, S., Bali, K.: Language modeling for code-mixing: the role of linguistic theory based synthetic data. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1543–1553. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1143, https://aclanthology.org/P18-1143
Rizvi, M.S.Z., Srinivasan, A., Ganu, T., Choudhury, M., Sitaram, S.: GCM: a toolkit for generating synthetic code-mixed text. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 205–211. Association for Computational Linguistics, April 2021. https://aclanthology.org/2021.eacl-demos.24
Samanta, B., Nangi, S., Jagirdar, H., Ganguly, N., Chakrabarti, S.: A deep generative model for code switched text, pp. 5175–5181, August 2019. https://doi.org/10.24963/ijcai.2019/719
Solorio, T., Liu, Y.: Learning to predict code-switching points. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 973–981. Association for Computational Linguistics, Honolulu, Hawaii, October 2008. https://aclanthology.org/D08-1102
Soto, V., Hirschberg, J.: Improving code-switched language modeling performance using cognate features, pp. 3725–3729, September 2019. https://doi.org/10.21437/Interspeech.2019-2681
Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015). https://proceedings.neurips.cc/paper/2015/file/29921001f2f04bd3baee84a12e98098f-Paper.pdf
Watanabe, S., et al.: ESPnet: end-to-end speech processing toolkit. In: Proceedings of Interspeech, pp. 2207–2211 (2018). https://doi.org/10.21437/Interspeech.2018-1456, https://dx.doi.org/10.21437/Interspeech.2018-1456
Weintraub, M., Taussig, K., Hunicke-Smith, K., Snodgrass, A.: Effect of speaking style on lvcsr performance. In: Proceedings of ICSLP, vol. 96, pp. 16–19. Citeseer (1996)
Google Scholar
Winata, G.I., Madotto, A., Wu, C.S., Fung, P.: Code-switching language modeling using syntax-aware multi-task learning. In: Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pp. 62–67. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/W18-3207, https://aclanthology.org/W18-3207
Winata, G.I., Madotto, A., Wu, C.S., Fung, P.: Code-switched language models using neural based synthetic data from parallel sentences. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 271–280. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/K19-1026, https://aclanthology.org/K19-1026
Yılmaz, E., van den Heuvel, H., Van Leeuwen, D.: Acoustic and textual data augmentation for improved ASR of code-switching speech, September 2018. https://doi.org/10.21437/Interspeech.2018-52
Zhou, R., et al.: MELM: data augmentation with masked entity language modeling for low-resource NER. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2251–2262. Association for Computational Linguistics, Dublin, Ireland, May 2022. https://doi.org/10.18653/v1/2022.acl-long.160, https://aclanthology.org/2022.acl-long.160

Download references

Acknowledgment

This research is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2(MOE2019-T2-1-084). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the Ministry of Education, Singapore. We would like to acknowledge the High Performance Computing Centre of Nanyang Technological University Singapore, for providing the computing resources, facilities, and services that have contributed significantly to this work. This research is supported by ST Engineering Mission Software & Services Pte. Ltd under a collaboration programme (Research Collaboration No: REQ0149132)

Author information

Authors and Affiliations

Nanyang Technological University, Singapore, Singapore
Chaiyasait Prachaseree, Thi Nga Ho, Kyaw Zin Tun & Eng Siong Chng
BITS Pilani, Pilani, India
Kshitij Gupta & G. S. S. Chalapthi
National University of Singapore, Singapore, Singapore
Yizhou Peng

Authors

Chaiyasait Prachaseree
View author publications
You can also search for this author in PubMed Google Scholar
Kshitij Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Thi Nga Ho
View author publications
You can also search for this author in PubMed Google Scholar
Yizhou Peng
View author publications
You can also search for this author in PubMed Google Scholar
Kyaw Zin Tun
View author publications
You can also search for this author in PubMed Google Scholar
Eng Siong Chng
View author publications
You can also search for this author in PubMed Google Scholar
G. S. S. Chalapthi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chaiyasait Prachaseree , Kshitij Gupta or Thi Nga Ho .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand
Siridech Boonsang
Iwate Prefectural University Iwate, Iwate, Japan
Hamido Fujita
Wroclaw University of Science and Technology, Wrocław, Poland
Bogumiła Hnatkowska
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand
Kitsuchart Pasupa
Malaysia Japan International Institute of Technology, Kuala Lumpur, Malaysia
Ali Selamat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prachaseree, C. et al. (2023). Adapting Code-Switching Language Models with Statistical-Based Text Augmentation. In: Nguyen, N.T., et al. Intelligent Information and Database Systems. ACIIDS 2023. Lecture Notes in Computer Science(), vol 13996. Springer, Singapore. https://doi.org/10.1007/978-981-99-5837-5_26

Download citation

DOI: https://doi.org/10.1007/978-981-99-5837-5_26
Published: 05 September 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-5836-8
Online ISBN: 978-981-99-5837-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adapting Code-Switching Language Models with Statistical-Based Text Augmentation