Skip to main content

Adapting Code-Switching Language Models with Statistical-Based Text Augmentation

  • Conference paper
  • First Online:
Intelligent Information and Database Systems (ACIIDS 2023)

Abstract

This paper introduces a statistical augmentation approach to generate code-switched sentences for code-switched language modeling. The proposed technique converts monolingual sentences from a particular domain into their corresponding code-switched versions using pretrained monolingual Part-of-Speech tagging models. The work also showed that adding 150 handcrafted formal to informal word replacements can further improve the naturalness of augmented sentences. When tested on an English-Malay code-switching corpus, a relative decrease of 9.7% in perplexity for ngram language model interpolated with the language model trained with augmented texts and other monolingual texts was observed, and 5.9% perplexity reduction for RNNLMs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/nidhaloff/deep-translator.git.

References

  1. Srilm - an extensible language modeling toolkit. In: Hansen, J.H.L., Pellom, B.L. (eds.) INTERSPEECH. ISCA (2002). http://dblp.uni-trier.de/db/conf/interspeech/interspeech2002.html#Stolcke02

  2. Adel, H., Vu, T., Kirchhoff, K., Telaar, D., Schultz, T.: Syntactic and semantic features for code-switching factored language models. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (2015). https://doi.org/10.1109/TASLP.2015.2389622

  3. Beneš, K., Burget, L.: Text augmentation for language models in high error recognition scenario. In: Proceedings of Interspeech 2021, pp. 1872–1876 (2021). https://doi.org/10.21437/Interspeech.2021-627

  4. Chang, C.T., Chuang, S.P., VI Lee, H.: Code-switching sentence generation by generative adversarial networks and its application to data augmentation. In: Interspeech (2018)

    Google Scholar 

  5. Ding, B., et al.: DAGA: data augmentation with a generation approach for low-resource tagging tasks. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6045–6057. Association for Computational Linguistics, November 2020. https://doi.org/10.18653/v1/2020.emnlp-main.488, http://aclanthology.org/2020.emnlp-main.488

  6. Gao, Y., Feng, J., Liu, Y., Hou, L., Pan, X., Ma, Y.: Code-switching sentence generation by bert and generative adversarial networks, pp. 3525–3529, September 2019. https://doi.org/10.21437/Interspeech.2019-2501

  7. Garg, S., Parekh, T., Jyothi, P.: Code-switched language models using dual RNNs and same-source pretraining. In: Conference on Empirical Methods in Natural Language Processing (2018)

    Google Scholar 

  8. Gonen, H., Goldberg, Y.: Language modeling for code-switching: evaluation, integration of monolingual data, and discriminative training. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4175–4185. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1427, http://aclanthology.org/D19-1427

  9. Gumperz, J.J.: Discourse Strategies. Studies in Interactional Sociolinguistics, Cambridge University Press, Cambridge (1982). https://doi.org/10.1017/CBO9780511611834

  10. Hu, X., Zhang, Q., Yang, L., Gu, B., Xu, X.: Data augmentation for code-switch language modeling by fusing multiple text generation methods. In: Proceedings of Interspeech 2020, pp. 1062–1066 (2020). https://doi.org/10.21437/Interspeech.2020-2219

  11. Huang, W.R., Peyser, C., Sainath, T.N., Pang, R., Strohman, T.D., Kumar, S.: Sentence-select: large-scale language model data selection for rare-word speech recognition (2022). http://arxiv.org/abs/2203.05008

  12. Husein, Z.: Malaya-speech (2020), speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow https://github.com/huseinzol05/malaya-speech

  13. Kitaev, N., Klein, D.: Constituency parsing with a self-attentive encoder. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2676–2686. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1249, http://aclanthology.org/P18-1249

  14. Koh, J.X., et al.: Building the Singapore English national speech corpus. In: Interspeech (2019)

    Google Scholar 

  15. Lee, G., Yue, X., Li, H.: linguistically motivated parallel data augmentation for code-switch language modeling. In: Proceedings of Interspeech 2019, pp. 3730–3734 (2019). https://doi.org/10.21437/Interspeech.2019-1382

  16. Li, C., Vu, N.T.: Improving code-switching language modeling with artificially generated texts using cycle-consistent adversarial networks. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 1057–1061. ISCA (2020). https://doi.org/10.21437/Interspeech.2020-2177

  17. Li, C.Y., Vu, N.T.: Improving code-switching language modeling with artificially generated texts using cycle-consistent adversarial networks. In: Proceedings of Interspeech 2020, pp. 1057–1061 (2020). https://doi.org/10.21437/Interspeech.2020-2177

  18. Li, S.S., Murray, K.: Language agnostic code-mixing data augmentation by predicting linguistic patterns. ArXiv abs/2211.07628 (2022)

    Google Scholar 

  19. Li, Y., Fung, P.: Language modeling with functional head constraint for code switching speech recognition. In: EMNLP 2014–2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 907–916, January 2014. https://doi.org/10.3115/v1/D14-1098

  20. Li, Y., Fung, P.: Code-switch language model with inversion constraints for mixed language speech recognition. In: Proceedings of COLING 2012, pp. 1671–1680. The COLING 2012 Organizing Committee, Mumbai, India, December 2012).https://aclanthology.org/C12-1102

  21. Li, Y., Fung, P.: Code switch language modeling with functional head constraint. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4913–4917 (2014). https://doi.org/10.1109/ICASSP.2014.6854536

  22. Liu, L., Ding, B., Bing, L., Joty, S., Si, L., Miao, C.: MulDA: a multilingual data augmentation framework for low-resource cross-lingual NER. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5834–5846. Association for Computational Linguistics, August 2021. https://doi.org/10.18653/v1/2021.acl-long.453, https://aclanthology.org/2021.acl-long.453

  23. Loper, E., Bird, S.: Nltk: The natural language toolkit. arXiv preprint cs/0205028 (2002)

    Google Scholar 

  24. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: In: ACL (System Demonstrations) (2014)

    Google Scholar 

  25. Myers-Scotton, C.: Duelling languages: Grammatical structure in codeswitching (1993)

    Google Scholar 

  26. Poplack, S.: Sometimes i’ll start a sentence in spanish y termino en espaNol: toward a typology of code-switching 1. Linguistics 18, 581–618 (1980). https://doi.org/10.1515/ling.1980.18.7-8.581

  27. Pratapa, A., Bhat, G., Choudhury, M., Sitaram, S., Dandapat, S., Bali, K.: Language modeling for code-mixing: the role of linguistic theory based synthetic data. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1543–1553. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1143, https://aclanthology.org/P18-1143

  28. Rizvi, M.S.Z., Srinivasan, A., Ganu, T., Choudhury, M., Sitaram, S.: GCM: a toolkit for generating synthetic code-mixed text. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 205–211. Association for Computational Linguistics, April 2021. https://aclanthology.org/2021.eacl-demos.24

  29. Samanta, B., Nangi, S., Jagirdar, H., Ganguly, N., Chakrabarti, S.: A deep generative model for code switched text, pp. 5175–5181, August 2019. https://doi.org/10.24963/ijcai.2019/719

  30. Solorio, T., Liu, Y.: Learning to predict code-switching points. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 973–981. Association for Computational Linguistics, Honolulu, Hawaii, October 2008. https://aclanthology.org/D08-1102

  31. Soto, V., Hirschberg, J.: Improving code-switched language modeling performance using cognate features, pp. 3725–3729, September 2019. https://doi.org/10.21437/Interspeech.2019-2681

  32. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015). https://proceedings.neurips.cc/paper/2015/file/29921001f2f04bd3baee84a12e98098f-Paper.pdf

  33. Watanabe, S., et al.: ESPnet: end-to-end speech processing toolkit. In: Proceedings of Interspeech, pp. 2207–2211 (2018). https://doi.org/10.21437/Interspeech.2018-1456, https://dx.doi.org/10.21437/Interspeech.2018-1456

  34. Weintraub, M., Taussig, K., Hunicke-Smith, K., Snodgrass, A.: Effect of speaking style on lvcsr performance. In: Proceedings of ICSLP, vol. 96, pp. 16–19. Citeseer (1996)

    Google Scholar 

  35. Winata, G.I., Madotto, A., Wu, C.S., Fung, P.: Code-switching language modeling using syntax-aware multi-task learning. In: Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pp. 62–67. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/W18-3207, https://aclanthology.org/W18-3207

  36. Winata, G.I., Madotto, A., Wu, C.S., Fung, P.: Code-switched language models using neural based synthetic data from parallel sentences. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 271–280. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/K19-1026, https://aclanthology.org/K19-1026

  37. Yılmaz, E., van den Heuvel, H., Van Leeuwen, D.: Acoustic and textual data augmentation for improved ASR of code-switching speech, September 2018. https://doi.org/10.21437/Interspeech.2018-52

  38. Zhou, R., et al.: MELM: data augmentation with masked entity language modeling for low-resource NER. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2251–2262. Association for Computational Linguistics, Dublin, Ireland, May 2022. https://doi.org/10.18653/v1/2022.acl-long.160, https://aclanthology.org/2022.acl-long.160

Download references

Acknowledgment

This research is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2(MOE2019-T2-1-084). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the Ministry of Education, Singapore. We would like to acknowledge the High Performance Computing Centre of Nanyang Technological University Singapore, for providing the computing resources, facilities, and services that have contributed significantly to this work. This research is supported by ST Engineering Mission Software & Services Pte. Ltd under a collaboration programme (Research Collaboration No: REQ0149132)

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chaiyasait Prachaseree , Kshitij Gupta or Thi Nga Ho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Prachaseree, C. et al. (2023). Adapting Code-Switching Language Models with Statistical-Based Text Augmentation. In: Nguyen, N.T., et al. Intelligent Information and Database Systems. ACIIDS 2023. Lecture Notes in Computer Science(), vol 13996. Springer, Singapore. https://doi.org/10.1007/978-981-99-5837-5_26

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-5837-5_26

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-5836-8

  • Online ISBN: 978-981-99-5837-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics