Abstract
It has become common, especially among urban youth, for people to use more than one language in their everyday conversations - a phenomenon referred to by linguists as “code-switching”. With the rise in globalization and the widespread of code-switching among multilingual societies, a great demand has been placed on Natural Language Processing (NLP) applications to be able to handle such mixed data. In this paper, we present our efforts in language modeling for code-switch Arabic-English. In order to train a language model (LM), huge amounts of text data is required in the respective language. However, the main challenge faced in language modeling for code-switch languages, is the lack of available data. In this paper, we propose an approach to artificially generate code-switch Arabic-English n-grams and thus improve the language model. This was done by expanding the relatively-small available corpus and its corresponding n-grams using translation-based approaches. The final LM achieved relative improvements in both perplexity and OOV rates of 1.97% and 16.36% respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Adel, H., Kirchhoff, K., Vu, N.T., Telaar, D., Schultz, T.: Comparing approaches to convert recurrent neural networks into backoff language models for efficient decoding. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association (Interspeech 2014), Singapore, pp. 651–655 (2014)
Ardila, A.: Spanglish: an anglicized Spanish dialect. Hispanic J. Behav. Sci. 27(1), 60–81 (2005)
Auer, P.: A postscript: code-switching and social identity. J. Pragmat. 37(3), 403–410 (2005)
Auer, P. (ed.): Code-Switching in Conversation: Language, Interaction and Identity. Routledge, London (1998)
Bhuvanagiri, K., Kopparapu, S.: An approach to mixed language automatic speech recognition. In: Proceedings of the Oriental COCOSDA, Kathmandu, Nepal (2010)
Bhuvanagirir, K., Kopparapu, S.K.: Mixed language speech recognition without explicit identification of language. Am. J. Sig. Process. 2(5), 92–97 (2012)
Cao, H., Ching, P., Lee, T., Yeung, Y.T.: Semantics-based language modeling for Cantonese-English code-mixing speech recognition. In: Proceedings of the 7th International Symposium on Chinese Spoken Language Processing (ISCSLP 2010), pp. 246–250. IEEE, Tainan (2010)
Chan, J.Y., Cao, H., Ching, P., Lee, T.: Automatic recognition of Cantonese-English code-mixing speech. Comput. Linguist. Chin. Lang. Process. 14(3), 281–304 (2009)
Chen, C.: Two types of code-switching in Taiwan. In: Proceeding of the 15th Sociolinguistics Symposium, Newcastle, UK (2004)
Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Franco, J., Solorio, T.: Baby-steps towards building a Spanglish language model. In: Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, pp. 75–84. Springer, Heidelberg (2007)
Fung, P., Schultz, T.: Multilingual spoken language processing. IEEE Sig. Process. Mag. 25(3), 89–97 (2008)
Hamed, I., Elmahdy, M., Abdennadher, S.: Building a first language model for code-switch Arabic-English. In: Proceedings of The 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), pp. 208–216. Elsevier, Dubai (2017)
Li, Y., Fung, P.: Code switch language modeling with functional head constraint. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), pp. 4913–4917. IEEE, Florence (2014)
Li, D.: Cantonese-English code-switching research in Hong Kong: a Y2K review. World Engl. 19(3), 305–322 (2000)
Luján-Mares, M., Martínez-Hinarejos, C.D., Alabau, V.: A study on bilingual speech recognition involving a minority language. In: Proceedings of the Language and Technology Conference, pp. 36–49. Springer, Heidelberg (2007)
Lyu, D.-C., Tan, T.-P., Chng, E.-S., Li, H.: An analysis of a Mandarin-English code-switching speech corpus: SEAME. Age, vol. 21, p. 25-8 (2010)
Stolcke, A., et al.: SRILM-an extensible language modeling toolkit. In: Proceedings of the 7th Conference on Spoken Language Processing, Denver, Colorado, vol. 2, pp. 901–904 (2002)
Uebler, U.: Multilingual speech recognition in seven languages. Speech Commun. 35(1), 53–69 (2001)
van der Westhuizen, E., Niesler, T.: Automatic speech recognition of English-isiZulu codeswitched speech from South African soap operas. Procedia Comput. Sci. 81, 121–127 (2016)
Vu, N.T., Schultz, T.: Exploration of the impact of maximum entropy in recurrent neural network language models for code-switching speech. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 34–41 (2014)
Vu, N.T., Lyu, D.-C., Weiner, J., Telaar, D., Schlippe, T., Blaicher, F., Chng, E.-S., Schultz, T., Li, H.: A first speech recognition system for Mandarin-English code-switch conversational speech. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), pp. 4889–4892. IEEE, Kyoto (2012)
Weiner, J., Vu, N.T., Telaar, D., Metze, F., Schultz, T., Lyu, D.-C., Chng, E.-S., Li, H.: Integration of language identification into a recognition system for spoken conversations containing code-switches. In: Proceedings of the 3rd Workshop on Spoken Language Technologies for Under-Resourced Languages, Cape Town, South Africa, pp. 61–64 (2012)
Weng, F., Bratt, H., Neumeyer, L., Stolcke, A.: A study of multilingual speech recognition. In: Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech 1997), Rhodes, Greece, pp. 359–362 (1997)
Xu, R., Zhang, Q., Pan, J., Yan, Y.: Investigations to minimum phone error training in bilingual speech recognition. In: Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2009), vol. 4, pp. 486–490. IEEE, Tianjin (2009)
Yılmaz, E., van den Heuvel, H., van Leeuwen, D.: Investigating bilingual deep neural networks for automatic recognition of code-switching Frisian speech. Procedia Comput. Sci. 81, 159–166 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Hamed, I., Elmahdy, M., Abdennadher, S. (2019). Expanding N-grams for Code-Switch Language Models. In: Hassanien, A., Tolba, M., Shaalan, K., Azar, A. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2018. AISI 2018. Advances in Intelligent Systems and Computing, vol 845. Springer, Cham. https://doi.org/10.1007/978-3-319-99010-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-99010-1_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99009-5
Online ISBN: 978-3-319-99010-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)