Abstract
This paper introduces the South East Asia Mandarin–English corpus, a 63-h spontaneous Mandarin–English code-switching transcribed speech corpus suitable for LVCSR and language change detection/identification research. The corpus is recorded under unscripted interview and conversational settings from 157 Singaporean and Malaysian speakers who spoke a mixture of Mandarin and English within a single sentence. About 82 % of the transcribed utterances are intra-sentential code-switching speech and the corpus will be release by LDC in 2015. This paper presents an analysis of the code-switching statistics of the corpus, such as the duration of monolingual segments and the frequency of language turns in code-switch utterances. We also summarize the development effort, details such as the processing time for transcription, validation and language boundary labelling. Lastly, we present textual analyses of code-switch segments examining the word length of monolingual segments in code-switch utterances and the most common single word and two-word phrase of such segments.




Similar content being viewed by others
References
Auer, P. (1998). Code-switching in conversation: Language, interaction and identity. London: Routledge.
Bullock, B. E., & Toribio, A. J. (2009). The Cambridge handbook of linguistic code-switch. Cambridge: Cambridge University Press.
Census of Population 2010 Statistical Release 1. (2011). Demographic Characteristics, Education, Language and Religion, Department of Statistics, Ministry of Trade & Industry, Republic of Singapore. January 2011.
Chan, H. S. (1992). Code-mixing in Hong Kong Cantonese–English bilinguals: Constraints and processes. M.A. Thesis, The Chinese University of Hong Kong, Hong Kong.
Chan, J. Y. C., Ching P. C., & Lee, T. (2005). Development of a Cantonese–English code-mixing speech corpus. In Proceedings of Eurospeech.
Chan, J. Y. C., Cao, H., Ching, P. C., & Lee, T. (2009). Automatic Recognition of Cantonese-English Code-Mixing Speech. Computational Linguistics and Chinese Language Processing, 14(3), 281–304.
Chan, J. Y. C., Ching, P. C., Lee, T., & Meng, H. M. (2004) Detection of language boundary in code-switching utterances by bi-phone probabilities. In Proceedings of the international symposium chinese spoken language processing.
Deterding, D., Brown, A., & Low, E. L. (2005). English in Singapore: phonetic research on a corpus, Singapore. New York: McGraw-Hill Education.
Gopinathan, S. (1998). Language policy changes 1997: Politics and pedagogy. In S. Gopinathan, A. Pakir, H. W. Kam, & V. Saravanan (Eds.), Language, society and education in Singapore (2nd ed.). Singapore: Times Academic Press.
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.
Kwan Terry, A. (1992). Code-switching and code-mixing: The case of a child learning English and Chinese simultaneously. Journal of Multilingual and Multicultural Development, 13(3), 243–259.
Li, W. (1998). The ‘Why’ and ‘How’ questions in the analysis of conversational codeswitching. In P. Auer (Ed.), Code-switching in conversation: Language, interaction, and identity. London: Routledge.
Li, H., Ma, B., & Lee, K.A. (2013). Spoken language recognition: From fundamentals to practice. In Proceedings of the IEEE.
Li, Y., Yu, Y., & Fung, P. (2012). A Mandarin–English code-switching corpus. In Proceedings of the eight international conference on language resources and evaluation (LREC’12).
Lyu, D.-C., Chng, E.-S., & Li, H. (2013a). Language diarization for code-switch conversational speech. In Proceedings of ICASSP.
Lyu, D.-C., Chng, E.-S., & Li, H. (2013b). Language diarization for conversational code-switch speech with pronunciation dictionary adaptation. In Proceedings of ChinaSIP.
Lyu, D.-C., & Lyu, R.-Y. (2008). Language identification on code-switching utterances using multiple cues. In Proceedings of the international speech communication association.
Lyu, D.-C., Lyu, R.-Y., Chiang, Y.-C., & Hsu, C.-N. (2006a). Speech recognition on code-switching among the Chinese dialects. In Proceeding of ICASSP.
Lyu, D.-C., Lyu, R.-Y., Chiang, Y.-C., & Hsu, C.-N. (2006b). Language identification by using syllable-based duration classification on code-switching speech. ISCSLP, volume 4274 of Lecture Notes in Computer Science (pp. 475–484). New York: Springer.
Lyu, D.-C., Zhu, C.-L., Lyu, R.-Y. & Ko, M.-T. (2010). Language identification in code-switching speech usingword-based lexical model. In Proceedings of the 7th international symposium on chinese spoken language processing (ISCSLP ‘10), Tainan, Taiwan, December 2010 (pp. 460–464).
MacSwan, J. (2013). Code-switching and grammatical theory. In T. Bhatia & W. Ritchie (Eds.), Handbook of Multilingualism (p. 2013). Cambridge: Blackwell.
Malik, L. (1994). Sociolinguistics: A study of code-switching. New Delhi: Anmol.
Milroy, L., & Muysken, P. (1995). One speaker, two languages. Cross-disciplinary perspectives on code-switching. Cambridge: Cambridge University Press.
Myers-Scotton, C. (1989). Codeswitching with English: Types of switching, types of communities. World Englishes, 8(3), 333–346.
Myers-Scotton, C. (1993). Social motivations for code-switching: Evidence from Africa. Oxford: Clarendon Press.
Myers-scotton, C., & Myers, C. (1993). Duelling languages: Grammatical structure in codeswitching. Oxford: Clarendon Press.
Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kaie, F., Ghoshal, A., et al. (2011). The subspace Gaussian mixture model—A structured model for speech recognition. Journal of Computer Speech and Language, 25(2), 404–439.
Population Trends. (2012). http://www.singstat.gov.sg/Publications/publications_and_papers/population_and_population_structure/population_trend.html.
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–287.
Reyes, I. (1994). Functions of code switching in schoolchildren’s conversations. Bilingual Research Journal, 28(1), 77–98.
Sankoff, D., & Poplack, S. (1981). A formal grammar for code-switching. Papers in Linguistics.
Santhosh Kumar, C. P., Li, H., Tong, R., Matĕjka, P., Burget, L., & Černocky, J. (2010). Tuning phone decoders for language identification. In Proceedings of ICASSP.
Shen, H.-P., Wu, C.-H., Yang, Y.-T. & Hsu, C.-S. (2011). CECOS: A Chinese–English code-switching speech database. In Proceedings of the international conference on speech database and assessments (Oriental COCOSDA ‘11), Hsinchu, Taiwan, October 2011 (pp. 120–123).
Shia, C. J., Chiu, Y. H., Hsieh, J. H., & Wu, C. H. (2004) Language boundary detection and identification of mixedlanguage speech based on map estimation. In Proceedings of the IEEE international conference on acoustics, speech, and signal.
Su, H. Y. (2001). Code-switching between Mandarin and Taiwanese in three telephone conversation: The negotiation of interpersonal relationships among Bilingual speakers in Taiwan. In Proceedings of the symposium about language and society.
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., & Manning, C. (2005). A conditional random field word segmenter. In Fourth SIGHAN workshop on Chinese language processing.
Vu, N. T., Lyu, D. C., Weiner, J., et al. (2012). A first speech recognition system for Mandarin–English code-switch conversational speech. In Proceedings of the 37th IEEE international conference on acoustics, speech and signal processing (ICASSP ‘12), Kyoto, Japan, March 2012 (pp. 4889–4892).
Wu, C.-H., Chiu, Y.-H., Shia, C.-J., & Lin, C.-Y. (2006). Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs. IEEE Transactions on Audio, Speech and Language Processing, 14(1), 266–275.
Young, S. (1996). A review of large-vocabulary continuous-speech. IEEE Signal Processing Magazine, 13(5), 45–57.
Zissman, M. A. (1996). Comparison of four approaches to automatic LID of telephone speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 4(1), 31–44.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lyu, DC., Tan, TP., Chng, ES. et al. Mandarin–English code-switching speech corpus in South-East Asia: SEAME. Lang Resources & Evaluation 49, 581–600 (2015). https://doi.org/10.1007/s10579-015-9303-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-015-9303-x