Skip to main content
Log in

Mandarin–English code-switching speech corpus in South-East Asia: SEAME

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper introduces the South East Asia Mandarin–English corpus, a 63-h spontaneous Mandarin–English code-switching transcribed speech corpus suitable for LVCSR and language change detection/identification research. The corpus is recorded under unscripted interview and conversational settings from 157 Singaporean and Malaysian speakers who spoke a mixture of Mandarin and English within a single sentence. About 82 % of the transcribed utterances are intra-sentential code-switching speech and the corpus will be release by LDC in 2015. This paper presents an analysis of the code-switching statistics of the corpus, such as the duration of monolingual segments and the frequency of language turns in code-switch utterances. We also summarize the development effort, details such as the processing time for transcription, validation and language boundary labelling. Lastly, we present textual analyses of code-switch segments examining the word length of monolingual segments in code-switch utterances and the most common single word and two-word phrase of such segments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Auer, P. (1998). Code-switching in conversation: Language, interaction and identity. London: Routledge.

    Google Scholar 

  • Bullock, B. E., & Toribio, A. J. (2009). The Cambridge handbook of linguistic code-switch. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Census of Population 2010 Statistical Release 1. (2011). Demographic Characteristics, Education, Language and Religion, Department of Statistics, Ministry of Trade & Industry, Republic of Singapore. January 2011.

  • Chan, H. S. (1992). Code-mixing in Hong Kong Cantonese–English bilinguals: Constraints and processes. M.A. Thesis, The Chinese University of Hong Kong, Hong Kong.

  • Chan, J. Y. C., Ching P. C., & Lee, T. (2005). Development of a Cantonese–English code-mixing speech corpus. In Proceedings of Eurospeech.

  • Chan, J. Y. C., Cao, H., Ching, P. C., & Lee, T. (2009). Automatic Recognition of Cantonese-English Code-Mixing Speech. Computational Linguistics and Chinese Language Processing, 14(3), 281–304.

    Google Scholar 

  • Chan, J. Y. C., Ching, P. C., Lee, T., & Meng, H. M. (2004) Detection of language boundary in code-switching utterances by bi-phone probabilities. In Proceedings of the international symposium chinese spoken language processing.

  • Deterding, D., Brown, A., & Low, E. L. (2005). English in Singapore: phonetic research on a corpus, Singapore. New York: McGraw-Hill Education.

    Google Scholar 

  • Gopinathan, S. (1998). Language policy changes 1997: Politics and pedagogy. In S. Gopinathan, A. Pakir, H. W. Kam, & V. Saravanan (Eds.), Language, society and education in Singapore (2nd ed.). Singapore: Times Academic Press.

    Google Scholar 

  • Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.

  • Kwan Terry, A. (1992). Code-switching and code-mixing: The case of a child learning English and Chinese simultaneously. Journal of Multilingual and Multicultural Development, 13(3), 243–259.

    Article  Google Scholar 

  • Li, W. (1998). The ‘Why’ and ‘How’ questions in the analysis of conversational codeswitching. In P. Auer (Ed.), Code-switching in conversation: Language, interaction, and identity. London: Routledge.

    Google Scholar 

  • Li, H., Ma, B., & Lee, K.A. (2013). Spoken language recognition: From fundamentals to practice. In Proceedings of the IEEE.

  • Li, Y., Yu, Y., & Fung, P. (2012). A Mandarin–English code-switching corpus. In Proceedings of the eight international conference on language resources and evaluation (LREC’12).

  • Lyu, D.-C., Chng, E.-S., & Li, H. (2013a). Language diarization for code-switch conversational speech. In Proceedings of ICASSP.

  • Lyu, D.-C., Chng, E.-S., & Li, H. (2013b). Language diarization for conversational code-switch speech with pronunciation dictionary adaptation. In Proceedings of ChinaSIP.

  • Lyu, D.-C., & Lyu, R.-Y. (2008). Language identification on code-switching utterances using multiple cues. In Proceedings of the international speech communication association.

  • Lyu, D.-C., Lyu, R.-Y., Chiang, Y.-C., & Hsu, C.-N. (2006a). Speech recognition on code-switching among the Chinese dialects. In Proceeding of ICASSP.

  • Lyu, D.-C., Lyu, R.-Y., Chiang, Y.-C., & Hsu, C.-N. (2006b). Language identification by using syllable-based duration classification on code-switching speech. ISCSLP, volume 4274 of Lecture Notes in Computer Science (pp. 475–484). New York: Springer.

  • Lyu, D.-C., Zhu, C.-L., Lyu, R.-Y. & Ko, M.-T. (2010). Language identification in code-switching speech usingword-based lexical model. In Proceedings of the 7th international symposium on chinese spoken language processing (ISCSLP ‘10), Tainan, Taiwan, December 2010 (pp. 460–464).

  • MacSwan, J. (2013). Code-switching and grammatical theory. In T. Bhatia & W. Ritchie (Eds.), Handbook of Multilingualism (p. 2013). Cambridge: Blackwell.

    Google Scholar 

  • Malik, L. (1994). Sociolinguistics: A study of code-switching. New Delhi: Anmol.

    Google Scholar 

  • Milroy, L., & Muysken, P. (1995). One speaker, two languages. Cross-disciplinary perspectives on code-switching. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Myers-Scotton, C. (1989). Codeswitching with English: Types of switching, types of communities. World Englishes, 8(3), 333–346.

  • Myers-Scotton, C. (1993). Social motivations for code-switching: Evidence from Africa. Oxford: Clarendon Press.

    Google Scholar 

  • Myers-scotton, C., & Myers, C. (1993). Duelling languages: Grammatical structure in codeswitching. Oxford: Clarendon Press.

    Google Scholar 

  • Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kaie, F., Ghoshal, A., et al. (2011). The subspace Gaussian mixture model—A structured model for speech recognition. Journal of Computer Speech and Language, 25(2), 404–439.

    Article  Google Scholar 

  • Population Trends. (2012). http://www.singstat.gov.sg/Publications/publications_and_papers/population_and_population_structure/population_trend.html.

  • Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–287.

    Article  Google Scholar 

  • Reyes, I. (1994). Functions of code switching in schoolchildren’s conversations. Bilingual Research Journal, 28(1), 77–98.

    Article  Google Scholar 

  • Sankoff, D., & Poplack, S. (1981). A formal grammar for code-switching. Papers in Linguistics.

  • Santhosh Kumar, C. P., Li, H., Tong, R., Matĕjka, P., Burget, L., & Černocky, J. (2010). Tuning phone decoders for language identification. In Proceedings of ICASSP.

  • Shen, H.-P., Wu, C.-H., Yang, Y.-T. & Hsu, C.-S. (2011). CECOS: A Chinese–English code-switching speech database. In Proceedings of the international conference on speech database and assessments (Oriental COCOSDA ‘11), Hsinchu, Taiwan, October 2011 (pp. 120–123).

  • Shia, C. J., Chiu, Y. H., Hsieh, J. H., & Wu, C. H. (2004) Language boundary detection and identification of mixedlanguage speech based on map estimation. In Proceedings of the IEEE international conference on acoustics, speech, and signal.

  • Su, H. Y. (2001). Code-switching between Mandarin and Taiwanese in three telephone conversation: The negotiation of interpersonal relationships among Bilingual speakers in Taiwan. In Proceedings of the symposium about language and society.

  • Tseng, H., Chang, P., Andrew, G., Jurafsky, D., & Manning, C. (2005). A conditional random field word segmenter. In Fourth SIGHAN workshop on Chinese language processing.

  • Vu, N. T., Lyu, D. C., Weiner, J., et al. (2012). A first speech recognition system for Mandarin–English code-switch conversational speech. In Proceedings of the 37th IEEE international conference on acoustics, speech and signal processing (ICASSP ‘12), Kyoto, Japan, March 2012 (pp. 4889–4892).

  • Wu, C.-H., Chiu, Y.-H., Shia, C.-J., & Lin, C.-Y. (2006). Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs. IEEE Transactions on Audio, Speech and Language Processing, 14(1), 266–275.

    Article  Google Scholar 

  • Young, S. (1996). A review of large-vocabulary continuous-speech. IEEE Signal Processing Magazine, 13(5), 45–57.

  • Zissman, M. A. (1996). Comparison of four approaches to automatic LID of telephone speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 4(1), 31–44.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dau-Cheng Lyu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lyu, DC., Tan, TP., Chng, ES. et al. Mandarin–English code-switching speech corpus in South-East Asia: SEAME. Lang Resources & Evaluation 49, 581–600 (2015). https://doi.org/10.1007/s10579-015-9303-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-015-9303-x

Keywords

Navigation