Mandarin–English code-switching speech corpus in South-East Asia: SEAME

Lyu, Dau-Cheng; Tan, Tien-Ping; Chng, Eng-Siong; Li, Haizhou

doi:10.1007/s10579-015-9303-x

Mandarin–English code-switching speech corpus in South-East Asia: SEAME

Original Paper
Published: 20 May 2015

Volume 49, pages 581–600, (2015)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Dau-Cheng Lyu¹,
Tien-Ping Tan⁴,
Eng-Siong Chng^1,2 &
…
Haizhou Li^1,2,3,5

2254 Accesses
3 Altmetric
Explore all metrics

Abstract

This paper introduces the South East Asia Mandarin–English corpus, a 63-h spontaneous Mandarin–English code-switching transcribed speech corpus suitable for LVCSR and language change detection/identification research. The corpus is recorded under unscripted interview and conversational settings from 157 Singaporean and Malaysian speakers who spoke a mixture of Mandarin and English within a single sentence. About 82 % of the transcribed utterances are intra-sentential code-switching speech and the corpus will be release by LDC in 2015. This paper presents an analysis of the code-switching statistics of the corpus, such as the duration of monolingual segments and the frequency of language turns in code-switch utterances. We also summarize the development effort, details such as the processing time for transcription, validation and language boundary labelling. Lastly, we present textual analyses of code-switch segments examining the word length of monolingual segments in code-switch utterances and the most common single word and two-word phrase of such segments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fine-Tuned Self-supervised Speech Representations for Language Diarization in Multilingual Code-Switched Speech

Spoken Language Change Detection Inspired by Speaker Change Detection

Article Open access 17 June 2024

Representing variation in a spoken corpus of an endangered dialect: the case of Torlak

Article Open access 09 January 2021

References

Auer, P. (1998). Code-switching in conversation: Language, interaction and identity. London: Routledge.
Google Scholar
Bullock, B. E., & Toribio, A. J. (2009). The Cambridge handbook of linguistic code-switch. Cambridge: Cambridge University Press.
Book Google Scholar
Census of Population 2010 Statistical Release 1. (2011). Demographic Characteristics, Education, Language and Religion, Department of Statistics, Ministry of Trade & Industry, Republic of Singapore. January 2011.
Chan, H. S. (1992). Code-mixing in Hong Kong Cantonese–English bilinguals: Constraints and processes. M.A. Thesis, The Chinese University of Hong Kong, Hong Kong.
Chan, J. Y. C., Ching P. C., & Lee, T. (2005). Development of a Cantonese–English code-mixing speech corpus. In Proceedings of Eurospeech.
Chan, J. Y. C., Cao, H., Ching, P. C., & Lee, T. (2009). Automatic Recognition of Cantonese-English Code-Mixing Speech. Computational Linguistics and Chinese Language Processing, 14(3), 281–304.
Google Scholar
Chan, J. Y. C., Ching, P. C., Lee, T., & Meng, H. M. (2004) Detection of language boundary in code-switching utterances by bi-phone probabilities. In Proceedings of the international symposium chinese spoken language processing.
Deterding, D., Brown, A., & Low, E. L. (2005). English in Singapore: phonetic research on a corpus, Singapore. New York: McGraw-Hill Education.
Google Scholar
Gopinathan, S. (1998). Language policy changes 1997: Politics and pedagogy. In S. Gopinathan, A. Pakir, H. W. Kam, & V. Saravanan (Eds.), Language, society and education in Singapore (2nd ed.). Singapore: Times Academic Press.
Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.
Kwan Terry, A. (1992). Code-switching and code-mixing: The case of a child learning English and Chinese simultaneously. Journal of Multilingual and Multicultural Development, 13(3), 243–259.
Article Google Scholar
Li, W. (1998). The ‘Why’ and ‘How’ questions in the analysis of conversational codeswitching. In P. Auer (Ed.), Code-switching in conversation: Language, interaction, and identity. London: Routledge.
Google Scholar
Li, H., Ma, B., & Lee, K.A. (2013). Spoken language recognition: From fundamentals to practice. In Proceedings of the IEEE.
Li, Y., Yu, Y., & Fung, P. (2012). A Mandarin–English code-switching corpus. In Proceedings of the eight international conference on language resources and evaluation (LREC’12).
Lyu, D.-C., Chng, E.-S., & Li, H. (2013a). Language diarization for code-switch conversational speech. In Proceedings of ICASSP.
Lyu, D.-C., Chng, E.-S., & Li, H. (2013b). Language diarization for conversational code-switch speech with pronunciation dictionary adaptation. In Proceedings of ChinaSIP.
Lyu, D.-C., & Lyu, R.-Y. (2008). Language identification on code-switching utterances using multiple cues. In Proceedings of the international speech communication association.
Lyu, D.-C., Lyu, R.-Y., Chiang, Y.-C., & Hsu, C.-N. (2006a). Speech recognition on code-switching among the Chinese dialects. In Proceeding of ICASSP.
Lyu, D.-C., Lyu, R.-Y., Chiang, Y.-C., & Hsu, C.-N. (2006b). Language identification by using syllable-based duration classification on code-switching speech. ISCSLP, volume 4274 of Lecture Notes in Computer Science (pp. 475–484). New York: Springer.
Lyu, D.-C., Zhu, C.-L., Lyu, R.-Y. & Ko, M.-T. (2010). Language identification in code-switching speech usingword-based lexical model. In Proceedings of the 7th international symposium on chinese spoken language processing (ISCSLP ‘10), Tainan, Taiwan, December 2010 (pp. 460–464).
MacSwan, J. (2013). Code-switching and grammatical theory. In T. Bhatia & W. Ritchie (Eds.), Handbook of Multilingualism (p. 2013). Cambridge: Blackwell.
Google Scholar
Malik, L. (1994). Sociolinguistics: A study of code-switching. New Delhi: Anmol.
Google Scholar
Milroy, L., & Muysken, P. (1995). One speaker, two languages. Cross-disciplinary perspectives on code-switching. Cambridge: Cambridge University Press.
Book Google Scholar
Myers-Scotton, C. (1989). Codeswitching with English: Types of switching, types of communities. World Englishes, 8(3), 333–346.
Myers-Scotton, C. (1993). Social motivations for code-switching: Evidence from Africa. Oxford: Clarendon Press.
Google Scholar
Myers-scotton, C., & Myers, C. (1993). Duelling languages: Grammatical structure in codeswitching. Oxford: Clarendon Press.
Google Scholar
Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kaie, F., Ghoshal, A., et al. (2011). The subspace Gaussian mixture model—A structured model for speech recognition. Journal of Computer Speech and Language, 25(2), 404–439.
Article Google Scholar
Population Trends. (2012). http://www.singstat.gov.sg/Publications/publications_and_papers/population_and_population_structure/population_trend.html.
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–287.
Article Google Scholar
Reyes, I. (1994). Functions of code switching in schoolchildren’s conversations. Bilingual Research Journal, 28(1), 77–98.
Article Google Scholar
Sankoff, D., & Poplack, S. (1981). A formal grammar for code-switching. Papers in Linguistics.
Santhosh Kumar, C. P., Li, H., Tong, R., Matĕjka, P., Burget, L., & Černocky, J. (2010). Tuning phone decoders for language identification. In Proceedings of ICASSP.
Shen, H.-P., Wu, C.-H., Yang, Y.-T. & Hsu, C.-S. (2011). CECOS: A Chinese–English code-switching speech database. In Proceedings of the international conference on speech database and assessments (Oriental COCOSDA ‘11), Hsinchu, Taiwan, October 2011 (pp. 120–123).
Shia, C. J., Chiu, Y. H., Hsieh, J. H., & Wu, C. H. (2004) Language boundary detection and identification of mixedlanguage speech based on map estimation. In Proceedings of the IEEE international conference on acoustics, speech, and signal.
Su, H. Y. (2001). Code-switching between Mandarin and Taiwanese in three telephone conversation: The negotiation of interpersonal relationships among Bilingual speakers in Taiwan. In Proceedings of the symposium about language and society.
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., & Manning, C. (2005). A conditional random field word segmenter. In Fourth SIGHAN workshop on Chinese language processing.
Vu, N. T., Lyu, D. C., Weiner, J., et al. (2012). A first speech recognition system for Mandarin–English code-switch conversational speech. In Proceedings of the 37th IEEE international conference on acoustics, speech and signal processing (ICASSP ‘12), Kyoto, Japan, March 2012 (pp. 4889–4892).
Wu, C.-H., Chiu, Y.-H., Shia, C.-J., & Lin, C.-Y. (2006). Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs. IEEE Transactions on Audio, Speech and Language Processing, 14(1), 266–275.
Article Google Scholar
Young, S. (1996). A review of large-vocabulary continuous-speech. IEEE Signal Processing Magazine, 13(5), 45–57.
Zissman, M. A. (1996). Comparison of four approaches to automatic LID of telephone speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 4(1), 31–44.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Temasek Laboratories, Nanyang Technological University, Singapore, 639798, Singapore
Dau-Cheng Lyu, Eng-Siong Chng & Haizhou Li
School of Computer Engineering, Nanyang Technological University, Singapore, 639798, Singapore
Eng-Siong Chng & Haizhou Li
Institute for Infocomm Research, 1 Fusionopolis Way, Singapore, 138632, Singapore
Haizhou Li
School of Computer Sciences, Universiti Sains Malaysia, 11800, USM, Penang, Malaysia
Tien-Ping Tan
The University of New South Wales, Sydney, NSW, 2052, Australia
Haizhou Li

Authors

Dau-Cheng Lyu
View author publications
You can also search for this author inPubMed Google Scholar
Tien-Ping Tan
View author publications
You can also search for this author inPubMed Google Scholar
Eng-Siong Chng
View author publications
You can also search for this author inPubMed Google Scholar
Haizhou Li
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Dau-Cheng Lyu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lyu, DC., Tan, TP., Chng, ES. et al. Mandarin–English code-switching speech corpus in South-East Asia: SEAME. Lang Resources & Evaluation 49, 581–600 (2015). https://doi.org/10.1007/s10579-015-9303-x

Download citation

Published: 20 May 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10579-015-9303-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mandarin–English code-switching speech corpus in South-East Asia: SEAME

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fine-Tuned Self-supervised Speech Representations for Language Diarization in Multilingual Code-Switched Speech

Spoken Language Change Detection Inspired by Speaker Change Detection

Representing variation in a spoken corpus of an endangered dialect: the case of Torlak

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now