Assembling Chinese-Mongolian Speech Corpus via Crowdsourcing

Su, Rihai; Shi, Shumin; Zhao, Meng; Huang, Heyan

doi:10.1007/978-3-319-61833-3_58

Rihai Su¹⁷,
Shumin Shi^17,18,
Meng Zhao¹⁷ &
…
Heyan Huang^17,18

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10386))

Included in the following conference series:

International Conference on Swarm Intelligence

2156 Accesses

Abstract

Chinese-Mongolian Speech Corpus (CMSC) is utilized in many practical applications in recent years, and it is a kind of low-resource corpus due to its high-cost construction. We describe a crowdsourcing method to build a collection of bilingual speech corpus through the use of a messaging app called WeChat, in which followers can send voice and text message to our Official Account Platform freely. Owing to most followers are fluent in Chinese and Mongolian, we gathered natural speech recordings in our daily life, and constructed a parallel speech corpus of 20547 utterances from 296 speakers, totalling 21.43 h of speech, during the first 25 days that collecting notification was pushed. Moreover, we present a quality control measure in the evaluation part that independent subscribers voted on the translations of each source sentence and it improves the quality of corpus markedly. We show that WeChat Official Account Platform can be used to assemble speech corpus quickly and cheaply, with near-expert accuracy. As the basic research content of natural language processing (NLP), the construction of bilingual speech corpus via crowdsourcing has a reference value for the similar studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sigurbjörnsson, B., Kamps, J., Rijke, M.: EuroGOV: engineering a multilingual web corpus. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., Rijke, M. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 825–836. Springer, Heidelberg (2006). doi:10.1007/11878773_90
Chapter Google Scholar
Crowdy, S.: Speech corpus design. Literary Linguist. Comput. 8(4), 259–265 (1993)
Article MathSciNet Google Scholar
Adolphs, S., Knight, D.: Building a speech corpus. In: The Routledge Handbook of Corpus Linguistics, pp. 38–52 (2010)
Google Scholar
Howe, J.: The rise of crowdsourcing. Wired Mag. 14(14), 1–5 (2006)
Google Scholar
Kennedy, G.: An Introduction to Corpus Linguistics. Routledge, Oxford (2014)
Google Scholar
Fei, L., Laigao, G., Laibao, Y.: J. Inne Mon. Sci. (NSE) 44(3), 320–323 (2013)
Google Scholar
Fei, L., Laigao, G., Laibao, Y.: J. Chin. Inf. Proc. 29(1), 178–182 (2015)
Google Scholar
Mu, R.: Research on Mongolian speech recognition. Dissertation (2013)
Google Scholar
Dongzhao, J., Laigao, G., Fei, L.: Research on Mongolian phonetic synthesis based on HMM. Comput. Sci. 41(1), 80–82 (2014)
Google Scholar
Reyiman, T., Yipitihaer, M., Wushouer, S.: J. XJ Sci. (NSE), 30(2), 199–203 (2013)
Google Scholar
Jiang, D.: J. Chin. Inf. Proc. 29(1), 178–182 (2015)
Google Scholar
Dawa, I., Zhang, Y., Uezono, K., Zhang, S.: Processing of Mongolian by computer. J. Chin. Inf. Proc. 20(4), 56–62 (2006)
Google Scholar
Xingwu, J.: Lexical tagging of Mongolian corpus. S.C. Inne Mon., 59–63 (2013)
Google Scholar
Yu, R., et al.: Problems of recording tagging and solutions in “Mongolian Speech Corpus”. In: PCC (2012)
Google Scholar
Tingyang, Y., Huadong, X., Wang, L.: Research on Uyghur speech language speech corpus of telephone channel. Comput. Eng. Appl. 47(23), 150–153 (2011)
Google Scholar
Finin, T., Murnane, W., Karandikar, V., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in Twitter data with crowdsourcing. In: ACL, pp. 80–88 (2010)
Google Scholar
Sabou, M., Bontcheva, K., Derczynski, L., Scharl, A.: Corpus annotation through crowdsourcing: towards best practice guidelines. In: LREC, pp. 859–866 (2014)
Google Scholar
Filatova, E.: Irony and sarcasm: corpus generation and analysis using crowdsourcing. In: LREC, pp. 392–398 (2012)
Google Scholar
Munro, R., Bethard, S., Kuperman, V., Lai, V.T., Melnick, R., Potts, C., Schnoebelen, T., Tily, H.: Crowdsourcing and language studies: the new generation of linguistic data. In: ACL, pp. 122–130 (2010)
Google Scholar
Post, M., Callison-Burch, C., Osborne, M.: Constructing parallel corpora for six indian languages via crowdsourcing. In: ACL, pp. 401–409 (2012)
Google Scholar
Chen, H.: Research on the construction of Uygur, Kazak and Kirgiz public opinion tagging corpus based on crowdsourcing. MS thesis (2015)
Google Scholar

Download references

Acknowledgments

We thank reviewers for their constructive comments, and gratefully acknowledge the support of Natural Science Foundation of China (61671064) and BIT Basic Research Fund (20160742017).

Author information

Authors and Affiliations

School of Computer Sciences and Technology, Beijing Institute of Technology, Beijing, China
Rihai Su, Shumin Shi, Meng Zhao & Heyan Huang
Beijing Engineering Research Centre of High Volume Language Information Processing and Cloud Computing Applications, Beijing, China
Shumin Shi & Heyan Huang

Authors

Rihai Su
View author publications
You can also search for this author in PubMed Google Scholar
Shumin Shi
View author publications
You can also search for this author in PubMed Google Scholar
Meng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Heyan Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shumin Shi .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Ying Tan
Kyushu University, Fukuoka, Japan
Hideyuki Takagi
Southern University of Science and Technology, Shenzhen, China
Yuhui Shi
Shenzhen University, Shenzhen, China
Ben Niu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Su, R., Shi, S., Zhao, M., Huang, H. (2017). Assembling Chinese-Mongolian Speech Corpus via Crowdsourcing. In: Tan, Y., Takagi, H., Shi, Y., Niu, B. (eds) Advances in Swarm Intelligence. ICSI 2017. Lecture Notes in Computer Science(), vol 10386. Springer, Cham. https://doi.org/10.1007/978-3-319-61833-3_58

Download citation

DOI: https://doi.org/10.1007/978-3-319-61833-3_58
Published: 24 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61832-6
Online ISBN: 978-3-319-61833-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics