Abstract
Language models have been foundations in various scenarios of NLP applications, but it has not been well applied in language variety studies, even for the most popular language like English. This paper represents one of the few initial efforts to utilize the NLP technology in the paradigm of World Englishes, specifically in creating a multi-variety corpus for studying Asian Englishes. We present an overview of the CCAE — Corpus of Chinese-based Asian English, a suite of corpora comprising six Chinese-based Asian English varieties. It is based on 340 million tokens in 448 thousand web documents from six regions. The ontology of data would make the corpus a helpful resource with enormous research potential for Asian Englishes (especially for Chinese Englishes for which there has not been a publicly accessible corpus yet so far) and an ideal source for variety-specific language modeling and downstream tasks, thus setting the stage for NLP-based World Englishes studies. And preliminary experiments on this corpus reveal the practical value of CCAE. Finally, we make CCAE available at https://huggingface.co/datasets/CCAE/CCAE-Corpus.
Similar content being viewed by others
Data Availability
CCAE has been released under the CC-BY-NC-ND 4.0\(^8\)(https://creativecommons.org/licenses/by-nc-nd/4.0) license on Hugging Face’s website: https://huggingface.co/datasets/CCAE/CCAE-Corpus, enabling reusers to copy and distribute the material in its unadapted form, only for noncommercial purposes, while giving attribution to the creator.
Notes
- 1.
Note that whatever GloWbE, ICE or ACE, they are not NLP-oriented originally, and we only counted disk size, documents and tokens on the parts of CAE in them, separately.
- 2.
See the following Wikipedia page for more information on this standard file format:https://en.wikipedia.org/wiki/Web_ARChive.
- 3.
SpaCy Tokenizer: https://spacy.io/api/tokenizer.
- 4.
Internet Archive: https://archive.org/web.
- 5.
Selenium: https://www.selenium.dev.
- 6.
Webdriver for chrome: https://chromedriver.chromium.org.
- 7.
2captcha - a captcha solving service: https://2captcha.com.
References
Qiu, X.P., Sun, T.X., Xu, Y.G., Shao, Y.F., Dai, N., Huang, X.J.: Pre-trained models for natural language processing: a survey. SCIENCE CHINA Technol. Sci. 63(10), 1872–1897 (2020). https://doi.org/10.1007/s11431-020-1647-3
Devlin, J., Chang, M., Lee, K.,Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv. abs/1810.04805 (2019)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 5485–5551 (2020)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inform. Process. Systems. 33, 1877–1901 (2020)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inform. Process. Systems. 30 (2017)
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Comput. Surv. (CSUR). 54, 1–40 (2021)
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for SQuAD. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp. 784–789 (2018). https://aclanthology.org/P18-2124
Edmiston, D.: A Systematic Analysis of Morphological Content in BERT Models for Multiple Languages. ArXiv. abs/2004.03032 (2020)
Espinosa Anke, L., Codina-Filba, J., Wanner, L.: Evaluating language models for the retrieval and categorization of lexical collocations. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 1406–1417 (2021). https://aclanthology.org/2021.eacl-main.120
Zhou, W., Ge, T., Xu, K., Wei, F., Zhou, M.: BERT-based Lexical Substitution. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3368–3373 (2019). https://aclanthology.org/P19-1328
Tran, K., Bisazza, A.: Zero-shot dependency parsing with pre-trained multilingual sentence representations. In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pp. 281–288 (2019). https://aclanthology.org/D19-6132
Zaharia, G., Avram, A., Cercel, D., Rebedea, T.: Exploring the power of Rsomanian BERT for dialect identification. In: Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties And Dialects, pp. 232–241 (2020). https://aclanthology.org/2020.vardial-1.22
Laicher, S., Kurtyigit, S., Schlechtweg, D., Kuhn, J., Walde, S.: Explaining and improving BERT performance on lexical semantic change detection. In: Proceedings of the 16th Conference Of The European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 192–202 (2021). https://aclanthology.org/2021.eacl-srw.25
Nuske, K.: “I Mean I’m Kind of Discriminating My Own People:’’ A Chinese TESOL Graduate Student’s Shifting Perceptions of China English. TESOL Q. 52, 360–390 (2018)
Kirk, J., Nelson, G.: The international corpus of english project: a progress report. World Englishes. 37, 697–716 (2018)
Davies, M., Fuchs, R.: Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-based English Corpus (GloWbE). English World-Wide. 36, 1–28 (2015)
Berns, M.: Expanding on the expanding circle: where do we go from here? World Englishes. 24, 85–93 (2005)
Xu, Z.: Chinese English: A future power?. The Routledge Handbook Of World Englishes, pp. 265–280 (2020)
Leimgruber, J.: Singapore English. language and linguistics. Compass. 5, 47–62 (2011)
Voigt, P., Bussche, A.: The EU general data protection regulation (GDPR). A Practical Guide, 1st Ed., Cham: Springer International Publishing. 10(3152676), 10–5555 (2017)
Kirkpatrick, A.: The Asian corpus of English: motivation and aims. Perspective 28, 256–269 (2013)
Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer sentinel mixture models. ArXiv Preprint ArXiv:1609.07843 (2016)
Davies, M.: The corpus of contemporary American English as the first reliable monitor corpus of English. Literary Lingu. Comput. 25, 447–464 (2010)
Pomikálek, J.: Removing boilerplate and duplicate content from web corpora. Masarykova Univerzita, Fakulta Informatiky, Disertacnı Práce (2011)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog. 1, 9 (2019)
Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)
Paszke, A., et al: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inform. Process. Syst. 32 (2019)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. ArXiv Preprint ArXiv:1412.6980 (2014)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural Inform. Process. Syst. 35, 27730–27744 (2022)
Yang, L., Xiang, Y.: Naive Bayes and BiLSTM ensemble for discriminating between mainland and Taiwan Variation of Mandarin Chinese. In: Proceedings of the Sixth Workshop on NLP For Similar Languages, Varieties and Dialects, pp. 120–127 (2019). https://aclanthology.org/W19-1412
Popa, C., NullStefănescu, V.: Applying multilingual and monolingual transformer-based models for dialect identification. In: Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 193–201 (2020). https://aclanthology.org/2020.vardial-1.18
Ceolin, A.: Comparing the performance of CNNs and shallow models for language identification. In: Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 102–112 (2021). https://aclanthology.org/2021.vardial-1.12
Beltagy, I., Peters, M., Cohan, A.: Longformer: The long-document transformer. ArXiv Preprint ArXiv:2004.05150 (2020)
Acknowledgement
Many thanks to Mark Davis, for his useful suggestions on data collection. We also thank the Internet Archive for providing service on the website time archive. This work was supported in part by the National Natural Science Foundation of China under Grant 62002016 and in part by the Fundamental Research Funds for the Central Universities under Grant 06500103.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, Y., Qin, M.X., Wang, L., Huang, C. (2023). CCAE: A Corpus of Chinese-Based Asian Englishes. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_48
Download citation
DOI: https://doi.org/10.1007/978-3-031-44696-2_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44695-5
Online ISBN: 978-3-031-44696-2
eBook Packages: Computer ScienceComputer Science (R0)