Skip to main content

CCAE: A Corpus of Chinese-Based Asian Englishes

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14303))

  • 766 Accesses

Abstract

Language models have been foundations in various scenarios of NLP applications, but it has not been well applied in language variety studies, even for the most popular language like English. This paper represents one of the few initial efforts to utilize the NLP technology in the paradigm of World Englishes, specifically in creating a multi-variety corpus for studying Asian Englishes. We present an overview of the CCAECorpus of Chinese-based Asian English, a suite of corpora comprising six Chinese-based Asian English varieties. It is based on 340 million tokens in 448 thousand web documents from six regions. The ontology of data would make the corpus a helpful resource with enormous research potential for Asian Englishes (especially for Chinese Englishes for which there has not been a publicly accessible corpus yet so far) and an ideal source for variety-specific language modeling and downstream tasks, thus setting the stage for NLP-based World Englishes studies. And preliminary experiments on this corpus reveal the practical value of CCAE. Finally, we make CCAE available at https://huggingface.co/datasets/CCAE/CCAE-Corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Data Availability

CCAE has been released under the CC-BY-NC-ND 4.0\(^8\)(https://creativecommons.org/licenses/by-nc-nd/4.0) license on Hugging Face’s website: https://huggingface.co/datasets/CCAE/CCAE-Corpus, enabling reusers to copy and distribute the material in its unadapted form, only for noncommercial purposes, while giving attribution to the creator.

Notes

  1. 1.

    Note that whatever GloWbE, ICE or ACE, they are not NLP-oriented originally, and we only counted disk size, documents and tokens on the parts of CAE in them, separately.

  2. 2.

    See the following Wikipedia page for more information on this standard file format:https://en.wikipedia.org/wiki/Web_ARChive.

  3. 3.

    SpaCy Tokenizer: https://spacy.io/api/tokenizer.

  4. 4.

    Internet Archive: https://archive.org/web.

  5. 5.

    Selenium: https://www.selenium.dev.

  6. 6.

    Webdriver for chrome: https://chromedriver.chromium.org.

  7. 7.

    2captcha - a captcha solving service: https://2captcha.com.

References

  1. Qiu, X.P., Sun, T.X., Xu, Y.G., Shao, Y.F., Dai, N., Huang, X.J.: Pre-trained models for natural language processing: a survey. SCIENCE CHINA Technol. Sci. 63(10), 1872–1897 (2020). https://doi.org/10.1007/s11431-020-1647-3

    Article  Google Scholar 

  2. Devlin, J., Chang, M., Lee, K.,Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv. abs/1810.04805 (2019)

  3. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 5485–5551 (2020)

    MathSciNet  Google Scholar 

  4. Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inform. Process. Systems. 33, 1877–1901 (2020)

    Google Scholar 

  5. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inform. Process. Systems. 30 (2017)

    Google Scholar 

  6. Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Comput. Surv. (CSUR). 54, 1–40 (2021)

    Article  Google Scholar 

  7. Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for SQuAD. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp. 784–789 (2018). https://aclanthology.org/P18-2124

  8. Edmiston, D.: A Systematic Analysis of Morphological Content in BERT Models for Multiple Languages. ArXiv. abs/2004.03032 (2020)

  9. Espinosa Anke, L., Codina-Filba, J., Wanner, L.: Evaluating language models for the retrieval and categorization of lexical collocations. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 1406–1417 (2021). https://aclanthology.org/2021.eacl-main.120

  10. Zhou, W., Ge, T., Xu, K., Wei, F., Zhou, M.: BERT-based Lexical Substitution. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3368–3373 (2019). https://aclanthology.org/P19-1328

  11. Tran, K., Bisazza, A.: Zero-shot dependency parsing with pre-trained multilingual sentence representations. In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pp. 281–288 (2019). https://aclanthology.org/D19-6132

  12. Zaharia, G., Avram, A., Cercel, D., Rebedea, T.: Exploring the power of Rsomanian BERT for dialect identification. In: Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties And Dialects, pp. 232–241 (2020). https://aclanthology.org/2020.vardial-1.22

  13. Laicher, S., Kurtyigit, S., Schlechtweg, D., Kuhn, J., Walde, S.: Explaining and improving BERT performance on lexical semantic change detection. In: Proceedings of the 16th Conference Of The European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 192–202 (2021). https://aclanthology.org/2021.eacl-srw.25

  14. Nuske, K.: “I Mean I’m Kind of Discriminating My Own People:’’ A Chinese TESOL Graduate Student’s Shifting Perceptions of China English. TESOL Q. 52, 360–390 (2018)

    Article  Google Scholar 

  15. Kirk, J., Nelson, G.: The international corpus of english project: a progress report. World Englishes. 37, 697–716 (2018)

    Article  Google Scholar 

  16. Davies, M., Fuchs, R.: Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-based English Corpus (GloWbE). English World-Wide. 36, 1–28 (2015)

    Google Scholar 

  17. Berns, M.: Expanding on the expanding circle: where do we go from here? World Englishes. 24, 85–93 (2005)

    Article  Google Scholar 

  18. Xu, Z.: Chinese English: A future power?. The Routledge Handbook Of World Englishes, pp. 265–280 (2020)

    Google Scholar 

  19. Leimgruber, J.: Singapore English. language and linguistics. Compass. 5, 47–62 (2011)

    Google Scholar 

  20. Voigt, P., Bussche, A.: The EU general data protection regulation (GDPR). A Practical Guide, 1st Ed., Cham: Springer International Publishing. 10(3152676), 10–5555 (2017)

    Google Scholar 

  21. Kirkpatrick, A.: The Asian corpus of English: motivation and aims. Perspective 28, 256–269 (2013)

    Google Scholar 

  22. Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer sentinel mixture models. ArXiv Preprint ArXiv:1609.07843 (2016)

  23. Davies, M.: The corpus of contemporary American English as the first reliable monitor corpus of English. Literary Lingu. Comput. 25, 447–464 (2010)

    Article  Google Scholar 

  24. Pomikálek, J.: Removing boilerplate and duplicate content from web corpora. Masarykova Univerzita, Fakulta Informatiky, Disertacnı Práce (2011)

    Google Scholar 

  25. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog. 1, 9 (2019)

    Google Scholar 

  26. Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)

    Google Scholar 

  27. Paszke, A., et al: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inform. Process. Syst. 32 (2019)

    Google Scholar 

  28. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. ArXiv Preprint ArXiv:1412.6980 (2014)

  29. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  30. Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural Inform. Process. Syst. 35, 27730–27744 (2022)

    Google Scholar 

  31. Yang, L., Xiang, Y.: Naive Bayes and BiLSTM ensemble for discriminating between mainland and Taiwan Variation of Mandarin Chinese. In: Proceedings of the Sixth Workshop on NLP For Similar Languages, Varieties and Dialects, pp. 120–127 (2019). https://aclanthology.org/W19-1412

  32. Popa, C., NullStefănescu, V.: Applying multilingual and monolingual transformer-based models for dialect identification. In: Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 193–201 (2020). https://aclanthology.org/2020.vardial-1.18

  33. Ceolin, A.: Comparing the performance of CNNs and shallow models for language identification. In: Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 102–112 (2021). https://aclanthology.org/2021.vardial-1.12

  34. Beltagy, I., Peters, M., Cohan, A.: Longformer: The long-document transformer. ArXiv Preprint ArXiv:2004.05150 (2020)

Download references

Acknowledgement

Many thanks to Mark Davis, for his useful suggestions on data collection. We also thank the Internet Archive for providing service on the website time archive. This work was supported in part by the National Natural Science Foundation of China under Grant 62002016 and in part by the Fundamental Research Funds for the Central Universities under Grant 06500103.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chao Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, Y., Qin, M.X., Wang, L., Huang, C. (2023). CCAE: A Corpus of Chinese-Based Asian Englishes. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44696-2_48

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44695-5

  • Online ISBN: 978-3-031-44696-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics