RCWI: A Dataset for Chinese Complex Word Identification

Que, Mengxi; Zhang, Yufei; Yu, Dong

doi:10.1007/978-981-16-6471-7_25

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1466))

Included in the following conference series:

China Conference on Knowledge Graph and Semantic Computing

1841 Accesses

Abstract

Reasonable evaluation of lexical complexity is the premise of multiple downstream NLP tasks such as text simplification. At present, there lacks of reliable Chinese lexical complexity datasets, while most of the existing foreign datasets only focus on the words that cause reading difficulty. This paper constructs a RCWI-Dataset for native Chinese speakers, which contains 40613 examples and three complexity categories. Each example is annotated by at least three annotators. We adopt comparison method to annotate words that are more difficult than average lexical complexity in sentences, so that we can get more information about word complexity and improve the reliability of our dataset. We provide baseline experiments based on feature engineering, the results show the validity of RCWI-Dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Dong, Y., Siyuan, W., Zhaoyang, G., Yuling, T.: Assessing sentence difficulty in Chinese textbooks based on crowdsourcing. J. Chin. Inf. Process. 34(2) (2020). (in Chinese)
Google Scholar
Gooding, S., Kochmar, E., Blackwell, A., Sakar, A.: Comparative judgements are more consistent than binary classification for labelling word complexity (2019)
Google Scholar
Hanban: New HSK Syllabus (2009). (in Chinese)
Google Scholar
Maddela, M., Wei, X.: A word-complexity lexicon and a neural readability ranking model for lexical simplification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018)
Google Scholar
Paetzold, G., Specia, L.: SemEval 2016 task 11: complex word identification. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) (2016)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python (2012)
Google Scholar
Shardlow, M., Cooper, M., Zampieri, M.: Complex – a new corpus for lexical complexity predicition from likert scale data (2020)
Google Scholar
Stenetorp, P., Pyysalo, S., Topi’C, G., Ohta, T., Ananiadou, S., Tsujii, J.: Brat: a web-based tool for NLP-assisted text annotation. Association for Computational Linguistics (2012)
Google Scholar
Xinchun, S.: Theory and method in compiling list of common words in compulsory education (draft). Appl. Linguist. 103(03), 2–11 (2017). (in Chinese)
Google Scholar
Yimam, S.M., Štajner, S., Riedl, M., Biemann, C.: CWIG3G2 - complex word identification task across three text genres and two user groups. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan, pp. 401–407. Asian Federation of Natural Language Processing, November 2017. https://www.aclweb.org/anthology/I17-2068
Yuling, T., Dong, Y.: The method of calculating sentence readability combined with deep learning and language difficulty characteristics. In: Proceedings of the 19th Chinese National Conference on Computational Linguistics, Haikou, China, pp. 731–742. Chinese Information Processing Society of China, October 2020. (in Chinese). https://www.aclweb.org/anthology/2020.ccl-1.68

Download references

Acknowledgements

This work is funded by the Humanity and Social Science Youth foundation of Ministry of Education (19YJCZH230) and the Fundamental Research Funds for the Central Universities in BLCU (No. 17PT05).

Author information

Authors and Affiliations

Beijing Language and Culture University, Beijing, China
Mengxi Que, Yufei Zhang & Dong Yu

Authors

Mengxi Que
View author publications
You can also search for this author in PubMed Google Scholar
Yufei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dong Yu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Harbin Institute of Technology, Harbin, China
Bing Qin
Peking University, Beijing, China
Zhi Jin
Tongji University, Shanghai, China
Haofen Wang
University of Edinburgh, Edinburgh, UK
Jeff Pan
University of South China, Hengyang, China
Yongbin Liu
Chinese Academy of Sciences, Beijing, China
Bo An

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Que, M., Zhang, Y., Yu, D. (2021). RCWI: A Dataset for Chinese Complex Word Identification. In: Qin, B., Jin, Z., Wang, H., Pan, J., Liu, Y., An, B. (eds) Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction. CCKS 2021. Communications in Computer and Information Science, vol 1466. Springer, Singapore. https://doi.org/10.1007/978-981-16-6471-7_25

Download citation

DOI: https://doi.org/10.1007/978-981-16-6471-7_25
Published: 28 October 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6470-0
Online ISBN: 978-981-16-6471-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics