Abstract
Reasonable evaluation of lexical complexity is the premise of multiple downstream NLP tasks such as text simplification. At present, there lacks of reliable Chinese lexical complexity datasets, while most of the existing foreign datasets only focus on the words that cause reading difficulty. This paper constructs a RCWI-Dataset for native Chinese speakers, which contains 40613 examples and three complexity categories. Each example is annotated by at least three annotators. We adopt comparison method to annotate words that are more difficult than average lexical complexity in sentences, so that we can get more information about word complexity and improve the reliability of our dataset. We provide baseline experiments based on feature engineering, the results show the validity of RCWI-Dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dong, Y., Siyuan, W., Zhaoyang, G., Yuling, T.: Assessing sentence difficulty in Chinese textbooks based on crowdsourcing. J. Chin. Inf. Process. 34(2) (2020). (in Chinese)
Gooding, S., Kochmar, E., Blackwell, A., Sakar, A.: Comparative judgements are more consistent than binary classification for labelling word complexity (2019)
Hanban: New HSK Syllabus (2009). (in Chinese)
Maddela, M., Wei, X.: A word-complexity lexicon and a neural readability ranking model for lexical simplification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018)
Paetzold, G., Specia, L.: SemEval 2016 task 11: complex word identification. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) (2016)
Pedregosa, F., et al.: Scikit-learn: machine learning in python (2012)
Shardlow, M., Cooper, M., Zampieri, M.: Complex – a new corpus for lexical complexity predicition from likert scale data (2020)
Stenetorp, P., Pyysalo, S., Topi’C, G., Ohta, T., Ananiadou, S., Tsujii, J.: Brat: a web-based tool for NLP-assisted text annotation. Association for Computational Linguistics (2012)
Xinchun, S.: Theory and method in compiling list of common words in compulsory education (draft). Appl. Linguist. 103(03), 2–11 (2017). (in Chinese)
Yimam, S.M., Štajner, S., Riedl, M., Biemann, C.: CWIG3G2 - complex word identification task across three text genres and two user groups. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan, pp. 401–407. Asian Federation of Natural Language Processing, November 2017. https://www.aclweb.org/anthology/I17-2068
Yuling, T., Dong, Y.: The method of calculating sentence readability combined with deep learning and language difficulty characteristics. In: Proceedings of the 19th Chinese National Conference on Computational Linguistics, Haikou, China, pp. 731–742. Chinese Information Processing Society of China, October 2020. (in Chinese). https://www.aclweb.org/anthology/2020.ccl-1.68
Acknowledgements
This work is funded by the Humanity and Social Science Youth foundation of Ministry of Education (19YJCZH230) and the Fundamental Research Funds for the Central Universities in BLCU (No. 17PT05).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Que, M., Zhang, Y., Yu, D. (2021). RCWI: A Dataset for Chinese Complex Word Identification. In: Qin, B., Jin, Z., Wang, H., Pan, J., Liu, Y., An, B. (eds) Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction. CCKS 2021. Communications in Computer and Information Science, vol 1466. Springer, Singapore. https://doi.org/10.1007/978-981-16-6471-7_25
Download citation
DOI: https://doi.org/10.1007/978-981-16-6471-7_25
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6470-0
Online ISBN: 978-981-16-6471-7
eBook Packages: Computer ScienceComputer Science (R0)