Skip to main content

RCWI: A Dataset for Chinese Complex Word Identification

  • Conference paper
  • First Online:
Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction (CCKS 2021)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1466))

Included in the following conference series:

  • 1841 Accesses

Abstract

Reasonable evaluation of lexical complexity is the premise of multiple downstream NLP tasks such as text simplification. At present, there lacks of reliable Chinese lexical complexity datasets, while most of the existing foreign datasets only focus on the words that cause reading difficulty. This paper constructs a RCWI-Dataset for native Chinese speakers, which contains 40613 examples and three complexity categories. Each example is annotated by at least three annotators. We adopt comparison method to annotate words that are more difficult than average lexical complexity in sentences, so that we can get more information about word complexity and improve the reliability of our dataset. We provide baseline experiments based on feature engineering, the results show the validity of RCWI-Dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dong, Y., Siyuan, W., Zhaoyang, G., Yuling, T.: Assessing sentence difficulty in Chinese textbooks based on crowdsourcing. J. Chin. Inf. Process. 34(2) (2020). (in Chinese)

    Google Scholar 

  2. Gooding, S., Kochmar, E., Blackwell, A., Sakar, A.: Comparative judgements are more consistent than binary classification for labelling word complexity (2019)

    Google Scholar 

  3. Hanban: New HSK Syllabus (2009). (in Chinese)

    Google Scholar 

  4. Maddela, M., Wei, X.: A word-complexity lexicon and a neural readability ranking model for lexical simplification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018)

    Google Scholar 

  5. Paetzold, G., Specia, L.: SemEval 2016 task 11: complex word identification. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) (2016)

    Google Scholar 

  6. Pedregosa, F., et al.: Scikit-learn: machine learning in python (2012)

    Google Scholar 

  7. Shardlow, M., Cooper, M., Zampieri, M.: Complex – a new corpus for lexical complexity predicition from likert scale data (2020)

    Google Scholar 

  8. Stenetorp, P., Pyysalo, S., Topi’C, G., Ohta, T., Ananiadou, S., Tsujii, J.: Brat: a web-based tool for NLP-assisted text annotation. Association for Computational Linguistics (2012)

    Google Scholar 

  9. Xinchun, S.: Theory and method in compiling list of common words in compulsory education (draft). Appl. Linguist. 103(03), 2–11 (2017). (in Chinese)

    Google Scholar 

  10. Yimam, S.M., Štajner, S., Riedl, M., Biemann, C.: CWIG3G2 - complex word identification task across three text genres and two user groups. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan, pp. 401–407. Asian Federation of Natural Language Processing, November 2017. https://www.aclweb.org/anthology/I17-2068

  11. Yuling, T., Dong, Y.: The method of calculating sentence readability combined with deep learning and language difficulty characteristics. In: Proceedings of the 19th Chinese National Conference on Computational Linguistics, Haikou, China, pp. 731–742. Chinese Information Processing Society of China, October 2020. (in Chinese). https://www.aclweb.org/anthology/2020.ccl-1.68

Download references

Acknowledgements

This work is funded by the Humanity and Social Science Youth foundation of Ministry of Education (19YJCZH230) and the Fundamental Research Funds for the Central Universities in BLCU (No. 17PT05).

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Que, M., Zhang, Y., Yu, D. (2021). RCWI: A Dataset for Chinese Complex Word Identification. In: Qin, B., Jin, Z., Wang, H., Pan, J., Liu, Y., An, B. (eds) Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction. CCKS 2021. Communications in Computer and Information Science, vol 1466. Springer, Singapore. https://doi.org/10.1007/978-981-16-6471-7_25

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-6471-7_25

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-6470-0

  • Online ISBN: 978-981-16-6471-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics