Abstract
Automatic assessment of sentence readability level can support educators in selecting sentence examples suitable for different learning levels to complement teaching materials. Although there exists extensive research on document-level and passage-level Chinese readability assessment, the sentence-level evaluation remains little explored. We bridge the gap by providing a research framework and a large corpus of nearly 40,000 sentences with ten-level readability annotation. We design experiments to analyze the influence of 88 linguistic features on sentence complexity and results suggest that the linguistic features can significantly improve the predictive performance with the highest of 70.78% distance-1 adjacent accuracy. Model comparison also confirms that our proposed set of features can reduce the bias in prediction without adding variances. We hope that our corpus, feature sets, and experimental validation can provide educators and linguists with more language resources, enlightenment, and automatic tools for future related research.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
The seven universities participating in text-books survey are Renmin University of China, Beijing Language and Culture University, Sun Yat-sen University, Jinan University, South China Normal University, Huaqiao University, and Fujian Normal University.
- 3.
- 4.
References
Flesch, R.: A new readability yardstick. J. Appl. Psychol. 32, 221–233 (1948)
Collins-Thompson, K., Callan, J.: A language-modelling approach to predicting reading difficulty. In: Proceedings NAACL-HLT, Boston, pp. 193–200 (2004)
Woodsend, Lapata: Learning to simplify sentences with quasi-synchronous grammar and integer programming. In: Proceedings of EMNLP 2011, pp. 409–420 (2011)
Husák, M.: Automatic retrieval of good dictionary examples. Bachelor Thesis, Brno (2010)
Segler, T.M.: Investigating the selection of example sentences for unknown target words in ICALL reading texts for L2 German. PhD Thesis. University of Edinburgh (2007)
Vajjala, Meurers: On improving the accuracy of readability classification using insights from second language acquisition. In: Proceedings of the ACL 2012 BEA 7th Workshop, pp. 163–173 (2012)
Pilán, et al.: Rule-based and machine learning approaches for second language sentence-level readability. In: Proceeding of the ACL 2014 BEA 9th Workshop, pp. 174–184 (2014)
Schumacher, E., et al.: Predicting the relative difficulty of single sentences with and without surrounding context. In: Proceedings of EMNLP 2016, pp. 1871–1881 (2016)
Song, R.: Stream model of generalized topic structure in Chinese text. Stud. Chin. Lang. 357(6), 483–494 (2013). (in Chinese)
Lin, D.: On the structural complexity of natural language sentences. In: Proceedings of COLING 1996, pp. 729–733 (1996)
Liu, Haitao: Dependency distance as a metric of language comprehension difficulty. J. Cogn. Sci. 9(2), 159–191 (2008)
Feng, L.: Automatic readability assessment. Ph.D. thesis, The City University of New York (2010)
Sung, Y., et al.: Leveling L2 texts through readability: combining multilevel linguistic features with the CEFR. Mod. Lang. J. 99(2), 371–391 (2015)
Acknowledgements
This work was supported by National Social Science Fund (Grant No. 17BGL068). We thank undergraduate students Zhiwei Wu, Yuansheng Wang, Xu Zhang, Yuan Chen, Hanwu Chen, Licong Tan, and Hao Zhang for their helpful assistance and support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Linguistic Features
Category | Sub-category | Feature definition |
---|---|---|
Shallow features | Character features | 1. Percentage of most-common characters per sentence |
2. Percentage of second-most-common characters per sentence | ||
3. Percentage of all common-characters per sentence | ||
4. Percentage of low-stroke-count characters per sentence | ||
5. Percentage of medium-stroke-count characters per sentence | ||
6. Percentage of high-stroke-count characters per sentence | ||
7. Average number of strokes per word per sentence | ||
8. Percentage of HSK1 to HSK3-characters per sentence | ||
9. Percentage of HSK4 to HSK5-characters per sentence | ||
10. Percentage of HSK6-characters per sentence | ||
11. Percentage of not-HSK-characters per sentence | ||
Word features | 12. Average number of characters per word per sentence | |
13. Average number of characters per unique word per sentence | ||
14. Number of two-character words per sentence | ||
15. Percentage of two-character words per sentence | ||
16. Number of three-character words per sentence | ||
17. Percentage of three-character words per sentence | ||
18. Number of four-character words per sentence | ||
19. Percentage of four-character words per sentence | ||
20. Number of five-up-character words per sentence | ||
21. Percentage of five-up-character words per sentence | ||
22. Percentage of HSK1 to HSK3-words per sentence | ||
23. Percentage of HSK4 to HSK5-words per sentence | ||
24. Percentage of HSK6-words per sentence | ||
25. Percentage of Not-HSK-words per sentence | ||
Sentence features | 26. Number of multi-character words per sentence | |
27. Number of words per sentence | ||
28. Number of characters per sentence | ||
29. Number of characters (including punctuations, numerical, and symbols) per sentence | ||
POS Features | Adjectives | 30. Percentage of adjectives per sentence |
31. Percentage of unique adjectives per sentence | ||
32. Number of unique adjectives per sentence | ||
33. Number of adjectives per sentence | ||
Functional words | 34. Percentage of functional words per sentence | |
35. Percentage of unique functional words per sentence | ||
36. Number of unique functional words per sentence | ||
37. Number of functional words per sentence | ||
Verbs | 38. Percentage of verbs per sentence | |
39. Number of unique verbs per sentence | ||
40. Percentage of unique verbs per sentence | ||
41. Number of verbs per sentence | ||
Nouns | 42. Percentage of nouns per sentence | |
43. Number of unique nouns per sentence | ||
44. Percentage of unique nouns per sentence | ||
45. Number of nouns per sentence | ||
46. Percentage of All-Nouns per sentence | ||
47. Number of unique All-Nouns per sentence | ||
48. Percentage of unique All-Nouns per sentence | ||
49. Number of All-Nouns per sentence | ||
Content words | 50. Percentage of content words per sentence | |
51. Number of unique content words per sentence | ||
52. Percentage of unique content words per sentence | ||
53. Number of content words per sentence | ||
Idioms | 54. Percentage of idioms per sentence | |
55. Number of unique idioms per sentence | ||
56. Percentage of unique idioms per sentence | ||
57. Number of idioms per sentence | ||
Adverbs | 58. Percentage of adverbs per sentence | |
59. Percentage of unique adverbs per sentence | ||
60. Number of unique adverbs per sentence | ||
61. Number of adverbs per sentence | ||
Syntactic features | Phrases | 62. Total number of noun phrases per sentence |
63. Total number of verbal phrases per sentence | ||
64. Total number of prepositional phrases per sentence | ||
65. Average length of noun phrases per sentence | ||
66. Average length of verbal phrases per sentence | ||
67. Average length of prepositional phrases per sentence | ||
Clauses | 68. Number of punctuation-clauses per sentence | |
69. Average dependency distance per sentence | ||
70. Maximum dependency distance per sentence | ||
Sentences | 71. Height of parse tree per sentence | |
72. Total number of dependency distances per sentence | ||
73. Average number of dependency distances per sentence | ||
Discourse features | Entity density | 74. Total number of entities per sentence |
75. Total number of unique entities per sentence | ||
76. Percentage of entities per sentence | ||
77. Percentage of unique entities per sentence | ||
78. Percentage of named entities per sentence | ||
79. Percentage of named entities against total number of entities per sentence | ||
80. Percentage of Not-NE nouns per sentence | ||
81. Number of Not-NE nouns per sentence | ||
82. Number of Not-Entity nouns per sentence | ||
Cohesion | 83. Percentage of conjunctions per sentence | |
84. Number of unique conjunctions per sentence | ||
85. Percentage of unique conjunctions per sentence | ||
86. Percentage of pronouns per sentence | ||
87. Number of unique pronouns per sentence | ||
88. Percentage of unique pronouns per sentence |
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Lu, D., Qiu, X., Cai, Y. (2020). Sentence-Level Readability Assessment for L2 Chinese Learning. In: Hong, JF., Zhang, Y., Liu, P. (eds) Chinese Lexical Semantics. CLSW 2019. Lecture Notes in Computer Science(), vol 11831. Springer, Cham. https://doi.org/10.1007/978-3-030-38189-9_40
Download citation
DOI: https://doi.org/10.1007/978-3-030-38189-9_40
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38188-2
Online ISBN: 978-3-030-38189-9
eBook Packages: Computer ScienceComputer Science (R0)