skip to main content
10.1145/3582768.3582802acmotherconferencesArticle/Chapter ViewAbstractPublication PagesnlpirConference Proceedingsconference-collections
research-article

CWITR: A Corpus for Automatic Complex Word Identification in Turkish Texts

Published: 27 June 2023 Publication History

Abstract

The Complex Word Identification (CWI) task aims to provide support to resolve accessibility barriers for people who experience difficulties with cognitive, language, and learning disabilities. The task is concerned with the detection and identification of complex words that are unusual and difficult to understand by certain target groups. CWI systems have a large impact on the output of Text Simplification (TS) systems. This paper revisits the CWI task by extending available datasets by creating a new CWI corpus. In this study, we collect a new CWI dataset (CWITR) of complex single and multi-token words consisting of different text genres for Turkish and prepare it for investigation of computational methods on discrimination between complex and non-complex words forms.

References

[1]
Paetzold, G., and Specia, L. .2016a. Unsupervised lexical simplification for non-native speakers. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 30, No. 1).
[2]
Shardlow, M. .2013a. The CW corpus: A new resource for evaluating the identification of complex words. In Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations (pp. 69-77).
[3]
Shardlow, M. .2014. Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), (pp. 1583-1590).
[4]
Ziegler, W., and Aichert, I. 2015. How much is a word? Predicting ease of articulation planning from apraxic speech error patterns. Cortex, 69, 24-39.
[5]
Shardlow, M. .2013b. A Comparison of Techniques to Automatically Identify Complex Words. In 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop (pp. 103-109).
[6]
Paetzold, G. H., & Specia, L. .2017. A survey on lexical simplification. Journal of Artificial Intelligence Research, 60, 549-593.
[7]
Bott, S., Rello, L., Drndarević, B., and Saggion, H. .2012. Can spanish be simpler? LexSiS: Lexical simplification for Spanish. In Proceedings of COLING 2012, (pp. 357-374).
[8]
Collins-Thompson, K. .2014. Computational assessment of text readability: A survey of current and future research. ITL-International Journal of Applied Linguistics, 165(2), 97-135.
[9]
Al-Thanyyan, S. S., and Azmi, A. M. .2021. Automated text simplification: A survey. ACM Computing Surveys (CSUR), 54(2), 1-36.
[10]
Siddharthan, A. .2014. A survey of research on text simplification. ITL-International Journal of Applied Linguistics, 165(2): 259-298.
[11]
Rello, L., Baeza-Yates, R., Dempere-Marco, L., and Saggion, H. (2013). Frequent words improve readability and short words improve understandability for people with dyslexia. In IFIP Conference on Human-Computer Interaction (pp. 203-219). Springer, Berlin, Heidelberg.
[12]
Siddharthan, A. .2002. An architecture for a text simplification system. In Language Engineering Conference, 2002. Proceedings (pp. 64-71). IEEE.
[13]
Siddharthan, A. .2010. Complex lexico-syntactic reformulation of sentences using typed dependency representations. In Proceedings of the 6th International Natural Language Generation Conference.
[14]
Siddharthan, A. .2011. Text simplification using typed dependencies: A comparison of the robustness of different generation strategies. In Proceedings of the 13th European Workshop on Natural Language Generation (pp. 2-11).
[15]
Narayan, S., and Gardent, C. .2016. Unsupervised sentence simplification using deep semantics.  In Proceedings of the 9th International Natural Language Generation conference, pages 111–120, Edinburgh, UK. 
[16]
Yimam, S. M., Štajner, S., Riedl, M., & Biemann, C.2017. CWIG3G2-complex word identification task across three text genres and two user groups. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (pp. 401-407).
[17]
Horn, C., Manduca, C., and Kauchak, D. .2014. Learning a lexical simplifier using Wikipedia. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 458-463).
[18]
Paetzold, G., & Specia, L. .2016b. Semeval 2016 task 11: Complex word identification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) (pp. 560-569).
[19]
Shardlow, M., Evans, R., & Zampieri, M. .2021a. Predicting lexical complexity in English texts. arXiv preprint arXiv:2102.08773.
[20]
Shardlow, M., Cooper, M., and Zampieri, M. .2020. Complex: A new corpus for lexical complexity prediction from Likert scale data. Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) (pp. 57-62).
[21]
Zaharia, G. E., Cercel, D. C., and Dascalu, M. .2021. UPB at SemEval-2021 Task 1: Combining Deep Learning and Hand-Crafted Features for Lexical Complexity Prediction. Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), (pp. 609-616).
[22]
Shardlow, M., Evans, R., Paetzold, G. H., & Zampieri, M. .2021b. Semeval-2021 task 1: Lexical complexity prediction. Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021) (pp. 1-16).
[23]
Thomas, S. R., & Anderson, S. .2012. WordNet-based lexical simplification of a document. In Proceedings of KONVENS 2012, (pp. 80-88).
[24]
Aroyehun, S. T., Angel, J., Alvarez, D. A. P., and Gelbukh, A.2018. Complex word identification: Convolutional neural network vs. feature engineering. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications, (pp. 322-327).
[25]
Sheang, K. C. .2019. Multilingual complex word identification: Convolutional neural networks with morphological and linguistic features. In Proceedings of the Student Research Workshop Associated with RANLP 2019 (pp. 83-89).
[26]
Hartmann, N., and Dos Santos, L. B.2018. NILC at CWI 2018: Exploring feature engineering and feature learning. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 335-340).
[27]
Oflazer, K. .1994. Two-level description of Turkish morphology. Literary and linguistic computing, 9(2), 137-148.

Index Terms

  1. CWITR: A Corpus for Automatic Complex Word Identification in Turkish Texts
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval
          December 2022
          241 pages
          ISBN:9781450397629
          DOI:10.1145/3582768
          Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 27 June 2023

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. Complex word identification
          2. Crowdsourcing
          3. Lexical complexity
          4. Text simplification

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          NLPIR 2022

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 28
            Total Downloads
          • Downloads (Last 12 months)14
          • Downloads (Last 6 weeks)1
          Reflects downloads up to 14 Feb 2025

          Other Metrics

          Citations

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media