Skip to main content

A Review of Parallel Corpora for Automatic Text Simplification. Key Challenges Moving Forward

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2023)

Abstract

This review of parallel corpora for automatic text simplification (ATS) involves an analysis of forty-nine papers wherein the corpora are presented, focusing on corpora in the Indo-European languages of Western Europe. We improve on recent corpora reviews by reporting on the target audience of the ATS, the language and domain of the source text, and other metadata for each corpus, such as alignment level, annotation strategy, and the transformation applied to the simplified text. The key findings of the review are: 1) the lack of resources that address ATS aimed at domains which are important for social inclusion, such as health and public administration; 2) the lack of resources aimed at audiences with mild cognitive impairment; 3) the scarcity of experiments where the target audience was directly involved in the development of the corpus; 4) more than half the proposals do not include any extra annotation, thereby lacking detail on how the simplification was done, or the linguistic phenomenon tackled by the simplification; 5) other types of annotation, such as the type and frequency of the transformation applied could identify the most frequent simplification strategies; and, 6) future strategies to advance the field of ATS could leverage automatic procedures to make the annotation process more agile and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.webofscience.com.

  2. 2.

    https://www.scopus.com.

  3. 3.

    https://european-union.europa.eu/principles-countries-history/languages_en.

References

  1. ACL (ed.): OneStopEnglish corpus: a new corpus for automatic readability assessment and text simplification (2018)

    Google Scholar 

  2. Al-Thanyyan, S.S., Azmi, A.M.: Automated text simplification: a survey. ACM Comput. Surv. (CSUR) 54(2), 1–36 (2021)

    Article  Google Scholar 

  3. Allen, D.: A study of the role of relative clauses in the simplification of news texts for learners of English. System 37(4), 585–599 (2009)

    Article  Google Scholar 

  4. Alva-Manchego, F., Martin, L., Bordes, A., Scarton, C., Sagot, B., Specia, L.: Asset: a dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. arXiv preprint arXiv:2005.00481 (2020)

  5. Alva-Manchego, F., Scarton, C., Specia, L.: The (un) suitability of automatic evaluation metrics for text simplification. Comput. Linguist. 47(4), 861–889 (2021)

    Article  Google Scholar 

  6. Aumiller, D., Gertz, M.: Klexikon: a German dataset for joint summarization and simplification. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 2693–2701 (2022)

    Google Scholar 

  7. Barzilay, R., Elhadad, N.: Sentence alignment for monolingual comparable corpora. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 25–32 (2003)

    Google Scholar 

  8. Battisti, A., Pfütze, D., Säuberli, A., Kostrzewa, M., Ebling, S.: A corpus for automatic readability assessment and text simplification of German. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 3302–3311 (2020)

    Google Scholar 

  9. Van den Bercken, L., Sips, R.J., Lofi, C.: Evaluating neural text simplification in the medical domain. In: The World Wide Web Conference, pp. 3286–3292 (2019)

    Google Scholar 

  10. Bott, S., Saggion, H.: An unsupervised alignment algorithm for text simplification corpus construction. In: Proceedings of the Workshop on Monolingual Text-To-Text Generation, pp. 20–26 (2011)

    Google Scholar 

  11. Bott, S., Saggion, H.: Text simplification resources for Spanish. Lang. Resour. Eval. 48(1), 93–120 (2014)

    Article  Google Scholar 

  12. Brouwers, L., Bernhard, D., Ligozat, A.L., François, T.: Syntactic sentence simplification for French. In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)@ EACL 2014, pp. 47–56 (2014)

    Google Scholar 

  13. Brunato, D., Cimino, A., Dell’Orletta, F., Venturi, G.: Paccss-it: a parallel corpus of complex-simple sentences for automatic text simplification. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 351–361 (2016)

    Google Scholar 

  14. Brunato, D., Dell’Orletta, F., Venturi, G.: Linguistically-based comparison of different approaches to building corpora for text simplification: a case study on Italian. Front. Psychol. 13, 97 (2022)

    Article  Google Scholar 

  15. Brunato, D., Dell’Orletta, F., Venturi, G., Montemagni, S.: Design and annotation of the first Italian corpus for text simplification. In: Proceedings of the 9th Linguistic Annotation Workshop, pp. 31–41 (2015)

    Google Scholar 

  16. Campillos-Llanos, L., Reinares, A.R.T., Puig, S.Z., Valverde-Mateos, A., Capllonch-Carrión, A.: Building a comparable corpus and a benchmark for Spanish medical text simplification. Procesamiento del Lenguaje Nat. 69, 189–196 (2022)

    Google Scholar 

  17. Cardon, R., Grabar, N.: French biomedical text simplification: when small and precise helps. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 710–716 (2020)

    Google Scholar 

  18. Caseli, H.M., Pereira, T.F., Specia, L., Pardo, T.A., Gasperin, C., Aluísio, S.M.: Building a Brazilian Portuguese parallel corpus of original and simplified texts. Adv. Comput. Linguist. Res. Comput. Sci. 41, 59–70 (2009)

    Google Scholar 

  19. Coster, W., Kauchak, D.: Simple English Wikipedia: a new text simplification task. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 665–669 (2011)

    Google Scholar 

  20. Crossley, S.A., Allen, D., McNamara, D.S.: Text simplification and comprehensible input: a case for an intuitive approach. Lang. Teach. Res. 16(1), 89–108 (2012)

    Article  Google Scholar 

  21. De Belder, J., Moens, M.F.: Text simplification for children. In: Proceedings of the SIGIR Workshop on Accessible Search Systems, pp. 19–26. ACM, New York (2010)

    Google Scholar 

  22. Ebling, S., et al.: Automatic text simplification for German. Front. Commun. 7, 15 (2022)

    Article  Google Scholar 

  23. European Parliament, C.o.t.E.U.: Directive (EU) 2016/2102 of the European parliament and of the council of 26 October 2016 on the accessibility of the websites and mobile applications of public sector bodies (2016)

    Google Scholar 

  24. Ferrés, D., Saggion, H.: Alexsis: a dataset for lexical simplification in Spanish. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3582–3594 (2022)

    Google Scholar 

  25. Gala, N., Tack, A., Javourey-Drevet, L., François, T., Ziegler, J.C.: Alector: a parallel corpus of simplified French texts with alignments of misreadings by poor and dyslexic readers. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 1353–1361 (2020)

    Google Scholar 

  26. Gonzales, A.R., et al.: A new dataset and efficient baselines for document-level text simplification in German. In: Proceedings of the Third Workshop on New Frontiers in Summarization, pp. 152–161 (2021)

    Google Scholar 

  27. Gonzalez-Dios, I., Aranzabe, M.J., Díaz de Ilarraza, A.: The corpus of basque simplified texts (CBST). Lang. Resour. Eval. 52(1), 217–247 (2018)

    Google Scholar 

  28. Gonzalez-Dios, I., Gutiérrez-Fandiño, I., Cumbicus-Pineda, O.M., Soroa, A.: IrekiaLFes: a new open benchmark and baseline systems for Spanish automatic text simplification. In: Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR 2022), pp. 86–97 (2022)

    Google Scholar 

  29. Gooding, S.: On the ethical considerations of text simplification. arXiv preprint arXiv:2204.09565 (2022)

  30. Grabar, N., Cardon, R.: Clear-simple corpus for medical French. In: Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA), pp. 3–9 (2018)

    Google Scholar 

  31. Hauser, R., Vamvas, J., Ebling, S., Volk, M.: A multilingual simplified language news corpus. In: Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference, pp. 25–30 (2022)

    Google Scholar 

  32. ETS Institute: Accessibility requirements for ICT products and services - EN 301 549 (v3.2.1) (2021)

    Google Scholar 

  33. Kajiwara, T., Komachi, M.: Building a monolingual parallel corpus for text simplification using sentence similarity based on alignment between word embeddings. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1147–1158 (2016)

    Google Scholar 

  34. Kauchak, D.: Improving text simplification language modeling using unsimplified text data. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), pp. 1537–1546 (2013)

    Google Scholar 

  35. Klaper, D., Ebling, S., Volk, M.: Building a German/simple German parallel corpus for automatic text simplification. In: ACL 2013, p. 11 (2013)

    Google Scholar 

  36. Klerke, S., Søgaard, A.: DSim, a Danish parallel corpus for text simplification. In: LREC, pp. 4015–4018 (2012)

    Google Scholar 

  37. Maruyama, T., Yamamoto, K.: Simplified corpus with core vocabulary. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  38. Megna, A.L., Schicchi, D., Bosco, G.L., Pilato, G.: A controllable text simplification system for the Italian language. In: 2021 IEEE 15th International Conference on Semantic Computing (ICSC), pp. 191–194. IEEE (2021)

    Google Scholar 

  39. Miliani, M., Auriemma, S., Alva-Manchego, F., Lenci, A.: Neural readability pairwise ranking for sentences in Italian administrative language. In: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, pp. 849–866 (2022)

    Google Scholar 

  40. Mitkov, R., Štajner, S.: The fewer, the better? A contrastive study about ways to simplify. In: Proceedings of the Workshop on Automatic Text Simplification-Methods and Applications in the Multilingual Society (ATS-MA 2014), pp. 30–40 (2014)

    Google Scholar 

  41. Nomoto, T.: A comparison of model free versus model intensive approaches to sentence compression. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 391–399 (2009)

    Google Scholar 

  42. Paun, S.: Parallel text alignment and monolingual parallel corpus creation from philosophical texts for text simplification. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 40–46 (2021)

    Google Scholar 

  43. Pellow, D., Eskenazi, M.: An open corpus of everyday documents for simplification tasks. In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), pp. 84–93 (2014)

    Google Scholar 

  44. Petersen, S.E., Ostendorf, M.: Text simplification for language learners: a corpus analysis. In: Workshop on Speech and Language Technology in Education. Citeseer (2007)

    Google Scholar 

  45. Qiang, J., Li, Y., Zhu, Y., Yuan, Y., Shi, Y., Wu, X.: LSBERT: lexical simplification based on BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3064–3076 (2021)

    Article  Google Scholar 

  46. Rello, L., Baeza-Yates, R., Bott, S., Saggion, H.: Simplify or help? Text simplification strategies for people with dyslexia. In: Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility, pp. 1–10 (2013)

    Google Scholar 

  47. Saggion, H., Štajner, S., Bott, S., Mille, S., Rello, L., Drndarevic, B.: Making it simplext: implementation and evaluation of a text simplification system for Spanish. ACM Trans. Accessible Comput. (TACCESS) 6(4), 1–36 (2015)

    Article  Google Scholar 

  48. Säuberli, A., Ebling, S., Volk, M.: Benchmarking data-driven automatic text simplification for German. In: Proceedings of the 1st Workshop on Tools and Resources to Empower People with Reading Difficulties (READI), pp. 41–48 (2020)

    Google Scholar 

  49. Scarton, C., Paetzold, G., Specia, L.: Simpa: a sentence-level simplification corpus for the public administration domain. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan (2018)

    Google Scholar 

  50. Shardlow, M.: A survey of automated text simplification. Int. J. Adv. Comput. Sci. Appl. 4(1), 58–70 (2014)

    Google Scholar 

  51. Shardlow, M., Alva-Manchego, F.: Simple TICO-19: a dataset for joint translation and simplification of Covid-19 texts. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3093–3102 (2022)

    Google Scholar 

  52. Štajner, S., Mitkov, R., Corpas Pastor, G.: Simple or not simple? A readability question. In: Gala, N., Rapp, R., Bel-Enguix, G. (eds.) Language Production, Cognition, and the Lexicon. TSLT, vol. 48, pp. 379–398. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-08043-7_22

    Chapter  Google Scholar 

  53. Stajner, S., Saggion, H.: Adapting text simplification decisions to different text genres and target users. Procesamiento del Lenguaje Nat. 51, 135–142 (2013)

    Google Scholar 

  54. Štajner, S., Saggion, H., Ponzetto, S.P.: Improving lexical coverage of text simplification systems for Spanish. Expert Syst. Appl. 118, 80–91 (2019)

    Article  Google Scholar 

  55. Sun, R., Jin, H., Wan, X.: Document-level text simplification: dataset, criteria and baseline. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7997–8013 (2021)

    Google Scholar 

  56. Toborek, V., Busch, M., Boßert, M., Welke, P., Bauckhage, C.: A new aligned simple German corpus. arXiv preprint arXiv:2209.01106 (2022)

  57. Tonelli, S., Aprosio, A.P., Saltori, F.: SIMPITIKI: a simplification corpus for Italian. In: CLiC-it/EVALITA, pp. 4333–4338 (2016)

    Google Scholar 

  58. Trask, R.L.: Origins and relatives of the Basque language: review of the evidence. In: Amsterdam Studies in the Theory and History of Linguistic Science Series, vol. 4, pp. 65–100 (1995)

    Google Scholar 

  59. Vajjala, S., Meurers, D.: On the applicability of readability models to web texts. In: Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations, pp. 59–68 (2013)

    Google Scholar 

  60. Woodsend, K., Lapata, M.: Learning to simplify sentences with quasi-synchronous grammar and integer programming. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 409–420 (2011)

    Google Scholar 

  61. Xu, W., Callison-Burch, C., Napoles, C.: Problems in current text simplification research: new data can help. Trans. Assoc. Comput. Linguist. 3, 283–297 (2015)

    Article  Google Scholar 

  62. Xu, W., Napoles, C., Pavlick, E., Chen, Q., Callison-Burch, C.: Optimizing statistical machine translation for text simplification. Trans. Assoc. Comput. Linguist. 4, 401–415 (2016)

    Article  Google Scholar 

  63. Yimam, S.M., et al.: A report on the complex word identification shared task 2018. arXiv preprint arXiv:1804.09132 (2018)

  64. Young, D.N.: Linguistic simplification of SL reading material: effective instructional practice? Mod. Lang. J. 83(3), 350–366 (1999)

    Article  Google Scholar 

  65. Zaman, F., Shardlow, M., Hassan, S.U., Aljohani, N.R., Nawaz, R.: HTSS: a novel hybrid text summarisation and simplification architecture. Inf. Process. Manag. 57(6), 102351 (2020)

    Article  Google Scholar 

  66. Zhang, X., Lapata, M.: Sentence simplification with deep reinforcement learning. arXiv preprint arXiv:1703.10931 (2017)

  67. Zhu, Z., Bernhard, D., Gurevych, I.: A monolingual tree-based translation model for sentence simplification. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1353–1361 (2010)

    Google Scholar 

Download references

Acknowledgements

This research was conducted as part of the CLEAR.TEXT project (TED2021-130707B-I00), funded by MCIN/AEI/10.13039/501100011033 and European Union NextGenerationEU/PRTR, and the R &D project CORTEX: Conscious Natural Text Generation (PID2021-123956OB-I00), funded by MCIN/ AEI/10.13039/501100011033/ and by “ERDF A way of making Europe”. Moreover, it has been also partially funded by the Generalitat Valenciana through the project “NL4DISMIS: Natural Language Technologies for dealing with dis- and misinformation with grant reference (CIPROM/2021/21)”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tania Josephine Martin .

Editor information

Editors and Affiliations

A Appendix

A Appendix

(See Table 2).

Table 2. Corpora Availability details on accessed date

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Martin, T.J., Abreu Salas, J.I., Moreda Pozo, P. (2023). A Review of Parallel Corpora for Automatic Text Simplification. Key Challenges Moving Forward. In: Métais, E., Meziane, F., Sugumaran, V., Manning, W., Reiff-Marganiec, S. (eds) Natural Language Processing and Information Systems. NLDB 2023. Lecture Notes in Computer Science, vol 13913. Springer, Cham. https://doi.org/10.1007/978-3-031-35320-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-35320-8_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-35319-2

  • Online ISBN: 978-3-031-35320-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics