Abstract
This review of parallel corpora for automatic text simplification (ATS) involves an analysis of forty-nine papers wherein the corpora are presented, focusing on corpora in the Indo-European languages of Western Europe. We improve on recent corpora reviews by reporting on the target audience of the ATS, the language and domain of the source text, and other metadata for each corpus, such as alignment level, annotation strategy, and the transformation applied to the simplified text. The key findings of the review are: 1) the lack of resources that address ATS aimed at domains which are important for social inclusion, such as health and public administration; 2) the lack of resources aimed at audiences with mild cognitive impairment; 3) the scarcity of experiments where the target audience was directly involved in the development of the corpus; 4) more than half the proposals do not include any extra annotation, thereby lacking detail on how the simplification was done, or the linguistic phenomenon tackled by the simplification; 5) other types of annotation, such as the type and frequency of the transformation applied could identify the most frequent simplification strategies; and, 6) future strategies to advance the field of ATS could leverage automatic procedures to make the annotation process more agile and efficient.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
ACL (ed.): OneStopEnglish corpus: a new corpus for automatic readability assessment and text simplification (2018)
Al-Thanyyan, S.S., Azmi, A.M.: Automated text simplification: a survey. ACM Comput. Surv. (CSUR) 54(2), 1–36 (2021)
Allen, D.: A study of the role of relative clauses in the simplification of news texts for learners of English. System 37(4), 585–599 (2009)
Alva-Manchego, F., Martin, L., Bordes, A., Scarton, C., Sagot, B., Specia, L.: Asset: a dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. arXiv preprint arXiv:2005.00481 (2020)
Alva-Manchego, F., Scarton, C., Specia, L.: The (un) suitability of automatic evaluation metrics for text simplification. Comput. Linguist. 47(4), 861–889 (2021)
Aumiller, D., Gertz, M.: Klexikon: a German dataset for joint summarization and simplification. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 2693–2701 (2022)
Barzilay, R., Elhadad, N.: Sentence alignment for monolingual comparable corpora. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 25–32 (2003)
Battisti, A., Pfütze, D., Säuberli, A., Kostrzewa, M., Ebling, S.: A corpus for automatic readability assessment and text simplification of German. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 3302–3311 (2020)
Van den Bercken, L., Sips, R.J., Lofi, C.: Evaluating neural text simplification in the medical domain. In: The World Wide Web Conference, pp. 3286–3292 (2019)
Bott, S., Saggion, H.: An unsupervised alignment algorithm for text simplification corpus construction. In: Proceedings of the Workshop on Monolingual Text-To-Text Generation, pp. 20–26 (2011)
Bott, S., Saggion, H.: Text simplification resources for Spanish. Lang. Resour. Eval. 48(1), 93–120 (2014)
Brouwers, L., Bernhard, D., Ligozat, A.L., François, T.: Syntactic sentence simplification for French. In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)@ EACL 2014, pp. 47–56 (2014)
Brunato, D., Cimino, A., Dell’Orletta, F., Venturi, G.: Paccss-it: a parallel corpus of complex-simple sentences for automatic text simplification. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 351–361 (2016)
Brunato, D., Dell’Orletta, F., Venturi, G.: Linguistically-based comparison of different approaches to building corpora for text simplification: a case study on Italian. Front. Psychol. 13, 97 (2022)
Brunato, D., Dell’Orletta, F., Venturi, G., Montemagni, S.: Design and annotation of the first Italian corpus for text simplification. In: Proceedings of the 9th Linguistic Annotation Workshop, pp. 31–41 (2015)
Campillos-Llanos, L., Reinares, A.R.T., Puig, S.Z., Valverde-Mateos, A., Capllonch-Carrión, A.: Building a comparable corpus and a benchmark for Spanish medical text simplification. Procesamiento del Lenguaje Nat. 69, 189–196 (2022)
Cardon, R., Grabar, N.: French biomedical text simplification: when small and precise helps. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 710–716 (2020)
Caseli, H.M., Pereira, T.F., Specia, L., Pardo, T.A., Gasperin, C., Aluísio, S.M.: Building a Brazilian Portuguese parallel corpus of original and simplified texts. Adv. Comput. Linguist. Res. Comput. Sci. 41, 59–70 (2009)
Coster, W., Kauchak, D.: Simple English Wikipedia: a new text simplification task. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 665–669 (2011)
Crossley, S.A., Allen, D., McNamara, D.S.: Text simplification and comprehensible input: a case for an intuitive approach. Lang. Teach. Res. 16(1), 89–108 (2012)
De Belder, J., Moens, M.F.: Text simplification for children. In: Proceedings of the SIGIR Workshop on Accessible Search Systems, pp. 19–26. ACM, New York (2010)
Ebling, S., et al.: Automatic text simplification for German. Front. Commun. 7, 15 (2022)
European Parliament, C.o.t.E.U.: Directive (EU) 2016/2102 of the European parliament and of the council of 26 October 2016 on the accessibility of the websites and mobile applications of public sector bodies (2016)
Ferrés, D., Saggion, H.: Alexsis: a dataset for lexical simplification in Spanish. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3582–3594 (2022)
Gala, N., Tack, A., Javourey-Drevet, L., François, T., Ziegler, J.C.: Alector: a parallel corpus of simplified French texts with alignments of misreadings by poor and dyslexic readers. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 1353–1361 (2020)
Gonzales, A.R., et al.: A new dataset and efficient baselines for document-level text simplification in German. In: Proceedings of the Third Workshop on New Frontiers in Summarization, pp. 152–161 (2021)
Gonzalez-Dios, I., Aranzabe, M.J., Díaz de Ilarraza, A.: The corpus of basque simplified texts (CBST). Lang. Resour. Eval. 52(1), 217–247 (2018)
Gonzalez-Dios, I., Gutiérrez-Fandiño, I., Cumbicus-Pineda, O.M., Soroa, A.: IrekiaLFes: a new open benchmark and baseline systems for Spanish automatic text simplification. In: Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR 2022), pp. 86–97 (2022)
Gooding, S.: On the ethical considerations of text simplification. arXiv preprint arXiv:2204.09565 (2022)
Grabar, N., Cardon, R.: Clear-simple corpus for medical French. In: Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA), pp. 3–9 (2018)
Hauser, R., Vamvas, J., Ebling, S., Volk, M.: A multilingual simplified language news corpus. In: Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference, pp. 25–30 (2022)
ETS Institute: Accessibility requirements for ICT products and services - EN 301 549 (v3.2.1) (2021)
Kajiwara, T., Komachi, M.: Building a monolingual parallel corpus for text simplification using sentence similarity based on alignment between word embeddings. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1147–1158 (2016)
Kauchak, D.: Improving text simplification language modeling using unsimplified text data. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), pp. 1537–1546 (2013)
Klaper, D., Ebling, S., Volk, M.: Building a German/simple German parallel corpus for automatic text simplification. In: ACL 2013, p. 11 (2013)
Klerke, S., Søgaard, A.: DSim, a Danish parallel corpus for text simplification. In: LREC, pp. 4015–4018 (2012)
Maruyama, T., Yamamoto, K.: Simplified corpus with core vocabulary. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Megna, A.L., Schicchi, D., Bosco, G.L., Pilato, G.: A controllable text simplification system for the Italian language. In: 2021 IEEE 15th International Conference on Semantic Computing (ICSC), pp. 191–194. IEEE (2021)
Miliani, M., Auriemma, S., Alva-Manchego, F., Lenci, A.: Neural readability pairwise ranking for sentences in Italian administrative language. In: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, pp. 849–866 (2022)
Mitkov, R., Štajner, S.: The fewer, the better? A contrastive study about ways to simplify. In: Proceedings of the Workshop on Automatic Text Simplification-Methods and Applications in the Multilingual Society (ATS-MA 2014), pp. 30–40 (2014)
Nomoto, T.: A comparison of model free versus model intensive approaches to sentence compression. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 391–399 (2009)
Paun, S.: Parallel text alignment and monolingual parallel corpus creation from philosophical texts for text simplification. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 40–46 (2021)
Pellow, D., Eskenazi, M.: An open corpus of everyday documents for simplification tasks. In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), pp. 84–93 (2014)
Petersen, S.E., Ostendorf, M.: Text simplification for language learners: a corpus analysis. In: Workshop on Speech and Language Technology in Education. Citeseer (2007)
Qiang, J., Li, Y., Zhu, Y., Yuan, Y., Shi, Y., Wu, X.: LSBERT: lexical simplification based on BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3064–3076 (2021)
Rello, L., Baeza-Yates, R., Bott, S., Saggion, H.: Simplify or help? Text simplification strategies for people with dyslexia. In: Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility, pp. 1–10 (2013)
Saggion, H., Štajner, S., Bott, S., Mille, S., Rello, L., Drndarevic, B.: Making it simplext: implementation and evaluation of a text simplification system for Spanish. ACM Trans. Accessible Comput. (TACCESS) 6(4), 1–36 (2015)
Säuberli, A., Ebling, S., Volk, M.: Benchmarking data-driven automatic text simplification for German. In: Proceedings of the 1st Workshop on Tools and Resources to Empower People with Reading Difficulties (READI), pp. 41–48 (2020)
Scarton, C., Paetzold, G., Specia, L.: Simpa: a sentence-level simplification corpus for the public administration domain. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan (2018)
Shardlow, M.: A survey of automated text simplification. Int. J. Adv. Comput. Sci. Appl. 4(1), 58–70 (2014)
Shardlow, M., Alva-Manchego, F.: Simple TICO-19: a dataset for joint translation and simplification of Covid-19 texts. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3093–3102 (2022)
Štajner, S., Mitkov, R., Corpas Pastor, G.: Simple or not simple? A readability question. In: Gala, N., Rapp, R., Bel-Enguix, G. (eds.) Language Production, Cognition, and the Lexicon. TSLT, vol. 48, pp. 379–398. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-08043-7_22
Stajner, S., Saggion, H.: Adapting text simplification decisions to different text genres and target users. Procesamiento del Lenguaje Nat. 51, 135–142 (2013)
Štajner, S., Saggion, H., Ponzetto, S.P.: Improving lexical coverage of text simplification systems for Spanish. Expert Syst. Appl. 118, 80–91 (2019)
Sun, R., Jin, H., Wan, X.: Document-level text simplification: dataset, criteria and baseline. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7997–8013 (2021)
Toborek, V., Busch, M., Boßert, M., Welke, P., Bauckhage, C.: A new aligned simple German corpus. arXiv preprint arXiv:2209.01106 (2022)
Tonelli, S., Aprosio, A.P., Saltori, F.: SIMPITIKI: a simplification corpus for Italian. In: CLiC-it/EVALITA, pp. 4333–4338 (2016)
Trask, R.L.: Origins and relatives of the Basque language: review of the evidence. In: Amsterdam Studies in the Theory and History of Linguistic Science Series, vol. 4, pp. 65–100 (1995)
Vajjala, S., Meurers, D.: On the applicability of readability models to web texts. In: Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations, pp. 59–68 (2013)
Woodsend, K., Lapata, M.: Learning to simplify sentences with quasi-synchronous grammar and integer programming. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 409–420 (2011)
Xu, W., Callison-Burch, C., Napoles, C.: Problems in current text simplification research: new data can help. Trans. Assoc. Comput. Linguist. 3, 283–297 (2015)
Xu, W., Napoles, C., Pavlick, E., Chen, Q., Callison-Burch, C.: Optimizing statistical machine translation for text simplification. Trans. Assoc. Comput. Linguist. 4, 401–415 (2016)
Yimam, S.M., et al.: A report on the complex word identification shared task 2018. arXiv preprint arXiv:1804.09132 (2018)
Young, D.N.: Linguistic simplification of SL reading material: effective instructional practice? Mod. Lang. J. 83(3), 350–366 (1999)
Zaman, F., Shardlow, M., Hassan, S.U., Aljohani, N.R., Nawaz, R.: HTSS: a novel hybrid text summarisation and simplification architecture. Inf. Process. Manag. 57(6), 102351 (2020)
Zhang, X., Lapata, M.: Sentence simplification with deep reinforcement learning. arXiv preprint arXiv:1703.10931 (2017)
Zhu, Z., Bernhard, D., Gurevych, I.: A monolingual tree-based translation model for sentence simplification. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1353–1361 (2010)
Acknowledgements
This research was conducted as part of the CLEAR.TEXT project (TED2021-130707B-I00), funded by MCIN/AEI/10.13039/501100011033 and European Union NextGenerationEU/PRTR, and the R &D project CORTEX: Conscious Natural Text Generation (PID2021-123956OB-I00), funded by MCIN/ AEI/10.13039/501100011033/ and by “ERDF A way of making Europe”. Moreover, it has been also partially funded by the Generalitat Valenciana through the project “NL4DISMIS: Natural Language Technologies for dealing with dis- and misinformation with grant reference (CIPROM/2021/21)”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
(See Table 2).
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Martin, T.J., Abreu Salas, J.I., Moreda Pozo, P. (2023). A Review of Parallel Corpora for Automatic Text Simplification. Key Challenges Moving Forward. In: Métais, E., Meziane, F., Sugumaran, V., Manning, W., Reiff-Marganiec, S. (eds) Natural Language Processing and Information Systems. NLDB 2023. Lecture Notes in Computer Science, vol 13913. Springer, Cham. https://doi.org/10.1007/978-3-031-35320-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-35320-8_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35319-2
Online ISBN: 978-3-031-35320-8
eBook Packages: Computer ScienceComputer Science (R0)