A Review of Parallel Corpora for Automatic Text Simplification. Key Challenges Moving Forward

Martin, Tania Josephine; Abreu Salas, José Ignacio; Moreda Pozo, Paloma

doi:10.1007/978-3-031-35320-8_5

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13913))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

1121 Accesses

Abstract

This review of parallel corpora for automatic text simplification (ATS) involves an analysis of forty-nine papers wherein the corpora are presented, focusing on corpora in the Indo-European languages of Western Europe. We improve on recent corpora reviews by reporting on the target audience of the ATS, the language and domain of the source text, and other metadata for each corpus, such as alignment level, annotation strategy, and the transformation applied to the simplified text. The key findings of the review are: 1) the lack of resources that address ATS aimed at domains which are important for social inclusion, such as health and public administration; 2) the lack of resources aimed at audiences with mild cognitive impairment; 3) the scarcity of experiments where the target audience was directly involved in the development of the corpus; 4) more than half the proposals do not include any extra annotation, thereby lacking detail on how the simplification was done, or the linguistic phenomenon tackled by the simplification; 5) other types of annotation, such as the type and frequency of the transformation applied could identify the most frequent simplification strategies; and, 6) future strategies to advance the field of ATS could leverage automatic procedures to make the annotation process more agile and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

ACL (ed.): OneStopEnglish corpus: a new corpus for automatic readability assessment and text simplification (2018)
Google Scholar
Al-Thanyyan, S.S., Azmi, A.M.: Automated text simplification: a survey. ACM Comput. Surv. (CSUR) 54(2), 1–36 (2021)
Article Google Scholar
Allen, D.: A study of the role of relative clauses in the simplification of news texts for learners of English. System 37(4), 585–599 (2009)
Article Google Scholar
Alva-Manchego, F., Martin, L., Bordes, A., Scarton, C., Sagot, B., Specia, L.: Asset: a dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. arXiv preprint arXiv:2005.00481 (2020)
Alva-Manchego, F., Scarton, C., Specia, L.: The (un) suitability of automatic evaluation metrics for text simplification. Comput. Linguist. 47(4), 861–889 (2021)
Article Google Scholar
Aumiller, D., Gertz, M.: Klexikon: a German dataset for joint summarization and simplification. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 2693–2701 (2022)
Google Scholar
Barzilay, R., Elhadad, N.: Sentence alignment for monolingual comparable corpora. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 25–32 (2003)
Google Scholar
Battisti, A., Pfütze, D., Säuberli, A., Kostrzewa, M., Ebling, S.: A corpus for automatic readability assessment and text simplification of German. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 3302–3311 (2020)
Google Scholar
Van den Bercken, L., Sips, R.J., Lofi, C.: Evaluating neural text simplification in the medical domain. In: The World Wide Web Conference, pp. 3286–3292 (2019)
Google Scholar
Bott, S., Saggion, H.: An unsupervised alignment algorithm for text simplification corpus construction. In: Proceedings of the Workshop on Monolingual Text-To-Text Generation, pp. 20–26 (2011)
Google Scholar
Bott, S., Saggion, H.: Text simplification resources for Spanish. Lang. Resour. Eval. 48(1), 93–120 (2014)
Article Google Scholar
Brouwers, L., Bernhard, D., Ligozat, A.L., François, T.: Syntactic sentence simplification for French. In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)@ EACL 2014, pp. 47–56 (2014)
Google Scholar
Brunato, D., Cimino, A., Dell’Orletta, F., Venturi, G.: Paccss-it: a parallel corpus of complex-simple sentences for automatic text simplification. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 351–361 (2016)
Google Scholar
Brunato, D., Dell’Orletta, F., Venturi, G.: Linguistically-based comparison of different approaches to building corpora for text simplification: a case study on Italian. Front. Psychol. 13, 97 (2022)
Article Google Scholar
Brunato, D., Dell’Orletta, F., Venturi, G., Montemagni, S.: Design and annotation of the first Italian corpus for text simplification. In: Proceedings of the 9th Linguistic Annotation Workshop, pp. 31–41 (2015)
Google Scholar
Campillos-Llanos, L., Reinares, A.R.T., Puig, S.Z., Valverde-Mateos, A., Capllonch-Carrión, A.: Building a comparable corpus and a benchmark for Spanish medical text simplification. Procesamiento del Lenguaje Nat. 69, 189–196 (2022)
Google Scholar
Cardon, R., Grabar, N.: French biomedical text simplification: when small and precise helps. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 710–716 (2020)
Google Scholar
Caseli, H.M., Pereira, T.F., Specia, L., Pardo, T.A., Gasperin, C., Aluísio, S.M.: Building a Brazilian Portuguese parallel corpus of original and simplified texts. Adv. Comput. Linguist. Res. Comput. Sci. 41, 59–70 (2009)
Google Scholar
Coster, W., Kauchak, D.: Simple English Wikipedia: a new text simplification task. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 665–669 (2011)
Google Scholar
Crossley, S.A., Allen, D., McNamara, D.S.: Text simplification and comprehensible input: a case for an intuitive approach. Lang. Teach. Res. 16(1), 89–108 (2012)
Article Google Scholar
De Belder, J., Moens, M.F.: Text simplification for children. In: Proceedings of the SIGIR Workshop on Accessible Search Systems, pp. 19–26. ACM, New York (2010)
Google Scholar
Ebling, S., et al.: Automatic text simplification for German. Front. Commun. 7, 15 (2022)
Article Google Scholar
European Parliament, C.o.t.E.U.: Directive (EU) 2016/2102 of the European parliament and of the council of 26 October 2016 on the accessibility of the websites and mobile applications of public sector bodies (2016)
Google Scholar
Ferrés, D., Saggion, H.: Alexsis: a dataset for lexical simplification in Spanish. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3582–3594 (2022)
Google Scholar
Gala, N., Tack, A., Javourey-Drevet, L., François, T., Ziegler, J.C.: Alector: a parallel corpus of simplified French texts with alignments of misreadings by poor and dyslexic readers. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 1353–1361 (2020)
Google Scholar
Gonzales, A.R., et al.: A new dataset and efficient baselines for document-level text simplification in German. In: Proceedings of the Third Workshop on New Frontiers in Summarization, pp. 152–161 (2021)
Google Scholar
Gonzalez-Dios, I., Aranzabe, M.J., Díaz de Ilarraza, A.: The corpus of basque simplified texts (CBST). Lang. Resour. Eval. 52(1), 217–247 (2018)
Google Scholar
Gonzalez-Dios, I., Gutiérrez-Fandiño, I., Cumbicus-Pineda, O.M., Soroa, A.: IrekiaLFes: a new open benchmark and baseline systems for Spanish automatic text simplification. In: Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR 2022), pp. 86–97 (2022)
Google Scholar
Gooding, S.: On the ethical considerations of text simplification. arXiv preprint arXiv:2204.09565 (2022)
Grabar, N., Cardon, R.: Clear-simple corpus for medical French. In: Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA), pp. 3–9 (2018)
Google Scholar
Hauser, R., Vamvas, J., Ebling, S., Volk, M.: A multilingual simplified language news corpus. In: Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference, pp. 25–30 (2022)
Google Scholar
ETS Institute: Accessibility requirements for ICT products and services - EN 301 549 (v3.2.1) (2021)
Google Scholar
Kajiwara, T., Komachi, M.: Building a monolingual parallel corpus for text simplification using sentence similarity based on alignment between word embeddings. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1147–1158 (2016)
Google Scholar
Kauchak, D.: Improving text simplification language modeling using unsimplified text data. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), pp. 1537–1546 (2013)
Google Scholar
Klaper, D., Ebling, S., Volk, M.: Building a German/simple German parallel corpus for automatic text simplification. In: ACL 2013, p. 11 (2013)
Google Scholar
Klerke, S., Søgaard, A.: DSim, a Danish parallel corpus for text simplification. In: LREC, pp. 4015–4018 (2012)
Google Scholar
Maruyama, T., Yamamoto, K.: Simplified corpus with core vocabulary. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Megna, A.L., Schicchi, D., Bosco, G.L., Pilato, G.: A controllable text simplification system for the Italian language. In: 2021 IEEE 15th International Conference on Semantic Computing (ICSC), pp. 191–194. IEEE (2021)
Google Scholar
Miliani, M., Auriemma, S., Alva-Manchego, F., Lenci, A.: Neural readability pairwise ranking for sentences in Italian administrative language. In: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, pp. 849–866 (2022)
Google Scholar
Mitkov, R., Štajner, S.: The fewer, the better? A contrastive study about ways to simplify. In: Proceedings of the Workshop on Automatic Text Simplification-Methods and Applications in the Multilingual Society (ATS-MA 2014), pp. 30–40 (2014)
Google Scholar
Nomoto, T.: A comparison of model free versus model intensive approaches to sentence compression. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 391–399 (2009)
Google Scholar
Paun, S.: Parallel text alignment and monolingual parallel corpus creation from philosophical texts for text simplification. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 40–46 (2021)
Google Scholar
Pellow, D., Eskenazi, M.: An open corpus of everyday documents for simplification tasks. In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), pp. 84–93 (2014)
Google Scholar
Petersen, S.E., Ostendorf, M.: Text simplification for language learners: a corpus analysis. In: Workshop on Speech and Language Technology in Education. Citeseer (2007)
Google Scholar
Qiang, J., Li, Y., Zhu, Y., Yuan, Y., Shi, Y., Wu, X.: LSBERT: lexical simplification based on BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3064–3076 (2021)
Article Google Scholar
Rello, L., Baeza-Yates, R., Bott, S., Saggion, H.: Simplify or help? Text simplification strategies for people with dyslexia. In: Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility, pp. 1–10 (2013)
Google Scholar
Saggion, H., Štajner, S., Bott, S., Mille, S., Rello, L., Drndarevic, B.: Making it simplext: implementation and evaluation of a text simplification system for Spanish. ACM Trans. Accessible Comput. (TACCESS) 6(4), 1–36 (2015)
Article Google Scholar
Säuberli, A., Ebling, S., Volk, M.: Benchmarking data-driven automatic text simplification for German. In: Proceedings of the 1st Workshop on Tools and Resources to Empower People with Reading Difficulties (READI), pp. 41–48 (2020)
Google Scholar
Scarton, C., Paetzold, G., Specia, L.: Simpa: a sentence-level simplification corpus for the public administration domain. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan (2018)
Google Scholar
Shardlow, M.: A survey of automated text simplification. Int. J. Adv. Comput. Sci. Appl. 4(1), 58–70 (2014)
Google Scholar
Shardlow, M., Alva-Manchego, F.: Simple TICO-19: a dataset for joint translation and simplification of Covid-19 texts. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3093–3102 (2022)
Google Scholar
Štajner, S., Mitkov, R., Corpas Pastor, G.: Simple or not simple? A readability question. In: Gala, N., Rapp, R., Bel-Enguix, G. (eds.) Language Production, Cognition, and the Lexicon. TSLT, vol. 48, pp. 379–398. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-08043-7_22
Chapter Google Scholar
Stajner, S., Saggion, H.: Adapting text simplification decisions to different text genres and target users. Procesamiento del Lenguaje Nat. 51, 135–142 (2013)
Google Scholar
Štajner, S., Saggion, H., Ponzetto, S.P.: Improving lexical coverage of text simplification systems for Spanish. Expert Syst. Appl. 118, 80–91 (2019)
Article Google Scholar
Sun, R., Jin, H., Wan, X.: Document-level text simplification: dataset, criteria and baseline. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7997–8013 (2021)
Google Scholar
Toborek, V., Busch, M., Boßert, M., Welke, P., Bauckhage, C.: A new aligned simple German corpus. arXiv preprint arXiv:2209.01106 (2022)
Tonelli, S., Aprosio, A.P., Saltori, F.: SIMPITIKI: a simplification corpus for Italian. In: CLiC-it/EVALITA, pp. 4333–4338 (2016)
Google Scholar
Trask, R.L.: Origins and relatives of the Basque language: review of the evidence. In: Amsterdam Studies in the Theory and History of Linguistic Science Series, vol. 4, pp. 65–100 (1995)
Google Scholar
Vajjala, S., Meurers, D.: On the applicability of readability models to web texts. In: Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations, pp. 59–68 (2013)
Google Scholar
Woodsend, K., Lapata, M.: Learning to simplify sentences with quasi-synchronous grammar and integer programming. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 409–420 (2011)
Google Scholar
Xu, W., Callison-Burch, C., Napoles, C.: Problems in current text simplification research: new data can help. Trans. Assoc. Comput. Linguist. 3, 283–297 (2015)
Article Google Scholar
Xu, W., Napoles, C., Pavlick, E., Chen, Q., Callison-Burch, C.: Optimizing statistical machine translation for text simplification. Trans. Assoc. Comput. Linguist. 4, 401–415 (2016)
Article Google Scholar
Yimam, S.M., et al.: A report on the complex word identification shared task 2018. arXiv preprint arXiv:1804.09132 (2018)
Young, D.N.: Linguistic simplification of SL reading material: effective instructional practice? Mod. Lang. J. 83(3), 350–366 (1999)
Article Google Scholar
Zaman, F., Shardlow, M., Hassan, S.U., Aljohani, N.R., Nawaz, R.: HTSS: a novel hybrid text summarisation and simplification architecture. Inf. Process. Manag. 57(6), 102351 (2020)
Article Google Scholar
Zhang, X., Lapata, M.: Sentence simplification with deep reinforcement learning. arXiv preprint arXiv:1703.10931 (2017)
Zhu, Z., Bernhard, D., Gurevych, I.: A monolingual tree-based translation model for sentence simplification. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1353–1361 (2010)
Google Scholar

Download references

Acknowledgements

This research was conducted as part of the CLEAR.TEXT project (TED2021-130707B-I00), funded by MCIN/AEI/10.13039/501100011033 and European Union NextGenerationEU/PRTR, and the R &D project CORTEX: Conscious Natural Text Generation (PID2021-123956OB-I00), funded by MCIN/ AEI/10.13039/501100011033/ and by “ERDF A way of making Europe”. Moreover, it has been also partially funded by the Generalitat Valenciana through the project “NL4DISMIS: Natural Language Technologies for dealing with dis- and misinformation with grant reference (CIPROM/2021/21)”.

Author information

Authors and Affiliations

Department of English Philology, University of Alicante, 03690, Alicante, Spain
Tania Josephine Martin
University Institute for Computing Research, University of Alicante, 03690, Alicante, Spain
Tania Josephine Martin & José Ignacio Abreu Salas
Department of Computing and Information Systems, University of Alicante, 03690, Alicante, Spain
Paloma Moreda Pozo

Authors

Tania Josephine Martin
View author publications
You can also search for this author in PubMed Google Scholar
José Ignacio Abreu Salas
View author publications
You can also search for this author in PubMed Google Scholar
Paloma Moreda Pozo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tania Josephine Martin .

Editor information

Editors and Affiliations

Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais
University of Derby, Derby, UK
Farid Meziane
Oakland University, Rochester, NY, USA
Vijayan Sugumaran
University of Derby, Derby, UK
Warren Manning
University of Derby, Derby, UK
Stephan Reiff-Marganiec

A Appendix

(See Table 2).

Table 2. Corpora Availability details on accessed date

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Martin, T.J., Abreu Salas, J.I., Moreda Pozo, P. (2023). A Review of Parallel Corpora for Automatic Text Simplification. Key Challenges Moving Forward. In: Métais, E., Meziane, F., Sugumaran, V., Manning, W., Reiff-Marganiec, S. (eds) Natural Language Processing and Information Systems. NLDB 2023. Lecture Notes in Computer Science, vol 13913. Springer, Cham. https://doi.org/10.1007/978-3-031-35320-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-35320-8_5
Published: 14 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35319-2
Online ISBN: 978-3-031-35320-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Review of Parallel Corpora for Automatic Text Simplification. Key Challenges Moving Forward

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation