Improving lexical coverage of text simplification systems for Spanish
Introduction
Lexical simplification (LS) has the aim of transforming written content into its simpler variants by replacing complex and infrequent words and phrases with shorter and more common ones, or by adding appropriate definitions to explain difficult concepts (Saggion, 2017). Such adaptation of written text (either manual or automatic) is recommended for making texts more understandable for wider audiences, e.g. children (De Belder & Moens, 2010), non-native speakers (Paetzold, Specia, 2016a, Petersen, Ostendorf, 2007), people with low literacy (Aluísio & Gasperin, 2010) or various kinds of reading or cognitive impairments, e.g. aphasia (Carroll et al., 1999), dyslexia (Rello, Baeza-Yates, Dempere-Marco, & Saggion, 2013), autism spectrum disorders (Martos et al., 2013), or Down’s syndrome (Saggion et al., 2015). At the same time, lexically simpler sentences have been shown to lead to better performances in information extraction (Beigman Klebanov, Knight, & Marcu, 2004), semantic role labelling (Vickrey & Koller, 2008), and machine translation (Štajner & Popović, 2016). The two input English sentences (“Several Israeli security delegations have visited Egypt during the past two months to decide on a new embassy location.” and “ Several Israeli security delegations have visited Egypt during the past two months to choose a new embassy location.”) which differ only in the lexical choice of the verb (“decide on” or “choose”), for instance, can lead to significant differences in the fluency and adequacy of the obtained machine translation (Štajner & Popović, 2016).
In order to be able to use lexical simplification for either of the two aforementioned purposes (making texts more understandable to humans, or building better performing NLP applications), we need reliable automatic lexical simplification systems. For English, the state-of-the-art LS systems are data-driven, and range from supervised approaches for learning lexical simplifications from the English Wikipedia - Simple English Wikipedia (EW–SEW) parallel corpus1 (Kauchak, 2013) by a feature-based SVM ranker (Horn, Manduca, & Kauchak, 2014), to unsupervised data-driven approaches (Glavaš, Štajner, 2015, Paetzold, Specia, 2016b) which rely on the use of word embeddings trained on large corpora, and recently, supervised lexical simplification systems with neural architectures (Nisioi, Štajner, Ponzetto, Dinu, 2017, Zhang, Lapata, 2017). For Spanish, the state-of-the-art LS systems are also data-driven, varying from fully unsupervised approaches which rely on freely available resources, such as an on-line dictionary and the Web as a corpus (Bott, Rello, Drndarevic, & Saggion, 2012), and fully supervised approaches which address LS as a monolingual machine translation (MT) problem and thus rely on a parallel corpus of original and manually simplified sentences (Štajner, Calixto, & Saggion, 2015b), to the recently proposed hybrid approach (Ferrés, Saggion, & Gómez Guinovart, 2017).
In the recent years, the phrase-based statistical machine translation (PBSMT) approach (Koehn, Och, & Marcu, 2003) has been extensively used for text simplification in English (Coster, Kauchak, 2011, Kauchak, 2013, Štajner, Bechara, Saggion, 2015a, Wubben, van den Bosch, Krahmer, 2012), Spanish (Štajner, 2014, Štajner, Calixto, Saggion, 2015b), and Brazilian Portuguese (Specia, 2010).
The main problem of the state-of-the-art supervised approaches is that they do not have sufficient coverage. The size of existing TS corpora is very limited – approximately 1,67,000 sentence pairs for the EW–SEW dataset, approximately 3,500 sentence pairs for Portuguese (Specia, 2010), and less than 1,000 sentence pairs for Spanish (Štajner et al., 2015b). Therefore, such corpora cannot offer sufficient coverage for the terms and phrases which we actually want to simplify automatically. Being infrequent and/or technical, they rarely occur in the EW–SEW corpus, and even less in a 1,000 sentence pairs TS corpus of news articles in Spanish, for example. The comparison of the state-of-the-art supervised LS system (Horn et al., 2014) and the state-of-the-art unsupervised LS system (Glavaš & Štajner, 2015) showed that the unsupervised system have significantly higher coverage (96.0%, as opposed to 86.3% for the supervised system), measured as the percentage of the target words (in the test set) that were changed by the system. Target words were 500 words, one in each ‘original’ sentence, that were present in the English Wikipedia and were replaced by a different word in the Simple English Wikipedia (Horn et al., 2014). It is also important to note that the test set belongs to the same domain (Wikipedia articles) as the training dataset used for the supervised LS approach, which allowed for high coverage of the supervised system. A recent detailed evaluation of MT-based text simplification systems (with non-neural and neural architectures), trained over English Wikipedia - Simple English Wikipedia or Newsela dataset2 (Štajner, Franco-Salvador, Ponzetto, Rosso, Stuckenschmidt, 2017, Xu, Callison-Burch, Napoles, 2015), showed that such systems perform as few as one change (among those some being lexical ones, and other being sentence splitting and content reduction ones) per sentence on average (Štajner, Franco-Salvador, Ponzetto, Rosso, Stuckenschmidt, 2017, Štajner, Nisioi, 2018).
The state-of-the-art unsupervised approaches to LS, for both English and Spanish (Baeza-Yates, Rello, Dembowski, 2015, Bott, Rello, Drndarevic, Saggion, 2012, Glavaš, Štajner, 2015, Paetzold, Specia, 2016b) have another problem. They only perform one-to-one word substitutions and thus cannot simplify longer lexical phrases or perform word reorderings within lexical phrases, which can be necessary after lexical substitutions, e.g. disolventes alternativos → otros disolventes [eng. alternative solvents → other solvents3] (see Section 4.2). The LS systems based on the use of word-embeddings (Glavaš, Štajner, 2015, Paetzold, Specia, 2016b), have an additional problem. They often change the meaning of the sentence, as word-embeddings do not distinguish between synonyms and antonyms (Glavaš & Štajner, 2015) and are not able to accommodate different meanings of the same (ambiguous) word (Paetzold & Specia, 2016b).
Our main objective is to build a lexical simplification system which will be able to simplify phrases longer than one word (longer n-grams), and perform word reorderings when necessary, while at the same time having good lexical coverage.4 At the same time, we wish to compile and make publicly available lexical simplification resources for Spanish, as there are no such resources so far. We focus on automatic LS for Spanish, as a language with less TS resources than English. In order to achieve our goals, we:
- 1.
Build new TS resources by filtering and ordering synonyms- and paraphrase pairs from the existing resources (Section 3.1).
- 2.
Train 27 PBSMT models using nine different combinations of the baseline TS dataset and the newly built TS resources, and three different language models (Section 3.2).
- 3.
Perform a detailed manual analysis of the output of all 27 systems in terms of their coverage and correctness of changes, and compare them with the current state-of-the-art LS system for Spanish (Section 4.2).
- 4.
Perform a human evaluation of quality of the output sentences of our best models in terms of their grammaticality, meaning preservation and simplicity (Section 4.3).
Our contributions to the field of text simplification are the following:
- 1.
We built several new TS resources consisting of (original, simple) pairs of synonyms and paraphrases (Section 3.1).
- 2.
We made first steps in addressing the problem of language models (LMs) of ‘simple’ Spanish by building three different LMs, one trained on the Spanish Wikipedia and two trained on quasy-simple sentences from the Spanish Wikipedia (Section 3.2).
- 3.
We built a LS system for Spanish which is able to perform LS beyond one-to-one word substitution, as well as some simple word reorderings, while significantly outperforming current state-of-the-art LS systems for Spanish (both supervised and unsupervised) in terms of coverage, meaning preservation, simplicity and grammaticality (Section 4).
- 4.
We investigated how much the newly built TS resources (of synonyms and paraphrases) improve lexical coverage of PBSMT systems for LS when added to the baseline TS training dataset (Section 4.2).
- 5.
We investigated the impact of the choice of lexical resources and LMs on the grammaticality, meaning preservation, and simplicity of the automatically simplified sentences (Section 4.3).
Section snippets
Related work
In this section, we provide an overview of the existing TS resources for Spanish (Section 2.1), statistical machine translation (SMT) approaches to lexical simplification (Section 2.2), state-of-the-art LS systems for Spanish (Section 2.3), and common evaluation methods in TS (Section 2.4).
Methodology
A schema of the entire workflow is presented in Fig. 1. In order to improve the coverage and the simplicity of the output generated by the PBSMT-based LS systems, we need: (1) a parallel dataset that, unlike the previously used one (Štajner et al., 2015b), will be enriched with simplification-specific parallel datasets of synonyms and paraphrases; and (2) will use better language models trained on simpler sentences (simpler than the previously used Europarl sentences (Štajner et al., 2015b)).
Results and discussion
We present and discuss the results obtained by each of the evaluation procedures separately, in the next three subsections. The summary of the most important results is presented in Section 4.4.
Conclusions
Lexical simplification plays an important role in making texts more accessible to wider audiences, and as a pre-processing step which improves performances of various NLP systems. However, the current state-of-the-art automatic lexical simplification systems have one of the following shortcomings: they either (1) do not have sufficient coverage (supervised approaches), or (2) they only perform one-to-one word substitutions and thus cannot simplify longer lexical phrases (can be either
Acknowledgements
Horacio Saggion’s work is partly supported by the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502) and by the TUNER project (TIN2015-65308-C5-5-R, MINECO/FEDER, UE).
References (53)
- et al.
Fostering digital inclusion and accessibility: The PorSimples project for simplification of Portuguese texts
Proceedings of the NAACL HLT young investigators workshop on computational approaches to languages of the Americas (YIWCALA)
(2010) Tipos de textos, complejidad lingüística y facilicitación lectora
Actas del Sexto Congreso de Hispanistas de Asia
(2007)- et al.
CASSA: A context-aware synonym simplification algorithm
Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies (NAACL HLT)
(2015) - et al.
Paraphrasing with Bilingual Parallel Corpora
Proceedings of the 43rd annual meeting on association for computational linguistics (ACL)
(2005) - et al.
Text simplification for information-seeking applications
On the move to meaningful internet systems
(2004) - et al.
Can Spanish be simpler? LexSiS: Lexical simplification for Spanish
Proceedings of the 24th international conference on computational linguistics (COLING)
(2012) - et al.
Simplifying text for language-impaired readers
Proceedings of the 9th conference of the european chapter of the ACL (EACL)
(1999) - et al.
Learning to simplify sentences using Wikipedia
Proceedings of the 49th annual meeting of the association for computational linguistics (ACL)
(2011) - et al.
Text Simplification for Children
Proceedings of the SIGIR workshop on accessible search systems
(2010) - et al.
Towards automatic lexical simplification in Spanish: an empirical study
Proceedings of the first workshop on predicting and improving text readability for target reader populations
(2012)
Sentence simplification as tree transduction
Proceedings of the second workshop on predicting and improving text readability for target reader populations
An adaptable lexical simplification architecture for major Ibero-Romance languages
Proceedings of the first workshop on building linguistically generalizable NLP systems
PPDB: The Paraphrase Database
Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies (NAACL-HLT)
Simplifying Lexical Simplification: Do We Need Simplified Corpora?
Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (ACL-IJCNLP), Vol. 2: Short papers
Learning a lexical simplifier using wikipedia
Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL), Vol. 2: Short papers
Improving text simplification language modeling using unsimplified text data
Proceedings of the 51st annual meeting of the association for computational linguistics (ACL), Vol. 1: Long papers
Statistical significance tests for machine translation evaluation
Proceedings of the empirical methods in natural language processing (EMNLP)
Europarl: A Parallel Corpus for Statistical Machine Translation
Proceedings of the machine translation summit
Moses: open source toolkit for statistical machine translation
Proceedings of the 45th annual meeting of the association for computational linguistics (ACL)
Statistical phrase-based translation
Proceedings of the conference of the north american chapter of the association for computational linguistics on human language technology (NAACL-HLT), Vol. 1
FIRST Deliverable - User preferences: Updated
Technical Report D2.2
The fewer, the better? A contrastive study about ways to simplify
Proceedings of the coling workshop on automatic text simplification – Methods and applications in the multilingual society (ATS-MA)
Exploring neural text simplification models
Proceedings of the 55th annual meeting of the association for computational linguistics (ACL)
Minimum Error Rate Training in Statistical Machine Translation
Proceedings of the 41st annual meeting of the association for computational linguistics (ACL)
A systematic comparison of various statistical alignment models
Computational Linguistics
FreeLing 3.0: Towards wider multilinguality
Proceedings of the language resources and evaluation conference (LREC)
Cited by (18)
Identifying HRV patterns in ECG signals as early markers of dementia
2024, Expert Systems with ApplicationsSimplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model
2023, Journal of King Saud University - Computer and Information SciencesThe Effect of Automatic Text Simplification on L2 Readers’ Text Comprehension
2023, Applied LinguisticsEasy-to-Read Language Resources and Tools for three European Languages
2023, ACM International Conference Proceeding SeriesAutomatic Text Simplification for People with Cognitive Disabilities: Resource Creation within the ClearText Project
2023, TSAR 2023 - 2nd Workshop on Text Simplification, Accessibility and Readability, associated with the 14th International Conference on Recent Advances in Natural Language Processing 2023, RANLP 2023 - Proceedings of the Workshop