Leveraging event-based semantics for automated text simplification
Introduction
Many texts we encounter daily are written using very complex syntactic structures and specialised or sophisticated vocabulary and thus cannot be understood by many readers. Many initiatives raised awareness about this issue, offering guidelines for writing in an easy-to-read manner in order to produce texts more accessible for everyone, including non-native speakers and people with any kind of language or intellectual impairment, e.g. “Make it Simple, European Guidelines for the Production of Easy-to-Read Information for People with Learning Disability” (Freyhoff, Hess, Kerr, Tronbacke, & Van Der Veken, 1998) or “Federal Plain Language Guidelines” (PlainLanguage, 2011). However, manual simplification of existing texts is time-consuming and requires very specific training. At the same time, it has been noticed that syntactically complex sentences, as well as infrequent words and phrases, decrease the performance of various NLP tasks, such as parsing (Chandrasekar, Doran, & Srinivas, 1996), machine translation (Chandrasekar, 1994, Štajner, Popović, 2016), information extraction (Beigman Klebanov, Knight, Marcu, 2004, Evans, 2011), or semantic role labelling (Vickrey & Koller, 2008). For both these reasons, late nineties yielded a new NLP task of Automated Text Simplification (ATS) that aims to (semi-)automatically transform complex texts into their simpler variants that are more understandable to wider audiences and easier to process with various NLP tools.
It has been shown that complex sentence structures (passive constructions, long sentences, appositions, etc.) and infrequent or long words can be difficult to understand for many people, e.g. non-native speakers (Petersen & Ostendorf, 2007), people with low literacy levels (Aluísio, Specia, Pardo, Maziero, & Fortes, 2008), and people with different kinds of reading or cognitive impairments, such as dyslexia (Rello, 2012), aphasia (Devlin & Unthank, 2006), autism spectrum disorders (Martos, Freire, González, Gil, & Sebastian, 2012), or Down’s syndrome (Saggion et al., 2015).
The writing style in newspaper articles is particularly challenging. They often contain long sequences of adjectives, e.g.“twenty-five-year-old blond-haired mother-of-two Jane Smith” (Carroll, Minnen, Canning, Devlin, & Tait, 1998), which can cause problems for people with aphasia (Carroll et al., 1998), autism spectrum disorders (Martos et al., 2012), and intellectual disabilities (Feng, 2009). In order to present the information in a more sensational way, the newspaper articles often use passive constructions which do not follow the canonical subject–verb–object structure and thus pose difficulties to people with aphasia (Carroll et al., 1999) or autism spectrum disorders (Martos et al., 2012). For example, instead of using the straightforward active voice which follows the canonical subject–verb–object structure “The council today accepted a bid to build an incinerator on local wasteland”, it is more common to find the same information in a passive sentence “A bid to build an incinerator on local wasteland was today accepted by the council” (Carroll et al., 1998).
At the word level, it has been noticed that infrequent words can be difficult to understand for people with aphasia (Devlin, 1999) and autism spectrum disorders (Martos, Freire, González, Gil, Sebastian, 2012, Norbury, 2005) and lead to a longer reading time in people with dyslexia (Rello, Baeza-Yates, Dempere-Marco, & Saggion, 2013).
At the discourse level, people with autism spectrum disorders or intellectual disabilities may have problem finding main idea and inferring information (Feng, 2009, Martos, Freire, González, Gil, Sebastian, 2012), resolving anaphors (Ehrlich, Remond, Tardieu, 1999, Martos, Freire, González, Gil, Sebastian, 2012, Shapiro, Milkes, 2004) and understanding texts written in dialog format (Drndarević, Saggion, 2012, Martos, Freire, González, Gil, Sebastian, 2012). Furthermore, long texts pose additional problems to people with intellectual disabilities, as they have difficulties processing and retaining large amounts of information (Fajardo, Vila, Ferrer, Tavares, Gómez, Hernández, 2014, Feng, 2009) and suppressing irrelevant information (Gernsbacher & Faust, 1991). They can also affect self-efficacy and reading motivation in students with intellectual disability (Gómez, 2011, Morgan, Moni, 2008).
Long and syntactically or semantically complex sentences are not only difficult to understand for humans, but they can also pose difficulties to machine processing. Many studies have thus tried to (manually) simplify such sentences in a pre-processing step in order to improve the performance of various NLP tools. It has been noticed that simple sentences generate a smaller number of possible parse trees and have fewer constituents which leads to a faster and less ambiguous parsing (Chandrasekar et al., 1996). Such sentences are also easier to process by machine translation systems due to a simpler sentence structure, simpler vocabulary and less ambiguity (Chandrasekar, 1994, Štajner, Popović, 2016). Vickrey and Koller (2008) showed that a rule-based sentence simplification system used as a pre-processing step significantly improves results of the semantic role labelling (SRL) task. Beigman Klebanov et al. (2004) showed that the use of Easy Access Sentences (EAS) – the sentences with only one tensed verb and in which pronouns are substituted with the appropriate names – lead to better performance of information retrieval systems. The use of simplified sentences also improves information extraction in medical texts (Evans, 2011).
So far, the majority of the proposed ATS systems were built for English, ranging from the early-days rule-based systems (Carroll, Minnen, Canning, Devlin, Tait, 1998, Devlin, 1999, Siddharthan, 2006), through data-driven approaches based on the comparable English Wikipedia – Simple English Wikipedia (EW–SEW) corpus (Coster & Kauchak, 2011b) using phrase-based statistical machine translation (Coster, Kauchak, 2011a, Kauchak, 2013, Štajner, Bechara, Saggion, 2015a, Wubben, van den Bosch, Krahmer, 2012) or syntactic machine translation (Woodsend and Lapata, 2011a; Zhu, Berndard, & Gurevych, 2010), and more recent hybrid approach (Siddharthan & Angrosh, 2014) that combines supervised data-driven lexical simplification with rule-based syntactic simplification.
All rule-based syntactic simplification modules proposed so far require a significant amount of handcrafted rules. For example, the system proposed by Siddharthan and Angrosh (2014) contains 26 hand-crafted rules for apposition, relative clauses, and combinations of the two; 85 rules for subordination and coordination, 11 for conversion from passive to active voice, and 14 for the standardisation of quotations. The lexical simplification modules are usually supervised and require parallel dataset for training, which limits them to the EW–SEW corpus (approx. 160,000 sentence pairs) and thus reduces their coverage. Our systems, in contrast, do not require a large number of handcrafted rules for syntactic simplification module (they only use 11 rules in total) and our lexical simplification module is fully unsupervised, thus not requiring any parallel or comparable text simplification datasets for training.
Additional problem with most of the existing ATS systems is that they do not perform any kind of content reduction, while at the same time, they make simplified texts often longer than the original texts by performing sentence splitting and adding explanations of difficult terms. Long texts – although lexically and syntactically simpler – can again pose problems to people with intellectual disabilities as they have problem with memory load and with suppressing irrelevant information (see Section 1.1). The analysis of manual simplifications of texts for people with intellectual disabilities, done by trained human editors familiar with the user needs and following specific guidelines, revealed that human editors often delete irrelevant information, some sentence parts or even whole sentences (Drndarević, Štajner, Bott, Bautista, Saggion, 2013, Petersen, Ostendorf, 2007). However, apart from the three ATS systems (Angrosh, Nomoto, Siddharthan, 2014, Narayan, Gardent, 2014, Woodsend, Lapata, 2011a) which perform some very light content reduction (occasionally delete an adjective phrase or a sentence argument), there have been no ATS systems which would address this important issue. Unlike those, our system performs transformations not only on lexical and syntactic levels but also on the discourse level, leading thus to significantly more content reduction within a sentence and within a text (deleting even whole sentences).
We propose an end-to-end ATS system that overcomes all aforementioned shortcomings and dedicates special attention to content reduction.1 The event-based simplification (EBS) module is based on the state-of-the-art event extraction system (Glavaš & Šnajder, 2015) and uses only 11 rules to perform sentence splitting and deletion of irrelevant sentences or sentence parts. The lexical simplification (LS) module leverages word embeddings trained on a large (standard English) corpora, thus not requiring any parallel or comparable TS corpora. Yet, it performs comparably well as, or better than, the state-of-the-art supervised lexical simplification model proposed by Horn, Manduca, and Kauchak (2014).
We also examine how the order of the simplification modules influences the system performance. On one hand, the lexical substitutions performed on the original text can influence event detection system (if we first apply the LS module) and thus lead to different selection of sentences to be retained as relevant. Applying the EBS module before the LS module, on the other hand, can lead to increased number of correct or incorrect substitutions, due to the repetition of the event actors during the sentence splitting process. Therefore, we perform an in-depth manual error analysis on 475 sentences simplified by two different system configurations: LexEv (first applying the LS module and then the EBS module), and EvLex (first applying the EBS module and then the LS module).
The work presented in this article makes several contributions to the field of automated text simplification:
- 1.
We propose a fully-fledged ATS system that overcomes all main problems of the existing ATS systems, as it does not require large number of hand-crafted rules or parallel/comparable TS corpora (Section 3).
- 2.
Unlike all previously proposed ATS systems which operate only on the lexical and/or syntactic level, our system is applied on the discourse level and performs a significant content reduction (Section 3).
- 3.
We explore the importance of the order of execution of simplification modules in the ATS system, i.e. the influence of lexical substitution choices on syntactic simplification and vice versa (Section 4). To the best of our knowledge, this is the first study which addresses this issue.
- 4.
We compare the performance of the proposed ATS system with two state-of-the-art ATS systems on several levels: by comparing readability of the simplified texts, the quality of the output sentences (in terms of their grammaticality, meaning preservation and simplicity), and the post-editing effort (Section 4). This is the first study which performs such an in-depth comparison of state-of-the-art ATS systems which use different simplification approaches.
- 5.
We propose a novel type of evaluation for ATS systems by measuring the post-editing effort necessary to correct grammaticality and meaning preservation of the output sentences (Section 4.4).
- 6.
We build two evaluation datasets, 100 newswire articles and 100 Wikipedia articles, which allow for evaluation on the text level (necessary for ATS systems which attempt at significant content reduction) and compare our systems with the state of the art across those two different text genres.
Section snippets
Related work
Up to date, many ATS systems have been proposed for English, e.g. (Coster, Kauchak, 2011a, Siddharthan, 2011, Siddharthan, Angrosh, 2014, Woodsend, Lapata, 2011a, Xu, Napoles, Pavlick, Chen, Callison-Burch, 2016) and a few more were proposed for other languages, e.g. Spanish (Bott, Rello, Drndarevic, Saggion, 2012a, Saggion, Štajner, Bott, Mille, Rello, Drndarevic, 2015, Štajner, Calixto, Saggion, 2015b), Portuguese (Specia, 2010), French (Brouwers, Bernhard, Ligozat, & François, 2014), etc. We
Leveraging event-based semantics for ATS
Besides entity mentions, mentions of real-world events are the most important information in news and other types of texts (e.g. biographies, police reports). Given that most entity mentions appear as event arguments (e.g. agent or location), it is natural to consider events as central concepts in news. Although news focus on events, there are non-negligible portions of descriptive text that do not relate to any of the event mentions. Those are not important for understanding the story, while
Evaluation
We provide three types of experimental evaluation. First, we evaluate the readability of the simplified texts using the palette of standard readability metrics (Section 4.2). However, readability metrics alone do not suffice to assess the quality of the simplification output because (1) they are completely oblivious to key aspects of a well-simplified text such as grammaticality and preservation of the original meaning, and (2) it is easy to achieve high readability scores with meaningless
Conclusions
Many texts we encounter in our everyday lives (e.g. news articles or Wikipedia articles) can be lexically, syntactically, or semantically complex for wider audiences, especially people with cognitive disabilities, autism spectrum disorders, non-native speakers, etc. At the same time, such texts pose difficulties for various natural language processing tools and tasks, e.g. parsing, machine translation, summarisation. Although there are many guidelines for how to write easy-to-read documents,
References (77)
- et al.
Text simplification for people with autistic spectrum disorders
- et al.
Towards Brazilian Portuguese automatic text simplification systems
Proceedings of the eighth ACM symposium on document engineering, DocEng’08
(2008) - et al.
Lexico-syntactic text simplification and compression with typed dependencies
Procedings of the 25th international conference on computational linguistics (COLING 2014), Dublin, Ireland
(2014) - et al.
PET: A tool for post-editing and assessing machine translation
Proceedings of the 8th language resources and evaluation conference (LREC)
(2012) - et al.
Feasibility analysis for semiautomatic conversion of text to improve readability
Proceedings of the second international conference on information and communication technology and accessibility (ICTA)
(2009) - et al.
Text simplification for information-seeking applications
On the move to meaningful internet systems
(2004) Cleartk-timeml: A minimalist approach to tempeval 2013
Proceedings of the second joint conference on lexical and computational semantics (*SEM)
(2013)- et al.
Putting it simply: A context-aware approach to lexical simplification
Proceedings of the ACL-HLT 2011
(2011) - et al.
Can Spanish be simpler? LexSiS: Lexical simplification for Spanish
Proceedings of COLING
(2012) - et al.
Text simplification tools for Spanish
Proceedings of LREC
(2012)
Syntactic sentence simplification for French
Proceedings of the EACL workshop on predicting and improving text readability for target reader populations (PITR), Gothenburg, Sweden
Practical simplification of English newspaper text to assist aphasic readers
Proceedings of AAAI-98 workshop on integrating artificial intelligence and assistive technology
Simplifying text for language-impaired readers
Proceedings of the 9th conference of the European chapter of the ACL (EACL’99)
A hybrid approach to machine translation using man machine communication
Motivations and methods for text simplification
Proceedings of the sixteenth international conference on computational linguistics (COLING)
Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit
Psychological Bulletin
A computer readability formula designed for machine scoring
Journal of Applied Psychology
Learning to simplify sentences using Wikipedia
Proceedings of the 49th annual meeting of the association for computational linguistics (ACL)
Simple english Wikipedia: A new text simplification task
Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (ACL&HLT).
Text simplification for children
Proceedings of the SIGIR workshop on accessible search systems
Simplifying natural language text for aphasic readers
Helping aphasic people process online information
Proceedings of the 8th international ACM sigaccess conference on computers and accessibility (ASSETS)
Machine translation using probabilistic synchronous dependency insertion grammars
Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Automatic evaluation of machine translation quality using n-gram cooccurrence statistics
Proceedings of the 2nd conference on human language technology research, San Diego, USA
Towards automatic lexical simplification in Spanish: An empirical study
Proceedings of the first workshop on predicting and improving text readability for target reader populations
Automatic text simplication in Spanish: A comparative evaluation of complementing components
Proceedings of the 12th international conference on intelligent text processing and computational linguistics (CICLing)
The principles of readability
Processing of anaphoric devices in young skilled and less skilled comprehenders: Differences in metacognitive monitoring
Reading and Writing
Comparing methods for the syntactic simplification of sentences in information extraction
Literary and Linguistic Computing
Easy-to-read texts for students with intellectual disability: Linguistic factors affecting comprehension
Journal of Applied Research in Intellectual Disabilities
Sentence simplification as tree transduction
Proceedings of the second workshop on predicting and improving text readability for target reader populations (pitr)
Automatic readability assessment for people with intellectual disabilities
ACM sigaccess accessibility and computing
The art of readable writing
Make it simple, European guidelines for the production of easy-to-read information for people with learning disability
The mechanism of suppression: A component of general comprehension skill
Journal of Experimental Psychology: Learning, Memory, and Cognition
Construction and evaluation of event graphs
Natural Language Engineering
Simplifying lexical simplification: Do we need simplified corpora?
Proceedings of 53rd annual meeting of association for computational linguistics
Event-centered simplication of news stories
Proceedings of the student workshop at RANLP 2013
Cited by (28)
Evaluation of split-and-rephrase output of the knowledge extraction tool in the intelligent tutoring system
2022, Expert Systems with ApplicationsCitation Excerpt :Different human evaluation approaches such as human evaluation of WikiSplit (50) and HSplit datasets; manual analysis of human simplification of WikiSplit dataset (50); error analysis of syntactic simplification (WikiSplit dataset (50)); student evaluation of the SAAT structural simplification compared to WikiSplit dataset (50); Comparison to two state-of-the-art rule-based syntactic TS systems (DisSim, AG18copy); comparison with simplification and post-editing examples presented in (Štajner & Glavaš, 2017); Re-usage of splitting sentences for question generation and further for intelligent tutoring.
SimpLex: a lexical text simplification architecture
2023, Neural Computing and ApplicationsIdentification of Complex Words in the Academic Domain in Spanish
2023, CEUR Workshop ProceedingsSimplification by Lexical Deletion
2023, TSAR 2023 - 2nd Workshop on Text Simplification, Accessibility and Readability, associated with the 14th International Conference on Recent Advances in Natural Language Processing 2023, RANLP 2023 - Proceedings of the WorkshopAutomatic Simplification of Scientific Texts using Pre-trained Language Models: A Comparative Study at CLEF Symposium 2023
2023, CEUR Workshop Proceedings