Elsevier

Expert Systems with Applications

Volume 82, 1 October 2017, Pages 383-395
Expert Systems with Applications

Leveraging event-based semantics for automated text simplification

https://doi.org/10.1016/j.eswa.2017.04.005Get rights and content

Highlights

  • Event-based automatic text simplification system.

  • Lexical and syntactic text simplification with content reduction.

  • No need for parallel datasets nor large set of handcrafted simplification rules.

  • Proposed simplification system is highly competitive.

  • Novel evaluation method based on measuring the postediting effort.

Abstract

Automated Text Simplification (ATS) aims to transform complex texts into their simpler variants which are easier to understand to wider audiences and easier to process with natural language processing (NLP) tools. While simplification can be applied on lexical, syntactic, and discourse level, all previously proposed ATS systems only operated on the first two levels, thus failing at simplifying texts on the discourse level. We present a semantically-motivated ATS system which is the first system that is applied on the discourse level. By exploiting the state-of-the-art event extraction system, it is the first ATS system able to eliminate large portions of irrelevant information from texts, by maintaining only those parts of the original text that belong to factual event mentions. A few handcrafted rules ensure that the output of the system is syntactically simple, by placing each factual event mention in a separate short sentence, while the state-of-the-art unsupervised lexical simplification module, based on using word embeddings, replaces complex and infrequent words with their simpler variants. We perform a thorough evaluation, both automatic and manual, showing that our system produces more readable and simpler texts than the state-of-the-art ATS systems. Our newly proposed post-editing evaluation further reveals that our system requires less human effort for correcting grammaticality and meaning preservation on news articles than the state-of-the-art ATS system.

Introduction

Many texts we encounter daily are written using very complex syntactic structures and specialised or sophisticated vocabulary and thus cannot be understood by many readers. Many initiatives raised awareness about this issue, offering guidelines for writing in an easy-to-read manner in order to produce texts more accessible for everyone, including non-native speakers and people with any kind of language or intellectual impairment, e.g. “Make it Simple, European Guidelines for the Production of Easy-to-Read Information for People with Learning Disability” (Freyhoff, Hess, Kerr, Tronbacke, & Van Der Veken, 1998) or “Federal Plain Language Guidelines” (PlainLanguage, 2011). However, manual simplification of existing texts is time-consuming and requires very specific training. At the same time, it has been noticed that syntactically complex sentences, as well as infrequent words and phrases, decrease the performance of various NLP tasks, such as parsing (Chandrasekar, Doran, & Srinivas, 1996), machine translation (Chandrasekar, 1994, Štajner, Popović, 2016), information extraction (Beigman Klebanov, Knight, Marcu, 2004, Evans, 2011), or semantic role labelling (Vickrey & Koller, 2008). For both these reasons, late nineties yielded a new NLP task of Automated Text Simplification (ATS) that aims to (semi-)automatically transform complex texts into their simpler variants that are more understandable to wider audiences and easier to process with various NLP tools.

It has been shown that complex sentence structures (passive constructions, long sentences, appositions, etc.) and infrequent or long words can be difficult to understand for many people, e.g. non-native speakers (Petersen & Ostendorf, 2007), people with low literacy levels (Aluísio, Specia, Pardo, Maziero, & Fortes, 2008), and people with different kinds of reading or cognitive impairments, such as dyslexia (Rello, 2012), aphasia (Devlin & Unthank, 2006), autism spectrum disorders (Martos, Freire, González, Gil, & Sebastian, 2012), or Down’s syndrome (Saggion et al., 2015).

The writing style in newspaper articles is particularly challenging. They often contain long sequences of adjectives, e.g.“twenty-five-year-old blond-haired mother-of-two Jane Smith” (Carroll, Minnen, Canning, Devlin, & Tait, 1998), which can cause problems for people with aphasia (Carroll et al., 1998), autism spectrum disorders (Martos et al., 2012), and intellectual disabilities (Feng, 2009). In order to present the information in a more sensational way, the newspaper articles often use passive constructions which do not follow the canonical subject–verb–object structure and thus pose difficulties to people with aphasia (Carroll et al., 1999) or autism spectrum disorders (Martos et al., 2012). For example, instead of using the straightforward active voice which follows the canonical subject–verb–object structure “The council today accepted a bid to build an incinerator on local wasteland”, it is more common to find the same information in a passive sentence “A bid to build an incinerator on local wasteland was today accepted by the council” (Carroll et al., 1998).

At the word level, it has been noticed that infrequent words can be difficult to understand for people with aphasia (Devlin, 1999) and autism spectrum disorders (Martos, Freire, González, Gil, Sebastian, 2012, Norbury, 2005) and lead to a longer reading time in people with dyslexia (Rello, Baeza-Yates, Dempere-Marco, & Saggion, 2013).

At the discourse level, people with autism spectrum disorders or intellectual disabilities may have problem finding main idea and inferring information (Feng, 2009, Martos, Freire, González, Gil, Sebastian, 2012), resolving anaphors (Ehrlich, Remond, Tardieu, 1999, Martos, Freire, González, Gil, Sebastian, 2012, Shapiro, Milkes, 2004) and understanding texts written in dialog format (Drndarević, Saggion, 2012, Martos, Freire, González, Gil, Sebastian, 2012). Furthermore, long texts pose additional problems to people with intellectual disabilities, as they have difficulties processing and retaining large amounts of information (Fajardo, Vila, Ferrer, Tavares, Gómez, Hernández, 2014, Feng, 2009) and suppressing irrelevant information (Gernsbacher & Faust, 1991). They can also affect self-efficacy and reading motivation in students with intellectual disability (Gómez, 2011, Morgan, Moni, 2008).

Long and syntactically or semantically complex sentences are not only difficult to understand for humans, but they can also pose difficulties to machine processing. Many studies have thus tried to (manually) simplify such sentences in a pre-processing step in order to improve the performance of various NLP tools. It has been noticed that simple sentences generate a smaller number of possible parse trees and have fewer constituents which leads to a faster and less ambiguous parsing (Chandrasekar et al., 1996). Such sentences are also easier to process by machine translation systems due to a simpler sentence structure, simpler vocabulary and less ambiguity (Chandrasekar, 1994, Štajner, Popović, 2016). Vickrey and Koller (2008) showed that a rule-based sentence simplification system used as a pre-processing step significantly improves results of the semantic role labelling (SRL) task. Beigman Klebanov et al. (2004) showed that the use of Easy Access Sentences (EAS) – the sentences with only one tensed verb and in which pronouns are substituted with the appropriate names – lead to better performance of information retrieval systems. The use of simplified sentences also improves information extraction in medical texts (Evans, 2011).

So far, the majority of the proposed ATS systems were built for English, ranging from the early-days rule-based systems (Carroll, Minnen, Canning, Devlin, Tait, 1998, Devlin, 1999, Siddharthan, 2006), through data-driven approaches based on the comparable English Wikipedia – Simple English Wikipedia (EW–SEW) corpus (Coster & Kauchak, 2011b) using phrase-based statistical machine translation (Coster, Kauchak, 2011a, Kauchak, 2013, Štajner, Bechara, Saggion, 2015a, Wubben, van den Bosch, Krahmer, 2012) or syntactic machine translation (Woodsend and Lapata, 2011a; Zhu, Berndard, & Gurevych, 2010), and more recent hybrid approach (Siddharthan & Angrosh, 2014) that combines supervised data-driven lexical simplification with rule-based syntactic simplification.

All rule-based syntactic simplification modules proposed so far require a significant amount of handcrafted rules. For example, the system proposed by Siddharthan and Angrosh (2014) contains 26 hand-crafted rules for apposition, relative clauses, and combinations of the two; 85 rules for subordination and coordination, 11 for conversion from passive to active voice, and 14 for the standardisation of quotations. The lexical simplification modules are usually supervised and require parallel dataset for training, which limits them to the EW–SEW corpus (approx. 160,000 sentence pairs) and thus reduces their coverage. Our systems, in contrast, do not require a large number of handcrafted rules for syntactic simplification module (they only use 11 rules in total) and our lexical simplification module is fully unsupervised, thus not requiring any parallel or comparable text simplification datasets for training.

Additional problem with most of the existing ATS systems is that they do not perform any kind of content reduction, while at the same time, they make simplified texts often longer than the original texts by performing sentence splitting and adding explanations of difficult terms. Long texts – although lexically and syntactically simpler – can again pose problems to people with intellectual disabilities as they have problem with memory load and with suppressing irrelevant information (see Section 1.1). The analysis of manual simplifications of texts for people with intellectual disabilities, done by trained human editors familiar with the user needs and following specific guidelines, revealed that human editors often delete irrelevant information, some sentence parts or even whole sentences (Drndarević, Štajner, Bott, Bautista, Saggion, 2013, Petersen, Ostendorf, 2007). However, apart from the three ATS systems (Angrosh, Nomoto, Siddharthan, 2014, Narayan, Gardent, 2014, Woodsend, Lapata, 2011a) which perform some very light content reduction (occasionally delete an adjective phrase or a sentence argument), there have been no ATS systems which would address this important issue. Unlike those, our system performs transformations not only on lexical and syntactic levels but also on the discourse level, leading thus to significantly more content reduction within a sentence and within a text (deleting even whole sentences).

We propose an end-to-end ATS system that overcomes all aforementioned shortcomings and dedicates special attention to content reduction.1 The event-based simplification (EBS) module is based on the state-of-the-art event extraction system (Glavaš & Šnajder, 2015) and uses only 11 rules to perform sentence splitting and deletion of irrelevant sentences or sentence parts. The lexical simplification (LS) module leverages word embeddings trained on a large (standard English) corpora, thus not requiring any parallel or comparable TS corpora. Yet, it performs comparably well as, or better than, the state-of-the-art supervised lexical simplification model proposed by Horn, Manduca, and Kauchak (2014).

We also examine how the order of the simplification modules influences the system performance. On one hand, the lexical substitutions performed on the original text can influence event detection system (if we first apply the LS module) and thus lead to different selection of sentences to be retained as relevant. Applying the EBS module before the LS module, on the other hand, can lead to increased number of correct or incorrect substitutions, due to the repetition of the event actors during the sentence splitting process. Therefore, we perform an in-depth manual error analysis on 475 sentences simplified by two different system configurations: LexEv (first applying the LS module and then the EBS module), and EvLex (first applying the EBS module and then the LS module).

The work presented in this article makes several contributions to the field of automated text simplification:

  • 1.

    We propose a fully-fledged ATS system that overcomes all main problems of the existing ATS systems, as it does not require large number of hand-crafted rules or parallel/comparable TS corpora (Section 3).

  • 2.

    Unlike all previously proposed ATS systems which operate only on the lexical and/or syntactic level, our system is applied on the discourse level and performs a significant content reduction (Section 3).

  • 3.

    We explore the importance of the order of execution of simplification modules in the ATS system, i.e. the influence of lexical substitution choices on syntactic simplification and vice versa (Section 4). To the best of our knowledge, this is the first study which addresses this issue.

  • 4.

    We compare the performance of the proposed ATS system with two state-of-the-art ATS systems on several levels: by comparing readability of the simplified texts, the quality of the output sentences (in terms of their grammaticality, meaning preservation and simplicity), and the post-editing effort (Section 4). This is the first study which performs such an in-depth comparison of state-of-the-art ATS systems which use different simplification approaches.

  • 5.

    We propose a novel type of evaluation for ATS systems by measuring the post-editing effort necessary to correct grammaticality and meaning preservation of the output sentences (Section 4.4).

  • 6.

    We build two evaluation datasets, 100 newswire articles and 100 Wikipedia articles, which allow for evaluation on the text level (necessary for ATS systems which attempt at significant content reduction) and compare our systems with the state of the art across those two different text genres.

Section snippets

Related work

Up to date, many ATS systems have been proposed for English, e.g. (Coster, Kauchak, 2011a, Siddharthan, 2011, Siddharthan, Angrosh, 2014, Woodsend, Lapata, 2011a, Xu, Napoles, Pavlick, Chen, Callison-Burch, 2016) and a few more were proposed for other languages, e.g. Spanish (Bott, Rello, Drndarevic, Saggion, 2012a, Saggion, Štajner, Bott, Mille, Rello, Drndarevic, 2015, Štajner, Calixto, Saggion, 2015b), Portuguese (Specia, 2010), French (Brouwers, Bernhard, Ligozat, & François, 2014), etc. We

Leveraging event-based semantics for ATS

Besides entity mentions, mentions of real-world events are the most important information in news and other types of texts (e.g. biographies, police reports). Given that most entity mentions appear as event arguments (e.g. agent or location), it is natural to consider events as central concepts in news. Although news focus on events, there are non-negligible portions of descriptive text that do not relate to any of the event mentions. Those are not important for understanding the story, while

Evaluation

We provide three types of experimental evaluation. First, we evaluate the readability of the simplified texts using the palette of standard readability metrics (Section 4.2). However, readability metrics alone do not suffice to assess the quality of the simplification output because (1) they are completely oblivious to key aspects of a well-simplified text such as grammaticality and preservation of the original meaning, and (2) it is easy to achieve high readability scores with meaningless

Conclusions

Many texts we encounter in our everyday lives (e.g. news articles or Wikipedia articles) can be lexically, syntactically, or semantically complex for wider audiences, especially people with cognitive disabilities, autism spectrum disorders, non-native speakers, etc. At the same time, such texts pose difficulties for various natural language processing tools and tasks, e.g. parsing, machine translation, summarisation. Although there are many guidelines for how to write easy-to-read documents,

References (77)

  • C. Orasan et al.

    Text simplification for people with autistic spectrum disorders

  • S.M. Aluísio et al.

    Towards Brazilian Portuguese automatic text simplification systems

    Proceedings of the eighth ACM symposium on document engineering, DocEng’08

    (2008)
  • M. Angrosh et al.

    Lexico-syntactic text simplification and compression with typed dependencies

    Procedings of the 25th international conference on computational linguistics (COLING 2014), Dublin, Ireland

    (2014)
  • W. Aziz et al.

    PET: A tool for post-editing and assessing machine translation

    Proceedings of the 8th language resources and evaluation conference (LREC)

    (2012)
  • S. Bautista et al.

    Feasibility analysis for semiautomatic conversion of text to improve readability

    Proceedings of the second international conference on information and communication technology and accessibility (ICTA)

    (2009)
  • B. Beigman Klebanov et al.

    Text simplification for information-seeking applications

    On the move to meaningful internet systems

    (2004)
  • S. Bethard

    Cleartk-timeml: A minimalist approach to tempeval 2013

    Proceedings of the second joint conference on lexical and computational semantics (*SEM)

    (2013)
  • O. Biran et al.

    Putting it simply: A context-aware approach to lexical simplification

    Proceedings of the ACL-HLT 2011

    (2011)
  • S. Bott et al.

    Can Spanish be simpler? LexSiS: Lexical simplification for Spanish

    Proceedings of COLING

    (2012)
  • S. Bott et al.

    Text simplification tools for Spanish

    Proceedings of LREC

    (2012)
  • L. Brouwers et al.

    Syntactic sentence simplification for French

    Proceedings of the EACL workshop on predicting and improving text readability for target reader populations (PITR), Gothenburg, Sweden

    (2014)
  • J. Carroll et al.

    Practical simplification of English newspaper text to assist aphasic readers

    Proceedings of AAAI-98 workshop on integrating artificial intelligence and assistive technology

    (1998)
  • J. Carroll et al.

    Simplifying text for language-impaired readers

    Proceedings of the 9th conference of the European chapter of the ACL (EACL’99)

    (1999)
  • R. Chandrasekar

    A hybrid approach to machine translation using man machine communication

    (1994)
  • R. Chandrasekar et al.

    Motivations and methods for text simplification

    Proceedings of the sixteenth international conference on computational linguistics (COLING)

    (1996)
  • J. Cohen

    Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit

    Psychological Bulletin

    (1968)
  • M. Coleman et al.

    A computer readability formula designed for machine scoring

    Journal of Applied Psychology

    (1975)
  • W. Coster et al.

    Learning to simplify sentences using Wikipedia

    Proceedings of the 49th annual meeting of the association for computational linguistics (ACL)

    (2011)
  • W. Coster et al.

    Simple english Wikipedia: A new text simplification task

    Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (ACL&HLT).

    (2011)
  • J. De Belder et al.

    Text simplification for children

    Proceedings of the SIGIR workshop on accessible search systems

    (2010)
  • S. Devlin

    Simplifying natural language text for aphasic readers

    (1999)
  • S. Devlin et al.

    Helping aphasic people process online information

    Proceedings of the 8th international ACM sigaccess conference on computers and accessibility (ASSETS)

    (2006)
  • Y. Ding et al.

    Machine translation using probabilistic synchronous dependency insertion grammars

    Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics

    (2005)
  • G. Doddington

    Automatic evaluation of machine translation quality using n-gram cooccurrence statistics

    Proceedings of the 2nd conference on human language technology research, San Diego, USA

    (2002)
  • B. Drndarević et al.

    Towards automatic lexical simplification in Spanish: An empirical study

    Proceedings of the first workshop on predicting and improving text readability for target reader populations

    (2012)
  • B. Drndarević et al.

    Automatic text simplication in Spanish: A comparative evaluation of complementing components

    Proceedings of the 12th international conference on intelligent text processing and computational linguistics (CICLing)

    (2013)
  • W.H. DuBay

    The principles of readability

    (2004)
  • M. Ehrlich et al.

    Processing of anaphoric devices in young skilled and less skilled comprehenders: Differences in metacognitive monitoring

    Reading and Writing

    (1999)
  • R.J. Evans

    Comparing methods for the syntactic simplification of sentences in information extraction

    Literary and Linguistic Computing

    (2011)
  • I. Fajardo et al.

    Easy-to-read texts for students with intellectual disability: Linguistic factors affecting comprehension

    Journal of Applied Research in Intellectual Disabilities

    (2014)
  • D. Feblowitz et al.

    Sentence simplification as tree transduction

    Proceedings of the second workshop on predicting and improving text readability for target reader populations (pitr)

    (2013)
  • FengL.

    Automatic readability assessment for people with intellectual disabilities

    ACM sigaccess accessibility and computing

    (2009)
  • R. Flesch

    The art of readable writing

    (1949)
  • G. Freyhoff et al.

    Make it simple, European guidelines for the production of easy-to-read information for people with learning disability

    (1998)
  • M.A. Gernsbacher et al.

    The mechanism of suppression: A component of general comprehension skill

    Journal of Experimental Psychology: Learning, Memory, and Cognition

    (1991)
  • G. Glavaš et al.

    Construction and evaluation of event graphs

    Natural Language Engineering

    (2015)
  • G. Glavaš et al.

    Simplifying lexical simplification: Do we need simplified corpora?

    Proceedings of 53rd annual meeting of association for computational linguistics

    (2015)
  • G. Glavaš et al.

    Event-centered simplication of news stories

    Proceedings of the student workshop at RANLP 2013

    (2013)
  • Cited by (28)

    • Evaluation of split-and-rephrase output of the knowledge extraction tool in the intelligent tutoring system

      2022, Expert Systems with Applications
      Citation Excerpt :

      Different human evaluation approaches such as human evaluation of WikiSplit (50) and HSplit datasets; manual analysis of human simplification of WikiSplit dataset (50); error analysis of syntactic simplification (WikiSplit dataset (50)); student evaluation of the SAAT structural simplification compared to WikiSplit dataset (50); Comparison to two state-of-the-art rule-based syntactic TS systems (DisSim, AG18copy); comparison with simplification and post-editing examples presented in (Štajner & Glavaš, 2017); Re-usage of splitting sentences for question generation and further for intelligent tutoring.

    • SimpLex: a lexical text simplification architecture

      2023, Neural Computing and Applications
    • Simplification by Lexical Deletion

      2023, TSAR 2023 - 2nd Workshop on Text Simplification, Accessibility and Readability, associated with the 14th International Conference on Recent Advances in Natural Language Processing 2023, RANLP 2023 - Proceedings of the Workshop
    View all citing articles on Scopus
    View full text