1 Introduction

Automatic text simplification (ATS) aims at producing a simpler version of a given input text, while still preserving its original information, semantic coherence and grammaticality (Candido et al., 2009; Horn et al., 2014). The resulting text is expected to be linguistically less complex, which can in turn have an interest from a human-oriented perspective, so as to provide with adapted texts for different target audiences, like children (De Belder & Moens, 2010), individuals with low literacy skills (Aluisio et al., 2010) or people with dyslexia (Rello et al., 2013). From a machine-oriented perspective, it proves to be a valuable pre-processing step for other NLP applications like machine translation (MT) (Stajner & Popovic, 2016) or information extraction (Evans & Orasan, 2019).

Nevertheless, ATS models are subject to generating outputs that, while being indeed simpler, still retain a level of complexity. This arises from the inherently relative nature of simplification, in which a given reference text is rewritten into a comparatively simpler version. Yet, simpler does not necessarily equate to simple, and can result in outputs that still exhibit typically complex linguistic features, such as high lexical density or long subordinate clauses (Brunato et al., 2022; Ormaechea et al., 2024).

Predicting sentence complexity seems a valuable ancillary task in this respect, as it can help evaluate the simplification effectiveness of the generated output. In addition, it can contribute to the automatic creation of monolingual complex-simpler pairs, which are a scarce resource in ATS, especially for less resource-rich languages than English. Prior research has often addressed sentence complexity assessment by relying on binary classification models (Paetzold & Specia, 2016; Stajner et al., 2017), through which an input is categorized as either complex or simple on an absolute basis. However, this approach proves somewhat coarse in the context of simplification, considering its acknowledged relative nature. Since ATS models operate based on a provided text, we believe that estimating the sentential complexity should also be conducted in a reference-aware manner.

In this paper, we aim to contribute with a BERT-based fine-tuning approach to assess sentence complexity, specifically in French. Despite its substantial resources for many NLP-related tasks, ATS research on this language remains largely unexplored given the scarcity of parallel simplification corpora, upon which they rely for training supervised ATS models. To mitigate data paucity, recent approaches have leveraged unsupervised methods (Qiang & Xindong, 2021) to exploit unlabeled data to generate simplified sentences and thus significantly lessen the need for aligned texts. Yet, not completely. These unsupervised approaches are often complemented with labeled data, either to help gain additional knowledge of simplification (Surya et al., 2019), or to mine aligned pairs and thus generate more data to improve the performance of simplification models (Martin et al., 2020, 2022). However, the existence of such aligned texts is often solely available in English, leading data-driven ATS in other languages like French harder to implement.

To alleviate this issue, we introduce a new triad of increasingly fine-grained models so as to: i) determine whether a French sentence is inherently complex or simple; ii) assess if the second sentence in a pair is simpler than the first; and iii) measure the simplification gain achieved by the second sentence in comparison to the original one. Additionally, based on the proposed method, we provide a general-purpose parallel sentence simplification dataset for French language.Footnote 1 To prove its utility, we subsequently utilize this corpus to fine-tune large language models (LLMs) and thus automatically generate simpler texts in French. To do so, we examine and compare instruction fine-tuning, reinforcement learning and parameter-efficient fine-tuning techniques through an objective and subjective evaluation.

The structure of this paper is as follows. Section 2 discusses the different linguistic levels of ATS, the often omitted distinction between simple and simpler, along with the methods for collecting simplification pairs and determining sentence complexity. Section 3 presents the corpora used to predict sentence complexity, and the proposed fine-grained method to extract complex-simpler pairs is introduced in Sects. 4 and 5. In Sect. 6, the resulting parallel corpus is utilized to fine-tune LLMs using various techniques, facilitating the automatic generation of simpler French. In Sect. 7, an objective and subjective evaluation is performed on the basis of the fine-tuned models derived from the previous section. Finally, Sect. 8 offers some final remarks and outlines potential pathways for future research.

2 Background and related work

In this section, we present the various intralingual transformations carried out in the context of ATS. Subsequently, we outline the various approaches used to collect parallel text simplification corpora, and lastly, we explore the methods used to predict sentence complexity.

2.1 Automatic text simplification: sentential operations

As noted in the introduction, the main goal of ATS is to produce a simpler version of a given input text, while maintaining intact, as much as possible, its original information, semantic coherence and grammaticality. In order to perform this task, ATS operations are traditionally subdivided at different levels depending on the linguistic span being considered for such adaptation:

  • lexical-level (LS), which addresses the substitution of potentially complex terms by simpler alternatives;

  • sentence-level (SS), which aims at transforming sentences into simpler equivalents; and

  • document-level (DS), that intends to address the challenges that arise in text simplification at a discursive level, such as coherence and cohesion.

While they are mutually distinguishable, there is a significant overlap between them. To this respect, it is worth noting that SS is the most prone to include operations that are characteristic to the other two levels, as it seeks to detect potentially complex constructions (e.g., deviations from canonical linear order, long subordinate clauses, etc.) and rewrite them into simpler versions, but without entailing meaning loss. To achieve this, it operates on two linguistic axes:

  1. 1.

    syntagmatic, thus decomplexifying the syntax and grammatical structure of the input sentence; and

  2. 2.

    paradigmatic, hence replacing complex lexical terms with simpler ones (akin to lexical-level simplification).

These changes do not necessarily involve one single sentence, but depending on the transformations made, they may affect a broader scope (therefore, at paragraph- or document-level). More specifically:

  1. 1.

    Intra-sentential operations, referring to the simplification changes that are produced within the scope of one single sentence (that is, on a 1:1 basis). These include word(s) substitution, reordering, paraphrasing as well as deletion, in the context of superfluous information.

  2. 2.

    Inter-sentential operations, referring to the changes involving several sentences, that is, on a n:m basis (two examples are shown in Table 1):

    • Divergence (or splitting), viz., dividing long sentences into shorter and less complex segments (with m>n).

    • Convergence (or compression), namely, rewriting n sentences into a simpler and more compact version (with n>1 and m<n).

Table 1 Inter-sentential examples extracted from the French versions of Wikipedia (Original) and Vikidia (Simpler). The text fragments showing a Convergence or a Divergence operation are in bold. A gloss in English is provided below each segment for clarity purposes

It should be noted that, when automating SS, the explicit implementation of such transformations may vary depending on the model being designed, as each system tends to target a specific linguistic axis or operation (Zhu et al., 2010).

2.2 Simple vs. simpler: a distinction often overlooked in ATS

The performance of ATS models is normally judged upon three criteria (Martin, 2021): i) how fluent the simplified output is; ii) how well the meaning of source text is preserved in the output; and most notably, iii) how simple it is compared to the original unsimplified text. A successful model is thus expected to produce a fluent, lossless-meaning text that is comparatively simpler in form than its original counterpart. This implies that the system is not necessarily designed to generate simple text, but rather to achieve or satisfice a simplicity gain with respect to a given text. In other words, the model is aimed at producing a comparatively simpler version of a text, according to a provided input. Yet simpler does not equal simple by definition. A complex text can be transformed into a relatively simpler version, but still show complex features that would make them inadequate to the constraints of simple language.

Then the question that arises is: what is the notion of simple? Is there such a thing as an absolute and objective simplicity that defines one particular text? The concept of simple language has been extensively investigated in prior literature, especially in the context of text accessibility. It has been broadly defined as a variety of language that shows low lexical and syntactic complexity (Klaper et al., 2013). Nevertheless, providing proper simplified texts requires a more precise delineation, as it is greatly influenced by the needs of specific target readers (e.g., individuals with cognitive disabilities, foreign language learners, children, etc.), which condition the preferred simplification operations accordingly. As can be noted, the audience is not a negligible factor, as it shows that text simplification is a strongly subject-dependent task: the perception of a text as being more easily accessible or comprehensible may vary substantially according to the target reader (Dmitrieva et al., 2021).

In recent years, the growing awareness of the eventual reading comprehension difficulty arisen by some types of documents (e.g., technical, administrative, but also general-domain) (Stajner, 2021), as well as the regulations ratified from institutional frameworks (Nomura & Nielsen, 2010), has fostered the definition of easy-to-understand manual simplification style guides, such as Easy Language or Plain Language (Maaß, 2020). These initiatives were created to provide standards for the writing of comprehensibility-enhanced texts, and to guarantee the quality and appropriateness of the resulting simplifications. Nonetheless, such guidelines often advise the use of overly broad or imprecise simplification-oriented rules, such as the usage of short sentences and simple words, or the avoidance of non-essential information (Candido et al., 2009). Such haziness hinders their eventual applicability within ATS solutions. And, more importantly, it makes it difficult to objectively quantify the extent to which a text complies with a specific guideline (Fajardo et al., 2013; Sutherland & Isherwood, 2016), thus obfuscating a consensual definition of simple language and a common characterization of simple text.

2.3 Approaches for building parallel text simplification corpora

The creation of relevant resources for text simplification is a crucial procedure for the subsequent training and evaluation of data-driven ATS models. However, it poses a significant challenge due to the intricacies associated with defining simplicity, as discussed earlier, and also the strong reliance on monolingual parallel corpora comprising representative simplified texts and their corresponding complex references. The paucity of such data collections has significantly hindered progress on this task, both method- and language-wise. To mitigate this issue, previous research has employed two approaches for building parallel complex-simple(r) text resources: manual and automatic, with a special focus on sentence-level simplifications.

2.3.1 Manually-created corpora

Manually crafted monolingual parallel corpora for ATS are usually created from scratch, by asking experts (i.e., teachers, translators or speech therapists) to simplify a set of texts (usually genre- or domain-specific), for a particular audience (Brunato et al., 2022). By relying on pre-existing or ad hoc target-aware style guidelines and professional editors’ expertise, the resulting sentence simplification pairs are expected to provide a reliable and high-quality parallel dataset.

On this basis, several datasets have been released, such as Newsela (Wei et al., 2015), in English and Spanish, or PorSimples (Aluisio et al., 2010) in Brazilian Portuguese. As for the French language, the only existing parallel corpus in this language is Alector (Gala et al., 2020). This collection includes approximately 2,300 complex-simple sentence pairs from literary and scientific documents, specifically simplified for child audiences. Parallel corpora derived from this approach are notable for their highly reliable simplification operations performed on the original text. However, this process is costly, both economically and time-wise, due to the requirement of trained human editors. Furthermore, it has an impact on the reduced size of the resulting dataset, which with the exception of Newsela, does not easily support the implementation of ML algorithms that are able to infer the transformations to generate simplified text.

2.3.2 Automatically-created corpora

With the goal of providing with ATS-oriented high-scale parallel monolingual datasets, automatic data acquisition approaches rely on existing comparable corpora (usually Wiki-based) that associate standard texts with their simplified versions. These resources are later used to extract complex-simple sentence pairs, giving rise to labeled data collections, like WikiSmall (Zhu et al., 2010), Ew-Sew (Hwang et al., 2015) or WikiLarge (Zhang & Lapata, 2017). While being widely used in the training of ATS models in prior literature (Nisioi et al., 2017; Martin et al., 2020; Sheang & Saggion, 2021), the adequacy of the simplifications within these datasets has been called into question (Wei et al., 2015). This is due to the eventual disparity between the source text and its comparatively simpler counterpart, given the fact that comparable corpora are often written independently. In addition to this, their limited controllability has also been debated, since it appears difficult to determine to what extent they observe any style manual, or whether the performed simplifications are target-aware or target-oblivious. Nor is it any less of an impediment that such resources are often solely existing in English, leading data-driven ATS in less resource-rich languages to be harder to implement.

Yet, the main reason to emphasize the unsuitability of these datasets is based on the eventual suboptimality of the methods used to mine register-diversified comparable corpora. To capture monolingual parallel data that is relevant for ATS, prior research has typically relied on automatic alignment algorithms. While originally designed to align bilingual texts, sentence alignment mechanisms have also garnered attention in the context of monolingual tasks like summarization, style transfer and also simplification. More precisely, a number of language-independent tools have been proposed to allow the alignment of complex-simple pairs from comparable monolingual documents, such as MassAlign (Paetzold et al., 2017), CATS (Stajner et al., 2018) and LHA (Nikolov & Hahnloser, 2019).

Furthermore, semantic similarity measures have been used as a means for semantic closeness between sentence pairs. Such is the case of SBERT (Reimers & Gurevych, 2019), which modifies the pre-trained BERT network (Devlin et al., 2019) by using siamese and triplet network structures to compute sentence embeddings that can later be compared using a cosine similarity measure. SBERT has been applied in the context of standard and simplified sentence mapping, so as to obtain 1:1 (Aumiller & Gertz, 2022) and n:1 (Ebling et al., 2022; Sun et al., 2023) alignments.

Although these strategies are prone to error, they aid in assessing the semantic closeness between two sentences, and thus serve as a proxy for meaning preservation. However, they do not suffice on their own, as they fail to ascertain whether the target text genuinely constitutes a simpler version with respect to the corresponding input. Given that simplicity gain is a sine qua non condition for a simplified text to be considered valid, recent studies have explored the use of classification and regression models to estimate sentence complexity, as we will see below.

2.4 Automatic assessment of sentence complexity

Automatically determining the complexity of a sentence proves to be a valuable ancillary task for ATS, as it can potentially serve as a preliminary step in creating labeled simplification data. Additionally, it can aid in evaluating the simplification effectiveness of the generated output.

Prior literature has approached sentence complexity prediction in various ways, depending on the ultimate objective. This typically includes: i) detecting the complex sentences needing to be simplified, and ii) quantifying the degree of simplification achieved within a pair. As a result, it has had an impact on the approach used for such assessment. So as to address the first goal, previous works have mainly employed absolute complexity classifiers. These models assign a discrete label to an input text that represents its difficulty. This can in turn be treated as a binary classification problem (Paetzold & Specia, 2016; Stajner et al., 2017) or a multi-class discrimination problem, if a greater granularity is considered (Vajjala & Meurers, 2014; Khallaf & Sharoff, 2021). On the other side, relative sentence complexity classifiers (Ambati et al., 2016) and, more particularly, regression models have been prioritized to address the second objective (Iavarone et al., 2021), as they can represent linguistic complexity in a continuum, and help predict the degree of complexity reduction obtained by a simplified sentence.

It is also worth noting that such regressors have commonly been used from the perspective of automatic readability assessment (Lee & Vajjala, 2022). While it is a complementary notion to that of simplification, they are not equivalent concepts. Readability primarily focuses on language clarity and accessibility, and it does not strictly target the meaning preservation and simplicity gain relation. In addition to this, readability formulae were designed for a document-level application, which means that they may not be completely reliable on a sentential-level (Stajner et al., 2017). This suggests the need to introduce new methods within ATS, so as to properly quantify the gain or loss of simplicity in a complex-simpler pair.

3 Corpora

As previously stated, automatically determining the complexity of a sentence (or a pair of sentences) can potentially serve as a helpful preliminary step in creating labeled simplification data in languages such as French, where ATS-specific aligned data is scarce. In this section, we showcase the corpora we used to make such prediction as well as to automatically mine complex-simpler pairs.

3.1 WikiLarge-Fr

Assessing sentence simplicity in an automatic manner is generally based on data-driven approaches. Considering this, we opted to rely on WikiLarge (Zhang & Lapata, 2017), a well-established dataset that has been utilized to develop and refine simplification models in previous ATS research (i.e., (Zhao et al., 2020; Qiang & Xindong, 2021; Martin et al., 2022) among others). However, a significant obstacle was encountered since the texts in WikiLarge were originally written in English, requiring to be translated into French. To tackle this issue, we used Google Translate to obtain translations for each pair, creating WikiLarge-Fr. However, due to the large size of the corpus, we were unable to manually verify the correctness of these translations. One potential solution to this limitation would be to compare a subset of WikiLarge-Fr with the corresponding original texts and conduct a human evaluation to assess its translation accuracy.

Table 2 Overview of size (in sentence pairs) and data distribution of the WikiLarge-Fr dataset

We identified that certain pairs were too similar during this process, so we kept those with a Levenshtein ratio of less than 0.95. We then partitioned the data into a train, validation, and test set using an 80:10:10 split and stratification (see Table 2).

Fig. 1
figure 1

Overview of the pipeline to obtain complex-simpler sentence pairs from the French Wikipedia and Vikidia

3.2 Wikipedia-Vikidia data acquisition

Prior studies have highlighted the potential use of Wiki-based articles for the creation of ATS resources (Brouwers et al., 2012). For this reason, we decided to use the French-language editions of register-differentiated comparable corpora to subsequently extract parallel simplification pairs. More precisely, we relied on Wikipedia and Vikidia, where the latter constitutes an adapted version of the former, and was created to provide with texts that can be more easily understandable by children between 8 and 13 years old. At present, French Vikidia comprises more than 40k articles, which makes it a significant resource for ATS. Although French is a reasonably well-resourced natural language, the available aligned data for this task is limited (Seretan, 2012; Cardon & Grabar, 2019).

In order to retrieve the textual content from the articles of both sources, we considered the total number of parallel articles between the two encyclopedias and extracted the complete text content. The implemented pipeline was the following:

  1. 1.

    The extraction process was initiated by parsing the URL list of all available articles from Vikidia (as shown in region a in Fig. 1). The output yielded a total of 34,357 article links.

  2. 2.

    Subsequently, the HTML content of the extracted URLs was parsed to find the corresponding Wikipedia articles using interlanguage links.Footnote 2

  3. 3.

    The extracted articles were then pre-processed and segmented into sentences.

  4. 4.

    Finally, sentences exceeding 128 word pieces were filtered out to avoid eventual truncation when encoded into a sentence embedding.

Table 3 An overview of the data collected from the Wikipedia and Vikidia articles

In an earlier work (Ormaechea & Tsourakis, 2023), we chose to merely extract the summaries included in the article’s preface (also known as lead sections). We based this decision on the hypothesis that the definitional style prevailing in such sections would facilitate the finding of aligned sentences. In order to expand the size of this preliminary dataset, we decided to extract the full content of the articles and apply the same procedure. As shown in Table 3, Wiki-based articles are typically much lengthier than Viki-based ones, which is evidenced by the number of sentences obtained from each source.

4 Meaning preservation pre-filtering

As discussed in Sect. 2.2, the output produced by an ATS model is expected to meet two primary conditions: i) retain the meaning and information conveyed in the input text, and ii) obtain a linguistic simplicity gain with respect to the reference. Based on this definition, we addressed these two dimensions sequentially (as shown in Fig. 1). In order to determine suitable complex-simpler pairs for ATS, we must first assess whether they are semantically equivalent. If their meaning is divergent, no assessment on simplicity gain is applicable.

4.1 n:m-aware automatic sentence alignment

In order to identify the Wikipedia-Vikidia pairs exhibiting a high semantic overlap, we implemented a meaning preservation filtering method. To this effect, we relied on SBERT (Reimers & Gurevych, 2019), which modifies the pre-trained BERT network (Devlin et al., 2019) by using a siamese architecture to compute sentence embeddings.Footnote 3 After mapping the sentences to a 768-dimensional dense vector space, we computed the cosine similarity for the resulting encoded pairs. It should be noted that since we intended to capture both intra- and inter-sentential simplification operations, we computed sentence embeddings on a multi-sentence basis and employed an n:m-aware sentence alignment. To achieve this, we fed SBERT with \(W_{n}\) Wikipedia sentences and \(V_{m}\) Vikidia sentences, with \(1 \le n, m \le 3\), where \(n, m \in \mathbb {N}\). As a result, we obtained the cosine similarity value corresponding to each input sentences pair.

4.2 Manual annotation

Once the SBERT-based cosine similarity scores were computed, we needed to assess which pairs showed sufficient semantic consistency. To this end, we chose to rely on a manual annotation of 500 randomly picked sentence pairs from our initial dataset. Two subjects were selected to determine to which extent each Wikipedia-Vikidia pair conveyed the same meaning. To conduct the annotation, they were given three judgment labels (Table 4 shows an example of each case):

  • valid, where the meaning and information from source to target is fully preserved;

  • partially valid, where information is partially lost from source to target or vice versa; and

  • non-valid, where information between the two sentences is divergent.

Fig. 2
figure 2

Box and whisker plot distribution of SBERT-derived cosine similarity values for each human judgment label

After the first annotation round, the two experts convened to discuss and reached a consensus, resulting in a Cohen’s kappa score of 0.87. With 500 annotated sentence pairs at our disposal, we plotted the distribution of the SBERT scores for each judgment label (see Fig. 2). On average, pairs labeled as valid exhibit distinctively higher SBERT-derived values compared to partially valid and non-valid pairs, with mean values of 0.70 and 0.55, respectively. This indicates a direct correlation between SBERT scoring and human judgments of sentence similarity. The mean score for valid pairs was 0.81, which we define as the cutoff threshold for the meaning preservation pre-filtering.

Table 4 Manually annotated examples on meaning preservation

5 Simplicity filtering

After addressing the meaning preservation dimension, we focused on how to extract the simplicity gain obtained by the target sentence with respect to the reference. Our approach consists of three distinct steps to assess absolute and relative simplification and estimate a gain score (as shown in Fig. 1), and aims to properly address the relative nature of simplification. An absolute binary categorization of a sentence as complex or simple seems somewhat insufficient and not suited for ATS. Indeed, a complex sentence (C) being transformed into a simple one (S) results in a simplification. Conversely, a S\(\rightarrow\)C process gives rise to a complexification. Nevertheless, an absolute classifier can equally categorize a source and target sentences as C\(\rightarrow\)C or S\(\rightarrow\)S. Given that simplification and complexification operations are reference-dependent, they may validly occur in both cases.

Because there are several phenomena involved within simplicity assessment, we split the problem into an increasingly fine-grained approach. First, we incorporated the WikiLarge-Fr dataset to elicit pairs of complex-simpler sentences that can be used to fine-tune different versions of FlauBERT (Le et al., 2020). For the classification task, we created two models: one to assess the simplicity of each sentence in the pair, and another to determine whether the target sentence is simpler than the corresponding source. Subsequently, based on a set of features, we calculated the simplicity gain for each pair that allowed the creation of a regressor model to automate this process. For a clearer depiction of the specific steps involved, refer to Fig. 3.

Fig. 3
figure 3

Overview of the simplicity assessment approach

5.1 Classification models for sentence complexity

Fine-tuning pre-trained classification models can help leverage their learned knowledge and transfer it to a new classification task. By adapting the model to the target task with labeled data, we can improve its generalization, capture domain-specific nuances, and achieve better results. In our work, we incorporated a specific architecture based on the FlauBERT language model to perform sentence complexity classification. It is a variant of the model that has been adapted specifically for sequence classification. In this architecture, the model is combined with additional layers and a classification head to enable it to classify sequences into different categories.

5.1.1 Absolute sentence complexity assessment

In the first experiment, we treated each sentence in the input pairs independently to determine whether it is categorized as simple or complex. To achieve this, we assigned a binary label for each of the sentences in the WikiLarge-Fr dataset (see Sect. 3.1). The performance on the test set is presented on the left side of Table 5. Utilizing different variants of the FlauBERT model, we contrasted the performance between each baseline model (untuned) and the one after training (tuned). We observed a significant improvement in the second case, which is similar to all three variants. The baseline untuned models’ performance was no better than random chance in distinguishing between the two classes (\(\sim\)50%) versus the tuned ones (\(\sim\)70%). It is worth noting that the small version of the untuned FlauBERT model is partially trained, which may impact its performance. Nevertheless, it was included for debugging purposes.

5.1.2 Relative sentence complexity assessment

The second classifier aims to assess the relative simplification between the source and target sentence pairs, answering the question of whether the second is a simpler version of the first. To accomplish this, we juxtaposed the sentences alternating their order into two sets of pairs to signify either simplification or complexification. This time, we significantly improved the baseline performance (\(\sim\)50% versus \(\sim\)93%). To reinforce the validity of the previous outcome, we also utilized the manually annotated dataset of Sect. 4, which included human annotations of relative simplification. Aside from asking the annotators to provide a judgment on meaning preservation, we also asked them to assess the simplicity gain dimension, using the following labels for the target sentence: i) simpler than the original, ii) as complex as the original, and iii) more complex than the original. The manual set used in this section contains the sentence pairs where both judges agreed on identifying a simplification or a complexification operation, totaling approximately 100 examples.

The results shown on the right side of Table 5 corroborate our previous assessment. As the dataset is imbalanced, the baseline classifiers’ performance mirrors the class distribution and can largely be attributed to chance. However, the tuned models improve those significantly (\(\sim\)94%).

Table 5 Accuracy results in % obtained for the absolute complexity classifier (AC) on the test set, and for the relative complexity classifier (RC) on the test and manual evaluation sets

5.2 A regression model for simplicity gain

The classification models presented above allow us to discern in a binary manner whether a sentence is complex or simple, or whether a pair of sentences has undergone a process of simplification or complexification. However, these models lack the capacity to indicate to what extent a target sentence is simpler than its original counterpart. For these reasons, we have aimed to quantify the simplification shift produced within a pair of classically categorized complex-simple sentences, with the training of a regression model. In this way, we have sought to measure the simplicity gain achieved from the original sentence to its simplified version.

As noted in Sect. 2.4, similar regression models have been used from a readability perspective, but they prioritize the measurement of clarity and accessibility aspects, and do not explicitly address the challenges of ATS. This is why we sought to examine the quantification of the simplicity gain.

5.2.1 Definition of features

We extracted a set of pertinent features, shown in Table 12 of the Appendix, that were chosen on the basis of previous literature regarding sentence simplicity assessment (Tanguy & Tulechki, 2009; Brunato et al., 2022). These describe the WikiLarge-Fr dataset along three dimensions and are grouped into structural, lexical, and syntactic groups. Based on these features, we calculated their values for each sentence in the pair and performed an element-wise subtraction. The result is a list containing the differences between the elements in the same positions of the original feature lists that we also standardized.

While using a predictive model to estimate the simplicity gain from complex-simpler pairs might not be necessary when a direct calculation process is available, there are potential benefits to consider. Predictive models can assist in quality assessment by identifying cases where direct calculations may falter due to assumptions or heuristics. They offer generalization capabilities, making predictions for new data and variations that the direct process may not cover. Additionally, these models can uncover hidden patterns, adapt to changes in data distributions, and provide robustness against noisy or imperfect data, enhancing their value in real-world scenarios. For that reason, LLMs can be beneficial by leveraging their capacity to comprehend and learn from intricate language patterns in the data.

To tackle the challenge of collinearity, we calculated the correlation of the simplicity gains shown in the left heatmap of Fig. 4. This heatmap aids in detecting patterns and dependencies among the features. This helps to identify the impact of each one on the overall simplicity gain and to decide on which to keep in the subsequent analysis. We observe that certain pairs demonstrate a high correlation, like Sentence length and Number of words (row: 0 – col: 1) or IDT and IDT-DLT (row: 9 – col: 19). We therefore excluded the second feature in each pair, ending with 18 features in total.

Fig. 4
figure 4

Correlation heatmaps among the feature gains for the WikiLarge-Fr and Alector datasets

We also performed a symmetric analysis on the aforementioned Alector dataset (shown in the right heatmap of Fig. 4). Given that it was manually created by expert linguists, the produced simplifications are expected to be highly reliable. This, in turn, helps to reinforce our decision to maintain or exclude features according to their relevance to the simplicity assessment. Interestingly, we observe similar patterns of correlation, indicating that the features have a similar effect in both datasets.

5.2.2 Simplicity gain estimation

Similarly to the classification tasks, we fine-tuned FlauBERT for regression. By utilizing MSE as the loss function, Adam optimizer and a batch size of 16, we trained FlauBERT to learn to map its linguistic representations to continuous target variables. The input received by the regressor consisted on the complex-simpler pairs appended to their simplicity gain score, with a maximum input size of 512 tokens.

Table 6 MSE scores from the gain regressor (GR)

Table 6 contrasts the performance on the test set using either an untuned or a tuned FlauBERT model. We observe a significant improvement in all three cases. Specifically, the tuned models achieved a much lower MSE, demonstrating their ability to capture underlying patterns in the data and provide more accurate predictions. The flaubert-large model yields the best performance with an MSE equal to 0.23, which is acceptable based on the fact that the regressor gain varies between -5 and 5. Notice, however, that minor differences in the outputs of the regressor are not substantially meaningful when comparing different candidate simplifications. The option of classifying each pair into broader categories such as slightly, moderately or considerably simplified could provide a better understanding of the degree to which a sentence has been simplified.

5.3 Wikipedia-Vikidia Corpus (WiViCo)

Having this triad of models in place, we were able to finally implement our fine-grained method on sentence simplicity to extract relevant pairs for ATS. To do so, we implemented our best performing models on the compiled data introduced in Sect. 3.2. As a result, we were able to generate the Wikipedia-Vikidia Corpus (WiViCo), that contains 46,525 aligned pairs. These include standard C\(\rightarrow\)S labeled examples, but also C\(\rightarrow\)C and S\(\rightarrow\)S ones, where a simplification operation was performed. Furthermore, by including not only 1:1 complex-simpler alignments but also n:m intersentential ones, we provide a set that covers a more exhaustive representation of simplification operations. A detailed description of the resulting dataset is shown in Table 7.

Table 7 Detailed description of WiViCo. We purposely use texts and not sentences because our dataset includes intersentential examples

For greater explicitness, Table 8 presents the application of the triad of models on three mined examples from the two encyclopedias. Upon observation, Pair\(_1\) represents a typical valid complex-simpler pair, where the AC model identifies the Wikipedia sentence as C and the Vikidia one as S. Both the RC and GR concur with this assessment: the former confirms that the Vikidia sentence has undergone simplification, and the latter quantifies such process with a positive value.

Pair\(_2\) presents a less standard, yet still valid example of a complex-simpler pair (C\(\rightarrow\)C), where the target sentence is comparatively simpler than the source. This observation is supported by both the RC and GR models. On the contrary, Pair\(_3\) shows a simplification counterexample, where all three models agree that the second sentence has undergone a complexification process compared to its Wikipedia counterpart.

Table 8 Applying the triad of BERT-based fine-models to Wikipedia-Vikidia sentence pairs. A gloss in English is provided below each segment for clarity purposes (AC: absolute complexity classifier, RC: relative complexity classifier and GR: gain regressor)

6 Simpler text generation

Inspired by MT, ATS has been conceived over the last decade as a monolingual text-to-text translation task (Wubben et al., 2012; Kajiwara & Komachi, 2016; Qiang & Xindong, 2021), where the input text is rewritten into a simpler version retaining its original information. Supervised data-driven approaches may offer a comprehensive solution to address sentence-level simplification, by applying both paradigmatic and syntagmatic transformations. Nevertheless, the reliance on parallel yet monolingual data, which is scarce per se, has greatly constrained the advancement in this field. For this reason, there has also been a growing interest in alleviating the data-bottleneck problem, by resorting to data augmentation techniques (Aprosio et al., 2019), or exploiting unlabeled data (Surya et al., 2019; Martin et al., 2020, 2022). Furthermore, increasing attention has been paid to addressing ATS through reinforcement learning (RL), with the introduction of simplification-specific rewards to encourage the generation of simpler yet meaning loss-less outputs (Zhang & Lapata, 2017; Nakamachi et al., 2020; Yanamoto et al., 2022). However, unsupervised methods often resort to labeled pairs to mine aligned data and generate additional examples, aiming to enhance the performance of simplification models. Also, the typical need of ground-truth simplifications for reward function design underscores the reliance on aligned datasets for RL. The existence of such aligned texts is though frequently limited to English, making data-driven ATS in other languages, such as French, harder to implement.

In this section, we utilize the resulting parallel French corpus from Sect. 5.3, namely WiViCo, to fine-tune LLMs and thus enable the generation of simpler texts in French. For this purpose, we resort to different fine-tuning techniques, i.e., instruction tuning, RL and parameter-efficient fine-tuning to obtain machine-generated simplifications.

6.1 Instruction fine-tuning

Pre-trained LLMs have proven effective for various NLP tasks despite being created as task-agnostic in the first place. With the appropriate prompt instructions and few-shot exemplars, they can achieve excellent results in various NLP tasks (Brown et al., 2020). Most of the time, however, their performance can be significantly improved without the need for exemplars using task-specific data and fine-tuning (Ouyang et al., 2022; Wei et al., 2022). In fact, LLM fine-tuning is a crucial step in harnessing their potential, as it allows the adaption of these models to generate contextually relevant content.

Fig. 5
figure 5

Prompt completion template with a source sentence example and its reference simplification. English glosses are provided for clarity purposes

One effective approach for this task is instruction fine-tuning, which improves a model by training it with examples of desired responses to specific instructions. We chose to incorporate different versions of the FLAN-T5 multilingual model,Footnote 4 an encoder-decoder model pre-trained on various language tasks. Tuning these LLMs involves the creation of pairs of prompt completion examples, including an instruction. In Fig. 5, we present the template prompt to instruct the model to simplify an input sentence ({source_text}). Notice that we specifically request that the output be in French to ensure that it does not generate multilingual results. During training, the model is presented with pairs of sentences, and its output is compared to the reference simplification ({reference_simplification}). The model’s weights are updated accordingly. This process is shown schematically in Fig. 6.

Fig. 6
figure 6

Overview of the instruction fine-tuning approach followed

6.2 Instruction fine-tuning with reinforcement learning

A recent trend in fine-tuning LLMs stems from the concept of reinforcement learning from human feedback (RLHF) that directs a model to produce output that is aligned with the expectations of humans (Stiennon et al., 2020; Ouyang et al., 2022). In RL, agents learn from trial and error, interacting with an environment and receiving rewards or penalties based on their actions. RLHF represents a valuable approach to combining the strengths of human expertise and machine learning to train more capable and reliable RL agents.

Following this approach, we propose a reward model based on human expertise that particularly focuses on ATS, so as to optimize an objective that encourages outputs to comply with ATS-specific constraints. More specifically, we first implemented a reward model to quantify the simplification gain achieved between a pair of sentences, which is based on the simplicity gain regressor (GR) presented in Sect. 5.2. We then fine-tuned the best performing model of Sect. 6.1 using RL.

Fig. 7
figure 7

High-level overview of the approach followed using instruction fine-tuning with reinforcement learning

With the reward model at hand, we focus on RL to update the weights of the instruct model. The input prompts had the structure shown in Fig. 5, excluding the reference simplification. The instruct model received each prompt and generated a completion. Both were sent to the reward model to emit a simplicity gain score (scoresg). We also computed their sentence embeddings through SBERT (Reimers & Gurevych, 2019), and later compared them using a cosine similarity measure. This specific calculation sought to quantify meaning preservation (scoremp). The reward function R consists of the above two components and is defined as:

$$\begin{aligned} R = w_1 \cdot {score}_{sg} + w_2 \cdot {score}_{mp} \end{aligned}$$

The weights w1 and w2 signify the impact of each component in the reward. We also fed the prompt to a mirror instruct model whose weights remained fixed (frozen). Then, the output of the two models was compared using the Kullback–Leibler (KL) divergence to control how much the model adapts. High KL divergence indicates significant differences in output distributions, suggesting that the fine-tuned model may have deviated too far from the original, which could lead to undesirable behavior. Consequently, the updated fine-tuned model was penalized if it generated completions that were too different from the frozen model. The KL divergence and the simplification gain were aggregated and fed to a proximal policy optimization (PPO) (Schulman et al., 2017) module that updated the instruct model weights (see Fig. 7).

6.3 Parameter efficient fine-tuning

Both instruction fine-tuning and instruction tuning with reinforcement learning presented in the previous sections require significant manipulation of the model’s parameters or the use of external reinforcement signals to achieve the LLM tuning. These methods enhance the model’s understanding and execution of complex instructions, thereby improving its performance on specific tasks. However, they can be resource-intensive and may not always be feasible for deployment in resource-constrained environments. In contrast, a Low-Rank Adaptation of Large Language Models (LoRA) adapter (Hu et al., 2021) within the PEFT framework offers a complementary approach by enabling the fine-tuning of LLMs without extensive modification to the core model.

We reused the pipeline of Sect. 6.1, adding a LoRA adapter and freezing the weights of the FLAN-T5 model, as shown in Fig. 8. The adapter offers 9.4M trainable parameters, which constitute 1.19% of the ones of the large FLAN-T5.

Fig. 8
figure 8

High-level overview of the parameter efficient fine-tuning approach

7 Results

In evaluating the proposed models, we employed both objective and subjective methods, and in this section, we present the various evaluation tasks. Using quantifiable metrics we first assess objectively the performance of the models not solely on their simplification capacity but also their computational efficiency and environmental impact. Subjective evaluation, on the other hand, was conducted by participants assessing the models’ outputs and rating them on different aspects.

7.1 Objective evaluation

7.1.1 Models’ performance

We divided the WiViCo dataset resulting from Sect. 5.3 into train, validation and test splits under a 80:10:10 distribution scheme, and fine-tuned three variants (small, base and large) of FLAN-T5 model using the different techniques presented above. As a result, 4 types of models were trained: an instruction-based tuned model (TM), an RL-based model using a simplicity gain reward function (RLg), an RL-based model with a reward function including simplicity gain and meaning preservation scores (RLg+c), and a PEFT-based model (PF). For comparison purposes, we appended an untuned model (UM), where no fine-tuning was performed to the pre-trained FLAN-T5 model.

Table 9 reports and contrasts the UM, TM, PF, RLg and RLg+c results on the test split using automatic metrics for evaluating the overall performance of ATS systems. More specifically, we resorted to SARI (Wei et al., 2016), classically associated with simplicity gain, and BLEU (Papineni et al., 2002), highly correlated with meaning preservation judgments.

Results show an improvement for both metrics in the TM with respect to the UM across all 3 model variants. To this respect, it is worth noting that PF offers a similar performance to the TM, with the caveat of achieving slightly lower BLEU scores. Given the resource efficiency of the adapter, which leverages just over 1% of the full model’s parameters, this represents a significant reduction in computational demand without compromising performance.

Table 9 SARI and BLEU results on the WiViCo test set

We extend Table 9 with the performance results obtained for the two versions of the RL-based models. As noted in Sect. 6.2, we only proceeded to the fine-tuning of the most performing TM model, flan-t5-large. Using w1=1, w2=0 (RLg) and w1=0.5, w2=0.5 (RLg+c), we observe an almost identical performance in terms of SARI, and an improvement of 0.4 gained by the latter with respect to BLEU. This may be explained by the presence of the weight dedicated to the meaning preservation dimension within the reward function, resulting in a better balance between the two metrics.

In light of the results obtained, it becomes evident that TM and RLg provide the best results in terms of BLEU and SARI, respectively. However, such high performance in one metric appears to come at the expense of the other. In comparative terms, it is rather RLg+c that achieves the best trade-off between both metrics, achieving the second best result in both, and approaching by less than 0.2 points to the most performing ones.

7.1.2 Models’ computational efficiency

In the next part of the objective evaluation, we focus on the computational efficiency of the large-based FLAN-T5 models. First, we report the average inference duration for the test set in Table 10. Notice that we repeated the calculations 10 times for each of the 5 models, obtaining a total of 50 data points. The experiments took place on a server with one GPU (NVIDIA TITAN X - Pascal). As observed, the PF model is the slowest, as it is influenced by the extra processing step of the LoRA adapter. Conversely, the RLg exhibits the fastest inference, which can be mainly attributed to its inherent tendency to produce shorter outputs. This outcome can be verified by the sentence length (number of characters) reported in the table, wherein the two extremes, we find the TM and RLg models. The RLg+c model follows a middle ground in terms of inference time and sentence length, which we can relate to the findings for SARI and BLEU reported earlier. This specific model keeps a good balance among simplification gain, meaning preservation and time efficiency.

Table 10 Computational efficiency and environmental impact of the models (each measurement is the average of all output sentences)

Besides computational efficiency, we then focus on assessing the environmental impact of our models. This dimension has become increasingly vital as LLMs grow in size and complexity, their energy consumption and carbon footprint raise significant environmental concerns (Soltan et al., 2022; Dinarelli et al., 2022). In this respect, we measured the electric energy consumption in kWh and its conversion into grams of CO2 (indicated as gCO2 in Table 10) using the codecarbon tool.Footnote 5 Since this tool overestimates the conversion between kWh and gCO2, like in (Dinarelli et al., 2022), we used the official coefficient of 68 gs/kWh (latest value availableFootnote 6). Again, we observe that the RLg+c model constitutes a good trade-off based on its environmental impact.

A one-way ANOVA was conducted to compare the run times across the five generative models, revealing a statistically significant difference (F(4, 45) = 139.297, p < 0.00001). Post-hoc pairwise comparisons using Tukey’s HSD test revealed significant differences in run times between most model pairs (p < 0.05) except between TM and RLg+c. A similar analysis was performed for electric energy consumption, yielding a statistically significant difference (F(4, 45) = 170.544, p < 0.00001). Tukey’s HSD test showed significant differences in energy consumption between all models.

7.2 Subjective evaluation

To assess the suitability of the generated sentences to the ATS task, we conducted a human evaluation on 50 sentences randomly selected from the test set.Footnote 7 Two annotators were then asked to score the given sentence pairs on a five-point Likert scale, on the basis of two criteria (see Table 11):

  1. 1.

    simplicity gain (SG), i.e., how much simpler is the generated simplification compared to the original sentence?;

  2. 2.

    meaning preservation (MP), i.e., how much of the meaning in the original sentence is preserved in the generated simplification?

Table 11 Labels assigned to each Likert scale value. Inspired on the taxonomy developed by (Yamaguchi et al., 2023)

Judges were shown the original unsimplified sentences along with each of the simplifications generated by the 5 models, in a random order. Each sentence output received a rating in SG and MP. The evaluation of the aforementioned 50 sentences resulted in a total of 1,000 annotations. Specifically, respondents assessed simplicity gain and meaning preservation criteria for each output generated by the fine-tuned flan-t5-large models: 2 annotators x 2 dimensions x 5 models x 50 sentences = 1,000 annotations. The effort of this task, as unofficially reported by the judges, was non-trivial; they had to review the models’ output multiple times to consistently assign scores according to the two dimensions under study.

The obtained inter-annotator agreement (Cohen’s kappa) was equal to 0.62 for SG and 0.69 for the MP, which signifies substantial agreement. Interestingly, the TM and RLg models are in the opposite extremes across the two criteria, suggesting an inverse correlation between the two: the more the text is simplified, the more prone it is to lose information (see Fig. 9). Conversely, the more information is retained, the less the text is simplified. Based on the results, RLg+c demonstrates a good compromise between the two tendencies, producing a non-negligible simplicity gain with respect to the original sentence without incurring a substantial information loss.

Fig. 9
figure 9

Average rating of SG and MP from the two judges

RLg+c is also the model that achieves the best trade-off between SARI and BLEU, as seen in Table 9. Similarly to the metric-based evaluation, judges give the highest MP and SG scores to the TM and RLg models, respectively. Note, however, that human judgments slightly deviate from the automatic assessment. For example, according to human judges, the RLg+c model ranks third (and not second) in the SG dimension, behind RLg and PF. This difference can be explained by the fact that in SARI- and BLEU-based evaluations, candidate simplifications are paired with both complex and simple references for the calculation of such metrics. However, in a human evaluation, candidate simplifications are only presented alongside the complex references. In this way, the presence or absence of these ground-truth simplifications may have impacted the results obtained in each evaluation task.

8 Conclusions and further work

This paper presents an increasingly fine-grained approach for assessing sentence simplicity. Through a comprehensive three-dimensional analysis, our objective was to estimate sentence simplicity in a manner suitable for ATS, which is an inherently relative operation. Aside from assessing text complexity in a finer-grained manner, our work can serve as a relevant and reproducible method to automatically create parallel simplification datasets. This can in turn be of great interest for reasonably well-resourced natural languages like French that still lack sufficient resources for the ATS task. In this work, we provide public access to the dataset that derives from the application of our approach, WiViCo.

Besides allowing other researchers interested in this field to utilize this resource, we took a step further by fine-tuning different language models with the dataset we curated ourselves. This process permitted experimentation with different techniques like instruction fine-tuning, reinforcement learning, and parameter-efficient fine-tuning. A recurring pattern in all evaluation tasks revealed the competing priorities between simplification gain and meaning preservation, with the RLg+c model achieving a good compromise. To reinforce our claims, we also resorted to human annotators who assessed the simplification outputs of our ATS models leading to very similar conclusions.

With the various models proposed and the thorough objective and subjective evaluations conducted, we have demonstrated their good performance on the WiViCo test partition. However, we also find it interesting to explore the generalization capabilities of the models on a manually-created French simplification dataset such as Alector. This investigation would help assess their robustness in out-of-domain test sets. Similarly, we intend to compare the results obtained using different fine-tuning techniques on FLAN-T5 against other state-of-the-art LLMs.

On another note, an extension of our investigations points to the creation of configurable ATS models. We could incorporate our triad of models into a larger pipeline designed for text simplification and use them to rank a set of candidate simplified sentences, with the goal of selecting the most simplified sentence that best preserves the original meaning of the input. Similarly, we plan to investigate reward models that also encompass human feedback on meaning preservation and grammaticality in order to have a holistic human-in-the-loop paradigm. This could serve as a guide during the simplification process by providing a continuous feedback signal to a generative ATS model and therefore adjust its output to attain a desired level of simplification.