For a summarization task, both human and automated evaluation are much needed to analyze the generated summary in terms of its accuracy, conciseness, repetition of words, whether the correct meaning is generated or not, and various such factors. The evaluation of the HindiSumm dataset encompasses two distinct approaches: Extrinsic Evaluation (human evaluation and inter-rater agreement score) and Intrinsic evaluation (redundancy, conciseness, novel n-grams, and abstractivity). These evaluation methods are discussed in the following subsections.
4.1 Extrinsic: Human Evaluation
Human evaluation involves the assessment of the dataset by human evaluators, who analyze and rate the extracted summaries based on predefined criteria. This process allows for subjective judgments, capturing the human perspective on the summary’s relevance, coherence, and overall quality. The feedback provided by human evaluators helps validate the dataset’s performance and its alignment with the intended goals. Hence, the extracted summary is evaluated and validated by other linguistic experts based on the following five criteria (C):
—
C1: Is the output summary producing the exact meaning or not?
—
C2: Is the output summary concise or not?
—
C3: Is the output summary grammatically correct or not?
—
C4: Is any relevant information missing from the output summary?
—
C5: Is the output summary free from any unnecessary or extra information or not?
C1 aims to determine the validity and quality of the summary. C2 focuses on identifying the minimum number of words necessary to form a correct sentence in summary. This criterion helps gauge the brevity and succinctness of the generated output. C3 is designed to detect any grammatical errors in the generated output compared with the input text. At the same time, C4 evaluates the presence of relevant information in the summary, aiming to ensure that no essential details are missing. Lastly, C5 evaluates the presence of additional information in the summary, which may be considered extraneous. This feature is important to check for extraneous information. Evaluating these criteria solely through intrinsic evaluation is challenging due to the human expert’s subjective interpretation of the additional information.
These five criteria are evaluated by the three experts providing binary responses (yes or no) based on the observations as shown in Table
2. The average percentage of positive responses indicates the extent to which the generated summaries meet the evaluation criteria. Based on the human evaluation results, the summary quality is reported as 93.44% for C1, 86.13% for C2, 94.64% for C3, 5.39% for C4, and 92.50%. These percentages reflect the overall effectiveness of the summarization process according to the evaluation criteria. The evaluation process, driven by the expertise of linguistic professionals, guides the refinement of the dataset. Based on their evaluations, the HindiSumm dataset is updated, ensuring a comprehensive and reliable resource for abstractive summarization research. As per the evaluation process, driven by the expertise of linguistic professionals, we have made corrections in those samples given to the linguistic experts.
Inter-Rater Agreement: It is a measure that assesses the level of concordance among the responses provided by two or more independent raters. It quantitatively captures the agreement between the raters and evaluates the consistency with which they distinguish between different responses. In order to calculate the inter agreement of linguistic experts in extrinsic evaluation, Kappa score (
\(\kappa\)) [
9] is used, which is measured using the Equation (
1). An experiment was conducted in which 50,000 random sentences were picked, and three linguistic experts were asked to rate these 50,000 sentences separately (by giving an answer of Yes or No). The three experts were given five criteria mentioned in
4, and for each criterion, the kappa score is calculated separately for the three experts,
\(\kappa _1\) is the kappa score between expert-1 and expert-2,
\(\kappa _2\) is for expert-2 and expert-3 and
\(\kappa _3\) is for expert-1 and expert-3. Similarly, the average kappa score is calculated for all five criteria. The expert’s rating was measured using the kappa agreement score using the Equation (
1).
\(\bar{P}\) is the sum of agreed observations and
\(\bar{P_{e}}\) is the sum of agreed observations by chance.
The kappa score, calculated based on the assessments of three experts, indicates the level of agreement among them. For each criterion (C1, C2, C3, C4, and C5), the average kappa scores are 0.768, 0.701, 0.734, 0.679, and 0.715, respectively. Overall, the average kappa score across all criteria is reported as
\(k=0.720\), indicating a substantial level of agreement among the three experts regarding the assessments of the HindiSumm dataset. This level of agreement suggests a high level of consistency in the evaluations. However, achieving perfect agreement among experts is challenging, especially in tasks like summarization where subjective interpretations can vary. It’s important to acknowledge these differences and understand that they contribute to variations in ratings. Despite not reaching a perfect agreement, the inclusion of diverse perspectives enriches the dataset. Table
3 shows the inter-rater agreement score for criteria
C1.
4.2 Intrinsic Evaluation
Although the results from the human evaluation are impeccable, intrinsic evaluation is also needed to prove the dataset’s quality. For intrinsic evaluation, some metrics are predefined by the research community, like redundancy, novel n-gram ratio, abstractivity, and conciseness. The metrics used for evaluation will consider a sample of <T,S> where T is Text, and S is the Summary of T. \(S_i \in\) S summarises \(T_i \in\) T. \(\vert {S}\vert\) denotes the number of words in a sentence and \(\Vert {S}\Vert\) represent the number of sentences in summary.
It’s indeed important to consider that perfect agreement among experts may not always be achievable due to subjective interpretations and variations in perceptions, especially in tasks like summarization where multiple valid interpretations exist. Retaining sentences that do not achieve unanimous agreement allows for a more diverse and comprehensive dataset.
Redundancy: Redundancy (RED) occurs when information is unnecessarily repeated in summary, making it less effective at conveying the most important or informative parts. Redundancy can be calculated using the ROUGE score to measure the overlap sentences in the summary. Although various authors calculate redundancy using Reference [
13] formula for single-line summaries, by calculating the frequency of n-grams in a sentence, but in our case, the summaries are multiple-line summaries. Hence the redundancy is calculated using the generalized metric given by Reference [
3]. The Equation (
2) calculates the average ROGUE score across all possible combinations of the sentences between x and y of the summary, where x and y are the two different sentences of the summary. The rouge scores of all the possible unique combination pairs of the summary are calculated and the average rouge score of all possible combinations is reported.
where x and y sentences are of length m and n respectively. ROUGE-N is calculated using N-gram recall depicted by Equation (
3). The term N stands for N-gram co-occurrence, R, P, and F are recall, precision, and F-measure and L stands for longest common subsequence. ROUGE-L is the F-measure calculated using Equation (
4).
Novel n-gram ratio: The novel n-gram ratio [
21] is a measure for assessing the effectiveness of a summarization model in generating a high-quality summary. It calculates the proportion of n-grams in the summary not present in the source text relative to the total number of n-grams in the summary. It helps to witness how well the summary captures the essential information from the source text. It is given by Equation (
5).
Abstractivity: Abstractivity (ABS) is a greedy approach to match the abstract words in the summary sentences [
10]. As defined by the author, it is calculated using fragment coverage
\(\mathcal {F}(T_i, S_i)\); it is the degree to which the summary contains all the essential information in the source text. To calculate fragment coverage, the source text is divided into smaller units, such as sentences or paragraphs, and each unit is marked as essential or non-essential. Then, the summary is evaluated to see if it contains all the essential units. If the summary includes all essential units, it has high fragment coverage, while it has low fragment coverage value if it omits essential units or includes non-essential units. ABS is calculated using a normalized version of fragment coverage. It is given by Equation (
6) where
\(|f |\) represents the fragments of the sentence.
Conciseness: It is a metric that measures the minimum number of words required to describe a complete sentence. It is also called compression (C) [
3]. It is defined by Equation (
7). The more the value of C, the better the result will be.
The quality of the HindiSumm dataset is validated by the intrinsic evaluation results presented in Table
4. Comparing it to other state-of-the-art datasets, HindiSumm has demonstrated superior performance. This can be attributed to the extensive pre-processing applied to the data, as well as the inclusion of multi-line summaries. It’s important to note that neither of these measures alone can fully capture the complexity of redundancy in summarization, and human evaluation is often necessary to determine the overall quality and usefulness of a summary. The sample of the HindiSumm dataset is provided using link.
2