skip to main content
research-article
Open access

HindiSumm: A Hindi Abstractive Summarization Benchmark Dataset

Published: 25 November 2024 Publication History

Abstract

Abstractive Text Summarization (ATS) is a task to create a novel summary by generating fresh sentences incorporating new words or rephrasing the article. It is a complex task as the model needs to understand the semantic similarity between the sentences of the text. To fulfill this, there is a need for a large annotated benchmark dataset, which is available for resource-rich languages such as English and non-indic languages. In contrast, for the less-resourced languages, such as Indic languages, the available datasets are limited and involve very short summary sentences. Hence, a language-specific abstractive summarization dataset called HindiSumm was introduced for Hindi, consisting of 570,000 text-summary pairs from Navbharat Times across 21 domains. The HindiSumm dataset’s efficiency is evaluated extrinsically and intrinsically by using various metrics. Furthermore, two recent multilingual-cased pre-trained models are fine-tuned on the HindiSumm dataset individually. In addition, an ensembled approach using weighted averaging is also incorporated to check the efficacy of the proposed dataset. The model is tested with the in-house created dataset, and results are evaluated on ROUGE scores and show significant improvements of around 13.2% for the proposed HindiSumm compared with other benchmark datasets. In the future, the HindiSumm dataset will promote the progress of ATS for the Indian language.

1 Introduction

With the proliferation of the web environment, the field of Natural Language Processing (NLP) [22] has experienced an unprecedented surge in the abundance of information. This has led to automatic summarization techniques, which aim at reducing the time and effort involved in finding concise and relevant information. Summarization [19] involves condensing the content of a text while retaining its meaning. The Summarization techniques can be broadly classified into extractive and abstractive, depending on how the content is selected and organized in summary.
Extractive summarization [24] methods select significant sentences based on statistical features. At the same time, abstractive techniques generate novel sentences by rephrasing or using new vocabulary rather than solely relying on sentence selection [11, 26, 28]. Deep analysis and reasoning techniques [1] are employed to generate new sentences that accurately capture the essence of the original text to create a dataset.
Unlike extractive summarization datasets, which often rely on the automatic extraction of sentences, summaries for abstractive summarization are typically crafted manually by linguistic experts. These experts write summaries and provide annotations and insights, ensuring the quality and linguistic richness of the generated summaries. This manual annotation process plays a crucial role in capturing the essence of the source text and producing coherent and meaningful abstractive summaries. Hence, the task of creating an abstractive summarization dataset is expensive and time-consuming. Numerous ATS datasets are available for the English language, including well-known ones such as CNN/Dailymail [20], NEWSROOM [10], New York Times Corpus [23], and DUC datasets [12]. Additionally, there are ATS datasets for non-Indic languages such as Cass [4], LCSTS [15], IndoSum [16], and TurkishSum [8].
Although the growth in abstractive summarization is growing tremendously, the researchers are perfunctory in Indo-Aryan languages like Hindi. Hindi being the third most spoken language in the world,1 suffers from the lack of summarization datasets. This might be due to the low availability of Hindi datasets and the feature engineering process for these datasets. Existing datasets such as Wikilingua [17], XL-Sum [13], and MassiveSumm [27] have some drawbacks concerning the number of samples collected and preprocessed, the limited domains, and the translated version of the data.
Most abstractive summarization data is available in English and then translated for other languages. However, translations for low-resource languages often have inaccuracies, grammatical errors, or altered meanings, leading to less reliable datasets. This is the reason the translated datasets are less accurate. This motivates us to develop HindiSumm, a massive dataset of 570,000 text-summary pairs in Hindi extracted from a Hindi e-news website Navbharat Times. It is the largest curated dataset for Hindi abstractive summarization. Figure 1 depicts the sample format of the summarization dataset. The effectiveness of the HindiSumm dataset is evaluated using intrinsic and extrinsic measures. The extrinsic evaluation involves human evaluation, where linguistic experts assess the quality of the generated summaries and inter-rater agreement scores are calculated to measure the consensus among the experts. On the other hand, intrinsic measures such as redundancy, conciseness, the presence of novel n-grams, and abstractivity are employed to analyze the dataset’s characteristics. Furthermore, we fine-tune the HindiSumm dataset on a state-of-the-art multilingual pre-trained text-to-text transformer (mT5) [30] and multilingual sequence-to-sequence denoising auto-encoder model (mBART25) [6] word representation models. The proposed work follows an ensembling approach [5] to incorporate decisions from multiple models to enhance overall accuracy. This allows us to gauge the performance of the dataset in generating high-quality summaries. The mT5, mBART25, and the ensemble model results are compared with benchmark datasets to assess HindiSumm’s relative performance and effectiveness. In short, the contributions in the article are:
Fig. 1.
Fig. 1. An example of the HindiSumm dataset showing the text-summary pair.
Developed HindiSumm, a dataset containing 570,000 text-summary pairs containing multi-line summaries for the Hindi language extracted from 21 domains.
Evaluated the dataset using extrinsic and intrinsic measures to prove its efficiency and fine-tuned on a pre-trained mT5 and mBART and the ensemble model.
Compared the obtained results of the proposed work with the other benchmark datasets.

2 Related Work

In this section, an extensive literature review is performed to gain a comprehensive understanding of the progress made in the field of abstractive summarization for both Indic and non-Indic languages. The complete details of the various datasets are arranged in Table 1.
Table 1.
Name of Dataset/CorpusLanguageNumber of SamplesData sourceReferencesOpen Source
Monolingual Dataset/Corpus
New York TimesEnglish650,000Newswire[23]No
CASSFrench129,445The French Court[4]No
LCSTSChinese2,400,591Sina Weibo[15]No
CNN/DailymailEnglish312,000CNN.com and Dailymail.co.uk[14]Yes
NEWSROOMEnglish1,321,995Social Metadata[10]Yes
IndoSumIndonesian20,000CNN Indonesia[16]Yes
TurkishSumTurkish112,833Multiple Sources[8]Yes
Multilingual Dataset/Corpus
MLSUMFrench
German
Spanish
Russian
Turkish
and English
424,763
242,982
290,465
27,063
273,617
311,971
Le Monde
Suddeutsche Zeitung
El Pais
Moskovskij Komsomolets
Internet Haber
Archive.org
[25]Yes
WikiLingua18 languages770,087 (All languages)
9,929 (Hindi language)
WikiHow[17]Yes
XL-Sum44 languages1,005,292 (All languages)
88,472 (Hindi language)
BBC News[13]Yes
MassiveSumm92 languages12,443,003 (All languages)
5,63,477 (Hindi language)
Archive.org[27]No
Table 1. Summary of Summarization Datasets for Various Languages

2.1 Summarization Datasets for the Indic Language

Recent years have promoted the field of summarization in almost every domain as the data is growing exponentially, and readers have limited time and memory capacity to go through all the content available online. This section covers most existing datasets used to predict and generate summaries for Indic languages. Wikilingua [17], a large dataset of cross-lingual abstractive summarization, was introduced for 18 languages scraped from the WikiHow website. The author also proposed a novel method of translation using Neural Machine Translation (NMT). This dataset is quite unique as it is extracted from a research website, but it contains only 9,929 summary article pairs for the Hindi language. The dataset is also a translated version; hence sentence alignment problem occurs. This amount of data is very less to train a model efficiently for good results. XL-SUM [13], is a large dataset of 1 million article-summary pairs covering 44 languages and then fine-tuned by using mT5 pre-trained model [30] to evaluate its dataset. The dataset is quite large, but it contains only 88,472 samples for the Hindi language. Also, the result for the Hindi language is low as compared with the other languages covered.
Daniel Varab and Natalie Schluter [27] constructed a large-scale multi-lingual dataset for 92 languages. The valid count for the Hindi language is 563,477 from the whole dataset after the quality control filtration. The major drawback of this dataset is that it is not an open-source dataset and contains single-line summaries, hindering the field of summarization for multi-line summaries.

2.2 Summarization Datasets for Non-Indic Languages

Numerous datasets are available for languages other than Hindi. These datasets cater to a wide range of languages and provide valuable resources for various summarization tasks. The popular English dataset, Document Understanding Conference (DUC) [7], contains human-generated summaries specially designed for this dataset. Each article in the dataset contains multiple reference summaries; however, it is relatively small and unsuitable for training deep learning models due to insufficient training data. CNN/Daily Mail dataset [14] is used for various applications like question answering, summarization, and so on. The summaries in the dataset were generated using news stories from CNN and Daily Mail websites. These news stories were transformed into questions by hiding one of the entities, while the corresponding paragraphs from the stories served as the answer to the questions. NEWSROOM dataset [10], containing 1.3 million summaries and articles from 38 websites, including social media and search engine platforms. This dataset is prepared for the English language, and the summaries are extractive, abstractive, and mixed.
Apart from English, various summarization datasets are also developed for non-indic languages such as Indonesian, Chinese, Turkish, and so on. IndoSum dataset [16] is an Indonesian language dataset containing around 20,000 article summary pairs for low-resource language. A Chinese dataset named Large Scale Chinese Short Text Summarization (LCSTS) [15] is created by using a Chinese micro-blogging site Sina Weibo; it contains 2 million text-summary pairs out of which 10,666 summaries are manually tagged with their short texts.

3 HindiSumm Dataset

The scraping and extraction process for the HindiSumm dataset resulted in 570,000 text-summary pairs. We have provided the extraction and scraping scripts to encourage researchers to work on the Hindi language. The comprehensive description of the dataset construction is mentioned in the following subsections:

3.1 Dataset Collection and Pre-Processing

The HindiSumm dataset is developed by web scraping the Navbharat Times news website utilizing the Selenium library. During scraping, a vast collection of 1,016,000 HTML links from various URLs of Hindi news articles were crawled. The BeautifulSoup (bs4) library performs the extraction process, carefully selecting and retaining only the relevant information while discarding irrelevant content such as images, popups, advertisements, and other unnecessary elements. Since the Navbharat website lacks an archive feature, we conducted web crawling to access all the links present on a starting page. The collected URLs are from 21 distinct domains, including sports, entertainment, world news, states, blogs, food, health, travel, dharma, astrology, science, technology, education, politics, automobiles, budget, weather, lifestyle, share market, jokes, and business. This comprehensive dataset offers a wide range of domains, making it the largest collection available. To ensure data quality, an additional preprocessing step is performed to eliminate redundant and duplicate information. This meticulous preprocessing stage helps to refine the dataset and ensure its reliability and accuracy. The complete dataset collection and pre-processing architecture are presented in Figure 2.
Fig. 2.
Fig. 2. Dataset construction and evaluation workflow.

3.2 Text-Summary Pair Extraction

The extraction process for obtaining text-summary pairs varies across different datasets. In the XSumm dataset, the initial line of the article is considered as the summary, while the remaining content serves as the input text. In CNN/ DM dataset, an English dataset used bullet points to create a summary. XL-Summ, on the other hand, selects the bold paragraph as the summary, while the remaining text serves as the input text. In our extraction process, we adopted a specific criterion to obtain the summary. We identified the first paragraph preceding the “down arrow” or the “read more” index. Typically, this paragraph consists of 3 to 4 lines and is manually composed by professionals to provide an abstractive summary of the entire article. The remaining content of the article is considered as the main text. Some of the major features are identified to make the extraction process effective. These features are listed in the following section:
The summary, which condenses the main points of the article, is positioned at the beginning of the text, typically comprising around 3 to 4 lines.
The words or sentences in the summary do not exactly match those in the corresponding article.
The article’s length is approximately two to three times longer than the summary.

3.3 Dataset Cleaning

Once the text-summary pairs are generated, the dataset undergoes a cleaning process to remove non-Hindi words, symbols, emojis, and other irrelevant elements. This ensures that the dataset contains only relevant and meaningful content. Subsequently, the cleaned text is transformed into a standardized JSON format, which allows for easy evaluation of the dataset using various metrics. The cleaning and formatting steps contribute to the overall quality and usability of the proposed dataset, making it suitable for evaluation and research purposes.

4 Evaluation of HindiSumm Dataset

For a summarization task, both human and automated evaluation are much needed to analyze the generated summary in terms of its accuracy, conciseness, repetition of words, whether the correct meaning is generated or not, and various such factors. The evaluation of the HindiSumm dataset encompasses two distinct approaches: Extrinsic Evaluation (human evaluation and inter-rater agreement score) and Intrinsic evaluation (redundancy, conciseness, novel n-grams, and abstractivity). These evaluation methods are discussed in the following subsections.

4.1 Extrinsic: Human Evaluation

Human evaluation involves the assessment of the dataset by human evaluators, who analyze and rate the extracted summaries based on predefined criteria. This process allows for subjective judgments, capturing the human perspective on the summary’s relevance, coherence, and overall quality. The feedback provided by human evaluators helps validate the dataset’s performance and its alignment with the intended goals. Hence, the extracted summary is evaluated and validated by other linguistic experts based on the following five criteria (C):
C1: Is the output summary producing the exact meaning or not?
C2: Is the output summary concise or not?
C3: Is the output summary grammatically correct or not?
C4: Is any relevant information missing from the output summary?
C5: Is the output summary free from any unnecessary or extra information or not?
C1 aims to determine the validity and quality of the summary. C2 focuses on identifying the minimum number of words necessary to form a correct sentence in summary. This criterion helps gauge the brevity and succinctness of the generated output. C3 is designed to detect any grammatical errors in the generated output compared with the input text. At the same time, C4 evaluates the presence of relevant information in the summary, aiming to ensure that no essential details are missing. Lastly, C5 evaluates the presence of additional information in the summary, which may be considered extraneous. This feature is important to check for extraneous information. Evaluating these criteria solely through intrinsic evaluation is challenging due to the human expert’s subjective interpretation of the additional information.
These five criteria are evaluated by the three experts providing binary responses (yes or no) based on the observations as shown in Table 2. The average percentage of positive responses indicates the extent to which the generated summaries meet the evaluation criteria. Based on the human evaluation results, the summary quality is reported as 93.44% for C1, 86.13% for C2, 94.64% for C3, 5.39% for C4, and 92.50%. These percentages reflect the overall effectiveness of the summarization process according to the evaluation criteria. The evaluation process, driven by the expertise of linguistic professionals, guides the refinement of the dataset. Based on their evaluations, the HindiSumm dataset is updated, ensuring a comprehensive and reliable resource for abstractive summarization research. As per the evaluation process, driven by the expertise of linguistic professionals, we have made corrections in those samples given to the linguistic experts.
Table 2.
CriteriaExpert-1Expert-2Expert-3Average
C196.3092.8291.2093.44
C289.4090.0179.0086.23
C396.0089.9298.0094.64
C48.202.675.325.39
C594.6290.2792.6192.50
Table 2. Human Evaluation on HindiSumm Dataset: The Results on the Dataset’s Quality are shown in the Percentage of the Total Yes to the Total Number of Text-Summary Pairs Submitted by the Experts During the Human Evaluation
Inter-Rater Agreement: It is a measure that assesses the level of concordance among the responses provided by two or more independent raters. It quantitatively captures the agreement between the raters and evaluates the consistency with which they distinguish between different responses. In order to calculate the inter agreement of linguistic experts in extrinsic evaluation, Kappa score (\(\kappa\)) [9] is used, which is measured using the Equation (1). An experiment was conducted in which 50,000 random sentences were picked, and three linguistic experts were asked to rate these 50,000 sentences separately (by giving an answer of Yes or No). The three experts were given five criteria mentioned in 4, and for each criterion, the kappa score is calculated separately for the three experts, \(\kappa _1\) is the kappa score between expert-1 and expert-2, \(\kappa _2\) is for expert-2 and expert-3 and \(\kappa _3\) is for expert-1 and expert-3. Similarly, the average kappa score is calculated for all five criteria. The expert’s rating was measured using the kappa agreement score using the Equation (1).
\begin{equation} \kappa =\frac{\bar{P}-\bar{P_{e}}}{1-{\bar{P_{e}}}}. \end{equation}
(1)
\(\bar{P}\) is the sum of agreed observations and \(\bar{P_{e}}\) is the sum of agreed observations by chance.
The kappa score, calculated based on the assessments of three experts, indicates the level of agreement among them. For each criterion (C1, C2, C3, C4, and C5), the average kappa scores are 0.768, 0.701, 0.734, 0.679, and 0.715, respectively. Overall, the average kappa score across all criteria is reported as \(k=0.720\), indicating a substantial level of agreement among the three experts regarding the assessments of the HindiSumm dataset. This level of agreement suggests a high level of consistency in the evaluations. However, achieving perfect agreement among experts is challenging, especially in tasks like summarization where subjective interpretations can vary. It’s important to acknowledge these differences and understand that they contribute to variations in ratings. Despite not reaching a perfect agreement, the inclusion of diverse perspectives enriches the dataset. Table 3 shows the inter-rater agreement score for criteria C1.
Table 3.
  Expert-1  Expert-2  Expert-3
Expert-2 YesNoTotalExpert-3 YesNoTotalExpert-1 YesNoTotal
Yes65%0%65%Yes68%2%70%Yes77%3%80%
No10%25%35%No6%24%30%No5%15%20%
Total75%25%100%Total74%26%100%Total82%18%100%
Table 3. Inter-Rater Agreement Score for Criteria C1

4.2 Intrinsic Evaluation

Although the results from the human evaluation are impeccable, intrinsic evaluation is also needed to prove the dataset’s quality. For intrinsic evaluation, some metrics are predefined by the research community, like redundancy, novel n-gram ratio, abstractivity, and conciseness. The metrics used for evaluation will consider a sample of <T,S> where T is Text, and S is the Summary of T. \(S_i \in\) S summarises \(T_i \in\) T. \(\vert {S}\vert\) denotes the number of words in a sentence and \(\Vert {S}\Vert\) represent the number of sentences in summary.
It’s indeed important to consider that perfect agreement among experts may not always be achievable due to subjective interpretations and variations in perceptions, especially in tasks like summarization where multiple valid interpretations exist. Retaining sentences that do not achieve unanimous agreement allows for a more diverse and comprehensive dataset.
Redundancy: Redundancy (RED) occurs when information is unnecessarily repeated in summary, making it less effective at conveying the most important or informative parts. Redundancy can be calculated using the ROUGE score to measure the overlap sentences in the summary. Although various authors calculate redundancy using Reference [13] formula for single-line summaries, by calculating the frequency of n-grams in a sentence, but in our case, the summaries are multiple-line summaries. Hence the redundancy is calculated using the generalized metric given by Reference [3]. The Equation (2) calculates the average ROGUE score across all possible combinations of the sentences between x and y of the summary, where x and y are the two different sentences of the summary. The rouge scores of all the possible unique combination pairs of the summary are calculated and the average rouge score of all possible combinations is reported.
\begin{equation} RED = \underset{(x,y)\epsilon S_{i} \ast S_{i}, x!=y}{mean}\, ROUGE(x,y), \end{equation}
(2)
where x and y sentences are of length m and n respectively. ROUGE-N is calculated using N-gram recall depicted by Equation (3). The term N stands for N-gram co-occurrence, R, P, and F are recall, precision, and F-measure and L stands for longest common subsequence. ROUGE-L is the F-measure calculated using Equation (4).
\begin{equation} ROUGE-N(x,y)=\frac{\sum _{s \epsilon {reference-summary}}\sum _{gram_n \epsilon s}count_{match}(gram_n)}{\sum _{s \epsilon {reference-summary}}\sum _{gram_n \epsilon s}count(gram_n)}, \end{equation}
(3)
\begin{equation} R_{L}=\frac{LCS(x,y)}{m} , \end{equation}
(4a)
\begin{equation} P_{L}=\frac{LCS(x,y)}{n} , \end{equation}
(4b)
\begin{equation} F_{L}=\frac{\left(1+\beta ^2 \right)R_{L}P_{L}}{R_{L}+\beta ^2 P_{L}}. \end{equation}
(4c)
Novel n-gram ratio: The novel n-gram ratio [21] is a measure for assessing the effectiveness of a summarization model in generating a high-quality summary. It calculates the proportion of n-grams in the summary not present in the source text relative to the total number of n-grams in the summary. It helps to witness how well the summary captures the essential information from the source text. It is given by Equation (5).
\begin{equation} novel\_n-gram\_ratio = \frac{\text{number of n-gram in (S)} \notin \text{(T)}}{\text{total number of n-gram} \in \text{(S)}}. \end{equation}
(5)
Abstractivity: Abstractivity (ABS) is a greedy approach to match the abstract words in the summary sentences [10]. As defined by the author, it is calculated using fragment coverage \(\mathcal {F}(T_i, S_i)\); it is the degree to which the summary contains all the essential information in the source text. To calculate fragment coverage, the source text is divided into smaller units, such as sentences or paragraphs, and each unit is marked as essential or non-essential. Then, the summary is evaluated to see if it contains all the essential units. If the summary includes all essential units, it has high fragment coverage, while it has low fragment coverage value if it omits essential units or includes non-essential units. ABS is calculated using a normalized version of fragment coverage. It is given by Equation (6) where \(|f |\) represents the fragments of the sentence.
\begin{equation} ABS (T_i, S_i) = 1-\frac{\sum _{f\in \mathcal {F}(T_i,S_i)}|f |}{|S_i |}. \end{equation}
(6)
Conciseness: It is a metric that measures the minimum number of words required to describe a complete sentence. It is also called compression (C) [3]. It is defined by Equation (7). The more the value of C, the better the result will be.
\begin{equation} C(T_i,S_i)=1-{\frac{\Vert {S_i}\Vert }{\Vert {T_i}\Vert }}. \end{equation}
(7)
The quality of the HindiSumm dataset is validated by the intrinsic evaluation results presented in Table 4. Comparing it to other state-of-the-art datasets, HindiSumm has demonstrated superior performance. This can be attributed to the extensive pre-processing applied to the data, as well as the inclusion of multi-line summaries. It’s important to note that neither of these measures alone can fully capture the complexity of redundancy in summarization, and human evaluation is often necessary to determine the overall quality and usefulness of a summary. The sample of the HindiSumm dataset is provided using link.2
Table 4.
DatasetSummary lengthRedundancyNovel n-grams RatioAbstractivityConciseness
r = 1r = Ln = 1n = 2n = 3n = 4
XL-Sum [13]Single-Line0.090.370.290.750.910.960.650.93
Wikilingua [17]Multi-Line0.240.540.340.780.920.950.750.81
MassiveSumm [27]Single-Line0.580.310.270.560.690.720.810.90
HindiSummMulti-Line0.030.120.380.770.950.960.830.81
Table 4. Result of Intrinsic Evaluation Metrics

5 Experiments and Results Discussion

To assess the performance and effectiveness of the HindiSumm dataset, we conducted fine-tuning experiments using two popular multilingual pre-trained models: multilingual Text-To-Text Transfer Transformer (mT5) and mBART25. By leveraging the capabilities of these models, we aimed to achieve a balanced and comprehensive evaluation that accounts for the limitations of individual models. This approach allowed us to explore the strengths and weaknesses of the proposed dataset more holistically, considering the collective insights by ensembling mT5 and mBART25 models (See Figure 3). The accuracy of the summary predictions is improved by combining predictions of the mT5 and mBART models using the weighted average method. This approach involves assigning weights to each model’s predictions, and the grid search method was employed to determine the optimal weights [2].
Fig. 3.
Fig. 3. Ensemble prediction model using weighted average.

5.1 Experimental Setup

In this experiment, the two models are fine-tuned on the proposed HindiSumm dataset. The experimental setup for each model is described in the following section:
mT5: mT5 is based on a transformer architecture, which is known for its self-attention mechanisms and scalability. It consists of an encoder-decoder structure suitable for a variety of text-based tasks. mT5 results from pre-training a T5 model specifically for the 101 languages considering 24 billion tokens for the Hindi language. It is the largest publically available word representation pre-trained model for four NLP applications. mT5 model trains with the objective of the Masked Language Model (MLM).
mBART25: Multilingual BART 25 is a version of the Bidirectional and Auto-Regressive Transformers (BART) model adapted for handling text in multiple languages. BART is a sequence-to-sequence transformer model that integrates the bidirectional encoding found in models like BERT with the autoregressive decoding from models like GPT. mBART extends this framework to accommodate multiple languages, making it versatile for various multilingual NLP tasks such as translation, summarization, and other text generation activities. It is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora for 25 languages with multiple denoising objectives. It was trained on 1.7 billion tokens for the Hindi language with 12 encoder and 12 decoder layers.
The dataset is split into a 90:10 ratio for training and test sets, respectively. The training set consists of 513,000 samples, which includes a validation set of 50,000 samples. The test set contains 57,000 samples. The same set of hyperparameters is used to fine-tune the mT5 and mBART models, ensuring fairness compared with the model’s performance. During training, both models were fine-tuned for ten epochs using a batch size of 16. The Adam optimizer was employed with a maximum learning rate of \(10^{-5}\) and a weight decay of 0.02. To accommodate the constraints, input sequences were truncated to 1,024 tokens, and output sequences were limited to 128. During the inference process while ensembling, a beam search was employed of size four and a length penalty of 0.6 [29]. These trained models were evaluated on three abstractive summarisation evaluation benchmark datasets: XL-Summ, Wikilingua, and MassiveSumm.

5.2 Discussion

The performance of the summarization tasks utilizing the n-grams co-occurrence statistic performance metrics, ROUGE-1, ROUGE-2, and ROUGE-L. ROUGE is a Recall-Oriented Understudy for Gisting Evaluation and is used to measure the similarity between two summaries [18], given in Equation (3). It compares the generated summary against the gold-standard summary. Table 5 illustrates the performance comparison of the trained mT5 and mBART25 models, while Table 6 showcases the results of the ensemble model evaluated on various datasets, including XL-Sum, Wikilingua, MassiveSumm, and our proposed HindiSumm dataset. Among the benchmark datasets, the HindiSumm dataset demonstrates superior performance for the mT5 model compared with the mBART25 model. However, combining these models in an ensemble setting yields the best results overall. The ensemble approach applied to the HindiSumm dataset notably improves ROUGE scores. i.e., 45.38 (R-1), 21.50 (R-2), and 43.20 (R-L). Specifically, the R-1, R-2, and R-L scores are enhanced by 10.15%, 17.36%, and 12.09%, respectively, compared with the performance of the individual best model. This is because of the large sample size and the presence of multi-line summaries in the HindiSumm dataset. Based on the results, we have also performed an error analysis (ref Section A.1) of the proposed dataset to identify the nature of the errors present in the dataset.
Table 5.
DatasetmT5mBART25
R-1R-2R-LR-1R-2R-L
RPFRPFRPFRPFRPFRPF
XL sum34.5158.6643.4613.5552.6821.5628.2368.6640.0131.8260.2641.659.0341.7914.8525.3360.3935.69
Wikilingua33.6863.7444.0712.7449.2920.2527.6268.8139.4228.5659.3838.5718.3240.2525.1822.8959.4433.05
MassiveSum36.0563.1545.9016.1351.3724.5534.9069.4246.4531.0559.8440.9017.5443.4825.0023.6760.5634.04
HindiSum41.2065.2350.5017.2253.2426.0238.5470.2349.7734.6658.4543.5214.3842.0121.4334.1762.1844.10
Table 5. Performance of mT5 and mBART25 Models on Various Datasets
Table 6.
DatasetR-1R-2R-L
RPFRPFRPF
XL sum36.2267.1147.0513.6153.5221.7028.3371.5740.59
Wikilingua33.8266.8744.9220.6553.0129.7230.6670.4342.72
MassiveSum36.9867.0447.6719.1153.1928.1225.4968.3337.13
HindiSum45.3869.2754.8421.5054.2530.0843.2072.6954.19
Table 6. Performance of Ensemble Model on Various Datasets

6 Conclusion

The article introduces HindiSumm, an extensive abstractive text summarization dataset for Hindi. It contains 570,000 samples from the Navbharat Times, making it the largest open-source benchmark for Hindi summarization. HindiSumm offers multi-line summaries, distinguishing it from other datasets. The dataset is carefully curated and evaluated intrinsically and extrinsically, demonstrating its effectiveness in producing concise and abstractive summaries that capture the main ideas. mT5 and mBART25 models are fine-tuned individually and then combined using an ensembling approach on the HindiSumm dataset. Results from the individual and ensembled approach are compared with state-of-the-art datasets, showcasing the dataset’s efficacy. Error analysis is conducted to understand evaluation challenges, and dataset scripts are provided for researchers to contribute and expand. To conclude, HindiSumm can be useful for the NLP community to work on low-resource languages. In the future, we plan to explore other potential uses for our dataset, such as cross-lingual summarization tasks.

Footnotes

A Appendix

A.1 Error Analysis

While the HindiSumm dataset demonstrates satisfactory performance when trained on mT5, mBART25, and an ensemble model, it is essential to identify any errors that may occur and understand their underlying causes. It is noticed that even after fine-tuning the pre-trained models, there were a few sentences for which the models did not predict the summary accurately. Further digging into the details of the dataset, it is concluded that a few anomalies in the dataset that confuse the model. So, we took a random sample of 100 and examined these 100 summaries. While analyzing the text-summary pairs, it is noted that the text with fewer lines and few words mostly gives a less accurate summary than the text with multiple sentences. Hence, avoiding short-length text containing 5–6 words is always suggested to generate a high-quality, meaningful summary. The HindiSumm dataset is free from such sentences.
Furthermore, it is observed that some summaries contain information like abbreviations, synonyms, and morphological terms (different morphs for the same lemma) which is different from the referenced summary and, for some cases, extraneous information (repetitive or unnecessary) is also responsible for getting low results. We encountered another problem: a word can mislead the information or get combined with other information and distract the attention to a topic it discusses. In Figure 4, some examples are extracted from the HindiSumm dataset to show the various morphological errors, synonyms errors, abbreviation errors, and unnecessary information errors. We have seen some errors in the generated dataset, a major cause of low-quality summaries. Encouragingly there are very few summaries found with the above-mentioned errors. Although human experts write the summaries in the HindiSumm dataset, the possibility of errors still exists, which can result in a low-quality summary generation. Therefore, there is always room for improvement to enhance the overall quality and accuracy of the summaries. Figure 5 depicts the snapshot of a sample article from the Navbharat Times showing the summary and article/text scrapped from the e-newspaper.
Fig. 4.
Fig. 4. Examples of errors (A) Morphological errors, (B) Abbreviation errors, (C) Synonyms error and (D) Extra information.
Fig. 5.
Fig. 5. Snapshot of a sample article from the Navbharat Times.

References

[1]
Ayham Alomari, Norisma Idris, Aznul Qalid Md Sabri, and Izzat Alsmadi. 2022. Deep reinforcement and transfer learning for abstractive text summarization: A review. Computer Speech & Language 71 (2022), 101276. DOI:
[2]
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, 2 (2012), 281–305.
[3]
Rishi Bommasani and Claire Cardie. 2020. Intrinsic evaluation of summarization datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 8075–8096.
[4]
Léo Bouscarrat, Antoine Bonnefoy, Thomas Peel, and Cécile Pereira. 2019. STRASS: A light and effective method for extractive summarization based on sentence embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, Florence, Italy, 243–252. DOI:
[5]
J. Briskilal and C. N. Subalalitha. 2022. An ensemble model for classifying idioms and literal texts using BERT and RoBERTa. Information Processing & Management 59, 1 (2022), 102756.
[6]
Hugh A. Chipman, Edward I. George, Robert E. McCulloch, and Thomas S. Shively. 2022. mbart: Multidimensional monotone bart. Bayesian Analysis 17, 2 (2022), 515–544.
[7]
Günes Erkan and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22 (2004), 457–479.
[8]
Fatih Ertam and Galip Aydin. 2022. Abstractive text summarization using deep learning with a new Turkish summarization benchmark dataset. Concurrency and Computation: Practice and Experience 34, 9 (2022), e6482.
[9]
Joseph L. Fleiss and Jacob Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33, 3 (1973), 613–619.
[10]
Max Grusky, Mor Naaman, and Yoav Artzi. 2018. Newsroom: A dataset of 1.3 Million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 708–719.
[11]
Vishal Gupta and Gurpreet Singh Lehal. 2010. A survey of text summarization extractive techniques. Journal of Emerging Technologies in Web Intelligence 2, 3 (2010), 258–268.
[12]
Donna Harman and Paul Over. 2004. The effects of human variation in DUC summarization evaluation. Text Summarization Branches Out. 10–17.
[13]
Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M Sohel Rahman, and Rifat Shahriyar. 2021. XL-Sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4693–4703.
[14]
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. Proceedings of the 28th International Conference on Neural Information Processing Systems 1 (2015), 1693–1701.
[15]
Baotian Hu, Qingcai Chen, and Fangze Zhu. 2015. LCSTS: A large scale chinese short text summarization dataset. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1967–1972.
[16]
Kemal Kurniawan and Samuel Louvan. 2018. Indosum: A new benchmark dataset for indonesian text summarization. In 2018 International Conference on Asian Language Processing (IALP’18). IEEE, 215–220.
[17]
Faisal Ladhak, Esin Durmus, Claire Cardie, and Kathleen Mckeown. 2020. WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020. 4034–4048.
[18]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out. 74–81.
[19]
Parth Mehta. 2016. From extractive to abstractive summarization: A journey. In Proceedings of the ACL 2016 Student Research Workshop. 100–106.
[20]
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çağlar Gulçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. 280–290.
[21]
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1797–1807.
[22]
Philip Resnik and Jimmy Lin. 2010. Evaluation of NLP systems. The Handbook of Computational Linguistics and Natural Language Processing (2010), 271–295.
[23]
Evan Sandhaus. 2008. New york times corpus: Corpus overview. LDC Catalogue Entry LDC2008T19 (2008), 1–22.
[24]
Brenda Santana, Ricardo Campos, Evelin Amorim, Alípio Jorge, Purificação Silvano, and Sérgio Nunes. 2023. A survey on narrative extraction from textual data. Artificial Intelligence Review 56, 8 (2023), 8393–8435.
[25]
Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. 2020. MLSUM: The multilingual summarization corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 8051–8067.
[26]
Shengli Song, Haitao Huang, and Tongxiao Ruan. 2019. Abstractive text summarization using LSTM-CNN based deep learning. Multimedia Tools and Applications 78, 1 (2019), 857–875.
[27]
Daniel Varab and Natalie Schluter. 2021. MassiveSumm: A very large-scale, very multilingual, news summarisation dataset. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 10150–10161.
[28]
Pradeepika Verma, Sukomal Pal, and Hari Om. 2019. A comparative analysis on Hindi and English extractive text summarization. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 18, 3 (2019), 1–39.
[29]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144. Retrieved from https://arxiv.org/abs/1609.08144
[30]
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 483–498.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 12
December 2024
237 pages
EISSN:2375-4702
DOI:10.1145/3613720
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 November 2024
Online AM: 17 September 2024
Accepted: 03 September 2024
Revised: 19 July 2024
Received: 11 August 2023
Published in TALLIP Volume 23, Issue 12

Check for updates

Author Tags

  1. Natural Language Processing (NLP)
  2. Abstractive Text Summarization (ATS)
  3. Deep Learning (DL)
  4. text-to-text transfer transformer (T5)
  5. Hindi dataset

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 521
    Total Downloads
  • Downloads (Last 12 months)521
  • Downloads (Last 6 weeks)162
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media