Abstract
The availability of a large collection of text documents on a variety of topics, such as tweets, web pages, news articles, and stories, in different languages. Due to these electronic documents, users get exhausted reading the entire document and face many difficulties finding relevant information. Text summarization is the solution for retrieving important information from multiple documents. The summarization process should be robust, language-independent, and able to generate summaries from multi-language corpora. With this aim, in this work, we propose a language-independent framework of extractive text summarization for the text documents, which are primarily based on English, and then the translated corpus of BBC, CNN, and DUC 2004 News stories from English to Hindi is exploited for this study. In the absence of a Hindi corpus for these datasets, we have used Google, Bing, and Systran machine translators for Hindi text generation to show the effectiveness of the proposed approach for a low resource language. The proposed method is based on the N-gram language model and statistical modeling using maximum likelihood estimation (MLE). Sentence ranking has been used to generate summaries of English and Hindi documents. We evaluate our model on the BBC, CNN, and DUC 2004 news stories datasets in both English and Hindi using the ROUGE metric at different levels to measure the performance of the generated summaries. Further, we compare the proposed method with other existing baseline and state-of-the-art methods, such as LexRank, TextRank, Lead, Luhn, LSA, and SumBasic. Experimental results of our approach indicate its effectiveness over existing summarization methods.
Similar content being viewed by others
Data Availability
The data associated with this work will be provided on a reasonable request.
Change history
08 November 2023
A Correction to this paper has been published: https://doi.org/10.1007/s42979-023-02456-y
References
ElKassas WS, Salama CR, Rafea AA, Mohamed HK. Automatic text summarization: a comprehensive survey. Expert Syst Appl. 2021;165:113679.
Gambhir M, Gupta V. Recent automatic text summarization techniques: a survey. Artif Intell Rev. 2017;47(1):1–66.
Ferreira R, De Souza Cabral L, Lins RD, Silva GP, Freitas F, Cavalcanti GD, Favaro L. Assessing sentence scoring techniques for extractive text summarization. Expert Syst Appl. 2013;40(14):5755–64.
Gupta V, Lehal GS. A survey of text summarization extractive techniques. J Emerg Technol Web Intell. 2010;2(3):258–68.
Gao S, Chen X, Li P, Ren Z, Bing L, Zhao D, Yan R. Abstractive text summarization by incorporating reader comments. Proc AAAI Conf Artif Intell. 2019;33:6399–406.
Neto JL, Freitas AA, Kaestner CA. Automatic text summarization using a machine learning approach. In: Brazilian symposium on artificial intelligence. Berlin, Heidelberg: Springer; 2002. p. 205–15.
https://www.tensorflow.org/datasets/catalog/cnn_dailymail. Accessed 29 July 2022.
https://www.kaggle.com/pariza/bbc-news-summary. Accessed 29 July 2022.
https://www.kaggle.com/datasets/usmanniazi/duc-2004-dataset. Accessed 29 July 2022.
https://www.microsofttranslator.com. Accessed 01 Aug 2022.
https://translate.goolge.com. Accessed 01 Aug 2022.
https://www.systran.net/en/translate/. Accessed 01 Aug 2022.
Lin CY, Hovy E. Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics. ACL; 2003. p. 150–157.
Hong K, Nenkova A. Improving the estimation of word importance for news multi-document summarization. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics. ACL; 2014. p. 712–721.
Chiche A, Yitagesu B. Part of speech tagging: a systematic review of deep learning and machine learning approaches. J Big Data. 2022;9(1):1–25.
Lovins JB. Development of a stemming algorithm. Mech Transl Comput Linguist. 1968;11(1–2):22–31.
Moratanch N, Chitrakala S. A survey on extractive text summarization. In: 2017 international conference on computer, communication and signal processing (ICCCSP) ACL. 2017. p. 1–6.
Lin CY. Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out. IEEE; 2004. p. 74–81.
Mallick C, Das AK, Dutta M, Das AK, Sarkar A. Graph-based text summarization using modified TextRank. In: Soft computing in data analytics. Singapore: Springer; 2019. p. 137–46.
Elbarougy R, Behery G, El Khatib A. Extractive Arabic text summarization using modified PageRank algorithm. Egypt Inf J. 2020;21(2):73–81.
Mihalcea R. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of the ACL interactive poster and demonstration sessions. ACL; 2004. p. 170–173.
Radev DR, Allison T, Blair-Goldensohn S, Blitzer J, Celebi A, Dimitrov S, Zhang Z, MEAD-a platform for multidocument multilingual text summarization. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, 2004. p. 699–702.
Abdulateef S, Khan NA, Chen B, Shang X. Multidocument Arabic text summarization based on clustering and Word2Vec to reduce redundancy. Information. 2020;11(2):59.
Oufaida H, Blache P, Nouali O. Using distributed word representations and mRMR discriminant analysis for multilingual text summarization. In: International Conference on Applications of Natural Language to Information Systems. Cham: Springer; 2015. p. 51–63.
Kaljahi R, Foster J, Roturier J. Semantic Role Labelling with minimal resources: Experiments with French. In: * SEM@ COLING. 2014. p. 87–92
Kabadjov M, Atkinson M, Steinberger J, Steinberger R, Goot EVD. NewsGist: a multilingual statistical news summarizer. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin, Heidelberg: Springer; 2010. p. 591–4.
Rani R, Lobiyal DK. Document vector embedding based extractive text summarization system for Hindi and English text. Appl Intell. 2022;52:9353–72.
Edmundson HP. New methods in automatic extracting. JACM. 1969;16(2):264–85.
Luhn HP. The automatic creation of literature abstracts. IBM J Res Dev. 1958;2(2):159–65.
Koh HY, Ju J, Liu M, Pan S. An empirical survey on long document summarization: datasets, models, and metrics. ACM Comput Surv. 2022;55:1–35.
Mishra R, Bian J, Fiszman M, Weir CR, Jonnalagadda S, Mostafa J, Del Fiol G. Text summarization in the biomedical domain: a systematic review of recent research. J Biomed Inf. 2014;52:457–67.
Afsharizadeh M, Ebrahimpour-Komleh H, Bagheri A. Query-oriented text summarization using sentence extraction technique. In: 2018 4th international conference on web research (ICWR). IEEE; 2018. p. 128–32.
Yang K, He H, Al.Sabahi K, Zhang Z. EcForest: extractive document summarization through enhanced sentence embedding and cascade forest. Concurr Comput Pract Exp. 2019;31(17):e5206.
Yousefi-Azar M, Hamey L. Text summarization using unsupervised deep learning. Expert Syst Appl. 2017;68:93–105.
Erkan G, Radev DR. Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res. 2004;22:457–79.
https://www.nltk.org/nltk_data/. Accessed 02 Aug 2022
Shrivastava M, Bhattacharyya P, Hindi POS tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge. In: International Conference on NLP (ICON08). Pune, India. ACL; 2008.
Porter MF. An algorithm for suffix stripping program: electronic library and information systems. Program. 1980;14(3):130–7.
Chouigui A, Ben Khiroun O, Elayeb B. An arabic multi-source news corpus: experimenting on single-document extractive summarization. Arab J Sci Eng. 2021;46(4):3925–38.
Alami N, En-nahnahi N, Ouatik SA, Meknassi M. Using unsupervised deep learning for automatic summarization of Arabic documents. Arab J Sci Eng. 2018;43(12):7803–15.
http://www-nlpir.nist.gov/related_projects/tipster_summac/cmp_lg.html. Accessed 07 Dec 2022.
https://catalog.ldc.upenn.edu/LDC2003T05. Accessed 07 Dec 2022.
Koupaee M, Wang Y. WikiHow: a large scale text summarization dataset. arXiv preprint. arXiv:1810.09305 (2018)
Nenkova A, Vanderwende L. The impact of frequency on summarization. Redmond: Microsoft Research; 2005. p. 101.
Joshi A, Fidalgo E, Alegre E, Alaiz-Rodriguez R. RankSum—an unsupervised extractive text summarization based on rank fusion. Expert Syst Appl. 2022;200: 116846.
Joshi A, Fidalgo E, Alegre E, Fernández-Robles L. SummCoder: an unsupervised framework for extractive text summarization based on deep auto-encoders. Expert Syst Appl. 2019;129:200–15.
Abualigah L, Bashabsheh MQ, Alabool H, Shehab M. Text summarization: a brief review. In: Recent Advances in NLP: the case of Arabic language. Cham: ACL; 2020. p. 1–15.
Bialy AA, Gaheen MA, ElEraky RM, ElGamal AF, Ewees AA, Single Arabic document summarization using natural language processing technique. In: Recent Advances in NLP: The Case of Arabic Language. Cham: ACL; 2020. p. 17–37.
Fakhrezi MF, Bijaksana MA, Huda AF. Implementation of automatic text summarization with TextRank method in the development of Al-qur’an vocabulary encyclopedia. Procedia Computer Science. 2021;179:391–8.
Yadav D, Desai J, Yadav AK. Automatic text summarization methods: a comprehensive review. 2022. arXiv preprint arXiv:2204.01849
Elsaid A, Mohammed A, Ibrahim LF, Sakre MM. A comprehensive review of arabic text summarization. IEEE Access. 2022;10:38012–30.
Gulati V, Kumar D, Popescu DE, Hemanth JD. Extractive article summarization using integrated TextRank and BM25+ algorithm. Electronics. 2023;12(2):372.
Cajueiro DO, Nery AG, Tavares I, De Melo MK, Reis SAD, Weigang L, Celestino VR. A comprehensive review of automatic text summarization techniques: method, data, evaluation and coding. 2023. arXiv:2301.03403
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that there is no conflict of interest regarding this manuscript and received no funding for this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Research Trends in Computational Intelligence” guest edited by Anshul Verma, Pradeepika Verma, Vivek Kumar Singh, and S. Karthikeyan.
The original online version of this article was revised: Due to the header of the second part was used (repeated) in the first part by mistake during the proof correction stage in Table 10. Now, the table has been corrected.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gupta, P., Nigam, S. & Singh, R. A Statistical Language Modeling Framework for Extractive Summarization of Text Documents. SN COMPUT. SCI. 4, 750 (2023). https://doi.org/10.1007/s42979-023-02241-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-023-02241-x