A Statistical Language Modeling Framework for Extractive Summarization of Text Documents

Gupta, Pooja; Nigam, Swati; Singh, Rajiv

doi:10.1007/s42979-023-02241-x

A Statistical Language Modeling Framework for Extractive Summarization of Text Documents

Original Research
Published: 28 September 2023

Volume 4, article number 750, (2023)
Cite this article

SN Computer Science Aims and scope Submit manuscript

96 Accesses
1 Citation
Explore all metrics

A Correction to this article was published on 08 November 2023

This article has been updated

Abstract

The availability of a large collection of text documents on a variety of topics, such as tweets, web pages, news articles, and stories, in different languages. Due to these electronic documents, users get exhausted reading the entire document and face many difficulties finding relevant information. Text summarization is the solution for retrieving important information from multiple documents. The summarization process should be robust, language-independent, and able to generate summaries from multi-language corpora. With this aim, in this work, we propose a language-independent framework of extractive text summarization for the text documents, which are primarily based on English, and then the translated corpus of BBC, CNN, and DUC 2004 News stories from English to Hindi is exploited for this study. In the absence of a Hindi corpus for these datasets, we have used Google, Bing, and Systran machine translators for Hindi text generation to show the effectiveness of the proposed approach for a low resource language. The proposed method is based on the N-gram language model and statistical modeling using maximum likelihood estimation (MLE). Sentence ranking has been used to generate summaries of English and Hindi documents. We evaluate our model on the BBC, CNN, and DUC 2004 news stories datasets in both English and Hindi using the ROUGE metric at different levels to measure the performance of the generated summaries. Further, we compare the proposed method with other existing baseline and state-of-the-art methods, such as LexRank, TextRank, Lead, Luhn, LSA, and SumBasic. Experimental results of our approach indicate its effectiveness over existing summarization methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automated identification of media bias in news articles: an interdisciplinary literature review

Article Open access 16 November 2018

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Article 26 October 2022

Recent automatic text summarization techniques: a survey

Article 29 March 2016

Data Availability

The data associated with this work will be provided on a reasonable request.

Change history

08 November 2023
A Correction to this paper has been published: https://doi.org/10.1007/s42979-023-02456-y

References

ElKassas WS, Salama CR, Rafea AA, Mohamed HK. Automatic text summarization: a comprehensive survey. Expert Syst Appl. 2021;165:113679.
Article Google Scholar
Gambhir M, Gupta V. Recent automatic text summarization techniques: a survey. Artif Intell Rev. 2017;47(1):1–66.
Article Google Scholar
Ferreira R, De Souza Cabral L, Lins RD, Silva GP, Freitas F, Cavalcanti GD, Favaro L. Assessing sentence scoring techniques for extractive text summarization. Expert Syst Appl. 2013;40(14):5755–64.
Article Google Scholar
Gupta V, Lehal GS. A survey of text summarization extractive techniques. J Emerg Technol Web Intell. 2010;2(3):258–68.
Google Scholar
Gao S, Chen X, Li P, Ren Z, Bing L, Zhao D, Yan R. Abstractive text summarization by incorporating reader comments. Proc AAAI Conf Artif Intell. 2019;33:6399–406.
Google Scholar
Neto JL, Freitas AA, Kaestner CA. Automatic text summarization using a machine learning approach. In: Brazilian symposium on artificial intelligence. Berlin, Heidelberg: Springer; 2002. p. 205–15.
Google Scholar
https://www.tensorflow.org/datasets/catalog/cnn_dailymail. Accessed 29 July 2022.
https://www.kaggle.com/pariza/bbc-news-summary. Accessed 29 July 2022.
https://www.kaggle.com/datasets/usmanniazi/duc-2004-dataset. Accessed 29 July 2022.
https://www.microsofttranslator.com. Accessed 01 Aug 2022.
https://translate.goolge.com. Accessed 01 Aug 2022.
https://www.systran.net/en/translate/. Accessed 01 Aug 2022.
Lin CY, Hovy E. Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics. ACL; 2003. p. 150–157.
Hong K, Nenkova A. Improving the estimation of word importance for news multi-document summarization. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics. ACL; 2014. p. 712–721.
Chiche A, Yitagesu B. Part of speech tagging: a systematic review of deep learning and machine learning approaches. J Big Data. 2022;9(1):1–25.
Article Google Scholar
Lovins JB. Development of a stemming algorithm. Mech Transl Comput Linguist. 1968;11(1–2):22–31.
Google Scholar
Moratanch N, Chitrakala S. A survey on extractive text summarization. In: 2017 international conference on computer, communication and signal processing (ICCCSP) ACL. 2017. p. 1–6.
Lin CY. Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out. IEEE; 2004. p. 74–81.
Mallick C, Das AK, Dutta M, Das AK, Sarkar A. Graph-based text summarization using modified TextRank. In: Soft computing in data analytics. Singapore: Springer; 2019. p. 137–46.
Chapter Google Scholar
Elbarougy R, Behery G, El Khatib A. Extractive Arabic text summarization using modified PageRank algorithm. Egypt Inf J. 2020;21(2):73–81.
Google Scholar
Mihalcea R. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of the ACL interactive poster and demonstration sessions. ACL; 2004. p. 170–173.
Radev DR, Allison T, Blair-Goldensohn S, Blitzer J, Celebi A, Dimitrov S, Zhang Z, MEAD-a platform for multidocument multilingual text summarization. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, 2004. p. 699–702.
Abdulateef S, Khan NA, Chen B, Shang X. Multidocument Arabic text summarization based on clustering and Word2Vec to reduce redundancy. Information. 2020;11(2):59.
Article Google Scholar
Oufaida H, Blache P, Nouali O. Using distributed word representations and mRMR discriminant analysis for multilingual text summarization. In: International Conference on Applications of Natural Language to Information Systems. Cham: Springer; 2015. p. 51–63.
Google Scholar
Kaljahi R, Foster J, Roturier J. Semantic Role Labelling with minimal resources: Experiments with French. In: * SEM@ COLING. 2014. p. 87–92
Kabadjov M, Atkinson M, Steinberger J, Steinberger R, Goot EVD. NewsGist: a multilingual statistical news summarizer. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin, Heidelberg: Springer; 2010. p. 591–4.
Chapter Google Scholar
Rani R, Lobiyal DK. Document vector embedding based extractive text summarization system for Hindi and English text. Appl Intell. 2022;52:9353–72.
Article Google Scholar
Edmundson HP. New methods in automatic extracting. JACM. 1969;16(2):264–85.
Article MATH Google Scholar
Luhn HP. The automatic creation of literature abstracts. IBM J Res Dev. 1958;2(2):159–65.
Article MathSciNet Google Scholar
Koh HY, Ju J, Liu M, Pan S. An empirical survey on long document summarization: datasets, models, and metrics. ACM Comput Surv. 2022;55:1–35.
Article Google Scholar
Mishra R, Bian J, Fiszman M, Weir CR, Jonnalagadda S, Mostafa J, Del Fiol G. Text summarization in the biomedical domain: a systematic review of recent research. J Biomed Inf. 2014;52:457–67.
Article Google Scholar
Afsharizadeh M, Ebrahimpour-Komleh H, Bagheri A. Query-oriented text summarization using sentence extraction technique. In: 2018 4th international conference on web research (ICWR). IEEE; 2018. p. 128–32.
Chapter Google Scholar
Yang K, He H, Al.Sabahi K, Zhang Z. EcForest: extractive document summarization through enhanced sentence embedding and cascade forest. Concurr Comput Pract Exp. 2019;31(17):e5206.
Article Google Scholar
Yousefi-Azar M, Hamey L. Text summarization using unsupervised deep learning. Expert Syst Appl. 2017;68:93–105.
Article Google Scholar
Erkan G, Radev DR. Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res. 2004;22:457–79.
Article Google Scholar
https://www.nltk.org/nltk_data/. Accessed 02 Aug 2022
Shrivastava M, Bhattacharyya P, Hindi POS tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge. In: International Conference on NLP (ICON08). Pune, India. ACL; 2008.
Porter MF. An algorithm for suffix stripping program: electronic library and information systems. Program. 1980;14(3):130–7.
Article Google Scholar
Chouigui A, Ben Khiroun O, Elayeb B. An arabic multi-source news corpus: experimenting on single-document extractive summarization. Arab J Sci Eng. 2021;46(4):3925–38.
Article Google Scholar
Alami N, En-nahnahi N, Ouatik SA, Meknassi M. Using unsupervised deep learning for automatic summarization of Arabic documents. Arab J Sci Eng. 2018;43(12):7803–15.
Article Google Scholar
http://www-nlpir.nist.gov/related_projects/tipster_summac/cmp_lg.html. Accessed 07 Dec 2022.
https://catalog.ldc.upenn.edu/LDC2003T05. Accessed 07 Dec 2022.
Koupaee M, Wang Y. WikiHow: a large scale text summarization dataset. arXiv preprint. arXiv:1810.09305 (2018)
Nenkova A, Vanderwende L. The impact of frequency on summarization. Redmond: Microsoft Research; 2005. p. 101.
Google Scholar
Joshi A, Fidalgo E, Alegre E, Alaiz-Rodriguez R. RankSum—an unsupervised extractive text summarization based on rank fusion. Expert Syst Appl. 2022;200: 116846.
Article Google Scholar
Joshi A, Fidalgo E, Alegre E, Fernández-Robles L. SummCoder: an unsupervised framework for extractive text summarization based on deep auto-encoders. Expert Syst Appl. 2019;129:200–15.
Article Google Scholar
Abualigah L, Bashabsheh MQ, Alabool H, Shehab M. Text summarization: a brief review. In: Recent Advances in NLP: the case of Arabic language. Cham: ACL; 2020. p. 1–15.
Bialy AA, Gaheen MA, ElEraky RM, ElGamal AF, Ewees AA, Single Arabic document summarization using natural language processing technique. In: Recent Advances in NLP: The Case of Arabic Language. Cham: ACL; 2020. p. 17–37.
Fakhrezi MF, Bijaksana MA, Huda AF. Implementation of automatic text summarization with TextRank method in the development of Al-qur’an vocabulary encyclopedia. Procedia Computer Science. 2021;179:391–8.
Article Google Scholar
Yadav D, Desai J, Yadav AK. Automatic text summarization methods: a comprehensive review. 2022. arXiv preprint arXiv:2204.01849
Elsaid A, Mohammed A, Ibrahim LF, Sakre MM. A comprehensive review of arabic text summarization. IEEE Access. 2022;10:38012–30.
Article Google Scholar
Gulati V, Kumar D, Popescu DE, Hemanth JD. Extractive article summarization using integrated TextRank and BM25+ algorithm. Electronics. 2023;12(2):372.
Article Google Scholar
Cajueiro DO, Nery AG, Tavares I, De Melo MK, Reis SAD, Weigang L, Celestino VR. A comprehensive review of automatic text summarization techniques: method, data, evaluation and coding. 2023. arXiv:2301.03403

Download references

Author information

Authors and Affiliations

Department of Computer Science & Centre for Artificial Intelligence, Banasthali Vidyapith, Rajasthan, India
Pooja Gupta, Swati Nigam & Rajiv Singh

Authors

Pooja Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Swati Nigam
View author publications
You can also search for this author in PubMed Google Scholar
Rajiv Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajiv Singh.

Ethics declarations

Conflict of Interest

The authors declare that there is no conflict of interest regarding this manuscript and received no funding for this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Research Trends in Computational Intelligence” guest edited by Anshul Verma, Pradeepika Verma, Vivek Kumar Singh, and S. Karthikeyan.

The original online version of this article was revised: Due to the header of the second part was used (repeated) in the first part by mistake during the proof correction stage in Table 10. Now, the table has been corrected.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gupta, P., Nigam, S. & Singh, R. A Statistical Language Modeling Framework for Extractive Summarization of Text Documents. SN COMPUT. SCI. 4, 750 (2023). https://doi.org/10.1007/s42979-023-02241-x

Download citation

Received: 05 February 2023
Accepted: 09 August 2023
Published: 28 September 2023
DOI: https://doi.org/10.1007/s42979-023-02241-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Statistical Language Modeling Framework for Extractive Summarization of Text Documents

Abstract

Access this article

Similar content being viewed by others

Automated identification of media bias in news articles: an interdisciplinary literature review

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Recent automatic text summarization techniques: a survey

Data Availability

Change history

08 November 2023

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Statistical Language Modeling Framework for Extractive Summarization of Text Documents

Abstract

Access this article

Similar content being viewed by others

Automated identification of media bias in news articles: an interdisciplinary literature review

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Recent automatic text summarization techniques: a survey

Data Availability

Change history

08 November 2023

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation