Skip to main content
Log in

A Statistical Language Modeling Framework for Extractive Summarization of Text Documents

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

A Correction to this article was published on 08 November 2023

This article has been updated

Abstract

The availability of a large collection of text documents on a variety of topics, such as tweets, web pages, news articles, and stories, in different languages. Due to these electronic documents, users get exhausted reading the entire document and face many difficulties finding relevant information. Text summarization is the solution for retrieving important information from multiple documents. The summarization process should be robust, language-independent, and able to generate summaries from multi-language corpora. With this aim, in this work, we propose a language-independent framework of extractive text summarization for the text documents, which are primarily based on English, and then the translated corpus of BBC, CNN, and DUC 2004 News stories from English to Hindi is exploited for this study. In the absence of a Hindi corpus for these datasets, we have used Google, Bing, and Systran machine translators for Hindi text generation to show the effectiveness of the proposed approach for a low resource language. The proposed method is based on the N-gram language model and statistical modeling using maximum likelihood estimation (MLE). Sentence ranking has been used to generate summaries of English and Hindi documents. We evaluate our model on the BBC, CNN, and DUC 2004 news stories datasets in both English and Hindi using the ROUGE metric at different levels to measure the performance of the generated summaries. Further, we compare the proposed method with other existing baseline and state-of-the-art methods, such as LexRank, TextRank, Lead, Luhn, LSA, and SumBasic. Experimental results of our approach indicate its effectiveness over existing summarization methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data Availability

The data associated with this work will be provided on a reasonable request.

Change history

References

  1. ElKassas WS, Salama CR, Rafea AA, Mohamed HK. Automatic text summarization: a comprehensive survey. Expert Syst Appl. 2021;165:113679.

    Article  Google Scholar 

  2. Gambhir M, Gupta V. Recent automatic text summarization techniques: a survey. Artif Intell Rev. 2017;47(1):1–66.

    Article  Google Scholar 

  3. Ferreira R, De Souza Cabral L, Lins RD, Silva GP, Freitas F, Cavalcanti GD, Favaro L. Assessing sentence scoring techniques for extractive text summarization. Expert Syst Appl. 2013;40(14):5755–64.

    Article  Google Scholar 

  4. Gupta V, Lehal GS. A survey of text summarization extractive techniques. J Emerg Technol Web Intell. 2010;2(3):258–68.

    Google Scholar 

  5. Gao S, Chen X, Li P, Ren Z, Bing L, Zhao D, Yan R. Abstractive text summarization by incorporating reader comments. Proc AAAI Conf Artif Intell. 2019;33:6399–406.

    Google Scholar 

  6. Neto JL, Freitas AA, Kaestner CA. Automatic text summarization using a machine learning approach. In: Brazilian symposium on artificial intelligence. Berlin, Heidelberg: Springer; 2002. p. 205–15.

    Google Scholar 

  7. https://www.tensorflow.org/datasets/catalog/cnn_dailymail. Accessed 29 July 2022.

  8. https://www.kaggle.com/pariza/bbc-news-summary. Accessed 29 July 2022.

  9. https://www.kaggle.com/datasets/usmanniazi/duc-2004-dataset. Accessed 29 July 2022.

  10. https://www.microsofttranslator.com. Accessed 01 Aug 2022.

  11. https://translate.goolge.com. Accessed 01 Aug 2022.

  12. https://www.systran.net/en/translate/. Accessed 01 Aug 2022.

  13. Lin CY, Hovy E. Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics. ACL; 2003. p. 150–157.

  14. Hong K, Nenkova A. Improving the estimation of word importance for news multi-document summarization. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics. ACL; 2014. p. 712–721.

  15. Chiche A, Yitagesu B. Part of speech tagging: a systematic review of deep learning and machine learning approaches. J Big Data. 2022;9(1):1–25.

    Article  Google Scholar 

  16. Lovins JB. Development of a stemming algorithm. Mech Transl Comput Linguist. 1968;11(1–2):22–31.

    Google Scholar 

  17. Moratanch N, Chitrakala S. A survey on extractive text summarization. In: 2017 international conference on computer, communication and signal processing (ICCCSP) ACL. 2017. p. 1–6.

  18. Lin CY. Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out. IEEE; 2004. p. 74–81.

  19. Mallick C, Das AK, Dutta M, Das AK, Sarkar A. Graph-based text summarization using modified TextRank. In: Soft computing in data analytics. Singapore: Springer; 2019. p. 137–46.

    Chapter  Google Scholar 

  20. Elbarougy R, Behery G, El Khatib A. Extractive Arabic text summarization using modified PageRank algorithm. Egypt Inf J. 2020;21(2):73–81.

    Google Scholar 

  21. Mihalcea R. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of the ACL interactive poster and demonstration sessions. ACL; 2004. p. 170–173.

  22. Radev DR, Allison T, Blair-Goldensohn S, Blitzer J, Celebi A, Dimitrov S, Zhang Z, MEAD-a platform for multidocument multilingual text summarization. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, 2004. p. 699–702.

  23. Abdulateef S, Khan NA, Chen B, Shang X. Multidocument Arabic text summarization based on clustering and Word2Vec to reduce redundancy. Information. 2020;11(2):59.

    Article  Google Scholar 

  24. Oufaida H, Blache P, Nouali O. Using distributed word representations and mRMR discriminant analysis for multilingual text summarization. In: International Conference on Applications of Natural Language to Information Systems. Cham: Springer; 2015. p. 51–63.

    Google Scholar 

  25. Kaljahi R, Foster J, Roturier J. Semantic Role Labelling with minimal resources: Experiments with French. In: * SEM@ COLING. 2014. p. 87–92

  26. Kabadjov M, Atkinson M, Steinberger J, Steinberger R, Goot EVD. NewsGist: a multilingual statistical news summarizer. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin, Heidelberg: Springer; 2010. p. 591–4.

    Chapter  Google Scholar 

  27. Rani R, Lobiyal DK. Document vector embedding based extractive text summarization system for Hindi and English text. Appl Intell. 2022;52:9353–72.

    Article  Google Scholar 

  28. Edmundson HP. New methods in automatic extracting. JACM. 1969;16(2):264–85.

    Article  MATH  Google Scholar 

  29. Luhn HP. The automatic creation of literature abstracts. IBM J Res Dev. 1958;2(2):159–65.

    Article  MathSciNet  Google Scholar 

  30. Koh HY, Ju J, Liu M, Pan S. An empirical survey on long document summarization: datasets, models, and metrics. ACM Comput Surv. 2022;55:1–35.

    Article  Google Scholar 

  31. Mishra R, Bian J, Fiszman M, Weir CR, Jonnalagadda S, Mostafa J, Del Fiol G. Text summarization in the biomedical domain: a systematic review of recent research. J Biomed Inf. 2014;52:457–67.

    Article  Google Scholar 

  32. Afsharizadeh M, Ebrahimpour-Komleh H, Bagheri A. Query-oriented text summarization using sentence extraction technique. In: 2018 4th international conference on web research (ICWR). IEEE; 2018. p. 128–32.

    Chapter  Google Scholar 

  33. Yang K, He H, Al.Sabahi K, Zhang Z. EcForest: extractive document summarization through enhanced sentence embedding and cascade forest. Concurr Comput Pract Exp. 2019;31(17):e5206.

    Article  Google Scholar 

  34. Yousefi-Azar M, Hamey L. Text summarization using unsupervised deep learning. Expert Syst Appl. 2017;68:93–105.

    Article  Google Scholar 

  35. Erkan G, Radev DR. Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res. 2004;22:457–79.

    Article  Google Scholar 

  36. https://www.nltk.org/nltk_data/. Accessed 02 Aug 2022

  37. Shrivastava M, Bhattacharyya P, Hindi POS tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge. In: International Conference on NLP (ICON08). Pune, India. ACL; 2008.

  38. Porter MF. An algorithm for suffix stripping program: electronic library and information systems. Program. 1980;14(3):130–7.

    Article  Google Scholar 

  39. Chouigui A, Ben Khiroun O, Elayeb B. An arabic multi-source news corpus: experimenting on single-document extractive summarization. Arab J Sci Eng. 2021;46(4):3925–38.

    Article  Google Scholar 

  40. Alami N, En-nahnahi N, Ouatik SA, Meknassi M. Using unsupervised deep learning for automatic summarization of Arabic documents. Arab J Sci Eng. 2018;43(12):7803–15.

    Article  Google Scholar 

  41. http://www-nlpir.nist.gov/related_projects/tipster_summac/cmp_lg.html. Accessed 07 Dec 2022.

  42. https://catalog.ldc.upenn.edu/LDC2003T05. Accessed 07 Dec 2022.

  43. Koupaee M, Wang Y. WikiHow: a large scale text summarization dataset. arXiv preprint. arXiv:1810.09305 (2018)

  44. Nenkova A, Vanderwende L. The impact of frequency on summarization. Redmond: Microsoft Research; 2005. p. 101.

    Google Scholar 

  45. Joshi A, Fidalgo E, Alegre E, Alaiz-Rodriguez R. RankSum—an unsupervised extractive text summarization based on rank fusion. Expert Syst Appl. 2022;200: 116846.

    Article  Google Scholar 

  46. Joshi A, Fidalgo E, Alegre E, Fernández-Robles L. SummCoder: an unsupervised framework for extractive text summarization based on deep auto-encoders. Expert Syst Appl. 2019;129:200–15.

    Article  Google Scholar 

  47. Abualigah L, Bashabsheh MQ, Alabool H, Shehab M. Text summarization: a brief review. In: Recent Advances in NLP: the case of Arabic language. Cham: ACL; 2020. p. 1–15.

  48. Bialy AA, Gaheen MA, ElEraky RM, ElGamal AF, Ewees AA, Single Arabic document summarization using natural language processing technique. In: Recent Advances in NLP: The Case of Arabic Language. Cham: ACL; 2020. p. 17–37.

  49. Fakhrezi MF, Bijaksana MA, Huda AF. Implementation of automatic text summarization with TextRank method in the development of Al-qur’an vocabulary encyclopedia. Procedia Computer Science. 2021;179:391–8.

    Article  Google Scholar 

  50. Yadav D, Desai J, Yadav AK. Automatic text summarization methods: a comprehensive review. 2022. arXiv preprint arXiv:2204.01849

  51. Elsaid A, Mohammed A, Ibrahim LF, Sakre MM. A comprehensive review of arabic text summarization. IEEE Access. 2022;10:38012–30.

    Article  Google Scholar 

  52. Gulati V, Kumar D, Popescu DE, Hemanth JD. Extractive article summarization using integrated TextRank and BM25+ algorithm. Electronics. 2023;12(2):372.

    Article  Google Scholar 

  53. Cajueiro DO, Nery AG, Tavares I, De Melo MK, Reis SAD, Weigang L, Celestino VR. A comprehensive review of automatic text summarization techniques: method, data, evaluation and coding. 2023. arXiv:2301.03403

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajiv Singh.

Ethics declarations

Conflict of Interest

The authors declare that there is no conflict of interest regarding this manuscript and received no funding for this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Research Trends in Computational Intelligence” guest edited by Anshul Verma, Pradeepika Verma, Vivek Kumar Singh, and S. Karthikeyan.

The original online version of this article was revised: Due to the header of the second part was used (repeated) in the first part by mistake during the proof correction stage in Table 10. Now, the table has been corrected.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, P., Nigam, S. & Singh, R. A Statistical Language Modeling Framework for Extractive Summarization of Text Documents. SN COMPUT. SCI. 4, 750 (2023). https://doi.org/10.1007/s42979-023-02241-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-02241-x

Keywords

Navigation