Extractive text summarization using clustering-based topic modeling

Belwal, Ramesh Chandra; Rai, Sawan; Gupta, Atul

doi:10.1007/s00500-022-07534-6

Extractive text summarization using clustering-based topic modeling

Application of soft computing
Published: 04 October 2022

Volume 27, pages 3965–3982, (2023)
Cite this article

Soft Computing Aims and scope Submit manuscript

681 Accesses
3 Citations
Explore all metrics

Abstract

Text summarization is the process of converting the input document into a short form, provided that it preserves the overall meaning associated with it. Primarily, text summarization is achieved in two ways, i.e., abstractive and extractive. Extractive summarizers select a few best sentences out of the input document, while abstractive methods may modify the sentence structure or introduce new sentences. The proposed approach is an extractive text summarization technique, where we have expanded topic modeling specifically to be applied to multiple lower-level specialized entities (i.e., groups) embedded in a single document. Our goal is to overcome the lack of coherence issues found in the summarization techniques. Topic modeling was initially proposed to model text data at the multi-document and word levels without considering sentence modeling. Subsequently, it has been applied at the sentence level and used for the document summarization; however, certain limitations were associated. Topic modeling does not perform as expected when applied to a single document at the sentence level. To address this shortcoming, we have proposed a summarization approach that is incorporated at the individual document and clusters level (instead of the sentence level). We aim to choose the best statement from each group (containing sentences of the same kind) found in the given text. We have tried to select the perfect topic by evaluating the probability distribution of the words and respective topics’ at the cluster level. The method is evaluated on two standard datasets and shows significant performance gains over existing text summarization techniques. Compared to other text summarization techniques, the Rouge parameters for automatic evaluation show a considerable improvement in F-measure, precision, and recall of the generated summary. Furthermore, a manual evaluation has demonstrated that the proposed approach outperforms the current state-of-the-art text summarization approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

Recent automatic text summarization techniques: a survey

Article 29 March 2016

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Article 26 October 2022

Data availability

Enquiries about data availability should be directed to the authors.

Notes

Click https://cs.nyu.edu/~kcho/DMQA/.
Click https://github.com/kavgan/opinosis-summarization.

References

Abdi A, Idris N, Alguliyev RM, Aliguliyev RM (2017) Query-based multi-documents summarization using linguistic knowledge and content word expansion. Soft Comput 21(7):1785–1801
Article Google Scholar
Abdi A, Shamsuddin SM, Aliguliyev RM (2018) Qmos: query-based multi-documents opinion-oriented summarization. Inf Process Manag 54(2):318–338
Article Google Scholar
Abdi A, Shamsuddin SM, Hasan S, Piran J (2018) Machine learning-based multi-documents sentiment-oriented summarization using linguistic treatment. Expert Syst Appl 109:66–85. https://doi.org/10.1016/j.eswa.2018.05.010
Article Google Scholar
Ali SM, Noorian Z, Bagheri E, Ding C, Al-Obeidat F (2020) Topic and sentiment aware microblog summarization for twitter. J Intell Inf Syst 54(1):129–156
Article Google Scholar
Amplayo RK, Song M (2017) An adaptable fine-grained sentiment analysis for summarization of multiple short online reviews. Data Knowl Eng 110:54–67
Article Google Scholar
Arora R, Ravindran B (2008) Latent Dirichlet allocation based multi-document summarization. In: Proceedings of the second workshop on analytics for noisy unstructured text data, pp 91–97
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Barrios F, López F, Argerich L, Wachenchauzer R (2015) Variations of the similarity function of textrank for automated summarization. In: Argentine symposium on artificial intelligence (ASAI 2015)-JAIIO 44 (Rosario, 2015)
Barrios F, López F, Argerich L, Wachenchauzer R (2016) Variations of the similarity function of textrank for automated summarization. arXiv preprint arXiv:1602.03606
Barzilay R, McKeown KR (2005) Sentence fusion for multidocument news summarization. Comput Linguist 31(3):297–328. https://doi.org/10.1162/089120105774321091
Article MATH Google Scholar
Baxendale PB (1958) Machine-made index for technical literature—an experiment. IBM J Res Dev 2(4):354–361. https://doi.org/10.1147/rd.24.0354
Article Google Scholar
Belwal RC, Rai S, Gupta A (2020) A new graph-based extractive text summarization using keywords or topic modeling. J Ambient Intell Hum Comput 1–16
Belwal RC, Rai S, Gupta A (2021) Text summarization using topic-based vector space model and semantic measure. Inf Process Manag 58(3):102536
Article Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Boros E, Kantor PB, Neu DJ (2001) A clustering based approach to creating multi-document summaries. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval
Chang YL, Chien JT (2009) Latent dirichlet learning for document summarization. In: IEEE international conference on acoustics, speech and signal processing, 2009 ICASSP 2009. IEEE, pp 1689–1692. https://doi.org/10.1109/icassp.2009.4959927
Cuong HN, Tran VD, Van LN, Than K (2019) Eliminating overfitting of probabilistic topic models on short and noisy text: the role of dropout. Int J Approx Reason
Diao Y, Lin H, Yang L, Fan X, Chu Y, Wu D, Zhang D, Xu K (2020) Crhasum: extractive text summarization with contextualized-representation hierarchical-attention summarization network. Neural Comput Appl 32(15):11491–11503
Article Google Scholar
Dumais ST (2004) Latent semantic analysis. Ann Rev Inf Sci Technol 38(1):188–230
Article Google Scholar
Erkan G, Radev DR (2004) Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479
Article Google Scholar
Fattah MA, Ren F (2008) Automatic text summarization. World Acad Sci Eng Technol 37:2008
Google Scholar
Fattah MA, Ren F (2009) GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Comput Speech Lang 23(1):126–144. https://doi.org/10.1016/j.csl.2008.04.002
Article Google Scholar
Ferreira R, de Souza Cabral RD, e Silva GP, Freitas F, Cavalcanti GD, Lima R, Simske SJ, Favaro L (2013) Assessing sentence scoring techniques for extractive text summarization. Expert Syst Appl 40(14):5755–5764. https://doi.org/10.1016/j.eswa.2013.04.023
Article Google Scholar
Fu X, Wang J, Zhang J, Wei J, Yang Z (2020) Document summarization with VHTM: variational hierarchical topic-aware mechanism. In: AAAI, pp 7740–7747
Fuad TA, Nayeem MT, Mahmud A, Chali Y (2019) Neural sentence fusion for diversity driven abstractive multi-document summarization. Comput Speech Language 58:216–230
Article Google Scholar
Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artif Intell Rev 47(1):1–66
Article Google Scholar
Ganesan K, Zhai C, Han J (2010) Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions. In: Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics, pp 340–348, https://dl.acm.org/citation.cfm?id=1873820
Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 19–25, https://doi.org/10.1145/383952.383955
Gupta P, Pendluri VS, Vats I (2011) Summarizing text by ranking text units according to shallow linguistic features. In: 2011 13th international conference on advanced communication technology (ICACT). IEEE, pp 1620–1625. https://ieeexplore.ieee.org/document/5746114
Haiduc S, Aponte J, Moreno L, Marcus A (2010) On the use of automated text summarization techniques for summarizing source code. In: 2010 17th working conference on reverse engineering (WCRE). IEEE, pp 35–44. https://doi.org/10.1109/wcre.2010.13
Harabagiu SM, Lacatusu VF, Morarescu P (2002) Multidocument summarization with gistexter. In: LREC, Citeseer, vol 1, pp 1456–1463. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.129.4846
Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. In: Advances in neural information processing systems, pp 1693–1701. arXiv:1506.03340
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1):177–196
Article MATH Google Scholar
Hu M, Sun A, Lim EP (2008) Comments-oriented document summarization: understanding documents with readers’ feedback. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 291–298. https://doi.org/10.1145/1390334.1390385
Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L (2019) Latent dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl 78(11):15169–15211
Article Google Scholar
Kanapala A, Pal S, Pamula R (2019) Text summarization from legal documents: a survey. Artif Intell Rev 51(3):371–402
Article Google Scholar
Kikuchi Y, Hirao T, Takamura H, Okumura M, Nagata M (2014) Single document summarization based on nested tree structure. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp 315–320
Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: identifying topics in source code. Inf Softw Technol 49(3):230–243
Article Google Scholar
Lee S, Belkasim S, Zhang Y (2013) Multi-document text summarization using topic model and fuzzy logic. In: International workshop on machine learning and data mining in pattern recognition. Springer, pp 159–168
Lim KW, Buntine W, Chen C, Du L (2016) Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes. Int J Approx Reason 78:172–191
Article MathSciNet MATH Google Scholar
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. http://aclweb.org/anthology/W04-1013
Liu X, Webster JJ, Kit C (2009) An extractive text summarizer based on significant words. In: International conference on computer processing of oriental languages. Springer, pp 168–178
Liu Y, Titov I, Lapata M (2019) Single document summarization as tree induction. In: Proceedings of the 2019 conference of the North American Chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers), pp 1745–1755
Lloret E, Palomar M (2009) A gradual combination of features for building automatic summarisation systems. In: International conference on text, speech and dialogue. Springer, pp 16–23. https://doi.org/10.1007/978-3-642-04208-9_6
Lloret E, Balahur A, Gómez JM, Montoyo A, Palomar M (2012) Towards a unified framework for opinion retrieval, mining and summarization. J Intell Inf Syst 39(3):711–747
Article Google Scholar
Lovinger J, Valova I, Clough C (2019) GIST: general integrated summarization of text and reviews. Soft Comput 23(5):1589–1601
Article Google Scholar
Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2(2):159–165. https://doi.org/10.1147/rd.22.0159
Article MathSciNet Google Scholar
Luong MT, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025
Mani I, Bloedorn E (1998) Machine learning of generic and user-focused summarization. In: AAAI/IAAI, pp 821–826
Mao X, Yang H, Huang S, Liu Y, Li R (2019) Extractive summarization using supervised and unsupervised learning. Expert Syst Appl 133:173–181
Article Google Scholar
Mihalcea R (2004) Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of the ACL 2004 on interactive poster and demonstration sessions. Association for Computational Linguistics, p 20. https://doi.org/10.3115/1219044.1219064
Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing. https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
Moawad IF, Aref M (2012) Semantic graph reduction approach for abstractive text summarization. In: 2012 Seventh international conference on computer engineering & systems (ICCES). IEEE, pp 132–138
Mutlu B, Sezer EA, Akcayol MA (2019) Multi-document extractive text summarization: a comparative assessment on features. Knowl-Based Syst 183:104848
Article Google Scholar
Na L, Ming-xia L, Ying L, Xiao-jun T, Hai-wen W, Peng X (2014) Mixture of topic model for multi-document summarization. In: The 26th chinese control and decision conference (2014 CCDC). IEEE, pp 5168–5172
Nagwani N (2015) Summarizing large text collection using topic modeling and clustering based on mapreduce framework. J Big Data 2(1):6
Article Google Scholar
Nallapati R, Zhou B, Gulcehre C, Xiang B, et al (2016) Abstractive text summarization using sequence-to-sequence RNNS and beyond. arXiv preprint arXiv:1602.06023
Nallapati R, Zhai F, Zhou B (2017) Summarunner: a recurrent neural network based sequence model for extractive summarization of documents. In: Thirty-first AAAI conference on artificial intelligence
Narayan S, Papasarantopoulos N, Cohen SB, Lapata M (2017) Neural extractive summarization with side information. arXiv preprint arXiv:1704.04530
Narayan S, Cohen SB, Lapata M (2018a) Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 1797–1807
Narayan S, Cohen SB, Lapata M (2018b) Ranking sentences for extractive summarization with reinforcement learning. In: Proceedings of the 2018 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp 1747–1759
Naveen GK, Nedungadi P (2014) Query-based multi-document summarization by clustering of documents. In: Proceedings of the 2014 international conference on interdisciplinary advances in applied computing, pp 1–8
Neto JL, Freitas AA, Kaestner CA (2002) Automatic text summarization using a machine learning approach. In: Brazilian symposium on artificial intelligence, Springer, pp 205–215
Nobata C, Sekine S, Murata M, Uchimoto K, Utiyama M, Isahara H (2001) Sentence extraction system assembling multiple evidence. In: NTCIR
Orăsan C (2009) Comparative evaluation of term-weighting methods for automatic summarization. J Quant Linguist 16(1):67–95
Article Google Scholar
Ouyang Y, Li W, Li S, Lu Q (2011) Applying regression models to query-focused multi-document summarization. Inf Process Manag 47(2):227–237
Article Google Scholar
Oya T, Mehdad Y, Carenini G, Ng R (2014) A template-based abstractive meeting summarization: Leveraging summary and source text relationships. In: Proceedings of the 8th international natural language generation conference (INLG), pp 45–53
Ozsoy MG, Alpaslan FN, Cicekli I (2011) Text summarization using latent semantic analysis. J Inf Sci 37(4):405–417. https://doi.org/10.1177/0165551511408848
Article MathSciNet Google Scholar
Powell L, Gelich A, Ras ZW (2019) Developing artwork pricing models for online art sales using text analytics. In: International joint conference on rough sets. Springer, pp 480–494
Qazvinian V, Radev DR (2008) Scientific paper summarization using citation summary networks. arXiv preprint arXiv:0807.1560
Rahman N, Borah B (2019) Improvement of query-based text summarization using word sense disambiguation. Complex Intell Syst 1–11
Roul RK (2021) Topic modeling combined with classification technique for extractive multi-document text summarization. Soft Comput 25(2):1113–1127
Article Google Scholar
Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. https://doi.org/10.18653/v1/d15-1044. arXiv preprint arXiv:1509.00685
Rush AM, Harvard S, Chopra S, Weston J (2017) A neural attention model for sentence summarization. In: ACLWeb Proceedings of the 2015 conference on empirical methods in natural language processing
Saggion H (2014) Creating summarization systems with summa. In: LREC. Citeseer, pp 4157–4163
See A, Liu PJ, Manning CD (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368
Silla CN, Pappa GL, Freitas AA, Kaestner CA (2004) Automatic text summarization with genetic algorithm-based attribute selection. In: Ibero-American conference on artificial intelligence. Springer, pp 305–314
Singh RK, Khetarpaul S, Gorantla R, Allada SG (2021) SHEG: summarization and headline generation of news articles using deep learning. Neural Comput Appl 33(8):3251–3265
Article Google Scholar
Steinberger J, Ježek K (2009) Update summarization based on latent semantic analysis. In: International conference on text speech and dialogue. Springer, pp 77–84
Van Lierde H, Chow TW (2019) Query-oriented text summarization based on hypergraph transversals. Inf Process Manag 56(4):1317–1338
Article Google Scholar
Vázquez E, Arnulfo Garcia-Hernandez R, Ledeneva Y (2018) Sentence features relevance for extractive text summarization using genetic algorithms. J Intell Fuzzy Syst 35(1):353–365
Article Google Scholar
Wong KF, Wu M, Li W (2008) Extractive summarization using supervised and semi-supervised learning. In: Proceedings of the 22nd international conference on computational linguistics (Coling 2008), pp 985–992
Yang L, Cai X, Zhang Y, Shi P (2014) Enhancing sentence-level clustering with ranking-based clustering framework for theme-based summarization. Inf Sci 260:37–50
Article Google Scholar
Yang M, Qu Q, Shen Y, Lei K, Zhu J (2020) Cross-domain aspect/sentiment-aware abstractive review summarization by combining topic modeling and deep reinforcement learning. Neural Comput Appl 32(11):6421–6433
Article Google Scholar
Yousefi-Azar M, Hamey L (2017) Text summarization using unsupervised deep learning. Expert Syst Appl 68:93–105
Article Google Scholar
Zhang X, Lapata M, Wei F, Zhou M (2018) Neural latent extractive document summarization. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 779–784

Download references

Funding

No funding was received for this work.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Information Technology Design and Manufacturing, Jabalpur, India
Ramesh Chandra Belwal & Atul Gupta
School of Computer Science Engineering and Technology, Bennett University, Greater Noida, India
Sawan Rai

Authors

Ramesh Chandra Belwal
View author publications
You can also search for this author in PubMed Google Scholar
Sawan Rai
View author publications
You can also search for this author in PubMed Google Scholar
Atul Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RCB and AG conceived this research and designed experiments. SR participated in editing and drafting the article. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ramesh Chandra Belwal.

Ethics declarations

Conflict of interest

There are no conflict of interest associated with this publication. We all declare that there are no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

’Not applicable’. This article does not contain any studies with human participants or animals performed by the authors. Formal consent is not required.

Informed consent

’Not applicable’. No individual/personal data used.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Belwal, R.C., Rai, S. & Gupta, A. Extractive text summarization using clustering-based topic modeling. Soft Comput 27, 3965–3982 (2023). https://doi.org/10.1007/s00500-022-07534-6

Download citation

Accepted: 22 September 2022
Published: 04 October 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s00500-022-07534-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extractive text summarization using clustering-based topic modeling

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Recent automatic text summarization techniques: a survey

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extractive text summarization using clustering-based topic modeling

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Recent automatic text summarization techniques: a survey

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation