Keyword extraction using supervised cumulative TextRank

Bordoloi, Monali; Chatterjee, Preetam Chayan; Biswas, Saroj Kumar; Purkayastha, Biswajit

doi:10.1007/s11042-020-09335-1

Keyword extraction using supervised cumulative TextRank

Published: 21 August 2020

Volume 79, pages 31467–31496, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Monali Bordoloi ORCID: orcid.org/0000-0002-0685-9268¹,
Preetam Chayan Chatterjee¹,
Saroj Kumar Biswas¹ &
…
Biswajit Purkayastha¹

809 Accesses
11 Citations
Explore all metrics

Abstract

Keyword extraction is a major step to extract plenty of valuable and meaningful information from the rich source of World Wide Web (W.W.W.). Different keyword extraction algorithms are proposed with their own advantages and disadvantages. Vector Space Model (VSM) algorithms prove quite effective for keyword extraction, but do not emphasize on the class label information of classified data. Supervised Term Weighting (STW) algorithms address this problem, but suffer from high dimensionality. Besides, they do not incorporate semantic relationship between terms. To address these problems, Graph Based Models (GBM) are introduced. However, they also use unsupervised learning. Hence, this paper proposes a Keyword Extraction using Supervised Cumulative TextRank (KESCT) technique that explores the benefits of both VSM and GBM techniques. The proposed algorithm modifies TextRank by incorporating a novel Unique Statistical Supervised Weight (USSW) to include class label information of classified data. To emphasize on the relatedness between terms, the mutual information between terms is also included. The proposed algorithm is validated using four review datasets and results are compared with traditional TextRank and its variants using Support Vector Machine (SVM) classifier, Naïve-Bayes (NB) classifier and an ensemble classifier. Experimental results mark the efficacy of the proposed algorithm over existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TextRank Keyword Extraction Method Based on Multi-feature Fusion

Text Keyword Extraction Based on Multi-dimensional Features

Learning Heterogeneous Coupling Relationships Between Non-IID Terms

References

Beliga S, Meštrović A, Martinčić-Ipšić S (2015) An overview of graph-based keyword extraction methods and approaches. Journal of information and organizational sciences 39(1):1–20
Google Scholar
Benghuzzi H, Elsheh MM (2020) An investigation of keywords extraction from textual documents using Word2Vec and Decision Tree. International Journal of Computer Science and Information Security (IJCSIS) 18(5)
Biswas SK, Bordoloi M, Shreya J (2018) A graph based keyword extraction model using collective node weight. Expert Syst Appl 97:51–59
Google Scholar
Bordoloi M, Biswas SK (2018) Keyword extraction from micro-blogs using collective weight. Soc Netw Anal Min 8(1):58
Google Scholar
Bordoloi M, Biswas SK (2019) Machine learning based sentiment analysis using graph based approach. In: 2019 10th international conference on computing, communication and networking technologies (ICCCNT). IEEE, pp 1–5
Boudin F (2013) A comparison of centrality measures for graph-based keyphrase extraction. In: Proceedings of the sixth international joint conference on natural language processing, pp 834–838
Canhasi E (2016) Fast document summarization using locality sensitive hashing and memory access efficient node ranking. Int J Electr Comput Eng 6(3):2088–8708
Google Scholar
Chen K, Zhang Z, Long J, Zhang H (2016) Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst Appl 66:245–260
Google Scholar
Debole F, Sebastiani F (2004) Supervised term weighting for automated text categorization. In: Text mining and its applications. Springer, Berlin, pp 81–97
Duari S, Bhatnagar V (2019) sCAKE: semantic connectivity aware keyword extraction. Inf Sci 477:100–117
Google Scholar
El-Khair IA (2009) Term weighting. In: Liu L, Özsu MT (eds) Encyclopedia of database systems. Springer, Boston
Google Scholar
Fernández AM, Esuli A, Sebastiani F (2018) Learning to weight for text classification. IEEE Trans Knowl Data Eng
Gollapudi S, Panigrahy R (2006) Exploiting asymmetry in hierarchical topic extraction. In: Proceedings of the 15th ACM international conference on Information and knowledge management, pp 475–482
Hassan S, Mihalcea R, Banea C (2007) Random walk term weighting for improved text classification. International Journal of Semantic Computing 1(04):421–439
Google Scholar
Islam MR, Islam MR (2008) An improved keyword extraction method using graph based random walk model. In: 2008 11th international conference on computer and information technology. IEEE, pp 225–229
Lan M, Tan CL, Su J, Lu Y (2008) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Google Scholar
Li H, Liu W, Ji H (2014) Two-stage hashing for fast document retrieval. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 495–500
Li SQ, Du SM, Xing XZ (2017) A keyword extraction method for Chinese scientific abstracts. In: Proceedings of the 2017 International Conference on Wireless Communications, Networking and Applications, pp 133–137
Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36(1):690–701
Google Scholar
Malliaros FD, Skianis K (2015) Graph-based term weighting for text categorization. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pp 1473–1479
Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 404–411
Nie JY, Jin F (2002) Integrating logical operators in query expansion in vector space model. In: Workshop on Mathematical/Formal Methods in Information Retrieval, 25th ACM-SIGIR
Pan S, Li Z, Dai J (2019) An improved TextRank keywords extraction algorithm. In: Proceedings of the ACM Turing Celebration Conference-China, pp 1–7
Porter MF (2006) An algorithm for suffix stripping. Program
Qingyun Z, Yuansheng F, Zhenlei S, Wanli Z (2020) Keyword extraction method for complex nodes based on TextRank algorithm. In: 2020 international conference on computer engineering and application (ICCEA). IEEE, pp 359–363
Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125
Google Scholar
Saki M, Faili H, Asadpour M (2017) Text reuse detection by keyword extraction for telegram channels. In: 2017 Iranian conference on electrical engineering (ICEE). IEEE, pp 1481–1484
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Google Scholar
Shouzhong T, Minlie H (2016) Mining microblog user interests based on TextRank with TF-IDF factor. The Journal of China Universities of Posts and Telecommunications 23(5):40–46
Google Scholar
Sonawane SS, Kulkarni PA (2014) Graph based representation and analysis of text document: a survey of techniques. Int J Comput Appl 96(19)
Song S, Wang Z, Xu S, Ni S, Xiao J (2019) A novel text classification approach based on Word2vec and TextRank keyword extraction. In: 2019 IEEE fourth international conference on data science in cyberspace (DSC). IEEE, pp 536–543
Stein B (2007) Principles of hash-based text retrieval. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp 527–534
Tavoli R, Kozegar E, Shojafar M, Soleimani H, Pooranian Z (2013) Weighted PCA for improving document image retrieval system based on keyword spotting accuracy. In: 2013 36th international conference on telecommunications and signal processing (TSP). IEEE, pp 773–777
Wu J, Shen L, Liu L (2020) LSH-based distributed similarity indexing with load balancing in high-dimensional space. J Supercomput 76(1):636–665
Google Scholar
Yao L, Pengzhou Z, Chi Z (2019) Research on news keyword extraction technology based on TF-IDF and TextRank. In: 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), pp 452–455
Zhang B, Liu X, Lang B (2015) Fast graph similarity search via locality sensitive hashing. In: Pacific rim conference on multimedia. Springer, Cham, pp 623–633
Zhang Y, Chen F, Zhang W, Zuo H, Yu F (2020) Keywords Extraction Based on Word2Vec and TextRank. In: Proceedings of the 2020 The 3rd International Conference on Big Data and Education, pp 37–42
Zhang Y, Zhou Y, Yao J (2020) Feature extraction with TF-IDF and game-theoretic shadowed sets. In: International conference on information processing and management of uncertainty in knowledge-based systems. Springer, Cham, pp 722–733

Download references

Author information

Authors and Affiliations

Computer Science and Engineering Department, NIT Silchar, Silchar, Assam, India
Monali Bordoloi, Preetam Chayan Chatterjee, Saroj Kumar Biswas & Biswajit Purkayastha

Authors

Monali Bordoloi
View author publications
You can also search for this author in PubMed Google Scholar
Preetam Chayan Chatterjee
View author publications
You can also search for this author in PubMed Google Scholar
Saroj Kumar Biswas
View author publications
You can also search for this author in PubMed Google Scholar
Biswajit Purkayastha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Monali Bordoloi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bordoloi, M., Chatterjee, P.C., Biswas, S.K. et al. Keyword extraction using supervised cumulative TextRank. Multimed Tools Appl 79, 31467–31496 (2020). https://doi.org/10.1007/s11042-020-09335-1

Download citation

Received: 21 October 2019
Revised: 26 June 2020
Accepted: 13 July 2020
Published: 21 August 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11042-020-09335-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Keyword extraction using supervised cumulative TextRank

Abstract

Access this article

Similar content being viewed by others

TextRank Keyword Extraction Method Based on Multi-feature Fusion

Text Keyword Extraction Based on Multi-dimensional Features

Learning Heterogeneous Coupling Relationships Between Non-IID Terms

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Keyword extraction using supervised cumulative TextRank

Abstract

Access this article

Similar content being viewed by others

TextRank Keyword Extraction Method Based on Multi-feature Fusion

Text Keyword Extraction Based on Multi-dimensional Features

Learning Heterogeneous Coupling Relationships Between Non-IID Terms

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation