Abstract
Keyword extraction is a major step to extract plenty of valuable and meaningful information from the rich source of World Wide Web (W.W.W.). Different keyword extraction algorithms are proposed with their own advantages and disadvantages. Vector Space Model (VSM) algorithms prove quite effective for keyword extraction, but do not emphasize on the class label information of classified data. Supervised Term Weighting (STW) algorithms address this problem, but suffer from high dimensionality. Besides, they do not incorporate semantic relationship between terms. To address these problems, Graph Based Models (GBM) are introduced. However, they also use unsupervised learning. Hence, this paper proposes a Keyword Extraction using Supervised Cumulative TextRank (KESCT) technique that explores the benefits of both VSM and GBM techniques. The proposed algorithm modifies TextRank by incorporating a novel Unique Statistical Supervised Weight (USSW) to include class label information of classified data. To emphasize on the relatedness between terms, the mutual information between terms is also included. The proposed algorithm is validated using four review datasets and results are compared with traditional TextRank and its variants using Support Vector Machine (SVM) classifier, Naïve-Bayes (NB) classifier and an ensemble classifier. Experimental results mark the efficacy of the proposed algorithm over existing algorithms.
Similar content being viewed by others
References
Beliga S, Meštrović A, Martinčić-Ipšić S (2015) An overview of graph-based keyword extraction methods and approaches. Journal of information and organizational sciences 39(1):1–20
Benghuzzi H, Elsheh MM (2020) An investigation of keywords extraction from textual documents using Word2Vec and Decision Tree. International Journal of Computer Science and Information Security (IJCSIS) 18(5)
Biswas SK, Bordoloi M, Shreya J (2018) A graph based keyword extraction model using collective node weight. Expert Syst Appl 97:51–59
Bordoloi M, Biswas SK (2018) Keyword extraction from micro-blogs using collective weight. Soc Netw Anal Min 8(1):58
Bordoloi M, Biswas SK (2019) Machine learning based sentiment analysis using graph based approach. In: 2019 10th international conference on computing, communication and networking technologies (ICCCNT). IEEE, pp 1–5
Boudin F (2013) A comparison of centrality measures for graph-based keyphrase extraction. In: Proceedings of the sixth international joint conference on natural language processing, pp 834–838
Canhasi E (2016) Fast document summarization using locality sensitive hashing and memory access efficient node ranking. Int J Electr Comput Eng 6(3):2088–8708
Chen K, Zhang Z, Long J, Zhang H (2016) Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst Appl 66:245–260
Debole F, Sebastiani F (2004) Supervised term weighting for automated text categorization. In: Text mining and its applications. Springer, Berlin, pp 81–97
Duari S, Bhatnagar V (2019) sCAKE: semantic connectivity aware keyword extraction. Inf Sci 477:100–117
El-Khair IA (2009) Term weighting. In: Liu L, Özsu MT (eds) Encyclopedia of database systems. Springer, Boston
Fernández AM, Esuli A, Sebastiani F (2018) Learning to weight for text classification. IEEE Trans Knowl Data Eng
Gollapudi S, Panigrahy R (2006) Exploiting asymmetry in hierarchical topic extraction. In: Proceedings of the 15th ACM international conference on Information and knowledge management, pp 475–482
Hassan S, Mihalcea R, Banea C (2007) Random walk term weighting for improved text classification. International Journal of Semantic Computing 1(04):421–439
Islam MR, Islam MR (2008) An improved keyword extraction method using graph based random walk model. In: 2008 11th international conference on computer and information technology. IEEE, pp 225–229
Lan M, Tan CL, Su J, Lu Y (2008) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Li H, Liu W, Ji H (2014) Two-stage hashing for fast document retrieval. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 495–500
Li SQ, Du SM, Xing XZ (2017) A keyword extraction method for Chinese scientific abstracts. In: Proceedings of the 2017 International Conference on Wireless Communications, Networking and Applications, pp 133–137
Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36(1):690–701
Malliaros FD, Skianis K (2015) Graph-based term weighting for text categorization. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pp 1473–1479
Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 404–411
Nie JY, Jin F (2002) Integrating logical operators in query expansion in vector space model. In: Workshop on Mathematical/Formal Methods in Information Retrieval, 25th ACM-SIGIR
Pan S, Li Z, Dai J (2019) An improved TextRank keywords extraction algorithm. In: Proceedings of the ACM Turing Celebration Conference-China, pp 1–7
Porter MF (2006) An algorithm for suffix stripping. Program
Qingyun Z, Yuansheng F, Zhenlei S, Wanli Z (2020) Keyword extraction method for complex nodes based on TextRank algorithm. In: 2020 international conference on computer engineering and application (ICCEA). IEEE, pp 359–363
Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125
Saki M, Faili H, Asadpour M (2017) Text reuse detection by keyword extraction for telegram channels. In: 2017 Iranian conference on electrical engineering (ICEE). IEEE, pp 1481–1484
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Shouzhong T, Minlie H (2016) Mining microblog user interests based on TextRank with TF-IDF factor. The Journal of China Universities of Posts and Telecommunications 23(5):40–46
Sonawane SS, Kulkarni PA (2014) Graph based representation and analysis of text document: a survey of techniques. Int J Comput Appl 96(19)
Song S, Wang Z, Xu S, Ni S, Xiao J (2019) A novel text classification approach based on Word2vec and TextRank keyword extraction. In: 2019 IEEE fourth international conference on data science in cyberspace (DSC). IEEE, pp 536–543
Stein B (2007) Principles of hash-based text retrieval. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp 527–534
Tavoli R, Kozegar E, Shojafar M, Soleimani H, Pooranian Z (2013) Weighted PCA for improving document image retrieval system based on keyword spotting accuracy. In: 2013 36th international conference on telecommunications and signal processing (TSP). IEEE, pp 773–777
Wu J, Shen L, Liu L (2020) LSH-based distributed similarity indexing with load balancing in high-dimensional space. J Supercomput 76(1):636–665
Yao L, Pengzhou Z, Chi Z (2019) Research on news keyword extraction technology based on TF-IDF and TextRank. In: 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), pp 452–455
Zhang B, Liu X, Lang B (2015) Fast graph similarity search via locality sensitive hashing. In: Pacific rim conference on multimedia. Springer, Cham, pp 623–633
Zhang Y, Chen F, Zhang W, Zuo H, Yu F (2020) Keywords Extraction Based on Word2Vec and TextRank. In: Proceedings of the 2020 The 3rd International Conference on Big Data and Education, pp 37–42
Zhang Y, Zhou Y, Yao J (2020) Feature extraction with TF-IDF and game-theoretic shadowed sets. In: International conference on information processing and management of uncertainty in knowledge-based systems. Springer, Cham, pp 722–733
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bordoloi, M., Chatterjee, P.C., Biswas, S.K. et al. Keyword extraction using supervised cumulative TextRank. Multimed Tools Appl 79, 31467–31496 (2020). https://doi.org/10.1007/s11042-020-09335-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09335-1