Skip to main content
Log in

Keyword extraction using supervised cumulative TextRank

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Keyword extraction is a major step to extract plenty of valuable and meaningful information from the rich source of World Wide Web (W.W.W.). Different keyword extraction algorithms are proposed with their own advantages and disadvantages. Vector Space Model (VSM) algorithms prove quite effective for keyword extraction, but do not emphasize on the class label information of classified data. Supervised Term Weighting (STW) algorithms address this problem, but suffer from high dimensionality. Besides, they do not incorporate semantic relationship between terms. To address these problems, Graph Based Models (GBM) are introduced. However, they also use unsupervised learning. Hence, this paper proposes a Keyword Extraction using Supervised Cumulative TextRank (KESCT) technique that explores the benefits of both VSM and GBM techniques. The proposed algorithm modifies TextRank by incorporating a novel Unique Statistical Supervised Weight (USSW) to include class label information of classified data. To emphasize on the relatedness between terms, the mutual information between terms is also included. The proposed algorithm is validated using four review datasets and results are compared with traditional TextRank and its variants using Support Vector Machine (SVM) classifier, Naïve-Bayes (NB) classifier and an ensemble classifier. Experimental results mark the efficacy of the proposed algorithm over existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Beliga S, Meštrović A, Martinčić-Ipšić S (2015) An overview of graph-based keyword extraction methods and approaches. Journal of information and organizational sciences 39(1):1–20

    Google Scholar 

  2. Benghuzzi H, Elsheh MM (2020) An investigation of keywords extraction from textual documents using Word2Vec and Decision Tree. International Journal of Computer Science and Information Security (IJCSIS) 18(5)

  3. Biswas SK, Bordoloi M, Shreya J (2018) A graph based keyword extraction model using collective node weight. Expert Syst Appl 97:51–59

    Google Scholar 

  4. Bordoloi M, Biswas SK (2018) Keyword extraction from micro-blogs using collective weight. Soc Netw Anal Min 8(1):58

    Google Scholar 

  5. Bordoloi M, Biswas SK (2019) Machine learning based sentiment analysis using graph based approach. In: 2019 10th international conference on computing, communication and networking technologies (ICCCNT). IEEE, pp 1–5

  6. Boudin F (2013) A comparison of centrality measures for graph-based keyphrase extraction. In: Proceedings of the sixth international joint conference on natural language processing, pp 834–838

  7. Canhasi E (2016) Fast document summarization using locality sensitive hashing and memory access efficient node ranking. Int J Electr Comput Eng 6(3):2088–8708

    Google Scholar 

  8. Chen K, Zhang Z, Long J, Zhang H (2016) Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst Appl 66:245–260

    Google Scholar 

  9. Debole F, Sebastiani F (2004) Supervised term weighting for automated text categorization. In: Text mining and its applications. Springer, Berlin, pp 81–97

  10. Duari S, Bhatnagar V (2019) sCAKE: semantic connectivity aware keyword extraction. Inf Sci 477:100–117

    Google Scholar 

  11. El-Khair IA (2009) Term weighting. In: Liu L, Özsu MT (eds) Encyclopedia of database systems. Springer, Boston

    Google Scholar 

  12. Fernández AM, Esuli A, Sebastiani F (2018) Learning to weight for text classification. IEEE Trans Knowl Data Eng

  13. Gollapudi S, Panigrahy R (2006) Exploiting asymmetry in hierarchical topic extraction. In: Proceedings of the 15th ACM international conference on Information and knowledge management, pp 475–482

  14. Hassan S, Mihalcea R, Banea C (2007) Random walk term weighting for improved text classification. International Journal of Semantic Computing 1(04):421–439

    Google Scholar 

  15. Islam MR, Islam MR (2008) An improved keyword extraction method using graph based random walk model. In: 2008 11th international conference on computer and information technology. IEEE, pp 225–229

  16. Lan M, Tan CL, Su J, Lu Y (2008) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735

    Google Scholar 

  17. Li H, Liu W, Ji H (2014) Two-stage hashing for fast document retrieval. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 495–500

  18. Li SQ, Du SM, Xing XZ (2017) A keyword extraction method for Chinese scientific abstracts. In: Proceedings of the 2017 International Conference on Wireless Communications, Networking and Applications, pp 133–137

  19. Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36(1):690–701

    Google Scholar 

  20. Malliaros FD, Skianis K (2015) Graph-based term weighting for text categorization. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pp 1473–1479

  21. Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 404–411

  22. Nie JY, Jin F (2002) Integrating logical operators in query expansion in vector space model. In: Workshop on Mathematical/Formal Methods in Information Retrieval, 25th ACM-SIGIR

  23. Pan S, Li Z, Dai J (2019) An improved TextRank keywords extraction algorithm. In: Proceedings of the ACM Turing Celebration Conference-China, pp 1–7

  24. Porter MF (2006) An algorithm for suffix stripping. Program

  25. Qingyun Z, Yuansheng F, Zhenlei S, Wanli Z (2020) Keyword extraction method for complex nodes based on TextRank algorithm. In: 2020 international conference on computer engineering and application (ICCEA). IEEE, pp 359–363

  26. Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125

    Google Scholar 

  27. Saki M, Faili H, Asadpour M (2017) Text reuse detection by keyword extraction for telegram channels. In: 2017 Iranian conference on electrical engineering (ICEE). IEEE, pp 1481–1484

  28. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Google Scholar 

  29. Shouzhong T, Minlie H (2016) Mining microblog user interests based on TextRank with TF-IDF factor. The Journal of China Universities of Posts and Telecommunications 23(5):40–46

    Google Scholar 

  30. Sonawane SS, Kulkarni PA (2014) Graph based representation and analysis of text document: a survey of techniques. Int J Comput Appl 96(19)

  31. Song S, Wang Z, Xu S, Ni S, Xiao J (2019) A novel text classification approach based on Word2vec and TextRank keyword extraction. In: 2019 IEEE fourth international conference on data science in cyberspace (DSC). IEEE, pp 536–543

  32. Stein B (2007) Principles of hash-based text retrieval. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp 527–534

  33. Tavoli R, Kozegar E, Shojafar M, Soleimani H, Pooranian Z (2013) Weighted PCA for improving document image retrieval system based on keyword spotting accuracy. In: 2013 36th international conference on telecommunications and signal processing (TSP). IEEE, pp 773–777

  34. Wu J, Shen L, Liu L (2020) LSH-based distributed similarity indexing with load balancing in high-dimensional space. J Supercomput 76(1):636–665

    Google Scholar 

  35. Yao L, Pengzhou Z, Chi Z (2019) Research on news keyword extraction technology based on TF-IDF and TextRank. In: 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), pp 452–455

  36. Zhang B, Liu X, Lang B (2015) Fast graph similarity search via locality sensitive hashing. In: Pacific rim conference on multimedia. Springer, Cham, pp 623–633

  37. Zhang Y, Chen F, Zhang W, Zuo H, Yu F (2020) Keywords Extraction Based on Word2Vec and TextRank. In: Proceedings of the 2020 The 3rd International Conference on Big Data and Education, pp 37–42

  38. Zhang Y, Zhou Y, Yao J (2020) Feature extraction with TF-IDF and game-theoretic shadowed sets. In: International conference on information processing and management of uncertainty in knowledge-based systems. Springer, Cham, pp 722–733

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Monali Bordoloi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bordoloi, M., Chatterjee, P.C., Biswas, S.K. et al. Keyword extraction using supervised cumulative TextRank. Multimed Tools Appl 79, 31467–31496 (2020). https://doi.org/10.1007/s11042-020-09335-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09335-1

Keywords

Navigation