Skip to main content
Log in

Designing an efficient unigram keyword detector for documents using Relative Entropy

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this work we propose a statistical approach to identify unigram keywords for a document. We identify unigram keywords as features which effectively captures the importance of a word in a document and evaluates its potential to be a keyword. We make use of relative entropy, displacement and variance of terms in a document have been evaluated in the context of keyword identification. The proposed approach works on single documents without the requirement of any pre-training of the model. We also evaluate the effectiveness of our features against the gold standard of “term frequency” and compare the usefulness of the proposed feature set with term frequency. The results of our proposed method are presented and compared with existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Aggarwal A, Sharma C, Jain M, Jain A (2018) Semi supervised graph based keyword extraction using lexical chains and centrality measures. Computación y Sistemas 22(4):1037–1315

  2. Bafna P, Pramod D Vaidya A (2016) Document clustering: Tf-idf approach. In: 2016 International conference on electrical, electronics, and optimization techniques (ICEEOT), IEEE, pp 61–66

  3. Biswas S K, Bordoloi M, Shreya J (2018) A graph based keyword extraction model using collective node weight. Expert Syst Appl 97:51–59

    Article  Google Scholar 

  4. Brinker K, Moerchen F, Glomann B, Neubauer C (2010) Online document clustering using tfidf and predefined time windows. US Patent 7,711,668

  5. Campos R, Mangaravite V, Pasquali A, Jorge AM, Nunes C, Jatowt A (2018) A text feature based automatic keyword extraction method for single documents. In: European conference on information retrieval, Springer, pp 684–691

  6. Campos R, Mangaravite V, Pasquali A, Jorge A, Nunes C, Jatowt A (2020) Yake! keyword extraction from single documents using multiple local features. Inf Sci 509:257–289

    Article  Google Scholar 

  7. Chen K, Zhang Z, Long J, Zhang H (2016) Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst Appl 66:245–260

    Article  Google Scholar 

  8. Chen Y, Wang J, Li P, Guo P (2019) Single document keyword extraction via quantifying higher-order structural features of word co-occurrence graph. Computer Speech & Language 57:98–107

    Article  Google Scholar 

  9. Duari S, Bhatnagar V (2019) scake: Semantic connectivity aware keyword extraction. Inform Sci 477:100–117

    Article  Google Scholar 

  10. Duwairi R, Hedaya M (2016) Automatic keyphrase extraction for arabic news documents based on kea system. Journal of Intelligent & Fuzzy Systems 30 (4):2101–2110

    Article  Google Scholar 

  11. Ercan G (2006) Automated text summarization and keyphrase extraction. Unpublished MSc thesis, Bilkent University

  12. Feduhko S (2014) Development of a software for computer-linguistic verification of socio-demographic profile of web-community member. Webology 11(2)

  13. Florescu C, Caragea C (2017) A position-biased pagerank algorithm for keyphrase extraction. In: Proceedings of the AAAI conference on artificial intelligence, vol 31

  14. Haque M, Pervin S, Begum Z et al (2013) Literature review of automatic multiple documents text summarization. International Journal of Innovation and Applied Studies 3(1):121–129

    Google Scholar 

  15. Korzh R, Fedushko S, Peleschyshyn A (2015) Methods for forming an informational image of a higher education institution

  16. Krapivin M, Autayeu M, Marchese M, Blanzieri E, Segata N (2010) Improving machine learning approaches for keyphrases extraction from scientific documents with natural language knowledge. In: Proceedings of the joint JCDL/ICADL international digital libraries conference, pp 102–111

  17. Lahiri S, Mihalcea R, Lai P H (2017) Keyword extraction from emails. Nat Lang Eng 23(2):295–317

    Article  Google Scholar 

  18. Li G, Wang H (2014) Improved automatic keyword extraction based on textrank using domain knowledge. In: CCF international conference on natural language processing and chinese computing, Springer, pp 403–413

  19. Li J, Zhang K, et al. (2007) Keyword extraction based on tf/idf for chinese news document. Wuhan University Journal of Natural Sciences 12(5):917–921

    Article  Google Scholar 

  20. Li X, Zhang A, Li C, Ouyang J, Cai Y (2018) Exploring coherent topics by topic modeling with term weighting. Inform Process Manage 54 (6):1345–1358

    Article  Google Scholar 

  21. Liu F, Pennell D, Liu F, Liu Y (2009) Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, pp 620–628

  22. Liu Z, Li P, Zheng Y, Sun M (2009) Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 conference on empirical methods in natural language processing, pp 257–266

  23. Matsuo Y, Ishizuka M (2004) Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13(01):157–169

    Article  Google Scholar 

  24. McMahon D (2007) Quantum computing explained. John Wiley & Sons

  25. Mihalcea R, Tarau P (2004) Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 404–411

  26. Mihalcea R (2005) A language independent algorithm for single and multiple document summarization. In: Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

  27. Naderalvojoud B, Sezer EA (2020) Term evaluation metrics in imbalanced text categorization. Nat Lang Eng 26(1):31–47

    Article  Google Scholar 

  28. Nguyen TD, Kan MY (2007) Keyphrase extraction in scientific publications. In: International conference on Asian digital libraries, Springer, pp 317–326

  29. Rose S, Engel D, Cramer N, Cowley W (2010) Automatic keyword extraction from individual documents. Text Mining: Applications and Theory 1:1–20

    Google Scholar 

  30. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inform Process Manage 24(5):513–523

    Article  Google Scholar 

  31. Tixier A, Malliaros F, Vazirgiannis M (2016) A graph degeneracy-based approach to keyword extraction. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 1860–1870

  32. Wu YfB, Li Q, Bot RS, Chen X (2005) Domain-specific keyphrase extraction. In: Proceedings of the 14th ACM international conference on Information and knowledge management, pp 283–284

  33. Zhang C, Wang X, Yu S, Wang Y (2018) Research on keyword extraction of word2vec model in chinese corpus. In: 2018 IEEE/ACIS 17Th international conference on computer and information science (ICIS), IEEE, pp 339–343

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. N. Rathi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rathi, R.N., Mustafi, A. Designing an efficient unigram keyword detector for documents using Relative Entropy. Multimed Tools Appl 81, 37747–37761 (2022). https://doi.org/10.1007/s11042-022-12657-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12657-x

Keywords

Navigation