Designing an efficient unigram keyword detector for documents using Relative Entropy

Rathi, R. N.; Mustafi, A.

doi:10.1007/s11042-022-12657-x

Designing an efficient unigram keyword detector for documents using Relative Entropy

Published: 22 April 2022

Volume 81, pages 37747–37761, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

186 Accesses
1 Citation
Explore all metrics

Abstract

In this work we propose a statistical approach to identify unigram keywords for a document. We identify unigram keywords as features which effectively captures the importance of a word in a document and evaluates its potential to be a keyword. We make use of relative entropy, displacement and variance of terms in a document have been evaluated in the context of keyword identification. The proposed approach works on single documents without the requirement of any pre-training of the model. We also evaluate the effectiveness of our features against the gold standard of “term frequency” and compare the usefulness of the proposed feature set with term frequency. The results of our proposed method are presented and compared with existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Keyword Extraction from Short Documents Using Three Levels of Word Evaluation

Evaluating the Performance of SOBEK Text Mining Keyword Extraction Algorithm

Text Keyword Extraction Based on Multi-dimensional Features

References

Aggarwal A, Sharma C, Jain M, Jain A (2018) Semi supervised graph based keyword extraction using lexical chains and centrality measures. Computación y Sistemas 22(4):1037–1315
Bafna P, Pramod D Vaidya A (2016) Document clustering: Tf-idf approach. In: 2016 International conference on electrical, electronics, and optimization techniques (ICEEOT), IEEE, pp 61–66
Biswas S K, Bordoloi M, Shreya J (2018) A graph based keyword extraction model using collective node weight. Expert Syst Appl 97:51–59
Article Google Scholar
Brinker K, Moerchen F, Glomann B, Neubauer C (2010) Online document clustering using tfidf and predefined time windows. US Patent 7,711,668
Campos R, Mangaravite V, Pasquali A, Jorge AM, Nunes C, Jatowt A (2018) A text feature based automatic keyword extraction method for single documents. In: European conference on information retrieval, Springer, pp 684–691
Campos R, Mangaravite V, Pasquali A, Jorge A, Nunes C, Jatowt A (2020) Yake! keyword extraction from single documents using multiple local features. Inf Sci 509:257–289
Article Google Scholar
Chen K, Zhang Z, Long J, Zhang H (2016) Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst Appl 66:245–260
Article Google Scholar
Chen Y, Wang J, Li P, Guo P (2019) Single document keyword extraction via quantifying higher-order structural features of word co-occurrence graph. Computer Speech & Language 57:98–107
Article Google Scholar
Duari S, Bhatnagar V (2019) scake: Semantic connectivity aware keyword extraction. Inform Sci 477:100–117
Article Google Scholar
Duwairi R, Hedaya M (2016) Automatic keyphrase extraction for arabic news documents based on kea system. Journal of Intelligent & Fuzzy Systems 30 (4):2101–2110
Article Google Scholar
Ercan G (2006) Automated text summarization and keyphrase extraction. Unpublished MSc thesis, Bilkent University
Feduhko S (2014) Development of a software for computer-linguistic verification of socio-demographic profile of web-community member. Webology 11(2)
Florescu C, Caragea C (2017) A position-biased pagerank algorithm for keyphrase extraction. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Haque M, Pervin S, Begum Z et al (2013) Literature review of automatic multiple documents text summarization. International Journal of Innovation and Applied Studies 3(1):121–129
Google Scholar
Korzh R, Fedushko S, Peleschyshyn A (2015) Methods for forming an informational image of a higher education institution
Krapivin M, Autayeu M, Marchese M, Blanzieri E, Segata N (2010) Improving machine learning approaches for keyphrases extraction from scientific documents with natural language knowledge. In: Proceedings of the joint JCDL/ICADL international digital libraries conference, pp 102–111
Lahiri S, Mihalcea R, Lai P H (2017) Keyword extraction from emails. Nat Lang Eng 23(2):295–317
Article Google Scholar
Li G, Wang H (2014) Improved automatic keyword extraction based on textrank using domain knowledge. In: CCF international conference on natural language processing and chinese computing, Springer, pp 403–413
Li J, Zhang K, et al. (2007) Keyword extraction based on tf/idf for chinese news document. Wuhan University Journal of Natural Sciences 12(5):917–921
Article Google Scholar
Li X, Zhang A, Li C, Ouyang J, Cai Y (2018) Exploring coherent topics by topic modeling with term weighting. Inform Process Manage 54 (6):1345–1358
Article Google Scholar
Liu F, Pennell D, Liu F, Liu Y (2009) Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, pp 620–628
Liu Z, Li P, Zheng Y, Sun M (2009) Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 conference on empirical methods in natural language processing, pp 257–266
Matsuo Y, Ishizuka M (2004) Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13(01):157–169
Article Google Scholar
McMahon D (2007) Quantum computing explained. John Wiley & Sons
Mihalcea R, Tarau P (2004) Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 404–411
Mihalcea R (2005) A language independent algorithm for single and multiple document summarization. In: Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts
Naderalvojoud B, Sezer EA (2020) Term evaluation metrics in imbalanced text categorization. Nat Lang Eng 26(1):31–47
Article Google Scholar
Nguyen TD, Kan MY (2007) Keyphrase extraction in scientific publications. In: International conference on Asian digital libraries, Springer, pp 317–326
Rose S, Engel D, Cramer N, Cowley W (2010) Automatic keyword extraction from individual documents. Text Mining: Applications and Theory 1:1–20
Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inform Process Manage 24(5):513–523
Article Google Scholar
Tixier A, Malliaros F, Vazirgiannis M (2016) A graph degeneracy-based approach to keyword extraction. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 1860–1870
Wu YfB, Li Q, Bot RS, Chen X (2005) Domain-specific keyphrase extraction. In: Proceedings of the 14th ACM international conference on Information and knowledge management, pp 283–284
Zhang C, Wang X, Yu S, Wang Y (2018) Research on keyword extraction of word2vec model in chinese corpus. In: 2018 IEEE/ACIS 17Th international conference on computer and information science (ICIS), IEEE, pp 339–343

Download references

Author information

Authors and Affiliations

Birla Institute of Technology, Mesra, India
R. N. Rathi & A. Mustafi

Authors

R. N. Rathi
View author publications
You can also search for this author in PubMed Google Scholar
A. Mustafi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. N. Rathi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rathi, R.N., Mustafi, A. Designing an efficient unigram keyword detector for documents using Relative Entropy. Multimed Tools Appl 81, 37747–37761 (2022). https://doi.org/10.1007/s11042-022-12657-x

Download citation

Received: 17 December 2020
Revised: 11 March 2021
Accepted: 09 February 2022
Published: 22 April 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s11042-022-12657-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Designing an efficient unigram keyword detector for documents using Relative Entropy

Abstract

Access this article

Similar content being viewed by others

Keyword Extraction from Short Documents Using Three Levels of Word Evaluation

Evaluating the Performance of SOBEK Text Mining Keyword Extraction Algorithm

Text Keyword Extraction Based on Multi-dimensional Features

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Designing an efficient unigram keyword detector for documents using Relative Entropy

Abstract

Access this article

Similar content being viewed by others

Keyword Extraction from Short Documents Using Three Levels of Word Evaluation

Evaluating the Performance of SOBEK Text Mining Keyword Extraction Algorithm

Text Keyword Extraction Based on Multi-dimensional Features

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation