Abstract
In this work we propose a statistical approach to identify unigram keywords for a document. We identify unigram keywords as features which effectively captures the importance of a word in a document and evaluates its potential to be a keyword. We make use of relative entropy, displacement and variance of terms in a document have been evaluated in the context of keyword identification. The proposed approach works on single documents without the requirement of any pre-training of the model. We also evaluate the effectiveness of our features against the gold standard of “term frequency” and compare the usefulness of the proposed feature set with term frequency. The results of our proposed method are presented and compared with existing algorithms.
Similar content being viewed by others
References
Aggarwal A, Sharma C, Jain M, Jain A (2018) Semi supervised graph based keyword extraction using lexical chains and centrality measures. Computación y Sistemas 22(4):1037–1315
Bafna P, Pramod D Vaidya A (2016) Document clustering: Tf-idf approach. In: 2016 International conference on electrical, electronics, and optimization techniques (ICEEOT), IEEE, pp 61–66
Biswas S K, Bordoloi M, Shreya J (2018) A graph based keyword extraction model using collective node weight. Expert Syst Appl 97:51–59
Brinker K, Moerchen F, Glomann B, Neubauer C (2010) Online document clustering using tfidf and predefined time windows. US Patent 7,711,668
Campos R, Mangaravite V, Pasquali A, Jorge AM, Nunes C, Jatowt A (2018) A text feature based automatic keyword extraction method for single documents. In: European conference on information retrieval, Springer, pp 684–691
Campos R, Mangaravite V, Pasquali A, Jorge A, Nunes C, Jatowt A (2020) Yake! keyword extraction from single documents using multiple local features. Inf Sci 509:257–289
Chen K, Zhang Z, Long J, Zhang H (2016) Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst Appl 66:245–260
Chen Y, Wang J, Li P, Guo P (2019) Single document keyword extraction via quantifying higher-order structural features of word co-occurrence graph. Computer Speech & Language 57:98–107
Duari S, Bhatnagar V (2019) scake: Semantic connectivity aware keyword extraction. Inform Sci 477:100–117
Duwairi R, Hedaya M (2016) Automatic keyphrase extraction for arabic news documents based on kea system. Journal of Intelligent & Fuzzy Systems 30 (4):2101–2110
Ercan G (2006) Automated text summarization and keyphrase extraction. Unpublished MSc thesis, Bilkent University
Feduhko S (2014) Development of a software for computer-linguistic verification of socio-demographic profile of web-community member. Webology 11(2)
Florescu C, Caragea C (2017) A position-biased pagerank algorithm for keyphrase extraction. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Haque M, Pervin S, Begum Z et al (2013) Literature review of automatic multiple documents text summarization. International Journal of Innovation and Applied Studies 3(1):121–129
Korzh R, Fedushko S, Peleschyshyn A (2015) Methods for forming an informational image of a higher education institution
Krapivin M, Autayeu M, Marchese M, Blanzieri E, Segata N (2010) Improving machine learning approaches for keyphrases extraction from scientific documents with natural language knowledge. In: Proceedings of the joint JCDL/ICADL international digital libraries conference, pp 102–111
Lahiri S, Mihalcea R, Lai P H (2017) Keyword extraction from emails. Nat Lang Eng 23(2):295–317
Li G, Wang H (2014) Improved automatic keyword extraction based on textrank using domain knowledge. In: CCF international conference on natural language processing and chinese computing, Springer, pp 403–413
Li J, Zhang K, et al. (2007) Keyword extraction based on tf/idf for chinese news document. Wuhan University Journal of Natural Sciences 12(5):917–921
Li X, Zhang A, Li C, Ouyang J, Cai Y (2018) Exploring coherent topics by topic modeling with term weighting. Inform Process Manage 54 (6):1345–1358
Liu F, Pennell D, Liu F, Liu Y (2009) Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, pp 620–628
Liu Z, Li P, Zheng Y, Sun M (2009) Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 conference on empirical methods in natural language processing, pp 257–266
Matsuo Y, Ishizuka M (2004) Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13(01):157–169
McMahon D (2007) Quantum computing explained. John Wiley & Sons
Mihalcea R, Tarau P (2004) Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 404–411
Mihalcea R (2005) A language independent algorithm for single and multiple document summarization. In: Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts
Naderalvojoud B, Sezer EA (2020) Term evaluation metrics in imbalanced text categorization. Nat Lang Eng 26(1):31–47
Nguyen TD, Kan MY (2007) Keyphrase extraction in scientific publications. In: International conference on Asian digital libraries, Springer, pp 317–326
Rose S, Engel D, Cramer N, Cowley W (2010) Automatic keyword extraction from individual documents. Text Mining: Applications and Theory 1:1–20
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inform Process Manage 24(5):513–523
Tixier A, Malliaros F, Vazirgiannis M (2016) A graph degeneracy-based approach to keyword extraction. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 1860–1870
Wu YfB, Li Q, Bot RS, Chen X (2005) Domain-specific keyphrase extraction. In: Proceedings of the 14th ACM international conference on Information and knowledge management, pp 283–284
Zhang C, Wang X, Yu S, Wang Y (2018) Research on keyword extraction of word2vec model in chinese corpus. In: 2018 IEEE/ACIS 17Th international conference on computer and information science (ICIS), IEEE, pp 339–343
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rathi, R.N., Mustafi, A. Designing an efficient unigram keyword detector for documents using Relative Entropy. Multimed Tools Appl 81, 37747–37761 (2022). https://doi.org/10.1007/s11042-022-12657-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12657-x