Diverse feature set based Keyphrase extraction and indexing techniques

Sharma, Saurabh; Gupta, Vishal; Juneja, Mamta

doi:10.1007/s11042-020-09423-2

Diverse feature set based Keyphrase extraction and indexing techniques

Published: 26 September 2020

Volume 80, pages 4111–4142, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Saurabh Sharma¹,
Vishal Gupta¹ &
Mamta Juneja¹

537 Accesses
6 Citations
Explore all metrics

Abstract

The internet changed the way that people communicate, and this has led to a vast amount of Text that is available in electronic format. It includes things like e-mail, technical and scientific reports, tweets, physician notes and military field reports. Providing key-phrases for these extensive text collections thus allows users to grab the essence of the lengthy contents quickly and helps to locate information with high efficiency. While designing a Keyword Extraction and Indexing system, it is essential to pick unique properties, called features. In this article, we proposed different unsupervised keyword extraction approaches, which is independent of the structure, size and domain of the documents. The proposed method relies on the novel and cognitive inspired set of standard, phrase, word embedding and external knowledge source features. The individual and selected feature results are reported through experimentation on four different datasets viz. SemEval, KDD, Inspec, and DUC. The selected (feature selection) and word embedding based features are the best features set to be used for keywords extraction and indexing among all mentioned datasets. That is the proposed distributed word vector with additional knowledge improves the results significantly over the use of individual features, combined features after feature selection and state-of-the-art. After successfully achieving the objective of developing various keyphrase extraction methods we also experimented it for document classification task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text Keyword Extraction Based on Multi-dimensional Features

Automatic Keyphrase Extraction Using SVM

Unsupervised KeyPhrase Extraction Based on Multi-granular Semantics Feature Fusion

Notes

References

Alrehamy H, Walker C (2018) Exploiting extensible background knowledge for clustering-based automatic keyphrase extraction. Soft Comput 22(21):7041–7057
Bahl LR, Jelinek F, Mercer, RL (1983) A maximum likelihood approach to continuous speech recognition. IEEE transactions on pattern analysis and machine intelligence 2:179–190
Barker K, Corrnacchia N (2000) Using noun phrase heads to extract document keyword. In Proceedings of the 13th biennial Springer conference of the canadian society on computational studies of intelligence: Advances in artificial intelligence (pp. 40–52)
Biswas SK, Bordoloi M, Shreya J (2018) A graph based keyword extraction model using collective node weight. Expert Syst Appl 97:51–59
Bordea G, Buitelaar P, Polajnar T (2013) Domain-independent term extraction through domain modelling. In Proceedings of the 10th International Conference on Terminology and Artificial Intelligence
Bougouin A, Boudin F, Daille B (2013) Topicrank: graph-based topic ranking for keyword extraction. In Proceedings of the 6th International Joint Conference on Natural Language Processing (pp. 543–551)
Caragea C, Bulgarov FA, Godea A, Gollapalli SD (2014) Citation-enhanced keyword extraction from research papers: a supervised approach. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1435–1446)
Chuang J, Manning CD, Heer J (2012) Without the clutter of unimportant words: descriptive keyword for text visualization. ACM Trans. Computer-Human Interaction 19(3):19:1–19:29
Cheung R, Eisenstein B (1978) Feature selection via dynamic programming for text-independent speaker identification. IEEE Trans Acoust Speech Signal Process 26(5):397–403
Danesh S, Sumner T, Martin JH (2015) SGrank: combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction. In Proceedings of the fourth joint conference on lexical and computational semantics (pp. 117–126).
Day WHE, Edelsbrunner H (1984) Efficient algorithms for agglomerative hierarchical clustering methods. J Classif 1(1):7–24
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Dhillon PS, Foster DP, Ungar LH (2015) Eigenwords: Spectral word embeddings. The Journal of Machine Learning Research 16(1):3035–3078
Ding Z, Zhang Q, Huang X (2011) Keyphrase extraction from online news using binary integer programming. In Proceedings of the 5th International Joint Conference on Natural Language Processing (pp. 165–173)
Doucet A, Ahonen-Myka H (2010) An efficient any language approach for the integration of phrases in document retrieval. Language Resources and Evaluation 44(1-2):159–180
Habibi M, Belis AP (2015) Keyword extraction and clustering for document recommendation in conversations. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(4):746–759
Hasan K, Vincent N (2014) Automatic keyword extraction: a survey of the state of the art. In Proceedings of the Association for Computational Linguistics (ACL) (pp. 1262–1273).
Hu J, Li S, Yao Y, Yu L, Yang G, Hu J (2018) Patent keyword extraction algorithm based on distributed representation for patent classification. Entropy 20(2):1–19
Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the ACM conference on empirical methods in natural language processing (pp. 216–223)
Kang SS (2003) Keyword-based document clustering. In Proceedings of the 6th international workshop on information retrieval with Asian languages (ACL) (pp. 132–137)
Kim SN, Medelyan O, Kan MY, Baldwin T (2010) SemEval- 2010 task 5: automatic keyword extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation (pp. 21-26)
Kozareva Z, Matveeva I, Melli G, Nastase V (2013). Graph-based methods for natural language processing. In Proceedings of EMNLP 2013 workshop on TextGraphs-8
Li J, Huang G, Fan C, Sun Z, Zhu H (2019) Key word extraction for short text via word2vec, doc2vec, and textrank. Turk J Electr Eng Comput Sci 27(3):1794–1805
Litvak M, Last M (2008) Graph-based keyword extraction for single-document summarization. In Proceedings of the workshop on Multi-source Multilingual Info. Extraction and Summarization (pp. 17–24)
Lingpeng Y, Donghong J, Guodong Z, Yu N (2005) Improving retrieval effectiveness by using key terms in top retrieved documents. In Proceedings of the 27th Springer European Conference on Advances in Information Retrieval Research (pp. 169–184)
Liu Z, Huang W, Zheng Y, Sun M, (2010) Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (pp. 366–376)
Liu J, Shang J, Wang C, Ren X, Han J (2015) Mining quality phrases from massive text corpora. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1729–1744)
Liu Q, Kawahara D, Li S (2018) Scientific Keyphrase extraction: extracting candidates with semi-supervised data augmentation. In Proceedings of the Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data (pp. 183–194)
Lund K, Burgess C (1996) Producing high-dimensional semantic spaces from lexical co-occurrence. Behav Res Methods Instrum Comput 28(2):203–208
Mahata D, Shah RR, Kuriakose J, Zimmermann R, Talburt JR (2018) Theme-weighted ranking of keywords from text documents using phrase embeddings. In Proceedings of the IEEE conference on multimedia information processing and retrieval (pp. 184–189)
Matsuo Y, Ishizuka M (2004) Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13:157–169
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In Proceeding of the 26th international conference on neural information processing systems (pp. 3111–3119)
Mihalcea R, Tarau P (2004) TextRank: bringing order into texts. In Proceedings of the conference on empirical methods in natural language processing (pp. 404–411)
Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In Proceeding of the workshop on machine learning for information filtering 1(1):61–67
Nguyen TD, Kan MY (2007) Key phrase extraction in scientific publications. In Proceeding of the springer international conference on Asian digital libraries (pp. 317–326)
Onan A, Korukoğlu S, Bulut H (2016) Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst Appl 57:232–247
Papagiannopoulou E, Tsoumakas G (2018) Local word vectors guiding keyphrase extraction. Inf Process Manag 54(6):888–902
Passalis N, Tefas A (2016) Bag of embedded words learning for text retrieval. In Proceedings of the 23rd International Conference on Pattern Recognition (pp. 2416–2421).
Passalis N, Tefas A (2018) Learning bag-of-embedded-words representations for textual information retrieval. Pattern Recogn 81:254–267
Paukkeri MS, Nieminen IT, Polla M, Honkela T (2008) A language-independent approach to keyword extraction and evaluation. In Proceedings of the 22nd international conference on computational Linguistics (pp. 83–86)
Qiu M, Li Y, Jiang J (2012) Query-oriented keyword extraction. In Proceedings of the 18^th Asia Information Retrieval Societies conference, Lecture Notes in Computer Science, 7675, 64–75
Qiu Q, Xie Z, Wu L, Li W (2019) Geoscience keyphrase extraction algorithm using enhanced word embedding. Expert Syst Appl 125:157–169
Rafiei-Asl J, Nickabadi A (2017) TSAKE: a topical and structural automatic keyphrase extractor. Appl Soft Comput 58:620–630
Rose S, Engel D, Cramer N, Cowley W (2010) Automatic keyword extraction from individual documents. Text Mining: Theory and Applications. Wiley
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Sharma S, Gupta V, Juneja M (2019) A survey of image data indexing techniques. Artif Intell Rev 52(2):1189–1266
Smires KB, Musat C, Hossmann A, Baeriswyl M, Jaggi M (2018) Simple unsupervised Keyphrase extraction using sentence embeddings. In Proceedings of the 22nd Conference on Computational Natural Language Learning (pp. 221–229)
Song M, Song IY, Allen RB, Obradovic Z (2006) Keyword extraction-based query expansion in digital libraries. In Proceedings of the 6th ACM/IEEE joint conference on digital libraries (pp. 202–209)
Sun Y, Qiu H, Zheng Y, Wang Z, Zhang C (2020) SIFRank: a new baseline for unsupervised Keyphrase extraction based on pre-trained language model. IEEE Access 8:10896–10906
Tam V, Santoso A, Setiono R (2002) A comparative study of centroid-based, neighborhood-based and statistical approaches for effective document categorization. Object recognition supported by user interaction for service robots 4:235–238
Tang J, Shu X, Li Z, Jiang YG, Tian Q (2019) Social anchor-unit graph regularized tensor completion for large-scale image retagging. IEEE Trans Pattern Anal Mach Intell 41(8):2027–2034
Tang J, Shu X, Qi GJ, Li Z, Wang M, Yan S, Jain R (2016) Tri-clustered tensor completion for social-aware image tag refinement. IEEE Trans Pattern Anal Mach Intell 39(8):1662–1674
Tomokiyo T, Hurst M (2003) A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on multiword expressions: analysis, acquisition and treatment (pp. 33–40).
Wan X, Xiao J (2008) Single document keyword extraction using neighborhood knowledge. In: Proceedings of the 23rd National Conference on Artificial Intelligence (pp. 855–860)
Wang R, Liu W, McDonald, C (2014) Corpus-independent generic keyphrase extraction using word embedding vectors. In Proceedings of the Software Engineering Research Conference (pp. 39–46)
Witten IH, Paynter GW, Frank E, Gutwin C, Manning CGN (1999) KEA : Practical automatic keyword extraction. In Proceedings of the 4th ACM Conference on Digital Libraries (pp. 254–255)
Wu J, Xuan Z, Pan D (2011) Enhancing text representation for classification tasks with semantic graph structures. Int J Innovative Computing, Information and Control 7(5)
Wu Z, Zhu H, Li G, Cui Z, Huang H, Li J, Chen E, Xu G (2017) An efficient Wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28
Yeom H, Ko Y, Seo J (2019) Unsupervised-learning-based keyphrase extraction from a single document by the effective combination of the graph-based model and the modified C-value method. Computer Speech & Language 58:304–318
Yih WT, Goodman J, Carvalho VR (2006) Finding advertising keywords on web pages. In Proceedings of the ACM 15th international conference on World Wide Web (pp. 213–222)
Zhang F, Lian’en HL, Peng B (2013) WordTopic-MultiRank: a new method for automatic keyword extraction. In Proceedings of the 6^th International Joint Conference on Natural Language Processing (pp. 10–18)
Zhang Y, Liu H, Wang S, Ip WH, Fan W, Xiao C (2019) Automatic keyphrase extraction using word embeddings. Soft Computing 1–16
Zhiyuan L, Wenyi H, Yabin Z, Maosong S (2010). Automatic Keyphrase extraction via topic decomposition. In Proceedings of the conference on Empirical Methods in Natural Language Processing (pp. 366—376)

Download references

Acknowledgements

The authors thank the reviewers for their helpful comments. First author would like to thank Ministry of Electronics and IT, Government of INDIA, for providing fellowship under Grant number: PhD-MLA/4(61)/2015-16 (Visvesvaraya PhD Scheme for Electronics and IT) to pursue his Ph.D. work.

Author information

Authors and Affiliations

University Institute of Engineering & Technology, Panjab University, Chandigarh, India
Saurabh Sharma, Vishal Gupta & Mamta Juneja

Authors

Saurabh Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Vishal Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Mamta Juneja
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vishal Gupta.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sharma, S., Gupta, V. & Juneja, M. Diverse feature set based Keyphrase extraction and indexing techniques. Multimed Tools Appl 80, 4111–4142 (2021). https://doi.org/10.1007/s11042-020-09423-2

Download citation

Received: 16 October 2019
Revised: 05 June 2020
Accepted: 21 July 2020
Published: 26 September 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11042-020-09423-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Diverse feature set based Keyphrase extraction and indexing techniques

Abstract

Access this article

Similar content being viewed by others

Text Keyword Extraction Based on Multi-dimensional Features

Automatic Keyphrase Extraction Using SVM

Unsupervised KeyPhrase Extraction Based on Multi-granular Semantics Feature Fusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Diverse feature set based Keyphrase extraction and indexing techniques

Abstract

Access this article

Similar content being viewed by others

Text Keyword Extraction Based on Multi-dimensional Features

Automatic Keyphrase Extraction Using SVM

Unsupervised KeyPhrase Extraction Based on Multi-granular Semantics Feature Fusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation