Skip to main content
Log in

Diverse feature set based Keyphrase extraction and indexing techniques

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The internet changed the way that people communicate, and this has led to a vast amount of Text that is available in electronic format. It includes things like e-mail, technical and scientific reports, tweets, physician notes and military field reports. Providing key-phrases for these extensive text collections thus allows users to grab the essence of the lengthy contents quickly and helps to locate information with high efficiency. While designing a Keyword Extraction and Indexing system, it is essential to pick unique properties, called features. In this article, we proposed different unsupervised keyword extraction approaches, which is independent of the structure, size and domain of the documents. The proposed method relies on the novel and cognitive inspired set of standard, phrase, word embedding and external knowledge source features. The individual and selected feature results are reported through experimentation on four different datasets viz. SemEval, KDD, Inspec, and DUC. The selected (feature selection) and word embedding based features are the best features set to be used for keywords extraction and indexing among all mentioned datasets. That is the proposed distributed word vector with additional knowledge improves the results significantly over the use of individual features, combined features after feature selection and state-of-the-art. After successfully achieving the objective of developing various keyphrase extraction methods we also experimented it for document classification task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://github.com/nltk/nltk

  2. http://morpho.aalto.fi/projects/morpho/morfessor2.shtml

  3. https://www.ldoceonline.com/dictionary

  4. https://wordnet.princeton.edu/download

  5. https://github.com/alvations/pywsd

References

  1. Alrehamy H, Walker C (2018) Exploiting extensible background knowledge for clustering-based automatic keyphrase extraction. Soft Comput 22(21):7041–7057

  2. Bahl LR, Jelinek F, Mercer, RL (1983) A maximum likelihood approach to continuous speech recognition. IEEE transactions on pattern analysis and machine intelligence 2:179–190

  3. Barker K, Corrnacchia N (2000) Using noun phrase heads to extract document keyword. In Proceedings of the 13th biennial Springer conference of the canadian society on computational studies of intelligence: Advances in artificial intelligence (pp. 40–52)

  4. Biswas SK, Bordoloi M, Shreya J (2018) A graph based keyword extraction model using collective node weight. Expert Syst Appl 97:51–59

  5. Bordea G, Buitelaar P, Polajnar T (2013) Domain-independent term extraction through domain modelling. In Proceedings of the 10th International Conference on Terminology and Artificial Intelligence

  6. Bougouin A, Boudin F, Daille B (2013) Topicrank: graph-based topic ranking for keyword extraction. In Proceedings of the 6th International Joint Conference on Natural Language Processing (pp. 543–551)

  7. Caragea C, Bulgarov FA, Godea A, Gollapalli SD (2014) Citation-enhanced keyword extraction from research papers: a supervised approach. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1435–1446)

  8. Chuang J, Manning CD, Heer J (2012) Without the clutter of unimportant words: descriptive keyword for text visualization. ACM Trans. Computer-Human Interaction 19(3):19:1–19:29

  9. Cheung R, Eisenstein B (1978) Feature selection via dynamic programming for text-independent speaker identification. IEEE Trans Acoust Speech Signal Process 26(5):397–403

  10. Danesh S, Sumner T, Martin JH (2015) SGrank: combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction. In Proceedings of the fourth joint conference on lexical and computational semantics (pp. 117–126).

  11. Day WHE, Edelsbrunner H (1984) Efficient algorithms for agglomerative hierarchical clustering methods. J Classif 1(1):7–24

  12. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

  13. Dhillon PS, Foster DP, Ungar LH (2015) Eigenwords: Spectral word embeddings. The Journal of Machine Learning Research 16(1):3035–3078

  14. Ding Z, Zhang Q, Huang X (2011) Keyphrase extraction from online news using binary integer programming. In Proceedings of the 5th International Joint Conference on Natural Language Processing (pp. 165–173)

  15. Doucet A, Ahonen-Myka H (2010) An efficient any language approach for the integration of phrases in document retrieval. Language Resources and Evaluation 44(1-2):159–180

  16. Habibi M, Belis AP (2015) Keyword extraction and clustering for document recommendation in conversations. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(4):746–759

  17. Hasan K, Vincent N (2014) Automatic keyword extraction: a survey of the state of the art. In Proceedings of the Association for Computational Linguistics (ACL) (pp. 1262–1273).

  18. Hu J, Li S, Yao Y, Yu L, Yang G, Hu J (2018) Patent keyword extraction algorithm based on distributed representation for patent classification. Entropy 20(2):1–19

  19. Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the ACM conference on empirical methods in natural language processing (pp. 216–223)

  20. Kang SS (2003) Keyword-based document clustering. In Proceedings of the 6th international workshop on information retrieval with Asian languages (ACL) (pp. 132–137)

  21. Kim SN, Medelyan O, Kan MY, Baldwin T (2010) SemEval- 2010 task 5: automatic keyword extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation (pp. 21-26)

  22. Kozareva Z, Matveeva I, Melli G, Nastase V (2013). Graph-based methods for natural language processing. In Proceedings of EMNLP 2013 workshop on TextGraphs-8

  23. Li J, Huang G, Fan C, Sun Z, Zhu H (2019) Key word extraction for short text via word2vec, doc2vec, and textrank. Turk J Electr Eng Comput Sci 27(3):1794–1805

  24. Litvak M, Last M (2008) Graph-based keyword extraction for single-document summarization. In Proceedings of the workshop on Multi-source Multilingual Info. Extraction and Summarization (pp. 17–24)

  25. Lingpeng Y, Donghong J, Guodong Z, Yu N (2005) Improving retrieval effectiveness by using key terms in top retrieved documents. In Proceedings of the 27th Springer European Conference on Advances in Information Retrieval Research (pp. 169–184)

  26. Liu Z, Huang W, Zheng Y, Sun M, (2010) Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (pp. 366–376)

  27. Liu J, Shang J, Wang C, Ren X, Han J (2015) Mining quality phrases from massive text corpora. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1729–1744)

  28. Liu Q, Kawahara D, Li S (2018) Scientific Keyphrase extraction: extracting candidates with semi-supervised data augmentation. In Proceedings of the Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data (pp. 183–194)

  29. Lund K, Burgess C (1996) Producing high-dimensional semantic spaces from lexical co-occurrence. Behav Res Methods Instrum Comput 28(2):203–208

  30. Mahata D, Shah RR, Kuriakose J, Zimmermann R, Talburt JR (2018) Theme-weighted ranking of keywords from text documents using phrase embeddings. In Proceedings of the IEEE conference on multimedia information processing and retrieval (pp. 184–189)

  31. Matsuo Y, Ishizuka M (2004) Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13:157–169

  32. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations

  33. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In Proceeding of the 26th international conference on neural information processing systems (pp. 3111–3119)

  34. Mihalcea R, Tarau P (2004) TextRank: bringing order into texts. In Proceedings of the conference on empirical methods in natural language processing (pp. 404–411)

  35. Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In Proceeding of the workshop on machine learning for information filtering 1(1):61–67

  36. Nguyen TD, Kan MY (2007) Key phrase extraction in scientific publications. In Proceeding of the springer international conference on Asian digital libraries (pp. 317–326)

  37. Onan A, Korukoğlu S, Bulut H (2016) Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst Appl 57:232–247

  38. Papagiannopoulou E, Tsoumakas G (2018) Local word vectors guiding keyphrase extraction. Inf Process Manag 54(6):888–902

  39. Passalis N, Tefas A (2016) Bag of embedded words learning for text retrieval. In Proceedings of the 23rd International Conference on Pattern Recognition (pp. 2416–2421).

  40. Passalis N, Tefas A (2018) Learning bag-of-embedded-words representations for textual information retrieval. Pattern Recogn 81:254–267

  41. Paukkeri MS, Nieminen IT, Polla M, Honkela T (2008) A language-independent approach to keyword extraction and evaluation. In Proceedings of the 22nd international conference on computational Linguistics (pp. 83–86)

  42. Qiu M, Li Y, Jiang J (2012) Query-oriented keyword extraction. In Proceedings of the 18th Asia Information Retrieval Societies conference, Lecture Notes in Computer Science, 7675, 64–75

  43. Qiu Q, Xie Z, Wu L, Li W (2019) Geoscience keyphrase extraction algorithm using enhanced word embedding. Expert Syst Appl 125:157–169

  44. Rafiei-Asl J, Nickabadi A (2017) TSAKE: a topical and structural automatic keyphrase extractor. Appl Soft Comput 58:620–630

  45. Rose S, Engel D, Cramer N, Cowley W (2010) Automatic keyword extraction from individual documents. Text Mining: Theory and Applications. Wiley

  46. Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

  47. Sharma S, Gupta V, Juneja M (2019) A survey of image data indexing techniques. Artif Intell Rev 52(2):1189–1266

  48. Smires KB, Musat C, Hossmann A, Baeriswyl M, Jaggi M (2018) Simple unsupervised Keyphrase extraction using sentence embeddings. In Proceedings of the 22nd Conference on Computational Natural Language Learning (pp. 221–229)

  49. Song M, Song IY, Allen RB, Obradovic Z (2006) Keyword extraction-based query expansion in digital libraries. In Proceedings of the 6th ACM/IEEE joint conference on digital libraries (pp. 202–209)

  50. Sun Y, Qiu H, Zheng Y, Wang Z, Zhang C (2020) SIFRank: a new baseline for unsupervised Keyphrase extraction based on pre-trained language model. IEEE Access 8:10896–10906

  51. Tam V, Santoso A, Setiono R (2002) A comparative study of centroid-based, neighborhood-based and statistical approaches for effective document categorization. Object recognition supported by user interaction for service robots 4:235–238

  52. Tang J, Shu X, Li Z, Jiang YG, Tian Q (2019) Social anchor-unit graph regularized tensor completion for large-scale image retagging. IEEE Trans Pattern Anal Mach Intell 41(8):2027–2034

  53. Tang J, Shu X, Qi GJ, Li Z, Wang M, Yan S, Jain R (2016) Tri-clustered tensor completion for social-aware image tag refinement. IEEE Trans Pattern Anal Mach Intell 39(8):1662–1674

  54. Tomokiyo T, Hurst M (2003) A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on multiword expressions: analysis, acquisition and treatment (pp. 33–40).

  55. Wan X, Xiao J (2008) Single document keyword extraction using neighborhood knowledge. In: Proceedings of the 23rd National Conference on Artificial Intelligence (pp. 855–860)

  56. Wang R, Liu W, McDonald, C (2014) Corpus-independent generic keyphrase extraction using word embedding vectors. In Proceedings of the Software Engineering Research Conference (pp. 39–46)

  57. Witten IH, Paynter GW, Frank E, Gutwin C, Manning CGN (1999) KEA : Practical automatic keyword extraction. In Proceedings of the 4th ACM Conference on Digital Libraries (pp. 254–255)

  58. Wu J, Xuan Z, Pan D (2011) Enhancing text representation for classification tasks with semantic graph structures. Int J Innovative Computing, Information and Control 7(5)

  59. Wu Z, Zhu H, Li G, Cui Z, Huang H, Li J, Chen E, Xu G (2017) An efficient Wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28

  60. Yeom H, Ko Y, Seo J (2019) Unsupervised-learning-based keyphrase extraction from a single document by the effective combination of the graph-based model and the modified C-value method. Computer Speech & Language 58:304–318

  61. Yih WT, Goodman J, Carvalho VR (2006) Finding advertising keywords on web pages. In Proceedings of the ACM 15th international conference on World Wide Web (pp. 213–222)

  62. Zhang F, Lian’en HL, Peng B (2013) WordTopic-MultiRank: a new method for automatic keyword extraction. In Proceedings of the 6th International Joint Conference on Natural Language Processing (pp. 10–18)

  63. Zhang Y, Liu H, Wang S, Ip WH, Fan W, Xiao C (2019) Automatic keyphrase extraction using word embeddings. Soft Computing 1–16

  64. Zhiyuan L, Wenyi H, Yabin Z, Maosong S (2010). Automatic Keyphrase extraction via topic decomposition. In Proceedings of the conference on Empirical Methods in Natural Language Processing (pp. 366—376)

Download references

Acknowledgements

The authors thank the reviewers for their helpful comments. First author would like to thank Ministry of Electronics and IT, Government of INDIA, for providing fellowship under Grant number: PhD-MLA/4(61)/2015-16 (Visvesvaraya PhD Scheme for Electronics and IT) to pursue his Ph.D. work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vishal Gupta.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, S., Gupta, V. & Juneja, M. Diverse feature set based Keyphrase extraction and indexing techniques. Multimed Tools Appl 80, 4111–4142 (2021). https://doi.org/10.1007/s11042-020-09423-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09423-2

Keywords

Navigation