Automatic keyphrase extraction using word embeddings

Zhang, Yuxiang; Liu, Huan; Wang, Suge; Ip., W. H.; Fan, Wei; Xiao, Chunjing

doi:10.1007/s00500-019-03963-y

Automatic keyphrase extraction using word embeddings

Focus
Published: 29 March 2019

Volume 24, pages 5593–5608, (2020)
Cite this article

Soft Computing Aims and scope Submit manuscript

Yuxiang Zhang¹,
Huan Liu¹,
Suge Wang²,
W. H. Ip.^3,4,
Wei Fan¹ &
…
Chunjing Xiao¹

966 Accesses
42 Citations
Explore all metrics

Abstract

Unsupervised random-walk keyphrase extraction models mainly rely on global structural information of the word graph, with nodes representing candidate words and edges capturing the co-occurrence information between candidate words. However, using word embedding method to integrate multiple kinds of useful information into the random-walk model to help better extract keyphrases is relatively unexplored. In this paper, we propose a random-walk-based ranking method to extract keyphrases from text documents using word embeddings. Specifically, we first design a heterogeneous text graph embedding model to integrate local context information of the word graph (i.e., the local word collocation patterns) with some crucial features of candidate words and edges of the word graph. Then, a novel random-walk-based ranking model is designed to score candidate words by leveraging such learned word embeddings. Finally, a new and generic similarity-based phrase scoring model using word embeddings is proposed to score phrases for selecting top-scoring phrases as keyphrases. Experimental results show that the proposed method consistently outperforms eight state-of-the-art unsupervised methods on three real datasets for keyphrase extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extracting Keyphrases from Research Papers Using Word Embeddings

WEKE: Learning Word Embeddings for Keyphrase Extraction

Topic Aware Contextualized Embeddings for High Quality Phrase Extraction

Notes

References

Alrehamy H, Walker C (2018) Exploiting extensible background knowledge for clustering-based automatic keyphrase extraction. Soft Comput 22(21):7041–7057
Article Google Scholar
Artetxe M, Labaka G, Agirre E (2018) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of ACL, pp 789–798
Baeza-Yates R, Ribeiro BAN et al (2011) Modern information retrieval. ACM Press/Addison-Wesley, New York/Harlow
Google Scholar
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell (TPAMI) 35(8):1798–1828
Article Google Scholar
Bhattacharya I, Godbole S, Joshi S (2008) Structured entity identification and document categorization: two tasks with one joint model. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, Las Vegas, 24–27 Aug 2008. ACM, New York, pp 25–33. https://doi.org/10.1145/1401890.1401899
Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., Sebastopol
MATH Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(1):993–1022
MATH Google Scholar
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist (TACL) 5:135–146
Article Google Scholar
Boudin F (2013) A comparison of centrality measures for graph-based keyphrase extraction. In: Proceedings of IJCNLP, pp 834–838
Boudin F (2015) Reducing over-generation errors for automatic keyphrase extraction using integer linear programming. In: Proceedings of ACL workshop on novel computational approaches to keyphrase extraction, pp 19–24
Bulgarov F, Caragea C (2015) A comparison of supervised keyphrase extraction models. In: Proceedings of WWW, pp 13–14
Caragea C, Bulgarov F, Godea A, Gollapalli SD (2014) Citation-enhanced keyphrase extraction from research papers: a supervised approach. In: Proceedings of EMNLP, pp 1435–1446
Chuang J, Manning CD, Heer J (2012) Termite: visualization techniques for assessing textual topic models. In: Proceedings of the international working conference on advanced visual interfaces, pp 74–77
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(8):2493–2537
MATH Google Scholar
Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H (2018) Word translation without parallel data. In: Proceedings of ICLR, pp 1–14
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302
Article Google Scholar
Din S, Paul A, Ahmad A, Gupta B, Rho S (2018) Service orchestration of optimizing continuous features in industrial surveillance using big data based fog-enabled internet of things. IEEE Access 6:21582–21591
Article Google Scholar
Florescu C, Caragea C (2017) Positionrank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of ACL, pp 1105–1115
Frank E, Paynter GW, Witten IH, Gutwin C, Nevill-Manning CG (1999) Domain-specific keyphrase extraction. In: Proceedings of EMNLP, pp 668–673
Gollapalli SD, Caragea C (2014) Extracting keyphrases from research papers using citation networks. In: Proceedings of AAAI, pp 1629–1635
Gollapalli SD, Li X, Yang P (2017) Incorporating expert knowledge into keyphrase extraction. In: Proceedings of AAAI, pp 3180–3187
Gupta BB (2018) Computer and cyber security: principles, algorithm, applications, and perspectives. CRC Press, Boca Raton
Google Scholar
Hasan KS, Ng V (2010) Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In: Proceedings of COLING: Posters, pp 365–373
Hasan KS, Ng V (2014) Automatic keyphrase extraction: a survey of the state of the art. In: Proceedings of ACL, pp 1262–1273
Jones S, Staveley MS (1999) Phrasier: a system for interactive document retrieval using keyphrases. In: Proceedings of SIGIR, pp 160–167
Krapivin M, Autayeu A, Marchese M, Blanzieri E, Segata N (2010) Keyphrases extraction from scientific documents: improving machine learning approaches with natural language processing. In: Proceedings of ICADL, pp 102–111
Levy O, Goldberg Y (2014) Dependency-based word embeddings. Proc ACL 2:302–308
Google Scholar
Liu Z, Huang W, Zheng Y, Sun M (2010) Automatic keyphrase extraction via topic decomposition. In: Proceedings of EMNLP, pp 366–376
Liu Y, Liu Z, Chua TS, Sun M (2015) Topical word embeddings. In: Proceedings of AAAI, pp 2418–2424
Lopez P, Romary L (2010) Humb: automatic key term extraction from scientific articles in GROBID. In: Proceedings of workshop on semantic evaluation, pp 248–251
Luo J, Meng B, Quan C, Tu X (2015) Exploiting salient semantic analysis for information retrieval. Enterp Inf Syst 10(9):959–969
Article Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Book Google Scholar
Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of EMNLP, pp 404–411
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: Proceedings of ICLR workshop
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS, pp 3111–3119
Nedjah N, Wyant RS, Mourelle L, Gupta B (2017) Efficient yet robust biometric iris matching on smart cards for data high security and privacy. Fut Gener Comput Syst 76:18–32
Article Google Scholar
Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: bringing order to the web. Technical report, Stanford InfoLab
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of EMNLP, pp 1532–1543
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of NAACL, pp 2227–2237
Plageras AP, Psannis KE, Stergiou C, Wang H, Gupta BB (2018) Efficient iot-based sensor big data collection-processing and analysis in smart buildings. Fut Gener Comput Syst 82:349–357
Article Google Scholar
Porter M (2006) An algorithm for suffix stripping. Program Electron Libr Inf Syst 40(3):211–218
Google Scholar
Qazvinian V, Radev DR, Özgür A (2010) Citation summarization through keyphrase extraction. In: Proceedings of COLING, pp 895–903
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Article Google Scholar
Shi B, Lam W, Jameel S, Schockaert S, Lai KP (2017) Jointly learning word embeddings and latent topics. In: Proceedings of SIGIR, pp 375–384
Shtok A, Kurland O, Carmel D (2010) Using statistical decision theory and relevance models for query-performance prediction. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, Geneva, 19–23 July 2010. ACM, New York, pp 259–266. https://doi.org/10.1145/1835449.1835494
Sterckx L, Demeester T, Deleu J, Develder C (2015) Topical word importance for fast keyphrase extraction. In: Proceedings of WWW, pp 121–122
Sterckx L, Caragea C, Demeester T, Develder C (2016) Supervised keyphrase extraction as positive unlabeled learning. In: Proceedings of EMNLP, pp 1924–1929
Tang J, Qu M, Mei Q (2015a) Pte: predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of SIGKDD, pp 1165–1174
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015b) Line: large-scale information network embedding. In: Proceedings of WWW, pp 1067–1077
Tang Y, Huang W, Liu Q, Tung AK, Wang X, Yang J, Zhang B (2017) Qalink: enriching text documents with relevant Q&A site contents. In: Proceedings of CIKM, pp 1359–1368
Teneva N, Cheng W (2017) Salience rank: efficient keyphrase extraction with topic modeling. In: Proceedings of ACL, pp 530–535
Turney PD (2000) Learning algorithms for keyphrase extraction. Inf Retr J 2(4):303–336
Article Google Scholar
Wan X, Xiao J (2008) Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of AAAI, pp 855–860
Wang R, Liu W, McDonald C (2015) Corpus-independent generic keyphrase extraction using word embedding vectors. In: Proceedings of DL-WSDM, pp 39–46
Wang Y, Jin Y, Zhu X, Goutte C (2016) Extracting discriminative keyphrases with learned semantic hierarchies. In: Proceedings of COLING, pp 932–942
Wieting J, Bansal M, Gimpel K, Livescu K (2016) Charagram: embedding words and sentences via character \(n\)-grams. In: Proceedings of EMNLP, pp 1504–1515
Yang J-M, Cai R, Wang Y, Zhu J, Zhang L, Ma W-Y (2009) Incorporating site-level knowledge to extract structured data from web forums. In: Proceedings of the 18th international conference on world wide web, Madrid, 20–24 Apr 2009. ACM, New York, pp 181–190. https://doi.org/10.1145/1526709.1526735
Zhang W, Feng W, Wang J (2013) Integrating semantic relatedness and words’ intrinsic features for keyword extraction. In: Proceedings of IJCAI, pp 139–160
Zhang W, Ming Z, Zhang Y, Liu T, Chua TS (2015) Exploring key concept paraphrasing based on pivot language translation for question retrieval. In: Proceedings of AAAI, pp 410–416
Zhang Q, Wang Y, Gong Y, Huang X (2016) Keyphrase extraction using deep recurrent neural networks on Twitter. In: Proceedings of EMNLP, pp 836–844
Zhang Y, Chang Y, Liu X, Gollapalli SD, Li X, Xiao C (2017) Mike: keyphrase extraction by integrating multidimensional information. In: Proceedings of CIKM, pp 1349–1358
Zhang Z, Gao J, Ciravegna F (2018) Semre-rank: Improving automatic term extraction by incorporating semantic relatedness with personalised pagerank. ACM Trans Knowl Dis Data (TKDD) 12(5):57:1–57:41
Google Scholar

Download references

Acknowledgements

This work was partially supported by Grants from the National Natural Science Foundation of China (Nos. U1333109, 61632011, 61573231, U1533104), Department of Industrial and Systems Engineering, Hong Kong Polytechnic University (Project code H-ZG3K) and Open Project Foundation of Intelligent Information Processing Key Laboratory of Shanxi Province (No. CICIP2018004).

Author information

Authors and Affiliations

School of Computer Science and Technology, Civil Aviation University of China, Tianjin, China
Yuxiang Zhang, Huan Liu, Wei Fan & Chunjing Xiao
School of Computer and Information Technology, Shanxi University, Taiyuan, China
Suge Wang
Department of Industrial and Systems Engineering, Hong Kong Polytechnic University, Kowloon, Hong Kong, SAR, China
W. H. Ip.
Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, Canada
W. H. Ip.

Authors

Yuxiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Huan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Suge Wang
View author publications
You can also search for this author in PubMed Google Scholar
W. H. Ip.
View author publications
You can also search for this author in PubMed Google Scholar
Wei Fan
View author publications
You can also search for this author in PubMed Google Scholar
Chunjing Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuxiang Zhang.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by B. B. Gupta.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Liu, H., Wang, S. et al. Automatic keyphrase extraction using word embeddings. Soft Comput 24, 5593–5608 (2020). https://doi.org/10.1007/s00500-019-03963-y

Download citation

Published: 29 March 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s00500-019-03963-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic keyphrase extraction using word embeddings

Abstract

Access this article

Similar content being viewed by others

Extracting Keyphrases from Research Papers Using Word Embeddings

WEKE: Learning Word Embeddings for Keyphrase Extraction

Topic Aware Contextualized Embeddings for High Quality Phrase Extraction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic keyphrase extraction using word embeddings

Abstract

Access this article

Similar content being viewed by others

Extracting Keyphrases from Research Papers Using Word Embeddings

WEKE: Learning Word Embeddings for Keyphrase Extraction

Topic Aware Contextualized Embeddings for High Quality Phrase Extraction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation