research-article

NamedKeys: Unsupervised Keyphrase Extraction for Biomedical Documents

Authors:

Joyce C. HoAuthors Info & Claims

BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pages 328 - 337

https://doi.org/10.1145/3307339.3342147

Published: 04 September 2019 Publication History

Abstract

A vast amount of biomedical literature is generated and digitized every year. As a result is a growing need to develop methods for discovering, accessing, and sharing knowledge from medical literature. Keyphrase extraction is the task of summarizing a text by identifying the key concepts. The keyphrases can be single-word or multi-word linguistic units which can concisely represent a document. Although a variety of models have been proposed for automated keyphrase extraction, the performance is poor in comparison with other natural language processing tasks. The problem is even more daunting for biomedical domain where the text is filled with highly domain-specific terminologies. We propose a new method, NamedKeys, to automatically identify meaningful and informative keyphrases from biomedical text. NamedKeys integrates named entity recognition, phrase embedding, phrase quality scoring, ranking, and clustering to extract author-assigned keywords from biomedical documents. Performance evaluation on PubMed abstracts demonstrates that NamedKeys achieves significant improvements over existing state-of-the-art keyphrase extraction models. Furthermore, we propose the first benchmark dataset for keyphrase extraction from biomedical text.

References

[1]

Isabelle Augenstein, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, and Andrew McCallum. 2017. Semeval 2017 task 10: Scienceie-extracting keyphrases and relations from scientific publications. arXiv preprint arXiv:1704.02853 (2017).

[2]

Santosh Kumar Bharti and Korra Sathya Babu. 2017. Automatic keyword extraction for text summarization: A survey. arXiv preprint arXiv:1704.03242 (2017).

[3]

Willie Boag, Elena Sergeeva, Saurabh Kulshreshtha, Peter Szolovits, Anna Rumshisky, and Tristan Naumann. 2018. CliNER 2.0: Accessible and Accurate Clinical Concept Extraction. arXiv preprint arXiv:1803.02245 (2018).

[4]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.

[5]

Florian Boudin. 2016. pke: an open source python-based keyphrase extraction toolkit. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations. Osaka, Japan, 69--73. http: //aclweb.org/anthology/C16--2015

[6]

Florian Boudin. 2018. Unsupervised keyphrase extraction with multipartite graphs. arXiv preprint arXiv:1803.08721 (2018).

[7]

Adrien Bougouin, Florian Boudin, and Béatrice Daille. 2013. Topicrank: Graphbased topic ranking for keyphrase extraction. In International Joint Conference on Natural Language Processing (IJCNLP). 543--551.

[8]

Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Mário Jorge, Célia Nunes, andAdam Jatowt. 2018. YAKE! collection-independent automatic keyword extractor. In European Conference on Information Retrieval. Springer, 806--810.

[9]

Jason Chuang, Christopher D Manning, and Jeffrey Heer. 2012. "Without the clutter of unimportant words": Descriptive keyphrases for text visualization. ACM Transactions on Computer-Human Interaction (TOCHI) 19, 3 (2012), 19.

Digital Library

[10]

Young Mee Chung and Jae Yun Lee. 2001. A corpus-based approach to comparative evaluation of statistical term association measures. Journal of the American Society for Information Science and Technology 52, 4 (Jan. 2001), 283--296.

Digital Library

[11]

Frans Coenen, Paul Leng, Robert Sanderson, and Yanbo J Wang. 2007. Statistical identification of key phrases for text classification. In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 838--853.

Digital Library

[12]

Samhaa R El-Beltagy and Ahmed Rafea. 2010. Kp-miner: Participation in semeval- 2. In Proceedings of the 5th international workshop on semantic evaluation. 190--193.

Digital Library

[13]

Corina Florescu and Cornelia Caragea. 2017. Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1105--1115.

[14]

Brendan J Frey and Delbert Dueck. 2007. Clustering by passing messages between data points. science 315, 5814 (2007), 972--976.

[15]

Zelalem Gero and Joyce C. Ho. 2019. PMCVec: Distributed phrase representation for biomedical text processing. Journal of biomedical Informatics, in press (2019).

[16]

Glove vec {n. d.}. GloVe: Global Vectors for Word Representation. https://nlp. stanford.edu/projects/glove/.

[17]

Google {n. d.}. word2vec: Tool for computing continuous distributed representations of words. https://code.google.com/archive/p/word2vec/.

[18]

Kazi Saidul Hasan and Vincent Ng. 2014. Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1262-- 1273.

[19]

Aminul Islam, Evangelos E Milios, and Vlado Keselj. 2012. Comparing word relatedness measures based on Google n-grams. In Proceedings of COLING 2012: Posters. 495--506.

[20]

Xin Jiang, Yunhua Hu, and Hang Li. 2009. A ranking approach to keyphrase extraction. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM, 756--757.

Digital Library

[21]

Su Nam Kim and Min-Yen Kan. 2009. Re-examining automatic keyphrase extraction approaches in scientific articles. In Proceedings of the workshop on multiword expressions: Identification, interpretation, disambiguation and applications. Association for Computational Linguistics, 9--16.

Digital Library

[22]

G Hemantha Kumar, Seyedmahmoud Talebi, and K Manoj. 2017. Users' Topic Detection from Tweets based on Keyword Extraction. International Journal of Computer Applications 975 (2017), 8887.

[23]

Quanzhi Li and Yi-Fang Brook Wu. 2006. Identifying important concepts from medical documents. Journal of biomedical informatics 39, 6 (2006), 668--679.

Digital Library

[24]

Zhiyuan Liu, Wenyi Huang, Yabin Zheng, and Maosong Sun. 2010. Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics, 366--376.

Digital Library

[25]

Zhiyuan Liu, Peng Li, Yabin Zheng, and Maosong Sun. 2009. Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 257--266.

Digital Library

[26]

Patrice Lopez and Laurent Romary. 2010. HUMB: Automatic key term extraction from scientific articles in GROBID. In Proceedings of the 5th international workshop on semantic evaluation. Association for Computational Linguistics, 248--251.

Digital Library

[27]

Debanjan Mahata, John Kuriakose, Rajiv Ratn Shah, and Roger Zimmermann. 2018. Key2vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 634--639.

[28]

Yuqing Mao and Zhiyong Lu. 2017. MeSH Now: automatic MeSH indexing at scale via learning to rank. Journal of biomedical semantics 8, 1 (2017), 15.

[29]

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing.

[30]

Naw Naw and Ei Ei Hlaing. 2013. Relevant words extraction method for recommendation system. Bulletin of Electrical Engineering and Informatics 2, 3 (2013), 169--176.

[31]

Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. 2019. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. arXiv preprint arXiv:1902.07669 (2019).

[32]

Aurélie Névéol, Rezarta Islamaj Doan, and Zhiyong Lu. 2010. Author keywords in biomedical journal articles. In AMIA annual symposium proceedings, Vol. 2010. American Medical Informatics Association, 537.

[33]

Aditya Parameswaran, Hector Garcia-Molina, and Anand Rajaraman. 2010. Towards the web of concepts: Extracting concepts from large datasets. Proceedings of the VLDB Endowment 0, 1--2 (2010), 566--577.

Digital Library

[34]

Vahed Qazvinian, Dragomir R Radev, and Arzucan Ozgur. 2010. Citation summarization through keyphrase extraction. In Proceedings of the 23rd international conference on computational linguistics (COLING 2010). 895--903.

Digital Library

[35]

Wullianallur Raghupathi and Viju Raghupathi. 2014. Big data analytics in healthcare: promise and potential. Health information science and systems 2, 1 (2014), 3.

[36]

Kamal Sarkar. 2013. A hybrid approach to extract keyphrases from medical documents. arXiv preprint arXiv:1303.1441 (2013).

[37]

Kamal Sarkar. 2014. A keyphrase-based approach to text summarization for English and bengali documents. International Journal of Technology Diffusion (IJTD) 5, 2 (2014), 28--38.

Digital Library

[38]

Stamatina Thomaidou and Michalis Vazirgiannis. 2011. Multiword keyword recommendation system for online advertising. In 2011 International Conference on Advances in Social Networks Analysis and Mining. IEEE, 423--427.

Digital Library

[39]

Takashi Tomokiyo and Matthew Hurst. 2003. A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment.

Digital Library

[40]

Peter D Turney. 2000. Learning algorithms for keyphrase extraction. Information retrieval 2, 4 (2000), 303--336.

Digital Library

[41]

Peter D Turney. 2002. Learning to extract keyphrases from text. arXiv preprint cs/0212013 (2002).

[42]

Xiaojun Wan and Jianguo Xiao. 2008. CollabRank: towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 969--976.

Digital Library

[43]

Rui Wang, Wei Liu, and Chris McDonald. 2014. Corpus-independent generic keyphrase extraction using word embedding vectors. In Software Engineering Research Conference, Vol. 39.

[44]

Christian Wartena and Rogier Brussee. 2008. Topic detection by clustering keywords. In 2008 19th International Workshop on Database and Expert Systems Applications. IEEE, 54--58.

Digital Library

[45]

Ian H Witten, Gordon W Paynter, Eibe Frank, Carl Gutwin, and Craig G Nevill- Manning. 2005. Kea: Practical automated keyphrase extraction. In Design and Usability of Digital Libraries: Case Studies in the Asia Pacific. IGI Global, 129--152.

Digital Library

[46]

Wen-tau Yih, Joshua Goodman, and Vitor R Carvalho. 2006. Finding advertising keywords on web pages. In Proceedings of the 15th international conference on World Wide Web. ACM, 213--222.

Digital Library

Cited By

Zengeya TFonou Dombeu JGwetu M(2024)A Centrality-Weighted Bidirectional Encoder Representation from Transformers Model for Enhanced Sequence Labeling in Key Phrase Extraction from Scientific TextsBig Data and Cognitive Computing10.3390/bdcc81201828:12(182)Online publication date: 4-Dec-2024
https://doi.org/10.3390/bdcc8120182
Glazkova AMorozov D(2024)Cross-Domain Robustness of Transformer-Based Keyphrase GenerationData Analytics and Management in Data Intensive Domains10.1007/978-3-031-67826-4_19(249-265)Online publication date: 1-Oct-2024
https://doi.org/10.1007/978-3-031-67826-4_19
Abulaish MFazil MZaki M(2022)Domain-Specific Keyword Extraction Using Joint Modeling of Local and Global Contextual SemanticsACM Transactions on Knowledge Discovery from Data10.1145/349456016:4(1-30)Online publication date: 8-Jan-2022
https://dl.acm.org/doi/10.1145/3494560
Show More Cited By

Index Terms

NamedKeys: Unsupervised Keyphrase Extraction for Biomedical Documents
1. Applied computing
  1. Life and medical sciences
    1. Health care information systems
    2. Health informatics
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Information extraction
      2. Summarization

Recommendations

Exploiting neighborhood knowledge for single document summarization and keyphrase extraction

Document summarization and keyphrase extraction are two related tasks in the IR and NLP fields, and both of them aim at extracting condensed representations from a single text document. Existing methods for single document summarization and keyphrase ...
Automatic keyphrase extraction for Arabic news documents based on KEA system

A keyphrase is a sequence of words that play an important role in the identification of the topics that are embedded in a given document. Keyphrase extraction is a process which extracts such phrases. This has many important applications such as document ...
Domain-specific keyphrase extraction
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

Document keyphrases provide semantic metadata characterizing documents and producing an overview of the content of a document. They can be used in many text-mining and knowledge management related applications. This paper describes a Keyphrase ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

September 2019

716 pages

ISBN:9781450366663

DOI:10.1145/3307339

General Chairs:
Xinghua (Mindy) Shi
Temple University, USA
,
Michael Buck
University of Buffalo, USA
,
Program Chairs:
Jian Ma
Carnegie Mellon University, USA
,
Pierangelo Veltri
University Magna Graecia of Catanzaro, Italy

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGBio: ACM Special Interest Group on Bioinformatics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

BCB '19

Sponsor:

SIGBio

BCB '19: 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

September 7 - 10, 2019

NY, Niagara Falls, USA

Acceptance Rates

BCB '19 Paper Acceptance Rate 42 of 157 submissions, 27%;

Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
373
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zengeya TFonou Dombeu JGwetu M(2024)A Centrality-Weighted Bidirectional Encoder Representation from Transformers Model for Enhanced Sequence Labeling in Key Phrase Extraction from Scientific TextsBig Data and Cognitive Computing10.3390/bdcc81201828:12(182)Online publication date: 4-Dec-2024
https://doi.org/10.3390/bdcc8120182
Glazkova AMorozov D(2024)Cross-Domain Robustness of Transformer-Based Keyphrase GenerationData Analytics and Management in Data Intensive Domains10.1007/978-3-031-67826-4_19(249-265)Online publication date: 1-Oct-2024
https://doi.org/10.1007/978-3-031-67826-4_19
Abulaish MFazil MZaki M(2022)Domain-Specific Keyword Extraction Using Joint Modeling of Local and Global Contextual SemanticsACM Transactions on Knowledge Discovery from Data10.1145/349456016:4(1-30)Online publication date: 8-Jan-2022
https://dl.acm.org/doi/10.1145/3494560
Forghani MFirstkov AAlyanNezhadi MForghani K(2022)Evaluating Keyphrase Extraction Methods for Clustering Influenza-Related Scientific Papers2022 8th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS)10.1109/ICSPIS56952.2022.10043863(1-7)Online publication date: 28-Dec-2022
https://doi.org/10.1109/ICSPIS56952.2022.10043863
Deka PJurek-Loughrey ADeepak (2021)Unsupervised Keyword Combination Query Generation from Online Health Related Content for Evidence-Based Fact CheckingThe 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487701(267-277)Online publication date: 29-Nov-2021
https://dl.acm.org/doi/10.1145/3487664.3487701
Ding HLuo X(2021)Attention-based Unsupervised Keyphrase Extraction and Phrase Graph for COVID-19 Medical Literature RetrievalACM Transactions on Computing for Healthcare10.1145/34739393:1(1-16)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3473939
Celikten AUgur ABulut H(2021)Keyword Extraction from Biomedical Documents Using Deep Contextualized Embeddings2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA52262.2021.9548470(1-5)Online publication date: 25-Aug-2021
https://doi.org/10.1109/INISTA52262.2021.9548470
Gero ZHo J(2021)Uncertainty-based Self-training for Biomedical Keyphrase Extraction2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI)10.1109/BHI50953.2021.9508592(1-4)Online publication date: 27-Jul-2021
https://doi.org/10.1109/BHI50953.2021.9508592
Carmichael ABhowmik DBaily JBrownlow AGunn GReeves A(2020)Ir-ManProceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3388440.3412417(1-9)Online publication date: 21-Sep-2020
https://dl.acm.org/doi/10.1145/3388440.3412417

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten