Abstract
Searching, reading, and finding information from the massive medical text collections are challenging. A typical biomedical search engine is not feasible to navigate each article to find critical information or keyphrases. Moreover, few tools provide a visualization of the relevant phrases to the query. However, there is a need to extract the keyphrases from each document for indexing and efficient search. The transformer-based neural networks—BERT has been used for various natural language processing tasks. The built-in self-attention mechanism can capture the associations between words and phrases in a sentence. This research investigates whether the self-attentions can be utilized to extract keyphrases from a document in an unsupervised manner and identify relevancy between phrases to construct a query relevancy phrase graph to visualize the search corpus phrases on their relevancy and importance. The comparison with six baseline methods shows that the self-attention-based unsupervised keyphrase extraction works well on a medical literature dataset. This unsupervised keyphrase extraction model can also be applied to other text data. The query relevancy graph model is applied to the COVID-19 literature dataset and to demonstrate that the attention-based phrase graph can successfully identify the medical phrases relevant to the query terms.
- [1] . 2020. “Cord-19: The covid-19 open research dataset.” ArXiv. 2020 Jul 9.Google Scholar
- [2] 2020. People with Certain Medical Conditions. Retrieved from https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/people-with-medical-conditions.html.Google Scholar
- [3] Centers for Disease Control and Prevention. 2020. Symptoms of COVID-19. Retrieved on 7 September, 2021 from https://www.cdc.gov/coronavirus/2019-ncov/symptoms-testing/symptoms.html.Google Scholar
- [4] World Health Organization (WHO). 2020. Global Research on Coronavirus Disease (COVID-19). Retrieved 7 September, 2021 from https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov. 2020. World Health Organization (WHO) Global Research on Coronavirus Disease (COVID-19). Retrieved from https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov.Google Scholar
- [5] . 2019. Publicly available clinical BERT embeddings. arXiv:1904.03323 (2019).Google Scholar
- [6] . 2016. Keyphrase extraction methodology from short abstracts of medical documents. In 8th Cairo International Biomedical Engineering Conference (CIBEC). IEEE, 23–26.Google Scholar
- [7] . 2001. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In Proceedings of the AMIA Symposium. American Medical Informatics Association.Google Scholar
- [8] . 2007. DBpedia: A nucleus for a web of open data. In The Semantic Web. Springer, 722–735. Google ScholarDigital Library
- [9] . 2016. Merck Diagnostic and Treatment Manual.Google Scholar
- [10] . 2018. Simple unsupervised keyphrase extraction using sentence embeddings. In Proceedings of CoNLL.Google Scholar
- [11] . 2013. Topicrank: Graph-based topic ranking for keyphrase extraction.Google Scholar
- [12] . 2018. YAKE! Keyword extraction from single documents using multiple local features. Information Science 509 (2020), 257–289. YAKE! collection-independent automatic keyword extractor. In European Conference on Information Retrieval. Springer, 806–810.Google Scholar
- [13] . 2019. What does BERT look at? An analysis of BERT’s attention. arXiv:1906.04341 (2019).Google Scholar
- [14] . 2020. Mental health, substance use, and suicidal ideation during the COVID-19 pandemic–United States, June 24–30, 2020. Morbid. Mortal. Week. Rep. 69, 32 (2020), 1049.Google Scholar
- [15] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018).Google Scholar
- [16] . 2019. NamedKeys: Unsupervised keyphrase extraction for biomedical documents. In 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 328–337. Google ScholarDigital Library
- [17] . 2013. Automatic generation of a qualified medical knowledge graph and its usage for retrieving patient cohorts from electronic medical records. In IEEE 7th International Conference on Semantic Computing. IEEE, 363–370. Google ScholarDigital Library
- [18] . 1999. Finding information on the World Wide Web: The retrieval effectiveness of search engines. Inf. Process. Manag. 35, 2 (1999), 141–180. Google ScholarDigital Library
- [19] . 2008. Exploring Network Structure, Dynamics, and Function Using NetworkX.
Technical Report . Los Alamos National Lab.(LANL), Los Alamos, NM.Google Scholar - [20] . 2008. Information Retrieval: A Health and Biomedical Perspective. Springer Science & Business Media. Google ScholarDigital Library
- [21] . 2017. A survey on medical information retrieval. In International Conference on Information and Communication Technology for Intelligent Systems. Springer, 543–550.Google Scholar
- [22] . 2013. AKMiner: Domain-specific knowledge graph mining from academic literatures. In International Conference on Web Information Systems Engineering. Springer, 241–255.Google Scholar
- [23] . 2020. Covid-19: risk factors for severe disease and death.Google Scholar
- [24] . 2018. Multi-head attention with disagreement regularization. arXiv preprint arXiv: 1810.10183Google Scholar
- [25] . 1993. The unified medical language system. Meth Inf. Med. 32, 4 (1993), 281.Google ScholarCross Ref
- [26] . 2018. Key2Vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 634–639.Google Scholar
- [27] . 2013. OWL web ontology language overview. W3C recommendation, W3C, Feb. 2004.Google Scholar
- [28] . 2001. Relationships in medical subject headings (MeSH). In Relationships in the Organization of Knowledge. Springer, 171–184.Google Scholar
- [29] . 2019. ScispaCy: Fast and robust models for biomedical natural language processing. arXiv:1902.07669 (2019).Google Scholar
- [30] . 2020. Role of angiotensin-converting enzyme 2 (ACE2) in COVID-19. Crit. Care 24, 1 (2020), 1–10.Google Scholar
- [31] . 1999. The PageRank Citation Ranking: Bringing Order to the Web.
Technical Report . Stanford InfoLab.Google Scholar - [32] . 2018. Local word vectors guiding keyphrase extraction. Inf. Process. Manag. 54, 6 (2018), 888–902.Google ScholarCross Ref
- [33] . 2013. Knowledge graph identification. In International Semantic Web Conference. Springer, 542–557. Google ScholarDigital Library
- [34] . 2015. A remedy for your health-related questions: Health info in the knowledge graph. Google Official Blog 2018 (2015).Google Scholar
- [35] . 2010. Automatic keyword extraction from individual documents. Text Mining: Applic. Theor. 1 (2010), 1–20.Google Scholar
- [36] . 2017. Learning a health knowledge graph from electronic medical records. Sci. Rep. 7, 1 (2017), 1–11.Google ScholarCross Ref
- [37] . 2017. Semantic health knowledge graph: Semantic integration of heterogeneous medical knowledge and services. BioMed Res. Int. 2017 (2017).Google Scholar
- [38] . 2020. SIFRank: A new baseline for unsupervised keyphrase extraction based on pre-trained language model. IEEE Access 8 (2020), 10896–10906.Google ScholarCross Ref
- [39] . 2008. Full Text Query and Search Systems and Method of Use.
US Patent App. 11/740, 247. Google Scholar - [40] . 2005. Developing a robust part-of-speech tagger for biomedical text. In Panhellenic Conference on Informatics. Springer, 382–392. Google ScholarDigital Library
- [41] . 2020. Natural Language Processing with Python and SpaCy: A Practical Introduction. No Starch Press.Google Scholar
- [42] . 2017. Attention is all you need. In International Conference on Advances in Neural Information Processing Systems. 5998–6008. Google ScholarDigital Library
- [43] . 2008. Single document keyphrase extraction using neighborhood knowledge. In Association for the Advancement of Artificial Intelligence Conference. 855–860. Google ScholarDigital Library
- [44] . 2018. Information extraction and knowledge graph construction from geoscience literature. Comput. Geosci. 112 (2018), 112–120.Google ScholarCross Ref
- [45] . 2014. Corpus-independent generic keyphrase extraction using word embedding vectors. In Software Engineering Research Conference. 1–8.Google Scholar
- [46] . 2020. Interactive attention networks for semantic text matching. In IEEE International Conference on Data Mining (ICDM). IEEE, 861–870.Google Scholar
- [47] . 2017. HDSKG: Harvesting domain specific knowledge graph from content of webpages. In IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 56–67.Google Scholar
Index Terms
- Attention-based Unsupervised Keyphrase Extraction and Phrase Graph for COVID-19 Medical Literature Retrieval
Recommendations
Domain-specific keyphrase extraction
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge managementDocument keyphrases provide semantic metadata characterizing documents and producing an overview of the content of a document. They can be used in many text-mining and knowledge management related applications. This paper describes a Keyphrase ...
Automatic keyphrase extraction for Arabic news documents based on KEA system
A keyphrase is a sequence of words that play an important role in the identification of the topics that are embedded in a given document. Keyphrase extraction is a process which extracts such phrases. This has many important applications such as document ...
The impact of document structure on keyphrase extraction
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementKeyphrases are short phrases that reflect the main topic of a document. Because manually annotating documents with keyphrases is a time-consuming process, several automatic approaches have been developed. Typically, candidate phrases are extracted using ...
Comments