skip to main content
10.1145/3583780.3615036acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking

Published: 21 October 2023 Publication History

Abstract

Discovering entity mentions that are out of a Knowledge Base (KB) from texts plays a critical role in KB maintenance, but has not yet been fully explored. The current methods are mostly limited to the simple threshold-based approach and feature-based classification, and the datasets for evaluation are relatively rare. We propose BLINKout, a new BERT-based Entity Linking (EL) method which can identify mentions that do not have corresponding KB entities by matching them to a special NIL entity. To better utilize BERT, we propose new techniques including NIL entity representation and classification, with synonym enhancement. We also apply KB Pruning and Versioning strategies to automatically construct out-of-KB datasets from common in-KB EL datasets. Results on five datasets of clinical notes, biomedical publications, and Wikipedia articles in various domains show the advantages of BLINKout over existing methods to identify out-of-KB mentions for the medical ontologies, UMLS, SNOMED CT, and the general KB, WikiData.

Supplementary Material

MP4 File (1685-video.mp4)
Presentation video on Out-of-Knowledge-Base Mention Discovery with Entity Linking

References

[1]
Dhruv Agarwal, Rico Angell, Nicholas Monath, and Andrew McCallum. 2021. Entity Linking and Discovery via Arborescence-based Supervised Clustering. CoRR, Vol. abs/2109.01242 (2021), 1--12. showeprint[arXiv]2109.01242 https://arxiv.org/abs/2109.01242
[2]
Tom Ayoola, Shubhi Tyagi, Joseph Fisher, Christos Christodoulopoulos, and Andrea Pierleoni. 2022. ReFinED: An Efficient Zero-shot-capable Approach to End-to-End Entity Linking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track. Association for Computational Linguistics, Hybrid: Seattle, Washington Online, 209--220. https://doi.org/10.18653/v1/2022.naacl-industry.24
[3]
Marco Basaldella, Fangyu Liu, Ehsan Shareghi, and Nigel Collier. 2020. COMETA: A Corpus for Medical Entity Linking in the Social Media. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 3122--3137. https://doi.org/10.18653/v1/2020.emnlp-main.253
[4]
Rajarshi Bhowmik, Karl Stratos, and Gerard de Melo. 2021. Fast and Effective Biomedical Entity Linking Using a Dual Encoder. In Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis. Association for Computational Linguistics, online, 28--37. https://aclanthology.org/2021.louhi-1.4
[5]
Olivier Bodenreider. 2004. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, Vol. 32, suppl_1 (01 2004), D267--D270. https://doi.org/10.1093/nar/gkh061
[6]
Razvan Bunescu and Marius Pacs ca. 2006. Using Encyclopedic Knowledge for Named entity Disambiguation. In 11th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Trento, Italy, 9--16. https://aclanthology.org/E06--1002
[7]
Lihu Chen, Gaël Varoquaux, and Fabian M. Suchanek. 2021. A Lightweight Neural Model for Biomedical Entity Linking. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 14 (May 2021), 12657--12665. https://doi.org/10.1609/aaai.v35i14.17499
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423
[9]
Kevin Donnelly et al. 2006. SNOMED-CT: The advanced terminology and coding system for eHealth. In Medical and Care Compunetics 3. Studies in health technology and informatics, Vol. 121. IOS Press, Amsterdam, Netherlands, 279--290.
[10]
Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, and Tim Finin. 2010. Entity Disambiguation for Knowledge Base Population. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). Coling 2010 Organizing Committee, Beijing, China, 277--285. https://aclanthology.org/C10--1032
[11]
Jennifer D'Souza and Vincent Ng. 2015. Sieve-Based Entity Linking for the Biomedical Domain. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Beijing, China, 297--302. https://doi.org/10.3115/v1/P15--2049
[12]
Maud Ehrmann, Matteo Romanello, Alex Flückiger, and Simon Clematide. 2020. Extended overview of CLEF HIPE 2020: named entity processing on historical newspapers. In Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings (CEUR-WS.org), Thessaloniki, Greece, 1--38.
[13]
Noémie Elhadad, Wendy Chapman, and Guergana Savova. 2013. ShAReCLEF eHealth 2013: Natural Language Processing and Information Retrieval for Clinical Care 1.0. https://physionet.org/content/shareclefehealth2013/1.0/Task1ShAReGuidelines2013.pdf.
[14]
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6894--6910. https://doi.org/10.18653/v1/2021.emnlp-main.552
[15]
Thomas R. Gruber. 1995. Toward principles for the design of ontologies used for knowledge sharing? International Journal of Human-Computer Studies, Vol. 43, 5 (1995), 907--928. https://doi.org/10.1006/ijhc.1995.1081
[16]
Yingjie Gu, Xiaoye Qu, Zhefeng Wang, Baoxing Huai, Nicholas Jing Yuan, and Xiaolin Gui. 2021a. Read, Retrospect, Select: An MRC Framework to Short Text Entity Linking. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 14 (May 2021), 12920--12928. https://doi.org/10.1609/aaai.v35i14.17528
[17]
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021b. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthcare, Vol. 3, 1, Article 2 (oct 2021), 23 pages. https://doi.org/10.1145/3458754
[18]
Yuan He, Jiaoyan Chen, Hang Dong, Ian Horrocks, Carlo Allocca, Taehun Kim, and Brahmananda Sapkota. 2023. DeepOnto: A Python Package for Ontology Engineering with Deep Learning. arXiv preprint arXiv:2307.03067 (2023).
[19]
Yuan He, Jiaoyan Chen, Hang Dong, Ernesto Jiménez-Ruiz, Ali Hadian, and Ian Horrocks. 2022. Machine learning-friendly biomedical datasets for equivalence and subsumption ontology matching. In International Semantic Web Conference. Springer, Cham, Switzerland, 575--591.
[20]
Nicolas Heist and Heiko Paulheim. 2023. NASTyLinker: NIL-Aware Scalable Transformer-Based Entity Linker. In The Semantic Web - 20th International Conference, ESWC 2023, Hersonissos, Crete, Greece, May 28 - June 1, 2023, Proceedings (Lecture Notes in Computer Science, Vol. 13870), Catia Pesquita, Ernesto Jimé nez-Ruiz, Jamie P. McCusker, Daniel Faria, Mauro Dragoni, Anastasia Dimou, Raphaë l Troncy, and Sven Hertling (Eds.). Springer, Cham, 174--191.
[21]
Johannes Hoffart, Yasemin Altun, and Gerhard Weikum. 2014. Discovering Emerging Entities with Ambiguous Names. In Proceedings of the 23rd International Conference on World Wide Web (Seoul, Korea) (WWW '14). Association for Computing Machinery, New York, NY, USA, 385--396. https://doi.org/10.1145/2566486.2568003
[22]
Johannes Hoffart, Dragan Milchevski, Gerhard Weikum, Avishek Anand, and Jaspreet Singh. 2016. The Knowledge Awakens: Keeping Knowledge Bases Fresh with Emerging Entities. In Proceedings of the 25th International Conference Companion on World Wide Web (Montréal, Québec, Canada) (WWW '16 Companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 203--206. https://doi.org/10.1145/2872518.2890537
[23]
Anastasiia Iurshina, Jiaxin Pan, Rafika Boutalbi, and Steffen Staab. 2022. NILK: Entity Linking Dataset Targeting NIL-Linking Cases. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (Atlanta, GA, USA) (CIKM '22). Association for Computing Machinery, New York, NY, USA, 4069--4073. https://doi.org/10.1145/3511808.3557659
[24]
Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Griffitt, and Joe Ellis. 2011. Overview of the TAC 2011 knowledge base population track. In Third text analysis conference (TAC 2011). National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 1--33.
[25]
Zongcheng Ji, Qiang Wei, and Hua Xu. 2020. BERT-based ranking for biomedical entity normalization. AMIA Summits on Translational Science Proceedings, Vol. 2020 (2020), 269.
[26]
Nora Kassner, Fabio Petroni, Mikhail Plekhanov, Sebastian Riedel, and Nicola Cancedda. 2022. EDIN: An End-to-end Benchmark and Pipeline for Unknown Entity Discovery and Indexing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 8659--8673.
[27]
Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, and Nigel Collier. 2021. Self-Alignment Pretraining for Biomedical Entity Representations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 4228--4238. https://doi.org/10.18653/v1/2021.naacl-main.334
[28]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019. OpenReview.net, New Orleans, LA, USA, 1--10.
[29]
Yen-Fu Luo, Sam Henry, Yanshan Wang, Feichen Shen, Ozlem Uzuner, and Anna Rumshisky. 2020. The 2019 n2c2/UMass Lowell shared task on clinical concept normalization. Journal of the American Medical Informatics Association, Vol. 27, 10 (09 2020), 1529--e1. https://doi.org/10.1093/jamia/ocaa106
[30]
Pedro Henrique Martins, Zita Marinho, and André F. T. Martins. 2019. Joint Learning of Named Entity Recognition and Entity Linking. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, Florence, Italy, 190--196. https://doi.org/10.18653/v1/P19--2026
[31]
Paul McNamee, Mark Dredze, Adam Gerber, Nikesh Garera, Tim Finin, James Mayfield, Christine Piatko, Delip Rao, David Yarowsky, Markus Dreyer, et al. 2009. HLTCOE Approaches to Knowledge Base Population at TAC 2009. In Proceedings of the 2009 Text Analysis Conference. National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 1--10.
[32]
Sunil Mohan and Donghui Li. 2019. MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts. In Automated Knowledge Base Construction (AKBC). openreview.net, Amherst, MA, USA, 1--13. https://doi.org/10.24432/C5G59C
[33]
Cedric Mö ller. 2022. Knowledge Graph Population with Out-of-KG Entities. In The Semantic Web: ESWC 2022 Satellite Events - Hersonissos (Lecture Notes in Computer Science, Vol. 13384), Paul Groth, Anisa Rula, Jodi Schneider, Ilaria Tiddi, Elena Simperl, Panos Alexopoulos, Rinke Hoekstra, Mehwish Alam, Anastasia Dimou, and Minna Tamper (Eds.). Springer, Hersonissos, Crete, Greece, 199--214. https://doi.org/10.1007/978--3-031--11609--4_35
[34]
Delip Rao, Paul McNamee, and Mark Dredze. 2013. Entity Linking: Finding Extracted Entities in a Knowledge Base. In Multi-source, Multilingual Information Extraction and Summarization, Thierry Poibeau, Horacio Saggion, Jakub Piskorski, and Roman Yangarber (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 93--115. https://doi.org/10.1007/978--3--642--28569--1_5 Series Title: Theory and Applications of Natural Language Processing.
[35]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982--3992. https://doi.org/10.18653/v1/D19--1410
[36]
Petar Ristoski, Zhizhong Lin, and Qunzhi Zhou. 2021. KG-ZESHEL: Knowledge Graph-Enhanced Zero-Shot Entity Linking. In Proceedings of the 11th on Knowledge Capture Conference (Virtual Event, USA) (K-CAP '21). Association for Computing Machinery, New York, NY, USA, 49--56. https://doi.org/10.1145/3460210.3493549
[37]
Giuseppe Rizzo, Bianca Pereira, Andrea Varga, Marieke Van Erp, and Amparo Elizabeth Cano Basave. 2017. Lessons learnt from the Named Entity rEcognition and Linking (NEEL) challenge series. Semantic Web, Vol. 8, 5 (2017), 667--700.
[38]
Özge Sevgili, Artem Shelmanov, Mikhail Arkhipov, Alexander Panchenko, and Chris Biemann. 2022. Neural entity linking: A survey of models based on deep learning. Semantic Web, Vol. 13, 3 (Jan. 2022), 527--570. https://doi.org/10.3233/SW-222986 Publisher: IOS Press.
[39]
Wei Shen, Yuhan Li, Yinan Liu, Jiawei Han, Jianyong Wang, and Xiaojie Yuan. 2023. Entity Linking Meets Deep Learning: Techniques and Solutions. IEEE Transactions on Knowledge and Data Engineering, Vol. 35, 3 (2023), 2556--2578. https://doi.org/10.1109/TKDE.2021.3117715
[40]
Wei Shen, Jianyong Wang, and Jiawei Han. 2014. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering, Vol. 27, 2 (2014), 443--460.
[41]
Hanna Suominen, Sanna Salanter"a, Sumithra Velupillai, Wendy Webber Chapman, Guergana K. Savova, Noemie Elhadad, Sameer Pradhan, Brett R. South, Danielle L. Mowery, Gareth J. F. Jones, Johannes Leveling, Liadh Kelly, Lorraine Goeuriot, David Mart'i nez, and Guido Zuccon. 2013. Overview of the ShARe/CLEF eHealth Evaluation Lab 2013. In Information Access Evaluation. Multilinguality, Multimodality, and Visualization - 4th International Conference of the CLEF Initiative, CLEF 2013 (Lecture Notes in Computer Science, Vol. 8138), Pamela Forner, Henning Mü ller, Roberto Paredes, Paolo Rosso, and Benno Stein (Eds.). Springer, Valencia, Spain, 212--231. https://doi.org/10.1007/978--3--642--40802--1_24
[42]
Stephen RF Twigg, Robert B Hufnagel, Kerry A Miller, Yan Zhou, Simon J McGowan, John Taylor, Jude Craft, Jenny C Taylor, Stephanie L Santoro, Taosheng Huang, et al. 2016. A recurrent mosaic mutation in SMO, encoding the hedgehog signal transducer smoothened, is the major cause of Curry-Jones syndrome. The American Journal of Human Genetics, Vol. 98, 6 (2016), 1256--1265.
[43]
Denny Vrandevcić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM, Vol. 57, 10 (sep 2014), 78--85. https://doi.org/10.1145/2629489
[44]
World Health Organization. 2022. Tracking SARS-CoV-2 variants. https://www.who.int/activities/tracking-SARS-CoV-2-variants.
[45]
Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. 2020. Scalable Zero-shot Entity Linking with Dense Entity Retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6397--6407. https://doi.org/10.18653/v1/2020.emnlp-main.519
[46]
Zhaohui Wu, Yang Song, and C. Lee Giles. 2016. Exploring Multiple Feature Spaces for Novel Entity Discovery. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Dale Schuurmans and Michael P. Wellman (Eds.). AAAI Press, Phoenix, Arizona, USA, 3073--3079. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12261
[47]
Dongfang Xu and Timothy Miller. 2022. A simple neural vector space model for medical concept normalization using concept embeddings. Journal of Biomedical Informatics, Vol. 130 (2022), 104080.
[48]
Shuo Zhang, Edgar Meij, Krisztian Balog, and Ridho Reinanda. 2020. Novel Entity Discovery from Web Tables. In Proceedings of The Web Conference 2020 (Taipei, Taiwan) (WWW '20). Association for Computing Machinery, New York, NY, USA, 1298--1308. https://doi.org/10.1145/3366423.3380205

Cited By

View all
  • (2025)Musical heritage historical entity linkingArtificial Intelligence Review10.1007/s10462-024-11102-958:5Online publication date: 20-Feb-2025
  • (2024)Use of SNOMED CT in Large Language Models: Scoping ReviewJMIR Medical Informatics10.2196/6292412(e62924)Online publication date: 7-Oct-2024
  • (2024)Taxonomy Completion via Implicit Concept InsertionProceedings of the ACM Web Conference 202410.1145/3589334.3645584(2159-2169)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
October 2023
5508 pages
ISBN:9798400701245
DOI:10.1145/3583780
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2023

Check for updates

Author Tags

  1. WikiData
  2. biomedical ontologies
  3. entity linking
  4. knowledge base enrichment
  5. language models

Qualifiers

  • Research-article

Funding Sources

Conference

CIKM '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)131
  • Downloads (Last 6 weeks)12
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Musical heritage historical entity linkingArtificial Intelligence Review10.1007/s10462-024-11102-958:5Online publication date: 20-Feb-2025
  • (2024)Use of SNOMED CT in Large Language Models: Scoping ReviewJMIR Medical Informatics10.2196/6292412(e62924)Online publication date: 7-Oct-2024
  • (2024)Taxonomy Completion via Implicit Concept InsertionProceedings of the ACM Web Conference 202410.1145/3589334.3645584(2159-2169)Online publication date: 13-May-2024
  • (2024)A Language Model Based Framework for New Concept Placement in OntologiesThe Semantic Web10.1007/978-3-031-60626-7_5(79-99)Online publication date: 26-May-2024
  • (2023)Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and PlacementProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615126(5316-5320)Online publication date: 21-Oct-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media