Skip to main content

A Language Model Based Framework for New Concept Placement in Ontologies

  • Conference paper
  • First Online:
The Semantic Web (ESWC 2024)

Abstract

We investigate the task of inserting new concepts extracted from texts into an ontology using language models. We explore an approach with three steps: edge search which is to find a set of candidate locations to insert (i.e., subsumptions between concepts), edge formation and enrichment which leverages the ontological structure to produce and enhance the edge candidates, and edge selection which eventually locates the edge to be placed into. In all steps, we propose to leverage neural methods, where we apply embedding-based methods and contrastive learning with Pre-trained Language Models (PLMs) such as BERT for edge search, and adapt a BERT fine-tuning-based multi-label Edge-Cross-encoder, and Large Language Models (LLMs) such as GPT series, FLAN-T5, and Llama 2, for edge selection. We evaluate the methods on recent datasets created using the SNOMED CT ontology and the MedMentions entity linking benchmark. The best settings in our framework use fine-tuned PLM for search and a multi-label Cross-encoder for selection. Zero-shot prompting of LLMs is still not adequate for the task, and we propose explainable instruction tuning of LLMs for improved performance. Our study shows the advantages of PLMs and highlights the encouraging performance of LLMs that motivates future studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Our implementation of the methods and experiments are available at https://github.com/KRR-Oxford/LM-ontology-concept-placement.

  2. 2.

    We focus on the common case that only the parent can be a complex concept, as in the explicit axioms in the SNOMED CT ontology.

  3. 3.

    This means that Neurocognitive Impairment belongs to the role group [25] or a grouping of the characteristic that is caused by (“due to”) a disease.

  4. 4.

    https://zenodo.org/records/10432003.

  5. 5.

    We also investigated k up to 300, while the insertion rate at k improves, the overall results after edge selection are worse than smaller k values as 10 and 50. A larger k also leads to a substantially longer running time for edge enrichment and selection.

  6. 6.

    https://platform.openai.com/docs/models/gpt-3-5.

  7. 7.

    https://huggingface.co/docs/trl/sft_trainer.

  8. 8.

    More details on experimental settings and time usage are in Appendix 1.

References

  1. Baader, F., Horrocks, I., Lutz, C., Sattler, U.: A Basic Description Logic, pp. 10–49. Cambridge University Press, Cambridge (2017). https://doi.org/10.1017/9781139025355.002

  2. Baader, F., Horrocks, I., Lutz, C., Sattler, U.: Ontology Languages and Applications, pp. 205–227. Cambridge University Press, Cambridge (2017). https://doi.org/10.1017/9781139025355.008

  3. Chen, J., et al.: Knowledge graphs for the life sciences: recent developments, challenges and opportunities. arXiv preprint arXiv:2309.17255 (2023)

  4. Chen, J., He, Y., Geng, Y., Jiménez-Ruiz, E., Dong, H., Horrocks, I.: Contextual semantic embeddings for ontology subsumption prediction. World Wide Web, pp. 1–23 (2023)

    Google Scholar 

  5. Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)

  6. Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 (2023)

  7. Dong, H., Chen, J., He, Y., Horrocks, I.: Ontology enrichment from texts: a biomedical dataset for concept discovery and placement. In: Proceedings of the 32nd ACM International Conference on Information & Knowledge Management. Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3583780.3615126

  8. Dong, H., Chen, J., He, Y., Liu, Y., Horrocks, I.: Reveal the unknown: out-of-knowledge-base mention discovery with entity linking. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 452–462. CIKM ’23, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3583780.3615036

  9. Gao, Y., et al.: Retrieval-augmented generation for large language models: a survey. arXiv preprint arXiv:2312.10997 (2023)

  10. Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM Comput. Surv. 47(3) (2015). https://doi.org/10.1145/2716262

  11. Glauer, M., Memariani, A., Neuhaus, F., Mossakowski, T., Hastings, J.: Interpretable ontology extension in chemistry. Semantic Web Pre-press(Pre-press), 1–22 (2023)

    Google Scholar 

  12. Grau, B.C., Horrocks, I., Motik, B., Parsia, B., Patel-Schneider, P., Sattler, U.: Owl 2: the next step for owl. J. Web Semant. 6(4), 309–322 (2008). semantic Web Challenge 2006/2007

    Google Scholar 

  13. Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3(1) (2021). https://doi.org/10.1145/3458754

  14. He, Y., Chen, J., Dong, H., Horrocks, I.: Exploring large language models for ontology alignment. arXiv preprint arXiv:2309.07172 (2023)

  15. He, Y., et al.: Deeponto: a python package for ontology engineering with deep learning. arXiv preprint arXiv:2307.03067 (2023)

  16. He, Y., Chen, J., Jimenez-Ruiz, E., Dong, H., Horrocks, I.: Language model analysis for ontology subsumption inference. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics: ACL 2023, pp. 3439–3453. Association for Computational Linguistics, Toronto, Canada, July 2023. https://doi.org/10.18653/v1/2023.findings-acl.213, https://aclanthology.org/2023.findings-acl.213

  17. Hertling, S., Paulheim, H.: Transformer based semantic relation typing for knowledge graph integration. In: Pesquita, C., et al. (eds.) The Semantic Web. ESWC 2023. LNCS, vol. 13870, pp. 105–121. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-33455-9_7

  18. Jurafsky, D., Martin, J.H.: Speech and Language Processing (3rd Edition) (2023). Online

    Google Scholar 

  19. Kudo, T., Richardson, J.: SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: Blanco, E., Lu, W. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71. Association for Computational Linguistics, Brussels, Belgium, November 2018.https://doi.org/10.18653/v1/D18-2012, https://aclanthology.org/D18-2012

  20. Liu, F., Shareghi, E., Meng, Z., Basaldella, M., Collier, N.: Self-alignment pretraining for biomedical entity representations. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4228–4238. Association for Computational Linguistics, Online, June 2021. https://doi.org/10.18653/v1/2021.naacl-main.334

  21. Liu, H., Perl, Y., Geller, J.: Concept placement using BERT trained by transforming and summarizing biomedical ontology structure. J. Biomed. Inform. 112(C) (2020)

    Google Scholar 

  22. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1410

  23. Ruas, P., Couto, F.M.: Nilinker: attention-based approach to nil entity linking. J. Biomed. Inform. 104137 (2022). https://doi.org/10.1016/j.jbi.2022.104137, https://www.sciencedirect.com/science/article/pii/S1532046422001526

  24. Shen, W., Wang, J., Han, J.: Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Trans. Knowl. Data Eng. 27(2), 443–460 (2014)

    Article  Google Scholar 

  25. Spackman, K.A., Dionne, R., Mays, E., Weis, J.: Role grouping as an extension to the description logic of ontylog, motivated by concept modeling in snomed. In: Proceedings of the AMIA Symposium, p. 712. American Medical Informatics Association (2002)

    Google Scholar 

  26. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  27. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)

    Google Scholar 

  28. Veseli, B., Singhania, S., Razniewski, S., Weikum, G.: Evaluating language models for knowledge base completion. In: Pesquita, C., et al. (eds.) The Semantic Web. ESWC 2023. LNCS, vol. 13870, pp. 227–243. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-33455-9_14

  29. Wang, Q., Gao, Z., Xu, R.: Exploring the in-context learning ability of large language model for biomedical concept linking. arXiv preprint arXiv:2307.01137 (2023)

  30. Wang, S., Zhao, R., Zheng, Y., Liu, B.: Qen: applicable taxonomy completion via evaluating full taxonomic relations. In: Proceedings of the ACM Web Conference 2022, pp. 1008–1017. WWW ’22, Association for Computing Machinery, New York, NY, USA (2022). https://github.com/sheryc/QEN

  31. Wu, L., Petroni, F., Josifoski, M., Riedel, S., Zettlemoyer, L.: Scalable zero-shot entity linking with dense entity retrieval. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6397–6407. Association for Computational Linguistics, Online, November 2020.https://doi.org/10.18653/v1/2020.emnlp-main.519

  32. Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: 32nd Annual Meeting of the Association for Computational Linguistics, pp. 133–138. Association for Computational Linguistics, Las Cruces, New Mexico, USA, June 1994. https://doi.org/10.3115/981732.981751, https://aclanthology.org/P94-1019

  33. Zeng, Q., Lin, J., Yu, W., Cleland-Huang, J., Jiang, M.: Enhancing taxonomy completion with concept generation via fusing relational representations. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2104–2113. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3447548.3467308, https://github.com/DM2-ND/GenTaxo

  34. Zhang, J., Song, X., Zeng, Y., Chen, J., Shen, J., Mao, Y., Li, L.: Taxonomy completion via triplet matching network. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4662–4670. AAAI Press, Palo Alto, California, USA (2021). https://github.com/JieyuZ2/TMN

  35. Zhao, W.X., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)

Download references

Acknowledgements

This work is supported by EPSRC projects, including ConCur (EP/V050869/1), OASIS (EP/S032347/1), UK FIRES (EP/S019111/1); and Samsung Research UK (SRUK).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hang Dong .

Editor information

Editors and Affiliations

Appendices

Appendix 1: Detailed model settings and time usage

The approaches are implemented using PyTorch and Huggingface Transformers. Edge-Bi-encoder and Edge-Cross-encoder are originally based on the architectures of BLINKout [8] (based on BLINK [31]). Inverted index with ontology concepts is based on DeepOnto Library [15]. The batch sizes for Edge-Bi-encoder and Edge-Cross-encoder are 16 and 1, resp. The fine-tuning of Edge-Bi-encoder and Edge-Cross-encoder takes 1 and 4 epochs, resp. We limit the rows to 200,000 for training the Edge-Cross-encoder models given the sufficient amount the data for model convergence and the long time of training. The instruction tuning of Llama-2-7B uses a 4-bit quantisation and takes 3 epochs with a batch size of 4.

Time Usage. We run all models using an NVIDIA Quadro RTX 8000 GPU card (48GB GPU). We report the time usage estimate for MM-S14-Disease under the top-50 setting. Training bi-encoder took around 29 h. Training cross-encoder took around 4 h. Instruction tuning of Llama-2-7B took around 16 h. Inferencing with fixed embeddings and inverted index with edge enrichment is within around 0.5 and 1 s per mention, resp. Inferencing with Edge-Bi-encoder only takes around 0.2 s per mention. The whole inferencing with both Edge-Bi-encoder and Edge-Cross-encoder takes around 2.3 s per mention. The prompting of an explainable instruction-tuned Llama-2-7B model takes around 78 s per mention to output natural language explanations.

Appendix 2: Detailed results on edge enrichment

We applied edge formation enrichment over inverted index and fixed embedding approach. Results in Table 4 show a substantial improvement for \(InR_{any}\) and \(InR_{all}\). We see that the mentions to be placed to non-leaf edges are not improved with inverted index and fixed embeddings, but are improved with the fine-tuned, Edge-Bi-encoder, this is because the latter places a more lenient score for the leaf edges that do not always rank them before the non-leaf edges.

Table 4. Results on edge search and enrichment (vs. not using edge enrichment) for MM-S14-Disease, under the top-50 setting. Each setting has validation and testing results, separated by a slash (/) sign. “lf” and “nlf” mean leaf and non-leaf, resp.

Appendix 3: Qualitative examples

Examples of a non-leaf and a leaf concept placement, with prompt options, model predictions, and instruction-tuned Llama-2-7B’s explanations, are in Table 5.

Table 5. Examples of two mentions in the out-of-KB test set of MM-S14-Disease to enrich SNOMED CT 2014.09. The correct predictions are in bold. (Note: while the concept Chronic kidney disease in SNOMED CT ver 2017.03 is not in ver 2014.09, it is modified from Chronic renal impairment, ID 236425005, in the older ontology.)

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dong, H., Chen, J., He, Y., Gao, Y., Horrocks, I. (2024). A Language Model Based Framework for New Concept Placement in Ontologies. In: Meroño Peñuela, A., et al. The Semantic Web. ESWC 2024. Lecture Notes in Computer Science, vol 14664. Springer, Cham. https://doi.org/10.1007/978-3-031-60626-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-60626-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-60625-0

  • Online ISBN: 978-3-031-60626-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics