Skip to main content

Navigating Ontology Development with Large Language Models

  • Conference paper
  • First Online:
The Semantic Web (ESWC 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14664))

Included in the following conference series:

  • 1133 Accesses

Abstract

Ontology engineering is a complex and time-consuming task, even with the help of current modelling environments. Often the result is error-prone unless developed by experienced ontology engineers. However, with the emergence of new tools, such as generative AI, inexperienced modellers might receive assistance. This study investigates the capability of Large Language Models (LLMs) to generate OWL ontologies directly from ontological requirements. Specifically, our research question centres on the potential of LLMs in assisting human modellers, by generating OWL modelling suggestions and alternatives. We experiment with several state-of-the-art models. Our methodology incorporates diverse prompting techniques like Chain of Thoughts (CoT), Graph of Thoughts (GoT), and Decomposed Prompting, along with the Zero-shot method. Results show that currently, GPT-4 is the only model capable of providing suggestions of sufficient quality, and we also note the benefits and drawbacks of the prompting techniques. Overall, we conclude that it seems feasible to use advanced LLMs to generate OWL suggestions, which are at least comparable to the quality of human novice modellers. Our research is a pioneering contribution in this area, being the first to systematically study the ability of LLMs to assist ontology engineers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/LiUSemWeb/LLMs4OntologyDev-ESWC2024.

  2. 2.

    https://protege.stanford.edu/.

  3. 3.

    https://allegrograph.com/topbraid-composer/.

  4. 4.

    See for instance the vocabularies section of https://www.w3.org/TR/ld-bp/ or the whitepaper at https://www.nist.gov/document/nist-ai-rfi-cubrcinc002pdf for OBO ontologies.

  5. 5.

    Details related to the versions and settings of these models can be found in our supplementary material.

  6. 6.

    The test is passed if a query can be formulated, i.e., no test data is used, and the complexity of the queries has not been analysed so far.

References

  1. Alharbi, R., et al.: Exploring the role of generative AI in constructing knowledge graphs for drug indications with medical context. In: 15th International Semantic Web Applications and Tools for Healthcare and Life Sciences (SWAT4HCLS 2024) (2024). (to appear)

    Google Scholar 

  2. Alharbi, R., Tamma, V., Grasso, F., Payne, T.: An experiment in retrofitting competency questions for existing ontologies. arXiv preprint arXiv:2311.05662 (2023)

  3. Almazrouei, E., et al.: Falcon-40B: an open large language model with state-of-the-art performance (2023). https://huggingface.co/tiiuae/falcon-40b

  4. Babaei Giglou, H., D’Souza, J., Auer, S.: Llms4ol: large language models for ontology learning. In: Payne, T.R., et al. (eds.) ISWC 2023. LNCS, pp. 408–427. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-47240-4_22

    Chapter  Google Scholar 

  5. Besta, M.: Graph of thoughts: solving elaborate problems with large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, pp. 17682–17690 (2024)

    Google Scholar 

  6. Blomqvist, E., Hammar, K., Presutti, V.: Engineering ontologies with patterns-the extreme design methodology. In: Ontology Engineering with Ontology Design Patterns. IOS Press (2016)

    Google Scholar 

  7. Blomqvist, E., Sandkuhl, K.: Patterns in ontology engineering: classification of ontology patterns. In: ICEIS, vol. 3, pp. 413–416. SciTePress (2005). https://doi.org/10.5220/0002518804130416. ISBN: 972-8865-19-8. INSTICC

  8. Blomqvist, E., Seil Sepour, A., Presutti, V.: Ontology testing-methodology and tool. In: ten Teije, A., et al. (eds.) Knowledge Engineering and Knowledge Management. EKAW 2012. LNCS, vol. 7603, pp. 216–226. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33876-2_20

    Chapter  Google Scholar 

  9. Caufield, J.H., et al.: Structured prompt interrogation and recursive extraction of semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics 40(3), btae104 (2024). https://doi.org/10.1093/bioinformatics/btae104

    Article  Google Scholar 

  10. Chen, M., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

  11. Chen, Q., et al.: Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations (2024). https://arxiv.org/abs/2305.16326

  12. Fernández, M., Gómez-Pérez, A., Juristo, N.: Methontology: from ontological art towards ontological engineering. In: Proceedings of the AAAI97 Spring Symposium Series on Ontological Engineering (1997)

    Google Scholar 

  13. Gangemi, A.: Ontology design patterns for semantic web content. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 262–276. Springer, Heidelberg (2005). https://doi.org/10.1007/11574620_21

    Chapter  Google Scholar 

  14. Grüninger, M., Fox, M.S.: The role of competency questions in enterprise engineering. In: Rolstadås, A. (eds.) Benchmarking — Theory and Practice. IFIP Advances in Information and Communication Technology, pp. 22–31. Springer, MA (1995). https://doi.org/10.1007/978-0-387-34847-6_3

  15. He, Y., Chen, J., Dong, H., Horrocks, I., Allocca, C., Kim, T., Sapkota, B.: Deeponto: A python package for ontology engineering with deep learning (2024). (To appear in the Semantic Web Journal)

    Google Scholar 

  16. Hertling, S., Paulheim, H.: OLaLa: ontology matching with large language models. In: Proceedings of the 12th Knowledge Capture Conference 2023. K-CAP ’23, pp. 131–139. Association for Computing Machinery, New York, NY (2023). https://doi.org/10.1145/3587259.3627571

  17. Hogan, A., et al.: Knowledge Graphs. Morgan & Claypool Publishers, San Rafael (2021)

    Google Scholar 

  18. Khot, T., et al.: Decomposed prompting: a modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406 (2022)

  19. Li, Y., et al.: Competition-level code generation with alphacode. Science 378(6624), 1092–1097 (2022)

    Article  Google Scholar 

  20. Lopes, A., Carbonera, J., Schmidt, D., Garcia, L., Rodrigues, F., Abel, M.: Using terms and informal definitions to classify domain entities into top-level ontology concepts: an approach based on language models. Knowl. Based Syst. 265, 110385 (2023). https://doi.org/10.1016/j.knosys.2023.110385, https://www.sciencedirect.com/science/article/pii/S0950705123001351

  21. Mateiu, P., Groza, A.: Ontology engineering with large language models (2023). https://arxiv.org/abs/2307.16699

  22. Mihindukulasooriya, N., Tiwari, S., Enguix, C.F., Lata, K.: Text2kgbench: a benchmark for ontology-driven knowledge graph generation from text. In: Payne, T.R., et al. (eds.) ISWC 2023. LNCS, vol. 14266, pp. 247–265. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-47243-5_14

    Chapter  Google Scholar 

  23. Neuhaus, F.: Ontologies in the era of large language models-a perspective. Appl. Ontol. 18(4), 399–407 (2023)

    Article  Google Scholar 

  24. Penedo, G., et al.: The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023)

  25. Peroni, S.: A simplified agile methodology for ontology development. In: Dragoni, M., Poveda-Villalón, M., Jimenez-Ruiz, E. (eds.) OWLED ORE 2016 2016. LNCS, vol. 10161, pp. 55–69. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-54627-8_5

    Chapter  Google Scholar 

  26. Petrucci, G., Rospocher, M., Ghidini, C.: Expressive ontology learning as neural machine translation. J. Web Seman. 52, 66–82 (2018)

    Article  Google Scholar 

  27. Poveda-Villalón, M., Fernández-Izquierdo, A., Fernández-López, M., García-Castro, R.: Lot: an industrial oriented ontology engineering framework. Eng. Appl. Artif. Intell. 111, 104755 (2022). https://doi.org/10.1016/j.engappai.2022.104755, https://www.sciencedirect.com/science/article/pii/S0952197622000525

  28. Roziere, B., et al.: Code llama: open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)

  29. Shimizu, C., Hammar, K., Hitzler, P.: Modular ontology modeling. Semant. Web 14(3), 459–489 (2023)

    Google Scholar 

  30. Suárez-Figueroa, M., Gómez-Pérez, A., Motta, E., Gangemi, A. (eds.): Ontology Engineering in a Networked World. Springer, Cham (2012)

    Google Scholar 

  31. Taori, R., et al.: Stanford alpaca: an instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca (2023)

  32. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  33. Vaithilingam, P., Zhang, T., Glassman, E.L.: Expectation vs. experience: evaluating the usability of code generation tools powered by large language models. In: Chi Conference on Human Factors in Computing Systems Extended Abstracts, pp. 1–7 (2022)

    Google Scholar 

  34. Wang, L., et al.: Plan-and-solve prompting: improving xero-shot chain-of-thought reasoning by large language models. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.): Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 2609–2634. Association for Computational Linguistics, Toronto (2023). https://doi.org/10.18653/v1/2023.acl-long.147

  35. Wang, X., et al.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

  36. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)

    Google Scholar 

  37. Xu, C., et al.: Wizardlm: empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023)

Download references

Acknowledgement

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement no. 101058682 (Onto-DESIDE), and is supported by the strategic research area Security Link. The student solutions used in the research were collected as part of a master’s course taught by Assoc. Prof. Blomqvist while employed at Jönköping University.

ChatGPT was used to enhance the readability of some of the text and improve the language of this paper, after the content was first added manually. All material was then checked manually before submission.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Javad Saeedizade .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Appendix

Appendix

1.1 Motivations, Limitations and Negative Results

Prompt Components: As mentioned in the methodology section  3, there are four sections in each prompt. At a quick glance, the header and story sections appear to be necessary since we provide a brief prompt and the story requirements. The helper and footer sections may be considered optional. However, removing the helper section causes the LLM to completely avoid modelling reifications and misplacing properties, such as putting a datatype property as a range for an object property. The helper begins by outlining strategies to establish a taxonomy, which is otherwise often ignored by LLMs.

The footer, or pitfall section, also enhances the output significantly. It offers the LLMs with common mistakes that they produce. Common errors that are mentioned as pitfalls to avoid are: (1) Providing an empty output of the given prompt. (2) Avoid the use of the Turtle syntax and instead provide a list of items in Python syntax. (3) Avoiding to provide an OWL output without establishing any taxonomy of classes. (4) In the thoughts prompting techniques, avoiding to run the complete plan (several steps) at a current step, since LLMs can ignore instructions and give the complete answer at the first step. (5) Providing explanations instead of providing the code.

Ontology Design Patterns (ODPs) serve as guides for ontology engineer to model an ontology. However, adding examples to prompts seems to degrade output performance. Despite fitting the prompt and story, 32K context LLMs tend to forget the ontology story (we tried with the 128K context GPT4-turbo model and it failed). This could be because the large context is distracting the current LLMs (this could be caused by the low performance of attention layers in LLMs). We used the term “distraction” since the model starts modelling the ODPs in the output instead of the given task.

Limitations: This study, while insightful, has several limitations. Our choice of evaluation method was additionally influenced by the time constraints faced by human experts in manually evaluating the outputs. While this approach was necessary given the available resources, it may not capture the full depth and nuances of LLM-generated ontologies compared to a more thorough, even though time-consuming, manual evaluation.

Due to their extensive branching, the tree of thoughts and the full version of the graph of thoughts techniques proved expensive. This complexity led to slower processing times and increased costs, limiting their practicality for larger-scale or time-sensitive applications.

We used the Microsoft Azure API to access GPT-3.5 and GPT-4, versions 613 trained until 2021. Consequently, our analysis did not consider any advancements or updates in these models post-2021, including the introduction of seed features in newer updates. This might limit the relevance of our findings in the context of the latest LLM capabilities. The accessibility of hyperparameters in GPT-4 and GPT-3.5 is limited, which presented challenges in our experiment. Despite setting the temperature and penalty parameters to zero (except in plan generation for GoT and CoT-SC, where they were set to 0.5), we observed inconsistencies in the outcomes when using identical prompts. This variability underscores the significance of utilizing open-source LLMs for achieving more consistent and reliable LLM performance rather than depending on unpredictable factors.

We faced another setback in our attempt to produce a more efficient OWL code to reduce context size or general improvement of modelling. For example, in CQbyCQ, when a CQ is addressed, we simply merge it with the previous CQs instead of asking LLM to merge if this CQ has not been addressed. This choice was made since LLMs often forgot to merge classes (or properties) from the previous section, which resulted in incomplete modelling.

Lastly, we encountered another challenge by experimenting with few-shot prompting techniques. In few-shot prompting, a few examples are provided to LLMs as an example. We faced difficulty finding examples of ontology modelling that were not too similar to the ontology story, as this could potentially provide an answer to the LLM. However, this challenge may lead to a similar experiment as the one we mentioned earlier in the usage of ODPs (LLM distraction due to large context size).

1.2 Initial Experiment Result Details

Due to space limitations we were not able to present all details of the initial experiment in the main paper body, merely a conclusion summary. The detailed results of the initial experiment, phase 2, are instead reflected here. In Table 2, the LLM-Prompting scores are presented, averaged over the three tasks and 8 criteria, and a threshold of 0.9 is chosen to pass.

Table 2. After conducting the initial experiment phase two, it was decided that CoT, CoT-SC, CQbyCQ, and GoT would move to the next stage (score > 0.9). GPT-3.5 was excluded as its performance was found to be equal to or less than GPT-4.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Saeedizade, M.J., Blomqvist, E. (2024). Navigating Ontology Development with Large Language Models. In: Meroño Peñuela, A., et al. The Semantic Web. ESWC 2024. Lecture Notes in Computer Science, vol 14664. Springer, Cham. https://doi.org/10.1007/978-3-031-60626-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-60626-7_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-60625-0

  • Online ISBN: 978-3-031-60626-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics