Abstract
Ontology engineering is a complex and time-consuming task, even with the help of current modelling environments. Often the result is error-prone unless developed by experienced ontology engineers. However, with the emergence of new tools, such as generative AI, inexperienced modellers might receive assistance. This study investigates the capability of Large Language Models (LLMs) to generate OWL ontologies directly from ontological requirements. Specifically, our research question centres on the potential of LLMs in assisting human modellers, by generating OWL modelling suggestions and alternatives. We experiment with several state-of-the-art models. Our methodology incorporates diverse prompting techniques like Chain of Thoughts (CoT), Graph of Thoughts (GoT), and Decomposed Prompting, along with the Zero-shot method. Results show that currently, GPT-4 is the only model capable of providing suggestions of sufficient quality, and we also note the benefits and drawbacks of the prompting techniques. Overall, we conclude that it seems feasible to use advanced LLMs to generate OWL suggestions, which are at least comparable to the quality of human novice modellers. Our research is a pioneering contribution in this area, being the first to systematically study the ability of LLMs to assist ontology engineers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
See for instance the vocabularies section of https://www.w3.org/TR/ld-bp/ or the whitepaper at https://www.nist.gov/document/nist-ai-rfi-cubrcinc002pdf for OBO ontologies.
- 5.
Details related to the versions and settings of these models can be found in our supplementary material.
- 6.
The test is passed if a query can be formulated, i.e., no test data is used, and the complexity of the queries has not been analysed so far.
References
Alharbi, R., et al.: Exploring the role of generative AI in constructing knowledge graphs for drug indications with medical context. In: 15th International Semantic Web Applications and Tools for Healthcare and Life Sciences (SWAT4HCLS 2024) (2024). (to appear)
Alharbi, R., Tamma, V., Grasso, F., Payne, T.: An experiment in retrofitting competency questions for existing ontologies. arXiv preprint arXiv:2311.05662 (2023)
Almazrouei, E., et al.: Falcon-40B: an open large language model with state-of-the-art performance (2023). https://huggingface.co/tiiuae/falcon-40b
Babaei Giglou, H., D’Souza, J., Auer, S.: Llms4ol: large language models for ontology learning. In: Payne, T.R., et al. (eds.) ISWC 2023. LNCS, pp. 408–427. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-47240-4_22
Besta, M.: Graph of thoughts: solving elaborate problems with large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, pp. 17682–17690 (2024)
Blomqvist, E., Hammar, K., Presutti, V.: Engineering ontologies with patterns-the extreme design methodology. In: Ontology Engineering with Ontology Design Patterns. IOS Press (2016)
Blomqvist, E., Sandkuhl, K.: Patterns in ontology engineering: classification of ontology patterns. In: ICEIS, vol. 3, pp. 413–416. SciTePress (2005). https://doi.org/10.5220/0002518804130416. ISBN: 972-8865-19-8. INSTICC
Blomqvist, E., Seil Sepour, A., Presutti, V.: Ontology testing-methodology and tool. In: ten Teije, A., et al. (eds.) Knowledge Engineering and Knowledge Management. EKAW 2012. LNCS, vol. 7603, pp. 216–226. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33876-2_20
Caufield, J.H., et al.: Structured prompt interrogation and recursive extraction of semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics 40(3), btae104 (2024). https://doi.org/10.1093/bioinformatics/btae104
Chen, M., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)
Chen, Q., et al.: Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations (2024). https://arxiv.org/abs/2305.16326
Fernández, M., Gómez-Pérez, A., Juristo, N.: Methontology: from ontological art towards ontological engineering. In: Proceedings of the AAAI97 Spring Symposium Series on Ontological Engineering (1997)
Gangemi, A.: Ontology design patterns for semantic web content. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 262–276. Springer, Heidelberg (2005). https://doi.org/10.1007/11574620_21
Grüninger, M., Fox, M.S.: The role of competency questions in enterprise engineering. In: Rolstadås, A. (eds.) Benchmarking — Theory and Practice. IFIP Advances in Information and Communication Technology, pp. 22–31. Springer, MA (1995). https://doi.org/10.1007/978-0-387-34847-6_3
He, Y., Chen, J., Dong, H., Horrocks, I., Allocca, C., Kim, T., Sapkota, B.: Deeponto: A python package for ontology engineering with deep learning (2024). (To appear in the Semantic Web Journal)
Hertling, S., Paulheim, H.: OLaLa: ontology matching with large language models. In: Proceedings of the 12th Knowledge Capture Conference 2023. K-CAP ’23, pp. 131–139. Association for Computing Machinery, New York, NY (2023). https://doi.org/10.1145/3587259.3627571
Hogan, A., et al.: Knowledge Graphs. Morgan & Claypool Publishers, San Rafael (2021)
Khot, T., et al.: Decomposed prompting: a modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406 (2022)
Li, Y., et al.: Competition-level code generation with alphacode. Science 378(6624), 1092–1097 (2022)
Lopes, A., Carbonera, J., Schmidt, D., Garcia, L., Rodrigues, F., Abel, M.: Using terms and informal definitions to classify domain entities into top-level ontology concepts: an approach based on language models. Knowl. Based Syst. 265, 110385 (2023). https://doi.org/10.1016/j.knosys.2023.110385, https://www.sciencedirect.com/science/article/pii/S0950705123001351
Mateiu, P., Groza, A.: Ontology engineering with large language models (2023). https://arxiv.org/abs/2307.16699
Mihindukulasooriya, N., Tiwari, S., Enguix, C.F., Lata, K.: Text2kgbench: a benchmark for ontology-driven knowledge graph generation from text. In: Payne, T.R., et al. (eds.) ISWC 2023. LNCS, vol. 14266, pp. 247–265. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-47243-5_14
Neuhaus, F.: Ontologies in the era of large language models-a perspective. Appl. Ontol. 18(4), 399–407 (2023)
Penedo, G., et al.: The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023)
Peroni, S.: A simplified agile methodology for ontology development. In: Dragoni, M., Poveda-Villalón, M., Jimenez-Ruiz, E. (eds.) OWLED ORE 2016 2016. LNCS, vol. 10161, pp. 55–69. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-54627-8_5
Petrucci, G., Rospocher, M., Ghidini, C.: Expressive ontology learning as neural machine translation. J. Web Seman. 52, 66–82 (2018)
Poveda-Villalón, M., Fernández-Izquierdo, A., Fernández-López, M., García-Castro, R.: Lot: an industrial oriented ontology engineering framework. Eng. Appl. Artif. Intell. 111, 104755 (2022). https://doi.org/10.1016/j.engappai.2022.104755, https://www.sciencedirect.com/science/article/pii/S0952197622000525
Roziere, B., et al.: Code llama: open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)
Shimizu, C., Hammar, K., Hitzler, P.: Modular ontology modeling. Semant. Web 14(3), 459–489 (2023)
Suárez-Figueroa, M., Gómez-Pérez, A., Motta, E., Gangemi, A. (eds.): Ontology Engineering in a Networked World. Springer, Cham (2012)
Taori, R., et al.: Stanford alpaca: an instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca (2023)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Vaithilingam, P., Zhang, T., Glassman, E.L.: Expectation vs. experience: evaluating the usability of code generation tools powered by large language models. In: Chi Conference on Human Factors in Computing Systems Extended Abstracts, pp. 1–7 (2022)
Wang, L., et al.: Plan-and-solve prompting: improving xero-shot chain-of-thought reasoning by large language models. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.): Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 2609–2634. Association for Computational Linguistics, Toronto (2023). https://doi.org/10.18653/v1/2023.acl-long.147
Wang, X., et al.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Xu, C., et al.: Wizardlm: empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023)
Acknowledgement
This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement no. 101058682 (Onto-DESIDE), and is supported by the strategic research area Security Link. The student solutions used in the research were collected as part of a master’s course taught by Assoc. Prof. Blomqvist while employed at Jönköping University.
ChatGPT was used to enhance the readability of some of the text and improve the language of this paper, after the content was first added manually. All material was then checked manually before submission.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Appendix
Appendix
1.1 Motivations, Limitations and Negative Results
Prompt Components: As mentioned in the methodology section 3, there are four sections in each prompt. At a quick glance, the header and story sections appear to be necessary since we provide a brief prompt and the story requirements. The helper and footer sections may be considered optional. However, removing the helper section causes the LLM to completely avoid modelling reifications and misplacing properties, such as putting a datatype property as a range for an object property. The helper begins by outlining strategies to establish a taxonomy, which is otherwise often ignored by LLMs.
The footer, or pitfall section, also enhances the output significantly. It offers the LLMs with common mistakes that they produce. Common errors that are mentioned as pitfalls to avoid are: (1) Providing an empty output of the given prompt. (2) Avoid the use of the Turtle syntax and instead provide a list of items in Python syntax. (3) Avoiding to provide an OWL output without establishing any taxonomy of classes. (4) In the thoughts prompting techniques, avoiding to run the complete plan (several steps) at a current step, since LLMs can ignore instructions and give the complete answer at the first step. (5) Providing explanations instead of providing the code.
Ontology Design Patterns (ODPs) serve as guides for ontology engineer to model an ontology. However, adding examples to prompts seems to degrade output performance. Despite fitting the prompt and story, 32K context LLMs tend to forget the ontology story (we tried with the 128K context GPT4-turbo model and it failed). This could be because the large context is distracting the current LLMs (this could be caused by the low performance of attention layers in LLMs). We used the term “distraction” since the model starts modelling the ODPs in the output instead of the given task.
Limitations: This study, while insightful, has several limitations. Our choice of evaluation method was additionally influenced by the time constraints faced by human experts in manually evaluating the outputs. While this approach was necessary given the available resources, it may not capture the full depth and nuances of LLM-generated ontologies compared to a more thorough, even though time-consuming, manual evaluation.
Due to their extensive branching, the tree of thoughts and the full version of the graph of thoughts techniques proved expensive. This complexity led to slower processing times and increased costs, limiting their practicality for larger-scale or time-sensitive applications.
We used the Microsoft Azure API to access GPT-3.5 and GPT-4, versions 613 trained until 2021. Consequently, our analysis did not consider any advancements or updates in these models post-2021, including the introduction of seed features in newer updates. This might limit the relevance of our findings in the context of the latest LLM capabilities. The accessibility of hyperparameters in GPT-4 and GPT-3.5 is limited, which presented challenges in our experiment. Despite setting the temperature and penalty parameters to zero (except in plan generation for GoT and CoT-SC, where they were set to 0.5), we observed inconsistencies in the outcomes when using identical prompts. This variability underscores the significance of utilizing open-source LLMs for achieving more consistent and reliable LLM performance rather than depending on unpredictable factors.
We faced another setback in our attempt to produce a more efficient OWL code to reduce context size or general improvement of modelling. For example, in CQbyCQ, when a CQ is addressed, we simply merge it with the previous CQs instead of asking LLM to merge if this CQ has not been addressed. This choice was made since LLMs often forgot to merge classes (or properties) from the previous section, which resulted in incomplete modelling.
Lastly, we encountered another challenge by experimenting with few-shot prompting techniques. In few-shot prompting, a few examples are provided to LLMs as an example. We faced difficulty finding examples of ontology modelling that were not too similar to the ontology story, as this could potentially provide an answer to the LLM. However, this challenge may lead to a similar experiment as the one we mentioned earlier in the usage of ODPs (LLM distraction due to large context size).
1.2 Initial Experiment Result Details
Due to space limitations we were not able to present all details of the initial experiment in the main paper body, merely a conclusion summary. The detailed results of the initial experiment, phase 2, are instead reflected here. In Table 2, the LLM-Prompting scores are presented, averaged over the three tasks and 8 criteria, and a threshold of 0.9 is chosen to pass.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Saeedizade, M.J., Blomqvist, E. (2024). Navigating Ontology Development with Large Language Models. In: Meroño Peñuela, A., et al. The Semantic Web. ESWC 2024. Lecture Notes in Computer Science, vol 14664. Springer, Cham. https://doi.org/10.1007/978-3-031-60626-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-60626-7_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-60625-0
Online ISBN: 978-3-031-60626-7
eBook Packages: Computer ScienceComputer Science (R0)