Using terms and informal definitions to classify domain entities into top-level ontology concepts: An approach based on language models
Introduction
The development of domain ontologies based on top-level ontology concepts has proved valuable in many domains, such as geology [1], [2] and biomedicine [3], [4], [5]. One reason is that different domain ontologies developed under the same top-level ontology become semantically interoperable because their domain entities specialize in the same top-level concepts. However, the task of identifying which top-level concept a domain entity specializes is laborious and time-consuming because it is usually performed manually and requires a high level of expertise in both the target domain and ontology engineering [6].
In this work, we proposed an approach for classifying domain entities into top-level concepts using the combination of terms representing the domain entities and their informal definitions. Our approach uses terms and informal definitions because they provide the intended meaning of the domain entities in a particular domain and are provided early in the ontology development process. Thus, we combined both into a single text sentence. From this, we fed a deep-neural architecture that includes a pre-trained language model as a layer. After that, we output the predicted top-level concept. In addition, this work proposes a methodology to extract two novel datasets from the OntoWordNet ontology and the Dolce-Lite and Dolce-Lite-Plus top-level ontologies. Each resulting datasets contain 120,489 instances, where each instance has the term representing the OntoWordNet entity, its informal definition, and its respective Dolce-Lite or Dolce-Lite-Plus top-level concept as its target class.
In our experiments, we evaluated eight free available transformer-based language models in the proposed deep-neural architecture: BERT tiny, mini, small, medium, and base [7]; RoBERTa base [8]; ALBERT base [9]; and ELECTRA small [10]. We fine-tuned each model for our classification task during the training stage using an unbalanced and stratified sample. The experimental results suggest that using the combination of the term representing the domain entity and its informal definition with language models has promising results. In addition, the experiments show that the BERT-Base model achieves the best overall scores, with 94% and 87% in micro F1-score using the Dolce-Lite-Plus and Dolce-Lite datasets, respectively.
The paper is organized as follows. In Section 2, we present the background notions that support this proposal, revisiting the OntoWordNet ontology, the top-level ontologies of Dolce-Lite-Plus and Dolce-Lite, and the main proposals on language models and ontology learning. In Section 3, we describe the methodology for extracting datasets from existing domain ontologies and the proposed approach for classifying domain entities into top-level concepts using terms, informal definitions and language models. In Section 4, we show the experiments performed using the two extracted datasets and the obtained results. Finally, in Section 5, we present the concluding remarks of our work.
Section snippets
Related works
This section describes core notions that are relevant to the proposed work. Firstly, we discuss the state-of-the-art of transformer language models. After, we describe the OntoWordNet ontology and the Dolce-Lite and Dolce-Lite-Plus top-level ontologies. In this work, we use these top-level ontologies and the OntoWordNet ontology to develop the datasets considered in our experiments. Finally, we present the current standings on approaches that use text sentences as input for classifying domain
Proposed approach
This section describes the two main contributions of this work. Firstly, we describe the extraction of two novel datasets based on the OntoWordNet ontology and the Dolce-Lite and Dolce-Lite-Plus top-level ontologies. After that, we present the proposed deep neural network architecture for classifying domain entities into top-level concepts using their terms and informal definitions.
Evaluation
This section describes the experiments to evaluate the proposed approach for classifying domain entities into concepts specified by a top-level ontology.2 Firstly, we present the general setting adopted for each evaluated model in our experiments and the evaluation metrics applied. After that, we describe the two experiments using the proposed datasets and the
Conclusion
This work described applying deep neural network architecture based on pre-trained language models in classifying domain entities into concepts specified by top-level ontologies. The proposed architecture takes as inputs text samples that combine the term representing a given domain entity and its informal definition. In addition, we developed two novel datasets for evaluating approaches to solve this problem. We achieved our best result using the BERT-Base model, with 94% of the average micro
CRediT authorship contribution statement
Alcides Lopes: Conceptualization, Methodology, Software, Writing – original draft. Joel Carbonera: Methodology, Writing – review & editing. Daniela Schmidt: Writing – review & editing. Luan Garcia: Writing – review & editing. Fabricio Rodrigues: Writing – review & editing. Mara Abel: Writing – review & editing, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by Petrobras, Brazil. It is also partially financed by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Brazil and the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.
References (21)
- et al.
The GeoCore ontology: A core ontology for general use in Geology
Comput. Geosci.
(2020) - et al.
GeoReservoir: An ontology for deep-marine depositional system geometry description
Comput. Geosci.
(2022) - et al.
Phenotype ontologies: the bridge between genomics and evolution
Trends Ecol. Evol.
(2007) - et al.
ChEBI: a database and ontology for chemical entities of biological interest
Nucleic Acids Res.
(2007) The gene ontology resource: 20 years and still GOing strong
Nucleic Acids Res.
(2019)- et al.
Predicting the top-level ontological concepts of domain entities using word embeddings, informal definitions, and deep learning
Expert Syst. Appl.
(2022) - et al.
Bert: Pre-training of deep bidirectional transformers for language understanding
(2018) - et al.
Roberta: A robustly optimized bert pretraining approach
(2019) - et al.
Albert: A lite bert for self-supervised learning of language representations
(2019) - et al.
Electra: Pre-training text encoders as discriminators rather than generators
(2020)
Cited by (4)
Chebifier: automating semantic classification in ChEBI to accelerate data-driven discovery
2024, Digital DiscoveryOntologies in the era of large language models - a perspective
2023, Applied OntologyUsing BERT Models to Automatically Classify Domain Concepts into DOLCE Top-Level Concepts: A Study of the OAEI Ontologies
2023, CEUR Workshop ProceedingsGPT-4: A Stochastic Parrot or Ontological Craftsman? Discovering Implicit Knowledge Structures in Large Language Models
2023, Proceedings - 2023 5th International Conference on Transdisciplinary AI, TransAI 2023