Using terms and informal definitions to classify domain entities into top-level ontology concepts: An approach based on language models

doi:10.1016/j.knosys.2023.110385

Knowledge-Based Systems

Volume 265, 8 April 2023, 110385

https://doi.org/10.1016/j.knosys.2023.110385 Get rights and content

Abstract

The classification of domain entities into top-level ontology concepts remains an activity performed manually by an ontology engineer. Although some works focus on automating this task by applying machine-learning approaches using textual sentences as input, they require the existence of the domain entities in external knowledge resources, such as pre-trained embedding models. In this context, this work proposes an approach that combines the term representing the domain entity and its informal definition into a single text sentence without requiring external knowledge resources. Thus, we use this sentence as the input of a deep neural network that contains a language model as a layer. Also, we present a methodology used to extract two novel datasets from the OntoWordNet ontology based on Dolce-Lite and Dolce-Lite-Plus top-level ontologies. Our experiments show that by using the transformer-based language models, we achieve promising results in classifying domain entities into 82 top-level ontology concepts, with 94% regarding micro F1-score.

Introduction

The development of domain ontologies based on top-level ontology concepts has proved valuable in many domains, such as geology [1], [2] and biomedicine [3], [4], [5]. One reason is that different domain ontologies developed under the same top-level ontology become semantically interoperable because their domain entities specialize in the same top-level concepts. However, the task of identifying which top-level concept a domain entity specializes is laborious and time-consuming because it is usually performed manually and requires a high level of expertise in both the target domain and ontology engineering [6].

In this work, we proposed an approach for classifying domain entities into top-level concepts using the combination of terms representing the domain entities and their informal definitions. Our approach uses terms and informal definitions because they provide the intended meaning of the domain entities in a particular domain and are provided early in the ontology development process. Thus, we combined both into a single text sentence. From this, we fed a deep-neural architecture that includes a pre-trained language model as a layer. After that, we output the predicted top-level concept. In addition, this work proposes a methodology to extract two novel datasets from the OntoWordNet ontology and the Dolce-Lite and Dolce-Lite-Plus top-level ontologies. Each resulting datasets contain 120,489 instances, where each instance has the term representing the OntoWordNet entity, its informal definition, and its respective Dolce-Lite or Dolce-Lite-Plus top-level concept as its target class.

In our experiments, we evaluated eight free available transformer-based language models in the proposed deep-neural architecture: BERT tiny, mini, small, medium, and base [7]; RoBERTa base [8]; ALBERT base [9]; and ELECTRA small [10]. We fine-tuned each model for our classification task during the training stage using an unbalanced and stratified sample. The experimental results suggest that using the combination of the term representing the domain entity and its informal definition with language models has promising results. In addition, the experiments show that the BERT-Base model achieves the best overall scores, with 94% and 87% in micro F1-score using the Dolce-Lite-Plus and Dolce-Lite datasets, respectively.

The paper is organized as follows. In Section 2, we present the background notions that support this proposal, revisiting the OntoWordNet ontology, the top-level ontologies of Dolce-Lite-Plus and Dolce-Lite, and the main proposals on language models and ontology learning. In Section 3, we describe the methodology for extracting datasets from existing domain ontologies and the proposed approach for classifying domain entities into top-level concepts using terms, informal definitions and language models. In Section 4, we show the experiments performed using the two extracted datasets and the obtained results. Finally, in Section 5, we present the concluding remarks of our work.

Section snippets

Related works

This section describes core notions that are relevant to the proposed work. Firstly, we discuss the state-of-the-art of transformer language models. After, we describe the OntoWordNet ontology and the Dolce-Lite and Dolce-Lite-Plus top-level ontologies. In this work, we use these top-level ontologies and the OntoWordNet ontology to develop the datasets considered in our experiments. Finally, we present the current standings on approaches that use text sentences as input for classifying domain

Proposed approach

This section describes the two main contributions of this work. Firstly, we describe the extraction of two novel datasets based on the OntoWordNet ontology and the Dolce-Lite and Dolce-Lite-Plus top-level ontologies. After that, we present the proposed deep neural network architecture for classifying domain entities into top-level concepts using their terms and informal definitions.

Evaluation

This section describes the experiments to evaluate the proposed approach for classifying domain entities into concepts specified by a top-level ontology.² Firstly, we present the general setting adopted for each evaluated model in our experiments and the evaluation metrics applied. After that, we describe the two experiments using the proposed datasets and the

Conclusion

This work described applying deep neural network architecture based on pre-trained language models in classifying domain entities into concepts specified by top-level ontologies. The proposed architecture takes as inputs text samples that combine the term representing a given domain entity and its informal definition. In addition, we developed two novel datasets for evaluating approaches to solve this problem. We achieved our best result using the BERT-Base model, with 94% of the average micro

CRediT authorship contribution statement

Alcides Lopes: Conceptualization, Methodology, Software, Writing – original draft. Joel Carbonera: Methodology, Writing – review & editing. Daniela Schmidt: Writing – review & editing. Luan Garcia: Writing – review & editing. Fabricio Rodrigues: Writing – review & editing. Mara Abel: Writing – review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by Petrobras, Brazil. It is also partially financed by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Brazil and the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.

References (21)

GarciaL.F. et al.
The GeoCore ontology: A core ontology for general use in Geology
Comput. Geosci.
(2020)
CicconetoF. et al.
GeoReservoir: An ontology for deep-marine depositional system geometry description
Comput. Geosci.
(2022)
MabeeP.M. et al.
Phenotype ontologies: the bridge between genomics and evolution
Trends Ecol. Evol.
(2007)
DegtyarenkoK. et al.
ChEBI: a database and ontology for chemical entities of biological interest
Nucleic Acids Res.
(2007)
Gene Ontology ConsortiumP.M.
The gene ontology resource: 20 years and still GOing strong
Nucleic Acids Res.
(2019)
JuniorA.G.L. et al.
Predicting the top-level ontological concepts of domain entities using word embeddings, informal definitions, and deep learning
Expert Syst. Appl.
(2022)
DevlinJ. et al.
Bert: Pre-training of deep bidirectional transformers for language understanding
(2018)
LiuY. et al.
Roberta: A robustly optimized bert pretraining approach
(2019)
LanZ. et al.
Albert: A lite bert for self-supervised learning of language representations
(2019)
ClarkK. et al.
Electra: Pre-training text encoders as discriminators rather than generators
(2020)

There are more references available in the full text version of this article.

Cited by (4)

Chebifier: automating semantic classification in ChEBI to accelerate data-driven discovery
2024, Digital Discovery
Ontologies in the era of large language models - a perspective
2023, Applied Ontology
Using BERT Models to Automatically Classify Domain Concepts into DOLCE Top-Level Concepts: A Study of the OAEI Ontologies
2023, CEUR Workshop Proceedings
GPT-4: A Stochastic Parrot or Ontological Craftsman? Discovering Implicit Knowledge Structures in Large Language Models
2023, Proceedings - 2023 5th International Conference on Transdisciplinary AI, TransAI 2023

View full text

Using terms and informal definitions to classify domain entities into top-level ontology concepts: An approach based on language models

Abstract

Introduction

Section snippets

Related works

Proposed approach

Evaluation

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Comput. Geosci.

Comput. Geosci.

Trends Ecol. Evol.

ChEBI: a database and ontology for chemical entities of biological interest

Nucleic Acids Res.

The gene ontology resource: 20 years and still GOing strong

Nucleic Acids Res.

Predicting the top-level ontological concepts of domain entities using word embeddings, informal definitions, and deep learning

Expert Syst. Appl.

Bert: Pre-training of deep bidirectional transformers for language understanding

Roberta: A robustly optimized bert pretraining approach

Albert: A lite bert for self-supervised learning of language representations

Electra: Pre-training text encoders as discriminators rather than generators