A flexible framework to experiment with ontology learning techniques

https://doi.org/10.1016/j.knosys.2007.11.009Get rights and content

Abstract

Ontology learning refers to extracting conceptual knowledge from several sources and building an ontology from scratch, enriching, or adapting an existing ontology. It uses methods from a diverse spectrum of fields such as natural language processing, artificial intelligence and machine learning. However, a crucial challenging issue is to quantitatively evaluate the usefulness and accuracy of both techniques and combinations of techniques, when applied to ontology learning. It is an interesting problem because there are no published comparative studies.

We are developing a flexible framework for ontology learning from text which provides a cyclical process that involves the successive application of various NLP techniques and learning algorithms for concept extraction and ontology modelling. The framework provides support to evaluate the usefulness and accuracy of different techniques and possible combinations of techniques into specific processes, to deal with the above challenge. We show our framework’s efficacy as a workbench for testing and evaluating concept identification. Our initial experiment supports our assumption about the usefulness of our approach.

Introduction

The Semantic Web is an evolving extension of the World-Wide Web, in which content is encoded in a formal and explicit way, and can be read and used by software agents [2]. It depends heavily on the proliferation of ontologies. An ontology constitutes a formal conceptualization of a particular domain shared by a group of people. In complex domains to identify, define, and conceptualize a domain manually, can be a costly and error-prone task. This problem can be eased by semi-automatically generating an ontology.

Most domain knowledge about domain entities and their properties and relationships is embodied in text collections – with varying degrees of explicitness and precision. Ontology learning from text has therefore been among the most important strategies for building an ontology. Machine learning and automated language-processing techniques have been used to extract concepts and relationships from structured and unstructured data, such as text and databases. For instance, Cimiano et al. [7] use statistical analysis to extract terms and produce a taxonomy. Similarly, Reinberger and Spyns [21] use shallow linguistic parsing for concept formation and identify some types of relationships by using prepositions.

Researchers have realized that the output for the ontology learning process is far from being perfect [14]. One problem is that in most cases it is not obvious to how to use, configure and combine techniques from different fields for a specific domain. Although there are a few published results about combinations of techniques, for instance [23], the problem is far from being solved. For example, some researchers use different text processing techniques such as stopwords filtering [5], lemmatization [4] or stemming [13] to generate a set of pre-processed data as input for the concept identification. However, there are no comparative studies that show the effectiveness of these linguistics pre-processing techniques. An additional problem for ontology learning is that most frameworks use a pre-defined combination of techniques. Thus, they do not include any mechanism for carrying out experiments with combinations or the ability to include new ones. Reinberger et al. [22] point out that: “To our knowledge no comparative study has been published yet on the efficiency and effectiveness of the various techniques applied to ontology learning”.

Our motivation is to help to make the ontology learning process controllable. Because of this, it is important to know the contribution of the available techniques and the efficiency of a technique combination. We think that the failure to evaluate the relative efficacy of different NLP techniques is likely to hinder the development of effective learning and knowledge acquisition support for ontology engineering. Due to the above problem, both a flexible framework and an integrated tool-suite to configure and combine techniques applied to ontology learning are proposed. The general architecture of our solution integrates an existing linguistic tool (WMatrix [20]), which provides part-of-speech (POS) and semantic tagging, an ontology workbench for information extraction, and an existing open source ontology editor called Protégé [16].1 This work is part of a larger project to build ontologies semi-automatically by processing a collection of domain texts. It involves dealing with four fundamental issues: extracting the relevant domain terminology, discovering concepts, deriving a concept hierarchy, and identifying and labeling ontological relations. Our work involves the innovative adaptation, integration and application of existing NLP and machine learning techniques in order to answer the following research question:

Can shallow analysis of the kind enabled by a range of linguistic and statistical NLP and corpus linguistic techniques identify key domain concepts? Can it do it with sufficient confidence in the correctness and completeness of the result?

The main contributions of our project are:

  • Providing ontology engineers with a coordinated and integrated tool for knowledge objects extraction and ontology modelling.

  • Evaluating the contribution of different NLP and machine learning techniques and their combinations for ontology learning.

  • Proposing a guideline to configure and combine techniques applied to ontology learning.

In this paper we present the results achieved so far:

  • The definition of a framework which provides support for testing different NLP and machine learning techniques to support the semi-automatic ontology learning process.

  • A prototype workbench for knowledge object extraction which provides support for the framework. This workbench integrates a set of NLP and corpus linguistics techniques for experimenting with them.

  • Comparative analysis using a set of linguistic and statistical techniques.

The remainder of our paper is organized as follows. We begin by introducing related work. Then, we present the main parts of the framework by describing and characterizing each of the activities that form the process. Next, we present experiments using a set of linguistic and statistical techniques. Finally, we discuss the results of the experiments and present the conclusions.

Section snippets

Background

In recent years, a number of frameworks that support ontology learning processes have been reported. They implement several techniques from different fields such as knowledge acquisition, machine learning, information retrieval, natural language processing, artificial intelligence reasoning and database management, as shown by the following work:

  • ASIUM [11] learns verb frames and taxonomic knowledge, based on statistical analysis of syntactic parsing of French texts.

  • Text2Onto [6] is a complete

The ontology framework: OntoLancs

Our research project principally addresses the issue of quantitatively evaluating the usefulness or accuracy of techniques and combinations of techniques applied to ontology learning. We have integrated a first set of natural language processing, corpus linguistics and machine learning techniques for experimentation. They are: (a) POS grouping, (b) stopwords filtering, (c) frequency filtering, (d) POS filtering, (e) lemmatization, (f) stemming, (g) frequency profiling, (h) concordance, (i)

Experiments

In this section we describe the mechanism our framework provides for evaluating the efficacy of different NLP techniques for the crucial second phase of the ontology learning process described in Section 3.1.

The experiments were designed to extract a set of candidate concepts from a domain corpus using a combination of NLP and machine learning techniques and to check the correspondence between the candidate concepts and the classes of a DAML reference ontology. In order to assess the efficiency

Conclusions and further work

In this paper, we have described an ongoing project which proposes a flexible framework for the ontology learning process. This framework is designed as a cyclical process to experiment with different techniques and combinations of techniques. It provides support to determine what techniques or their combinations provide optimal performances for the ontology learning process. An ontology engineer can decide techniques or combinations which will be used to extract concepts and turn them into an

References (25)

  • M. Craven et al.

    Learning to construct knowledge bases from the World Wide Web

    Artif. Intell.

    (2000)
  • M. Sabou et al.

    Learning domain ontologies for semantic web service descriptions

    J. Web Sem.

    (2005)
  • R. Alkula

    From plain character strings to meaningful words: producing better full text databases for inflectional and compounding languages with morphological analysis software

    Inf. Retr.

    (2001)
  • T. Berner-Lee et al.

    The Semantic Web – a new form of Web content that is meaningful to computers will unleash a revolution of new possibilities

    Sci. Am.

    (2001)
  • P. Buitelaar, M. Sintek, OntoLT version 1.0: middleware for ontology extraction from text, in: Proc. Demo Session at...
  • P. Buitelaar, S. Ramaka, Unsupervised ontology-based semantic tagging for knowledge markup, in: S.B. Wray Buntine, A....
  • S. Bloehdorn et al.

    Learning ontologies to improve text clustering and classification

  • P. Cimiano, J. Volker, Text2onto – a framework for ontology learning and data-driven change discovery, in: Proc. NLDB...
  • P. Cimiano, L. Schmidt-Thieme, A. Pivk, S. Staab, Learning taxonomic relations from heterogeneous evidence, in: P....
  • H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, GATE: a framework and graphical development environment for robust...
  • M. Dean et al.

    OWL Web Ontology Language Reference

    W3C

    (2004)
  • D. Faure, T. Poibeau, First experiences of using semantic knowledge learned by ASIUM for information extraction task...
  • Cited by (52)

    • A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

      2017, Computer Speech and Language
      Citation Excerpt :

      Over recent years, various semantic lexical resources and semantic annotation tools have been developed, such as EuroWordNet (Vossen, 1998) and the UCREL (University Centre for Computer Corpus Research on Language) Semantic Analysis System (USAS) (Rayson et al., 2004), and they have played an important role in developing intelligent natural language processing (NLP) and Human language technology (HLT) systems. For example, the USAS semantic tagger has been applied in a variety of studies, including empirical language studies at the semantic level (Klebanov et al., 2008; Ooi et al., 2007; Potts and Baker, 2013; Rayson et al., 2004), studies in information technology (Doherty et al., 2006; Nakano et al., 2005; Volk et al., 2002), software engineering (Chitchyan et al., 2006; Taiani et al., 2008) and others (Balossi, 2014; Gacitua et al., 2008; Hancock et al., 2013; Markowitz and Hancock, 2014; Semino et al., 2015). In this paper, we present our work in designing, developing and evaluating the accuracy of a new semantic tagger: the “Historical-Thesaurus-based Semantic Tagger” (henceforth HTST).

    • Concept relation extraction using Naïve Bayes classifier for ontology-based question answering systems

      2015, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      Ontologies have had a great impact on several fields, e.g., biology and medicine. Most domain ontology constructions are not performed automatically (Gacitua et al., 2008). Most of the work on ontology-driven QAs tend to focus on the use of ontology for query expansion (Mc Guinness, 2004).

    • Domain Knowledge Discovery Guided by Software Trace Links

      2018, Proceedings - 2018 5th International Workshop on Artificial Intelligence for Requirements Engineering, AIRE 2018
    View all citing articles on Scopus
    View full text