Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Building rich ontologies with reasoning capabilities is a difficult task, which can be time consuming. It requires both the knowledge of domain experts and the experience of ontology engineers. This is one of the main reasons why current Semantic Web and linked data rely mostly on lightweight ontologies. The automatization of axiom extraction is a step towards creating richer domain concept descriptions [4] and building a Semantic Web that goes beyond explicit knowledge for query answering. Ontology learning, i.e. the automatic extraction of ontologies from text, can help automatize the extraction of primitive, named and complex classes. Few state of the art approaches were developed to achieve this goal, mostly pattern-based approaches [2, 3]. To our knowledge, LExO [3] is the most advanced system for complex class extraction. This paper describes our approach to extract defined and primitive class axioms from Wikipedia concept definitions using SPARQL. The main contribution of this work is (i) the utilization of SPARQL graph matching capabilities to model patterns for axiom extraction (ii) the description of SPARQL patterns for complex class extractions from definitions and (iii) The enrichment of DBpedia concept descriptions using OWL axioms and defined classes. We also briefly compare our preliminary results with those of LEXO.

2 Methodology

We rely on a pattern-based approach to detect syntactic constructs that denote complex class axioms. These axioms are extracted from Wikipedia definitions.

Definition Representation and General Pipeline:

We process definition sentences and first construct an RDF graph that represents the dependency structure of the definition and the words’ part of speech and positions in the sentence. This step makes the subsequent step of pattern matching using SPARQL requests easier. For every word, we specify its label, its part of speech, its position in the sentence and its grammatical relations with the other words based on the output of the Stanford parser [1]. Figure 1 presents an example of the RDF graph of a definition. For this example, we use the definition of the Wikipedia concept Vehicle from our dataset, which is “Vehicles are non-living means of transportation”.

Fig. 1.
figure 1

The RDF representation of the definition of vehicles.

Based on this RDF representation, we execute a pipeline of SPARQL requests on the obtained RDF graphs (see Fig. 2).

Fig. 2.
figure 2

Pipeline.

First, we execute SPARQL aggregation requests to extract complex expressions such as nominal and verbal groups and define subclass axioms. For instance, for the sentence vehicles are non-living means of transportation, we obtain the following expressions: vehicles, non-living means of transportations and means of transportations. We also extract the axiom subClassOf (Non-living means of transportation, Means of transportation). Finally, we execute a set of SPARQL axiom queries to identify occurrences of patterns that can be mapped to OWL complex class definitions.

SPARQL Pattern Representation:

Based on a randomly chosen set of 110 definitions from Wikipedia and their sentence representation, we identified several recurring syntactic structures manually and built their corresponding SPARQL patterns. Next, we mapped patterns to complex class axioms using SPARQL CONSTRUCT. Table 1 presents the most common patterns that we identified in our dataset, in addition to their corresponding axioms. Each pattern is modeled using a single SPARQL request. This mechanism provides simple ways to enrich our approach with patterns that we do not support yet.

Table 1. Most frequent patterns and their respective axioms.

3 Preliminary Evaluation and Discussion

We compared the generated axioms with a manually-built gold standard containing 20 definitions chosen randomly from our initial datasetFootnote 1. We assessed the correctness of the axioms using standard precision and recall by focusing on named classes, predicates and complete axioms (see Table 2). Complete axioms metrics are calculated by counting the number of classes, predicates and logical operators matched with the ones in the gold standard. We obtain a macro precision and recall of 0.86/0.59 respectively. We also propose an axiom evaluation based on the Levenstein similarity metric which considers each axiom as a string. The higher the Levenstein similarity between the generated axiom and the reference, the most similar the axioms are. We tested multiple similarity levels as shown in Table 3. We notice that we usually generate the right axioms for (i) small sentences (ii) sentences with a simple grammatical structure and (iii) longer sentences which have no grammatical ambiguities. We also notice that false positives are rarely generated, and the errors in our results are usually caused by incomplete axioms. This is explained by the limited number of implemented patterns (10 patterns).

Table 2. Evaluation results.
Table 3. Axioms’ precision based on Levenstein similarity with the gold standard.

While LExO [3] adopted a similar approach to ours, they did not rely on standard Semantic Web languages such as SPARQL for their patterns and did not take into account the aggregation of nominal and verbal groups, or the extraction of taxonomical relations. For example, given the definition A minister or a secretary is a politician who holds significant public office in a national or regional government, LExO generates (Minister ∪ Secretary)  (Politician ∩ ∃holds.((Office ∩ Significant ∩ Public) ∩ ∃in.(Government ∩ (National ∪ Regional)))). In contrast, our system generates the axiom Minister  Secretary  (Politician ∩ ∃ holds.(SignificantPublicOffice ∩ ∃in.(NationalGovernment ∪ RegionalGovernment))), and in addition, it generates a taxonomy where, SignificantPublicOffice is a subclass of PublicOffice, and NationalGovernment and RegionalGovernment are subclasses of Government.

4 Conclusion and Future Work

The paper describes an approach to extract OWL axioms with the aim to logically define DBpedia concepts from Wikipedia definitions using SPARQL requests. We are currently working on the implementation of our pipeline as a Web service, which has not been proposed yet in the state of the art. More importantly, one original contribution of this paper is the reliance on Semantic Web languages (RDF, SPARQL) to model sentences, patterns and axioms, thus easing the reusability and enrichment of the defined patterns.