Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The Linked Data paradigm, which is now the prominent enabler for sharing huge volumes of data using Semantic Web technologies, has created novel challenges for non-relational data management technologies such as RDF and graph database systems. Semantics of Linked Data are expressed in terms of the RDF Schema Language (RDFS) and the OWL Web Ontology Language. RDFS and OWL vocabularies are used from nearly all data sources in the LOD cloud. Moreover, according to a recent studyFootnote 1, 36.49 % of LOD use various OWL fragments, so it becomes critical to optimize RDF engines by considering OWL features.

Commercial RDF engines implement RDFS and OWL rules by performing forward or backward reasoning. Regardless of the way of reasoning, they basically store RDF data in a large triple table and consequently the evaluation of SPARQL queries boils down to performing a query with a large number of costly self-joins. To evaluate such difficult SPARQL queries a number of prototypes have been proposed. Many of these approaches propose a mapping of regular schema-conforming part of the RDF dataset into a set of relational tables [1, 2, 4], and rely on the optimization techniques of the underlying DBMSs for query evaluation. Other approaches [6, 7, 9, 13] propose main-memory resident extensive indexes for RDF triples. In either case, information residing in OWL schemas is rarely taken into account as in [3, 5, 8, 11], so it is our belief that an an OWL schema-aware SPARQL query optimizer could complement those approaches since many datasets (especially in the LOD Cloud) come with such good quality schemas.

In this paper we discuss how schema information expressed in terms of OWL ontologies can be used to perform interesting, possibly complex, optimizations in order to improve SPARQL query execution plans, and, consequently, the performance of the RDF engines. Such optimizations can be employed in a complementary fashion to traditional ones to further improve query planners’ performance. Our intention in this work is not to provide full solutions, but to present the potential of the idea (fully described in [10]) by discussing some possible types of optimizations (Sect. 2) that can be performed. Many more may exist.

2 Schema Based Optimization Techniques

2.1 Constraint Violation

An RDF engine could be able at compile time to take advantage of class and property constraints as expressed in an OWL schema; these include equivalence (owl:equivalentClass, owl:equivalentProperty) and disjointness (owl:disjointWith, owl:propertyDisjointWith) of classes and properties as well as constraints relevant to the property’s domain and range (\({\mathtt {rdfs{:}domain}}\) and \({\mathtt {rdfs{:}range}}\) resp.). For instance, a query looking for an instance of two disjoint classes (owl:disjointWith construct) is certain to return no answers, so it should be answerable in constant time, without having the query engine evaluate it. This kind of information is important for RDF engines that follow either a forward or backward reasoning approach for computing the inferred knowledge.

2.2 Selectivity Estimation

Cardinality Constraints: OWL allows defining cardinality restrictions through the min (owl:minCardinality), max (owl:maxCardinality) and exact (owl:cardinality) cardinality constraints for object and datatype properties, which state how many instances of said property a resource can have. These schema-level constraints can be used to guide the optimizer into selecting a possibly efficient join ordering without resorting to statistics [3, 5]. To do so, triple patterns that refer to more selective properties (e.g. functional properties, owl:FunctionalProperty could be pushed down in the plan to reduce intermediate results).

Complex Class Expressions: Selectivity of triple patterns in a SPARQL query can be estimated through OWL constructs that define classes through set operations, such as intersection (\(\mathtt {owl\!\!:\!\!intersectionOf}\)) and union (\(\mathtt {owl\!\!:\!\!unionOf}\)). For example, consider a query that requests instances \(\mathtt {?x}\) of a class \(\mathtt {<\!\!C\!\!>}\), the latter defined as an intersection of classes \(\mathtt {<\!\!C1\!\!>}\) and \(\mathtt {<\!\!C2\!\!>}\), in conjunction with triple patterns that relate instances \(\mathtt {?y}\) and \(\mathtt {?z}\) of the intersected classes, with triple patterns with predicates \(\mathtt {<\!\!P1\!\!>}\) and \(\mathtt {<\!\!P2\!\!>}\). The class \(\mathtt {<\!\!C\!\!>}\), being more selective, should be considered first in a bushy plan with two sub-trees (around \(\mathtt {?x}\) and \(\mathtt {?y}\), respectively) being joined with a hash join (right side of Fig. 1). Without the knowledge of schema constraints, the query optimizer would put the three triple patterns with \(\mathtt {rdf\!\!:\!\!type}\) predicate at the end, since those usually match a lot of triples (left side of Fig. 1)  [12]. An analogous line of thought can be followed for the \(\mathtt {owl\!\!:\!\!unionOf}\) construct.

Fig. 1.
figure 1

Optimal plan, considers C = C1 \(\mathtt {owl\!\!:\!\!intersectionOf}\) C2 (right) and suboptimal ignores the rule (left)

Class and Property Hierarchies: Hierarchies of classes and properties (through rdfs:subClassOf and rdfs:subPropertyOf) can also improve selectivity estimation. In this case, the triple patterns that request for instances of classes found lower in a class hierarchy should be considered first in a query plan (depending on the form of the query), when deciding join ordering.

2.3 Advanced Optimizations

In this section we present a set of cases where schema information can help the query engine determine the optimal plan in a more sophisticated way.

Inference: In backward reasoning systems, the inferred knowledge obtained through OWL reasoning rules is computed at query time. Is some cases, the same information may be obtained in various ways. For example, assume that we have a long hierarchy where \(\mathtt {<\!\!B_{i}\!\!>}\) is a subclass (rdfs:subClassOf) of \(\mathtt {<\!\!B_{i+1}\!\!>}\), \(i=1,\ldots , n\). Consider also that the domain (\({\mathtt {rdfs{:}domain}}\)) of property \(\mathtt {<\!\!P\!\!>}\) is class \(\mathtt {<\!\!A\!\!>}\) and all its values (\(\mathtt {owl\!\!:\!allValuesFrom}\)) come from root class \(\mathtt {<\!\!B_{n+1}\!\!>}\). In a query that asks for instances \(\mathtt {?v}\) of class \(\mathtt {<\!\!B_{n+1}\!\!>}\) that are also values of property \(\mathtt {<\!\!P\!\!>}\), there are two ways to obtain instances \(\mathtt {?v}\): one through the \(\mathtt {owl\!\!:\!allValuesFrom}\) (cls-avf axiomFootnote 2), and another through the transitivity of rdfs:subClassOf (cax-sco axiom\(^{4}\)). For large n, class \(\mathtt {<\!\!B_{n+1}\!\!>}\) is positioned high in the hierarchy, so the engine should use the \(\mathtt {owl\!\!:\!allValuesFrom}\) construct to obtain the values for \(\mathtt {?v}\). The alternative may be better if the two classes are sufficiently “close” in the hierarchy, especially given the fact that subsumption-related inference is the most optimized type (due to its widespread use).

Star Query Transformation: Schema information can also be used by the query optimizer to rewrite SPARQL queries to equivalent ones that have a form for which already known optimization techniques are easily applicable. For example, when a triple pattern, involving a symmetric property (owl:SymmetricProperty), “breaks” a star-shaped query pattern (subject values of remaining triple patterns appear as an object value), a schema-aware optimizer, should rewrite this query into its equivalent one, where all triple pattern’s subject values are the same, according to the semantics of owl:SymmetricProperty.

3 Conclusions

We advocated on the use of OWL schema information for improving SPARQL query planning, and described some optimizations that can be employed in this direction. Our proposal is meant to be complementary to well-known optimizations (e.g., based on statistics) for query planning, and is most appropriate for datasets and benchmarks that use a rich schema structure (e.g., UOBM). In the future, we plan to work further on understanding the different possible optimizations and potential trade-offs, so that they can be implemented on top of an RDF store in order to quantify the achieved speed-up.