Elsevier

Information Systems

Volume 37, Issue 4, June 2012, Pages 306-319
Information Systems

PoweRGen: A power-law based generator of RDFS schemas

https://doi.org/10.1016/j.is.2011.09.005Get rights and content

Abstract

As the amount of RDF datasets available on the Web has grown significantly over the last years, scalability and performance of Semantic Web (SW) systems are gaining importance. Current RDF benchmarking efforts either consider schema-less RDF datasets or rely on fixed RDFS schemas. In this paper, we present the first RDFS schema generator, termed PoweRGen, which takes into account the features exhibited by real SW schemas. It considers the power-law functions involved in (a) the combined in- and out-degree distribution of the property graph (which captures the domains and ranges of the properties defined in a schema) and (b) the out-degree distribution of the transitive closure (TC) of the subsumption graph (which essentially captures the class hierarchy). The synthetic schemas generated by PoweRGen respect the power-law functions given as input with an accuracy ranging between 89 and 96%, as well as, various morphological characteristics regarding the subsumption hierarchy depth, structure, etc.

Highlights

► We present the first RDFS schema generator, termed PoweRGen. ► It considers power-law property and subsumption graphs. ► Linear Programming reductions for graph generation.

Introduction

As the amount of RDF datasets available on the Web has grown significantly over the last years through initiatives like Linking Open Data [4], scalability and performance of Semantic Web (SW) systems are gaining importance. Typically, such systems provide services for storing, querying, and updating large volumes of RDF graphs [20] by taking into account additional information (e.g. subsumption relationships) encoded in one or more associated RDFS schemas [7].

Most of the recent benchmarking efforts [1], [31], [24], [30] focus exclusively on the performances of SPARQL [28] pattern matching against real or synthetic schema-less RDF graphs stored according to different relational representations (horizontal vs vertical) [35] and database support (tuple vs column based). However, to benchmark the full potential of SW systems [39], we also need to consider the impact of the associated RDFS schemas in the design of RDF graph stores (for materializing or not inferred data based on subsumption relationships) as well as in the query or update workloads [35]. This need is particularly highlighted by on-going extensions of SPARQL with path expressions spanning over RDFS graphs [25], [2] as well as by recent RDFS-based change management services [41], [19]. It is worth noticing that more than the size of the RDFS schemas, the performance of these tools is bounded by the morphology of the schemas which determines the number of intermediate queries that need to be executed for computing inferred data on the fly. Instead of ignoring or considering only fixed RDFS schemas (as with existing XML [29], [3] or RDF [14], [30], [5] data generators), in this paper we are interested in the synthetic generation of RDFS schemas whose morphological features are similar to those frequently exhibited in reality [37].

RDFS schemas are graphs whose arcs are of different nature, namely, (a) arcs representing subsumption relationships among classes (e.g. is_a), and (b) arcs representing relations between classes or attributes (e.g. has_a), collectively called properties. In this context, for each RDFS schema we essentially need to generate two graphs that have the same set of nodes (i.e. classes or literal types), that is, the subsumption, and the property graph. The total-degree distribution of the property graph, as well as the out-degree (i.e. the class descendants) distribution of the Transitive Closure (TC) of the subsumption graph usually follow a power-law distribution (i.e. P(X=x)xβ) [37]. Furthermore, classes that appear as the domain of many properties are located highly in the class subsumption hierarchies, i.e. classes with high out-degree in the property graph are typically located at the higher levels of the subsumption graph.

In this paper we propose the first synthetic RDFS schema generator, termed PoweRGen, which receives as input: (a) the number of schema classes and properties, (b) the characteristic exponents of the aforementioned power-law s, (c) the depth of the subsumption graph, and (d) the information of whether the subsumption graph should be a DAG or a tree. The property (resp. subsumption) graph approximates with a 89–95% (resp. 93–96%) accuracy a power-law, whose characteristic exponent coincides with a 97–99% (resp. 91–99%) accuracy the one given as input.

To the best of our knowledge, there is no other related work addressing the problem of synthetic RDFS schema generation. With respect to the version of our work originally presented in [36], the main contributions of this paper are:

  • (i)

    A new Linear Programming reduction for the generation of the subsumption graph that has lower complexity (i.e. O(N3)) than that proposed in [36] (i.e. O(N5)).

  • (ii)

    A thorough experimental evaluation of the effectiveness and efficiency of PoweRGen.

We should mention at this point that the ideas and techniques presented in this paper are useful for the production of samples of most other kinds of graphs. The reason is fairly general and of a fundamental nature: the overwhelming majority of properties of graphs, (and all sorts of combinatorial objects, for that matter), that arise in practical applications belong to the class NP that is they can be verified in polynomial time. Since the Integer Linear Programming problem (ILP) is an NP-complete problem, we can always express the fact that a graph has such a property as an ILP instance. We cannot solve such instances efficiently, but we are able to solve efficiently the relaxed (rational) version of these instances, and obtain an integer solution by a suitable rounding of a rational solution. Obviously some errors may arise in this way, but if we are only interested in producing a sample obeying some statistical properties, these errors become insignificant since they take the appearance of statistical fluctuations. As we shall see subsequently, this approach has given quite good results in the specific domain we have applied it (RDF schemas).

The remainder of this paper is organized as follows: Section 2 introduces the main features of the property and subsumption graph forming an RDFS schema. Section 3 presents the generation of RDFS schemas. Section 4 presents the results of an experimental evaluation, while Section 5 compares our graph generation method with related work. Finally, Section 6 identifies issues for future research.

Section snippets

Semantic web schema graphs

RDFS schemas are usually represented as directed labeled graphs, whose nodes are classes or literal types and arcs are properties. These graphs may have self-loops (representing recursive properties) and multiple arcs (when two classes are connected by several properties). The upper left part of Fig. 1 depicts an example of a schema. In particular, SW schemas have two different kinds of arcs: subsumption arcs (rdfs:subclassOf), and user defined arcs. The former comprises subsumption

Synthetic SW schema generation

Synthetic RDFS schemas are used to benchmark various approaches to store, query or update RDF data. In this respect, we focus on widely used statistical and morphological features of schema graphs rather than naming conventions (e.g. for classes or properties) or other humanly interpreted semantic properties. In this Section, we present the main algorithmic steps of the Power-Law based RDFS synthetic schema generator (PoweRGen), which comprises the:

  • (i)

    generation of the total-degree (resp.

Experimental evaluation

Given that PoweRGen is the first parameterized generator of synthetic RDFS schemas there are no other similar systems to compare with. In this section we experimentally evaluate it on two axes, namely, the effectiveness and efficiency of the generated schema RDFS generation algorithm. Regarding the effectiveness, we experimentally demonstrate the ability of PoweRGen to generate graphs that respect power-law distributions given as input (paragraph 4.1), as well as the morphological features

Related work

To the best of our knowledge, there is no other related work addressing the problem of synthetic SW schema generation. The majority of existing benchmarking efforts in the relational or XML context focus exclusively on the synthetic data generation. For instance, in [29], [3] the synthetically generated XML data are considered to be valid w.r.t. either a predefined DTD [29] or an XML schema given by the user [3]. However, the full potential of SW systems lies in the existence of schemas, which

Conclusions & future work

To the best of our knowledge, PoweRGen is the first RDFS schema generator that simulates the graph features SW schemas frequently exhibit in reality, such as power-law degree distributions with accuracy ranging between 89–96%. By employing a Linear Programming (LP) reduction, PoweRGen generates RDFS schemas efficiently since Simplex (i.e. the algorithm used to solve the LP instances) has smoothed complexity polynomial in the input size [32], which is O(N3) where N is the number of schema

References (41)

  • F. Alkhateeb et al.

    Extending SPARQL with regular expression patterns (for querying RDF)

    Journal of Web Semantics

    (2009)
  • D.J. Abadi, A. Marcus, S.R. Madden, K. Hollenbach, Scalable semantic web data management using vertical partitioning,...
  • D. Barbosa, A. Mendelzon, J. Keenleyside, K. Lyons, Toxgene: a template-based data generator for xml, in: Proceedings...
  • C. Bizer et al.

    Linked data—the story so far, International

    Journal on Semantic Web and Information Systems

    (2009)
  • C. Bizer, A. Schultz, Benchmarking the performance of storage systems that expose SPARQL endpoints, in: Proceedings of...
  • J. Blitzstein, P. Diaconis, A sequential importance sampling algorithm for generating random graphs with prescribed...
  • D. Brickley, R.V. Guha, RDF Vocabulary Description Language 1.0: RDF Schema, W3C Recommendation, 10 February...
  • D. Chakrabarti et al.

    Graph mining: laws, generators, and algorithms

    ACM Computing Surveys (CSUR)

    (2006)
  • B. Claude

    Graphs and Hypergraphs

    (1973)
  • L. Ding, T. Finin, Characterizing the semantic web on the web, in: Proceedings of the Fifth International Semantic Web...
  • P. Erdös et al.

    Graphs with prescribed degree of vertices

    Matematikai Lapok

    (1960)
  • M. Faloutsos, P. Faloutsos, C. Faloutsos, On power law relationships of the internet topology, in: Proceedings of ACM...
  • Y. Guo, J. Heflin, Z. Pan, Benchmarking DAML+OIL Repositories, in: Proceedings of the second International Semantic Web...
  • Y. Guo et al.

    A requirements driven framework for benchmarking semantic web knowledge base systems

    IEEE Transactions on Knowledge and Data Engineering: Special Issue: Knowledge and Data Engineering in the Semantic Web Era

    (2007)
  • S. Hakimi

    On the realizability of a set of integers as degrees of the vertices of a graph

    Society for Industrial and Applied Mathematics

    (1962)
  • V. Havel

    A remark on the existence of finite graphs

    Casopis Pest. Mat.

    (1955)
  • P. Hayes, RDF semantics, W3C Recommendation, 10 February...
  • A.J. Hoffman et al.

    Integral bounding points of convex polyedra

  • G. Konstantinidis, G. Flouris, G. Antoniou, V. Christophides, A formal approach for RDF/S ontology evolution, in:...
  • O. lassila, R. Swick, Resource description framework (RDF) model and syntax specification, W3C Recommendation, February...
  • Cited by (0)

    View full text