PoweRGen: A power-law based generator of RDFS schemas
Highlights
► We present the first RDFS schema generator, termed PoweRGen. ► It considers power-law property and subsumption graphs. ► Linear Programming reductions for graph generation.
Introduction
As the amount of RDF datasets available on the Web has grown significantly over the last years through initiatives like Linking Open Data [4], scalability and performance of Semantic Web () systems are gaining importance. Typically, such systems provide services for storing, querying, and updating large volumes of graphs [20] by taking into account additional information (e.g. subsumption relationships) encoded in one or more associated schemas [7].
Most of the recent benchmarking efforts [1], [31], [24], [30] focus exclusively on the performances of [28] pattern matching against real or synthetic schema-less graphs stored according to different relational representations (horizontal vs vertical) [35] and database support (tuple vs column based). However, to benchmark the full potential of systems [39], we also need to consider the impact of the associated schemas in the design of graph stores (for materializing or not inferred data based on subsumption relationships) as well as in the query or update workloads [35]. This need is particularly highlighted by on-going extensions of SPARQL with path expressions spanning over graphs [25], [2] as well as by recent -based change management services [41], [19]. It is worth noticing that more than the size of the schemas, the performance of these tools is bounded by the morphology of the schemas which determines the number of intermediate queries that need to be executed for computing inferred data on the fly. Instead of ignoring or considering only fixed schemas (as with existing XML [29], [3] or [14], [30], [5] data generators), in this paper we are interested in the synthetic generation of schemas whose morphological features are similar to those frequently exhibited in reality [37].
schemas are graphs whose arcs are of different nature, namely, (a) arcs representing subsumption relationships among classes (e.g. is_a), and (b) arcs representing relations between classes or attributes (e.g. has_a), collectively called properties. In this context, for each schema we essentially need to generate two graphs that have the same set of nodes (i.e. classes or literal types), that is, the subsumption, and the property graph. The total-degree distribution of the property graph, as well as the out-degree (i.e. the class descendants) distribution of the Transitive Closure () of the subsumption graph usually follow a distribution (i.e. [37]. Furthermore, classes that appear as the domain of many properties are located highly in the class subsumption hierarchies, i.e. classes with high out-degree in the property graph are typically located at the higher levels of the subsumption graph.
In this paper we propose the first synthetic schema generator, termed PoweRGen, which receives as input: (a) the number of schema classes and properties, (b) the characteristic exponents of the aforementioned s, (c) the depth of the subsumption graph, and (d) the information of whether the subsumption graph should be a DAG or a tree. The property (resp. subsumption) graph approximates with a 89–95% (resp. 93–96%) accuracy a , whose characteristic exponent coincides with a 97–99% (resp. 91–99%) accuracy the one given as input.
To the best of our knowledge, there is no other related work addressing the problem of synthetic schema generation. With respect to the version of our work originally presented in [36], the main contributions of this paper are:
- (i)
A new Linear Programming reduction for the generation of the subsumption graph that has lower complexity (i.e. (N3)) than that proposed in [36] (i.e. (N5)).
- (ii)
A thorough experimental evaluation of the effectiveness and efficiency of PoweRGen.
We should mention at this point that the ideas and techniques presented in this paper are useful for the production of samples of most other kinds of graphs. The reason is fairly general and of a fundamental nature: the overwhelming majority of properties of graphs, (and all sorts of combinatorial objects, for that matter), that arise in practical applications belong to the class NP that is they can be verified in polynomial time. Since the Integer Linear Programming problem (ILP) is an NP-complete problem, we can always express the fact that a graph has such a property as an ILP instance. We cannot solve such instances efficiently, but we are able to solve efficiently the relaxed (rational) version of these instances, and obtain an integer solution by a suitable rounding of a rational solution. Obviously some errors may arise in this way, but if we are only interested in producing a sample obeying some statistical properties, these errors become insignificant since they take the appearance of statistical fluctuations. As we shall see subsequently, this approach has given quite good results in the specific domain we have applied it (RDF schemas).
The remainder of this paper is organized as follows: Section 2 introduces the main features of the property and subsumption graph forming an schema. Section 3 presents the generation of schemas. Section 4 presents the results of an experimental evaluation, while Section 5 compares our graph generation method with related work. Finally, Section 6 identifies issues for future research.
Section snippets
Semantic web schema graphs
schemas are usually represented as directed labeled graphs, whose nodes are classes or literal types and arcs are properties. These graphs may have self-loops (representing recursive properties) and multiple arcs (when two classes are connected by several properties). The upper left part of Fig. 1 depicts an example of a schema. In particular, schemas have two different kinds of arcs: subsumption arcs (rdfs:subclassOf), and user defined arcs. The former comprises subsumption
Synthetic SW schema generation
Synthetic RDFS schemas are used to benchmark various approaches to store, query or update RDF data. In this respect, we focus on widely used statistical and morphological features of schema graphs rather than naming conventions (e.g. for classes or properties) or other humanly interpreted semantic properties. In this Section, we present the main algorithmic steps of the Power-Law based synthetic schema generator (PoweRGen), which comprises the:
- (i)
generation of the total-degree (resp.
Experimental evaluation
Given that PoweRGen is the first parameterized generator of synthetic RDFS schemas there are no other similar systems to compare with. In this section we experimentally evaluate it on two axes, namely, the effectiveness and efficiency of the generated schema generation algorithm. Regarding the effectiveness, we experimentally demonstrate the ability of PoweRGen to generate graphs that respect distributions given as input (paragraph 4.1), as well as the morphological features
Related work
To the best of our knowledge, there is no other related work addressing the problem of synthetic schema generation. The majority of existing benchmarking efforts in the relational or XML context focus exclusively on the synthetic data generation. For instance, in [29], [3] the synthetically generated XML data are considered to be valid w.r.t. either a predefined DTD [29] or an XML schema given by the user [3]. However, the full potential of systems lies in the existence of schemas, which
Conclusions & future work
To the best of our knowledge, PoweRGen is the first schema generator that simulates the graph features SW schemas frequently exhibit in reality, such as degree distributions with accuracy ranging between 89–96%. By employing a Linear Programming () reduction, PoweRGen generates schemas efficiently since Simplex (i.e. the algorithm used to solve the instances) has smoothed complexity polynomial in the input size [32], which is (N3) where N is the number of schema
References (41)
- et al.
Extending SPARQL with regular expression patterns (for querying RDF)
Journal of Web Semantics
(2009) - D.J. Abadi, A. Marcus, S.R. Madden, K. Hollenbach, Scalable semantic web data management using vertical partitioning,...
- D. Barbosa, A. Mendelzon, J. Keenleyside, K. Lyons, Toxgene: a template-based data generator for xml, in: Proceedings...
- et al.
Linked data—the story so far, International
Journal on Semantic Web and Information Systems
(2009) - C. Bizer, A. Schultz, Benchmarking the performance of storage systems that expose SPARQL endpoints, in: Proceedings of...
- J. Blitzstein, P. Diaconis, A sequential importance sampling algorithm for generating random graphs with prescribed...
- D. Brickley, R.V. Guha, RDF Vocabulary Description Language 1.0: RDF Schema, W3C Recommendation, 10 February...
- et al.
Graph mining: laws, generators, and algorithms
ACM Computing Surveys (CSUR)
(2006) Graphs and Hypergraphs
(1973)- L. Ding, T. Finin, Characterizing the semantic web on the web, in: Proceedings of the Fifth International Semantic Web...