skip to main content
10.1145/3323878.3325804acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Relational schemata for distributed SPARQL query processing

Published: 05 July 2019 Publication History

Abstract

To benefit from mature database technology RDF stores are built on top of relational databases and SPARQL queries are mapped into SQL. Using a shared-nothing computer cluster is a way to achieve scalability by carrying out query processing on top of large RDF datasets in a distributed fashion. Aiming to this the current paper elaborates on the impact of relational schema design when queries are mapped into Apache Spark SQL. A single triple table, a set of tables resulting from partitioning by predicate, a single wide table covering all properties, and a set of tables based on the application model specification called domain-dependent-schema, are the considered designs. For each of the mentioned approaches, the rows of the corresponding tables are stored in the distributed file system HDFS using the columnar-store Parquet. Experiments using standard benchmarks demonstrate that the single wide property table approach, despite its simplicity, is superior to other approaches. Further experiments demonstrate that this single table approach continues to be attractive even when repartitioning by key (RDF subject) is applied before executing queries.

References

[1]
D. J. Abadi et al. Scalable semantic web data management using vertical partitioning. In Proc. VLDB, 2007.
[2]
I. Abdelaziz et al. Combining vertex-centric graph processing with sparql for large-scale rdf data analytics. IEEE TPDS, 2017.
[3]
A. Abele et al. Linking open data cloud diagram 2017. http://lod-cloud.net/, 2017.
[4]
G. Aluç. et al. Diversified stress testing of rdf data management systems. In Proc. ISWC, 2014.
[5]
P. A. Boncz et al. Advances in large-scale RDF data management. In Proc. Linked Open Data - Creating Knowledge Out of Interlinked Data - Results of the LOD2 Project. 2014.
[6]
M. A. Bornea et al. Building an efficient RDF store over a relational database. In Proc. SIGMOD, 2013.
[7]
J. Broekstra et al. Sesame: A generic architecture for storing and querying rdf and rdf schema. In Proc. ISWC, 2002.
[8]
M. Cossu et al. Prost: Distributed execution of sparql queries using mixed partitioning strategies. In Proc. EDBT, 2018.
[9]
M. Färber et al. Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Semantic Web Journal, 2018.
[10]
D. Graux et al. Sparqlgx: Efficient distributed evaluation of sparql with apache spark. In Proc. ISWC, 2016.
[11]
S. Gurajada et al. Triad: a distributed shared-nothing rdf engine based on asynchronous message passing. In Proc. SIGMOD, 2014.
[12]
A. Harth et al. Yars2: A federated repository for querying graph structured data from the web. In The Semantic Web. 2007.
[13]
Z. Kaoudi and I. Manolescu. RDF in the clouds: a survey. VLDB J., 24(1), 2015.
[14]
A. Madkour et al. Sparti: Scalable rdf data management using query-centric semantic partitioning. In Proc. SBD, 2018.
[15]
A. Madkour et al. WORQ: workload-driven RDF query processing. In Proc. ISWC, 2018.
[16]
T. Neumann et al. Rdf-3x: a risc-style engine for rdf. Proc. VLDB, 2008.
[17]
T. Neumann and G. Moerkotte. Characteristic sets: Accurate cardinality estimation for rdf queries with multiple joins. In Proc. ICDE, 2011.
[18]
Z. Pan and J. Heflin. DLDB: Extending relational databases to support semantic web queries. In Proc. PSSS1 - Practical and Scalable Semantic Systems, 2003.
[19]
M. Pham and P. A. Boncz. Exploiting emergent schemas to make RDF systems more efficient. In Proc. ISWC, 2016.
[20]
M.-D. Pham et al. Deriving an emergent relational schema from rdf data. In Proc. WWW, 2015.
[21]
A. Potter et al. Distributed RDF query answering with dynamic data exchange. In Proc. of ISWC, 2016.
[22]
R. Punnoose et al. Rya: a scalable rdf triple store for the clouds. In Proc. IWCI, 2012.
[23]
T. Rebele et al. YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames.
[24]
A. Schätzle et al. Sempala: interactive sparql query processing on hadoop. In Proc. ISWC, 2014.
[25]
A. Schätzle et al. S2rdf: Rdf querying with sparql on spark. Proc. VLDB, 2016.
[26]
L. Sidirourgos et al. Column-store support for rdf data management: Not all swans are white. Proc. VLDB Endow., 2008.
[27]
K. Wilkinson. Jena property table implementation. In Proc. SSWKBS, 2006.
[28]
M. Wylot et al. RDF data storage and query processing schemes: A survey. ACM Comput. Surv., 51(4), 2018.

Cited By

View all
  • (2023)An Effective Framework for Enhancing Query Answering in a Heterogeneous Data LakeProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591637(770-780)Online publication date: 19-Jul-2023
  • (2022)Strabo 2: Distributed Management of Massive Geospatial RDF DatasetsThe Semantic Web – ISWC 202210.1007/978-3-031-19433-7_24(411-427)Online publication date: 16-Oct-2022
  • (2022)Towards Prescriptive Analyses of Querying Large Knowledge GraphsNew Trends in Database and Information Systems10.1007/978-3-031-15743-1_59(639-647)Online publication date: 29-Aug-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SBD '19: Proceedings of the International Workshop on Semantic Big Data
July 2019
57 pages
ISBN:9781450367660
DOI:10.1145/3323878
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. RDF
  2. SPARQL
  3. parquet
  4. relational schema
  5. spark SQL

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '19
Sponsor:

Acceptance Rates

SBD '19 Paper Acceptance Rate 8 of 15 submissions, 53%;
Overall Acceptance Rate 30 of 54 submissions, 56%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)An Effective Framework for Enhancing Query Answering in a Heterogeneous Data LakeProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591637(770-780)Online publication date: 19-Jul-2023
  • (2022)Strabo 2: Distributed Management of Massive Geospatial RDF DatasetsThe Semantic Web – ISWC 202210.1007/978-3-031-19433-7_24(411-427)Online publication date: 16-Oct-2022
  • (2022)Towards Prescriptive Analyses of Querying Large Knowledge GraphsNew Trends in Database and Information Systems10.1007/978-3-031-15743-1_59(639-647)Online publication date: 29-Aug-2022
  • (2020)Towards making sense of Spark-SQL performance for processing vast distributed RDF datasetsProceedings of The International Workshop on Semantic Big Data10.1145/3391274.3393632(1-6)Online publication date: 14-Jun-2020
  • (2019)On Complex Value Relations in HiveAdvances in Conceptual Modeling10.1007/978-3-030-34146-6_13(146-156)Online publication date: 27-Oct-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media