skip to main content
10.1145/3070607.3070610acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Querying Semantic Knowledge Bases with SQL-on-Hadoop

Published: 14 May 2017 Publication History

Abstract

The constant growth of semantically-annotated data and an increasing interest in cross-domain knowledge bases raises the need for expressive query languages for RDF and novel approaches that enable their evaluation for web-scale data sizes. However, SPARQL, the W3C standard query language for RDF, suffers from a rather limited capability to express navigational queries. More expressive languages have been theoretically studied, however not implemented. In this paper, we continue our work on TRIAL-QL, an expressive (SQL-like) RDF query language based on the Triple Algebra with Recursion [31]. We present a new version of our TRIAL-QL processor, which takes advantage of the current momentum in in-memory SQL-on-Hadoop solutions and is built on top of Impala and SPARK while using one unified data storage. We use our system to study the application of multiple evaluation algorithms, storage strategies and optimizations on Impala and SPARK while highlighting their properties. Comprehensive experiments examine the performance of our system in comparison to other competitive RDF management systems. The obtained results demonstrate its suitability for querying semantic knowledge bases by providing interactive query response times for selective queries on datasets with more than one billion triple. More data-intensive use-cases that produce, e.g. over 25 billion results finished in the order of minutes.

References

[1]
D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. Scalable Semantic Web Data Management Using Vertical Partitioning. In VLDB, pages 411--422, 2007.
[2]
S. Abiteboul, M. Bienvenu, A. Galland, and M. Rousset. Distributed datalog revisited. In Datalog Reloaded - First International Workshop, Datalog 2010, Oxford, UK, March 16-19, 2010. Revised Selected Papers, pages 252--261, 2010.
[3]
F. N. Afrati, V. R. Borkar, M. J. Carey, N. Polyzotis, and J. D. Ullman. Map-reduce extensions and recursive queries. In EDBT 2011, Sweden, March 21-24, 2011.
[4]
F. N. Afrati and J. D. Ullman. Transitive closure and recursive datalog implemented on clusters. In EDBT'12, Berlin, Germany, March 27-30, 2012, pages 132--143, 2012.
[5]
G. Aluc, O. Hartig, M. Özsu, and K. Daudjee. Diversified Stress Testing of RDF Data Management Systems. In ISWC, volume 8796 of LNCS, pages 197--212, 2014.
[6]
R. Angles. A comparison of current graph database models. In 28th ICDE Workshops, 2012, Arlington, USA, 2012.
[7]
Apache. Apache Parquet. http://parquet.io.
[8]
M. Arenas, G. Gottlob, and A. Pieris. Expressive languages for querying the semantic web. In Proc. of the 33rd ACM Symposium on Principles of Database Systems, PODS'14, Snowbird, UT, USA, June 22-27, 2014, pages 14--26, 2014.
[9]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pages 1383--1394. ACM, 2015.
[10]
C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. DBpedia - A crystallization point for the Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web, 7(3):154--165, 2009.
[11]
P. A. Boncz, O. Erling, and M.-D. Pham. Experiences with Virtuoso Cluster RDF Column Store. In Linked Data Manag., pages 239--259. Chapman and Hall/CRC, 2014.
[12]
V. Boshnjaku. A Scalable Engine for TriAL-QL on SQL-on-Hadoop. M.Sc. Thesis, University Freiburg, 2016.
[13]
J. Broekstra, A. Kampman, and F. v. Harmelen. Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In The Semantic Web - ISWC 2002, number 2342 in Lecture Notes in Computer Science, pages 54--68. Springer Berlin Heidelberg, 2002.
[14]
F. Cacace, S. Ceri, and M. A. Houtsma. An overview of parallel strategies for transitive closure on algebraic machines. In Parallel Database Systems, pages 44--62. Springer, 1991.
[15]
J. J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, and K. Wilkinson. Jena: Implementing the Semantic Web Recommendations. In Proc. WWW Alt., pages 74--83, 2004.
[16]
P. Csermely, T. Korcsmáros, H. J. Kiss, G. London, and R. Nussinov. Structure and dynamics of molecular networks: a novel paradigm of drug discovery: a comprehensive review. Pharmacology & therapeutics, 138(3):333--408, 2013.
[17]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In 6th Symposium on Operating System Design and Implementation (OSDI), pages 137--150, San Francisco, California, USA, 2004. USENIX Association.
[18]
X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge Vault: A Web-scale Approach to Probabilistic Knowledge Fusion. In Proc. of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 601--610. ACM, 2014.
[19]
O. Erling and I. Mikhailov. Virtuoso: RDF Support in a Native RDBMS. In Semantic Web Information Manag., pages 501--519. Springer Berlin Heidelberg, 2010.
[20]
G. H. L. Fletcher, M. Gyssens, D. Leinders, J. V. den Bussche, D. V. Gucht, S. Vansummeren, and Y. Wu. The impact of transitive closure on the expressiveness of navigational query languages on unlabeled graphs. Ann. Math. Artif. Intell., 73(1-2):167--203, 2015.
[21]
S. Gurajada, S. Seufert, I. Miliaraki, and M. Theobald. TriAD: A Distributed Shared-nothing RDF Engine Based on Asynchronous Message Passing. In Proc. of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 289--300. ACM, 2014.
[22]
M. Hammoud, D. A. Rabbou, R. Nouri, S.-M.-R. Beheshti, and S. Sakr. DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication. Proc. VLDB Endow., 8(6):654--665, 2015.
[23]
S. Harris and N. Gibbins. 3store: Efficient Bulk RDF Storage. In Proceedings of the First International Workshop on Practical and Scalable Semantic Systems, volume 89 of CEUR Workshop Proceedings, 2003.
[24]
S. Harris, N. Lamb, and N. Shadbolt. 4store: The Design and Implementation of a Clustered RDF Store. In Proceedings of the 5th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS2009), volume 517 of CEUR Workshop Proceedings, 2009.
[25]
A. Harth, J. Umbrich, A. Hogan, and S. Decker. YARS2: A Federated Repository for Querying Graph Structured Data from the Web. In The Semantic Web, number 4825 in Lecture Notes in Computer Science, pages 211--224. Springer Berlin Heidelberg, 2007.
[26]
J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. AI, 194:28--61, 2013.
[27]
J. Huang, D. J. Abadi, and K. Ren. Scalable SPARQL Querying of Large RDF Graphs. PVLDB, 4(11):1123--1134, 2011.
[28]
M. F. Husain, J. P. McGlothlin, M. M. Masud, L. R. Khan, and B. M. Thuraisingham. Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing. IEEE TKDE, 23(9), 2011.
[29]
Y. E. Ioannidis. On the computation of the transitive closure of relational operators. In VLDB '86, August 25-28, 1986, Kyoto, Japan, Proceedings., pages 403--411, 1986.
[30]
K. Lee and L. Liu. Scaling Queries over Big RDF Graphs with Semantic Hash Partitioning. Proc. VLDB Endow., 6(14):1894--1905, 2013.
[31]
L. Libkin, J. L. Reutter, and D. Vrgoc. Trial for RDF: adapting graph query languages for RDF data. In Proceedings of the 32nd ACM PODS 2013, New York, NY, USA - June 22-27, 2013, pages 201--212, 2013.
[32]
F. Manola, E. Miller, and B. McBride. RDF 1.1 Primer. http://www.w3.org/TR/rdf-primer/, 2014.
[33]
Neo-Technology. Neo4j. https://neo4j.com/, 2016.
[34]
T. Neumann and G. Weikum. The RDF-3X Engine for Scalable Management of RDF Data. The VLDB Journal, 19(1):91--113, 2010.
[35]
A. Owens. Clustered TDB: A Clustered Triple Store for Jena. In Proceedings of the 18th international conference on World Wide Web (WWW), 2009.
[36]
N. Papailiou, I. Konstantinou, D. Tsoumakos, P. Karras, and N. Koziris. H2rdf+: High-performance distributed joins over large-scale RDF graphs. In 2013 IEEE International Conference on Big Data, pages 255--263, 2013.
[37]
J. Pérez, M. Arenas, and C. Gutierrez. nSPARQL: A navigational language for RDF. J. Web Sem., 8(4):255--270, 2010.
[38]
M. Przyjaciel-Zablocki, A. Schätzle, and A. Lange. TriAL-QL: Distributed Processing of Navigational Queries. In Proc. AMW, Lima, Peru, volume 1378 of CEUR Workshop Proceedings, 2015.
[39]
M. Przyjaciel-Zablocki, A. Schätzle, and G. Lausen. TriAL-QL: Distributed Processing of Navigational Queries. In Proc. of the 18th International Workshop on Web and Databases (WebDB), Melbourne, Australia, WebDB '15, pages 48--54, 2015.
[40]
J. L. Reutter, A. Soto, and D. Vrgoc. Recursion in SPARQL. In The Semantic Web - ISWC 2015 - 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part I, pages 19--35, 2015.
[41]
K. Rohloff and R. E. Schantz. Clause-iteration with MapReduce to Scalably Query Datagraphs in the SHARD Graph-store. In Proc. of the 4th International Workshop on Data-intensive Distributed Computing, DIDC '11, pages 35--44. ACM, 2011.
[42]
A. Schätzle, M. Przyjaciel-Zablocki, and G. Lausen. PigSPARQL: Mapping SPARQL to Pig Latin. In Proceedings of the International Workshop on Semantic Web Information Management (SWIM), Athens, Greece, SWIM'11, pages 4:1--4:8, 2011.
[43]
A. Schätzle, M. Przyjaciel-Zablocki, A. Neu, and G. Lausen. Sempala: Interactive SPARQL Query Processing on Hadoop. In Proceedings of the 13th International Semantic Web Conference (ISWC), Riva del Garda, Italy, volume 8796 of Lecture Notes in Computer Science (LNCS), pages 164--179, 2014.
[44]
A. Schätzle, M. Przyjaciel-Zablocki, S. Skilevic, and G. Lausen. S2RDF: RDF Querying with SPARQL on Spark. Proceedings of the VLDB Endowment (PVLDB), 9(10):804--815, 2016.
[45]
J. Seo, J. Park, J. Shin, and M. S. Lam. Distributed socialite: A datalog-based language for large-scale graph analysis. PVLDB, 6(14):1906--1917, 2013.
[46]
L. Shala. Distributed Processing of RDFPath Queries. M.Sc. thesis, University Freiburg, 2016.
[47]
M. Shaw, P. Koutris, B. Howe, and D. Suciu. Optimizing large-scale semi-naïve datalog evaluation in hadoop. In Datalog in Academia and Industry, Datalog 2.0, Vienna, Austria, September 11-13, pages 165--176, 2012.
[48]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A Warehousing Solution over a Map-Reduce Framework. Proc. VLDB Endow., 2(2):1626--1629, 2009.
[49]
J. Urbani, S. Kotoulas, J. Maassen, F. van Harmelen, and H. E. Bal. WebPIE: A Web-scale Parallel Inference Engine using MapReduce. J. Web Sem., 10:59--75, 2012.
[50]
T. White. Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale (2. ed.). O'Reilly, 2011.
[51]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Fast and Interactive Analytics Over Hadoop Data with Spark. USENIX; login:, 34(4):45--51, 2012.

Cited By

View all
  • (2023)Technological Prospects of Cloud Computing in Web Mining: Recent Trends and Opportunitiesinternational journal of engineering technology and management sciences10.46647/ijetms.2023.v07i01.0177:1(98-104)Online publication date: 28-Feb-2023
  • (2018)Report from the Fourth Workshop on Algorithms andSystems for MapReduce and Beyond (BeyondMR '17)ACM SIGMOD Record10.1145/3186549.318656146:4(44-48)Online publication date: 22-Feb-2018
  • (2018)Storing and Querying Semantic Data in the CloudReasoning Web. Learning, Uncertainty, Streaming, and Scalability10.1007/978-3-030-00338-8_7(173-222)Online publication date: 22-Sep-2018

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BeyondMR'17: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond
May 2017
76 pages
ISBN:9781450350198
DOI:10.1145/3070607
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS'17
Sponsor:

Acceptance Rates

BeyondMR'17 Paper Acceptance Rate 9 of 17 submissions, 53%;
Overall Acceptance Rate 19 of 36 submissions, 53%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Technological Prospects of Cloud Computing in Web Mining: Recent Trends and Opportunitiesinternational journal of engineering technology and management sciences10.46647/ijetms.2023.v07i01.0177:1(98-104)Online publication date: 28-Feb-2023
  • (2018)Report from the Fourth Workshop on Algorithms andSystems for MapReduce and Beyond (BeyondMR '17)ACM SIGMOD Record10.1145/3186549.318656146:4(44-48)Online publication date: 22-Feb-2018
  • (2018)Storing and Querying Semantic Data in the CloudReasoning Web. Learning, Uncertainty, Streaming, and Scalability10.1007/978-3-030-00338-8_7(173-222)Online publication date: 22-Sep-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media