Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning

Harbi, Razen; Abdelaziz, Ibrahim; Kalnis, Panos; Mamoulis, Nikos; Ebrahim, Yasser; Sahli, Majed

doi:10.1007/s00778-016-0420-y

Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning

Regular Paper
Published: 08 February 2016

Volume 25, pages 355–380, (2016)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Razen Harbi ORCID: orcid.org/0000-0001-7298-5484¹,
Ibrahim Abdelaziz¹,
Panos Kalnis¹,
Nikos Mamoulis²,
Yasser Ebrahim³ &
…
Majed Sahli¹

2373 Accesses
66 Citations
Explore all metrics

Abstract

State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation. Others try to minimize inter-node communication, which requires an expensive data preprocessing phase, leading to a high startup cost. Apriori knowledge of the query workload has also been used to create partitions, which, however, are static and do not adapt to workload changes. In this paper, we propose AdPart, a distributed RDF system, which addresses the shortcomings of previous work. First, AdPart applies lightweight partitioning on the initial data, which distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdPart takes full advantage of the partitioning to (1) support the fully parallel processing of join patterns on subjects and (2) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdPart monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdPart implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdPart: (1) starts faster than all existing systems; (2) processes thousands of queries before other systems become online; and (3) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in subseconds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

Comparing Oracle and PostgreSQL, Performance and Optimization

ServiceNet: resource-efficient architecture for topology discovery in large-scale multi-tenant clouds

Article Open access 13 April 2024

Notes

http://www.bio2rdf.org/
http://yago-knowledge.org/
http://www.w3.org/TR/rdf-sparql-query/
For simplicity, we use: \(i = t.subject \mod W\).
In many RDF datasets, vertex degrees follow a power-law distribution, where few ones have extremely high degrees. For example, vertices that appear as objects in triples with rdf:type have very high degree centrality. Treating such vertices as cores results in imbalanced partitions and prevents the system from taking full advantage of parallelism [19].
Recall if a core vertex is a subject, we do not redistribute.
Auto-tuning the frequency threshold is a subject of our future work.
http://swat.cse.lehigh.edu/projects/lubm/
http://db.uwaterloo.ca/watdiv/
http://yago-knowledge.org/
http://download.bio2rdf.org/release/2/
http://cloud.kaust.edu.sa/Pages/adpart.aspx
http://db.uwaterloo.ca/watdiv/basic-testing.shtml
Only query patterns are used. Classes and properties are fixed so queries return large intermediate results.

References

Aluç, G., Özsu, M.T., Daudjee, K.: Workload matters: Why RDF databases need a new design. PVLDB 7(10), 837–840 (2014)
Atre, M., Chaoji, V., Zaki, M.J., Hendler J.A.: Matrix “Bit” loaded: a scalable lightweight join query processor for rdf data. In: WWW (2010)
Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core CPUs. In: SIGMOD (2011)
Bol’shev, L., Ubaidullaeva, M.: Chauvenet’s test in the classical theory of errors. Theory Prob. Appl. 19(4), 683–692 (1975)
Article MATH Google Scholar
Boyer, R.S., Strother Moore, J.: MJRTY: a fast majority vote algorithm. In: Boyer, R.S. (ed.) Automated Reasoning: Essays in Honor of Woody Bledsoe, pp. 105–118. Kluwer, London (1991)
Chapter Google Scholar
Chong, Z., Chen, H., Zhang, Z., Shu, H., Qi, G., Zhou, A.: RDF pattern matching using sortable views. In: CIKM (2012)
Curino, C., Jones, E., Zhang, Y., Madden, S.: Schism: a workload-driven approach to database replication and partitioning. PVLDB 3(1–2), 48–57 (2010)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI (2004)
Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (Without It Even Noticing). PVLDB 3(1–2), 515–529 (2010)
Dritsou, V., Constantopoulos, P., Deligiannakis, A., Kotidis, Y.: Optimizing query shortcuts in RDF databases. In: ESWC (2011)
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Article Google Scholar
Forum, M.P.: Mpi: a message-passing interface standard. Tech. rep, Knoxville, TN, USA (1994)
Galarraga, L., Hose, K., Schenkel, R.: Partout: a distributed engine for efficient RDF processing. CoRR arXiv:1212.5636 (2012)
Gallego, M.A., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. In: USEWOD (2011)
Goasdoué, F., Karanasos, K., Leblay, J., Manolescu, I.: View selection in semantic web databases. PVLDB 5(2), 97–108 (2011)
Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: TriAD: A distributed shared-nothing RDF engine based on asynchronous message passing. In: SIGMOD (2014)
Harth, A., Umbrich, J., Hogan, A., Decker, S.: YARS2: A federated repository for querying graph structured data from the Web. In: ISWC/ASWC, vol. 4825 (2007)
Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: ICDEW (2013)
Huang, J., Abadi, D., Ren, K.: Scalable SPARQL querying of large RDF graphs. PVLDB 4(11), 1123–1134 (2011)
Husain, M., McGlothlin, J., Masud, M., Khan, L., Thuraisingham, B.: Heuristics-based query processing for large RDF graphs using cloud computing. TKDE 23(9), 1312–1327 (2011)
Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR (2007)
Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)
Article MathSciNet MATH Google Scholar
Lee, K., Liu, L.: Scaling queries over big RDF graphs with semantic hash partitioning. PVLDB 6(14), 1894–1905 (2013)
Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD (2010)
Neumann, T., Weikum, G.: The rdf-3x engine for scalable management of rdf data. VLDB J. 19(1), 91–113 (2010)
Papailiou, N., Konstantinou, I., Tsoumakos, D., Karras, P., Koziris, N.: H2rdf+: High-performance distributed joins over large-scale rdf graphs. In: IEEE Big Data (2013)
Punnoose, R., Crainiceanu, A., Rapp, D.: Rya: a scalable RDF triple store for the clouds. In: Cloud-I (2012)
Rietveld, L., Hoekstra, R., Schlobach, S., Guéret, C.: Structural properties as proxy for semantic relevance in RDF graph sampling. In: ISWC (2014)
Rohloff, K., Schantz, R.E.: High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store. In: PSI EtA (2010)
Shen, Y., Chen, G., Jagadish, H.V., Lu, W., Ooi, B.C., Tudor, B.M.: Fast failure recovery in distributed graph processing systems. PVLDB 8(4), 437–448 (2014)
Stonebraker, M., Madden, S., Abadi, D., Harizopoulos, S., Hachem, N., Helland, P.: The end of an Architectural Era: (It’s Time for a Complete Rewrite). PVLDB, 1150–1160 (2007)
Wang, L., Xiao, Y., Shao, B., Wang, H.: How to partition a billion-node graph. In: ICDE (2014)
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1), 1008–1019 (2008)
Wu, B., Zhou, Y., Yuan, P., Liu, L., Jin, H.: Scalable SPARQL querying using path partitioning. In: ICDE (2015)
Yang, S., Yan, X., Zong, B., Khan, A.: Towards effective partition management for large graphs. In: SIGMOD (2012)
Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., Liu, L.: TripleBit: a fast and compact system for large scale RDF data. PVLDB 6(7), 517–528 (2013)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: USENIX (2010)
Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale RDF data. PVLDB 6(4), 265–276 (2013)
Zhang, X., Chen, L., Tong, Y., Wang, M.: EAGRE: towards scalable I/O efficient SPARQL query evaluation on the cloud. In: ICDE (2013)
Zou, L., Özsu, M.T., Chen, L., Shen, X., Huang, R., Zhao, D.: gStore: a graph-based SPARQL query engine. VLDB J. 23(4), 565–590 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Razen Harbi, Ibrahim Abdelaziz, Panos Kalnis & Majed Sahli
University of Ioannina, Ioannina, Greece
Nikos Mamoulis
Microsoft Corporation, Redmond, WA, 98052, USA
Yasser Ebrahim

Authors

Razen Harbi
View author publications
You can also search for this author in PubMed Google Scholar
Ibrahim Abdelaziz
View author publications
You can also search for this author in PubMed Google Scholar
Panos Kalnis
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Mamoulis
View author publications
You can also search for this author in PubMed Google Scholar
Yasser Ebrahim
View author publications
You can also search for this author in PubMed Google Scholar
Majed Sahli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Razen Harbi.

Appendices

Appendix 1: Workload queries repetition

In this experiment, we test AdPart’s performance using a real scenario workload where a certain percentage of the queries is repeated, while other new queries are taken into account. We use three workloads, each workload contains 10 K LUBM random queries out of which a certain percentage is repeated. Figure 20a shows AdPart’s performance while varying the amount of repeated queries between 20, 40 and 80 %. As the results suggest, the more the repeated queries, the less the workload execution time. Since AdPart monitors the query patterns and not the individual queries, it could capture most of the patterns in the workload even with only 20 % of its queries repeated.

Appendix 2: Average partition size

In this experiment, we report how the average partition size changes during the workload execution. Using the 10K queries LUBM workload, Fig. 20b shows how the partition size increases as more queries are executed. Initially, each partition contains around 19M triples. This corresponds to a 0 % replication ratio as AdPart loads only the original dataset. As the system adapts, the size of each partition slightly increases till reaching an average size of around 33 M triples, which counts for a 72 % replication ratio after executing the whole 10K workload queries.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Harbi, R., Abdelaziz, I., Kalnis, P. et al. Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. The VLDB Journal 25, 355–380 (2016). https://doi.org/10.1007/s00778-016-0420-y

Download citation

Received: 30 April 2015
Revised: 17 October 2015
Accepted: 09 January 2016
Published: 08 February 2016
Issue Date: June 2016
DOI: https://doi.org/10.1007/s00778-016-0420-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning

Abstract

Access this article

Similar content being viewed by others

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Comparing Oracle and PostgreSQL, Performance and Optimization

ServiceNet: resource-efficient architecture for topology discovery in large-scale multi-tenant clouds

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Workload queries repetition

Appendix 2: Average partition size

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning

Abstract

Access this article

Similar content being viewed by others

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Comparing Oracle and PostgreSQL, Performance and Optimization

ServiceNet: resource-efficient architecture for topology discovery in large-scale multi-tenant clouds

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Workload queries repetition

Appendix 2: Average partition size

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation