Skip to main content
Log in

Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation. Others try to minimize inter-node communication, which requires an expensive data preprocessing phase, leading to a high startup cost. Apriori knowledge of the query workload has also been used to create partitions, which, however, are static and do not adapt to workload changes. In this paper, we propose AdPart, a distributed RDF system, which addresses the shortcomings of previous work. First, AdPart applies lightweight partitioning on the initial data, which distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdPart takes full advantage of the partitioning to (1) support the fully parallel processing of join patterns on subjects and (2) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdPart monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdPart implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdPart: (1) starts faster than all existing systems; (2) processes thousands of queries before other systems become online; and (3) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in subseconds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. http://www.bio2rdf.org/

  2. http://yago-knowledge.org/

  3. http://www.w3.org/TR/rdf-sparql-query/

  4. For simplicity, we use: \(i = t.subject \mod W\).

  5. In many RDF datasets, vertex degrees follow a power-law distribution, where few ones have extremely high degrees. For example, vertices that appear as objects in triples with rdf:type have very high degree centrality. Treating such vertices as cores results in imbalanced partitions and prevents the system from taking full advantage of parallelism [19].

  6. Recall if a core vertex is a subject, we do not redistribute.

  7. Auto-tuning the frequency threshold is a subject of our future work.

  8. http://swat.cse.lehigh.edu/projects/lubm/

  9. http://db.uwaterloo.ca/watdiv/

  10. http://yago-knowledge.org/

  11. http://download.bio2rdf.org/release/2/

  12. http://cloud.kaust.edu.sa/Pages/adpart.aspx

  13. http://db.uwaterloo.ca/watdiv/basic-testing.shtml

  14. Only query patterns are used. Classes and properties are fixed so queries return large intermediate results.

References

  1. Aluç, G., Özsu, M.T., Daudjee, K.: Workload matters: Why RDF databases need a new design. PVLDB 7(10), 837–840 (2014)

  2. Atre, M., Chaoji, V., Zaki, M.J., Hendler J.A.: Matrix “Bit” loaded: a scalable lightweight join query processor for rdf data. In: WWW (2010)

  3. Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core CPUs. In: SIGMOD (2011)

  4. Bol’shev, L., Ubaidullaeva, M.: Chauvenet’s test in the classical theory of errors. Theory Prob. Appl. 19(4), 683–692 (1975)

    Article  MATH  Google Scholar 

  5. Boyer, R.S., Strother Moore, J.: MJRTY: a fast majority vote algorithm. In: Boyer, R.S. (ed.) Automated Reasoning: Essays in Honor of Woody Bledsoe, pp. 105–118. Kluwer, London (1991)

    Chapter  Google Scholar 

  6. Chong, Z., Chen, H., Zhang, Z., Shu, H., Qi, G., Zhou, A.: RDF pattern matching using sortable views. In: CIKM (2012)

  7. Curino, C., Jones, E., Zhang, Y., Madden, S.: Schism: a workload-driven approach to database replication and partitioning. PVLDB 3(1–2), 48–57 (2010)

  8. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI (2004)

  9. Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (Without It Even Noticing). PVLDB 3(1–2), 515–529 (2010)

  10. Dritsou, V., Constantopoulos, P., Deligiannakis, A., Kotidis, Y.: Optimizing query shortcuts in RDF databases. In: ESWC (2011)

  11. Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

    Article  Google Scholar 

  12. Forum, M.P.: Mpi: a message-passing interface standard. Tech. rep, Knoxville, TN, USA (1994)

  13. Galarraga, L., Hose, K., Schenkel, R.: Partout: a distributed engine for efficient RDF processing. CoRR arXiv:1212.5636 (2012)

  14. Gallego, M.A., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. In: USEWOD (2011)

  15. Goasdoué, F., Karanasos, K., Leblay, J., Manolescu, I.: View selection in semantic web databases. PVLDB 5(2), 97–108 (2011)

  16. Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: TriAD: A distributed shared-nothing RDF engine based on asynchronous message passing. In: SIGMOD (2014)

  17. Harth, A., Umbrich, J., Hogan, A., Decker, S.: YARS2: A federated repository for querying graph structured data from the Web. In: ISWC/ASWC, vol. 4825 (2007)

  18. Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: ICDEW (2013)

  19. Huang, J., Abadi, D., Ren, K.: Scalable SPARQL querying of large RDF graphs. PVLDB 4(11), 1123–1134 (2011)

  20. Husain, M., McGlothlin, J., Masud, M., Khan, L., Thuraisingham, B.: Heuristics-based query processing for large RDF graphs using cloud computing. TKDE 23(9), 1312–1327 (2011)

  21. Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR (2007)

  22. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  23. Lee, K., Liu, L.: Scaling queries over big RDF graphs with semantic hash partitioning. PVLDB 6(14), 1894–1905 (2013)

  24. Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD (2010)

  25. Neumann, T., Weikum, G.: The rdf-3x engine for scalable management of rdf data. VLDB J. 19(1), 91–113 (2010)

  26. Papailiou, N., Konstantinou, I., Tsoumakos, D., Karras, P., Koziris, N.: H2rdf+: High-performance distributed joins over large-scale rdf graphs. In: IEEE Big Data (2013)

  27. Punnoose, R., Crainiceanu, A., Rapp, D.: Rya: a scalable RDF triple store for the clouds. In: Cloud-I (2012)

  28. Rietveld, L., Hoekstra, R., Schlobach, S., Guéret, C.: Structural properties as proxy for semantic relevance in RDF graph sampling. In: ISWC (2014)

  29. Rohloff, K., Schantz, R.E.: High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store. In: PSI EtA (2010)

  30. Shen, Y., Chen, G., Jagadish, H.V., Lu, W., Ooi, B.C., Tudor, B.M.: Fast failure recovery in distributed graph processing systems. PVLDB 8(4), 437–448 (2014)

  31. Stonebraker, M., Madden, S., Abadi, D., Harizopoulos, S., Hachem, N., Helland, P.: The end of an Architectural Era: (It’s Time for a Complete Rewrite). PVLDB, 1150–1160 (2007)

  32. Wang, L., Xiao, Y., Shao, B., Wang, H.: How to partition a billion-node graph. In: ICDE (2014)

  33. Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1), 1008–1019 (2008)

  34. Wu, B., Zhou, Y., Yuan, P., Liu, L., Jin, H.: Scalable SPARQL querying using path partitioning. In: ICDE (2015)

  35. Yang, S., Yan, X., Zong, B., Khan, A.: Towards effective partition management for large graphs. In: SIGMOD (2012)

  36. Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., Liu, L.: TripleBit: a fast and compact system for large scale RDF data. PVLDB 6(7), 517–528 (2013)

    Google Scholar 

  37. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: USENIX (2010)

  38. Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale RDF data. PVLDB 6(4), 265–276 (2013)

  39. Zhang, X., Chen, L., Tong, Y., Wang, M.: EAGRE: towards scalable I/O efficient SPARQL query evaluation on the cloud. In: ICDE (2013)

  40. Zou, L., Özsu, M.T., Chen, L., Shen, X., Huang, R., Zhao, D.: gStore: a graph-based SPARQL query engine. VLDB J. 23(4), 565–590 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Razen Harbi.

Appendices

Appendix 1: Workload queries repetition

In this experiment, we test AdPart’s performance using a real scenario workload where a certain percentage of the queries is repeated, while other new queries are taken into account. We use three workloads, each workload contains 10 K  LUBM random queries out of which a certain percentage is repeated. Figure 20a shows AdPart’s performance while varying the amount of repeated queries between 20, 40 and 80 %. As the results suggest, the more the repeated queries, the less the workload execution time. Since AdPart monitors the query patterns and not the individual queries, it could capture most of the patterns in the workload even with only 20 % of its queries repeated.

Appendix 2: Average partition size

In this experiment, we report how the average partition size changes during the workload execution. Using the 10K queries LUBM workload, Fig. 20b shows how the partition size increases as more queries are executed. Initially, each partition contains around 19M triples. This corresponds to a 0 % replication ratio as AdPart loads only the original dataset. As the system adapts, the size of each partition slightly increases till reaching an average size of around 33 M triples, which counts for a 72 % replication ratio after executing the whole 10K workload queries.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Harbi, R., Abdelaziz, I., Kalnis, P. et al. Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. The VLDB Journal 25, 355–380 (2016). https://doi.org/10.1007/s00778-016-0420-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-016-0420-y

Keywords

Navigation