Abstract
State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation. Others try to minimize inter-node communication, which requires an expensive data preprocessing phase, leading to a high startup cost. Apriori knowledge of the query workload has also been used to create partitions, which, however, are static and do not adapt to workload changes. In this paper, we propose AdPart, a distributed RDF system, which addresses the shortcomings of previous work. First, AdPart applies lightweight partitioning on the initial data, which distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdPart takes full advantage of the partitioning to (1) support the fully parallel processing of join patterns on subjects and (2) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdPart monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdPart implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdPart: (1) starts faster than all existing systems; (2) processes thousands of queries before other systems become online; and (3) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in subseconds.
Similar content being viewed by others
Notes
For simplicity, we use: \(i = t.subject \mod W\).
In many RDF datasets, vertex degrees follow a power-law distribution, where few ones have extremely high degrees. For example, vertices that appear as objects in triples with rdf:type have very high degree centrality. Treating such vertices as cores results in imbalanced partitions and prevents the system from taking full advantage of parallelism [19].
Recall if a core vertex is a subject, we do not redistribute.
Auto-tuning the frequency threshold is a subject of our future work.
Only query patterns are used. Classes and properties are fixed so queries return large intermediate results.
References
Aluç, G., Özsu, M.T., Daudjee, K.: Workload matters: Why RDF databases need a new design. PVLDB 7(10), 837–840 (2014)
Atre, M., Chaoji, V., Zaki, M.J., Hendler J.A.: Matrix “Bit” loaded: a scalable lightweight join query processor for rdf data. In: WWW (2010)
Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core CPUs. In: SIGMOD (2011)
Bol’shev, L., Ubaidullaeva, M.: Chauvenet’s test in the classical theory of errors. Theory Prob. Appl. 19(4), 683–692 (1975)
Boyer, R.S., Strother Moore, J.: MJRTY: a fast majority vote algorithm. In: Boyer, R.S. (ed.) Automated Reasoning: Essays in Honor of Woody Bledsoe, pp. 105–118. Kluwer, London (1991)
Chong, Z., Chen, H., Zhang, Z., Shu, H., Qi, G., Zhou, A.: RDF pattern matching using sortable views. In: CIKM (2012)
Curino, C., Jones, E., Zhang, Y., Madden, S.: Schism: a workload-driven approach to database replication and partitioning. PVLDB 3(1–2), 48–57 (2010)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI (2004)
Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (Without It Even Noticing). PVLDB 3(1–2), 515–529 (2010)
Dritsou, V., Constantopoulos, P., Deligiannakis, A., Kotidis, Y.: Optimizing query shortcuts in RDF databases. In: ESWC (2011)
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Forum, M.P.: Mpi: a message-passing interface standard. Tech. rep, Knoxville, TN, USA (1994)
Galarraga, L., Hose, K., Schenkel, R.: Partout: a distributed engine for efficient RDF processing. CoRR arXiv:1212.5636 (2012)
Gallego, M.A., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. In: USEWOD (2011)
Goasdoué, F., Karanasos, K., Leblay, J., Manolescu, I.: View selection in semantic web databases. PVLDB 5(2), 97–108 (2011)
Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: TriAD: A distributed shared-nothing RDF engine based on asynchronous message passing. In: SIGMOD (2014)
Harth, A., Umbrich, J., Hogan, A., Decker, S.: YARS2: A federated repository for querying graph structured data from the Web. In: ISWC/ASWC, vol. 4825 (2007)
Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: ICDEW (2013)
Huang, J., Abadi, D., Ren, K.: Scalable SPARQL querying of large RDF graphs. PVLDB 4(11), 1123–1134 (2011)
Husain, M., McGlothlin, J., Masud, M., Khan, L., Thuraisingham, B.: Heuristics-based query processing for large RDF graphs using cloud computing. TKDE 23(9), 1312–1327 (2011)
Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR (2007)
Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)
Lee, K., Liu, L.: Scaling queries over big RDF graphs with semantic hash partitioning. PVLDB 6(14), 1894–1905 (2013)
Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD (2010)
Neumann, T., Weikum, G.: The rdf-3x engine for scalable management of rdf data. VLDB J. 19(1), 91–113 (2010)
Papailiou, N., Konstantinou, I., Tsoumakos, D., Karras, P., Koziris, N.: H2rdf+: High-performance distributed joins over large-scale rdf graphs. In: IEEE Big Data (2013)
Punnoose, R., Crainiceanu, A., Rapp, D.: Rya: a scalable RDF triple store for the clouds. In: Cloud-I (2012)
Rietveld, L., Hoekstra, R., Schlobach, S., Guéret, C.: Structural properties as proxy for semantic relevance in RDF graph sampling. In: ISWC (2014)
Rohloff, K., Schantz, R.E.: High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store. In: PSI EtA (2010)
Shen, Y., Chen, G., Jagadish, H.V., Lu, W., Ooi, B.C., Tudor, B.M.: Fast failure recovery in distributed graph processing systems. PVLDB 8(4), 437–448 (2014)
Stonebraker, M., Madden, S., Abadi, D., Harizopoulos, S., Hachem, N., Helland, P.: The end of an Architectural Era: (It’s Time for a Complete Rewrite). PVLDB, 1150–1160 (2007)
Wang, L., Xiao, Y., Shao, B., Wang, H.: How to partition a billion-node graph. In: ICDE (2014)
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1), 1008–1019 (2008)
Wu, B., Zhou, Y., Yuan, P., Liu, L., Jin, H.: Scalable SPARQL querying using path partitioning. In: ICDE (2015)
Yang, S., Yan, X., Zong, B., Khan, A.: Towards effective partition management for large graphs. In: SIGMOD (2012)
Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., Liu, L.: TripleBit: a fast and compact system for large scale RDF data. PVLDB 6(7), 517–528 (2013)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: USENIX (2010)
Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale RDF data. PVLDB 6(4), 265–276 (2013)
Zhang, X., Chen, L., Tong, Y., Wang, M.: EAGRE: towards scalable I/O efficient SPARQL query evaluation on the cloud. In: ICDE (2013)
Zou, L., Özsu, M.T., Chen, L., Shen, X., Huang, R., Zhao, D.: gStore: a graph-based SPARQL query engine. VLDB J. 23(4), 565–590 (2014)
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Workload queries repetition
In this experiment, we test AdPart’s performance using a real scenario workload where a certain percentage of the queries is repeated, while other new queries are taken into account. We use three workloads, each workload contains 10 K LUBM random queries out of which a certain percentage is repeated. Figure 20a shows AdPart’s performance while varying the amount of repeated queries between 20, 40 and 80 %. As the results suggest, the more the repeated queries, the less the workload execution time. Since AdPart monitors the query patterns and not the individual queries, it could capture most of the patterns in the workload even with only 20 % of its queries repeated.
Appendix 2: Average partition size
In this experiment, we report how the average partition size changes during the workload execution. Using the 10K queries LUBM workload, Fig. 20b shows how the partition size increases as more queries are executed. Initially, each partition contains around 19M triples. This corresponds to a 0 % replication ratio as AdPart loads only the original dataset. As the system adapts, the size of each partition slightly increases till reaching an average size of around 33 M triples, which counts for a 72 % replication ratio after executing the whole 10K workload queries.
Rights and permissions
About this article
Cite this article
Harbi, R., Abdelaziz, I., Kalnis, P. et al. Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. The VLDB Journal 25, 355–380 (2016). https://doi.org/10.1007/s00778-016-0420-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-016-0420-y