Abstract
The ever-increasing amount of RDF data made available requires data to be partitioned across multiple servers. We have witnessed some research progress made towards scaling RDF query processing based on suitable data distribution methods. In general, they work well for queries matching simple triple patterns, but they are not efficient for queries involving more complex patterns. In this paper, we present an RDF data distribution method which overcomes the shortcomings of the current approaches in order to scale RDF storage both on the volume of data and query processing. We apply a method that identifies frequent patterns accessed by queries in order to keep related data in the same partition. We deploy our reasoning on a summarized view of data in order to avoid exhaustive analysis on large datasets. As result, partitioning templates are obtained from data items in an RDF structure. In addition, we provide an approach for dynamic data insertions even if new data do not conform to the original RDF structure. Apart from the repartitioning approaches, we use an overflow repository to store data which may not follow the original schema. Our study shows that our method scales well and is effective to improve the overall performance by decreasing the amount of message passing among servers, compared to alternative data distribution approaches for RDF.




















Similar content being viewed by others
References
Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: SW-Store: a vertically partitioned DBMS for semantic web data management. VLDB J. 18(2), 385–406 (2009). https://doi.org/10.1007/s00778-008-0125-y
Agrawal, S., Narasayya, V., Yang, B.: Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 359–370 (2004). https://doi.org/10.1145/1007568.1007609
Aluç, G., Özsu, M.T., Daudjee, K.: Building self-clustering RDF databases using Tunable-LSH. VLDB J. 28, 173–195 (2018)
Bellatreche, L., Bouchakri, R., Cuzzocrea, A., Maabout, S.: Horizontal partitioning of very-large data warehouses under dynamically-changing query workloads via incremental algorithms. In: Proceedings of ACM Symposium on Applied Computing, pp. 208–210 (2013)
Bizer, C., Schultz, A.: The Berlin SPARQL benchmark. Int. J. Semant. Web Inf. Syst. 5(2), 1–24 (2009). https://doi.org/10.4018/jswis.2009040101
Bok, K., Kim, C., Jeong, J., Lim, J., Yoo, J.: Dynamic partitioning of large scale RDF graph in dynamic environments. In: Lee, W., Choi, W., Jung, S., Song, M. (eds) Proceedings of the 7th International Conference on Emerging Databases, pp. 43–49 (2018). https://doi.org/10.1007/978-981-10-6520-0_5
Bordawekar, R., Shmueli, O.: An algorithm for partitioning trees augmented with sibling edges. Inf. Process. Lett. 108(3), 136–142 (2008). https://doi.org/10.1016/j.ipl.2008.04.010
Cong, G., Fan, W., Kementsietsidis, A.: Distributed query evaluation with performance guarantees. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 509–520. ACM Press, New York (2007). https://doi.org/10.1145/1247480.1247537
Cruz, F., Maia, F., Matos, M., Oliveira, R., Paulo, J., Pereira , J., Vilaça, R.: MeT: workload aware elasticity for NoSQL. In: ACM European Conference on Computer Systems, pp. 183–196 (2013). https://doi.org/10.1145/2465351.2465370
Curino, C., Jones, E., Zhang, Y., Madden, S.: Schism: a workload-driven approach to database replication and partitioning. Proc. VLDB Endow. 3(1–2), 48–57 (2010). https://doi.org/10.14778/1920841.1920853
Feng, J., Meng, C., Song, J., Zhang, X., Feng, Z., Zou, L.: SPARQL query parallel processing: a survey. In: 2017 IEEE International Congress on Big Data (BigData Congress), pp. 444–451 (2017). https://doi.org/10.1109/BigDataCongress.2017.65
Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: ICDE Workshop: Data Engineering Meets the Semantic Web, pp. 1–6 (2013). https://doi.org/10.1109/ICDEW.2013.6547414
Jiewen Huang, D.J.A.: Scalable SPARQL querying of large RDF graphs. PVLDB 4(11), 1123–1134 (2011)
Madkour, A., Aly, A.M., Aref, W.G.: WORQ: Workload-driven RDF query processing. Semant. Web ISWC 2018, 583–599 (2018)
METIS: Family of Graph and Hypergraph Partitioning Software (2018). URL http://glaros.dtc.umn.edu/gkhome/views/metis
Navathe, S., Ra, M.: Vertical partitioning for database design: a graphical algorithm. In: Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data, vol. 18, pp. 440–450 (1989). https://doi.org/10.1145/67544.66966
Nejdl, W., Siberski, W., Sintek, M.: Design issues and challenges for RDF and schema-based peer-to-peer systems. ACM SIGMOD Rec. 32(3), 41–46 (2003). https://doi.org/10.1145/945721.945731
Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In: IEEE 27th International Conference on Data Engineering (ICDE), pp. 984–994 (2011). https://doi.org/10.1109/ICDE.2011.5767868
Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Prentice-Hall, New York (1991)
Pavlo, A., Curino, C., Zdonik, S.: Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 61–72 (2012). https://doi.org/10.1145/2213836.2213844
Penteado, R.R.M.: Otimização de Consultas SPARQL em Bases RDF Distribuídas. PhD thesis, Universidade Federal do Paraná (2017)
Pham, M.: Self-organizing structured RDF in MonetDB. In: Data Engineering Workshops (ICDEW), 2013 IEEE 29th International Conference on, pp. 310–313 (2013). https://doi.org/10.1109/ICDEW.2013.6547471
Quamar, A., Kumar, K.A., Deshpande, A.: SWORD: Scalable workload-aware data placement for transactional workloads. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 430–441 (2013). https://doi.org/10.1145/2452376.2452427
Schroeder, R., Hara, C.S.: Partitioning templates for RDF. In: Advances in Databases and Information Systems, Poitiers, France, pp. 305–319 (2015). https://doi.org/10.1007/978-3-319-23135-8_21
Schroeder, R., Mello, R., Hara, C.: Affinity-based XML Fragmentation. In: International Workshop on the Web and Databases (2012). URL http://db.disi.unitn.eu/pages/WebDB2012/papers/p23.pdf
Schütt, T., Schintke, F., Reinefeld, A.: Scalaris: reliable transactional P2P key/value store. In: ACM SIGPLAN Workshop on ERLANG, pp. 41–48 (2008). https://doi.org/10.1145/1411273.1411280
Shanbhag, A., Jindal, A., Madden, S., Quiane, J., Elmore, A.J.: A robust partitioning scheme for ad-hoc query workloads. In: Proceedings of the 2017 Symposium on Cloud Computing, pp. 229–241 (2017). https://doi.org/10.1145/3127479.3131613
Shang, Z., Yu, J.X.: Catch the wind: graph workload balancing on cloud. In: IEEE 29th International Conference on Data Engineering, pp. 553–564 (2013). https://doi.org/10.1109/ICDE.2013.6544855
Shute, J., Whipkey, C., Menestrina, D., Vingralek, R., Samwel, B., Handy, B., Rollins, E., Oancea, M., Littlefield, K., Ellner, S., Cieslewicz, J., Rae, I., Stancescu, T., Apte, H.: F1: a distributed SQL database that scales. Proc. VLDB Endow. 6(11), 1068–1079 (2013). https://doi.org/10.14778/2536222.2536232
Vazirani, V.V.: Approximation Algorithms. Springer, Berlin (2003)
Wang, L., Xiao, Y., Shao, B., Wang, H.: How to partition a billion-node graph. In: IEEE 30th International Conference on Data Engineering (ICDE), pp. 568–579 (2014). https://doi.org/10.1109/ICDE.2014.6816682
Xiong, P.: Dynamic management of resources and workloads for RDBMS in cloud: a control-theoretic approach. In: Proceedings of the on SIGMOD/PODS 2012 PhD Symposium, pp. 63–68. ACM, New York (2012). https://doi.org/10.1145/2213598.2213614
Yang, M., Wu, G.: A workload-based partitioning scheme for parallel RDF data processing. In: Semantic Web and Web Science, Springer Proceedings in Complexity, pp. 311–324. Springer, New York (2013). https://link.springer.com/chapter/10.1007/978-1-4614-6880-6_27
Yang, T., Chen, J., Wang, X., Chen, Y., Du, X.: Efficient SPARQL query evaluation via automatic data partitioning. In: Meng, M., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds) Database Systems for Advanced Applications, pp. 244–258. Springer, Berlin (2013). URL https://link.springer.com/chapter/10.1007/978-3-642-37450-0_18
Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A Distributed graph engine for web scale RDF data. Proc. VLDB Endow. 6(4), 265–276 (2013)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Schroeder, R., Penteado, R.R.M. & Hara, C.S. A data distribution model for RDF. Distrib Parallel Databases 39, 129–167 (2021). https://doi.org/10.1007/s10619-020-07296-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-020-07296-w