Abstract
Recently the flexibility of RDF data model makes increasing number of organizations and communities keep their data available in the RDF format. There is a growing need for querying these data in scalable and efficient way. MapReduce is a parallel data processing solution for processing large data-intensive workloads, which is not supported directly for join-intensive workloads. In this paper, we present a schema based hybrid partitioning technique for RDF triples placement according to the relationships between them, and reduce the necessary number of MR cycles in each SAPRQL query job. Then we propose a lightweight sideways information passing techniques which pass the join information across MR jobs to decrease the intermediate results involved in join operations. The experimental results show that our approaches achieve a substantial performance improvement, and outperform the previous system by a factor of 2-20 using LUBM benchmark.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
The Friend of a Friend (FOAF) project, http://www.foaf-project.org/
Linking open data on the Semantic Web, http://www.w3.org/wiki/SweoIG/Task-Forces/CommunityProjects/LinkingOpenData
MapReduce, A.: major step backwards, http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/
MonetDB, http://www.monetdb.org/
Resource Description Framework (RDF), http://www.w3.org/TR/rdf-concepts/
SPARQL query language for RDF, http://www.w3.org/TR/rdf-sparql-query/
The universal protein resource (Uniprot), http://www.uniprot.org/
Abadi, D., Marcus, A., Madden, S., Hollenbach, K.: Scalable semantic web data management using vertical partitioning. In: Proc. VLDB, pp. 411–422 (2007)
Abadi, D., Marcus, A., Madden, S., Hollenbach, K.: SW-Store: a vertically partitioned DBMS for Semantic Web data management. The VLDB Journal 18(2), 385–406 (2009)
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: Proc. VLDB, pp. 922–933 (2009)
Agrawal, S., Narasayya, V., Yang, B.: Integrating vertical and horizontal partitioning into automated physical database design. In: Proc. SIGMOD, pp. 359–370 (2004)
Atre, M., Chaoji, V., Zaki, M., Hendler, J.: Matrix Bit loaded: a scalable lightweight join query processor for RDF data. In: Proc. WWW, pp. 41–50 (2010)
Ceri, S., Navathe, S., Wiederhold, G.: Distribution design of logical database schemas. IEEE Transactions on Software Engineering (4), 487–504 (1983)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Erling, O., Mikhailov, I.: Towards web scale RDF. In: Proc. SSWS (2008)
Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web 3(2), 158–182 (2005)
Harris, S., Lamb, N., Shadbolt, N.: 4store: The design and implementation of a clustered RDF store. In: Proc. SSWS, pp. 94–109 (2009)
Harth, A., Umbrich, J., Hogan, A., Decker, S.: YARS2: A Federated Repository for Querying Graph Structured Data from the Web. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 211–224. Springer, Heidelberg (2007)
Huang, J., Abadi, D., Ren, K.: Scalable sparql querying of large rdf graphs. In: Proc. VLDB (2011)
Husain, M., McGlothlin, J., Masud, M., Khan, L., Thuraisingham, B.: Heuristics based query processing for large RDF graphs using cloud computing. IEEE Transactions on Knowledge and Data Engineering 23(9), 1312–1327 (2011)
Ives, Z., Taylor, N.: Sideways information passing for push-style query processing. In: Proc. ICDE, pp. 774–783 (2008)
Kaoudi, Z., Kyzirakos, K., Koubarakis, M.: SPARQL Query Optimization on Top of DHTs. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 418–435. Springer, Heidelberg (2010)
Neumann, T., Weikum, G.: Scalable join processing on very large RDF graphs. In: Proc. SIGMOD, pp. 627–640 (2009)
Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. The VLDB Journal 19(1), 91–113 (2010)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proc. SIGMOD, pp. 1099–1110 (2008)
Ravindra, P., Hong, S., Kim, H., Anyanwu, K.: Efficient processing of rdf graph pattern matching on mapreduce platforms. In: Proc. International Workshop on Data Intensive Computing in the Clouds, pp. 13–20 (2011)
Rohloff, K., Schantz, R.: High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store. In: Proc. Programming Support Innovations for Emerging Distributed Applications (2010)
Sridhar, R., Ravindra, P., Anyanwu, K.: RAPID: Enabling Scalable Ad-Hoc Analytics on the Semantic Web. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 715–730. Springer, Heidelberg (2009)
Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C., Reynolds, D.: SPARQL basic graph pattern optimization using selectivity estimation. In: Proc. WWW (2008)
Tanimura, Y., Matono, A., Lynden, S., Kojima, I.: Extensions to the Pig data processing platform for scalable RDF data processing using Hadoop. In: Proc. IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 251–256 (2010)
Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: Proc. ICDE (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wu, B., Jin, H., Yuan, P. (2013). Scalable SAPRQL Querying Processing on Large RDF Data in Cloud Computing Environment. In: Zu, Q., Hu, B., Elçi, A. (eds) Pervasive Computing and the Networked World. ICPCA/SWS 2012. Lecture Notes in Computer Science, vol 7719. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37015-1_55
Download citation
DOI: https://doi.org/10.1007/978-3-642-37015-1_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37014-4
Online ISBN: 978-3-642-37015-1
eBook Packages: Computer ScienceComputer Science (R0)