ABSTRACT
In order to exploit the growing amount of RDF data in decision-making, there is an increasing demand for analytics-style processing of such data. RDF data is modeled as a labeled graph that represents a collection of binary relations (triples). In this context, analytical queries can be interpreted as consisting of three main constructs namely pattern matching, grouping and aggregation, and require several join operations to reassemble them into n-ary relations relevant to the given query, unlike traditional OLAP systems where data is suitably organized. MapReduce-based parallel processing systems like Pig have gained success in processing scalable analytical workloads. However, these systems offer only relational algebra style operators which would require an iterative n-tuple reassembly process in which intermediate results need to be materialized. This leads to high I/O costs that negatively impacts performance. In this paper, we propose UDFs that (i) re-factor analytical processing on RDF graphs in a way that enables more parallelized processing (ii) perform a look-ahead processing to reduce the cost of subsequent operators in the query execution plan. These functions have been integrated into the Pig Latin function library and the experimental results show up to 50% improvement in execution times for certain classes of queries. An important impact of this work is that it could serve as the foundation for additional physical operators in systems such as Pig for more efficient graph processing.
- Weiss, C., Karras, P., Bernstein, A.: Hexastore: Sextuple Indexing for Semantic Web Data Management. VLDB 2008 Google ScholarDigital Library
- Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. Of OSDI 2004 Google ScholarDigital Library
- Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. ACM SIGMOD 2008 Google ScholarDigital Library
- Abadi, D. J., Marcus, A., Madden, S. R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007 Google ScholarDigital Library
- Yang, H., Dasdan, A., Hsiao, R., Parker Jr., D. S.: Map-reducemerge: simplified relational data processing on large clusters. SIGMOD 2007 Google ScholarDigital Library
- Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 2008 Google ScholarDigital Library
- Yu, Y., Isard, M., Fetterly, D., Badiu, M., Erlingsson, U., Gunda, P. K., and Currey, J.: DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. OSDI 2008 Google ScholarDigital Library
- Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with Sawzall. Scientific Programming 2005 Google ScholarDigital Library
- Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009 Google ScholarDigital Library
- Afrati, Foto N. and Ullman, Jeffrey D.: Optimizing Joins in a Map-Reduce Environment. EDBT 2010 Google ScholarDigital Library
- Neumann, T., Weikum, G.: Scalable join processing on very large RDF graphs. SIGMOD 2009 Google ScholarDigital Library
- Urbani, J., Kotoulas, S., Oren, E., van Harmelen, F.: Scalable Distributed Reasoning using MapReduce, ISWC 2009 Google ScholarDigital Library
- Newman, A., Li, Y-F., and Hunter, J.: Scalable Semantics - the Silver Lining of Cloud Computing, 4th IEEE International Conference on e-Science, 2008 Google ScholarDigital Library
- Newman, A., Hunter, J., Li, Y-F., Bouton, C., Davis, M.: A Scale-Out RDF Molecule Store for Distributed Processing of Biomedical Data. HCLS'08 at WWW 2008Google Scholar
- Sridhar, R., Ravindra, P., Anyanwu, K.: RAPID: Enabling scalable ad-hoc analytics on the semantic web. ISWC 2009 Google ScholarDigital Library
- Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., Stonebraker M.: A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 2009 Google ScholarDigital Library
- Prud'hommeaux, E., Seaborne, A.: SPARQL query language for RDF. Technical report, World Wide Web Consortium (2005) http://www.w3.org/TR/rdf-sparql-queryGoogle Scholar
- Apache Projects Proceedings, http://hadoop.apache.org/core/Google Scholar
- VCL Setup at NC State University, https://vcl.ncsu.edu/Google Scholar
- JAQL, http://code.google.com/p/jaqlGoogle Scholar
Index Terms
- Towards scalable RDF graph analytics on MapReduce
Recommendations
Piglet: Interactive and Platform Transparent Analytics for RDF & Dynamic Data
WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide WebData analytics has gained more and more focus during recent years and many data processing platforms have been developed. They all provide a powerful but often complex API that users have to learn. Furthermore, results can only be stored or printed, ...
MapReduce: Review and open challenges
The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Scalable RDF graph querying using cloud computing
With the explosion of the semantic web technologies, conventional SPARQL processing tools do not scale well for large amounts of RDF data because they are designed for use on a single-machine context. Several optimization solutions combined with cloud ...
Comments