research-article

Towards scalable RDF graph analytics on MapReduce

Authors:
Padmashree Ravindra

North Carolina State University, Raleigh, NC

North Carolina State University, Raleigh, NC
View Profile

,
Vikas V. Deshpande

North Carolina State University, Raleigh, NC

North Carolina State University, Raleigh, NC
View Profile

,
Kemafor Anyanwu

North Carolina State University, Raleigh, NC

North Carolina State University, Raleigh, NC
View Profile

MDAC '10: Proceedings of the 2010 Workshop on Massive Data Analytics on the CloudApril 2010Article No.: 5Pages 1–6https://doi.org/10.1145/1779599.1779604

Published:26 April 2010Publication History

MDAC '10: Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud

Pages 1–6

ABSTRACT

In order to exploit the growing amount of RDF data in decision-making, there is an increasing demand for analytics-style processing of such data. RDF data is modeled as a labeled graph that represents a collection of binary relations (triples). In this context, analytical queries can be interpreted as consisting of three main constructs namely pattern matching, grouping and aggregation, and require several join operations to reassemble them into n-ary relations relevant to the given query, unlike traditional OLAP systems where data is suitably organized. MapReduce-based parallel processing systems like Pig have gained success in processing scalable analytical workloads. However, these systems offer only relational algebra style operators which would require an iterative n-tuple reassembly process in which intermediate results need to be materialized. This leads to high I/O costs that negatively impacts performance. In this paper, we propose UDFs that (i) re-factor analytical processing on RDF graphs in a way that enables more parallelized processing (ii) perform a look-ahead processing to reduce the cost of subsequent operators in the query execution plan. These functions have been integrated into the Pig Latin function library and the experimental results show up to 50% improvement in execution times for certain classes of queries. An important impact of this work is that it could serve as the foundation for additional physical operators in systems such as Pig for more efficient graph processing.

References

Weiss, C., Karras, P., Bernstein, A.: Hexastore: Sextuple Indexing for Semantic Web Data Management. VLDB 2008 Google ScholarDigital Library
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. Of OSDI 2004 Google ScholarDigital Library
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. ACM SIGMOD 2008 Google ScholarDigital Library
Abadi, D. J., Marcus, A., Madden, S. R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007 Google ScholarDigital Library
Yang, H., Dasdan, A., Hsiao, R., Parker Jr., D. S.: Map-reducemerge: simplified relational data processing on large clusters. SIGMOD 2007 Google ScholarDigital Library
Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 2008 Google ScholarDigital Library
Yu, Y., Isard, M., Fetterly, D., Badiu, M., Erlingsson, U., Gunda, P. K., and Currey, J.: DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. OSDI 2008 Google ScholarDigital Library
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with Sawzall. Scientific Programming 2005 Google ScholarDigital Library
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009 Google ScholarDigital Library
Afrati, Foto N. and Ullman, Jeffrey D.: Optimizing Joins in a Map-Reduce Environment. EDBT 2010 Google ScholarDigital Library
Neumann, T., Weikum, G.: Scalable join processing on very large RDF graphs. SIGMOD 2009 Google ScholarDigital Library
Urbani, J., Kotoulas, S., Oren, E., van Harmelen, F.: Scalable Distributed Reasoning using MapReduce, ISWC 2009 Google ScholarDigital Library
Newman, A., Li, Y-F., and Hunter, J.: Scalable Semantics - the Silver Lining of Cloud Computing, 4th IEEE International Conference on e-Science, 2008 Google ScholarDigital Library
Newman, A., Hunter, J., Li, Y-F., Bouton, C., Davis, M.: A Scale-Out RDF Molecule Store for Distributed Processing of Biomedical Data. HCLS'08 at WWW 2008Google Scholar
Sridhar, R., Ravindra, P., Anyanwu, K.: RAPID: Enabling scalable ad-hoc analytics on the semantic web. ISWC 2009 Google ScholarDigital Library
Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., Stonebraker M.: A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 2009 Google ScholarDigital Library
Prud'hommeaux, E., Seaborne, A.: SPARQL query language for RDF. Technical report, World Wide Web Consortium (2005) http://www.w3.org/TR/rdf-sparql-queryGoogle Scholar
Apache Projects Proceedings, http://hadoop.apache.org/core/Google Scholar
VCL Setup at NC State University, https://vcl.ncsu.edu/Google Scholar
JAQL, http://code.google.com/p/jaqlGoogle Scholar

Index Terms

Towards scalable RDF graph analytics on MapReduce
1. Information systems
  1. Data management systems
    1. Query languages
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query languages (principles)

Recommendations

Piglet: Interactive and Platform Transparent Analytics for RDF & Dynamic Data
WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

Data analytics has gained more and more focus during recent years and many data processing platforms have been developed. They all provide a powerful but often complex API that users have to learn. Furthermore, results can only be stored or printed, ...
Read More
MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Read More
Scalable RDF graph querying using cloud computing

With the explosion of the semantic web technologies, conventional SPARQL processing tools do not scale well for large amounts of RDF data because they are designed for use on a single-machine context. Several optimization solutions combined with cloud ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MDAC '10: Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
April 2010
53 pages
ISBN:9781605589916
DOI:10.1145/1779599
Conference Chairs:
Ullas Nambiar
IBM India Research Lab, New Delhi, India
,
John McPherson
IBM Almaden Research Center
,
David Konopnicki
IBM Haifa Research Lab, Israel
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 April 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MapReduce
Pig Latin
RDF analytics
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 967
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Towards scalable RDF graph analytics on MapReduce

MDAC '10: Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud

ABSTRACT

References

Cited By

Index Terms

Recommendations

Piglet: Interactive and Platform Transparent Analytics for RDF & Dynamic Data

MapReduce: Review and open challenges

Scalable RDF graph querying using cloud computing