skip to main content
10.1145/1779599.1779604acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmdacConference Proceedingsconference-collections
research-article

Towards scalable RDF graph analytics on MapReduce

Published:26 April 2010Publication History

ABSTRACT

In order to exploit the growing amount of RDF data in decision-making, there is an increasing demand for analytics-style processing of such data. RDF data is modeled as a labeled graph that represents a collection of binary relations (triples). In this context, analytical queries can be interpreted as consisting of three main constructs namely pattern matching, grouping and aggregation, and require several join operations to reassemble them into n-ary relations relevant to the given query, unlike traditional OLAP systems where data is suitably organized. MapReduce-based parallel processing systems like Pig have gained success in processing scalable analytical workloads. However, these systems offer only relational algebra style operators which would require an iterative n-tuple reassembly process in which intermediate results need to be materialized. This leads to high I/O costs that negatively impacts performance. In this paper, we propose UDFs that (i) re-factor analytical processing on RDF graphs in a way that enables more parallelized processing (ii) perform a look-ahead processing to reduce the cost of subsequent operators in the query execution plan. These functions have been integrated into the Pig Latin function library and the experimental results show up to 50% improvement in execution times for certain classes of queries. An important impact of this work is that it could serve as the foundation for additional physical operators in systems such as Pig for more efficient graph processing.

References

  1. Weiss, C., Karras, P., Bernstein, A.: Hexastore: Sextuple Indexing for Semantic Web Data Management. VLDB 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. Of OSDI 2004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. ACM SIGMOD 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Abadi, D. J., Marcus, A., Madden, S. R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Yang, H., Dasdan, A., Hsiao, R., Parker Jr., D. S.: Map-reducemerge: simplified relational data processing on large clusters. SIGMOD 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yu, Y., Isard, M., Fetterly, D., Badiu, M., Erlingsson, U., Gunda, P. K., and Currey, J.: DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. OSDI 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with Sawzall. Scientific Programming 2005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Afrati, Foto N. and Ullman, Jeffrey D.: Optimizing Joins in a Map-Reduce Environment. EDBT 2010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Neumann, T., Weikum, G.: Scalable join processing on very large RDF graphs. SIGMOD 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Urbani, J., Kotoulas, S., Oren, E., van Harmelen, F.: Scalable Distributed Reasoning using MapReduce, ISWC 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Newman, A., Li, Y-F., and Hunter, J.: Scalable Semantics - the Silver Lining of Cloud Computing, 4th IEEE International Conference on e-Science, 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Newman, A., Hunter, J., Li, Y-F., Bouton, C., Davis, M.: A Scale-Out RDF Molecule Store for Distributed Processing of Biomedical Data. HCLS'08 at WWW 2008Google ScholarGoogle Scholar
  15. Sridhar, R., Ravindra, P., Anyanwu, K.: RAPID: Enabling scalable ad-hoc analytics on the semantic web. ISWC 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., Stonebraker M.: A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Prud'hommeaux, E., Seaborne, A.: SPARQL query language for RDF. Technical report, World Wide Web Consortium (2005) http://www.w3.org/TR/rdf-sparql-queryGoogle ScholarGoogle Scholar
  18. Apache Projects Proceedings, http://hadoop.apache.org/core/Google ScholarGoogle Scholar
  19. VCL Setup at NC State University, https://vcl.ncsu.edu/Google ScholarGoogle Scholar
  20. JAQL, http://code.google.com/p/jaqlGoogle ScholarGoogle Scholar

Index Terms

  1. Towards scalable RDF graph analytics on MapReduce

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        MDAC '10: Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
        April 2010
        53 pages
        ISBN:9781605589916
        DOI:10.1145/1779599

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 April 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader