Abstract
MapReduce is a programming paradigm for effective processing of large datasets in distributed environments, using the map and reduce functions. The map process creates (key, value) pairs, while the reduce phase aggregates same-key values. In other words, a MapReduce application defines and reduces one set of values for each key, which means that the user only knows one aspect of the key. Advanced OLAP applications however, require multiple sets to be defined and reduced for the same key, not necessarily mutually disjoint. The challenge is to extend MapReduce to support this in a syntactically simple and computationally efficient way. We propose an extension to the classic MapReduce model, called Tagged MapReduce, where data is represented as (key, value, tag) triplets. Users map triplets and reducing takes place for each key and for each tag. For example, given a set of pages, one may want to count words’ occurrences per page type. The page type is represented by the tag. While the classic MapReduce can handle this class of queries, it requires effort and possibly advanced programming skills for efficient implementations. For example, should the tag form a compound object with the key or the value? Our formalism makes it simpler for the programmer to use and easier for the system to identify and apply efficient algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. In: VLDB (2009)
Hacker, S., Simmons, R., Varming, C.: Netezza meets MapReduce Abstractions for Data Intensive Computing
Oracle Corporation: Integrating Hadoop Data with Oracle Parallel Processing. An Oracle white paper (2010)
Xu, Y., Kostamaa, P., Gao, L.: Integrating Hadoop and Parallel DBMS. In: SIGMOD (2010)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: A Not-So-Foreign Language for Data Processing. In: SIGMOD (2008)
DeWitt, D., Stonebraker, M.: MapReduce: A major step backwards. DatabaseColumnBlog, http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
Pavlo, A., Paulson, E., Alexander, R., Abadi, J.D., DeWitt, J.D., Madden, S., Stonebraker, M.: A comparison to Approaches to Large-Scale Data Analysis. In: SIGMOD (2009)
Abouzeid, A., Pawlikowski-Bajda, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: An Architecture Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In: VLDB (2009)
Dean, J., Ghemawat, S.: MapReduce: A Flexible Data Processing Tool. Communications of the ACM 53(1), 72–77 (2010)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI (2004)
Apache Hadoop, http://hadoop.apache.org
Isard, M., Budiu, M., Yu, Y., Birell, A., Fetterly, D.: Dryad: Distributed data-parallel programs for sequential building blocks. In: Proceedings of EuroSys (2007)
H.-c. Yang, A., Dasdan, R.-L., Hsiao, D.S.: Parker: Map-Reduce-Merge: Simplified Realtional Data Processing on Large Clusters. In: SIGMOD (2007)
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: Sharing Across Multiple Queries in MapReduce. In: VLDB (2010)
Chatziantoniou, D., Tzortzakakis, E.: ASSET Queries: A Declarative Alternative to MapReduce. ACM SIGMOD Record 38(2) (2009)
Mackey, G., Sehrish, S., Bent, J., Lopez, J., Habib, S., Wang, J.: Intoducing MapReduce to High End Computing. In: PDSW (2008)
Chatziantoniou, D., Ross, K.: Querying Multiple Features of Groups in Relational Databases. In: VLDB (1996)
Chatziantoniou, D.: Evaluation of Ad Hoc OLAP: In-Place Computation. In: SSDM (1999)
Chatziantoniou, D.: The PanQ Tool and EMF SQL for Complex Data Management. In: KDD, pp. 420–424 (1999)
Chatziantoniou, D.: Using grouping variables to express complex decision support queries. DKE Journal 61(1), 114–136 (2007)
Chatziantoniou, D., Akinde, M.O., Johnson, T., Kim, S.: The MD-join: An Operator for Complex OLAP. In: ICDE, pp. 524–533 (2001)
Oracle: Analytic Functions for Oracle 8i. White Paper, Oracle Corporation (1999)
Amazon EC2 cluster, http://aws.amazon.com/ec2/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Williams, A., Mitsoulis-Ntompos, P., Chatziantoniou, D. (2011). Tagged MapReduce: Efficiently Computing Multi-analytics Using MapReduce. In: Cuzzocrea, A., Dayal, U. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2011. Lecture Notes in Computer Science, vol 6862. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23544-3_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-23544-3_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23543-6
Online ISBN: 978-3-642-23544-3
eBook Packages: Computer ScienceComputer Science (R0)