Parallel Join Algorithms in MapReduce

Blanas, Spyros

doi:10.1007/978-3-319-63962-8_206-1

Spyros Blanas³

140 Accesses

Definition

The MapReduce framework is often used to analyze large volumes of unstructured and semi-structured data. A common analysis pattern involves combining a massive file that describes events (commonly in the form of a log) with much smaller reference datasets. This analytical operation corresponds to a parallel join. Parallel joins have been extensively studied in data management research, and many algorithms are tailored to take advantage of interesting properties of the input or the analysis in a relational database management system. However, the MapReduce framework was designed to operate on a single input and is a cumbersome framework for join processing. As a consequence, a new class of parallel join algorithms has been designed, implemented, and optimized specifically for the MapReduce framework.

Overview

Since its introduction, the MapReduce framework (Dean and Ghemawat 2004) has become extremely popular for analyzing large datasets. The success of MapReduce stems from...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Abouzeid A, Bajda-Pawlikowski K, Abadi DJ, Rasin A, Silberschatz A (2009) HadoopDB: an architectural hybrid of mapreduce and DBMS technologies for analytical workloads. PVLDB 2(1):922–933. http://www.vldb.org/pvldb/2/vldb09-861.pdf
Google Scholar
Abouzied A, Abadi DJ, Silberschatz A (2013) Invisible loading: access-driven data transfer from raw files into database systems. In: EDBT
Book Google Scholar
Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th international conference on extending database technology, EDBT ’10. ACM, New York, pp 99–110. http://doi.acm.org/10.1145/1739041.1739056
Chapter Google Scholar
Alagiannis I, Borovica R, Branco M, Idreos S, Ailamaki A (2012) NoDB: efficient query execution on raw data files. In: SIGMOD
Book Google Scholar
AsterixDB (2017) Apache AsterixDB. https://asterixdb.apache.org/. Accessed Dec 2017
Avro (2017) Apache Avro. https://avro.apache.org/. Accessed Dec 2017
Bajda-Pawlikowski K, Abadi DJ, Silberschatz A, Paulson E (2011) Efficient processing of data warehousing queries in a split execution environment. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, SIGMOD ’11. ACM, New York, pp 1165–1176. http://doi.acm.org/10.1145/1989323.1989447
Chapter Google Scholar
Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquin R (2011) Incoop: MapReduce for incremental computations. In: Proceedings of the 2nd ACM symposium on cloud computing, SOCC ’11. ACM, New York, pp 7: 1–7:14. http://doi.acm.org/10.1145/2038916.2038923
Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in MapReduce. In: ACM SIGMOD. http://doi.acm.org/10.1145/1807167.1807273
Book Google Scholar
Blanas S, Wu K, Byna S, Dong B, Shoshani A (2014) Parallel data analysis directly on scientific file formats. In: SIGMOD
Book Google Scholar
Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache Flink: stream and batch processing in a single engine. IEEE Data Eng Bull 38(4):28–38. http://sites.computer.org/debull/A15dec/p28.pdf
Google Scholar
Cheng Y, Rusu F (2015) SCANRAW: a database meta-operator for parallel in-situ processing and loading. TODS 40(3)
Google Scholar
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design & implementation – volume 6, OSDI’04. USENIX Association, Berkeley, pp 10–10. http://dl.acm.org/citation.cfm?id=1251254.1251264
Google Scholar
DeWitt DJ, Halverson A, Nehme R, Shankar S, Aguilar-Saborit J, Avanes A, Flasza M, Gramling J (2013) Split query processing in Polybase. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, SIGMOD ’13. ACM, New York, pp 1255–1266. http://doi.acm.org/10.1145/2463676.2463709
Chapter Google Scholar
Drill (2017) Apache Drill. https://drill.apache.org/. Accessed Dec 2017
Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J (2011) CoHadoop: flexible data placement and its exploitation in Hadoop. PVLDB 4(9):575–585. http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf
Google Scholar
Floratou A, Patel JM, Shekita EJ, Tata S (2011) Column-oriented storage techniques for mapreduce. PVLDB 4(7):419–429. http://www.vldb.org/pvldb/vol4/p419-floratou.pdf
Google Scholar
Hadoop (2017) Apache Hadoop. https://hadoop.apache.org/. Accessed Dec 2017
He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z (2011) RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of the 27th international conference on data engineering, ICDE 2011, 11–16 Apr 2011, Hannover, pp 1199–1208. https://doi.org/10.1109/ICDE.2011.5767933
Herodotou H, Babu S (2011) Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB 4(11):1111–1122. http://www.vldb.org/pvldb/vol4/p1111-herodotou.pdf
Google Scholar
Hive (2017) Apache Hive. https://hive.apache.org/. Accessed Dec 2017
Impala (2017) Apache Impala. https://impala.apache.org/. Accessed Dec 2017
Liu F, Blanas S (2015) Forecasting the cost of processing multi-join queries via hashing for main-memory databases. In: Proceedings of the sixth ACM symposium on cloud computing, SoCC 2015, Kohala Coast, 27–29 Aug 2015, pp 153–166. http://doi.acm.org/10.1145/2806777.2806944
Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, SIGMOD ’11. ACM, New York, pp 949–960. http://doi.acm.org/10.1145/1989323.1989423
Chapter Google Scholar
Parquet (2017) Apache Parquet. https://parquet.apache.org. Accessed Dec 2017
Quickstep (2017) Apache Quickstep. https://quickstep.incubator.apache.org/. Accessed Dec 2017
Spark (2017) Apache Spark. https://spark.apache.org/. Accessed Dec 2017
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12. USENIX Association, Berkeley, pp 2–12. http://dl.acm.org/citation.cfm?id=2228298.2228301
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Spyros Blanas

Authors

Spyros Blanas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Spyros Blanas .

Editor information

Editors and Affiliations

Institute of Computer Science, University of Tartu, Tartu, Estonia
Sherif Sakr
Sch of Info Techno, Building J12, University of Sydney Sch of Info Techno, Building J12, Sydney, Australia
Albert Zomaya

Section Editor information

Delft University of Technology, Delft, Netherlands
Asterios Katsifodimos
School of Informatics, University of Edinburgh, Edinburgh, UK
Pramod Bhatotia

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Blanas, S. (2018). Parallel Join Algorithms in MapReduce. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_206-1

Download citation

DOI: https://doi.org/10.1007/978-3-319-63962-8_206-1
Published: 05 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics