Definition
The MapReduce framework is often used to analyze large volumes of unstructured and semi-structured data. A common analysis pattern involves combining a massive file that describes events (commonly in the form of a log) with much smaller reference datasets. This analytical operation corresponds to a parallel join. Parallel joins have been extensively studied in data management research, and many algorithms are tailored to take advantage of interesting properties of the input or the analysis in a relational database management system. However, the MapReduce framework was designed to operate on a single input and is a cumbersome framework for join processing. As a consequence, a new class of parallel join algorithms has been designed, implemented, and optimized specifically for the MapReduce framework.
Overview
Since its introduction, the MapReduce framework (Dean and Ghemawat 2004) has become extremely popular for analyzing large datasets. The success of MapReduce stems from...
References
Abouzeid A, Bajda-Pawlikowski K, Abadi DJ, Rasin A, Silberschatz A (2009) HadoopDB: an architectural hybrid of mapreduce and DBMS technologies for analytical workloads. PVLDB 2(1):922–933. http://www.vldb.org/pvldb/2/vldb09-861.pdf
Abouzied A, Abadi DJ, Silberschatz A (2013) Invisible loading: access-driven data transfer from raw files into database systems. In: EDBT
Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th international conference on extending database technology, EDBT ’10. ACM, New York, pp 99–110. http://doi.acm.org/10.1145/1739041.1739056
Alagiannis I, Borovica R, Branco M, Idreos S, Ailamaki A (2012) NoDB: efficient query execution on raw data files. In: SIGMOD
AsterixDB (2017) Apache AsterixDB. https://asterixdb.apache.org/. Accessed Dec 2017
Avro (2017) Apache Avro. https://avro.apache.org/. Accessed Dec 2017
Bajda-Pawlikowski K, Abadi DJ, Silberschatz A, Paulson E (2011) Efficient processing of data warehousing queries in a split execution environment. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, SIGMOD ’11. ACM, New York, pp 1165–1176. http://doi.acm.org/10.1145/1989323.1989447
Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquin R (2011) Incoop: MapReduce for incremental computations. In: Proceedings of the 2nd ACM symposium on cloud computing, SOCC ’11. ACM, New York, pp 7: 1–7:14. http://doi.acm.org/10.1145/2038916.2038923
Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in MapReduce. In: ACM SIGMOD. http://doi.acm.org/10.1145/1807167.1807273
Blanas S, Wu K, Byna S, Dong B, Shoshani A (2014) Parallel data analysis directly on scientific file formats. In: SIGMOD
Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache Flink: stream and batch processing in a single engine. IEEE Data Eng Bull 38(4):28–38. http://sites.computer.org/debull/A15dec/p28.pdf
Cheng Y, Rusu F (2015) SCANRAW: a database meta-operator for parallel in-situ processing and loading. TODS 40(3)
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design & implementation – volume 6, OSDI’04. USENIX Association, Berkeley, pp 10–10. http://dl.acm.org/citation.cfm?id=1251254.1251264
DeWitt DJ, Halverson A, Nehme R, Shankar S, Aguilar-Saborit J, Avanes A, Flasza M, Gramling J (2013) Split query processing in Polybase. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, SIGMOD ’13. ACM, New York, pp 1255–1266. http://doi.acm.org/10.1145/2463676.2463709
Drill (2017) Apache Drill. https://drill.apache.org/. Accessed Dec 2017
Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J (2011) CoHadoop: flexible data placement and its exploitation in Hadoop. PVLDB 4(9):575–585. http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf
Floratou A, Patel JM, Shekita EJ, Tata S (2011) Column-oriented storage techniques for mapreduce. PVLDB 4(7):419–429. http://www.vldb.org/pvldb/vol4/p419-floratou.pdf
Hadoop (2017) Apache Hadoop. https://hadoop.apache.org/. Accessed Dec 2017
He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z (2011) RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of the 27th international conference on data engineering, ICDE 2011, 11–16 Apr 2011, Hannover, pp 1199–1208. https://doi.org/10.1109/ICDE.2011.5767933
Herodotou H, Babu S (2011) Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB 4(11):1111–1122. http://www.vldb.org/pvldb/vol4/p1111-herodotou.pdf
Hive (2017) Apache Hive. https://hive.apache.org/. Accessed Dec 2017
Impala (2017) Apache Impala. https://impala.apache.org/. Accessed Dec 2017
Liu F, Blanas S (2015) Forecasting the cost of processing multi-join queries via hashing for main-memory databases. In: Proceedings of the sixth ACM symposium on cloud computing, SoCC 2015, Kohala Coast, 27–29 Aug 2015, pp 153–166. http://doi.acm.org/10.1145/2806777.2806944
Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, SIGMOD ’11. ACM, New York, pp 949–960. http://doi.acm.org/10.1145/1989323.1989423
Parquet (2017) Apache Parquet. https://parquet.apache.org. Accessed Dec 2017
Quickstep (2017) Apache Quickstep. https://quickstep.incubator.apache.org/. Accessed Dec 2017
Spark (2017) Apache Spark. https://spark.apache.org/. Accessed Dec 2017
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12. USENIX Association, Berkeley, pp 2–12. http://dl.acm.org/citation.cfm?id=2228298.2228301
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this entry
Cite this entry
Blanas, S. (2018). Parallel Join Algorithms in MapReduce. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_206-1
Download citation
DOI: https://doi.org/10.1007/978-3-319-63962-8_206-1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering