Skip to main content

MapReduce

  • Reference work entry
  • First Online:
  • 48 Accesses

Scientific Fundamentals

MapReduce refers to both a programming model and the corresponding distributed framework. Its model is composed of two phases, map and reduce, which manipulate data formated as key-value pairs. Map phase splits and sorts data on keys, whereas reduce phase applies user-defined function to process data with the same key. In this way, MapReduce is a typical divide-and-conquer framework that is designed to handle embarrassingly parallel problems, namely problems that can be split into sub-tasks with little or no synchronization costs.

Definition

MapReduce is a programming framework that allows users to process large-scaled data by leveraging the parallelism among a cluster of nodes. It is also used to refer to the distributed engine which splits and disseminates users’ jobs and monitors their processing in the cluster. MapReduce is a typical divide-and-conquer framework, since it transforms the user code into an embarrassingly parallel job, where little or no effort...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

  1. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation; 2004. p. 137–50.

    Google Scholar 

  2. https://hadoop.apache.org/

  3. Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. The Google file system. In: Proceedings of the 19th ACM Symposium on Operating System Principles; 2003. p. 29–43.

    Google Scholar 

  4. Dittrich J, Quiané-Ruiz J-A, Jindal A, Kargin Y, Setty V, Schad J. Hadoop++: making a yellow elephant run like a cheetah (without It even noticing). Proc VLDB Endow. 2010;3(1):518–29.

    Google Scholar 

  5. http://hbase.apache.org

  6. Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive – a warehousing solution over a map-reduce framework. Proc VLDB Endow. 2009;2(2):1626–9.

    Article  Google Scholar 

  7. http://mahout.apache.org

  8. Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2008. p. 1099–110.

    Google Scholar 

  9. https://developer.yahoo.com/blogs/hadoop/

  10. Pavlo A, Paulson E, Rasin A, Abadi DJ, DeWitt DJ, Madden S, Stonebraker M. A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2009. p. 165–78.

    Google Scholar 

  11. Jiang D, Ooi BC, Shi L, Wu S. The performance of MapReduce: an in-depth study. Proc VLDB Endow. 2010;3(1):472–83.

    Article  Google Scholar 

  12. Sai Wu, Feng Li, Sharad Mehrotra, Beng Chin Ooi. Query optimization for massively parallel data processing. In: Proceedings of the 2nd ACM Symposium on Cloud Computing; 2011. p. 12.

    Google Scholar 

  13. Afrati FN, Das Sarma A, Menestrina D, Parameswaran AG, Ullman JD. Fuzzy joins using MapReduce. In: Proceedings of the 28th International Conference on Data Engineering; 2012. p. 498–509.

    Google Scholar 

  14. Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. MRShare: sharing across multiple queries in MapReduce. Proc VLDB Endow. 2010;3(1):494–505.

    Article  MATH  Google Scholar 

  15. Li F, Ooi BC, Tamer Özsu M, Wu S. Distributed data management using MapReduce. ACM Comput Surv. 2014;46(3):31:1–31:42.

    Google Scholar 

  16. Abouzeid A, Bajda-Pawlikowski K, Abadi DJ, Rasin A, Silberschatz A. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc VLDB Endow. 2009;2(1):922–33.

    Article  Google Scholar 

  17. http://www.informationweek.com/cloud/software-as-a-service/google-i-o-hello-dataflow-goodbye-mapreduce/d/d-id/1278917

  18. Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM. Distributed GraphLab: a framework for machine learning in the cloud. Proc VLDB Endow. 2012;5(8):716–27.

    Article  Google Scholar 

  19. Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2010. p. 135–46.

    Google Scholar 

  20. Jiang D, Chen G, Ooi BC, Tan K-L, Wu S. epiC: an extensible and scalable system for processing big data. Proc VLDB Endow. 2014;7(7):541–52.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sai Wu .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Wu, S. (2018). MapReduce. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_80802

Download citation

Publish with us

Policies and ethics