research-article

OMR: out-of-core MapReduce for large data sets

Authors:

Sai Charan Koduru,

Rajiv GuptaAuthors Info & Claims

ISMM 2018: Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

Pages 71 - 83

https://doi.org/10.1145/3210563.3210568

Published: 18 June 2018 Publication History

Abstract

While single machine MapReduce systems can squeeze out maximum performance from available multi-cores, they are often limited by the size of main memory and can thus only process small datasets. Our experience shows that the state-of-the-art single-machine in-memory MapReduce system Metis frequently experiences out-of-memory crashes. Even though today's computers are equipped with efficient secondary storage devices, the frameworks do not utilize these devices mainly because disk access latencies are much higher than those for main memory. Therefore, the single-machine setup of the Hadoop system performs much slower when it is presented with the datasets which are larger than the main memory. Moreover, such frameworks also require tuning a lot of parameters which puts an added burden on the programmer. In this paper we present OMR, an Out-of-core MapReduce system that not only successfully handles datasets that are far larger than the size of main memory, it also guarantees linear scaling with the growing data sizes. OMR actively minimizes the amount of data to be read/written to/from disk via on-the-fly aggregation and it uses block sequential disk read/write operations whenever disk accesses become necessary to avoid running out of memory. We theoretically prove OMR's linear scalability and empirically demonstrate it by processing datasets that are up to 5x larger than main memory. Our experiments show that in comparison to the standalone single-machine setup of the Hadoop system, OMR delivers far higher performance. Also in contrast to Metis, OMR avoids out-of-memory crashes for large datasets as well as delivers higher performance when datasets are small enough to fit in main memory.

References

[1]

Apache hadoop. http://hadoop.apache.org .

[2]

Faraz Ahmad, Seyong Lee, Mithuna Thottethodi, and TN Vijaykumar. Puma: Purdue mapreduce benchmarks suite. 2012.

[3]

Protocol Buffers. Google’s data interchange format, 2011.

[4]

Rong Chen, Haibo Chen, and Binyu Zang. Tiled-mapreduce: optimizing resource usages of data-parallel applications on multicore with tiling. In International Conference on Parallel Architectures and Compilation Techniques, pages 523–534, 2010.

Digital Library

[5]

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In USENIX OSDI, pages 10–10, 2004.

Digital Library

[6]

Tao Gao, Yanfei Guo, Boyu Zhang, Pietro Cicotti, Yutong Lu, Pavan Balaji, and Michela Taufer. Mimir: Memory-efficient and scalable mapreduce for large supercomputing systems. In IEEE Intl. Parallel and Distributed Processing Symposium, pages 1098–1108, 2017.

[7]

Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In USENIX OSDI, volume 12, page 2, 2012.

Digital Library

[8]

Google. MR4C. https://github.com/google/mr4c .

[9]

Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. Mars: A mapreduce framework on graphics processors. In International Conference on Parallel Architectures and Compilation Techniques, pages 260–269, 2008.

Digital Library

[10]

Wei Jiang, Vignesh T Ravi, and Gagan Agrawal. A map-reduce system with an alternate api for multi-core environments. In International Conference on Cluster, Cloud and Grid Computing, pages 84–93, 2010.

Digital Library

[11]

Sai Charan Koduru, Rajiv Gupta, and Iulian Neamtiu. Size oblivious programming with InfiniMem. In International Workshop on Languages and Compilers for Parallel Computing, pages 3–19, 2016.

Digital Library

[12]

K Ashwin Kumar, Jonathan Gluck, Amol Deshpande, and Jimmy Lin. Hone: Scaling down hadoop on shared-memory systems. Proceedings of the VLDB Endowment, 6(12):1354–1357, 2013.

Digital Library

[13]

Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. Graphchi: Large-scale graph computation on just a pc. In USENIX OSDI, pages 31–46, 2012.

Digital Library

[14]

Yandong Mao, Robert Morris, and M Frans Kaashoek. Optimizing mapreduce for multicore architectures. In Computer Science and Artificial Intelligence Laboratory, MIT Technical Reports, 2010.

[15]

Frank McSherry, Michael Isard, and Derek G. Murray. Scalability! But at what COST? 15th Workshop on Hot Topics in Operating Systems (HotOS), 2015.

Digital Library

[16]

G.A. Miller, E.B. Newman, and E.A. Friedman. Length-frequency statistics for written english. Information and Control, 1(4), 1958.

[17]

Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In IEEE International Symposium on High Performance Computer Architecture, pages 13–24, 2007.

Digital Library

[18]

Scott Schneider, Christos D Antonopoulos, and Dimitrios S Nikolopoulos. Scalable locality-conscious multithreaded memory allocation. In International Symposium on Memory Management, pages 84–94. ACM, 2006.

Digital Library

[19]

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop distributed file system. In IEEE Symposium on Mass Storage Systems and Technologies, pages 1–10, 2010.

Digital Library

[20]

Jeff A. Stuart and John D. Owens. Multi-gpu mapreduce on gpu clusters. In IEEE International Parallel & Distributed Processing Symposium, pages 1068–1079, 2011.

Digital Library

[21]

Justin Talbot, Richard M. Yoo, and Christos Kozyrakis. Phoenix++: Modular mapreduce for shared-memory systems. In International Workshop on MapReduce and Its Applications, pages 9–16, 2011.

Digital Library

[22]

Keval Vora, Guoqing Xu, and Rajiv Gupta. Load the edges you need: A generic i/o optimization for disk-based graph processing. In USENIX Annual Technical Conference, pages 507–522, 2016.

Digital Library

[23]

Keval Vora, Rajiv Gupta, and Guoqing Xu. KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 237-251, 2017.

Digital Library

[24]

Keval Vora, Chen Tian, Rajiv Gupta, and Ziang Hu. CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 223-236, 2017.

Digital Library

[25]

Keval Vora, Sai Charan Koduru, and Rajiv Gupta. ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms using a Relaxed Consistency based DSM. In International Conference on Object Oriented Programming Systems, Languages and Applications (OOPSLA), pages 861-878, 2014.

Digital Library

[26]

Richard M Yoo, Anthony Romano, and Christos Kozyrakis. Phoenix rebirth: Scalable mapreduce on a large-scale shared-memory system. In IEEE International Symposium on Workload Characterization, pages 198–207, 2009.

Digital Library

[27]

Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. USENIX HotCloud, 10(10-10):95, 2010.

Digital Library

Cited By

Iliakis KXydis SSoudris DDi Natale GFummi F(2020)Resource-aware mapreduce runtime for multi/many-core architecturesProceedings of the 23rd Conference on Design, Automation and Test in Europe10.5555/3408352.3408556(897-902)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.5555/3408352.3408556
Iliakis KXydis SSoudris D(2020)Resource-Aware MapReduce Runtime for Multi/Many-core Architectures2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE48585.2020.9116281(897-902)Online publication date: Mar-2020
https://doi.org/10.23919/DATE48585.2020.9116281
Lee HChen QYeom HSon YHung CCerny TShin DBechini A(2020)An efficient garbage collection in java virtual machine via swap I/O optimizationProceedings of the 35th Annual ACM Symposium on Applied Computing10.1145/3341105.3373982(1238-1245)Online publication date: 30-Mar-2020
https://dl.acm.org/doi/10.1145/3341105.3373982

Index Terms

OMR: out-of-core MapReduce for large data sets
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs
        MapReduce-based systems

Recommendations

OMR: out-of-core MapReduce for large data sets
ISMM '18

While single machine MapReduce systems can squeeze out maximum performance from available multi-cores, they are often limited by the size of main memory and can thus only process small datasets. Our experience shows that the state-of-the-art single-...
Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications
The MapReduce platform has been widely used for large-scale data processing and analysis recently. It works well if the hardware of a cluster is well configured. However, our survey has indicated that common hardware configurations in small-and medium-...
A Performance Analysis of MapReduce Task with Large Number of Files Dataset in Big Data Using Hadoop
CSNT '14: Proceedings of the 2014 Fourth International Conference on Communication Systems and Network Technologies

Big Data is a huge amount of data that cannot be managed by the traditional data management system. Hadoop is a technological answer to Big Data. Hadoop Distributed File System (HDFS) and MapReduce programming model is used for storage and retrieval of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISMM 2018: Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

June 2018

119 pages

ISBN:9781450358019

DOI:10.1145/3210563

General Chair:
Hannes Payer
Google, Germany
,
Program Chair:
Jennifer B. Sartor
Vrije Universiteit Brussel, Belgium

ACM SIGPLAN Notices Volume 53, Issue 5
ISMM '18
May 2018
119 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3299706
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISMM '18

Sponsor:

SIGPLAN

ISMM '18: The International Symposium on Memory Management 2018

June 18, 2018

PA, Philadelphia, USA

Acceptance Rates

Overall Acceptance Rate 72 of 156 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
169
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Iliakis KXydis SSoudris DDi Natale GFummi F(2020)Resource-aware mapreduce runtime for multi/many-core architecturesProceedings of the 23rd Conference on Design, Automation and Test in Europe10.5555/3408352.3408556(897-902)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.5555/3408352.3408556
Iliakis KXydis SSoudris D(2020)Resource-Aware MapReduce Runtime for Multi/Many-core Architectures2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE48585.2020.9116281(897-902)Online publication date: Mar-2020
https://doi.org/10.23919/DATE48585.2020.9116281
Lee HChen QYeom HSon YHung CCerny TShin DBechini A(2020)An efficient garbage collection in java virtual machine via swap I/O optimizationProceedings of the 35th Annual ACM Symposium on Applied Computing10.1145/3341105.3373982(1238-1245)Online publication date: 30-Mar-2020
https://dl.acm.org/doi/10.1145/3341105.3373982

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten