research-article

CooMR: cross-task coordination for efficient data management in MapReduce programs

Authors:
Xiaobing Li

Auburn University, AL

Auburn University, AL
View Profile

,
Yandong Wang

Auburn University, AL

Auburn University, AL
View Profile

,
Yizheng Jiao

Auburn University, AL

Auburn University, AL
View Profile

,
Cong Xu

Auburn University, AL

Auburn University, AL
View Profile

,
Weikuan Yu

Auburn University, AL

Auburn University, AL
View Profile

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisNovember 2013Article No.: 42Pages 1–11https://doi.org/10.1145/2503210.2503276

Published:17 November 2013Publication History

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Pages 1–11

ABSTRACT

Hadoop is a widely adopted open source implementation of MapReduce programming model for big data processing. It represents system resources as available map and reduce slots and assigns them to various tasks. This execution model gives little regard to the need of cross-task coordination on the use of shared system resources on a compute node, which results in task interference. In addition, the existing Hadoop merge algorithm can cause excessive I/O. In this study, we undertake an effort to address both issues. Accordingly, we have designed a cross-task coordination framework called CooMR for efficient data management in MapReduce programs. CooMR consists of three component schemes including cross-task opportunistic memory sharing and log-structured I/O consolidation, which are designed to facilitate task coordination, and the key-based in-situ merge (KISM) algorithm which is designed to enable the sorting/merging of Hadoop intermediate data without actually moving the <key, value> pairs. Our evaluation demonstrates that CooMR is able to increase task coordination, improve system resource utilization, and significantly speed up the execution time of MapReduce programs.

References

Apache hadoop nextgen mapreduce (yarn). http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.Google Scholar
Apache Hadoop Project. http://hadoop.apache.org/.Google Scholar
Faraz Ahmad, Srimat T. Chakradhar, Anand Raghunathan, and T. N. Vijaykumar. Tarazu: optimizing mapreduce on heterogeneous clusters. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'12, pages 61--74, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, and Ion Stoica. Pacman: Coordinated memory caching for parallel jobs. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, NSDI'12, pages 3--3, Berkeley, CA, USA, 2012. USENIX Association. Google ScholarDigital Library
Xiangping Bu, Jia Rao, and Cheng-zhong Xu. Interference and locality-aware task scheduling for mapreduce applications in virtual clusters. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, HPDC '13, pages 227--238, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. Mapreduce online. In Proceedings of the 7th USENIX conference on Networked systems design and implementation, NSDI'10, pages 21--21, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: fair allocation of multiple resource types. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, NSDI'11, pages 24--24, Berkeley, CA, USA, 2011. USENIX Association. Google ScholarDigital Library
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: a platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, NSDI'11, pages 22--22, Berkeley, CA, USA, 2011. USENIX Association. Google ScholarDigital Library
Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. An analysis of traces from a production mapreduce cluster. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGRID '10, pages 94--103, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, and Prashant Shenoy. A platform for scalable one-pass analytics using mapreduce. In Proceedings of the 2011 International Conference on Management of Data, SIGMOD'11. ACM, 2011. Google ScholarDigital Library
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: bringing order to the web. 1999.Google Scholar
Smriti R Ramakrishnan, Garret Swart, and Aleksey Urmanov. Balancing reducer skew in mapreduce workloads using progressive sampling. In Proceedings of the ACM Symposium on Cloud Computing, SoCC'12. ACM, 2012. Google ScholarDigital Library
Sriram Rao, Raghu Ramakrishnan, Adam Silberstein, Mike Ovsiannikov, and Damian Reeves. Sailfish: a framework for large scale data processing. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC '12, pages 4:1--4:14, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
Alexander Rasmussen, Michael Conley, Rishi Kapoor, Vinh The Lam, George Porter, and Amin Vahdat. Themis: An i/o efficient mapreduce. In Proceedings of the ACM Symposium on Cloud Computing, SoCC'12. ACM, 2012. Google ScholarDigital Library
Mendel Rosenblum and John K. Ousterhout. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst., 10(1):26--52, February 1992. Google ScholarDigital Library
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST '10, pages 1--10, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
Jian Tan, Xiaoqiao Meng, and Li Zhang. Delay tails in mapreduce scheduling. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems, SIGMETRICS '12, pages 5--16, 2012. Google ScholarDigital Library
Yandong Wang, Xinyu Que, Weikuan Yu, Dror Goldenberg, and Dhiraj Sehgal. Hadoop acceleration through network levitated merge. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 57:1--57:10, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Yandong Wang, Jian Tan, Weikuan Yu, Xiaoqiao Meng, and Li Zhang. Preemptive reducetask scheduling for fair and fast job completion. In Proceedings of the 10th International Conference on Autonomic Computing, ICAC'13, June 2013.Google Scholar
Yandong Wang, Cong Xu, Xiaobing Li, and Weikuan Yu. Jvm-bypass for efficient hadoop shuffling. In 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS'13. IEEE, 2013. Google ScholarDigital Library
Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems, EuroSys'10, pages 265--278, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, NSDI'12, pages 2--2, Berkeley, CA, USA, 2012. USENIX Association. Google ScholarDigital Library
Matei Zaharia, Andrew Konwinski, Anthony D. Joseph, Randy H. Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. Technical Report UCB/EECS-2008-99, EECS Department, University of California, Berkeley, Aug 2008.Google Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2013
1123 pages
ISBN:9781450323789
DOI:10.1145/2503210
General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 November 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 375
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CooMR: cross-task coordination for efficient data management in MapReduce programs

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

Big Data Analytics with R and Hadoop

Big Data Analytics

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools