ABSTRACT
Hadoop is a widely adopted open source implementation of MapReduce programming model for big data processing. It represents system resources as available map and reduce slots and assigns them to various tasks. This execution model gives little regard to the need of cross-task coordination on the use of shared system resources on a compute node, which results in task interference. In addition, the existing Hadoop merge algorithm can cause excessive I/O. In this study, we undertake an effort to address both issues. Accordingly, we have designed a cross-task coordination framework called CooMR for efficient data management in MapReduce programs. CooMR consists of three component schemes including cross-task opportunistic memory sharing and log-structured I/O consolidation, which are designed to facilitate task coordination, and the key-based in-situ merge (KISM) algorithm which is designed to enable the sorting/merging of Hadoop intermediate data without actually moving the <key, value> pairs. Our evaluation demonstrates that CooMR is able to increase task coordination, improve system resource utilization, and significantly speed up the execution time of MapReduce programs.
- Apache hadoop nextgen mapreduce (yarn). http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.Google Scholar
- Apache Hadoop Project. http://hadoop.apache.org/.Google Scholar
- Faraz Ahmad, Srimat T. Chakradhar, Anand Raghunathan, and T. N. Vijaykumar. Tarazu: optimizing mapreduce on heterogeneous clusters. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'12, pages 61--74, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, and Ion Stoica. Pacman: Coordinated memory caching for parallel jobs. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, NSDI'12, pages 3--3, Berkeley, CA, USA, 2012. USENIX Association. Google ScholarDigital Library
- Xiangping Bu, Jia Rao, and Cheng-zhong Xu. Interference and locality-aware task scheduling for mapreduce applications in virtual clusters. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, HPDC '13, pages 227--238, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. Mapreduce online. In Proceedings of the 7th USENIX conference on Networked systems design and implementation, NSDI'10, pages 21--21, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
- Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: fair allocation of multiple resource types. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, NSDI'11, pages 24--24, Berkeley, CA, USA, 2011. USENIX Association. Google ScholarDigital Library
- Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: a platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, NSDI'11, pages 22--22, Berkeley, CA, USA, 2011. USENIX Association. Google ScholarDigital Library
- Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. An analysis of traces from a production mapreduce cluster. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, CCGRID '10, pages 94--103, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
- Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, and Prashant Shenoy. A platform for scalable one-pass analytics using mapreduce. In Proceedings of the 2011 International Conference on Management of Data, SIGMOD'11. ACM, 2011. Google ScholarDigital Library
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: bringing order to the web. 1999.Google Scholar
- Smriti R Ramakrishnan, Garret Swart, and Aleksey Urmanov. Balancing reducer skew in mapreduce workloads using progressive sampling. In Proceedings of the ACM Symposium on Cloud Computing, SoCC'12. ACM, 2012. Google ScholarDigital Library
- Sriram Rao, Raghu Ramakrishnan, Adam Silberstein, Mike Ovsiannikov, and Damian Reeves. Sailfish: a framework for large scale data processing. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC '12, pages 4:1--4:14, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- Alexander Rasmussen, Michael Conley, Rishi Kapoor, Vinh The Lam, George Porter, and Amin Vahdat. Themis: An i/o efficient mapreduce. In Proceedings of the ACM Symposium on Cloud Computing, SoCC'12. ACM, 2012. Google ScholarDigital Library
- Mendel Rosenblum and John K. Ousterhout. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst., 10(1):26--52, February 1992. Google ScholarDigital Library
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST '10, pages 1--10, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
- Jian Tan, Xiaoqiao Meng, and Li Zhang. Delay tails in mapreduce scheduling. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems, SIGMETRICS '12, pages 5--16, 2012. Google ScholarDigital Library
- Yandong Wang, Xinyu Que, Weikuan Yu, Dror Goldenberg, and Dhiraj Sehgal. Hadoop acceleration through network levitated merge. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 57:1--57:10, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Yandong Wang, Jian Tan, Weikuan Yu, Xiaoqiao Meng, and Li Zhang. Preemptive reducetask scheduling for fair and fast job completion. In Proceedings of the 10th International Conference on Autonomic Computing, ICAC'13, June 2013.Google Scholar
- Yandong Wang, Cong Xu, Xiaobing Li, and Weikuan Yu. Jvm-bypass for efficient hadoop shuffling. In 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS'13. IEEE, 2013. Google ScholarDigital Library
- Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems, EuroSys'10, pages 265--278, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, NSDI'12, pages 2--2, Berkeley, CA, USA, 2012. USENIX Association. Google ScholarDigital Library
- Matei Zaharia, Andrew Konwinski, Anthony D. Joseph, Randy H. Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. Technical Report UCB/EECS-2008-99, EECS Department, University of California, Berkeley, Aug 2008.Google Scholar
Comments