ABSTRACT
Hadoop-stock is a reliable, scalable, and open source implementation of the MapReduce framework to process data-intensive applications in a distributed and parallel environment. In a common environment between multiple users with various types of applications, due to the lower number of resources than the number of jobs, there will be multi-wave jobs. Shuffling as the longest phase of running a job has the most adverse effect (network traffic) on the job execution time. On one hand, due to the dependency of shuffle phase to reduce task, the shuffle phase could not start until the reduce task being scheduled. On the other hand, the static scheduling of reduce tasks results in loss of reduce slots. This paper presents our ongoing effort in the designing an intelligent service in which the sort/merge and shuffle phases are completely independent of map and reduce phases and could act in parallel with map and reduce phases. This parallelism mitigates the job completion time.
- T. White, Hadoop: The Definitive Guide, O'Reilly Media, Inc., 2009. Google ScholarDigital Library
- C. P. Chen, C. Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences 275 (2014) 314--347.Google ScholarCross Ref
- Y. Guo, J. Rao, D. Cheng, X. Zhou, Ishuffle: Improving hadoop performance with shuffle-on-write, IEEE Transactions on Parallel and Distributed Systems 28 (6) (2017) 1649--1662. Google ScholarDigital Library
- S. Rao, R. Ramakrishnan, A. Silberstein, M. Ovsiannikov, D. Reeves, Sailfish: a framework for large scale data processing, Proceedings of the Third ACM Symposium on Cloud Computing, (2012) 1--14, San Jose, California. Google ScholarDigital Library
- https://sourceforge.net/p/kosmosfs/wiki/Home/.Google Scholar
Index Terms
- POSTER: An Intelligent Framework to Parallelize Hadoop Phases
Recommendations
HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs
Large-scale MapReduce clusters that routinely process big data bring challenges to the cloud computing. One of the key challenges is to reduce the response time of these MapReduce clusters by minimizing their makespans. It is observed that the order in ...
Two Sides of a Coin: Optimizing the Schedule of MapReduce Jobs to Minimize Their Makespan and Improve Cluster Performance
MASCOTS '12: Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication SystemsLarge-scale MapReduce clusters that routinely process petabytes of unstructured and semi-structured data represent a new entity in the changing landscape of clouds. A key challenge is to increase the utilization of these MapReduce clusters. In this work,...
An optimized MapReduce workflow scheduling algorithm for heterogeneous computing
The MapReduce framework is considered to be an effective resolution for huge and parallel data processing. This paper treats a massive data processing workflow as a DAG graph consisting of MapReduce jobs. In a heterogeneous computing environment, the ...
Comments