Editorial Notes
NOTICE OF CONCERN: ACM has received evidence that casts doubt on the integrity of the peer review process for the ICIA 2016 Conference. As a result, ACM is issuing a Notice of Concern for all papers published and strongly suggests that the papers from this Conference not be cited in the literature until ACM's investigation has concluded and final decisions have been made regarding the integrity of the peer review process for this Conference.
ABSTRACT
This paper investigates the partition skew problem at reduce phase in the MapReduce jobs. Our studies with the Hadoop addresses this problem in both offline and online manner. Offline is a heuristics based approach which has to wait for the completion of map tasks and involves computation overhead to estimate the partition size. In another approach, they distribute the overloaded tasks across other nodes that needed extra split and merge operation. These extra operations, in turn, hamper the performance of the system. In this paper, we propose Aegeus, an on-line streaming based skew mitigation approach for MapReduce jobs which do not have long waiting time and extra operations for addressing the skew problem. Aegeus predicts the partition size of the each map tasks and creates the resource specification based on its requirement even before the completion of map phase. Hence, the proposed system can create the container based on the workload which can improve the overall job completion time and system performance. We evaluated Aegeus by using benchmark datasets and, compare its performance with naive Hadoop. Based on our observation, Aegeus outperforms naive Hadoop by 42% by maximizing the overall performance of the application and system.
- F. Ahmad, S. Lee, M. Thottethodi, and T. Vijaykumar. Puma: Purdue mapreduce benchmarks suite. 2012.Google Scholar
- G. Ananthanarayanan, S. Kandula, A. G. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in map-reduce clusters using mantri. In OSDI, volume 10, page 24, 2010. Google ScholarDigital Library
- Q. Chen, J. Yao, and Z. Xiao. Libra: Lightweight data skew mitigation in mapreduce. IEEE Transactions on Parallel and Distributed Systems, 26(9):2520--2533, 2015.Google ScholarCross Ref
- M. Company. http://www.mckinsey.com/business-functions/business-technology/our-insights/the-need-to-lead-in-data-and-analytics. visited 10-may-2016.Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- P. Dhawalia, S. Kailasam, and D. Janakiram. Chisel: A resource savvy approach for handling skew in mapreduce applications. In 2013 IEEE Sixth International Conference on Cloud Computing, pages 652--660. IEEE, 2013. Google ScholarDigital Library
- P. Dhawalia, S. Kailasam, and D. Janakiram. Chisel++: handling partitioning skew in mapreduce framework using efficient range partitioning technique. In Proceedings of the sixth international workshop on Data intensive distributed computing, pages 21--28. ACM, 2014. Google ScholarDigital Library
- K. Elmeleegy, C. Olston, and B. Reed. Spongefiles: Mitigating data skew in mapreduce using distributed memory. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 551--562. ACM, 2014. Google ScholarDigital Library
- A. Hadoop. https://hadoop.apache.org/.Google Scholar
- M. Hammoud and M. F. Sakr. Locality-aware reduce task scheduling for mapreduce. In Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on, pages 570--576. IEEE, 2011. Google ScholarDigital Library
- D. Hsu and S. Sabato. Heavy-tailed regression with a generalized median-of-means. In ICML, pages 37--45, 2014.Google Scholar
- S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi. Leen: Locality/fairness-aware key partitioning for mapreduce in the cloud. In Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, pages 17--24. IEEE, 2010. Google ScholarDigital Library
- Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skewtune: mitigating skew in mapreduce applications. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 25--36. ACM, 2012. Google ScholarDigital Library
- Y. Le, J. Liu, F. Ergün, and D. Wang. Online load balancing for mapreduce with skewed data input. In IEEE INFOCOM 2014-IEEE Conference on Computer Communications, pages 2004--2012. IEEE, 2014.Google ScholarCross Ref
- Z. Liu, Q. Zhang, R. Boutaba, Y. Liu, and B. Wang. Optima: on-line partitioning skew mitigation for mapreduce with resource adjustment. Journal of Network and Systems Management, pages 1--25, 2016. Google ScholarDigital Library
- Z. Liu, Q. Zhang, M. F. Zhani, R. Boutaba, Y. Liu, and Z. Gong. Dreams: Dynamic resource allocation for mapreduce with data skew. In 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), pages 18--26. IEEE, 2015.Google ScholarCross Ref
- S. Sabato and R. Munos. Active regression by stratification. In Advances in Neural Information Processing Systems, pages 469--477, 2014. Google ScholarDigital Library
- V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing, page 5. ACM, 2013. Google ScholarDigital Library
- vCloud. http://www.vcloudnews.com/every-day-big-data-statistics-2-5-quintillion-bytes-of-data-created-daily. visited 10-may-2016.Google Scholar
- N. Zaheilas and V. Kalogeraki. Real-time scheduling of skewed mapreduce jobs in heterogeneous environments. In 11th International Conference on Autonomic Computing (ICAC 14), pages 189--200, 2014.Google Scholar
Recommendations
Chisel++: handling partitioning skew in MapReduce framework using efficient range partitioning technique
DIDC '14: Proceedings of the sixth international workshop on Data intensive distributed computingJob completion in MapReduce framework depends upon the slowest running reduce task. Inordinate time gap among the completion points of reduce tasks delays a job significantly. Synchronization in reduce task completion not only completes a job faster but ...
OPTIMA: On-Line Partitioning Skew Mitigation for MapReduce with Resource Adjustment
Partitioning skew has been shown to be a major issue that can significantly prolong the execution time of MapReduce jobs. Most of the existing off-line heuristics for partitioning skew mitigation are inefficient; they have to wait for the completion of ...
Improvement of job completion time in data-intensive cloud computing applications
AbstractTask stragglers in MapReduce jobs dramatically impede job execution of data-intensive computing in cloud data centers. This impedance is due to the uneven distribution of input data, heterogeneous data nodes, resource contention situations, and ...
Comments