skip to main content
10.1145/2980258.2980461acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiciaConference Proceedingsconference-collections
research-article

AEGEUS: An online partition skew mitigation algorithm for mapreduce

Published:25 August 2016Publication History

Editorial Notes

NOTICE OF CONCERN: ACM has received evidence that casts doubt on the integrity of the peer review process for the ICIA 2016 Conference. As a result, ACM is issuing a Notice of Concern for all papers published and strongly suggests that the papers from this Conference not be cited in the literature until ACM's investigation has concluded and final decisions have been made regarding the integrity of the peer review process for this Conference.

ABSTRACT

This paper investigates the partition skew problem at reduce phase in the MapReduce jobs. Our studies with the Hadoop addresses this problem in both offline and online manner. Offline is a heuristics based approach which has to wait for the completion of map tasks and involves computation overhead to estimate the partition size. In another approach, they distribute the overloaded tasks across other nodes that needed extra split and merge operation. These extra operations, in turn, hamper the performance of the system. In this paper, we propose Aegeus, an on-line streaming based skew mitigation approach for MapReduce jobs which do not have long waiting time and extra operations for addressing the skew problem. Aegeus predicts the partition size of the each map tasks and creates the resource specification based on its requirement even before the completion of map phase. Hence, the proposed system can create the container based on the workload which can improve the overall job completion time and system performance. We evaluated Aegeus by using benchmark datasets and, compare its performance with naive Hadoop. Based on our observation, Aegeus outperforms naive Hadoop by 42% by maximizing the overall performance of the application and system.

References

  1. F. Ahmad, S. Lee, M. Thottethodi, and T. Vijaykumar. Puma: Purdue mapreduce benchmarks suite. 2012.Google ScholarGoogle Scholar
  2. G. Ananthanarayanan, S. Kandula, A. G. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in map-reduce clusters using mantri. In OSDI, volume 10, page 24, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Q. Chen, J. Yao, and Z. Xiao. Libra: Lightweight data skew mitigation in mapreduce. IEEE Transactions on Parallel and Distributed Systems, 26(9):2520--2533, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  4. M. Company. http://www.mckinsey.com/business-functions/business-technology/our-insights/the-need-to-lead-in-data-and-analytics. visited 10-may-2016.Google ScholarGoogle Scholar
  5. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Dhawalia, S. Kailasam, and D. Janakiram. Chisel: A resource savvy approach for handling skew in mapreduce applications. In 2013 IEEE Sixth International Conference on Cloud Computing, pages 652--660. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Dhawalia, S. Kailasam, and D. Janakiram. Chisel++: handling partitioning skew in mapreduce framework using efficient range partitioning technique. In Proceedings of the sixth international workshop on Data intensive distributed computing, pages 21--28. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Elmeleegy, C. Olston, and B. Reed. Spongefiles: Mitigating data skew in mapreduce using distributed memory. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 551--562. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Hadoop. https://hadoop.apache.org/.Google ScholarGoogle Scholar
  10. M. Hammoud and M. F. Sakr. Locality-aware reduce task scheduling for mapreduce. In Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on, pages 570--576. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Hsu and S. Sabato. Heavy-tailed regression with a generalized median-of-means. In ICML, pages 37--45, 2014.Google ScholarGoogle Scholar
  12. S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi. Leen: Locality/fairness-aware key partitioning for mapreduce in the cloud. In Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, pages 17--24. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skewtune: mitigating skew in mapreduce applications. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 25--36. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Le, J. Liu, F. Ergün, and D. Wang. Online load balancing for mapreduce with skewed data input. In IEEE INFOCOM 2014-IEEE Conference on Computer Communications, pages 2004--2012. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  15. Z. Liu, Q. Zhang, R. Boutaba, Y. Liu, and B. Wang. Optima: on-line partitioning skew mitigation for mapreduce with resource adjustment. Journal of Network and Systems Management, pages 1--25, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Liu, Q. Zhang, M. F. Zhani, R. Boutaba, Y. Liu, and Z. Gong. Dreams: Dynamic resource allocation for mapreduce with data skew. In 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), pages 18--26. IEEE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  17. S. Sabato and R. Munos. Active regression by stratification. In Advances in Neural Information Processing Systems, pages 469--477, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing, page 5. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. vCloud. http://www.vcloudnews.com/every-day-big-data-statistics-2-5-quintillion-bytes-of-data-created-daily. visited 10-may-2016.Google ScholarGoogle Scholar
  20. N. Zaheilas and V. Kalogeraki. Real-time scheduling of skewed mapreduce jobs in heterogeneous environments. In 11th International Conference on Autonomic Computing (ICAC 14), pages 189--200, 2014.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICIA-16: Proceedings of the International Conference on Informatics and Analytics
    August 2016
    868 pages
    ISBN:9781450347563
    DOI:10.1145/2980258

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 25 August 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader