skip to main content
10.1145/3167132.3167150acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Data locality and VM interference aware mitigation of data skew in hadoop leveraging modern portfolio theory

Published:09 April 2018Publication History

ABSTRACT

Data skew, which is the result of uneven distribution of data among tasks in big data processing frameworks such as MapReduce, causes significant variation in the execution time of tasks and makes their placement on computing resources more challenging. Moreover, with the proliferation of big data processing in the cloud, the interference among virtual machines co-located on the same physical machine exacerbates the aforementioned variation. To tackle this challenge, we propose Locality and Interference aware Portfolio-based Task Assignment (LIPTA) approach. LIPTA leverages the modern portfolio theory to mitigate the variation in execution time of tasks while considering the interference of virtual machines and locality of input data. It selects and assigns groups of tasks (the portfolio) to each machine such that variation of their total execution time is reduced due to portfolio effect. Experimental results using real-world workload logs demonstrate the effectiveness of our LIPTA approach. It can reduce the total execution time of workloads by up to 46.7% compared with several variation-oblivious approaches.

References

  1. {n. d.}. Amazon EMR. https://aws.amazon.com/emr/. ({n. d.}). Acs: 2016-07-12.Google ScholarGoogle Scholar
  2. Hanieh Alipour, Yan Liu, Abdelwahab Hamou-Lhadj, and Ian Gorton. 2016. Model driven performance simulation of cloud provisioned Hadoop MapReduce applications. In IEEE/ACM 8th International Workshop on Modeling in Software Engineering (MiSE). 48--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Xiangping Bu, Jia Rao, and Cheng-zhong Xu. 2013. Interference and Locality-aware Task Scheduling for MapReduce Applications in Virtual Clusters. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC). 227--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Qi Chen, Jinyu Yao, and Zhen Xiao. 2015. Libra: Lightweight data skew mitigation in mapreduce. IEEE Transactions on Parallel and Distributed Systems 26, 9 (2015), 2520--2533.Google ScholarGoogle ScholarCross RefCross Ref
  5. Emilio Coppa and Irene Finocchi. 2015. On Data Skewness, Stragglers, and MapReduce Progress Indicators. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC). 139--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christina Delimitrou and Christos Kozyrakis. 2013. The netflix challenge: Datacenter edition. IEEE Computer Architecture Letters 12, 1 (2013), 29--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Edwin J Elton and Martin J Gruber. 1997. Modern portfolio theory, 1950 to date. Journal of Banking & Finance 21, 11 (1997), 1743--1759.Google ScholarGoogle ScholarCross RefCross Ref
  8. Eugene F Fama. 1970. Multiperiod consumption-investment decisions. The American Economic Review (1970), 163--174.Google ScholarGoogle Scholar
  9. Yifeng Geng, Shimin Chen, YongWei Wu, Ryan Wu, Guangwen Yang, and Weimin Zheng. 2011. Location-aware mapreduce in virtual cloud. In International Conference on Parallel Processing (ICPP). 275--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yanfei Guo, Jia Rao, Changjun Jiang, and Xiaobo Zhou. 2017. Moving Hadoop into the Cloud with Flexible Slot Management and Speculative Execution. IEEE Transactions on Parallel and Distributed systems 28, 3 (2017), 798--812. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nils H Hakansson. 1974. Convergence to isoelastic utility and policy in multi-period portfolio choice. Journal of Financial Economics 1, 3 (1974), 201--224.Google ScholarGoogle ScholarCross RefCross Ref
  12. Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A Self-tuning System for Big Data Analytics.. In Cidr, Vol. 11. 261--272.Google ScholarGoogle Scholar
  13. Tzu-Chi Huang, Kuo-Chih Chu, Guo-Hao Huang, Yan-Chen Shen, and Ce-Kuen Shieh. 2017. Smart Partitioning Mechanism for Dealing with Intermediate Data Skew in Reduce Task on Cloud Computing. In IEEE 31st International Conference on Advanced Information Networking and Applications (AINA). 819--826.Google ScholarGoogle Scholar
  14. Zhe Huang, Bharath Balasubramanian, Michael Wang, Tian Lan, Mung Chiang, and Danny HK Tsang. 2016. RUSH: A RobUst ScHeduler to Manage Uncertain Completion-Times in Shared Clouds. In IEEE 36th International Conference on Distributed Computing Systems (ICDCS). 242--251.Google ScholarGoogle ScholarCross RefCross Ref
  15. Inkwon Hwang and Massoud Pedram. 2012. Portfolio theory-based resource assignment in a cloud computing system. In IEEE International Conference on Cloud Computing (CLOUD). 582--589. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yu-Chon Kao and Ya-Shu Chen. 2016. Data-locality-aware mapreduce real-time scheduling framework. Journal of Systems and Software 112 (2016), 65--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mukhtaj Khan, Yong Jin, Maozhen Li, Yang Xiang, and Changjun Jiang. 2016. Hadoop performance modeling for job estimation and resource provisioning. IEEE Transactions on Parallel and Distributed Systems 27, 2 (2016), 441--454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Younggyun Koh, Rob Knauerhase, Paul Brett, Mic Bowman, Zhihua Wen, and Calton Pu. 2007. An analysis of performance interference effects in virtual environments. In IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS). 200--209.Google ScholarGoogle ScholarCross RefCross Ref
  19. YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2012. Skewtune: mitigating skew in mapreduce applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Min Li, Dinesh Subhraveti, Ali R Butt, Aleksandr Khasymski, and Prasenjit Sarkar. 2012. CAM: a topology aware minimum cost flow based resource manager for MapReduce applications in the cloud. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (HPDC). 211--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Norman Lim, Shikharesh Majumdar, and Peter Ashwood-Smith. 2017. MRCP-RM: A Technique for Resource Allocation and Scheduling of MapReduce Jobs with Deadlines. IEEE Transactions on Parallel and Distributed Systems 28, 5 (2017), 1375--1389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jia-Chun Lin, Ming-Chang Lee, and Ramin Yahyapour. 2014. Scheduling MapReduce tasks on virtual MapReduce clusters from a tenant's perspective. In IEEE International Conference on BigData (BigData). 141--146.Google ScholarGoogle ScholarCross RefCross Ref
  23. Zhihong Liu, Qi Zhang, Raouf Boutaba, Yaping Liu, and Baosheng Wang. 2016. Optima: on-line partitioning skew mitigation for MapReduce with resource adjustment. Journal of Network and Systems Management 24, 4 (2016), 859--883. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Xiaoqiang Ma, Xiaoyi Fan, Jiangchuan Liu, Hongbo Jiang, and Kai Peng. 2017. vLocality: Revisiting Data Locality for MapReduce in Virtualized Clouds. IEEE Network 31, 1 (2017), 28--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Harry M Markowitz. 1968. Portfolio selection: efficient diversification of investments. Vol. 16. Yale university press.Google ScholarGoogle Scholar
  26. Lena Mashayekhy, Mahyar Movahed Nejad, Daniel Grosu, Quan Zhang, and Weisong Shi. 2015. Energy-aware scheduling of mapreduce jobs for big data applications. IEEE Transactions on Parallel and Distributed Systems 26, 10 (2015), 2720--2733. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Seyed Morteza Nabavinejad and Maziar Goudarzi. 2016. Energy efficiency in cloud-Based MapReduce applications through better performance estimation. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE). 1339--1344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Seyed Morteza Nabavinejad and Maziar Goudarzi. 2017. Faster MapReduce Computation on Clouds through Better Performance Estimation. IEEE Transactions on Cloud Computing (2017).Google ScholarGoogle Scholar
  29. Seyed Morteza Nabavinejad, Maziar Goudarzi, and Shirin Mozaffari. 2016. The Memory Challenge in Reduce Phase of MapReduce Applications. IEEE Transactions on Big Data 2, 4 (2016), 380--386.Google ScholarGoogle ScholarCross RefCross Ref
  30. Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, and Seungryoul Maeng. 2012. Locality-aware dynamic VM reconfiguration on MapReduce clouds. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (HPDC). 27--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kai Ren, YongChul Kwon, Magdalena Balazinska, and Bill Howe. 2013. Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads. Proceedings of the VLDB Endowment 6, 10 (2013), 853--864. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Andrew Rudd and Henry K Clasing. 1982. Modern portfolio theory: The principles of investment management.Google ScholarGoogle Scholar
  33. M. Goudarzi S. Nasehi, S.M. Nabavinejad. 2017. A Novel Key Partitioning Schema for Efficient Execution of MapReduce Applications. The 19th CSI International Symposium on Computer Architecture & Digital Systems (CADS) (2017).Google ScholarGoogle Scholar
  34. TP Shabeera and SD Madhu Kumar. 2015. Optimising virtual machine allocation in MapReduce cloud for improved data locality. International Journal of Big Data Intelligence 2, 1 (2015), 2--8.Google ScholarGoogle ScholarCross RefCross Ref
  35. Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, and Chen Wang. 2014. MRTuner: a toolkit to enable holistic optimization for mapreduce jobs. Proceedings of the VLDB Endowment 7, 13 (2014), 1319--1330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jia Wang and Xiaoping Li. 2016. Task scheduling for MapReduce in heterogeneous networks. Cluster Computing 19, 1 (2016), 197--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kewen Wang, Mohammad Maifi Hasan Khan, Nhan Nguyen, and Swapna Gokhale. 2016. Modeling interference for apache spark jobs. In IEEE 9th International Conference on Cloud Computing (CLOUD). 423--431.Google ScholarGoogle ScholarCross RefCross Ref
  38. Suzhen Wang and Haowei Zhou. 2016. The research of mapreduce load balancing based on multiple partition algorithm. In IEEE/ACM 9th International Conference on Utility and Cloud Computing (UCC),. 339--342. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Weina Wang, Kai Zhu, Lei Ying, Jian Tan, and Li Zhang. 2013. A throughput optimal algorithm for map task scheduling in mapreduce with data locality. ACM SIGMETRICS Performance Evaluation Review 40, 4 (2013), 33--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Xueying Wang, Zhihui Lu, Jie Wu, Tong Zhao, and Patrick Hung. 2015. In STechAH: An Autoscaling Scheme for Hadoop in the Private Cloud. In IEEE International Conference on Services Computing (SCC). 395--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Y. Yuan, H. Wang, D. Wang, and J. Liu. 2013. On interference-aware provisioning for cloud-based big data processing. In IEEE/ACM 21st International Symposium on Quality of Service (IWQoS). 1--6.Google ScholarGoogle Scholar
  42. Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems (EuroSys). 265--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Wei Zhang, Sundaresan Rajasekaran, Shaohua Duan, Timothy Wood, and Mingfa Zhuy. 2015. Minimizing interference and maximizing progress for Hadoop virtual machines. ACM SIGMETRICS Performance Evaluation Review 42, 4 (2015), 62--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Wei Zhang, Sundaresan Rajasekaran, Timothy Wood, and Mingfa Zhu. 2014. Mimp: Deadline and interference aware scheduling of hadoop virtual machines. In 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 394--403.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. X. Zhang, Y. Feng, S. Feng, J. Fan, and Z. Ming. 2011. An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments. In International Conference on Cloud and Service Computing (CSC). 235--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Xiaohong Zhang, Zhiyong Zhong, Shengzhong Feng, Bibo Tu, and Jianping Fan. 2011. Improving data locality of mapreduce by scheduling in homogeneous computing environments. In IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA). 120--126. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data locality and VM interference aware mitigation of data skew in hadoop leveraging modern portfolio theory

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing
        April 2018
        2327 pages
        ISBN:9781450351911
        DOI:10.1145/3167132

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 April 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,650of6,669submissions,25%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader