ABSTRACT
Data skew, which is the result of uneven distribution of data among tasks in big data processing frameworks such as MapReduce, causes significant variation in the execution time of tasks and makes their placement on computing resources more challenging. Moreover, with the proliferation of big data processing in the cloud, the interference among virtual machines co-located on the same physical machine exacerbates the aforementioned variation. To tackle this challenge, we propose Locality and Interference aware Portfolio-based Task Assignment (LIPTA) approach. LIPTA leverages the modern portfolio theory to mitigate the variation in execution time of tasks while considering the interference of virtual machines and locality of input data. It selects and assigns groups of tasks (the portfolio) to each machine such that variation of their total execution time is reduced due to portfolio effect. Experimental results using real-world workload logs demonstrate the effectiveness of our LIPTA approach. It can reduce the total execution time of workloads by up to 46.7% compared with several variation-oblivious approaches.
- {n. d.}. Amazon EMR. https://aws.amazon.com/emr/. ({n. d.}). Acs: 2016-07-12.Google Scholar
- Hanieh Alipour, Yan Liu, Abdelwahab Hamou-Lhadj, and Ian Gorton. 2016. Model driven performance simulation of cloud provisioned Hadoop MapReduce applications. In IEEE/ACM 8th International Workshop on Modeling in Software Engineering (MiSE). 48--54. Google ScholarDigital Library
- Xiangping Bu, Jia Rao, and Cheng-zhong Xu. 2013. Interference and Locality-aware Task Scheduling for MapReduce Applications in Virtual Clusters. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC). 227--238. Google ScholarDigital Library
- Qi Chen, Jinyu Yao, and Zhen Xiao. 2015. Libra: Lightweight data skew mitigation in mapreduce. IEEE Transactions on Parallel and Distributed Systems 26, 9 (2015), 2520--2533.Google ScholarCross Ref
- Emilio Coppa and Irene Finocchi. 2015. On Data Skewness, Stragglers, and MapReduce Progress Indicators. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC). 139--152. Google ScholarDigital Library
- Christina Delimitrou and Christos Kozyrakis. 2013. The netflix challenge: Datacenter edition. IEEE Computer Architecture Letters 12, 1 (2013), 29--32. Google ScholarDigital Library
- Edwin J Elton and Martin J Gruber. 1997. Modern portfolio theory, 1950 to date. Journal of Banking & Finance 21, 11 (1997), 1743--1759.Google ScholarCross Ref
- Eugene F Fama. 1970. Multiperiod consumption-investment decisions. The American Economic Review (1970), 163--174.Google Scholar
- Yifeng Geng, Shimin Chen, YongWei Wu, Ryan Wu, Guangwen Yang, and Weimin Zheng. 2011. Location-aware mapreduce in virtual cloud. In International Conference on Parallel Processing (ICPP). 275--284. Google ScholarDigital Library
- Yanfei Guo, Jia Rao, Changjun Jiang, and Xiaobo Zhou. 2017. Moving Hadoop into the Cloud with Flexible Slot Management and Speculative Execution. IEEE Transactions on Parallel and Distributed systems 28, 3 (2017), 798--812. Google ScholarDigital Library
- Nils H Hakansson. 1974. Convergence to isoelastic utility and policy in multi-period portfolio choice. Journal of Financial Economics 1, 3 (1974), 201--224.Google ScholarCross Ref
- Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A Self-tuning System for Big Data Analytics.. In Cidr, Vol. 11. 261--272.Google Scholar
- Tzu-Chi Huang, Kuo-Chih Chu, Guo-Hao Huang, Yan-Chen Shen, and Ce-Kuen Shieh. 2017. Smart Partitioning Mechanism for Dealing with Intermediate Data Skew in Reduce Task on Cloud Computing. In IEEE 31st International Conference on Advanced Information Networking and Applications (AINA). 819--826.Google Scholar
- Zhe Huang, Bharath Balasubramanian, Michael Wang, Tian Lan, Mung Chiang, and Danny HK Tsang. 2016. RUSH: A RobUst ScHeduler to Manage Uncertain Completion-Times in Shared Clouds. In IEEE 36th International Conference on Distributed Computing Systems (ICDCS). 242--251.Google ScholarCross Ref
- Inkwon Hwang and Massoud Pedram. 2012. Portfolio theory-based resource assignment in a cloud computing system. In IEEE International Conference on Cloud Computing (CLOUD). 582--589. Google ScholarDigital Library
- Yu-Chon Kao and Ya-Shu Chen. 2016. Data-locality-aware mapreduce real-time scheduling framework. Journal of Systems and Software 112 (2016), 65--77. Google ScholarDigital Library
- Mukhtaj Khan, Yong Jin, Maozhen Li, Yang Xiang, and Changjun Jiang. 2016. Hadoop performance modeling for job estimation and resource provisioning. IEEE Transactions on Parallel and Distributed Systems 27, 2 (2016), 441--454. Google ScholarDigital Library
- Younggyun Koh, Rob Knauerhase, Paul Brett, Mic Bowman, Zhihua Wen, and Calton Pu. 2007. An analysis of performance interference effects in virtual environments. In IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS). 200--209.Google ScholarCross Ref
- YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2012. Skewtune: mitigating skew in mapreduce applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 25--36. Google ScholarDigital Library
- Min Li, Dinesh Subhraveti, Ali R Butt, Aleksandr Khasymski, and Prasenjit Sarkar. 2012. CAM: a topology aware minimum cost flow based resource manager for MapReduce applications in the cloud. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (HPDC). 211--222. Google ScholarDigital Library
- Norman Lim, Shikharesh Majumdar, and Peter Ashwood-Smith. 2017. MRCP-RM: A Technique for Resource Allocation and Scheduling of MapReduce Jobs with Deadlines. IEEE Transactions on Parallel and Distributed Systems 28, 5 (2017), 1375--1389. Google ScholarDigital Library
- Jia-Chun Lin, Ming-Chang Lee, and Ramin Yahyapour. 2014. Scheduling MapReduce tasks on virtual MapReduce clusters from a tenant's perspective. In IEEE International Conference on BigData (BigData). 141--146.Google ScholarCross Ref
- Zhihong Liu, Qi Zhang, Raouf Boutaba, Yaping Liu, and Baosheng Wang. 2016. Optima: on-line partitioning skew mitigation for MapReduce with resource adjustment. Journal of Network and Systems Management 24, 4 (2016), 859--883. Google ScholarDigital Library
- Xiaoqiang Ma, Xiaoyi Fan, Jiangchuan Liu, Hongbo Jiang, and Kai Peng. 2017. vLocality: Revisiting Data Locality for MapReduce in Virtualized Clouds. IEEE Network 31, 1 (2017), 28--35. Google ScholarDigital Library
- Harry M Markowitz. 1968. Portfolio selection: efficient diversification of investments. Vol. 16. Yale university press.Google Scholar
- Lena Mashayekhy, Mahyar Movahed Nejad, Daniel Grosu, Quan Zhang, and Weisong Shi. 2015. Energy-aware scheduling of mapreduce jobs for big data applications. IEEE Transactions on Parallel and Distributed Systems 26, 10 (2015), 2720--2733. Google ScholarDigital Library
- Seyed Morteza Nabavinejad and Maziar Goudarzi. 2016. Energy efficiency in cloud-Based MapReduce applications through better performance estimation. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE). 1339--1344. Google ScholarDigital Library
- Seyed Morteza Nabavinejad and Maziar Goudarzi. 2017. Faster MapReduce Computation on Clouds through Better Performance Estimation. IEEE Transactions on Cloud Computing (2017).Google Scholar
- Seyed Morteza Nabavinejad, Maziar Goudarzi, and Shirin Mozaffari. 2016. The Memory Challenge in Reduce Phase of MapReduce Applications. IEEE Transactions on Big Data 2, 4 (2016), 380--386.Google ScholarCross Ref
- Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, and Seungryoul Maeng. 2012. Locality-aware dynamic VM reconfiguration on MapReduce clouds. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (HPDC). 27--36. Google ScholarDigital Library
- Kai Ren, YongChul Kwon, Magdalena Balazinska, and Bill Howe. 2013. Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads. Proceedings of the VLDB Endowment 6, 10 (2013), 853--864. Google ScholarDigital Library
- Andrew Rudd and Henry K Clasing. 1982. Modern portfolio theory: The principles of investment management.Google Scholar
- M. Goudarzi S. Nasehi, S.M. Nabavinejad. 2017. A Novel Key Partitioning Schema for Efficient Execution of MapReduce Applications. The 19th CSI International Symposium on Computer Architecture & Digital Systems (CADS) (2017).Google Scholar
- TP Shabeera and SD Madhu Kumar. 2015. Optimising virtual machine allocation in MapReduce cloud for improved data locality. International Journal of Big Data Intelligence 2, 1 (2015), 2--8.Google ScholarCross Ref
- Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, and Chen Wang. 2014. MRTuner: a toolkit to enable holistic optimization for mapreduce jobs. Proceedings of the VLDB Endowment 7, 13 (2014), 1319--1330. Google ScholarDigital Library
- Jia Wang and Xiaoping Li. 2016. Task scheduling for MapReduce in heterogeneous networks. Cluster Computing 19, 1 (2016), 197--210. Google ScholarDigital Library
- Kewen Wang, Mohammad Maifi Hasan Khan, Nhan Nguyen, and Swapna Gokhale. 2016. Modeling interference for apache spark jobs. In IEEE 9th International Conference on Cloud Computing (CLOUD). 423--431.Google ScholarCross Ref
- Suzhen Wang and Haowei Zhou. 2016. The research of mapreduce load balancing based on multiple partition algorithm. In IEEE/ACM 9th International Conference on Utility and Cloud Computing (UCC),. 339--342. Google ScholarDigital Library
- Weina Wang, Kai Zhu, Lei Ying, Jian Tan, and Li Zhang. 2013. A throughput optimal algorithm for map task scheduling in mapreduce with data locality. ACM SIGMETRICS Performance Evaluation Review 40, 4 (2013), 33--42. Google ScholarDigital Library
- Xueying Wang, Zhihui Lu, Jie Wu, Tong Zhao, and Patrick Hung. 2015. In STechAH: An Autoscaling Scheme for Hadoop in the Private Cloud. In IEEE International Conference on Services Computing (SCC). 395--402. Google ScholarDigital Library
- Y. Yuan, H. Wang, D. Wang, and J. Liu. 2013. On interference-aware provisioning for cloud-based big data processing. In IEEE/ACM 21st International Symposium on Quality of Service (IWQoS). 1--6.Google Scholar
- Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems (EuroSys). 265--278. Google ScholarDigital Library
- Wei Zhang, Sundaresan Rajasekaran, Shaohua Duan, Timothy Wood, and Mingfa Zhuy. 2015. Minimizing interference and maximizing progress for Hadoop virtual machines. ACM SIGMETRICS Performance Evaluation Review 42, 4 (2015), 62--71. Google ScholarDigital Library
- Wei Zhang, Sundaresan Rajasekaran, Timothy Wood, and Mingfa Zhu. 2014. Mimp: Deadline and interference aware scheduling of hadoop virtual machines. In 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 394--403.Google ScholarDigital Library
- X. Zhang, Y. Feng, S. Feng, J. Fan, and Z. Ming. 2011. An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments. In International Conference on Cloud and Service Computing (CSC). 235--242. Google ScholarDigital Library
- Xiaohong Zhang, Zhiyong Zhong, Shengzhong Feng, Bibo Tu, and Jianping Fan. 2011. Improving data locality of mapreduce by scheduling in homogeneous computing environments. In IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA). 120--126. Google ScholarDigital Library
Index Terms
- Data locality and VM interference aware mitigation of data skew in hadoop leveraging modern portfolio theory
Recommendations
'Big data', Hadoop and cloud computing in genomics
Graphical abstractDisplay Omitted Ever improving next generation sequencing technologies has led to an unprecedented proliferation of sequence data.Biology is now one of the fastest growing fields of big data science.Cloud computing and big data ...
Comments