Abstract
Cloud computing has emerged as a new way of sharing resources. MapReduce has become the de facto standard for cloud computing, which helps for data-intensive computation in parallel. Hadoop is an open-source framework that allows the implementation of MapReduce on the cluster of commodity hardware. An environment with different generations of commodity hardware (node) raises heterogeneity in the Hadoop environment. Today heterogeneity has become common in industries as well as in research centers. Hadoop’s current implementation assumes that nodes in the environment are homogeneous and distribute the workload evenly among these nodes. This homogeneity assumption creates a load imbalance among the nodes in the heterogeneous Hadoop environment, which furthers leads to stragglers. Stragglers are the nodes that are available in the environment, but their performance is abysmal. The paper proposed a Historical data based data placement (HDBDP) policy to balance the workload among heterogeneous nodes based on their computing capabilities to improve the Map tasks data locality and to reduce the job turnaround time in the heterogeneous Hadoop environment. The approach introduces an agent to measures the node computing capabilities using the job history information. It also helps NameNode to decide the block counts for each node in the environment. The proposed policy’s performance is evaluated on Hadoop’s most popular benchmark, i.e., HiBench benchmark suite. Finally, compared to the Hadoop’s default data placement policy and different policies, the proposed HDBDP policy minimizes the job turnaround time for several workloads by an average of 14–26%. Also, it improves the Map tasks data locality by nearly 27% in a heterogeneous Hadoop environment.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Anjos JC, Carrera I, Kolberg W, Tibola AL, Arantes LB, Geyer CR (2015) Mra++: Scheduling and data placement on mapreduce for heterogeneous environments. Future Generation Computer Systems 42:22–35. https://doi.org/10.1016/j.future.2014.09.001
Arasanal RM, Rumani DU (2013) Improving mapreduce performance through complexity and performance based data placement in heterogeneous hadoop clusters. In: International Conference on Distributed Computing and Internet Technology. Springer. pp 115–125. https://doi.org/10.1007/978-3-642-36071-8_8
Bawankule KL, Dewang RK, Singh AK (2021) Load balancing approach for a MapReduce job running on a heterogeneous Hadoop cluster. In: International conference on distributed computing and internet technology, Springer, Cham, pp 289–298. https://doi.org/10.1007/978-3-030-65621-8_19
Cassales GW, Charão AS, Kirsch-Pinheiro M, Souveyet C, Steffenel LA (2016) Improving the performance of apache hadoop on pervasive environments through context-aware scheduling. Journal of Ambient Intelligence and Humanized Computing 7(3):333–345. https://doi.org/10.1007/s12652-016-0361-8
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26(2):1–26. https://doi.org/10.1145/1365815.1365816
Chen Q, Liu C, Xiao Z (2014) Improving mapreduce performance using smart speculative execution strategy. IEEE Transactions on Computers 63(4):954–967. https://doi.org/10.1109/TC.2013.15
De Maio C, Fenza G, Loia V, Orciuoli F (2017) Distributed online temporal fuzzy concept analysis for stream processing in smart cities. Journal of Parallel and Distributed Computing 110:31–41. https://doi.org/10.1016/j.jpdc.2017.02.002
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492
Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J (2011) Cohadoop: flexible data placement and its exploitation in hadoop. Proceedings of the VLDB Endowment 4(9):575–585. https://doi.org/10.14778/2002938.2002943
Ghemawat S, Gobioff H, Leung ST (2003) The google file system. In: Proceedings of the nineteenth ACM symposium on Operating systems principles. pp 29–43 https://doi.org/10.1145/945445.945450
Glushkova D, Jovanovic P, Abelló A (2019) Mapreduce performance model for hadoop 2. x. Information Systems 79:32–43. https://doi.org/10.1016/j.is.2017.11.006
Huang S, Huang J, Dai J, Xie T, Huang B (2010) The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010). IEEE. pp 41–51. https://doi.org/10.1109/ICDEW.2010.5452747
Ibrahim IA, Bassiouni M (2017) Improving mapreduce performance with progress and feedback based speculative execution. In: 2017 IEEE International Conference on Smart Cloud (SmartCloud). IEEE. pp 120–125. https://doi.org/10.1109/SmartCloud.2017.25
Irandoost MA, Rahmani AM, Setayeshi S (2019) Mapreduce data skewness handling: a systematic literature review. International Journal of Parallel Programming 47(5–6):907–950. https://doi.org/10.1007/s10766-019-00627-0
Jin H, Yang X, Sun XH, Raicu I (2012) Adapt: Availability-aware mapreduce data placement for non-dedicated distributed computing. In: 2012 IEEE 32nd International Conference on Distributed Computing Systems. IEEE. pp 516–525. https://doi.org/10.1109/ICDCS.2012.48
Kalyanaraman A, Cannon WR, Latt B, Baxter DJ (2011) Mapreduce implementation of a hybrid spectral library-database search method for large-scale peptide identification. Bioinformatics 27(21):3072–3073. https://doi.org/10.1093/bioinformatics/btr523
Krish K, Anwar A, Butt AR (2014) hats: A heterogeneity-aware tiered storage for hadoop. In: 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE. pp 502–511. https://doi.org/10.1109/CCGrid.2014.51
Kumaresan V, Baskaran R, Dhavachelvan P (2018) Aegeus++: an energy-aware online partition skew mitigation algorithm for mapreduce in cloud. Cluster Computing 21(2):1243–1260. https://doi.org/10.1007/s10586-017-1044-8
Kwon Y, Balazinska M, Howe B, Rolia J (2012) Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM. pp 25–36. https://doi.org/10.1145/2213836.2213840
Lee CW, Hsieh KY, Hsieh SY, Hsiao HC (2014) A dynamic data placement strategy for hadoop in heterogeneous environments. Big Data Research 1:14–22. https://doi.org/10.1016/j.bdr.2014.07.002
Wh LIN, Zm LEI, Jun L, Jie Y, Fang L, Gang H, Qin W (2013) Mapreduce optimization algorithm based on machine learning in heterogeneous cloud environment. The Journal of China Universities of Posts and Telecommunications 20(6):77–121. https://doi.org/10.1016/S1005-8885(13)60112-0
Liu Q, Cai W, Shen J, Fu Z, Liu X, Linge N (2017) A speculative execution strategy based on node classification and hierarchy index mechanism for heterogeneous hadoop systems. In: 2017 19th International Conference on Advanced Communication Technology (ICACT). IEEE. pp 889–894. https://doi.org/10.23919/ICACT.2017.7890240
Liu Y, Wu CQ, Wang M, Hou A, Wang Y (2018) On a dynamic data placement strategy for heterogeneous hadoop clusters. In: 2018 International Symposium on Networks, Computers and Communications (ISNCC). IEEE. pp 1–7. https://doi.org/10.1109/ISNCC.2018.8530970
Naik NS, Negi A, Sastry V (2015) Performance improvement of mapreduce framework in heterogeneous context using reinforcement learning. Procedia Computer Science 50:169–175. https://doi.org/10.1016/j.procs.2015.04.080
Naik NS, Negi A, Tapas Bapu BR, Anitha R (2019) A data locality based scheduler to enhance mapreduce performance in heterogeneous environments. Future Generation Computer Systems 90:423–434. https://doi.org/10.1016/j.future.2018.07.043
Paik SS, Goswami RS, Roy D, Reddy KH (2017) Intelligent data placement in heterogeneous hadoop cluster. In: International Conference on Next Generation Computing Technologies. Springer. pp 568–579. https://doi.org/10.1007/978-981-10-8657-1_43
Pandey V, Saini P (2018) How heterogeneity affects the design of hadoop mapreduce schedulers: A state-of-the-art survey and challenges. Big Data 6(2):72–95. https://doi.org/10.1089/big.2018.0013
Ramanathan R, Latha B (2019) Towards optimal resource provisioning for hadoop-mapreduce jobs using scale-out strategy and its performance analysis in private cloud environment. Cluster Computing 22(6):14061–14071. https://doi.org/10.1007/s10586-018-2234-8
Rasooli A, Down DG (2011) An adaptive scheduling algorithm for dynamic heterogeneous hadoop systems. In: Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research. IBM Corp. pp 30–44. https://doi.org/10.5555/2093889.2093893
Ren X, Ananthanarayanan G, Wierman A, Yu M (2015) Hopper: Decentralized speculation-aware cluster scheduling at scale. In: ACM SIGCOMM Computer Communication Review. ACM. 45:379–392. https://doi.org/10.1145/2785956.2787481
Shvachko K, Kuang H, Radia S, Chansler R, et al. (2010) The hadoop distributed file system. In: MSST. 10:1–10. https://doi.org/10.1109/MSST.2010.5496972
Tiwari N, Sarkar S, Bellur U, Indrawan M (2015) Classification framework of mapreduce scheduling algorithms. ACM Computing Surveys (CSUR) 47(3):49. https://doi.org/10.1145/2693315
Ubarhande V, Popescu AM, González-Vélez H (2015) Novel data-distribution technique for hadoop in heterogeneous cloud environments. In: 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems. IEEE. pp 217–224. https://doi.org/10.1109/CISIS.2015.37
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, et al. (2013) Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th annual Symposium on Cloud Computing. ACM. p 5. https://doi.org/10.1145/2523616.2523633
Wang B, Jiang J, Yang G (2015) Actcap: Accelerating mapreduce on heterogeneous clusters with capability-aware data placement. In: 2015 IEEE Conference on Computer Communications (INFOCOM). IEEE. pp 1328–1336. https://doi.org/10.1109/INFOCOM.2015.7218509
White T (2012) Hadoop: The definitive guide. O’Reilly Media. Inc, CA, USA
Jx Wu, Cs Zhang, Zhang B, Wang P (2016) A new data-grouping-aware dynamic data placement method that take into account jobs execute frequency for hadoop. Microprocessors and Microsystems 47:161–169. https://doi.org/10.1016/j.micpro.2016.07.011
Xie J, Yin S, Ruan X, Ding Z, Tian Y, Majors J, Manzanares A, Qin X (2010) Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW). IEEE. pp 1–9. https://doi.org/10.1109/IPDPSW.2010.5470880
Xiong R, Luo J, Dong F (2014) Sldp: A novel data placement strategy for large-scale heterogeneous hadoop cluster. In: 2014 Second International Conference on Advanced Cloud and Big Data. IEEE. pp 9–17. https://doi.org/10.1109/CBD.2014.57
Xiong R, Du Y, Jin J, Luo J (2018) Hadaap: A hotness-aware data placement strategy for improving storage efficiency in heterogeneous hadoop clusters. Concurrency and Computation: Practice and Experience 30(20):e4830. https://doi.org/10.1002/cpe.4830
Xu H, Lau WC (2014) Optimization for speculative execution of multiple jobs in a mapreduce-like cluster. arXiv preprint arXiv:1406.0609
Xu H, Lau WC (2016) Optimization for speculative execution in big data processing clusters. IEEE Transactions on Parallel and Distributed Systems 28(2):530–545. https://doi.org/10.1109/TPDS.2016.2564962
Xu Y, Wu S, Wang M, Zou Y (2020) Design and implementation of distributed rsa algorithm based on hadoop. Journal of Ambient Intelligence and Humanized Computing 11(3):1047–1053. https://doi.org/10.1007/s12652-018-1021-y
Ye X, Huang M, Zhu D, Xu P (2012) A novel blocks placement strategy for hadoop. In: 2012 IEEE/ACIS 11th International Conference on Computer and Information Science. IEEE. pp 3–7. https://doi.org/10.1109/ICIS.2012.11
Zaharia M, Konwinski A, Joseph AD, Katz RH, Stoica I (2008) Improving mapreduce performance in heterogeneous environments. In: Osdi. 8:7. https://doi.org/10.5555/1855741.1855744
Zhang X, Wu Y, Zhao C (2016) Mrheter: improving mapreduce performance in heterogeneous environments. Cluster Computing 19(4):1691–1701. https://doi.org/10.1007/s10586-016-0625-2
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bawankule, K.L., Dewang, R.K. & Singh, A.K. Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster. J Ambient Intell Human Comput 12, 9573–9589 (2021). https://doi.org/10.1007/s12652-020-02699-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-020-02699-0