Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster

Bawankule, Kamalakant Laxman; Dewang, Rupesh Kumar; Singh, Anil Kumar

doi:10.1007/s12652-020-02699-0

Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster

Original Research
Published: 08 February 2021

Volume 12, pages 9573–9589, (2021)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

431 Accesses
Explore all metrics

Abstract

Cloud computing has emerged as a new way of sharing resources. MapReduce has become the de facto standard for cloud computing, which helps for data-intensive computation in parallel. Hadoop is an open-source framework that allows the implementation of MapReduce on the cluster of commodity hardware. An environment with different generations of commodity hardware (node) raises heterogeneity in the Hadoop environment. Today heterogeneity has become common in industries as well as in research centers. Hadoop’s current implementation assumes that nodes in the environment are homogeneous and distribute the workload evenly among these nodes. This homogeneity assumption creates a load imbalance among the nodes in the heterogeneous Hadoop environment, which furthers leads to stragglers. Stragglers are the nodes that are available in the environment, but their performance is abysmal. The paper proposed a Historical data based data placement (HDBDP) policy to balance the workload among heterogeneous nodes based on their computing capabilities to improve the Map tasks data locality and to reduce the job turnaround time in the heterogeneous Hadoop environment. The approach introduces an agent to measures the node computing capabilities using the job history information. It also helps NameNode to decide the block counts for each node in the environment. The proposed policy’s performance is evaluated on Hadoop’s most popular benchmark, i.e., HiBench benchmark suite. Finally, compared to the Hadoop’s default data placement policy and different policies, the proposed HDBDP policy minimizes the job turnaround time for several workloads by an average of 14–26%. Also, it improves the Map tasks data locality by nearly 27% in a heterogeneous Hadoop environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Historical data based approach to mitigate stragglers from the Reduce phase of MapReduce in a heterogeneous Hadoop cluster

Article 01 February 2022

A Counter-Based Profiling Scheme for Improving Locality Through Data and Reducer Placement

Improvement of job completion time in data-intensive cloud computing applications

Article Open access 07 February 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Anjos JC, Carrera I, Kolberg W, Tibola AL, Arantes LB, Geyer CR (2015) Mra++: Scheduling and data placement on mapreduce for heterogeneous environments. Future Generation Computer Systems 42:22–35. https://doi.org/10.1016/j.future.2014.09.001
Article Google Scholar
Arasanal RM, Rumani DU (2013) Improving mapreduce performance through complexity and performance based data placement in heterogeneous hadoop clusters. In: International Conference on Distributed Computing and Internet Technology. Springer. pp 115–125. https://doi.org/10.1007/978-3-642-36071-8_8
Bawankule KL, Dewang RK, Singh AK (2021) Load balancing approach for a MapReduce job running on a heterogeneous Hadoop cluster. In: International conference on distributed computing and internet technology, Springer, Cham, pp 289–298. https://doi.org/10.1007/978-3-030-65621-8_19
Cassales GW, Charão AS, Kirsch-Pinheiro M, Souveyet C, Steffenel LA (2016) Improving the performance of apache hadoop on pervasive environments through context-aware scheduling. Journal of Ambient Intelligence and Humanized Computing 7(3):333–345. https://doi.org/10.1007/s12652-016-0361-8
Article Google Scholar
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26(2):1–26. https://doi.org/10.1145/1365815.1365816
Article Google Scholar
Chen Q, Liu C, Xiao Z (2014) Improving mapreduce performance using smart speculative execution strategy. IEEE Transactions on Computers 63(4):954–967. https://doi.org/10.1109/TC.2013.15
Article MathSciNet MATH Google Scholar
De Maio C, Fenza G, Loia V, Orciuoli F (2017) Distributed online temporal fuzzy concept analysis for stream processing in smart cities. Journal of Parallel and Distributed Computing 110:31–41. https://doi.org/10.1016/j.jpdc.2017.02.002
Article MATH Google Scholar
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492
Article Google Scholar
Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J (2011) Cohadoop: flexible data placement and its exploitation in hadoop. Proceedings of the VLDB Endowment 4(9):575–585. https://doi.org/10.14778/2002938.2002943
Article Google Scholar
Ghemawat S, Gobioff H, Leung ST (2003) The google file system. In: Proceedings of the nineteenth ACM symposium on Operating systems principles. pp 29–43 https://doi.org/10.1145/945445.945450
Glushkova D, Jovanovic P, Abelló A (2019) Mapreduce performance model for hadoop 2. x. Information Systems 79:32–43. https://doi.org/10.1016/j.is.2017.11.006
Article Google Scholar
Huang S, Huang J, Dai J, Xie T, Huang B (2010) The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010). IEEE. pp 41–51. https://doi.org/10.1109/ICDEW.2010.5452747
Ibrahim IA, Bassiouni M (2017) Improving mapreduce performance with progress and feedback based speculative execution. In: 2017 IEEE International Conference on Smart Cloud (SmartCloud). IEEE. pp 120–125. https://doi.org/10.1109/SmartCloud.2017.25
Irandoost MA, Rahmani AM, Setayeshi S (2019) Mapreduce data skewness handling: a systematic literature review. International Journal of Parallel Programming 47(5–6):907–950. https://doi.org/10.1007/s10766-019-00627-0
Article Google Scholar
Jin H, Yang X, Sun XH, Raicu I (2012) Adapt: Availability-aware mapreduce data placement for non-dedicated distributed computing. In: 2012 IEEE 32nd International Conference on Distributed Computing Systems. IEEE. pp 516–525. https://doi.org/10.1109/ICDCS.2012.48
Kalyanaraman A, Cannon WR, Latt B, Baxter DJ (2011) Mapreduce implementation of a hybrid spectral library-database search method for large-scale peptide identification. Bioinformatics 27(21):3072–3073. https://doi.org/10.1093/bioinformatics/btr523
Article Google Scholar
Krish K, Anwar A, Butt AR (2014) hats: A heterogeneity-aware tiered storage for hadoop. In: 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE. pp 502–511. https://doi.org/10.1109/CCGrid.2014.51
Kumaresan V, Baskaran R, Dhavachelvan P (2018) Aegeus++: an energy-aware online partition skew mitigation algorithm for mapreduce in cloud. Cluster Computing 21(2):1243–1260. https://doi.org/10.1007/s10586-017-1044-8
Article Google Scholar
Kwon Y, Balazinska M, Howe B, Rolia J (2012) Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM. pp 25–36. https://doi.org/10.1145/2213836.2213840
Lee CW, Hsieh KY, Hsieh SY, Hsiao HC (2014) A dynamic data placement strategy for hadoop in heterogeneous environments. Big Data Research 1:14–22. https://doi.org/10.1016/j.bdr.2014.07.002
Article Google Scholar
Wh LIN, Zm LEI, Jun L, Jie Y, Fang L, Gang H, Qin W (2013) Mapreduce optimization algorithm based on machine learning in heterogeneous cloud environment. The Journal of China Universities of Posts and Telecommunications 20(6):77–121. https://doi.org/10.1016/S1005-8885(13)60112-0
Article Google Scholar
Liu Q, Cai W, Shen J, Fu Z, Liu X, Linge N (2017) A speculative execution strategy based on node classification and hierarchy index mechanism for heterogeneous hadoop systems. In: 2017 19th International Conference on Advanced Communication Technology (ICACT). IEEE. pp 889–894. https://doi.org/10.23919/ICACT.2017.7890240
Liu Y, Wu CQ, Wang M, Hou A, Wang Y (2018) On a dynamic data placement strategy for heterogeneous hadoop clusters. In: 2018 International Symposium on Networks, Computers and Communications (ISNCC). IEEE. pp 1–7. https://doi.org/10.1109/ISNCC.2018.8530970
Naik NS, Negi A, Sastry V (2015) Performance improvement of mapreduce framework in heterogeneous context using reinforcement learning. Procedia Computer Science 50:169–175. https://doi.org/10.1016/j.procs.2015.04.080
Article Google Scholar
Naik NS, Negi A, Tapas Bapu BR, Anitha R (2019) A data locality based scheduler to enhance mapreduce performance in heterogeneous environments. Future Generation Computer Systems 90:423–434. https://doi.org/10.1016/j.future.2018.07.043
Article Google Scholar
Paik SS, Goswami RS, Roy D, Reddy KH (2017) Intelligent data placement in heterogeneous hadoop cluster. In: International Conference on Next Generation Computing Technologies. Springer. pp 568–579. https://doi.org/10.1007/978-981-10-8657-1_43
Pandey V, Saini P (2018) How heterogeneity affects the design of hadoop mapreduce schedulers: A state-of-the-art survey and challenges. Big Data 6(2):72–95. https://doi.org/10.1089/big.2018.0013
Article Google Scholar
Ramanathan R, Latha B (2019) Towards optimal resource provisioning for hadoop-mapreduce jobs using scale-out strategy and its performance analysis in private cloud environment. Cluster Computing 22(6):14061–14071. https://doi.org/10.1007/s10586-018-2234-8
Article Google Scholar
Rasooli A, Down DG (2011) An adaptive scheduling algorithm for dynamic heterogeneous hadoop systems. In: Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research. IBM Corp. pp 30–44. https://doi.org/10.5555/2093889.2093893
Ren X, Ananthanarayanan G, Wierman A, Yu M (2015) Hopper: Decentralized speculation-aware cluster scheduling at scale. In: ACM SIGCOMM Computer Communication Review. ACM. 45:379–392. https://doi.org/10.1145/2785956.2787481
Shvachko K, Kuang H, Radia S, Chansler R, et al. (2010) The hadoop distributed file system. In: MSST. 10:1–10. https://doi.org/10.1109/MSST.2010.5496972
Tiwari N, Sarkar S, Bellur U, Indrawan M (2015) Classification framework of mapreduce scheduling algorithms. ACM Computing Surveys (CSUR) 47(3):49. https://doi.org/10.1145/2693315
Article Google Scholar
Ubarhande V, Popescu AM, González-Vélez H (2015) Novel data-distribution technique for hadoop in heterogeneous cloud environments. In: 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems. IEEE. pp 217–224. https://doi.org/10.1109/CISIS.2015.37
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, et al. (2013) Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th annual Symposium on Cloud Computing. ACM. p 5. https://doi.org/10.1145/2523616.2523633
Wang B, Jiang J, Yang G (2015) Actcap: Accelerating mapreduce on heterogeneous clusters with capability-aware data placement. In: 2015 IEEE Conference on Computer Communications (INFOCOM). IEEE. pp 1328–1336. https://doi.org/10.1109/INFOCOM.2015.7218509
White T (2012) Hadoop: The definitive guide. O’Reilly Media. Inc, CA, USA
Google Scholar
Jx Wu, Cs Zhang, Zhang B, Wang P (2016) A new data-grouping-aware dynamic data placement method that take into account jobs execute frequency for hadoop. Microprocessors and Microsystems 47:161–169. https://doi.org/10.1016/j.micpro.2016.07.011
Article Google Scholar
Xie J, Yin S, Ruan X, Ding Z, Tian Y, Majors J, Manzanares A, Qin X (2010) Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW). IEEE. pp 1–9. https://doi.org/10.1109/IPDPSW.2010.5470880
Xiong R, Luo J, Dong F (2014) Sldp: A novel data placement strategy for large-scale heterogeneous hadoop cluster. In: 2014 Second International Conference on Advanced Cloud and Big Data. IEEE. pp 9–17. https://doi.org/10.1109/CBD.2014.57
Xiong R, Du Y, Jin J, Luo J (2018) Hadaap: A hotness-aware data placement strategy for improving storage efficiency in heterogeneous hadoop clusters. Concurrency and Computation: Practice and Experience 30(20):e4830. https://doi.org/10.1002/cpe.4830
Article Google Scholar
Xu H, Lau WC (2014) Optimization for speculative execution of multiple jobs in a mapreduce-like cluster. arXiv preprint arXiv:1406.0609
Xu H, Lau WC (2016) Optimization for speculative execution in big data processing clusters. IEEE Transactions on Parallel and Distributed Systems 28(2):530–545. https://doi.org/10.1109/TPDS.2016.2564962
Article Google Scholar
Xu Y, Wu S, Wang M, Zou Y (2020) Design and implementation of distributed rsa algorithm based on hadoop. Journal of Ambient Intelligence and Humanized Computing 11(3):1047–1053. https://doi.org/10.1007/s12652-018-1021-y
Article Google Scholar
Ye X, Huang M, Zhu D, Xu P (2012) A novel blocks placement strategy for hadoop. In: 2012 IEEE/ACIS 11th International Conference on Computer and Information Science. IEEE. pp 3–7. https://doi.org/10.1109/ICIS.2012.11
Zaharia M, Konwinski A, Joseph AD, Katz RH, Stoica I (2008) Improving mapreduce performance in heterogeneous environments. In: Osdi. 8:7. https://doi.org/10.5555/1855741.1855744
Zhang X, Wu Y, Zhao C (2016) Mrheter: improving mapreduce performance in heterogeneous environments. Cluster Computing 19(4):1691–1701. https://doi.org/10.1007/s10586-016-0625-2
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Motilal Nehru National Institute of Technology Allahabad, Pryagraj, Uttar Pradesh, India
Kamalakant Laxman Bawankule, Rupesh Kumar Dewang & Anil Kumar Singh

Authors

Kamalakant Laxman Bawankule
View author publications
You can also search for this author inPubMed Google Scholar
Rupesh Kumar Dewang
View author publications
You can also search for this author inPubMed Google Scholar
Anil Kumar Singh
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Kamalakant Laxman Bawankule.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bawankule, K.L., Dewang, R.K. & Singh, A.K. Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster. J Ambient Intell Human Comput 12, 9573–9589 (2021). https://doi.org/10.1007/s12652-020-02699-0

Download citation

Received: 28 April 2020
Accepted: 16 November 2020
Published: 08 February 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s12652-020-02699-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Historical data based approach to mitigate stragglers from the Reduce phase of MapReduce in a heterogeneous Hadoop cluster

A Counter-Based Profiling Scheme for Improving Locality Through Data and Reducer Placement

Improvement of job completion time in data-intensive cloud computing applications

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now