Skip to main content
Log in

Historical data based approach to mitigate stragglers from the Reduce phase of MapReduce in a heterogeneous Hadoop cluster

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Hadoop MapReduce processes data on the cluster of commodity hardware (node) in two phases using Map and Reduce tasks. Yet another resource negotiator (YARN), a dynamic resource manager, allocates resources for Map tasks by preserving the data locality. In contrast, it allocates resources to schedule the Reduce tasks on any node in the cluster. The policy’s performance is better in a homogeneous environment, where the nodes’ computing capabilities are identical. However, its performance degrades in a heterogeneous environment when it allocates the containers for scheduling the Reduce tasks on any node that slowdowns the Reduce tasks execution and leads to computational skew. To mitigate the computational skew from the Reduce phase of MapReduce, we proposed the Historical data based Reduce tasks scheduling (HDRTS) technique. The technique has two algorithms: The first algorithm finds node average response time (NART) of each node by interpreting the job history information. The second algorithm allocates the resource on the faster processing node (FPN) to schedule the Reduce tasks. To evaluate the policy’s performance, we have used a very popular benchmark, i.e., the HiBench benchmark suite. Finally, compared with Hadoop’s default policy and several other policies, the proposed HDRTS policy reduces the Reduce tasks execution time for reduce-input-heavy jobs by nearly 25% to 37% significantly. Finally, it mitigates the computational skew and the stragglers from Reduce phase of MapReduce in the heterogeneous environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25

Similar content being viewed by others

Data availability

Not applicable.

Code availability

Not applicable.

References

  1. Arasanal, R.M., Rumani, D.U.: Improving Mapreduce performance through complexity and performance based data placement in heterogeneous Hadoop clusters. In: International Conference on Distributed Computing and Internet Technology, pp. 115–125. Springer (2013)

  2. Bawankule, K.L., Dewang, R.K., Singh, A.K.: Load balancing approach for a Mapreduce job running on a heterogeneous Hadoop cluster. In: International Conference on Distributed Computing and Internet Technology, pp. 289–298. Springer (2021)

  3. Bawankule, K.L., Dewang, R.K., Singh, A.K.: Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster. J Ambient Intell. Hum. Comput. 23, 1–17 (2021)

    Google Scholar 

  4. Bawankule, K.L., Dewang, R.K., Singh, A.K.: Performance analysis of hadoop YARN job schedulers in a multi-tenant environment on HiBench benchmark suite. Int. J. Distrib. Syst. Technol. 12(3), 64–82 (2021). https://doi.org/10.4018/IJDST.2021070104

    Article  Google Scholar 

  5. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  6. Dhawalia, P., Kailasam, S., Janakiram, D.: Chisel: a resource savvy approach for handling skew in Mapreduce applications. In: 2013 IEEE Sixth International Conference on Cloud Computing, pp. 652–660. IEEE (2013)

  7. Ghazali, R., Adabi, S., Down, D.G., Movaghar, A.: A classification of Hadoop job schedulers based on performance optimization approaches. Clust. Comput. 41, 1–23 (2021)

    Google Scholar 

  8. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The Hibench benchmark suite: Characterization of the Mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51. IEEE (2010)

  9. Irandoost, M.A., Rahmani, A.M., Setayeshi, S.: Mapreduce data skewness handling: a systematic literature review. Int. J. Parall. Program. 47(5–6), 907–950 (2019)

    Article  Google Scholar 

  10. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 75–86 (2010)

  11. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in Mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM (2012)

  12. Lee, C.W., Hsieh, K.Y., Hsieh, S.Y., Hsiao, H.C.: A dynamic data placement strategy for Hadoop in heterogeneous environments. Big Data Res. 1, 14–22 (2014)

    Article  Google Scholar 

  13. Naik, N.S., Negi, A., BR, T.B., Anitha, R.: A data locality based scheduler to enhance Mapreduce performance in heterogeneous environments. Future Gener. Comput. Syst. 90, 423–434 (2019)

    Article  Google Scholar 

  14. Paik, S.S., Goswami, R.S., Roy, D., Reddy, K.H.: Intelligent data placement in heterogeneous Hadoop cluster. In: International Conference on Next Generation Computing Rechnologies, pp. 568–579. Springer (2017)

  15. Pandey, V., Saini, P.: A heuristic method towards deadline-aware energy-efficient Mapreduce scheduling problem in Hadoop yarn. Clust. Comput. 24(2), 683–699 (2021)

    Article  Google Scholar 

  16. Sellami, M., Mezni, H., Hacid, M.S., Gammoudi, M.M.: Clustering-based data placement in cloud computing: a predictive approach. Clust. Comput. 87, 1–26 (2021)

    Google Scholar 

  17. Seneviratne, S., Levy, D.C.: Task profiling model for load profile prediction. Future Gener. Comput. Syst. 27(3), 245–255 (2011)

    Article  Google Scholar 

  18. Shvachko, K., Kuang, H., Radia, S., Chansler, R., et al.: The Hadoop distributed file system. MSST 10, 1–10 (2010)

    Google Scholar 

  19. Ubarhande, V., Popescu, A.M., González-Vélez, H.: Novel data-distribution technique for hadoop in heterogeneous cloud environments. In: 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems, pp. 217–224. IEEE (2015)

  20. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache Hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013)

  21. Wang, B., Jiang, J., Yang, G.: Actcap: accelerating mapreduce on heterogeneous clusters with capability-aware data placement. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 1328–1336. IEEE (2015)

  22. Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A., Qin, X.: Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–9. IEEE (2010)

  23. Ye, X., Huang, M., Zhu, D., Xu, P.: A novel blocks placement strategy for Hadoop. In: 2012 IEEE/ACIS 11th International Conference on Computer and Information Science, pp. 3–7. IEEE (2012)

  24. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving Mapreduce performance in heterogeneous environments. Osdi 8, 7 (2008)

    Google Scholar 

  25. Zhang, X., Wu, Y., Zhao, C.: Mrheter: improving Mapreduce performance in heterogeneous environments. Clust. Comput. 19(4), 1691–1701 (2016)

    Article  Google Scholar 

Download references

Funding

The authors would like to thank the Quality Improvement Program of All India Council for Technical Education (AICTE), India, to support the research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kamalakant Laxman Bawankule.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bawankule, K.L., Dewang, R.K. & Singh, A.K. Historical data based approach to mitigate stragglers from the Reduce phase of MapReduce in a heterogeneous Hadoop cluster. Cluster Comput 25, 3193–3211 (2022). https://doi.org/10.1007/s10586-021-03530-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-021-03530-x

Keywords

Navigation