A strategy for scheduling reduce task based on intermediate data locality of the MapReduce

Shang, Fengjun; Chen, Xuanling; Yan, Chenyun

doi:10.1007/s10586-017-0972-7

A strategy for scheduling reduce task based on intermediate data locality of the MapReduce

Published: 17 June 2017

Volume 20, pages 2821–2831, (2017)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Fengjun Shang¹,
Xuanling Chen¹ &
Chenyun Yan¹

1273 Accesses
9 Citations
Explore all metrics

Abstract

In this paper, researching on task scheduling is a way from the perspective of resource allocation and management to improve performance of Hadoop system. In order to save the network bandwidth resources in Hadoop cluster environment and improve the performance of Hadoop system, a ReduceTask scheduling strategy that based on data-locality is improved. In MapReduce stage, there are two main data streams in cluster network, they are slow task migration and remote copies of data. The two overlapping burst data transfer can easily become bottlenecks of the cluster network. To reduce the amount of remote copies of data, combining with data-locality, we establish a minimum network resource consumption model (MNRC). MNRC is used to calculate the network resources consumption of ReduceTask. Based on this model, we design a delay priority scheduling policy for the ReduceTask which is based on the cost of network resource consumption. Finally, MNRC is verified by simulation experiments. Evaluation results show that MNRC outperforms the saving cluster network resource by an average of 7.5% in heterogeneous.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Node Capability Modeling for Reduce Phase’s Scheduling in MapReduce Environment

TMaR: a two-stage MapReduce scheduler for heterogeneous environments

Article Open access 07 October 2020

New Scheduling Algorithm in Hadoop Based on Resource Aware

References

Landset, S., Khoshgoftaar, T.M., Richter, A.N., et al.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2(1), 2–11 (2015)
Article Google Scholar
Kaisler, S., Armour, F., Espinosa, J.A.: Introduction to big data: challenges, opportunities, and realities minitrack. In: 47th Hawaii International Conference on System Sciences, pp. 728–728 (2014)
Xun, Y., Zhang, J., Qin, X., Zhao, X.: FiDoop-DP: data partitioning in frequent itemset mining on hadoop clusters. IEEE Trans. Parallel Distrib. Syst. 28(1), 101–114 (2016)
Article Google Scholar
Elmeleegy, K., Olston, C., Reed, B.: SpongeFiles: mitigating data skew in mapReduce using distributed memory. In: ACM Sigmod International Conference on Management of Data, pp. 551–562 (2014)
Jiang, T., Zhang, Q., Hou, R.: Understanding the behavior of in-memory computing workloads. In: IEEE International Symposium on Workload Characterization (IISWC). IEEE (2014)
Ahmad, F., Lee, S., Thottethodi, M.: MapReduce with communication overlap (MaRCO). J. Parallel Distrib. Comput. 73(5), 608–620 (2013)
Article Google Scholar
Katarina, G., Michael, H., Wilson, A.H.: Challenges for MapReduce in big data. In: Proceeding of the IEEE 10th 2014 world congress on services (SERVICES 2014)
Dean, J., Ghemawat, S.: System and method for large-scale data processing using an application-independent framework. United States Patent, 12/686292 (2013)
Gunasekaran, S., Kannan, A., SaiRamesh, L., Sabena, S., et al.: Dynamic scheduling algorithm for reducing start time in Hadoop. ACM Proc. Int. Conf. Inform. Anal. Artic. 8(25–26), 123 (2016)
Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., Owens, J.D.: Gunrock: a high performance graph processing library on the GPU. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, vol. 30, pp. 265–266. ACM, New York (2015)
Huang, T.-C., Chu, K.-C., Shieh, C.-K., Tsai, M.-F.: Speed-based load balancer for scheduling reduce tasks to process intermediate data of MapReduce applications on cloud computing. In: ACM ASE BD&SI ’15 Proceedings of the ASE BigData & SocialInformatics 2015 Article, vol. 10(07-09), p. 49. ACM, New York (2015)
Lin, C.H., Guo, W.Z., Chen, H.N., et al.: Node-capability-aimed data distribution strategy in heterogeneous Hadoop cluster. J. Chin. Comput. Syst. 01, 83–88 (2015)
Google Scholar
Vavilapalli, V.K., Murthy, A.C., Douglas, C., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Symposium on Cloud Computing, pp. 1–16. ACM, New York (2013)
Tang, X., Wang, L., Geng, Z., et al.: A reduce task scheduler for MapReduce with minimum transmission cost based on sampling evaluation. Int. J. Database Theory Appl. 8(1), 1–10 (2015)
Article Google Scholar
Cheng, D., Rao, J., Guo, Y., Jiang, C.J., Zhou, X.: Improving performance of heterogeneous mapreduce clusters with adaptive task tuning. IEEE Trans. Parallel Distrib. Syst. 99(99), 1–1 (2016)
Google Scholar
Chen, Q., Yao, J., Xiao, Z.: LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans. Parallel Distrib. Syst. 26(9), 2520–2533 (2015)
Article Google Scholar
Hadoop, W.T.: The definitive guide, pp. 125–230. O’Reilly Media, Inc., America (2015)
Li, Z., Shen, Y., Yao, B., et al.: OFScheduler: a dynamic network optimizer for MapReduce in heterogeneous cluster. Int. J. Parallel Program. 43(3), 472–488 (2015)
Article Google Scholar
Saxena, V.K., Pushkar, S.: Cloud computing challenges and implementations. In: International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 2583–2588 (2016)
Zhao, Y., Wu, J., Liu, C.: Dache: a data aware caching for big-data applications using the MapReduce framework. Tsinghua Sci. Technol. 19(1), 39–50 (2014)
Article Google Scholar
Zhang, K., Chen, X.: Large-scale deep belief nets with mapreduce. IEEE Access 2, 395–403 (2014)
Article Google Scholar
Li, F., Ooi, B.C., Ozsu, M.T.: Distributed data management using MapReduce. ACM Comput. SURVEYS 46(3), 31 (2014)
Google Scholar

Download references

Acknowledgements

The author would like to thank the Chongqing Basic and Frontier Research Project under Grant No. cstc2016jcyjA0590. The work is partly funded by the National Nature Science Foundation of China (No. 61672004).

Author information

Authors and Affiliations

Institute of Computer Network Engineering, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
Fengjun Shang, Xuanling Chen & Chenyun Yan

Authors

Fengjun Shang
View author publications
You can also search for this author in PubMed Google Scholar
Xuanling Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chenyun Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fengjun Shang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shang, F., Chen, X. & Yan, C. A strategy for scheduling reduce task based on intermediate data locality of the MapReduce. Cluster Comput 20, 2821–2831 (2017). https://doi.org/10.1007/s10586-017-0972-7

Download citation

Received: 21 February 2017
Revised: 01 June 2017
Accepted: 05 June 2017
Published: 17 June 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10586-017-0972-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A strategy for scheduling reduce task based on intermediate data locality of the MapReduce

Abstract

Access this article

Similar content being viewed by others

Node Capability Modeling for Reduce Phase’s Scheduling in MapReduce Environment

TMaR: a two-stage MapReduce scheduler for heterogeneous environments

New Scheduling Algorithm in Hadoop Based on Resource Aware

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A strategy for scheduling reduce task based on intermediate data locality of the MapReduce

Abstract

Access this article

Similar content being viewed by others

Node Capability Modeling for Reduce Phase’s Scheduling in MapReduce Environment

TMaR: a two-stage MapReduce scheduler for heterogeneous environments

New Scheduling Algorithm in Hadoop Based on Resource Aware

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation