Skip to main content
Log in

A strategy for scheduling reduce task based on intermediate data locality of the MapReduce

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

In this paper, researching on task scheduling is a way from the perspective of resource allocation and management to improve performance of Hadoop system. In order to save the network bandwidth resources in Hadoop cluster environment and improve the performance of Hadoop system, a ReduceTask scheduling strategy that based on data-locality is improved. In MapReduce stage, there are two main data streams in cluster network, they are slow task migration and remote copies of data. The two overlapping burst data transfer can easily become bottlenecks of the cluster network. To reduce the amount of remote copies of data, combining with data-locality, we establish a minimum network resource consumption model (MNRC). MNRC is used to calculate the network resources consumption of ReduceTask. Based on this model, we design a delay priority scheduling policy for the ReduceTask which is based on the cost of network resource consumption. Finally, MNRC is verified by simulation experiments. Evaluation results show that MNRC outperforms the saving cluster network resource by an average of 7.5% in heterogeneous.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Landset, S., Khoshgoftaar, T.M., Richter, A.N., et al.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2(1), 2–11 (2015)

    Article  Google Scholar 

  2. Kaisler, S., Armour, F., Espinosa, J.A.: Introduction to big data: challenges, opportunities, and realities minitrack. In: 47th Hawaii International Conference on System Sciences, pp. 728–728 (2014)

  3. Xun, Y., Zhang, J., Qin, X., Zhao, X.: FiDoop-DP: data partitioning in frequent itemset mining on hadoop clusters. IEEE Trans. Parallel Distrib. Syst. 28(1), 101–114 (2016)

    Article  Google Scholar 

  4. Elmeleegy, K., Olston, C., Reed, B.: SpongeFiles: mitigating data skew in mapReduce using distributed memory. In: ACM Sigmod International Conference on Management of Data, pp. 551–562 (2014)

  5. Jiang, T., Zhang, Q., Hou, R.: Understanding the behavior of in-memory computing workloads. In: IEEE International Symposium on Workload Characterization (IISWC). IEEE (2014)

  6. Ahmad, F., Lee, S., Thottethodi, M.: MapReduce with communication overlap (MaRCO). J. Parallel Distrib. Comput. 73(5), 608–620 (2013)

    Article  Google Scholar 

  7. Katarina, G., Michael, H., Wilson, A.H.: Challenges for MapReduce in big data. In: Proceeding of the IEEE 10th 2014 world congress on services (SERVICES 2014)

  8. Dean, J., Ghemawat, S.: System and method for large-scale data processing using an application-independent framework. United States Patent, 12/686292 (2013)

  9. Gunasekaran, S., Kannan, A., SaiRamesh, L., Sabena, S., et al.: Dynamic scheduling algorithm for reducing start time in Hadoop. ACM Proc. Int. Conf. Inform. Anal. Artic. 8(25–26), 123 (2016)

  10. Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., Owens, J.D.: Gunrock: a high performance graph processing library on the GPU. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, vol. 30, pp. 265–266. ACM, New York (2015)

  11. Huang, T.-C., Chu, K.-C., Shieh, C.-K., Tsai, M.-F.: Speed-based load balancer for scheduling reduce tasks to process intermediate data of MapReduce applications on cloud computing. In: ACM ASE BD&SI ’15 Proceedings of the ASE BigData & SocialInformatics 2015 Article, vol. 10(07-09), p. 49. ACM, New York (2015)

  12. Lin, C.H., Guo, W.Z., Chen, H.N., et al.: Node-capability-aimed data distribution strategy in heterogeneous Hadoop cluster. J. Chin. Comput. Syst. 01, 83–88 (2015)

    Google Scholar 

  13. Vavilapalli, V.K., Murthy, A.C., Douglas, C., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Symposium on Cloud Computing, pp. 1–16. ACM, New York (2013)

  14. Tang, X., Wang, L., Geng, Z., et al.: A reduce task scheduler for MapReduce with minimum transmission cost based on sampling evaluation. Int. J. Database Theory Appl. 8(1), 1–10 (2015)

    Article  Google Scholar 

  15. Cheng, D., Rao, J., Guo, Y., Jiang, C.J., Zhou, X.: Improving performance of heterogeneous mapreduce clusters with adaptive task tuning. IEEE Trans. Parallel Distrib. Syst. 99(99), 1–1 (2016)

    Google Scholar 

  16. Chen, Q., Yao, J., Xiao, Z.: LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans. Parallel Distrib. Syst. 26(9), 2520–2533 (2015)

    Article  Google Scholar 

  17. Hadoop, W.T.: The definitive guide, pp. 125–230. O’Reilly Media, Inc., America (2015)

  18. Li, Z., Shen, Y., Yao, B., et al.: OFScheduler: a dynamic network optimizer for MapReduce in heterogeneous cluster. Int. J. Parallel Program. 43(3), 472–488 (2015)

    Article  Google Scholar 

  19. Saxena, V.K., Pushkar, S.: Cloud computing challenges and implementations. In: International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 2583–2588 (2016)

  20. Zhao, Y., Wu, J., Liu, C.: Dache: a data aware caching for big-data applications using the MapReduce framework. Tsinghua Sci. Technol. 19(1), 39–50 (2014)

    Article  Google Scholar 

  21. Zhang, K., Chen, X.: Large-scale deep belief nets with mapreduce. IEEE Access 2, 395–403 (2014)

    Article  Google Scholar 

  22. Li, F., Ooi, B.C., Ozsu, M.T.: Distributed data management using MapReduce. ACM Comput. SURVEYS 46(3), 31 (2014)

    Google Scholar 

Download references

Acknowledgements

The author would like to thank the Chongqing Basic and Frontier Research Project under Grant No. cstc2016jcyjA0590. The work is partly funded by the National Nature Science Foundation of China (No. 61672004).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fengjun Shang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shang, F., Chen, X. & Yan, C. A strategy for scheduling reduce task based on intermediate data locality of the MapReduce. Cluster Comput 20, 2821–2831 (2017). https://doi.org/10.1007/s10586-017-0972-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-0972-7

Keywords

Navigation