Abstract
In this paper, researching on task scheduling is a way from the perspective of resource allocation and management to improve performance of Hadoop system. In order to save the network bandwidth resources in Hadoop cluster environment and improve the performance of Hadoop system, a ReduceTask scheduling strategy that based on data-locality is improved. In MapReduce stage, there are two main data streams in cluster network, they are slow task migration and remote copies of data. The two overlapping burst data transfer can easily become bottlenecks of the cluster network. To reduce the amount of remote copies of data, combining with data-locality, we establish a minimum network resource consumption model (MNRC). MNRC is used to calculate the network resources consumption of ReduceTask. Based on this model, we design a delay priority scheduling policy for the ReduceTask which is based on the cost of network resource consumption. Finally, MNRC is verified by simulation experiments. Evaluation results show that MNRC outperforms the saving cluster network resource by an average of 7.5% in heterogeneous.
Similar content being viewed by others
References
Landset, S., Khoshgoftaar, T.M., Richter, A.N., et al.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2(1), 2–11 (2015)
Kaisler, S., Armour, F., Espinosa, J.A.: Introduction to big data: challenges, opportunities, and realities minitrack. In: 47th Hawaii International Conference on System Sciences, pp. 728–728 (2014)
Xun, Y., Zhang, J., Qin, X., Zhao, X.: FiDoop-DP: data partitioning in frequent itemset mining on hadoop clusters. IEEE Trans. Parallel Distrib. Syst. 28(1), 101–114 (2016)
Elmeleegy, K., Olston, C., Reed, B.: SpongeFiles: mitigating data skew in mapReduce using distributed memory. In: ACM Sigmod International Conference on Management of Data, pp. 551–562 (2014)
Jiang, T., Zhang, Q., Hou, R.: Understanding the behavior of in-memory computing workloads. In: IEEE International Symposium on Workload Characterization (IISWC). IEEE (2014)
Ahmad, F., Lee, S., Thottethodi, M.: MapReduce with communication overlap (MaRCO). J. Parallel Distrib. Comput. 73(5), 608–620 (2013)
Katarina, G., Michael, H., Wilson, A.H.: Challenges for MapReduce in big data. In: Proceeding of the IEEE 10th 2014 world congress on services (SERVICES 2014)
Dean, J., Ghemawat, S.: System and method for large-scale data processing using an application-independent framework. United States Patent, 12/686292 (2013)
Gunasekaran, S., Kannan, A., SaiRamesh, L., Sabena, S., et al.: Dynamic scheduling algorithm for reducing start time in Hadoop. ACM Proc. Int. Conf. Inform. Anal. Artic. 8(25–26), 123 (2016)
Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., Owens, J.D.: Gunrock: a high performance graph processing library on the GPU. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, vol. 30, pp. 265–266. ACM, New York (2015)
Huang, T.-C., Chu, K.-C., Shieh, C.-K., Tsai, M.-F.: Speed-based load balancer for scheduling reduce tasks to process intermediate data of MapReduce applications on cloud computing. In: ACM ASE BD&SI ’15 Proceedings of the ASE BigData & SocialInformatics 2015 Article, vol. 10(07-09), p. 49. ACM, New York (2015)
Lin, C.H., Guo, W.Z., Chen, H.N., et al.: Node-capability-aimed data distribution strategy in heterogeneous Hadoop cluster. J. Chin. Comput. Syst. 01, 83–88 (2015)
Vavilapalli, V.K., Murthy, A.C., Douglas, C., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Symposium on Cloud Computing, pp. 1–16. ACM, New York (2013)
Tang, X., Wang, L., Geng, Z., et al.: A reduce task scheduler for MapReduce with minimum transmission cost based on sampling evaluation. Int. J. Database Theory Appl. 8(1), 1–10 (2015)
Cheng, D., Rao, J., Guo, Y., Jiang, C.J., Zhou, X.: Improving performance of heterogeneous mapreduce clusters with adaptive task tuning. IEEE Trans. Parallel Distrib. Syst. 99(99), 1–1 (2016)
Chen, Q., Yao, J., Xiao, Z.: LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans. Parallel Distrib. Syst. 26(9), 2520–2533 (2015)
Hadoop, W.T.: The definitive guide, pp. 125–230. O’Reilly Media, Inc., America (2015)
Li, Z., Shen, Y., Yao, B., et al.: OFScheduler: a dynamic network optimizer for MapReduce in heterogeneous cluster. Int. J. Parallel Program. 43(3), 472–488 (2015)
Saxena, V.K., Pushkar, S.: Cloud computing challenges and implementations. In: International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 2583–2588 (2016)
Zhao, Y., Wu, J., Liu, C.: Dache: a data aware caching for big-data applications using the MapReduce framework. Tsinghua Sci. Technol. 19(1), 39–50 (2014)
Zhang, K., Chen, X.: Large-scale deep belief nets with mapreduce. IEEE Access 2, 395–403 (2014)
Li, F., Ooi, B.C., Ozsu, M.T.: Distributed data management using MapReduce. ACM Comput. SURVEYS 46(3), 31 (2014)
Acknowledgements
The author would like to thank the Chongqing Basic and Frontier Research Project under Grant No. cstc2016jcyjA0590. The work is partly funded by the National Nature Science Foundation of China (No. 61672004).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shang, F., Chen, X. & Yan, C. A strategy for scheduling reduce task based on intermediate data locality of the MapReduce. Cluster Comput 20, 2821–2831 (2017). https://doi.org/10.1007/s10586-017-0972-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-0972-7