Abstract
As one of the most popular parallel data processing models, data analysis system MapReduce has been widely used in many fields. Task scheduling is the core module in MapReduce system, and the quality of the scheduling algorithm directly affects the processing capacity of the system. Since new nodes need to be continuously added in the cluster to improve the processing capacity of the cluster, objectively, the heterogeneity of the cluster is caused. Heterogeneous environment is common in practical application scenarios, but there has been little research on task scheduling in heterogeneous environment. For this reason, this paper presents an in-depth study of task scheduling in heterogeneous environment and proposes a new task scheduling algorithm HTD. First, we give a formal definition of the throughput-driven task scheduling problem in a heterogeneous environment. Second, we design the scheduling algorithm HTD, which quickly obtains the completion sequence of a jobs set and optimizes the task scheduling details in heterogeneous environment. Finally, a series of experiments show the efficiency and effectiveness of the algorithm.
Similar content being viewed by others
References
Maleki, N., Faragardi, H.R., Rahmani, A.M., Conti, M., Lofstead, J.F.: TMaR: A two-stage MapReduce scheduler for heterogeneous environments. Hum. Centric Comput. Inf. Sci 10, 42 (2020)
Mitsuzuka, K., Hayashi, A., Koibuchi, M., Amano, H., Matsutani, H.: In-switch approximate processing: Delayed tasks management for MapReduce applications, 2017 27th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4 (2017)
Chen, C., Lin, J., Kuo, S.: MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. IEEE Trans. Cloud Comput. 6(1), 127–140 (2018)
Shen, H., Sarker, A., Yu, L., Deng, F.: Probabilistic network-aware task placement for MapReduce scheduling. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 241–250 (2016)
Camacho-Rodríguez, J., Chauhan, A., Gates, A., et al.: Apache hive: From MapReduce to enterprise-grade big data warehousing. In: Proceedings of the 2019 International Conference on Management of Data, pp. 1773–1786 (2019)
Wu, Y., Li, X., Liu, J., Cui, L.: Hadoop-EDF: Large-scale distributed processing of electrophysiological signal data in hadoop MapReduce. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2265–2271 (2019)
Tiwari, N., Sarkar, S., Bellur, U., Indrawan, M.: Classification framework of MapReduce scheduling algorithms. ACM Comput. Surv. 47, 49:1-49:38 (2015)
Bibal Benifa, J.V.: Dejey, performance improvement of MapReduce for heterogeneous clusters based on efficient locality and replica aware scheduling (ELRAS) strategy. Wirel. Pers. Commun. 95, 2709–2733 (2017)
Jiang, Y., Zhu, Y., Weili, W., Li, D.: Makespan minimization for MapReduce systems with different servers. Fut. Gener. Comput. Syst. 67, 13–21 (2017)
Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.N.: Tarazu: Optimizing MapReduce on heterogeneous clusters. ASPLOS 40, 61–74 (2012)
Hsieh, S., Chen, C., Chen, C., Yen, T., Hsiao, H., Buyya, R.: Novel scheduling algorithms for efficient deployment of MapReduce applications in heterogeneous computing environments. IEEE Trans. Cloud Comput. 6(4), 1080–1095 (2018)
Cheng, D., Rao, J., Guo, Y., Jiang, C., Zhou, X.: Improving performance of heterogeneous MapReduce clusters with adaptive task tuning. IEEE Trans. Parallel Distrib. Syst. 28(3), 774–786 (2017)
Rasooli, A., Down, D.G.: COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems. Future Gener Comput Syst 36, 1–15 (2014)
Bellatreche, L., Cuzzocrea, A., Benkrid, S.: Effectively and efficiently designing and querying parallel relational data warehouses on heterogeneous database clusters: The F&A approach. J. Database Manag. 23(4), 17–51 (2012)
Kerkad, A., Bellatreche, L., Richard, P., Ordonez, C., Geniet, D.: A query beehive algorithm for data warehouse buffer management and query scheduling. Int. J. Data Warehousing Mining (IJDWM) 10(3), 34–58 (2014)
Chi, Y., Hacigümüs, H., Hsiung, W.-P., Jeffrey, F.: Naughton: Distribution-based query scheduling. Proc. VLDB Endow. 6(9), 673–684 (2013)
Mansouri, N.: Cost-based job scheduling strategy in cloud computing environments. Distrib. Parallel Databases 38(2), 365–400 (2020)
Hagras, T., Atef, A., Mahdy, Y.B.: Greening duplication-based dependent-tasks scheduling on heterogeneous large-scale computing platforms. J. Grid Comput. 19(1), 13 (2021)
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. OSDI 8, 29–42 (2008)
Kwon, Y., Balazinska, M., Howe, B., et al.: SkewTune: Mitigating skew in MapReduce applications. ACM SIGMOD Int. Conf. Manag. Data 2012, 25–36 (2012)
Kwon, Y., Balazinska, M., Howe, B., et al.: SkewTune in action: Mitigating skew in MapReduce applications. Proc. VLDB Endow. 2012 5(12), 1934–1937 (2012)
Hammoud, M., Rehman, S., Sakr, M.: A data locality and skew aware task scheduler for MapReduce in cloud computing. Bloomsbury Qatar Found. J. 2011, 1 (2011)
Yu, X., Kostamaa, P.: Efficient outer join data skew handling in parallel DBMS. Proc. VLDB Endow. 2(2), 1390–1396 (2009)
Kwon, Y.C., Balazinska, M., Howe, B., Rolia, J.A.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. SoCC 2010, 75–86 (2010)
Pericini, M.H., Leite, L.G., Carvalho-Junior, D., Francisco, H., Machado, J.C., Rezende, C.A.: MAPSkew metaheuristic approaches for partitioning skew in MapReduce. Algorithms 12(1), 5 (2019)
Wang, B., Jiang, J., Yang, G.: ActCap: Accelerating MapReduce on heterogeneous clusters with capability-aware data placement. INFOCOM 2015, 1328–1336 (2015)
Wang, J., Li, X.: Task scheduling for MapReduce in heterogeneous networks. Clust. Comput. 19(1), 197–210 (2016)
Wang, M., Wu, C.Q., Cao, H., Liu, Y., Wang, Y., Hou, A.: On MapReduce scheduling in hadoop yarn on heterogeneous clusters. TrustCom/BigDataSE 2018, 1747–1754 (2018)
Chen, L., Liu, Z.-H.: Energy- and locality-efficient multi-job scheduling based on MapReduce for heterogeneous datacenter. Serv. Orient. Comput. Appl. 13(4), 297–308 (2019)
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Grant Nos. 61602076, 61702072, 62002039, 61976032), the China Postdoctoral Science Foundation funded projects (Grant Nos. 2017M611211, 2017M6211, 2019M661077), the Natural Science Foundation of Liaoning Province (Grant No. 20180540003), CERNET Innovation Project (Grant No. NGII20190902).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, X., Wang, C., Bai, M. et al. HTD: heterogeneous throughput-driven task scheduling algorithm in MapReduce. Distrib Parallel Databases 40, 135–163 (2022). https://doi.org/10.1007/s10619-021-07375-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-021-07375-6