HTD: heterogeneous throughput-driven task scheduling algorithm in MapReduce

Wang, Xite; Wang, Chaojin; Bai, Mei; Ma, Qian; Li, Guanyu

doi:10.1007/s10619-021-07375-6

HTD: heterogeneous throughput-driven task scheduling algorithm in MapReduce

Published: 28 October 2021

Volume 40, pages 135–163, (2022)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Xite Wang¹,
Chaojin Wang ORCID: orcid.org/0000-0001-9559-4478¹,
Mei Bai¹,
Qian Ma¹ &
…
Guanyu Li¹

331 Accesses
2 Citations
Explore all metrics

Abstract

As one of the most popular parallel data processing models, data analysis system MapReduce has been widely used in many fields. Task scheduling is the core module in MapReduce system, and the quality of the scheduling algorithm directly affects the processing capacity of the system. Since new nodes need to be continuously added in the cluster to improve the processing capacity of the cluster, objectively, the heterogeneity of the cluster is caused. Heterogeneous environment is common in practical application scenarios, but there has been little research on task scheduling in heterogeneous environment. For this reason, this paper presents an in-depth study of task scheduling in heterogeneous environment and proposes a new task scheduling algorithm HTD. First, we give a formal definition of the throughput-driven task scheduling problem in a heterogeneous environment. Second, we design the scheduling algorithm HTD, which quickly obtains the completion sequence of a jobs set and optimizes the task scheduling details in heterogeneous environment. Finally, a series of experiments show the efficiency and effectiveness of the algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing the Performance of MapReduce Default Scheduler by Detecting Prolonged TaskTrackers in Heterogeneous Environments

Task Scheduling for MapReduce Based on Heterogeneous Networks

Performance Improvement of MapReduce Framework by Identifying Slow TaskTrackers in Heterogeneous Hadoop Cluster

References

Maleki, N., Faragardi, H.R., Rahmani, A.M., Conti, M., Lofstead, J.F.: TMaR: A two-stage MapReduce scheduler for heterogeneous environments. Hum. Centric Comput. Inf. Sci 10, 42 (2020)
Article Google Scholar
Mitsuzuka, K., Hayashi, A., Koibuchi, M., Amano, H., Matsutani, H.: In-switch approximate processing: Delayed tasks management for MapReduce applications, 2017 27th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4 (2017)
Chen, C., Lin, J., Kuo, S.: MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. IEEE Trans. Cloud Comput. 6(1), 127–140 (2018)
Article Google Scholar
Shen, H., Sarker, A., Yu, L., Deng, F.: Probabilistic network-aware task placement for MapReduce scheduling. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 241–250 (2016)
http://hadoop.apache.org
Camacho-Rodríguez, J., Chauhan, A., Gates, A., et al.: Apache hive: From MapReduce to enterprise-grade big data warehousing. In: Proceedings of the 2019 International Conference on Management of Data, pp. 1773–1786 (2019)
Wu, Y., Li, X., Liu, J., Cui, L.: Hadoop-EDF: Large-scale distributed processing of electrophysiological signal data in hadoop MapReduce. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2265–2271 (2019)
Tiwari, N., Sarkar, S., Bellur, U., Indrawan, M.: Classification framework of MapReduce scheduling algorithms. ACM Comput. Surv. 47, 49:1-49:38 (2015)
Article Google Scholar
Bibal Benifa, J.V.: Dejey, performance improvement of MapReduce for heterogeneous clusters based on efficient locality and replica aware scheduling (ELRAS) strategy. Wirel. Pers. Commun. 95, 2709–2733 (2017)
Article Google Scholar
Jiang, Y., Zhu, Y., Weili, W., Li, D.: Makespan minimization for MapReduce systems with different servers. Fut. Gener. Comput. Syst. 67, 13–21 (2017)
Article Google Scholar
Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.N.: Tarazu: Optimizing MapReduce on heterogeneous clusters. ASPLOS 40, 61–74 (2012)
Article Google Scholar
Hsieh, S., Chen, C., Chen, C., Yen, T., Hsiao, H., Buyya, R.: Novel scheduling algorithms for efficient deployment of MapReduce applications in heterogeneous computing environments. IEEE Trans. Cloud Comput. 6(4), 1080–1095 (2018)
Article Google Scholar
Cheng, D., Rao, J., Guo, Y., Jiang, C., Zhou, X.: Improving performance of heterogeneous MapReduce clusters with adaptive task tuning. IEEE Trans. Parallel Distrib. Syst. 28(3), 774–786 (2017)
Article Google Scholar
Rasooli, A., Down, D.G.: COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems. Future Gener Comput Syst 36, 1–15 (2014)
Article Google Scholar
Bellatreche, L., Cuzzocrea, A., Benkrid, S.: Effectively and efficiently designing and querying parallel relational data warehouses on heterogeneous database clusters: The F&A approach. J. Database Manag. 23(4), 17–51 (2012)
Article Google Scholar
Kerkad, A., Bellatreche, L., Richard, P., Ordonez, C., Geniet, D.: A query beehive algorithm for data warehouse buffer management and query scheduling. Int. J. Data Warehousing Mining (IJDWM) 10(3), 34–58 (2014)
Article Google Scholar
Chi, Y., Hacigümüs, H., Hsiung, W.-P., Jeffrey, F.: Naughton: Distribution-based query scheduling. Proc. VLDB Endow. 6(9), 673–684 (2013)
Article Google Scholar
Mansouri, N.: Cost-based job scheduling strategy in cloud computing environments. Distrib. Parallel Databases 38(2), 365–400 (2020)
Article Google Scholar
Hagras, T., Atef, A., Mahdy, Y.B.: Greening duplication-based dependent-tasks scheduling on heterogeneous large-scale computing platforms. J. Grid Comput. 19(1), 13 (2021)
Article Google Scholar
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. OSDI 8, 29–42 (2008)
Google Scholar
Kwon, Y., Balazinska, M., Howe, B., et al.: SkewTune: Mitigating skew in MapReduce applications. ACM SIGMOD Int. Conf. Manag. Data 2012, 25–36 (2012)
Google Scholar
Kwon, Y., Balazinska, M., Howe, B., et al.: SkewTune in action: Mitigating skew in MapReduce applications. Proc. VLDB Endow. 2012 5(12), 1934–1937 (2012)
Article Google Scholar
Hammoud, M., Rehman, S., Sakr, M.: A data locality and skew aware task scheduler for MapReduce in cloud computing. Bloomsbury Qatar Found. J. 2011, 1 (2011)
Google Scholar
Yu, X., Kostamaa, P.: Efficient outer join data skew handling in parallel DBMS. Proc. VLDB Endow. 2(2), 1390–1396 (2009)
Article Google Scholar
Kwon, Y.C., Balazinska, M., Howe, B., Rolia, J.A.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. SoCC 2010, 75–86 (2010)
Google Scholar
Pericini, M.H., Leite, L.G., Carvalho-Junior, D., Francisco, H., Machado, J.C., Rezende, C.A.: MAPSkew metaheuristic approaches for partitioning skew in MapReduce. Algorithms 12(1), 5 (2019)
Article Google Scholar
Wang, B., Jiang, J., Yang, G.: ActCap: Accelerating MapReduce on heterogeneous clusters with capability-aware data placement. INFOCOM 2015, 1328–1336 (2015)
Google Scholar
Wang, J., Li, X.: Task scheduling for MapReduce in heterogeneous networks. Clust. Comput. 19(1), 197–210 (2016)
Article Google Scholar
Wang, M., Wu, C.Q., Cao, H., Liu, Y., Wang, Y., Hou, A.: On MapReduce scheduling in hadoop yarn on heterogeneous clusters. TrustCom/BigDataSE 2018, 1747–1754 (2018)
Google Scholar
Chen, L., Liu, Z.-H.: Energy- and locality-efficient multi-job scheduling based on MapReduce for heterogeneous datacenter. Serv. Orient. Comput. Appl. 13(4), 297–308 (2019)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant Nos. 61602076, 61702072, 62002039, 61976032), the China Postdoctoral Science Foundation funded projects (Grant Nos. 2017M611211, 2017M6211, 2019M661077), the Natural Science Foundation of Liaoning Province (Grant No. 20180540003), CERNET Innovation Project (Grant No. NGII20190902).

Author information

Authors and Affiliations

Information Science and Technology College, Dalian Maritime University, Dalian, China
Xite Wang, Chaojin Wang, Mei Bai, Qian Ma & Guanyu Li

Authors

Xite Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chaojin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mei Bai
View author publications
You can also search for this author in PubMed Google Scholar
Qian Ma
View author publications
You can also search for this author in PubMed Google Scholar
Guanyu Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xite Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, X., Wang, C., Bai, M. et al. HTD: heterogeneous throughput-driven task scheduling algorithm in MapReduce. Distrib Parallel Databases 40, 135–163 (2022). https://doi.org/10.1007/s10619-021-07375-6

Download citation

Accepted: 01 October 2021
Published: 28 October 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10619-021-07375-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HTD: heterogeneous throughput-driven task scheduling algorithm in MapReduce

Abstract

Access this article

Similar content being viewed by others

Enhancing the Performance of MapReduce Default Scheduler by Detecting Prolonged TaskTrackers in Heterogeneous Environments

Task Scheduling for MapReduce Based on Heterogeneous Networks

Performance Improvement of MapReduce Framework by Identifying Slow TaskTrackers in Heterogeneous Hadoop Cluster

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HTD: heterogeneous throughput-driven task scheduling algorithm in MapReduce

Abstract

Access this article

Similar content being viewed by others

Enhancing the Performance of MapReduce Default Scheduler by Detecting Prolonged TaskTrackers in Heterogeneous Environments

Task Scheduling for MapReduce Based on Heterogeneous Networks

Performance Improvement of MapReduce Framework by Identifying Slow TaskTrackers in Heterogeneous Hadoop Cluster

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation