An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems

Tang, Xiaoyong; Li, Kenli; Liao, Guiping

doi:10.1007/s10586-014-0372-1

An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems

Published: 05 April 2014

Volume 17, pages 1413–1425, (2014)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Xiaoyong Tang^1,2,
Kenli Li¹ &
Guiping Liao³

395 Accesses
19 Citations
Explore all metrics

Abstract

In large-scale heterogeneous cluster computing systems, processor and network failures are inevitable and can have an adverse effect on applications executing on such systems. One way of taking failures into account is to employ a reliable scheduling algorithm. However, most existing scheduling algorithms for precedence constrained tasks in heterogeneous systems only consider scheduling length, and not efficiently satisfy the reliability requirements of task. In recognition of this problem, we build an application reliability analysis model based on Weibull distribution, which can dynamically measure the reliability of task executing on heterogeneous cluster with arbitrary networks architectures. Then, we propose a reliability-driven earliest finish time with duplication scheduling algorithm (REFTD) which incorporates task reliability overhead into scheduling. Furthermore, to improve system reliability, it duplicates task as if task hazard rate is more than threshold \(\theta \). The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithm can shorten schedule length and improve system reliability significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems

Article 30 November 2016

A fault-tolerant scheduling algorithm that minimizes the number of replicas in heterogeneous service-oriented cloud computing systems

Article 27 February 2024

A low redundancy and high time efficiency large-scale task assignment strategy for heterogeneous service-oriented cloud computing systems

Article 20 August 2020

References

Bahman, J., Parimala, T., Rajkumar, B.: Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources. J. Supercomput. 63(2), 467–489 (2013)
Article Google Scholar
Balasangameshwara, J., Rajub, N.: Hybrid policy for fault tolerant load balancing in grid computing environments. J. Netw. Comput. Appl. 35(1), 412–422 (2012)
Article Google Scholar
Ball, O.: Computational complexity of network reliability analysis: an Overview. IEEE Trans. Reliab. 35(3), 230–239 (1986)
Article MATH Google Scholar
Casanova, H.: Network modeling issues for grid application scheduling. Int. J. Found. Comput. 16(2), 145–162 (2005)
Article Google Scholar
Das, K.: A comparative study of exponential distribution vs Weibull distribution in machine reliability analysis in a CMS design. Comput. Ind. Eng 54(1), 12–33 (2008)
Article Google Scholar
Dogan, A., Özguner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Dist. Sys 13(3), 308–323 (2002)
Article Google Scholar
Dzmitry, K., Pascal, B., Samee, K.: DENS: data center energy-efficient network-aware scheduling. Cluster Comput. 16, 65–75 (2013)
Article Google Scholar
Gary, M.R., Johnson, D.S.: Computers and Intractability: a Guide to the Theory of NP-Completeness. W.H. Freeman and Co, San Francisco (1979)
Google Scholar
http://simgrid.gforge.inria.fr/. Accessed 12 Nov 2012
Jeannot, E., Saule, E., Trystram, D.: Optimizing performance and reliability on heterogeneous parallel systems: approximation algorithms and heuristics. J. Parallel Dist. Comput. 72(2), 268–280 (2012)
Article MATH Google Scholar
Jin, H., Sun, X., Zheng, Z., Lan, Z., Xie, B.: Performance under failures of DAG\_based parallel computing. In Proceedings of the CCGrid’09, pp. 236–243 (2009).
Khan, A.: Scheduling for heterogeneous systems using constrained critical paths. Parallel Comput. 38(4–5), 175–193 (2012)
Article Google Scholar
Kwok, Y.-K., Ahmad, I.: Dynamic critical-path scheduling: an effective technique for allocating task graphs onto multiprocessors. IEEE Trans. Parallel Dist. Sys. 7(5), 506–521 (1996)
Article Google Scholar
Li, R., Zhang, Y., Xu, Z., Wu, H.: A load-balancing method for network GISs in a heterogeneous cluster-based system using access density. Future Gener. Comput. Sys. 29(2), 528–535 (2013)
Article Google Scholar
Litke, A., Skoutas, D., Tserpes, K., Varvarigou, T.: Efficient task replication and management for adaptive fault tolerance in mobile grid environments. Future Gener. Comput. Syst. 23(2), 163–178 (2007)
Article Google Scholar
Macey, B.S., Zomaya, A.Y.: A performance evaluation of CP list scheduling heuristics for communication intensive task graphs. In: Parallel Processing Symposium, pp. 538–541 (1998).
Prabhakar, M.D.N., Bulmerc, M., Eccleston, A.: Weibull model selection for reliability modelling. Reliab. Eng. Sys. Safety 86(3), 257–267 (2004)
Article Google Scholar
Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Dist. Comput. 65(8), 885–900 (2005)
Article MATH Google Scholar
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. In: Proceedings of the International Symposium on Dependable Systems and Networks (DSN 2006), pp. 249–258 (2006).
Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous machine architectures. IEEE Trans. Parallel Distrib. Sys. 49(2), 175–187 (1993)
Article Google Scholar
Sinnen, O., Sousa, L.A., Sandnes, E.: Toward a realistic task scheduling model. IEEE Trans. Parallel Dist. Sys. 17(3), 263–275 (2006)
Article Google Scholar
Tang, X., Li, K.: PADUA D.: communication contention in APN list scheduling algorithm. Info. Sci. 53(1), 59–69 (2009)
Google Scholar
Tang, X., Li, K., Li, R., Veeravalli, B.: Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J. Parallel Dist. Comput. 70(9), 941–952 (2010)
Google Scholar
Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Sys. 13(3), 260–274 (2002)
Article Google Scholar
Ye, Z., Xie, M., Tang, L.: Reliability evaluation of hard disk drive failures based on counting processes. Reliability Engineering & System Safety 109, 110–118 (2013)
Google Scholar
Zhang, X., Pham, H.: Software field failure rate prediction before software deployment. J. Sys. Softw. 79(3), 291–300 (2006)
Article Google Scholar
Zhang, Y., Mueller, F.: Autogeneration and autotuning of 3d stencil codes on homogeneous and heterogeneous gpu clusters. IEEE Trans. Parallel Distrib. Sys. 24(3), 417–427 (2013)
Article Google Scholar
Zhao, H., Sakellariou, R.: An experimental investigation into the rank function of the heterogeneous earliest finish time scheduling algorithm. In: Proceedings of 9th International Euro-Par Conference, LNCS 2790, pp. 189–194 (2003).
Zheng, Q., Veeravalli, B., Tham, C.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)
Article MathSciNet Google Scholar
Zhu, X., Ge, R., Sun, J., He, C.: 3E: energy-efficient elastic scheduling for independent tasks in heterogeneous computing systems. J. Sys. Softw. 8(2), 302–314 (2013)
Google Scholar

Download references

Acknowledgments

This research was partially funded by National Science Foundation of China (Grant Nos. 61133005, 61070057, 61370098), the National Science Foundation for Distinguished Young Scholars of Hunan (12JJ1011), and a project supported by Scientific Research Fund of Hunan Provincial Education Department (Grant No. 12A062).

Author information

Authors and Affiliations

School of Information Science and Engineering, Hunan University, Changsha , 410082, China
Xiaoyong Tang & Kenli Li
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing , 210046, China
Xiaoyong Tang
Information Science and Technology College, Hunan Agricultural University, Changsha , 410128, China
Guiping Liao

Authors

Xiaoyong Tang
View author publications
You can also search for this author in PubMed Google Scholar
Kenli Li
View author publications
You can also search for this author in PubMed Google Scholar
Guiping Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoyong Tang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, X., Li, K. & Liao, G. An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems. Cluster Comput 17, 1413–1425 (2014). https://doi.org/10.1007/s10586-014-0372-1

Download citation

Received: 13 April 2013
Revised: 04 February 2014
Accepted: 21 March 2014
Published: 05 April 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10586-014-0372-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems

Abstract

Access this article

Similar content being viewed by others

A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems

A fault-tolerant scheduling algorithm that minimizes the number of replicas in heterogeneous service-oriented cloud computing systems

A low redundancy and high time efficiency large-scale task assignment strategy for heterogeneous service-oriented cloud computing systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems

Abstract

Access this article

Similar content being viewed by others

A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems

A fault-tolerant scheduling algorithm that minimizes the number of replicas in heterogeneous service-oriented cloud computing systems

A low redundancy and high time efficiency large-scale task assignment strategy for heterogeneous service-oriented cloud computing systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation