Skip to main content
Log in

An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

In large-scale heterogeneous cluster computing systems, processor and network failures are inevitable and can have an adverse effect on applications executing on such systems. One way of taking failures into account is to employ a reliable scheduling algorithm. However, most existing scheduling algorithms for precedence constrained tasks in heterogeneous systems only consider scheduling length, and not efficiently satisfy the reliability requirements of task. In recognition of this problem, we build an application reliability analysis model based on Weibull distribution, which can dynamically measure the reliability of task executing on heterogeneous cluster with arbitrary networks architectures. Then, we propose a reliability-driven earliest finish time with duplication scheduling algorithm (REFTD) which incorporates task reliability overhead into scheduling. Furthermore, to improve system reliability, it duplicates task as if task hazard rate is more than threshold \(\theta \). The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithm can shorten schedule length and improve system reliability significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Bahman, J., Parimala, T., Rajkumar, B.: Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources. J. Supercomput. 63(2), 467–489 (2013)

    Article  Google Scholar 

  2. Balasangameshwara, J., Rajub, N.: Hybrid policy for fault tolerant load balancing in grid computing environments. J. Netw. Comput. Appl. 35(1), 412–422 (2012)

    Article  Google Scholar 

  3. Ball, O.: Computational complexity of network reliability analysis: an Overview. IEEE Trans. Reliab. 35(3), 230–239 (1986)

    Article  MATH  Google Scholar 

  4. Casanova, H.: Network modeling issues for grid application scheduling. Int. J. Found. Comput. 16(2), 145–162 (2005)

    Article  Google Scholar 

  5. Das, K.: A comparative study of exponential distribution vs Weibull distribution in machine reliability analysis in a CMS design. Comput. Ind. Eng 54(1), 12–33 (2008)

    Article  Google Scholar 

  6. Dogan, A., Özguner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Dist. Sys 13(3), 308–323 (2002)

    Article  Google Scholar 

  7. Dzmitry, K., Pascal, B., Samee, K.: DENS: data center energy-efficient network-aware scheduling. Cluster Comput. 16, 65–75 (2013)

    Article  Google Scholar 

  8. Gary, M.R., Johnson, D.S.: Computers and Intractability: a Guide to the Theory of NP-Completeness. W.H. Freeman and Co, San Francisco (1979)

    Google Scholar 

  9. http://simgrid.gforge.inria.fr/. Accessed 12 Nov 2012

  10. Jeannot, E., Saule, E., Trystram, D.: Optimizing performance and reliability on heterogeneous parallel systems: approximation algorithms and heuristics. J. Parallel Dist. Comput. 72(2), 268–280 (2012)

    Article  MATH  Google Scholar 

  11. Jin, H., Sun, X., Zheng, Z., Lan, Z., Xie, B.: Performance under failures of DAG\_based parallel computing. In Proceedings of the CCGrid’09, pp. 236–243 (2009).

  12. Khan, A.: Scheduling for heterogeneous systems using constrained critical paths. Parallel Comput. 38(4–5), 175–193 (2012)

    Article  Google Scholar 

  13. Kwok, Y.-K., Ahmad, I.: Dynamic critical-path scheduling: an effective technique for allocating task graphs onto multiprocessors. IEEE Trans. Parallel Dist. Sys. 7(5), 506–521 (1996)

    Article  Google Scholar 

  14. Li, R., Zhang, Y., Xu, Z., Wu, H.: A load-balancing method for network GISs in a heterogeneous cluster-based system using access density. Future Gener. Comput. Sys. 29(2), 528–535 (2013)

    Article  Google Scholar 

  15. Litke, A., Skoutas, D., Tserpes, K., Varvarigou, T.: Efficient task replication and management for adaptive fault tolerance in mobile grid environments. Future Gener. Comput. Syst. 23(2), 163–178 (2007)

    Article  Google Scholar 

  16. Macey, B.S., Zomaya, A.Y.: A performance evaluation of CP list scheduling heuristics for communication intensive task graphs. In: Parallel Processing Symposium, pp. 538–541 (1998).

  17. Prabhakar, M.D.N., Bulmerc, M., Eccleston, A.: Weibull model selection for reliability modelling. Reliab. Eng. Sys. Safety 86(3), 257–267 (2004)

    Article  Google Scholar 

  18. Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Dist. Comput. 65(8), 885–900 (2005)

    Article  MATH  Google Scholar 

  19. Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. In: Proceedings of the International Symposium on Dependable Systems and Networks (DSN 2006), pp. 249–258 (2006).

  20. Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous machine architectures. IEEE Trans. Parallel Distrib. Sys. 49(2), 175–187 (1993)

    Article  Google Scholar 

  21. Sinnen, O., Sousa, L.A., Sandnes, E.: Toward a realistic task scheduling model. IEEE Trans. Parallel Dist. Sys. 17(3), 263–275 (2006)

    Article  Google Scholar 

  22. Tang, X., Li, K.: PADUA D.: communication contention in APN list scheduling algorithm. Info. Sci. 53(1), 59–69 (2009)

    Google Scholar 

  23. Tang, X., Li, K., Li, R., Veeravalli, B.: Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J. Parallel Dist. Comput. 70(9), 941–952 (2010)

    Google Scholar 

  24. Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Sys. 13(3), 260–274 (2002)

    Article  Google Scholar 

  25. Ye, Z., Xie, M., Tang, L.: Reliability evaluation of hard disk drive failures based on counting processes. Reliability Engineering & System Safety 109, 110–118 (2013)

    Google Scholar 

  26. Zhang, X., Pham, H.: Software field failure rate prediction before software deployment. J. Sys. Softw. 79(3), 291–300 (2006)

    Article  Google Scholar 

  27. Zhang, Y., Mueller, F.: Autogeneration and autotuning of 3d stencil codes on homogeneous and heterogeneous gpu clusters. IEEE Trans. Parallel Distrib. Sys. 24(3), 417–427 (2013)

    Article  Google Scholar 

  28. Zhao, H., Sakellariou, R.: An experimental investigation into the rank function of the heterogeneous earliest finish time scheduling algorithm. In: Proceedings of 9th International Euro-Par Conference, LNCS 2790, pp. 189–194 (2003).

  29. Zheng, Q., Veeravalli, B., Tham, C.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)

    Article  MathSciNet  Google Scholar 

  30. Zhu, X., Ge, R., Sun, J., He, C.: 3E: energy-efficient elastic scheduling for independent tasks in heterogeneous computing systems. J. Sys. Softw. 8(2), 302–314 (2013)

    Google Scholar 

Download references

Acknowledgments

This research was partially funded by National Science Foundation of China (Grant Nos. 61133005, 61070057, 61370098), the National Science Foundation for Distinguished Young Scholars of Hunan (12JJ1011), and a project supported by Scientific Research Fund of Hunan Provincial Education Department (Grant No. 12A062).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoyong Tang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, X., Li, K. & Liao, G. An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems. Cluster Comput 17, 1413–1425 (2014). https://doi.org/10.1007/s10586-014-0372-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-014-0372-1

Keywords

Navigation