Skip to main content
Log in

Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

As the scale and complexity of heterogeneous computing systems grow, failures occur frequently and have an adverse effect on solving large-scale applications. Hence, fault-tolerant scheduling is an imperative step for large-scale computing systems. The existing fault-tolerant scheduling algorithms belong to static scheduling, and they allocate multiple copies of each task to several processors no matter whether processor failures affect the execution of tasks. Such active replication strategies not only waste resource but also sacrifice the makespan. What is more, they cannot guarantee the successful execution of applications. In this paper, we propose a fault-tolerant dynamic rescheduling algorithm named FTDR, which can overcome above drawbacks. FTDR keeps listening to the processor failure, and reschedules the suspended tasks once failures occur. Because FTDR reschedules the tasks that are suspended because of failures, it can tolerate an arbitrary number of failures. Randomly generated DAGs are tested in our experiments. Experimental results show that the proposed algorithm achieves good performance in terms of makespan and resource consumption compared with its direct competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Kasahara, H., Narita, S.: Practical multiprocessor scheduling algorithms for efficient parallel processing. IEEE Trans. Comput. 33(11), 1023–1029 (1984)

    Article  Google Scholar 

  2. Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)

    Article  Google Scholar 

  3. Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)

    Article  MATH  Google Scholar 

  4. Nesmachnow, S., Dorronsoro, B., Pecero, J., Bouvry, P.: Energy-aware scheduling on multicore heterogeneous grid computing systems. J. Grid Comput. 11(4), 653–680 (2013)

    Article  Google Scholar 

  5. Arabnejad, H., Barbosa, J.: A budget constrained scheduling algorithm for workflow applications. J. Grid Comput. 12(4), 665–679 (2014)

    Article  Google Scholar 

  6. Ranaweera, S., Agrawal, D.: A scalable task duplication based scheduling algorithm for heterogeneous systems. In: Proceedings of 2000 International Conference on Parallel Processing, pp. 383–390 (2000)

  7. Bansal, S., Kumar, P., Singh, K.: An improved duplication strategy for scheduling precedence constrained graphs in multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 14(6), 533–544 (2003)

    Article  Google Scholar 

  8. Shin, K., Cha, M., Jang, M., Jung, J., Yoon, W., Choi, S.: Task scheduling algorithm using minimized duplications in homogeneous systems. J. Parallel Distrib. Comput. 68(8), 1146–1156 (2008)

    Article  MATH  Google Scholar 

  9. Tang, X., Li, K., Liao, G., Li, R.: List scheduling with duplication for heterogeneous computing systems. J. Parallel Distrib. Comput. 70(4), 323–329 (2010)

    Article  MATH  Google Scholar 

  10. Song, I., Yoon, W., Jang, E., Choi, S.: Task scheduling algorithm with minimal redundant duplications in homogeneous multiprocessor system in Grid and Distributed Computing, pp. 238–245. Springer (2011)

  11. Bansal, S., Kumar, P., Singh, K.: An improved duplication strategy for scheduling precedence constrained graphs in multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 14(6), 533–544 (2003)

    Article  Google Scholar 

  12. Hagras, T., brevecek, J.J.: A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems. Parallel Comput. 31(7), 653–670 (2005)

    Article  Google Scholar 

  13. Liou, J., Palis, M.: An efficient task clustering heuristic for scheduling dags on multiprocessors. In: Proceedings of Parallel and Distributed Processing Symposium (1996)

  14. Fangfa, F., Yuxin, B., Xinaan, H., Jinxiang, W., Minyan, Y., Jia, Z.: An objective-flexible clustering algorithm for task mapping and scheduling on cluster-based noc. In: 2010 10th Russian-Chinese Symposium on Laser Physics and Laser Technologies (RCSLPLT) and 2010 Academic Symposium on Optoelectronics Technology (ASOT), 28 2010-aug. 1 2010, pp. 369–373

  15. Khan, M.A.: Scheduling for heterogeneous systems using constrained critical paths. Parallel Comput. 38(4), 175–193 (2012)

    Article  Google Scholar 

  16. Stearley, J.: Defining and measuring supercomputer reliability, availability, and serviceability (ras). In: Proceedings of the Linux Clusters Institute Conference (2005)

  17. Rahman, R.M., Barker, K., Alhajj, R.: Replica placement strategies in data grid. J. Grid Comput. 6(1), 103–123 (2008)

    Article  MATH  Google Scholar 

  18. Yang, H., Luan, Z., Li, W., Qian, D.: Mapreduce workload modeling with statistical approach. J. grid Comput. 10(2), 279–310 (2012)

    Article  Google Scholar 

  19. Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. 1, 23–31 (1987)

    Article  MATH  Google Scholar 

  20. Chakravorty, S.: A fault tolerance protocol for fast recovery. ProQuest (2008)

  21. Yang, X., Wang, Z., Xue, J., Zhou, Y.: The reliability wall for exascale supercomputing. IEEE Trans. Comput. 61(6), 767–779 (2012)

    Article  MathSciNet  Google Scholar 

  22. Benoit, A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: IEEE International Symposium Parallel Distributed Processing, pp. 1–8. IEEE (2008)

  23. Zhao, L., Ren, Y., Xiang, Y., Sakurai, K.: Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems. In: 12th IEEE International Conference High Performance Computing Communications, pp. 434–441. IEEE (2010)

  24. Shatz, S.M., Wang, J.-P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Trans. Comput. 41(9), 1156–1168 (1992)

    Article  Google Scholar 

  25. Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Distrib. Comput. 65(8), 885–900 (2005)

    Article  MATH  Google Scholar 

  26. Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pp. 280–288. ACM (2007)

  27. Jeannot, E., Saule, E., Trystram, D.: Bi-objective approximation scheme for makespan and reliability optimization on uniform parallel machines. In: Euro-Par 2008–Parallel Processing, pp. 877–886. Springer (2008)

  28. Girault, A., Saule, E., Trystram, D.: Reliability versus performance for critical applications. J. Parallel Distrib. Comput. 69(3), 326–336 (2009)

    Article  Google Scholar 

  29. Tang, X., Li, K., Li, R., Veeravalli, B.: Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 70(9), 941–952 (2010)

    Article  MATH  Google Scholar 

  30. Boeres, C., Sardiña, I. M., Drummond, L.: An efficient weighted bi-objective scheduling algorithm for heterogeneous systems. Parallel Comput. 37(8), 349–364 (2011)

    Article  Google Scholar 

  31. Jeannot, E., Saule, E., Trystram, D.: Optimizing performance and reliability on heterogeneous parallel systems: Approximation algorithms and heuristics. J. Parallel Distrib. Comput. 72(2), 268–280 (2012)

    Article  MATH  Google Scholar 

  32. Tao, Y., Jin, H., Wu, S., Shi, X., Shi, L.: Dependable grid workflow scheduling based on resource availability. J. Grid Comput. 11(1), 47–61 (2013)

    Article  Google Scholar 

  33. Hakem, M., Butelle, F.: Reliability and scheduling on systems subject to failures. In: International Conference on Parallel Processing, pp. 38–38. IEEE (2007)

  34. Qin, X., Jiang, H.: A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems. Parallel Comput. 32(5), 331–356 (2006)

    Article  MathSciNet  Google Scholar 

  35. Zheng, Q., Veeravalli, B.: On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices. J. Parallel Distrib. Comput. 69(3), 282–294 (2009)

    Article  MathSciNet  Google Scholar 

  36. Zheng, Q., Veeravalli, B., Tham, C.-K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)

    Article  MathSciNet  Google Scholar 

  37. Benoit, A., Hakem, M., Robert, Y.: Realistic models and efficient algorithms for fault tolerant scheduling on heterogeneous platforms. In: 37th International Conference on Parallel Processing, pp. 246–253. IEEE (2008)

  38. Khokhar, A., Prasanna, V., Shaaban, M., Wang, C.-L.: Heterogeneous computing: challenges and opportunities. Computer 26(6), 18–27 (1993)

    Article  Google Scholar 

  39. Radulescu, A., Van Gemund, A.: Fast and effective task scheduling in heterogeneous systems. In: Proceedings of 9th Heterogeneous Computing Workshop, pp. 229–238 (2000)

  40. Choudhury, P., Chakrabarti, P., Kumar, R.: Online scheduling of dynamic task graphs with communication and contention for multiprocessors, vol. 23, pp. 126–133 (2012)

  41. Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531

  42. Jin, H., Sun, X.-H., Zheng, Z., Lan, Z., Xie, B.: Performance under failures of dag-based parallel computing. In: 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 236–243 (2009)

  43. Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kenli Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mei, J., Li, K., Zhou, X. et al. Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems. J Grid Computing 13, 507–525 (2015). https://doi.org/10.1007/s10723-015-9331-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-015-9331-1

Keywords

Navigation