Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems

Mei, Jing; Li, Kenli; Zhou, Xu; Li, Keqin

doi:10.1007/s10723-015-9331-1

Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems

Published: 14 April 2015

Volume 13, pages 507–525, (2015)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Jing Mei¹,
Kenli Li¹,
Xu Zhou¹ &
…
Keqin Li^1,2

250 Accesses
24 Citations
Explore all metrics

Abstract

As the scale and complexity of heterogeneous computing systems grow, failures occur frequently and have an adverse effect on solving large-scale applications. Hence, fault-tolerant scheduling is an imperative step for large-scale computing systems. The existing fault-tolerant scheduling algorithms belong to static scheduling, and they allocate multiple copies of each task to several processors no matter whether processor failures affect the execution of tasks. Such active replication strategies not only waste resource but also sacrifice the makespan. What is more, they cannot guarantee the successful execution of applications. In this paper, we propose a fault-tolerant dynamic rescheduling algorithm named FTDR, which can overcome above drawbacks. FTDR keeps listening to the processor failure, and reschedules the suspended tasks once failures occur. Because FTDR reschedules the tasks that are suspended because of failures, it can tolerate an arbitrary number of failures. Randomly generated DAGs are tested in our experiments. Experimental results show that the proposed algorithm achieves good performance in terms of makespan and resource consumption compared with its direct competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reliability-Aware Distributed Computing Scheduling Policy

A fault-tolerant scheduling algorithm that minimizes the number of replicas in heterogeneous service-oriented cloud computing systems

Article 27 February 2024

A low redundancy and high time efficiency large-scale task assignment strategy for heterogeneous service-oriented cloud computing systems

Article 20 August 2020

References

Kasahara, H., Narita, S.: Practical multiprocessor scheduling algorithms for efficient parallel processing. IEEE Trans. Comput. 33(11), 1023–1029 (1984)
Article Google Scholar
Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)
Article Google Scholar
Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)
Article MATH Google Scholar
Nesmachnow, S., Dorronsoro, B., Pecero, J., Bouvry, P.: Energy-aware scheduling on multicore heterogeneous grid computing systems. J. Grid Comput. 11(4), 653–680 (2013)
Article Google Scholar
Arabnejad, H., Barbosa, J.: A budget constrained scheduling algorithm for workflow applications. J. Grid Comput. 12(4), 665–679 (2014)
Article Google Scholar
Ranaweera, S., Agrawal, D.: A scalable task duplication based scheduling algorithm for heterogeneous systems. In: Proceedings of 2000 International Conference on Parallel Processing, pp. 383–390 (2000)
Bansal, S., Kumar, P., Singh, K.: An improved duplication strategy for scheduling precedence constrained graphs in multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 14(6), 533–544 (2003)
Article Google Scholar
Shin, K., Cha, M., Jang, M., Jung, J., Yoon, W., Choi, S.: Task scheduling algorithm using minimized duplications in homogeneous systems. J. Parallel Distrib. Comput. 68(8), 1146–1156 (2008)
Article MATH Google Scholar
Tang, X., Li, K., Liao, G., Li, R.: List scheduling with duplication for heterogeneous computing systems. J. Parallel Distrib. Comput. 70(4), 323–329 (2010)
Article MATH Google Scholar
Song, I., Yoon, W., Jang, E., Choi, S.: Task scheduling algorithm with minimal redundant duplications in homogeneous multiprocessor system in Grid and Distributed Computing, pp. 238–245. Springer (2011)
Bansal, S., Kumar, P., Singh, K.: An improved duplication strategy for scheduling precedence constrained graphs in multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 14(6), 533–544 (2003)
Article Google Scholar
Hagras, T., brevecek, J.J.: A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems. Parallel Comput. 31(7), 653–670 (2005)
Article Google Scholar
Liou, J., Palis, M.: An efficient task clustering heuristic for scheduling dags on multiprocessors. In: Proceedings of Parallel and Distributed Processing Symposium (1996)
Fangfa, F., Yuxin, B., Xinaan, H., Jinxiang, W., Minyan, Y., Jia, Z.: An objective-flexible clustering algorithm for task mapping and scheduling on cluster-based noc. In: 2010 10th Russian-Chinese Symposium on Laser Physics and Laser Technologies (RCSLPLT) and 2010 Academic Symposium on Optoelectronics Technology (ASOT), 28 2010-aug. 1 2010, pp. 369–373
Khan, M.A.: Scheduling for heterogeneous systems using constrained critical paths. Parallel Comput. 38(4), 175–193 (2012)
Article Google Scholar
Stearley, J.: Defining and measuring supercomputer reliability, availability, and serviceability (ras). In: Proceedings of the Linux Clusters Institute Conference (2005)
Rahman, R.M., Barker, K., Alhajj, R.: Replica placement strategies in data grid. J. Grid Comput. 6(1), 103–123 (2008)
Article MATH Google Scholar
Yang, H., Luan, Z., Li, W., Qian, D.: Mapreduce workload modeling with statistical approach. J. grid Comput. 10(2), 279–310 (2012)
Article Google Scholar
Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. 1, 23–31 (1987)
Article MATH Google Scholar
Chakravorty, S.: A fault tolerance protocol for fast recovery. ProQuest (2008)
Yang, X., Wang, Z., Xue, J., Zhou, Y.: The reliability wall for exascale supercomputing. IEEE Trans. Comput. 61(6), 767–779 (2012)
Article MathSciNet Google Scholar
Benoit, A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: IEEE International Symposium Parallel Distributed Processing, pp. 1–8. IEEE (2008)
Zhao, L., Ren, Y., Xiang, Y., Sakurai, K.: Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems. In: 12th IEEE International Conference High Performance Computing Communications, pp. 434–441. IEEE (2010)
Shatz, S.M., Wang, J.-P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Trans. Comput. 41(9), 1156–1168 (1992)
Article Google Scholar
Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Distrib. Comput. 65(8), 885–900 (2005)
Article MATH Google Scholar
Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pp. 280–288. ACM (2007)
Jeannot, E., Saule, E., Trystram, D.: Bi-objective approximation scheme for makespan and reliability optimization on uniform parallel machines. In: Euro-Par 2008–Parallel Processing, pp. 877–886. Springer (2008)
Girault, A., Saule, E., Trystram, D.: Reliability versus performance for critical applications. J. Parallel Distrib. Comput. 69(3), 326–336 (2009)
Article Google Scholar
Tang, X., Li, K., Li, R., Veeravalli, B.: Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 70(9), 941–952 (2010)
Article MATH Google Scholar
Boeres, C., Sardiña, I. M., Drummond, L.: An efficient weighted bi-objective scheduling algorithm for heterogeneous systems. Parallel Comput. 37(8), 349–364 (2011)
Article Google Scholar
Jeannot, E., Saule, E., Trystram, D.: Optimizing performance and reliability on heterogeneous parallel systems: Approximation algorithms and heuristics. J. Parallel Distrib. Comput. 72(2), 268–280 (2012)
Article MATH Google Scholar
Tao, Y., Jin, H., Wu, S., Shi, X., Shi, L.: Dependable grid workflow scheduling based on resource availability. J. Grid Comput. 11(1), 47–61 (2013)
Article Google Scholar
Hakem, M., Butelle, F.: Reliability and scheduling on systems subject to failures. In: International Conference on Parallel Processing, pp. 38–38. IEEE (2007)
Qin, X., Jiang, H.: A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems. Parallel Comput. 32(5), 331–356 (2006)
Article MathSciNet Google Scholar
Zheng, Q., Veeravalli, B.: On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices. J. Parallel Distrib. Comput. 69(3), 282–294 (2009)
Article MathSciNet Google Scholar
Zheng, Q., Veeravalli, B., Tham, C.-K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)
Article MathSciNet Google Scholar
Benoit, A., Hakem, M., Robert, Y.: Realistic models and efficient algorithms for fault tolerant scheduling on heterogeneous platforms. In: 37th International Conference on Parallel Processing, pp. 246–253. IEEE (2008)
Khokhar, A., Prasanna, V., Shaaban, M., Wang, C.-L.: Heterogeneous computing: challenges and opportunities. Computer 26(6), 18–27 (1993)
Article Google Scholar
Radulescu, A., Van Gemund, A.: Fast and effective task scheduling in heterogeneous systems. In: Proceedings of 9th Heterogeneous Computing Workshop, pp. 229–238 (2000)
Choudhury, P., Chakrabarti, P., Kumar, R.: Online scheduling of dynamic task graphs with communication and contention for multiprocessors, vol. 23, pp. 126–133 (2012)
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531
Jin, H., Sun, X.-H., Zheng, Z., Lan, Z., Xie, B.: Performance under failures of dag-based parallel computing. In: 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 236–243 (2009)
Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Science and Engineering, Hunan University, and National Supercomputing Center in Changsha, Hunan, 410082, China
Jing Mei, Kenli Li, Xu Zhou & Keqin Li
Department of Computer Science, State University of New York, New Paltz, New York, 12561, USA
Keqin Li

Authors

Jing Mei
View author publications
You can also search for this author in PubMed Google Scholar
Kenli Li
View author publications
You can also search for this author in PubMed Google Scholar
Xu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Keqin Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kenli Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mei, J., Li, K., Zhou, X. et al. Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems. J Grid Computing 13, 507–525 (2015). https://doi.org/10.1007/s10723-015-9331-1

Download citation

Received: 03 September 2014
Accepted: 25 March 2015
Published: 14 April 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10723-015-9331-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems

Abstract

Access this article

Similar content being viewed by others

Reliability-Aware Distributed Computing Scheduling Policy

A fault-tolerant scheduling algorithm that minimizes the number of replicas in heterogeneous service-oriented cloud computing systems

A low redundancy and high time efficiency large-scale task assignment strategy for heterogeneous service-oriented cloud computing systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems

Abstract

Access this article

Similar content being viewed by others

Reliability-Aware Distributed Computing Scheduling Policy

A fault-tolerant scheduling algorithm that minimizes the number of replicas in heterogeneous service-oriented cloud computing systems

A low redundancy and high time efficiency large-scale task assignment strategy for heterogeneous service-oriented cloud computing systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation