A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems

Wang, Shuli; Li, Kenli; Mei, Jing; Xiao, Guoqing; Li, Keqin

doi:10.1007/s10723-016-9386-7

A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems

Published: 30 November 2016

Volume 15, pages 23–39, (2017)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Shuli Wang^1,4,
Kenli Li^1,2,
Jing Mei^1,5,
Guoqing Xiao¹ &
…
Keqin Li^1,3

450 Accesses
40 Citations
Explore all metrics

Abstract

Over the past several years, a heterogeneous computing (HC) system has become more competitive as a commercial computing platform than a homogeneous system. With the growing scale of HC systems, network failures become inevitable. To achieve high performance, communication reliability should be considered while designing reliability-aware task scheduling algorithms. In this paper, we propose a new algorithm called RMSR (Replication-based scheduling for Maximizing System Reliability), which incorporates task communication into system reliability. To maximize communication reliability, an improved algorithm which searches all optimal reliability communication paths for current tasks is proposed. During the task replication phase, the task reliability threshold is determined by users and each task has dynamic replicas. Our comparative studies for both randomly generated graphs and application graphs of real-world problems show that our RMSR algorithm outperforms existing scheduling algorithms in terms of system reliability. For randomly generated graphs, several factors affecting the performance are analyzed in the paper. For an application graph of a real-world problem with a fixed DAG, the system reliability of the RMSR algorithm is at most influenced by one factor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A low redundancy and high time efficiency large-scale task assignment strategy for heterogeneous service-oriented cloud computing systems

Article 20 August 2020

A fault-tolerant scheduling algorithm that minimizes the number of replicas in heterogeneous service-oriented cloud computing systems

Article 27 February 2024

Maximizing reliability and performance with reliability-driven task scheduling in heterogeneous distributed computing systems

Article 27 June 2018

References

Arabnejad, H., Barbosa, J.G.: List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans. Parallel Distrib. Syst. 25(3), 682–694 (2014)
Article Google Scholar
Benoit, A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–8 (2008)
Bittencourt, L.F., Sakellariou, R., Madeira, E.R.: Dag scheduling using a lookahead variant of the heterogeneous earliest finish time algorithm. In: 2010 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 27–34. IEEE (2010)
Chiu, C.C., Yeh, Y.S., Chou, J.S.: A fast algorithm for reliability-oriented task assignment in a distributed system. Comput. Commun. 25(17), 1622–1630 (2002)
Article Google Scholar
Chung, Y.C., Ranka, S.: Applications and performance analysis of a compile-time optimization approach for list scheduling algorithms on distributed memory multiprocessors. In: Supercomputing’92., Proceedings, pp. 512–521. IEEE (1992)
Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)
Article MATH Google Scholar
Dogan, A., Ozguner, F.: Optimal and suboptimal reliable scheduling of precedence-constrained tasks in heterogeneous distributed computing. In: Proceedings of the 2000 International Workshops on Parallel Processing, 2000, pp. 429–436. IEEE (2000)
Dogan, A., Ozguner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 308–323 (2002)
Article Google Scholar
Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pp. 280–288. ACM (2007)
El-Rewini, H., Lewis, T.G.: Scheduling parallel program tasks onto arbitrary target machines. J. Parallel Distrib. Comput. 9(2), 138–153 (1990)
Article MATH Google Scholar
Elegbede, A.C., Chu, C., Adjallah, K.H., Yalaoui, F.: Reliability allocation through cost minimization. IEEE Trans. Reliab. 52(1), 106–111 (2003)
Article Google Scholar
Hsieh, C.C.: Optimal task allocation and hardware redundancy policies in distributed computing systems. Eur. J. Oper. Res. 147(2), 430–447 (2003)
Article MathSciNet MATH Google Scholar
Iverson, M.A., Özgüner, F., Follen, G.J.: Parallelizing existing applications in a distributed heterogeneous environment. In: 4TH Heterogeneous Computing Workshop (HCW’95. Citeseer (1995)
Kartik, S., Murthy, C.S.R.: Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Trans. Comput. 46(6), 719–724 (1997)
Article Google Scholar
Kwok, Y.K., Ahmad, I.: Benchmarking and comparison of the task graph scheduling algorithms. J. Parallel Distrib. Comput. 59(3), 381–422 (1999)
Article MATH Google Scholar
Liu, G., Poh, K.L., Xie, M.: Iterative list scheduling for heterogeneous computing. J. Parallel Distrib. Comput. 65(5), 654–665 (2005)
Article MATH Google Scholar
Michael, R.G., Johnson, D.S.: Computers and intractability: A guide to the theory of np-completeness. WH Freeman &Co., San Francisco (1979)
MATH Google Scholar
Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, 1998. Digest of Papers. pp. 48–57. IEEE (1998)
Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Distrib. Comput. 65(8), 885–900 (2005)
Article MATH Google Scholar
Raj, J.S., Rachel, I.S.: A survey on reliability scheduling on grid computing. In: 2013 7th International Conference on Intelligent Systems and Control (ISCO), pp. 331–334. IEEE (2013)
Ramírez-Alcaraz, J.M., Tchernykh, A., Yahyapour, R., Schwiegelshohn, U., Quezada-Pina, A., González-García, J.L., Hirales-Carbajal, A.: Job allocation strategies with user run time estimates for online scheduling in hierarchical grids. Journal of Grid Computing 9(1), 95–116 (2011)
Article Google Scholar
Sakellariou, R., Zhao, H.: A hybrid heuristic for dag scheduling on heterogeneous systems. In: Proceedings. 18th International Parallel and Distributed Processing Symposium, 2004. p. 111. IEEE (2004)
Shatz, S.M., Wang, J.P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Trans. Comput. 41(9), 1156–1168 (1992)
Article Google Scholar
Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans. Parallel Distrib. Syst. 4(2), 175–187 (1993)
Article Google Scholar
Srinivasan, S., Jha, N.K.: Safety and reliability driven task allocation in distributed systems. IEEE Trans. Parallel Distrib. Syst. 10(3), 238–251 (1999)
Article Google Scholar
Tang, X., Li, K., Li, R., Veeravalli, B.: Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 70(9), 941–952 (2010)
Article MATH Google Scholar
Tang, X., Li, K., Liao, G.: An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems. Clust. Comput., 1–13 (2014)
Tang, X., Li, K., Qiu, M., Sha, E.H.M.: A hierarchical reliability-driven scheduling algorithm in grid systems. J. Parallel Distrib. Comput. 72(4), 525–535 (2012)
Article Google Scholar
Tao, Y., Jin, H., Shi, X.: Grid workflow scheduling based on reliability cost. In: Proceedings of the 2nd international conference on Scalable information systems, p. 12. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering) (2007)
Tom, P.A., Murthy, C.: Algorithms for reliability-oriented module allocation in distributed computing systems. J. Syst. Softw. 40(2), 125–138 (1998)
Article Google Scholar
Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)
Article Google Scholar
Tosun, S.: Energy-and reliability-aware task scheduling onto heterogeneous mpsoc architectures. J. Supercomput. 62(1), 265–289 (2012)
Article Google Scholar
Ullman, J.D.: NP-complete scheduling problems. J. Comput. Syst. Sci. 10(3), 384–393 (1975)
Article MathSciNet MATH Google Scholar
Wang, X., Yeo, C.S., Buyya, R., Su, J.: Optimizing the makespan and reliability for workflow applications with reputation and a look-ahead genetic algorithm. Futur. Gener. Comput. Syst. 27(8), 1124–1134 (2011)
Article Google Scholar
Wu, M.Y., Gajski, D.D.: Hypertool: A programming aid for message-passing systems. IEEE Trans. Parallel Distrib. Syst. 1(3), 330–343 (1990)
Article Google Scholar
Zhang, J., Gu, J.: An approach to analyze grid service reliability subject to failures. In: 4th International Conference on Computer Science & Education, 2009. ICCSE’09. pp. 343–347. IEEE (2009)
Zhang, X., Zagorodnov, D., Hiltunen, M., Marzullo, K., Schlichting, R.D.: Fault-tolerant grid services using primary-backup: feasibility and performance. In: 2004 IEEE International Conference on Cluster Computing, pp. 105–114. IEEE (2004)
Zhao, B., Aydin, H., Zhu, D.: On maximizing reliability of real-time embedded applications under hard energy constraint. IEEE Trans. Ind. Inf. 6(3), 316–328 (2010)
Article Google Scholar
Zhao, L., Ren, Y., Sakurai, K.: A resource minimizing scheduling algorithm with ensuring the deadline and reliability in heterogeneous systems. In: 2011 IEEE International Conference on Advanced Information Networking and Applications (AINA), pp. 275–282. IEEE (2011)
Zhao, L., Ren, Y., Xiang, Y., Sakurai, K.: Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems. In: 2010 12th IEEE International Conference on High Performance Computing and Communications (HPCC), pp. 434–441. IEEE (2010)
Zheng, Q., Veeravalli, B., Tham, C.K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)
Article MathSciNet Google Scholar
Zhu, D., Melhem, R., Mossé, D.: The effects of energy management on reliability in real-time embedded systems. In: IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004. pp. 35–40. IEEE (2004)

Download references

Acknowledgments

A preliminary version of the paper will be presented in the Fourth Workshop on Parallel Computing and Optimization in conjunction with IPDPS 2014, Phoenix, Arizona, May 23, 2014. The authors would like to thank the three anonymous reviewers for their valuable and helpful comments on improving the manuscript.

Author information

Authors and Affiliations

College of Information Science and Engineering, Hunan University, Changsha, Hunan, 410082, China
Shuli Wang, Kenli Li, Jing Mei, Guoqing Xiao & Keqin Li
National Supercomputing Center in Changsha, Changsha, Hunan, 410082, China
Kenli Li
Department of Computer Science, State University of New York, New Paltz, NY, 12561, USA
Keqin Li
Hunan Zhongyi Communication Technology Engineering Co., Ltd., Changsha, Hunan, 410003, China
Shuli Wang
College of Mathematics and Computer Science, Hunan Normal University, Changsha, Hunan, 410006, China
Jing Mei

Authors

Shuli Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kenli Li
View author publications
You can also search for this author in PubMed Google Scholar
Jing Mei
View author publications
You can also search for this author in PubMed Google Scholar
Guoqing Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Keqin Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kenli Li.

Additional information

The research was partially funded by the Key Program of National Natural Science Foundation of China (Grant Nos. 61133005, 61432005), the National Outstanding Youth Science Program of National Natural Science Foundation of China (Grant No. 61625202), the International Science & Technology Cooperation Program of China (Grant No. 2015DFA11240), the National Key R&D Program of China (Grant No. 2016YFB0201402), the Hunan Provincial Innovation Foundation For Postgraduate (Grant No. CX2016B066), and the Outstanding Graduate Student Innovation Fund Program of Collaborative Innovation Center of High Performance Computing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, S., Li, K., Mei, J. et al. A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems. J Grid Computing 15, 23–39 (2017). https://doi.org/10.1007/s10723-016-9386-7

Download citation

Received: 12 April 2014
Accepted: 01 November 2016
Published: 30 November 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s10723-016-9386-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems

Abstract

Access this article

Similar content being viewed by others

A low redundancy and high time efficiency large-scale task assignment strategy for heterogeneous service-oriented cloud computing systems

A fault-tolerant scheduling algorithm that minimizes the number of replicas in heterogeneous service-oriented cloud computing systems

Maximizing reliability and performance with reliability-driven task scheduling in heterogeneous distributed computing systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems

Abstract

Access this article

Similar content being viewed by others

A low redundancy and high time efficiency large-scale task assignment strategy for heterogeneous service-oriented cloud computing systems

A fault-tolerant scheduling algorithm that minimizes the number of replicas in heterogeneous service-oriented cloud computing systems

Maximizing reliability and performance with reliability-driven task scheduling in heterogeneous distributed computing systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation