Skip to main content
Log in

A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Over the past several years, a heterogeneous computing (HC) system has become more competitive as a commercial computing platform than a homogeneous system. With the growing scale of HC systems, network failures become inevitable. To achieve high performance, communication reliability should be considered while designing reliability-aware task scheduling algorithms. In this paper, we propose a new algorithm called RMSR (Replication-based scheduling for Maximizing System Reliability), which incorporates task communication into system reliability. To maximize communication reliability, an improved algorithm which searches all optimal reliability communication paths for current tasks is proposed. During the task replication phase, the task reliability threshold is determined by users and each task has dynamic replicas. Our comparative studies for both randomly generated graphs and application graphs of real-world problems show that our RMSR algorithm outperforms existing scheduling algorithms in terms of system reliability. For randomly generated graphs, several factors affecting the performance are analyzed in the paper. For an application graph of a real-world problem with a fixed DAG, the system reliability of the RMSR algorithm is at most influenced by one factor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arabnejad, H., Barbosa, J.G.: List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans. Parallel Distrib. Syst. 25(3), 682–694 (2014)

    Article  Google Scholar 

  2. Benoit, A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–8 (2008)

  3. Bittencourt, L.F., Sakellariou, R., Madeira, E.R.: Dag scheduling using a lookahead variant of the heterogeneous earliest finish time algorithm. In: 2010 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 27–34. IEEE (2010)

  4. Chiu, C.C., Yeh, Y.S., Chou, J.S.: A fast algorithm for reliability-oriented task assignment in a distributed system. Comput. Commun. 25(17), 1622–1630 (2002)

    Article  Google Scholar 

  5. Chung, Y.C., Ranka, S.: Applications and performance analysis of a compile-time optimization approach for list scheduling algorithms on distributed memory multiprocessors. In: Supercomputing’92., Proceedings, pp. 512–521. IEEE (1992)

  6. Daoud, M.I., Kharma, N.: A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 68(4), 399–409 (2008)

    Article  MATH  Google Scholar 

  7. Dogan, A., Ozguner, F.: Optimal and suboptimal reliable scheduling of precedence-constrained tasks in heterogeneous distributed computing. In: Proceedings of the 2000 International Workshops on Parallel Processing, 2000, pp. 429–436. IEEE (2000)

  8. Dogan, A., Ozguner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 308–323 (2002)

    Article  Google Scholar 

  9. Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pp. 280–288. ACM (2007)

  10. El-Rewini, H., Lewis, T.G.: Scheduling parallel program tasks onto arbitrary target machines. J. Parallel Distrib. Comput. 9(2), 138–153 (1990)

    Article  MATH  Google Scholar 

  11. Elegbede, A.C., Chu, C., Adjallah, K.H., Yalaoui, F.: Reliability allocation through cost minimization. IEEE Trans. Reliab. 52(1), 106–111 (2003)

    Article  Google Scholar 

  12. Hsieh, C.C.: Optimal task allocation and hardware redundancy policies in distributed computing systems. Eur. J. Oper. Res. 147(2), 430–447 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  13. Iverson, M.A., Özgüner, F., Follen, G.J.: Parallelizing existing applications in a distributed heterogeneous environment. In: 4TH Heterogeneous Computing Workshop (HCW’95. Citeseer (1995)

  14. Kartik, S., Murthy, C.S.R.: Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Trans. Comput. 46(6), 719–724 (1997)

    Article  Google Scholar 

  15. Kwok, Y.K., Ahmad, I.: Benchmarking and comparison of the task graph scheduling algorithms. J. Parallel Distrib. Comput. 59(3), 381–422 (1999)

    Article  MATH  Google Scholar 

  16. Liu, G., Poh, K.L., Xie, M.: Iterative list scheduling for heterogeneous computing. J. Parallel Distrib. Comput. 65(5), 654–665 (2005)

    Article  MATH  Google Scholar 

  17. Michael, R.G., Johnson, D.S.: Computers and intractability: A guide to the theory of np-completeness. WH Freeman &Co., San Francisco (1979)

    MATH  Google Scholar 

  18. Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, 1998. Digest of Papers. pp. 48–57. IEEE (1998)

  19. Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Distrib. Comput. 65(8), 885–900 (2005)

    Article  MATH  Google Scholar 

  20. Raj, J.S., Rachel, I.S.: A survey on reliability scheduling on grid computing. In: 2013 7th International Conference on Intelligent Systems and Control (ISCO), pp. 331–334. IEEE (2013)

  21. Ramírez-Alcaraz, J.M., Tchernykh, A., Yahyapour, R., Schwiegelshohn, U., Quezada-Pina, A., González-García, J.L., Hirales-Carbajal, A.: Job allocation strategies with user run time estimates for online scheduling in hierarchical grids. Journal of Grid Computing 9(1), 95–116 (2011)

    Article  Google Scholar 

  22. Sakellariou, R., Zhao, H.: A hybrid heuristic for dag scheduling on heterogeneous systems. In: Proceedings. 18th International Parallel and Distributed Processing Symposium, 2004. p. 111. IEEE (2004)

  23. Shatz, S.M., Wang, J.P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Trans. Comput. 41(9), 1156–1168 (1992)

    Article  Google Scholar 

  24. Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans. Parallel Distrib. Syst. 4(2), 175–187 (1993)

    Article  Google Scholar 

  25. Srinivasan, S., Jha, N.K.: Safety and reliability driven task allocation in distributed systems. IEEE Trans. Parallel Distrib. Syst. 10(3), 238–251 (1999)

    Article  Google Scholar 

  26. Tang, X., Li, K., Li, R., Veeravalli, B.: Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 70(9), 941–952 (2010)

    Article  MATH  Google Scholar 

  27. Tang, X., Li, K., Liao, G.: An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems. Clust. Comput., 1–13 (2014)

  28. Tang, X., Li, K., Qiu, M., Sha, E.H.M.: A hierarchical reliability-driven scheduling algorithm in grid systems. J. Parallel Distrib. Comput. 72(4), 525–535 (2012)

    Article  Google Scholar 

  29. Tao, Y., Jin, H., Shi, X.: Grid workflow scheduling based on reliability cost. In: Proceedings of the 2nd international conference on Scalable information systems, p. 12. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering) (2007)

  30. Tom, P.A., Murthy, C.: Algorithms for reliability-oriented module allocation in distributed computing systems. J. Syst. Softw. 40(2), 125–138 (1998)

    Article  Google Scholar 

  31. Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)

    Article  Google Scholar 

  32. Tosun, S.: Energy-and reliability-aware task scheduling onto heterogeneous mpsoc architectures. J. Supercomput. 62(1), 265–289 (2012)

    Article  Google Scholar 

  33. Ullman, J.D.: NP-complete scheduling problems. J. Comput. Syst. Sci. 10(3), 384–393 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  34. Wang, X., Yeo, C.S., Buyya, R., Su, J.: Optimizing the makespan and reliability for workflow applications with reputation and a look-ahead genetic algorithm. Futur. Gener. Comput. Syst. 27(8), 1124–1134 (2011)

    Article  Google Scholar 

  35. Wu, M.Y., Gajski, D.D.: Hypertool: A programming aid for message-passing systems. IEEE Trans. Parallel Distrib. Syst. 1(3), 330–343 (1990)

    Article  Google Scholar 

  36. Zhang, J., Gu, J.: An approach to analyze grid service reliability subject to failures. In: 4th International Conference on Computer Science & Education, 2009. ICCSE’09. pp. 343–347. IEEE (2009)

  37. Zhang, X., Zagorodnov, D., Hiltunen, M., Marzullo, K., Schlichting, R.D.: Fault-tolerant grid services using primary-backup: feasibility and performance. In: 2004 IEEE International Conference on Cluster Computing, pp. 105–114. IEEE (2004)

  38. Zhao, B., Aydin, H., Zhu, D.: On maximizing reliability of real-time embedded applications under hard energy constraint. IEEE Trans. Ind. Inf. 6(3), 316–328 (2010)

    Article  Google Scholar 

  39. Zhao, L., Ren, Y., Sakurai, K.: A resource minimizing scheduling algorithm with ensuring the deadline and reliability in heterogeneous systems. In: 2011 IEEE International Conference on Advanced Information Networking and Applications (AINA), pp. 275–282. IEEE (2011)

  40. Zhao, L., Ren, Y., Xiang, Y., Sakurai, K.: Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems. In: 2010 12th IEEE International Conference on High Performance Computing and Communications (HPCC), pp. 434–441. IEEE (2010)

  41. Zheng, Q., Veeravalli, B., Tham, C.K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)

    Article  MathSciNet  Google Scholar 

  42. Zhu, D., Melhem, R., Mossé, D.: The effects of energy management on reliability in real-time embedded systems. In: IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004. pp. 35–40. IEEE (2004)

Download references

Acknowledgments

A preliminary version of the paper will be presented in the Fourth Workshop on Parallel Computing and Optimization in conjunction with IPDPS 2014, Phoenix, Arizona, May 23, 2014. The authors would like to thank the three anonymous reviewers for their valuable and helpful comments on improving the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kenli Li.

Additional information

The research was partially funded by the Key Program of National Natural Science Foundation of China (Grant Nos. 61133005, 61432005), the National Outstanding Youth Science Program of National Natural Science Foundation of China (Grant No. 61625202), the International Science & Technology Cooperation Program of China (Grant No. 2015DFA11240), the National Key R&D Program of China (Grant No. 2016YFB0201402), the Hunan Provincial Innovation Foundation For Postgraduate (Grant No. CX2016B066), and the Outstanding Graduate Student Innovation Fund Program of Collaborative Innovation Center of High Performance Computing.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, S., Li, K., Mei, J. et al. A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems. J Grid Computing 15, 23–39 (2017). https://doi.org/10.1007/s10723-016-9386-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-016-9386-7

Keywords

Navigation