Skip to main content
Log in

Modeling and Analysis of Grid Service Reliability Considering Fault Recovery

  • Published:
New Generation Computing Aims and scope Submit manuscript

Abstract

The extreme complexity of grid system makes it extremely difficult to achieve high service reliability, and this situation is aggravated by the fact that many grid services need to perform time-consuming tasks that may require several days or even months of computation. To improve grid service reliability, this paper studies a fault recovery technique in grid systems and conducts in-depth research on grid reliability modeling and analysis with fault recovery. Grid failures considered in this paper are classified into two categories: unrecoverable failures and recoverable failures. Software reliability is taken into account as well. To make fault recovery more practical, certain constraints on fault recovery are introduced and grid service reliability models under these practical constraints are developed. Numerical examples are presented, and based on the results obtained, the impact of fault recovery as well as that of practical constraints on grid service reliability is discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Affaan, M. and Ansari, M. A., “Distributed Fault Management for Computational Grids,” in Proc. of the Fifth International Conference on Grid and Cooperative Computing 2006, IEEE Computer Society Press, pp. 363–368, 2006.

  2. Bolosky, W. J., Douceur, J. R., Ely, D. and Theimer, M., “Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs,” in Proc. of the ACM International Conference on Measurement and Modeling of Computer Systems 2000, ACM Press, pp. 34–43, 2000.

  3. Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V. and Selikhov, A., “MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes,” in Proc. of the ACM/IEEE conference on Supercomputing 2002, IEEE Computer Society Press, pp. 1–18, 2002.

  4. Dai Y.S., Levitin G.: “Reliability and Performance of Tree-structured Grid Service”. IEEE Transactions on Reliability 55(2), 337–349 (2006)

    Article  Google Scholar 

  5. Dai Y.S., Levitin G., Wang X.L.: “Optimal Task Partition and Distribution in Grid Service System with Common Cause Failures”. Future Generation Computer Systems 23(2), 209–218 (2007)

    Article  Google Scholar 

  6. Dai Y.S., Pan Y., Zou X.K.: “A Hierarchical Modeling and Analysis for Grid Service Reliability”. IEEE Transactions on Computers 56(5), 681–691 (2007)

    Article  MathSciNet  Google Scholar 

  7. Dai Y.S., Xie M., Poh K.L.: “Reliability of Grid Service Systems”.. Computers and Industrial Engineering 50(1), 130–147 (2006)

    Article  Google Scholar 

  8. Epema D.H.J., Livnyb M., Dantzigc R.Va., Eversa X., Pruyneb J.: “A Worldwide Flock of Condors: Load Sharing among Workstation Cluster”. Future Generations Computer Systems 12(1), 53–65 (1996)

    Article  Google Scholar 

  9. Foster I.: “The Grid: a New Infrastructure for 21st Century Science”.. Physics Today 55(2), 42–47 (2002)

    Article  Google Scholar 

  10. Foster, I. and Kesselman, C., The Grid 2: Blueprint for a New Computing Infrastructure, Morgan-Kaufmann, 2003.

  11. Foster I., Kesselman C., Nick J.M.: “Grid Services for Distributed System Integration”. Computer 35(6), 37–46 (2002)

    Article  Google Scholar 

  12. Foster I., Kesselman C., Tuecke S.: “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”. International Journal of High Performance Computing Applications 15(3), 200–222 (2001)

    Article  Google Scholar 

  13. Heddaya, A. and Helal, A., “Reliability, Availability, Dependability and Performability: a User-centered View,” Technical Report 1997-011, 1997.

  14. Hwang S., Kesselman C.: “A Flexible Framework for Fault Tolerance in the Grid”. Journal of Grid Computing 1(3), 251–272 (2003)

    Article  MATH  Google Scholar 

  15. Jin, L., Tong, W. Q., Tang, J. Q. and Wang, B., “A Fault-tolerance Mechanism in grid,” in Proc. of IEEE International Conference on Industrial Informatics 2003, IEEE Computer Society Press, pp. 351–357, 2003.

  16. Kao, E. P. C., An Introduction to Stochastic Processes, Wadsworth Publishing Company, 1997.

  17. Kovacs, J. and Kacsuk, P., “A Migration Framework for Executing Parallel Programs in the Grid,” in European across Grids Conference 2004, Springer, pp. 80–89, 2004.

  18. Levitin G., Dai Y.S.: “Performance and Reliability of Star Topology Grid Service with Data Dependency and Two Types of Failures”. IIE Transactions 39(8), 783–794 (2007)

    Article  Google Scholar 

  19. Levitin G., Dai Y.S.: “Service Reliability and Performance in Grid System with Star Topology”. Reliability Engineering and System Safety 92(1), 40–46 (2007)

    Article  Google Scholar 

  20. Levitin G., Dai Y.S., Hanoch B.H.: “Reliability and Performance of Star Topology Grid Service with Precedence Constraints on Subtask Execution”. IEEE Transactions on Reliability 55(2), 507–515 (2006)

    Article  Google Scholar 

  21. Litzkow, M., Tannenbaum, T., Basney, J. and Livny, J., “Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System,” Technical Report UW-CS-TR-1346, 1997.

  22. Musa, J. D., Iannino, A. and Okumoto, K., Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987.

  23. Nabrzyski, J., Schopf, J. M. and Weglarz, J., Grid Resource Management, Kluwer Publishing Company, 2003.

  24. Pradhan D.K., Vaidya N.H.: “Roll-forward Checkpointing Scheme: a Novel Fault-tolerant Architecture”. IEEE Transactions on Computers 43(10), 1163–1174 (1994)

    Article  MATH  Google Scholar 

  25. Tierney, B., Aydt, R., Gunter, D., Smith, W., Taylor, V., Wolski, R. and Swany, M., “White Paper: A Grid Monitoring Service Architecture,” Grid Performance Working Group, 2001.

  26. Townend, P. and Xu, J., “Fault Tolerance within a Grid Environment,” in Proc. of the UK e-Science All Hands Meeting 2003, Nottingham Conference Center, pp. 272–275, 2003.

  27. Treaster, M., “A Survey of Fault-tolerance and Fault-recovery Techniques in Parallel Systems,” ACM Computing Research Repository (CoRR), pp. 1–11, 2005

  28. Xie, M., Software Reliability Modeling, World Scientific Publishing Company, 1991.

  29. Yang B., Xie M.: “A Study of Operational and Testing Reliability in Software Reliability Analysis”. Reliability Engineering and System Safety 70(3), 323–329 (2000)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suchang Guo.

About this article

Cite this article

Guo, S., Huang, HZ. & Liu, Y. Modeling and Analysis of Grid Service Reliability Considering Fault Recovery. New Gener. Comput. 29, 345–364 (2011). https://doi.org/10.1007/s00354-009-0114-8

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00354-009-0114-8

Keywords

Navigation