Abstract
The extreme complexity of grid system makes it extremely difficult to achieve high service reliability, and this situation is aggravated by the fact that many grid services need to perform time-consuming tasks that may require several days or even months of computation. To improve grid service reliability, this paper studies a fault recovery technique in grid systems and conducts in-depth research on grid reliability modeling and analysis with fault recovery. Grid failures considered in this paper are classified into two categories: unrecoverable failures and recoverable failures. Software reliability is taken into account as well. To make fault recovery more practical, certain constraints on fault recovery are introduced and grid service reliability models under these practical constraints are developed. Numerical examples are presented, and based on the results obtained, the impact of fault recovery as well as that of practical constraints on grid service reliability is discussed.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Affaan, M. and Ansari, M. A., “Distributed Fault Management for Computational Grids,” in Proc. of the Fifth International Conference on Grid and Cooperative Computing 2006, IEEE Computer Society Press, pp. 363–368, 2006.
Bolosky, W. J., Douceur, J. R., Ely, D. and Theimer, M., “Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs,” in Proc. of the ACM International Conference on Measurement and Modeling of Computer Systems 2000, ACM Press, pp. 34–43, 2000.
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V. and Selikhov, A., “MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes,” in Proc. of the ACM/IEEE conference on Supercomputing 2002, IEEE Computer Society Press, pp. 1–18, 2002.
Dai Y.S., Levitin G.: “Reliability and Performance of Tree-structured Grid Service”. IEEE Transactions on Reliability 55(2), 337–349 (2006)
Dai Y.S., Levitin G., Wang X.L.: “Optimal Task Partition and Distribution in Grid Service System with Common Cause Failures”. Future Generation Computer Systems 23(2), 209–218 (2007)
Dai Y.S., Pan Y., Zou X.K.: “A Hierarchical Modeling and Analysis for Grid Service Reliability”. IEEE Transactions on Computers 56(5), 681–691 (2007)
Dai Y.S., Xie M., Poh K.L.: “Reliability of Grid Service Systems”.. Computers and Industrial Engineering 50(1), 130–147 (2006)
Epema D.H.J., Livnyb M., Dantzigc R.Va., Eversa X., Pruyneb J.: “A Worldwide Flock of Condors: Load Sharing among Workstation Cluster”. Future Generations Computer Systems 12(1), 53–65 (1996)
Foster I.: “The Grid: a New Infrastructure for 21st Century Science”.. Physics Today 55(2), 42–47 (2002)
Foster, I. and Kesselman, C., The Grid 2: Blueprint for a New Computing Infrastructure, Morgan-Kaufmann, 2003.
Foster I., Kesselman C., Nick J.M.: “Grid Services for Distributed System Integration”. Computer 35(6), 37–46 (2002)
Foster I., Kesselman C., Tuecke S.: “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”. International Journal of High Performance Computing Applications 15(3), 200–222 (2001)
Heddaya, A. and Helal, A., “Reliability, Availability, Dependability and Performability: a User-centered View,” Technical Report 1997-011, 1997.
Hwang S., Kesselman C.: “A Flexible Framework for Fault Tolerance in the Grid”. Journal of Grid Computing 1(3), 251–272 (2003)
Jin, L., Tong, W. Q., Tang, J. Q. and Wang, B., “A Fault-tolerance Mechanism in grid,” in Proc. of IEEE International Conference on Industrial Informatics 2003, IEEE Computer Society Press, pp. 351–357, 2003.
Kao, E. P. C., An Introduction to Stochastic Processes, Wadsworth Publishing Company, 1997.
Kovacs, J. and Kacsuk, P., “A Migration Framework for Executing Parallel Programs in the Grid,” in European across Grids Conference 2004, Springer, pp. 80–89, 2004.
Levitin G., Dai Y.S.: “Performance and Reliability of Star Topology Grid Service with Data Dependency and Two Types of Failures”. IIE Transactions 39(8), 783–794 (2007)
Levitin G., Dai Y.S.: “Service Reliability and Performance in Grid System with Star Topology”. Reliability Engineering and System Safety 92(1), 40–46 (2007)
Levitin G., Dai Y.S., Hanoch B.H.: “Reliability and Performance of Star Topology Grid Service with Precedence Constraints on Subtask Execution”. IEEE Transactions on Reliability 55(2), 507–515 (2006)
Litzkow, M., Tannenbaum, T., Basney, J. and Livny, J., “Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System,” Technical Report UW-CS-TR-1346, 1997.
Musa, J. D., Iannino, A. and Okumoto, K., Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987.
Nabrzyski, J., Schopf, J. M. and Weglarz, J., Grid Resource Management, Kluwer Publishing Company, 2003.
Pradhan D.K., Vaidya N.H.: “Roll-forward Checkpointing Scheme: a Novel Fault-tolerant Architecture”. IEEE Transactions on Computers 43(10), 1163–1174 (1994)
Tierney, B., Aydt, R., Gunter, D., Smith, W., Taylor, V., Wolski, R. and Swany, M., “White Paper: A Grid Monitoring Service Architecture,” Grid Performance Working Group, 2001.
Townend, P. and Xu, J., “Fault Tolerance within a Grid Environment,” in Proc. of the UK e-Science All Hands Meeting 2003, Nottingham Conference Center, pp. 272–275, 2003.
Treaster, M., “A Survey of Fault-tolerance and Fault-recovery Techniques in Parallel Systems,” ACM Computing Research Repository (CoRR), pp. 1–11, 2005
Xie, M., Software Reliability Modeling, World Scientific Publishing Company, 1991.
Yang B., Xie M.: “A Study of Operational and Testing Reliability in Software Reliability Analysis”. Reliability Engineering and System Safety 70(3), 323–329 (2000)
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Guo, S., Huang, HZ. & Liu, Y. Modeling and Analysis of Grid Service Reliability Considering Fault Recovery. New Gener. Comput. 29, 345–364 (2011). https://doi.org/10.1007/s00354-009-0114-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00354-009-0114-8