Abstract
In grid environments, with the large number of components (both hardware and software) that are involved in application execution, the overall probability that at least one of these components is (temporarily) non-functional is increasing rapidly. In traditional operating systems, such failures are flagged as fatal and the application will be stopped, relying on a re-start after the problem will have been fixed. In a large grid system, this is not a feasible approach as failures happen too frequently while error diagnostics might not be possible at all.
This scenario is asking for a different approach to application execution, where detection and circumvention of error conditions become an integral part. We present a service that is keeping track of an application’s life cycle, from submission by the user to successful completion of its execution. In a case study, we describe how GRID superscalar, a grid application programming environment, can benefit from our service.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
G. Allen, K. Davis, T. Goodale, A. Hutanu, H. Kaiser, T. Kielmann, A. Merzky, R. van Nieuwpoort, A. Reinefeld, F. Schintke, T. Schuett, E. Seidel, and B. Ullmer. The grid application toolkit: toward generic and easy application programming interfaces for the grid. Proceedings of the IEEE, 93:534–550, 2005.
R. M. Badia, J. Labarta, R. Sirvent, J. M. Pérez, J.M. Cela, and R.Grima. Programming Grid Applications with GRID Superscalar. Journal of Grid Computing, 1(2):151–170, 2003.
M. Bubak, T. Szepieniec, and M. Radecki. A proposal of application failure detection and recovery in the Grid, 2003. Cracow, Grid Workshop.
T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, 1996.
CoreGRID Institute on Problem Soving Environments Tools and Grid Systems. Proposal for Mediator Component Toolkit. CoreGRID deliverable D.ETS.02, 2005.
X. Défago, N. Hayashibara, and T. Katayama. On the design of a failure detection service for large-scale distributed systems. Proceedings International Symposium Towards Peta- Bit Ultra-Networks, pages 88–95, 2003.
The Distributed ASCI Supercomputer DAS-2. http://www.cs.vu.nl/das2.
G. E. Fagg and J. J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. Lecture Notes in Computer Science, 1908:346–354, 2000.
P. Felber, X. Défago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications (DOA’99), pages 132–141, Edinburgh, Scotland, 1999.
I. Foster. What is the Grid? A three point checklist. GRID Today, 2002.
J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke. Condor-G: A Computation Management Agent for Multi-Institutional Grids. In Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10), 2001.
Globus Toolkit. The MyProxy Credential Management Service. http://grid.ncsa.uiuc.edu/myproxy.
S. Hwang and C. Kesselman. A generic failure detection service for the Grid. Information Sciences Institute, University of Southern California. Technical Report ISI-TR-568, 2003.
S. Hwang and C. Kesselman. Grid workflow: A flexible failure handling framework for the Grid. High Performance Distributed Computing, 00:126, 2003.
A. Jagatheesan. The GGF Grid File System Architecture Workbook. Grid Forum Document, GFD.61, 2006. Global Grid Forum.
A. Jain and R. K. Shyamasundar. Failure detection and membership management in grid environments. In Fifth IEEE/ACM International Workshop on Grid Computing, pages 44–52, Los Alamitos, CA, USA, 2004. IEEE Computer Society.
T. Kielmann, G. Wrzesinska, N. Currle-Linde, and M. Resch. Redesigning the SEGL Problem Solving Environment: A Case Study of Using Mediator Components. In Integrated Research in Grid Computing. Springer Verlag, 2006.
G. Kola, T. Kosar, and M. Livny. Phoenix: Making data-intensive grid applications faulttolerant. In 5th IEEE/ACM International Workshop on Grid Computing, 2004.
R. Medeiros, W. Cirne, F. Brasileiro, and J. Sauve. Faults in grids: Why are they so bad and what can be done about it? In Fourth International Workshop on Grid Computing, page 18, Los Alamitos, CA, USA, 2003. IEEE Computer Society.
W. Smith and C. Hu. An execution service for grid computing. NAS Technical Report NAS-04-004, 2004.
P. Stelling, C. DeMatteis, I. T. Foster, C. Kesselman, C. A. Lee, and G. von Laszewski. A fault detection service for wide area distributed computations. Cluster Computing, 2(2):117–128, 1999.
Y. Tanaka, H. Nakada, S. Sekiguchi, T. Suzumura, and S.Matsuoka. Ninf-G: A Reference Implementation of RPC-based Programming Middleware for Grid Computing. Journal of Grid Computing, 1(1):41–51, 2003.
D. Thain and M. Livny. Error scope on a computational grid: Theory and practice. In 11th IEEE International Symposium on High Performance Distributed Computing, Los Alamitos, CA, USA, 2002. IEEE Computer Society.
G. Wrzesinska, R. van Nieuwpoort, J. Maassen, and H. E. Bal. Fault-tolerance, malleability and migration for divide-and-conquer applications on the grid. In Proc. of 19th International Parallel and Distributed Processing Symposium, Denver, CO, April 2005.
The Globus project. http://www.globus.org.
LHC Computing Grid (LCG) project. http://lcg.web.cern.ch/LCG.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Krępska, E., Kielmann, T., Sirvent, R., Badia, R.M. (2008). A Service for Reliable Execution of Grid Applications. In: Gorlatch, S., Bubak, M., Priol, T. (eds) Achievements in European Research on Grid Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-72812-4_14
Download citation
DOI: https://doi.org/10.1007/978-0-387-72812-4_14
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-72811-7
Online ISBN: 978-0-387-72812-4
eBook Packages: Computer ScienceComputer Science (R0)