A Service for Reliable Execution of Grid Applications

Krępska, Elżbieta; Kielmann, Thilo; Sirvent, Raül; Badia, Rosa M.

doi:10.1007/978-0-387-72812-4_14

Elżbieta Krępska⁴,
Thilo Kielmann⁴,
Raül Sirvent⁵ &
…
Rosa M. Badia⁵

216 Accesses
1 Citations

Abstract

In grid environments, with the large number of components (both hardware and software) that are involved in application execution, the overall probability that at least one of these components is (temporarily) non-functional is increasing rapidly. In traditional operating systems, such failures are flagged as fatal and the application will be stopped, relying on a re-start after the problem will have been fixed. In a large grid system, this is not a feasible approach as failures happen too frequently while error diagnostics might not be possible at all.

This scenario is asking for a different approach to application execution, where detection and circumvention of error conditions become an integral part. We present a service that is keeping track of an application’s life cycle, from submission by the user to successful completion of its execution. In a case study, we describe how GRID superscalar, a grid application programming environment, can benefit from our service.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 159.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Detecting Disaster Before It Strikes: On the Challenges of Automated Building and Testing in HPC Environments

User-level failure detection and auto-recovery of parallel programs in HPC systems

Article 01 September 2021

A framework for evaluating comprehensive fault resilience mechanisms in numerical programs

Article 24 April 2015

References

G. Allen, K. Davis, T. Goodale, A. Hutanu, H. Kaiser, T. Kielmann, A. Merzky, R. van Nieuwpoort, A. Reinefeld, F. Schintke, T. Schuett, E. Seidel, and B. Ullmer. The grid application toolkit: toward generic and easy application programming interfaces for the grid. Proceedings of the IEEE, 93:534–550, 2005.
Google Scholar
R. M. Badia, J. Labarta, R. Sirvent, J. M. Pérez, J.M. Cela, and R.Grima. Programming Grid Applications with GRID Superscalar. Journal of Grid Computing, 1(2):151–170, 2003.
Article Google Scholar
M. Bubak, T. Szepieniec, and M. Radecki. A proposal of application failure detection and recovery in the Grid, 2003. Cracow, Grid Workshop.
Google Scholar
T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, 1996.
Article MATH MathSciNet Google Scholar
CoreGRID Institute on Problem Soving Environments Tools and Grid Systems. Proposal for Mediator Component Toolkit. CoreGRID deliverable D.ETS.02, 2005.
Google Scholar
X. Défago, N. Hayashibara, and T. Katayama. On the design of a failure detection service for large-scale distributed systems. Proceedings International Symposium Towards Peta- Bit Ultra-Networks, pages 88–95, 2003.
Google Scholar
The Distributed ASCI Supercomputer DAS-2. http://www.cs.vu.nl/das2.
Google Scholar
G. E. Fagg and J. J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. Lecture Notes in Computer Science, 1908:346–354, 2000.
Article Google Scholar
P. Felber, X. Défago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications (DOA’99), pages 132–141, Edinburgh, Scotland, 1999.
Google Scholar
I. Foster. What is the Grid? A three point checklist. GRID Today, 2002.
Google Scholar
J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke. Condor-G: A Computation Management Agent for Multi-Institutional Grids. In Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10), 2001.
Google Scholar
Globus Toolkit. The MyProxy Credential Management Service. http://grid.ncsa.uiuc.edu/myproxy.
Google Scholar
S. Hwang and C. Kesselman. A generic failure detection service for the Grid. Information Sciences Institute, University of Southern California. Technical Report ISI-TR-568, 2003.
Google Scholar
S. Hwang and C. Kesselman. Grid workflow: A flexible failure handling framework for the Grid. High Performance Distributed Computing, 00:126, 2003.
Google Scholar
A. Jagatheesan. The GGF Grid File System Architecture Workbook. Grid Forum Document, GFD.61, 2006. Global Grid Forum.
Google Scholar
A. Jain and R. K. Shyamasundar. Failure detection and membership management in grid environments. In Fifth IEEE/ACM International Workshop on Grid Computing, pages 44–52, Los Alamitos, CA, USA, 2004. IEEE Computer Society.
Google Scholar
T. Kielmann, G. Wrzesinska, N. Currle-Linde, and M. Resch. Redesigning the SEGL Problem Solving Environment: A Case Study of Using Mediator Components. In Integrated Research in Grid Computing. Springer Verlag, 2006.
Google Scholar
G. Kola, T. Kosar, and M. Livny. Phoenix: Making data-intensive grid applications faulttolerant. In 5th IEEE/ACM International Workshop on Grid Computing, 2004.
Google Scholar
R. Medeiros, W. Cirne, F. Brasileiro, and J. Sauve. Faults in grids: Why are they so bad and what can be done about it? In Fourth International Workshop on Grid Computing, page 18, Los Alamitos, CA, USA, 2003. IEEE Computer Society.
Google Scholar
W. Smith and C. Hu. An execution service for grid computing. NAS Technical Report NAS-04-004, 2004.
Google Scholar
P. Stelling, C. DeMatteis, I. T. Foster, C. Kesselman, C. A. Lee, and G. von Laszewski. A fault detection service for wide area distributed computations. Cluster Computing, 2(2):117–128, 1999.
Article Google Scholar
Y. Tanaka, H. Nakada, S. Sekiguchi, T. Suzumura, and S.Matsuoka. Ninf-G: A Reference Implementation of RPC-based Programming Middleware for Grid Computing. Journal of Grid Computing, 1(1):41–51, 2003.
Article Google Scholar
D. Thain and M. Livny. Error scope on a computational grid: Theory and practice. In 11th IEEE International Symposium on High Performance Distributed Computing, Los Alamitos, CA, USA, 2002. IEEE Computer Society.
Google Scholar
G. Wrzesinska, R. van Nieuwpoort, J. Maassen, and H. E. Bal. Fault-tolerance, malleability and migration for divide-and-conquer applications on the grid. In Proc. of 19th International Parallel and Distributed Processing Symposium, Denver, CO, April 2005.
Google Scholar
The Globus project. http://www.globus.org.
Google Scholar
LHC Computing Grid (LCG) project. http://lcg.web.cern.ch/LCG.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, Vrije Universiteit, Amsterdam, The Netherlands
Elżbieta Krępska & Thilo Kielmann
Barcelona Supercomputing Center and Universitat Politécnica de Catalunya, Barcelona, Spain
Raül Sirvent & Rosa M. Badia

Authors

Elżbieta Krępska
View author publications
You can also search for this author in PubMed Google Scholar
Thilo Kielmann
View author publications
You can also search for this author in PubMed Google Scholar
Raül Sirvent
View author publications
You can also search for this author in PubMed Google Scholar
Rosa M. Badia
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universität Münster FB Mathematik und Informatik Institut für Informatik, Einsteinstr. 62, 48149 Münster, Germany
Sergei Gorlatch
Academy Mining /Metallurgy Inst. Computer Science, AGH, Al. A. Mickiewicza 30, 30-059 KRAKOW, Poland
Marian Bubak
IRISA / INRIA Rennes Campus de Beaulieu, 35042 Rennes CX, France
Thierry Priol

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Krępska, E., Kielmann, T., Sirvent, R., Badia, R.M. (2008). A Service for Reliable Execution of Grid Applications. In: Gorlatch, S., Bubak, M., Priol, T. (eds) Achievements in European Research on Grid Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-72812-4_14

Download citation

DOI: https://doi.org/10.1007/978-0-387-72812-4_14
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-72811-7
Online ISBN: 978-0-387-72812-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics