Skip to main content

A Service for Reliable Execution of Grid Applications

  • Chapter
Achievements in European Research on Grid Systems

Abstract

In grid environments, with the large number of components (both hardware and software) that are involved in application execution, the overall probability that at least one of these components is (temporarily) non-functional is increasing rapidly. In traditional operating systems, such failures are flagged as fatal and the application will be stopped, relying on a re-start after the problem will have been fixed. In a large grid system, this is not a feasible approach as failures happen too frequently while error diagnostics might not be possible at all.

This scenario is asking for a different approach to application execution, where detection and circumvention of error conditions become an integral part. We present a service that is keeping track of an application’s life cycle, from submission by the user to successful completion of its execution. In a case study, we describe how GRID superscalar, a grid application programming environment, can benefit from our service.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. G. Allen, K. Davis, T. Goodale, A. Hutanu, H. Kaiser, T. Kielmann, A. Merzky, R. van Nieuwpoort, A. Reinefeld, F. Schintke, T. Schuett, E. Seidel, and B. Ullmer. The grid application toolkit: toward generic and easy application programming interfaces for the grid. Proceedings of the IEEE, 93:534–550, 2005.

    Google Scholar 

  2. R. M. Badia, J. Labarta, R. Sirvent, J. M. Pérez, J.M. Cela, and R.Grima. Programming Grid Applications with GRID Superscalar. Journal of Grid Computing, 1(2):151–170, 2003.

    Article  Google Scholar 

  3. M. Bubak, T. Szepieniec, and M. Radecki. A proposal of application failure detection and recovery in the Grid, 2003. Cracow, Grid Workshop.

    Google Scholar 

  4. T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, 1996.

    Article  MATH  MathSciNet  Google Scholar 

  5. CoreGRID Institute on Problem Soving Environments Tools and Grid Systems. Proposal for Mediator Component Toolkit. CoreGRID deliverable D.ETS.02, 2005.

    Google Scholar 

  6. X. Défago, N. Hayashibara, and T. Katayama. On the design of a failure detection service for large-scale distributed systems. Proceedings International Symposium Towards Peta- Bit Ultra-Networks, pages 88–95, 2003.

    Google Scholar 

  7. The Distributed ASCI Supercomputer DAS-2. http://www.cs.vu.nl/das2.

    Google Scholar 

  8. G. E. Fagg and J. J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. Lecture Notes in Computer Science, 1908:346–354, 2000.

    Article  Google Scholar 

  9. P. Felber, X. Défago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications (DOA’99), pages 132–141, Edinburgh, Scotland, 1999.

    Google Scholar 

  10. I. Foster. What is the Grid? A three point checklist. GRID Today, 2002.

    Google Scholar 

  11. J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke. Condor-G: A Computation Management Agent for Multi-Institutional Grids. In Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10), 2001.

    Google Scholar 

  12. Globus Toolkit. The MyProxy Credential Management Service. http://grid.ncsa.uiuc.edu/myproxy.

    Google Scholar 

  13. S. Hwang and C. Kesselman. A generic failure detection service for the Grid. Information Sciences Institute, University of Southern California. Technical Report ISI-TR-568, 2003.

    Google Scholar 

  14. S. Hwang and C. Kesselman. Grid workflow: A flexible failure handling framework for the Grid. High Performance Distributed Computing, 00:126, 2003.

    Google Scholar 

  15. A. Jagatheesan. The GGF Grid File System Architecture Workbook. Grid Forum Document, GFD.61, 2006. Global Grid Forum.

    Google Scholar 

  16. A. Jain and R. K. Shyamasundar. Failure detection and membership management in grid environments. In Fifth IEEE/ACM International Workshop on Grid Computing, pages 44–52, Los Alamitos, CA, USA, 2004. IEEE Computer Society.

    Google Scholar 

  17. T. Kielmann, G. Wrzesinska, N. Currle-Linde, and M. Resch. Redesigning the SEGL Problem Solving Environment: A Case Study of Using Mediator Components. In Integrated Research in Grid Computing. Springer Verlag, 2006.

    Google Scholar 

  18. G. Kola, T. Kosar, and M. Livny. Phoenix: Making data-intensive grid applications faulttolerant. In 5th IEEE/ACM International Workshop on Grid Computing, 2004.

    Google Scholar 

  19. R. Medeiros, W. Cirne, F. Brasileiro, and J. Sauve. Faults in grids: Why are they so bad and what can be done about it? In Fourth International Workshop on Grid Computing, page 18, Los Alamitos, CA, USA, 2003. IEEE Computer Society.

    Google Scholar 

  20. W. Smith and C. Hu. An execution service for grid computing. NAS Technical Report NAS-04-004, 2004.

    Google Scholar 

  21. P. Stelling, C. DeMatteis, I. T. Foster, C. Kesselman, C. A. Lee, and G. von Laszewski. A fault detection service for wide area distributed computations. Cluster Computing, 2(2):117–128, 1999.

    Article  Google Scholar 

  22. Y. Tanaka, H. Nakada, S. Sekiguchi, T. Suzumura, and S.Matsuoka. Ninf-G: A Reference Implementation of RPC-based Programming Middleware for Grid Computing. Journal of Grid Computing, 1(1):41–51, 2003.

    Article  Google Scholar 

  23. D. Thain and M. Livny. Error scope on a computational grid: Theory and practice. In 11th IEEE International Symposium on High Performance Distributed Computing, Los Alamitos, CA, USA, 2002. IEEE Computer Society.

    Google Scholar 

  24. G. Wrzesinska, R. van Nieuwpoort, J. Maassen, and H. E. Bal. Fault-tolerance, malleability and migration for divide-and-conquer applications on the grid. In Proc. of 19th International Parallel and Distributed Processing Symposium, Denver, CO, April 2005.

    Google Scholar 

  25. The Globus project. http://www.globus.org.

    Google Scholar 

  26. LHC Computing Grid (LCG) project. http://lcg.web.cern.ch/LCG.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Krępska, E., Kielmann, T., Sirvent, R., Badia, R.M. (2008). A Service for Reliable Execution of Grid Applications. In: Gorlatch, S., Bubak, M., Priol, T. (eds) Achievements in European Research on Grid Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-72812-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-72812-4_14

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-72811-7

  • Online ISBN: 978-0-387-72812-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics