A Fault Avoidance Strategy Improving the Reliability of the EGI Production Grid Infrastructure

Palmieri, Francesco; Pardi, Silvio; Veronesi, Paolo

doi:10.1007/978-3-642-17653-1_14

Francesco Palmieri¹⁹,
Silvio Pardi²⁰ &
Paolo Veronesi²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6490))

Included in the following conference series:

International Conference On Principles Of Distributed Systems

772 Accesses
4 Citations

Abstract

Reliability is a crucial issue for the development of stable and effective production grid infrastructures. That is, grid users must be able to trust upon the runtime service they request and receive from the underlying grid. Many runtime services and capabilities offered by modern Grid infrastructures are not available in advance to the application developers and dynamically bound only at the execution time, leading to an increased incidence of interaction faults. In this work we propose, implement and evaluate a novel low-impact fault-avoidance scheme, specifically conceived to improve the grid reliability from the user/application point of view, by providing proper service status information to the workload management system. In particular, starting from the EGEE experience, we designed a strategy inhibiting the use of some specific runtime capabilities on the available resources as soon as the monitoring system detect any anomalous behavior associated to these capabilities and re-integrating them when they re-start to correctly work again. The results of a significant set of tests ran on the production EGEE infrastructure, have been presented to show the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Laure, E., Hemmer, F., et al.: Middleware for the Next Generation Grid Infrastructure. In: Computing in High Energy and Nuclear Physics (CHEP), Interlaken, Switzerland (September 2004)
Google Scholar
Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S.: Condor-G: A Computation Management Agent for Multi-Institutional Grids. Cluster Computing 5(3), 237–246 (2002)
Article Google Scholar
Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid. The International Journal of High Performance Computing Applications 15(3), 200–222 (2001)
Article Google Scholar
The LCG Project, http://cern.ch/lcg
Pacini, F.: Job Description Language HowTo (2003), http://server11.infn.it/workload-grid/docs/DataGrid-01-TEN-0142-02.pdf
Dabrowski, C.: Reliability in grid computing systems, Concurrency and computation: pratice and experience, Published online in Wiley Inter Science (2009) doi:10.1002/cpe.1410, http://www.interscience.wiley.com
Google Scholar
Huedo, E., Montero, R., Llorente, I.: Evaluating the reliability of computational Grids from the end user’s point of view. Journal of Systems Architecture 52(12), 727–736 (2006)
Article Google Scholar
Abawajy, H., Dandamudi, S.P.: Fault-tolerant grid resource management infrastructure. Journal of Neural, Parallel and Scientific Computations 12, 208–220 (2004)
Google Scholar
Hwang, S., Kesselman, C.: Gridworkflow: A flexible failure handling framework for the grid. In: 12th International Symposium on High-Performance Distributed Computing (HPDC-12 2003), Seattle, WA, USA, pp. 126–137. IEEE Computer Society, Los Alamitos (2003)
Chapter Google Scholar
Weissman, J.B.: Fault tolerant computing on the grid: What are my options? In: Proceedings of the The Eighth IEEE International Symposium on High Performance Distributed Computing, HPDC 1999, p. 26. IEEE Computer Society, Los Alamitos (1999)
Google Scholar
Weissman, J.B.: Fault-tolerant wide area parallel computation. In: Proceedings of IDDPS 2000 Workshops, pp. 1214–1225 (2000)
Google Scholar
Abawajy, J.H.: Fault detection service architecture for grid computing systems. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds.) ICCSA 2004. LNCS, vol. 3044, pp. 107–115. Springer, Heidelberg (2004)
Chapter Google Scholar
Stelling, P., Foster, I., Kesselman, C., von Laszewski, G., Lee, C.: A fault detection service for wide area distributed computations. In: Proc. 7th Symposium on High Performance Computing (HPDC), pp. 268–278 (1998)
Google Scholar
Tierney, B., Crowley, B., Gunter, D., Holding, M., Lee, J., Thompson, M.: A monitoring sensor management system for grid environments. In: HPDC, pp. 97–104 (2000)
Google Scholar
Soonwook, H.: A Generic Failure Detection Service for the Grid, Ph.D. thesis, University of Southern California (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Università degli studi di Napoli Federico II, Via Cinthia, 5, 80126, Napoli, Italy
Francesco Palmieri
INFN Sezione di Napoli and INDAM, Via Cinthia, 5, 80126, Napoli, Italy
Silvio Pardi
INFN CNAF, Viale Berti Pichat 6/2, 40127, Bologna, Italy
Paolo Veronesi

Authors

Francesco Palmieri
View author publications
You can also search for this author in PubMed Google Scholar
Silvio Pardi
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Veronesi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Washington University, Campus Box 1045, One Brookings Drive, 63130, St. Louis, MO, USA
Chenyang Lu
Osaka University, Japan
Toshimitsu Masuzawa
LaBRI, University of Bordeaux, 351 cours de la Libération, 33405, Talence, France
Mohamed Mosbah

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Palmieri, F., Pardi, S., Veronesi, P. (2010). A Fault Avoidance Strategy Improving the Reliability of the EGI Production Grid Infrastructure. In: Lu, C., Masuzawa, T., Mosbah, M. (eds) Principles of Distributed Systems. OPODIS 2010. Lecture Notes in Computer Science, vol 6490. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17653-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-17653-1_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17652-4
Online ISBN: 978-3-642-17653-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics