Abstract
In a distributed, inherently dynamic Grid environment the reliability of individual resources cannot be guaranteed. The more resources and components are involved the more error-prone is the system. Therefore, it is important to enhance the dependability of the system with fault-tolerance mechanisms. In this paper, we present Migol, a fault-tolerant, self-healing Grid service infrastructure for MPI applications.
The benefit of the Grid is that in case of a failure an application may be migrated and restarted from a checkpoint file on another site. This approach requires a service infrastructure which handles the necessary activities transparently for an application. But any migration framework cannot support fault-tolerant applications, if it is not fault-tolerant itself.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations. In: HPDC 1999: Proceedings of the The Eighth IEEE International Symposium on High Performance Distributed Computing, Washington, DC, USA, p. 31. IEEE Computer Society, Los Alamitos (1999)
Nguyen-Tuong, A., Grimshaw, A.S., Wasson, G., Humphrey, M., Knight, J.C.: Towards Dependable Grids. Available at http://www.cs.virginia.edu/~techrep/CS-2004-11.pdf
Barak, A., Braverman, A., Gilderman, I., Laaden, O.: Performance of PVM with the MOSIX Preemptive Process Migration. In: Proceedings of the 7th Israeli Conference on Computer Systems and Software Engineering, Herzliya, June 1996, pp. 38–45 (1996)
Basney, J., Humphrey, M., Welch, V.: The myproxy online credential repository (2005), http://www.ncsa.uiuc.edu/~jbasney/myproxy-spe.pdf
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: Mpich-v: toward a scalable fault tolerant mpi for volatile nodes. In: Supercomputing 2002: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Los Alamitos, CA, USA, pp. 1–18. IEEE Computer Society Press, Los Alamitos (2002)
Chen, D., et al.: OGSA Globus Toolkit 3 evaluation activity at CERN. Nucl. Instrum. Meth. A534, 80–84 (2004)
Chervenak, A.L., Palavalli, N., Bharathi, S., Kesselman, C., Schwartzkopf, R.: Performance and scalability of a replica location service (2004). Available at http://www.globus.org/alliance/publications/papers/chervenakhpdc13.pdf
Czajkowski, K., Ferguson, D.F., Foster, I., Frey, J., Graham, S., Sedukhin, I., Snelling, D., Tuecke, S., Vambenepe, W.: The WS-Resource Framework (2005). Available at http://www.oasis-open.org/committees/download.php/6796/ws-wsrf.pdf
Czajkowski, K., Foster, I.T., Karonis, N.T., Kesselman, C., Martin, S., Smith, W., Tuecke, S.: A resource management architecture for metacomputing systems. In: IPPS/SPDP 1998: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, London, UK, pp. 62–82. Springer, Heidelberg (1998)
Tannenbaum, T., Thain, D., Livny, M.: Condor and the grid. In: Berman, F., Hey, A.J.G. (eds.) Grid Computing: Making the Global Infrastructure a Reality. John Wiley, Chichester (2003)
Fagg, G.E., Dongarra, J.: Ft-mpi: Fault tolerant mpi, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, London, UK, pp. 346–353. Springer, Heidelberg (2000)
Floros, E., Cotronis, Y.: Exposing mpi applications as grid services. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds.) Euro-Par 2004. LNCS, vol. 3149, pp. 436–443. Springer, Heidelberg (2004)
Foster, I., Kesselman, C., Nick, J.M., Tuecke, S.: The Physiology of the Grid – An Open Grid Services Architecture for Distributed Systems Integration (2002). Available at http://www-unix.globus.org/toolkit/3.0/ogsa/docs/physiology.pdf
Foster, I.T., Kesselman, C., Tsudik, G., Tuecke, S.: A security architecture for computational grids. In: ACM Conference on Computer and Communications Security, pp. 83–92 (1998)
Globus Homepage (2005). Available at http://www.globus.org
Thilo, J.M., Wrzesinska, K.G., van Niewpoort, R.V., Bal, H.E.: Fault-tolerant scheduling of fine-grained tasks in grid environments. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium(IPDPS 2005), Denver, Colorado, USA (April 2005)
Gropp, W., Lusk, E.: Fault tolerance in mpi programs. High Performance Computing and Applications (2002)
Henderson, R., Tweten, D.: Portable Batch System: External reference specification. Technical report, NASA Ames Research Center (1996)
Kohl, J.A., Papadopoulos, P.M.: Cumulvs version 1.0 (1996). Available at http://www.netlib.org/cumulvs/
Kovacs, J., Kacsuk, P.: A migration framework for executing parallel programs in the grid. In: Proceedings of the 2nd European Across Grids Conference, Nicosia, Cyprus (January 2004)
Lanfermann, G., Schnor, B., Seidel, E.: Grid object description: Characterizing grids. In: Eighth IFIP/IEEE International Symposium on Integrated Network Management (IM 2003), Colorado Springs, Colorado, USA (March 2003)
Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems (June 1988)
Mihahn, M., Schnor, B.: Fault-tolerant grid peer services. Technical report, University Potsdam (2004)
Montero, R.S., Huedo, E., Llorente, I.M.: Grid resource selection for opportunistic job migration. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 366–373. Springer, Heidelberg (2003)
Petri, S., Langendörfer, H.: Load Balancing and Fault Tolerance in Workstation Clusters– Migrating Groups of Communicating Processes. Operating Systems Review 29(4), 25–36 (1995)
Puppin, D., Tonellotto, N., Laforenza, D.: Using web services to run distributed numerical applications. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 207–214. Springer, Heidelberg (2004)
Wolski, N.S.R., Hayes, J.: The network weather service: A distributed resource performance forecasting service for metacomputing. Journal of Future Generation Computing Systems 15(5-6), 757–768 (1999)
Smith, C.: Open source metascheduling for virtual organizations with the community scheduler framework (csf). Technical report, Platform Computing Inc. (2003)
Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996), Honolulu, Hawaii (1996)
Tuecke, S., Foster, I., Kesselman, C.: Open Grid Service Infrastructure (2003). Available at http://www-unix.globus.org/toolkit/draft-ggf-ogsi-gridservice-33_2003-06-27.pdf
Vadhiyar, S.S., Dongarra, J.J.: A performance oriented migration framework for the grid. In: Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, p. 130. IEEE Computer Society, Los Alamitos (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Luckow, A., Schnor, B. (2005). Migol: A Fault-Tolerant Service Framework for MPI Applications in the Grid. In: Di Martino, B., KranzlmĂĽller, D., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2005. Lecture Notes in Computer Science, vol 3666. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11557265_35
Download citation
DOI: https://doi.org/10.1007/11557265_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29009-4
Online ISBN: 978-3-540-31943-6
eBook Packages: Computer ScienceComputer Science (R0)