Abstract
Utilization of computing power of idle workstations and tolerating failures of computing nodes running parallel message-passing applications is the research area attracting many research groups in Computer Science. A Channel Memory based approach has shown its capabilities to tolerate faults of tasks of parallel applications. The first work utilizing such approach in conjunction with a specially designed checkpointing and recovery protocol has been resulted in MPICH-V architecture. In this paper, we present Channel Memory based Dynamic Environment (CMDE) – a stand-alone distributed program system based on MPICH-V architecture. We also present an approach to tolerate faults of Channel Memories, based on CMDE architecture and on a Limited Replication of Channel Memories algorithm, introduced in this paper.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proc. IEEE/ACM SC2002 Conf., Baltimore, Maryland (2002)
Hérault, T., Lemarinier, P.: A rollback-recovery protocol on peer to peer systems. In: Proc. of MOVEP 2002 Summer School, pp. 313–319 (2002)
Raman, R., Livny, M.: High throughput resource management. Ch. 13 in The Grid: Blueprint for aNew Computing Infrastructure. Morgan Kaufmann, San Francisco (1999)
Fedak, G., Germain, C., Neri, V., Cappello, F.: XtremWeb: a generic global computing platform. In: IEEE/ACM CCGRID 2001, pp. 582–587. IEEE Press, Los Alamitos (2001)
Selikhov, A., Bosilca, G., Germain, C., Fedak, G., Cappello, F.: MPICH-CM: A communication library design for a P2P MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J., Volkert, J. (eds.) PVM/MPI 2002. LNCS, vol. 2474, pp. 323–330. Springer, Heidelberg (2002)
Condor Manuals, ch. 4.2.1, http://www.cs.wisc.edu/condor/manual/
Stellner, G.: CoCheck: Checkpointing and proces migration for MPI. In: Proc. 10th International Parallel Processing Symposium (IPPS 1996), Hawaii, pp. 526–531 (1996)
Agbaria, A., Friedman, R.: Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations. In: Proc. 8th IEEE International Symposium on High Performance Distributed Computing (HPDC 1999), pp. 167–176 (1999)
Gropp, W., Lusk, E.: MPICH working note: Creating a new MPICH device using the channel interface. Technical Report ANL/MCS-TM-213, Argonne National Laboratory (1995)
Chen, Y., Plank, J.S., Li, K.: CLIP: A checkpointing tool for message-passing parallel programs. In: Int. Conf. on High Performance Networking and Computing (SC 1997). ACM Press, New York (1997)
Fagg, G., Dongarra, J.: FT-MPI: fault-tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000)
Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. Globus Project (2002), http://www.globus.org/research/papers/ogsa.pdf
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Selikhov, A., Germain, C. (2003). CMDE: A Channel Memory Based Dynamic Environment for Fault-Tolerant Message Passing Based on MPICH-V Architecture. In: Malyshkin, V.E. (eds) Parallel Computing Technologies. PaCT 2003. Lecture Notes in Computer Science, vol 2763. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45145-7_50
Download citation
DOI: https://doi.org/10.1007/978-3-540-45145-7_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40673-0
Online ISBN: 978-3-540-45145-7
eBook Packages: Springer Book Archive