Abstract
Checkpointing is a valuable tool for any scheduling system to have. With the ability to checkpoint, schedulers are not locked into a single allocation of resources to jobs, but instead can stop running jobs, and reallocate resources with out sacrificing any completed computations. Checkpointing techniques are not new, but they have not been widely available on parallel platforms. We have implemented CoCheck, a system for checkpointing message passing parallel programs. Parallel programs tend to be large in terms of their aggregate memory utilization, so the size of their checkpoint is also large. Because of this, checkpoints must be handled carefully to avoid overloading the system when check-points take place. Today 's distributed file systems do not handle this situation well. We therefore propose the use of checkpoint servers which are specifically designed to move checkpoints from the checkpointing process, across the interconnection network, and on to stable storage. A scheduling system can utilize numerous checkpoint servers in any configuration in order to provide good checkpointing performance.
Preview
Unable to display preview. Download preview PDF.
References
M. J. Litzkow, M. Livny, and M. W. Mutka, “Condor: A hunter of idle workstations,” in Proceedings of the 8th International Conference on Distributed Computing Systems, pp. 104–111, June 1988.
M. Squillante, “On the benefits and limitations of dynamic partitioning in parallel computer systems,” in Job Scheduling Strategies for Parallel Processing (D. G. Feitelson and L. Rudolph, eds.), vol. 949 of Lecture notes in Compter Science, Springer-Verlag, 1995.
K. M. Chandy and L. Lamport, “Distributed snapshots: Determining global states of distributed systems,” ACM Transactions on Computer Systems, vol. 3, pp. 63–75, Feb. 1985.
G. Stellner and J. Pruyne, “Resource management and checkpointing for PVM,” in Proceedings of the 2nd European Users' Group Meeting, pp. 131–136, Sept. 1995.
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, PVM: Parallel Virtual Machine — A Users' Guid and Tutorial for Networked Parallel Computing. Cambridge, MA.: The MIT Press, 1994.
G. Stellner, “CoCheck: Checkpointing and process migration for MPI,” in Proceedings of the International Parallel Processing Symposium, IEEE, April 1996.
M. J. Litzkow and M. Solomon, “Supporting checkpointing and process migration outside the Unix kernel,” in Proceedings of the Winter Usenix Conference, (San Francisco, CA), 1992.
T. Tannenbaum and M. Litzkow, “The Condor distributed processing system,” Dr. Dobb's Journal, pp. 40–48, February 1995.
J. Pruyne and M. Livny, “Providing resource management services to parallel applications,” in Proceedings of the Second Workshop on Environments and Tools for Parallel Scientific Computing (J. Dongarra and B. Tourancheau, eds.), SIAM Proceedings Series, pp. 152–161, SIAM, May 1994.
J. Pruyne and M. Livny, “Parallel processing on dynamic resources with CARMI,” in Job Scheduling Strategies for Parallel Processing (D. G. Feitelson and L. Rudolph, eds.), vol. 949 of Lecture notes in Compter Science, Springer-Verlag, 1995.
R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, “Design and implementation of the Sun network file system,” in Proceedings of the Summer Usenix Conference, pp. 119–130, 1985.
J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols, M. Satyanarayanan, R. N. Sidebotham, and M. J. West, “Scale and performance in a distributed file system,” ACM Transactions on Computer Systems, vol. 6, pp. 51–81, February 1988.
J. Gerner, “Input/output on the IBM SP2-an overview.” http://www.tc.cornell.edu/ SmartNodes/Newsletters/IO.series/intro.html.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pruyne, J., Livny, M. (1996). Managing checkpoints for parallel programs. In: Feitelson, D.G., Rudolph, L. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 1996. Lecture Notes in Computer Science, vol 1162. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0022292
Download citation
DOI: https://doi.org/10.1007/BFb0022292
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61864-5
Online ISBN: 978-3-540-70710-3
eBook Packages: Springer Book Archive