Managing checkpoints for parallel programs

Pruyne, Jim; Livny, Miron

doi:10.1007/BFb0022292

Jim Pruyne¹ &
Miron Livny¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1162))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

144 Accesses
32 Citations

Abstract

Checkpointing is a valuable tool for any scheduling system to have. With the ability to checkpoint, schedulers are not locked into a single allocation of resources to jobs, but instead can stop running jobs, and reallocate resources with out sacrificing any completed computations. Checkpointing techniques are not new, but they have not been widely available on parallel platforms. We have implemented CoCheck, a system for checkpointing message passing parallel programs. Parallel programs tend to be large in terms of their aggregate memory utilization, so the size of their checkpoint is also large. Because of this, checkpoints must be handled carefully to avoid overloading the system when check-points take place. Today 's distributed file systems do not handle this situation well. We therefore propose the use of checkpoint servers which are specifically designed to move checkpoints from the checkpointing process, across the interconnection network, and on to stable storage. A scheduling system can utilize numerous checkpoint servers in any configuration in order to provide good checkpointing performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

M. J. Litzkow, M. Livny, and M. W. Mutka, “Condor: A hunter of idle workstations,” in Proceedings of the 8th International Conference on Distributed Computing Systems, pp. 104–111, June 1988.
Google Scholar
M. Squillante, “On the benefits and limitations of dynamic partitioning in parallel computer systems,” in Job Scheduling Strategies for Parallel Processing (D. G. Feitelson and L. Rudolph, eds.), vol. 949 of Lecture notes in Compter Science, Springer-Verlag, 1995.
Google Scholar
K. M. Chandy and L. Lamport, “Distributed snapshots: Determining global states of distributed systems,” ACM Transactions on Computer Systems, vol. 3, pp. 63–75, Feb. 1985.
Article Google Scholar
G. Stellner and J. Pruyne, “Resource management and checkpointing for PVM,” in Proceedings of the 2nd European Users' Group Meeting, pp. 131–136, Sept. 1995.
Google Scholar
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, PVM: Parallel Virtual Machine — A Users' Guid and Tutorial for Networked Parallel Computing. Cambridge, MA.: The MIT Press, 1994.
Google Scholar
G. Stellner, “CoCheck: Checkpointing and process migration for MPI,” in Proceedings of the International Parallel Processing Symposium, IEEE, April 1996.
Google Scholar
M. J. Litzkow and M. Solomon, “Supporting checkpointing and process migration outside the Unix kernel,” in Proceedings of the Winter Usenix Conference, (San Francisco, CA), 1992.
Google Scholar
T. Tannenbaum and M. Litzkow, “The Condor distributed processing system,” Dr. Dobb's Journal, pp. 40–48, February 1995.
Google Scholar
J. Pruyne and M. Livny, “Providing resource management services to parallel applications,” in Proceedings of the Second Workshop on Environments and Tools for Parallel Scientific Computing (J. Dongarra and B. Tourancheau, eds.), SIAM Proceedings Series, pp. 152–161, SIAM, May 1994.
Google Scholar
J. Pruyne and M. Livny, “Parallel processing on dynamic resources with CARMI,” in Job Scheduling Strategies for Parallel Processing (D. G. Feitelson and L. Rudolph, eds.), vol. 949 of Lecture notes in Compter Science, Springer-Verlag, 1995.
Google Scholar
R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, “Design and implementation of the Sun network file system,” in Proceedings of the Summer Usenix Conference, pp. 119–130, 1985.
Google Scholar
J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols, M. Satyanarayanan, R. N. Sidebotham, and M. J. West, “Scale and performance in a distributed file system,” ACM Transactions on Computer Systems, vol. 6, pp. 51–81, February 1988.
Article Google Scholar
J. Gerner, “Input/output on the IBM SP2-an overview.” http://www.tc.cornell.edu/ SmartNodes/Newsletters/IO.series/intro.html.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Sciences, University of Wisconsin-Madison, USA
Jim Pruyne & Miron Livny

Authors

Jim Pruyne
View author publications
You can also search for this author in PubMed Google Scholar
Miron Livny
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Dror G. Feitelson Larry Rudolph

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pruyne, J., Livny, M. (1996). Managing checkpoints for parallel programs. In: Feitelson, D.G., Rudolph, L. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 1996. Lecture Notes in Computer Science, vol 1162. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0022292

Download citation

DOI: https://doi.org/10.1007/BFb0022292
Published: 15 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61864-5
Online ISBN: 978-3-540-70710-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics