Skip to main content

Managing checkpoints for parallel programs

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 1996)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1162))

Included in the following conference series:

Abstract

Checkpointing is a valuable tool for any scheduling system to have. With the ability to checkpoint, schedulers are not locked into a single allocation of resources to jobs, but instead can stop running jobs, and reallocate resources with out sacrificing any completed computations. Checkpointing techniques are not new, but they have not been widely available on parallel platforms. We have implemented CoCheck, a system for checkpointing message passing parallel programs. Parallel programs tend to be large in terms of their aggregate memory utilization, so the size of their checkpoint is also large. Because of this, checkpoints must be handled carefully to avoid overloading the system when check-points take place. Today 's distributed file systems do not handle this situation well. We therefore propose the use of checkpoint servers which are specifically designed to move checkpoints from the checkpointing process, across the interconnection network, and on to stable storage. A scheduling system can utilize numerous checkpoint servers in any configuration in order to provide good checkpointing performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. M. J. Litzkow, M. Livny, and M. W. Mutka, “Condor: A hunter of idle workstations,” in Proceedings of the 8th International Conference on Distributed Computing Systems, pp. 104–111, June 1988.

    Google Scholar 

  2. M. Squillante, “On the benefits and limitations of dynamic partitioning in parallel computer systems,” in Job Scheduling Strategies for Parallel Processing (D. G. Feitelson and L. Rudolph, eds.), vol. 949 of Lecture notes in Compter Science, Springer-Verlag, 1995.

    Google Scholar 

  3. K. M. Chandy and L. Lamport, “Distributed snapshots: Determining global states of distributed systems,” ACM Transactions on Computer Systems, vol. 3, pp. 63–75, Feb. 1985.

    Article  Google Scholar 

  4. G. Stellner and J. Pruyne, “Resource management and checkpointing for PVM,” in Proceedings of the 2nd European Users' Group Meeting, pp. 131–136, Sept. 1995.

    Google Scholar 

  5. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, PVM: Parallel Virtual Machine — A Users' Guid and Tutorial for Networked Parallel Computing. Cambridge, MA.: The MIT Press, 1994.

    Google Scholar 

  6. G. Stellner, “CoCheck: Checkpointing and process migration for MPI,” in Proceedings of the International Parallel Processing Symposium, IEEE, April 1996.

    Google Scholar 

  7. M. J. Litzkow and M. Solomon, “Supporting checkpointing and process migration outside the Unix kernel,” in Proceedings of the Winter Usenix Conference, (San Francisco, CA), 1992.

    Google Scholar 

  8. T. Tannenbaum and M. Litzkow, “The Condor distributed processing system,” Dr. Dobb's Journal, pp. 40–48, February 1995.

    Google Scholar 

  9. J. Pruyne and M. Livny, “Providing resource management services to parallel applications,” in Proceedings of the Second Workshop on Environments and Tools for Parallel Scientific Computing (J. Dongarra and B. Tourancheau, eds.), SIAM Proceedings Series, pp. 152–161, SIAM, May 1994.

    Google Scholar 

  10. J. Pruyne and M. Livny, “Parallel processing on dynamic resources with CARMI,” in Job Scheduling Strategies for Parallel Processing (D. G. Feitelson and L. Rudolph, eds.), vol. 949 of Lecture notes in Compter Science, Springer-Verlag, 1995.

    Google Scholar 

  11. R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, “Design and implementation of the Sun network file system,” in Proceedings of the Summer Usenix Conference, pp. 119–130, 1985.

    Google Scholar 

  12. J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols, M. Satyanarayanan, R. N. Sidebotham, and M. J. West, “Scale and performance in a distributed file system,” ACM Transactions on Computer Systems, vol. 6, pp. 51–81, February 1988.

    Article  Google Scholar 

  13. J. Gerner, “Input/output on the IBM SP2-an overview.” http://www.tc.cornell.edu/ SmartNodes/Newsletters/IO.series/intro.html.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Dror G. Feitelson Larry Rudolph

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pruyne, J., Livny, M. (1996). Managing checkpoints for parallel programs. In: Feitelson, D.G., Rudolph, L. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 1996. Lecture Notes in Computer Science, vol 1162. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0022292

Download citation

  • DOI: https://doi.org/10.1007/BFb0022292

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-61864-5

  • Online ISBN: 978-3-540-70710-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics