Abstract
Running parallel applications in a network of workstations (NOW) requires the use of a resource management system with batch queueing and load balancing functionalities to utilize idle workstations in the NOW and to avoid load imbalance in the network.
A resource management system for parallel jobs requires special functionalities to schedule jobs to hosts and to support checkpointing and migration of parallel applications. This paper describes the essential components of a distributed resource management system supporting parallel computations in a NOW and how to reuse existing resource management components for this approach.
The implementation of a distributed resource manager demonstrates the practical relevance of the design concept.
Preview
Unable to display preview. Download preview PDF.
References
Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidy Sunderam. PVM: Parallel Virtual Machine — A Users' Guide and Tutorial for Networked Parallel Computing. Scientific and Engineering Computation. The MIT Press, Cambridge, MA, 1994.
GENIAS Software GmbH, Erzgebirgstr. 2B, D-93073 Neutraubling, Germany. CODINE Reference Manual, Version 4.0, 1996.
Thomas P. Green and J. Snyder. DQS, A Distributed Queuing System. Technical report, Florida State University, March 1992.
Peter Luksch, Ursula Maier, Sabine Rathmayer, Friedemann Unger, and Matthias Weidmann. Parallelization of a state-of-the-art industrial CFD Package for Execution on Networks of Workstations and Massively Parallel Processors. In Third European PVM Users' Group Meeting, EuroPVM 96, München, October 1996.
Peter Luksch, Ursula Maier, Sabine Rathmayer, and Matthias Weidmann. Software Engineering Methods for Parallel and Distributed Scientific Computing. In HPCN Europe 1996, Lecture Notes in Computer Science. Springer-Verlag, April 1996.
Michael Litzkow and Marvin Solomon. Supporting checkpointing and process migration outside the UNIX kernel. In Proceedings of the USENIX Winter Conference, San Francisco, CA, January 1992.
Thomas Ludwig. Automatische Lastverwaltung für Parallelrechner. Reihe Informatik. BI-Wissenschaftsverlag, Mannheim, 1993.
Christoph Pleier. Prozeβverlagerung in heterogenen Rechnernetzen basierend auf einer speziellen Übersetzungstechnik. Informatik. Herbert Utz Verlag Wissenschaft, München, 1996.
Georg Stellner and Jim Pruyne. Resource Management and Checkpointing for PVM. In Proceedings of the 2nd European PVM Users' Group Meeting, pages 131–136, Lyon, September 1995. Editions Hermes.
Georg Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the International Parallel Processing Symposium, pages 526–531, Honolulu, HI, April 1996. IEEE Computer Society Press, 10662 Los Vaqueros Circle, P.O. Box 3014, Los Alamitos, CA 90720-1264.
Todd Tannenbaum and Michael Litzkow. The Condor Distributed Processing System. Dr. Dobb's Journal, (2):40–48, February 1995.
Avi Ziv and Jehoshua Bruck. Checkpointing in Parallel and Distributed Systems. In Albert Zomaya, editor, Parallel and Distributed Computing Handbook, Series on Computer Engineering, chapter 10, pages 274–302. McGraw-Hill, 1996.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Maier, U., Stellner, G. (1997). Distributed resource management for parallel applications in networks of workstations. In: Hertzberger, B., Sloot, P. (eds) High-Performance Computing and Networking. HPCN-Europe 1997. Lecture Notes in Computer Science, vol 1225. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0031618
Download citation
DOI: https://doi.org/10.1007/BFb0031618
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-62898-9
Online ISBN: 978-3-540-69041-2
eBook Packages: Springer Book Archive