Abstract
Metacomputing systems are intended to support remote and/or concurrent use of geographically distributed computational resources. Resource management in such systems is complicated by five concerns that do not typically arise in other situations: site autonomy and heterogeneous substrates at the resources, and application requirements for policy extensibility, co-allocation, and online control. We describe a resource management architecture that addresses these concerns. This architecture distributes the resource management problem among distinct local manager, resource broker, and resource co-allocator components and defines an extensible resource specification language to exchange information about requirements. We describe how these techniques have been implemented in the context of the Globus metacomputing toolkit and used to implement a variety of different resource management strategies. We report on our experiences applying our techniques in a large testbed, GUSTO, incorporating 15 sites, 330 computers, and 3600 processors.
Preview
Unable to display preview. Download preview PDF.
References
Cray Research, 1997. Document Number IN-2153 2/97.
D. Abramson, R. Sosic, J. Giddy, and B. Hall. Nimrod: A tool for performing parameterised simulations using distributed workstations. In Proc. 4th IEEE Symp. on High Performance Distributed Computing. IEEE Computer Society Press, 1995.
F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. Application-level scheduling on distributed heterogeneous networks. In Proceedings of Supercomputing '96. ACM Press, 1996.
S. Brunett and T. Gottschalk. Scalable ModSAF simulations with more than 50,000 vehicles using multiple scalable parallel processors. In Proceedings of the Simulation Interoperability Workshop, 1997.
S. Chapin. Distributed scheduling support in the presence of autonomy. In Proc. Heterogeneous Computing Workshop, pages 22–29, 1995.
Joseph Czyzyk, Michael P. Mesnier, and Jorge J. Moré. The Network-Enabled Optimization System (NEOS) Server. Preprint MCS-P615-0996, Argonne National Laboratory, Argonne, Illinois, 1996.
A. Downey. Predicting queue times on space-sharing parallel computers. In Proceedings of the 11th International Parallel Processing Symposium, 1997.
S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski, W. Smith, and S. Tuecke. A directory service for configuring high-performance distributed computations. In Proc. 6th IEEE Symp. on High Performance Distributed Computing, pages 365–375. IEEE Computer Society Press, 1997.
I. Foster, J. Geisler, W. Nickless, W. Smith, and S. Tuecke. Software infrastructure for the I-WAY metacomputing experiment. Concurrency: Practice & Experience, 1998. to appear.
I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. International Journal of Supercomputer Applications, 11(2):115–128, 1997.
GENIAS Software GmbH. CODINE: Computing in distributed networked environments, 1995. http://www.genias.de/genias/english/codine.html.
A. Grimshaw, W. Wulf, J. French, A. Weaver, and P. Reynolds, Jr. Legion: The next logical step toward a nationwide virtual computer. Technical Report CS-94-21, Department of Computer Science, University of Virginia, 1994.
The PSCHED API Working Group. PSCHED: An API for parallel job/resource management version 0.1, 1996. http://parallel.nas.nasa.gov/PSCHED/.
R. Henderson and D. Tweten. Portable Batch System: External reference specification. Technical report, NASA Ames Research Center, 1996.
International Business Machines Corporation, Kingston, NY. IBM Load Leveler: User's Guide, September 1993.
J. Jones and C. Brickell. Second evaluation of job queuing/scheduling software: Phase 1 report. NAS Technical Report NAS-97-013, NASA Ames Research Center, Moffett Field, CA 94035-1000, 1997. http://science.nas.nasa.gov/Pubs/TechReports/NASreports/NAS-97-013/jms.eval.rep2.html.
David A. Lifka. The ANL/IBM SP scheduling system. In The IPPS'95 Workshop on Job Scheduling Strategies for Parallel Processing, pages 187–191, April 1995.
M. Litzkow, M. Livny, and M. Mutka. Condor — a hunter of idle workstations. In Proc. 8th Intl Conf. on Distributed Computing Systems, pages 104–111, 1988.
P. Messina, S. Brunett, D. Davis, T. Gottschalk, D. Curkendall, L. Ekroot, and H. Siegel. Distributed interactive simulation for synthetic forces. In Proceedings of the 11th International Parallel Processing Symposium, 1997.
K. Moore, G. Fagg, A. Geist, and J. Dongarra. Scalable networked information processing environment (SNIPE). In Proceedings of Supercomputing '91, 1997.
B. C. Neuman. Prospero: A tool for organizing internet resources. Electronic Networking: Research, Applications, and Policy, 2(1):30–37, Spring 1992.
B. C. Neuman and S. Rao. The Prospero resource manager: A scalable frame-work for processor allocation in distributed systems. Concurrency: Practice & Experience, 6(4):339–355, 1994.
R. Ramamoorthi, A. Rifkin, B. Dimitrov, and K.M. Chandy. A general resource reservation framework for scientific computing. In Scientific Computing in Object-Oriented Parallel Environments, pages 283–290. Springer-Verlag, 1997.
W. Smith, I. Foster, and V. Taylor. Predicting application run times using historical information. Lecture Notes on Computer Science, 1998.
Amin Vahdat, Eshwar Belani, Paul Eastham, Chad Yoshikawa, Thomas Anderson, David Culler, and Michael Dahlin. WebOS: Operating system services for wide area applications. In 7th Symposium on High Performance Distributed Computing, to appear, July 1998.
J. Weissman. Gallop: The benefits of wide-area computing for parallel processing. Technical report, University of Texas at San Antonio, 1997.
J. Weissman and A. Grimshaw. A federated model for scheduling in wide-area systems. In Proc. 5th IEEE Symp. on High Performance Distributed Computing, 1996.
S. Zhou. LSF: Load sharing in large-scale heterogeneous distributed systems. In Proc. Workshop on Cluster Computing, 1992.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Czajkowski, K. et al. (1998). A resource management architecture for metacomputing systems. In: Feitelson, D.G., Rudolph, L. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 1998. Lecture Notes in Computer Science, vol 1459. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0053981
Download citation
DOI: https://doi.org/10.1007/BFb0053981
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64825-3
Online ISBN: 978-3-540-68536-4
eBook Packages: Springer Book Archive