Abstract
Bringing clusters of computers into the mainstream as general-purpose computing systems requires that better facilities for transparent remote execution of parallel and sequential applications be developed. While much research has been done in this area, most of this work remains inaccessible for clusters built using contemporary hardware and operating systems. Implementations are either too old and/or not publicly available, require use of operating systems which are not supported by modern hardware, or simply do not meet the functional requirements demanded by practical use in real world settings. To address these issues, we designed REXEC, a decentralized, secure remote execution facility. It provides high availability, scalability, transparent remote execution, dynamic cluster configuration, decoupled node discovery and selection, a well-defined failure and cleanup model, parallel and distributed program support, and strong authentication and encryption. The system is implemented and is currently installed and in use on a 32-node cluster of 2-way SMPs running the Linux 2.2.5 operating system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agrawal, R., Ezzat, A.K.: Location independent remote execution in nest. IEEE Transactions on Software Engineering 13(8), 905–912 (1987)
Barak, A., La’Adan, O., Smith, A.: Scalable cluster computing with mosix for linux. In: Proceedings of Linux Expo 1999, pp. 95–100 (May 1999)
Barcellos, A.M.P., Schramm, J.F.L., Filho, V.R.B., Geyer, C.F.R.: The hetnos network operating system: a tool for writing distributed applications. Operating Systems Review (October 1994)
Chun, B.N., Culler, D.E.: Market-based proportional resource sharing for clusters (September 1999) (submitted for publication)
Douglis, F., Ousterhout, J.: Transparent process migration: Design alternatives and the sprite implementation. Software—Practice and Experience 21(8) (August 1991)
Freier, A.O., Karlton, P., Kocher, P.C.: The ssl protocol version 3.0, internetdraft (1996)
Ghormley, D.P., Petrou, D., Rodrigues, S.H., Vahdat, A.M., Anderson, T.E.: Glunix: a global layer unix for a network of workstations. Software—Practice and Experience (April 1998)
Hori, A., Tezuka, H., Ishikawa, Y.: An implementation of parallel operating system for clustered commodity computers. In: Proceedings of Cluster Computing Conference (March 1997)
Ju, J., Xu, G., Tao, J.: Parallel computing using idle workstations. Operating Systems Review (July 1993)
Khalidi, Y.A., Bernabeu, J.M., Matena, V., Shirriff, K., Thadani, M.: Solaris mc: A multi computer os. In: Proceedings of the 1996 USENIX Conference (1996)
Litzkow, M., Tannenbaum, T., Basney, J., Livny, M.: Checkpoint andmigration of unix processes in the condor distributed processing system. Tech. Rep. 1346, University of Wisconsin-Madison (April 1997)
Myricom. The gm api (1999)
Nichols, D.A.: Using idle workstations in a shared computing environment. In: Proceedings of the 11th ACM Symposium on Operating Systems Principles (1987)
Ousterhout, J.K., Cherenson, A.R., Douglis, F., Nelson, M.N., Welch, B.B.: The sprite network operating system. IEEE Computer 21(2) (February 1988)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under unix. In: Proceedings of the 1995 USENIX Winter Conference (1995)
Rowe, L.A., Birman, K.P.: A local network based on the unix operating system. IEEE Transactions on Software Engineering 8(2) (March 1982)
Shirriff, K.: Building distributed process management on an object-oriented framework. In: Proceedings of the 1997 USENIX Conference (1997)
Stumm, M.: The design and implementation of a decentralized scheduling facility for a workstation cluster. In: Proceedings of the 2nd IEEE Conference on Computer Workstations, pp. 12–22 (March 1988)
Theimer, M.M., Lantz, K.A., Cheriton, D.R.: Preemptable remote execution facilities for the v-system. In: Proceedings of the 10th ACM Symposium on Operating Systems Principles (1985)
Waldspurger, C.A., Weihl, W.E.: Stride scheduling: Deterministic proportionalshare resource management. Tech. Rep. MIT/LCS/TM-528, Massachusetts Institute of Technology (1995)
Walker, B., Popek, G., English, R., Kline, C., Thiel, G.: The locus distributed operating system. In: Proceedings of the 9th ACM Symposium on Operating Systems Principles, pp. 49–70 (1983)
Zhou, S., Wang, J., Zheng, X., Delisle, P.: Utopia: A load sharing facility for large, heterogenous distributed computer systems. Software—Practice and Experience (1992)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chun, B.N., Culler, D.E. (2000). REXEC: A Decentralized, Secure Remote Execution Environment for Clusters. In: Falsafi, B., Lauria, M. (eds) Network-Based Parallel Computing. Communication, Architecture, and Applications. CANPC 2000. Lecture Notes in Computer Science, vol 1797. Springer, Berlin, Heidelberg. https://doi.org/10.1007/10720115_1
Download citation
DOI: https://doi.org/10.1007/10720115_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67879-3
Online ISBN: 978-3-540-44655-2
eBook Packages: Springer Book Archive