Abstract
Programmers and users of compute intensive scientific applications often do not want to (or even cannot) code load balancing and fault tolerance into their programs.
The Beam system [18] uses a global virtual name space to provide migration and rollback transparency in user space for distributed groups of processes on workstations. The system calls are interposed and their parameters translated between the name spaces. Unlike other migration mechanisms, Beam does not require the applications to be written for a specific programming model or communication library.
In this paper we describe design and implementation of a separate system call interposition process [3] that accesses the application via the debugging interface. The main advantage of this approach is that it can handle even unmodified (e. g. commercially bought) application programs. We compare measured performance figures with previous similar approaches [15, 20].
At the time of writing funded by DFG contract SFB 342 at Institute for Computer Science, Technical University Munich
Preview
Unable to display preview. Download preview PDF.
References
A.D. Alexandrov, M. Ibel, K.E. Schauser, and C.J. Scheiman. Extending the Operating System at the User Level: the Ufo Global File System. In USENIX Technical Conference Proceedings, pages 77–90, Anaheim, CA, January 1997.
D. Andres, C. Elford, B. Fin, and L. Smith. Dynamic load balancing in PVM. Technical report, University of Illinois at Urbanna-Champaign, April 1993.
M. Bolz. Transparent Redirection of System Calls for Unmodified Programs in Beam Master's thesis, Institut für Betriebssysteme und Rechnerverbund, TU Braunschweig, November 1997. (In German).
J. Cargille and B.P. Miller. Binary Wrapping: A Technique for Instrumenting Object Code. ACM Sigplan Notices, 27(6):17–18, June 1992.
J. Casas, D.L. Clark, R. Konuru, S.W. Otto, R.M. Prouty, and J. Walpole. MPVM: A migration transparent version of PVM. Computing Systems, 8(2):171–216, 1995.
CCS Annual Report. WWW page, Center for Computational Sciences, Oak Ridge National Laboratory, 1995.http://www.ccs.ornl.org/AnRep95/CCS95.html.
R. Faulkner and R. Gomes. The Process File System and Process Model in UNIX System V. In USENIX Technical Conference Proceedings, pages 243–252, Dallas, TX, January 1991.
Al Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine — A Users' Guide and Tutorial for Networked Parallel Computing. The MIT Press, Cambridge, Massachusetts, 1994.
M.B. Jones.Transparently Interposing User Code at the System Interface. PhD thesis, CMU, September 1992.
A.H. Karp, M. Heath, and Al Geist. 1995 Gordon Bell Prize Winners. IEEE Computer, 29(1):79–85, January 1996.
J. León, A.L, Fisher, and P. Steenkiste. Fail-save PVM: A portable package for distributed programming with Transparent Recovery. Report CMU-CS-93-124, Carnegie Mellon University, February 1993.
M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpointing and Migration of UNIX Processes in the Condor Distributed Processing System. Report 1346, University of Wisconsin-Madison Computer Sciences, April 1997.
M.J. Litzkow and M. Solomon. Supporting Checkpointing and Process Migration Outside the UNIX Kernel. In USENIX Technical Conference Proceedings, pages 283–290, San Francisco, CA, January 1992.
D. Long, J. Caroll, and C. Park. A Study of the Reliability of Internet Sites. In Proceedings of the 10th Symposium on Reliable Distributed Systems, pages 177–186,1991.
K.I. Mandelberg and V.S. Sunderam. Process Migration in UNIX Networks. In USENIX Technical Conference Proceedings, pages 357–363, Dallas, TX, February 1988.
Message Passing Interface Forum MPIF. MPI-2: Extensions to the Message-Passing Interface. Technical report, University of Tennessee, Knoxville, July 1997. http://www.mpi-forum.org.
S. Petri, M. Bolz, and H. Langendörfer. Transparent Migration and Rollback for Unmodified Applications in Workstation Clusters. Informatik-Bericht 98-02, TU Braunschweig, April 1998. To appear.
S. Petri and H. Langendbrfer. Load Balancing and Fault Tolerance in Workstation Clusters — Migrating Groups of Communicating Processes. Operating Systems Review, 29(4):25–36, October 1995.
S. Petri, B. Schnor, M. Becker, B. Hinrichs, T. Tschamtke, and H. Langendörfer. Evaluation of Multicast Methods to Maintain a Global Name Space for Transparent Process Migration in Workstation Clusters. In Kommunikation in Verteilten Systemen, pages 224–234. GI/ITG Fachtagung KIVS'97, Springer, February 1997.
S. Petri, B. Schnor, H. Langendbrfer, and J. Steinborn. Consistent Global Checkpoints for Distributed Applications on Clusters of Unix Workstations. In Paralleles und Verteiltes Rechnen — Beiträge zum 4. Workshop über Wissenschaftliches Rechnen, pages 77–86, Aachen, October 1996. TU Braunschweig, Shaker.
T Shirakihara, H. Hirayama, K. Sato, and T. Kanai. ARTEMIS: Advanced Reliable disTributed Environment Middleware System. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA'97, pages 97–106, Las Vegas, NV, July 1997.
G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, April 1996.
Sun Microsystems. SunOS Reference Manual, 1990. Revision A.
J. Trinitis. An External Checkpointing Technique for Integration into a Parallel Tool Environment. In preparation. trinitis@informatik.tu-muenchen.de, 1998.
J.J.J. Vesseur, R.N. Heederik, B.J. Overeinder, and P.M.A. Sloot. Experiments in Dynamic Load Balancing for Parallel Cluster Computing. In Proceedings of the Workshop on Parallel Programming and Computation (ZEUS'95) and the 4th Nordic Transputer Conference (NTUG'95), pages 189–194, Amsterdam, June 1995. IOS Press. *** DIRECT SUPPORT *** A0008D07 00007
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Petri, S., Bolze, M., Langendörfer, H. (1998). Migration and rollback transparency for arbitrary distributed applications in workstation clusters. In: Rolim, J. (eds) Parallel and Distributed Processing. IPPS 1998. Lecture Notes in Computer Science, vol 1388. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-64359-1_686
Download citation
DOI: https://doi.org/10.1007/3-540-64359-1_686
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64359-3
Online ISBN: 978-3-540-69756-5
eBook Packages: Springer Book Archive