Abstract
A major challenge facing grid applications is the appropriate handling of failures. In this paper we address the problem of making parallel Java applications based on Remote Method Invocation (RMI) fault tolerant in a way transparent to the programmer. We use globally consistent checkpointing to avoid having to restart long-running computations from scratch after a system crash. The application’s execution state can be captured at any time also when some of the application’s threads are blocked waiting for the result of a (nested) remote method call. We modify only the program’s bytecode which makes our solution independent from a particular Java Virtual Machine (JVM) implementation. The bytecode transformation algorithm performs a compile time analysis to reduce the number of modifications in the application’s code which has a direct impact on the application’s performance. The fault tolerance extensions encompass also the RMI components such as the RMI registry. Since essential data as checkpoints are replicated, our system is resilient to simultaneous failures of multiple machines. Experimental results show negligible performance overhead of our fault-tolerance extensions.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Java platform debugger architecture (jpda), http://java.sun.com/products/jpda/
Java remote method invocation specification. revision 1.10, jdk 1.5.0, 2004, http://java.sun.com/j2se/1.5/pdf/rmi-spec-1.5.0.pdf .
Allen, G., Benger, W., Goodale, T., Hege, H.C., Lanfermann, G., Merzky, A., Radke, T., Seidel, E., Shalf, J.: The cactus code: A problem solving environment for the grid. In: The Ninth IEEE International Symposium on High Performance Distributed Computing (HPDC9), Pittsburgh, PA, USA (August 2000)
Arnold, D.C., Dongarra, J.: The netsolve environment: Progressing towards the seamless grid. In: International Workshop on Parallel Processing, Toronto, Canada (August 2000)
Bouchenak, S.: Making java applications mobile or persistent. In: Conference on Object-Oriented Technologies and Systems, San Antonio, TX, USA (January 2001)
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3(1), 63–75 (1985)
Clark, R., Jensen, E., Reynolds, F.: An architectural overview of the alpha real-time distributed kernel. In: USENIX Winter Conference, San Diego, CA, USA (January 1993)
Coninx, T., Truyen, E., Vanhaute, B., Berbers, Y., Joosen, W., Verbaeten, P.: On the use of threads in mobile object systems. In: Malenfant, J., Moisan, S., Moreira, A.M.D. (eds.) ECOOP 2000 Workshops. LNCS, vol. 1964. Springer, Heidelberg (2000)
Fuenfrocken, S.: Transparent migration of java-based mobile agents. In: Rothermel, K., Hohl, F. (eds.) MA 1998. LNCS, vol. 1477, p. 26. Springer, Heidelberg (1998)
Gosling, J., Joy, B., Steele Jr., G.L., Bracha, G.: The Java Language Specification, 2nd edn. Addison-Wesley, Reading (2000), http://java.sun.com/docs/books/jls/
Illman, T., Krueger, T., Kargl, F., Weber, M.: Transparent migration of mobile agents using the java platform debugger architecture. In: The Fifth IEEE International Conference on Mobile Agents, Atlanta, GA, USA (December 2001)
Lindholm, T., Yellin, F.: The Java Virtual Machine Specification. Addison-Wesley, Reading (1999), http://java.sun.com/docs/books/vmspec/
Maassen, J., van Nieuwpoort, R., Veldema, R., Bal, H., Kielmann, T., Jacobs, C., Hofman, R.: Efficient java rmi for parallel programming. ACM Transactions on Programming Languages and Systems 23(6), 747–775 (2001)
Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems 9(1), 21–65 (1991)
Sekiguchi, T., Masuhara, H., Yonezawa, A.: A simple extension of java language for controllable transparent migration and its portable implementation. In: 3rd International Conference on Coordination Models and Languages, Amsterdam, The Netherlands (April 1999)
Stone, N., Simmel, D., Kielmann, T.: GWD-I: An architecture for grid checkpoint recovery services and a GridCPR API. In: Grid Checkpoint Recovery Working Group Draft 3.0, Global Grid Forum (May 2004), http://gridcpr.psc.edu/GGF/docs/draft-ggf-gridcpr-Architecture-2.0.pdf
Suri, N., Bradshaw, J., Breedy, M., Groth, P., Hill, A.G., Jeffers, R.: Strong mobility and fine-grained resource control in nomads. In: Agent Systems and Applications / Mobile Agents, Zurich, Switzerland (September 2000)
Tanenbaum, A.S., van Steen, M.: Distributed Systems: Principles and Paradigms. Prentice-Hall, Englewood Cliffs (2002)
Tang, P., Yew, P.C.: Algorithms for distributing hot spot addressing. Technical report, Center for Supercomputing Research and Development, University of Illinois Urbana-Champaign (January 1987)
van Nieuwpoort, R.V., Maassen, J., Hofman, R., Kielmann, T., Bal, H.E.: Ibis: An efficient java-based grid programming environment. In: Joint ACM Java Grande - ISCOPE 2002 Conference, Seattle, WA, USA (November 2002)
van Nieuwpoort, R.V., Maassen, J., Hofman, R., Kielmann, T., Bal, H.E.: Satin: Simple and efficient java-based grid programming. In: AGridM 2003 Workshop on Adaptive Grid Middleware, New Orleans, LA, USA (September 2003)
Weyns, D., Truyen, E., Verbaeten, P.: Distributed threads in java. In: International Symposium on Parallel and Distributed Computing, Iasi, Romania (July 2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Garbacki, P., Biskupski, B., Bal, H. (2005). Transparent Fault Tolerance for Grid Applications. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds) Advances in Grid Computing - EGC 2005. EGC 2005. Lecture Notes in Computer Science, vol 3470. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11508380_68
Download citation
DOI: https://doi.org/10.1007/11508380_68
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26918-2
Online ISBN: 978-3-540-32036-4
eBook Packages: Computer ScienceComputer Science (R0)