Skip to main content
Log in

Implementation of Fault-Tolerant GridRPC Applications

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

A task parallel application is implemented with Ninf-G, a GridRPC system. A series of experiments are conducted on the Grid testbed in Asia Pacific for three months. Through tens of long executions, typical fault patterns were collected, and instability of the network throughput was determined to be a major reason of the faults. Several important points are stressed to avoid task throughput decline due to the fault-recovery operations: Timeout minimization for fault detection, background recovery, redundant task assignments, and so on. This study also issues a steer for design of the automated fault-tolerant mechanism in an upper layer of the GridRPC framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Allen, G., Dramlitsch, T., Foster, I., Karonis, N.T., Ripeanu, M., Seidel, E., Toonen, B.: Supporting efficient execution in heterogeneous distributed computing environments with cactus and globus. In: Proceedings of Supercomputing, 2001

  2. ApGrid. http://www.apgrid.org/

  3. Arnold, D., Agrawal, S., Blackford, S., Dongarra, J., Miller, M., Seymour, K., Sagi, K., Shi, Z., Vadhiyar, S.: Users’ Guide to NetSolve V1.4.1. Innovative Computing Dept. Technical Report ICL-UT-02-05, University of Tennessee, 2002

  4. Bosilca, G., Bouteiller, A., Cappello, F., DjiLali, S., Fédak, G., Germain, C., Hérault, T., Lodygensky, P.L. a d O., Magniette, F., Néri, V., Selikhov, A.: Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In: Proceeding of Supercomputing, 2002

  5. Buyya, R., Abramson, D., Giddy, J.: Nimrod/G: An architecture of resource management and scheduling system in a global computational Grid. In: Proceedings of HPC Asia, pp. 283–289, 2000

  6. Casanova, H., Dongarra, J.: Netsolve: A network server for solving computational science problems. Int. J. Supercomput. Appl. High Perform. Comput. 11(3), 212–223 (1997)

    Article  Google Scholar 

  7. Chen, W., Toueg, S., Aguilera, M.K.: On the quality of service of failure detection. IEEE Trans. Comput. 51(5), 561–580 (2002)

    Article  MathSciNet  Google Scholar 

  8. Fagg, G.E., Bukovsky, A., Dongarra, J.J.: HARNESS and fault tolerant MPI. Parallel Comput. 27, 1479–1496 (2001)

    Article  MATH  Google Scholar 

  9. Foster, I., Kesselman, C.: Globus: A metacomputing infrastructure toolkit. Int. J. Supercomput. Appl. High Perform. Comput. 11(2), 115–128 (1997)

    Article  Google Scholar 

  10. Goux, J., Kulkarni, S., Linderoth, J., Yoder, M.: An enabling framework for master–worker applications on the computational Grid. In: Proceedings of HPDC-9, pp. 43–50, 2000

  11. Ikegami, T., Takemiya, H., Nagashima, U., Tanaka, Y., Sekiguchi, S.: Accurate molecular simulation on the Grid – Replica exchange Monte Carlo simulation for C 20 molecule. Journal of Information Processing Society of Japan 44(SIG11), 14–22 (2003)

    Google Scholar 

  12. Nakada, H., Tanaka, Y., Matsuoka, S., Sekiguchi, S.: The design and implementation of a fault-tolerant RPC system: Ninf-C. In: Proceedings of HPC Asia, pp. 9–18, 2004

  13. PRAGMA. http://www.pragma-grid.net/

  14. Pruyne, J., Livny, M.: Managing checkpoints for parallel programs. In: Proceedings of Workshop on Job Scheduling Strategies for Parallel Processing, 1996

  15. Seymour, K., Nakada, H., Matsuoka, S., Dongarra, J., Lee, C., Casanova, H.: Overview of GridRPC: A remote procedure call API for Grid computing. In: Parashar, M. (ed.) Proceedings of 3rd International Workshop on Grid Computing, pp. 274–278, 2002

  16. Takemiya, H., Shudo, K., Tanaka, Y., Sekiguchi, S.: Constructing Grid applications using standard Grid middleware. Grid Computing 1, 117–131 (2003)

    Article  Google Scholar 

  17. Tanaka, Y., Takemiya, H., Nakada, H., Sekiguchi, S.: Design, implementation and performance evaluation of GridRPC programming middleware for a large-scale computational Grid. Fifth IEEE/ACS International Workshop on Grid Communicating, pp. 298–305, 2005

  18. Yabana, K., Bertsch, G.F.: Time-dependent local-density approximation in real time: Application to conjugated molecules. Quantum Chemistry 75, 55–66 (1999)

    Article  Google Scholar 

  19. GGF. http://www.gridforum.org/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yusuke Tanimura.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tanimura, Y., Ikegami, T., Nakada, H. et al. Implementation of Fault-Tolerant GridRPC Applications. J Grid Computing 4, 145–157 (2006). https://doi.org/10.1007/s10723-006-9044-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-006-9044-6

Key words

Navigation