Skip to main content

Deploying fault-tolerance and task migration with NetSolve

  • Conference paper
  • First Online:
Applied Parallel Computing Large Scale Scientific and Industrial Problems (PARA 1998)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1541))

Included in the following conference series:

  • 123 Accesses

Abstract

Computational power grids are computing environments with massive resources for processing and storage. While these resources may be pervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve is its integration of fault-tolerance and task migration in a way that is transparent to the end user. In this paper, we discuss how NetSolve’s structure allows for the seamless integration of fault-tolerance and migration in grid applications, and present the specific approaches that have been and are currently being implemented within NetSolve.

This work was supported by the Applied Mathematical Sciences Research Programm, Office of Energy Research, U.S. Department of Energy, under contract DE-AL04-94AL85000 with Lockheed Martin Energy Research Corporation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations, IEEE Computer, 29(2): 18–28, February, 1996.

    Google Scholar 

  2. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users’ Guide, Second Edition, SIAM, Philadelphia, PA, 1995.

    MATH  Google Scholar 

  3. D. E. Bakken and R. D. Schilchting. Supporting Fault-Tolerant Parallel Programming in Linda. IEEE Transactions on Parallel and Distributed Systems, 6(3):287–302, March 1995.

    Article  Google Scholar 

  4. A. Baratloo, P. Dasgupta, and Z. M. Kedem. Calypso: A Novel Software System for Fault-Tolerant Parallel Processing on Distributed Platoform. In 4th IEEE International Symposium on High Performance Distributed Computing, August 1995.

    Google Scholar 

  5. A. Beguelin, E. Seligman, and P. Stephan. Application Level Fault Tolerance in Heterogeneous Networks of Workstations. Journal of Parallel and Distributed Computing, September 1997.

    Google Scholar 

  6. L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997.

    MATH  Google Scholar 

  7. D. Boley, G. H. Golub, S. Makar, N. Saxena, and E. J. McCluskey. Floating Point Fault Tolerance with Backward Error Assertions. IEEE Transactions on Computers, 44(2), February 1995.

    Google Scholar 

  8. G. Cabillic, G. Muller, and I. Puaut. The Performance of Consistent Checkpointing in Distributed Shared Memory Systems. In Proceedings of the 1995 European Intel Supercomputer Users’ Group Meeting, 1995.

    Google Scholar 

  9. H. Casanova and J. Dongarra. NetSolve’s Network Enabled Server: Examples and Applications. IEEE Computational Science & Engineering, tp appear.

    Google Scholar 

  10. J. Casas, D. L. Clark, P. S. Galbiati, R. Konuru, S. W. Otto, R. M. Prouty, and J. Walpole. MIST: PVM with transparent migration and checkpointing. In 3rd Annual PVM Users’ Group Meeting, Pittsburgh, PA, May 1995.

    Google Scholar 

  11. M. Castro, P. Guedes, M. Sequeira, and M. Costa. A checkpoint protocol for an entry consistent shared memory system. In Thirteenth ACM Symposium on Principles of Distributed Computing, Los Angeles, CA, August 1994.

    Google Scholar 

  12. Y. Chen, J. S. Plank, and K. Li. CLIP: A Checkpointing Tool for Message-Passing Parallel Programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.

    Google Scholar 

  13. P. E. Chung, Y. Huang, S. Yajnik, G. Fowler, K. P. Vo, and Y. M. Wang. Checkpointing in CosMiC: a user-level process migration environment. In Pacific Rim International Symposium on Fault-Tolerant Systems, December 1997.

    Google Scholar 

  14. D. Cummings and L. Alkalaj. Checkpoint/Rollback in a Distributed System Using Coarse-Grained Dataflow. In 24th International Symposium on Fault-Tolerant Computing, pages 424–433, Austin, TX, June 1994.

    Google Scholar 

  15. J. Czyzyk, M. Mesnier, and J. Moré. NEOS: The Network-Enabled Optimization System. Technical Report MCS-P615-1096, Mathematics and Computer Science Division, Argonne National Laboratory, 1996.

    Google Scholar 

  16. M. J. Feeley, W. E. Morgan, F. H. Pighin, A. R. Karlin, and H. M. Levy. Implementing Global Memory Management in a Workstation Cluster. In 15th Symposium on Operating Systems Principles, pages 201–212. ACM, December 1995.

    Google Scholar 

  17. I. Foster, C. Kesselman, C. Lee, G. von Laszewski, and P. Stelling. A Fault Detection Service for Wide Area Distributed Computations. In Proc. of the High Performance Distributed Computing Conference, to appear.

    Google Scholar 

  18. I. Foster and K Kesselman. Globus: A Metacomputing Infrastructure Toolkit. In Proc. Workshop on Environments and Tools. SIAM, to appear.

    Google Scholar 

  19. A. Grimshaw, W. Wulf, J. French, A. Weaver, and P. Jr. Reynolds. A Synopsis of the Legion Project. Technical Report CS-94-20, Department of Computer Science, University of Virginia, 1994.

    Google Scholar 

  20. K-H. Huang and J. A. Abraham. Algorithm-Based Fault Tolerance for Matrix Operations. IEEE Transactions on Computers, C-33(6):518–528, June 1984.

    MATH  Google Scholar 

  21. The Math Works Inc. MATLAB Reference Guide. 1992.

    Google Scholar 

  22. G. Janakiraman and Y. Tamir. Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers. In 13th Symposium on Reliable Distributed Systems, pages 42–51, October 1994.

    Google Scholar 

  23. K. L. Johnson, M. F. Kaashoek, and D. A. Wallach. CRL: High-Performance All-Software Distributed Shared Memory. In 15th Symposium on Operating Systems Principles, pages 213–228. ACM, December 1995.

    Google Scholar 

  24. Y. Kim, J. S. Plank, and J. Dongarra. Fault Tolerant Matrix Operations using Checksum and Reverse Computation. In 6th Symposium on the Fontiers of Massively Parallel Computation, October 1996.

    Google Scholar 

  25. M. Litzkow and M. Livny. Experience with the Condor Distributed Batch System. In Proc. of IEEE Workshop on Experimental Distributed Systems. Department of Computer Science, University of Winsconsin, Madison, 1990.

    Google Scholar 

  26. M. W. Mutka and M. Livny. The available capacity of a privately owned workstation environment. Perfomance Evaluation, August 1991.

    Google Scholar 

  27. V. K. Naik, S. P. Midkiff, and J. E. Moreira. A Checkpointing Strategy for Scalable Recovery on Distributed Parallel Systems. In SC97: High Performance Networking and Computing, San Jose, November 1997.

    Google Scholar 

  28. D. A. Nichols. Using Idle Workstations in a Shared Computing Environment. Operating Systems Review: Proceedings of SOSP-11, 21(5):5–12, November 1987.

    Article  Google Scholar 

  29. R. Orfali and D. Harkey. Client/Server Programming with Java and CORBA. John Wiley & Sons, Inc, 1997.

    Google Scholar 

  30. J. S. Plank, Y. Kim, and J. Dongarra. Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing. Journal of Parallel and Distributed Computing, 43:125–138, September 1997.

    Article  Google Scholar 

  31. J. Pruyne and M. Livny. Parallel Processing on Dynamic Resources with CARMI. In First IPPS Workshop on Job Scheduling Strategies for Parallel Processing, April 1995.

    Google Scholar 

  32. B. Ramkumar and V. Strumpen. Portable Checkpointing and Recovery in Heterogeneous Environments. In 27th International Symposium on Fault-Tolerant Computing, 1997.

    Google Scholar 

  33. D. J. Scales and M. S. Lam. Transparent Fault Tolerance for Parallel Applications on Networks of Workstations. In Usenix 1996 Technical Conference on UNIX and Advanced Computing Systems, San Diego, January 1996.

    Google Scholar 

  34. S. Sekiguchi, M. Sato, H. Nakada, S. Matsuoka, and U. Nagashima. Ninf: Network based Information Library for Globally High Performance Computing. In Proc. of Parallel Object-Oriented Methods and Applications (POOMA), Santa Fe, 1996.

    Google Scholar 

  35. L. M. Silva, J. G. Silva, S. Chapple, and L. Clarke. Portable Checkpointing and Recovery. In Proceedings of the HPDC-4, High-Performance Distributed Computing, pages 188–195, Washington, DC, August 1995.

    Google Scholar 

  36. L. M. Silva, B. Veer, and J. G. Silva. Checkpointing SPMD Applications on Transputer Networks. In Scalable High Performance Computing Conference, pages 694–701, Knoxville, TN, May 1994.

    Google Scholar 

  37. B. Steensgaard and E. Jul. Object and native code thread mobility among heterogeneous computers. In 15th Symposium on Operating Systems Principles, pages 68–78. ACM, December 1995.

    Google Scholar 

  38. G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In 10th International Parallel Processing Symposium, April 1996.

    Google Scholar 

  39. G. Suri, B. Janssens, and W. K. Fuchs. Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory. In 24th International Symposium on Fault-Tolerant Computing, pages 279–288, June 1994.

    Google Scholar 

  40. N. H. Vaidya. Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme. IEEE Transactions on Computers, 46(8):942–947, August 1997.

    Article  Google Scholar 

  41. S. Wolfram. The Mathematical Book Third Edition. Wolfram Median, Inc. and Cambridge University Press, 1996.

    Google Scholar 

  42. R. Wolski. Dynamically forecasting network performance to support dynamic scheduling using the Network Weather Service. In 6th High-Performance Distributed Computing Conference, August 1997.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Bo Kågström Jack Dongarra Erik Elmroth Jerzy Waśniewski

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Plank, J.S., Casanova, H., Beck, M., Dongarra, J. (1998). Deploying fault-tolerance and task migration with NetSolve. In: Kågström, B., Dongarra, J., Elmroth, E., Waśniewski, J. (eds) Applied Parallel Computing Large Scale Scientific and Industrial Problems. PARA 1998. Lecture Notes in Computer Science, vol 1541. Springer, Berlin, Heidelberg . https://doi.org/10.1007/BFb0095364

Download citation

  • DOI: https://doi.org/10.1007/BFb0095364

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-65414-8

  • Online ISBN: 978-3-540-49261-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics