Skip to main content

Migol: A Fault-Tolerant Service Framework for MPI Applications in the Grid

  • Conference paper
Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI 2005)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 3666))

Abstract

In a distributed, inherently dynamic Grid environment the reliability of individual resources cannot be guaranteed. The more resources and components are involved the more error-prone is the system. Therefore, it is important to enhance the dependability of the system with fault-tolerance mechanisms. In this paper, we present Migol, a fault-tolerant, self-healing Grid service infrastructure for MPI applications.

The benefit of the Grid is that in case of a failure an application may be migrated and restarted from a checkpoint file on another site. This approach requires a service infrastructure which handles the necessary activities transparently for an application. But any migration framework cannot support fault-tolerant applications, if it is not fault-tolerant itself.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations. In: HPDC 1999: Proceedings of the The Eighth IEEE International Symposium on High Performance Distributed Computing, Washington, DC, USA, p. 31. IEEE Computer Society, Los Alamitos (1999)

    Google Scholar 

  2. Nguyen-Tuong, A., Grimshaw, A.S., Wasson, G., Humphrey, M., Knight, J.C.: Towards Dependable Grids. Available at http://www.cs.virginia.edu/~techrep/CS-2004-11.pdf

  3. Barak, A., Braverman, A., Gilderman, I., Laaden, O.: Performance of PVM with the MOSIX Preemptive Process Migration. In: Proceedings of the 7th Israeli Conference on Computer Systems and Software Engineering, Herzliya, June 1996, pp. 38–45 (1996)

    Google Scholar 

  4. Basney, J., Humphrey, M., Welch, V.: The myproxy online credential repository (2005), http://www.ncsa.uiuc.edu/~jbasney/myproxy-spe.pdf

  5. Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: Mpich-v: toward a scalable fault tolerant mpi for volatile nodes. In: Supercomputing 2002: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Los Alamitos, CA, USA, pp. 1–18. IEEE Computer Society Press, Los Alamitos (2002)

    Google Scholar 

  6. Chen, D., et al.: OGSA Globus Toolkit 3 evaluation activity at CERN. Nucl. Instrum. Meth. A534, 80–84 (2004)

    Google Scholar 

  7. Chervenak, A.L., Palavalli, N., Bharathi, S., Kesselman, C., Schwartzkopf, R.: Performance and scalability of a replica location service (2004). Available at http://www.globus.org/alliance/publications/papers/chervenakhpdc13.pdf

  8. Czajkowski, K., Ferguson, D.F., Foster, I., Frey, J., Graham, S., Sedukhin, I., Snelling, D., Tuecke, S., Vambenepe, W.: The WS-Resource Framework (2005). Available at http://www.oasis-open.org/committees/download.php/6796/ws-wsrf.pdf

  9. Czajkowski, K., Foster, I.T., Karonis, N.T., Kesselman, C., Martin, S., Smith, W., Tuecke, S.: A resource management architecture for metacomputing systems. In: IPPS/SPDP 1998: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, London, UK, pp. 62–82. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  10. Tannenbaum, T., Thain, D., Livny, M.: Condor and the grid. In: Berman, F., Hey, A.J.G. (eds.) Grid Computing: Making the Global Infrastructure a Reality. John Wiley, Chichester (2003)

    Google Scholar 

  11. Fagg, G.E., Dongarra, J.: Ft-mpi: Fault tolerant mpi, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, London, UK, pp. 346–353. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  12. Floros, E., Cotronis, Y.: Exposing mpi applications as grid services. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds.) Euro-Par 2004. LNCS, vol. 3149, pp. 436–443. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  13. Foster, I., Kesselman, C., Nick, J.M., Tuecke, S.: The Physiology of the Grid – An Open Grid Services Architecture for Distributed Systems Integration (2002). Available at http://www-unix.globus.org/toolkit/3.0/ogsa/docs/physiology.pdf

  14. Foster, I.T., Kesselman, C., Tsudik, G., Tuecke, S.: A security architecture for computational grids. In: ACM Conference on Computer and Communications Security, pp. 83–92 (1998)

    Google Scholar 

  15. Globus Homepage (2005). Available at http://www.globus.org

  16. Thilo, J.M., Wrzesinska, K.G., van Niewpoort, R.V., Bal, H.E.: Fault-tolerant scheduling of fine-grained tasks in grid environments. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium(IPDPS 2005), Denver, Colorado, USA (April 2005)

    Google Scholar 

  17. Gropp, W., Lusk, E.: Fault tolerance in mpi programs. High Performance Computing and Applications (2002)

    Google Scholar 

  18. Henderson, R., Tweten, D.: Portable Batch System: External reference specification. Technical report, NASA Ames Research Center (1996)

    Google Scholar 

  19. Kohl, J.A., Papadopoulos, P.M.: Cumulvs version 1.0 (1996). Available at http://www.netlib.org/cumulvs/

  20. Kovacs, J., Kacsuk, P.: A migration framework for executing parallel programs in the grid. In: Proceedings of the 2nd European Across Grids Conference, Nicosia, Cyprus (January 2004)

    Google Scholar 

  21. Lanfermann, G., Schnor, B., Seidel, E.: Grid object description: Characterizing grids. In: Eighth IFIP/IEEE International Symposium on Integrated Network Management (IM 2003), Colorado Springs, Colorado, USA (March 2003)

    Google Scholar 

  22. Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems (June 1988)

    Google Scholar 

  23. Mihahn, M., Schnor, B.: Fault-tolerant grid peer services. Technical report, University Potsdam (2004)

    Google Scholar 

  24. Montero, R.S., Huedo, E., Llorente, I.M.: Grid resource selection for opportunistic job migration. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 366–373. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  25. Petri, S., Langendörfer, H.: Load Balancing and Fault Tolerance in Workstation Clusters– Migrating Groups of Communicating Processes. Operating Systems Review 29(4), 25–36 (1995)

    Article  Google Scholar 

  26. Puppin, D., Tonellotto, N., Laforenza, D.: Using web services to run distributed numerical applications. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 207–214. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  27. Wolski, N.S.R., Hayes, J.: The network weather service: A distributed resource performance forecasting service for metacomputing. Journal of Future Generation Computing Systems 15(5-6), 757–768 (1999)

    Article  Google Scholar 

  28. Smith, C.: Open source metascheduling for virtual organizations with the community scheduler framework (csf). Technical report, Platform Computing Inc. (2003)

    Google Scholar 

  29. Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996), Honolulu, Hawaii (1996)

    Google Scholar 

  30. Tuecke, S., Foster, I., Kesselman, C.: Open Grid Service Infrastructure (2003). Available at http://www-unix.globus.org/toolkit/draft-ggf-ogsi-gridservice-33_2003-06-27.pdf

  31. Vadhiyar, S.S., Dongarra, J.J.: A performance oriented migration framework for the grid. In: Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, p. 130. IEEE Computer Society, Los Alamitos (2003)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Luckow, A., Schnor, B. (2005). Migol: A Fault-Tolerant Service Framework for MPI Applications in the Grid. In: Di Martino, B., KranzlmĂĽller, D., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2005. Lecture Notes in Computer Science, vol 3666. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11557265_35

Download citation

  • DOI: https://doi.org/10.1007/11557265_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29009-4

  • Online ISBN: 978-3-540-31943-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics