Skip to main content
Log in

Creating a transparent, distributed, and resilient computing environment: the OpenRTE project

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Meeting the future computing needs of the scientific community will likely require the development of petascale computing environments based on the integration of significant numbers of processors into large-scale clusters, and the (possibly heterogeneous) aggregation of multiple clusters for use by individual and/or synchronized applications. Despite the best of efforts, such complex systems dictate that applications must expect to encounter failures of their computing resources and/or networks during the course of execution.

The Open Run-Time Environment (OpenRTE) has been designed to support high-performance computing applications in such environments. Gaining acceptance by the user community requires that OpenRTE not only meet basic functional requirements, but must also provide users with (a) a transparent interface that avoids the need to customize applications when moving between specific computing and/or communication resources; (b) effective strategies that can be selected at run-time for dealing with faults; (c) transparent support for inter-process communication, resource discovery and allocation, and process launch across a variety of platforms; and (d) the ability to launch their applications remotely from their desktop, disconnect from them, and reconnect at a later time to monitor progress.

This paper provides an updated description of OpenRTE and discusses its relation to the current grid protocols. In addition, we introduce the concept of resilient computing—a next-generation approach to fault tolerance—and describe how OpenRTE will utilize this concept in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ali S, Maciejewski AA, Siegel HJ, Kim JK (2004) Measuring the robustness of a resource allocation. IEEE Trans Parallel Distrib Syst 15(7): 630–641

    Article  Google Scholar 

  2. Aulwes RT, Daniel DJ, Desai NN, Graham RL, Risinger LD, Sukalski MW, Taylor MA, Woodall TS (2004) Architecture of LA-MPI, a network-fault-tolerant MPI. In: 18th intl parallel and distributed processing symposium, 2004

  3. Berry PM (1993) Uncertainty in scheduling: probability, problem reduction, abstractions and the user. In: IEE computing and control division colloquium on advanced software technologies for scheduling, Digest No: 1993/163, Apr 1993

  4. Burns G, Daoud R, Vaigl J (1994) LAM: an open cluster environment for MPI. In: Proceedings of supercomputing symposium, 1994, pp 379–386

  5. Castain RH, Woodall TS, Daniel DJ, Squyres JM, Barrett B, Fagg GE (2005) The open run-time environment (OpenRTE): a transparent multi-cluster environment for high-performance computing. In: Proceedings of the 12th european PVM/MPI users’ group meeting, Sorrento, Italy, Sept 2005

  6. Fagg G, Dongarra J (2002) HARNESS fault tolerant MPI design, usage and performance issues. Future Gener Comput Syst 18(8):1127–1142

    Article  MATH  Google Scholar 

  7. Foster I, Kesselman C (1997) Globus: a metacomputing infrastructure toolkit. Intl J Supercomput Appl 11(2):115–128

    Article  Google Scholar 

  8. Foster I, Kesselman C, Nick J, Tuecke S (2002) The physiology of the grid: an open grid services architecture for distributed systems integration. In: Open grid service infrastructure working group, global grid forum, June 2002

  9. Gabriel E, Fagg GE, Bosilica G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH, Daniel DJ, Graham RL, Woodall TS (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings of the 11th european PVM/MPI users’ group meeting, 2004

  10. Gropp W, Lusk E, Doss N, Skjellum A (1996) A high-performance, portable implementation of the MPI message passing interface standard. J Parallel Comput 22(6):789–828

    Article  MATH  Google Scholar 

  11. ISO Standard Number ISO/IEC/ANSI 8652:1995, Information Technology Programming Languages: Ada, Feb 1995

  12. Kim D, Chaudhuri M, Heinrich M, Speight E (2004) Architectural support for uniprocessor and multiprocessor active memory systems. IEEE Trans Comput 53(3):288–307

    Article  Google Scholar 

  13. Kistler M, Alvisi L (2005) Improving the performance of software distributed shared memory with speculation. IEEE Trans Parallel Distrib Syst 16(9):885–896

    Article  Google Scholar 

  14. Kronstadt EP (2005) PetaScale computing, presented at 19th IEEE intl parallel and distributed processing symposium, Denver, CO, USA, April 2005

  15. Kwok YK, Maciejewski AA, Siegel HJ, Ahmad I, Ghafoor A (2006) A semi-static approach to mapping dynamic iterative tasks onto heterogeneous computing systems. J Parallel Distrib Comput 66(1):77–98

    MATH  Google Scholar 

  16. Morrison JP, Clayton B, Power DA, Patil A (2004) WebCom-G: Grid enabled metacomputing. Neural, Parallel Sci Comput 12(3):419–437

    Google Scholar 

  17. Morrison JP, Kennedy JJ, Power DA (1999) A condensed graphs engine to drive metacomputing. In: Proceedings of the international conference on parallel and distributed processing techniques and applications (PDPTA’99), Jun 1999

  18. Sheppard JW, Kaufman MA (2005) A Bayesian approach to diagnosis and prognosis using built-in test. IEEE Trans Instrum Meas 54(3):1003–1018

    Article  Google Scholar 

  19. Squyres JM, Lumsdaine A (2003) A component architecture for LAM/MPI. In: 10th European PVM/MPI users’ group meeting, 2003

  20. Squyres JM, Lumsdaine A (2004) The component architecture of open MPI: enabling third-party collective algorithms. In: Proceedings, 18th ACM international conference on supercomputing, workshop on component models and systems for grid applications, St. Malo, France, July 2004

  21. Tilevich E, Smaragdakis Y (2004) J-Orchestra: automatic java application partitioning. In: Proceedings, european conference on object-oriented programming (ECOOP), Malaga, Jun 2004

  22. Vichare NM, Pecht MG (2006) Prognostics and health management of electronics. IEEE Trans Compon Packag Technol 29(1):222–229

    Article  Google Scholar 

  23. Watson GR, Rasmussen CE (2005) A strategy for addressing the needs of advanced scientific computing using eclipse as a parallel tools platform. http://www.eclipse.org/ptp/docs/whitepapers/PTPWPv9.pdf

  24. Yoo A, Jette M, Grondona M (2003) SLURM: simple Linux utility for resource management, job scheduling strategies for parallel processing, Lecture Notes in Computer Science, vol 2862, 2003, pp 44–60

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ralph H. Castain.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Castain, R., Squyres, J.M. Creating a transparent, distributed, and resilient computing environment: the OpenRTE project. J Supercomput 42, 107–123 (2007). https://doi.org/10.1007/s11227-006-0040-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-006-0040-1

Keywords

Navigation