Abstract
Meeting the future computing needs of the scientific community will likely require the development of petascale computing environments based on the integration of significant numbers of processors into large-scale clusters, and the (possibly heterogeneous) aggregation of multiple clusters for use by individual and/or synchronized applications. Despite the best of efforts, such complex systems dictate that applications must expect to encounter failures of their computing resources and/or networks during the course of execution.
The Open Run-Time Environment (OpenRTE) has been designed to support high-performance computing applications in such environments. Gaining acceptance by the user community requires that OpenRTE not only meet basic functional requirements, but must also provide users with (a) a transparent interface that avoids the need to customize applications when moving between specific computing and/or communication resources; (b) effective strategies that can be selected at run-time for dealing with faults; (c) transparent support for inter-process communication, resource discovery and allocation, and process launch across a variety of platforms; and (d) the ability to launch their applications remotely from their desktop, disconnect from them, and reconnect at a later time to monitor progress.
This paper provides an updated description of OpenRTE and discusses its relation to the current grid protocols. In addition, we introduce the concept of resilient computing—a next-generation approach to fault tolerance—and describe how OpenRTE will utilize this concept in the future.
Similar content being viewed by others
References
Ali S, Maciejewski AA, Siegel HJ, Kim JK (2004) Measuring the robustness of a resource allocation. IEEE Trans Parallel Distrib Syst 15(7): 630–641
Aulwes RT, Daniel DJ, Desai NN, Graham RL, Risinger LD, Sukalski MW, Taylor MA, Woodall TS (2004) Architecture of LA-MPI, a network-fault-tolerant MPI. In: 18th intl parallel and distributed processing symposium, 2004
Berry PM (1993) Uncertainty in scheduling: probability, problem reduction, abstractions and the user. In: IEE computing and control division colloquium on advanced software technologies for scheduling, Digest No: 1993/163, Apr 1993
Burns G, Daoud R, Vaigl J (1994) LAM: an open cluster environment for MPI. In: Proceedings of supercomputing symposium, 1994, pp 379–386
Castain RH, Woodall TS, Daniel DJ, Squyres JM, Barrett B, Fagg GE (2005) The open run-time environment (OpenRTE): a transparent multi-cluster environment for high-performance computing. In: Proceedings of the 12th european PVM/MPI users’ group meeting, Sorrento, Italy, Sept 2005
Fagg G, Dongarra J (2002) HARNESS fault tolerant MPI design, usage and performance issues. Future Gener Comput Syst 18(8):1127–1142
Foster I, Kesselman C (1997) Globus: a metacomputing infrastructure toolkit. Intl J Supercomput Appl 11(2):115–128
Foster I, Kesselman C, Nick J, Tuecke S (2002) The physiology of the grid: an open grid services architecture for distributed systems integration. In: Open grid service infrastructure working group, global grid forum, June 2002
Gabriel E, Fagg GE, Bosilica G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH, Daniel DJ, Graham RL, Woodall TS (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings of the 11th european PVM/MPI users’ group meeting, 2004
Gropp W, Lusk E, Doss N, Skjellum A (1996) A high-performance, portable implementation of the MPI message passing interface standard. J Parallel Comput 22(6):789–828
ISO Standard Number ISO/IEC/ANSI 8652:1995, Information Technology Programming Languages: Ada, Feb 1995
Kim D, Chaudhuri M, Heinrich M, Speight E (2004) Architectural support for uniprocessor and multiprocessor active memory systems. IEEE Trans Comput 53(3):288–307
Kistler M, Alvisi L (2005) Improving the performance of software distributed shared memory with speculation. IEEE Trans Parallel Distrib Syst 16(9):885–896
Kronstadt EP (2005) PetaScale computing, presented at 19th IEEE intl parallel and distributed processing symposium, Denver, CO, USA, April 2005
Kwok YK, Maciejewski AA, Siegel HJ, Ahmad I, Ghafoor A (2006) A semi-static approach to mapping dynamic iterative tasks onto heterogeneous computing systems. J Parallel Distrib Comput 66(1):77–98
Morrison JP, Clayton B, Power DA, Patil A (2004) WebCom-G: Grid enabled metacomputing. Neural, Parallel Sci Comput 12(3):419–437
Morrison JP, Kennedy JJ, Power DA (1999) A condensed graphs engine to drive metacomputing. In: Proceedings of the international conference on parallel and distributed processing techniques and applications (PDPTA’99), Jun 1999
Sheppard JW, Kaufman MA (2005) A Bayesian approach to diagnosis and prognosis using built-in test. IEEE Trans Instrum Meas 54(3):1003–1018
Squyres JM, Lumsdaine A (2003) A component architecture for LAM/MPI. In: 10th European PVM/MPI users’ group meeting, 2003
Squyres JM, Lumsdaine A (2004) The component architecture of open MPI: enabling third-party collective algorithms. In: Proceedings, 18th ACM international conference on supercomputing, workshop on component models and systems for grid applications, St. Malo, France, July 2004
Tilevich E, Smaragdakis Y (2004) J-Orchestra: automatic java application partitioning. In: Proceedings, european conference on object-oriented programming (ECOOP), Malaga, Jun 2004
Vichare NM, Pecht MG (2006) Prognostics and health management of electronics. IEEE Trans Compon Packag Technol 29(1):222–229
Watson GR, Rasmussen CE (2005) A strategy for addressing the needs of advanced scientific computing using eclipse as a parallel tools platform. http://www.eclipse.org/ptp/docs/whitepapers/PTPWPv9.pdf
Yoo A, Jette M, Grondona M (2003) SLURM: simple Linux utility for resource management, job scheduling strategies for parallel processing, Lecture Notes in Computer Science, vol 2862, 2003, pp 44–60
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Castain, R., Squyres, J.M. Creating a transparent, distributed, and resilient computing environment: the OpenRTE project. J Supercomput 42, 107–123 (2007). https://doi.org/10.1007/s11227-006-0040-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-006-0040-1