ABSTRACT
The possibility of hardware failures occurring during the execution of application software continues to increase along with the scale of modern systems. Existing parallel development approaches cannot effectively recover from these errors except by means of expensive checkpoint/restart files. As a result, many CPU hours of scientific simulation are lost due to hardware failures.
Relentless Computing is a data-oriented approach to software development that allows for many classes of distributed and parallel algorithms, from no data-sharing to intense data-sharing, to be solved in both loosely- and tightly- coupled environments. Each process requires no knowledge of the current runtime status of the others to begin contributing, meaning that the execution pool can shrink and grow, as well as recover from hardware failure, automatically.
We present motivations for the development of Relentless Computing, how it works, examples of using Relentless Computing to solve several types of problems, and initial scaling results.
Supplemental Material
Available for Download
Index Terms
- Poster: The relentless computing paradigm: a data-oriented programming model for distributed-memory computation
Recommendations
A fully informed model-based checkpointing protocol for preventing useless checkpoints
Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...
Towards Resilient Chapel: Design and implementation of a transparent resilience mechanism for Chapel
EASC '15: Proceedings of the 3rd International Conference on Exascale Applications and SoftwareThe exponential increase of components in modern High Performance Computing (HPC) systems poses a challenge on their resilience: predictions of time between failures on ExaScale systems range from hours to minutes, yet the prevalent HPC programming ...
Multilevel Diskless Checkpointing
Extreme scale systems available before the end of this decade are expected to have 100 million to 1 billion CPU cores. The probability that a failure occurs during an application execution is expected to be much higher than today's systems. ...
Comments