Performance evaluation of an application-level checkpointing solution on grids

https://doi.org/10.1016/j.future.2010.04.016Get rights and content

Abstract

In recent years there has been a significant effort to develop middleware that facilitates the execution of applications on Grid infrastructures. However, support for fault-tolerant execution continues to be scarce. The CPPC-G framework is a service-based architecture designed to provide efficient fault-tolerant mechanisms for the execution of sequential and parallel applications on grids. Applications to be managed by CPPC-G are expected to be preprocessed with CPPC (ComPiler for Portable Checkpointing), a tool for automatically inserting portable checkpoint instrumentation into the code of parallel applications. Built on top of existing Globus services, CPPC-G services are in charge of submitting and monitoring CPPC applications, managing generated checkpoint files, detecting failures and automatically restarting failed executions. In this paper the feasibility of this approach is assessed by measuring the performance of CPPC-G, quantitatively addressing its impact on application performance. Results show that the increase in overall throughput and availability comes with minor performance degradation.

Introduction

The emergence of Grid infrastructures has allowed users to remotely execute high-demanding computational applications on powerful nodes of the Grid or distribute them over several nodes. But grids are highly dynamic systems formed by thousands of resources connected to each other. Reliability of individual resources and communications cannot be guaranteed. Furthermore, as parallel platforms connected to the Grid increase their complexity, so does their overall failure rate. This is specially troublesome for long-running applications, whose mean time to complete execution amply exceeds the mean time to failure (MTTF) of the underlying hardware. In such a context, ensuring that not all of the computation done is lost in the event of failure is a must.

Checkpointing is a widely used technique to obtain fault tolerance on such environments. It periodically saves the computation state to stable storage, so that the application execution can be resumed by restoring such state. A number of solutions and techniques have been proposed [1], each having its own pros and cons. But because Grid environments are highly heterogeneous, successful application of checkpointing requires an evolution from traditional non-portable state saving techniques towards really portable tools that allow a computation to be resumed on a wide range of different machines. Portability in this context means to provide the following fundamental features:

  • OS-independence: checkpointing strategies must be compatible with any given operating system. This means having at least a basic modular structure to allow for the substitution of certain critical sections of code (e.g. filesystem access) depending on the underlying OS.

  • Support for parallel applications with communication protocol independence: the checkpointing framework should not make any assumption about the communication interface or implementation being used. Even recognizing the role of MPI as the message-passing de-facto standard, computational grids include machines belonging to independent entities which cannot be forced to provide a certain version of the MPI interface. The checkpointing technique cannot be theoretically tied to a specific MPI communication interface in order to provide a truly portable, reusable approach.

  • Reduced checkpoint file sizes: the tool should optimize the amount of data being saved, avoiding dumping state which will not be necessary upon application restart. This improves performance, which depends heavily on state file sizes. This is specially true on Grid environments if state files have to be moved from one site to another when migrating an application in case of failure.

  • Portable data recovery: the state of an application can be seen as a structure containing different types of data. The checkpointing tool must be able to recover all these data in a portable way. This includes recovery of opaque state, such as MPI communicators, as well as of OS-dependent state, such as the file table or the execution stack.

CPPC1 (ComPiler for Portable Checkpointing) [2] provides all these features which are key issues for fault tolerance support on heterogeneous systems. The CPPC compiler automatically instruments a parallel code, transforming it into a fault-tolerant version. CPPC is aimed at being used with message-passing2 parallel codes but it could be used with sequential codes as well. The analyses and transformations performed by the compiler completely automate the instrumentation process.

But to provide users with a really transparent fault-tolerant execution service on Grid environments the availability of a portable checkpointing tool as CPPC is not sufficient. Services are needed for managing an execution on behalf of the user, monitoring its state, making backups of generated state files and automatically detecting faults and taking the necessary corrective actions. CPPC-G [3] is a set of new Grid services3 implemented on top of Globus 4 [4] which provides such functionalities for CPPC-instrumented fault-tolerant applications (CPPC applications from now on). CPPC-G services will be in charge of submitting and monitoring the execution, as well as of managing and replicating the generated state files. Upon detecting a failure, CPPC-G services will restart the application from the most recent consistent state, in a completely transparent way. A large number of experiments have been performed to assess the impact of CPPC-G in the execution of an application. Results show that the overhead is negligible, specially if compared with typical execution times of long-running applications.

The rest of the paper is structured as follows. Section 2 gives an overview of the CPPC tool. Section 3 introduces CPPC-G. Section 4 details the performance evaluation of CPPC-G. Section 5 addresses the framework behavior with real large-scale applications. Section 6 describes related work and, finally, Section 7 concludes the paper.

Section snippets

The CPPC tool

Grid computing presents new challenges for checkpointing techniques [5]. Its inherently heterogeneous nature makes it impossible to apply traditional state saving techniques, which use non portable strategies for recovering structures such as application stack, heap or communication state. The scale of the computations for which grids are designed discourages the use of runtime coordination, as it becomes a source of unscalability. As such, modern checkpointing techniques need to provide

The CPPC-G framework

To achieve the goal of providing users with a really transparent fault-tolerant execution service on Grid environments requires not only to have a portable checkpointing tool as CPPC, but also to extend existing Grid middleware functionalities. New services have to be defined to transparently manage functionalities such as resource discovery, remote execution and monitoring of applications, detection and restart of failed executions, etc. This section focuses on the most relevant design and

Performance evaluation

For the experimental evaluation of the framework, measurements of the impact of the CPPC-G services in an application execution were performed. The experimental setup and deployment of the services is shown in Fig. 3. The FaultTolerantJob service was deployed on a desktop Intel Core Duo E8400 with 1 GB of RAM (Desktop Machine A). The CkptWarehouse service was deployed on another desktop machine, an AMD Athlon 64 3200 + with 1 GB of RAM (Desktop Machine B). The rest of the services were deployed

Case study

As stated in Section 4, the NAS Parallel Benchmarks are good choices for experimentally evaluating the low-level behavior of the implemented services. However, the usability and practical value of the CPPC-G framework are better tested when confronted against real-world, large-scale applications. The focus of this section is to validate the CPPC-G approach in such an environment, outlining the relevant steps that a user needs to take in order to integrate a large-scale application with CPPC,

Related work

There has been extensive work done both on fault tolerance for parallel applications [1], [11] and on Grid reliability [5]. Related work to fault tolerance for parallel applications is referred in this paper throughout Section 2. This section reviews only the most closely related work to CPPC-G, which includes middleware and tools to provide fault tolerance of parallel applications in grids.

As for working groups, there have been a number of initiatives towards achieving fault tolerance on

Conclusions and future work

The CPPC-G framework for the fault-tolerant execution of applications on Grid environments has been described and evaluated. The framework provides new services that can be deployed on any Globus-based grid, extending existing functionality. The CPPC-G services manage remote fault-tolerant executions on behalf of the user, submitting and monitoring the application, making remote backups of checkpoint files and automatically detecting faults, migrating and restarting failed executions. Although

Acknowledgements

This research was supported by the Ministry of Science and Innovation of Spain and FEDER funds of the European Union (Project TIN-2007-67537-C03-02) and by the Galician Government (Consolidation of Competitive Research Groups, Xunta de Galicia ref. 2006/3).

Gabriel Rodríguez received the B.S. (2004), M.S (2004) and Ph.D. (2008) degrees in Computer Science from the University of A Coruña, Spain. Currently he is an Assistant Professor in the Department of Electronics and Systems at the University of A Coruña. His research interests include fault-tolerance for message-passing applications, parallelizing compilers and Grid computing.

References (43)

  • E.N. Elnozahy et al.

    A survey of rollback-recovery protocols in message-passing systems

    ACM Computing Surveys

    (2002)
  • G. Rodríguez et al.

    CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications

    Concurrency and Computation: Practice and Experience

    (2010)
  • G. Rodríguez et al.

    A fault tolerance solution for sequential and MPI applications on the Grid

    Scalable Computing: Practice and Experience

    (2008)
  • I. Foster

    Globus Toolkit version 4: software for service-oriented systems

    Journal of Computer Science and Technology

    (2006)
  • C. Dabrowski

    Reliability in Grid computing systems

    Concurrency and Computation: Practice and Experience

    (2009)
  • M.C. Cardoso et al.

    MPI support on opportunistic grids based on the InteGrade middleware

    Concurrency and Computation: Practice and Experience

    (2010)
  • J. Hursey et al.

    Interconnect agnostic checkpoint/restart in Open MPI

  • J. Hursey et al.

    The design and implementation of checkpoint/restart process fault tolerance for Open MPI

  • G. Bosilca et al.

    MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

  • D. Buntinas et al.

    Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols

    Future Generation Computer Systems

    (2008)
  • J.P. Walters, V. Chaudhary, Application-level checkpointing techniques for parallel programs, in: ICDCIT’06:...
  • National Center for Supercomputing Applications. HDF-5: File Format Specification. Last accessed April 2010....
  • G. Gibson et al.

    Failure tolerance in petascale computers

    CTWatch Quarterly

    (2007)
  • E.N. Elnozahy et al.

    Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery

    IEEE Transactions on Dependable and Secure Computing

    (2004)
  • Y. Chen et al.

    CLIP: a checkpointing tool for message-passing parallel programs

  • S. Krishnan et al.

    XCAT3: a framework for CCA components as OGSA services

  • M. Schulz et al.

    Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs

  • G. Stellner

    CoCheck: checkpointing and process migration for MPI

  • G. Rodríguez et al.

    A heuristic approach for the automatic insertion of checkpoints in message-passing codes

    Journal of Universal Computer Science

    (2009)
  • J.M. Hélary et al.

    Consistency issues in distributed checkpoints

    IEEE Transactions on Software Engineering

    (1999)
  • G. Rodríguez et al.

    Controller/precompiler for portable checkpointing

    IEICE Transactions on Information and Systems

    (2006)
  • Gabriel Rodríguez received the B.S. (2004), M.S (2004) and Ph.D. (2008) degrees in Computer Science from the University of A Coruña, Spain. Currently he is an Assistant Professor in the Department of Electronics and Systems at the University of A Coruña. His research interests include fault-tolerance for message-passing applications, parallelizing compilers and Grid computing.

    Xoán C. Pardo received the B.S. (1994) and M.S. (1995) degrees in Computer Science and the Ph.D. (2004) degree in Computer Engineering from the University of A Coruña, Spain. Currently he is an Associate Professor in the Department of Electronics and Systems at the University of A Coruña. His research interests include fault tolerant distributed systems and Grid and Cloud computing.

    María J. Martín is an Associate Professor of Computer Engineering at the University of A Coruña. She earned the B.S. (1993), M.S. (1994) and Ph.D. (1999) degrees in Physics from the University of Santiago de Compostela, Spain. Her major research interests include parallel algorithms and applications, parallelizing compilers, Grid computing and fault-tolerance for message-passing applications.

    Patricia González received the B.S. (1996), M.S. (1996) and Ph.D. (2001) degrees in physics from the University of Santiago de Compostela. Currently she is an Associate Professor in the Department of Electronics and Systems at the University of A Coruña. Her research interests include parallel algorithms and numerical methods for solving dense and sparse systems of equations, prediction and improvement of performance for irregular problems, distributed job management systems, Grid computing and fault-tolerance for message-passing applications.

    View full text