Fault-tolerant virtual cluster experiments on federated sites using BonFIRE

https://doi.org/10.1016/j.future.2013.12.027Get rights and content

Highlights

  • A new proposal for a virtual cluster architecture with fault-tolerance for Cloud.

  • A new Elasticity Engine that uses the application performance.

  • Elasticity of virtual clusters using application performance monitoring.

  • Experiment result about using elasticity to fulfill Specific Deadlines Objective.

  • Fault-tolerant experiment results using BonFIRE’s federated infrastructure.

Abstract

The failure of Cloud sites and variability of performance of the virtual machines (VMs) in this environment are two issues that have to be taken into account by software providers. If they want to guarantee the return of the results on time to their customers, their virtual infrastructure must be designed to adapt itself to the new scenario. This is especially critical in compute intensive applications that execute on virtual clusters with a large number of VMs, because they can need hours or days to produce valid results. Changes in the performance could mean longer times to produce results and, probably, higher costs. Site failures usually force to restart from the beginning, losing many computing hours. In this paper we present a fault-tolerant virtual cluster architecture that can tackle with both issues in the context of compute intensive bag-of-tasks applications. It includes an Elasticity Engine that uses the application performance to decide about the enlargement or reduction of the virtual cluster to fulfill the expectations of the final users. The architecture has been tested in three experiments: execution of the application in a multi-site configuration which has shown that it is not suffering from any penalty because of its execution in a distributed environment; an experiment about Specific Deadline Objective where the Elasticity Engine takes decisions about the enlargement of the cluster with new VMs to end the simulation on time; and a fault-tolerance test where one part of a distributed virtual cluster is lost, restoring the application performance on the surviving Cloud site using recovering mechanisms and elasticity rules, without interruption of the service.

Introduction

Cloud providers aim to have a pervasive service for their customers as other utilities do. However, as the electric grid, they are not free of outages. In fact, in the summer of 2012, two of the biggest Cloud providers in the market suffered service interruptions in some areas  [1], [2], [3]. As a consequence, some of their customers’ services were down totally or partially. Because these service interruptions were at site level, the unique solution to mitigate the impact is either to deploy them on several sites of the same Cloud provider or even to use more than one provider. In both cases, a fault-tolerant architecture is mandatory. In addition to this issue, Software-as-a-Service providers which use these big Cloud infrastructures must tackle the possible variability in the application performance due to many causes, such as changes in the assigned hardware between deployments, or sharing the infrastructure with other customers. In fact, J. Schäd et al.  [4] have observed a performance variance in Amazon EC2, even having two different levels of performance for the same configuration of their virtual infrastructure but measured on different times. In this case, to maintain the quality of the service, a possible strategy is to exploit the elasticity of the Cloud to self-adapt to this undesired variability. This means using a key application performance indicator to trigger the enlargement of the backend virtual infrastructure when needed or for reducing it when it is idle to avoid unnecessary costs. A set of applications that can benefit from this Cloud horizontal elasticity is those that can be divided in several independent tasks, as Monte Carlo simulations or engineering parametric studies. For Software-as-a-Service providers of these applications it is important to return the results on time, with some level of guarantee, and with the lower possible costs. In some cases, they could have Service Level Agreements (SLA) with their customers to return the results before a limit hour. Examples of such services are the executions of ensembles of atmospheric numerical solutions for operational weather forecast, engineering simulations for product design optimization, or Monte Carlo simulations of clinical radiotherapy treatments.

In this paper we present a fault-tolerant virtual cluster architecture that can tackle with the changes in the performance and the site failures in the context of compute intensive bag-of-tasks applications. It includes an Elasticity Engine that uses the application performance to decide about the enlargement or reduction of the virtual cluster to fulfill the expectations of the final users. To check the design of the architecture, it has been tested in three experiments: execution of the application in a multi-site configuration; an experiment about Specific Deadline Objective where the Elasticity Engine takes decisions about the enlargement of the cluster with new VMs to end the simulation on time; and a fault-tolerance test where one part of a distributed virtual cluster is lost, restoring the application performance on the surviving Cloud site using recovering mechanisms and elasticity rules, without interruption of the service. The presented architecture is well suited to cases where one virtual cluster is allocated to a single execution of the application for one customer, without sharing it between customers and other applications, and is deployed on demand in one or several Cloud providers. It is completely autonomous, adapting itself to changes in the Cloud provider infrastructure.

This paper is divided in five sections. First of all, technical details about the proposed virtual cluster architecture, elasticity management using application performance, use case and experiment infrastructure are presented. The next section describes the experiments executed to check the proposed architecture. A brief review of related work is presented in Section  3. Finally, next two sections summarize the conclusions and describe future work.

Section snippets

Virtual Cluster Architecture

The full virtual cluster (VC) architecture comprises two main nodes (“master” and “shadow”) and two sets of computing nodes (CE); each set is associated to one of the main nodes (see Fig. 1). So, the VC is split in two partitions which could be deployed in different locations. To build it, two different VM images have to be configured: one for the master and shadow nodes, and another one for the CEs. The dependencies between nodes, the location of each partition, and other technical

Experiments

We have executed three experiments to check that the defined and implemented architecture is valid for the intent of the design: a multi-site deployment, a case for Specific Deadline Objective, and finally a fault-tolerance experiment. So, the experiments are a proof-of-concept tests and allow us to gain knowledge about future improvements. Before the execution of the experiments, a parameter selection for the Elasticity Engine was done as explained in the next subsection.

Related work

Tordsson et al.  [19] have defined a virtual cluster architecture that can be deployed in a multi-site Cloud environment using a Cloud brokering mechanism. They demonstrated that using multi-cloud for executing HTC workloads can be more favorable in performance and cost than a single-site Cloud usage. However, the defined virtual cluster architecture does not support fault-tolerance as the one proposed in this paper. Montero et al.  [20] defined also a generic cluster architecture using virtual

Conclusions

This paper has described a general virtual cluster architecture that support the execution of bag-of-tasks applications. This architecture can be deployed in several different modes in single and multi-site Cloud IaaS platforms: single site; distributed sites; and fault-tolerant single and distributed sites. The architecture is complemented with an Elasticity Engine that uses the performance of the application as trigger for the decision to enlarge or reduce the cluster size. Three experiments

Future work

There are several issues that can be improved in the presented model. First, we have to find a solution to avoid that the cluster enters in the recovery mode when only the connection between sites is lost. Currently, both master and shadow will interpret this situation as a failure of the other site, generating two different virtual clusters which will apply the elasticity rules independently. The solution up to now needs manual intervention from an external operator who must decide to stop one

Acknowledgments

The research leading to these results has received funding from the European Commission’s Seventh Framework Programme (FP7/2007–2013) under grant agreement number 257386. Authors want to acknowledge the support of BonFIRE team, especially to Maxence Dunnewind (INRIA), Michael Gienger (HLRS), Ally Hume (EPCC), Kostas Kavoussanakis (EPCC), and David García Pérez (Atos Research).

A. Gómez is the Projects and Applications manager at CESGA and holds a Ph.D. in Physics from the University of Santiago de Compostela. He has worked for several industrial and IT companies mainly in distributed systems programming, design and management. Since 2001, he is working at CESGA, where he has participated in several IST European research projects. He has published more than 60 technical and scientific publications in journals and conferences. His research interests are focused on the

References (23)

  • J. Tordsson et al.

    Cloud brokering mechanisms for optimized placement of virtual machines across multiple providers

    Future Gener. Comput. Syst.

    (2012)
  • R.S. Montero et al.

    An elasticity model for high throughput computing clusters

    J. Parallel Distrib. Comput.

    (2011)
  • Amazon Failure Summer 2012, Online, cited 03.12.2012. URL:...
  • Amazon Failure, Online, cited 03.12.2012. URL:...
  • Azure outage summer 2012, Online, cited 16.09.2012. URL:...
  • J. Schad et al.

    Runtime measurements in the cloud: observing, analyzing, and reducing variance

    Proc. VLDB

    (2010)
  • DRDB, Online, cited 03.12.2012. URL:...
  • OCFS2, Online, cited 02.12.2012. URL:...
  • Open Grid Scheduler, Online, cited 03.12.2012. URL:...
  • A. Hume, Y. Al-Hazmi, B. Belter, BonFIRE: a multi-cloud test facility for Internet of services experimentation, in: 8th...
  • BonFIRE project Website, Online, cited 03.12.2012. URL:...
  • Cited by (7)

    • High-availability clusters: A taxonomy, survey, and future directions

      2022, Journal of Systems and Software
      Citation Excerpt :

      Tools that operate on an operating system level and are similar to host-based replication can also be included in this category (Zhu and Lin, 2020). For instance, Gómez et al. (2014) use a software-based Distributed Replicated Block Device (DRBD) solution to enable replication between two volumes at a block-level in a virtual cluster setup. H.2.1.2: Method.

    • AFTTM: Agent-Based Fault Tolerance Trust Mechanism in Cloud Environment

      2022, International Journal of Cloud Applications and Computing
    • AFTM-Agent Based Fault Tolerance Manager in Cloud Environment

      2022, International Arab Journal of Information Technology
    • Improved PC based resource scheduling algorithm for virtual machines in cloud computing

      2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus

    A. Gómez is the Projects and Applications manager at CESGA and holds a Ph.D. in Physics from the University of Santiago de Compostela. He has worked for several industrial and IT companies mainly in distributed systems programming, design and management. Since 2001, he is working at CESGA, where he has participated in several IST European research projects. He has published more than 60 technical and scientific publications in journals and conferences. His research interests are focused on the Cloud for HPC, performance of parallel applications, the development of medical physics software tools, and the improvement of the usability of computing resources.

    L.M. Carril is a Telecommunication Engineer (University of Vigo, 2007) and M.Sc. in High Performance Computing (University of Santiago de Compostela, 2012); he has worked at several projects at CESGA and University of Santiago de Compostela, including national and European projects in Cloud computing. Currently he is engaged in a research at Karlsruhe Institute for Technology (KIT) in error detection in multicore parallel programming.

    R. Valin received an M.Sc. degree in Electronic Physics from the University of Santiago de Compostela, Spain, in 2006 and a Ph.D. degree in Electronics and Computer Science in 2011. In 2007 he joined the Departamento de Electrónica y Computación of the same university, where he worked towards his Ph.D. in the field of parallelization and optimization of numerical simulators of semiconductor devices on advanced architectures and its application to variability studies on SOI devices. He was a visiting post-graduate student at the Edinburgh Parallel Computing Centre (EPCC), UK, in 2011. In 2012 he moved to the Supercomputing Centre of Galicia (CESGA) where he worked on the BonFIRE project. His research interests include parallelization and optimization of numerical simulators of semiconductor devices, grid and cloud computing.

    J.C. Mouriño is graduated in Computer Science in 2000 from the University of A Coruña. He holds a Ph.D. in Computer Science from the same University (2006). Between 1999 and 2005 he worked for the University of A Coruña and University of Santiago de Compostela as a researcher in several I+D projects. He works at CESGA as Applications Senior Technician. His research interest includes parallel and distributed computing, Cloud computing and HPC in the Cloud.

    C. Cotelo is Applications Senior Technician at Galicia Supercomputing Centre and Ph.D. student in Electronics and Computer Science at the University of Santiago de Compostela (USC). Since 2001 she has participated in several RTD projects related to Cloud, Grid and scientific applications. Her research interests are focused on application parallelization, science gateways and Clouds, the performance prediction of parallel applications, and scheduling techniques in Grid and Cloud environments.

    The research leading to these results has received funding from the European Commission’s Seventh Framework Programme (FP7/2007–2013) under grant agreement number 257386.

    View full text