An elasticity model for High Throughput Computing clusters

https://doi.org/10.1016/j.jpdc.2010.05.005Get rights and content

Abstract

Different methods have been proposed to dynamically provide scientific applications with execution environments that hide the complexity of distributed infrastructures. Recently virtualization has emerged as a promising technology to provide such environments. In this work we present a generic cluster architecture that extends the classical benefits of virtual machines to the cluster level, so providing cluster consolidation, cluster partitioning and support for heterogeneous environments. Additionally the capacity of the virtual clusters can be supplemented with resources from a commercial cloud provider. The performance of this architecture has been evaluated in the execution of High Throughput Computing workloads. Results show that, in spite of the overhead induced by the virtualization and cloud layers, these virtual clusters constitute a feasible and performing HTC platform. Additionally, we propose a performance model to characterize these variable capacity (elastic) cluster environments. The model can be used to dynamically dimension the cluster using cloud resources, according to a fixed budget, or to estimate the cost of completing a given workload in a target time.

Research highlights

► Classical HTC clusters architectures can be used in Public and Hybrid Clouds. ► Virtualization and communication overheads can be neglected for HTC computations. ► Performance of HTC clusters grows linearly with the number of Cloud worker-nodes. ► Cloud Computing is a cost effective solution for HTC workloads. ► Clouds and virtualization delivers efficient, flexible and elastic HTC clusters.

Introduction

The introduction of highly distributed computing paradigms, such as Grid Computing, has brought unprecedented computational power to a wide area of the scientific community. An important research effort has been devoted to effectively deliver these raw processing resources to the scientific applications. Usually, the characteristics of the distributed environment, such as its heterogeneity or dynamism, hinder the efficient use of the infrastructure.

One of the methods proposed in the literature to face the previous problems consists of overlaying a custom software stack on top of the existing middleware layer. For example, Walter et al. developed MyCluster [29], which creates a Condor or Sun Grid Engine cluster on top of TeraGrid services. Another example is the Falkon [22] system, which provides a light high-throughput execution environment on top of the Globus GRAM service.

Additionally, several projects have investigated the partitioning of a distributed infrastructure to dynamically provide independent customized clusters. Jeffrey Chase et al., from Duke University, describe [5] a cluster management software called COD (Cluster On Demand) that dynamically allocates servers from a common pool to multiple virtual clusters. Similarly, the VIOcluster [24] project allows to dynamically adjust the capacity of a computing cluster by sharing resources between peer domains.

Recently, the dramatic performance improvements in hypervisor technologies have made it possible to experiment with virtual machines (VM) as basic building blocks for flexible computational platforms. The first works in this area integrated resource management systems with VMs to provide custom execution environments on a per-job basis, see, for example, the works of Emeneker et al. [8] and Fallenbeck et al. [10].

Virtualization has also brought about a new computing model, called cloud computing, for the on-demand provision of virtualized resources as a service. The Amazon Elastic Compute Cloud (Amazon EC2 [2]), and the work by Wolski et al. [19] are probably the best examples of this new paradigm for providing elastic capacity. This new paradigm is also being studied in the Reservoir [23] project to build a holistic service management platform over a federated infrastructure of cloud providers. Finally, cloud computing has been also studied by Freeman et al. to deliver on-demand clusters in the context of the Virtual Workspace Service (VWS) [11].

In this paper we propose a flexible and generic cluster architecture that combines the use of virtual machines and cloud computing, to dynamically deliver heterogeneous computational environments. Moreover, the introduction of a new virtualization layer between the computational environments and the physical infrastructure makes it possible to adjust the capacity allocated to each environment and to supplement them with resources from an external cloud provider. In this way, the architecture proposed here can be effectively used to deliver flexible and elastic HTC environments, which can be seamlessly integrated with the current applications, and software tools.

Probably, the most important obstacles for the adoption of virtualization-based computational solutions is the potential performance degradation that scientific applications may suffer. In this work, we will show that this overhead can be neglected for a wide range of High Throughput Computing (HTC) applications. To this end, we will characterize the performance of these virtual cluster environments in the execution of a benchmarking HTC application. Additionally, we propose a model to predict their performance when additional capacity from the cloud is allocated to the virtual cluster.

The rest of the paper is organized as follows: first, in Section 2, we describe the main characteristics of the proposed cluster architecture. Then, in Section 3, we analyze the impact of using virtualized and cloud resources in several cluster systems. In Section 4, we present a performance model for elastic cluster architectures. Section 5 summarizes the main conclusions of our work.

Section snippets

Elastic management of computing clusters

Medium size computing clusters have been used for decades, by institutions of any kind, as an affordable and reasonably performing computational platform. Consequently, these architectures and the related software tools have been considerably improved over the years. However, as the use of clusters increased and the users started demanding more functionalities, some limitations of these platforms become evident:

  • support for heterogeneous configurations, usually applications from different

Performance evaluation of the computing cluster infrastructure

In this section we evaluate the previous architecture by studying the effect of virtualizing the worker nodes, and deploying them in the cloud, in several cluster subsystems. In particular, we consider a cluster based on the Sun Grid Engine (SGE) local resource manager, where the front-end acts as the SGE master and provides NFS and NIS services for every worker node. The cloud worker nodes are deployed in the US West availability zone of the Amazon EC2, while the front-end and local worker

Performance evaluation of elastic computing clusters

In the previous section we have analyzed the impact of deploying worker nodes in the EC2 cloud in several cluster subsystems. In this section we extend these results from the application point of view. The goal of this study is to incorporate application-level metrics to the structural analysis presented above. First, we determine the raw overhead imposed by the virtualization layer. Then, we quantify the effect of deploying part of the cluster in the cloud. Finally, we propose an elasticity

Conclusions

In this work we have presented an elastic architecture for clusters that allows a flexible management of these computing platforms by: (i) supporting the execution of heterogeneous application domains; (ii) dynamically partitioning the cluster capacity, adapting it to variable demands; and (iii) efficiently isolating the cluster workloads. Moreover, this architecture is able to transparently grow the cluster’s capacity using an external cloud provider.

We have evaluated this architecture in the

Acknowledgments

The authors would like to thank Javier Fontán, and Tino Vázquez for their support to the development of the present work.

Ruben S. Montero, Ph.D. is an associate professor in the Department of Computer Architecture at Complutense University of Madrid. In the past, he has held several visiting positions at ICASE (NASA Langley Research Center, VA). Over the last years, he has published more than 70 scientific papers in the field of High-Performance Parallel and Distributed Computing, and contributed to more than 20 research and development programmes. He is also heavily involved in organizing the Spanish e-science

References (30)

  • R.S. Montero et al.

    Benchmarking of high throughput computing applications on grids

    Parallel Comput.

    (2006)
  • K. Adams, O. Agesen, A comparison of software and hardware techniques for x86 virtualization, in: Proceedings of ASPLOS...
  • Amazon Elastic Compute Cloud,...
  • D.H. Bailey et al.

    The NAS parallel benchmarks

    J. Supercomputer Applications

    (1991)
  • P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, R.N. Alex Ho, I. Pratt, A. Warfield, Xen and the art of...
  • J. Chase, D. Irwin, L. Grit, J. Moore, S. Sprenkle, Dynamic virtual clusters in a grid site manager, in: Proceedings of...
  • B. Clark, T. Deshane, E. Dow, S. Evanchik, M.F.J. Herne, J.N. Matthews, Xen and the art of repeated search, in: USENIX...
  • Y. Dong, J. Dai, Z. Huang, H. Guan, K. Tian, Y. Jiang, Towards high-quality I/O virtualization, in: Proceedings of...
  • W. Emeneker, D. Jackson, J. Butikofer, D. Stanzione, Dynamic virtual clustering with Xen and Moab, in: Proc. of the...
  • C. Evangelinos, C. Hill, Cloud computing for parallel scientific HPC applications: feasibility of running coupled...
  • N. Fallenbeck, H. Picht, M. Smith, B. Freisleben, Xen and the art of cluster scheduling, in: First International...
  • T. Freeman, K. Keahey, Flying low: simple leases with workspace pilot, in: Proceedings of the Euro-Par,...
  • T. Freeman, K. Keahey, Contextualization: providing one-click virtual clusters, in: Proceedings of the eScience08...
  • M.A. Frumkin et al.

    NAS grid benchmarks: a tool for grid space exploration

    J. Cluster Computing

    (2002)
  • S. Garfinkel, An evaluation of amazon’s grid computing services: EC2, S3, and SQS, Center for Research on Computation...
  • Cited by (49)

    • Schlouder: A broker for IaaS clouds

      2017, Future Generation Computer Systems
      Citation Excerpt :

      It restricts clients from making requests that take longer than 30 s. Brokers for general scientific computations have been investigated in [1,2,29], and [3]. The main difference with our solution for the three following works is that they do not provide a method of customizing the provisioning policy.

    • A novel algorithm for reducing energy-consumption in cloud computing environment: Web service computing approach

      2016, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      Without detailed information of participating node or centralized node, Community-Aware Scheduling Algorithm (CASA) increases both average job waiting time and job slowdown radically (Huang et al., 2013). Elastic cluster architecture supports execution of heterogeneous application domain, which dynamically partitions cluster capacity and adapts to variable demands (Montero et al., 2011). Performance of cloud computing services is analyzed for scientific computing workloads based on loosely coupled applications (losup, 2011).

    View all citing articles on Scopus

    Ruben S. Montero, Ph.D. is an associate professor in the Department of Computer Architecture at Complutense University of Madrid. In the past, he has held several visiting positions at ICASE (NASA Langley Research Center, VA). Over the last years, he has published more than 70 scientific papers in the field of High-Performance Parallel and Distributed Computing, and contributed to more than 20 research and development programmes. He is also heavily involved in organizing the Spanish e-science infrastructure as a member of the infrastructure expert panel of the national e-science initiative. His research interests lie mainly in resource provisioning models for distributed systems, in particular: Grid resource management and scheduling, distributed management of virtual machines and cloud computing. He is also actively involved in several open source grid initiatives like the Globus Toolkit and the GridWay metascheduler; and OpenNebula.

    Rafael Moreno-Vozmediano received the M.S. degree in Physics and the Ph.D. degree from the Universidad Complutense de Madrid (UCM), Spain, in 1991 and 1995 respectively. In 1991, he joined the Department of Computer Science of the UCM, where he worked as a Research Assistant and Assistant Professor until 1997. Since 1997 he has been an Associate Professor of Computer Science and Electrical Engineering at the Department of Computer Architecture of the UCM, Spain. He has about 17 years of research experience in the fields of High-Performance Parallel and Distributed Computing, Grid Computing and Virtualization.

    Ignacio M. Llorente has a graduate degree in Physics (B.S. in Physics and M.S. in Computer Science), a Ph.D. in Physics (Program in Computer Science) and an Executive Masters in Business Administration. He has about 15 years of research experience in the field of High Performance Parallel and Distributed Computing, Grid Computing and Virtualization. Currently, he is a Full Professor in Computer Architecture and Technology at Universidad Complutense de Madrid, where he leads the Distributed Systems Architecture Group.

    This research was supported by Consejería de Educación of Comunidad de Madrid, Fondo Europeo de Desarrollo Regional (FEDER) and Fondo Social Europeo (FSE) through MEDIANET Research Program S2009/TIC-1468; by Ministerio de Ciencia e Innovación of Spain through research grant TIN2009-07146; and European Union through the research project RESERVOIR Grant Agreement 215605.

    View full text