SERAC3: Smart and economical resource allocation for big data clusters in community clouds

https://doi.org/10.1016/j.future.2018.03.044Get rights and content

Highlights

  • We address the problem of workload heterogeneity by designing a machine learning based application classifier, which can automatically classify different jobs, while diminishing the heterogeneity within groups.

  • We leverage Bayesian Optimization to pick the optimal or a near optimal configuration economically for each group, with just a few samples needed.

  • We propose a quasi-realtime optimizing mechanism, which helps to improve the accuracy and adaptivity of SERAC3’s resource allocating strategy.

Abstract

Big data analysis jobs on clouds are gaining more and more popularity in recent years. It is critical but challenging to pick the right configuration for an incoming job, since the configuration space is too large, and the relationship between allocated resources and job performance is not deterministic. In this paper, we propose SERAC3 to allocate resources smartly and economically for big data clusters in community clouds. SERAC3 is a system that can automatically extract representative workloads from incoming big data analysis jobs, smartly decide an optimal configuration for each job, and adjust its assigning strategy in a quasi-realtime mode. With experiments on a community cloud built on OpenStack, we show that on average, SERAC3 can smartly select a configuration within 2.2% of the exact optimal one, while saving about 80.1% search cost compared to the exhaustive search.

Introduction

In recent decades, big data analysis jobs on clouds have attracted attention from both academia and industry. There are various kinds of big data analysis workloads, involving both traditional ones such as SQL queries, and advanced analysis jobs like deep learning for image recognition. These workloads vary so much from each other that there must be a great difference in resource requirements (e.g., CPU cores, memory size, disk type and size, network, etc.) among them.

It is difficult for a non-expert cloud user to select a suitable configuration since there are a great number of configurations provided to execute those workloads. Traditional approaches of assigning resources by static rules may lead to low efficiency in both time and resource. Choosing an unsuitable configuration can lead to 12 times more cost for the same performance target [1], while choosing an appropriate configuration can improve performance by up to 1.9x at the same cost [2]. It is thus critical to efficiently recognize the features of real-world applications despite the heterogeneity and pick the right configuration accordingly. However, it is challenging to provide a satisfying solution with high accuracy, high adaptivity and low overhead in concern of the heterogeneous background.

The input workloads can differ greatly from each other, thus it is hard to extract a representative workload automatically. Existing work typically focusses on some particular applications [2] or repeatable workloads [[1], [3]] (e.g., recurring jobs like daily log analysis).

The relationship between cloud resources and performance is non-linear, so it is difficult to model the relationship directly. Moreover, the cloud environment is complex and dynamic, which may lead to further uncertainty in time and cost under the same amount of resources. Therefore, it is important to design a dynamic system that can adapt to the dynamics and maintain accuracy. Prediction-based solutions [[1], [2], [4]] fail to react in time to the dynamic cloud environment, which may lead to an accumulation of inaccuracy in the long run.

It is always accurate to search the configuration space exhaustively to get the exact optimal configuration. However, the cost and complexity increase exponentially when the configuration space grows. We need a cheaper and simpler way to reach the optimal or a near optimal configuration quickly.

In this paper, we propose SERAC3, a smart and economical resource allocation mechanism for big data clusters in community clouds. SERAC3 can automatically recognize the distinct features of an incoming job, map it to a specific group, smartly select a near optimal configuration for each group with low overhead, and adjust itself dynamically to adapt to the varying cloud environment.

The key idea of our work is that no such a configuration exists that is optimal for all workloads. On account of this, we design a machine learning based classifier to gather similar jobs together based on their applications. The groups are further subdivided by data size, so that jobs in the same group are more likely to share the same optimal configurations. Then for each group of jobs, we use Bayesian Optimization [5] (BO) as a core basic algorithm, which works well on optimizing black-box problems. With BO, we can achieve the optimal or a near optimal configuration with low overhead and high accuracy. To address the problem of the dynamic cloud environment, we introduce a quasi-realtime optimization module to our system. Inspired by the idea of data-driven paradigm, we leverage the actual user data of executing jobs to tune the allocation policy online in return.

We evaluate SERAC3 in a community cloud deployed on OpenStack [6], with 40 configurations and hundreds of different big data analysis jobs. Results show that on average, SERAC3 can reach a near optimal configuration within 2.2% of the exact optimal one, while saving about 80.1% of search cost compared to the exhaustive search. We also compare SERAC3 with CherryPick [1], and show that SERAC3 performs better than CherryPick while brings in little overhead.

The major contributions of this paper are as follows:

  • We address the problem of workload heterogeneity by designing a machine learning based application classifier. With this classifier, we can automatically classify different jobs, while diminishing the heterogeneity within groups. Further configuration selection can then be performed at the group level.

  • We leverage Bayesian Optimization [5] to pick the optimal or a near optimal configuration economically at the group level, with just a few samples needed.

  • We propose a quasi-realtime optimizing mechanism, which helps to improve the accuracy and adaptivity of SERAC3’s resource allocating strategy.

The rest of the paper is organized as follows. Section 2 describes the motivation and challenges of designing SERAC3. In Section 3, we depict the architecture of SERAC3 at a high level. In Section 4, we describe the design and implementation details of SERAC3 Offline, a group-based configuration selecting module. In Section 5, we describe SERAC3 Online, about the online optimization details. Section 6 presents the experiments and results with detailed explanations. We discuss the related work in Section 7 and conclude the whole paper in Section 8.

Section snippets

Motivation scenario

The NIST Definition of Cloud Computing [7] lists four deployment models: Private cloud, Community cloud, Public cloud and Hybrid cloud. In public clouds, a large number of users request resources in the pay-on-demand cloud infrastructure. Private clouds serve for a specific group of enterprises and institutions. Community clouds serve for a specific community of consumers from organizations that have shared concerns. In hybrid clouds, the cloud infrastructure is a composition of two or more

Design of SERAC3

As discussed above, the challenges we face can be summarized as the following four questions:

  • 1.

    How to pick out the representative workloads?

  • 2.

    How to make a tradeoff between resource utilization rate and execution efficiency in a community cloud?

  • 3.

    How to select an optimal configuration with limited knowledge?

  • 4.

    How to adapt to the dynamic environment?

To effectively deal with the above four “How-to”s, we propose a two-phase workflow (the offline and the online phase), as shown in Fig. 2.

SERAC3 Offline

In this section, we discuss the details of the SERAC3 Offline module, as shown in Fig. 2. There are two major components in this module: a job-classifying logic and an offline configuration selecting logic. We will describe them respectively in this section.

SERAC3 Online

The SERAC3 Online module is necessary for the following reasons:

  • 1.

    Online tuning is crucial to adjust the errors in the offline phase: In the SERAC3 Offline module, we pick the optimal configuration of one representative workload in each group as the optimal configuration of the whole group. However, the groups classified by the Application Classifier is a relatively coarse-grained one, and the picked representative workload may not be accurate enough.

  • 2.

    The performance under the same amount of

Experiment environment

The experiment platform is a campus cloud supported by the government of Shanghai, which is an OpenStack [6] based community cloud. The users of this platform can be scholars of various fields, such as biology, mathematics and computer science. Currently, the platform holds a capacity of 1 PB (Petabyte), with 0.5 PB for storage and the other 0.5 PB for computing. In the future, the total capacity of the platform can be up to 5 PB, with 4 PB for storage and 1 PB for computing. The goal of SERAC3

Related work

There are three key contributions of SERAC3: (1) automatically picking out the representative workloads (2) using data-driven paradigm to effectively select the optimal configuration (3) online tuning and optimization. A large amount of previous work has been working on some of the above aspects. SERAC3 borrows and extends ideas from relevant research and synthesizes these ideas to bring out a smart and economical resource allocation mechanism for big data clusters in community clouds. The key

Conclusion and future work

In this paper, we present SERAC3, which can smartly and economically choose a near optimal configuration for an incoming big data analysis job. Despite the heterogeneity of applications, data size, and the cloud environment, SERAC3 can achieve the goals of high accuracy, low overhead, and high adaptivity. Our experiments on a community cloud built on OpenStack [6] demonstrate that SERAC3 is able to automatically pick the optimal or a near optimal configuration with much lower search cost than

Acknowledgments

The work of this paper is supported by National Natural Science Foundation of China under Grant No. 61728202-Research on Internet of Things Big Data Transmission and Processing Architecture based on Cloud-Fog Hybrid Computing Model & Grant No. 61572137-Multiple Clouds based CDN as a Service Key Technology Research, and Shanghai 2016 Innovation Action Project, China under Grant 16DZ1100200-Data-trade-supporting Big Data Testbed.

Junnan Li is a master student at School of Computer Science, Fudan University. Her major is computer application technology. Her research interests are cloud computing and big data analysis system.

References (40)

  • NaikNenavath Srinivas et al.

    Performance improvement of mapreduce framework in heterogeneous context using reinforcement learning

    Procedia Comput. Sci.

    (2015)
  • Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, Ming Zhang, CherryPick:...
  • Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, Ion Stoica, Ernest: Efficient performance...
  • A. Khan, X. Yan, S. Tao, N. Anerousis, Workload characterization and prediction in the cloud: A multiple time series...
  • IslamSadeka et al.

    Empirical prediction models for adaptive resource provisioning in the cloud

    Future Gener. Comput. Syst.

    (2011)
  • BrochuEric et al.

    A Tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning

    Comput. Sci.

    (2010)
  • OpenStack documentation for ocatab (February 2017),...
  • MellPeter et al.

    The nist definition of cloud computing

    Commun. ACM

    (2009)
  • Andrew D. Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, Rodrigo Fonseca, Jockey: guaranteed job latency in data...
  • Sameer Agarwal, Srikanth Kandula, Nicolas Bruno, Ming Chuan Wu, Ion Stoica, Jingren Zhou, Re-optimizing data-parallel...
  • MozerS. et al.

    Reinforcement learning: An introduction

    IEEE Trans. Neural Netw.

    (2005)
  • LippmannR.

    An introduction to computing with neural nets

    ACM SIGARCH Comput. Archit. News

    (2003)
  • ArelI. et al.

    Deep machine learning - a new frontier in artificial intelligence research [research frontier]

    Comput. Intell. Mag. IEEE

    (2010)
  • Vinod Nair, Geoffrey E. Hinton, Rectified linear units improve restricted boltzmann machines, in: International...
  • A. Tuerk, S.J. Young, Polynomial Softmax Functions for Pattern Classification,...
  • Per-second billing for ec2 instances and ebs volumes,...
  • Jasper Snoek, Hugo Larochelle, Ryan P. Adams, Practical Bayesian optimization of machine learning algorithms, in:...
  • MackayD. J. C

    Introduction to gaussian processes

    Nato. Asi.

    (2008)
  • OunpraseuthSongthip T

    Gaussian processes for machine learning

    (2006)
  • JonesDonald R. et al.

    Efficient global optimization of expensive black-box functions

    J. Global Optim.

    (1998)
  • Cited by (12)

    • ARVMEC: Adaptive Recommendation of Virtual Machines for IoT in Edge–Cloud Environment

      2020, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      The work in the previous study about the determination of user purpose is to either pre-define fixed goals for making a balance between cost and performance or flexibly specify targets for the prediction model. Both SERAC3 [26] and CherryPick [2] can pre-define the purpose to make a trade-off between performance and cost. Arrow [17] and PARIS [35] can specify flexibly target.

    • Adaptive sliding windows for improved estimation of data center resource utilization

      2020, Future Generation Computer Systems
      Citation Excerpt :

      Efficient management of those resources is vital to maximizing the performance of the facility with minimal and affordable operational costs. Data center’s resource utilization forecasting is important for various reasons including resource management [1–5], energy saving [6–9], cost prediction and consolidation of virtual machines (VMs) [10–12], and capacity planning [13,14]. An accurate estimation of resource demands can greatly help to optimize the operational cost of the applications for the end uses, also for data center managers to reduce the aggregated costs of offering those resources to large numbers of users.

    • SARA: Stably and quickly find optimal cloud configurations for heterogeneous big data workloads

      2019, Applied Soft Computing Journal
      Citation Excerpt :

      It can resolve cold-start issues and help new workloads find optimal configurations. SERAC3 designs a classifier to gather similar applications to groups, and selects the optimal configuration of the corresponding representative workload as the optimal configuration of the whole group [35]. The selection of representative workloads will greatly affect the optimization results of the whole group.

    View all citing articles on Scopus

    Junnan Li is a master student at School of Computer Science, Fudan University. Her major is computer application technology. Her research interests are cloud computing and big data analysis system.

    Zhihui Lu received a Ph.D. computer science degree from Fudan University in 2004, and he is an Associate Professor in School of Computer Science, Fudan University. He is a member of the IEEE. His research interests are big data architecture, cloud computing and service computing technology, edge computing, and software defined network.

    Wei Zhang is a master student at School of Computer Science, Fudan University. His research interests are data-driven optimization, cloud computing, and distributed system.

    Jie Wu is a Professor at School of Computer Science, Fudan University. His research interests are internet technology, big data architecture, service computing, cloud computing, software defined network and P2P streaming technology, he received a Ph.D. computer science degree from Fudan University in 2008.

    Hao Qiang is a master student at School of Computer Science, Fudan University. His research and programming experience include cloud computing and big data process architecture.

    Bo Li is a master student at School of Computer Science, Fudan University. His research interests include distributed system architecture, cloud computing and blockchain applications.

    Patrick C.K. Hung is a Professor at the Faculty of Business and Information Technology in University of Ontario Institute of Technology. Patrick has been working with Boeing Research and Technology in Seattle, Washington on aviation services-related research projects. He owns a U.S. patent on Mobile Network Dynamic Workflow Exception Handling System with Boeing. His research interests include services computing, big data architecture,business process and security.

    View full text