A simulation framework for priority scheduling on heterogeneous clusters

https://doi.org/10.1016/j.future.2015.04.008Get rights and content

Highlights

  • Trace driven cluster management framework for priority scheduling is proposed.

  • Tradeoff between evictions and response time of priorities is analyzed.

  • Effects of different eviction policies are explored by trace driven simulations.

  • Workload-aware system is proposed to improve response time of heterogeneous workloads.

Abstract

Executing heterogeneous workloads with different priorities, resource demands and performance objectives is one of the key operations for today’s data centers to increase resource as well as energy efficiency. In order to meet the performance objectives of diverse workloads, schedulers rely on evictions even resulting in waste of resources due to lost executions of evicted tasks. It is not straightforward to design priority schedulers which capture key aspects of workloads and systems and also to strike a balance between resource (in)efficiency and application performance tradeoff. To explore large space of designing such schedulers, we propose a trace-driven cluster management framework that models a comprehensive set of system configurations and general priority-based scheduling policies. In particular, we focus on the impact of task evictions on resource inefficiency and task response times of multiple priority classes driven by Google production cluster trace. Moreover, we propose a system design as a use case exploiting workload heterogeneity and introducing workload-awareness into the system configuration and task assignment.

Introduction

A common approach in today’s data centers is to execute different applications on separate clusters dimensioned for the peak load in order to meet application specific service level objectives (SLOs). Consequently, the systems suffer from low resource utilization. To improve the system resource efficiency, hosting multiple types of applications on the same cluster is often sought, but meanwhile the system complexity particularly the complexity of the schedulers, greatly increases  [1].

Many different schedulers have been proposed in the literature  [2], [3], [4] with the main focus of increasing the energy efficiency by consolidating workload and minimizing the number of active servers. Most of them are based on rather homogeneous systems and workloads, i.e., servers with the same capacities, tasks with similar resource demands and priority. However, workloads executed on real systems are highly heterogeneous and have diversified resource usages, SLOs and class privileges  [5], [6], makes existing solutions suboptimal. Moreover, to meet the SLOs under heterogeneous workloads, schedulers rely on task evictions. Evictions are often overlooked by prior work although it is a major source of resource inefficiency. Task evictions are controlled by the scheduler which occur due to task congestion, reservation excess or hardware failure. The execution progress of task can be suspended or lost at the time of eviction depending on the system. In particular, non-resume systems that do not preserve the task state during eviction suffer from significant wasted resources under bursty arrivals  [7]. Overall, it is hard to design scheduling policies that capture the complex system and workload characteristics and optimize the tradeoff between response times and resource (in)efficiency due to evictions.

Motivated by the lack of a framework to analyze the system performance with different scheduling policies under highly heterogeneous workloads and evictions, we propose a new cluster management framework. Our objective is twofold:

  • (i)

    to provide a platform to analyze and better understand the system behavior under different workload conditions, system settings, scheduling policies and eviction strategies that make it easier to propose new energy-efficient systems with improved task response times;

  • (ii)

    to quantify the performance of workload-aware and workload agnostic systems in order to achieve greenness for highly heterogeneous workloads.

In the subsequent sections, we investigate the main causes of resource “inefficiency” in current systems in order to motivate the design decisions.

We first pinpoint the impact of priority scheduling and task eviction on wasted resources using the Google Cluster trace  [8], which contains a rich heterogeneous mix of workloads running on a large heterogeneous cluster for 29 days. For more details about the trace, we direct the interested readers to characterization studies  [9], [5] presenting overall task scheduling statistics.

We focus on eviction events, which are recorded in the task events, task resource usage, and machine events table. Unfortunately, the detailed reasons for such events are not provided in the trace. Our prior work of characterization study qualitatively shows that the priority is the main cause for eviction and the insufficient resource, such as memory contributes to a small number of eviction events  [7]. Our 7-day trace analysis identifies two main causes for eviction. Across all evictions, around 95% are priority evictions and 2% are memory evictions, the rest of the evictions occur either due to machine failure or disk excess. Evicted task can be rescheduled, however we observe that more than 43% of evicted tasks experience subsequent evictions, i.e., are evicted at least twice. As a result, only 38.09% of the evicted tasks across all priorities could successfully complete their execution  [7]. The low percentage of successful executions demonstrates the strong negative impact of evictions on the task success rate which in return leads to wasted resources. Indeed, all resources spent on an unsuccessful task execution are wasted. The central role of prioritization is clearly emphasized by the distribution of evictions across priorities. Low priorities are evicted more frequently which resulting in high failure rates and wasted resources, on the other hand high priorities are rarely preempted.

Motivated by the significant negative impact of evictions in order to better understand and reduce the resource inefficiency of priority scheduling, we propose detailed eviction models presented in Sections  3.3.3 Priority eviction model, 3.3.4 Memory eviction model.

Another significant cause of “inefficiency” is resource overbooking where the scheduler allocates resources according to user set task requirements. However, users usually overestimate the resource usage of tasks. Hence, the cluster looks full even though the actual usage is far below the resource reservations. In this case, the illusion of a full cluster triggers unnecessary evictions which affect the system negatively. The Google Cluster trace analysis shows a very heavily overbooked system. The total resource reservations at almost any time accounts for more than 80% of the cluster memory capacity and more than 100% of the cluster CPU capacity  [5]. However, the measured overall usage is much lower: averaging over one-hour time windows, memory usage does not exceed 50% of the cluster capacity and CPU usage does not exceed 60%. Users usually overestimate their resource requirements. Also users want to guarantee their successful service execution by over-provisioning. This is because, if a task exceeds its resource requirements, it is automatically evicted by the scheduler. Hence, Google clusters are usually fully booked even though the actual resource utilization is only half of the resource reservations.

To overcome this problem we use a dynamic, elastic resource allocation scheme namely slot-based resource assignment, where the number of tasks concurrently running on a server is limited by the number of slots. The slot-based system has the advantage that it does not require user defined resource demands. Furthermore, we do not specifically limit the resource usage of individual tasks, hence our approach does not suffer from resource fragmentation and overbooking. The details of the resource allocation schemes are explained in Section  3.3.1.

We first analyze the Google Cluster trace to better understand the challenges in order to develop an effective cluster scheduler. Due to lack of public information about the insights of Google Cluster scheduler, we turn our attention to trace analysis in order to investigate the inefficiencies and extract some working principles of the Google scheduler. The most remarkable properties we reach from the trace analysis are: significant amount of resource inefficiency and workload heterogeneity, which constitute the main motivations of this study. Due to the complex reciprocal dependency of events and system dynamics, it is nearly infeasible to quantitatively infer the main contributors of evictions without a simulation framework.

According to these findings, we propose a “cluster management framework” which is designed to quantify and minimize the inefficiencies discovered in the Google Trace. This framework also incorporates many complex design parameters that enables exploring the design space of scheduling policies with a particular focus on the impact of task evictions. By utilizing this framework, we propose and evaluate several scheduling designs and eviction policies and discover the impact of prioritization, evictions and workload characteristics. Hence, we present a comprehensive set of experiments of priority scheduling and evictions on large computing clusters with various tunable parameters and policies.

The contributions of this paper can be summarized as follows:

  • 1.

    We introduce a new “cluster management framework” which provides the control of response time per priority class by priority scheduling.

  • 2.

    We explore the behavior of different eviction policies by trace driven simulations.

  • 3.

    We demonstrate the importance of workload-awareness by means of a use-case: we propose a workload-aware slot configuration and task assignment scheme in combination with priority scheduling by exploiting the heterogeneity of the workload in order to improve response time and resource efficiency for highly heterogeneous workloads.

  • 4.

    We show that workload-aware system offers a lot of potential to better utilize resources for highly heterogeneous workloads having diverse and unknown resource demands.

In this study, we propose a resource and workload-aware cluster management framework, which provides system performance quantification for different system settings and scheduling policies. In Section  3, we describe the system properties and eviction policies in details. Then, we characterize the workload and server heterogeneity of the trace in Section  4. We investigate the system performance under different design decisions in Section  5. Finally, in Section  6, we show a system design as use case which provides better performance in terms of class based response time by exploiting the workload heterogeneity.

Section snippets

Related work

Interest in energy  [10] and resource efficiency  [11] has been growing over the past years. Some studies achieve significant energy savings by focusing on workload consolidation and dynamic right-sizing such as  [4], [12], [13], in spite of ignoring task priorities and evictions. The related energy-aware proposals deal with aggregate workloads, hence the allocations are done based on overall resource demand overlooking the task level constraints, requirements and performance objectives. As a

System model

At the heart of our framework lies a cluster system simulator, which is capable of capturing the energy consumption and response times of complex workloads under various what-if scenarios. The system consists of two main parts: the scheduler and a set of servers. Incoming tasks are enqueued at the scheduler and dispatched to available servers in a time-slotted fashion. The tasks are processed by a set of servers, where the server environment is consisting of heterogeneous multicore machines.

Workload and server environment analysis

Our framework allows both: (i) synthetically generated task and server attributes following some predefined distributions such as exponential or Pareto; and, (ii) input based on real system traces and characteristics. In this study, we focus on the second one to be closer to the real world conditions. In particular, we consider the publicly available Google Cluster trace, which represents a rich heterogeneous workload mix, on a large heterogeneous cluster. The trace provides information of both

Design comparisons

To better understand the tradeoffs and correlations between different system design decisions, we conduct several analysis under different system configurations. First, we analyze the effect of priority and memory eviction policies separately, then we investigate their combined effect. The details of the experiment settings are described in Table 4. The first set is composed of three experiments (S1, S2, S3), and shows the effect of priority eviction policies. The second set of experiments S4

Case study

In the previous section, we analyze the system performance under different system designs. We show that by using preemptive priority scheduling and efficient eviction policies, i.e., MRS and LSF, it is possible to reduce the wasted resources/resource inefficiency significantly while improving class-based response times and energy savings. In this section, we propose a workload-aware system configuration and task assignment scheme to the proposed framework in order to obtain further improvement

Conclusion

Motivated by the high complexity of system-workload and the significant resource inefficiency in today’s priority based schedulers, we propose a trace-driven cluster management framework that enables exploring the design space of scheduling policies with a particular focus on the impact of task evictions. The proposed framework models not only a comprehensive set of system and workload parameters, i.e., CPU cores, memory capacities, task slots, priorities, CPU/memory demands, but also a general

Acknowledgments

The research presented in this paper has been supported by the Swiss National Science Foundation (200021_141002), EU commission under FP7 GENiC project (608826).

Derya Çavdar received her M.S. and B.S. degrees in Computer Engineering from Bogazici University, Istanbul, Turkey in 2009, and 2007, respectively. She worked as a visiting scientist at IBM Research Zurich Lab, Switzerland in 2014. She is now pursuing her Ph.D. in Computer Engineering at Bogazici University, Istanbul, Turkey. Her current research interests include resource management and scheduling in clouds, and green computing.

References (32)

  • G. Sakellari et al.

    A survey of mathematical models, simulation approaches and testbeds used for research in cloud computing

    Simul. Modell. Pract. Theory

    (2013)
  • M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, J. Wilkes, Omega: Flexible, scalable schedulers for large compute...
  • S. Spicuglia, L. Chen, W. Binder, Join the best queue: Reducing performance variability in heterogeneous systems, in:...
  • A. Gandhi et al.

    Autoscale: Dynamic, robust capacity management for multi-tier data centers

    ACM Trans. Comput. Syst. (TOCS)

    (2012)
  • M. Lin, A. Wierman, L. Andrew, E. Thereska, Dynamic right-sizing for power-proportional data centers, in: INFOCOM,...
  • C. Reiss, A. Tumanov, G.R. Ganger, R.H. Katz, M.A. Kozuch, Heterogeneity and dynamicity of clouds at scale: Google...
  • Y. Chen et al.

    Analysis and lessons from a publicly available google cluster trace, Tech. Rep.

    (2010)
  • D. Çavdar et al.

    Quantifying the brown side of priority schedulers: Lessons from big clusters

    SIGMETRICS Perform. Eval. Rev.

    (2014)
  • J. Wilkes, More Google cluster data, Google research blog, 2011....
  • B. Sharma, T. Wood, C.R. Das, HybridMR: A hierarchical MapReduce scheduler for hybrid data centers, in: ICDCS, 2013,...
  • Q. Zhang et al.

    Dynamic heterogeneity-aware resource provisioning in the cloud

    IEEE Trans. Cloud Comput.

    (2014)
  • B. Sharma, R. Prabhakar, S. Lim, M. Kandemir, C. Das, MROrchestrator: A fine-grained resource orchestration framework...
  • A. Gandhi, P. Dube, A. Karve, A. Kochut, L. Zhang, Adaptive, model-driven autoscaling for cloud applications, in: ICAC,...
  • C. Delimitrou, C. Kozyrakis, Paragon: QoS-aware scheduling for heterogeneous datacenters, in: ASPLOS, 2013, pp....
  • J. Nair, K. Jagannathan, A. Wierman, When heavy-tailed and light-tailed flows compete: The response time tail under...
  • A. Sleptchenko et al.

    An exact solution for the state probabilities of the multi-class, multi-server queue with preemptive priorities

    Queueing Syst.

    (2005)
  • Cited by (1)

    • DYVERSE: DYnamic VERtical Scaling in multi-tenant Edge environments

      2020, Future Generation Computer Systems
      Citation Excerpt :

      The proposed mechanism is underpinned by a model that accounts for static priorities (set before the execution of a workload) and dynamic priorities (that changes during execution) of workloads on the Edge. While priorities have been exploited in Cloud computing [22,23], we investigate it in the context of Edge computing in this article. The Edge is expected to be a premium service for Cloud workloads, and therefore, selecting Edge service users becomes important.

    Derya Çavdar received her M.S. and B.S. degrees in Computer Engineering from Bogazici University, Istanbul, Turkey in 2009, and 2007, respectively. She worked as a visiting scientist at IBM Research Zurich Lab, Switzerland in 2014. She is now pursuing her Ph.D. in Computer Engineering at Bogazici University, Istanbul, Turkey. Her current research interests include resource management and scheduling in clouds, and green computing.

    Robert Birke received his Ph.D. degree from the Politecnico di Torino in 2009 with the telecommunications group under the supervision of professor Fabio Neri. Currently he works as a postdoc in the cloud server technologies group at IBM Research Zurich. He is coauthor of more than 40 scientific papers. His main research interests are high performance computing, cloud computing and datacenter networks with special focus on performance, quality of service, and virtualization.

    Lydia Y. Chen received her Ph.D. in Operations Research and Industrial Engineering from Penn State University in Dec 2006. She completed her undergraduate studies at National Taiwan University and British Columbia University. She is currently a performance analyst at the Energy Management group of IBM Zurich Research Lab. Her research interests are performance modeling, evaluation and optimal control of computer and communication systems using techniques in operations research and statistics.

    Fatih Alagöz received the B.Sc. degree from Middle East Technical University, Ankara, Turkey, in 1992, and the M.Sc. and D.Sc. degrees from The George Washington University, Washington, DC, USA, in 1995 and 2000, respectively, all in electrical engineering. He is a Professor with the Department of Computer Engineering, Bogazici University, Istanbul, Turkey. He has contributed to many research projects for various agencies/organizations, including the US Army Intelligence Center, Naval Research Laboratory, UAE Research Fund, Turkish Scientific Research Council, State Planning Organization of Turkey, etc. He is the Satellite Systems Advisor to the Kandilli Earthquake Research Institute, Istanbul, Turkey. He is an editor of five books and an author of more than 100 scholarly papers in selected journals and conferences. His research interests span various aspects of wireless/mobile/satellite communication networks. Dr. Alagöz has served on several major conference technical committees and organized and chaired many technical sessions at many international conferences. He is a member of the IEEE Satellite and Space Communications Technical Committee. He has received numerous professional awards.

    View full text